Abstract:Depth imaging is a foundational building block for broad applications, such as autonomous driving and virtual/augmented reality. Traditionally, depth cameras have relied on time-of-flight sensors or multi-lens systems to achieve physical depth measurements. However, these systems often face a trade-off between a bulky form factor and imprecise approximations, limiting their suitability for spatially constrained scenarios. Inspired by the emerging advancements of nano-optics, we present Nano-3D, a metasurface-based neural depth imaging solution with an ultra-compact footprint. Nano-3D integrates our custom-fabricated 700 nm thick TiO2 metasurface with a multi-module deep neural network to extract precise metric depth information from monocular metasurface-polarized imagery. We demonstrate the effectiveness of Nano-3D with both simulated and physical experiments. We hope the exhibited success paves the way for the community to bridge future graphics systems with emerging nanomaterial technologies through novel computational approaches.
Abstract:There are many deep learning (DL) powered mobile and wearable applications today continuously and unobtrusively sensing the ambient surroundings to enhance all aspects of human lives.To enable robust and private mobile sensing, DL models are often deployed locally on resource-constrained mobile devices using techniques such as model compression or offloading.However, existing methods, either front-end algorithm level (i.e. DL model compression/partitioning) or back-end scheduling level (i.e. operator/resource scheduling), cannot be locally online because they require offline retraining to ensure accuracy or rely on manually pre-defined strategies, struggle with dynamic adaptability.The primary challenge lies in feeding back runtime performance from the back-end level to the front-end level optimization decision. Moreover, the adaptive mobile DL model porting middleware with cross-level co-adaptation is less explored, particularly in mobile environments with diversity and dynamics. In response, we introduce CrowdHMTware, a dynamic context-adaptive DL model deployment middleware for heterogeneous mobile devices. It establishes an automated adaptation loop between cross-level functional components, i.e. elastic inference, scalable offloading, and model-adaptive engine, enhancing scalability and adaptability. Experiments with four typical tasks across 15 platforms and a real-world case study demonstrate that CrowdHMTware can effectively scale DL model, offloading, and engine actions across diverse platforms and tasks. It hides run-time system issues from developers, reducing the required developer expertise.
Abstract:This paper explores the digital modeling and robotic reproduction of traditional Chinese medicine (TCM) massage techniques. We adopt an adaptive admittance control algorithm to optimize force and position control, ensuring safety and comfort. The paper analyzes key TCM techniques from kinematic and dynamic perspectives, and designs robotic systems to reproduce these massage techniques. The results demonstrate that the robot successfully mimics the characteristics of TCM massage, providing a foundation for integrating traditional therapy with modern robotics and expanding assistive therapy applications.
Abstract:Understanding non-human primate behavior is crucial for improving animal welfare, modeling social behavior, and gaining insights into both distinctly human and shared behaviors. Despite recent advances in computer vision, automated analysis of primate behavior remains challenging due to the complexity of their social interactions and the lack of specialized algorithms. Existing methods often struggle with the nuanced behaviors and frequent occlusions characteristic of primate social dynamics. This study aims to develop an effective method for automated detection, tracking, and recognition of chimpanzee behaviors in video footage. Here we show that our proposed method, AlphaChimp, an end-to-end approach that simultaneously detects chimpanzee positions and estimates behavior categories from videos, significantly outperforms existing methods in behavior recognition. AlphaChimp achieves approximately 10% higher tracking accuracy and a 20% improvement in behavior recognition compared to state-of-the-art methods, particularly excelling in the recognition of social behaviors. This superior performance stems from AlphaChimp's innovative architecture, which integrates temporal feature fusion with a Transformer-based self-attention mechanism, enabling more effective capture and interpretation of complex social interactions among chimpanzees. Our approach bridges the gap between computer vision and primatology, enhancing technical capabilities and deepening our understanding of primate communication and sociality. We release our code and models and hope this will facilitate future research in animal social dynamics. This work contributes to ethology, cognitive science, and artificial intelligence, offering new perspectives on social intelligence.
Abstract:Though advanced in understanding visual information with human languages, Large Vision-Language Models (LVLMs) still suffer from multimodal hallucinations. A natural concern is that during multimodal interaction, the generated hallucinations could influence the LVLMs' subsequent generation. Thus, we raise a question: When presented with a query relevant to the previously generated hallucination, will LVLMs be misled and respond incorrectly, even though the ground visual information exists? To answer this, we propose a framework called MMHalSnowball to evaluate LVLMs' behaviors when encountering generated hallucinations, where LVLMs are required to answer specific visual questions within a curated hallucinatory conversation. Crucially, our experiment shows that the performance of open-source LVLMs drops by at least $31\%$, indicating that LVLMs are prone to accept the generated hallucinations and make false claims that they would not have supported without distractions. We term this phenomenon Multimodal Hallucination Snowballing. To mitigate this, we further propose a training-free method called Residual Visual Decoding, where we revise the output distribution of LVLMs with the one derived from the residual visual input, providing models with direct access to the visual information. Experiments show that our method can mitigate more than $24\%$ of the snowballed multimodal hallucination while maintaining capabilities.
Abstract:In this paper, we investigate the beam training problem in the multi-user millimeter wave (mmWave) communication system, where multiple reconfigurable intelligent surfaces (RISs) are deployed to improve the coverage and the achievable rate. However, existing beam training techniques in mmWave systems suffer from the high complexity (i.e., exponential order) and low identification accuracy. To address these problems, we propose a novel hashing multi-arm beam (HMB) training scheme that reduces the training complexity to the logarithmic order with the high accuracy. Specifically, we first design a generation mechanism for HMB codebooks. Then, we propose a demultiplexing algorithm based on the soft decision to distinguish signals from different RIS reflective links. Finally, we utilize a multi-round voting mechanism to align the beams. Simulation results show that the proposed HMB training scheme enables simultaneous training for multiple RISs and multiple users, and reduces the beam training overhead to the logarithmic level. Moreover, it also shows that our proposed scheme can significantly improve the identification accuracy by at least 20% compared to existing beam training techniques.
Abstract:In this paper, we investigate the millimeter-wave (mmWave) near-field beam training problem to find the correct beam direction. In order to address the high complexity and low identification accuracy of existing beam training techniques, we propose an efficient hashing multi-arm beam (HMB) training scheme for the near-field scenario. Specifically, we first design a set of sparse bases based on the polar domain sparsity of the near-field channel. Then, the random hash functions are chosen to construct the near-field multi-arm beam training codebook. Each multi-arm beam codeword is scanned in a time slot until all the predefined codewords are traversed. Finally, the soft decision and voting methods are applied to distinguish the signal from different base stations and obtain correctly aligned beams. Simulation results show that our proposed near-field HMB training method can reduce the beam training overhead to the logarithmic level, and achieve 96.4% identification accuracy of exhaustive beam training. Moreover, we also verify applicability under the far-field scenario.
Abstract:Millimeter wave (mmWave) has attracted considerable attention due to its wide bandwidth and high frequency. However, it is highly susceptible to blockages, resulting in significant degradation of the coverage and the sum rate. A promising approach is deploying distributed reconfigurable intelligent surfaces (RISs), which can establish extra communication links. In this paper, we investigate the impact of distributed RISs on the coverage probability and the sum rate in mmWave wireless communication systems. Specifically, we first introduce the system model, which includes the blockage, the RIS and the user distribution models, leveraging the Poisson point process. Then, we define the association criterion and derive the conditional coverage probabilities for the two cases of direct association and reflective association through RISs. Finally, we combine the two cases using Campbell's theorem and the total probability theorem to obtain the closed-form expressions for the ergodic coverage probability and the sum rate. Simulation results validate the effectiveness of the proposed analytical approach, demonstrating that the deployment of distributed RISs significantly improves the ergodic coverage probability by 45.4% and the sum rate by over 1.5 times.
Abstract:The millimeter wave (mmWave) has received considerable interest due to its expansive bandwidth and high frequency. However, a noteworthy challenge arises from its vulnerability to blockages, leading to reduced coverage and achievable rates. To address these limitations, a potential solution is to deploy distributed reconfigurable intelligent surfaces (RISs), which comprise many low-cost and passively reflected elements, and can facilitate the establishment of extra communication links. In this paper, we leverage stochastic geometry to investigate the ergodic coverage probability and the achievable rate in both distributed RISs-assisted single-cell and multi-cell mmWave wireless communication systems. Specifically, we first establish the system model considering the stochastically distributed blockages, RISs and users by the Poisson point process. Then we give the association criterion and derive the association probabilities, the distance distributions, and the conditional coverage probabilities for two cases of associations between base stations and users without or with RISs. Finally, we use Campbell's theorem and the total probability theorem to obtain the closed-form expressions of the ergodic coverage probability and the achievable rate. Simulation results verify the effectiveness of our analysis method, and demonstrate that by deploying distributed RISs, the ergodic coverage probability is significantly improved by approximately 50%, and the achievable rate is increased by more than 1.5 times.
Abstract:In millimeter-wave communications, large-scale antenna arrays are commonly employed to mitigate obstacle occlusion and path loss. However, these large-scale arrays generate pencil-shaped beams, which necessitate a higher number of training beams to cover the desired space. This results in the heavy beam training overhead. Furthermore, as the antenna aperture increases, users are more likely to be situated in the near-field region of the base station (BS) antenna array. This motivates our investigation into the beam training problem in the near-field region to achieve efficient beam alignment. To address the high complexity and low identification accuracy of existing beam training techniques, we propose an efficient hashing multi-arm beam (HMB) training scheme for the near-field scenario. Specifically, we first design a set of sparse bases based on the polar domain sparsity of the near-field channel and construct a near-field single-beam training codebook. Then, the hash functions are chosen to construct the near-field multi-arm beam training codebook. Each multi-arm beam training codeword is used in a time slot until the predefined codebook is traversed. Finally, the soft decision and voting methods are applied to distinguish the signal from different BS and obtain the correctly aligned beams. In addition, we provide the logically rigorous proof of computational complexity. Simulation results show that our proposed near-field HMB training method can achieve 96.4% identification accuracy of the exhaustive beam training method and greatly reduce the training overhead to the logarithmic level. Furthermore, we verify its applicability under the far-field scenario as well.