Abstract:Cross-Modal Retrieval (CMR), which retrieves relevant items from one modality (e.g., audio) given a query in another modality (e.g., visual), has undergone significant advancements in recent years. This capability is crucial for robots to integrate and interpret information across diverse sensory inputs. However, the retrieval space in existing robotic CMR approaches often consists of only one modality, which limits the robot's performance. In this paper, we propose a novel CMR model that incorporates three different modalities, i.e., visual, audio and tactile, for enhanced multi-modal object retrieval, named as VAT-CMR. In this model, multi-modal representations are first fused to provide a holistic view of object features. To mitigate the semantic gaps between representations of different modalities, a dominant modality is then selected during the classification training phase to improve the distinctiveness of the representations, so as to improve the retrieval performance. To evaluate our proposed approach, we conducted a case study and the results demonstrate that our VAT-CMR model surpasses competing approaches. Further, our proposed dominant modality selection significantly enhances cross-retrieval accuracy.
Abstract:Unsupervised Domain Adaptive Object Detection (DAOD) could adapt a model trained on a source domain to an unlabeled target domain for object detection. Existing unsupervised DAOD methods usually perform feature alignments from the target to the source. Unidirectional domain transfer would omit information about the target samples and result in suboptimal adaptation when there are large domain shifts. Therefore, we propose a pairwise attentive adversarial network with a Domain Mixup (DomMix) module to mitigate the aforementioned challenges. Specifically, a deep-level mixup is employed to construct an intermediate domain that allows features from both domains to share their differences. Then a pairwise attentive adversarial network is applied with attentive encoding on both image-level and instance-level features at different scales and optimizes domain alignment by adversarial learning. This allows the network to focus on regions with disparate contextual information and learn their similarities between different domains. Extensive experiments are conducted on several benchmark datasets, demonstrating the superiority of our proposed method.
Abstract:This study explores an indoor system for tracking multiple humans and detecting falls, employing three Millimeter-Wave radars from Texas Instruments placed on x-y-z surfaces. Compared to wearables and camera methods, Millimeter-Wave radar is not plagued by mobility inconvenience, lighting conditions, or privacy issues. We establish a real-time framework to integrate signals received from these radars, allowing us to track the position and body status of human targets non-intrusively. To ensure the overall accuracy of our system, we conduct an initial evaluation of radar characteristics, covering aspects such as resolution, interference between radars, and coverage area. Additionally, we introduce innovative strategies, including Dynamic DBSCAN clustering based on signal energy levels, a probability matrix for enhanced target tracking, target status prediction for fall detection, and a feedback loop for noise reduction. We conduct an extensive evaluation using over 300 minutes of data, which equates to approximately 360,000 frames. Our prototype system exhibits remarkable performance, achieving a precision of 98.9% for tracking a single target and 96.5% and 94.0% for tracking two and three targets in human tracking scenarios, respectively. Moreover, in the field of human fall detection, the system demonstrates a high accuracy of 98.2%, underscoring its effectiveness in distinguishing falls from other statuses.
Abstract:Millimetre-wave (mmWave) radar has emerged as an attractive and cost-effective alternative for human activity sensing compared to traditional camera-based systems. mmWave radars are also non-intrusive, providing better protection for user privacy. However, as a Radio Frequency (RF) based technology, mmWave radars rely on capturing reflected signals from objects, making them more prone to noise compared to cameras. This raises an intriguing question for the deep learning community: Can we develop more effective point set-based deep learning methods for such attractive sensors? To answer this question, our work, termed MiliPoint, delves into this idea by providing a large-scale, open dataset for the community to explore how mmWave radars can be utilised for human activity recognition. Moreover, MiliPoint stands out as it is larger in size than existing datasets, has more diverse human actions represented, and encompasses all three key tasks in human activity recognition. We have also established a range of point-based deep neural networks such as DGCNN, PointNet++ and PointTransformer, on MiliPoint, which can serve to set the ground baseline for further development.
Abstract:Millimetre-wave (mmWave) radars can generate 3D point clouds to represent objects in the scene. However, the accuracy and density of the generated point cloud can be lower than a laser sensor. Although researchers have used mmWave radars for various applications, there are few quantitative evaluations on the quality of the point cloud generated by the radar and there is a lack of a standard on how this quality can be assessed. This work aims to fill the gap in the literature. A radar simulator is built to evaluate the most common data processing chains of 3D point cloud construction and to examine the capability of the mmWave radar as a 3D imaging sensor under various factors. It will be shown that the radar detection can be noisy and have an imbalance distribution. To address the problem, a novel super-resolution point cloud construction (SRPC) algorithm is proposed to improve the spatial resolution of the point cloud and is shown to be able to produce a more natural point cloud and reduce outliers.