Abstract:Human sensing, which employs various sensors and advanced deep learning technologies to accurately capture and interpret human body information, has significantly impacted fields like public security and robotics. However, current human sensing primarily depends on modalities such as cameras and LiDAR, each of which has its own strengths and limitations. Furthermore, existing multi-modal fusion solutions are typically designed for fixed modality combinations, requiring extensive retraining when modalities are added or removed for diverse scenarios. In this paper, we propose a modality-invariant foundation model for all modalities, X-Fi, to address this issue. X-Fi enables the independent or combinatory use of sensor modalities without additional training by utilizing a transformer structure to accommodate variable input sizes and incorporating a novel "X-fusion" mechanism to preserve modality-specific features during multimodal integration. This approach not only enhances adaptability but also facilitates the learning of complementary features across modalities. Extensive experiments conducted on the MM-Fi and XRF55 datasets, employing six distinct modalities, demonstrate that X-Fi achieves state-of-the-art performance in human pose estimation (HPE) and human activity recognition (HAR) tasks. The findings indicate that our proposed model can efficiently support a wide range of human sensing applications, ultimately contributing to the evolution of scalable, multimodal sensing technologies.
Abstract:In the field of psychology, the static nature and lack of customization of psychological test scales, along with the challenge of quantifying psychological indicators, have long been critical issues. Despite numerous attempts to use AI to address psychological challenges, a dynamically interactive psychological test has yet to emerge. In contrast to traditional psychological assessment methods, we propose PsyDI, a multi-modal, interactive, and customized chatbot for psychological assessments, using the Myers-Briggs Type Indicator (MBTI) as an example. PsyDI initiates with user-related multi-modal information, then engaging in customized interaction to discern the user's MBTI type based on their multiple rounds of responses. Despite these advancements, accurately quantifying absolute value of psychological indicators remains challenging. To tackle such difficulty, we introduce the PsyDI framework that trains LLMs to discern the relative magnitude of psychological traits rather than their absolute values. Through various experiments, we demonstrate the effectiveness of the training techniques proposed in PsyDI on various datasets, and we have also launched its web version, reaching about ~3k accesses. Additionally, comprehensive post-deployment data analysis has provided profound insights into the implications and applications of PsyDI, demonstrating its potential to serve as a general framework for psychological assessment.
Abstract:Diffusion models have shown impressive performance in many domains, including image generation, time series prediction, and reinforcement learning. The algorithm demonstrates superior performance over the traditional GAN and transformer based methods. However, the model's capability to follow natural language instructions (e.g., spatial relationships between objects, generating complex scenes) is still unsatisfactory. This has been an important research area to enhance such capability. Prior works adopt reinforcement learning to adjust the behavior of the diffusion models. However, RL methods not only require careful reward design and complex hyperparameter tuning, but also fails to incorporate rich natural language feedback. In this work, we propose iterative prompt relabeling (IP-RLDF), a novel algorithm that aligns images to text through iterative image sampling and prompt relabeling. IP-RLDF first samples a batch of images conditioned on the text, then relabels the text prompts of unmatched text-image pairs with classifier feedback. We conduct thorough experiments on three different models, including SDv2, GLIGEN, and SDXL, testing their capability to generate images following instructions. With IP-RLDF, we improved up to 15.22% (absolute improvement) on the challenging spatial relation VISOR benchmark, demonstrating superior performance compared to previous RL methods.
Abstract:4D human perception plays an essential role in a myriad of applications, such as home automation and metaverse avatar simulation. However, existing solutions which mainly rely on cameras and wearable devices are either privacy intrusive or inconvenient to use. To address these issues, wireless sensing has emerged as a promising alternative, leveraging LiDAR, mmWave radar, and WiFi signals for device-free human sensing. In this paper, we propose MM-Fi, the first multi-modal non-intrusive 4D human dataset with 27 daily or rehabilitation action categories, to bridge the gap between wireless sensing and high-level human perception tasks. MM-Fi consists of over 320k synchronized frames of five modalities from 40 human subjects. Various annotations are provided to support potential sensing tasks, e.g., human pose estimation and action recognition. Extensive experiments have been conducted to compare the sensing capacity of each or several modalities in terms of multiple tasks. We envision that MM-Fi can contribute to wireless sensing research with respect to action recognition, human pose estimation, multi-modal learning, cross-modal supervision, and interdisciplinary healthcare research.
Abstract:Industrial SAT formula generation is a critical yet challenging task for heuristic development and the surging learning-based methods in practical SAT applications. Existing SAT generation approaches can hardly simultaneously capture the global structural properties and maintain plausible computational hardness, which can be hazardous for the various downstream engagements. To this end, we first present an in-depth analysis for the limitation of previous learning methods in reproducing the computational hardness of original instances, which may stem from the inherent homogeneity in their adopted split-merge procedure. On top of the observations that industrial formulae exhibit clear community structure and oversplit substructures lead to the difficulty in semantic formation of logical structures, we propose HardSATGEN, which introduces a fine-grained control mechanism to the neural split-merge paradigm for SAT formula generation to better recover the structural and computational properties of the industrial benchmarks. Experimental results including evaluations on private corporate data and hyperparameter tuning over solvers in practical use show the significant superiority of HardSATGEN being the only method to successfully augments formulae maintaining similar computational hardness and capturing the global structural properties simultaneously. Compared to the best previous methods to our best knowledge, the average performance gains achieve 38.5% in structural statistics, 88.4% in computational metrics, and over 140.7% in the effectiveness of guiding solver development tuned by our generated instances.
Abstract:WiFi sensing has been evolving rapidly in recent years. Empowered by propagation models and deep learning methods, many challenging applications are realized such as WiFi-based human activity recognition and gesture recognition. However, in contrast to deep learning for visual recognition and natural language processing, no sufficiently comprehensive public benchmark exists. In this paper, we highlight the recent progress on deep learning enabled WiFi sensing, and then propose a benchmark, SenseFi, to study the effectiveness of various deep learning models for WiFi sensing. These advanced models are compared in terms of distinct sensing tasks, WiFi platforms, recognition accuracy, model size, computational complexity, feature transferability, and adaptability of unsupervised learning. It is also regarded as a tutorial for deep learning based WiFi sensing, starting from CSI hardware platform to sensing algorithms. The extensive experiments provide us with experiences in deep model design, learning strategy skills and training techniques for real-world applications. To the best of our knowledge, this is the first benchmark with an open-source library for deep learning in WiFi sensing research. The benchmark codes are available at https://github.com/CHENXINYAN-sg/WiFi-CSI-Sensing-Benchmark.
Abstract:WiFi technology has been applied to various places due to the increasing requirement of high-speed Internet access. Recently, besides network services, WiFi sensing is appealing in smart homes since it is device-free, cost-effective and privacy-preserving. Though numerous WiFi sensing methods have been developed, most of them only consider single smart home scenario. Without the connection of powerful cloud server and massive users, large-scale WiFi sensing is still difficult. In this paper, we firstly analyze and summarize these obstacles, and propose an efficient large-scale WiFi sensing framework, namely EfficientFi. The EfficientFi works with edge computing at WiFi APs and cloud computing at center servers. It consists of a novel deep neural network that can compress fine-grained WiFi Channel State Information (CSI) at edge, restore CSI at cloud, and perform sensing tasks simultaneously. A quantized auto-encoder and a joint classifier are designed to achieve these goals in an end-to-end fashion. To the best of our knowledge, the EfficientFi is the first IoT-cloud-enabled WiFi sensing framework that significantly reduces communication overhead while realizing sensing tasks accurately. We utilized human activity recognition and identification via WiFi sensing as two case studies, and conduct extensive experiments to evaluate the EfficientFi. The results show that it compresses CSI data from 1.368Mb/s to 0.768Kb/s with extremely low error of data reconstruction and achieves over 98% accuracy for human activity recognition.