Abstract:Despite the widespread integration of ambient light sensors (ALS) in smart devices commonly used for screen brightness adaptation, their application in human activity recognition (HAR), primarily through body-worn ALS, is largely unexplored. In this work, we developed ALS-HAR, a robust wearable light-based motion activity classifier. Although ALS-HAR achieves comparable accuracy to other modalities, its natural sensitivity to external disturbances, such as changes in ambient light, weather conditions, or indoor lighting, makes it challenging for daily use. To address such drawbacks, we introduce strategies to enhance environment-invariant IMU-based activity classifications through augmented multi-modal and contrastive classifications by transferring the knowledge extracted from the ALS. Our experiments on a real-world activity dataset for three different scenarios demonstrate that while ALS-HAR's accuracy strongly relies on external lighting conditions, cross-modal information can still improve other HAR systems, such as IMU-based classifiers.Even in scenarios where ALS performs insufficiently, the additional knowledge enables improved accuracy and macro F1 score by up to 4.2 % and 6.4 %, respectively, for IMU-based classifiers and even surpasses multi-modal sensor fusion models in two of our three experiment scenarios. Our research highlights the untapped potential of ALS integration in advancing sensor-based HAR technology, paving the way for practical and efficient wearable ALS-based activity recognition systems with potential applications in healthcare, sports monitoring, and smart indoor environments.
Abstract:Due to the scarcity of labeled sensor data in HAR, prior research has turned to video data to synthesize Inertial Measurement Units (IMU) data, capitalizing on its rich activity annotations. However, generating IMU data from videos presents challenges for HAR in real-world settings, attributed to the poor quality of synthetic IMU data and its limited efficacy in subtle, fine-grained motions. In this paper, we propose Multi$^3$Net, our novel multi-modal, multitask, and contrastive-based framework approach to address the issue of limited data. Our pretraining procedure uses videos from online repositories, aiming to learn joint representations of text, pose, and IMU simultaneously. By employing video data and contrastive learning, our method seeks to enhance wearable HAR performance, especially in recognizing subtle activities.Our experimental findings validate the effectiveness of our approach in improving HAR performance with IMU data. We demonstrate that models trained with synthetic IMU data generated from videos using our method surpass existing approaches in recognizing fine-grained activities.
Abstract:In human activity recognition (HAR), the availability of substantial ground truth is necessary for training efficient models. However, acquiring ground pressure data through physical sensors itself can be cost-prohibitive, time-consuming. To address this critical need, we introduce Text-to-Pressure (T2P), a framework designed to generate extensive ground pressure sequences from textual descriptions of human activities using deep learning techniques. We show that the combination of vector quantization of sensor data along with simple text conditioned auto regressive strategy allows us to obtain high-quality generated pressure sequences from textual descriptions with the help of discrete latent correlation between text and pressure maps. We achieved comparable performance on the consistency between text and generated motion with an R squared value of 0.722, Masked R squared value of 0.892, and FID score of 1.83. Additionally, we trained a HAR model with the the synthesized data and evaluated it on pressure dynamics collected by a real pressure sensor which is on par with a model trained on only real data. Combining both real and synthesized training data increases the overall macro F1 score by 5.9 percent.
Abstract:We present a novel local-global feature fusion framework for body-weight exercise recognition with floor-based dynamic pressure maps. One step further from the existing studies using deep neural networks mainly focusing on global feature extraction, the proposed framework aims to combine local and global features using image processing techniques and the YOLO object detection to localize pressure profiles from different body parts and consider physical constraints. The proposed local feature extraction method generates two sets of high-level local features consisting of cropped pressure mapping and numerical features such as angular orientation, location on the mat, and pressure area. In addition, we adopt a knowledge distillation for regularization to preserve the knowledge of the global feature extraction and improve the performance of the exercise recognition. Our experimental results demonstrate a notable 11 percent improvement in F1 score for exercise recognition while preserving label-specific features.
Abstract:We propose PressureTransferNet, a novel method for Human Activity Recognition (HAR) using ground pressure information. Our approach generates body-specific dynamic ground pressure profiles for specific activities by leveraging existing pressure data from different individuals. PressureTransferNet is an encoder-decoder model taking a source pressure map and a target human attribute vector as inputs, producing a new pressure map reflecting the target attribute. To train the model, we use a sensor simulation to create a diverse dataset with various human attributes and pressure profiles. Evaluation on a real-world dataset shows its effectiveness in accurately transferring human attributes to ground pressure profiles across different scenarios. We visually confirm the fidelity of the synthesized pressure shapes using a physics-based deep learning model and achieve a binary R-square value of 0.79 on areas with ground contact. Validation through classification with F1 score (0.911$\pm$0.015) on physical pressure mat data demonstrates the correctness of the synthesized pressure maps, making our method valuable for data augmentation, denoising, sensor simulation, and anomaly detection. Applications span sports science, rehabilitation, and bio-mechanics, contributing to the development of HAR systems.
Abstract:To help smart wearable researchers choose the optimal ground truth methods for motion capturing (MoCap) for all types of loose garments, we present a benchmark, DrapeMoCapBench (DMCB), specifically designed to evaluate the performance of optical marker-based and marker-less MoCap. High-cost marker-based MoCap systems are well-known as precise golden standards. However, a less well-known caveat is that they require skin-tight fitting markers on bony areas to ensure the specified precision, making them questionable for loose garments. On the other hand, marker-less MoCap methods powered by computer vision models have matured over the years, which have meager costs as smartphone cameras would suffice. To this end, DMCB uses large real-world recorded MoCap datasets to perform parallel 3D physics simulations with a wide range of diversities: six levels of drape from skin-tight to extremely draped garments, three levels of motions and six body type - gender combinations to benchmark state-of-the-art optical marker-based and marker-less MoCap methods to identify the best-performing method in different scenarios. In assessing the performance of marker-based and low-cost marker-less MoCap for casual loose garments both approaches exhibit significant performance loss (>10cm), but for everyday activities involving basic and fast motions, marker-less MoCap slightly outperforms marker-based MoCap, making it a favorable and cost-effective choice for wearable studies.
Abstract:Folding is an unique structural technique to enable planer materials with motion or 3D mechanical properties. Textile-based capacitive sensing has shown to be sensitive to the geometry deformation and relative motion of conductive textiles. In this work, we propose a novel self-tracking foldable smart textile by combining folded fabric structures and capacitive sensing to detect the structural motions using state-of-the-art sensing circuits and deep learning technologies. We created two folding patterns, Accordion and Chevron, each with two layouts of capacitive sensors in the form of thermobonded conductive textile patches. In an experiment of manually moving patches of the folding patterns, we developed deep neural network to learn and reconstruct the vision-tracked shape of the patches. Through our approach, the geometry primitives defining the patch shape can be reconstructed from the capacitive signals with R-squared value of up to 95\% and tracking error of 1cm for 22.5cm long patches. With mechanical, electrical and sensing properties, Capafoldable could enable a new range of smart textile applications.
Abstract:Accurate camera calibration is crucial for various computer vision applications. However, measuring camera parameters in the real world is challenging and arduous, and there needs to be a dataset with ground truth to evaluate calibration algorithms' accuracy. In this paper, we present SynthCal, a synthetic camera calibration benchmarking pipeline that generates images of calibration patterns to measure and enable accurate quantification of calibration algorithm performance in camera parameter estimation. We present a SynthCal-generated calibration dataset with four common patterns, two camera types, and two environments with varying view, distortion, lighting, and noise levels. The dataset evaluates single-view calibration algorithms by measuring reprojection and root-mean-square errors for identical patterns and camera settings. Additionally, we analyze the significance of different patterns using Zhang's method, which estimates intrinsic and extrinsic camera parameters with known correspondences between 3D points and their 2D projections in different configurations and environments. The experimental results demonstrate the effectiveness of SynthCal in evaluating various calibration algorithms and patterns.
Abstract:Online clothing shopping has become increasingly popular, but the high rate of returns due to size and fit issues has remained a major challenge. To address this problem, virtual try-on systems have been developed to provide customers with a more realistic and personalized way to try on clothing. In this paper, we propose a novel virtual try-on method called ClothFit, which can predict the draping shape of a garment on a target body based on the actual size of the garment and human attributes. Unlike existing try-on models, ClothFit considers the actual body proportions of the person and available cloth sizes for clothing virtualization, making it more appropriate for current online apparel outlets. The proposed method utilizes a U-Net-based network architecture that incorporates cloth and human attributes to guide the realistic virtual try-on synthesis. Specifically, we extract features from a cloth image using an auto-encoder and combine them with features from the user's height, weight, and cloth size. The features are concatenated with the features from the U-Net encoder, and the U-Net decoder synthesizes the final virtual try-on image. Our experimental results demonstrate that ClothFit can significantly improve the existing state-of-the-art methods in terms of photo-realistic virtual try-on results.
Abstract:Ground pressure exerted by the human body is a valuable source of information for human activity recognition (HAR) in unobtrusive pervasive sensing. While data collection from pressure sensors to develop HAR solutions requires significant resources and effort, we present a novel end-to-end framework, PresSim, to synthesize sensor data from videos of human activities to reduce such effort significantly. PresSim adopts a 3-stage process: first, extract the 3D activity information from videos with computer vision architectures; then simulate the floor mesh deformation profiles based on the 3D activity information and gravity-included physics simulation; lastly, generate the simulated pressure sensor data with deep learning models. We explored two approaches for the 3D activity information: inverse kinematics with mesh re-targeting, and volumetric pose and shape estimation. We validated PresSim with an experimental setup with a monocular camera to provide input and a pressure-sensing fitness mat (80x28 spatial resolution) to provide the sensor ground truth, where nine participants performed a set of predefined yoga sequences.