Abstract:End-effector based assistive robots face persistent challenges in generating smooth and robust trajectories when controlled by human's noisy and unreliable biosignals such as muscle activities and brainwaves. The produced endpoint trajectories are often jerky and imprecise to perform complex tasks such as stable robotic grasping. We propose STREAMS (Self-Training Robotic End-to-end Adaptive Multimodal Shared autonomy) as a novel framework leveraged deep reinforcement learning to tackle this challenge in biosignal based robotic control systems. STREAMS blends environmental information and synthetic user input into a Deep Q Learning Network (DQN) pipeline for an interactive end-to-end and self-training mechanism to produce smooth trajectories for the control of end-effector based robots. The proposed framework achieved a high-performance record of 98% in simulation with dynamic target estimation and acquisition without any pre-existing datasets. As a zero-shot sim-to-real user study with five participants controlling a physical robotic arm with noisy head movements, STREAMS (as an assistive mode) demonstrated significant improvements in trajectory stabilization, user satisfaction, and task performance reported as a success rate of 83% compared to manual mode which was 44% without any task support. STREAMS seeks to improve biosignal based assistive robotic controls by offering an interactive, end-to-end solution that stabilizes end-effector trajectories, enhancing task performance and accuracy.
Abstract:Multiple toddler tracking (MTT) involves identifying and differentiating toddlers in video footage. While conventional multi-object tracking (MOT) algorithms are adept at tracking diverse objects, toddlers pose unique challenges due to their unpredictable movements, various poses, and similar appearance. Tracking toddlers in indoor environments introduces additional complexities such as occlusions and limited fields of view. In this paper, we address the challenges of MTT and propose MTTSort, a customized method built upon the DeepSort algorithm. MTTSort is designed to track multiple toddlers in indoor videos accurately. Our contributions include discussing the primary challenges in MTT, introducing a genetic algorithm to optimize hyperparameters, proposing an accurate tracking algorithm, and curating the MTTrack dataset using unbiased AI co-labeling techniques. We quantitatively compare MTTSort to state-of-the-art MOT methods on MTTrack, DanceTrack, and MOT15 datasets. In our evaluation, the proposed method outperformed other MOT methods, achieving 0.98, 0.68, and 0.98 in multiple object tracking accuracy (MOTA), higher order tracking accuracy (HOTA), and iterative and discriminative framework 1 (IDF1) metrics, respectively.
Abstract:Automated human action recognition, a burgeoning field within computer vision, boasts diverse applications spanning surveillance, security, human-computer interaction, tele-health, and sports analysis. Precise action recognition in infants serves a multitude of pivotal purposes, encompassing safety monitoring, developmental milestone tracking, early intervention for developmental delays, fostering parent-infant bonds, advancing computer-aided diagnostics, and contributing to the scientific comprehension of child development. This paper delves into the intricacies of infant action recognition, a domain that has remained relatively uncharted despite the accomplishments in adult action recognition. In this study, we introduce a groundbreaking dataset called ``InfActPrimitive'', encompassing five significant infant milestone action categories, and we incorporate specialized preprocessing for infant data. We conducted an extensive comparative analysis employing cutting-edge skeleton-based action recognition models using this dataset. Our findings reveal that, although the PoseC3D model achieves the highest accuracy at approximately 71%, the remaining models struggle to accurately capture the dynamics of infant actions. This highlights a substantial knowledge gap between infant and adult action recognition domains and the urgent need for data-efficient pipeline models.
Abstract:Non-nutritive sucking (NNS), which refers to the act of sucking on a pacifier, finger, or similar object without nutrient intake, plays a crucial role in assessing healthy early development. In the case of preterm infants, NNS behavior is a key component in determining their readiness for feeding. In older infants, the characteristics of NNS behavior offer valuable insights into neural and motor development. Additionally, NNS activity has been proposed as a potential safeguard against sudden infant death syndrome (SIDS). However, the clinical application of NNS assessment is currently hindered by labor-intensive and subjective finger-in-mouth evaluations. Consequently, researchers often resort to expensive pressure transducers for objective NNS signal measurement. To enhance the accessibility and reliability of NNS signal monitoring for both clinicians and researchers, we introduce a vision-based algorithm designed for non-contact detection of NNS activity using baby monitor footage in natural settings. Our approach involves a comprehensive exploration of optical flow and temporal convolutional networks, enabling the detection and amplification of subtle infant-sucking signals. We successfully classify short video clips of uniform length into NNS and non-NNS periods. Furthermore, we investigate manual and learning-based techniques to piece together local classification results, facilitating the segmentation of longer mixed-activity videos into NNS and non-NNS segments of varying duration. Our research introduces two novel datasets of annotated infant videos, including one sourced from our clinical study featuring 19 infant subjects and 183 hours of overnight baby monitor footage.
Abstract:Respiration is a critical vital sign for infants, and continuous respiratory monitoring is particularly important for newborns. However, neonates are sensitive and contact-based sensors present challenges in comfort, hygiene, and skin health, especially for preterm babies. As a step toward fully automatic, continuous, and contactless respiratory monitoring, we develop a deep-learning method for estimating respiratory rate and waveform from plain video footage in natural settings. Our automated infant respiration flow-based network (AIRFlowNet) combines video-extracted optical flow input and spatiotemporal convolutional processing tuned to the infant domain. We support our model with the first public annotated infant respiration dataset with 125 videos (AIR-125), drawn from eight infant subjects, set varied pose, lighting, and camera conditions. We include manual respiration annotations and optimize AIRFlowNet training on them using a novel spectral bandpass loss function. When trained and tested on the AIR-125 infant data, our method significantly outperforms other state-of-the-art methods in respiratory rate estimation, achieving a mean absolute error of $\sim$2.9 breaths per minute, compared to $\sim$4.7--6.2 for other public models designed for adult subjects and more uniform environments.
Abstract:Modern autonomous systems require extensive testing to ensure reliability and build trust in ground vehicles. However, testing these systems in the real-world is challenging due to the lack of large and diverse datasets, especially in edge cases. Therefore, simulations are necessary for their development and evaluation. However, existing open-source simulators often exhibit a significant gap between synthetic and real-world domains, leading to deteriorated mobility performance and reduced platform reliability when using simulation data. To address this issue, our Scoping Autonomous Vehicle Simulation (SAVeS) platform benchmarks the performance of simulated environments for autonomous ground vehicle testing between synthetic and real-world domains. Our platform aims to quantify the domain gap and enable researchers to develop and test autonomous systems in a controlled environment. Additionally, we propose using domain adaptation technologies to address the domain gap between synthetic and real-world data with our SAVeS$^+$ extension. Our results demonstrate that SAVeS$^+$ is effective in helping to close the gap between synthetic and real-world domains and yields comparable performance for models trained with processed synthetic datasets to those trained on real-world datasets of same scale. This paper highlights our efforts to quantify and address the domain gap between synthetic and real-world data for autonomy simulation. By enabling researchers to develop and test autonomous systems in a controlled environment, we hope to bring autonomy simulation one step closer to realization.
Abstract:Animal pose estimation has become a crucial area of research, but the scarcity of annotated data is a significant challenge in developing accurate models. Synthetic data has emerged as a promising alternative, but it frequently exhibits domain discrepancies with real data. Style transfer algorithms have been proposed to address this issue, but they suffer from insufficient spatial correspondence, leading to the loss of label information. In this work, we present a new approach called Synthetic Pose-aware Animal ControlNet (SPAC-Net), which incorporates ControlNet into the previously proposed Prior-Aware Synthetic animal data generation (PASyn) pipeline. We leverage the plausible pose data generated by the Variational Auto-Encoder (VAE)-based data generation pipeline as input for the ControlNet Holistically-nested Edge Detection (HED) boundary task model to generate synthetic data with pose labels that are closer to real data, making it possible to train a high-precision pose estimation network without the need for real data. In addition, we propose the Bi-ControlNet structure to separately detect the HED boundary of animals and backgrounds, improving the precision and stability of the generated data. Using the SPAC-Net pipeline, we generate synthetic zebra and rhino images and test them on the AP10K real dataset, demonstrating superior performance compared to using only real images or synthetic data generated by other methods. Our work demonstrates the potential for synthetic data to overcome the challenge of limited annotated data in animal pose estimation.
Abstract:We present an end-to-end computer vision pipeline to detect non-nutritive sucking (NNS) -- an infant sucking pattern with no nutrition delivered -- as a potential biomarker for developmental delays, using off-the-shelf baby monitor video footage. One barrier to clinical (or algorithmic) assessment of NNS stems from its sparsity, requiring experts to wade through hours of footage to find minutes of relevant activity. Our NNS activity segmentation algorithm solves this problem by identifying periods of NNS with high certainty -- up to 94.0\% average precision and 84.9\% average recall across 30 heterogeneous 60 s clips, drawn from our manually annotated NNS clinical in-crib dataset of 183 hours of overnight baby monitor footage from 19 infants. Our method is based on an underlying NNS action recognition algorithm, which uses spatiotemporal deep learning networks and infant-specific pose estimation, achieving 94.9\% accuracy in binary classification of 960 2.5 s balanced NNS vs. non-NNS clips. Tested on our second, independent, and public NNS in-the-wild dataset, NNS recognition classification reaches 92.3\% accuracy, and NNS segmentation achieves 90.8\% precision and 84.2\% recall.
Abstract:We apply computer vision pose estimation techniques developed expressly for the data-scarce infant domain to the study of torticollis, a common condition in infants for which early identification and treatment is critical. Specifically, we use a combination of facial landmark and body joint estimation techniques designed for infants to estimate a range of geometric measures pertaining to face and upper body symmetry, drawn from an array of sources in the physical therapy and ophthalmology research literature in torticollis. We gauge performance with a range of metrics and show that the estimates of most these geometric measures are successful, yielding strong to very strong Spearman's $\rho$ correlation with ground truth values. Furthermore, we show that these estimates, derived from pose estimation neural networks designed for the infant domain, cleanly outperform estimates derived from more widely known networks designed for the adult domain
Abstract:Accurately annotated image datasets are essential components for studying animal behaviors from their poses. Compared to the number of species we know and may exist, the existing labeled pose datasets cover only a small portion of them, while building comprehensive large-scale datasets is prohibitively expensive. Here, we present a very data efficient strategy targeted for pose estimation in quadrupeds that requires only a small amount of real images from the target animal. It is confirmed that fine-tuning a backbone network with pretrained weights on generic image datasets such as ImageNet can mitigate the high demand for target animal pose data and shorten the training time by learning the the prior knowledge of object segmentation and keypoint estimation in advance. However, when faced with serious data scarcity (i.e., $<10^2$ real images), the model performance stays unsatisfactory, particularly for limbs with considerable flexibility and several comparable parts. We therefore introduce a prior-aware synthetic animal data generation pipeline called PASyn to augment the animal pose data essential for robust pose estimation. PASyn generates a probabilistically-valid synthetic pose dataset, SynAP, through training a variational generative model on several animated 3D animal models. In addition, a style transfer strategy is utilized to blend the synthetic animal image into the real backgrounds. We evaluate the improvement made by our approach with three popular backbone networks and test their pose estimation accuracy on publicly available animal pose images as well as collected from real animals in a zoo.