Abstract:This paper addresses the limitations of current humanoid robot control frameworks, which primarily rely on reactive mechanisms and lack autonomous interaction capabilities due to data scarcity. We propose Humanoid-VLA, a novel framework that integrates language understanding, egocentric scene perception, and motion control, enabling universal humanoid control. Humanoid-VLA begins with language-motion pre-alignment using non-egocentric human motion datasets paired with textual descriptions, allowing the model to learn universal motion patterns and action semantics. We then incorporate egocentric visual context through a parameter efficient video-conditioned fine-tuning, enabling context-aware motion generation. Furthermore, we introduce a self-supervised data augmentation strategy that automatically generates pseudoannotations directly derived from motion data. This process converts raw motion sequences into informative question-answer pairs, facilitating the effective use of large-scale unlabeled video data. Built upon whole-body control architectures, extensive experiments show that Humanoid-VLA achieves object interaction and environment exploration tasks with enhanced contextual awareness, demonstrating a more human-like capacity for adaptive and intelligent engagement.
Abstract:In a complex road traffic scene, illegal lane intrusion of pedestrians or cyclists constitutes one of the main safety challenges in autonomous driving application. In this paper, we propose a novel object-level phase space reconstruction network (PSRNet) for motion time series classification, aiming to recognize lane intrusion actions that occur 150m ahead through a monocular camera fixed on moving vehicle. In the PSRNet, the movement of pedestrians and cyclists, specifically viewed as an observable object-level dynamic process, can be reconstructed as trajectories of state vectors in a latent phase space and further characterized by a learnable Lyapunov exponent-like classifier that indicates discrimination in terms of average exponential divergence of state trajectories. Additionally, in order to first transform video inputs into one-dimensional motion time series of each object, a lane width normalization based on visual object tracking-by-detection is presented. Extensive experiments are conducted on the THU-IntrudBehavior dataset collected from real urban roads. The results show that our PSRNet could reach the best accuracy of 98.0%, which remarkably exceeds existing action recognition approaches by more than 30%.
Abstract:Object detection in videos has drawn increasing attention recently since it is more important in real scenarios. Most of the deep learning methods for video analysis use convolutional neural networks designed for image-wise parsing in a video stream. But they usually ignore the fact that a video is generally stored and transmitted in a compressed data format. In this paper, we propose a fast object detection model that incorporates light-weight motion-aided memory network (MMNet), which can be directly used for H.264 compressed video. MMNet has two major advantages: 1) For a group of successive pictures (GOP) in a compressed video stream, it runs the heavy computational network for I-frames, i.e. a few reference frames in videos, while a light-weight memory network is designed to generate features for prediction frames called P-frames; 2) Unlike establishing an additional network to explicitly model motion among frames, we directly take full advantage of both motion vectors and residual errors that are all encoded in a compressed video. Such signals maintain spatial variations and are freely available. To our best knowledge, the MMNet is the first work that explores a convolutional detector on a compressed video and a motion-based memory in order to achieve significant speedup. Our model is evaluated on the large-scale ImageNet VID dataset, and the results show that it is about 3x times faster than single image detector R-FCN and 10x times faster than high performance detectors like FGFA and MANet.