Abstract:Autoregressive language models demonstrate excellent performance in various scenarios. However, the inference efficiency is limited by its one-step-one-word generation mode, which has become a pressing problem recently as the models become increasingly larger. Speculative decoding employs a "draft and then verify" mechanism to allow multiple tokens to be generated in one step, realizing lossless acceleration. Existing methods mainly adopt fixed heuristic draft structures, which fail to adapt to different situations to maximize the acceptance length during verification. To alleviate this dilemma, we proposed OPT-Tree, an algorithm to construct adaptive and scalable draft trees. It searches the optimal tree structure that maximizes the mathematical expectation of the acceptance length in each decoding step. Experimental results reveal that OPT-Tree outperforms the existing draft structures and achieves a speed-up ratio of up to 3.2 compared with autoregressive decoding. If the draft model is powerful enough and the node budget is sufficient, it can generate more than ten tokens in a single step. Our code is available at https://github.com/Jikai0Wang/OPT-Tree.
Abstract:We introduce a data capture system and a new dataset named HO-Cap that can be used to study 3D reconstruction and pose tracking of hands and objects in videos. The capture system uses multiple RGB-D cameras and a HoloLens headset for data collection, avoiding the use of expensive 3D scanners or mocap systems. We propose a semi-automatic method to obtain annotations of shape and pose of hands and objects in the collected videos, which significantly reduces the required annotation time compared to manual labeling. With this system, we captured a video dataset of humans using objects to perform different tasks, as well as simple pick-and-place and handover of an object from one hand to the other, which can be used as human demonstrations for embodied AI and robot manipulation research. Our data capture setup and annotation framework can be used by the community to reconstruct 3D shapes of objects and human hands and track their poses in videos.
Abstract:Large Language Models (LLMs) have played an important role in many fields due to their powerful capabilities.However, their massive number of parameters leads to high deployment requirements and incurs significant inference costs, which impedes their practical applications. Training smaller models is an effective way to address this problem. Therefore, we introduce OpenBA-V2, a 3.4B model derived from multi-stage compression and continual pre-training from the original 15B OpenBA model. OpenBA-V2 utilizes more data, more flexible training objectives, and techniques such as layer pruning, neural pruning, and vocabulary pruning to achieve a compression rate of 77.3\% with minimal performance loss. OpenBA-V2 demonstrates competitive performance compared to other open-source models of similar size, achieving results close to or on par with the 15B OpenBA model in downstream tasks such as common sense reasoning and Named Entity Recognition (NER). OpenBA-V2 illustrates that LLMs can be compressed into smaller ones with minimal performance loss by employing advanced training objectives and data strategies, which may help deploy LLMs in resource-limited scenarios.
Abstract:Following step-by-step procedures is an essential component of various activities carried out by individuals in their daily lives. These procedures serve as a guiding framework that helps to achieve goals efficiently, whether it is assembling furniture or preparing a recipe. However, the complexity and duration of procedural activities inherently increase the likelihood of making errors. Understanding such procedural activities from a sequence of frames is a challenging task that demands an accurate interpretation of visual information and the ability to reason about the structure of the activity. To this end, we collect a new egocentric 4D dataset, CaptainCook4D, comprising 384 recordings (94.5 hours) of people performing recipes in real kitchen environments. This dataset consists of two distinct types of activity: one in which participants adhere to the provided recipe instructions and another in which they deviate and induce errors. We provide 5.3K step annotations and 10K fine-grained action annotations and benchmark the dataset for the following tasks: supervised error recognition, multistep localization, and procedure learning
Abstract:Agile quadrotor flight relies on rapidly planning and accurately tracking time-optimal trajectories, a technology critical to their application in the wild. However, the computational burden of computing time-optimal trajectories based on the full quadrotor dynamics (typically on the order of minutes or even hours) can hinder its ability to respond quickly to changing scenarios. Additionally, modeling errors and external disturbances can lead to deviations from the desired trajectory during tracking in real time. This letter proposes a novel approach to computing time-optimal trajectories, by fixing the nodes with waypoint constraints and adopting separate sampling intervals for trajectories between waypoints, which significantly accelerates trajectory planning. Furthermore, the planned paths are tracked via a time-adaptive model predictive control scheme whose allocated tracking time can be adaptively adjusted on-the-fly, therefore enhancing the tracking accuracy and robustness. We evaluate our approach through simulations and experimentally validate its performance in dynamic waypoint scenarios for time-optimal trajectory replanning and trajectory tracking.
Abstract:Recent research works have focused on generating human models and garments from their 2D images. However, state-of-the-art researches focus either on only a single layer of the garment on a human model or on generating multiple garment layers without any guarantee of the intersection-free geometric relationship between them. In reality, people wear multiple layers of garments in their daily life, where an inner layer of garment could be partially covered by an outer one. In this paper, we try to address this multi-layer modeling problem and propose the Layered-Garment Net (LGN) that is capable of generating intersection-free multiple layers of garments defined by implicit function fields over the body surface, given the person's near front-view image. With a special design of garment indication fields (GIF), we can enforce an implicit covering relationship between the signed distance fields (SDF) of different layers to avoid self-intersections among different garment surfaces and the human body. Experiments demonstrate the strength of our proposed LGN framework in generating multi-layer garments as compared to state-of-the-art methods. To the best of our knowledge, LGN is the first research work to generate intersection-free multiple layers of garments on the human body from a single image.
Abstract:Multi-illuminant color constancy is a challenging problem with only a few existing methods. For example, one prior work used a small set of predefined white balance settings and spatially blended among them, limiting the solution to predefined illuminations. Another method proposed a generative adversarial network and an angular loss, yet the performance is suboptimal due to the lack of regularization for multi-illumination colors. This paper introduces a transformer-based multi-task learning method to estimate single and multiple light colors from a single input image. To help our deep learning model have better cues of the light colors, achromatic-pixel detection, and edge detection are used as auxiliary tasks in our multi-task learning setting. By exploiting extracted content features from the input image as tokens, illuminant color correlations between pixels are learned by leveraging contextual information in our transformer. Our transformer approach is further assisted via a contrastive loss defined between the input, output, and ground truth. We demonstrate that our proposed model achieves 40.7% improvement compared to a state-of-the-art multi-illuminant color constancy method on a multi-illuminant dataset (LSMI). Moreover, our model maintains a robust performance on the single illuminant dataset (NUS-8) and provides 22.3% improvement on the state-of-the-art single color constancy method.
Abstract:Accurate trajectory prediction of vehicles is essential for reliable autonomous driving. To maintain consistent performance as a vehicle driving around different cities, it is crucial to adapt to changing traffic circumstances and achieve lifelong trajectory prediction model. To realize it, catastrophic forgetting is a main problem to be addressed. In this paper, a divergence measurement method based on conditional Kullback-Leibler divergence is proposed first to evaluate spatiotemporal dependency difference among varied driving circumstances. Then based on generative replay, a novel lifelong vehicle trajectory prediction framework is developed. The framework consists of a conditional generation model and a vehicle trajectory prediction model. The conditional generation model is a generative adversarial network conditioned on position configuration of vehicles. After learning and merging trajectory distribution of vehicles across different cities, the generation model replays trajectories with prior samplings as inputs, which alleviates catastrophic forgetting. The vehicle trajectory prediction model is trained by the replayed trajectories and achieves consistent prediction performance on visited cities. A lifelong experiment setup is established on four open datasets including five tasks. Spatiotemporal dependency divergence is calculated for different tasks. Even though these divergence, the proposed framework exhibits lifelong learning ability and achieves consistent performance on all tasks.
Abstract:In recent years, visual SLAM has achieved great progress and development in different scenes, however, there are still many problems to be solved. The SLAM system is not only restricted by the external scenes but is also affected by its movement mode, such as movement speed, rotational motion, etc. As the representatives of the most excellent networks for frame interpolation, Sepconv-slomo and EDSC can predict high-quality intermediate frame between the previous frame and the current frame. Intuitively, frame interpolation technology can enrich the information of images sequences, the number of which is limited by the camera's frame rate, and thus decreasing the probability of SLAM system's failure rate. In this article, we propose an InterpolationSLAM framework. InterpolationSLAM is robust in rotational movement for Monocular and RGB-D configurations. By detecting the rotation and performing interpolation processing at the rotated position, pose of the system can be estimated more accurately, thereby improving the accuracy and robustness of the SLAM system in the rotational movement.
Abstract:In recent years, visual SLAM has achieved great progress and development, but in complex scenes, especially rotating scenes, the error of mapping will increase significantly, and the slam system is easy to lose track. In this article, we propose an InterpolationSLAM framework, which is a visual SLAM framework based on ORB-SLAM2. InterpolationSLAM is robust in rotating scenes for Monocular and RGB-D configurations. By detecting the rotation and performing interpolation processing at the rotated position, pose of the system can be estimated more accurately at the rotated position, thereby improving the accuracy and robustness of the SLAM system in the rotating scenes. To the best of our knowledge, it is the first work combining the interpolation network into a Visual SLAM system to improve SLAM system robustness in rotating scenes. We conduct experiments both on KITTI Monocular and TUM RGB-D datasets. The results demonstrate that InterpolationSLAM outperforms the accuracy of standard Visual SLAM baselines.