Abstract:Spatial reasoning in Large Language Models (LLMs) is the foundation for embodied intelligence. However, even in simple maze environments, LLMs still encounter challenges in long-term path-planning, primarily influenced by their spatial hallucination and context inconsistency hallucination by long-term reasoning. To address this challenge, this study proposes an innovative model, Spatial-to-Relational Transformation and Curriculum Q-Learning (S2RCQL). To address the spatial hallucination of LLMs, we propose the Spatial-to-Relational approach, which transforms spatial prompts into entity relations and paths representing entity relation chains. This approach fully taps the potential of LLMs in terms of sequential thinking. As a result, we design a path-planning algorithm based on Q-learning to mitigate the context inconsistency hallucination, which enhances the reasoning ability of LLMs. Using the Q-value of state-action as auxiliary information for prompts, we correct the hallucinations of LLMs, thereby guiding LLMs to learn the optimal path. Finally, we propose a reverse curriculum learning technique based on LLMs to further mitigate the context inconsistency hallucination. LLMs can rapidly accumulate successful experiences by reducing task difficulty and leveraging them to tackle more complex tasks. We performed comprehensive experiments based on Baidu's self-developed LLM: ERNIE-Bot 4.0. The results showed that our S2RCQL achieved a 23%--40% improvement in both success and optimality rates compared with advanced prompt engineering.
Abstract:Being able to map the activities of others into one's own point of view is one fundamental human skill even from a very early age. Taking a step toward understanding this human ability, we introduce EgoExoLearn, a large-scale dataset that emulates the human demonstration following process, in which individuals record egocentric videos as they execute tasks guided by demonstration videos. Focusing on the potential applications in daily assistance and professional support, EgoExoLearn contains egocentric and demonstration video data spanning 120 hours captured in daily life scenarios and specialized laboratories. Along with the videos we record high-quality gaze data and provide detailed multimodal annotations, formulating a playground for modeling the human ability to bridge asynchronous procedural actions from different viewpoints. To this end, we present benchmarks such as cross-view association, cross-view action planning, and cross-view referenced skill assessment, along with detailed analysis. We expect EgoExoLearn can serve as an important resource for bridging the actions across views, thus paving the way for creating AI agents capable of seamlessly learning by observing humans in the real world. Code and data can be found at: https://github.com/OpenGVLab/EgoExoLearn
Abstract:We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-the-art performance in action recognition, video-text tasks, and video-centric dialogue. Our approach employs a progressive training paradigm that unifies the different self- or weakly-supervised learning frameworks of masked video token reconstruction, cross-modal contrastive learning, and next token prediction. Different training stages would guide our model to capture different levels of structure and semantic information through different pretext tasks. At the data level, we prioritize the spatiotemporal consistency by semantically segmenting videos and generating video-audio-speech captions. This improves the alignment between video and text. We scale both data and model size for our InternVideo2. Through extensive experiments, we validate our designs and demonstrate the state-of-the-art performance on over 60 video and audio tasks. Notably, our model outperforms others on various video-related captioning, dialogue, and long video understanding benchmarks, highlighting its ability to reason and comprehend long temporal contexts. Code and models are available at https://github.com/OpenGVLab/InternVideo2/.
Abstract:While several long-form VideoQA datasets have been introduced, the length of both videos used to curate questions and sub-clips of clues leveraged to answer those questions have not yet reached the criteria for genuine long-form video understanding. Moreover, their QAs are unduly narrow and modality-biased, lacking a wider view of understanding long-term video content with rich dynamics and complex narratives. To remedy this, we introduce MoVQA, a long-form movie question-answering dataset, and benchmark to assess the diverse cognitive capabilities of multimodal systems rely on multi-level temporal lengths, with considering both video length and clue length. Additionally, to take a step towards human-level understanding in long-form video, versatile and multimodal question-answering is designed from the moviegoer-perspective to assess the model capabilities on various perceptual and cognitive axes.Through analysis involving various baselines reveals a consistent trend: the performance of all methods significantly deteriorate with increasing video and clue length. Meanwhile, our established baseline method has shown some improvements, but there is still ample scope for enhancement on our challenging MoVQA dataset. We expect our MoVQA to provide a new perspective and encourage inspiring works on long-form video understanding research.
Abstract:Multi-view feature extraction is an efficient approach for alleviating the issue of dimensionality in highdimensional multi-view data. Contrastive learning (CL), which is a popular self-supervised learning method, has recently attracted considerable attention. In this study, we propose a novel multi-view feature extraction method based on triple contrastive heads, which combines the sample-, recovery- , and feature-level contrastive losses to extract the sufficient yet minimal subspace discriminative information in compliance with information bottleneck principle. In MFETCH, we construct the feature-level contrastive loss, which removes the redundent information in the consistency information to achieve the minimality of the subspace discriminative information. Moreover, the recovery-level contrastive loss is also constructed in MFETCH, which captures the view-specific discriminative information to achieve the sufficiency of the subspace discriminative information.The numerical experiments demonstrate that the proposed method offers a strong advantage for multi-view feature extraction.
Abstract:Multi-view feature extraction is an efficient approach for alleviating the issue of dimensionality in highdimensional multi-view data. Contrastive learning (CL), which is a popular self-supervised learning method, has recently attracted considerable attention. Most CL-based methods were constructed only from the sample level. In this study, we propose a novel multiview feature extraction method based on dual contrastive head, which introduce structural-level contrastive loss into sample-level CL-based method. Structural-level CL push the potential subspace structures consistent in any two cross views, which assists sample-level CL to extract discriminative features more effectively. Furthermore, it is proven that the relationships between structural-level CL and mutual information and probabilistic intraand inter-scatter, which provides the theoretical support for the excellent performance. Finally, numerical experiments on six real datasets demonstrate the superior performance of the proposed method compared to existing methods.
Abstract:The foundation models have recently shown excellent performance on a variety of downstream tasks in computer vision. However, most existing vision foundation models simply focus on image-level pretraining and adpation, which are limited for dynamic and complex video-level understanding tasks. To fill the gap, we present general video foundation models, InternVideo, by taking advantage of both generative and discriminative self-supervised video learning. Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications. Without bells and whistles, InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications. Especially, our methods can obtain 91.1% and 77.2% top-1 accuracy on the challenging Kinetics-400 and Something-Something V2 benchmarks, respectively. All of these results effectively show the generality of our InternVideo for video understanding. The code will be released at https://github.com/OpenGVLab/InternVideo .
Abstract:Deep reinforcement learning has achieved great success in various fields with its super decision-making ability. However, the policy learning process requires a large amount of training time, causing energy consumption. Inspired by the redundancy of neural networks, we propose a lightweight parallel training framework based on neural network compression, AcceRL, to accelerate the policy learning while ensuring policy quality. Specifically, AcceRL speeds up the experience collection by flexibly combining various neural network compression methods. Overall, the AcceRL consists of five components, namely Actor, Learner, Compressor, Corrector, and Monitor. The Actor uses the Compressor to compress the Learner's policy network to interact with the environment. And the generated experiences are transformed by the Corrector with Off-Policy methods, such as V-trace, Retrace and so on. Then the corrected experiences are feed to the Learner for policy learning. We believe this is the first general reinforcement learning framework that incorporates multiple neural network compression techniques. Extensive experiments conducted in gym show that the AcceRL reduces the time cost of the actor by about 2.0 X to 4.13 X compared to the traditional methods. Furthermore, the AcceRL reduces the whole training time by about 29.8% to 40.3% compared to the traditional methods while keeps the same policy quality.
Abstract:In this report, we present our champion solutions to five tracks at Ego4D challenge. We leverage our developed InternVideo, a video foundation model, for five Ego4D tasks, including Moment Queries, Natural Language Queries, Future Hand Prediction, State Change Object Detection, and Short-term Object Interaction Anticipation. InternVideo-Ego4D is an effective paradigm to adapt the strong foundation model to the downstream ego-centric video understanding tasks with simple head designs. In these five tasks, the performance of InternVideo-Ego4D comprehensively surpasses the baseline methods and the champions of CVPR2022, demonstrating the powerful representation ability of InternVideo as a video foundation model. Our code will be released at https://github.com/OpenGVLab/ego4d-eccv2022-solutions
Abstract:In this study, we propose a feature extraction framework based on contrastive learning with adaptive positive and negative samples (CL-FEFA) that is suitable for unsupervised, supervised, and semi-supervised single-view feature extraction. CL-FEFA constructs adaptively the positive and negative samples from the results of feature extraction, which makes it more appropriate and accurate. Thereafter, the discriminative features are re extracted to according to InfoNCE loss based on previous positive and negative samples, which will make the intra-class samples more compact and the inter-class samples more dispersed. At the same time, using the potential structure information of subspace samples to dynamically construct positive and negative samples can make our framework more robust to noisy data. Furthermore, CL-FEFA considers the mutual information between positive samples, that is, similar samples in potential structures, which provides theoretical support for its advantages in feature extraction. The final numerical experiments prove that the proposed framework has a strong advantage over the traditional feature extraction methods and contrastive learning methods.