Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shijia Huang

Efficient-VLN: A Training-Efficient Vision-Language Navigation Model

Dec 11, 2025

Duo Zheng, Shijia Huang, Yanyang Li, Liwei Wang

Figure 1 for Efficient-VLN: A Training-Efficient Vision-Language Navigation Model

Figure 2 for Efficient-VLN: A Training-Efficient Vision-Language Navigation Model

Figure 3 for Efficient-VLN: A Training-Efficient Vision-Language Navigation Model

Figure 4 for Efficient-VLN: A Training-Efficient Vision-Language Navigation Model

Abstract:Multimodal large language models (MLLMs) have shown promising potential in Vision-Language Navigation (VLN). However, their practical development is severely hindered by the substantial training overhead. We recognize two key issues that contribute to the overhead: (1) the quadratic computational burden from processing long-horizon historical observations as massive sequences of tokens, and (2) the exploration-efficiency trade-off in DAgger, i.e., a data aggregation process of collecting agent-explored trajectories. While more exploration yields effective error-recovery trajectories for handling test-time distribution shifts, it comes at the cost of longer trajectory lengths for both training and inference. To address these challenges, we propose Efficient-VLN, a training-efficient VLN model. Specifically, to mitigate the token processing burden, we design two efficient memory mechanisms: a progressive memory that dynamically allocates more tokens to recent observations, and a learnable recursive memory that utilizes the key-value cache of learnable tokens as the memory state. Moreover, we introduce a dynamic mixed policy to balance the exploration-efficiency trade-off. Extensive experiments show that Efficient-VLN achieves state-of-the-art performance on R2R-CE (64.2% SR) and RxR-CE (67.0% SR). Critically, our model consumes merely 282 H800 GPU hours, demonstrating a dramatic reduction in training overhead compared to state-of-the-art methods.

Via

Access Paper or Ask Questions

Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors

May 30, 2025

Duo Zheng, Shijia Huang, Yanyang Li, Liwei Wang

Abstract:Previous research has investigated the application of Multimodal Large Language Models (MLLMs) in understanding 3D scenes by interpreting them as videos. These approaches generally depend on comprehensive 3D data inputs, such as point clouds or reconstructed Bird's-Eye View (BEV) maps. In our research, we advance this field by enhancing the capability of MLLMs to understand and reason in 3D spaces directly from video data, without the need for additional 3D input. We propose a novel and efficient method, the Video-3D Geometry Large Language Model (VG LLM). Our approach employs a 3D visual geometry encoder that extracts 3D prior information from video sequences. This information is integrated with visual tokens and fed into the MLLM. Extensive experiments have shown that our method has achieved substantial improvements in various tasks related to 3D scene understanding and spatial reasoning, all directly learned from video sources. Impressively, our 4B model, which does not rely on explicit 3D data inputs, achieves competitive results compared to existing state-of-the-art methods, and even surpasses the Gemini-1.5-Pro in the VSI-Bench evaluations.

Via

Access Paper or Ask Questions

Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding

Nov 30, 2024

Duo Zheng, Shijia Huang, Liwei Wang

Figure 1 for Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding

Figure 2 for Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding

Figure 3 for Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding

Figure 4 for Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding

Abstract:The rapid advancement of Multimodal Large Language Models (MLLMs) has significantly impacted various multimodal tasks. However, these models face challenges in tasks that require spatial understanding within 3D environments. Efforts to enhance MLLMs, such as incorporating point cloud features, have been made, yet a considerable gap remains between the models' learned representations and the inherent complexity of 3D scenes. This discrepancy largely stems from the training of MLLMs on predominantly 2D data, which restricts their effectiveness in comprehending 3D spaces. To address this issue, in this paper, we propose a novel generalist model, i.e., Video-3D LLM, for 3D scene understanding. By treating 3D scenes as dynamic videos and incorporating 3D position encoding into these representations, our Video-3D LLM aligns video representations with real-world spatial contexts more accurately. Additionally, we have implemented a maximum coverage sampling technique to optimize the balance between computational costs and performance efficiency. Extensive experiments demonstrate that our model achieves state-of-the-art performance on several 3D scene understanding benchmarks, including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.

* 14 pages, 4 figures

Via

Access Paper or Ask Questions

Enhancing Temporal Modeling of Video LLMs via Time Gating

Oct 08, 2024

Zi-Yuan Hu, Yiwu Zhong, Shijia Huang, Michael R. Lyu, Liwei Wang

Figure 1 for Enhancing Temporal Modeling of Video LLMs via Time Gating

Figure 2 for Enhancing Temporal Modeling of Video LLMs via Time Gating

Figure 3 for Enhancing Temporal Modeling of Video LLMs via Time Gating

Figure 4 for Enhancing Temporal Modeling of Video LLMs via Time Gating

Abstract:Video Large Language Models (Video LLMs) have achieved impressive performance on video-and-language tasks, such as video question answering. However, most existing Video LLMs neglect temporal information in video data, leading to struggles with temporal-aware video understanding. To address this gap, we propose a Time Gating Video LLM (TG-Vid) designed to enhance temporal modeling through a novel Time Gating module (TG). The TG module employs a time gating mechanism on its sub-modules, comprising gating spatial attention, gating temporal attention, and gating MLP. This architecture enables our model to achieve a robust understanding of temporal information within videos. Extensive evaluation of temporal-sensitive video benchmarks (i.e., MVBench, TempCompass, and NExT-QA) demonstrates that our TG-Vid model significantly outperforms the existing Video LLMs. Further, comprehensive ablation studies validate that the performance gains are attributed to the designs of our TG module. Our code is available at https://github.com/LaVi-Lab/TG-Vid.

* EMNLP 2024 Findings (Short)

Via

Access Paper or Ask Questions

Towards Learning a Generalist Model for Embodied Navigation

Dec 06, 2023

Duo Zheng, Shijia Huang, Lin Zhao, Yiwu Zhong, Liwei Wang

Figure 1 for Towards Learning a Generalist Model for Embodied Navigation

Figure 2 for Towards Learning a Generalist Model for Embodied Navigation

Figure 3 for Towards Learning a Generalist Model for Embodied Navigation

Figure 4 for Towards Learning a Generalist Model for Embodied Navigation

Abstract:Building a generalist agent that can interact with the world is the intriguing target of AI systems, thus spurring the research for embodied navigation, where an agent is required to navigate according to instructions or respond to queries. Despite the major progress attained, previous works primarily focus on task-specific agents and lack generalizability to unseen scenarios. Recently, LLMs have presented remarkable capabilities across various fields, and provided a promising opportunity for embodied navigation. Drawing on this, we propose the first generalist model for embodied navigation, NaviLLM. It adapts LLMs to embodied navigation by introducing schema-based instruction. The schema-based instruction flexibly casts various tasks into generation problems, thereby unifying a wide range of tasks. This approach allows us to integrate diverse data sources from various datasets into the training, equipping NaviLLM with a wide range of capabilities required by embodied navigation. We conduct extensive experiments to evaluate the performance and generalizability of our model. The experimental results demonstrate that our unified model achieves state-of-the-art performance on CVDN, SOON, and ScanQA. Specifically, it surpasses the previous stats-of-the-art method by a significant margin of 29% in goal progress on CVDN. Moreover, our model also demonstrates strong generalizability and presents impressive results on unseen tasks, e.g., embodied question answering and 3D captioning.

* 13 pages, 3 figures. Official code: https://github.com/zd11024/NaviLLM

Via

Access Paper or Ask Questions

LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models

Dec 05, 2023

Hao Zhang, Hongyang Li, Feng Li, Tianhe Ren, Xueyan Zou, Shilong Liu, Shijia Huang, Jianfeng Gao, Lei Zhang, Chunyuan Li(+1 more)

Figure 1 for LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models

Figure 2 for LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models

Figure 3 for LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models

Figure 4 for LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models

Abstract:With the recent significant advancements in large multi-modal models (LMMs), the importance of their grounding capability in visual chat is increasingly recognized. Despite recent efforts to enable LMMs to support grounding, their capabilities for grounding and chat are usually separate, and their chat performance drops dramatically when asked to ground. The problem is the lack of a dataset for grounded visual chat (GVC). Existing grounding datasets only contain short captions. To address this issue, we have created GVC data that allows for the combination of grounding and chat capabilities. To better evaluate the GVC capabilities, we have introduced a benchmark called Grounding-Bench. Additionally, we have proposed a model design that can support GVC and various types of visual prompts by connecting segmentation models with language models. Experimental results demonstrate that our model outperforms other LMMs on Grounding-Bench. Furthermore, our model achieves competitive performance on classic grounding benchmarks like RefCOCO/+/g and Flickr30K Entities. Our code will be released at https://github.com/UX-Decoder/LLaVA-Grounding .

Via

Access Paper or Ask Questions

CLEVA: Chinese Language Models EVAluation Platform

Aug 09, 2023

Yanyang Li, Jianqiao Zhao, Duo Zheng, Zi-Yuan Hu, Zhi Chen, Xiaohui Su, Yongfeng Huang, Shijia Huang, Dahua Lin, Michael R. Lyu(+1 more)

Figure 1 for CLEVA: Chinese Language Models EVAluation Platform

Figure 2 for CLEVA: Chinese Language Models EVAluation Platform

Figure 3 for CLEVA: Chinese Language Models EVAluation Platform

Figure 4 for CLEVA: Chinese Language Models EVAluation Platform

Abstract:With the continuous emergence of Chinese Large Language Models (LLMs), how to evaluate a model's capabilities has become an increasingly significant issue. The absence of a comprehensive Chinese benchmark that thoroughly assesses a model's performance, the unstandardized and incomparable prompting procedure, and the prevalent risk of contamination pose major challenges in the current evaluation of Chinese LLMs. We present CLEVA, a user-friendly platform crafted to holistically evaluate Chinese LLMs. Our platform employs a standardized workflow to assess LLMs' performance across various dimensions, regularly updating a competitive leaderboard. To alleviate contamination, CLEVA curates a significant proportion of new data and develops a sampling strategy that guarantees a unique subset for each leaderboard round. Empowered by an easy-to-use interface that requires just a few mouse clicks and a model API, users can conduct a thorough evaluation with minimal coding. Large-scale experiments featuring 23 influential Chinese LLMs have validated CLEVA's efficacy.

Via

Access Paper or Ask Questions

MP-Former: Mask-Piloted Transformer for Image Segmentation

Mar 15, 2023

Hao Zhang, Feng Li, Huaizhe Xu, Shijia Huang, Shilong Liu, Lionel M. Ni, Lei Zhang

Abstract:We present a mask-piloted Transformer which improves masked-attention in Mask2Former for image segmentation. The improvement is based on our observation that Mask2Former suffers from inconsistent mask predictions between consecutive decoder layers, which leads to inconsistent optimization goals and low utilization of decoder queries. To address this problem, we propose a mask-piloted training approach, which additionally feeds noised ground-truth masks in masked-attention and trains the model to reconstruct the original ones. Compared with the predicted masks used in mask-attention, the ground-truth masks serve as a pilot and effectively alleviate the negative impact of inaccurate mask predictions in Mask2Former. Based on this technique, our \M achieves a remarkable performance improvement on all three image segmentation tasks (instance, panoptic, and semantic), yielding $+2.3$AP and $+1.6$mIoU on the Cityscapes instance and semantic segmentation tasks with a ResNet-50 backbone. Our method also significantly speeds up the training, outperforming Mask2Former with half of the number of training epochs on ADE20K with both a ResNet-50 and a Swin-L backbones. Moreover, our method only introduces little computation during training and no extra computation during inference. Our code will be released at \url{https://github.com/IDEA-Research/MP-Former}.

* CVPR 2023

Via

Access Paper or Ask Questions

DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding

Nov 30, 2022

Shilong Liu, Yaoyuan Liang, Feng Li, Shijia Huang, Hao Zhang, Hang Su, Jun Zhu, Lei Zhang

Figure 1 for DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding

Figure 2 for DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding

Figure 3 for DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding

Figure 4 for DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding

Abstract:In this paper, we study the problem of visual grounding by considering both phrase extraction and grounding (PEG). In contrast to the previous phrase-known-at-test setting, PEG requires a model to extract phrases from text and locate objects from images simultaneously, which is a more practical setting in real applications. As phrase extraction can be regarded as a $1$D text segmentation problem, we formulate PEG as a dual detection problem and propose a novel DQ-DETR model, which introduces dual queries to probe different features from image and text for object prediction and phrase mask prediction. Each pair of dual queries is designed to have shared positional parts but different content parts. Such a design effectively alleviates the difficulty of modality alignment between image and text (in contrast to a single query design) and empowers Transformer decoder to leverage phrase mask-guided attention to improve performance. To evaluate the performance of PEG, we also propose a new metric CMAP (cross-modal average precision), analogous to the AP metric in object detection. The new metric overcomes the ambiguity of Recall@1 in many-box-to-one-phrase cases in phrase grounding. As a result, our PEG pre-trained DQ-DETR establishes new state-of-the-art results on all visual grounding benchmarks with a ResNet-101 backbone. For example, it achieves $91.04\%$ and $83.51\%$ in terms of recall rate on RefCOCO testA and testB with a ResNet-101 backbone. Code will be availabl at \url{https://github.com/IDEA-Research/DQ-DETR}.

* Accepted to AAAI 2023

Via

Access Paper or Ask Questions

A Unified Mutual Supervision Framework for Referring Expression Segmentation and Generation

Nov 15, 2022

Shijia Huang, Feng Li, Hao Zhang, Shilong Liu, Lei Zhang, Liwei Wang

Abstract:Reference Expression Segmentation (RES) and Reference Expression Generation (REG) are mutually inverse tasks that can be naturally jointly trained. Though recent work has explored such joint training, the mechanism of how RES and REG can benefit each other is still unclear. In this paper, we propose a unified mutual supervision framework that enables two tasks to improve each other. Our mutual supervision contains two directions. On the one hand, Disambiguation Supervision leverages the expression unambiguity measurement provided by RES to enhance the language generation of REG. On the other hand, Generation Supervision uses expressions automatically generated by REG to scale up the training of RES. Such unified mutual supervision effectively improves two tasks by solving their bottleneck problems. Extensive experiments show that our approach significantly outperforms all existing methods on REG and RES tasks under the same setting, and detailed ablation studies demonstrate the effectiveness of all components in our framework.

Via

Access Paper or Ask Questions