Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zehao Lin

TA-VLA: Elucidating the Design Space of Torque-aware Vision-Language-Action Models

Sep 09, 2025

Zongzheng Zhang, Haobo Xu, Zhuo Yang, Chenghao Yue, Zehao Lin, Huan-ang Gao, Ziwei Wang, Hao Zhao

Abstract:Many robotic manipulation tasks require sensing and responding to force signals such as torque to assess whether the task has been successfully completed and to enable closed-loop control. However, current Vision-Language-Action (VLA) models lack the ability to integrate such subtle physical feedback. In this work, we explore Torque-aware VLA models, aiming to bridge this gap by systematically studying the design space for incorporating torque signals into existing VLA architectures. We identify and evaluate several strategies, leading to three key findings. First, introducing torque adapters into the decoder consistently outperforms inserting them into the encoder.Third, inspired by joint prediction and planning paradigms in autonomous driving, we propose predicting torque as an auxiliary output, which further improves performance. This strategy encourages the model to build a physically grounded internal representation of interaction dynamics. Extensive quantitative and qualitative experiments across contact-rich manipulation benchmarks validate our findings.

* Accepted to CoRL 2025, project page: \url{https://zzongzheng0918.github.io/Torque-Aware-VLA.github.io/}

Via

Access Paper or Ask Questions

Scalable Complexity Control Facilitates Reasoning Ability of LLMs

May 29, 2025

Liangkai Hang, Junjie Yao, Zhiwei Bai, Tianyi Chen, Yang Chen, Rongjie Diao, Hezhou Li, Pengxiao Lin, Zhiwei Wang, Cheng Xu(+10 more)

Abstract:The reasoning ability of large language models (LLMs) has been rapidly advancing in recent years, attracting interest in more fundamental approaches that can reliably enhance their generalizability. This work demonstrates that model complexity control, conveniently implementable by adjusting the initialization rate and weight decay coefficient, improves the scaling law of LLMs consistently over varying model sizes and data sizes. This gain is further illustrated by comparing the benchmark performance of 2.4B models pretrained on 1T tokens with different complexity hyperparameters. Instead of fixing the initialization std, we found that a constant initialization rate (the exponent of std) enables the scaling law to descend faster in both model and data sizes. These results indicate that complexity control is a promising direction for the continual advancement of LLMs.

Via

Access Paper or Ask Questions

Xinyu AI Search: Enhanced Relevance and Comprehensive Results with Rich Answer Presentations

May 28, 2025

Bo Tang, Junyi Zhu, Chenyang Xi, Yunhang Ge, Jiahao Wu, Yuchen Feng, Yijun Niu, Wenqiang Wei, Yu Yu, Chunyu Li(+8 more)

Abstract:Traditional search engines struggle to synthesize fragmented information for complex queries, while generative AI search engines face challenges in relevance, comprehensiveness, and presentation. To address these limitations, we introduce Xinyu AI Search, a novel system that incorporates a query-decomposition graph to dynamically break down complex queries into sub-queries, enabling stepwise retrieval and generation. Our retrieval pipeline enhances diversity through multi-source aggregation and query expansion, while filtering and re-ranking strategies optimize passage relevance. Additionally, Xinyu AI Search introduces a novel approach for fine-grained, precise built-in citation and innovates in result presentation by integrating timeline visualization and textual-visual choreography. Evaluated on recent real-world queries, Xinyu AI Search outperforms eight existing technologies in human assessments, excelling in relevance, comprehensiveness, and insightfulness. Ablation studies validate the necessity of its key sub-modules. Our work presents the first comprehensive framework for generative AI search engines, bridging retrieval, generation, and user-centric presentation.

Via

Access Paper or Ask Questions

MemOS: An Operating System for Memory-Augmented Generation (MAG) in Large Language Models

May 28, 2025

Zhiyu Li, Shichao Song, Hanyu Wang, Simin Niu, Ding Chen, Jiawei Yang, Chenyang Xi, Huayi Lai, Jihao Zhao, Yezhaohui Wang(+12 more)

Figure 1 for MemOS: An Operating System for Memory-Augmented Generation (MAG) in Large Language Models

Figure 2 for MemOS: An Operating System for Memory-Augmented Generation (MAG) in Large Language Models

Figure 3 for MemOS: An Operating System for Memory-Augmented Generation (MAG) in Large Language Models

Figure 4 for MemOS: An Operating System for Memory-Augmented Generation (MAG) in Large Language Models

Abstract:Large Language Models (LLMs) have emerged as foundational infrastructure in the pursuit of Artificial General Intelligence (AGI). Despite their remarkable capabilities in language perception and generation, current LLMs fundamentally lack a unified and structured architecture for handling memory. They primarily rely on parametric memory (knowledge encoded in model weights) and ephemeral activation memory (context-limited runtime states). While emerging methods like Retrieval-Augmented Generation (RAG) incorporate plaintext memory, they lack lifecycle management and multi-modal integration, limiting their capacity for long-term knowledge evolution. To address this, we introduce MemOS, a memory operating system designed for LLMs that, for the first time, elevates memory to a first-class operational resource. It builds unified mechanisms for representation, organization, and governance across three core memory types: parametric, activation, and plaintext. At its core is the MemCube, a standardized memory abstraction that enables tracking, fusion, and migration of heterogeneous memory, while offering structured, traceable access across tasks and contexts. MemOS establishes a memory-centric execution framework with strong controllability, adaptability, and evolvability. It fills a critical gap in current LLM infrastructure and lays the groundwork for continual adaptation, personalized intelligence, and cross-platform coordination in next-generation intelligent systems.

Via

Access Paper or Ask Questions

FTII-Bench: A Comprehensive Multimodal Benchmark for Flow Text with Image Insertion

Oct 16, 2024

Jiacheng Ruan, Yebin Yang, Zehao Lin, Feiyu Xiong, Zeyun Tang, Zhiyu Li

Figure 1 for FTII-Bench: A Comprehensive Multimodal Benchmark for Flow Text with Image Insertion

Figure 2 for FTII-Bench: A Comprehensive Multimodal Benchmark for Flow Text with Image Insertion

Figure 3 for FTII-Bench: A Comprehensive Multimodal Benchmark for Flow Text with Image Insertion

Figure 4 for FTII-Bench: A Comprehensive Multimodal Benchmark for Flow Text with Image Insertion

Abstract:Benefiting from the revolutionary advances in large language models (LLMs) and foundational vision models, large vision-language models (LVLMs) have also made significant progress. However, current benchmarks focus on tasks that evaluating only a single aspect of LVLM capabilities (e.g., recognition, detection, understanding). These tasks fail to fully demonstrate LVLMs' potential in complex application scenarios. To comprehensively assess the performance of existing LVLMs, we propose a more challenging task called the Flow Text with Image Insertion task (FTII). This task requires LVLMs to simultaneously possess outstanding abilities in image comprehension, instruction understanding, and long-text interpretation. Specifically, given several text paragraphs and a set of candidate images, as the text paragraphs accumulate, the LVLMs are required to select the most suitable image from the candidates to insert after the corresponding paragraph. Constructing a benchmark for such a task is highly challenging, particularly in determining the sequence of flowing text and images. To address this challenge, we turn to professional news reports, which naturally contain a gold standard for image-text sequences. Based on this, we introduce the Flow Text with Image Insertion Benchmark (FTII-Bench), which includes 318 high-quality Chinese image-text news articles and 307 high-quality English image-text news articles, covering 10 different news domains. Using these 625 high-quality articles, we construct problems of two different types with multiple levels of difficulty. Furthermore, we establish two different evaluation pipelines based on the CLIP model and existing LVLMs. We evaluate 9 open-source and 2 closed-source LVLMs as well as 2 CLIP-based models. Results indicate that even the most advanced models (e.g., GPT-4o) face significant challenges when tackling the FTII task.

* Work in progress. 9 pages, 3 figures

Via

Access Paper or Ask Questions

MM-CamObj: A Comprehensive Multimodal Dataset for Camouflaged Object Scenarios

Sep 24, 2024

Jiacheng Ruan, Wenzhen Yuan, Zehao Lin, Ning Liao, Zhiyu Li, Feiyu Xiong, Ting Liu, Yuzhuo Fu

Figure 1 for MM-CamObj: A Comprehensive Multimodal Dataset for Camouflaged Object Scenarios

Figure 2 for MM-CamObj: A Comprehensive Multimodal Dataset for Camouflaged Object Scenarios

Figure 3 for MM-CamObj: A Comprehensive Multimodal Dataset for Camouflaged Object Scenarios

Figure 4 for MM-CamObj: A Comprehensive Multimodal Dataset for Camouflaged Object Scenarios

Abstract:Large visual-language models (LVLMs) have achieved great success in multiple applications. However, they still encounter challenges in complex scenes, especially those involving camouflaged objects. This is primarily due to the lack of samples related to camouflaged scenes in the training dataset. To mitigate this issue, we construct the MM-CamObj dataset for the first time, comprising two subsets: CamObj-Align and CamObj-Instruct. Specifically, CamObj-Align contains 11,363 image-text pairs, and it is designed for VL alignment and injecting rich knowledge of camouflaged scenes into LVLMs. CamObj-Instruct is collected for fine-tuning the LVLMs with improved instruction-following capabilities, and it includes 11,363 images and 68,849 conversations with diverse instructions. Based on the MM-CamObj dataset, we propose the CamObj-Llava, an LVLM specifically designed for addressing tasks in camouflaged scenes. To facilitate our model's effective acquisition of knowledge about camouflaged objects and scenes, we introduce a curriculum learning strategy with six distinct modes. Additionally, we construct the CamObj-Bench to evaluate the existing LVLMs' capabilities of understanding, recognition, localization and count in camouflage scenes. This benchmark includes 600 images and 7 tasks, with a total of 9,449 questions. Extensive experiments are conducted on the CamObj-Bench with CamObj-Llava, 8 existing open-source and 3 closed-source LVLMs. Surprisingly, the results indicate that our model achieves a 25.84% improvement in 4 out of 7 tasks compared to GPT-4o. Code and datasets will be available at https://github.com/JCruan519/MM-CamObj.

* 9 pages, 5 figures. Work in progress

Via

Access Paper or Ask Questions

$\text{Memory}^3$: Language Modeling with Explicit Memory

Jul 01, 2024

Hongkang Yang, Zehao Lin, Wenjin Wang, Hao Wu, Zhiyu Li, Bo Tang, Wenqiang Wei, Jinbo Wang, Zeyun Tang, Shichao Song(+6 more)

$Figure 1 for $\text{Memory}^3$: Language Modeling with Explicit Memory$

$Figure 2 for $\text{Memory}^3$: Language Modeling with Explicit Memory$

$Figure 3 for $\text{Memory}^3$: Language Modeling with Explicit Memory$

$Figure 4 for $\text{Memory}^3$: Language Modeling with Explicit Memory$

Abstract:The training and inference of large language models (LLMs) are together a costly process that transports knowledge from raw data to meaningful computation. Inspired by the memory hierarchy of the human brain, we reduce this cost by equipping LLMs with explicit memory, a memory format cheaper than model parameters and text retrieval-augmented generation (RAG). Conceptually, with most of its knowledge externalized to explicit memories, the LLM can enjoy a smaller parameter size, training cost, and inference cost, all proportional to the amount of remaining "abstract knowledge". As a preliminary proof of concept, we train from scratch a 2.4B LLM, which achieves better performance than much larger LLMs as well as RAG models, and maintains higher decoding speed than RAG. The model is named $\text{Memory}^3$, since explicit memory is the third form of memory in LLMs after implicit memory (model parameters) and working memory (context key-values). We introduce a memory circuitry theory to support the externalization of knowledge, and present novel techniques including a memory sparsification mechanism that makes storage tractable and a two-stage pretraining scheme that facilitates memory formation.

Via

Access Paper or Ask Questions

Similar Scenes arouse Similar Emotions: Parallel Data Augmentation for Stylized Image Captioning

Aug 26, 2021

Guodun Li, Yuchen Zhai, Zehao Lin, Yin Zhang

Figure 1 for Similar Scenes arouse Similar Emotions: Parallel Data Augmentation for Stylized Image Captioning

Figure 2 for Similar Scenes arouse Similar Emotions: Parallel Data Augmentation for Stylized Image Captioning

Figure 3 for Similar Scenes arouse Similar Emotions: Parallel Data Augmentation for Stylized Image Captioning

Figure 4 for Similar Scenes arouse Similar Emotions: Parallel Data Augmentation for Stylized Image Captioning

Abstract:Stylized image captioning systems aim to generate a caption not only semantically related to a given image but also consistent with a given style description. One of the biggest challenges with this task is the lack of sufficient paired stylized data. Many studies focus on unsupervised approaches, without considering from the perspective of data augmentation. We begin with the observation that people may recall similar emotions when they are in similar scenes, and often express similar emotions with similar style phrases, which underpins our data augmentation idea. In this paper, we propose a novel Extract-Retrieve-Generate data augmentation framework to extract style phrases from small-scale stylized sentences and graft them to large-scale factual captions. First, we design the emotional signal extractor to extract style phrases from small-scale stylized sentences. Second, we construct the plugable multi-modal scene retriever to retrieve scenes represented with pairs of an image and its stylized caption, which are similar to the query image or caption in the large-scale factual data. In the end, based on the style phrases of similar scenes and the factual description of the current scene, we build the emotion-aware caption generator to generate fluent and diversified stylized captions for the current scene. Extensive experimental results show that our framework can alleviate the data scarcity problem effectively. It also significantly boosts the performance of several existing image captioning models in both supervised and unsupervised settings, which outperforms the state-of-the-art stylized image captioning methods in terms of both sentence relevance and stylishness by a substantial margin.

* Accepted at ACM Multimedia (ACMMM) 2021

Via

Access Paper or Ask Questions

Dialogue State Tracking with Multi-Level Fusion of Predicted Dialogue States and Conversations

Jul 12, 2021

Jingyao Zhou, Haipang Wu, Zehao Lin, Guodun Li, Yin Zhang

Figure 1 for Dialogue State Tracking with Multi-Level Fusion of Predicted Dialogue States and Conversations

Figure 2 for Dialogue State Tracking with Multi-Level Fusion of Predicted Dialogue States and Conversations

Figure 3 for Dialogue State Tracking with Multi-Level Fusion of Predicted Dialogue States and Conversations

Figure 4 for Dialogue State Tracking with Multi-Level Fusion of Predicted Dialogue States and Conversations

Abstract:Most recently proposed approaches in dialogue state tracking (DST) leverage the context and the last dialogue states to track current dialogue states, which are often slot-value pairs. Although the context contains the complete dialogue information, the information is usually indirect and even requires reasoning to obtain. The information in the lastly predicted dialogue states is direct, but when there is a prediction error, the dialogue information from this source will be incomplete or erroneous. In this paper, we propose the Dialogue State Tracking with Multi-Level Fusion of Predicted Dialogue States and Conversations network (FPDSC). This model extracts information of each dialogue turn by modeling interactions among each turn utterance, the corresponding last dialogue states, and dialogue slots. Then the representation of each dialogue turn is aggregated by a hierarchical structure to form the passage information, which is utilized in the current turn of DST. Experimental results validate the effectiveness of the fusion network with 55.03% and 59.07% joint accuracy on MultiWOZ 2.0 and MultiWOZ 2.1 datasets, which reaches the state-of-the-art performance. Furthermore, we conduct the deleted-value and related-slot experiments on MultiWOZ 2.1 to evaluate our model.

* SIGDIAL 2021 Accepted paper

Via

Access Paper or Ask Questions

Should Answer Immediately or Wait for Further Information? A Novel Wait-or-Answer Task and Its Predictive Approach

May 27, 2020

Zehao Lin, Shaobo Cui, Xiaoming Kang, Guodun Li, Feng Ji, Haiqing Chen, Yin Zhang

Figure 1 for Should Answer Immediately or Wait for Further Information? A Novel Wait-or-Answer Task and Its Predictive Approach

Figure 2 for Should Answer Immediately or Wait for Further Information? A Novel Wait-or-Answer Task and Its Predictive Approach

Figure 3 for Should Answer Immediately or Wait for Further Information? A Novel Wait-or-Answer Task and Its Predictive Approach

Figure 4 for Should Answer Immediately or Wait for Further Information? A Novel Wait-or-Answer Task and Its Predictive Approach

Abstract:Different people have different habits of describing their intents in conversations. Some people may tend to deliberate their full intents in several successive utterances, i.e., they use several consistent messages for readability instead of a long sentence to express their question. This creates a predicament faced by dialogue systems' application, especially in real-world industrial scenarios, in which the dialogue system is unsure that whether it should answer the user's query immediately or wait for users' further supplementary input. Motivated by such interesting quandary, we define a novel task: Wait-or-Answer to better tackle this dilemma faced by dialogue systems. We shed light on a new research topic about how the dialogue system can be more competent to behave in this Wait-or-Answer quandary. Further, we propose a predictive approach dubbed Imagine-then-Arbitrate (ITA) to resolve this Wait-or-Answer task. More specifically, we take advantage of an arbitrator model to help the dialogue system decide to wait or answer. The arbitrator's decision is made with the assistance of two ancillary imaginator models: a wait imaginator and an answer imaginator. The wait imaginator tries to predict what the user would supplement and use its prediction to persuade the arbitrator that the user has some information to add, so the dialogue system should wait. The answer imaginator, nevertheless, struggles to predict the answer of the dialogue system and convince the arbitrator that it's a superior choice to answer the users' query immediately. To our best knowledge, our paper is the first work to explicitly define the Wait-or-Answer task in the dialogue system. Additionally, our proposed ITA approach significantly outperforms the existing models in solving this Wait-or-Answer problem.

* This previously appeared as arXiv:2002.09616v2, which was mistakenly submitted as a replacement. arXiv admin note: text overlap with arXiv:2002.09616v3

Via

Access Paper or Ask Questions