Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Renshan Zhang

FALCON: Resolving Visual Redundancy and Fragmentation in High-resolution Multimodal Large Language Models via Visual Registers

Jan 27, 2025

Renshan Zhang, Rui Shao, Gongwei Chen, Kaiwen Zhou, Weili Guan, Liqiang Nie

Figure 1 for FALCON: Resolving Visual Redundancy and Fragmentation in High-resolution Multimodal Large Language Models via Visual Registers

Figure 2 for FALCON: Resolving Visual Redundancy and Fragmentation in High-resolution Multimodal Large Language Models via Visual Registers

Figure 3 for FALCON: Resolving Visual Redundancy and Fragmentation in High-resolution Multimodal Large Language Models via Visual Registers

Figure 4 for FALCON: Resolving Visual Redundancy and Fragmentation in High-resolution Multimodal Large Language Models via Visual Registers

Abstract:The incorporation of high-resolution visual input equips multimodal large language models (MLLMs) with enhanced visual perception capabilities for real-world tasks. However, most existing high-resolution MLLMs rely on a cropping-based approach to process images, which leads to fragmented visual encoding and a sharp increase in redundant tokens. To tackle these issues, we propose the FALCON model. FALCON introduces a novel visual register technique to simultaneously: 1) Eliminate redundant tokens at the stage of visual encoding. To directly address the visual redundancy present in the output of vision encoder, we propose a Register-based Representation Compacting (ReCompact) mechanism. This mechanism introduces a set of learnable visual registers designed to adaptively aggregate essential information while discarding redundancy. It enables the encoder to produce a more compact visual representation with a minimal number of output tokens, thus eliminating the need for an additional compression module. 2) Ensure continuity in visual encoding. To address the potential encoding errors caused by fragmented visual inputs, we develop a Register Interactive Attention (ReAtten) module. This module facilitates effective and efficient information exchange across sub-images by enabling interactions between visual registers. It ensures the continuity of visual semantics throughout the encoding. We conduct comprehensive experiments with FALCON on high-resolution benchmarks across a wide range of scenarios. FALCON demonstrates superior performance with a remarkable 9-fold and 16-fold reduction in visual tokens.

Via

Access Paper or Ask Questions

Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding

Jul 19, 2024

Renshan Zhang, Yibo Lyu, Rui Shao, Gongwei Chen, Weili Guan, Liqiang Nie

Abstract:Cropping high-resolution document images into multiple sub-images is the most widely used approach for current Multimodal Large Language Models (MLLMs) to do document understanding. Most of current document understanding methods preserve all tokens within sub-images and treat them equally. This neglects their different informativeness and leads to a significant increase in the number of image tokens. To perform a more adaptive and efficient document understanding, we propose Token-level Correlation-guided Compression, a parameter-free and plug-and-play methodology to optimize token processing. Firstly, we propose an innovative approach for assessing the pattern repetitiveness based on the correlation between each patch tokens. This method identifies redundant tokens, allowing for the determination of the sub-image's information density. Secondly, we present a token-level sampling method that efficiently captures the most informative tokens by delving into the correlation between the [CLS] token and patch tokens. By integrating these strategies, we develop a plug-and-play adaptive compressor module that can be seamlessly incorporated into MLLMs utilizing cropping techniques. This module not only enhances the processing speed during training and inference but also maintains comparable performance. We conduct experiments with the SOTA document understanding model mPLUG-DocOwl1.5 and the effectiveness is demonstrated through extensive comparisons with other compression methods.

Via

Access Paper or Ask Questions

A Novel Dual Quaternion Based Dynamic Motion Primitives for Acrobatic Flight

Jul 13, 2021

Renshan Zhang, Yongyang Hu, Kuang Zhao, Su Cao

Figure 1 for A Novel Dual Quaternion Based Dynamic Motion Primitives for Acrobatic Flight

Figure 2 for A Novel Dual Quaternion Based Dynamic Motion Primitives for Acrobatic Flight

Figure 3 for A Novel Dual Quaternion Based Dynamic Motion Primitives for Acrobatic Flight

Figure 4 for A Novel Dual Quaternion Based Dynamic Motion Primitives for Acrobatic Flight

Abstract:The realization of motion description is a challenging work for fixed-wing Unmanned Aerial Vehicle (UAV) acrobatic flight, due to the inherent coupling problem in ranslational-rotational motion. This paper aims to develop a novel maneuver description method through the idea of imitation learning, and there are two main contributions of our work: 1) A dual quaternion based dynamic motion primitives (DQ-DMP) is proposed and the state equations of the position and attitude can be combined without loss of accuracy. 2) An online hardware-inthe-loop (HITL) training system is established. Based on the DQDMP method, the geometric features of the demonstrated maneuver can be obtained in real-time, and the stability of the DQ-DMP is theoretically proved. The simulation results illustrate the superiority of the proposed method compared to the traditional position/attitude decoupling method.

* 6 pages

Via

Access Paper or Ask Questions