Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ray Zhang

CrossLMM: Decoupling Long Video Sequences from LMMs via Dual Cross-Attention Mechanisms

May 22, 2025

Shilin Yan, Jiaming Han, Joey Tsai, Hongwei Xue, Rongyao Fang, Lingyi Hong, Ziyu Guo, Ray Zhang

Abstract:The advent of Large Multimodal Models (LMMs) has significantly enhanced Large Language Models (LLMs) to process and interpret diverse data modalities (e.g., image and video). However, as input complexity increases, particularly with long video sequences, the number of required tokens has grown significantly, leading to quadratically computational costs. This has made the efficient compression of video tokens in LMMs, while maintaining performance integrity, a pressing research challenge. In this paper, we introduce CrossLMM, decoupling long video sequences from LMMs via a dual cross-attention mechanism, which substantially reduces visual token quantity with minimal performance degradation. Specifically, we first implement a significant token reduction from pretrained visual encoders through a pooling methodology. Then, within LLM layers, we employ a visual-to-visual cross-attention mechanism, wherein the pooled visual tokens function as queries against the original visual token set. This module enables more efficient token utilization while retaining fine-grained informational fidelity. In addition, we introduce a text-to-visual cross-attention mechanism, for which the text tokens are enhanced through interaction with the original visual tokens, enriching the visual comprehension of the text tokens. Comprehensive empirical evaluation demonstrates that our approach achieves comparable or superior performance across diverse video-based LMM benchmarks, despite utilizing substantially fewer computational resources.

* Project page: https://github.com/shilinyan99/CrossLMM

Via

Access Paper or Ask Questions

SciVerse: Unveiling the Knowledge Comprehension and Visual Reasoning of LMMs on Multi-modal Scientific Problems

Mar 13, 2025

Ziyu Guo, Ray Zhang, Hao Chen, Jialin Gao, Dongzhi Jiang, Jiaze Wang, Pheng-Ann Heng

Abstract:The rapid advancement of Large Multi-modal Models (LMMs) has enabled their application in scientific problem-solving, yet their fine-grained capabilities remain under-explored. In this paper, we introduce SciVerse, a multi-modal scientific evaluation benchmark to thoroughly assess LMMs across 5,735 test instances in five distinct versions. We aim to investigate three key dimensions of LMMs: scientific knowledge comprehension, multi-modal content interpretation, and Chain-of-Thought (CoT) reasoning. To unveil whether LMMs possess sufficient scientific expertise, we first transform each problem into three versions containing different levels of knowledge required for solving, i.e., Knowledge-free, -lite, and -rich. Then, to explore how LMMs interpret multi-modal scientific content, we annotate another two versions, i.e., Vision-rich and -only, marking more question information from texts to diagrams. Comparing the results of different versions, SciVerse systematically examines the professional knowledge stock and visual perception skills of LMMs in scientific domains. In addition, to rigorously assess CoT reasoning, we propose a new scientific CoT evaluation strategy, conducting a step-wise assessment on knowledge and logical errors in model outputs. Our extensive evaluation of different LMMs on SciVerse reveals critical limitations in their scientific proficiency and provides new insights into future developments. Project page: https://sciverse-cuhk.github.io

* Initially released in September 2024. Project page: https://sciverse-cuhk.github.io

Via

Access Paper or Ask Questions

Exploring the Potential of Encoder-free Architectures in 3D LMMs

Feb 13, 2025

Yiwen Tang, Zoey Guo, Zhuhao Wang, Ray Zhang, Qizhi Chen, Junli Liu, Delin Qu, Zhigang Wang, Dong Wang, Xuelong Li(+1 more)

Figure 1 for Exploring the Potential of Encoder-free Architectures in 3D LMMs

Figure 2 for Exploring the Potential of Encoder-free Architectures in 3D LMMs

Figure 3 for Exploring the Potential of Encoder-free Architectures in 3D LMMs

Figure 4 for Exploring the Potential of Encoder-free Architectures in 3D LMMs

Abstract:Encoder-free architectures have been preliminarily explored in the 2D visual domain, yet it remains an open question whether they can be effectively applied to 3D understanding scenarios. In this paper, we present the first comprehensive investigation into the potential of encoder-free architectures to overcome the challenges of encoder-based 3D Large Multimodal Models (LMMs). These challenges include the failure to adapt to varying point cloud resolutions and the point features from the encoder not meeting the semantic needs of Large Language Models (LLMs). We identify key aspects for 3D LMMs to remove the encoder and enable the LLM to assume the role of the 3D encoder: 1) We propose the LLM-embedded Semantic Encoding strategy in the pre-training stage, exploring the effects of various point cloud self-supervised losses. And we present the Hybrid Semantic Loss to extract high-level semantics. 2) We introduce the Hierarchical Geometry Aggregation strategy in the instruction tuning stage. This incorporates inductive bias into the LLM early layers to focus on the local details of the point clouds. To the end, we present the first Encoder-free 3D LMM, ENEL. Our 7B model rivals the current state-of-the-art model, ShapeLLM-13B, achieving 55.0%, 50.92%, and 42.7% on the classification, captioning, and VQA tasks, respectively. Our results demonstrate that the encoder-free architecture is highly promising for replacing encoder-based architectures in the field of 3D understanding. The code is released at https://github.com/Ivan-Tang-3D/ENEL

* The code is released at https://github.com/Ivan-Tang-3D/ENEL

Via

Access Paper or Ask Questions

SLAM assisted 3D tracking system for laparoscopic surgery

Sep 18, 2024

Jingwei Song, Ray Zhang, Wenwei Zhang, Hao Zhou, Maani Ghaffari

Figure 1 for SLAM assisted 3D tracking system for laparoscopic surgery

Figure 2 for SLAM assisted 3D tracking system for laparoscopic surgery

Figure 3 for SLAM assisted 3D tracking system for laparoscopic surgery

Figure 4 for SLAM assisted 3D tracking system for laparoscopic surgery

Abstract:A major limitation of minimally invasive surgery is the difficulty in accurately locating the internal anatomical structures of the target organ due to the lack of tactile feedback and transparency. Augmented reality (AR) offers a promising solution to overcome this challenge. Numerous studies have shown that combining learning-based and geometric methods can achieve accurate preoperative and intraoperative data registration. This work proposes a real-time monocular 3D tracking algorithm for post-registration tasks. The ORB-SLAM2 framework is adopted and modified for prior-based 3D tracking. The primitive 3D shape is used for fast initialization of the monocular SLAM. A pseudo-segmentation strategy is employed to separate the target organ from the background for tracking purposes, and the geometric prior of the 3D shape is incorporated as an additional constraint in the pose graph. Experiments from in-vivo and ex-vivo tests demonstrate that the proposed 3D tracking system provides robust 3D tracking and effectively handles typical challenges such as fast motion, out-of-field-of-view scenarios, partial visibility, and "organ-background" relative motion.

* Demo: https://youtu.be/B1xZW8bj3cM

Via

Access Paper or Ask Questions

Correspondence-Free SE(3) Point Cloud Registration in RKHS via Unsupervised Equivariant Learning

Jul 29, 2024

Ray Zhang, Zheming Zhou, Min Sun, Omid Ghasemalizadeh, Cheng-Hao Kuo, Ryan Eustice, Maani Ghaffari, Arnie Sen

Abstract:This paper introduces a robust unsupervised SE(3) point cloud registration method that operates without requiring point correspondences. The method frames point clouds as functions in a reproducing kernel Hilbert space (RKHS), leveraging SE(3)-equivariant features for direct feature space registration. A novel RKHS distance metric is proposed, offering reliable performance amidst noise, outliers, and asymmetrical data. An unsupervised training approach is introduced to effectively handle limited ground truth data, facilitating adaptation to real datasets. The proposed method outperforms classical and supervised methods in terms of registration accuracy on both synthetic (ModelNet40) and real-world (ETH3D) noisy, outlier-rich datasets. To our best knowledge, this marks the first instance of successful real RGB-D odometry data registration using an equivariant method. The code is available at {https://sites.google.com/view/eccv24-equivalign}

* 10 pages, to be published in ECCV 2024

Via

Access Paper or Ask Questions

MR-MLLM: Mutual Reinforcement of Multimodal Comprehension and Vision Perception

Jun 22, 2024

Guanqun Wang, Xinyu Wei, Jiaming Liu, Ray Zhang, Yichi Zhang, Kevin Zhang, Maurice Chong, Shanghang Zhang

Figure 1 for MR-MLLM: Mutual Reinforcement of Multimodal Comprehension and Vision Perception

Figure 2 for MR-MLLM: Mutual Reinforcement of Multimodal Comprehension and Vision Perception

Figure 3 for MR-MLLM: Mutual Reinforcement of Multimodal Comprehension and Vision Perception

Figure 4 for MR-MLLM: Mutual Reinforcement of Multimodal Comprehension and Vision Perception

Abstract:In recent years, multimodal large language models (MLLMs) have shown remarkable capabilities in tasks like visual question answering and common sense reasoning, while visual perception models have made significant strides in perception tasks, such as detection and segmentation. However, MLLMs mainly focus on high-level image-text interpretations and struggle with fine-grained visual understanding, and vision perception models usually suffer from open-world distribution shifts due to their limited model capacity. To overcome these challenges, we propose the Mutually Reinforced Multimodal Large Language Model (MR-MLLM), a novel framework that synergistically enhances visual perception and multimodal comprehension. First, a shared query fusion mechanism is proposed to harmonize detailed visual inputs from vision models with the linguistic depth of language models, enhancing multimodal comprehension and vision perception synergistically. Second, we propose the perception-enhanced cross-modal integration method, incorporating novel modalities from vision perception outputs, like object detection bounding boxes, to capture subtle visual elements, thus enriching the understanding of both visual and textual data. In addition, an innovative perception-embedded prompt generation mechanism is proposed to embed perceptual information into the language model's prompts, aligning the responses contextually and perceptually for a more accurate multimodal interpretation. Extensive experiments demonstrate MR-MLLM's superior performance in various multimodal comprehension and vision perception tasks, particularly those requiring corner case vision perception and fine-grained language comprehension.

* 14 pages, 8 figures

Via

Access Paper or Ask Questions

RKHS-BA: A Semantic Correspondence-Free Multi-View Registration Framework with Global Tracking

Mar 02, 2024

Ray Zhang, Jingwei Song, Xiang Gao, Junzhe Wu, Tianyi Liu, Jinyuan Zhang, Ryan Eustice, Maani Ghaffari

Figure 1 for RKHS-BA: A Semantic Correspondence-Free Multi-View Registration Framework with Global Tracking

Figure 2 for RKHS-BA: A Semantic Correspondence-Free Multi-View Registration Framework with Global Tracking

Figure 3 for RKHS-BA: A Semantic Correspondence-Free Multi-View Registration Framework with Global Tracking

Figure 4 for RKHS-BA: A Semantic Correspondence-Free Multi-View Registration Framework with Global Tracking

Abstract:This work reports a novel Bundle Adjustment (BA) formulation using a Reproducing Kernel Hilbert Space (RKHS) representation called RKHS-BA. The proposed formulation is correspondence-free, enables the BA to use RGB-D/LiDAR and semantic labels in the optimization directly, and provides a generalization for the photometric loss function commonly used in direct methods. RKHS-BA can incorporate appearance and semantic labels within a continuous spatial-semantic functional representation that does not require optimization via image pyramids. We demonstrate its applications in sliding-window odometry and global LiDAR mapping, which show highly robust performance in extremely challenging scenes and the best trade-off of generalization and accuracy.

* 16 pages, 12 figures, technical report under review

Via

Access Paper or Ask Questions

Cloud-Device Collaborative Learning for Multimodal Large Language Models

Dec 26, 2023

Guanqun Wang, Jiaming Liu, Chenxuan Li, Junpeng Ma, Yuan Zhang, Xinyu Wei, Kevin Zhang, Maurice Chong, Ray Zhang, Yijiang Liu(+1 more)

Figure 1 for Cloud-Device Collaborative Learning for Multimodal Large Language Models

Figure 2 for Cloud-Device Collaborative Learning for Multimodal Large Language Models

Figure 3 for Cloud-Device Collaborative Learning for Multimodal Large Language Models

Figure 4 for Cloud-Device Collaborative Learning for Multimodal Large Language Models

Abstract:The burgeoning field of Multimodal Large Language Models (MLLMs) has exhibited remarkable performance in diverse tasks such as captioning, commonsense reasoning, and visual scene understanding. However, the deployment of these large-scale MLLMs on client devices is hindered by their extensive model parameters, leading to a notable decline in generalization capabilities when these models are compressed for device deployment. Addressing this challenge, we introduce a Cloud-Device Collaborative Continual Adaptation framework, designed to enhance the performance of compressed, device-deployed MLLMs by leveraging the robust capabilities of cloud-based, larger-scale MLLMs. Our framework is structured into three key components: a device-to-cloud uplink for efficient data transmission, cloud-based knowledge adaptation, and an optimized cloud-to-device downlink for model deployment. In the uplink phase, we employ an Uncertainty-guided Token Sampling (UTS) strategy to effectively filter out-of-distribution tokens, thereby reducing transmission costs and improving training efficiency. On the cloud side, we propose Adapter-based Knowledge Distillation (AKD) method to transfer refined knowledge from large-scale to compressed, pocket-size MLLMs. Furthermore, we propose a Dynamic Weight update Compression (DWC) strategy for the downlink, which adaptively selects and quantizes updated weight parameters, enhancing transmission efficiency and reducing the representational disparity between cloud and device models. Extensive experiments on several multimodal benchmarks demonstrate the superiority of our proposed framework over prior Knowledge Distillation and device-cloud collaboration methods. Notably, we also validate the feasibility of our approach to real-world experiments.

Via

Access Paper or Ask Questions

BDIS-SLAM: A lightweight CPU-based dense stereo SLAM for surgery

Dec 25, 2023

Jingwei Song, Ray Zhang, Qiuchen Zhu, Jianyu Lin, Maani Ghaffari

Figure 1 for BDIS-SLAM: A lightweight CPU-based dense stereo SLAM for surgery

Figure 2 for BDIS-SLAM: A lightweight CPU-based dense stereo SLAM for surgery

Figure 3 for BDIS-SLAM: A lightweight CPU-based dense stereo SLAM for surgery

Figure 4 for BDIS-SLAM: A lightweight CPU-based dense stereo SLAM for surgery

Abstract:Purpose: Common dense stereo Simultaneous Localization and Mapping (SLAM) approaches in Minimally Invasive Surgery (MIS) require high-end parallel computational resources for real-time implementation. Yet, it is not always feasible since the computational resources should be allocated to other tasks like segmentation, detection, and tracking. To solve the problem of limited parallel computational power, this research aims at a lightweight dense stereo SLAM system that works on a single-core CPU and achieves real-time performance (more than 30 Hz in typical scenarios). Methods: A new dense stereo mapping module is integrated with the ORB-SLAM2 system and named BDIS-SLAM. Our new dense stereo mapping module includes stereo matching and 3D dense depth mosaic methods. Stereo matching is achieved with the recently proposed CPU-level real-time matching algorithm Bayesian Dense Inverse Searching (BDIS). A BDIS-based shape recovery and a depth mosaic strategy are integrated as a new thread and coupled with the backbone ORB-SLAM2 system for real-time stereo shape recovery. Results: Experiments on in-vivo data sets show that BDIS-SLAM runs at over 30 Hz speed on modern single-core CPU in typical endoscopy/colonoscopy scenarios. BDIS-SLAM only consumes around an additional 12% time compared with the backbone ORB-SLAM2. Although our lightweight BDIS-SLAM simplifies the process by ignoring deformation and fusion procedures, it can provide a usable dense mapping for modern MIS on computationally constrained devices. Conclusion: The proposed BDIS-SLAM is a lightweight stereo dense SLAM system for MIS. It achieves 30 Hz on a modern single-core CPU in typical endoscopy/colonoscopy scenarios (image size around 640*480). BDIS-SLAM provides a low-cost solution for dense mapping in MIS and has the potential to be applied in surgical robots and AR systems.

* This paper has been accepted by International Journal of Computer Assisted Radiology and Surgery. Code is available at https://github.com/JingweiSong/BDIS-SLAM

Via

Access Paper or Ask Questions

FM-OV3D: Foundation Model-based Cross-modal Knowledge Blending for Open-Vocabulary 3D Detection

Dec 22, 2023

Dongmei Zhang, Chang Li, Ray Zhang, Shenghao Xie, Wei Xue, Xiaodong Xie, Shanghang Zhang

Figure 1 for FM-OV3D: Foundation Model-based Cross-modal Knowledge Blending for Open-Vocabulary 3D Detection

Figure 2 for FM-OV3D: Foundation Model-based Cross-modal Knowledge Blending for Open-Vocabulary 3D Detection

Figure 3 for FM-OV3D: Foundation Model-based Cross-modal Knowledge Blending for Open-Vocabulary 3D Detection

Figure 4 for FM-OV3D: Foundation Model-based Cross-modal Knowledge Blending for Open-Vocabulary 3D Detection

Abstract:The superior performances of pre-trained foundation models in various visual tasks underscore their potential to enhance the 2D models' open-vocabulary ability. Existing methods explore analogous applications in the 3D space. However, most of them only center around knowledge extraction from singular foundation models, which limits the open-vocabulary ability of 3D models. We hypothesize that leveraging complementary pre-trained knowledge from various foundation models can improve knowledge transfer from 2D pre-trained visual language models to the 3D space. In this work, we propose FM-OV3D, a method of Foundation Model-based Cross-modal Knowledge Blending for Open-Vocabulary 3D Detection, which improves the open-vocabulary localization and recognition abilities of 3D model by blending knowledge from multiple pre-trained foundation models, achieving true open-vocabulary without facing constraints from original 3D datasets. Specifically, to learn the open-vocabulary 3D localization ability, we adopt the open-vocabulary localization knowledge of the Grounded-Segment-Anything model. For open-vocabulary 3D recognition ability, We leverage the knowledge of generative foundation models, including GPT-3 and Stable Diffusion models, and cross-modal discriminative models like CLIP. The experimental results on two popular benchmarks for open-vocabulary 3D object detection show that our model efficiently learns knowledge from multiple foundation models to enhance the open-vocabulary ability of the 3D model and successfully achieves state-of-the-art performance in open-vocabulary 3D object detection tasks. Code is released at https://github.com/dmzhang0425/FM-OV3D.git.

* Accepted by AAAI 2024. Code will be released at https://github.com/dmzhang0425/FM-OV3D.git

Via

Access Paper or Ask Questions