Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:To Preserve or To Compress: An In-Depth Study of Connector Selection in Multimodal Large Language Models

Oct 09, 2024

Junyan Lin, Haoran Chen, Dawei Zhu, Xiaoyu Shen

Figure 1 for To Preserve or To Compress: An In-Depth Study of Connector Selection in Multimodal Large Language Models

Figure 2 for To Preserve or To Compress: An In-Depth Study of Connector Selection in Multimodal Large Language Models

Figure 3 for To Preserve or To Compress: An In-Depth Study of Connector Selection in Multimodal Large Language Models

Figure 4 for To Preserve or To Compress: An In-Depth Study of Connector Selection in Multimodal Large Language Models

Share this with someone who'll enjoy it:

Abstract:In recent years, multimodal large language models (MLLMs) have garnered significant attention from both industry and academia. However, there is still considerable debate on constructing MLLM architectures, particularly regarding the selection of appropriate connectors for perception tasks of varying granularities. This paper systematically investigates the impact of connectors on MLLM performance. Specifically, we classify connectors into feature-preserving and feature-compressing types. Utilizing a unified classification standard, we categorize sub-tasks from three comprehensive benchmarks, MMBench, MME, and SEED-Bench, into three task types: coarse-grained perception, fine-grained perception, and reasoning, and evaluate the performance. Our findings reveal that feature-preserving connectors excel in \emph{fine-grained perception} tasks due to their ability to retain detailed visual information. In contrast, feature-compressing connectors, while less effective in fine-grained perception tasks, offer significant speed advantages and perform comparably in \emph{coarse-grained perception} and \emph{reasoning} tasks. These insights are crucial for guiding MLLM architecture design and advancing the optimization of MLLM architectures.

* Accepted to EMNLP 2024 Main Conference

View paper on

Share this with someone who'll enjoy it:

Title:To Preserve or To Compress: An In-Depth Study of Connector Selection in Multimodal Large Language Models

Paper and Code