Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:MOVE: A Mixture-of-Vision-Encoders Approach for Domain-Focused Vision-Language Processing

Feb 21, 2025

Matvey Skripkin, Elizaveta Goncharova, Dmitrii Tarasov, Andrey Kuznetsov

Figure 1 for MOVE: A Mixture-of-Vision-Encoders Approach for Domain-Focused Vision-Language Processing

Figure 2 for MOVE: A Mixture-of-Vision-Encoders Approach for Domain-Focused Vision-Language Processing

Figure 3 for MOVE: A Mixture-of-Vision-Encoders Approach for Domain-Focused Vision-Language Processing

Figure 4 for MOVE: A Mixture-of-Vision-Encoders Approach for Domain-Focused Vision-Language Processing

Share this with someone who'll enjoy it:

Abstract:Multimodal language models (MLMs) integrate visual and textual information by coupling a vision encoder with a large language model through the specific adapter. While existing approaches commonly rely on a single pre-trained vision encoder, there is a great variability of specialized encoders that can boost model's performance in distinct domains. In this work, we propose MOVE (Mixture of Vision Encoders) a simple yet effective approach to leverage multiple pre-trained encoders for specialized multimodal tasks. MOVE automatically routes inputs to the most appropriate encoder among candidates such as Unichat, InternViT, and Texify, thereby enhancing performance across a diverse set of benchmarks, including ChartQA, MMBench, and MMMU. Experimental results demonstrate that MOVE achieves competitive accuracy without incurring the complexities of image slicing for high-resolution images.

* 10 pages, 6 figures, 4 tables

View paper on

Share this with someone who'll enjoy it:

Title:MOVE: A Mixture-of-Vision-Encoders Approach for Domain-Focused Vision-Language Processing

Paper and Code