Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Jan 03, 2025

Chaoyou Fu, Haojia Lin, Xiong Wang, Yi-Fan Zhang, Yunhang Shen, Xiaoyu Liu, Yangze Li, Zuwei Long, Heting Gao, Ke Li(+5 more)

Figure 1 for VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Figure 2 for VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Figure 3 for VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Figure 4 for VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Share this with someone who'll enjoy it:

Abstract:Recent Multimodal Large Language Models (MLLMs) have typically focused on integrating visual and textual modalities, with less emphasis placed on the role of speech in enhancing interaction. However, speech plays a crucial role in multimodal dialogue systems, and implementing high-performance in both vision and speech tasks remains a significant challenge due to the fundamental modality differences. In this paper, we propose a carefully designed multi-stage training methodology that progressively trains LLM to understand both visual and speech information, ultimately enabling fluent vision and speech interaction. Our approach not only preserves strong vision-language capacity, but also enables efficient speech-to-speech dialogue capabilities without separate ASR and TTS modules, significantly accelerating multimodal end-to-end response speed. By comparing our method against state-of-the-art counterparts across benchmarks for image, video, and speech tasks, we demonstrate that our model is equipped with both strong visual and speech capabilities, making near real-time vision and speech interaction.

* https://github.com/VITA-MLLM/VITA

View paper on

Share this with someone who'll enjoy it:

Title:VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Paper and Code