Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Mini-Omni2: Towards Open-source GPT-4o Model with Vision, Speech and Duplex

Oct 15, 2024

Zhifei Xie, Changqiao Wu

Figure 1 for Mini-Omni2: Towards Open-source GPT-4o Model with Vision, Speech and Duplex

Figure 2 for Mini-Omni2: Towards Open-source GPT-4o Model with Vision, Speech and Duplex

Figure 3 for Mini-Omni2: Towards Open-source GPT-4o Model with Vision, Speech and Duplex

Figure 4 for Mini-Omni2: Towards Open-source GPT-4o Model with Vision, Speech and Duplex

Share this with someone who'll enjoy it:

Abstract:GPT4o, an all-encompassing model, represents a milestone in the development of multi-modal large models. It can understand visual, auditory, and textual modalities, directly output audio, and support flexible duplex interaction. However, its technical framework is not open-sourced. Models from the open-source community often achieve some functionalities of GPT4o, such as visual understanding and voice dialogue. Nevertheless, training a unified model that incorporates all modalities is challenging due to the complexities of multi-modal data, intricate model architectures, and training processes. In this paper, we introduce Mini-Omni2, a visual-audio assistant capable of providing real-time, end-to-end voice responses to user video and voice queries, while also incorporating auditory capabilities. By integrating pretrained visual and auditory encoders, Mini-Omni2 maintains strong performance in individual modalities. We propose a three-stage training process to align modalities, allowing the language model to handle multi-modal inputs and outputs after training on a limited dataset. For interaction, we introduce a semantic-based interruption mechanism, enabling more flexible dialogues with users. All modeling approaches and data construction methods will be open-sourced. To the best of our knowledge, Mini-Omni2 is one of the models closest to GPT4o in functionality, and we hope it can offer valuable insights for subsequent research.

* 13 pages, 6 figures

View paper on

Share this with someone who'll enjoy it:

Title:Mini-Omni2: Towards Open-source GPT-4o Model with Vision, Speech and Duplex

Paper and Code