Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM

Nov 01, 2024

Xiong Wang, Yangze Li, Chaoyou Fu, Lei Xie, Ke Li, Xing Sun, Long Ma

Figure 1 for Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM

Figure 2 for Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM

Figure 3 for Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM

Figure 4 for Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM

Share this with someone who'll enjoy it:

Abstract:The rapid development of large language models has brought many new smart applications, especially the excellent multimodal human-computer interaction in GPT-4o has brought impressive experience to users. In this background, researchers have proposed many multimodal LLMs that can achieve speech-to-speech dialogue recently. In this paper, we propose a speech-text multimodal LLM architecture called Freeze-Omni. Our main contribution is the speech input and output modalities can connected to the LLM while keeping the LLM frozen throughout the training process. We designed 3-stage training strategies both for the modeling of speech input and output, enabling Freeze-Omni to obtain speech-to-speech dialogue ability using text-speech paired data (such as ASR and TTS data) and only 60,000 multi-round text Q&A data on 8 GPUs. Moreover, we can effectively ensure that the intelligence of the Freeze-Omni in the speech modality is at the same level compared with that in the text modality of its backbone LLM, while the end-to-end latency of the spoken response achieves a low level. In addition, we also designed a method to achieve duplex dialogue ability through multi-task training, making Freeze-Omni have a more natural style of dialogue ability between the users. Freeze-Omni mainly provides a possibility for researchers to conduct multimodal LLM under the condition of a frozen LLM, avoiding various impacts caused by the catastrophic forgetting of LLM caused by fewer data and training resources.

* Project Page: https://freeze-omni.github.io/

View paper on

Share this with someone who'll enjoy it:

Title:Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM

Paper and Code