Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Text-Speech Language Models with Improved Cross-Modal Transfer by Aligning Abstraction Levels

Mar 08, 2025

Santiago Cuervo, Adel Moumen, Yanis Labrak, Sameer Khurana, Antoine Laurent, Mickael Rouvier, Ricard Marxer

Figure 1 for Text-Speech Language Models with Improved Cross-Modal Transfer by Aligning Abstraction Levels

Figure 2 for Text-Speech Language Models with Improved Cross-Modal Transfer by Aligning Abstraction Levels

Figure 3 for Text-Speech Language Models with Improved Cross-Modal Transfer by Aligning Abstraction Levels

Figure 4 for Text-Speech Language Models with Improved Cross-Modal Transfer by Aligning Abstraction Levels

Share this with someone who'll enjoy it:

Abstract:Text-Speech Language Models (TSLMs) -- language models trained to jointly process and generate text and speech -- aim to enable cross-modal knowledge transfer to overcome the scaling limitations of unimodal speech LMs. The predominant approach to TSLM training expands the vocabulary of a pre-trained text LM by appending new embeddings and linear projections for speech, followed by fine-tuning on speech data. We hypothesize that this method limits cross-modal transfer by neglecting feature compositionality, preventing text-learned functions from being fully leveraged at appropriate abstraction levels. To address this, we propose augmenting vocabulary expansion with modules that better align abstraction levels across layers. Our models, \textsc{SmolTolk}, rival or surpass state-of-the-art TSLMs trained with orders of magnitude more compute. Representation analyses and improved multimodal performance suggest our method enhances cross-modal transfer.

View paper on

Share this with someone who'll enjoy it:

Title:Text-Speech Language Models with Improved Cross-Modal Transfer by Aligning Abstraction Levels

Paper and Code