Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shuyu Lei

Seal: Advancing Speech Language Models to be Few-Shot Learners

Jul 20, 2024

Shuyu Lei, Lingen Liu, Jiaolong Yang, Yasen Jiao, Yuxiang Yang, Yushu Yang, Xiang Guo

Figure 1 for Seal: Advancing Speech Language Models to be Few-Shot Learners

Figure 2 for Seal: Advancing Speech Language Models to be Few-Shot Learners

Figure 3 for Seal: Advancing Speech Language Models to be Few-Shot Learners

Abstract:Existing auto-regressive language models have demonstrated a remarkable capability to perform a new task with just a few examples in prompt, without requiring any additional training. In order to extend this capability to a multi-modal setting (i.e. speech and language), this paper introduces the Seal model, an abbreviation for speech language model. It incorporates a novel alignment method, in which Kullback-Leibler divergence loss is performed to train a projector that bridges a frozen speech encoder with a frozen language model decoder. The resulting Seal model exhibits robust performance as a few-shot learner on two speech understanding tasks. Additionally, consistency experiments are conducted to validate its robustness on different pre-trained language models.

Via

Access Paper or Ask Questions

GRASS: Unified Generation Model for Speech-to-Semantic Tasks

Sep 11, 2023

Aobo Xia, Shuyu Lei, Yushu Yang, Xiang Guo, Hua Chai

Figure 1 for GRASS: Unified Generation Model for Speech-to-Semantic Tasks

Figure 2 for GRASS: Unified Generation Model for Speech-to-Semantic Tasks

Figure 3 for GRASS: Unified Generation Model for Speech-to-Semantic Tasks

Abstract:This paper explores the instruction fine-tuning technique for speech-to-semantic tasks by introducing a unified end-to-end (E2E) framework that generates target text conditioned on a task-related prompt for audio data. We pre-train the model using large and diverse data, where instruction-speech pairs are constructed via a text-to-speech (TTS) system. Extensive experiments demonstrate that our proposed model achieves state-of-the-art (SOTA) results on many benchmarks covering speech named entity recognition, speech sentiment analysis, speech question answering, and more, after fine-tuning. Furthermore, the proposed model achieves competitive performance in zero-shot and few-shot scenarios. To facilitate future work on instruction fine-tuning for speech-to-semantic tasks, we release our instruction dataset and code.

Via

Access Paper or Ask Questions