Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kyu Han

Speech Retrieval-Augmented Generation without Automatic Speech Recognition

Dec 21, 2024

Do June Min, Karel Mundnich, Andy Lapastora, Erfan Soltanmohammadi, Srikanth Ronanki, Kyu Han

Figure 1 for Speech Retrieval-Augmented Generation without Automatic Speech Recognition

Figure 2 for Speech Retrieval-Augmented Generation without Automatic Speech Recognition

Figure 3 for Speech Retrieval-Augmented Generation without Automatic Speech Recognition

Figure 4 for Speech Retrieval-Augmented Generation without Automatic Speech Recognition

Abstract:One common approach for question answering over speech data is to first transcribe speech using automatic speech recognition (ASR) and then employ text-based retrieval-augmented generation (RAG) on the transcriptions. While this cascaded pipeline has proven effective in many practical settings, ASR errors can propagate to the retrieval and generation steps. To overcome this limitation, we introduce SpeechRAG, a novel framework designed for open-question answering over spoken data. Our proposed approach fine-tunes a pre-trained speech encoder into a speech adapter fed into a frozen large language model (LLM)--based retrieval model. By aligning the embedding spaces of text and speech, our speech retriever directly retrieves audio passages from text-based queries, leveraging the retrieval capacity of the frozen text retriever. Our retrieval experiments on spoken question answering datasets show that direct speech retrieval does not degrade over the text-based baseline, and outperforms the cascaded systems using ASR. For generation, we use a speech language model (SLM) as a generator, conditioned on audio passages rather than transcripts. Without fine-tuning of the SLM, this approach outperforms cascaded text-based models when there is high WER in the transcripts.

Via

Access Paper or Ask Questions

Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages

May 02, 2022

Felix Wu, Kwangyoun Kim, Shinji Watanabe, Kyu Han, Ryan McDonald, Kilian Q. Weinberger, Yoav Artzi

Figure 1 for Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages

Figure 2 for Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages

Figure 3 for Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages

Figure 4 for Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages

Abstract:We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data. We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task -- transcribing audio inputs into pseudo subword sequences. This process stands on its own, or can be applied as low-cost second-stage pre-training. We experiment with automatic speech recognition (ASR), spoken named entity recognition, and speech-to-text translation. We set new state-of-the-art results for end-to-end spoken named entity recognition, and show consistent improvements on 20 language pairs for speech-to-text translation, even when competing methods use additional text data for training. Finally, on ASR, our approach enables encoder-decoder methods to benefit from pre-training for all parts of the network, and shows comparable performance to highly optimized recent methods.

* Code available at https://github.com/asappresearch/wav2seq

Via

Access Paper or Ask Questions

SRU++: Pioneering Fast Recurrence with Attention for Speech Recognition

Oct 11, 2021

Jing Pan, Tao Lei, Kwangyoun Kim, Kyu Han, Shinji Watanabe

Figure 1 for SRU++: Pioneering Fast Recurrence with Attention for Speech Recognition

Figure 2 for SRU++: Pioneering Fast Recurrence with Attention for Speech Recognition

Figure 3 for SRU++: Pioneering Fast Recurrence with Attention for Speech Recognition

Figure 4 for SRU++: Pioneering Fast Recurrence with Attention for Speech Recognition

Abstract:The Transformer architecture has been well adopted as a dominant architecture in most sequence transduction tasks including automatic speech recognition (ASR), since its attention mechanism excels in capturing long-range dependencies. While models built solely upon attention can be better parallelized than regular RNN, a novel network architecture, SRU++, was recently proposed. By combining the fast recurrence and attention mechanism, SRU++ exhibits strong capability in sequence modeling and achieves near-state-of-the-art results in various language modeling and machine translation tasks with improved compute efficiency. In this work, we present the advantages of applying SRU++ in ASR tasks by comparing with Conformer across multiple ASR benchmarks and study how the benefits can be generalized to long-form speech inputs. On the popular LibriSpeech benchmark, our SRU++ model achieves 2.0% / 4.7% WER on test-clean / test-other, showing competitive performances compared with the state-of-the-art Conformer encoder under the same set-up. Specifically, SRU++ can surpass Conformer on long-form speech input with a large margin, based on our analysis.

Via

Access Paper or Ask Questions

Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition

Sep 14, 2021

Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi

Figure 1 for Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition

Figure 2 for Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition

Figure 3 for Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition

Figure 4 for Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition

Abstract:This paper is a study of performance-efficiency trade-offs in pre-trained models for automatic speech recognition (ASR). We focus on wav2vec 2.0, and formalize several architecture designs that influence both the model performance and its efficiency. Putting together all our observations, we introduce SEW (Squeezed and Efficient Wav2vec), a pre-trained model architecture with significant improvements along both performance and efficiency dimensions across a variety of training setups. For example, under the 100h-960h semi-supervised setup on LibriSpeech, SEW achieves a 1.9x inference speedup compared to wav2vec 2.0, with a 13.5% relative reduction in word error rate. With a similar inference time, SEW reduces word error rate by 25-50% across different model sizes.

* Code available at https://github.com/asappresearch/sew

Via

Access Paper or Ask Questions

Speaker Diarization With Lexical Information

Nov 28, 2018

Tae Jin Park, Kyu Han, Ian Lane, Panayiotis Georgiou

Figure 1 for Speaker Diarization With Lexical Information

Figure 2 for Speaker Diarization With Lexical Information

Figure 3 for Speaker Diarization With Lexical Information

Figure 4 for Speaker Diarization With Lexical Information

Abstract:This work presents a novel approach to leverage lexical information for speaker diarization. We introduce a speaker diarization system that can directly integrate lexical as well as acoustic information into a speaker clustering process. Thus, we propose an adjacency matrix integration technique to integrate word level speaker turn probabilities with speaker embeddings in a comprehensive way. Our proposed method works without any reference transcript. Words, and word boundary information are provided by an ASR system. We show that our proposed method improves a baseline speaker diarization system solely based on speaker embeddings, achieving a meaningful improvement on the CALLHOME American English Speech dataset.

* 5 pages, 6 figures

Via

Access Paper or Ask Questions