Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Dirk Padfield

Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities

May 29, 2024

Vicky Zayats, Peter Chen, Melissa Merrari, Dirk Padfield

Abstract:Integrating multiple generative foundation models, especially those trained on different modalities, into something greater than the sum of its parts poses significant challenges. Two key hurdles are the availability of aligned data (concepts that contain similar meaning but is expressed differently in different modalities), and effectively leveraging unimodal representations in cross-domain generative tasks, without compromising their original unimodal capabilities. We propose Zipper, a multi-tower decoder architecture that addresses these concerns by using cross-attention to flexibly compose multimodal generative models from independently pre-trained unimodal decoders. In our experiments fusing speech and text modalities, we show the proposed architecture performs very competitively in scenarios with limited aligned text-speech data. We also showcase the flexibility of our model to selectively maintain unimodal (e.g., text-to-text generation) generation performance by freezing the corresponding modal tower (e.g. text). In cross-modal tasks such as automatic speech recognition (ASR) where the output modality is text, we show that freezing the text backbone results in negligible performance degradation. In cross-modal tasks such as text-to-speech generation (TTS) where the output modality is speech, we show that using a pre-trained speech backbone results in superior performance to the baseline.

* Under review at NeurIPS

Via

Access Paper or Ask Questions

AudioPaLM: A Large Language Model That Can Speak and Listen

Jun 22, 2023

Paul K. Rubenstein, Chulayuth Asawaroengchai, Duc Dung Nguyen, Ankur Bapna, Zalán Borsos, Félix de Chaumont Quitry, Peter Chen, Dalia El Badawy, Wei Han, Eugene Kharitonov(+20 more)

Figure 1 for AudioPaLM: A Large Language Model That Can Speak and Listen

Figure 2 for AudioPaLM: A Large Language Model That Can Speak and Listen

Figure 3 for AudioPaLM: A Large Language Model That Can Speak and Listen

Figure 4 for AudioPaLM: A Large Language Model That Can Speak and Listen

Abstract:We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt. We release examples of our method at https://google-research.github.io/seanet/audiopalm/examples

* Technical report

Via

Access Paper or Ask Questions

MultiTurnCleanup: A Benchmark for Multi-Turn Spoken Conversational Transcript Cleanup

May 19, 2023

Hua Shen, Vicky Zayats, Johann C. Rocholl, Daniel D. Walker, Dirk Padfield

Abstract:Current disfluency detection models focus on individual utterances each from a single speaker. However, numerous discontinuity phenomena in spoken conversational transcripts occur across multiple turns, hampering human readability and the performance of downstream NLP tasks. This study addresses these phenomena by proposing an innovative Multi-Turn Cleanup task for spoken conversational transcripts and collecting a new dataset, MultiTurnCleanup1. We design a data labeling schema to collect the high-quality dataset and provide extensive data analysis. Furthermore, we leverage two modeling approaches for experimental evaluation as benchmarks for future research.

Via

Access Paper or Ask Questions

Chronological Self-Training for Real-Time Speaker Diarization

Aug 05, 2022

Dirk Padfield, Daniel J. Liebling

Figure 1 for Chronological Self-Training for Real-Time Speaker Diarization

Figure 2 for Chronological Self-Training for Real-Time Speaker Diarization

Figure 3 for Chronological Self-Training for Real-Time Speaker Diarization

Abstract:Diarization partitions an audio stream into segments based on the voices of the speakers. Real-time diarization systems that include an enrollment step should limit enrollment training samples to reduce user interaction time. Although training on a small number of samples yields poor performance, we show that the accuracy can be improved dramatically using a chronological self-training approach. We studied the tradeoff between training time and classification performance and found that 1 second is sufficient to reach over 95% accuracy. We evaluated on 700 audio conversation files of about 10 minutes each from 6 different languages and demonstrated average diarization error rates as low as 10%.

* Proc. Interspeech (2021) 4613-4617
* 5 pages, 5 figures, ICASSP 2021

Via

Access Paper or Ask Questions

Teaching BERT to Wait: Balancing Accuracy and Latency for Streaming Disfluency Detection

May 02, 2022

Angelica Chen, Vicky Zayats, Daniel D. Walker, Dirk Padfield

Figure 1 for Teaching BERT to Wait: Balancing Accuracy and Latency for Streaming Disfluency Detection

Figure 2 for Teaching BERT to Wait: Balancing Accuracy and Latency for Streaming Disfluency Detection

Figure 3 for Teaching BERT to Wait: Balancing Accuracy and Latency for Streaming Disfluency Detection

Figure 4 for Teaching BERT to Wait: Balancing Accuracy and Latency for Streaming Disfluency Detection

Abstract:In modern interactive speech-based systems, speech is consumed and transcribed incrementally prior to having disfluencies removed. This post-processing step is crucial for producing clean transcripts and high performance on downstream tasks (e.g. machine translation). However, most current state-of-the-art NLP models such as the Transformer operate non-incrementally, potentially causing unacceptable delays. We propose a streaming BERT-based sequence tagging model that, combined with a novel training objective, is capable of detecting disfluencies in real-time while balancing accuracy and latency. This is accomplished by training the model to decide whether to immediately output a prediction for the current input or to wait for further context. Essentially, the model learns to dynamically size its lookahead window. Our results demonstrate that our model produces comparably accurate predictions and does so sooner than our baselines, with lower flicker. Furthermore, the model attains state-of-the-art latency and stability scores when compared with recent work on incremental disfluency detection.

* To be published at NAACL 2022

Via

Access Paper or Ask Questions

Residual Adapters for Parameter-Efficient ASR Adaptation to Atypical and Accented Speech

Sep 14, 2021

Katrin Tomanek, Vicky Zayats, Dirk Padfield, Kara Vaillancourt, Fadi Biadsy

Figure 1 for Residual Adapters for Parameter-Efficient ASR Adaptation to Atypical and Accented Speech

Figure 2 for Residual Adapters for Parameter-Efficient ASR Adaptation to Atypical and Accented Speech

Figure 3 for Residual Adapters for Parameter-Efficient ASR Adaptation to Atypical and Accented Speech

Figure 4 for Residual Adapters for Parameter-Efficient ASR Adaptation to Atypical and Accented Speech

Abstract:Automatic Speech Recognition (ASR) systems are often optimized to work best for speakers with canonical speech patterns. Unfortunately, these systems perform poorly when tested on atypical speech and heavily accented speech. It has previously been shown that personalization through model fine-tuning substantially improves performance. However, maintaining such large models per speaker is costly and difficult to scale. We show that by adding a relatively small number of extra parameters to the encoder layers via so-called residual adapter, we can achieve similar adaptation gains compared to model fine-tuning, while only updating a tiny fraction (less than 0.5%) of the model parameters. We demonstrate this on two speech adaptation tasks (atypical and accented speech) and for two state-of-the-art ASR architectures.

* Accepted to EMNLP 2021

Via

Access Paper or Ask Questions

Sentence Boundary Augmentation For Neural Machine Translation Robustness

Oct 21, 2020

Daniel Li, Te I, Naveen Arivazhagan, Colin Cherry, Dirk Padfield

Figure 1 for Sentence Boundary Augmentation For Neural Machine Translation Robustness

Figure 2 for Sentence Boundary Augmentation For Neural Machine Translation Robustness

Figure 3 for Sentence Boundary Augmentation For Neural Machine Translation Robustness

Figure 4 for Sentence Boundary Augmentation For Neural Machine Translation Robustness

Abstract:Neural Machine Translation (NMT) models have demonstrated strong state of the art performance on translation tasks where well-formed training and evaluation data are provided, but they remain sensitive to inputs that include errors of various types. Specifically, in the context of long-form speech translation systems, where the input transcripts come from Automatic Speech Recognition (ASR), the NMT models have to handle errors including phoneme substitutions, grammatical structure, and sentence boundaries, all of which pose challenges to NMT robustness. Through in-depth error analysis, we show that sentence boundary segmentation has the largest impact on quality, and we develop a simple data augmentation strategy to improve segmentation robustness.

* 5 pages, 4 figures

Via

Access Paper or Ask Questions