Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yash Khare

Universal-2-TF: Robust All-Neural Text Formatting for ASR

Jan 10, 2025

Yash Khare, Taufiquzzaman Peyash, Andrea Vanzo, Takuya Yoshioka

Abstract:This paper introduces an all-neural text formatting (TF) model designed for commercial automatic speech recognition (ASR) systems, encompassing punctuation restoration (PR), truecasing, and inverse text normalization (ITN). Unlike traditional rule-based or hybrid approaches, this method leverages a two-stage neural architecture comprising a multi-objective token classifier and a sequence-to-sequence (seq2seq) model. This design minimizes computational costs and reduces hallucinations while ensuring flexibility and robustness across diverse linguistic entities and text domains. Developed as part of the Universal-2 ASR system, the proposed method demonstrates superior performance in TF accuracy, computational efficiency, and perceptual quality, as validated through comprehensive evaluations using both objective and subjective methods. This work underscores the importance of holistic TF models in enhancing ASR usability in practical settings.

Via

Access Paper or Ask Questions

Anatomy of Industrial Scale Multilingual ASR

Apr 16, 2024

Francis McCann Ramirez, Luka Chkhetiani, Andrew Ehrenberg, Robert McHardy, Rami Botros, Yash Khare, Andrea Vanzo, Taufiquzzaman Peyash, Gabriel Oexle, Michael Liang(+7 more)

Figure 1 for Anatomy of Industrial Scale Multilingual ASR

Figure 2 for Anatomy of Industrial Scale Multilingual ASR

Figure 3 for Anatomy of Industrial Scale Multilingual ASR

Figure 4 for Anatomy of Industrial Scale Multilingual ASR

Abstract:This paper describes AssemblyAI's industrial-scale automatic speech recognition (ASR) system, designed to meet the requirements of large-scale, multilingual ASR serving various application needs. Our system leverages a diverse training dataset comprising unsupervised (12.5M hours), supervised (188k hours), and pseudo-labeled (1.6M hours) data across four languages. We provide a detailed description of our model architecture, consisting of a full-context 600M-parameter Conformer encoder pre-trained with BEST-RQ and an RNN-T decoder fine-tuned jointly with the encoder. Our extensive evaluation demonstrates competitive word error rates (WERs) against larger and more computationally expensive models, such as Whisper large and Canary-1B. Furthermore, our architectural choices yield several key advantages, including an improved code-switching capability, a 5x inference speedup compared to an optimized Whisper baseline, a 30% reduction in hallucination rate on speech data, and a 90% reduction in ambient noise compared to Whisper, along with significantly improved time-stamp accuracy. Throughout this work, we adopt a system-centric approach to analyzing various aspects of fully-fledged ASR models to gain practically relevant insights useful for real-world services operating at scale.

Via

Access Paper or Ask Questions

Conformer-1: Robust ASR via Large-Scale Semisupervised Bootstrapping

Apr 12, 2024

Kevin Zhang, Luka Chkhetiani, Francis McCann Ramirez, Yash Khare, Andrea Vanzo, Michael Liang, Sergio Ramirez Martin, Gabriel Oexle, Ruben Bousbib, Taufiquzzaman Peyash(+3 more)

Figure 1 for Conformer-1: Robust ASR via Large-Scale Semisupervised Bootstrapping

Figure 2 for Conformer-1: Robust ASR via Large-Scale Semisupervised Bootstrapping

Figure 3 for Conformer-1: Robust ASR via Large-Scale Semisupervised Bootstrapping

Figure 4 for Conformer-1: Robust ASR via Large-Scale Semisupervised Bootstrapping

Abstract:This paper presents Conformer-1, an end-to-end Automatic Speech Recognition (ASR) model trained on an extensive dataset of 570k hours of speech audio data, 91% of which was acquired from publicly available sources. To achieve this, we perform Noisy Student Training after generating pseudo-labels for the unlabeled public data using a strong Conformer RNN-T baseline model. The addition of these pseudo-labeled data results in remarkable improvements in relative Word Error Rate (WER) by 11.5% and 24.3% for our asynchronous and realtime models, respectively. Additionally, the model is more robust to background noise owing to the addition of these data. The results obtained in this study demonstrate that the incorporation of pseudo-labeled publicly available data is a highly effective strategy for improving ASR accuracy and noise robustness.

Via

Access Paper or Ask Questions

MMBERT: Multimodal BERT Pretraining for Improved Medical VQA

Apr 03, 2021

Yash Khare, Viraj Bagal, Minesh Mathew, Adithi Devi, U Deva Priyakumar, CV Jawahar

Figure 1 for MMBERT: Multimodal BERT Pretraining for Improved Medical VQA

Figure 2 for MMBERT: Multimodal BERT Pretraining for Improved Medical VQA

Figure 3 for MMBERT: Multimodal BERT Pretraining for Improved Medical VQA

Figure 4 for MMBERT: Multimodal BERT Pretraining for Improved Medical VQA

Abstract:Images in the medical domain are fundamentally different from the general domain images. Consequently, it is infeasible to directly employ general domain Visual Question Answering (VQA) models for the medical domain. Additionally, medical images annotation is a costly and time-consuming process. To overcome these limitations, we propose a solution inspired by self-supervised pretraining of Transformer-style architectures for NLP, Vision and Language tasks. Our method involves learning richer medical image and text semantic representations using Masked Language Modeling (MLM) with image features as the pretext task on a large medical image+caption dataset. The proposed solution achieves new state-of-the-art performance on two VQA datasets for radiology images -- VQA-Med 2019 and VQA-RAD, outperforming even the ensemble models of previous best solutions. Moreover, our solution provides attention maps which help in model interpretability. The code is available at https://github.com/VirajBagal/MMBERT

Via

Access Paper or Ask Questions