Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Andrea Vanzo

Heriot-Watt University

Universal-2-TF: Robust All-Neural Text Formatting for ASR

Jan 10, 2025

Yash Khare, Taufiquzzaman Peyash, Andrea Vanzo, Takuya Yoshioka

Abstract:This paper introduces an all-neural text formatting (TF) model designed for commercial automatic speech recognition (ASR) systems, encompassing punctuation restoration (PR), truecasing, and inverse text normalization (ITN). Unlike traditional rule-based or hybrid approaches, this method leverages a two-stage neural architecture comprising a multi-objective token classifier and a sequence-to-sequence (seq2seq) model. This design minimizes computational costs and reduces hallucinations while ensuring flexibility and robustness across diverse linguistic entities and text domains. Developed as part of the Universal-2 ASR system, the proposed method demonstrates superior performance in TF accuracy, computational efficiency, and perceptual quality, as validated through comprehensive evaluations using both objective and subjective methods. This work underscores the importance of holistic TF models in enhancing ASR usability in practical settings.

Via

Access Paper or Ask Questions

Anatomy of Industrial Scale Multilingual ASR

Apr 16, 2024

Francis McCann Ramirez, Luka Chkhetiani, Andrew Ehrenberg, Robert McHardy, Rami Botros, Yash Khare, Andrea Vanzo, Taufiquzzaman Peyash, Gabriel Oexle, Michael Liang(+7 more)

Figure 1 for Anatomy of Industrial Scale Multilingual ASR

Figure 2 for Anatomy of Industrial Scale Multilingual ASR

Figure 3 for Anatomy of Industrial Scale Multilingual ASR

Figure 4 for Anatomy of Industrial Scale Multilingual ASR

Abstract:This paper describes AssemblyAI's industrial-scale automatic speech recognition (ASR) system, designed to meet the requirements of large-scale, multilingual ASR serving various application needs. Our system leverages a diverse training dataset comprising unsupervised (12.5M hours), supervised (188k hours), and pseudo-labeled (1.6M hours) data across four languages. We provide a detailed description of our model architecture, consisting of a full-context 600M-parameter Conformer encoder pre-trained with BEST-RQ and an RNN-T decoder fine-tuned jointly with the encoder. Our extensive evaluation demonstrates competitive word error rates (WERs) against larger and more computationally expensive models, such as Whisper large and Canary-1B. Furthermore, our architectural choices yield several key advantages, including an improved code-switching capability, a 5x inference speedup compared to an optimized Whisper baseline, a 30% reduction in hallucination rate on speech data, and a 90% reduction in ambient noise compared to Whisper, along with significantly improved time-stamp accuracy. Throughout this work, we adopt a system-centric approach to analyzing various aspects of fully-fledged ASR models to gain practically relevant insights useful for real-world services operating at scale.

Via

Access Paper or Ask Questions

Conformer-1: Robust ASR via Large-Scale Semisupervised Bootstrapping

Apr 12, 2024

Kevin Zhang, Luka Chkhetiani, Francis McCann Ramirez, Yash Khare, Andrea Vanzo, Michael Liang, Sergio Ramirez Martin, Gabriel Oexle, Ruben Bousbib, Taufiquzzaman Peyash(+3 more)

Figure 1 for Conformer-1: Robust ASR via Large-Scale Semisupervised Bootstrapping

Figure 2 for Conformer-1: Robust ASR via Large-Scale Semisupervised Bootstrapping

Figure 3 for Conformer-1: Robust ASR via Large-Scale Semisupervised Bootstrapping

Figure 4 for Conformer-1: Robust ASR via Large-Scale Semisupervised Bootstrapping

Abstract:This paper presents Conformer-1, an end-to-end Automatic Speech Recognition (ASR) model trained on an extensive dataset of 570k hours of speech audio data, 91% of which was acquired from publicly available sources. To achieve this, we perform Noisy Student Training after generating pseudo-labels for the unlabeled public data using a strong Conformer RNN-T baseline model. The addition of these pseudo-labeled data results in remarkable improvements in relative Word Error Rate (WER) by 11.5% and 24.3% for our asynchronous and realtime models, respectively. Additionally, the model is more robust to background noise owing to the addition of these data. The results obtained in this study demonstrate that the incorporation of pseudo-labeled publicly available data is a highly effective strategy for improving ASR accuracy and noise robustness.

Via

Access Paper or Ask Questions

Going for GOAL: A Resource for Grounded Football Commentaries

Nov 08, 2022

Alessandro Suglia, José Lopes, Emanuele Bastianelli, Andrea Vanzo, Shubham Agarwal, Malvina Nikandrou, Lu Yu, Ioannis Konstas, Verena Rieser

Figure 1 for Going for GOAL: A Resource for Grounded Football Commentaries

Figure 2 for Going for GOAL: A Resource for Grounded Football Commentaries

Figure 3 for Going for GOAL: A Resource for Grounded Football Commentaries

Figure 4 for Going for GOAL: A Resource for Grounded Football Commentaries

Abstract:Recent video+language datasets cover domains where the interaction is highly structured, such as instructional videos, or where the interaction is scripted, such as TV shows. Both of these properties can lead to spurious cues to be exploited by models rather than learning to ground language. In this paper, we present GrOunded footbAlL commentaries (GOAL), a novel dataset of football (or `soccer') highlights videos with transcribed live commentaries in English. As the course of a game is unpredictable, so are commentaries, which makes them a unique resource to investigate dynamic language grounding. We also provide state-of-the-art baselines for the following tasks: frame reordering, moment retrieval, live commentary retrieval and play-by-play live commentary generation. Results show that SOTA models perform reasonably well in most tasks. We discuss the implications of these results and suggest new tasks for which GOAL can be used. Our codebase is available at: https://gitlab.com/grounded-sport-convai/goal-baselines.

* Preprint formatted using the ACM Multimedia template (8 pages + appendix)

Via

Access Paper or Ask Questions

An Empirical Study on the Generalization Power of Neural Representations Learned via Visual Guessing Games

Jan 31, 2021

Alessandro Suglia, Yonatan Bisk, Ioannis Konstas, Antonio Vergari, Emanuele Bastianelli, Andrea Vanzo, Oliver Lemon

Figure 1 for An Empirical Study on the Generalization Power of Neural Representations Learned via Visual Guessing Games

Figure 2 for An Empirical Study on the Generalization Power of Neural Representations Learned via Visual Guessing Games

Figure 3 for An Empirical Study on the Generalization Power of Neural Representations Learned via Visual Guessing Games

Figure 4 for An Empirical Study on the Generalization Power of Neural Representations Learned via Visual Guessing Games

Abstract:Guessing games are a prototypical instance of the "learning by interacting" paradigm. This work investigates how well an artificial agent can benefit from playing guessing games when later asked to perform on novel NLP downstream tasks such as Visual Question Answering (VQA). We propose two ways to exploit playing guessing games: 1) a supervised learning scenario in which the agent learns to mimic successful guessing games and 2) a novel way for an agent to play by itself, called Self-play via Iterated Experience Learning (SPIEL). We evaluate the ability of both procedures to generalize: an in-domain evaluation shows an increased accuracy (+7.79) compared with competitors on the evaluation suite CompGuessWhat?!; a transfer evaluation shows improved performance for VQA on the TDIUC dataset in terms of harmonic average accuracy (+5.31) thanks to more fine-grained object representations learned via SPIEL.

* Accepted paper for the 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2021)

Via

Access Paper or Ask Questions

Encoding Syntactic Constituency Paths for Frame-Semantic Parsing with Graph Convolutional Networks

Nov 26, 2020

Emanuele Bastianelli, Andrea Vanzo, Oliver Lemon

Figure 1 for Encoding Syntactic Constituency Paths for Frame-Semantic Parsing with Graph Convolutional Networks

Figure 2 for Encoding Syntactic Constituency Paths for Frame-Semantic Parsing with Graph Convolutional Networks

Figure 3 for Encoding Syntactic Constituency Paths for Frame-Semantic Parsing with Graph Convolutional Networks

Figure 4 for Encoding Syntactic Constituency Paths for Frame-Semantic Parsing with Graph Convolutional Networks

Abstract:We study the problem of integrating syntactic information from constituency trees into a neural model in Frame-semantic parsing sub-tasks, namely Target Identification (TI), FrameIdentification (FI), and Semantic Role Labeling (SRL). We use a Graph Convolutional Network to learn specific representations of constituents, such that each constituent is profiled as the production grammar rule it corresponds to. We leverage these representations to build syntactic features for each word in a sentence, computed as the sum of all the constituents on the path between a word and a task-specific node in the tree, e.g. the target predicate for SRL. Our approach improves state-of-the-art results on the TI and SRL of ~1%and~3.5% points, respectively (+2.5% additional points are gained with BERT as input), when tested on FrameNet 1.5, while yielding comparable results on the CoNLL05 dataset to other syntax-aware systems.

Via

Access Paper or Ask Questions

SLURP: A Spoken Language Understanding Resource Package

Nov 26, 2020

Emanuele Bastianelli, Andrea Vanzo, Pawel Swietojanski, Verena Rieser

Figure 1 for SLURP: A Spoken Language Understanding Resource Package

Figure 2 for SLURP: A Spoken Language Understanding Resource Package

Figure 3 for SLURP: A Spoken Language Understanding Resource Package

Figure 4 for SLURP: A Spoken Language Understanding Resource Package

Abstract:Spoken Language Understanding infers semantic meaning directly from audio data, and thus promises to reduce error propagation and misunderstandings in end-user applications. However, publicly available SLU resources are limited. In this paper, we release SLURP, a new SLU package containing the following: (1) A new challenging dataset in English spanning 18 domains, which is substantially bigger and linguistically more diverse than existing datasets; (2) Competitive baselines based on state-of-the-art NLU and ASR systems; (3) A new transparent metric for entity labelling which enables a detailed error analysis for identifying potential areas of improvement. SLURP is available at https: //github.com/pswietojanski/slurp.

* Published at the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP-2020)

Via

Access Paper or Ask Questions

Imagining Grounded Conceptual Representations from Perceptual Information in Situated Guessing Games

Nov 05, 2020

Alessandro Suglia, Antonio Vergari, Ioannis Konstas, Yonatan Bisk, Emanuele Bastianelli, Andrea Vanzo, Oliver Lemon

Figure 1 for Imagining Grounded Conceptual Representations from Perceptual Information in Situated Guessing Games

Figure 2 for Imagining Grounded Conceptual Representations from Perceptual Information in Situated Guessing Games

Figure 3 for Imagining Grounded Conceptual Representations from Perceptual Information in Situated Guessing Games

Figure 4 for Imagining Grounded Conceptual Representations from Perceptual Information in Situated Guessing Games

Abstract:In visual guessing games, a Guesser has to identify a target object in a scene by asking questions to an Oracle. An effective strategy for the players is to learn conceptual representations of objects that are both discriminative and expressive enough to ask questions and guess correctly. However, as shown by Suglia et al. (2020), existing models fail to learn truly multi-modal representations, relying instead on gold category labels for objects in the scene both at training and inference time. This provides an unnatural performance advantage when categories at inference time match those at training time, and it causes models to fail in more realistic "zero-shot" scenarios where out-of-domain object categories are involved. To overcome this issue, we introduce a novel "imagination" module based on Regularized Auto-Encoders, that learns context-aware and category-aware latent embeddings without relying on category labels at inference time. Our imagination module outperforms state-of-the-art competitors by 8.26% gameplay accuracy in the CompGuessWhat?! zero-shot scenario (Suglia et al., 2020), and it improves the Oracle and Guesser accuracy by 2.08% and 12.86% in the GuessWhat?! benchmark, when no gold categories are available at inference time. The imagination module also boosts reasoning about object properties and attributes.

* Accepted to the International Conference on Computational Linguistics (COLING) 2020

Via

Access Paper or Ask Questions

CompGuessWhat?!: A Multi-task Evaluation Framework for Grounded Language Learning

Jun 03, 2020

Alessandro Suglia, Ioannis Konstas, Andrea Vanzo, Emanuele Bastianelli, Desmond Elliott, Stella Frank, Oliver Lemon

Figure 1 for CompGuessWhat?!: A Multi-task Evaluation Framework for Grounded Language Learning

Figure 2 for CompGuessWhat?!: A Multi-task Evaluation Framework for Grounded Language Learning

Figure 3 for CompGuessWhat?!: A Multi-task Evaluation Framework for Grounded Language Learning

Figure 4 for CompGuessWhat?!: A Multi-task Evaluation Framework for Grounded Language Learning

Abstract:Approaches to Grounded Language Learning typically focus on a single task-based final performance measure that may not depend on desirable properties of the learned hidden representations, such as their ability to predict salient attributes or to generalise to unseen situations. To remedy this, we present GROLLA, an evaluation framework for Grounded Language Learning with Attributes with three sub-tasks: 1) Goal-oriented evaluation; 2) Object attribute prediction evaluation; and 3) Zero-shot evaluation. We also propose a new dataset CompGuessWhat?! as an instance of this framework for evaluating the quality of learned neural representations, in particular concerning attribute grounding. To this end, we extend the original GuessWhat?! dataset by including a semantic layer on top of the perceptual one. Specifically, we enrich the VisualGenome scene graphs associated with the GuessWhat?! images with abstract and situated attributes. By using diagnostic classifiers, we show that current models learn representations that are not expressive enough to encode object attributes (average F1 of 44.27). In addition, they do not learn strategies nor representations that are robust enough to perform well when novel scenes or objects are involved in gameplay (zero-shot best accuracy 50.06%).

* Accepted to the Annual Conference of the Association for Computational Linguistics (ACL) 2020

Via

Access Paper or Ask Questions

Hierarchical Multi-Task Natural Language Understanding for Cross-domain Conversational AI: HERMIT NLU

Oct 02, 2019

Andrea Vanzo, Emanuele Bastianelli, Oliver Lemon

Figure 1 for Hierarchical Multi-Task Natural Language Understanding for Cross-domain Conversational AI: HERMIT NLU

Figure 2 for Hierarchical Multi-Task Natural Language Understanding for Cross-domain Conversational AI: HERMIT NLU

Figure 3 for Hierarchical Multi-Task Natural Language Understanding for Cross-domain Conversational AI: HERMIT NLU

Figure 4 for Hierarchical Multi-Task Natural Language Understanding for Cross-domain Conversational AI: HERMIT NLU

Abstract:We present a new neural architecture for wide-coverage Natural Language Understanding in Spoken Dialogue Systems. We develop a hierarchical multi-task architecture, which delivers a multi-layer representation of sentence meaning (i.e., Dialogue Acts and Frame-like structures). The architecture is a hierarchy of self-attention mechanisms and BiLSTM encoders followed by CRF tagging layers. We describe a variety of experiments, showing that our approach obtains promising results on a dataset annotated with Dialogue Acts and Frame Semantics. Moreover, we demonstrate its applicability to a different, publicly available NLU dataset annotated with domain-specific intents and corresponding semantic roles, providing overall performance higher than state-of-the-art tools such as RASA, Dialogflow, LUIS, and Watson. For example, we show an average 4.45% improvement in entity tagging F-score over Rasa, Dialogflow and LUIS.

* SIGDial 2019
* 10 pages

Via

Access Paper or Ask Questions