Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yeon-Jun Kim

1SPU: 1-step Speech Processing Unit

Nov 10, 2023

Karan Singla, Shahab Jalalvand, Yeon-Jun Kim, Antonio Moreno Daniel, Srinivas Bangalore, Andrej Ljolje, Ben Stern

Abstract:Recent studies have made some progress in refining end-to-end (E2E) speech recognition encoders by applying Connectionist Temporal Classification (CTC) loss to enhance named entity recognition within transcriptions. However, these methods have been constrained by their exclusive use of the ASCII character set, allowing only a limited array of semantic labels. We propose 1SPU, a 1-step Speech Processing Unit which can recognize speech events (e.g: speaker change) or an NL event (Intent, Emotion) while also transcribing vocal content. It extends the E2E automatic speech recognition (ASR) system's vocabulary by adding a set of unused placeholder symbols, conceptually akin to the <pad> tokens used in sequence modeling. These placeholders are then assigned to represent semantic events (in form of tags) and are integrated into the transcription process as distinct tokens. We demonstrate notable improvements on the SLUE benchmark and yields results that are on par with those for the SLURP dataset. Additionally, we provide a visual analysis of the system's proficiency in accurately pinpointing meaningful tokens over time, illustrating the enhancement in transcription quality through the utilization of supplementary semantic tags.

* It's a work in progress. More tasks and interesting experiments planned

Via

Access Paper or Ask Questions

E2E Spoken Entity Extraction for Virtual Agents

Mar 01, 2023

Karan Singla, Yeon-Jun Kim, Ryan Price, Shahab Jalalvand, Srinivas Bangalore

Figure 1 for E2E Spoken Entity Extraction for Virtual Agents

Figure 2 for E2E Spoken Entity Extraction for Virtual Agents

Figure 3 for E2E Spoken Entity Extraction for Virtual Agents

Figure 4 for E2E Spoken Entity Extraction for Virtual Agents

Abstract:This paper reimagines some aspects of speech processing using speech encoders, specifically about extracting entities directly from speech, with no intermediate textual representation. In human-computer conversations, extracting entities such as names, postal addresses and email addresses from speech is a challenging task. In this paper, we study the impact of fine-tuning pre-trained speech encoders on extracting spoken entities in human-readable form directly from speech without the need for text transcription. We illustrate that such a direct approach optimizes the encoder to transcribe only the entity relevant portions of speech, ignoring the superfluous portions such as carrier phrases and spellings of entities. In the context of dialogs from an enterprise virtual agent, we demonstrate that the 1-step approach outperforms the typical 2-step cascade of first generating lexical transcriptions followed by text-based entity extraction for identifying spoken entities.

Via

Access Paper or Ask Questions

Cross-stitched Multi-modal Encoders

Apr 20, 2022

Karan Singla, Daniel Pressel, Ryan Price, Bhargav Srinivas Chinnari, Yeon-Jun Kim, Srinivas Bangalore

Figure 1 for Cross-stitched Multi-modal Encoders

Figure 2 for Cross-stitched Multi-modal Encoders

Figure 3 for Cross-stitched Multi-modal Encoders

Figure 4 for Cross-stitched Multi-modal Encoders

Abstract:In this paper, we propose a novel architecture for multi-modal speech and text input. We combine pretrained speech and text encoders using multi-headed cross-modal attention and jointly fine-tune on the target problem. The resultant architecture can be used for continuous token-level classification or utterance-level prediction acting on simultaneous text and speech. The resultant encoder efficiently captures both acoustic-prosodic and lexical information. We compare the benefits of multi-headed attention-based fusion for multi-modal utterance-level classification against a simple concatenation of pre-pooled, modality-specific representations. Our model architecture is compact, resource efficient, and can be trained on a single consumer GPU card.

Via

Access Paper or Ask Questions

Seq-2-Seq based Refinement of ASR Output for Spoken Name Capture

Mar 29, 2022

Karan Singla, Shahab Jalalvand, Yeon-Jun Kim, Ryan Price, Daniel Pressel, Srinivas Bangalore

Figure 1 for Seq-2-Seq based Refinement of ASR Output for Spoken Name Capture

Figure 2 for Seq-2-Seq based Refinement of ASR Output for Spoken Name Capture

Figure 3 for Seq-2-Seq based Refinement of ASR Output for Spoken Name Capture

Figure 4 for Seq-2-Seq based Refinement of ASR Output for Spoken Name Capture

Abstract:Person name capture from human speech is a difficult task in human-machine conversations. In this paper, we propose a novel approach to capture the person names from the caller utterances in response to the prompt "say and spell your first/last name". Inspired from work on spell correction, disfluency removal and text normalization, we propose a lightweight Seq-2-Seq system which generates a name spell from a varying user input. Our proposed method outperforms the strong baseline which is based on LM-driven rule-based approach.

* Under review at InterSpeech 2022

Via

Access Paper or Ask Questions