Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Swayambhu Nath Ray

DuRep: Dual-Mode Speech Representation Learning via ASR-Aware Distillation

May 26, 2025

Prabash Reddy Male, Swayambhu Nath Ray, Harish Arsikere, Akshat Jaiswal, Prakhar Swarup, Prantik Sen, Debmalya Chakrabarty, K V Vijay Girish, Nikhil Bhave, Frederick Weber(+2 more)

Abstract:Recent advancements in speech encoders have drawn attention due to their integration with Large Language Models for various speech tasks. While most research has focused on either causal or full-context speech encoders, there's limited exploration to effectively handle both streaming and non-streaming applications, while achieving state-of-the-art performance. We introduce DuRep, a Dual-mode Speech Representation learning setup, which enables a single speech encoder to function efficiently in both offline and online modes without additional parameters or mode-specific adjustments, across downstream tasks. DuRep-200M, our 200M parameter dual-mode encoder, achieves 12% and 11.6% improvements in streaming and non-streaming modes, over baseline encoders on Multilingual ASR. Scaling this approach to 2B parameters, DuRep-2B sets new performance benchmarks across ASR and non-ASR tasks. Our analysis reveals interesting trade-offs between acoustic and semantic information across encoder layers.

Via

Access Paper or Ask Questions

Unified Modeling of Multi-Domain Multi-Device ASR Systems

May 13, 2022

Soumyajit Mitra, Swayambhu Nath Ray, Bharat Padi, Arunasish Sen, Raghavendra Bilgi, Harish Arsikere, Shalini Ghosh, Ajay Srinivasamurthy, Sri Garimella

Figure 1 for Unified Modeling of Multi-Domain Multi-Device ASR Systems

Figure 2 for Unified Modeling of Multi-Domain Multi-Device ASR Systems

Figure 3 for Unified Modeling of Multi-Domain Multi-Device ASR Systems

Figure 4 for Unified Modeling of Multi-Domain Multi-Device ASR Systems

Abstract:Modern Automatic Speech Recognition (ASR) systems often use a portfolio of domain-specific models in order to get high accuracy for distinct user utterance types across different devices. In this paper, we propose an innovative approach that integrates the different per-domain per-device models into a unified model, using a combination of domain embedding, domain experts, mixture of experts and adversarial training. We run careful ablation studies to show the benefit of each of these innovations in contributing to the accuracy of the overall unified model. Experiments show that our proposed unified modeling approach actually outperforms the carefully tuned per-domain models, giving relative gains of up to 10% over a baseline model with negligible increase in the number of parameters.

* Submitted to Interspeech 2022

Via

Access Paper or Ask Questions

Improving RNN-T ASR Performance with Date-Time and Location Awareness

Jun 16, 2021

Swayambhu Nath Ray, Soumyajit Mitra, Raghavendra Bilgi, Sri Garimella

Figure 1 for Improving RNN-T ASR Performance with Date-Time and Location Awareness

Figure 2 for Improving RNN-T ASR Performance with Date-Time and Location Awareness

Figure 3 for Improving RNN-T ASR Performance with Date-Time and Location Awareness

Figure 4 for Improving RNN-T ASR Performance with Date-Time and Location Awareness

Abstract:In this paper, we explore the benefits of incorporating context into a Recurrent Neural Network (RNN-T) based Automatic Speech Recognition (ASR) model to improve the speech recognition for virtual assistants. Specifically, we use meta information extracted from the time at which the utterance is spoken and the approximate location information to make ASR context aware. We show that these contextual information, when used individually, improves overall performance by as much as 3.48% relative to the baseline and when the contexts are combined, the model learns complementary features and the recognition improves by 4.62%. On specific domains, these contextual signals show improvements as high as 11.5%, without any significant degradation on others. We ran experiments with models trained on data of sizes 30K hours and 10K hours. We show that the scale of improvement with the 10K hours dataset is much higher than the one obtained with 30K hours dataset. Our results indicate that with limited data to train the ASR model, contextual signals can improve the performance significantly.

* To appear in TSD 2021

Via

Access Paper or Ask Questions

Timestamping Documents and Beliefs

Jun 09, 2021

Swayambhu Nath Ray

Figure 1 for Timestamping Documents and Beliefs

Figure 2 for Timestamping Documents and Beliefs

Figure 3 for Timestamping Documents and Beliefs

Figure 4 for Timestamping Documents and Beliefs

Abstract:Most of the textual information available to us are temporally variable. In a world where information is dynamic, time-stamping them is a very important task. Documents are a good source of information and are used for many tasks like, sentiment analysis, classification of reviews etc. The knowledge of creation date of documents facilitates several tasks like summarization, event extraction, temporally focused information extraction etc. Unfortunately, for most of the documents on the web, the time-stamp meta-data is either erroneous or missing. Thus document dating is a challenging problem which requires inference over the temporal structure of the document alongside the contextual information of the document. Prior document dating systems have largely relied on handcrafted features while ignoring such document-internal structures. In this paper we propose NeuralDater, a Graph Convolutional Network (GCN) based document dating approach which jointly exploits syntactic and temporal graph structures of document in a principled way. We also pointed out some limitations of NeuralDater and tried to utilize both context and temporal information in documents in a more flexible and intuitive manner proposing AD3: Attentive Deep Document Dater, an attention-based document dating system. To the best of our knowledge these are the first application of deep learning methods for the task. Through extensive experiments on real-world datasets, we find that our models significantly outperforms state-of-the-art baselines by a significant margin.

* Master's Report

Via

Access Paper or Ask Questions

Listen with Intent: Improving Speech Recognition with Audio-to-Intent Front-End

May 14, 2021

Swayambhu Nath Ray, Minhua Wu, Anirudh Raju, Pegah Ghahremani, Raghavendra Bilgi, Milind Rao, Harish Arsikere, Ariya Rastrow, Andreas Stolcke, Jasha Droppo

Figure 1 for Listen with Intent: Improving Speech Recognition with Audio-to-Intent Front-End

Figure 2 for Listen with Intent: Improving Speech Recognition with Audio-to-Intent Front-End

Figure 3 for Listen with Intent: Improving Speech Recognition with Audio-to-Intent Front-End

Figure 4 for Listen with Intent: Improving Speech Recognition with Audio-to-Intent Front-End

Abstract:Comprehending the overall intent of an utterance helps a listener recognize the individual words spoken. Inspired by this fact, we perform a novel study of the impact of explicitly incorporating intent representations as additional information to improve a recurrent neural network-transducer (RNN-T) based automatic speech recognition (ASR) system. An audio-to-intent (A2I) model encodes the intent of the utterance in the form of embeddings or posteriors, and these are used as auxiliary inputs for RNN-T training and inference. Experimenting with a 50k-hour far-field English speech corpus, this study shows that when running the system in non-streaming mode, where intent representation is extracted from the entire utterance and then used to bias streaming RNN-T search from the start, it provides a 5.56% relative word error rate reduction (WERR). On the other hand, a streaming system using per-frame intent posteriors as extra inputs for the RNN-T ASR system yields a 3.33% relative WERR. A further detailed analysis of the streaming system indicates that our proposed method brings especially good gain on media-playing related intents (e.g. 9.12% relative WERR on PlayMusicIntent).

Via

Access Paper or Ask Questions

Dating Documents using Graph Convolution Networks

Feb 01, 2019

Shikhar Vashishth, Shib Sankar Dasgupta, Swayambhu Nath Ray, Partha Talukdar

Figure 1 for Dating Documents using Graph Convolution Networks

Figure 2 for Dating Documents using Graph Convolution Networks

Figure 3 for Dating Documents using Graph Convolution Networks

Figure 4 for Dating Documents using Graph Convolution Networks

Abstract:Document date is essential for many important tasks, such as document retrieval, summarization, event detection, etc. While existing approaches for these tasks assume accurate knowledge of the document date, this is not always available, especially for arbitrary documents from the Web. Document Dating is a challenging problem which requires inference over the temporal structure of the document. Prior document dating systems have largely relied on handcrafted features while ignoring such document internal structures. In this paper, we propose NeuralDater, a Graph Convolutional Network (GCN) based document dating approach which jointly exploits syntactic and temporal graph structures of document in a principled way. To the best of our knowledge, this is the first application of deep learning for the problem of document dating. Through extensive experiments on real-world datasets, we find that NeuralDater significantly outperforms state-of-the-art baseline by 19% absolute (45% relative) accuracy points.

* Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics 2018
* Accepted at ACL 2018

Via

Access Paper or Ask Questions