Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Lillian Zhou

The Gift of Feedback: Improving ASR Model Quality by Learning from User Corrections through Federated Learning

Sep 29, 2023

Lillian Zhou, Yuxin Ding, Mingqing Chen, Harry Zhang, Rohit Prabhavalkar, Dhruv Guliani, Giovanni Motta, Rajiv Mathews

Abstract:Automatic speech recognition (ASR) models are typically trained on large datasets of transcribed speech. As language evolves and new terms come into use, these models can become outdated and stale. In the context of models trained on the server but deployed on edge devices, errors may result from the mismatch between server training data and actual on-device usage. In this work, we seek to continually learn from on-device user corrections through Federated Learning (FL) to address this issue. We explore techniques to target fresh terms that the model has not previously encountered, learn long-tail words, and mitigate catastrophic forgetting. In experimental evaluations, we find that the proposed techniques improve model recognition of fresh terms, while preserving quality on the overall language distribution.

* Accepted to IEEE ASRU 2023

Via

Access Paper or Ask Questions

Exploring Heterogeneous Characteristics of Layers in ASR Models for More Efficient Training

Oct 08, 2021

Lillian Zhou, Dhruv Guliani, Andreas Kabel, Giovanni Motta, Françoise Beaufays

Figure 1 for Exploring Heterogeneous Characteristics of Layers in ASR Models for More Efficient Training

Figure 2 for Exploring Heterogeneous Characteristics of Layers in ASR Models for More Efficient Training

Figure 3 for Exploring Heterogeneous Characteristics of Layers in ASR Models for More Efficient Training

Figure 4 for Exploring Heterogeneous Characteristics of Layers in ASR Models for More Efficient Training

Abstract:Transformer-based architectures have been the subject of research aimed at understanding their overparameterization and the non-uniform importance of their layers. Applying these approaches to Automatic Speech Recognition, we demonstrate that the state-of-the-art Conformer models generally have multiple ambient layers. We study the stability of these layers across runs and model sizes, propose that group normalization may be used without disrupting their formation, and examine their correlation with model weight updates in each layer. Finally, we apply these findings to Federated Learning in order to improve the training procedure, by targeting Federated Dropout to layers by importance. This allows us to reduce the model size optimized by clients without quality degradation, and shows potential for future exploration.

* \c{opyright} 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

Via

Access Paper or Ask Questions

Enabling On-Device Training of Speech Recognition Models with Federated Dropout

Oct 07, 2021

Dhruv Guliani, Lillian Zhou, Changwan Ryu, Tien-Ju Yang, Harry Zhang, Yonghui Xiao, Francoise Beaufays, Giovanni Motta

Figure 1 for Enabling On-Device Training of Speech Recognition Models with Federated Dropout

Figure 2 for Enabling On-Device Training of Speech Recognition Models with Federated Dropout

Figure 3 for Enabling On-Device Training of Speech Recognition Models with Federated Dropout

Figure 4 for Enabling On-Device Training of Speech Recognition Models with Federated Dropout

Abstract:Federated learning can be used to train machine learning models on the edge on local data that never leave devices, providing privacy by default. This presents a challenge pertaining to the communication and computation costs associated with clients' devices. These costs are strongly correlated with the size of the model being trained, and are significant for state-of-the-art automatic speech recognition models. We propose using federated dropout to reduce the size of client models while training a full-size model server-side. We provide empirical evidence of the effectiveness of federated dropout, and propose a novel approach to vary the dropout rate applied at each layer. Furthermore, we find that federated dropout enables a set of smaller sub-models within the larger model to independently have low word error rates, making it easier to dynamically adjust the size of the model deployed for inference.

* \c{opyright} 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses

Via

Access Paper or Ask Questions

Personalization of End-to-end Speech Recognition On Mobile Devices For Named Entities

Dec 14, 2019

Khe Chai Sim, Françoise Beaufays, Arnaud Benard, Dhruv Guliani, Andreas Kabel, Nikhil Khare, Tamar Lucassen, Petr Zadrazil, Harry Zhang, Leif Johnson(+2 more)

Figure 1 for Personalization of End-to-end Speech Recognition On Mobile Devices For Named Entities

Figure 2 for Personalization of End-to-end Speech Recognition On Mobile Devices For Named Entities

Figure 3 for Personalization of End-to-end Speech Recognition On Mobile Devices For Named Entities

Figure 4 for Personalization of End-to-end Speech Recognition On Mobile Devices For Named Entities

Abstract:We study the effectiveness of several techniques to personalize end-to-end speech models and improve the recognition of proper names relevant to the user. These techniques differ in the amounts of user effort required to provide supervision, and are evaluated on how they impact speech recognition performance. We propose using keyword-dependent precision and recall metrics to measure vocabulary acquisition performance. We evaluate the algorithms on a dataset that we designed to contain names of persons that are difficult to recognize. Therefore, the baseline recall rate for proper names in this dataset is very low: 2.4%. A data synthesis approach we developed brings it to 48.6%, with no need for speech input from the user. With speech input, if the user corrects only the names, the name recall rate improves to 64.4%. If the user corrects all the recognition errors, we achieve the best recall of 73.5%. To eliminate the need to upload user data and store personalized models on a server, we focus on performing the entire personalization workflow on a mobile device.

Via

Access Paper or Ask Questions