Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tycho Max Sylvester Tax

On Scaling Contrastive Representations for Low-Resource Speech Recognition

Feb 01, 2021

Lasse Borgholt, Tycho Max Sylvester Tax, Jakob Drachmann Havtorn, Lars Maaløe, Christian Igel

Figure 1 for On Scaling Contrastive Representations for Low-Resource Speech Recognition

Figure 2 for On Scaling Contrastive Representations for Low-Resource Speech Recognition

Figure 3 for On Scaling Contrastive Representations for Low-Resource Speech Recognition

Figure 4 for On Scaling Contrastive Representations for Low-Resource Speech Recognition

Abstract:Recent advances in self-supervised learning through contrastive training have shown that it is possible to learn a competitive speech recognition system with as little as 10 minutes of labeled data. However, these systems are computationally expensive since they require pre-training followed by fine-tuning in a large parameter space. We explore the performance of such systems without fine-tuning by training a state-of-the-art speech recognizer on the fixed representations from the computationally demanding wav2vec 2.0 framework. We find performance to decrease without fine-tuning and, in the extreme low-resource setting, wav2vec 2.0 is inferior to its predecessor. In addition, we find that wav2vec 2.0 representations live in a low dimensional subspace and that decorrelating the features of the representations can stabilize training of the automatic speech recognizer. Finally, we propose a bidirectional extension to the original wav2vec framework that consistently improves performance.

* {\copyright} 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

Via

Access Paper or Ask Questions

Utilizing Domain Knowledge in End-to-End Audio Processing

Dec 01, 2017

Tycho Max Sylvester Tax, Jose Luis Diez Antich, Hendrik Purwins, Lars Maaløe

Figure 1 for Utilizing Domain Knowledge in End-to-End Audio Processing

Figure 2 for Utilizing Domain Knowledge in End-to-End Audio Processing

Figure 3 for Utilizing Domain Knowledge in End-to-End Audio Processing

Figure 4 for Utilizing Domain Knowledge in End-to-End Audio Processing

Abstract:End-to-end neural network based approaches to audio modelling are generally outperformed by models trained on high-level data representations. In this paper we present preliminary work that shows the feasibility of training the first layers of a deep convolutional neural network (CNN) model to learn the commonly-used log-scaled mel-spectrogram transformation. Secondly, we demonstrate that upon initializing the first layers of an end-to-end CNN classifier with the learned transformation, convergence and performance on the ESC-50 environmental sound classification dataset are similar to a CNN-based model trained on the highly pre-processed log-scaled mel-spectrogram features.

* Accepted at the ML4Audio workshop at the NIPS 2017

Via

Access Paper or Ask Questions

Exploiting Nontrivial Connectivity for Automatic Speech Recognition

Nov 28, 2017

Marius Paraschiv, Lasse Borgholt, Tycho Max Sylvester Tax, Marco Singh, Lars Maaløe

Figure 1 for Exploiting Nontrivial Connectivity for Automatic Speech Recognition

Figure 2 for Exploiting Nontrivial Connectivity for Automatic Speech Recognition

Figure 3 for Exploiting Nontrivial Connectivity for Automatic Speech Recognition

Abstract:Nontrivial connectivity has allowed the training of very deep networks by addressing the problem of vanishing gradients and offering a more efficient method of reusing parameters. In this paper we make a comparison between residual networks, densely-connected networks and highway networks on an image classification task. Next, we show that these methodologies can easily be deployed into automatic speech recognition and provide significant improvements to existing models.

* Accepted at the ML4Audio workshop at the NIPS 2017

Via

Access Paper or Ask Questions