Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tanay Sharma

Audio-Visual Decision Fusion for WFST-based and seq2seq Models

Jan 29, 2020

Rohith Aralikatti, Sharad Roy, Abhinav Thanda, Dilip Kumar Margam, Pujitha Appan Kandala, Tanay Sharma, Shankar M Venkatesan

Figure 1 for Audio-Visual Decision Fusion for WFST-based and seq2seq Models

Figure 2 for Audio-Visual Decision Fusion for WFST-based and seq2seq Models

Figure 3 for Audio-Visual Decision Fusion for WFST-based and seq2seq Models

Figure 4 for Audio-Visual Decision Fusion for WFST-based and seq2seq Models

Abstract:Under noisy conditions, speech recognition systems suffer from high Word Error Rates (WER). In such cases, information from the visual modality comprising the speaker lip movements can help improve the performance. In this work, we propose novel methods to fuse information from audio and visual modalities at inference time. This enables us to train the acoustic and visual models independently. First, we train separate RNN-HMM based acoustic and visual models. A common WFST generated by taking a special union of the HMM components is used for decoding using a modified Viterbi algorithm. Second, we train separate seq2seq acoustic and visual models. The decoding step is performed simultaneously for both modalities using shallow fusion while maintaining a common hypothesis beam. We also present results for a novel seq2seq fusion without the weighing parameter. We present results at varying SNR and show that our methods give significant improvements over acoustic-only WER.

* Submitted for review to ICASSP 2020 on October 21st, 2019

Via

Access Paper or Ask Questions

LipReading with 3D-2D-CNN BLSTM-HMM and word-CTC models

Jun 25, 2019

Dilip Kumar Margam, Rohith Aralikatti, Tanay Sharma, Abhinav Thanda, Pujitha A K, Sharad Roy, Shankar M Venkatesan

Figure 1 for LipReading with 3D-2D-CNN BLSTM-HMM and word-CTC models

Figure 2 for LipReading with 3D-2D-CNN BLSTM-HMM and word-CTC models

Figure 3 for LipReading with 3D-2D-CNN BLSTM-HMM and word-CTC models

Figure 4 for LipReading with 3D-2D-CNN BLSTM-HMM and word-CTC models

Abstract:In recent years, deep learning based machine lipreading has gained prominence. To this end, several architectures such as LipNet, LCANet and others have been proposed which perform extremely well compared to traditional lipreading DNN-HMM hybrid systems trained on DCT features. In this work, we propose a simpler architecture of 3D-2D-CNN-BLSTM network with a bottleneck layer. We also present analysis of two different approaches for lipreading on this architecture. In the first approach, 3D-2D-CNN-BLSTM network is trained with CTC loss on characters (ch-CTC). Then BLSTM-HMM model is trained on bottleneck lip features (extracted from 3D-2D-CNN-BLSTM ch-CTC network) in a traditional ASR training pipeline. In the second approach, same 3D-2D-CNN-BLSTM network is trained with CTC loss on word labels (w-CTC). The first approach shows that bottleneck features perform better compared to DCT features. Using the second approach on Grid corpus' seen speaker test set, we report $1.3\%$ WER - a $55\%$ improvement relative to LCANet. On unseen speaker test set we report $8.6\%$ WER which is $24.5\%$ improvement relative to LipNet. We also verify the method on a second dataset of $81$ speakers which we collected. Finally, we also discuss the effect of feature duplication on BLSTM-HMM model performance.

* Submitted to Interspeech 2019

Via

Access Paper or Ask Questions

Global SNR Estimation of Speech Signals using Entropy and Uncertainty Estimates from Dropout Networks

Apr 12, 2018

Rohith Aralikatti, Dilip Margam, Tanay Sharma, Thanda Abhinav, Shankar M Venkatesan

Figure 1 for Global SNR Estimation of Speech Signals using Entropy and Uncertainty Estimates from Dropout Networks

Figure 2 for Global SNR Estimation of Speech Signals using Entropy and Uncertainty Estimates from Dropout Networks

Figure 3 for Global SNR Estimation of Speech Signals using Entropy and Uncertainty Estimates from Dropout Networks

Figure 4 for Global SNR Estimation of Speech Signals using Entropy and Uncertainty Estimates from Dropout Networks

Abstract:This paper demonstrates two novel methods to estimate the global SNR of speech signals. In both methods, Deep Neural Network-Hidden Markov Model (DNN-HMM) acoustic model used in speech recognition systems is leveraged for the additional task of SNR estimation. In the first method, the entropy of the DNN-HMM output is computed. Recent work on bayesian deep learning has shown that a DNN-HMM trained with dropout can be used to estimate model uncertainty by approximating it as a deep Gaussian process. In the second method, this approximation is used to obtain model uncertainty estimates. Noise specific regressors are used to predict the SNR from the entropy and model uncertainty. The DNN-HMM is trained on GRID corpus and tested on different noise profiles from the DEMAND noise database at SNR levels ranging from -10 dB to 30 dB.

Via

Access Paper or Ask Questions