Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Abhinav Thanda

Audio-Visual Decision Fusion for WFST-based and seq2seq Models

Jan 29, 2020

Rohith Aralikatti, Sharad Roy, Abhinav Thanda, Dilip Kumar Margam, Pujitha Appan Kandala, Tanay Sharma, Shankar M Venkatesan

Figure 1 for Audio-Visual Decision Fusion for WFST-based and seq2seq Models

Figure 2 for Audio-Visual Decision Fusion for WFST-based and seq2seq Models

Figure 3 for Audio-Visual Decision Fusion for WFST-based and seq2seq Models

Figure 4 for Audio-Visual Decision Fusion for WFST-based and seq2seq Models

Abstract:Under noisy conditions, speech recognition systems suffer from high Word Error Rates (WER). In such cases, information from the visual modality comprising the speaker lip movements can help improve the performance. In this work, we propose novel methods to fuse information from audio and visual modalities at inference time. This enables us to train the acoustic and visual models independently. First, we train separate RNN-HMM based acoustic and visual models. A common WFST generated by taking a special union of the HMM components is used for decoding using a modified Viterbi algorithm. Second, we train separate seq2seq acoustic and visual models. The decoding step is performed simultaneously for both modalities using shallow fusion while maintaining a common hypothesis beam. We also present results for a novel seq2seq fusion without the weighing parameter. We present results at varying SNR and show that our methods give significant improvements over acoustic-only WER.

* Submitted for review to ICASSP 2020 on October 21st, 2019

Via

Access Paper or Ask Questions

LipReading with 3D-2D-CNN BLSTM-HMM and word-CTC models

Jun 25, 2019

Dilip Kumar Margam, Rohith Aralikatti, Tanay Sharma, Abhinav Thanda, Pujitha A K, Sharad Roy, Shankar M Venkatesan

Figure 1 for LipReading with 3D-2D-CNN BLSTM-HMM and word-CTC models

Figure 2 for LipReading with 3D-2D-CNN BLSTM-HMM and word-CTC models

Figure 3 for LipReading with 3D-2D-CNN BLSTM-HMM and word-CTC models

Figure 4 for LipReading with 3D-2D-CNN BLSTM-HMM and word-CTC models

Abstract:In recent years, deep learning based machine lipreading has gained prominence. To this end, several architectures such as LipNet, LCANet and others have been proposed which perform extremely well compared to traditional lipreading DNN-HMM hybrid systems trained on DCT features. In this work, we propose a simpler architecture of 3D-2D-CNN-BLSTM network with a bottleneck layer. We also present analysis of two different approaches for lipreading on this architecture. In the first approach, 3D-2D-CNN-BLSTM network is trained with CTC loss on characters (ch-CTC). Then BLSTM-HMM model is trained on bottleneck lip features (extracted from 3D-2D-CNN-BLSTM ch-CTC network) in a traditional ASR training pipeline. In the second approach, same 3D-2D-CNN-BLSTM network is trained with CTC loss on word labels (w-CTC). The first approach shows that bottleneck features perform better compared to DCT features. Using the second approach on Grid corpus' seen speaker test set, we report $1.3\%$ WER - a $55\%$ improvement relative to LCANet. On unseen speaker test set we report $8.6\%$ WER which is $24.5\%$ improvement relative to LipNet. We also verify the method on a second dataset of $81$ speakers which we collected. Finally, we also discuss the effect of feature duplication on BLSTM-HMM model performance.

* Submitted to Interspeech 2019

Via

Access Paper or Ask Questions

Multi-task Learning Of Deep Neural Networks For Audio Visual Automatic Speech Recognition

Jan 10, 2017

Abhinav Thanda, Shankar M Venkatesan

Figure 1 for Multi-task Learning Of Deep Neural Networks For Audio Visual Automatic Speech Recognition

Figure 2 for Multi-task Learning Of Deep Neural Networks For Audio Visual Automatic Speech Recognition

Abstract:Multi-task learning (MTL) involves the simultaneous training of two or more related tasks over shared representations. In this work, we apply MTL to audio-visual automatic speech recognition(AV-ASR). Our primary task is to learn a mapping between audio-visual fused features and frame labels obtained from acoustic GMM/HMM model. This is combined with an auxiliary task which maps visual features to frame labels obtained from a separate visual GMM/HMM model. The MTL model is tested at various levels of babble noise and the results are compared with a base-line hybrid DNN-HMM AV-ASR model. Our results indicate that MTL is especially useful at higher level of noise. Compared to base-line, upto 7\% relative improvement in WER is reported at -3 SNR dB

Via

Access Paper or Ask Questions

Audio Visual Speech Recognition using Deep Recurrent Neural Networks

Nov 09, 2016

Abhinav Thanda, Shankar M Venkatesan

Figure 1 for Audio Visual Speech Recognition using Deep Recurrent Neural Networks

Figure 2 for Audio Visual Speech Recognition using Deep Recurrent Neural Networks

Figure 3 for Audio Visual Speech Recognition using Deep Recurrent Neural Networks

Figure 4 for Audio Visual Speech Recognition using Deep Recurrent Neural Networks

Abstract:In this work, we propose a training algorithm for an audio-visual automatic speech recognition (AV-ASR) system using deep recurrent neural network (RNN).First, we train a deep RNN acoustic model with a Connectionist Temporal Classification (CTC) objective function. The frame labels obtained from the acoustic model are then used to perform a non-linear dimensionality reduction of the visual features using a deep bottleneck network. Audio and visual features are fused and used to train a fusion RNN. The use of bottleneck features for visual modality helps the model to converge properly during training. Our system is evaluated on GRID corpus. Our results show that presence of visual modality gives significant improvement in character error rate (CER) at various levels of noise even when the model is trained without noisy data. We also provide a comparison of two fusion methods: feature fusion and decision fusion.

Via

Access Paper or Ask Questions