Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jian-Shu Zhang

Learning Contextually Fused Audio-visual Representations for Audio-visual Speech Recognition

Feb 15, 2022

Zi-Qiang Zhang, Jie Zhang, Jian-Shu Zhang, Ming-Hui Wu, Xin Fang, Li-Rong Dai

Figure 1 for Learning Contextually Fused Audio-visual Representations for Audio-visual Speech Recognition

Figure 2 for Learning Contextually Fused Audio-visual Representations for Audio-visual Speech Recognition

Figure 3 for Learning Contextually Fused Audio-visual Representations for Audio-visual Speech Recognition

Figure 4 for Learning Contextually Fused Audio-visual Representations for Audio-visual Speech Recognition

Abstract:With the advance in self-supervised learning for audio and visual modalities, it has become possible to learn a robust audio-visual speech representation. This would be beneficial for improving the audio-visual speech recognition (AVSR) performance, as the multi-modal inputs contain more fruitful information in principle. In this paper, based on existing self-supervised representation learning methods for audio modality, we therefore propose an audio-visual representation learning approach. The proposed approach explores both the complementarity of audio-visual modalities and long-term context dependency using a transformer-based fusion module and a flexible masking strategy. After pre-training, the model is able to extract fused representations required by AVSR. Without loss of generality, it can be applied to single-modal tasks, e.g. audio/visual speech recognition by simply masking out one modality in the fusion module. The proposed pre-trained model is evaluated on speech recognition and lipreading tasks using one or two modalities, where the superiority is revealed.

Via

Access Paper or Ask Questions