Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tomoki Honda

Efficient and Robust Long-Form Speech Recognition with Hybrid H3-Conformer

Oct 05, 2024

Tomoki Honda, Shinsuke Sakai, Tatsuya Kawahara

Figure 1 for Efficient and Robust Long-Form Speech Recognition with Hybrid H3-Conformer

Figure 2 for Efficient and Robust Long-Form Speech Recognition with Hybrid H3-Conformer

Figure 3 for Efficient and Robust Long-Form Speech Recognition with Hybrid H3-Conformer

Figure 4 for Efficient and Robust Long-Form Speech Recognition with Hybrid H3-Conformer

Abstract:Recently, Conformer has achieved state-of-the-art performance in many speech recognition tasks. However, the Transformer-based models show significant deterioration for long-form speech, such as lectures, because the self-attention mechanism becomes unreliable with the computation of the square order of the input length. To solve the problem, we incorporate a kind of state-space model, Hungry Hungry Hippos (H3), to replace or complement the multi-head self-attention (MHSA). H3 allows for efficient modeling of long-form sequences with a linear-order computation. In experiments using two datasets of CSJ and LibriSpeech, our proposed H3-Conformer model performs efficient and robust recognition of long-form speech. Moreover, we propose a hybrid of H3 and MHSA and show that using H3 in higher layers and MHSA in lower layers provides significant improvement in online recognition. We also investigate a parallel use of H3 and MHSA in all layers, resulting in the best performance.

* Submitted to InterSpeech2024, Sample code is available at https://github.com/mirrormouse/Hybrid-H3-Conformer

Via

Access Paper or Ask Questions