Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Bhargav Srinivas Chinnari

Cross-stitched Multi-modal Encoders

Apr 20, 2022

Karan Singla, Daniel Pressel, Ryan Price, Bhargav Srinivas Chinnari, Yeon-Jun Kim, Srinivas Bangalore

Figure 1 for Cross-stitched Multi-modal Encoders

Figure 2 for Cross-stitched Multi-modal Encoders

Figure 3 for Cross-stitched Multi-modal Encoders

Figure 4 for Cross-stitched Multi-modal Encoders

Abstract:In this paper, we propose a novel architecture for multi-modal speech and text input. We combine pretrained speech and text encoders using multi-headed cross-modal attention and jointly fine-tune on the target problem. The resultant architecture can be used for continuous token-level classification or utterance-level prediction acting on simultaneous text and speech. The resultant encoder efficiently captures both acoustic-prosodic and lexical information. We compare the benefits of multi-headed attention-based fusion for multi-modal utterance-level classification against a simple concatenation of pre-pooled, modality-specific representations. Our model architecture is compact, resource efficient, and can be trained on a single consumer GPU card.

Via

Access Paper or Ask Questions