Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Zero-Shot End-to-End Spoken Language Understanding via Cross-Modal Selective Self-Training

May 22, 2023

Jianfeng He, Julian Salazar, Kaisheng Yao, Haoqi Li, Jinglun Cai

Figure 1 for Zero-Shot End-to-End Spoken Language Understanding via Cross-Modal Selective Self-Training

Figure 2 for Zero-Shot End-to-End Spoken Language Understanding via Cross-Modal Selective Self-Training

Figure 3 for Zero-Shot End-to-End Spoken Language Understanding via Cross-Modal Selective Self-Training

Figure 4 for Zero-Shot End-to-End Spoken Language Understanding via Cross-Modal Selective Self-Training

Share this with someone who'll enjoy it:

Abstract:End-to-end (E2E) spoken language understanding (SLU) is constrained by the cost of collecting speech-semantics pairs, especially when label domains change. Hence, we explore \textit{zero-shot} E2E SLU, which learns E2E SLU without speech-semantics pairs, instead using only speech-text and text-semantics pairs. Previous work achieved zero-shot by pseudolabeling all speech-text transcripts with a natural language understanding (NLU) model learned on text-semantics corpora. However, this method requires the domains of speech-text and text-semantics to match, which often mismatch due to separate collections. Furthermore, using the entire speech-text corpus from any domains leads to \textit{imbalance} and \textit{noise} issues. To address these, we propose \textit{cross-modal selective self-training} (CMSST). CMSST tackles imbalance by clustering in a joint space of the three modalities (speech, text, and semantics) and handles label noise with a selection network. We also introduce two benchmarks for zero-shot E2E SLU, covering matched and found speech (mismatched) settings. Experiments show that CMSST improves performance in both two settings, with significantly reduced sample sizes and training time.

* 15 pages, 7 figures

View paper on

Share this with someone who'll enjoy it:

Title:Zero-Shot End-to-End Spoken Language Understanding via Cross-Modal Selective Self-Training

Paper and Code