Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Chih-Chen Chen

Evaluating Self-Supervised Speech Representations for Indigenous American Languages

Oct 08, 2023

Chih-Chen Chen, William Chen, Rodolfo Zevallos, John E. Ortega

Abstract:The application of self-supervision to speech representation learning has garnered significant interest in recent years, due to its scalability to large amounts of unlabeled data. However, much progress, both in terms of pre-training and downstream evaluation, has remained concentrated in monolingual models that only consider English. Few models consider other languages, and even fewer consider indigenous ones. In our submission to the New Language Track of the ASRU 2023 ML-SUPERB Challenge, we present an ASR corpus for Quechua, an indigenous South American Language. We benchmark the efficacy of large SSL models on Quechua, along with 6 other indigenous languages such as Guarani and Bribri, on low-resource ASR. Our results show surprisingly strong performance by state-of-the-art SSL models, showing the potential generalizability of large-scale models to real-world data.

Via

Access Paper or Ask Questions

Benchmarking Azerbaijani Neural Machine Translation

Jul 29, 2022

Chih-Chen Chen, William Chen

Figure 1 for Benchmarking Azerbaijani Neural Machine Translation

Figure 2 for Benchmarking Azerbaijani Neural Machine Translation

Figure 3 for Benchmarking Azerbaijani Neural Machine Translation

Figure 4 for Benchmarking Azerbaijani Neural Machine Translation

Abstract:Little research has been done on Neural Machine Translation (NMT) for Azerbaijani. In this paper, we benchmark the performance of Azerbaijani-English NMT systems on a range of techniques and datasets. We evaluate which segmentation techniques work best on Azerbaijani translation and benchmark the performance of Azerbaijani NMT models across several domains of text. Our results show that while Unigram segmentation improves NMT performance and Azerbaijani translation models scale better with dataset quality than quantity, cross-domain generalization remains a challenge

* Published in The International Conference and Workshop on Agglutinative Language Technologies as a Challenge for NLP (ALTNLP) https://www.altnlp.org

Via

Access Paper or Ask Questions