Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Noel Chen

Evaluating Self-supervised Speech Models on a Taiwanese Hokkien Corpus

Dec 06, 2023

Yi-Hui Chou, Kalvin Chang, Meng-Ju Wu, Winston Ou, Alice Wen-Hsin Bi, Carol Yang, Bryan Y. Chen, Rong-Wei Pai, Po-Yen Yeh, Jo-Peng Chiang(+5 more)

Abstract:Taiwanese Hokkien is declining in use and status due to a language shift towards Mandarin in Taiwan. This is partly why it is a low resource language in NLP and speech research today. To ensure that the state of the art in speech processing does not leave Taiwanese Hokkien behind, we contribute a 1.5-hour dataset of Taiwanese Hokkien to ML-SUPERB's hidden set. Evaluating ML-SUPERB's suite of self-supervised learning (SSL) speech representations on our dataset, we find that model size does not consistently determine performance. In fact, certain smaller models outperform larger ones. Furthermore, linguistic alignment between pretraining data and the target language plays a crucial role.

* Accepted to ASRU 2023

Via

Access Paper or Ask Questions

Bear the Query in Mind: Visual Grounding with Query-conditioned Convolution

Jun 22, 2022

Chonghan Chen, Qi Jiang, Chih-Hao Wang, Noel Chen, Haohan Wang, Xiang Li, Bhiksha Raj

Figure 1 for Bear the Query in Mind: Visual Grounding with Query-conditioned Convolution

Figure 2 for Bear the Query in Mind: Visual Grounding with Query-conditioned Convolution

Figure 3 for Bear the Query in Mind: Visual Grounding with Query-conditioned Convolution

Figure 4 for Bear the Query in Mind: Visual Grounding with Query-conditioned Convolution

Abstract:Visual grounding is a task that aims to locate a target object according to a natural language expression. As a multi-modal task, feature interaction between textual and visual inputs is vital. However, previous solutions mainly handle each modality independently before fusing them together, which does not take full advantage of relevant textual information while extracting visual features. To better leverage the textual-visual relationship in visual grounding, we propose a Query-conditioned Convolution Module (QCM) that extracts query-aware visual features by incorporating query information into the generation of convolutional kernels. With our proposed QCM, the downstream fusion module receives visual features that are more discriminative and focused on the desired object described in the expression, leading to more accurate predictions. Extensive experiments on three popular visual grounding datasets demonstrate that our method achieves state-of-the-art performance. In addition, the query-aware visual features are informative enough to achieve comparable performance to the latest methods when directly used for prediction without further multi-modal fusion.

Via

Access Paper or Ask Questions