Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jiahuan Li

Investigating and Scaling up Code-Switching for Multilingual Language Model Pre-Training

Apr 02, 2025

Zhijun Wang, Jiahuan Li, Hao Zhou, Rongxiang Weng, Jingang Wang, Xin Huang, Xue Han, Junlan Feng, Chao Deng, Shujian Huang

Abstract:Large language models (LLMs) exhibit remarkable multilingual capabilities despite the extreme language imbalance in the pre-training data. In this paper, we closely examine the reasons behind this phenomenon, focusing on the pre-training corpus. We find that the existence of code-switching, alternating between different languages within a context, is key to multilingual capabilities. We conduct an analysis to investigate code-switching in the pre-training corpus, examining its presence and categorizing it into four types within two quadrants. We then assess its impact on multilingual performance. These types of code-switching data are unbalanced in proportions and demonstrate different effects on facilitating language transfer. To better explore the power of code-switching for language alignment during pre-training, we investigate the strategy of synthetic code-switching. We continuously scale up the synthetic code-switching data and observe remarkable improvements in both benchmarks and representation space. Extensive experiments indicate that incorporating synthetic code-switching data enables better language alignment and generalizes well to high, medium, and low-resource languages with pre-training corpora of varying qualities.

Via

Access Paper or Ask Questions

"I've Heard of You!": Generate Spoken Named Entity Recognition Data for Unseen Entities

Dec 26, 2024

Jiawei Yu, Xiang Geng, Yuang Li, Mengxin Ren, Wei Tang, Jiahuan Li, Zhibin Lan, Min Zhang, Hao Yang, Shujian Huang(+1 more)

Figure 1 for "I've Heard of You!": Generate Spoken Named Entity Recognition Data for Unseen Entities

Figure 2 for "I've Heard of You!": Generate Spoken Named Entity Recognition Data for Unseen Entities

Figure 3 for "I've Heard of You!": Generate Spoken Named Entity Recognition Data for Unseen Entities

Figure 4 for "I've Heard of You!": Generate Spoken Named Entity Recognition Data for Unseen Entities

Abstract:Spoken named entity recognition (NER) aims to identify named entities from speech, playing an important role in speech processing. New named entities appear every day, however, annotating their Spoken NER data is costly. In this paper, we demonstrate that existing Spoken NER systems perform poorly when dealing with previously unseen named entities. To tackle this challenge, we propose a method for generating Spoken NER data based on a named entity dictionary (NED) to reduce costs. Specifically, we first use a large language model (LLM) to generate sentences from the sampled named entities and then use a text-to-speech (TTS) system to generate the speech. Furthermore, we introduce a noise metric to filter out noisy data. To evaluate our approach, we release a novel Spoken NER benchmark along with a corresponding NED containing 8,853 entities. Experiment results show that our method achieves state-of-the-art (SOTA) performance in the in-domain, zero-shot domain adaptation, and fully zero-shot settings. Our data will be available at https://github.com/DeepLearnXMU/HeardU.

* Accepted by ICASSP 2025

Via

Access Paper or Ask Questions

Formality is Favored: Unraveling the Learning Preferences of Large Language Models on Data with Conflicting Knowledge

Oct 07, 2024

Jiahuan Li, Yiqing Cao, Shujian Huang, Jiajun Chen

Figure 1 for Formality is Favored: Unraveling the Learning Preferences of Large Language Models on Data with Conflicting Knowledge

Figure 2 for Formality is Favored: Unraveling the Learning Preferences of Large Language Models on Data with Conflicting Knowledge

Figure 3 for Formality is Favored: Unraveling the Learning Preferences of Large Language Models on Data with Conflicting Knowledge

Figure 4 for Formality is Favored: Unraveling the Learning Preferences of Large Language Models on Data with Conflicting Knowledge

Abstract:Having been trained on massive pretraining data, large language models have shown excellent performance on many knowledge-intensive tasks. However, pretraining data tends to contain misleading and even conflicting information, and it is intriguing to understand how LLMs handle these noisy data during training. In this study, we systematically analyze LLMs' learning preferences for data with conflicting knowledge. We find that pretrained LLMs establish learning preferences similar to humans, i.e., preferences towards formal texts and texts with fewer spelling errors, resulting in faster learning and more favorable treatment of knowledge in data with such features when facing conflicts. This finding is generalizable across models and languages and is more evident in larger models. An in-depth analysis reveals that LLMs tend to trust data with features that signify consistency with the majority of data, and it is possible to instill new preferences and erase old ones by manipulating the degree of consistency with the majority data.

* accepted by EMNLP 2024, main conference

Via

Access Paper or Ask Questions

PreAlign: Boosting Cross-Lingual Transfer by Early Establishment of Multilingual Alignment

Jul 23, 2024

Jiahuan Li, Shujian Huang, Xinyu Dai, Jiajun Chen

Abstract:Large language models demonstrate reasonable multilingual abilities, despite predominantly English-centric pretraining. However, the spontaneous multilingual alignment in these models is shown to be weak, leading to unsatisfactory cross-lingual transfer and knowledge sharing. Previous works attempt to address this issue by explicitly injecting multilingual alignment information during or after pretraining. Thus for the early stage in pretraining, the alignment is weak for sharing information or knowledge across languages. In this paper, we propose PreAlign, a framework that establishes multilingual alignment prior to language model pretraining. PreAlign injects multilingual alignment by initializing the model to generate similar representations of aligned words and preserves this alignment using a code-switching strategy during pretraining. Extensive experiments in a synthetic English to English-Clone setting demonstrate that PreAlign significantly outperforms standard multilingual joint training in language modeling, zero-shot cross-lingual transfer, and cross-lingual knowledge application. Further experiments in real-world scenarios further validate PreAlign's effectiveness across various model sizes.

Via

Access Paper or Ask Questions

Why Not Transform Chat Large Language Models to Non-English?

May 22, 2024

Xiang Geng, Ming Zhu, Jiahuan Li, Zhejian Lai, Wei Zou, Shuaijie She, Jiaxin Guo, Xiaofeng Zhao, Yinglu Li, Yuang Li(+7 more)

Figure 1 for Why Not Transform Chat Large Language Models to Non-English?

Figure 2 for Why Not Transform Chat Large Language Models to Non-English?

Figure 3 for Why Not Transform Chat Large Language Models to Non-English?

Figure 4 for Why Not Transform Chat Large Language Models to Non-English?

Abstract:The scarcity of non-English data limits the development of non-English large language models (LLMs). Transforming English-centric LLMs to non-English has been identified as an effective and resource-efficient method. Previous works start from base LLMs and perform knowledge distillation (KD) with data generated by stronger LLMs, e.g. GPT-4. Compared to base LLMs, chat LLMs are further optimized for advanced abilities, e.g. multi-turn conversation and human preference alignment, and thus more powerful in both helpfulness and safety. However, transforming a chat LLM involves two critical issues: (1) How can we effectively transfer advanced abilities without their supervised data? (2) How can we prevent the original knowledge from catastrophic forgetting during transformation? We target these issues by introducing a simple framework called TransLLM. For the first issue, TransLLM divides the transfer problem into some common sub-tasks with the translation chain-of-thought, which uses the translation as the bridge between English and non-English step-by-step. We further enhance the performance of sub-tasks with publicly available data. For the second issue, we propose a method comprising two synergistic components: low-rank adaptation for training to maintain the original LLM parameters, and recovery KD, which utilizes data generated by the chat LLM itself to recover the original knowledge from the frozen parameters. In the experiments, we transform the LLaMA-2-chat-7B to the Thai language. Our method, using only single-turn data, outperforms strong baselines and ChatGPT on multi-turn benchmark MT-bench. Furthermore, our method, without safety data, rejects more harmful queries of safety benchmark AdvBench than both ChatGPT and GPT-4.

Via

Access Paper or Ask Questions

MT-PATCHER: Selective and Extendable Knowledge Distillation from Large Language Models for Machine Translation

Apr 01, 2024

Jiahuan Li, Shanbo Cheng, Shujian Huang, Jiajun Chen

Figure 1 for MT-PATCHER: Selective and Extendable Knowledge Distillation from Large Language Models for Machine Translation

Figure 2 for MT-PATCHER: Selective and Extendable Knowledge Distillation from Large Language Models for Machine Translation

Figure 3 for MT-PATCHER: Selective and Extendable Knowledge Distillation from Large Language Models for Machine Translation

Figure 4 for MT-PATCHER: Selective and Extendable Knowledge Distillation from Large Language Models for Machine Translation

Abstract:Large Language Models (LLM) have demonstrated their strong ability in the field of machine translation (MT), yet they suffer from high computational cost and latency. Therefore, transferring translation knowledge from giant LLMs to medium-sized machine translation models is a promising research direction. However, traditional knowledge distillation methods do not take the capability of student and teacher models into consideration, therefore repeatedly teaching student models on the knowledge they have learned, and failing to extend to novel contexts and knowledge. In this paper, we propose a framework called MT-Patcher, which transfers knowledge from LLMs to existing MT models in a selective, comprehensive and proactive manner. Considering the current translation ability of student MT models, we only identify and correct their translation errors, instead of distilling the whole translation from the teacher. Leveraging the strong language abilities of LLMs, we instruct LLM teachers to synthesize diverse contexts and anticipate more potential errors for the student. Experiment results on translating both specific language phenomena and general MT benchmarks demonstrate that finetuning the student MT model on about 10% examples can achieve comparable results to the traditional knowledge distillation method, and synthesized potential errors and diverse contexts further improve translation performances on unseen contexts and words.

* Accepted to NAACL-2024 main conference

Via

Access Paper or Ask Questions

Eliciting the Translation Ability of Large Language Models via Multilingual Finetuning with Translation Instructions

May 24, 2023

Jiahuan Li, Hao Zhou, Shujian Huang, Shanbo Chen, Jiajun Chen

Abstract:Large-scale Pretrained Language Models~(LLMs), such as ChatGPT and GPT4, have shown strong abilities in multilingual translations, without being explicitly trained on parallel corpora. It is interesting how the LLMs obtain their ability to carry out translation instructions for different languages. In this paper, we present a detailed analysis by finetuning a multilingual pretrained language model, XGLM-7B, to perform multilingual translation following given instructions. Firstly, we show that the multilingual LLMs have stronger translation abilities than previously demonstrated. For a certain language pair, the performance depends on both the language families and the amount of data used in the pretraining phase. Secondly, we find that LLMs' ability to carry out translation instructions relies on the understanding of translation instruction and the alignment among different languages. With proper enhancement, LLMs could perform the translation task well even for those language pairs unseen during the instruction tuning phase.

Via

Access Paper or Ask Questions

Better Datastore, Better Translation: Generating Datastores from Pre-Trained Models for Nearest Neural Machine Translation

Dec 17, 2022

Jiahuan Li, Shanbo Cheng, Zewei Sun, Mingxuan Wang, Shujian Huang

Abstract:Nearest Neighbor Machine Translation (kNNMT) is a simple and effective method of augmenting neural machine translation (NMT) with a token-level nearest neighbor retrieval mechanism. The effectiveness of kNNMT directly depends on the quality of retrieved neighbors. However, original kNNMT builds datastores based on representations from NMT models, which would result in poor retrieval accuracy when NMT models are not good enough, leading to sub-optimal translation performance. In this paper, we propose PRED, a framework that leverages Pre-trained models for Datastores in kNN-MT. Better representations from pre-trained models allow us to build datastores of better quality. We also design a novel contrastive alignment objective to mitigate the representation gap between the NMT model and pre-trained models, enabling the NMT model to retrieve from better datastores. We conduct extensive experiments on both bilingual and multilingual translation benchmarks, including WMT17 English $\leftrightarrow$ Chinese, WMT14 English $\leftrightarrow$ German, IWSLT14 German $\leftrightarrow$ English, and IWSLT14 multilingual datasets. Empirical results demonstrate the effectiveness of PRED.

Via

Access Paper or Ask Questions

DirectQE: Direct Pretraining for Machine Translation Quality Estimation

May 15, 2021

Qu Cui, Shujian Huang, Jiahuan Li, Xiang Geng, Zaixiang Zheng, Guoping Huang, Jiajun Chen

Figure 1 for DirectQE: Direct Pretraining for Machine Translation Quality Estimation

Figure 2 for DirectQE: Direct Pretraining for Machine Translation Quality Estimation

Figure 3 for DirectQE: Direct Pretraining for Machine Translation Quality Estimation

Figure 4 for DirectQE: Direct Pretraining for Machine Translation Quality Estimation

Abstract:Machine Translation Quality Estimation (QE) is a task of predicting the quality of machine translations without relying on any reference. Recently, the predictor-estimator framework trains the predictor as a feature extractor, which leverages the extra parallel corpora without QE labels, achieving promising QE performance. However, we argue that there are gaps between the predictor and the estimator in both data quality and training objectives, which preclude QE models from benefiting from a large number of parallel corpora more directly. We propose a novel framework called DirectQE that provides a direct pretraining for QE tasks. In DirectQE, a generator is trained to produce pseudo data that is closer to the real QE data, and a detector is pretrained on these data with novel objectives that are akin to the QE task. Experiments on widely used benchmarks show that DirectQE outperforms existing methods, without using any pretraining models such as BERT. We also give extensive analyses showing how fixing the two gaps contributes to our improvements.

Via

Access Paper or Ask Questions