Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training

Sep 15, 2021

Bo Zheng, Li Dong, Shaohan Huang, Saksham Singhal, Wanxiang Che, Ting Liu, Xia Song, Furu Wei

Figure 1 for Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training

Figure 2 for Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training

Figure 3 for Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training

Figure 4 for Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training

Share this with someone who'll enjoy it:

Abstract:Compared to monolingual models, cross-lingual models usually require a more expressive vocabulary to represent all languages adequately. We find that many languages are under-represented in recent cross-lingual language models due to the limited vocabulary capacity. To this end, we propose an algorithm VoCap to determine the desired vocabulary capacity of each language. However, increasing the vocabulary size significantly slows down the pre-training speed. In order to address the issues, we propose k-NN-based target sampling to accelerate the expensive softmax. Our experiments show that the multilingual vocabulary learned with VoCap benefits cross-lingual language model pre-training. Moreover, k-NN-based target sampling mitigates the side-effects of increasing the vocabulary size while achieving comparable performance and faster pre-training speed. The code and the pretrained multilingual vocabularies are available at https://github.com/bozheng-hit/VoCapXLM.

* EMNLP 2021

View paper on

Share this with someone who'll enjoy it:

Title:Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training

Paper and Code