Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Adapters for Altering LLM Vocabularies: What Languages Benefit the Most?

Oct 12, 2024

HyoJung Han, Akiko Eriguchi, Haoran Xu, Hieu Hoang, Marine Carpuat, Huda Khayrallah

Figure 1 for Adapters for Altering LLM Vocabularies: What Languages Benefit the Most?

Figure 2 for Adapters for Altering LLM Vocabularies: What Languages Benefit the Most?

Figure 3 for Adapters for Altering LLM Vocabularies: What Languages Benefit the Most?

Figure 4 for Adapters for Altering LLM Vocabularies: What Languages Benefit the Most?

Share this with someone who'll enjoy it:

Abstract:Vocabulary adaptation, which integrates new vocabulary into pre-trained language models (LMs), enables expansion to new languages and mitigates token over-fragmentation. However, existing approaches are limited by their reliance on heuristic or external embeddings. We propose VocADT, a novel method for vocabulary adaptation using adapter modules that are trained to learn the optimal linear combination of existing embeddings while keeping the model's weights fixed. VocADT offers a flexible and scalable solution without requiring external resources or language constraints. Across 11 languages-with various scripts, resource availability, and fragmentation-we demonstrate that VocADT outperforms the original Mistral model and other baselines across various multilingual tasks. We find that Latin-script languages and highly fragmented languages benefit the most from vocabulary adaptation. We further fine-tune the adapted model on the generative task of machine translation and find that vocabulary adaptation is still beneficial after fine-tuning and that VocADT is the most effective method.

View paper on

Share this with someone who'll enjoy it:

Title:Adapters for Altering LLM Vocabularies: What Languages Benefit the Most?

Paper and Code