arXiv

Effective vocabulary expansion of multilingual language models for extremely low-resource languages

Title: Enhancing Vocabulary for Multilingual Language Models in Extremely Low-Resource Settings

Abstract:

Multilingual pre-trained language models (mPLMs) provide substantial advantages for numerous low-resource languages. While existing research has largely concentrated on extending model support through continued pre-training, there is a notable gap in strategies for adapting mPLMs to languages that were previously unsupported. To address this challenge, we propose a method that expands the model’s vocabulary by leveraging a target language corpus. This process involves identifying and removing a subset of the original vocabulary that is heavily biased toward the source language (such as English). We then employ bilingual dictionaries to initialize the representations for the newly added vocabulary items. Following this, we perform continued pre-training of the mPLMs on the target language corpus, utilizing these initialized representations.

Our experimental findings indicate that this approach surpasses the baseline method, which relies on randomly initialized expanded vocabulary for continued pre-training. Specifically, we observed performance gains of 0.54% in Part-of-Speech (POS) tagging and 2.60% in Named Entity Recognition (NER). Additionally, the proposed method exhibits strong robustness regarding the selection of training corpora. Notably, the continued pre-training process does not result in any performance degradation on the source language.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

Bloomberg Tech Event Special | Bloomberg Tech 6/04/2026
Bloomberg

Bloomberg Tech Event Special | Bloomberg Tech 6/04/2026

This title indicates a special Bloomberg Tech broadcast scheduled for June 4, 2026. No specific content details are prov...

Anthropic’s Amodei on Pros and Cons of an AI Startup IPO
Bloomberg

Anthropic’s Amodei on Pros and Cons of an AI Startup IPO

Anthropic CEO Dario Amodei weighs the pros and cons of an IPO for his AI startup, highlighting the trade-offs between pu...

TechCrunch

Meta’s Oversight Board says account bans lack due process, transparency

Meta’s Oversight Board criticized account bans for lacking due process and transparency, citing inconsistent enforcement...

Fed's Daly Says Forward Guidance Could Be Misleading
Bloomberg

Fed's Daly Says Forward Guidance Could Be Misleading

Fed’s Daly warns forward guidance may be misleading or lack clarity.

TechCrunch

Meta rolls out a new AI creator assistant on Facebook

Meta launched an AI creator assistant on Facebook to streamline analytics and content brainstorming. Initially available...

TechCrunch

What to expect from WWDC 2026: Siri’s highly anticipated revamp and Apple Intelligence updates

WWDC 2026 promises a Siri revamp powered by Google’s Gemini and standalone app, plus AI agents in the App Store and Came...