arXiv

Parameter Alignment Mitigates Catastrophic Forgetting in Multilingual Expert Language Models

Title: Parameter Alignment Mitigates Catastrophic Forgetting in Multilingual Expert Language Models

Original: arXiv:2606.00284v1 Announce Type: new Abstract: While continual pretraining~(CPT) is a practical way to extend large language models to new languages, na\"ive finetuning on targeted data erodes existing capabilities through catastrophic forgetting. Organizing training around language families reduces cross-language interference but cannot alone prevent forgetting of the general knowledge needed for downstream tasks. We link this forgetting to parameter drift in multilingual CPT and present a suite of five layer-aware parameter alignment strategies: hard layer freezing, soft regularization, post-hoc weight reversion, and model merging. We systematically compare our alignment strategies against two unregularized CPT baselines on benchmarks spanning 32 training languages from five language families, plus held-out languages, across four evaluation axes: perplexity, reading comprehension, physical reasoning, and translation. Parameter alignment substantially reduces forgetting at minimal cost to language acquisition: layer freezing and regularization best preserve comprehension, whereas post-hoc reversion yields the strongest translation gains. Together, these results map the acquisition--forgetting frontier for family-expert CPT and offer practical deployment guidelines pairing each strategy to the tasks it best serves.

Rewritten:

Title: Parameter Alignment Mitigates Catastrophic Forgetting in Multilingual Expert Language Models

Original: arXiv:2606.00284v1 Announce Type: new Abstract: While continual pretraining~(CPT) is a practical way to extend large language models to new languages, na\"ive finetuning on targeted data erodes existing capabilities through catastrophic forgetting. Organizing training around language families reduces cross-language interference but cannot alone prevent forgetting of the general knowledge needed for downstream tasks. We link this forgetting to parameter drift in multilingual CPT and present a suite of five layer-aware parameter alignment strategies: hard layer freezing, soft regularization, post-hoc weight reversion, and model merging. We systematically compare our alignment strategies against two unregularized CPT baselines on benchmarks spanning 32 training languages from five language families, plus held-out languages, across four evaluation axes: perplexity, reading comprehension, physical reasoning, and translation. Parameter alignment substantially reduces forgetting at minimal cost to language acquisition: layer freezing and regularization best preserve comprehension, whereas post-hoc reversion yields the strongest translation gains. Together, these results map the acquisition--forgetting frontier for family-expert CPT and offer practical deployment guidelines pairing each strategy to the tasks it best serves.

Rewritten:

Continual pretraining (CPT) offers a viable method for expanding the linguistic scope of large language models; however, straightforward fine-tuning on specific datasets often leads to catastrophic forgetting, degrading pre-existing competencies. Although structuring training around language families can diminish cross-lingual interference, this approach alone fails to safeguard the general knowledge essential for subsequent tasks. This study attributes such memory loss to parameter drift during multilingual CPT and introduces five layer-specific parameter alignment techniques: hard layer freezing, soft regularization, post-hoc weight reversion, and model merging. These strategies were rigorously evaluated against two unregularized CPT baselines using benchmarks that cover 32 languages drawn from five families, along with held-out languages. The assessment focused on four metrics: perplexity, reading comprehension, physical reasoning, and translation. The findings indicate that parameter alignment significantly curbs forgetting with negligible impact on language acquisition. Specifically, layer freezing and regularization are most effective at maintaining comprehension, while post-hoc reversion delivers the most substantial improvements in translation. Collectively, these outcomes delineate the trade-off between knowledge acquisition and forgetting in family-expert CPT, providing actionable recommendations for aligning each method with its optimal task application.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...