arXiv

$M^3$ Scaling Law: Optimizing Multi-Epoch, Multi-Lingual, and Multi-Stage Training for Low-Resource Language Models

Title: The $M^3$ Scaling Law: Streamlining Multi-Epoch, Multi-Lingual, and Multi-Stage Training for Low-Resource Language Models

Abstract:

This study investigates a core design challenge in pretraining Large Language Models (LLMs) within low-resource language contexts. While current methodologies employ strategies such as multi-epoch, multi-lingual, and multi-stage training to maximize the utility of scarce target-language data, there has been no prior scaling law capable of evaluating these diverse approaches under identical compute budgets ($C$) and target-corpus sizes ($D_T$). Consequently, the ideal training configuration remains ambiguous. To bridge this gap, we introduce the $M^3$ Scaling Law, a comprehensive predictive framework. This model is defined by four parameters: model scale, the number of target-corpus epochs ($k$), the average target-language ratio ($r$), and the final-stage target-language ratio ($r_f$). By mapping monolingual single-stage, multi-lingual single-stage, and multi-lingual multi-stage training recipes onto a unified target-language loss surface, the $M^3$ law demonstrates superior extrapolation accuracy into unseen hyperparameter regions compared to existing scaling laws, as validated across three language pairs. Leveraging $M^3$ as a surrogate objective, we establish two actionable guidelines for pretraining low-resource LLMs: First, as the target corpus size ($D_T$) diminishes, the optimal strategy transitions directly from monolingual single-stage to multi-lingual two-stage training at a threshold determined by the compute budget; notably, multi-lingual single-stage training was found to be suboptimal across our experimental grid. Second, the ideal number of epochs aligns along a single curve defined by the scarcity metric $D_T/D^(C)$, where $D^(C) \propto C^{\alpha/(\alpha+\beta)}$ represents the corpus size that is compute-optimal for monolingual training.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...