arXiv

$M^3$ Scaling Law: Optimizing Multi-Epoch, Multi-Lingual, and Multi-Stage Training for Low-Resource Language Models

June 2, 2026 · Kosuke Akimoto, Taiki Miyagawa, Masafumi Oyamada · Original Source

Title: The $M^3$ Scaling Law: Streamlining Multi-Epoch, Multi-Lingual, and Multi-Stage Training for Low-Resource Language Models

Abstract:

This study investigates a core design challenge in pretraining Large Language Models (LLMs) within low-resource language contexts. While current methodologies employ strategies such as multi-epoch, multi-lingual, and multi-stage training to maximize the utility of scarce target-language data, there has been no prior scaling law capable of evaluating these diverse approaches under identical compute budgets ($C$) and target-corpus sizes ($D_T$). Consequently, the ideal training configuration remains ambiguous. To bridge this gap, we introduce the $M^3$ Scaling Law, a comprehensive predictive framework. This model is defined by four parameters: model scale, the number of target-corpus epochs ($k$), the average target-language ratio ($r$), and the final-stage target-language ratio ($r_f$). By mapping monolingual single-stage, multi-lingual single-stage, and multi-lingual multi-stage training recipes onto a unified target-language loss surface, the $M^3$ law demonstrates superior extrapolation accuracy into unseen hyperparameter regions compared to existing scaling laws, as validated across three language pairs. Leveraging $M^3$ as a surrogate objective, we establish two actionable guidelines for pretraining low-resource LLMs: First, as the target corpus size ($D_T$) diminishes, the optimal strategy transitions directly from monolingual single-stage to multi-lingual two-stage training at a threshold determined by the compute budget; notably, multi-lingual single-stage training was found to be suboptimal across our experimental grid. Second, the ideal number of epochs aligns along a single curve defined by the scarcity metric $D_T/D^(C)$, where $D^(C) \propto C^{\alpha/(\alpha+\beta)}$ represents the corpus size that is compute-optimal for monolingual training.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC