arXiv

Tuning the Implicit Regularizer of Masked Diffusion Language Models: Enhancing Generalization via Insights from $k$-Parity

June 4, 2026 · Jianhao Huang, Baharan Mirzasoleiman · Original Source

Title: Optimizing the Implicit Regularization in Masked Diffusion Language Models: Boosting Generalization Through $k$-Parity Analysis

Abstract: While Masked Diffusion Language Models (MDLMs) have established themselves as a potent generative framework, their generalization capabilities have received far less scrutiny than those of auto-regressive models. This study explores these generalization dynamics through the lens of the $k$-parity task, which involves calculating the XOR sum of $k$ specific bits. In this domain, neural networks typically display "grokking"—a phenomenon characterized by a long period of stagnation at chance-level performance, followed by an abrupt leap to generalization. Our theoretical analysis breaks down the MD objective into two distinct phases: a Signal regime, which facilitates feature learning, and a Noise regime, which acts as an implicit regularizer. By applying the MD objective to train nanoGPT on the $k$-parity problem, we show that it reshapes the learning landscape, allowing for swift and concurrent generalization that bypasses the grokking phase. Building on these theoretical findings, we refine the mask probability distribution within the MD objective. This approach yields substantial perplexity improvements for models with 50 million parameters and delivers superior outcomes in both pre-training from scratch and supervised fine-tuning. On 8B-parameter models, our method achieves peak performance gains of $8.8\%$ and $5.8\%$ in these respective scenarios, underscoring the scalability and efficacy of our framework for large-scale masked diffusion language models.

Source: arXiv Generated at: 2026-06-04 00:00:00 UTC