DLLM-JEPA: Joint Embedding Predictive Architectures for Masked Diffusion Language Models
Title: DLLM-JEPA: Integrating Joint Embedding Predictive Architectures with Masked Diffusion Language Models
Abstract
Joint Embedding Predictive Architectures (JEPAs) have fundamentally transformed self-supervised representation learning within the field of computer vision. While the recently introduced LLM-JEPA adapted this framework for autoregressive language models, it retained significant inefficiencies inherent to causal-attention structures. Specifically, LLM-JEPA necessitates explicit multi-view datasets, such as parallel text-code pairs, and relies on two gradient-carrying forward passes for every training step. To address these limitations, we propose DLLM-JEPA, a novel approach that combines JEPAs with masked-diffusion language models, thereby removing both burdens simultaneously.
The bidirectional attention mechanism employed by diffusion models allows for the generation of two semantically distinct perspectives from a single input simply by varying masking rates, eliminating the need for explicit data pairs. Furthermore, this architecture supports a single gradient-carrying forward pass, resulting in a 33% reduction in training FLOPs compared to LLM-JEPA.
Our evaluation demonstrates that DLLM-JEPA consistently outperforms diffusion-only fine-tuning across all tested (task, architecture) configurations. Notable improvements include an accuracy increase of up to +18.7 percentage points on GSM8K for LLaDA-8B and +11.4 percentage points for Dream-7B. Additionally, the model shows steady performance gains on Spider, NL-RX-SYNTH, and Django benchmarks.
Beyond raw accuracy, DLLM-JEPA delivers a "dual-win" outcome. Using the LLaDA-8B model with the Wide-t configuration, the approach simultaneously boosted GSM8K accuracy from 65.2 to 67.1 (+1.8 pp), reduced held-out Wikitext loss below the levels of the pre-trained base model, and maintained MMLU accuracy at baseline levels across three fine-tuning seeds. In contrast, an L2-to-base parameter anchor strategy only matched baseline accuracy without delivering any task-specific improvements.
Investigations via layer-wise probing reveal the underlying mechanism: a dissociation of geometric-functional drift. In this process, the fine-tuned backbone diverges more significantly from the pre-trained weights than the baseline does, yet it retains more knowledge on held-out Wikitext data, with this effect primarily amplified in the middle transformer layers. This pattern is also observed in Dream-7B, suggesting that the phenomenon is generalizable and not limited to a specific backbone architecture.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




