Regret Pre-training: Bridging Prior and Posterior Views for Enhanced Knowledge Grounding
Regret Pre-training: Bridging Prior and Posterior Views for Enhanced Knowledge Grounding
Source: arXiv:2606.03080v1 Announcement Type: Cross
Abstract:
Standard causal language models rely on a factorization of sequence probabilities that considers only preceding context, thereby failing to leverage future information that is present within the training dataset. To address this limitation, we present Regret Pre-training, a self-supervised approach rooted in the Learning Using Privileged Information (LUPI) paradigm. This framework utilizes a dual-view architecture where a single model produces both a causal Student distribution and a future-conditioned Teacher distribution. The training process enhances standard language modeling by incorporating a regret loss function designed to minimize the KL divergence from the teacher to the student, effectively transferring future-aware signals into causal representations.
We evaluated two teacher configurations using the OLMoE-1B-7B architecture: LocalRegret, which extends attention to include one future token, and GlobalRegret, which conditions on bidirectional context while masking the target position. Experimental results across nine downstream tasks, following 4 billion tokens of training, show that both configurations consistently exceed baseline performance. On average, GlobalRegret and LocalRegret achieved accuracies of 33.9% and 32.2%, respectively, outperforming the baseline’s 30.2%. Notably, GlobalRegret delivered a significant boost to BoolQ performance, improving it by 18.1 percentage points (reaching 61.0% compared to the baseline’s 42.9%). The proposed framework adds no extra parameters and necessitates only one additional inference-mode forward pass per training step.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



