Verifying Meta-Awareness via Predictive Rewards in Reasoning Models
Validating Meta-Awareness Through Predictive Rewards in Reasoning Architectures
Abstract:
Current investigations into reasoning models are increasingly focused on their meta-awareness, encompassing capabilities such as identifying the optimal duration for cognitive processing, delineating limits of knowledge, and organizing thought processes at a conceptual level. While existing large reasoning models rely exclusively on verification based on final answers, our findings demonstrate that incorporating meta-awareness objectives results in substantial performance enhancements compared to models lacking this meta-knowledge.
The proposed MAPR (Meta-Awareness via Predictive Reward) framework introduces a self-generated task wherein the model predicts rollout statisticsānamely length, pass-rate, and concepts utilizedāand verifies these predictions against actual outcomes. By exploiting this self-predictive ability, the model can modulate its reasoning behavior through three primary mechanisms: i) discarding prompts that are trivial or unsolvable, ii) curbing lengthy generations that are prone to errors, and iii) producing hints pertinent to the specific problem.
The outcomes are highly encouraging, with MAPR delivering marked improvements in both accuracy and training efficiency across multiple reasoning benchmarks. Specifically, the method accelerates GRPO training by more than 1.28x to achieve equivalent performance levels. Furthermore, it secures an 83.18% accuracy gain on the AIME25 benchmark and an average improvement of 13.04% across six distinct mathematics benchmarks. The codebase is open-source and accessible at https://github.com/akatigre/MAPR-RL.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




