arXiv

Enhancing LLM Metacognition via Cognitive Pairwise Training

Title: Boosting LLM Metacognition Through Cognitive Pairwise Training

Original: arXiv:2606.00869v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) has become central to LLM reasoning, but its outcome-level rewards can make models more willing to give confident answers when evidence or reasoning is unreliable. Existing SFT or RL methods mainly teach LLMs to refuse or express uncertainty at the response level, which can overfit abstention behavior rather than improve reasoning reliability. To address this limitation, we propose Cognitive Pairwise Training (CPT), a cognitive mid-training alignment stage that turns pairwise comparisons over reasoning traces into a reusable alignment signal. By learning to distinguish trustworthy from flawed reasoning, CPT encourages the model to internalize a reasoning-quality discrimination boundary rather than memorize surface refusal patterns. Across five model scales and three model families, CPT improves the reasoning--metacognition trade-off. At 14B, CPT+RL outperforms the standard SFT+RL pipeline by +2.2 math-average points and +5.2 abstention-F1 points. Further analyses show that CPT improves trace quality and exhibits strong robustness and scalability across evaluation and training settings. Code and models are released at https://github.com/Tsinghua-dhy/CPT.

Rewrite:

Reinforcement learning with verifiable rewards (RLVR) is currently pivotal to enhancing reasoning capabilities in large language models (LLMs). However, relying on outcome-level rewards can inadvertently encourage models to deliver high-confidence responses even when the underlying evidence or logic is unsound. While traditional supervised fine-tuning (SFT) and reinforcement learning approaches primarily instruct LLMs to decline or indicate uncertainty at the final response stage, these techniques often lead to an overfitting of avoidance behaviors instead of genuinely boosting the reliability of the reasoning process.

To overcome these challenges, we introduce Cognitive Pairwise Training (CPT), an alignment phase during mid-training that leverages cognitive principles. This method transforms pairwise evaluations of reasoning traces into a consistent alignment signal. By mastering the ability to differentiate between sound and defective reasoning, CPT prompts models to develop an intrinsic boundary for assessing reasoning quality, rather than simply memorizing superficial cues for refusal.

Our evaluation across three distinct model families and five different model sizes demonstrates that CPT effectively enhances the balance between reasoning performance and metacognitive accuracy. Specifically, at the 14B parameter scale, the combination of CPT and RL surpasses the conventional SFT+RL approach, achieving gains of 2.2 points in average mathematics scores and 5.2 points in abstention F1 scores. Additional investigations reveal that CPT not only elevates the quality of reasoning traces but also maintains significant robustness and scalability across various training and assessment environments. The associated codebase and models are publicly available at https://github.com/Tsinghua-dhy/CPT.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...