arXiv

LC-ERD: Mining Latent Logic for Self-Evolving Reasoning via Consistency-Regulated Reward Decomposition

June 2, 2026 · Yanyu Chen, Jiyue Jiang, Dianzhi Yu, Zheng Wu, Jiahong Liu, Jiaming Han, Xiao Guo, Jinhu Qi, Yu Li, Yifei Zhang, Irwin King · Original Source

Title: LC-ERD: Extracting Latent Logic for Self-Improving Reasoning Through Consistency-Guided Reward Decomposition

Abstract: The advancement of reasoning capabilities in Large Language Models (LLMs) is currently hindered by a shortage of high-quality process data. Although self-alignment through endogenous rewards presents a potential remedy, extracting effective supervision encounters three primary obstacles: (1) Label Noise stemming from Mimetic Bias, wherein rewards favor statistical probability rather than logical validity, fostering a "correctness illusion" that obscures accumulating errors; (2) Coarse-Grained Supervision, where sparse global outcomes (such as those in GRPO) lack the granularity to guide individual steps, effectively treating entire reasoning chains as indivisible units; and (3) Distributional Collapse, a phenomenon where signals do not generalize effectively without exacerbating pre-training biases. To overcome these issues, we propose LC-ERD (Logic-Consistent Endogenous Reward Decomposition), a framework that conceptualizes self-alignment as the mining of latent structures. We formulate a Variational Logic Potential by aggregating consensus from the model’s Latent Logic Expertise (LLE) to clean the reasoning manifold, and we implement a Multi-Agent Value Decomposition protocol grounded in the IGM principle to measure the utility of individual steps. Experimental results demonstrate that LC-ERD facilitates a resilient self-evolution trajectory, revealing the trade-offs between logical consistency and accuracy, and highlighting high-value reasoning patterns that conventional rewards overlook. Our code is accessible at https://github.com/LC-ERD-repo/LC-ERD.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC