ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning
Title: ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning
Abstract:
While Reinforcement Learning with Verifiable Rewards (RLVR) applied to Chain-of-Thoughts (CoTs) has driven significant advancements in Large Reasoning Models (LRMs), these models suffer from "over-thinking." This issue arises because long CoTs inherently involve trial and error, and standard RLVR methods reinforce the entire trajectory—including redundant explorations—when selecting outcome-correct paths for memorization. Although prior efforts have attempted to address this by favoring shorter trajectories, their reliance on outcome-based signals fails to eliminate the memorization of unnecessary steps within longer chains. To overcome this limitation, we introduce ThoughtFold, a framework utilizing fine-grained preference learning to curb redundant exploration and enhance reasoning efficiency. ThoughtFold utilizes an introspective mechanism to pinpoint redundancies within every correct trajectory, generating a diverse set of candidate sub-trajectories. Based on this spectrum, we propose a masked preference optimization objective that actively penalizes redundant actions and incentivizes the model to connect key reasoning steps directly. This process effectively "folds" the reasoning chain into a more streamlined path. Our extensive experiments demonstrate that ThoughtFold markedly improves efficiency; specifically, it cuts the token consumption of DeepSeek-R1-Distill-Qwen-7B by roughly 56% without compromising its state-of-the-art accuracy.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



