InftyThink+: Effective and Efficient Infinite-Horizon Reasoning via Reinforcement Learning
Title: InftyThink+: Achieving Efficient and Effective Infinite-Horizon Reasoning Through Reinforcement Learning
Abstract:
While large reasoning models have demonstrated impressive capabilities by scaling inference-time chain-of-thought processes, this approach is hindered by quadratic computational costs, restrictions on context length, and reasoning degradation caused by lost-in-the-middle phenomena. Iterative reasoning offers a solution by periodically summarizing intermediate thoughts, but current methods typically depend on supervised learning or static heuristics, failing to effectively optimize the timing of summaries, the selection of preserved information, and the resumption of the reasoning process.
To address these limitations, we introduce InftyThink+, an end-to-end reinforcement learning framework designed to optimize the entire iterative reasoning trajectory. This approach leverages model-controlled iteration boundaries and explicit summarization mechanisms. InftyThink+ employs a two-stage training protocol, beginning with a supervised cold-start phase followed by trajectory-level reinforcement learning. This structure enables the model to master strategic decisions regarding when to summarize and how to continue reasoning.
Experimental results on the DeepSeek-R1-Distill-Qwen-1.5B model indicate that InftyThink+ boosts accuracy by 21% on AIME24. It significantly outperforms conventional long chain-of-thought reinforcement learning methods and exhibits superior generalization to out-of-distribution benchmarks. Furthermore, InftyThink+ substantially lowers inference latency and speeds up reinforcement learning training, thereby enhancing both reasoning efficiency and overall performance.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



