Characterizing, Evaluating, and Optimizing Complex Reasoning
Title: Characterizing, Evaluating, and Optimizing Complex Reasoning
Abstract:
As Large Reasoning Models (LRMs) become increasingly dependent on reasoning traces with intricate internal structures, a significant gap remains in the literature regarding three core inquiries: (1) the criteria for defining high-quality reasoning, (2) methods for reliably assessing long, implicitly structured reasoning paths, and (3) the application of these evaluation metrics to optimize reasoning processes. This study offers a cohesive framework to resolve these issues. First, we propose the ME$^2$ principle, which assesses reasoning quality through both macro- and micro-level dimensions of efficiency and effectiveness. Second, leveraging this principle, we represent reasoning traces as directed acyclic graphs (DAGs) and introduce a pairwise evaluation technique based on DAGs capable of capturing complex structural nuances. Third, utilizing this evaluation method, we create the TRM-Preference dataset and develop a Thinking Reward Model (TRM) to facilitate large-scale assessment of reasoning quality. Our experimental results demonstrate that thinking rewards function as a potent signal for optimization. During inference, choosing superior reasoning paths yields improved results, with gains reaching up to 19.3\%. Furthermore, incorporating thinking rewards into Reinforcement Learning (RL) training enhances both reasoning capabilities and overall performance by up to 3.9\% across various tasks. The associated code and data can be accessed at https://github.com/Simplified-Reasoning/TRM.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC






