arXiv

Inducing Reasoning Primitives from Agent Traces

June 3, 2026 · Zhihan Lei, Jiarui Yan, Joshua Momo, William W. Cohen · Original Source

Title: Deriving Reasoning Primitives from Agent Traces

Original: arXiv:2606.02994v1 Announce Type: new Abstract: ReAct-style LLM agents often rediscover the same reasoning routines across problems, yet leave those routines trapped in transient scratchpads. We introduce Reasoning Primitive Induction, a single-pass method that mines successful ReAct traces, clusters recurrent reasoning moves, and converts the most frequent moves into a compact library of typed pseudo-tools. Each pseudo-tool is specified by a natural-language docstring interpreted by an LLM at invocation time, and a standard ReAct loop composes these primitives at test time. The central result is that induced libraries outperform the very agent that generated their traces: by +44pp on RuleArena NBA (30 -> 74), +30pp on MuSR team allocation (38 -> 68), and +22pp on NatPlan meeting planning (7 -> 29). Across five comparable subtasks spanning narrative deduction, rule application, and constraint-satisfaction planning, a single fixed configuration improves over zero-shot Chain-of-Thought on every subtask, matches or surpasses expert-authored decompositions, and outperforms AWM at lower average inference cost.

Rewrite: arXiv:2606.02994v1 Announcement Type: New Abstract: Although ReAct-based LLM agents frequently identify identical reasoning patterns across various challenges, these insights are typically confined to temporary scratchpads and not retained. To address this, we present Reasoning Primitive Induction, a one-pass approach that extracts successful ReAct execution traces, groups recurring reasoning steps, and transforms the most common patterns into a concise repository of typed pseudo-tools. At inference time, an LLM interprets the natural-language docstring defining each pseudo-tool, while a conventional ReAct loop integrates these primitives. Our primary finding demonstrates that these derived libraries surpass the performance of the original agents that produced the traces, yielding gains of +44 percentage points on RuleArena NBA (from 30 to 74), +30 percentage points on MuSR team allocation (from 38 to 68), and +22 percentage points on NatPlan meeting planning (from 7 to 29). In evaluations covering five related subtasks—including narrative deduction, rule application, and constraint-satisfaction planning—a unified configuration consistently beats zero-shot Chain-of-Thought, equals or exceeds manually crafted expert decompositions, and achieves superior results to AWM with reduced average inference expenses.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC