arXiv

Agentic Transformers Provably Learn to Search via Reinforcement Learning

June 2, 2026 · Tong Yang, Yu Huang, Yingbin Liang, Yuejie Chi · Original Source

Title: Agentic Transformers Provably Learn to Search via Reinforcement Learning

Abstract: Tree search serves as a fundamental framework for numerous reasoning and decision-making tasks involving language agents, requiring entities to explore potential actions, retain records of unsuccessful attempts, and backtrack toward more viable options. Despite its prevalence, there is currently no theoretical framework explaining how transformer-based policies develop these search abilities through reinforcement learning (RL) training dynamics. To address this gap, we investigate a stochastic $k$-ary tree environment where an agentic transformer interacts solely through its trajectory history, receiving a terminal reward upon reaching a concealed leaf goal node.

We demonstrate that a two-head transformer can execute randomized depth-first search (DFS). In this architecture, one head monitors the sequence of prior actions, while the other identifies failure states to initiate backtracking. By analyzing policy gradient training dynamics under a depth-wise curriculum, we show that this DFS mechanism arises in distinct stages from sparse RL feedback, independent of expert demonstrations. The trained policy displays depth generalization, successfully navigating deeper full trees despite being trained exclusively on depth-$1$ and depth-$2$ structures. Additionally, we find that when goal distributions are imbalanced, applying return discounting yields a ranked DFS policy that favors branches with higher probabilities. Collectively, these findings reveal a mechanistic normal form for transformer-based search, where specialized attention heads collaborate to distill decision-relevant information from context and translate it into agentic actions through RL training.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC