StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning
Title: StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning
Abstract:
Agentic reinforcement learning (RL) is rapidly becoming a vital post-training strategy for enhancing the capabilities of Large Language Model (LLM) agents. Current RL methods for LLMs predominantly adhere to a token-centric framework, similar to RLHF and RLVR, treating individual tokens as the fundamental units for both modeling and optimization. Yet, this approach creates a granularity mismatch within agentic RL contexts. While these algorithms focus on optimizing token-level predictions, LLM agents actually operate by making decisions at the step level, driven by iterative cycles of environmental observation and action.
To address this discrepancy, we introduce StepPO, a step-centric paradigm for agentic RL that relies on step-aligned policy optimization. We fundamentally reframe agentic RL by shifting from a token-level Markov Decision Process (MDP) to a step-level MDP, where interaction steps function as the primary units for trajectory representation. Additionally, we implement step-level credit assignment to ensure that policy optimization corresponds with the natural granularity of agent decision-making. Consequently, StepPO enhances agent policies at the step level to facilitate effective multi-turn interaction between the agent and its environment.
Empirical evaluations across tasks such as multi-hop question answering, academic paper search, and text-world actions demonstrate that StepPO consistently surpasses a variety of existing RL algorithms. Further investigation offers valuable insights into how adopting a step-centric paradigm improves the training process for agents. We anticipate that this step-centric perspective will serve as both a practical framework for training more capable LLM agents and a useful analytical lens for understanding agent behavior.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





