arXiv

Synthesize and Reward -- Reinforcement Learning for Multi-Step Tool Use in Live Environments

June 3, 2026 · Ibrahim Abdelaziz, Asim Munawar, Kinjal Basu, Maxwell Crouse, Chulaka Gunasekara, Suneet Katrekar, Pavan Kapanipathi · Original Source

Title: Synthesize and Reward -- Reinforcement Learning for Multi-Step Tool Use in Live Environments

Abstract:

The development of Large Language Models (LLMs) capable of managing multi-step tool interactions is currently hindered by three interconnected challenges: the high expense of constructing realistic, stateful execution environments; the disconnect between synthetic training queries and the server’s actual state, which leads to failed tool executions; and the tendency of recall-based reinforcement learning (RL) rewards to encourage unnecessarily verbose tool-calling behaviors.

To address these issues, we introduce PROVE (Programmatic Rewards On Verified Environments), a novel framework comprising three key innovations. First, we provide a library containing 20 stateful Model Context Protocol (MCP) servers that expose a total of 343 tools. This infrastructure supports live-execution RL training while ensuring session-scoped state isolation. Second, we developed an automated data synthesis pipeline that creates validated, multi-turn tool-call trajectories. This process utilizes dependency-graph-guided conversation simulations grounded in live-sampled server states, ensuring that every generated query references entities that genuinely exist within the server context. Third, we designed a multi-component programmatic reward system that eliminates the need for an external judge model. This system incorporates graduated validity scoring, dependency-aware coverage metrics, a tool-name signal, an argument-value matching bonus, and an adaptive efficiency penalty that scales call budgets according to complexity.

We evaluated four models—Qwen3-4B, Qwen3-8B, Qwen2.5-7B, and Granite-4.1-8B—using GRPO. The training process utilized approximately 13,000 examples with identical reward hyperparameters across all models, with only the learning rate adjusted per model family based on a three-point sweep. Our results demonstrate that PROVE delivers consistent improvements across two distinct model families. Specifically, it achieved gains of up to +10.2 points on BFCL Multi-Turn, +6.8 points on tau2-bench, and +6.5 points on T-Eval, proving that a compact programmatic reward structure effectively enhances multi-step tool orchestration.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC