arXiv

Synthesize and Reward -- Reinforcement Learning for Multi-Step Tool Use in Live Environments

Title: Synthesize and Reward -- Reinforcement Learning for Multi-Step Tool Use in Live Environments

Abstract:

The development of Large Language Models (LLMs) capable of managing multi-step tool interactions is currently hindered by three interconnected challenges: the high expense of constructing realistic, stateful execution environments; the disconnect between synthetic training queries and the server’s actual state, which leads to failed tool executions; and the tendency of recall-based reinforcement learning (RL) rewards to encourage unnecessarily verbose tool-calling behaviors.

To address these issues, we introduce PROVE (Programmatic Rewards On Verified Environments), a novel framework comprising three key innovations. First, we provide a library containing 20 stateful Model Context Protocol (MCP) servers that expose a total of 343 tools. This infrastructure supports live-execution RL training while ensuring session-scoped state isolation. Second, we developed an automated data synthesis pipeline that creates validated, multi-turn tool-call trajectories. This process utilizes dependency-graph-guided conversation simulations grounded in live-sampled server states, ensuring that every generated query references entities that genuinely exist within the server context. Third, we designed a multi-component programmatic reward system that eliminates the need for an external judge model. This system incorporates graduated validity scoring, dependency-aware coverage metrics, a tool-name signal, an argument-value matching bonus, and an adaptive efficiency penalty that scales call budgets according to complexity.

We evaluated four models—Qwen3-4B, Qwen3-8B, Qwen2.5-7B, and Granite-4.1-8B—using GRPO. The training process utilized approximately 13,000 examples with identical reward hyperparameters across all models, with only the learning rate adjusted per model family based on a three-point sweep. Our results demonstrate that PROVE delivers consistent improvements across two distinct model families. Specifically, it achieved gains of up to +10.2 points on BFCL Multi-Turn, +6.8 points on tau2-bench, and +6.5 points on T-Eval, proving that a compact programmatic reward structure effectively enhances multi-step tool orchestration.


Source: arXiv Generated at: 2026-06-03 00:00:00 UTC

Related Articles

TikTok Billionaire Tops Ambani as Asia’s Second-Richest
Bloomberg

TikTok Billionaire Tops Ambani as Asia’s Second-Richest

TikTok founder surpasses Mukesh Ambani to become Asia’s second-richest person, marking a significant shift in the region...

Publishers in UK can opt out of Google AI search results
BBC News

Publishers in UK can opt out of Google AI search results

UK publishers can now opt out of Google’s AI search summaries, a CMA ruling designed to boost their bargaining power and...

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.
Bloomberg

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.

Kioxia’s market cap nears Toyota’s, signaling a major shift in Japan’s corporate hierarchy. This narrowing gap highlight...

Reuters

Morning Bid: Marvell, a fitting name for the latest AI darling

Reuters highlights Marvell as a top AI stock, noting its name perfectly suits its status as the newest market darling.

Financial Times

Tim Hayward: I built the Jaguar E-Type of computer keyboards

Tim Hayward compares his bespoke keyboard designs to the Jaguar E-Type. He explores high-end customization for personal ...

Financial Times

AI Labs: Zuckerberg’s $100bn gamble

Meta’s $100 billion AI investment aims to secure AI dominance, but questions remain whether sheer spending can outpace c...