Learning When Not to Act: Mitigating Tool Abuse in Agentic Reinforcement Learning
Title: Mastering Restraint: Preventing Tool Misuse in Agentic Reinforcement Learning
Abstract: Agentic reinforcement learning systems are prone to tool abuse, a phenomenon where models excessively rely on external tools for tasks that could be resolved through internal reasoning. Current mitigation strategies typically employ uniform penalties or strict caps on tool usage; while these methods decrease tool frequency, they risk stifling valuable tool-assisted exploration. To address this, we introduce EAPO (Efficient Agentic Policy Optimization), a framework designed to teach selective tool utilization. EAPO incorporates tool-free trajectories within each rollout group, implements difficulty-sensitive reward shaping that targets penalties for redundant calls primarily on simpler queries, and leverages confidence-based token reweighting to enhance policy training. Evaluated across nine mathematical and knowledge-heavy reasoning benchmarks, EAPO demonstrates a superior balance between accuracy and efficiency on Qwen2.5-3B, Qwen2.5-7B, and Llama3.1-8B. In comparison to GRPO, EAPO boosts average performance by 10.45%, 7.27%, and 9.69%, while simultaneously cutting average tool calls by 18.33%, 18.33%, and 24.59%, respectively. These findings indicate that agents can effectively learn to refrain from using tools without undermining the benefits of tool-integrated reasoning.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




