arXiv

Agent Tools Orchestration Leaks More: Dataset, Benchmark, and Mitigation

June 2, 2026 · Yuxuan Qiao, Dongqin Liu, Hongchang Yang, Wei Zhou, Songlin Hu · Original Source

Title: Further Insights into Agent Tool Orchestration: New Dataset, Benchmark, and Mitigation Strategies

Abstract:

As Large Language Model (LLM) agents increasingly rely on a variety of external tools to execute complex operations, a new privacy vulnerability known as Tools Orchestration Privacy Risk (TOP-R) has emerged. This risk materializes when an agent synthesizes multiple non-sensitive tool outputs to inadvertently reveal a sensitive conclusion. We formally define TOP-R through three specific criteria: the sensitivity of the final conclusion, the non-inferability of the sensitive data from any single source, and the compositional inferability that arises from combining these sources.

To investigate this issue, we developed LRSE (Library-Grounded Reverse-Inference Seed Expansion), a pipeline for reverse construction grounded in privacy norms, reasoning chains, tool schemas, and task scenarios. Utilizing this pipeline, we created TOP-Bench, a comprehensive benchmark consisting of 1,000 instances. This benchmark assesses semantic disclosure in final responses under a controlled, two-stage tool-use protocol.

Our evaluation across six different LLM agents revealed that while task completion rates remained high, the average leakage rate was as high as 88.6 percent, resulting in a low H-score of 20.4. Initial mitigation efforts using only prompt-based safeguards yielded a modest improvement, raising the H-score by approximately 2.7 points on the main benchmark.

To address these limitations, we propose TOP-Align, a post-training approach that combines Supervised Fine-Tuning (SFT) with Direct Preference Optimization (DPO) to establish safer boundaries for task completion. On a separate evaluation split designated for post-training assessment, TOP-Align improved the H-score by 16.2 points compared to the base model. This represents a significant leap over the 4.9-point average gain achieved by prompt-only mitigation on the same split. These findings indicate that addressing TOP-R necessitates mitigation strategies that go beyond simple prompting.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC