arXiv

A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks

June 3, 2026 · Tomer Keren, Nitay Calderon, Asaf Yehudai, Yotam Perlitz, Michal Shmueli-Scheuer, Roi Reichart · Original Source

Title: TASTE: Enhancing Benchmark Coverage and Complexity

Abstract

As artificial agents grow more capable, established evaluation suites like $\tau^2$-Bench are nearing saturation. However, creating fresh benchmark tasks is a process that remains difficult, expensive, and heavily reliant on manual effort. Furthermore, the conventional methodology—where scenarios are drafted in natural language and subsequently converted into tool sequences—fails to capture the full spectrum of tool-use patterns that agents actually employ. To tackle these limitations, this study inverts the traditional task creation workflow. We introduce TASTE (Task Synthesis from Tool Sequence Evolution), an automated framework designed to produce rigorous tasks with expanded tool-use diversity.

TASTE leverages an Adaptive Contrastive $n$-gram model, which is trained on validity signals assessed by Large Language Models (LLMs). This approach allows for the sampling of valid tool sequences that encompass a wide variety of tool combinations. From this extensive pool, the system employs clustering to identify representative sequences, which are then instantiated into full benchmark tasks and polished through iterative difficulty evolution.

Using this methodology, we developed $\tau^c$-Bench, a more demanding extension of the three domains found in $\tau^2$-Bench. Our evaluation of 11 distinct agent/user LLM pairs reveals that models which had nearly maxed out their scores on $\tau^2$-Bench experienced significant performance declines on our new tasks. For instance, Gemini-3-Flash’s scores plummeted from a range of $0.82!-!0.94$ to $0.28!-!0.61$. In addition to raising the difficulty bar, our generated tasks more than double the number of unique tool combinations agents are required to execute. These findings indicate that high performance on current benchmarks often signals saturation rather than genuine, robust problem-solving skills. By automating the creation of difficult, high-coverage benchmarks, TASTE facilitates the continuous and scalable evaluation of future agent technologies.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC