Can Generalist Agents Automate Data Curation?
Title: Can Generalist Agents Automate Data Curation?
Abstract
Curating training data represents one of the most critical and labor-intensive phases of contemporary AI development. Practitioners typically engage in an iterative cycle of proposing, implementing, evaluating, and refining data policies in response to noisy feedback from benchmarks. This study investigates whether generalist coding agents can automate this data-curation workflow. To this end, we present Curation-Bench, a benchmark designed with an agent-centric approach. This framework keeps the model, training recipe, and evaluation suite constant, while granting agents command-line access to inspect data, enact policies, submit them to a fixed training and evaluation pipeline, and subsequently revise their approach.
In experiments involving vision-language instruction tuning, out-of-the-box agents achieved performance comparable to strong published data-selection baselines within just ten iterations. However, an analysis of agent trajectories highlights a persistent "execution-research gap": agents primarily adjusted local policy variants rather than exploring new policy families, even when provided with strategic guides and references to existing papers. By introducing scaffolds that require agents to cite, instantiate, and adapt prior methods at each iteration, we shifted the agents toward method-guided exploration. The resulting scaffolded agent autonomously composed a data-selection policy—without any human design input—that surpassed strong published baselines while using only one-tenth of their data budget. Ultimately, while current agents can execute the curation loop, reliable data research necessitates scaffolded method adaptation rather than relying solely on open-ended prompting. The code and benchmark are publicly available.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC




