AblationBench: Evaluating Automated Planning of Ablations in Empirical AI Research
Title: AblationBench: Assessing the Automated Planning of Ablations in Empirical AI Studies
Abstract:
While language model agents are increasingly deployed to automate scientific inquiry, assessing the actual scientific value they contribute presents a significant hurdle. A vital tool for generating such insights is the execution of ablation experiments. To address this, we present AblationBench, a benchmarking suite designed to evaluate agent performance on ablation planning within empirical AI research. This suite comprises two distinct tasks: AuthorAblation, which supports authors in suggesting ablation experiments derived from method sections (featuring 83 instances), and ReviewerAblation, which assists reviewers in identifying omitted ablations in complete papers (featuring 350 instances). For both categories, we have established LM-based judges to function as an automated evaluation framework.
Our experiments utilizing state-of-the-art LMs reveal that these tasks remain difficult; the top-performing LM system identified an average of only 45% of the original ablations, a figure that falls short of human-level accuracy. We noted an inverse performance correlation between the author and reviewer tasks, a phenomenon we ascribe to variations in model grounding. Furthermore, our analysis of current LM limitations on these tasks indicates that chain-of-thought prompting yields better results than agent-based approaches. The dataset is accessible at https://huggingface.co/collections/ai-coscientist/ablationbench, and the corresponding code can be found at https://github.com/ai-scientist-bench/ablation-bench.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




