arXiv

BenchEvolver: Frontier Task Synthesis via Solution-Centric Evolution

June 2, 2026 · Yangzhen Wu, Aaron J. Li, Wenjie Ma, Li Cao, Ziheng Zhou, Mert Cemri, Shu Liu, Yuran Xiu, Chenxiao Yan, Haikun Zhao, Bin Yu, Ion Stoica, Dawn Song · Original Source

Title: BenchEvolver: Frontier Task Synthesis via Solution-Centric Evolution

Abstract:

The accelerated advancement of state-of-the-art large language models has resulted in widespread benchmark saturation, rendering current datasets inadequate for distinguishing model capabilities or offering meaningful training signals. For example, on LiveCodeBench, leading models surpass 99% Pass@1 on easy splits and average over 90% Pass@1 across varying difficulty levels. Because creating new, rigorous datasets usually demands significant human labor, this process creates a bottleneck for further progress. To address this, we present BenchEvolver, an evolutionary framework centered on solutions that automatically converts existing coding problems into more challenging variants. Instead of fabricating problems from the ground up, BenchEvolver evolves reference solutions through structured transformations, subsequently deriving corresponding problem statements and test cases from these evolved solutions. This approach anchors generation in executable semantics, facilitating the scalable creation of diverse, high-quality, and difficult tasks with verifiable correctness.

When applied to LiveCodeBench and SciCode, BenchEvolver produces evolved tasks that are significantly more difficult while preserving validity, reference correctness, and diversity. We additionally curated LiveCodeBench-Plus, a benchmark consisting of 91 problems that merge evolved tasks with difficult original LCB-v6 items. On this benchmark, frontier model Pass@1 scores range between 27.5% and 62.6%, thereby restoring clear discrimination among top-tier coding models. Notably, these evolved tasks remain challenging even for the model responsible for their generation, which facilitates self-improvement. Furthermore, we demonstrate that reinforcement learning (RL) on evolved LCB tasks enhances coding performance on held-out datasets: for gpt-oss-20b, training with seed+evolved data yields Pass@1 improvements of +8.7 on LCB v6 Hard and +8.3 on LCB-Pro Easy. These gains exceed those achieved with seed-only training by 70.7% and 34.8%, respectively. Our findings indicate that BenchEvolver can transform saturated benchmarks into frontier-level evaluation suites and generate reusable training signals.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC