Global News Digest

arXiv

BenchEvolver: Frontier Task Synthesis via Solution-Centric Evolution

Title: BenchEvolver: Frontier Task Synthesis via Solution-Centric Evolution

Abstract:

The accelerated advancement of state-of-the-art large language models has resulted in widespread benchmark saturation, rendering current datasets inadequate for distinguishing model capabilities or offering meaningful training signals. For example, on LiveCodeBench, leading models surpass 99% Pass@1 on easy splits and average over 90% Pass@1 across varying difficulty levels. Because creating new, rigorous datasets usually demands significant human labor, this process creates a bottleneck for further progress. To address this, we present BenchEvolver, an evolutionary framework centered on solutions that automatically converts existing coding problems into more challenging variants. Instead of fabricating problems from the ground up, BenchEvolver evolves reference solutions through structured transformations, subsequently deriving corresponding problem statements and test cases from these evolved solutions. This approach anchors generation in executable semantics, facilitating the scalable creation of diverse, high-quality, and difficult tasks with verifiable correctness.

When applied to LiveCodeBench and SciCode, BenchEvolver produces evolved tasks that are significantly more difficult while preserving validity, reference correctness, and diversity. We additionally curated LiveCodeBench-Plus, a benchmark consisting of 91 problems that merge evolved tasks with difficult original LCB-v6 items. On this benchmark, frontier model Pass@1 scores range between 27.5% and 62.6%, thereby restoring clear discrimination among top-tier coding models. Notably, these evolved tasks remain challenging even for the model responsible for their generation, which facilitates self-improvement. Furthermore, we demonstrate that reinforcement learning (RL) on evolved LCB tasks enhances coding performance on held-out datasets: for gpt-oss-20b, training with seed+evolved data yields Pass@1 improvements of +8.7 on LCB v6 Hard and +8.3 on LCB-Pro Easy. These gains exceed those achieved with seed-only training by 70.7% and 34.8%, respectively. Our findings indicate that BenchEvolver can transform saturated benchmarks into frontier-level evaluation suites and generate reusable training signals.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Schroders Renewable Unit Targets AI Assets as Power Demand Soars
Bloomberg

Schroders Renewable Unit Targets AI Assets as Power Demand Soars

Schroders’ renewable unit targets AI infrastructure, pivoting to meet soaring energy demand from artificial intelligence...

State Street's Paglia on SBI Group Partnership, ETFs
Bloomberg

State Street's Paglia on SBI Group Partnership, ETFs

State Street's Paglia discusses the SBI Group partnership and ETFs, but the source text is missing. Please provide the a...

Nvidia Boss Says Workers Should Be Paid ‘as Much as Possible’
Bloomberg

Nvidia Boss Says Workers Should Be Paid ‘as Much as Possible’

Nvidia CEO Jensen Huang advocates for paying workers “as much as possible,” emphasizing maximum compensation. This stanc...

TSE Talking With Regulator For Easing ETF Listing Rules
Bloomberg

TSE Talking With Regulator For Easing ETF Listing Rules

The Tokyo Stock Exchange is discussing with regulators to ease ETF listing rules. This aims to simplify market access an...

S&P DJI CEO on Japan Markets, Mega IPOs
Bloomberg

S&P DJI CEO on Japan Markets, Mega IPOs

S&P DJI CEO discusses Japan's financial markets and major IPOs.