arXiv

PBT-Bench: Benchmarking AI Agents on Property-Based Testing

Title: PBT-Bench: Evaluating AI Agents via Property-Based Testing

Abstract:

Current code benchmarks primarily assess an agent's ability to generate tests that reproduce known defects or to create patches that resolve described issues. These metrics fail to isolate the specific competency required for property-based testing: the capacity to extract semantic invariants from documentation and formulate input-generation strategies with sufficient precision to expose violations through random search. To address this gap, we present PBT-Bench, a benchmark comprising 100 carefully curated property-based testing challenges drawn from 40 real-world Python libraries.

Each challenge incorporates one or more semantic bugs—totaling 365 across the dataset, with an average of 3.65 bugs per problem. These bugs are engineered such that default random inputs rarely trigger them, requiring the agent to analyze library documentation, pinpoint the relevant invariant, and define a Hypothesis @given strategy that focuses probability mass on the specific trigger regions. The dataset is categorized into three difficulty levels (L1–L3), covering everything from single-constraint boundary errors to complex, stateful, cross-function protocol violations.

We evaluated eight contemporary large language models (LLMs) using two distinct prompting approaches: an open-ended baseline and an explicit Hypothesis scaffolding method, with three independent runs conducted for each configuration. The results show that bug recall under the PBT-guided prompt varies between 42.1% and 83.4% across the models, whereas the open-ended baseline yields recall rates between 31.4% and 76.7%. While Hypothesis scaffolding improves performance by more than 20 percentage points for mid-tier models, it offers more modest benefits for the most capable models. Notably, two models exhibited performance degradation under the structured prompt, suggesting that such scaffolding may sometimes hinder rather than assist specific model behaviors. The most difficult bugs revealed model-specific vulnerabilities, with different architectures failing on distinct problems, indicating persistent gaps that no single model currently addresses. We are releasing the benchmark, the associated harness, and the complete evaluation corpus to facilitate further research into documentation-grounded semantic reasoning.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...