arXiv

PBT-Bench: Benchmarking AI Agents on Property-Based Testing

June 2, 2026 · Lucas Jing, Xinqi Wang, Liao Zhang, Simon S. Du · Original Source

Title: PBT-Bench: Evaluating AI Agents via Property-Based Testing

Abstract:

Current code benchmarks primarily assess an agent's ability to generate tests that reproduce known defects or to create patches that resolve described issues. These metrics fail to isolate the specific competency required for property-based testing: the capacity to extract semantic invariants from documentation and formulate input-generation strategies with sufficient precision to expose violations through random search. To address this gap, we present PBT-Bench, a benchmark comprising 100 carefully curated property-based testing challenges drawn from 40 real-world Python libraries.

Each challenge incorporates one or more semantic bugs—totaling 365 across the dataset, with an average of 3.65 bugs per problem. These bugs are engineered such that default random inputs rarely trigger them, requiring the agent to analyze library documentation, pinpoint the relevant invariant, and define a Hypothesis @given strategy that focuses probability mass on the specific trigger regions. The dataset is categorized into three difficulty levels (L1–L3), covering everything from single-constraint boundary errors to complex, stateful, cross-function protocol violations.

We evaluated eight contemporary large language models (LLMs) using two distinct prompting approaches: an open-ended baseline and an explicit Hypothesis scaffolding method, with three independent runs conducted for each configuration. The results show that bug recall under the PBT-guided prompt varies between 42.1% and 83.4% across the models, whereas the open-ended baseline yields recall rates between 31.4% and 76.7%. While Hypothesis scaffolding improves performance by more than 20 percentage points for mid-tier models, it offers more modest benefits for the most capable models. Notably, two models exhibited performance degradation under the structured prompt, suggesting that such scaffolding may sometimes hinder rather than assist specific model behaviors. The most difficult bugs revealed model-specific vulnerabilities, with different architectures failing on distinct problems, indicating persistent gaps that no single model currently addresses. We are releasing the benchmark, the associated harness, and the complete evaluation corpus to facilitate further research into documentation-grounded semantic reasoning.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC