arXiv

FVSpec: Real-World Property-Based Tests as Lean Challenges

June 2, 2026 · Quinn Dougherty, Max von Hippel, Hazel Shackleton, Mike Dodds · Original Source

Title: FVSpec: Transforming Real-World Property-Based Tests into Lean Challenges

Abstract

This paper introduces a new benchmark designed to assess the capabilities of AI models and agents in handling genuine formal software verification tasks. The dataset was constructed by extracting 11,039 property-based tests (PBTs) from active Python repositories. Of these, 2,772 (representing 25%) were automatically converted into 9,415 Lean 4 specifications, which include sorry placeholders to indicate incomplete proofs. This process yields an average of approximately three formalizations for every original PBT; when multiple attempts were generated, we retained several versions rather than selecting a single winner, ensuring diversity in quality metrics.

The translation of PBTs into Lean specifications presents significant hurdles. It demands the accurate modeling of Python semantics within Lean, the deduction of logical properties hidden within imperative test structures, and the navigation of the complex, dependent-typed programming paradigm inherent to a language that is not widely adopted. To address these challenges, we detail a three-agent LLM pipeline responsible for transpiling PBTs into Lean. We also present evaluations of coverage and quality, alongside baseline results for proof generation using both automated and model-based techniques. The complete codebase, including the scraper and agent implementations, as well as the full dataset of PBTs and Lean specifications, is made available as open source. Ultimately, this benchmark seeks to advance research in AI-assisted formal verification for real-world software—a critical area of growing importance as artificial intelligence increasingly contributes to global codebases.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC