arXiv

Evaluating Autoformalization Robustness via Semantically Similar Paraphrasing

June 4, 2026 · Hayden Moore, Asfahan Shah · Original Source

Title: Assessing the Resilience of Autoformalization through Semantically Equivalent Paraphrasing

Abstract: Large Language Models (LLMs) have recently established themselves as potent instruments for autoformalization. Nevertheless, despite their strong capabilities, these systems often encounter difficulties in generating formalizations that are both grounded and verifiable. Previous research within the text-to-SQL domain has highlighted that LLMs exhibit sensitivity to paraphrased natural language (NL) inputs, even when the semantic integrity of the original text is largely maintained. This study examines this phenomenon within the context of autoformalization. We specifically assess the robustness of LLMs in producing formal proofs from semantically comparable paraphrased NL statements by evaluating both semantic accuracy and compilation validity. Employing the MiniF2F benchmark and the Lean 4 adaptation of ProofNet, alongside two contemporary LLMs, we generate paraphrased NL statements and conduct cross-evaluations of these inputs across the respective models. Our findings indicate significant performance fluctuations when processing paraphrased inputs, underscoring that slight alterations in NL phrasing can substantially influence model outcomes.

Source: arXiv Generated at: 2026-06-04 00:00:00 UTC