arXiv

AXIOM: A Trust-First Neuro-Symbolic Execution Architecture for Verifiable Mathematical Reasoning

June 2, 2026 · Alessio Bruno · Original Source

Title: AXIOM: A Neuro-Symbolic Execution Framework Prioritizing Trust for Verifiable Mathematical Logic

Abstract

This paper introduces AXIOM, a neuro-symbolic execution architecture designed with a "trust-first" approach to natural-language mathematical reasoning. Within this framework, the large language model (LLM) serves exclusively as a canonicalizer, transforming informal problem descriptions into a constrained schema that is processed by a deterministic Computer-Algebra-System (CAS) pipeline. This pipeline is responsible for deriving and verifying solutions, with the option to abstain from answering treated as a primary, first-class output rather than a failure mode.

The system’s routing mechanism relies on a strict 1:1:1 correspondence between problem-shape regular expressions, schema-specific prompts, and closed-form CAS handlers. To date, the system has deployed over 3,100 distinct routes, maintaining zero LOST_CORRECT regressions across more than 250 consecutive software releases.

Empirical evaluations across four MATH categories demonstrate a cumulative correctness rate of 94.36% (2,592 correct out of 2,747 cases) while achieving 100.00% trust. Notably, there were zero confident-wrong answers across the entire 2,747-record benchmark. Performance in all four domains exceeded the per-domain floor of 70/90/70, with per-domain trust consistently at 100.0%. Additionally, the median latency for rule-only handlers was recorded at 1 ms, covering 88% of records on the lm-eval arithmetic 20,000-record benchmark. The architecture has already processed approximately 30,000 production queries via a public deployment.

Rather than focusing solely on static accuracy metrics, we highlight the forward dynamic established by this architecture: every abstention logged in production becomes a candidate for correctness after a single ship cycle, as new tasks can be composed without causing regressions in the existing registry. The operational discipline underpinning this reliability—including math-template bucketing, the use of LOST_CORRECT scans as regression oracles, parseable-first onboarding protocols, and the treatment of abstention as a first-class output—forms a transferable framework for building trustworthy neuro-symbolic systems in fields beyond mathematics.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC