arXiv

Testing LLM Arithmetic Reasoning Generalization with Automatic Numeric-Remapping Attacks

June 3, 2026 · Malia Barker, Bishal Lakha, Edoardo Serra, Francesco Gullo · Original Source

Title: Evaluating the Generalization of LLM Arithmetic Reasoning via Automated Numeric-Remapping Attacks

Abstract:

While large language models demonstrate robust capabilities on arithmetic reasoning benchmarks, a frequent mitigation for their computational rigidity is to offload calculations to code execution. However, in many practical scenarios, models are required to reason directly through natural language, and truly reliable systems should be capable of solving simple arithmetic word problems without relying on external tools. Previous research has highlighted that LLMs are highly sensitive to numerical fluctuations; a model might successfully solve an original problem but fail when presented with structurally similar variants that require the same logical procedure but involve different numbers.

This study investigates whether this fragility endures under a more rigorous framework involving small, schema-preserving numerical adjustments that maintain the original reasoning program and circumvent stress tests involving large numbers. To this end, we propose an automatic algorithm designed to generate numeric-remapping attacks on arithmetic word problems. Diverging from template-based perturbation techniques that necessitate manual schemas or constraints, our method constructs problem-specific symbolic representations, produces constrained numeric remappings, recalculates the correct answers, and implements transformed questions via deterministic edits based on LLM-generated edit plans. The pipeline ensures scalability with minimal human oversight by employing stage-wise validation and a high-confidence audit to filter for reliable attacks.

We assessed the performance of DeepSeek-R1 (70B), Gemma4 (31B), and GPT-OSS (120B) across the GSM8K, MAWPS, and MultiArith datasets. In the GSM8K evaluations, completed runs revealed conditional accuracy declines ranging from 12.16 to 25.82 percentage points. Conversely, MAWPS and MultiArith exhibited significantly greater stability, with most attacked accuracy rates hovering at or exceeding 98%. These findings indicate that robustness to numeric-remapping is heavily influenced by dataset structure: while shorter, more regular datasets prove resilient, GSM8K remains vulnerable even when reasoning programs are preserved and answers are recomputed.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC