arXiv

Testing LLM Arithmetic Reasoning Generalization with Automatic Numeric-Remapping Attacks

Title: Evaluating the Generalization of LLM Arithmetic Reasoning via Automated Numeric-Remapping Attacks

Abstract:

While large language models demonstrate robust capabilities on arithmetic reasoning benchmarks, a frequent mitigation for their computational rigidity is to offload calculations to code execution. However, in many practical scenarios, models are required to reason directly through natural language, and truly reliable systems should be capable of solving simple arithmetic word problems without relying on external tools. Previous research has highlighted that LLMs are highly sensitive to numerical fluctuations; a model might successfully solve an original problem but fail when presented with structurally similar variants that require the same logical procedure but involve different numbers.

This study investigates whether this fragility endures under a more rigorous framework involving small, schema-preserving numerical adjustments that maintain the original reasoning program and circumvent stress tests involving large numbers. To this end, we propose an automatic algorithm designed to generate numeric-remapping attacks on arithmetic word problems. Diverging from template-based perturbation techniques that necessitate manual schemas or constraints, our method constructs problem-specific symbolic representations, produces constrained numeric remappings, recalculates the correct answers, and implements transformed questions via deterministic edits based on LLM-generated edit plans. The pipeline ensures scalability with minimal human oversight by employing stage-wise validation and a high-confidence audit to filter for reliable attacks.

We assessed the performance of DeepSeek-R1 (70B), Gemma4 (31B), and GPT-OSS (120B) across the GSM8K, MAWPS, and MultiArith datasets. In the GSM8K evaluations, completed runs revealed conditional accuracy declines ranging from 12.16 to 25.82 percentage points. Conversely, MAWPS and MultiArith exhibited significantly greater stability, with most attacked accuracy rates hovering at or exceeding 98%. These findings indicate that robustness to numeric-remapping is heavily influenced by dataset structure: while shorter, more regular datasets prove resilient, GSM8K remains vulnerable even when reasoning programs are preserved and answers are recomputed.


Source: arXiv Generated at: 2026-06-03 00:00:00 UTC

Related Articles

TikTok Billionaire Tops Ambani as Asia’s Second-Richest
Bloomberg

TikTok Billionaire Tops Ambani as Asia’s Second-Richest

TikTok founder surpasses Mukesh Ambani to become Asia’s second-richest person, marking a significant shift in the region...

Publishers in UK can opt out of Google AI search results
BBC News

Publishers in UK can opt out of Google AI search results

UK publishers can now opt out of Google’s AI search summaries, a CMA ruling designed to boost their bargaining power and...

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.
Bloomberg

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.

Kioxia’s market cap nears Toyota’s, signaling a major shift in Japan’s corporate hierarchy. This narrowing gap highlight...

Reuters

Morning Bid: Marvell, a fitting name for the latest AI darling

Reuters highlights Marvell as a top AI stock, noting its name perfectly suits its status as the newest market darling.

Financial Times

Tim Hayward: I built the Jaguar E-Type of computer keyboards

Tim Hayward compares his bespoke keyboard designs to the Jaguar E-Type. He explores high-end customization for personal ...

Financial Times

AI Labs: Zuckerberg’s $100bn gamble

Meta’s $100 billion AI investment aims to secure AI dominance, but questions remain whether sheer spending can outpace c...