arXiv

PyraMathBench: Evaluating and Improving Mathematical Capability in Large Language Models

June 3, 2026 · Zetian Ouyang, Linlin Wang, Gerard de Melo, Liang He · Original Source

Title: PyraMathBench: Enhancing and Assessing Mathematical Proficiency in Large Language Models

Abstract:

Numerical reasoning serves as the foundation for mathematical competence in large language models (LLMs), a critical factor across various applications. However, current evaluation frameworks rarely assess LLMs by combining numerical processing with mathematical reasoning, which limits our ability to interpret failures in mathematical tasks effectively. To bridge this gap, we present PyraMathBench, a multi-layered benchmark comprising 32,505 questions sourced from 7,404 mathematical word problems. This dataset covers four primary cognitive dimensions, 14 subcategories, and two distinct modalities. Our experimental results indicate that LLM performance is significantly hindered by poor numerical computation skills and difficulties in managing abstract numerical inquiries. In response, we introduce the Smart Optimization & Learning-based VErsatile module (SOLVE) and Interactive Relative Policy Optimization (IRPO). These approaches strengthen the synergy between numerical and mathematical tasks within LLMs by facilitating efficient tool usage, including fuzzy matching and the rejection of low-quality tool calls. Comparative analyses demonstrate that Qwen-2.5, when trained with SOLVE and IRPO, achieves a score increase of 5.0.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC

Bloomberg

TikTok Billionaire Tops Ambani as Asia’s Second-Richest

June 3, 2026

TikTok founder surpasses Mukesh Ambani to become Asia’s second-richest person, marking a significant shift in the region...

BBC News

Publishers in UK can opt out of Google AI search results

June 3, 2026

UK publishers can now opt out of Google’s AI search summaries, a CMA ruling designed to boost their bargaining power and...

Bloomberg

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.

June 3, 2026

Kioxia’s market cap nears Toyota’s, signaling a major shift in Japan’s corporate hierarchy. This narrowing gap highlight...

Reuters

Morning Bid: Marvell, a fitting name for the latest AI darling

June 3, 2026

Reuters highlights Marvell as a top AI stock, noting its name perfectly suits its status as the newest market darling.

Financial Times

Tim Hayward: I built the Jaguar E-Type of computer keyboards

June 3, 2026

Tim Hayward compares his bespoke keyboard designs to the Jaguar E-Type. He explores high-end customization for personal ...

Financial Times

AI Labs: Zuckerberg’s $100bn gamble

June 3, 2026

Meta’s $100 billion AI investment aims to secure AI dominance, but questions remain whether sheer spending can outpace c...

Global News Digest

PyraMathBench: Evaluating and Improving Mathematical Capability in Large Language Models

Related Articles

TikTok Billionaire Tops Ambani as Asia’s Second-Richest

Publishers in UK can opt out of Google AI search results

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.

Morning Bid: Marvell, a fitting name for the latest AI darling

Tim Hayward: I built the Jaguar E-Type of computer keyboards

AI Labs: Zuckerberg’s $100bn gamble