arXiv

PyraMathBench: Evaluating and Improving Mathematical Capability in Large Language Models

Title: PyraMathBench: Enhancing and Assessing Mathematical Proficiency in Large Language Models

Abstract:

Numerical reasoning serves as the foundation for mathematical competence in large language models (LLMs), a critical factor across various applications. However, current evaluation frameworks rarely assess LLMs by combining numerical processing with mathematical reasoning, which limits our ability to interpret failures in mathematical tasks effectively. To bridge this gap, we present PyraMathBench, a multi-layered benchmark comprising 32,505 questions sourced from 7,404 mathematical word problems. This dataset covers four primary cognitive dimensions, 14 subcategories, and two distinct modalities. Our experimental results indicate that LLM performance is significantly hindered by poor numerical computation skills and difficulties in managing abstract numerical inquiries. In response, we introduce the Smart Optimization & Learning-based VErsatile module (SOLVE) and Interactive Relative Policy Optimization (IRPO). These approaches strengthen the synergy between numerical and mathematical tasks within LLMs by facilitating efficient tool usage, including fuzzy matching and the rejection of low-quality tool calls. Comparative analyses demonstrate that Qwen-2.5, when trained with SOLVE and IRPO, achieves a score increase of 5.0.


Source: arXiv Generated at: 2026-06-03 00:00:00 UTC

Related Articles

TikTok Billionaire Tops Ambani as Asia’s Second-Richest
Bloomberg

TikTok Billionaire Tops Ambani as Asia’s Second-Richest

TikTok founder surpasses Mukesh Ambani to become Asia’s second-richest person, marking a significant shift in the region...

Publishers in UK can opt out of Google AI search results
BBC News

Publishers in UK can opt out of Google AI search results

UK publishers can now opt out of Google’s AI search summaries, a CMA ruling designed to boost their bargaining power and...

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.
Bloomberg

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.

Kioxia’s market cap nears Toyota’s, signaling a major shift in Japan’s corporate hierarchy. This narrowing gap highlight...

Reuters

Morning Bid: Marvell, a fitting name for the latest AI darling

Reuters highlights Marvell as a top AI stock, noting its name perfectly suits its status as the newest market darling.

Financial Times

Tim Hayward: I built the Jaguar E-Type of computer keyboards

Tim Hayward compares his bespoke keyboard designs to the Jaguar E-Type. He explores high-end customization for personal ...

Financial Times

AI Labs: Zuckerberg’s $100bn gamble

Meta’s $100 billion AI investment aims to secure AI dominance, but questions remain whether sheer spending can outpace c...