PyraMathBench: Evaluating and Improving Mathematical Capability in Large Language Models
Title: PyraMathBench: Enhancing and Assessing Mathematical Proficiency in Large Language Models
Abstract:
Numerical reasoning serves as the foundation for mathematical competence in large language models (LLMs), a critical factor across various applications. However, current evaluation frameworks rarely assess LLMs by combining numerical processing with mathematical reasoning, which limits our ability to interpret failures in mathematical tasks effectively. To bridge this gap, we present PyraMathBench, a multi-layered benchmark comprising 32,505 questions sourced from 7,404 mathematical word problems. This dataset covers four primary cognitive dimensions, 14 subcategories, and two distinct modalities. Our experimental results indicate that LLM performance is significantly hindered by poor numerical computation skills and difficulties in managing abstract numerical inquiries. In response, we introduce the Smart Optimization & Learning-based VErsatile module (SOLVE) and Interactive Relative Policy Optimization (IRPO). These approaches strengthen the synergy between numerical and mathematical tasks within LLMs by facilitating efficient tool usage, including fuzzy matching and the rejection of low-quality tool calls. Comparative analyses demonstrate that Qwen-2.5, when trained with SOLVE and IRPO, achieves a score increase of 5.0.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



