arXiv

Calibration Data Trade-offs Across Capability Dimensions: Why Multi-Source Mixing Matters for High-Sparsity LLM Pruning

June 3, 2026 · Hu Xu, Zhaolong Xing, Congcong Liu, Jiaxing Wang, Zhida Jiang, Junshi Huang, Zhen Chen, Jianfeng Xu · Original Source

Title: Navigating Calibration Data Trade-offs Across Capability Dimensions: The Strategic Value of Multi-Source Mixing in High-Sparsity LLM Pruning

Abstract

Recent studies have suggested that the specific choice of unlabelled calibration data has a negligible effect on the overall averaged accuracy of large language models (LLMs) after post-training pruning to high sparsity levels. However, this conclusion warrants scrutiny when performance is assessed not as a single aggregate metric, but across distinct capability domains. By decomposing post-pruning performance into four specific areas—General knowledge, Commonsense reasoning, Code generation, and Mathematical ability—and evaluating 15 different calibration sources via Spearman correlations between OIT information metrics and retention rates per dimension, we identify a significant trade-off characterized by opposing signs.

Our analysis reveals that calibration perplexity exhibits a positive correlation with General capability retention ($\rho = +0.71$) but a negative correlation with both Math ($\rho = -0.53$) and Code ($\rho = -0.59$) retention ($p < 0.05$). This inverse relationship demonstrates that no single calibration source is capable of preserving all model capabilities simultaneously. To address this limitation, we introduce multi-source calibration mixing and propose IGSP, an information-guided self-calibration protocol. IGSP automates the construction of multi-source datasets without requiring corpora aligned with specific capabilities, achieving this by minimizing 4-gram aggregation while balancing perplexity across dimensions.

In experiments conducted on LLaMA-3.1-8B at 60% sparsity using SparseGPT, a uniform multi-source mixture achieved a total retention rate of 58.8%. This result outperforms the strongest single source, MetaMath (50.0%), by 8.8 percentage points, and surpasses the C4 default by 18.8 percentage points. Furthermore, IGSP demonstrates superior performance, improving upon Self-Cal by 2.4 points and SGS by 4.8 points.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC