arXiv

SFMP: Fine-Grained, Hardware-Friendly and Search-Free Mixed-Precision Quantization for Large Language Models

June 4, 2026 · Xin Nie, Haicheng Zhang, Liang Dong, Beining Feng, Jinhong Weng, Guiling Sun · Original Source

Title: SFMP: Search-Free, Hardware-Efficient Mixed-Precision Quantization for Large Language Models with Fine-Grained Control

Abstract

Mixed-precision quantization offers a viable pathway for compressing large language models when memory resources are severely constrained. Nevertheless, current mixed-precision techniques generally face one of two primary drawbacks: they either depend on computationally intensive discrete optimization to assign precision levels or create hardware inefficiencies stemming from irregular memory structures. To address these challenges, we introduce SFMP, a novel mixed-precision quantization framework tailored for large language models that is both hardware-compatible and free from search-based processes.

SFMP is founded on four innovative components. First, it employs a Fractional bit-width approach, which allows the bit-widths of weight matrices to take on fractional values rather than being restricted to integers. This transforms the discrete precision allocation task into a continuous optimization problem. Second, the framework utilizes Block-wise mixed-precision, which permits fine-grained precision variations within weight matrices while maintaining compatibility with hardware architectures. Third, it incorporates Row-column weight reordering to cluster significant weights through row and column permutations, a process that adds minimal overhead during inference by requiring only slight activation reordering. Finally, SFMP features a Unified GEMM kernel capable of handling mixed-precision general matrix multiplication at any arbitrary average bit-width.

Comprehensive experiments indicate that SFMP surpasses leading state-of-the-art layer-wise mixed-precision methods under equivalent memory limitations. Additionally, it substantially lowers quantization costs and enhances inference efficiency. The implementation code is publicly accessible at https://github.com/Nkniexin/SFMP.

Source: arXiv Generated at: 2026-06-04 00:00:00 UTC