SFMP: Fine-Grained, Hardware-Friendly and Search-Free Mixed-Precision Quantization for Large Language Models
Title: SFMP: Search-Free, Hardware-Efficient Mixed-Precision Quantization for Large Language Models with Fine-Grained Control
Abstract
Mixed-precision quantization offers a viable pathway for compressing large language models when memory resources are severely constrained. Nevertheless, current mixed-precision techniques generally face one of two primary drawbacks: they either depend on computationally intensive discrete optimization to assign precision levels or create hardware inefficiencies stemming from irregular memory structures. To address these challenges, we introduce SFMP, a novel mixed-precision quantization framework tailored for large language models that is both hardware-compatible and free from search-based processes.
SFMP is founded on four innovative components. First, it employs a Fractional bit-width approach, which allows the bit-widths of weight matrices to take on fractional values rather than being restricted to integers. This transforms the discrete precision allocation task into a continuous optimization problem. Second, the framework utilizes Block-wise mixed-precision, which permits fine-grained precision variations within weight matrices while maintaining compatibility with hardware architectures. Third, it incorporates Row-column weight reordering to cluster significant weights through row and column permutations, a process that adds minimal overhead during inference by requiring only slight activation reordering. Finally, SFMP features a Unified GEMM kernel capable of handling mixed-precision general matrix multiplication at any arbitrary average bit-width.
Comprehensive experiments indicate that SFMP surpasses leading state-of-the-art layer-wise mixed-precision methods under equivalent memory limitations. Additionally, it substantially lowers quantization costs and enhances inference efficiency. The implementation code is publicly accessible at https://github.com/Nkniexin/SFMP.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC






