arXiv

SFMP: Fine-Grained, Hardware-Friendly and Search-Free Mixed-Precision Quantization for Large Language Models

Title: SFMP: Search-Free, Hardware-Efficient Mixed-Precision Quantization for Large Language Models with Fine-Grained Control

Abstract

Mixed-precision quantization offers a viable pathway for compressing large language models when memory resources are severely constrained. Nevertheless, current mixed-precision techniques generally face one of two primary drawbacks: they either depend on computationally intensive discrete optimization to assign precision levels or create hardware inefficiencies stemming from irregular memory structures. To address these challenges, we introduce SFMP, a novel mixed-precision quantization framework tailored for large language models that is both hardware-compatible and free from search-based processes.

SFMP is founded on four innovative components. First, it employs a Fractional bit-width approach, which allows the bit-widths of weight matrices to take on fractional values rather than being restricted to integers. This transforms the discrete precision allocation task into a continuous optimization problem. Second, the framework utilizes Block-wise mixed-precision, which permits fine-grained precision variations within weight matrices while maintaining compatibility with hardware architectures. Third, it incorporates Row-column weight reordering to cluster significant weights through row and column permutations, a process that adds minimal overhead during inference by requiring only slight activation reordering. Finally, SFMP features a Unified GEMM kernel capable of handling mixed-precision general matrix multiplication at any arbitrary average bit-width.

Comprehensive experiments indicate that SFMP surpasses leading state-of-the-art layer-wise mixed-precision methods under equivalent memory limitations. Additionally, it substantially lowers quantization costs and enhances inference efficiency. The implementation code is publicly accessible at https://github.com/Nkniexin/SFMP.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

Shark Tank Star Shrinks Data Center Footprint After Backlash
Bloomberg

Shark Tank Star Shrinks Data Center Footprint After Backlash

After public backlash, a Shark Tank entrepreneur reduced the size of a Utah data center project. This decision followed ...

Hatch’s New Bedside Sleep Clock Wirelessly Tracks Sleep Quality
Bloomberg

Hatch’s New Bedside Sleep Clock Wirelessly Tracks Sleep Quality

Hatch’s $250 screen-free sleep clock wirelessly tracks breathing, heart rate, and movement using low-power signals, offe...

Anduril's Stephens on Innovating in an Age of War
Bloomberg

Anduril's Stephens on Innovating in an Age of War

At Bloomberg Tech 2026, Anduril’s Stephens discussed AI’s role in defense and military innovation amid global conflict.

Liftoff Mobile CEO Talks IPO, Advertising and Strategy
Bloomberg

Liftoff Mobile CEO Talks IPO, Advertising and Strategy

Liftoff Mobile’s CEO discusses IPO plans, navigating ad market trends, and outlining the company's strategic direction f...

Samsung Sponsor Spotlight
Bloomberg

Samsung Sponsor Spotlight

The request lacks source text for the "Samsung Sponsor Spotlight" article. Please provide the original content to enable...

AI Isn’t Replacing Credit Hedge Fund Traders Yet, Barclays Says
Bloomberg

AI Isn’t Replacing Credit Hedge Fund Traders Yet, Barclays Says

Barclays states AI hasn’t replaced credit hedge fund traders yet. Human expertise remains vital for complex decisions, m...