arXiv

QuBLAST: A Framework for Quantizing Large Language Models with Block-Level Compression Approach and Activation Scaling Strategy

Title: QuBLAST: Implementing Block-Level Compression and Activation Scaling for Large Language Model Quantization

Abstract:

While Large Language Models (LLMs) currently represent the pinnacle of performance for Natural Language Processing (NLP) applications, their substantial computational and memory requirements present significant barriers to deployment on embedded systems. Existing state-of-the-art approaches often rely on uniform post-training quantization (PTQ) applied across all attention blocks, thereby ignoring the benefits of utilizing varied quantization precision within a single network. Furthermore, these methods typically depend on computationally expensive operations to counteract the detrimental effects of activation outliers. Additionally, current techniques have largely overlooked the evaluation of emerging LLMs featuring non-traditional attention mechanisms, such as state-space models, which introduce distinct quantization challenges.

To overcome these constraints, we introduce QuBLAST, a novel PTQ methodology that integrates a block-level compression strategy with an activation scaling technique. This block-level compression facilitates mixed-precision quantization across the network's various blocks, while the activation scaling strategy effectively neutralizes the adverse effects of activation outliers.

The QuBLAST process begins by evaluating the sensitivity of individual attention blocks in pre-trained models via cross-entropy loss analysis. This sensitivity data guides the determination of the optimal weight quantization level for each specific attention block. Subsequently, QuBLAST utilizes an activation scaling map for each block to regulate the range of activation values, thereby reducing the impact of outliers and improving overall quantization accuracy.

Empirical results demonstrate that QuBLAST achieves a model size reduction of 40% to 45.2% across diverse architectures, including Qwen3-8B, Llama3-8B, Mistral v0.1-8B, and Falcon H1R-7B. Crucially, this compression is achieved while maintaining performance stability, with perplexity increases remaining within 5% on both the WikiText-2 and WikiText-103 datasets.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

SpaceX Seeks to Raise $75 Billion in Record IPO (Video)
Bloomberg

SpaceX Seeks to Raise $75 Billion in Record IPO (Video)

SpaceX aims for a record $75 billion valuation through an initial public offering. This historic IPO marks a significant...

Broadcom AI Chip Outlook Disappoints Investors
Bloomberg

Broadcom AI Chip Outlook Disappoints Investors

Broadcom’s AI chip projections disappointed investors, dampening market sentiment. The outlook fell short of expectation...

Hiranandani Group CEO on Powering India's Digital Future
Bloomberg

Hiranandani Group CEO on Powering India's Digital Future

Hiranandani Group CEO discusses driving India's digital transformation.

Cerebras Says It’s Working With All AI Gear Makers Except Nvidia
Bloomberg

Cerebras Says It’s Working With All AI Gear Makers Except Nvidia

Cerebras confirmed partnerships with all major AI hardware vendors except Nvidia. This broad engagement positions Cerebr...

Putin Turns Russia’s AI Future Into a Kremlin Family Business
Bloomberg

Putin Turns Russia’s AI Future Into a Kremlin Family Business

Putin is consolidating Russia’s AI ambitions into a Kremlin family business, effectively turning the sector into a dynas...

Reuters

Meta repeatedly pushes back new AI model release for developers, WSJ says

Meta has repeatedly delayed the release of its new AI model for developers, according to the WSJ. This ongoing postponem...