QuBLAST: A Framework for Quantizing Large Language Models with Block-Level Compression Approach and Activation Scaling Strategy
Title: QuBLAST: Implementing Block-Level Compression and Activation Scaling for Large Language Model Quantization
Abstract:
While Large Language Models (LLMs) currently represent the pinnacle of performance for Natural Language Processing (NLP) applications, their substantial computational and memory requirements present significant barriers to deployment on embedded systems. Existing state-of-the-art approaches often rely on uniform post-training quantization (PTQ) applied across all attention blocks, thereby ignoring the benefits of utilizing varied quantization precision within a single network. Furthermore, these methods typically depend on computationally expensive operations to counteract the detrimental effects of activation outliers. Additionally, current techniques have largely overlooked the evaluation of emerging LLMs featuring non-traditional attention mechanisms, such as state-space models, which introduce distinct quantization challenges.
To overcome these constraints, we introduce QuBLAST, a novel PTQ methodology that integrates a block-level compression strategy with an activation scaling technique. This block-level compression facilitates mixed-precision quantization across the network's various blocks, while the activation scaling strategy effectively neutralizes the adverse effects of activation outliers.
The QuBLAST process begins by evaluating the sensitivity of individual attention blocks in pre-trained models via cross-entropy loss analysis. This sensitivity data guides the determination of the optimal weight quantization level for each specific attention block. Subsequently, QuBLAST utilizes an activation scaling map for each block to regulate the range of activation values, thereby reducing the impact of outliers and improving overall quantization accuracy.
Empirical results demonstrate that QuBLAST achieves a model size reduction of 40% to 45.2% across diverse architectures, including Qwen3-8B, Llama3-8B, Mistral v0.1-8B, and Falcon H1R-7B. Crucially, this compression is achieved while maintaining performance stability, with perplexity increases remaining within 5% on both the WikiText-2 and WikiText-103 datasets.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC





