STaR-Quant: State-Time Consistent Post-Training Quantization for Diffusion Large Language Models
Title: STaR-Quant: State-Time Consistent Post-Training Quantization for Diffusion Large Language Models
Abstract:
Diffusion large language models (DLLMs) have recently surfaced as a compelling alternative to traditional autoregressive LLMs. By leveraging bidirectional context and iterative masked denoising, these models generate text in a novel manner. However, their substantial model architectures and the computational demands of iterative denoising create significant memory and processing bottlenecks, driving the need for post-training quantization to facilitate efficient deployment.
This study highlights two primary obstacles in quantizing DLLMs to low bit-widths: temporal error accumulation and state-dependent activation disparity. Within every denoising step, masked and unmasked tokens display distinct activation distributions. Furthermore, quantization errors have the potential to compound across steps throughout the iterative decoding phase.
To overcome these hurdles, we introduce STaR-Quant, a post-training quantization (PTQ) framework designed to maintain consistency across state and time for DLLMs. STaR-Quant features State-Guided Activation Transformation (SGAT), which utilizes a unified static weight-side transformation to direct masked and unmasked tokens into separate activation transformation spaces. Additionally, it incorporates Temporal Attention Compensation (TAC), a mechanism that rectifies quantized attention representations through a lightweight block-diagonal affine mapping.
Experimental results on various representative DLLMs show that STaR-Quant consistently enhances low-bit weight-activation quantization performance compared to robust PTQ baselines. Moreover, the framework achieves substantial efficiency gains, offering up to a 3.14x reduction in memory usage and a 1.69x speedup relative to FP16 deployment.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC





