arXiv

WUSH: Near-Optimal Adaptive Transforms for LLM Quantization

June 2, 2026 · Jiale Chen, Vage Egiazarian, Roberto L. Castro, Torsten Hoefler, Dan Alistarh · Original Source

Title: WUSH: Near-Optimal Adaptive Transforms for LLM Quantization

Abstract:

While quantizing both weights and activations is a widely adopted strategy to enable efficient large language model (LLM) deployment, the presence of severe outliers can expand the dynamic range, thereby exacerbating quantization errors in low-bit formats. Existing transform-based solutions, such as Hadamard rotations, are static and independent of data characteristics, leaving their true optimality for quantization tasks uncertain. In this work, we establish closed-form optimal linear blockwise transforms designed for joint weight-activation quantization using standard RTN AbsMax-scaled block quantizers, applicable to both integer and floating-point representations. The proposed method, WUSH, integrates a Hadamard foundation with a data-driven second-moment component to create a non-orthogonal transform. Under mild assumptions, this approach is proven to be near-optimal for both FP and INT quantizers and supports an efficient, fused implementation on GPUs. Empirical evaluations demonstrate that WUSH significantly enhances W4A4 accuracy compared to leading Hadamard-based baselines; for instance, on Llama-3.1-8B-Instruct using MXFP4, it yields an average improvement of +2.8 points with RTN and +0.7 points with GPTQ. Additionally, the method achieves up to 5.8$\times$ higher per-layer throughput than BF16 through FP4 MatMul operations. The source code is publicly accessible at https://github.com/IST-DASLab/WUSH.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC