OASIS: Outlier-Aware LUT-Based GEMM with Dual-Side Quantization for LLM Inference Acceleration
Title: OASIS: Dual-Side Quantization and Outlier-Aware LUT-Based GEMM for Accelerating LLM Inference
Large language models (LLMs) have shown remarkable performance across numerous applications, yet their inference processes place heavy burdens on memory and computational resources. Current quantization techniques face a dilemma between efficiency and accuracy: weight-only quantization (WOQ) suffers from expensive dequantization overheads, whereas integer weight-and-activation quantization (INT-WAQ) sacrifices precision, leading to reduced model quality. While non-uniform weight-and-activation quantization (NU-WAQ) effectively handles the skewed distributions of LLM data, it lacks compatibility with standard low-precision hardware.
To address these challenges, this study introduces OASIS, a lookup table (LUT)-based architecture designed to perform efficient general matrix multiplication (GEMM) between non-uniformly quantized weights and activations, eliminating the need for dequantization. By utilizing pre-computed Cartesian Product LUTs, OASIS reduces LUT storage requirements by 64 times and boosts computational parallelism by 1,024 times compared to existing LUT-based GEMM approaches.
To maintain high accuracy despite aggressive activation quantization, OASIS incorporates an outlier-aware quantization framework. This system combines LUT-based GEMM with error compensation specifically targeted at outliers. Additionally, the authors developed Orizuru, a high-efficiency engine for real-time detection of top-k activation outliers.
Extensive evaluations demonstrate that OASIS maintains an average accuracy loss of just 1.98% relative to the FP16 baseline, representing a 5.18% improvement over Atom. In terms of hardware performance, OASIS delivers an average speedup of 3.00x and enhances energy efficiency by 1.44x when compared to the FIGLUT accelerator.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



