LLM Compression with Jointly Optimizing Architectural and Quantization choices
Title: LLM Compression via Joint Optimization of Architecture and Quantization
Abstract:
The substantial memory footprint and computational demands of Large Language Models (LLMs) present significant hurdles to their deployment. Although creating small or tiny language models from the ground up is one solution, it necessitates extensive GPU training resources. A more viable alternative involves compressing existing, pre-trained LLMs for use on edge devices. While techniques such as pruning and quantization are common, Neural Architecture Search (NAS) provides a powerful method for effective compression. However, previous NAS methods have frequently restricted the search space and treated architectural design and quantization as separate processes. To address this, we present a differentiable NAS framework that comprehensively explores the design space by simultaneously optimizing architectural configurations and mixed-precision quantization for the linear layers of LLMs. Our experimental results highlight an improved accuracy-latency balance: compared to baseline methods that apply NAS followed by quantization, our models deliver up to 1.4 times faster inference speeds with similar accuracy, or achieve an average accuracy gain of up to 6% across seven reasoning benchmarks at the same latency levels.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC




