arXiv

LLM Compression with Jointly Optimizing Architectural and Quantization choices

Title: LLM Compression via Joint Optimization of Architecture and Quantization

Abstract:

The substantial memory footprint and computational demands of Large Language Models (LLMs) present significant hurdles to their deployment. Although creating small or tiny language models from the ground up is one solution, it necessitates extensive GPU training resources. A more viable alternative involves compressing existing, pre-trained LLMs for use on edge devices. While techniques such as pruning and quantization are common, Neural Architecture Search (NAS) provides a powerful method for effective compression. However, previous NAS methods have frequently restricted the search space and treated architectural design and quantization as separate processes. To address this, we present a differentiable NAS framework that comprehensively explores the design space by simultaneously optimizing architectural configurations and mixed-precision quantization for the linear layers of LLMs. Our experimental results highlight an improved accuracy-latency balance: compared to baseline methods that apply NAS followed by quantization, our models deliver up to 1.4 times faster inference speeds with similar accuracy, or achieve an average accuracy gain of up to 6% across seven reasoning benchmarks at the same latency levels.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

TechCrunch

Benchmark raises its first-ever growth fund as part of $2B capital raise

Benchmark Capital launches its first growth fund, raising $2 billion to target later-stage AI deals. This marks a strate...

Netflix Aims to Use AI to Help Viewers Manage Content Overload
Bloomberg

Netflix Aims to Use AI to Help Viewers Manage Content Overload

Netflix uses AI to help viewers manage content overload, tackling the challenge of too many choices.

TSMC CEO Warns Chip Supply Won’t Meet AI-Fueled Demand for Years
Bloomberg

TSMC CEO Warns Chip Supply Won’t Meet AI-Fueled Demand for Years

TSMC CEO warns that chip supply will lag behind surging AI demand for years. This multi-year shortfall highlights the in...

Reuters

TSMC boss upbeat on outlook as AI boom shows no sign of easing

TSMC executives remain optimistic as sustained AI demand shows no signs of slowing, driving strong confidence in the com...

Bitcoin Falls to Pre-Iran Conflict Low as Crypto Slide Extends
Bloomberg

Bitcoin Falls to Pre-Iran Conflict Low as Crypto Slide Extends

Bitcoin drops to its lowest level before the Iran conflict, extending a broader cryptocurrency decline.

Why Amazon Has Struggled to Crack India
Bloomberg

Why Amazon Has Struggled to Crack India

Amazon’s aggressive push for dominance in India has stalled, marking the end of its ambitious expansion efforts. The 202...