arXiv

Measuring Maximum Activations in Open Large Language Models

Title: Quantifying Peak Activations in Open-Source Large Language Models

Abstract: The dynamic range of neural activations represents a critical bottleneck for low-bit quantization, activation scaling techniques, and ensuring stable inference in Large Language Models (LLMs). While previous research extensively documented outlier features and extreme activation values in pre-2024 LLaMA-style architectures, the subsequent surge in open models has left the downstream activation-quantization pipeline relying on outdated assumptions without revisiting the data for the modern era. This study addresses a practical, deployment-focused inquiry: what is the upper limit of activation magnitudes in contemporary open LLMs, and how do these values fluctuate across different model families, generations, and training phases?

Using a standardized evaluation pipeline, we analyzed 27 checkpoints drawn from eight distinct open-source families. The methodology involved a unified 5,000-sample multi-domain corpus, family-specific tokenization strategies, and consistent monitoring hooks across embeddings, hidden states, attention mechanisms, MLP/MoE layers, SwiGLU gates, and the final normalization layer. Our analysis covered dense, Mixture-of-Experts (MoE), vision-language, intermediate-training, and instruction-tuned variants.

Our findings reveal three key insights: (i) Global maxima exhibit a spread of nearly four orders of magnitude among models with comparable parameter counts. Specifically, Qwen3.5 and MoE checkpoints generally fall within the $10^2$ to $10^3$ range, whereas Gemma3-27B-it reaches approximately $7 \times 10^5$. (ii) Comparisons across families and generations do not follow a simple monotonic scaling pattern. (iii) MoE architectures demonstrate significantly lower peak values, ranging from 14.0 to 23.4 times smaller than their dense counterparts of similar scale. Furthermore, in 22 out of 24 checkpoints, the residual stream contained the global maximum.

A lightweight INT-8 sanity check confirmed that the measured maxima correlate with low-bit reconstruction errors, particularly through the mechanism of activation-scale selection. We conclude that maximum activation magnitude is an intrinsic property of a model, determined by its family, architecture, and training stage, rather than merely a byproduct of model size. Consequently, these metrics should be measured and reported alongside any open-weight release prior to low-bit deployment. The code for this study is publicly accessible at https://github.com/clx1415926/Max_act_llm.


Source: arXiv Generated at: 2026-06-03 00:00:00 UTC

Related Articles

TechCrunch

The world’s largest privately owned laser just turned on

Xcimer Energy activated the Phoenix laser, the world’s largest privately owned laser, aiming to commercialize fusion pow...

Uber Targets Doubling Its Fleet of Electric Motorcycles in Kenya
Bloomberg

Uber Targets Doubling Its Fleet of Electric Motorcycles in Kenya

Uber plans to double its electric motorcycle fleet in Kenya. This expansion aims to enhance sustainable transport option...

AI Saves Time But Most Companies Waste the Gain, Study Shows
Bloomberg

AI Saves Time But Most Companies Waste the Gain, Study Shows

A study reveals that while AI saves employee time, most companies fail to capitalize on these gains, squandering potenti...

JPMorgan Lifts S&P Target on Earnings 'Supercycle'
Bloomberg

JPMorgan Lifts S&P Target on Earnings 'Supercycle'

JPMorgan raised its S&P 500 target, citing an earnings “supercycle” that reflects heightened confidence in corporate pro...

Europe Sleepwalking Into Economic Ruin, Serb Leader Says
Bloomberg

Europe Sleepwalking Into Economic Ruin, Serb Leader Says

Serbian leader warns Europe is sleepwalking into economic ruin.

Delta Electronics Flags Power Crunch
Bloomberg

Delta Electronics Flags Power Crunch

Delta Electronics warns of a looming power deficit due to surging demand and constrained production, predicting serious ...