Measuring Maximum Activations in Open Large Language Models
Title: Quantifying Peak Activations in Open-Source Large Language Models
Abstract: The dynamic range of neural activations represents a critical bottleneck for low-bit quantization, activation scaling techniques, and ensuring stable inference in Large Language Models (LLMs). While previous research extensively documented outlier features and extreme activation values in pre-2024 LLaMA-style architectures, the subsequent surge in open models has left the downstream activation-quantization pipeline relying on outdated assumptions without revisiting the data for the modern era. This study addresses a practical, deployment-focused inquiry: what is the upper limit of activation magnitudes in contemporary open LLMs, and how do these values fluctuate across different model families, generations, and training phases?
Using a standardized evaluation pipeline, we analyzed 27 checkpoints drawn from eight distinct open-source families. The methodology involved a unified 5,000-sample multi-domain corpus, family-specific tokenization strategies, and consistent monitoring hooks across embeddings, hidden states, attention mechanisms, MLP/MoE layers, SwiGLU gates, and the final normalization layer. Our analysis covered dense, Mixture-of-Experts (MoE), vision-language, intermediate-training, and instruction-tuned variants.
Our findings reveal three key insights: (i) Global maxima exhibit a spread of nearly four orders of magnitude among models with comparable parameter counts. Specifically, Qwen3.5 and MoE checkpoints generally fall within the $10^2$ to $10^3$ range, whereas Gemma3-27B-it reaches approximately $7 \times 10^5$. (ii) Comparisons across families and generations do not follow a simple monotonic scaling pattern. (iii) MoE architectures demonstrate significantly lower peak values, ranging from 14.0 to 23.4 times smaller than their dense counterparts of similar scale. Furthermore, in 22 out of 24 checkpoints, the residual stream contained the global maximum.
A lightweight INT-8 sanity check confirmed that the measured maxima correlate with low-bit reconstruction errors, particularly through the mechanism of activation-scale selection. We conclude that maximum activation magnitude is an intrinsic property of a model, determined by its family, architecture, and training stage, rather than merely a byproduct of model size. Consequently, these metrics should be measured and reported alongside any open-weight release prior to low-bit deployment. The code for this study is publicly accessible at https://github.com/clx1415926/Max_act_llm.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC





