arXiv

Bounded Hyperbolic Tangent: A Stable and Efficient Alternative to Pre-Layer Normalization in Large Language Models

Title: Bounded Hyperbolic Tangh: A Robust and High-Performance Substitute for Pre-Layer Normalization in Large Language Models

Abstract:

Pre-Layer Normalization (Pre-LN) has emerged as the standard normalization technique for large language models (LLMs), playing a pivotal role in ensuring stable pretraining and facilitating effective transfer learning. Despite its widespread adoption, Pre-LN suffers from significant computational overhead due to repeated statistical calculations. Furthermore, it is susceptible to the "curse of depth," a phenomenon where the magnitude and variance of hidden states escalate with additional layers, leading to training instability. While normalization-free approaches like Dynamic Tanh (DyT) offer improved throughput, they often lack robustness in deeper architectures.

To simultaneously resolve issues of stability and efficiency, this study introduces Bounded Hyperbolic Tanh (BHyT), a seamless alternative to Pre-LN. BHyT integrates a tanh activation function with explicit, data-driven input constraints to maintain activations within a non-saturating range. This mechanism effectively curbs the depth-wise expansion of activation magnitude and variance, backed by a theoretical guarantee of stability. In terms of efficiency, BHyT calculates exact statistics only once per block and substitutes a secondary normalization step with a computationally lightweight variance approximation.

Experimental results indicate that BHyT enhances both stability and efficiency during the pretraining phase. Compared to RMSNorm, BHyT accelerates training by an average of 1.6% and increases token generation throughput by an average of 1.77%. Additionally, it preserves strong performance in both pretraining-only and post-Supervised Fine-Tuning (SFT) evaluations across various benchmarks for language understanding and reasoning.

Code is available at: https://github.com/MLAI-Yonsei/BHyT


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

Glazer Family Members Said to Study Manchester United Stake Sale
Bloomberg

Glazer Family Members Said to Study Manchester United Stake Sale

Reports indicate the Glazer family is evaluating a potential sale of their Manchester United stake, with family members ...

Ares' Blair Jacbobson: Disconnect Over Private Credit Headlines
Bloomberg

Ares' Blair Jacbobson: Disconnect Over Private Credit Headlines

Ares’ Blair Jacobson argues that private credit headlines misrepresent reality, highlighting a disconnect between media ...

Nvidia-Backed Robotics Startup Generalist AI Valued at $2 Billion
Bloomberg

Nvidia-Backed Robotics Startup Generalist AI Valued at $2 Billion

Nvidia-backed robotics startup Generalist AI has reached a $2 billion valuation. Founders Pete Florence, Andy Zeng, and ...

TechCrunch

Oura Ring 5 review: Thinner, lighter, better

The Oura Ring 5 is 40% smaller and lighter than its predecessor, offering superior comfort and a discreet, jewelry-like ...

Financial Times

How AI has de-skilled translation

AI fragments specialist translation into routine tasks, effectively de-skilling the profession. This shift reduces compl...

Zurich Insurance Expands Data-Center Offering Beyond the US
Bloomberg

Zurich Insurance Expands Data-Center Offering Beyond the US

Zurich Insurance Group is expanding its data center insurance products internationally, extending coverage beyond the Un...