Embracing Anisotropy: Turning Massive Activations into Interpretable Control Knobs for Large Language Models
Title: Leveraging Anisotropy: Transforming Large-Scale Activations into Interpretable Controls for Large Language Models
Abstract: Large Language Models (LLMs) display highly anisotropic internal representations, frequently marked by massive activations—a trait where a limited number of feature dimensions exhibit magnitudes that dwarf those of the remaining dimensions. While previous research has largely treated these extreme dimensions as mere artifacts requiring management, we offer a different viewpoint: these dimensions function as intrinsic, interpretable units of functionality stemming from domain-specific specialization. To this end, we introduce a straightforward magnitude-based method for identifying Domain-Critical Dimensions without the need for training. Our investigation demonstrates that these dimensions act as interpretable semantic detectors for domain-specific terminology or symbolic and quantitative patterns. Furthermore, we present Critical Dimension Steering, a technique that restricts activation steering solely to the identified dimensions. Experimental findings indicate that this strategy surpasses traditional whole-dimension steering in both jailbreaking and domain adaptation contexts.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC





