MENTIS: What Belief Changes Under Alignment? Measuring Multi-Scale Latent Torsion in Language Models
Title: MENTIS: What Belief Changes Under Alignment? Measuring Multi-Scale Latent Torsion in Language Models
Abstract:
While preference alignment has significantly enhanced the observable performance of large language models, the internal mechanisms driving these improvements remain poorly understood. The continued vulnerability of aligned systems to jailbreaks, prompt injection, and retrieval-time corruption indicates that assessments based solely on behavior are insufficient. We posit that post-training processes should generate detectable signatures within internal computations. This study investigates the geometric transformations that occur when an instruction-tuned (IT) model is converted into a preference-aligned (PA) model, specifically examining where these changes concentrate and how they vary across different concepts, prompts, and model architectures.
To address these questions, we present MENTIS, a geometry-centric framework designed to quantify alignment-induced internal reorganization by comparing paired model checkpoints. MENTIS employs a primary layerwise covariance-based torsion norm (T1), a secondary spectral torsion diagnostic (T2), and an Energy-Radiance-Activation measure (ERA) to pinpoint changes at specific depths. Our analysis of four 7-8B model pairs on the LITMUS benchmark demonstrates that alignment-induced modifications are selective rather than uniform. Specifically, normative concepts show greater torsion shifts than factual ones on average, and torsion exhibits a negative correlation with contextual entropy. Furthermore, the most significant effects are localized to mid-to-late layers, a pattern that holds true across word-level, prompt-level, and model-level evaluations. These findings suggest that preference alignment imprints structured, depth-specific geometric signatures onto internal computation, offering insights that go beyond what can be captured by behavior-level metrics alone.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




