Global News Digest

arXiv

MENTIS: What Belief Changes Under Alignment? Measuring Multi-Scale Latent Torsion in Language Models

Title: MENTIS: What Belief Changes Under Alignment? Measuring Multi-Scale Latent Torsion in Language Models

Abstract:

While preference alignment has significantly enhanced the observable performance of large language models, the internal mechanisms driving these improvements remain poorly understood. The continued vulnerability of aligned systems to jailbreaks, prompt injection, and retrieval-time corruption indicates that assessments based solely on behavior are insufficient. We posit that post-training processes should generate detectable signatures within internal computations. This study investigates the geometric transformations that occur when an instruction-tuned (IT) model is converted into a preference-aligned (PA) model, specifically examining where these changes concentrate and how they vary across different concepts, prompts, and model architectures.

To address these questions, we present MENTIS, a geometry-centric framework designed to quantify alignment-induced internal reorganization by comparing paired model checkpoints. MENTIS employs a primary layerwise covariance-based torsion norm (T1), a secondary spectral torsion diagnostic (T2), and an Energy-Radiance-Activation measure (ERA) to pinpoint changes at specific depths. Our analysis of four 7-8B model pairs on the LITMUS benchmark demonstrates that alignment-induced modifications are selective rather than uniform. Specifically, normative concepts show greater torsion shifts than factual ones on average, and torsion exhibits a negative correlation with contextual entropy. Furthermore, the most significant effects are localized to mid-to-late layers, a pattern that holds true across word-level, prompt-level, and model-level evaluations. These findings suggest that preference alignment imprints structured, depth-specific geometric signatures onto internal computation, offering insights that go beyond what can be captured by behavior-level metrics alone.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Schroders Renewable Unit Targets AI Assets as Power Demand Soars
Bloomberg

Schroders Renewable Unit Targets AI Assets as Power Demand Soars

Schroders’ renewable unit targets AI infrastructure, pivoting to meet soaring energy demand from artificial intelligence...

State Street's Paglia on SBI Group Partnership, ETFs
Bloomberg

State Street's Paglia on SBI Group Partnership, ETFs

State Street's Paglia discusses the SBI Group partnership and ETFs, but the source text is missing. Please provide the a...

Nvidia Boss Says Workers Should Be Paid ‘as Much as Possible’
Bloomberg

Nvidia Boss Says Workers Should Be Paid ‘as Much as Possible’

Nvidia CEO Jensen Huang advocates for paying workers “as much as possible,” emphasizing maximum compensation. This stanc...

TSE Talking With Regulator For Easing ETF Listing Rules
Bloomberg

TSE Talking With Regulator For Easing ETF Listing Rules

The Tokyo Stock Exchange is discussing with regulators to ease ETF listing rules. This aims to simplify market access an...

S&P DJI CEO on Japan Markets, Mega IPOs
Bloomberg

S&P DJI CEO on Japan Markets, Mega IPOs

S&P DJI CEO discusses Japan's financial markets and major IPOs.