arXiv

When Models Refuse: Political Steerability and Feature Richness as Measures of Ideological Depth

June 3, 2026 · Shariar Kabir · Original Source

Title: When Models Say No: Assessing Ideological Depth via Political Steerability and Feature Richness

Abstract

Large language models (LLMs) frequently decline to execute benign directives, such as adopting a specific persona or debating a particular political stance. While these refusals are typically interpreted as evidence of effective safety protocols, this study explores an alternative explanation: they may instead indicate a capability deficit, stemming from a lack of the internal representations necessary to reason from the requested viewpoint. To examine this hypothesis, we propose ideological depth, a metric composed of two elements: (i) the model’s steerability, defined as its capacity to adhere to political instructions without failure, and (ii) the feature richness of its internal political representations, quantified using sparse autoencoders (SAEs).

By analyzing two prominent openweight LLMs, we evaluated interventions involving both prompt engineering and activation steering, while probing political features through publicly accessible SAEs. Our findings reveal significant, systematic variations between the models. Specifically, the model demonstrating higher steerability across both ideological spectrums activated approximately ~7.3x more distinct political features compared to its counterpart, which predominantly responded with increased refusals. Furthermore, we causally ablated a targeted subset of political features in the more capable model; this intervention replicated the feature-poor behavior and triggered a rise in refusals, mirroring the performance of the less capable model. Collectively, these results suggest that refusals on benign prompts may stem from capability deficits rather than rigid safety constraints. Additionally, they establish ideological depth as a quantifiable property of LLMs that serves as a predictor for refusal behavior.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC