Value Entanglement: Conflation Between Different Kinds of Good In (Some) Large Language Models
Title: Value Entanglement: The Blurring of Distinct Value Categories in Certain Large Language Models
Abstract: Achieving value alignment in Large Language Models (LLMs) necessitates the empirical assessment of the values these systems have actually internalized. A key feature of human value representation is the ability to differentiate between various types of worth. This study examines whether LLMs similarly distinguish among three specific categories of good: moral, grammatical, and economic. Through an analysis of model behavior, embeddings, and residual stream activations, we identify widespread instances of "value entanglement"—a merging of these otherwise distinct value representations. Our findings indicate that, compared to human standards, both grammatical and economic judgments are disproportionately swayed by moral considerations. However, this conflation can be rectified by selectively removing the activation vectors linked to moral reasoning.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC




