The Representation-Rationalizability Tradeoff in Reward Learning
Title: Navigating the Tension Between Representation and Rationalizability in Reward Learning
Abstract:
Reinforcement Learning from Human Feedback (RLHF) typically involves training examples composed of a prompt $x$ and two potential responses, $y$ and $y'$. Human annotators then establish pairwise preferences between these options. The core challenge lies in transforming these diverse pairwise judgments into a unified scalar reward, $r(x,y)$, which assesses the quality of a response for a given prompt. Traditional social choice theory suggests this task is impossible because aggregating heterogeneous annotator preferences can lead to Condorcet cycles, rendering it impossible for any scalar reward to rank all compared response pairs consistently.
While recent research has increasingly framed RLHF through the lens of social choice, most studies operate under the assumption of a fixed, finite set of alternatives—specifically, a pre-listed group of candidate responses for every prompt. In contrast, contemporary systems evaluate responses by first mapping them to a learned representation, $\phi(x,y)$, prior to passing them through a scalar head. Consequently, the embedding function $\phi$ dictates which responses are recognized as distinct alternatives and determines which comparisons are accessible to the reward model. Integrating this embedding into the framework transforms the theoretical impossibility into a manageable tradeoff.
We demonstrate that the excess cross-entropy loss associated with any reward model constructed upon $\phi$ can be precisely decomposed into two components: a representational term and an aggregation term. A more expressive $\phi$ reduces the representational term but simultaneously increases the aggregation term by revealing additional comparisons that no single scalar value can rank without inconsistency. These findings also apply to Direct Preference Optimization (DPO). Furthermore, our analysis indicates that jointly optimizing the embedding and the reward parameters does not ensure the recovery of the optimal balance in this tradeoff. Our theoretical conclusions are validated through experiments conducted on both synthetic data and actual preference datasets.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





