arXiv

The Representation-Rationalizability Tradeoff in Reward Learning

Title: Navigating the Tension Between Representation and Rationalizability in Reward Learning

Abstract:

Reinforcement Learning from Human Feedback (RLHF) typically involves training examples composed of a prompt $x$ and two potential responses, $y$ and $y'$. Human annotators then establish pairwise preferences between these options. The core challenge lies in transforming these diverse pairwise judgments into a unified scalar reward, $r(x,y)$, which assesses the quality of a response for a given prompt. Traditional social choice theory suggests this task is impossible because aggregating heterogeneous annotator preferences can lead to Condorcet cycles, rendering it impossible for any scalar reward to rank all compared response pairs consistently.

While recent research has increasingly framed RLHF through the lens of social choice, most studies operate under the assumption of a fixed, finite set of alternatives—specifically, a pre-listed group of candidate responses for every prompt. In contrast, contemporary systems evaluate responses by first mapping them to a learned representation, $\phi(x,y)$, prior to passing them through a scalar head. Consequently, the embedding function $\phi$ dictates which responses are recognized as distinct alternatives and determines which comparisons are accessible to the reward model. Integrating this embedding into the framework transforms the theoretical impossibility into a manageable tradeoff.

We demonstrate that the excess cross-entropy loss associated with any reward model constructed upon $\phi$ can be precisely decomposed into two components: a representational term and an aggregation term. A more expressive $\phi$ reduces the representational term but simultaneously increases the aggregation term by revealing additional comparisons that no single scalar value can rank without inconsistency. These findings also apply to Direct Preference Optimization (DPO). Furthermore, our analysis indicates that jointly optimizing the embedding and the reward parameters does not ensure the recovery of the optimal balance in this tradeoff. Our theoretical conclusions are validated through experiments conducted on both synthetic data and actual preference datasets.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...