arXiv

Reinforcement Learning with Pairwise Preferences in Long-Term Decision Problems

June 2, 2026 · Jonathan Cola\c{c}o Carr, Prakash Panangaden, Doina Precup, Benjamin Van Roy · Original Source

Title: Pairwise Preference-Based Reinforcement Learning for Extended Time Horizons

Abstract:

Standard reinforcement learning frameworks generally aim to maximize the anticipated value of a scalar reward function. However, specifying goals through pairwise preferences is often more intuitive for users and can capture objectives that scalar rewards fail to represent. Consequently, there has been increasing attention on reinforcement learning techniques that utilize pairwise preferences. Despite this interest, existing approaches suffer from inefficiency in scenarios involving long time horizons. Furthermore, they do not provide performance guarantees comparing Markov policies to history-dependent policies, leaving a gap between theoretical foundations and practical applications.

To address these limitations, we introduce the Markov decision contest, a novel problem model designed for reinforcement learning with pairwise preferences. We demonstrate that stationary Markov policies are optimal across all history-dependent policies. Additionally, we establish that solving a Markov decision contest exactly is computationally tractable (in P) and that a straightforward iterative algorithm converges to an optimal policy at a sublinear rate. Finally, empirical evaluations on high-dimensional decision problems with extended time horizons reveal that our approximate algorithm achieves significantly higher learning efficiency compared to previous methods.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC