arXiv

S-SPPO: Semantic-Calibrated Self-Play Preference Optimization

June 2, 2026 · Xiwen Chen, Wenhui Zhu, Jingjing Wang, Peijie Qiu, Zhipeng Wang, Huayu Li, ZhengXiao He, Xuanzhao Dong, Prayag Tiwari, Mingkun Xu, Yujian Xiong, Feng Luo, Abolfazl Razi, Brendan Hogan Rappazzo, Anderson Schneider, Yuriy Nevmyvaka · Original Source

Title: S-SPPO: Semantic-Calibrated Self-Play Preference Optimization

Original: arXiv:2606.01561v1 Announce Type: new Abstract: Aligning Large Language Models (LLMs) with human preferences is often formulated via Direct Preference Optimization (DPO). However, the standard Bradley-Terry instantiation of DPO is limited in modeling common departures from transitivity in human preferences. To address this, recent work has introduced Self-Play Preference Optimization (SPPO), which iteratively refines the policy by training on self-generated win-lose pairs. Our investigation, however, reveals a critical instability in SPPO: the optimization is prone to policy degeneration when the preference oracle assigns overly confident wins to semantically indistinguishable responses. To mitigate this, we propose S-SPPO, a dual-space semantic calibration framework comprising: i) Supervision Calibration via semantic gating, which anneals win rate targets toward the maximum-entropy baseline as semantic overlap increases; and ii) Representation Calibration via latent repulsion to enforce geometric diversity to prevent manifold collapse and maintain latent diversity between chosen and rejected samples. Theoretically, we show that the calibration preserves the constant-sum game structure, facilitating convergence to a Nash Equilibrium. Empirically, S-SPPO avoids the performance degradation seen in prior methods, achieving 52.19% win rate and 47.46% length-controlled win rate on AlpacaEval 2.0 with Llama-3-8B, without using additional human-annotated preferences during training. The code will be available at https://github.com/xiwenc1/s-sppo.

Rewritten:

Title: S-SPPO: Semantic-Calibrated Self-Play Preference Optimization

Source: arXiv:2606.01561v1 Announcement Type: New

Abstract:

The alignment of Large Language Models (LLMs) with human preferences is frequently approached through Direct Preference Optimization (DPO). Nevertheless, the traditional Bradley-Terry formulation of DPO struggles to capture frequent deviations from transitivity inherent in human judgment. To overcome these limitations, Self-Play Preference Optimization (SPPO) was recently proposed, a method that iteratively enhances policy performance by utilizing self-generated pairs of winning and losing responses. Despite these advancements, our analysis identifies a significant instability within SPPO: the optimization process is susceptible to policy degeneration. This occurs specifically when the preference oracle exhibits excessive confidence in assigning victories to responses that are semantically indistinguishable.

To resolve this issue, we introduce S-SPPO, a novel framework featuring dual-space semantic calibration. This approach consists of two primary components: first, Supervision Calibration, which employs semantic gating to gradually adjust win rate targets toward a maximum-entropy baseline as the degree of semantic overlap grows; and second, Representation Calibration, which utilizes latent repulsion to ensure geometric diversity. This mechanism prevents manifold collapse and preserves distinct latent characteristics between selected and rejected samples.

From a theoretical standpoint, we demonstrate that this calibration process maintains the constant-sum game structure, thereby enabling convergence toward a Nash Equilibrium. Empirically, S-SPPO successfully circumvents the performance decline observed in earlier approaches. When applied to the Llama-3-8B model on the AlpacaEval 2.0 benchmark, it secured a win rate of 52.19% and a length-controlled win rate of 47.46%, all without relying on additional human-annotated preferences during the training phase. The source code for this project will be released at https://github.com/xiwenc1/s-sppo.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC