Global News Digest

arXiv

S-SPPO: Semantic-Calibrated Self-Play Preference Optimization

Title: S-SPPO: Semantic-Calibrated Self-Play Preference Optimization

Original: arXiv:2606.01561v1 Announce Type: new Abstract: Aligning Large Language Models (LLMs) with human preferences is often formulated via Direct Preference Optimization (DPO). However, the standard Bradley-Terry instantiation of DPO is limited in modeling common departures from transitivity in human preferences. To address this, recent work has introduced Self-Play Preference Optimization (SPPO), which iteratively refines the policy by training on self-generated win-lose pairs. Our investigation, however, reveals a critical instability in SPPO: the optimization is prone to policy degeneration when the preference oracle assigns overly confident wins to semantically indistinguishable responses. To mitigate this, we propose S-SPPO, a dual-space semantic calibration framework comprising: i) Supervision Calibration via semantic gating, which anneals win rate targets toward the maximum-entropy baseline as semantic overlap increases; and ii) Representation Calibration via latent repulsion to enforce geometric diversity to prevent manifold collapse and maintain latent diversity between chosen and rejected samples. Theoretically, we show that the calibration preserves the constant-sum game structure, facilitating convergence to a Nash Equilibrium. Empirically, S-SPPO avoids the performance degradation seen in prior methods, achieving 52.19% win rate and 47.46% length-controlled win rate on AlpacaEval 2.0 with Llama-3-8B, without using additional human-annotated preferences during training. The code will be available at https://github.com/xiwenc1/s-sppo.

Rewritten:

Title: S-SPPO: Semantic-Calibrated Self-Play Preference Optimization

Source: arXiv:2606.01561v1 Announcement Type: New

Abstract:

The alignment of Large Language Models (LLMs) with human preferences is frequently approached through Direct Preference Optimization (DPO). Nevertheless, the traditional Bradley-Terry formulation of DPO struggles to capture frequent deviations from transitivity inherent in human judgment. To overcome these limitations, Self-Play Preference Optimization (SPPO) was recently proposed, a method that iteratively enhances policy performance by utilizing self-generated pairs of winning and losing responses. Despite these advancements, our analysis identifies a significant instability within SPPO: the optimization process is susceptible to policy degeneration. This occurs specifically when the preference oracle exhibits excessive confidence in assigning victories to responses that are semantically indistinguishable.

To resolve this issue, we introduce S-SPPO, a novel framework featuring dual-space semantic calibration. This approach consists of two primary components: first, Supervision Calibration, which employs semantic gating to gradually adjust win rate targets toward a maximum-entropy baseline as the degree of semantic overlap grows; and second, Representation Calibration, which utilizes latent repulsion to ensure geometric diversity. This mechanism prevents manifold collapse and preserves distinct latent characteristics between selected and rejected samples.

From a theoretical standpoint, we demonstrate that this calibration process maintains the constant-sum game structure, thereby enabling convergence toward a Nash Equilibrium. Empirically, S-SPPO successfully circumvents the performance decline observed in earlier approaches. When applied to the Llama-3-8B model on the AlpacaEval 2.0 benchmark, it secured a win rate of 52.19% and a length-controlled win rate of 47.46%, all without relying on additional human-annotated preferences during the training phase. The source code for this project will be released at https://github.com/xiwenc1/s-sppo.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Schroders Renewable Unit Targets AI Assets as Power Demand Soars
Bloomberg

Schroders Renewable Unit Targets AI Assets as Power Demand Soars

Schroders’ renewable unit targets AI infrastructure, pivoting to meet soaring energy demand from artificial intelligence...

State Street's Paglia on SBI Group Partnership, ETFs
Bloomberg

State Street's Paglia on SBI Group Partnership, ETFs

State Street's Paglia discusses the SBI Group partnership and ETFs, but the source text is missing. Please provide the a...

Nvidia Boss Says Workers Should Be Paid ‘as Much as Possible’
Bloomberg

Nvidia Boss Says Workers Should Be Paid ‘as Much as Possible’

Nvidia CEO Jensen Huang advocates for paying workers “as much as possible,” emphasizing maximum compensation. This stanc...

TSE Talking With Regulator For Easing ETF Listing Rules
Bloomberg

TSE Talking With Regulator For Easing ETF Listing Rules

The Tokyo Stock Exchange is discussing with regulators to ease ETF listing rules. This aims to simplify market access an...

S&P DJI CEO on Japan Markets, Mega IPOs
Bloomberg

S&P DJI CEO on Japan Markets, Mega IPOs

S&P DJI CEO discusses Japan's financial markets and major IPOs.