arXiv

ASymPO: Asymmetric-Scale Policy Optimization for Asynchronous LLM Post-Training Without Behavior Information

Title: ASymPO: Stabilizing Asynchronous LLM Post-Training via Asymmetric-Scale Policy Optimization Without Reliance on Behavior Data

Abstract:

While asynchronous reinforcement learning offers enhanced throughput for language model post-training by separating response generation from policy updates, the resulting stale responses can cause distribution drift. Conventional approaches mitigate this drift using behavior-policy probabilities, importance weights, or clipping mechanisms; however, these methods necessitate token-aligned, versioned, and numerically consistent behavior log-probabilities across both rollout and learner systems. This study investigates whether asynchronous group-relative reinforcement learning can be stabilized using solely current-policy probabilities. We pinpoint a specific failure mode characterized by scale imbalance: when stale responses are assessed by the current policy, positive and negative loss components may manifest at disparate negative log-probability magnitudes, thereby breaking the zero-sum advantage assumption and leading to unbalanced loss contributions. To address this, we introduce Asymmetric-Scale Policy Optimization (ASymPO), a method that normalizes the token loss of each response according to its current average token negative log-probability. ASymPO eliminates the need for behavior-policy probabilities, re-establishes response-level zero-sum balance, and maintains a robust learning signal. Additionally, we present Scaled Policy Optimization (SPO) as a fixed negative-scaling baseline. Both current-policy-only objectives are evaluated in the context of asynchronous mathematical reasoning post-training.


Source: arXiv Generated at: 2026-06-03 00:00:00 UTC

Related Articles

TikTok Billionaire Tops Ambani as Asia’s Second-Richest
Bloomberg

TikTok Billionaire Tops Ambani as Asia’s Second-Richest

TikTok founder surpasses Mukesh Ambani to become Asia’s second-richest person, marking a significant shift in the region...

Publishers in UK can opt out of Google AI search results
BBC News

Publishers in UK can opt out of Google AI search results

UK publishers can now opt out of Google’s AI search summaries, a CMA ruling designed to boost their bargaining power and...

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.
Bloomberg

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.

Kioxia’s market cap nears Toyota’s, signaling a major shift in Japan’s corporate hierarchy. This narrowing gap highlight...

Reuters

Morning Bid: Marvell, a fitting name for the latest AI darling

Reuters highlights Marvell as a top AI stock, noting its name perfectly suits its status as the newest market darling.

Financial Times

Tim Hayward: I built the Jaguar E-Type of computer keyboards

Tim Hayward compares his bespoke keyboard designs to the Jaguar E-Type. He explores high-end customization for personal ...

Financial Times

AI Labs: Zuckerberg’s $100bn gamble

Meta’s $100 billion AI investment aims to secure AI dominance, but questions remain whether sheer spending can outpace c...