ASymPO: Asymmetric-Scale Policy Optimization for Asynchronous LLM Post-Training Without Behavior Information
Title: ASymPO: Stabilizing Asynchronous LLM Post-Training via Asymmetric-Scale Policy Optimization Without Reliance on Behavior Data
Abstract:
While asynchronous reinforcement learning offers enhanced throughput for language model post-training by separating response generation from policy updates, the resulting stale responses can cause distribution drift. Conventional approaches mitigate this drift using behavior-policy probabilities, importance weights, or clipping mechanisms; however, these methods necessitate token-aligned, versioned, and numerically consistent behavior log-probabilities across both rollout and learner systems. This study investigates whether asynchronous group-relative reinforcement learning can be stabilized using solely current-policy probabilities. We pinpoint a specific failure mode characterized by scale imbalance: when stale responses are assessed by the current policy, positive and negative loss components may manifest at disparate negative log-probability magnitudes, thereby breaking the zero-sum advantage assumption and leading to unbalanced loss contributions. To address this, we introduce Asymmetric-Scale Policy Optimization (ASymPO), a method that normalizes the token loss of each response according to its current average token negative log-probability. ASymPO eliminates the need for behavior-policy probabilities, re-establishes response-level zero-sum balance, and maintains a robust learning signal. Additionally, we present Scaled Policy Optimization (SPO) as a fixed negative-scaling baseline. Both current-policy-only objectives are evaluated in the context of asynchronous mathematical reasoning post-training.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



