Policy Split: Incentivizing Dual-Mode Exploration in LLM Reinforcement with Dual-Mode Entropy Regularization
Title: Policy Split: Encouraging Dual-Mode Exploration in LLM Reinforcement Learning via Dual-Mode Entropy Regularization
Abstract:
We introduce Policy Split, a novel framework designed to foster diverse exploration in reinforcement learning (RL) for large language models (LLMs) while maintaining high accuracy. This approach divides the policy into two distinct modes—normal and high-entropy—guided by a high-entropy prompt. Although both modes share the same underlying model parameters, they are subject to a collaborative dual-mode entropy regularization scheme aligned with their specific goals. The normal mode focuses on optimizing task correctness, whereas the high-entropy mode prioritizes exploration, allowing the two to learn in tandem. Our extensive experiments show that Policy Split consistently surpasses established entropy-guided RL baselines across different model sizes in both general and creative tasks. Further analysis indicates that this method enables dual-mode exploration, with the high-entropy mode producing behavioral patterns distinct from those of the normal mode, thereby offering unique learning signals.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC





