arXiv

RL Excursions during Pre-Training: Re-examining Policy Optimization for LLM training

Title: RL Excursions during Pre-Training: Re-examining Policy Optimization for LLM training

Abstract

Conventional large language model (LLM) development typically reserves reinforcement learning (RL) for the final stages, implementing it only after the model has undergone pre-training and supervised fine-tuning (SFT). This study challenges that established workflow by evaluating an approach where RL, SFT, and the combined SFT-then-RL sequence are applied directly to intermediate checkpoints during the initial pre-training phase. Our findings indicate that RL yields significant benefits remarkably early in the process, often achieving performance levels comparable to the traditional full SFT$\to$RL pipeline at an early stage.

When tested on more complex tasks, we discovered that the composition of pre-training data serves as a critical factor in enhancing RL effectiveness, proving to be even more influential than increasing model scale. Furthermore, while applying RL directly to base checkpoints broadens the model’s output distribution, the "sharpening" effect documented in recent literature appears exclusively when RL is applied subsequent to SFT. Notably, general model capabilities remain stable under RL but tend to deteriorate after SFT. To address this, we propose merging RL and SFT objectives through parallel averaging. This hybrid method surpasses all other discussed training strategies across various metrics and successfully preserves the model’s general capabilities. Collectively, these insights imply that expanding the scope of RL usage throughout the LLM training lifecycle could offer substantial advantages.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

TechCrunch

A burglar used a Waymo to steal yoga clothes in San Francisco — and got away with it

A thief stole yoga clothes using a Waymo, but police failed to catch them because the car’s video data was deleted and b...

Goldman Sachs CEO David Solomon on the Coming Mega IPOs
Bloomberg

Goldman Sachs CEO David Solomon on the Coming Mega IPOs

Goldman Sachs CEO David Solomon anticipates a surge in major IPOs, signaling renewed market confidence and significant o...

What Are A.I. Agents Actually Doing?
New York Times

What Are A.I. Agents Actually Doing?

Arena research shows tech professionals are most likely to use AI agents at work, highlighting a strong industry trend i...

TechCrunch

Cash App launches a wand for tap-and-pay

Cash App launched a $25 NFC "Magic Wand" for tap-and-pay, blending viral novelty with practical contactless payments. It...

Databricks CEO Plans to Avoid IPO During Year of Huge Offerings
Bloomberg

Databricks CEO Plans to Avoid IPO During Year of Huge Offerings

Databricks CEO plans to avoid an IPO in 2021, despite a surge in public offerings. This contrasts with earlier reports t...

TechCrunch

Waymo’s spent robotaxi batteries will be used as grid storage

Waymo partners with B2U to repurpose retired robotaxi batteries for grid storage in California and Texas, aligning with ...