arXiv

Learning What to Learn: Stage-Specific Data Sets for SFT-then-RL in Small Language Model Reasoning

Title: Tailoring Data for Each Phase: Stage-Specific Datasets for SFT-then-RL in Small Language Model Reasoning

Abstract:

While the standard post-training approach for Small Language Models (SLMs) involves a sequential SFT-then-RL pipeline, current research largely overlooks the strategic selection of data for each distinct phase. We contend that data curation must align with the unique objectives of SFT and RL: SFT is more effective for acquiring reasoning capabilities the model has not yet mastered, whereas RL excels at reinforcing skills the model can already partially execute. Guided by this insight, we introduce a difficulty-aware SFT-then-RL framework that structures training data into phase-specific collections. During the SFT stage, we implement a Bridge mechanism to convert raw reasoning traces generated by teachers into more accessible supervision for SLMs, specifically targeting difficult samples. In the RL stage, for challenging examples that remain unsolved, we employ Critique Fine-Tuning; this converts instances of zero-reward failure into diagnostic, corrective, and novel reasoning trace supervision to inform the subsequent SFT phase. Evaluations across five reasoning benchmarks using two different SLMs demonstrate that our approach consistently outperforms standard SFT, distillation, and RL baselines. These findings underscore the critical need to coordinate data difficulty between SFT and RL stages to achieve effective post-training reasoning in SLMs.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

Glazer Family Members Said to Study Manchester United Stake Sale
Bloomberg

Glazer Family Members Said to Study Manchester United Stake Sale

Reports indicate the Glazer family is evaluating a potential sale of their Manchester United stake, with family members ...

Ares' Blair Jacbobson: Disconnect Over Private Credit Headlines
Bloomberg

Ares' Blair Jacbobson: Disconnect Over Private Credit Headlines

Ares’ Blair Jacobson argues that private credit headlines misrepresent reality, highlighting a disconnect between media ...

Nvidia-Backed Robotics Startup Generalist AI Valued at $2 Billion
Bloomberg

Nvidia-Backed Robotics Startup Generalist AI Valued at $2 Billion

Nvidia-backed robotics startup Generalist AI has reached a $2 billion valuation. Founders Pete Florence, Andy Zeng, and ...

TechCrunch

Oura Ring 5 review: Thinner, lighter, better

The Oura Ring 5 is 40% smaller and lighter than its predecessor, offering superior comfort and a discreet, jewelry-like ...

Financial Times

How AI has de-skilled translation

AI fragments specialist translation into routine tasks, effectively de-skilling the profession. This shift reduces compl...

Zurich Insurance Expands Data-Center Offering Beyond the US
Bloomberg

Zurich Insurance Expands Data-Center Offering Beyond the US

Zurich Insurance Group is expanding its data center insurance products internationally, extending coverage beyond the Un...