arXiv

HMPO: Hybrid Median-length Policy Optimization for Chain-of-Thought Compression

Title: HMPO: Hybrid Median-length Policy Optimization for Chain-of-Thought Compression

Original: arXiv:2606.01934v1 Announce Type: new Abstract: Large language models achieve remarkable performance via extended chain-of-thought (CoT) reasoning, yet this lengthy process incurs substantial inference overhead. Existing CoT compression methods struggle with inflexible manual length budgets, computationally expensive multi-stage training pipelines, and fragile scalability restricted to small models. We propose HMPO (Hybrid Median-length Policy Optimization), a cost-effective, single-stage reinforcement learning framework. HMPO efficiently compresses CoT via three synergistic components: an adaptive median-based budget derived from successful rollouts to eliminate manual tuning, a cosine-decay token reward for smooth length penalization, and a multiplicative reward formulation that substantially mitigates trivial reward hacking by strictly prioritizing answer correctness. Trained exclusively on mathematical data, HMPO generalizes seamlessly across math, code, science, and instruction-following tasks. Extensive experiments scaling from 9B to 122B parameters across dense and Mixture-of-Experts (MoE) architectures demonstrate that HMPO achieves 19%--46% token compression with negligible accuracy degradation, all while drastically reducing training costs compared to existing multi-stage baselines.

Rewrite: arXiv:2606.01934v1 Announcement Type: New

While large language models deliver exceptional results through extended chain-of-thought (CoT) reasoning, the resulting length creates significant inference burdens. Current approaches to compressing CoTs face notable limitations, including rigid, manually set length constraints, high computational costs associated with multi-stage training, and limited scalability that is often confined to smaller model sizes. To address these challenges, we introduce HMPO (Hybrid Median-length Policy Optimization), an economical reinforcement learning framework that operates in a single stage. HMPO streamlines CoT compression through the integration of three key mechanisms: an adaptive budget based on the median length of successful rollouts, which removes the need for manual configuration; a cosine-decay token reward mechanism that ensures gentle penalization for length; and a multiplicative reward structure designed to curb simple reward hacking by placing primary emphasis on the correctness of the final answer. Although trained solely on mathematical datasets, HMPO transfers effectively to diverse domains, including coding, scientific reasoning, and instruction-following. Comprehensive evaluations involving models ranging from 9B to 122B parameters—spanning both dense and Mixture-of-Experts (MoE) structures—reveal that HMPO reduces token usage by 19% to 46% with minimal impact on accuracy. Furthermore, this approach significantly lowers training expenses when compared to traditional multi-stage baseline methods.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...