HMPO: Hybrid Median-length Policy Optimization for Chain-of-Thought Compression
Title: HMPO: Hybrid Median-length Policy Optimization for Chain-of-Thought Compression
Original: arXiv:2606.01934v1 Announce Type: new Abstract: Large language models achieve remarkable performance via extended chain-of-thought (CoT) reasoning, yet this lengthy process incurs substantial inference overhead. Existing CoT compression methods struggle with inflexible manual length budgets, computationally expensive multi-stage training pipelines, and fragile scalability restricted to small models. We propose HMPO (Hybrid Median-length Policy Optimization), a cost-effective, single-stage reinforcement learning framework. HMPO efficiently compresses CoT via three synergistic components: an adaptive median-based budget derived from successful rollouts to eliminate manual tuning, a cosine-decay token reward for smooth length penalization, and a multiplicative reward formulation that substantially mitigates trivial reward hacking by strictly prioritizing answer correctness. Trained exclusively on mathematical data, HMPO generalizes seamlessly across math, code, science, and instruction-following tasks. Extensive experiments scaling from 9B to 122B parameters across dense and Mixture-of-Experts (MoE) architectures demonstrate that HMPO achieves 19%--46% token compression with negligible accuracy degradation, all while drastically reducing training costs compared to existing multi-stage baselines.
Rewrite: arXiv:2606.01934v1 Announcement Type: New
While large language models deliver exceptional results through extended chain-of-thought (CoT) reasoning, the resulting length creates significant inference burdens. Current approaches to compressing CoTs face notable limitations, including rigid, manually set length constraints, high computational costs associated with multi-stage training, and limited scalability that is often confined to smaller model sizes. To address these challenges, we introduce HMPO (Hybrid Median-length Policy Optimization), an economical reinforcement learning framework that operates in a single stage. HMPO streamlines CoT compression through the integration of three key mechanisms: an adaptive budget based on the median length of successful rollouts, which removes the need for manual configuration; a cosine-decay token reward mechanism that ensures gentle penalization for length; and a multiplicative reward structure designed to curb simple reward hacking by placing primary emphasis on the correctness of the final answer. Although trained solely on mathematical datasets, HMPO transfers effectively to diverse domains, including coding, scientific reasoning, and instruction-following. Comprehensive evaluations involving models ranging from 9B to 122B parameters—spanning both dense and Mixture-of-Experts (MoE) structures—reveal that HMPO reduces token usage by 19% to 46% with minimal impact on accuracy. Furthermore, this approach significantly lowers training expenses when compared to traditional multi-stage baseline methods.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





