arXiv

Efficiently Aligning Language Models with Online Natural Language Feedback

Title: Optimizing Language Model Alignment Through Online Natural Language Feedback

Abstract: While reinforcement learning with verifiable rewards has successfully driven impressive performance in language models across various sectors, the broad deployment of AI necessitates training models that excel in "fuzzy" domains characterized by difficult-to-supervise criteria. This study introduces techniques for aligning language models in such ambiguous contexts, leveraging online natural language feedback to allow human experts to provide high-quality supervision for only a limited subset of model outputs. Our approach involves iteratively optimizing the models against proxy reward signals, halting before over-optimization occurs, gathering new expert supervision, and subsequently updating the proxy reward. These proxy reward models are generated via fine-tuning and in-context learning (ICL) applied to language models. We evaluated our methods by enhancing creative writing abilities in Qwen3-8B and alignment research capabilities in Haiku 4.5. Regarding Qwen3-8B, ICL techniques achieved up to 35% performance recovery using 50 times fewer expert samples, whereas fine-tuning methods recovered 80% performance with up to 20 times fewer samples and 100% with three times fewer samples. For Haiku 4.5, ICL approaches yielded up to 35% performance recovery with 30 times fewer samples, and fine-tuning methods attained full performance recovery using 10 times fewer samples. These findings indicate that online natural language feedback can significantly enhance the data efficiency of expert supervision.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

TechCrunch

Oura Ring 5 review: Thinner, lighter, better

The Oura Ring 5 is 40% smaller and lighter than its predecessor, offering superior comfort and a discreet, jewelry-like ...

Financial Times

How AI has de-skilled translation

AI fragments specialist translation into routine tasks, effectively de-skilling the profession. This shift reduces compl...

Zurich Insurance Expands Data-Center Offering Beyond the US
Bloomberg

Zurich Insurance Expands Data-Center Offering Beyond the US

Zurich Insurance Group is expanding its data center insurance products internationally, extending coverage beyond the Un...

Emerging-Market Stocks Fall as Broadcom Miss Disrupts AI Trade
Bloomberg

Emerging-Market Stocks Fall as Broadcom Miss Disrupts AI Trade

Broadcom’s earnings miss triggered a sell-off in AI stocks, dragging down emerging-market equities. This disruption high...

Revolut Co-Founder, CTO Vlad Yatsenko to Step Down From Role
Bloomberg

Revolut Co-Founder, CTO Vlad Yatsenko to Step Down From Role

Revolut co-founder and CTO Vlad Yatsenko is stepping down from his executive role. The resignation marks a significant l...

Netflix Top Tech Exec Stone on Integrating AI
Bloomberg

Netflix Top Tech Exec Stone on Integrating AI

Netflix’s top tech exec discusses integrating AI to enhance content discovery and production efficiency.