Efficiently Aligning Language Models with Online Natural Language Feedback
Title: Optimizing Language Model Alignment Through Online Natural Language Feedback
Abstract: While reinforcement learning with verifiable rewards has successfully driven impressive performance in language models across various sectors, the broad deployment of AI necessitates training models that excel in "fuzzy" domains characterized by difficult-to-supervise criteria. This study introduces techniques for aligning language models in such ambiguous contexts, leveraging online natural language feedback to allow human experts to provide high-quality supervision for only a limited subset of model outputs. Our approach involves iteratively optimizing the models against proxy reward signals, halting before over-optimization occurs, gathering new expert supervision, and subsequently updating the proxy reward. These proxy reward models are generated via fine-tuning and in-context learning (ICL) applied to language models. We evaluated our methods by enhancing creative writing abilities in Qwen3-8B and alignment research capabilities in Haiku 4.5. Regarding Qwen3-8B, ICL techniques achieved up to 35% performance recovery using 50 times fewer expert samples, whereas fine-tuning methods recovered 80% performance with up to 20 times fewer samples and 100% with three times fewer samples. For Haiku 4.5, ICL approaches yielded up to 35% performance recovery with 30 times fewer samples, and fine-tuning methods attained full performance recovery using 10 times fewer samples. These findings indicate that online natural language feedback can significantly enhance the data efficiency of expert supervision.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC




