arXiv

Recovering Diversity Without Losing Alignment: A DPO Recipe for Post-Trained LLMs

June 4, 2026 · Vinay Samuel, Yapei Chang, Mohit Iyyer · Original Source

Title: Restoring Diversity in Post-Trained LLMs Without Compromising Alignment: A DPO-Based Approach

Abstract: While many open-ended prompts allow for multiple correct answers that would be beneficial for users to see, the post-training process frequently restricts a Large Language Model’s (LLM) output to a narrow range of canonical responses. To address this, we present REDIPO, an offline data-construction pipeline based on Direct Preference Optimization (DPO) designed to recover distinct, valid answer modes while maintaining the alignment advantages of the instructed model. The REDIPO methodology involves sampling responses from both the base and instruct models for each prompt, rewriting the base model’s outputs using the instruct model, and filtering these candidates for safety and instruction-following quality. Subsequently, it constructs preference pairs that prioritize marginally diverse responses among candidates that exhibit similar instruction-following rewards.

Evaluations across Qwen3-4B, OLMo-3-7B, and LLaMA-3.1-8B demonstrate that REDIPO increases NoveltyBench distinct_k scores by 134%, 33%, and 44%, respectively, compared to the instruct checkpoints. In contrast, DivPO resulted in diversity changes of 0%, -6%, and -4% on these same models. The performance improvements achieved by REDIPO largely preserve results on MTBench, IFEval, and Arena-Hard benchmarks, while also lowering the direct-category attack success rate on HarmBench. Ablation studies indicate that the diversity enhancements are primarily driven by the selection of marginally diverse pairs and the rewriting of base responses, whereas filtering and quality-bounded pairing are crucial for sustaining alignment. These findings suggest that carefully curated preference data can reintroduce diverse, valid answers derived from base-model generations without sacrificing the alignment benefits of post-training. We have made our code and data available at https://github.com/vsamuel2003/ReDiPO.

Source: arXiv Generated at: 2026-06-04 00:00:00 UTC