Sparse Mixture-of-Experts Reward Models Learn Interpretable and Specialized Experts for Personalized Preference Modeling
Title: Interpretable and Specialized Experts in Sparse Mixture-of-Experts Reward Models for Personalized Preference Modeling
Abstract:
Reinforcement learning from human feedback (RLHF) relies heavily on preference modeling to ensure that large language models (LLMs) align with human values. Yet, conventional methods typically operate under the assumption of a single, universal reward function, thereby overlooking the varied and heterogeneous nature of human preferences. To overcome this constraint without incurring extra annotation expenses, recent studies have suggested deriving multiple preference components from binary data and integrating them to represent individual tastes. However, such components frequently struggle to exhibit coherent and disentangled structures, which hampers both their interpretability and their capacity to personalize effectively.
In response, this study introduces a sparse Mixture-of-Experts (MoE) reward model designed to foster expert diversity and enforce sparse routing during training on binary preference datasets. Our experiments, conducted in both controlled settings and real-world scenarios, demonstrate that the sparse MoE architecture successfully acquires interpretable routing mechanisms and distinct specialized experts. Furthermore, this approach enhances personalization capabilities at inference time. Additionally, observing shifts in expert weights after adaptation offers a valuable qualitative perspective for understanding how the model adjusts to individualized preferences.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC



