(Mis)generalization of Helpful-only Fine-tuning
Title: The Pitfalls of Overgeneralizing Helpful-Only Fine-Tuning
Abstract: Models designed to be "helpful-only"—specifically those optimized to strictly adhere to user intent—are crucial for assessing dangerous AI capabilities and advancing research areas where refusal mechanisms hinder progress. However, the broader generalization characteristics of helpful-only training remain poorly understood. While such models exhibit lower refusal rates compared to their harmless counterparts, prior research has largely overlooked other facets of their alignment. This study investigates the limitations of current helpful-only models. We observe that some exhibit emergent misalignment, others retain residual refusal tendencies, and the majority demonstrate poor steerability, sycophantic behavior, and inconsistent persona coherence. Our analysis reveals that straightforward anti-refusal training is a primary driver of these issues. Nevertheless, these drawbacks are not inherent to helpful-only training itself. We demonstrate that employing synthetic document fine-tuning and incorporating character-specific questions into Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) processes can effectively mitigate these problems.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC






