Safety Measurements for Fine-tuned LLMs Should be Grounded in Capability
Title: Grounding Safety Evaluations for Fine-Tuned LLMs in Capability
Abstract
Tailoring foundation large language models to specific tasks or stylistic preferences via fine-tuning can inadvertently undermine the model’s safety protocols. While prior research has investigated the safety implications of fine-tuning, these studies have often relied on constrained and seemingly arbitrary experimental designs. We contend that linking fine-tuning objectives to distinct capability goals is crucial; this approach prevents arbitrary empirical decisions, facilitates meaningful insights into safety impacts, and ensures a consistent framework for comparing mitigation strategies. By conducting a multi-dimensional assessment that prioritizes both capability and safety, our analysis highlights three critical findings: first, fine-tuned models may generate incoherent responses when faced with safety-related prompts; second, automated safety assessment tools prove unreliable when evaluating such incoherent outputs; and third, conclusions regarding the impact of fine-tuning are highly sensitive to the selection of both the safety benchmark and the evaluation method employed.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



