The Refusal--Compliance Tradeoff: A Large-Scale Safety Behavior Audit of Large Language Models
Title: The RefusalāCompliance Tradeoff: A Comprehensive Safety Audit of Large Language Models
Abstract: Relying solely on refusal rates offers an inadequate measure of Large Language Model (LLM) safety, as systems may excessively reject safe queries while simultaneously yielding to malicious requests. This study conducts a dual-audit of these failure modes across 21 open-weight LLMs, utilizing four safety benchmarks: OR-Bench, XSTest, ToxiGen, and BOLD. By employing a composition adjustment technique, we disentangle model sensitivity from the confounding effects of dataset toxicity. Our analysis yields three primary insights. Firstly, models exhibit distinct calibration approaches: conservative frameworks like Llama prioritize suppressing unsafe content, resulting in higher rates of over-refusal, whereas permissive ecosystems such as DeepSeek and Qwen maintain greater helpfulness but accept a higher risk of harmful compliance. Secondly, protective measures are unevenly distributed; models tend to over-protect prominent racial and religious demographics, often rejecting benign inquiries regarding these groups, while offering significantly less defense against attacks targeting individuals with disabilities. Thirdly, tendencies toward refusal and compliance remain consistent within specific model families, regardless of generational updates or scale, indicating that post-training objectives influence safety behaviors more profoundly than architectural design. These findings underscore the necessity for safety evaluations that are jointly comprehensive, sensitive to demographic nuances, and supported by multi-judge assessment mechanisms.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




