DDOR: Delta Debugging for Explainable Overrefusal Testing and Repair
Title: DDOR: Delta Debugging for Explainable Overrefusal Testing and Repair
Abstract: Although safety alignment protocols and guardrails are designed to prevent large language models (LLMs) from generating harmful content, they frequently lead to overrefusal—the unjustified rejection of benign inquiries that simply resemble risky topics. To address this, we introduce DDOR (Delta Debugging for OverRefusal), a completely automated and transparent framework aimed at testing and repairing overrefusal in black-box environments. In this setting, only the model’s inputs and outputs are visible, leaving internal safety mechanisms hidden. DDOR utilizes delta debugging to pinpoint minimal refusal-triggering fragments (mRTFs), offering phrase-level, interpretable insights into the causes of refusals. Using these mRTFs as a basis, the framework creates diverse, context-heavy prompts and employs multi-oracle validation to exclude cases that are inherently unsafe or ambiguous. This process yields scalable, model-specific overrefusal test suites, comprising roughly 1,000 cases per model. In addition to evaluation, DDOR uses the identified mRTFs to execute targeted prompt repairs. This approach significantly lowers overrefusal rates while retaining the original intent of the queries and ensuring that genuinely harmful inputs remain blocked. Ultimately, DDOR provides a comprehensive, end-to-end method for assessing and mitigating overrefusal, enhancing LLM usability without compromising safety standards.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



