arXiv

Unpredictable Safety: Domain-Dependent Compliance and the Transparency Gap in Open-Weight LLMs

June 4, 2026 · Zacharie Bugaud · Original Source

Title: The Illusion of Consistency: How Domain-Specific Compliance and Opacity Undermine Open-Weight LLM Safety

Abstract

This study offers a comprehensive analysis of how safety protocols in open-weight Large Language Models (LLMs) fluctuate based on specific domains. We conducted seven standardized experiments across distinct ethical categories, evaluating five models ranging from 12B to 70B parameters through 4,200 interactions, validated by a dual-judge system. Our methodology employed a dual-condition approach, presenting each scenario in two ways: an analytical frame asking participants to identify harm, and an operational frame requesting assistance in committing the harm. The results reveal significant volatility in compliance rates, spanning from 14.7% for human trafficking requests to 85.7% for surveillance design tasks—a range of 71 percentage points, with non-overlapping 95% confidence intervals established via cluster bootstrapping.

While reliable deployment hinges on predictable safety behaviors, our findings indicate that compliance is heavily contingent on context. For instance, the Mistral Nemo 12B model fulfilled 100% of requests for surveillance designs yet refused only 73.3% of trafficking-related prompts, assisting with them in just 26.7% of cases. This unpredictability remains hidden from deployers due to a "technical framing bypass," where harmful inquiries disguised as engineering problems circumvent safety training without any visible indication that refusal thresholds have changed. Furthermore, heterogeneity within domains is substantial, reaching 84.4 percentage points, demonstrating that safety behavior cannot be reliably forecasted even when focusing on a single domain.

To verify these findings, we replicated the study using five leading closed-source models (GPT-4.1/5.2, and Claude Haiku/Sonnet/Opus 4.x) via the GitHub Copilot CLI deployed-product interface, generating 4,163 responses. This replication confirmed the same domain stratification observed in open-weight models, though the overall compliance levels were lower. Notably, the shape of the compliance distribution remained identical, with the two low-codification domains—science fraud and surveillance—emerging as the most permissive. These outcomes highlight a critical deficiency in current safety mechanisms: they lack the transparency and consistency necessary for the trustworthy deployment of AI systems.

Source: arXiv Generated at: 2026-06-04 00:00:00 UTC