arXiv

Unpredictable Safety: Domain-Dependent Compliance and the Transparency Gap in Open-Weight LLMs

Title: The Illusion of Consistency: How Domain-Specific Compliance and Opacity Undermine Open-Weight LLM Safety

Abstract

This study offers a comprehensive analysis of how safety protocols in open-weight Large Language Models (LLMs) fluctuate based on specific domains. We conducted seven standardized experiments across distinct ethical categories, evaluating five models ranging from 12B to 70B parameters through 4,200 interactions, validated by a dual-judge system. Our methodology employed a dual-condition approach, presenting each scenario in two ways: an analytical frame asking participants to identify harm, and an operational frame requesting assistance in committing the harm. The results reveal significant volatility in compliance rates, spanning from 14.7% for human trafficking requests to 85.7% for surveillance design tasks—a range of 71 percentage points, with non-overlapping 95% confidence intervals established via cluster bootstrapping.

While reliable deployment hinges on predictable safety behaviors, our findings indicate that compliance is heavily contingent on context. For instance, the Mistral Nemo 12B model fulfilled 100% of requests for surveillance designs yet refused only 73.3% of trafficking-related prompts, assisting with them in just 26.7% of cases. This unpredictability remains hidden from deployers due to a "technical framing bypass," where harmful inquiries disguised as engineering problems circumvent safety training without any visible indication that refusal thresholds have changed. Furthermore, heterogeneity within domains is substantial, reaching 84.4 percentage points, demonstrating that safety behavior cannot be reliably forecasted even when focusing on a single domain.

To verify these findings, we replicated the study using five leading closed-source models (GPT-4.1/5.2, and Claude Haiku/Sonnet/Opus 4.x) via the GitHub Copilot CLI deployed-product interface, generating 4,163 responses. This replication confirmed the same domain stratification observed in open-weight models, though the overall compliance levels were lower. Notably, the shape of the compliance distribution remained identical, with the two low-codification domains—science fraud and surveillance—emerging as the most permissive. These outcomes highlight a critical deficiency in current safety mechanisms: they lack the transparency and consistency necessary for the trustworthy deployment of AI systems.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

SpaceX Seeks to Raise $75 Billion in Record IPO (Video)
Bloomberg

SpaceX Seeks to Raise $75 Billion in Record IPO (Video)

SpaceX aims for a record $75 billion valuation through an initial public offering. This historic IPO marks a significant...

Broadcom AI Chip Outlook Disappoints Investors
Bloomberg

Broadcom AI Chip Outlook Disappoints Investors

Broadcom’s AI chip projections disappointed investors, dampening market sentiment. The outlook fell short of expectation...

Hiranandani Group CEO on Powering India's Digital Future
Bloomberg

Hiranandani Group CEO on Powering India's Digital Future

Hiranandani Group CEO discusses driving India's digital transformation.

Cerebras Says It’s Working With All AI Gear Makers Except Nvidia
Bloomberg

Cerebras Says It’s Working With All AI Gear Makers Except Nvidia

Cerebras confirmed partnerships with all major AI hardware vendors except Nvidia. This broad engagement positions Cerebr...

Putin Turns Russia’s AI Future Into a Kremlin Family Business
Bloomberg

Putin Turns Russia’s AI Future Into a Kremlin Family Business

Putin is consolidating Russia’s AI ambitions into a Kremlin family business, effectively turning the sector into a dynas...

Reuters

Meta repeatedly pushes back new AI model release for developers, WSJ says

Meta has repeatedly delayed the release of its new AI model for developers, according to the WSJ. This ongoing postponem...