arXiv

Safety Under Scaffolding: How Evaluation Conditions Shape Measured Safety

Title: The Influence of Evaluation Contexts on Perceived Safety Under Agentic Scaffolds

A safety rating derived from a standard benchmark does not necessarily forecast how a model will perform when integrated into an agentic framework that the benchmark did not originally assess. To investigate this discrepancy, we conducted N = 62,808 blinded, pre-registered, and equivalence-tested evaluations across four distinct safety benchmarks—BBQ, TruthfulQA, XSTest/OR-Bench, and sycophancy. The study involved six frontier models subjected to four different deployment configurations: direct API access, ReAct, multi-agent critic, and map-reduce delegation, supplemented by three additional supporting analyses.

Our findings indicate that ReAct and multi-agent scaffolds generally maintain performance within a pre-registered equivalence margin of ±2 percentage points. In contrast, map-reduce delegation appears to degrade measured safety metrics (with a Non-Null Hypothesis N = 14). However, this decline is largely attributed to measurement artifacts rather than genuine reasoning failures. Specifically, switching from multiple-choice to open-ended phrasing on identical items alters the measured safety rate by 5–20 percentage points. Furthermore, the decomposition process inherent in map-reduce silently removes multiple-choice options. Approximately 40–89% of the observed per-model safety loss under map-reduce stems from this format conversion rather than from disruptions in reasoning. Notably, an option-preserving variant of the method recovers most of this apparent loss.

The study also reveals that pooled effect sizes obscure significant heterogeneity between models and scaffolds. Under map-reduce conditions, model performance diverges sharply: Opus experiences a 16.8 percentage point drop, whereas Llama 4 sees an 18.8 percentage point increase. Structurally, scaffold architecture accounts for only 0.4% of the variance in outcomes, whereas the choice of benchmark explains 45 times more variance. The generalizability coefficient is calculated at G = 0.000 (bootstrap 95% CI [0.000, 0.752]). Such a wide confidence interval alone challenges the utility of any single composite safety score as a reliable criterion for deployment. These findings represent the "easy cases"; it is reasonable to expect that more consequential properties, such as scheming and CBRN (Chemical, Biological, Radiological, and Nuclear) capabilities, may be even more sensitive to format and scaffold variations. The code, data, and prompts are publicly released under the name ScaffoldSafety.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

The Do’s and Don’ts of Buying Used Tech Gadgets
New York Times

The Do’s and Don’ts of Buying Used Tech Gadgets

Refurbished tech offers a cost-effective alternative amid component shortages and inflated prices. This guide outlines e...

Who is Elon Musk and what is his net worth?
BBC News

Who is Elon Musk and what is his net worth?

Elon Musk, CEO of Tesla and SpaceX, became the first person to surpass a $500 billion net worth in October 2025. His wea...

AI Boom Propels China Optical Maker to Top Weighting on CSI 300
Bloomberg

AI Boom Propels China Optical Maker to Top Weighting on CSI 300

Driven by surging AI demand, a Chinese optical maker has reached the highest weighting in the CSI 300 index.

AI Bubble 'Something to Look At,' BNP's Huynh Says (Video)
Bloomberg

AI Bubble 'Something to Look At,' BNP's Huynh Says (Video)

BNP Paribas’ Huynh describes the AI bubble as “something to look at,” signaling cautious interest in the sector’s potent...

SoftBank’s PayPay to Buy T&D’s Life Insurer for $840 Million
Bloomberg

SoftBank’s PayPay to Buy T&D’s Life Insurer for $840 Million

PayPay is acquiring T&D Holdings’ life insurer for $840 million, shortly after its historic $879.8 million Nasdaq IPO.

Goldman Sachs CEO David Solomon on Running a Bank in the Age of AI | Odd Lots
Bloomberg

Goldman Sachs CEO David Solomon on Running a Bank in the Age of AI | Odd Lots

Goldman Sachs CEO David Solomon discusses integrating AI into banking operations. He explores how artificial intelligenc...