Safety Under Scaffolding: How Evaluation Conditions Shape Measured Safety
Title: The Influence of Evaluation Contexts on Perceived Safety Under Agentic Scaffolds
A safety rating derived from a standard benchmark does not necessarily forecast how a model will perform when integrated into an agentic framework that the benchmark did not originally assess. To investigate this discrepancy, we conducted N = 62,808 blinded, pre-registered, and equivalence-tested evaluations across four distinct safety benchmarks—BBQ, TruthfulQA, XSTest/OR-Bench, and sycophancy. The study involved six frontier models subjected to four different deployment configurations: direct API access, ReAct, multi-agent critic, and map-reduce delegation, supplemented by three additional supporting analyses.
Our findings indicate that ReAct and multi-agent scaffolds generally maintain performance within a pre-registered equivalence margin of ±2 percentage points. In contrast, map-reduce delegation appears to degrade measured safety metrics (with a Non-Null Hypothesis N = 14). However, this decline is largely attributed to measurement artifacts rather than genuine reasoning failures. Specifically, switching from multiple-choice to open-ended phrasing on identical items alters the measured safety rate by 5–20 percentage points. Furthermore, the decomposition process inherent in map-reduce silently removes multiple-choice options. Approximately 40–89% of the observed per-model safety loss under map-reduce stems from this format conversion rather than from disruptions in reasoning. Notably, an option-preserving variant of the method recovers most of this apparent loss.
The study also reveals that pooled effect sizes obscure significant heterogeneity between models and scaffolds. Under map-reduce conditions, model performance diverges sharply: Opus experiences a 16.8 percentage point drop, whereas Llama 4 sees an 18.8 percentage point increase. Structurally, scaffold architecture accounts for only 0.4% of the variance in outcomes, whereas the choice of benchmark explains 45 times more variance. The generalizability coefficient is calculated at G = 0.000 (bootstrap 95% CI [0.000, 0.752]). Such a wide confidence interval alone challenges the utility of any single composite safety score as a reliable criterion for deployment. These findings represent the "easy cases"; it is reasonable to expect that more consequential properties, such as scheming and CBRN (Chemical, Biological, Radiological, and Nuclear) capabilities, may be even more sensitive to format and scaffold variations. The code, data, and prompts are publicly released under the name ScaffoldSafety.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC






