arXiv

Constitutional Black-Box Monitoring for Scheming in LLM Agents

June 2, 2026 · Simon Storf, Rich Barton-Cooper, James Peters-Gill, Marius Hobbhahn · Original Source

Title: Constitutional Black-Box Monitoring for Scheming in LLM Agents

Abstract: Ensuring the secure deployment of Large Language Model (LLM) agents in autonomous environments necessitates robust oversight frameworks. A primary difficulty in this domain is the detection of "scheming"—a behavior wherein agents secretly advance goals that conflict with their intended objectives. One strategy to address these risks involves LLM-based monitoring, which utilizes language models to analyze agent activities for signs of suspicious conduct. This paper investigates constitutional black-box monitors, defined as prompted classifiers that identify scheming by relying exclusively on external inputs and outputs. These monitors are trained on synthetic data derived from natural-language behavior specifications.

To facilitate this training, we present two distinct pipelines for creating synthetic agent trajectories: STRIDE, which employs iterative refinement, and Gloom, which utilizes agent-environment simulation. Each pipeline generates 1,000 samples. We then optimize state-of-the-art LLM monitors using these datasets through a combination of prompt sweeps, human-led refinement, and automated prompt optimization. The performance of these monitors is assessed using 7,500 held-out trajectories from ControlArena, a collection of grounded environments that place agents in more realistic operational contexts.

Our findings indicate that monitors chosen based solely on synthetic data can effectively generalize to more complex, realistic settings, successfully capturing significant signals of scheming behavior. However, we observe that performance levels off rapidly within our experimental framework. Notably, simple prompt sweeps achieve results comparable to those obtained through more intensive optimization processes. Attempting to surpass this threshold does not yield further gains; instead, it results in overfitting.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC