arXiv

Monitoring Agentic Systems Before They're Reliable

June 2, 2026 · Marisa Ferrara Boston, Glen Hanson, Effi Georgala, JD Hudgens, Heather Frase · Original Source

Title: Pre-Requisite Checks for Agentic Systems Prior to Reliability Assurance

Original: arXiv:2606.02494v1 Announce Type: cross Abstract: When agentic systems are deployed into production, they often function as partially integrated components. In this phase, the primary causes of failure are structural flaws rather than mistakes at the task level. At this stage of maturity, detecting errors at the task level is often impossible because structural failure modes obscure the specific signals that task-level monitors are built to identify.

We introduce a monitoring and triage approach that breaks down the evaluation of agentic systems into three key dimensions—quality, suitability, and efficiency—across three distinct monitoring scopes: within-run, cross-run, and structural. This method utilizes variance as a primary signal for characterization. The results are then categorized by severity using a classification system derived from Failure Mode and Effects Analysis (FMEA), which directs human focus toward the specific subset of issues requiring investigation.

Our evaluation was conducted on a synthetic testbed comprising 220 runs across 120 document bundles, with controlled error injection. Three key findings emerged from this study. First, the scope of the monitor dictates the type of failure revealed: within-run monitors expose deterministic defects at specific stages (CV = 0.02); cross-run monitors uncover stochastic integration consequences (CV = 1.25, representing 24% at L2); and structural monitors detect integration gaps with absolute consistency (CV = 0.00). Second, injected task-level errors proved indistinguishable from clean baselines, confirming that structural defects effectively mask task-level signals. Third, deterministic triage successfully routed 97% of findings to automated tracking, leaving only the 2% associated with variable behavior for manual human investigation.

Based on this Stage 1 evidence, we propose a maturity-staging model for monitoring. This model suggests that as integration defects are resolved, monitoring should evolve from structural characterization to error detection, and finally to reliability tracking. This taxonomy, along with the variance-based scope characterization and severity model, is architecturally transferable to document-driven, multi-stage agentic workflows in regulated industries, though specific calibrations will vary by domain. Our recommendation is to deploy monitoring early, as the initial issues identified are typically the most critical to address.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC