The Saturation Trap and the Subjectivity of Intervention Timing: Why Affect-Based Triggers and LLM Judges Fail to Time Interventions on Autonomous Agents
**Title: The Saturation Trap and the Subjectivity of Intervention Timing: Why Affect-Based Triggers and LLM Judges Fail to Time Interventions on Autonomous Agents
Abstract:
As autonomous AI agents evolve from simple conversational interfaces into systems capable of long-horizon software execution, the need for runtime safety mechanisms capable of determining when to interrupt an agent has become critical. This study investigates the challenge of intervention timing by utilizing a continuous 18-dimensional affective-dynamics engine (HEART) as a diagnostic tool. We evaluate four distinct families of intervention triggers—absolute state thresholds, composite state-action patterns, regex-based reasoning-feature extraction, and zero-shot LLM-as-judge methods—by comparing them against human-annotated intervention points within SWE-bench-Verified debugging traces. Our analysis yields three primary findings.
First, we identify a "State Saturation Trap." Agents exhibit no recovery signals when facing sustained difficulty, causing modeled frustration to rapidly hit its ceiling and remain there. Consequently, triggers based on state thresholds shift from detecting specific moments to acting as near-constant indicators, firing on 39-83% of actions across five tested trajectories.
Second, we observe a significant capability and context floor for LLM judges. A smaller model (gpt-5.4-mini) never triggered an intervention. While frontier and cross-vendor models managed to escape this zero-firing baseline, they required full-trajectory context to do so. Even under these conditions, their performance remained low, achieving an F1 score of only 0.17-0.40 at costs up to 90 times higher.
Third, and most critically, the supervised target itself lacks reproducibility among humans. When three trained annotators applied a single rubric to a 56-action trajectory, their agreement on where to intervene was only marginally better than chance (Krippendorff’s alpha = +0.047; best pairwise Cohen’s kappa = +0.349). Agreement on the type of intervention was negligible: pause decisions were degenerate, clarification decisions fell below chance, and reflection decisions showed only an alpha of +0.226.
We conclude that intervention timing is a construct with low reliability, rendering single-annotator F1 an inappropriate optimization target. Our contribution lies in the comprehensive mapping of this issue across human inter-rater reliability, four detector architectures, a cross-model LLM-judge sweep, and the reproduction of the saturation effect, rather than in the accuracy of any single detector.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC





