arXiv

Early Diagnosis of Wasted Computation in Multi-Agent LLM Systems via Failure-Aware Observability

June 2, 2026 · Xianyou Li, Weiran Yan, Yichao Wu, Penghao Liang, Mengwei Yuan, Jianan Liu, Jing Yang · Original Source

Title: Identifying Wasted Computation in Multi-Agent LLM Systems Through Failure-Resilient Observability

Abstract:

In multi-agent systems powered by large language models (LLMs) that utilize external tools, computational resources are consumed via token processing, tool invocation, retry attempts, and code execution prior to generating a response. While the termination of a failed run is typically identified through final-answer evaluation, this method often fails to pinpoint the specific moment in the trajectory where recoverable progress ceased. To address this gap, we present a failure-aware observability framework designed to diagnose wasted computation within multi-agent LLM execution traces.

This framework correlates common failure patterns with real-time trace indicators, such as tool reliability, execution recovery capabilities, orchestration loops, evidence accessibility, information volatility, and budget constraints. We implemented this framework within a three-agent question-answering architecture and conducted an evaluation using 165 GAIA validation traces, subjecting them to uniform execution limits.

Our analysis reveals that operational failures persist at significant rates: 22 out of 53 runs at Level 1, 33 out of 86 at Level 2, and 12 out of 26 at Level 3 failed to yield a usable final answer. The underlying causes for these outcomes varied, ranging from inadequate evidence and repetitive action loops to max-step terminations, streaks of tool failures, and execution calls that returned without generating useful output. Furthermore, we observed a substantial increase in mean token consumption, rising from 8,152 tokens at Level 1 to 16,389 tokens at Level 3, accompanied by a divergence in evidence availability and sentence-level support metrics.

A grounding audit conducted via an LLM-judge on a cached set of 10 traces demonstrated that inexpensive online signals and more complex semantic metrics capture distinct, complementary aspects of failure. These findings position failure-aware observability as a critical diagnostic layer bridging raw execution logs and final-answer accuracy.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC