arXiv

How Generation Architecture Shapes Code Complexity in Multi-Agent LLM Systems: A Paired Study on HumanEval

June 2, 2026 · Nazmus Ashrafi · Original Source

Title: The Impact of Multi-Agent LLM Architectures on Code Complexity: A Paired Analysis of HumanEval

The landscape of large-language-model code generation has evolved from simple, single-shot prompting into complex multi-agent orchestrations involving distinct roles such as analysts, coders, testers, and debuggers. While these systems are typically assessed solely on their functional correctness, a critical question remains unanswered: do these architectural choices influence the structural complexity of the generated code, and which specific layers contribute to this burden? Although previous research has established that prompt-level variations affect code complexity, the impact of the underlying architecture itself has not been thoroughly investigated.

This study addresses that gap by evaluating six popular multi-agent configurations—Basic, AC, ACT, Debugger, AC+Debugger, and ACT+Debugger—across all 164 tasks in the HumanEval dataset. Using two models from the GPT-4o family, the research generates 1,968 paired observations and measures complexity through five RADON metrics: SLOC, cyclomatic complexity, and Halstead Volume, Difficulty, and Effort. The analysis employs a rigorous paired non-parametric statistical pipeline, including Friedman omnibus tests, Wilcoxon signed-rank post-hoc tests with Holm correction, Kendall’s W, and matched-pairs rank-biserial effect sizes, examining both all-completion and passing-only scenarios.

The results reveal that the six architectures fall into two distinct complexity clusters, separated by a significant gap of 50-130%. This division is consistent across both GPT-4o models and under both evaluation conditions. Regarding specific architectural components, the separation of analyst and coder roles tends to increase complexity, whereas the inclusion of a runtime debugger does not. Notably, on top of an analyst-coder baseline, the debugger actually reduces complexity. However, the addition of a tester layer reverses this trend, inflating complexity once again.

Crucially, the study finds that the additional complexity incurred by "heavy" architectures provides no advantage in terms of pass@1 accuracy. In fact, the leanest architectures perform as well as or better than their more complex counterparts. These findings suggest that architectural elaboration in LLM-based code generation should not be assumed to be beneficial; rather, any added structural complexity must be justified by demonstrable improvements in the metrics that truly matter.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC