Global News Digest

arXiv

How Generation Architecture Shapes Code Complexity in Multi-Agent LLM Systems: A Paired Study on HumanEval

Title: The Impact of Multi-Agent LLM Architectures on Code Complexity: A Paired Analysis of HumanEval

The landscape of large-language-model code generation has evolved from simple, single-shot prompting into complex multi-agent orchestrations involving distinct roles such as analysts, coders, testers, and debuggers. While these systems are typically assessed solely on their functional correctness, a critical question remains unanswered: do these architectural choices influence the structural complexity of the generated code, and which specific layers contribute to this burden? Although previous research has established that prompt-level variations affect code complexity, the impact of the underlying architecture itself has not been thoroughly investigated.

This study addresses that gap by evaluating six popular multi-agent configurations—Basic, AC, ACT, Debugger, AC+Debugger, and ACT+Debugger—across all 164 tasks in the HumanEval dataset. Using two models from the GPT-4o family, the research generates 1,968 paired observations and measures complexity through five RADON metrics: SLOC, cyclomatic complexity, and Halstead Volume, Difficulty, and Effort. The analysis employs a rigorous paired non-parametric statistical pipeline, including Friedman omnibus tests, Wilcoxon signed-rank post-hoc tests with Holm correction, Kendall’s W, and matched-pairs rank-biserial effect sizes, examining both all-completion and passing-only scenarios.

The results reveal that the six architectures fall into two distinct complexity clusters, separated by a significant gap of 50-130%. This division is consistent across both GPT-4o models and under both evaluation conditions. Regarding specific architectural components, the separation of analyst and coder roles tends to increase complexity, whereas the inclusion of a runtime debugger does not. Notably, on top of an analyst-coder baseline, the debugger actually reduces complexity. However, the addition of a tester layer reverses this trend, inflating complexity once again.

Crucially, the study finds that the additional complexity incurred by "heavy" architectures provides no advantage in terms of pass@1 accuracy. In fact, the leanest architectures perform as well as or better than their more complex counterparts. These findings suggest that architectural elaboration in LLM-based code generation should not be assumed to be beneficial; rather, any added structural complexity must be justified by demonstrable improvements in the metrics that truly matter.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Schroders Renewable Unit Targets AI Assets as Power Demand Soars
Bloomberg

Schroders Renewable Unit Targets AI Assets as Power Demand Soars

Schroders’ renewable unit targets AI infrastructure, pivoting to meet soaring energy demand from artificial intelligence...

State Street's Paglia on SBI Group Partnership, ETFs
Bloomberg

State Street's Paglia on SBI Group Partnership, ETFs

State Street's Paglia discusses the SBI Group partnership and ETFs, but the source text is missing. Please provide the a...

Nvidia Boss Says Workers Should Be Paid ‘as Much as Possible’
Bloomberg

Nvidia Boss Says Workers Should Be Paid ‘as Much as Possible’

Nvidia CEO Jensen Huang advocates for paying workers “as much as possible,” emphasizing maximum compensation. This stanc...

TSE Talking With Regulator For Easing ETF Listing Rules
Bloomberg

TSE Talking With Regulator For Easing ETF Listing Rules

The Tokyo Stock Exchange is discussing with regulators to ease ETF listing rules. This aims to simplify market access an...

S&P DJI CEO on Japan Markets, Mega IPOs
Bloomberg

S&P DJI CEO on Japan Markets, Mega IPOs

S&P DJI CEO discusses Japan's financial markets and major IPOs.