Global News Digest

arXiv

LLM Consortium for Software Design Refinement: A Controlled Experiment on Multi-Agent Collaboration Topologies

Title: Optimizing Software Architecture Design: A Controlled Study of Multi-Agent LLM Collaboration Structures

Abstract:

This study introduces a controlled experiment assessing twelve distinct multi-agent Large Language Model (LLM) collaboration topologies for software architecture design. Utilizing a $2\times2\times2$ factorial framework that varies Authority, Roles, and Dynamics, the research comprised 520 experimental iterations distributed across eight design tasks of differing complexity, with five repetitions per task. The resulting designs were assessed using a comprehensive 12-dimensional rubric by three independent automated evaluators: GPT-OSS 120B, Claude Opus 4.6, and Claude Sonnet 4.6.

The analysis yields four primary insights. First, the structural adversarial topology (v4b) achieved the top ranking in the ensemble evaluation. This prompt-engineered adversarial variant prioritizes mandatory rewrites over incremental patches, securing a weighted ensemble score of 4.637 out of 5.0. Second, cross-model review emerged as the unanimous second-place strategy. By generating content with one model and reviewing it with another, this approach ranked #2 across all three evaluators, achieving a weighted ensemble score of 4.606.

Third, the study highlights evaluator diversity as a significant finding. While all three evaluators concurred that v4b was the superior method and v3 the inferior, they exhibited sharp disagreement regarding v2b. Specifically, the divergence was marked by a Cohen’s d of 1.44 for Claude compared to 0.45 for GPT-OSS, illustrating how distinct model families prioritize different design qualities. Finally, parallel merge strategies were found to be fundamentally flawed. All evaluators placed merge variants in the lowest tier, with scores ranging from 3.65 to 3.79, attributing this poor performance to token starvation and the "Frankenstein effect." The robustness of these rankings, derived from the weighted ensemble ($2\times$Opus + $2\times$Sonnet + $1\times$GPT-OSS), was validated through independent cross-validation across the 520 runs.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Schroders Renewable Unit Targets AI Assets as Power Demand Soars
Bloomberg

Schroders Renewable Unit Targets AI Assets as Power Demand Soars

Schroders’ renewable unit targets AI infrastructure, pivoting to meet soaring energy demand from artificial intelligence...

State Street's Paglia on SBI Group Partnership, ETFs
Bloomberg

State Street's Paglia on SBI Group Partnership, ETFs

State Street's Paglia discusses the SBI Group partnership and ETFs, but the source text is missing. Please provide the a...

Nvidia Boss Says Workers Should Be Paid ‘as Much as Possible’
Bloomberg

Nvidia Boss Says Workers Should Be Paid ‘as Much as Possible’

Nvidia CEO Jensen Huang advocates for paying workers “as much as possible,” emphasizing maximum compensation. This stanc...

TSE Talking With Regulator For Easing ETF Listing Rules
Bloomberg

TSE Talking With Regulator For Easing ETF Listing Rules

The Tokyo Stock Exchange is discussing with regulators to ease ETF listing rules. This aims to simplify market access an...

S&P DJI CEO on Japan Markets, Mega IPOs
Bloomberg

S&P DJI CEO on Japan Markets, Mega IPOs

S&P DJI CEO discusses Japan's financial markets and major IPOs.