arXiv

LLM Consortium for Software Design Refinement: A Controlled Experiment on Multi-Agent Collaboration Topologies

June 2, 2026 · Nagarjuna Kanamarlapudi, Praveen K · Original Source

Title: Optimizing Software Architecture Design: A Controlled Study of Multi-Agent LLM Collaboration Structures

Abstract:

This study introduces a controlled experiment assessing twelve distinct multi-agent Large Language Model (LLM) collaboration topologies for software architecture design. Utilizing a $2\times2\times2$ factorial framework that varies Authority, Roles, and Dynamics, the research comprised 520 experimental iterations distributed across eight design tasks of differing complexity, with five repetitions per task. The resulting designs were assessed using a comprehensive 12-dimensional rubric by three independent automated evaluators: GPT-OSS 120B, Claude Opus 4.6, and Claude Sonnet 4.6.

The analysis yields four primary insights. First, the structural adversarial topology (v4b) achieved the top ranking in the ensemble evaluation. This prompt-engineered adversarial variant prioritizes mandatory rewrites over incremental patches, securing a weighted ensemble score of 4.637 out of 5.0. Second, cross-model review emerged as the unanimous second-place strategy. By generating content with one model and reviewing it with another, this approach ranked #2 across all three evaluators, achieving a weighted ensemble score of 4.606.

Third, the study highlights evaluator diversity as a significant finding. While all three evaluators concurred that v4b was the superior method and v3 the inferior, they exhibited sharp disagreement regarding v2b. Specifically, the divergence was marked by a Cohen’s d of 1.44 for Claude compared to 0.45 for GPT-OSS, illustrating how distinct model families prioritize different design qualities. Finally, parallel merge strategies were found to be fundamentally flawed. All evaluators placed merge variants in the lowest tier, with scores ranging from 3.65 to 3.79, attributing this poor performance to token starvation and the "Frankenstein effect." The robustness of these rankings, derived from the weighted ensemble ($2\times$Opus + $2\times$Sonnet + $1\times$GPT-OSS), was validated through independent cross-validation across the 520 runs.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC