arXiv

The Ringelmann Effect in Multi-Agent LLM Systems: A Scaling Law for Effective Team Size

June 3, 2026 · Bla\v{z} Bertalani\v{c}, Carolina Fortuna · Original Source

Title: The Ringelmann Effect in Multi-Agent LLM Systems: A Scaling Law for Effective Team Size

Abstract:

Current approaches to scaling multi-agent Large Language Model (LLM) systems during inference lack a standardized metric. Simply counting the number of agents mistakenly equates financial cost with independent statistical evidence. To address this, we introduce a two-parameter scaling law defined as $R(N) = N_\text{eff}/N = 1/(1+c(N-1)N^{-\beta})$. This formula utilizes a regime exponent, $\beta$, to categorize any system configuration into one of three asymptotic behaviors: a hard ceiling at $1/c$ (where $\beta = 0$), a sublinear growth of $N^\beta/c$ (where $0 < \beta < 1$), or linear scaling (where $\beta = 1$).

Our analysis of the MMLU-Hard benchmark reveals that while different benchmarks exhibit varying levels of absolute performance, the structural parameters $(c, \beta)$ remain consistent across benchmarks. However, on free-form mathematical tasks, dense peer influence fundamentally alters the dynamics, collapsing the answer-level regime from sublinear to a hard ceiling, although the correctness-level fits maintain a hard ceiling throughout.

These results yield three key practical implications:

Diminishing Returns in Homogeneous Teams: On MMLU-Hard, deploying thirty densely interacting debating agents yields no greater answer diversity than using a single agent.
The Illusion of Debate: A noise placebo experiment demonstrates that self-correction on free-form math tasks scales at a rate of $4\times$. This suggests that within homogeneous teams, performance improvements typically attributed to "debate" are actually driven by self-re-evaluation rather than the exchange of peer-generated content.
Predictive Pilots and Architectural Diversity: A small pilot study with $N \le 5$ agents can accurately predict the structural ceiling observed at $N=30$. Furthermore, among the configurations tested, only architectural diversity (heterogeneous teams) successfully reduced the parameter $c$ and allowed systems to escape the hard-ceiling regime. In contrast, interventions focused solely on communication modes failed to achieve this effect.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC