arXiv

Demystifying Multi-Agent Debate: The Role of Confidence and Diversity

June 2, 2026 · Xiaochen Zhu, Caiqi Zhang, Yizhou Chi, Tom Stafford, Nigel Collier, Andreas Vlachos · Original Source

Title: Demystifying Multi-Agent Debate: The Role of Confidence and Diversity

Abstract:

While multi-agent debate (MAD) is frequently employed to boost large language model (LLM) performance via test-time scaling, recent evidence suggests that standard MAD often yields inferior results compared to simple majority voting, all while incurring greater computational expenses. Research indicates that when agents are homogeneous and belief updates are uniform, debate merely maintains expected accuracy rather than enhancing it, failing to guarantee improved outcomes. By integrating insights from human deliberation and collective decision-making, we pinpoint two critical deficiencies in vanilla MAD: a lack of diverse initial perspectives and the absence of explicit, calibrated confidence signaling.

To address these gaps, we introduce two lightweight interventions. The first is a diversity-aware initialization strategy that curates a more varied set of candidate answers, thereby increasing the probability that the correct hypothesis is available at the debate’s onset. The second is a confidence-modulated debate protocol, where agents communicate calibrated confidence levels and adjust their updates based on the confidence expressed by their peers. Theoretically, we demonstrate that diversity-aware initialization boosts the prior probability of MAD success without altering the core update dynamics, whereas confidence-modulated updates allow the debate process to systematically converge toward the correct hypothesis. Empirical evaluations across six reasoning-focused question-answering benchmarks confirm that our proposed methods consistently surpass both vanilla MAD and majority vote. These findings bridge the gap between human deliberation and LLM-based debate, illustrating how straightforward, principled adjustments can significantly amplify the effectiveness of debate mechanisms.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC