arXiv

Don't Let a Few Network Failures Slow the Entire AllReduce

June 2, 2026 · Peiqing Chen, Jiedong Jiang, Nengneng Yu, Yuefeng Wang, Sixian Xiong, Wei Wang, Zaoxing Liu · Original Source

Title: Preventing Isolated Network Glitches from Stalling Global AllReduce Operations

Original: arXiv:2606.01680v1 Announce Type: cross Abstract: Network failures are among the most frequent hardware faults in large-scale GPU clusters and a leading cause of training-job interruptions. Modern collective communication libraries such as NCCL mitigate network failures by rerouting traffic through surviving NICs on the same server, trading reduced inter-node bandwidth for uninterrupted training. However, the degraded server remains on the critical path of the standard ring algorithm, slowing the entire collective. We present the first information-theoretic lower bound on AllReduce completion time under asymmetric network bandwidth and show that when the straggler retains at least half of its original bandwidth, the unavoidable overhead relative to the fault-free optimum is only O(1/p) for p GPUs. We then design OptCC, a four-stage pipelined AllReduce algorithm that approaches this lower bound. Experiments on SimAI confirm that OptCC closes the gap left by existing fault-tolerant schemes: under practical network failures with up to 50% bandwidth loss, OptCC completes AllReduce within 2-6% of NCCL's fault-free ring performance, whereas the state-of-the-art incurs up to 57% overhead.

Rewritten:

Title: Avoiding the Drag of Isolated Network Errors on Collective AllReduce Tasks

Abstract: Hardware faults within large-scale GPU clusters frequently stem from network disruptions, which stand as a primary driver for interrupted training processes. While contemporary collective communication frameworks like NCCL address these issues by shifting data flow to functional network interface cards (NICs) on the same node, this strategy accepts lower inter-node throughput to maintain continuity. The drawback is that nodes with reduced performance still dictate the speed of the standard ring algorithm, thereby bottlenecking the entire operation. This paper introduces the inaugural information-theoretic lower limit for AllReduce duration in environments with uneven network bandwidth. Our analysis demonstrates that if a lagging node maintains at least 50% of its initial capacity, the minimal performance penalty compared to an ideal, failure-free scenario is merely O(1/p) for a system of p GPUs. Building on this finding, we developed OptCC, a four-stage pipelined AllReduce method designed to nearly achieve this theoretical limit. Validation via SimAI experiments reveals that OptCC effectively remedies the deficiencies of current fault-tolerance approaches. In scenarios involving real-world network degradation with bandwidth reductions of up to 50%, OptCC finishes AllReduce tasks within 2–6% of the speed of NCCL’s fault-free ring execution. In contrast, the leading existing solution suffers performance penalties as high as 57%.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC