arXiv

Don't Let a Few Network Failures Slow the Entire AllReduce

Title: Preventing Isolated Network Glitches from Stalling Global AllReduce Operations

Original: arXiv:2606.01680v1 Announce Type: cross Abstract: Network failures are among the most frequent hardware faults in large-scale GPU clusters and a leading cause of training-job interruptions. Modern collective communication libraries such as NCCL mitigate network failures by rerouting traffic through surviving NICs on the same server, trading reduced inter-node bandwidth for uninterrupted training. However, the degraded server remains on the critical path of the standard ring algorithm, slowing the entire collective. We present the first information-theoretic lower bound on AllReduce completion time under asymmetric network bandwidth and show that when the straggler retains at least half of its original bandwidth, the unavoidable overhead relative to the fault-free optimum is only O(1/p) for p GPUs. We then design OptCC, a four-stage pipelined AllReduce algorithm that approaches this lower bound. Experiments on SimAI confirm that OptCC closes the gap left by existing fault-tolerant schemes: under practical network failures with up to 50% bandwidth loss, OptCC completes AllReduce within 2-6% of NCCL's fault-free ring performance, whereas the state-of-the-art incurs up to 57% overhead.

Rewritten:

Title: Avoiding the Drag of Isolated Network Errors on Collective AllReduce Tasks

Abstract: Hardware faults within large-scale GPU clusters frequently stem from network disruptions, which stand as a primary driver for interrupted training processes. While contemporary collective communication frameworks like NCCL address these issues by shifting data flow to functional network interface cards (NICs) on the same node, this strategy accepts lower inter-node throughput to maintain continuity. The drawback is that nodes with reduced performance still dictate the speed of the standard ring algorithm, thereby bottlenecking the entire operation. This paper introduces the inaugural information-theoretic lower limit for AllReduce duration in environments with uneven network bandwidth. Our analysis demonstrates that if a lagging node maintains at least 50% of its initial capacity, the minimal performance penalty compared to an ideal, failure-free scenario is merely O(1/p) for a system of p GPUs. Building on this finding, we developed OptCC, a four-stage pipelined AllReduce method designed to nearly achieve this theoretical limit. Validation via SimAI experiments reveals that OptCC effectively remedies the deficiencies of current fault-tolerance approaches. In scenarios involving real-world network degradation with bandwidth reductions of up to 50%, OptCC finishes AllReduce tasks within 2–6% of the speed of NCCL’s fault-free ring execution. In contrast, the leading existing solution suffers performance penalties as high as 57%.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...