arXiv

FlashMLA-ETAP: Efficient Transpose Attention Pipeline for Accelerating MLA Inference on NVIDIA H20 GPUs

June 3, 2026 · Pengcuo Dege, Qiuming Luo, Rui Mao, Chang Kong · Original Source

Title: FlashMLA-ETAP: A High-Performance Transpose Attention Pipeline for Accelerating MLA Inference on NVIDIA H20 GPUs

Abstract:

Deploying the massive DeepSeek-R1 671B model on a single Multi-GPU server presents significant hurdles for efficient Multi-Head Latent Attention (MLA) inference. To address this, this study presents FlashMLA-ETAP, an innovative framework specifically designed to optimize MLA inference for single-instance deployments on NVIDIA H20 GPUs. Central to this approach is the Efficient Transpose Attention Pipeline (ETAP), a method that restructures attention calculations via transposition. This reconfiguration aligns the KV context length with the $M$-dimension utilized in WGMMA operations, thereby substantially cutting down on redundant processing.

Performance benchmarks demonstrate that FlashMLA-ETAP delivers a 2.78x acceleration compared to standard FlashMLA when processing sequences of 64K length with a batch size of 16. Additionally, it outperforms FlashAttention-3 by 5.24x and FlashInfer by 4.94x. Crucially, these gains are achieved without compromising precision; the framework maintains numerical stability, exhibiting an RMSE of $1.25 \times 10^{-5}$, which is 15.2x lower than that of FlashAttention-3.

Theoretical analysis underpins the design of ETAP, highlighting its capacity for seamless integration into existing ecosystems such as FlashAttention-3 and FlashInfer. By tackling the critical challenge of resource-constrained inference, this work provides a scalable solution tailored for mid-tier GPUs, facilitating wider adoption of hardware-aware optimization techniques. The source code for this project is publicly accessible at https://github.com/pengcuo/FlashMLA-ETAP.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC