APB-V: Accelerating Long-Video Understanding via Sequence-Parallelism-aware Approximate Attention
Title: APB-V: Accelerating Long-Video Understanding via Sequence-Parallelism-aware Approximate Attention
Abstract: The inference of long videos remains a significant bottleneck, primarily caused by the intensive computational demands of the prefill phase in Large Multimodal Models (LMMs). Current approaches, which typically rely on compressing visual embeddings or implementing sparse attention mechanisms on individual GPUs, often result in suboptimal speedups or compromised accuracy, thereby limiting the ability of LMMs to process longer and more intricate video content. To address these challenges, we introduce APB-V, a sequence-parallel framework featuring optimized attention designed to expedite long-video inference across multiple GPUs. By dispersing approximate attention tasks, APB-V minimizes computational load while enhancing parallelism, allowing for the efficient handling of a greater volume of visual embeddings without the need for compression, which in turn boosts task performance. Additional system-level enhancements, including load balancing and fused forward passes, further maximize the framework's capabilities, achieving speedups of 12.72x, 1.70x, and 1.18x compared to FlashAttn, ZigZagRing, and APB, respectively, with no significant degradation in performance. Code is available at https://github.com/thunlp/APB
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




