arXiv

FlowNar: Scalable Streaming Narration for Long-Form Videos

June 2, 2026 · Zeyun Zhong, Manuel Martin, Chengzhi Wu, David Schneider, Frederik Diederichs, Juergen Gall, Juergen Beyerer · Original Source

Title: FlowNar: Enabling Scalable Streaming Narration for Extended Video Content

Abstract:

Although Large Multimodal Models (LMMs) have seen significant advancements, they are predominantly engineered for offline applications, making them poorly equipped to handle the fluid demands of streaming video. While recent efforts have adapted these models for online use to facilitate real-time processing, they continue to grapple with severe scalability limitations. Specifically, resource consumption tends to scale at least linearly as video duration increases. To address this bottleneck, we introduce FlowNar, an innovative framework designed for scalable streaming video narration.

FlowNar’s foundation lies in a dynamic context management strategy that eliminates historical visual context, paired with our novel CLAM (Cross Linear Attentive Memory) module. This module is specifically tailored to retain visual history during streaming, thereby guaranteeing bounded visual memory usage and maintaining constant computational complexity—factors essential for efficient streaming operations. Furthermore, we propose a realistic self-conditioned evaluation protocol alongside complementary metrics to rigorously assess streaming narration models under conditions that mirror real-world deployment.

Our experiments, conducted on the Ego4D, EgoExo4D, and EpicKitchens100 datasets, reveal that FlowNar significantly enhances narration quality compared to robust baseline models. Simultaneously, it delivers high efficiency, capable of processing videos ten times longer and achieving a threefold increase in throughput (FPS). The source code is accessible at https://github.com/zeyun-zhong/FlowNar.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC