Retrieve What's Missing: Coverage-Maximizing Retrieval for Consistent Long Video Generation
Title: Recovering Lost Context: A Coverage-Driven Retrieval Strategy for Stable Long-Form Video Synthesis
Abstract: Achieving geometric coherence over extended durations remains a significant hurdle in autoregressive video generation. While memory-augmented generative models attempt to solve this by pulling historical frames from storage, their performance hinges on two critical decisions: determining the appropriate 3D geometric representation of past data and selecting specific memory frames from that data. Current approaches typically rely on camera poses or field-of-view intersections; while computationally inexpensive, these metrics are too coarse to accurately assess pixel-level visibility. Conversely, methods employing explicit 3D reconstruction offer detailed evidence but incur prohibitive maintenance costs during long generation sequences. To address these limitations, we introduce Coverage-Maximizing Retrieval-Augmented Generation (COVRAG). This depth-centric framework leverages pretrained 3D priors to build a target-view coverage map, serving as a lightweight form of 3D memory evidence. Regarding frame selection, COVRAG operates by maximizing residual coverage gain, systematically retrieving frames that account for target-view areas left unexplained by the current context or previously chosen memories. To enhance scalability for long-video tasks, we also propose sliding-window depth caching to streamline geometry estimation. Evaluations on the RealEstate10K and DL3DV10K datasets demonstrate that COVRAG boosts long-horizon geometric consistency while keeping latency lower than that of baseline methods.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





