R^3: Composed Video Retrieval via Reasoning-Guided Recalling and Re-ranking
Title: R^3: Composed Video Retrieval via Reasoning-Guided Recalling and Re-ranking
Abstract: The CoVR-R challenge focuses on composed video retrieval, a task requiring systems to identify a specific target video from a vast collection based on a reference video and a textual instruction describing an edit. This scenario diverges from conventional video-text retrieval, as the query relies on both the visual content of the source video and the transformations suggested by the edit command. While robust embedding models can efficiently generate a broad set of candidate recalls, they often fail to capture nuanced target-side outcomes, such as shifts in state, action substitutions, object continuity, or temporal coherence. Conversely, pairwise multimodal rerankers can assess these details more precisely, yet applying them exhaustively across an entire gallery is computationally prohibitive. To address these issues, we introduce $\mathbb{R}^3$, a zero-shot pipeline for composed video retrieval that leverages Reasoning-guided Recalling and Reranking. Our approach transforms the source-edit query into a retrieval program grounded in reasoning, rather than treating the edit text merely as a brief caption. Initially, the model produces a reasoning trace outlining the anticipated characteristics of the target video post-edit. This trace is then encoded alongside the source video to form a reasoning-augmented query. The retrieval score from this augmented query is combined with that of the standard composed query using an agreement-gated residual mechanism. Finally, a reranker validates the recalled candidates through direct comparison between the source and the candidates. Experimental results confirm the efficacy of our method in tackling this challenge. The code is accessible at https://github.com/Lee-zixu/R-3.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





