Reason, Retrieve, Re-rank: A Zero-Shot Reasoning-Aware Framework for Composed Video Retrieval
Title: Reason, Retrieve, Re-rank: A Zero-Shot Reasoning-Aware Framework for Composed Video Retrieval
Abstract:
Composed Video Retrieval (CoVR) aims to identify a target video by applying free-form textual modifications to a reference video. We tackle the Reason-Aware CoVR (CoVR-R) challenge at the CVPR~2026 VidLLMs workshop, which operates under strict zero-shot retrieval conditions. To this end, we introduce R3-CoVR (Reason, Retrieve, Re-rank), a training-free pipeline constructed exclusively from frozen foundation models. First, a multimodal large language model (Qwen3-VL-8B) analyzes the after-effects of an edit—such as state transitions, action phases, scene composition, camera movement, and tempo—and generates a concise post-edit description. Second, a contrastive video-text encoder (SigLIP-2) embeds this description alongside the video gallery to perform first-stage retrieval. Finally, a constraint-aware re-ranking stage employs the same multimodal model as a judge, scoring each shortlisted candidate against the desired edited outcome.
On the challenge test set, R3-CoVR achieves 91.9% R@1 and 98.2% R@10. Our results are driven by two key insights: (i) aligning the description length with the contrastive encoder’s text window boosts R@1 from $67.5$ to $72.7$; and (ii) the constraint-aware re-ranker, which reorders only the shortlisted items, elevates R@1 from $72.7$ to $91.9$, representing the most significant performance gain. We provide an analysis of the re-ranker’s behavior, the interplay between retrieval and re-ranking, and the impact of shortlist depth, while also releasing a streamlined three-layer implementation.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





