Reason-Then-Retrieve for CoVR-R with Structured Edit Prompts and Dense-Sparse Fusion
Title: Reason-Then-Retrieve for CoVR-R with Structured Edit Prompts and Dense-Sparse Fusion
Abstract:
CoVR-R focuses on reason-aware composed video retrieval, a task where the objective is to identify a target video based on a reference video and a specific edit instruction. The primary challenge lies in the fact that the target video is not explicitly described; instead, it must be deduced by analyzing subtle modifications in object identity, the sequence of actions, the final state, hand interactions, and scene transitions. To address this, we developed a zero-shot "reason-then-retrieve" pipeline utilizing the Qwen3.5-27B model.
In our approach, the model produces both a retrieval-oriented structured description and a dense embedding for every video in the gallery. The dense embedding is derived by pooling the hidden states of generated tokens, applying token-dependent weights. For the query side, the model first conducts edit reasoning regarding the reference video and the instruction. It then generates a description of the target video, using the hidden states from this generation process as the query embedding.
To enhance retrieval performance, we augment the dense retrieval method with a TF-IDF branch that operates on the generated texts. The final ranking is determined by fusing these two methods using weights specific to each split. Our current top-performing submission achieved the following metrics on the validation set: 80.81 for R@1, 94.86 for R@5, 97.11 for R@10, and 98.59 for R@50. On the blind test split, the system reached scores of 89.73 at R@1, 95.79 at R@5, 96.63 at R@10, and 97.98 at R@50.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





