TempRet: Temporal Enhancement and Two-Stage Reranking for CVPR 2026 EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge
Title: TempRet: Temporal Enhancement and Two-Stage Reranking for CVPR 2026 EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge
Abstract:
While large-scale vision-language pretraining has propelled significant advancements in video-text retrieval, the majority of current methods rely on a legacy assumption borrowed from image-text retrieval: that visual meaning can be adequately captured on a per-frame basis. This perspective fails to account for the temporal dynamics inherent in egocentric videos. Furthermore, the EPIC-KITCHENS-100 Multi-Instance Retrieval (MIR) challenge intensifies the difficulty by supplying soft-label relevance matrices instead of binary labels, thereby requiring models capable of resolving graded semantic alignments across different modalities.
In this report, we introduce TempRet, our solution for the CVPR 2026 EPIC-KITCHENS-100 MIR challenge. Built upon a CLIP-based dual-encoder backbone, our method incorporates two pivotal components to tackle temporal and cross-modal complexities. Initially, a temporal transformer functions solely on the video stream, capturing inter-frame dependencies via multi-head self-attention and learnable positional encodings applied to frame-level CLIP features. Subsequently, we implement a two-stage reranking pipeline: the first stage retrieves Top-K candidates using the dual-encoder, while the second stage recalibrates their scores through a cross-encoder featuring an Image-Text Matching (ITM) head.
The system is optimized using Symmetric Multi-Similarity Loss, leveraging the soft-label relevance matrices provided by the challenge. Our approach yields an average mean Average Precision (mAP) of 67.97% and an average normalized Discounted Cumulative Gain (nDCG) of 82.92% on the EK-100 MIR benchmark. These results underscore the efficacy of integrating temporal modeling with cross-modal refinement for egocentric video retrieval.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




