arXiv

TempRet: Temporal Enhancement and Two-Stage Reranking for CVPR 2026 EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge

Title: TempRet: Temporal Enhancement and Two-Stage Reranking for CVPR 2026 EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge

Abstract:

While large-scale vision-language pretraining has propelled significant advancements in video-text retrieval, the majority of current methods rely on a legacy assumption borrowed from image-text retrieval: that visual meaning can be adequately captured on a per-frame basis. This perspective fails to account for the temporal dynamics inherent in egocentric videos. Furthermore, the EPIC-KITCHENS-100 Multi-Instance Retrieval (MIR) challenge intensifies the difficulty by supplying soft-label relevance matrices instead of binary labels, thereby requiring models capable of resolving graded semantic alignments across different modalities.

In this report, we introduce TempRet, our solution for the CVPR 2026 EPIC-KITCHENS-100 MIR challenge. Built upon a CLIP-based dual-encoder backbone, our method incorporates two pivotal components to tackle temporal and cross-modal complexities. Initially, a temporal transformer functions solely on the video stream, capturing inter-frame dependencies via multi-head self-attention and learnable positional encodings applied to frame-level CLIP features. Subsequently, we implement a two-stage reranking pipeline: the first stage retrieves Top-K candidates using the dual-encoder, while the second stage recalibrates their scores through a cross-encoder featuring an Image-Text Matching (ITM) head.

The system is optimized using Symmetric Multi-Similarity Loss, leveraging the soft-label relevance matrices provided by the challenge. Our approach yields an average mean Average Precision (mAP) of 67.97% and an average normalized Discounted Cumulative Gain (nDCG) of 82.92% on the EK-100 MIR benchmark. These results underscore the efficacy of integrating temporal modeling with cross-modal refinement for egocentric video retrieval.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Withings Debuts New Smart Scale Marketed Toward GLP-1 Users
Bloomberg

Withings Debuts New Smart Scale Marketed Toward GLP-1 Users

Withings launched a new smart scale targeting GLP-1 users, offering advanced body composition analysis. This device help...

TechCrunch

Rocket engine startup Impulse raises $500 million to hire people, not AI

Rocket engine startup Impulse Space raised $500 million to hire 200 engineers, prioritizing human expertise over AI for ...

Startup Impulse Space Raises $500 Million, Valued at $4 Billion
Bloomberg

Startup Impulse Space Raises $500 Million, Valued at $4 Billion

Impulse Space secured $500 million in funding, achieving a $4 billion valuation. This investment supports the developmen...

Walmart’s Answer to Apple Pay Wants to Be Your Favorite Financial App
Bloomberg

Walmart’s Answer to Apple Pay Wants to Be Your Favorite Financial App

Walmart’s new financial app aims to rival Apple Pay, positioning itself as a preferred digital payment and banking solut...

Nvidia Is Bigger, Stronger, and Trying to Slay the Laptop Dragon Again
Bloomberg

Nvidia Is Bigger, Stronger, and Trying to Slay the Laptop Dragon Again

Nvidia unveiled the RTX Spark Superchip at Computex 2026, aiming to challenge Intel’s PC dominance and modernize hardwar...

TechCrunch

Pacific Fusion’s latest prototype packs 440 gigawatts into an 80-nanosecond burst

Pacific Fusion’s new prototype delivers 440 gigawatts in 80 nanoseconds, securing over $1 billion in funding and enablin...