arXiv

TAP-JEPA: Frozen Future-Latent Probing and Two-Stage Score Fusion for EPIC-KITCHENS-100 Action Anticipation

June 2, 2026 · Chaoyang Wang, Lexuan Xu · Original Source

Title: TAP-JEPA: Two-Stage Score Fusion and Frozen Future-Latent Probing for EPIC-KITCHENS-100 Action Anticipation

Abstract

This paper introduces TAP-JEPA, which secured the second position in the EPIC-KITCHENS-100 (EK-100) Action Anticipation Challenge at EgoVis 2026. The challenge requires participants to predict the subsequent verb, noun, or combined verb-noun action based on an egocentric video clip that concludes prior to the onset of the target activity. Rather than employing fine-tuning on a massive video backbone, TAP-JEPA constructs a lightweight anticipation model utilizing frozen features from V-JEPA 2.1. In this architecture, a ViT-G/384 encoder processes visible pre-action tokens, while a pre-trained latent predictor infers near-future tokens from the available context. These two token sets are subsequently integrated via attentive probes equipped with task-specific queries designed for verbs, nouns, and action pairs.

For the final entry, we augmented supervised training by incorporating the official training split alongside the majority of the validation split, keeping only a minimal subset aside for sanity checks and qualitative assessment. Additionally, we implemented a two-stage score fusion strategy: first, we averaged results from eight independently initialized probe replicas within each epoch, and second, we combined candidate outputs from epochs 12 through 20 using field-dependent weights. On the official open-testing leaderboard, our team (sunshinesky) achieved an overall action Mean Top-5 Recall (MT5R) of 27.91 percent. This performance placed us second, trailing the leading score by a narrow margin of just 0.04 percentage points.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC