FROST-STA: Frozen Dense Features for the Ego4D Short-Term Object Interaction Anticipation
Title: FROST-STA: Leveraging Frozen Dense Features for Short-Term Object Interaction Anticipation in Ego4D
Abstract:
Predicting short-term interactions in egocentric video demands more than mere scene recognition; a robust system must determine which object the wearer is about to touch, identify the subsequent action, and estimate the time until contact. This paper introduces FROST-STA, our entry into the Ego4D Short-Term Object Interaction Anticipation (STA) Challenge at EgoVis 2026. For every queried timestamp, the model generates a ranked list of structured hypotheses, each specifying an active-object bounding box, noun and verb labels, time-to-contact (TTC), and a confidence score.
While FROST-STA adheres to the V-JEPA 2.1 STA evaluation protocol, it incorporates specific adaptations for the competition, including object-centric decoding, multi-head prediction mechanisms, and a training and ensembling strategy optimized for submission performance. We maintain the V-JEPA 2.1 ViT-G backbone in a frozen state, extracting two distinct dense token streams: video tokens derived from a short clip (resized to 384 pixels) preceding the query, and image tokens from the final high-resolution frame observed. A streamlined alignment module, comprising an attentive probe and frame-guided temporal pooling, projects the clip’s representation onto the spatial coordinate system of the last frame, which is then fused with the image features.
These fused feature maps are processed by STA heads modeled after Faster R-CNN architectures, which predict box offsets, noun and verb classifications, TTC values, and interaction quality metrics. For our final submission, we trained the model for 25 epochs using the official training data alongside additional permitted validation annotations. The final results were obtained by aggregating predictions from eight heads and checkpoints spanning epochs 15 through 25. FROST-STA achieved an Overall Top-5 mAP of 5.13 on the official test server, securing second place in the challenge. These results demonstrate that frozen dense image-video features constitute a powerful foundation for forecasting object-level interactions.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





