arXiv

FROST-STA: Frozen Dense Features for the Ego4D Short-Term Object Interaction Anticipation

Title: FROST-STA: Leveraging Frozen Dense Features for Short-Term Object Interaction Anticipation in Ego4D

Abstract:

Predicting short-term interactions in egocentric video demands more than mere scene recognition; a robust system must determine which object the wearer is about to touch, identify the subsequent action, and estimate the time until contact. This paper introduces FROST-STA, our entry into the Ego4D Short-Term Object Interaction Anticipation (STA) Challenge at EgoVis 2026. For every queried timestamp, the model generates a ranked list of structured hypotheses, each specifying an active-object bounding box, noun and verb labels, time-to-contact (TTC), and a confidence score.

While FROST-STA adheres to the V-JEPA 2.1 STA evaluation protocol, it incorporates specific adaptations for the competition, including object-centric decoding, multi-head prediction mechanisms, and a training and ensembling strategy optimized for submission performance. We maintain the V-JEPA 2.1 ViT-G backbone in a frozen state, extracting two distinct dense token streams: video tokens derived from a short clip (resized to 384 pixels) preceding the query, and image tokens from the final high-resolution frame observed. A streamlined alignment module, comprising an attentive probe and frame-guided temporal pooling, projects the clip’s representation onto the spatial coordinate system of the last frame, which is then fused with the image features.

These fused feature maps are processed by STA heads modeled after Faster R-CNN architectures, which predict box offsets, noun and verb classifications, TTC values, and interaction quality metrics. For our final submission, we trained the model for 25 epochs using the official training data alongside additional permitted validation annotations. The final results were obtained by aggregating predictions from eight heads and checkpoints spanning epochs 15 through 25. FROST-STA achieved an Overall Top-5 mAP of 5.13 on the official test server, securing second place in the challenge. These results demonstrate that frozen dense image-video features constitute a powerful foundation for forecasting object-level interactions.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...