Hand Trajectory Fusion for Egocentric Natural Language Query Grounding
Title: Integrating Hand Trajectory Data to Enhance Egocentric Natural Language Query Grounding
Abstract: The task of Egocentric Natural Language Query (NLQ) grounding requires a model to identify the specific time segment within an extended first-person video that corresponds to a given free-text query. While current approaches typically integrate visual content with textual queries, they frequently overlook hand motion. This oversight is significant because approximately 41% of queries in the Ego4D dataset are resolved during moments of hand-object interaction or immediately following such actions. To address this gap, we introduce a hand-trajectory encoder designed to transform sequences of hand skeletons into rich, semantic kinematic features. These features are subsequently aligned with pre-trained video-text representations via a cross-attention fusion mechanism that employs adaptive gating. Evaluations on the Ego4D NLQ v2 validation set demonstrate that incorporating hand trajectory data yields the most substantial improvements for queries involving hand-object interactions (a +2.54 increase in R1@IoU=0.3) and those concerning quantity or state (a +4.32 increase in R1@IoU=0.3). These results suggest that hand trajectory information offers critical grounding signals that go beyond what visual appearance alone can provide.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



