TrAction: Action Recognition with Sparse Trajectories
Title: TrAction: Action Recognition with Sparse Trajectories
Original: arXiv:2606.03490v1 Announce Type: new Abstract: Modern action recognition models operate on memory- and compute-intensive dense RGB video volumes and frequently exploit appearance and background shortcuts, for example, predicting actions from objects or scenes instead of characteristic motion. We investigate an efficient alternative input modality that is largely free of such biases by construction: sparse point trajectories. To this end, we develop a simple transformer architecture for 2.5D trajectory-based recognition together with a masked-trajectory pretraining, which we show to substantially improve downstream action recognition accuracy. Despite using only a fraction of the dense RGB input, our method reaches 45% top-1 on Something-Something V2 and 54% on EPIC-Kitchens-100, and surpasses V-JEPA on time-reversal sensitivity. More importantly, we find trajectory features to be complementary to state-of-the-art appearance-based features. Fusing our pretrained model with DINOv2 and V-JEPA 2 improves top-1 accuracy on Something-Something V2 by 8.7 and 1.6 points, respectively. Code: https://github.com/ecker-lab/TrAction
Rewrite: Current action recognition systems typically rely on resource-heavy, dense RGB video data, often falling back on superficial cues like specific objects or settings rather than analyzing true motion dynamics. To address this, we propose sparse point trajectories as a computationally efficient input method that inherently avoids these biases. Our approach introduces a streamlined transformer design tailored for 2.5D trajectory analysis, enhanced by a masked-trajectory pretraining strategy that significantly boosts performance on downstream tasks. Even though this method utilizes only a small portion of the data compared to dense RGB inputs, it achieves a 45% top-1 score on the Something-Something V2 dataset and 54% on EPIC-Kitchens-100. Additionally, it outperforms V-JEPA in detecting time-reversed actions. Crucially, we demonstrate that trajectory-based features complement existing appearance-driven models. Integrating our pre-trained model with DINOv2 and V-JEPA 2 yields improvements of 8.7 and 1.6 percentage points, respectively, in top-1 accuracy on Something-Something V2. Code: https://github.com/ecker-lab/TrAction
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC





