TrajTok: Learning Trajectory Tokens enables better Video Understanding
Title: TrajTok: Enhancing Video Comprehension Through Trajectory Token Learning
Video models typically rely on patchification for tokenization, a process that results in a high volume of redundant tokens. This redundancy significantly hinders the efficiency and scalability of video processing. Although recent trajectory-based tokenizers present a viable alternative by separating token count from video duration, they generally depend on intricate, task-agnostic external pipelines for segmentation and tracking, which are notably slow. To address these limitations, we introduce TrajTok, an end-to-end video tokenizer module designed to be fully integrated and co-trained with downstream video models. TrajTok dynamically adjusts its token granularity based on semantic complexity rather than video length.
At the core of TrajTok is a unified segmenter that executes implicit clustering of pixels across both spatial and temporal dimensions, generating object trajectories in a single forward pass. By prioritizing adaptability to downstream tasks over pixel-perfect segmentation accuracy, TrajTok remains lightweight and efficient while empirically boosting video understanding capabilities. Utilizing this approach, we developed TrajViT2, a video CLIP model trained from the ground up. It sets a new standard for accuracy at scale in both retrieval and classification benchmarks, all while maintaining efficiency levels comparable to leading token-merging techniques.
Furthermore, TrajTok demonstrates versatility beyond its primary function as a tokenizer. We demonstrate its seamless integration as a probing head for pretrained visual features (TrajAdapter) and as an alignment connector within vision-language models (TrajVLM). Notably, it delivers particularly robust performance in long-video reasoning tasks.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC






