arXiv

TrajTok: Learning Trajectory Tokens enables better Video Understanding

Title: TrajTok: Enhancing Video Comprehension Through Trajectory Token Learning

Video models typically rely on patchification for tokenization, a process that results in a high volume of redundant tokens. This redundancy significantly hinders the efficiency and scalability of video processing. Although recent trajectory-based tokenizers present a viable alternative by separating token count from video duration, they generally depend on intricate, task-agnostic external pipelines for segmentation and tracking, which are notably slow. To address these limitations, we introduce TrajTok, an end-to-end video tokenizer module designed to be fully integrated and co-trained with downstream video models. TrajTok dynamically adjusts its token granularity based on semantic complexity rather than video length.

At the core of TrajTok is a unified segmenter that executes implicit clustering of pixels across both spatial and temporal dimensions, generating object trajectories in a single forward pass. By prioritizing adaptability to downstream tasks over pixel-perfect segmentation accuracy, TrajTok remains lightweight and efficient while empirically boosting video understanding capabilities. Utilizing this approach, we developed TrajViT2, a video CLIP model trained from the ground up. It sets a new standard for accuracy at scale in both retrieval and classification benchmarks, all while maintaining efficiency levels comparable to leading token-merging techniques.

Furthermore, TrajTok demonstrates versatility beyond its primary function as a tokenizer. We demonstrate its seamless integration as a probing head for pretrained visual features (TrajAdapter) and as an alignment connector within vision-language models (TrajVLM). Notably, it delivers particularly robust performance in long-video reasoning tasks.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

The Do’s and Don’ts of Buying Used Tech Gadgets
New York Times

The Do’s and Don’ts of Buying Used Tech Gadgets

Refurbished tech offers a cost-effective alternative amid component shortages and inflated prices. This guide outlines e...

Who is Elon Musk and what is his net worth?
BBC News

Who is Elon Musk and what is his net worth?

Elon Musk, CEO of Tesla and SpaceX, became the first person to surpass a $500 billion net worth in October 2025. His wea...

AI Boom Propels China Optical Maker to Top Weighting on CSI 300
Bloomberg

AI Boom Propels China Optical Maker to Top Weighting on CSI 300

Driven by surging AI demand, a Chinese optical maker has reached the highest weighting in the CSI 300 index.

AI Bubble 'Something to Look At,' BNP's Huynh Says (Video)
Bloomberg

AI Bubble 'Something to Look At,' BNP's Huynh Says (Video)

BNP Paribas’ Huynh describes the AI bubble as “something to look at,” signaling cautious interest in the sector’s potent...

SoftBank’s PayPay to Buy T&D’s Life Insurer for $840 Million
Bloomberg

SoftBank’s PayPay to Buy T&D’s Life Insurer for $840 Million

PayPay is acquiring T&D Holdings’ life insurer for $840 million, shortly after its historic $879.8 million Nasdaq IPO.

Goldman Sachs CEO David Solomon on Running a Bank in the Age of AI | Odd Lots
Bloomberg

Goldman Sachs CEO David Solomon on Running a Bank in the Age of AI | Odd Lots

Goldman Sachs CEO David Solomon discusses integrating AI into banking operations. He explores how artificial intelligenc...