arXiv

CaptionFormer: Unified Segmentation, Tracking, and Captioning for Spatio-Temporal Objects

June 2, 2026 · Gabriel Fiastre, Antoine Yang, Cordelia Schmid · Original Source

Title: CaptionFormer: A Unified Framework for Spatio-Temporal Object Segmentation, Tracking, and Captioning

Abstract:

Dense Video Object Captioning (DVOC) involves the simultaneous detection, tracking, and linguistic description of object trajectories within videos, a process that demands a deep comprehension of spatio-temporal nuances alongside natural language generation. Given the intricate nature of this task and the prohibitive expense of manual annotation, prior methods have often relied on training with constrained datasets, which can result in less-than-optimal performance. To address this limitation, we propose the use of a cutting-edge Vision-Language Model (VLM) to generate captions for spatio-temporally localized entities. We have augmented the LVIS and LV-VIS datasets with these synthetic captions, creating the LVISCap and LV-VISCap datasets, respectively. Additionally, we present CaptionFormer, an end-to-end architecture designed to jointly perform detection, segmentation, tracking, and captioning of object trajectories. Our model sets a new state-of-the-art for DVOC on three established benchmarks: VidSTG, VLN, and BenSMOT. The associated code and datasets can be accessed at https://www.gabriel.fiastre.fr/captionformer/.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC