arXiv

Scene-Centric Unsupervised Video Panoptic Segmentation

June 4, 2026 · Christoph Reich, Oliver Hahn, Nikita Araslanov, Laura Leal-Taix\'e, Christian Rupprecht, Daniel Cremers, Stefan Roth · Original Source

Title: Scene-Centric Unsupervised Video Panoptic Segmentation

Abstract:

Video panoptic segmentation (VPS) seeks to simultaneously detect, segment, and track every object within a video while dividing the footage into regions that are semantically coherent. In this work, we establish the framework for unsupervised VPS, a setting that requires no human-generated labels. While current research on unsupervised scene understanding has primarily concentrated on static image segmentation, the video domain has seen little attention. To address this gap, we present VideoCUPS, the inaugural method for unsupervised VPS. VideoCUPS creates temporally stable panoptic pseudo-labels by leveraging unsupervised signals such as depth, motion, and visual features derived from scene-centric videos. By training on these generated pseudo-labels with a newly designed Video DropLoss, we produce a highly accurate VPS model without supervision. To measure advancements in this area, we develop a thorough evaluation protocol alongside four robust baseline models, adapting leading unsupervised panoptic image and instance video segmentation techniques for VPS tasks. VideoCUPS surpasses all existing baselines and exhibits significant efficiency in label usage. Together with our evaluation standards and baseline comparisons, VideoCUPS lays a solid groundwork for subsequent investigations into unsupervised video panoptic segmentation.

Source: arXiv Generated at: 2026-06-04 00:00:00 UTC