arXiv

An Open-Source Benchmark and Baseline for Multi-temporal Referring Segmentation

June 2, 2026 · Bingyu Li, Da Zhang, Tao Huo, Zhiyuan Zhao, Junyu Gao, Xuelong Li · Original Source

Title: Establishing a Baseline and Open-Source Benchmark for Multi-temporal Referring Segmentation

Abstract:

While Large Vision-Language Models (LVLMs) have demonstrated impressive capabilities in visual comprehension and language-guided grounding, their ability to perform multi-temporal visual reasoning has yet to be thoroughly investigated. To address this limitation, we present Multi-temporal Referring Segmentation (MTRS), a novel task focused on segmenting language-described temporal changes within multi-temporal imagery. MTRS expands upon traditional referring segmentation and change detection by simultaneously demanding temporal correspondence reasoning, linguistic grounding, and pixel-level mask generation.

To facilitate this research, we introduce CRAFT-Agent, an automated pipeline for data construction that incorporates human auditing. Utilizing this tool, we developed MTRefSeg-21K, the inaugural MTRS benchmark. This dataset comprises 21,000 high-quality triplets of multi-temporal images, text, and masks, spanning a wide variety of scenes, viewpoints, and domains.

Our extensive benchmarking of numerous VLM- and LVLM-based models indicates that direct inference yields subpar results, and task-specific fine-tuning offers limited improvements. To overcome these challenges, we propose MTRefSeg-R1, a change-aware LVLM framework employing a two-stage training strategy. Initially, the model acquires general temporal-change perception using 20,000 vision-only bi-temporal samples. Subsequently, it undergoes fine-tuning on MTRefSeg-21K to achieve fine-grained, language-guided temporal localization. MTRefSeg-R1 explicitly captures cross-temporal visual disparities, aligns linguistic instructions with temporal variations, and generates masks for the referred changes. Comprehensive experiments demonstrate that MTRefSeg-R1 delivers robust, and often superior, performance relative to existing LVLM baselines, underscoring both the difficulties and the promise inherent in MTRS.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC