Global News Digest

arXiv

An Open-Source Benchmark and Baseline for Multi-temporal Referring Segmentation

Title: Establishing a Baseline and Open-Source Benchmark for Multi-temporal Referring Segmentation

Abstract:

While Large Vision-Language Models (LVLMs) have demonstrated impressive capabilities in visual comprehension and language-guided grounding, their ability to perform multi-temporal visual reasoning has yet to be thoroughly investigated. To address this limitation, we present Multi-temporal Referring Segmentation (MTRS), a novel task focused on segmenting language-described temporal changes within multi-temporal imagery. MTRS expands upon traditional referring segmentation and change detection by simultaneously demanding temporal correspondence reasoning, linguistic grounding, and pixel-level mask generation.

To facilitate this research, we introduce CRAFT-Agent, an automated pipeline for data construction that incorporates human auditing. Utilizing this tool, we developed MTRefSeg-21K, the inaugural MTRS benchmark. This dataset comprises 21,000 high-quality triplets of multi-temporal images, text, and masks, spanning a wide variety of scenes, viewpoints, and domains.

Our extensive benchmarking of numerous VLM- and LVLM-based models indicates that direct inference yields subpar results, and task-specific fine-tuning offers limited improvements. To overcome these challenges, we propose MTRefSeg-R1, a change-aware LVLM framework employing a two-stage training strategy. Initially, the model acquires general temporal-change perception using 20,000 vision-only bi-temporal samples. Subsequently, it undergoes fine-tuning on MTRefSeg-21K to achieve fine-grained, language-guided temporal localization. MTRefSeg-R1 explicitly captures cross-temporal visual disparities, aligns linguistic instructions with temporal variations, and generates masks for the referred changes. Comprehensive experiments demonstrate that MTRefSeg-R1 delivers robust, and often superior, performance relative to existing LVLM baselines, underscoring both the difficulties and the promise inherent in MTRS.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Schroders Renewable Unit Targets AI Assets as Power Demand Soars
Bloomberg

Schroders Renewable Unit Targets AI Assets as Power Demand Soars

Schroders’ renewable unit targets AI infrastructure, pivoting to meet soaring energy demand from artificial intelligence...

State Street's Paglia on SBI Group Partnership, ETFs
Bloomberg

State Street's Paglia on SBI Group Partnership, ETFs

State Street's Paglia discusses the SBI Group partnership and ETFs, but the source text is missing. Please provide the a...

Nvidia Boss Says Workers Should Be Paid ‘as Much as Possible’
Bloomberg

Nvidia Boss Says Workers Should Be Paid ‘as Much as Possible’

Nvidia CEO Jensen Huang advocates for paying workers “as much as possible,” emphasizing maximum compensation. This stanc...

TSE Talking With Regulator For Easing ETF Listing Rules
Bloomberg

TSE Talking With Regulator For Easing ETF Listing Rules

The Tokyo Stock Exchange is discussing with regulators to ease ETF listing rules. This aims to simplify market access an...

S&P DJI CEO on Japan Markets, Mega IPOs
Bloomberg

S&P DJI CEO on Japan Markets, Mega IPOs

S&P DJI CEO discusses Japan's financial markets and major IPOs.