arXiv

\textsc{CR-Seg}: Attention-Guided and CoT-Enhanced Coarse-to-Refined Reasoning Segmentation

June 3, 2026 · Yifan Cao, Xiaocui Yang, Faxian Wan, Shi Feng, Daling Wang, Yifei Zhang · Original Source

Title: \textsc{CR-Seg}: Attention-Guided and CoT-Enhanced Coarse-to-Refined Reasoning Segmentation

Original: arXiv:2606.03564v1 Announce Type: cross Abstract: Reasoning segmentation aims to segment target objects described by complex language through joint visual-textual reasoning. Existing methods typically rely on either learned semantic tokens to bridge Multimodal Large Language Models (MLLMs) and segmentation models, suffering from difficult cross-modal alignment, or explicit spatial prompts such as bounding boxes, which may lose holistic response semantics. To address these limitations, we propose Attention-Guided and CoT-Enhanced Coarse-to-Refined Reasoning Segmentation, termed CR-Seg, a two-stage framework for coarse-to-refined reasoning segmentation. Specifically, we design an Extract Attention Maps and Points (EAP) module to extract attention maps for coarse target localization and select informative points, both of which are fed into SAM for mask refinement. To alleviate reasoning--answer inconsistency, we further introduce Global-to-Local Chain-of-Thought (GLCoT), which guides the model to reason progressively from global scene context to local target details. Extensive experiments on reasoning segmentation benchmarks demonstrate the effectiveness of CR-Seg.

Rewritten: Title: \textsc{CR-Seg}: Attention-Guided and CoT-Enhanced Coarse-to-Refined Reasoning Segmentation

Abstract: Reasoning segmentation seeks to isolate target objects defined by intricate linguistic descriptions via integrated visual-textual analysis. Current approaches generally depend on either learned semantic tokens to connect Multimodal Large Language Models (MLLMs) with segmentation architectures, a process often hampered by challenging cross-modal alignment, or explicit spatial cues like bounding boxes, which risk discarding holistic semantic context. To overcome these shortcomings, we introduce \textsc{CR-Seg} (Attention-Guided and CoT-Enhanced Coarse-to-Refined Reasoning Segmentation), a novel two-stage framework designed for coarse-to-fine reasoning segmentation. Central to our approach is the Extract Attention Maps and Points (EAP) module, which generates attention maps for initial target localization and identifies key informative points; these elements are subsequently utilized by SAM to refine segmentation masks. Additionally, to mitigate inconsistencies between reasoning processes and final answers, we incorporate a Global-to-Local Chain-of-Thought (GLCoT) mechanism. This component steers the model toward a progressive reasoning trajectory, moving from broad scene understanding to specific local target attributes. Comprehensive evaluations on reasoning segmentation benchmarks confirm the efficacy of the proposed \textsc{CR-Seg} method.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC