InstructSAM: Segment Any Instance with Any Instructions
Title: InstructSAM: Segment Any Instance with Any Instructions
Abstract:
This study presents InstructSAM, a cohesive and efficient framework tailored for multi-instance segmentation guided by arbitrary instructions. We approach instruction-driven instance segmentation as a problem of set-structured query prediction. To connect a vision-language model (VLM) with SAM3, we introduce an explicit reasoning-to-instance query interface. This method injects a bank of learnable instance queries into the VLM, where they are contextualized using both visual data and instructional input, allowing each query to function as an instance-aware slot. A hybrid-attention mechanism enhances the interaction between these queries, visual tokens, and instruction tokens, which boosts instance enumeration capabilities and minimizes redundant predictions. These LLM-conditioned queries are then projected into SAM3’s detector query space, facilitating accurate multi-instance segmentation in a single forward pass. This architecture grants SAM3 advanced capabilities, including high-level instruction comprehension, compositional reasoning, and instance-level set prediction, all without altering its fundamental structure. To facilitate training and assessment, we also develop Inst2Seg, a large-scale, high-quality benchmark and dataset for instruction-based instance segmentation that pairs free-form instructions with instance-level masks. Comprehensive experiments demonstrate that InstructSAM, despite having only 2 billion parameters, delivers robust performance on complex instruction-driven and phrase-level referring segmentation benchmarks. It surpasses previous end-to-end approaches and SAM3’s agentic pipeline, while maintaining the efficiency of single-pass multi-instance prediction.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




