arXiv

InstructSAM: Segment Any Instance with Any Instructions

Title: InstructSAM: Segment Any Instance with Any Instructions

Abstract:

This study presents InstructSAM, a cohesive and efficient framework tailored for multi-instance segmentation guided by arbitrary instructions. We approach instruction-driven instance segmentation as a problem of set-structured query prediction. To connect a vision-language model (VLM) with SAM3, we introduce an explicit reasoning-to-instance query interface. This method injects a bank of learnable instance queries into the VLM, where they are contextualized using both visual data and instructional input, allowing each query to function as an instance-aware slot. A hybrid-attention mechanism enhances the interaction between these queries, visual tokens, and instruction tokens, which boosts instance enumeration capabilities and minimizes redundant predictions. These LLM-conditioned queries are then projected into SAM3’s detector query space, facilitating accurate multi-instance segmentation in a single forward pass. This architecture grants SAM3 advanced capabilities, including high-level instruction comprehension, compositional reasoning, and instance-level set prediction, all without altering its fundamental structure. To facilitate training and assessment, we also develop Inst2Seg, a large-scale, high-quality benchmark and dataset for instruction-based instance segmentation that pairs free-form instructions with instance-level masks. Comprehensive experiments demonstrate that InstructSAM, despite having only 2 billion parameters, delivers robust performance on complex instruction-driven and phrase-level referring segmentation benchmarks. It surpasses previous end-to-end approaches and SAM3’s agentic pipeline, while maintaining the efficiency of single-pass multi-instance prediction.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Withings Debuts New Smart Scale Marketed Toward GLP-1 Users
Bloomberg

Withings Debuts New Smart Scale Marketed Toward GLP-1 Users

Withings launched a new smart scale targeting GLP-1 users, offering advanced body composition analysis. This device help...

TechCrunch

Rocket engine startup Impulse raises $500 million to hire people, not AI

Rocket engine startup Impulse Space raised $500 million to hire 200 engineers, prioritizing human expertise over AI for ...

Startup Impulse Space Raises $500 Million, Valued at $4 Billion
Bloomberg

Startup Impulse Space Raises $500 Million, Valued at $4 Billion

Impulse Space secured $500 million in funding, achieving a $4 billion valuation. This investment supports the developmen...

Walmart’s Answer to Apple Pay Wants to Be Your Favorite Financial App
Bloomberg

Walmart’s Answer to Apple Pay Wants to Be Your Favorite Financial App

Walmart’s new financial app aims to rival Apple Pay, positioning itself as a preferred digital payment and banking solut...

Nvidia Is Bigger, Stronger, and Trying to Slay the Laptop Dragon Again
Bloomberg

Nvidia Is Bigger, Stronger, and Trying to Slay the Laptop Dragon Again

Nvidia unveiled the RTX Spark Superchip at Computex 2026, aiming to challenge Intel’s PC dominance and modernize hardwar...

TechCrunch

Pacific Fusion’s latest prototype packs 440 gigawatts into an 80-nanosecond burst

Pacific Fusion’s new prototype delivers 440 gigawatts in 80 nanoseconds, securing over $1 billion in funding and enablin...