arXiv

Dive into the Scene: Breaking the Perceptual Bottleneck in Vision-Language Decision Making via Focus Plan Generation

Title: Overcoming the Perceptual Bottleneck in Vision-Language Decision Making Through Strategic Focus Plan Generation

Abstract:

In embodied tasks that rely on vision-language decision-making, such as robotic navigation and manipulation, Vision-Language Models (VLMs) and Vision-Language-Action Models (VLAs) serve as potent instruments, each offering distinct advantages. VLMs excel in long-term strategic planning, whereas VLAs are superior for immediate, reactive control. Despite their strengths, both model types are hindered by a shared perceptual bottleneck: visual hallucinations occur because the models struggle to separate task-critical objects from background noise or distractors.

Fundamentally, overcoming this limitation requires the precise identification and concentration on essential elements while effectively ignoring irrelevant information. While a simple, one-step focus strategy—directly targeting key objects—might seem like a viable solution, it often fails. This is because effective focus demands a profound understanding of the scene, which a single-step approach cannot provide.

To address this challenge, we introduce SceneDiver, a method designed for VLMs that generates a coarse-to-fine focus plan by capitalizing on their long-term planning capabilities. The process begins by constructing a comprehensive scene graph to build an initial holistic understanding. It then iteratively breaks down complex tasks into manageable sub-problems through a continuous loop of recognition, comprehension, and analysis. To facilitate reactive control, we also developed a lightweight adapter that distills this deliberate focus capability into VLAs.

Our evaluations across standard embodied AI benchmarks demonstrate that SceneDiver significantly diminishes visual hallucinations in both VLMs and VLAs. Importantly, the method maintains computational efficiency, ensuring it remains viable for tasks requiring rapid execution. The code and data associated with this research are available at: https://future-item.github.io/SceneDiver.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

IBM, AT&T Accused by Whistleblower of Covering Up Foreign Hacks
Bloomberg

IBM, AT&T Accused by Whistleblower of Covering Up Foreign Hacks

A whistleblower alleges IBM and AT&T concealed foreign cyberattacks. This claim contrasts with unrelated news about Micr...

Verizon CEO Sees AI Coming for Customer Service Jobs
Bloomberg

Verizon CEO Sees AI Coming for Customer Service Jobs

Verizon’s CEO predicts AI will disrupt customer service jobs, as automation reshapes support operations and alters tradi...

Verizon CEO Sees AI Replacing Large Share of Customer Service
Bloomberg

Verizon CEO Sees AI Replacing Large Share of Customer Service

Verizon CEO Dan Schulman predicts AI will replace a large share of customer service roles. This outlook was shared at th...

Android's Samat on Integrating AI into the Ecosystem
Bloomberg

Android's Samat on Integrating AI into the Ecosystem

Samat discusses integrating AI into the Android ecosystem. The source text is missing, so no specific details can be sum...

HPE Sponsor Spotlight
Bloomberg

HPE Sponsor Spotlight

HPE Sponsor Spotlight highlights key partners driving innovation. Discover how their solutions enhance enterprise infras...

TechCrunch

Meta steals a tactic from Tesla and builds data centers in tents

Meta builds six large tents in Ohio to cut data center construction time by 50%, mirroring Tesla and xAI’s strategies. T...