arXiv

Dive into the Scene: Breaking the Perceptual Bottleneck in Vision-Language Decision Making via Focus Plan Generation

June 4, 2026 · Boyuan Xiao, Bohong Chen, Yumeng Li, Ji Feng, Yao-Xiang Ding, Kun Zhou · Original Source

Title: Overcoming the Perceptual Bottleneck in Vision-Language Decision Making Through Strategic Focus Plan Generation

Abstract:

In embodied tasks that rely on vision-language decision-making, such as robotic navigation and manipulation, Vision-Language Models (VLMs) and Vision-Language-Action Models (VLAs) serve as potent instruments, each offering distinct advantages. VLMs excel in long-term strategic planning, whereas VLAs are superior for immediate, reactive control. Despite their strengths, both model types are hindered by a shared perceptual bottleneck: visual hallucinations occur because the models struggle to separate task-critical objects from background noise or distractors.

Fundamentally, overcoming this limitation requires the precise identification and concentration on essential elements while effectively ignoring irrelevant information. While a simple, one-step focus strategy—directly targeting key objects—might seem like a viable solution, it often fails. This is because effective focus demands a profound understanding of the scene, which a single-step approach cannot provide.

To address this challenge, we introduce SceneDiver, a method designed for VLMs that generates a coarse-to-fine focus plan by capitalizing on their long-term planning capabilities. The process begins by constructing a comprehensive scene graph to build an initial holistic understanding. It then iteratively breaks down complex tasks into manageable sub-problems through a continuous loop of recognition, comprehension, and analysis. To facilitate reactive control, we also developed a lightweight adapter that distills this deliberate focus capability into VLAs.

Our evaluations across standard embodied AI benchmarks demonstrate that SceneDiver significantly diminishes visual hallucinations in both VLMs and VLAs. Importantly, the method maintains computational efficiency, ensuring it remains viable for tasks requiring rapid execution. The code and data associated with this research are available at: https://future-item.github.io/SceneDiver.

Source: arXiv Generated at: 2026-06-04 00:00:00 UTC