arXiv

See Less, Specify More: Visual Evidence Budgets for Generalizable VLAs

Title: See Less, Specify More: Visual Evidence Budgets for Generalizable VLAs

Original: arXiv:2606.02735v1 Announce Type: cross

Abstract: Vision-language-action (VLA) models face a significant hurdle in generalization. When confronted with distractors, changes in appearance, or tasks that are semantically similar, these policies must often deduce specific execution details from broad instructions while simultaneously determining which visual elements are relevant for control. To address this, we introduce S2 (See Less, Specify More), a framework designed to enhance VLA generalization by training the executor via a streamlined interface. The "Specify More" component retains the original instruction as a fixed high-level objective but relabels each trajectory with refined language at both the subtask and trajectory levels, thereby clarifying the current mode of execution. Meanwhile, "See Less" introduces an explicit budget for visual evidence. This trains the executor to rely on task-sufficient data rather than unconstrained visual context, all without requiring region or mask annotations. This approach allows the executor to adhere to detailed guidance without being sidetracked by distracting visual patches or having to resolve unnecessary ambiguity independently. Furthermore, this interface remains compatible with off-the-shelf VLM planners when using in-context learning. In our primary evaluation settings, S2 boosts overall generalization metrics by altering the executor's learning challenge: coarse instructions lead to avoidable supervision aliasing, goal-preserving local guidance proves superior to instruction replacement in our main ablations, and explicit evidence budgeting lessens the reliance on broad visual context beyond just efficiency gains. Across eight real-robot tasks conducted on TX-G2 (an AgiBot G2-compatible variant) and HSR, S2 increased the mean subtask success rate from 54.2% to 79.0% compared to pi0.5. These findings indicate that VLA generalization improves when executors are trained to utilize informative local guidance and task-sufficient visual evidence, rather than attempting to infer both from weak supervision.


Source: arXiv Generated at: 2026-06-03 00:00:00 UTC

Related Articles

TikTok Billionaire Tops Ambani as Asia’s Second-Richest
Bloomberg

TikTok Billionaire Tops Ambani as Asia’s Second-Richest

TikTok founder surpasses Mukesh Ambani to become Asia’s second-richest person, marking a significant shift in the region...

Publishers in UK can opt out of Google AI search results
BBC News

Publishers in UK can opt out of Google AI search results

UK publishers can now opt out of Google’s AI search summaries, a CMA ruling designed to boost their bargaining power and...

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.
Bloomberg

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.

Kioxia’s market cap nears Toyota’s, signaling a major shift in Japan’s corporate hierarchy. This narrowing gap highlight...

Reuters

Morning Bid: Marvell, a fitting name for the latest AI darling

Reuters highlights Marvell as a top AI stock, noting its name perfectly suits its status as the newest market darling.

Financial Times

Tim Hayward: I built the Jaguar E-Type of computer keyboards

Tim Hayward compares his bespoke keyboard designs to the Jaguar E-Type. He explores high-end customization for personal ...

Financial Times

AI Labs: Zuckerberg’s $100bn gamble

Meta’s $100 billion AI investment aims to secure AI dominance, but questions remain whether sheer spending can outpace c...