See Less, Specify More: Visual Evidence Budgets for Generalizable VLAs
Title: See Less, Specify More: Visual Evidence Budgets for Generalizable VLAs
Original: arXiv:2606.02735v1 Announce Type: cross
Abstract: Vision-language-action (VLA) models face a significant hurdle in generalization. When confronted with distractors, changes in appearance, or tasks that are semantically similar, these policies must often deduce specific execution details from broad instructions while simultaneously determining which visual elements are relevant for control. To address this, we introduce S2 (See Less, Specify More), a framework designed to enhance VLA generalization by training the executor via a streamlined interface. The "Specify More" component retains the original instruction as a fixed high-level objective but relabels each trajectory with refined language at both the subtask and trajectory levels, thereby clarifying the current mode of execution. Meanwhile, "See Less" introduces an explicit budget for visual evidence. This trains the executor to rely on task-sufficient data rather than unconstrained visual context, all without requiring region or mask annotations. This approach allows the executor to adhere to detailed guidance without being sidetracked by distracting visual patches or having to resolve unnecessary ambiguity independently. Furthermore, this interface remains compatible with off-the-shelf VLM planners when using in-context learning. In our primary evaluation settings, S2 boosts overall generalization metrics by altering the executor's learning challenge: coarse instructions lead to avoidable supervision aliasing, goal-preserving local guidance proves superior to instruction replacement in our main ablations, and explicit evidence budgeting lessens the reliance on broad visual context beyond just efficiency gains. Across eight real-robot tasks conducted on TX-G2 (an AgiBot G2-compatible variant) and HSR, S2 increased the mean subtask success rate from 54.2% to 79.0% compared to pi0.5. These findings indicate that VLA generalization improves when executors are trained to utilize informative local guidance and task-sufficient visual evidence, rather than attempting to infer both from weak supervision.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



