arXiv

Agent Skills Should Go Beyond Text: The Case for Visual Skills

June 2, 2026 · Binxiao Xu, Ruichuan An, Bocheng Zou, Hang Hua · Original Source

Title: Expanding Agent Capabilities: Why Visual Skills Must Complement Text

Reusable competencies serve as a crucial engine for expanding an agent’s potential, enabling it to amass experience and tackle progressively intricate challenges. However, the prevailing approach to skill acquisition largely confines this reusable knowledge to textual formats, such as instructional guides, chains of reasoning, or condensed action histories. We contend that relying exclusively on text imposes a significant constraint on visual-heavy applications, where effective knowledge transfer hinges on spatial arrangements, visual anchoring, detailed appearances, and localized state transitions.

To overcome these limitations, we introduce \textbf{\NAME}, a novel multimodal framework that integrates declarative textual logic with robust visual components. This approach categorizes reusable assets into three distinct types: static priors that capture enduring spatial conventions; dynamic priors that function as in-situ visual working memory; and interleaved visual skills that link sequential text instructions to specific source frames, screenshots, or page areas that validate them. Unlike text-only methods that merely outline actions, visual skills also specify where to focus attention, how to conduct inspections, and how to confirm visual results.

To facilitate the large-scale creation of these skills, we present \textbf{\SYSTEM}, an automated infrastructure designed to transform agent experiences into reusable multimodal resources. This system retains critical elements from task trajectories, including textual reasoning processes, spatial references, visual boundaries, and interaction patterns. Our evaluations across GUI and other visually oriented tasks demonstrate that visual skills consistently surpass their text-only counterparts, especially in scenarios demanding spatial accuracy, visual proof, and state-aware interactions. These findings reinforce our core argument: for the next generation of multimodal agents, reusable skills must evolve beyond text to encompass multimodal assets.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC