arXiv

MMSkills: Towards Multimodal Skills for General Visual Agents

June 2, 2026 · Kangning Zhang, Shuai Shao, Qingyao Li, Jianghao Lin, Lingyue Fu, Shijian Wang, Wenxiang Jiao, Yuan Lu, Weiwen Liu, Weinan Zhang, Yong Yu · Original Source

Title: MMSkills: Developing Multimodal Capabilities for General-Purpose Visual Agents

Abstract:

While reusable skills are increasingly viewed as the fundamental building blocks for enhancing agent performance, current approaches largely confine these skills to textual prompts, executable scripts, or learned behavioral patterns. This traditional paradigm overlooks a critical aspect of visual agents: their procedural knowledge is inherently multimodal. Effective reuse requires more than just knowing which operation to execute; it demands the ability to recognize specific states, interpret visual cues indicating progress or setbacks, and determine subsequent actions based on that evidence.

In this work, we formalize this necessity as "multimodal procedural knowledge" and tackle three primary practical hurdles: determining the optimal content for multimodal skill packages, identifying sources within public interaction histories for deriving these packages, and enabling agents to leverage multimodal evidence during inference without relying on excessive image context or becoming overly anchored to reference screenshots.

We present MMSkills, a novel framework designed to represent, generate, and deploy reusable multimodal procedures to support visual decision-making at runtime. An MMSkill functions as a concise, state-conditioned package that integrates a textual procedure with runtime state cards and keyframes captured from multiple viewpoints.

To build these packages, we devised an agentic trajectory-to-skill Generator. This component converts public, non-evaluation trajectories into reusable multimodal skills by employing workflow grouping, procedure induction, visual grounding, and auditing guided by meta-skills. For deployment, we introduce a branch-loaded multimodal skill agent. This mechanism allows selected state cards and keyframes to be examined within a temporary branch, where they are aligned with the live environment and distilled into structured guidance for the primary agent.

Our experiments, conducted across benchmarks involving GUI and game-based visual agents, demonstrate that MMSkills consistently enhance the performance of both state-of-the-art and smaller-scale multimodal agents. These results indicate that external multimodal procedural knowledge effectively complements the internal priors of the models.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC