Global News Digest

arXiv

MMSkills: Towards Multimodal Skills for General Visual Agents

Title: MMSkills: Developing Multimodal Capabilities for General-Purpose Visual Agents

Abstract:

While reusable skills are increasingly viewed as the fundamental building blocks for enhancing agent performance, current approaches largely confine these skills to textual prompts, executable scripts, or learned behavioral patterns. This traditional paradigm overlooks a critical aspect of visual agents: their procedural knowledge is inherently multimodal. Effective reuse requires more than just knowing which operation to execute; it demands the ability to recognize specific states, interpret visual cues indicating progress or setbacks, and determine subsequent actions based on that evidence.

In this work, we formalize this necessity as "multimodal procedural knowledge" and tackle three primary practical hurdles: determining the optimal content for multimodal skill packages, identifying sources within public interaction histories for deriving these packages, and enabling agents to leverage multimodal evidence during inference without relying on excessive image context or becoming overly anchored to reference screenshots.

We present MMSkills, a novel framework designed to represent, generate, and deploy reusable multimodal procedures to support visual decision-making at runtime. An MMSkill functions as a concise, state-conditioned package that integrates a textual procedure with runtime state cards and keyframes captured from multiple viewpoints.

To build these packages, we devised an agentic trajectory-to-skill Generator. This component converts public, non-evaluation trajectories into reusable multimodal skills by employing workflow grouping, procedure induction, visual grounding, and auditing guided by meta-skills. For deployment, we introduce a branch-loaded multimodal skill agent. This mechanism allows selected state cards and keyframes to be examined within a temporary branch, where they are aligned with the live environment and distilled into structured guidance for the primary agent.

Our experiments, conducted across benchmarks involving GUI and game-based visual agents, demonstrate that MMSkills consistently enhance the performance of both state-of-the-art and smaller-scale multimodal agents. These results indicate that external multimodal procedural knowledge effectively complements the internal priors of the models.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Schroders Renewable Unit Targets AI Assets as Power Demand Soars
Bloomberg

Schroders Renewable Unit Targets AI Assets as Power Demand Soars

Schroders’ renewable unit targets AI infrastructure, pivoting to meet soaring energy demand from artificial intelligence...

State Street's Paglia on SBI Group Partnership, ETFs
Bloomberg

State Street's Paglia on SBI Group Partnership, ETFs

State Street's Paglia discusses the SBI Group partnership and ETFs, but the source text is missing. Please provide the a...

Nvidia Boss Says Workers Should Be Paid ‘as Much as Possible’
Bloomberg

Nvidia Boss Says Workers Should Be Paid ‘as Much as Possible’

Nvidia CEO Jensen Huang advocates for paying workers “as much as possible,” emphasizing maximum compensation. This stanc...

TSE Talking With Regulator For Easing ETF Listing Rules
Bloomberg

TSE Talking With Regulator For Easing ETF Listing Rules

The Tokyo Stock Exchange is discussing with regulators to ease ETF listing rules. This aims to simplify market access an...

S&P DJI CEO on Japan Markets, Mega IPOs
Bloomberg

S&P DJI CEO on Japan Markets, Mega IPOs

S&P DJI CEO discusses Japan's financial markets and major IPOs.