MMG2Skill: Can Agents Distill In-the-Wild Guides into Self-Evolving Skills?
Title: MMG2Skill: Transforming Unstructured Guides into Self-Improving Agent Skills
Abstract: The vast repository of procedural knowledge available on the web offers significant potential for empowering agents to tackle complex, long-horizon tasks. Yet, this information is frequently multimodal, fragmented, noisy, and written with human users in mind, rendering it unsuitable for direct application as agent skills. To address the disconnect between human-centric instructions and machine-executable commands, we define the challenge of "guide-to-skill learning." This process involves converting raw, real-world guides into executable skills and refining them through agent-observable trajectories.
To assess how well current agents handle this specific task, we present MMG2Skill-Bench, a novel benchmark tailored to this problem. We also introduce MMG2Skill, a closed-loop framework that translates guides into modifiable skills. This framework conditions a fixed vision-language model (VLM) agent on these skills during operation and updates them based on root-cause feedback derived from trajectories, without relying on benchmark scores.
Evaluated across GUI control, open-ended gaming, and strategic card playing using six different VLM backbones, MMG2Skill consistently surpasses standard baseline agents in every configuration. The approach yields macro-average performance improvements ranging from +12.8 to +25.3 percentage points. Ablation studies reveal that simply prompting agents with unstructured guides actually harms performance. Instead, both the structured construction of skills and the revision of those skills via trajectory data are essential for the observed gains. Furthermore, on tasks where success is inferable, an analyzer-driven early-stopping mechanism helps avoid late-stage performance drops, reducing the number of attempts by 25% to 53% when the success signal is accurately calibrated.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




