Demo2Tutorial: From Human Experience to Multimodal Software Tutorials
Title: Demo2Tutorial: Converting Human Experience into Multimodal Software Tutorials
Abstract:
Digital environments harbor a largely untapped reservoir of authentic, unedited interactions, offering a wealth of procedural knowledge derived from human experience. We present Demo2Tutorial, a novel framework designed to convert this captured experience—gathered through screen recordings and interaction logs—into structured, multimodal software tutorials suitable for instructing both humans and AI agents. The process begins with the collection of human experience using a specialized recorder. Subsequently, a multimodal Action Parser deciphers the raw data to reconstruct the user’s perception, actions, and intent. Following this, a Step Planner organizes these elements into hierarchical task graphs that delineate specific goals and steps. Finally, a Tutorial Composer synthesizes the parsed experience into reusable, structured instructions combining images and text.
We assessed the quality of the generated tutorials using a new benchmark based on official software documentation. Our results indicate that this distilled representation offers dual benefits: it enhances human learning through the automatic creation of multimodal tutorials and boosts agent learning by refining downstream GUI-agent planning and generalization capabilities. Experimental findings reveal that Demo2Tutorial generates high-quality tutorials that exceed the standard of human-authored content and significantly outperform baseline methods. Furthermore, the framework facilitates faster task completion for humans and improves planning for GUI agents, demonstrating that structured tutorials extracted from human experience can serve as potent knowledge representations for advancing both human education and artificial intelligence capabilities. Code and data will be accessible at https://github.com/showlab/Demo2Tutorial.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC





