Plan, Watch, Recover: A Benchmark and Architectures for Proactive Procedural Assistance
Title: Plan, Watch, Recover: A Benchmark and Architectures for Proactive Procedural Assistance
Abstract
We propose a framework for a proactive, multi-modal assistant capable of providing real-time, step-by-step direction for procedural tasks. This system autonomously determines the optimal timing for interruptions and the specific methods for coaching. However, advancements have been hindered by a lack of large-scale, cross-domain benchmarks that simulate realistic scenarios, especially instances where users diverge from the anticipated sequence of steps. To bridge this gap, we present four key contributions: (1) the release of EgoProactive, a comprehensive wearable-egocentric dataset designed for proactive procedural assistance, featuring explicit annotations for Out-of-Plan (OOP) deviations and corresponding recovery actions; (2) the expansion of five established benchmarks—Ego4D, EPIC-KITCHENS, EgoExo4D, HoloAssist, and HowTo100M—into Pro²Bench, organized under a unified schema for proactive guidance; (3) the development of a decoupled planner–interaction architecture tailored to handle procedural states, visual signals, and the injection of recovery steps; and (4) the introduction of a post-training methodology that facilitates transfer across different model families, a capability validated through cross-backbone replication involving Llama 4 and Qwen-3.6-VL. Our extensive experiments demonstrate that the Llama-4 system significantly enhances the quality of objective interventions compared to both robust proprietary baselines (Claude Opus 4.6, Gemini 3.1 Pro, GPT 5.2) and open-weight models (Qwen3 VL 235B) across all six datasets. Furthermore, oracle-plan experiments reveal that when plan quality is held constant, the trained duplex model delivers high-quality guidance and achieves substantial improvements in Out-of-Plan recovery performance.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC


