See, Infer, Intervene: Proactive World Modeling for Goal-Oriented Social Intelligence
Title: See, Infer, Intervene: Proactive World Modeling for Goal-Oriented Social Intelligence
Abstract:
To move beyond mere observation, multimodal retail agents must anticipate customer needs and determine the appropriate timing and method of assistance prior to any explicit request. This study explores that capability through the See–Infer–Intervene (SII) framework. In this model, a system first observes pre-interaction behaviors, deduces the customer’s hidden intentions, and then decides whether to execute a specific service intervention or remain passive.
We implement the SII framework using the Proactive Intent World Model (PIWM). This model characterizes customer status through AIDA purchasing stages (Attention, Interest, Desire, Action) and BDI psychological dimensions (belief, desire, intention). It forecasts intent shifts conditioned on actions and chooses from five distinct response categories: Greet, Elicit, Inform, Recommend, and Hold. To support this research, we introduce GuidanceSalesBench, a comprehensive smart-retail benchmark featuring state manifests, pre-interaction footage, potential responses, action-conditioned outcomes, and labels for the optimal action.
When PIWM is conditioned on ground-truth customer states to isolate the action-selection process, it achieves a macro F1 score of 0.641 across 30 held-out target videos. This performance surpasses both a zero-shot Qwen2.5-VL-7B baseline and training variants lacking balanced action supervision. However, end-to-end selection based solely on video input results in a lower score of 0.295, falling beneath the 0.414 threshold of a 5-class balanced random baseline. This disparity highlights video-to-state grounding as the primary bottleneck for deployment. Additionally, a preliminary staged pilot in a real store—utilizing paid participants enacting scripted customer behaviors—yielded an action macro F1 of 0.579 on 20 fully annotated videos. We also release 10 additional accessible videos accompanied by index-level labels.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC





