Pause and Think: A Dataset and Benchmark for Video-Grounded Assistive Action Suggestion
Title: Pause and Think: A Dataset and Benchmark for Video-Grounded Assistive Action Suggestion
Abstract
Current Vision-Language Models (VLMs) face significant challenges in maintaining temporal consistency, executing context-aware planning, and performing grounded reasoning within video content. To address these limitations, we present pause-and-think-T, a training dataset designed with a focus on reasoning. This dataset instructs models to halt, analyze visual evidence, and generate brief, actionable replies. By enforcing structured reasoning before the generation of answers, the dataset steers models toward providing assistance that is both human-like and anchored in the specific scene.
We fine-tuned a streamlined model containing 4 billion parameters and assessed its performance using pause-and-think-B, a benchmark specifically aimed at contextual comprehension and goal-oriented planning. Our model attained an accuracy rate of 58.0%, utilizing 59 times fewer parameters than Qwen3-VL-235B, which scored 58.9%. In terms of scene understanding, our model performed on par with GPT-5.2 and exceeded the capabilities of GPT-4o. Furthermore, demonstrating robust out-of-distribution capabilities on the EgoThink and TempCompass datasets, the model achieved significant improvements in affordance, assistance, attribution recognition, situated reasoning, and temporal order, all without undergoing training specific to those benchmarks. These findings suggest that focused reasoning supervision allows smaller models to provide effective, visually grounded guidance that generalizes well beyond the training data, eliminating the need for massive model scaling.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




