arXiv

InsightVQA: High-Dimensional Emotion-Cognitive Visual Question Answering Benchmark

June 2, 2026 · Shiyu Wang (East China Normal University, Shanghai, China), Ziyu Liu (East China Normal University, Shanghai, China), Chaoyi Yu (East China Normal University, Shanghai, China), Yujie Yin (East China Normal University, Shanghai, China), Zhongqian Mao (East · Original Source

Title: InsightVQA: A High-Dimensional Benchmark for Emotion-Cognitive Visual Question Answering

Abstract:

True visual emotion comprehension demands that models move beyond simple state recognition to explain the causes of emotions and engage in advanced cognitive reasoning. Current benchmarks, however, predominantly target emotion identification, providing insufficient support for grounded understanding and analysis aimed at generating responses. To bridge this gap, we present InsightVQA, a large-scale dataset designed for hierarchical visual question answering focused on emotion understanding and cognitive reasoning.

Starting with 351,000 images sourced from six public datasets, we employed a rigorous multi-stage filtering process to select 138,000 high-confidence images. These images are annotated across three hierarchical tiers: 1. Perception QA: Focused on recognizing emotions and valence. 2. Grounded Understanding QA: Developed via constraint-guided generation based on extracted visual triggers. 3. Cognition QA: Centered on predicting response intent and performing sequential insight reasoning.

The resulting InsightVQA dataset comprises a total of 725,000 QA pairs. Additionally, we introduce InsightVQA-Bench, a high-quality evaluation benchmark featuring 30,000 samples for fine-grained assessment. To facilitate this evaluation, we developed InsightNet, a baseline model for Multimodal Large Language Models (MLLMs) tuned for emotion tasks. Our results indicate that InsightVQA presents substantial challenges for models attempting grounded emotion understanding and reasoning.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC