TVIR: Building Deep Research Agents Towards Text--Visual Interleaved Report Generation
TVIR: Building Deep Research Agents Towards Text--Visual Interleaved Report Generation
arXiv:2606.02320v1 Announce Type: new
Abstract: While Deep Research Agents have demonstrated robust proficiency in executing multi-step information retrieval, complex reasoning, and long-form report generation, current benchmarks and systems are largely text-centric. Consequently, there is insufficient evaluation regarding the factual reliability of visual components and their alignment with accompanying textual analysis. To bridge this gap, we present TVIR (Text--Visual Interleaved Report Generation). This initiative introduces TVIR-Bench, a benchmark comprising 100 expert-curated multimodal deep research tasks where visual elements are integral to achieving specific analytical sub-goals. Additionally, we propose TVIR-Agent, a hierarchical multi-agent framework that establishes a strong baseline for generating report outlines, retrieving images, creating charts with traceable sources, and composing reports via context-aware sequential writing. We also introduce a dual-path evaluation framework that integrates both Textual and Visual Assessment. Our experiments across nine deep research systems reveal that TVIR-Agent delivers superior overall performance, highlighting the critical need for explicit multimodal design and evaluation strategies in evidence-driven report generation.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





