arXiv

Towards Verifiable Multimodal Deep Research: A Multi-Agent Harness for Interleaved Report Generation

June 4, 2026 · Chenghao Zhang, Guanting Dong, Yufan Liu, Tong Zhao, Xiaoxi Li, Zhicheng Dou · Original Source

Title: Advancing Verifiable Multimodal Deep Research: A Multi-Agent Framework for Interleaved Report Creation

Original: arXiv:2605.29861v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) have advanced autonomous agents from deep search, which retrieves concise factual answers, to deep research, which synthesizes scattered evidence into long-form reports. However, verifiable multimodal deep research remains challenging due to open-ended synthesis without deterministic ground truth and the need to interleave textual arguments with visual evidence. We propose Ptah, a multi-agent harness for interleaved report generation. Ptah orchestrates the lifecycle from user query to rendered web report through planning, research, and writing stages, where specialized agents construct visual-aware plans, collect claim-grounded evidence, maintain source-aligned images in a Visual Working Memory, and compose reports through declarative multimodal tool use. A verifier agent serves as the harness's acceptance function, enforcing factual grounding, citation fidelity, and cross-modal consistency throughout the workflow. We further introduce PtahEval, an evaluation protocol that augments existing benchmarks with image-level and presentation-level assessments. Experiments on deep research benchmarks show that Ptah produces more reliable, visually informative, and usable human-facing multimodal reports than strong baselines. Our code is released at https://github.com/SnowNation101/Ptah

Rewrite: arXiv:2605.29861v2 Announce Type: replace-cross Abstract: The capabilities of Large Language Models (LLMs) have propelled autonomous agents beyond simple deep search—limited to extracting succinct factual answers—into the realm of deep research, where fragmented data is synthesized into comprehensive long-form documents. Despite this progress, achieving verifiable multimodal deep research is difficult. This challenge stems from the open-ended nature of synthesis, which lacks a deterministic ground truth, as well as the complexity of integrating textual reasoning with visual proof. To address this, we introduce Ptah, a multi-agent framework designed for generating reports that seamlessly blend text and images. Ptah manages the entire process, from initial user query to the final rendered web report, across three distinct phases: planning, research, and writing. Within this structure, dedicated agents develop plans that account for visual elements, gather evidence anchored to specific claims, store source-matched images in a Visual Working Memory, and assemble the final document using declarative multimodal tools. A dedicated verifier agent acts as the system’s quality control mechanism, ensuring factual accuracy, precise citation, and consistency across modalities throughout the pipeline. Additionally, we present PtahEval, a new evaluation standard that expands current benchmarks by incorporating assessments at the image and presentation levels. Our experimental results on deep research benchmarks demonstrate that Ptah outperforms robust baseline models by generating multimodal reports that are more trustworthy, visually rich, and practical for human readers. The source code is publicly available at https://github.com/SnowNation101/Ptah

Source: arXiv Generated at: 2026-06-04 00:00:00 UTC