MM-BizRAG: Rethinking Multimodal Retrieval-Augmented Generation for General Purpose Enterprise Q&A
Title: MM-BizRAG: Reevaluating Multimodal Retrieval-Augmented Generation for Comprehensive Enterprise Question Answering
Abstract:
While recent developments in multimodal retrieval-augmented generation (MM-RAG) have favored minimal parsing strategies—utilizing page-level images to generate retriever embeddings and drive answer production—this efficiency-driven approach often overlooks the explicit management of complex, structured data found in corporate documents. Instead of directly addressing this complexity, such methods typically rely on vision-language models or pre-trained embeddings to implicitly infer structural nuances. In contrast, our work, MM-BizRAG, adopts a proactive strategy by explicitly extracting and encoding document structure. We employ a structure-aware splitting mechanism that dynamically directs documents into orientation-specific ingestion pipelines. This process applies layout-aware parsing for vertically oriented materials, such as reports, while utilizing holistic page-level representations for horizontally formatted content like slide decks.
To maintain natural reading order, we implement a unified LLM-driven pipeline for artifact transformation that utilizes placeholder-based positional alignment. Additionally, our inference-time multimodal assembly separates retrieval representations from the generation context, facilitating more grounded and detailed responses without necessitating any fine-tuning. Evaluations conducted on a diverse, large-scale enterprise dataset, alongside two public benchmarks (SlideVQA and FinRAGBench-V), demonstrate that MM-BizRAG consistently surpasses leading vision-centric baselines by margins of up to 32 percentage points. These improvements are particularly pronounced for report-style layouts. Furthermore, we present FastRAGEval, a novel single-call LLM Judge metric designed for fine-grained generative recall. This metric reduces the computational cost of RAGChecker by 50% while demonstrating superior alignment with human judgment.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC




