arXiv

MM-BizRAG: Rethinking Multimodal Retrieval-Augmented Generation for General Purpose Enterprise Q&A

Title: MM-BizRAG: Reevaluating Multimodal Retrieval-Augmented Generation for Comprehensive Enterprise Question Answering

Abstract:

While recent developments in multimodal retrieval-augmented generation (MM-RAG) have favored minimal parsing strategies—utilizing page-level images to generate retriever embeddings and drive answer production—this efficiency-driven approach often overlooks the explicit management of complex, structured data found in corporate documents. Instead of directly addressing this complexity, such methods typically rely on vision-language models or pre-trained embeddings to implicitly infer structural nuances. In contrast, our work, MM-BizRAG, adopts a proactive strategy by explicitly extracting and encoding document structure. We employ a structure-aware splitting mechanism that dynamically directs documents into orientation-specific ingestion pipelines. This process applies layout-aware parsing for vertically oriented materials, such as reports, while utilizing holistic page-level representations for horizontally formatted content like slide decks.

To maintain natural reading order, we implement a unified LLM-driven pipeline for artifact transformation that utilizes placeholder-based positional alignment. Additionally, our inference-time multimodal assembly separates retrieval representations from the generation context, facilitating more grounded and detailed responses without necessitating any fine-tuning. Evaluations conducted on a diverse, large-scale enterprise dataset, alongside two public benchmarks (SlideVQA and FinRAGBench-V), demonstrate that MM-BizRAG consistently surpasses leading vision-centric baselines by margins of up to 32 percentage points. These improvements are particularly pronounced for report-style layouts. Furthermore, we present FastRAGEval, a novel single-call LLM Judge metric designed for fine-grained generative recall. This metric reduces the computational cost of RAGChecker by 50% while demonstrating superior alignment with human judgment.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

TechCrunch

Benchmark raises its first-ever growth fund as part of $2B capital raise

Benchmark Capital launches its first growth fund, raising $2 billion to target later-stage AI deals. This marks a strate...

Netflix Aims to Use AI to Help Viewers Manage Content Overload
Bloomberg

Netflix Aims to Use AI to Help Viewers Manage Content Overload

Netflix uses AI to help viewers manage content overload, tackling the challenge of too many choices.

TSMC CEO Warns Chip Supply Won’t Meet AI-Fueled Demand for Years
Bloomberg

TSMC CEO Warns Chip Supply Won’t Meet AI-Fueled Demand for Years

TSMC CEO warns that chip supply will lag behind surging AI demand for years. This multi-year shortfall highlights the in...

Reuters

TSMC boss upbeat on outlook as AI boom shows no sign of easing

TSMC executives remain optimistic as sustained AI demand shows no signs of slowing, driving strong confidence in the com...

Bitcoin Falls to Pre-Iran Conflict Low as Crypto Slide Extends
Bloomberg

Bitcoin Falls to Pre-Iran Conflict Low as Crypto Slide Extends

Bitcoin drops to its lowest level before the Iran conflict, extending a broader cryptocurrency decline.

Why Amazon Has Struggled to Crack India
Bloomberg

Why Amazon Has Struggled to Crack India

Amazon’s aggressive push for dominance in India has stalled, marking the end of its ambitious expansion efforts. The 202...