Global News Digest

arXiv

Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps

Title: Assessing Deep Research Agents in Expert Consulting Tasks: A Benchmark Framework Featuring Verifiers, Rubrics, and Cognitive Traps

Abstract:

Frontier deep research agents (DRAs) are increasingly integrated into enterprise workflows, capable of planning research initiatives, synthesizing information across multiple documents, and generating structured deliverables on request. However, their deployment is outpacing their evaluation. Current benchmarks primarily assess factual recall, single-hop question answering, or general agentic capabilities, failing to capture the multi-document, decision-grade analysis that DRAs are designed to produce. To address this gap, we introduce a benchmark focused on the structured analytical outputs that constitute a management consultant’s typical weekly workload.

We evaluated three leading agents—Claude Opus 4.6 with web search, OpenAI o3-deep-research, and Google Gemini 3.1 Pro deep-research—against 42 prompts authored by subject matter experts (SMEs). Each of the resulting 126 responses was assessed using a two-tiered scoring system: deterministic ground-truth verifiers (averaging 13.8 checks per task) and a five-criterion SME rubric scored on a 0-3 scale. These components were combined to create a Verifier-Rubric Score (VRS) ranging from 0 to 100. Notably, most prompts included cognitive traps designed to penalize superficial pattern matching.

Adherence to our joint acceptance threshold (a rubric mean of at least 2.5 and a verifier rate of at least 80%) was uniformly low across all models: Gemini achieved 21.4%, while both o3 and Claude reached only 9.5%. The mean VRS results align with existing rubric-based benchmarks, with our top score at 62.6 compared to APEX-v1’s 64.2, ProfBench’s 65.9, and ResearchRubrics’ figures below 68%, thereby validating the rubric construct.

Acceptance rates fall below the Pass@1 band of 12.3-22.7% observed in APEX-Agents’ MC-segment for dedicated DR agents. Our lower floor, despite the advantages of our testing harness, stems from stricter conjunctive grading and the inclusion of cognitive traps. Each agent exhibited distinct failure modes: Claude generated the most reliable deliverables (exhibiting a file-required task success rate 4.5 times higher than its competitors) but showed the highest propensity for fabrication. The o3 model demonstrated the cleanest average reasoning but frequently omitted required sections and propagated arithmetic errors. Gemini displayed bimodal performance, achieving the highest acceptance rate alongside the highest number of zero-scored rubric cells.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Schroders Renewable Unit Targets AI Assets as Power Demand Soars
Bloomberg

Schroders Renewable Unit Targets AI Assets as Power Demand Soars

Schroders’ renewable unit targets AI infrastructure, pivoting to meet soaring energy demand from artificial intelligence...

State Street's Paglia on SBI Group Partnership, ETFs
Bloomberg

State Street's Paglia on SBI Group Partnership, ETFs

State Street's Paglia discusses the SBI Group partnership and ETFs, but the source text is missing. Please provide the a...

Nvidia Boss Says Workers Should Be Paid ‘as Much as Possible’
Bloomberg

Nvidia Boss Says Workers Should Be Paid ‘as Much as Possible’

Nvidia CEO Jensen Huang advocates for paying workers “as much as possible,” emphasizing maximum compensation. This stanc...

TSE Talking With Regulator For Easing ETF Listing Rules
Bloomberg

TSE Talking With Regulator For Easing ETF Listing Rules

The Tokyo Stock Exchange is discussing with regulators to ease ETF listing rules. This aims to simplify market access an...

S&P DJI CEO on Japan Markets, Mega IPOs
Bloomberg

S&P DJI CEO on Japan Markets, Mega IPOs

S&P DJI CEO discusses Japan's financial markets and major IPOs.