Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps
Title: Assessing Deep Research Agents in Expert Consulting Tasks: A Benchmark Framework Featuring Verifiers, Rubrics, and Cognitive Traps
Abstract:
Frontier deep research agents (DRAs) are increasingly integrated into enterprise workflows, capable of planning research initiatives, synthesizing information across multiple documents, and generating structured deliverables on request. However, their deployment is outpacing their evaluation. Current benchmarks primarily assess factual recall, single-hop question answering, or general agentic capabilities, failing to capture the multi-document, decision-grade analysis that DRAs are designed to produce. To address this gap, we introduce a benchmark focused on the structured analytical outputs that constitute a management consultant’s typical weekly workload.
We evaluated three leading agents—Claude Opus 4.6 with web search, OpenAI o3-deep-research, and Google Gemini 3.1 Pro deep-research—against 42 prompts authored by subject matter experts (SMEs). Each of the resulting 126 responses was assessed using a two-tiered scoring system: deterministic ground-truth verifiers (averaging 13.8 checks per task) and a five-criterion SME rubric scored on a 0-3 scale. These components were combined to create a Verifier-Rubric Score (VRS) ranging from 0 to 100. Notably, most prompts included cognitive traps designed to penalize superficial pattern matching.
Adherence to our joint acceptance threshold (a rubric mean of at least 2.5 and a verifier rate of at least 80%) was uniformly low across all models: Gemini achieved 21.4%, while both o3 and Claude reached only 9.5%. The mean VRS results align with existing rubric-based benchmarks, with our top score at 62.6 compared to APEX-v1’s 64.2, ProfBench’s 65.9, and ResearchRubrics’ figures below 68%, thereby validating the rubric construct.
Acceptance rates fall below the Pass@1 band of 12.3-22.7% observed in APEX-Agents’ MC-segment for dedicated DR agents. Our lower floor, despite the advantages of our testing harness, stems from stricter conjunctive grading and the inclusion of cognitive traps. Each agent exhibited distinct failure modes: Claude generated the most reliable deliverables (exhibiting a file-required task success rate 4.5 times higher than its competitors) but showed the highest propensity for fabrication. The o3 model demonstrated the cleanest average reasoning but frequently omitted required sections and propagated arithmetic errors. Gemini displayed bimodal performance, achieving the highest acceptance rate alongside the highest number of zero-scored rubric cells.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




