BADGER: Bridging Agentic and Deterministic Evaluation for Generative Enterprise Reasoning
Title: BADGER: A Unified Framework for Evaluating Agentic and Deterministic Aspects of Enterprise Generative Reasoning
Abstract
Evaluating enterprise AI systems that convert natural language into SQL and manage multi-step agentic workflows demands methodologies distinct from traditional academic benchmarks. While foundational benchmarks like Spider and BIRD have set standards for execution accuracy, and tools such as G-Eval and RAGAS have advanced LLM-based assessments, recent initiatives including Spider 2.0, BEAVER, and BIRD-Interact are beginning to tackle the complexities of enterprise and agentic contexts. However, no existing framework currently harmonizes text-to-SQL evaluation with agentic behavior assessment into a production-ready pipeline calibrated by human expert standards.
To address this gap, we introduce BADGER, a comprehensive evaluation framework developed at Merkle. BADGER unifies the assessment of text-to-SQL capabilities with agentic behavior analysis. The framework delivers three primary contributions:
- LLM-Assisted SQL Component Extraction: This feature extends the Spider methodology to accommodate SQL queries characterized by Common Table Expressions (CTEs) and specific dialects, enhancing extraction precision in complex enterprise environments.
- Hybrid Execution Accuracy Metric (Hybrid-EX): This metric overcomes the limitations of column aliasing and numeric tolerance brittleness inherent in traditional methods. By employing an LLM to deduce structural alignments prior to deterministic cell-level scoring, Hybrid-EX demonstrates superior performance. In validation tests involving 150 human-annotated industry queries, it achieved a Cohen’s kappa of 0.717 [95% CI: 0.600-0.822], indicating substantial agreement, and a balanced accuracy of 87.3%. These results significantly outperform six competing frameworks, with Delta-kappa values ranging from 0.322 to 0.502 (all p<=0.001).
- Enterprise Agentic Evaluation Suite: This component consolidates metrics from RAGAS, G-Eval, and various agent benchmarks into a single cohesive pipeline. The only novel addition to this suite is the "Excess Tool Usage" metric.
Designed to operate entirely within a client’s governed data environment, BADGER supports configurable LLM judge backends. It facilitates the rapid prototyping of custom judges and metrics, functioning as a continuous evaluation backbone for ongoing quality assurance rather than serving merely as a static quality gate.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




