DrugClaw and DrugAudit: A Primary-Source-Grounded Agent and Authority-Aware Benchmark for Drug-Information Question Answering
DrugClaw and DrugAudit: A Primary-Source-Grounded Agent and Authority-Aware Benchmark for Drug-Information Question Answering
Original: arXiv:2606.01434v1 Announce Type: new
Abstract: In the high-stakes domain of drug-information question answering, the provenance of cited facts is as critical as the facts themselves, as hallucinations can severely mislead clinical decision-making. To address this, we introduce DrugClaw, a multi-agent retrieval-augmented system. This framework utilizes a reflection-driven state-machine workflow to query a registry of drug and pharmacovigilance capabilities, delivering answers that are strictly grounded in primary regulatory documents or peer-reviewed records. Additionally, we present DrugAudit, a comprehensive benchmark comprising 3,772 items. This authority-aware dataset features an evaluation panel that assesses upstream-of-gold source matching, token-level semantic snippet overlap, and citation faithfulness. The evaluation employs a dual-judge LLM-as-judge protocol, achieving an inter-judge kappa coefficient of 0.88, indicating almost-perfect agreement.
In comparative analyses across DrugAudit and drug-specific subsets of MedQA (751 items) and PubMedQA (512 items), DrugClaw achieved the top rank in every metric of the primary results table. Specifically, it led in the composite Evidence Index under both judges, judge-mediated answer correctness, primary-source rate (0.918, representing a 10.1 percentage point improvement over the next-best model), and faithfulness (0.887, a 5.9 percentage point gain). Furthermore, DrugClaw secured scores of 0.920 on MedQA and 0.693 on PubMedQA.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





