arXiv

DrugClaw and DrugAudit: A Primary-Source-Grounded Agent and Authority-Aware Benchmark for Drug-Information Question Answering

DrugClaw and DrugAudit: A Primary-Source-Grounded Agent and Authority-Aware Benchmark for Drug-Information Question Answering

Original: arXiv:2606.01434v1 Announce Type: new

Abstract: In the high-stakes domain of drug-information question answering, the provenance of cited facts is as critical as the facts themselves, as hallucinations can severely mislead clinical decision-making. To address this, we introduce DrugClaw, a multi-agent retrieval-augmented system. This framework utilizes a reflection-driven state-machine workflow to query a registry of drug and pharmacovigilance capabilities, delivering answers that are strictly grounded in primary regulatory documents or peer-reviewed records. Additionally, we present DrugAudit, a comprehensive benchmark comprising 3,772 items. This authority-aware dataset features an evaluation panel that assesses upstream-of-gold source matching, token-level semantic snippet overlap, and citation faithfulness. The evaluation employs a dual-judge LLM-as-judge protocol, achieving an inter-judge kappa coefficient of 0.88, indicating almost-perfect agreement.

In comparative analyses across DrugAudit and drug-specific subsets of MedQA (751 items) and PubMedQA (512 items), DrugClaw achieved the top rank in every metric of the primary results table. Specifically, it led in the composite Evidence Index under both judges, judge-mediated answer correctness, primary-source rate (0.918, representing a 10.1 percentage point improvement over the next-best model), and faithfulness (0.887, a 5.9 percentage point gain). Furthermore, DrugClaw secured scores of 0.920 on MedQA and 0.693 on PubMedQA.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...