arXiv

Truth, Trust, and Trouble: Medical AI on the Edge

June 2, 2026 · Mohammad Anas Azeez, Rafiq Ali, Ebad Shabbir, Zohaib Hasan Siddiqui, Gautam Siddharth Kashyap, Jiechao Gao, Usman Naseem · Original Source

Title: Navigating the Precipice of Medical AI: The Interplay of Truth, Trust, and Risk

The integration of Large Language Models (LLMs) into digital health offers transformative potential, particularly through the automation of medical question-and-answer systems. Nevertheless, guaranteeing that these technologies adhere to stringent industry benchmarks regarding factual precision, utility, and safety—especially within open-source frameworks—presents a formidable hurdle. To address this, we introduce a comprehensive benchmarking framework utilizing a dataset comprising more than 1,000 health-related inquiries. This study evaluates model capabilities across three critical dimensions: honesty, helpfulness, and harmlessness.

Our analysis reveals significant trade-offs between factual reliability and safety among the tested models: Mistral-7B, BioMistral-7B-DARE, and AlpaCare-13B. Specifically, AlpaCare-13B demonstrated superior performance, securing the highest accuracy rate of 91.7% alongside a harmlessness score of 0.92. Meanwhile, BioMistral-7B-DARE, despite its smaller architectural scale, leveraged domain-specific tuning to enhance its safety profile to 0.90. Furthermore, the implementation of few-shot prompting was shown to elevate accuracy from 78% to 85%. However, a consistent decline in helpfulness across all models when addressing complex queries underscores the persistent difficulties inherent in clinical question answering.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC