Truth, Trust, and Trouble: Medical AI on the Edge
Title: Navigating the Precipice of Medical AI: The Interplay of Truth, Trust, and Risk
The integration of Large Language Models (LLMs) into digital health offers transformative potential, particularly through the automation of medical question-and-answer systems. Nevertheless, guaranteeing that these technologies adhere to stringent industry benchmarks regarding factual precision, utility, and safetyâespecially within open-source frameworksâpresents a formidable hurdle. To address this, we introduce a comprehensive benchmarking framework utilizing a dataset comprising more than 1,000 health-related inquiries. This study evaluates model capabilities across three critical dimensions: honesty, helpfulness, and harmlessness.
Our analysis reveals significant trade-offs between factual reliability and safety among the tested models: Mistral-7B, BioMistral-7B-DARE, and AlpaCare-13B. Specifically, AlpaCare-13B demonstrated superior performance, securing the highest accuracy rate of 91.7% alongside a harmlessness score of 0.92. Meanwhile, BioMistral-7B-DARE, despite its smaller architectural scale, leveraged domain-specific tuning to enhance its safety profile to 0.90. Furthermore, the implementation of few-shot prompting was shown to elevate accuracy from 78% to 85%. However, a consistent decline in helpfulness across all models when addressing complex queries underscores the persistent difficulties inherent in clinical question answering.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




