arXiv

RISED: A Pre-Deployment Evaluation Framework for High-Stakes AI Decision-Support Systems, with Application to Healthcare

Title: RISED: A Pre-Deployment Evaluation Framework for High-Stakes AI Decision-Support Systems, with Application to Healthcare

Abstract:

Clinical decision-support systems function as expert tools that clinicians directly follow, yet their approval often hinges solely on a single aggregate accuracy metric derived from a reserved test dataset. Such a metric fails to account for input reliability amidst encoding shifts, disparities across subgroups, sensitivity to thresholds, or practical operational feasibility. To address these gaps, we introduce RISED, a pre-deployment evaluation framework that operationalizes five key dimensions: Reliability, Inclusivity, Sensitivity, Equity, and Deployability. This framework utilizes BCa bootstrap 95% confidence intervals, thresholds established by literature, and Holm-Bonferroni-corrected verdicts categorized as PASS, FAIL, or INCONCLUSIVE. Notably, Equity serves as a diagnostic tool for proxy dependence rather than a gating criterion.

We applied RISED to seven cohorts spanning 35 years, with sample sizes ranging from 303 to 99,492. The framework revealed critical failures that remained hidden when relying exclusively on AUROC. For instance, in the Diabetes 130 cohort, while Reliability passed by a margin of three orders of magnitude (PSS = 0.0004), Inclusivity (with an AUC parity gap of 0.262) and Sensitivity (exhibiting a maximum threshold-flip rate of 49.1%) failed decisively. These results were replicated across both NHIS cohorts. In contrast, NHANES 2021-2023, which featured a complete profile of features, resulted in INCONCLUSIVE verdicts. Meanwhile, BRFSS 2024 experienced the most severe Sensitivity failure in the suite, with a maximum threshold-flip rate of 64.2%, following the removal of hypertension and cholesterol data due to instrument rotation.

This pattern of failure was also observed in credit- and income-prediction cohorts, underscoring the framework’s domain-agnostic nature. Furthermore, multi-model checks confirmed that these issues stem from data characteristics rather than specific model architectures. RISED is available as an open-source Python package designed to complement existing standards such as TRIPOD+AI, FUTURE-AI, and Fairlearn, providing the structured numerical evidence these guidelines require but do not explicitly mandate.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...