arXiv

RISED: A Pre-Deployment Evaluation Framework for High-Stakes AI Decision-Support Systems, with Application to Healthcare

June 2, 2026 · Rohith Reddy Bellibatlu, Manpreet Singh, Yash Jajoo, Shyamal Lakhanpal, Abhishek Israni · Original Source

Title: RISED: A Pre-Deployment Evaluation Framework for High-Stakes AI Decision-Support Systems, with Application to Healthcare

Abstract:

Clinical decision-support systems function as expert tools that clinicians directly follow, yet their approval often hinges solely on a single aggregate accuracy metric derived from a reserved test dataset. Such a metric fails to account for input reliability amidst encoding shifts, disparities across subgroups, sensitivity to thresholds, or practical operational feasibility. To address these gaps, we introduce RISED, a pre-deployment evaluation framework that operationalizes five key dimensions: Reliability, Inclusivity, Sensitivity, Equity, and Deployability. This framework utilizes BCa bootstrap 95% confidence intervals, thresholds established by literature, and Holm-Bonferroni-corrected verdicts categorized as PASS, FAIL, or INCONCLUSIVE. Notably, Equity serves as a diagnostic tool for proxy dependence rather than a gating criterion.

We applied RISED to seven cohorts spanning 35 years, with sample sizes ranging from 303 to 99,492. The framework revealed critical failures that remained hidden when relying exclusively on AUROC. For instance, in the Diabetes 130 cohort, while Reliability passed by a margin of three orders of magnitude (PSS = 0.0004), Inclusivity (with an AUC parity gap of 0.262) and Sensitivity (exhibiting a maximum threshold-flip rate of 49.1%) failed decisively. These results were replicated across both NHIS cohorts. In contrast, NHANES 2021-2023, which featured a complete profile of features, resulted in INCONCLUSIVE verdicts. Meanwhile, BRFSS 2024 experienced the most severe Sensitivity failure in the suite, with a maximum threshold-flip rate of 64.2%, following the removal of hypertension and cholesterol data due to instrument rotation.

This pattern of failure was also observed in credit- and income-prediction cohorts, underscoring the framework’s domain-agnostic nature. Furthermore, multi-model checks confirmed that these issues stem from data characteristics rather than specific model architectures. RISED is available as an open-source Python package designed to complement existing standards such as TRIPOD+AI, FUTURE-AI, and Fairlearn, providing the structured numerical evidence these guidelines require but do not explicitly mandate.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC