Explainably Safe Reinforcement Learning
Title: Explainable Safe Reinforcement Learning
Abstract:
Building trust in decision-making systems necessitates not only safety guarantees but also the interpretability and understandability of their behavior. This is especially critical for learned systems, where decision-making processes are often opaque. While shielding—a prominent model-based technique for enforcing safety in reinforcement learning—is automatically synthesized using rigorous formal methods, its decisions are typically difficult for humans to interpret. Recently, decision trees have become customary for representing controllers and policies. However, since shields are inherently non-deterministic, their decision tree representations become excessively large and impractical for explainability.
To address this challenge, we propose a novel approach for explainable safe RL that enhances trust by providing human-interpretable explanations of the shield's decisions. Our method represents the shielding policy as a hierarchy of decision trees, offering top-down, case-based explanations. At design time, we use a world model to analyze the safety risks of executing actions in given states. Based on this analysis, we construct both the shield and a high-level decision tree that classifies states into risk categories (safe, critical, dangerous, unsafe), explaining why a situation may be safety-critical. At runtime, we generate localized decision trees that explain which actions are allowed and why others are deemed unsafe.
Our method facilitates explainability of the safety aspect in safe-by-shielding reinforcement learning, requires no additional information beyond what is already used for shielding, incurs minimal overhead, and integrates readily into existing shielded RL pipelines. In our experiments, we compute explanations using decision trees that are several orders of magnitude smaller than the original shield.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC






