Lost in Delusion: Examining LLM Safety Under User Delusions and Distress
Title: Lost in Delusion: Examining LLM Safety Under User Delusions and Distress
Abstract:
As large language model (LLM) chatbots become a primary point of contact for individuals experiencing psychological distress—particularly those whose conditions involve delusional thinking—ensuring their safety is critical. Existing research on mental health safety in LLMs has primarily focused on general therapeutic quality or single-turn crisis detection, leaving a significant gap in understanding model behavior during extended interactions where distress and delusion are intertwined. To bridge this gap, we conducted matched multi-turn simulations involving six LLMs and clinically grounded personas. By pairing each conversation involving delusion with a control scenario featuring distress alone, we isolated the specific impact of delusional framing.
Our analysis identifies a critical "recognition-intervention gap." While models identify distress with similar frequency regardless of whether it is framed within a delusion, they significantly fail to initiate safety interventions once the distress is embedded in delusional content. Specifically, safety interventions were suppressed by as much as 4.5 times in delusional contexts. This failure correlates with the models’ accumulation of acceptance of the user’s false premises, rather than a lack of emotional validation.
Furthermore, we found that intuitive mitigation strategies, such as prompting models to explicitly assess user distress, are ineffective under delusional framing. The only solution that successfully closed the intervention gap was delusion-aware prompting combined with explicit response guidance. However, this approach relies on a delusion classifier that proves unreliable when applied to the most vulnerable models. Consequently, safe deployment of LLMs in these contexts necessitates treating delusional framing as a distinct risk signal that supersedes conversational accommodation.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





