Between a Rock and a Hard Place: The Tension Between Ethical Reasoning and Safety Alignment in LLMs
Title: Caught in the Crossfire: Navigating the Conflict Between Moral Reasoning and Safety Protocols in Large Language Models
Abstract:
Current safety alignment strategies for Large Language Models (LLMs) largely rely on a binary framework, categorizing user inputs strictly as either safe or unsafe. However, this dichotomous approach fails to account for the complexities of ethical dilemmas, where a model’s ability to navigate moral trade-offs opens up a unique vulnerability surface. To address this gap, we propose TRIAL, a novel multi-turn red-teaming protocol that conceals malicious intents within ethically charged contexts. TRIAL successfully exploits the ethical reasoning faculties of various models, persuading them to justify harmful actions as necessary moral compromises, thereby achieving high success rates in attacks. In response to these findings, we present ERR (Ethical Reasoning Robustness), a defensive framework designed to differentiate between instrumental actions that facilitate harm and explanatory analyses that explore ethical structures without endorsing negative outcomes. ERR utilizes a Layer-Stratified Harm-Gated LoRA architecture, offering strong protection against reasoning-driven attacks while maintaining the model’s functional utility.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




