NeuroArmor: Safe-Variant-Guided Representation Consistency for Selective Re-Anchoring in Jailbreak Defense
Title: NeuroArmor: Ensuring Representation Consistency via Safe-Variant-Guided Selective Re-anchoring for Jailbreak Mitigation
Abstract:
Large language models continue to face significant risks from jailbreak attacks, which conceal malicious objectives within innocuous-looking queries. These tactics include role-playing scenarios, translation tasks, encoding schemes, adversarial suffixes, and multi-turn conversational buildups. Current defensive measures often fail to strike a balance between security and utility, frequently resorting to blanket restrictions that inadvertently block legitimate, albeit sensitive, user requests. This inefficiency stems from a one-size-fits-all approach to prompt handling. To address this, we introduce NeuroArmor, a white-box runtime defense mechanism. This system leverages prompt-specific safe variants as a localized safety benchmark to determine the necessity of intervention and serves as a safe target for corrective actions when triggered. For every incoming prompt, NeuroArmor generates K safe variants and evaluates the prompt’s state against this local reference within the hidden-state space. Based on this comparison, it directs anomalies to either a refusal pathway for clearly malicious inputs or a helpful recovery route for ambiguous, potentially benign requests. In evaluations on the Llama-3-8B-Instruct model, NeuroArmor significantly improved safety metrics, reducing the malicious attack success rate (ASR) from 41.56% to 1.57%. Simultaneously, it decreased the benign false positive rate (FPR) on the shared benign pool from 30.26% to 22.05%, outperforming matched baselines in this critical trade-off. Further assessments, including external-judge and manual behavioral reviews, indicate that the outputs not blocked by the system are far less likely to be operationally harmful. Ultimately, NeuroArmor offers a superior runtime strategy for jailbreak defense by integrating prompt-specific consistency checks, intelligent routing, and selective intervention.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



