arXiv

NeuroArmor: Safe-Variant-Guided Representation Consistency for Selective Re-Anchoring in Jailbreak Defense

Title: NeuroArmor: Ensuring Representation Consistency via Safe-Variant-Guided Selective Re-anchoring for Jailbreak Mitigation

Abstract:

Large language models continue to face significant risks from jailbreak attacks, which conceal malicious objectives within innocuous-looking queries. These tactics include role-playing scenarios, translation tasks, encoding schemes, adversarial suffixes, and multi-turn conversational buildups. Current defensive measures often fail to strike a balance between security and utility, frequently resorting to blanket restrictions that inadvertently block legitimate, albeit sensitive, user requests. This inefficiency stems from a one-size-fits-all approach to prompt handling. To address this, we introduce NeuroArmor, a white-box runtime defense mechanism. This system leverages prompt-specific safe variants as a localized safety benchmark to determine the necessity of intervention and serves as a safe target for corrective actions when triggered. For every incoming prompt, NeuroArmor generates K safe variants and evaluates the prompt’s state against this local reference within the hidden-state space. Based on this comparison, it directs anomalies to either a refusal pathway for clearly malicious inputs or a helpful recovery route for ambiguous, potentially benign requests. In evaluations on the Llama-3-8B-Instruct model, NeuroArmor significantly improved safety metrics, reducing the malicious attack success rate (ASR) from 41.56% to 1.57%. Simultaneously, it decreased the benign false positive rate (FPR) on the shared benign pool from 30.26% to 22.05%, outperforming matched baselines in this critical trade-off. Further assessments, including external-judge and manual behavioral reviews, indicate that the outputs not blocked by the system are far less likely to be operationally harmful. Ultimately, NeuroArmor offers a superior runtime strategy for jailbreak defense by integrating prompt-specific consistency checks, intelligent routing, and selective intervention.


Source: arXiv Generated at: 2026-06-03 00:00:00 UTC

Related Articles

TikTok Billionaire Tops Ambani as Asia’s Second-Richest
Bloomberg

TikTok Billionaire Tops Ambani as Asia’s Second-Richest

TikTok founder surpasses Mukesh Ambani to become Asia’s second-richest person, marking a significant shift in the region...

Publishers in UK can opt out of Google AI search results
BBC News

Publishers in UK can opt out of Google AI search results

UK publishers can now opt out of Google’s AI search summaries, a CMA ruling designed to boost their bargaining power and...

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.
Bloomberg

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.

Kioxia’s market cap nears Toyota’s, signaling a major shift in Japan’s corporate hierarchy. This narrowing gap highlight...

Reuters

Morning Bid: Marvell, a fitting name for the latest AI darling

Reuters highlights Marvell as a top AI stock, noting its name perfectly suits its status as the newest market darling.

Financial Times

Tim Hayward: I built the Jaguar E-Type of computer keyboards

Tim Hayward compares his bespoke keyboard designs to the Jaguar E-Type. He explores high-end customization for personal ...

Financial Times

AI Labs: Zuckerberg’s $100bn gamble

Meta’s $100 billion AI investment aims to secure AI dominance, but questions remain whether sheer spending can outpace c...