PsychoPass: Geometric Profiling of Multi-Turn Adversarial LLM Conversations
Title: PsychoPass: Geometric Profiling of Multi-Turn Adversarial LLM Conversations
Abstract
Multi-turn jailbreak attempts against large language models (LLMs) expose a critical vulnerability in existing safety mechanisms: while defensive guardrails typically analyze isolated turns, adversarial strategies manifest as continuous trajectories spanning entire dialogues. To address this, we advocate a paradigm shift from static content analysis to dynamic modeling, treating conversations as paths within representation space to determine if adversarial intent is geometrically encoded from the outset. We present PsychoPass, a novel framework that derives geometric features from conversation trajectories in embedding space to forecast potential attacks prior to the generation of harmful material. While our initial geometric features yield near-perfect accuracy in naive classifiers, this performance is primarily driven by the inclusion of the total number of turns as a variable. Upon controlling for this confound, we observe a persistent, albeit subtler, geometric signal. Notably, classification efficacy remains stable regardless of the specific encoder employed. Importantly, this predictive signal emerges early in the interaction; even short prefixes allow for detection rates significantly above chance and more reliably than standard baseline guardrails. Our theoretical analysis elucidates these observations through a decomposition of length and shape, establishes a detection bound tied to prefix length, and confirms encoder invariance. Collectively, these findings demonstrate that adversarial exchanges imprint an early, representation-robust geometric signature, making them viable targets for real-time online monitoring.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC





