Caught in the Act(ivation): Toward Pre-Output and Multi-Turn Detection of Credential Exfiltration by LLM Agents
Title: Intercepting Credential Theft: A Framework for Pre-Output and Multi-Turn Detection in LLM Agents
Abstract
Large Language Model (LLM) agents frequently expose sensitive credentials within context windows that also contain untrusted, retrieved data. This overlap creates a vulnerable pathway for indirect prompt injections, which can manipulate the model into exfiltrating these credentials. To address this security failure, we evaluate three distinct defensive strategies. First, we investigate the efficacy of activation probes in identifying credential access prior to the generation of output tokens. Second, we develop honeytokens derived from format-specific character models and refine detection precision using split conformal prediction. Third, we frame multi-turn exfiltration as a cumulative information-flow issue, monitoring an estimated leakage budget across successive conversation turns.
Our controlled experiments on open-weight models demonstrate that activation features can distinguish between benign queries and those seeking credentials with high accuracy, even when subjected to held-out encoding transformations. Additionally, in a synthetic multi-turn test suite, cumulative accounting methods successfully identified attacks that single-turn detectors failed to catch. While these findings are preliminary—given that the multi-turn benchmark is proprietary and limited in scale, the activation approach demands white-box model access, and the information estimator serves as a practical indicator rather than a strict upper bound—they strongly suggest that robust defenses against credential exfiltration must integrate pre-output monitoring, calibrated canary detection, and temporal leakage accounting, rather than relying exclusively on text-level output filters.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC






