REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak
Title: REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak
Abstract:
Despite the impressive prowess of Large Language Models (LLMs), they remain vulnerable to intricate, multi-step jailbreak strategies. These attacks bypass traditional surface-level safety alignments by manipulating the model’s internal generation mechanisms. To mitigate these risks, we introduce Reflector, a rigorous two-stage framework designed to embed self-reflection directly into the generation trajectory.
The first stage employs teacher-guided generation to curate high-quality reflection data, which is then utilized for Supervised Fine-Tuning (SFT) to establish structured reflection patterns. In the second stage, the framework applies Reinforcement Learning (RL) with outcome-driven and reward-validity supervision to cultivate robust, autonomous self-reflection abilities.
Empirical evaluations demonstrate that Reflector maintains Defense Success Rates (DSR) above 90% when facing complex indirect attacks, showing strong generalization across a variety of threat landscapes. Furthermore, the framework bolsters both specific task performance and general utility, recording a 5.85% improvement on GSM8K and enhanced results on knowledge-intensive benchmarks. By embedding trajectory-level safety, Reflector addresses the inherent constraints of surface alignment without incurring substantial computational costs, presenting an efficient and scalable approach for advancing the development of secure and capable LLMs.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC





