arXiv

REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak

Title: REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak

Abstract:

Despite the impressive prowess of Large Language Models (LLMs), they remain vulnerable to intricate, multi-step jailbreak strategies. These attacks bypass traditional surface-level safety alignments by manipulating the model’s internal generation mechanisms. To mitigate these risks, we introduce Reflector, a rigorous two-stage framework designed to embed self-reflection directly into the generation trajectory.

The first stage employs teacher-guided generation to curate high-quality reflection data, which is then utilized for Supervised Fine-Tuning (SFT) to establish structured reflection patterns. In the second stage, the framework applies Reinforcement Learning (RL) with outcome-driven and reward-validity supervision to cultivate robust, autonomous self-reflection abilities.

Empirical evaluations demonstrate that Reflector maintains Defense Success Rates (DSR) above 90% when facing complex indirect attacks, showing strong generalization across a variety of threat landscapes. Furthermore, the framework bolsters both specific task performance and general utility, recording a 5.85% improvement on GSM8K and enhanced results on knowledge-intensive benchmarks. By embedding trajectory-level safety, Reflector addresses the inherent constraints of surface alignment without incurring substantial computational costs, presenting an efficient and scalable approach for advancing the development of secure and capable LLMs.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

Dimon and SpaceX Executives to Pitch IPO to Clients
Bloomberg

Dimon and SpaceX Executives to Pitch IPO to Clients

JPMorgan Chase CEO Jamie Dimon and SpaceX executives are pitching IPO details to clients.

Financial Times

Europe is finally flexing its innovation muscles

The EU’s new tech sovereignty package signals a positive shift from defensive regulation to proactive innovation, markin...

Apollo’s Zelter Expects High-Grade Debt Sales to Top US Treasuries
Bloomberg

Apollo’s Zelter Expects High-Grade Debt Sales to Top US Treasuries

Apollo’s Zelter expects high-grade debt sales to surpass US Treasuries. He anticipates investment-grade debt outperformi...

EU Insurance Watchdog Warns on Loan Risks
Bloomberg

EU Insurance Watchdog Warns on Loan Risks

EIOPA warns insurers to closely monitor loan risks, though initial reports lack specific details on the nature or scope ...

Glazer Family Members Said to Study Manchester United Stake Sale
Bloomberg

Glazer Family Members Said to Study Manchester United Stake Sale

Reports indicate the Glazer family is evaluating a potential sale of their Manchester United stake, with family members ...

Ares' Blair Jacbobson: Disconnect Over Private Credit Headlines
Bloomberg

Ares' Blair Jacbobson: Disconnect Over Private Credit Headlines

Ares’ Blair Jacobson argues that private credit headlines misrepresent reality, highlighting a disconnect between media ...