Hybrid Adversarial Defence for Natural Language Understanding Tasks
Title: A Hybrid Adversarial Defense Strategy for Natural Language Comprehension
Abstract:
Large Language Models (LLMs) face significant risks from both hallucination and adversarial manipulation. While these issues are deeply interconnected, current defensive measures usually treat them as distinct problems. This study introduces a hybrid defense framework that integrates entropy-based models, which aim to mitigate hallucinations, with uncertainty-based and geometric-based models intended to enhance resistance against attacks.
Our evaluation on Natural Language Understanding datasets, including FEVER, HotpotQA, CSQA, and SIQA, reveals that the hybrid approach boosts performance on clean tasks by as much as 43.34%. Furthermore, it significantly strengthens adversarial robustness, achieving up to a 64.92% increase in accuracy and a 62.27% decrease in the attack success rate. When tested on out-of-distribution datasets such as AeroEngQA and CPIQA, the model maintained comparable robustness, showing accuracy improvements of up to 57.14%.
The framework also demonstrated high efficacy against prompt injection (SafeGuard) and jailbreak detection (AdvBench, DAN) scenarios, reducing the attack success rate by up to 51% relative to state-of-the-art baseline models. Collectively, these findings indicate that synthesizing entropy, uncertainty, and geometric features yields a superior defensive strategy compared to relying on any single feature type, proving effective across both in-domain and out-of-distribution contexts.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC




