arXiv

Boosting RL-Based Visual Reasoning with Selective Adversarial Entropy Intervention

June 2, 2026 · Yang Yu, Zhuangzhuang Chen, Lanqing Li, Xiaomeng Li · Original Source

Title: Enhancing RL-Driven Visual Reasoning via Selective Adversarial Entropy Intervention

Abstract: Reinforcement learning (RL) has emerged as a prevalent strategy for bolstering the reasoning faculties of vision-language models (VLMs). Within the landscape of RL-based fine-tuning, entropy intervention has proven to be a potent mechanism for augmenting exploratory capacity, which subsequently elevates policy performance. However, while prior research largely confines entropy intervention to the manipulation of specific token updates during RL policy optimization, it frequently overlooks the potential of intervening during the RL sampling phase. Addressing this gap, we demonstrate that such intervention can significantly enhance the performance of GRPO by fostering greater response diversity. To this end, we introduce Selective-adversarial Entropy Intervention (SaEI), a method that amplifies policy entropy by perturbing visual inputs through a token-selective adversarial objective derived from the entropy of sampled responses. Our approach comprises two key components: first, we develop Entropy-guided Adversarial Sampling (EgAS), which treats the entropy of sampled responses as an adversarial objective. By leveraging the resulting adversarial gradient to attack visual inputs, EgAS generates adversarial samples that encourage the policy model to traverse a broader answer space during RL sampling. Second, we introduce Token-selective Entropy Computation (TsEC) to optimize the efficacy of the EgAS attack, ensuring that factual knowledge within VLMs remains intact. Comprehensive experiments conducted on both in-domain and out-of-domain datasets confirm that our method substantially enhances policy exploration through entropy intervention, thereby improving reasoning capabilities. The code will be made available upon acceptance.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC