P\textsuperscript{2}-DPO: Grounding Hallucination in Perceptual Processing via Calibration Direct Preference Optimization
Title: P²-DPO: Tackling Hallucination in Perceptual Processing through Calibration Direct Preference Optimization
Abstract:
The phenomenon of hallucination has recently become a focal point of research within Large Vision-Language Models (LVLMs). Direct Preference Optimization (DPO) offers a solution by leveraging corrected human preferences to guide learning, effectively mitigating these hallucinations. However, this approach has limitations: it fails to specifically address the perceptual bottlenecks present in attended regions and does not adequately handle insufficient visual robustness when images are degraded. Additionally, conventional preference pairs are typically vision-agnostic, and their off-policy nature restricts their utility in directing model training.
To overcome these hurdles, we introduce Perceptual Processing Direct Preference Optimization (P²-DPO), a new training framework where the model constructs and learns from its own preference pairs. This approach directly targets the aforementioned visual bottlenecks while sidestepping the problems associated with vision-agnostic and off-policy data. The proposed method features two key components: (1) an on-policy strategy for constructing preference pairs that focus on enhancing perception and ensuring visual robustness, and (2) a specialized Calibration Loss designed to accurately align visual inputs with the causal generation of text.
Our experiments show that P²-DPO surpasses strong baseline models that depend on expensive human feedback, achieving this with similar training costs and data volumes. Moreover, assessments using Attention Region Fidelity (ARF) and tests under image degradation conditions confirm that P²-DPO successfully resolves perceptual bottlenecks in attended areas and enhances the model's robustness against degraded visual inputs.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



