arXiv

Bad Seeing or Bad Thinking? Rewarding Perception for Multimodal Reasoning

June 4, 2026 · Haozhe Wang, Qixin Xu, Changpeng Wang, Taofeng Xue, Chong Peng, Wenhu Chen, Fangzhen Lin · Original Source

Title: Is the Issue Poor Vision or Poor Logic? Enhancing Multimodal Reasoning by Incentivizing Accurate Perception

Abstract: Establishing a robust synergy between perception and reasoning remains a primary objective for next-generation Vision-Language Models (VLMs). While recent efforts have attempted to bridge this gap through novel architectural designs or agentic workflows, these methods frequently encounter significant hurdles. They are often constrained by rigid textual reasoning processes or burdened by the substantial engineering and computational costs associated with complex external agents. Moreover, such extensive resource allocation rarely translates into proportional performance improvements, instead frequently triggering a "seesaw effect" where gains in one area lead to losses in the other. This phenomenon prompts a critical re-evaluation of the actual bottleneck. In this study, we posit that the underlying cause of this trade-off lies in the ambiguity of modality credit assignment: specifically, when a VLM fails, it is unclear whether the error stems from defective perception ("bad seeing") or flawed logic ("bad thinking"). To address this, we present a reinforcement learning framework designed to enhance perception-reasoning synergy by explicitly rewarding perceptual fidelity. We achieve this by decomposing the generation process into alternating perception and reasoning phases, allowing for targeted supervision of the perceptual component. Central to our approach is Perception Verification (PV), which utilizes a "blindfolded reasoning" proxy to assess perceptual accuracy independently of the final reasoning result. Additionally, to facilitate scalable training across diverse, free-form VL tasks, we introduce Structured Verbal Verification. This method substitutes the high-variance judgments of LLMs with structured algorithmic execution. These innovations are consolidated within a Modality-Aware Credit Assignment (MoCA) mechanism. MoCA directs rewards to the precise origin of errors—distinguishing between "bad seeing" and "bad thinking"—thereby enabling a single VLM to deliver simultaneous performance enhancements across a broad spectrum of tasks.

Source: arXiv Generated at: 2026-06-04 00:00:00 UTC