arXiv

Bad Seeing or Bad Thinking? Rewarding Perception for Multimodal Reasoning

Title: Is the Issue Poor Vision or Poor Logic? Enhancing Multimodal Reasoning by Incentivizing Accurate Perception

Abstract: Establishing a robust synergy between perception and reasoning remains a primary objective for next-generation Vision-Language Models (VLMs). While recent efforts have attempted to bridge this gap through novel architectural designs or agentic workflows, these methods frequently encounter significant hurdles. They are often constrained by rigid textual reasoning processes or burdened by the substantial engineering and computational costs associated with complex external agents. Moreover, such extensive resource allocation rarely translates into proportional performance improvements, instead frequently triggering a "seesaw effect" where gains in one area lead to losses in the other. This phenomenon prompts a critical re-evaluation of the actual bottleneck. In this study, we posit that the underlying cause of this trade-off lies in the ambiguity of modality credit assignment: specifically, when a VLM fails, it is unclear whether the error stems from defective perception ("bad seeing") or flawed logic ("bad thinking"). To address this, we present a reinforcement learning framework designed to enhance perception-reasoning synergy by explicitly rewarding perceptual fidelity. We achieve this by decomposing the generation process into alternating perception and reasoning phases, allowing for targeted supervision of the perceptual component. Central to our approach is Perception Verification (PV), which utilizes a "blindfolded reasoning" proxy to assess perceptual accuracy independently of the final reasoning result. Additionally, to facilitate scalable training across diverse, free-form VL tasks, we introduce Structured Verbal Verification. This method substitutes the high-variance judgments of LLMs with structured algorithmic execution. These innovations are consolidated within a Modality-Aware Credit Assignment (MoCA) mechanism. MoCA directs rewards to the precise origin of errors—distinguishing between "bad seeing" and "bad thinking"—thereby enabling a single VLM to deliver simultaneous performance enhancements across a broad spectrum of tasks.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

Exelon CEO Sees Daily Cybersecurity Threats
Bloomberg

Exelon CEO Sees Daily Cybersecurity Threats

Exelon’s CEO warns of daily cybersecurity threats, highlighting persistent risks to the energy giant.

TechCrunch

Ramp raises $750M at $44B valuation as investors hunger for fintechs with an AI story

Ramp secured $750M at a $44B valuation, driven by AI integration and $1.5B+ revenue. The fintech firm now serves 70,000 ...

TechCrunch

Is Silicon Valley ready to put robots in people’s homes? Hello Robot is.

Hello Robot’s Stretch avoids Silicon Valley hype, focusing on practical home deployment to gather essential real-world d...

Canada to Provide Funding, Buy Equity Stakes in AI Startups
Bloomberg

Canada to Provide Funding, Buy Equity Stakes in AI Startups

Canada will fund and buy equity stakes in AI startups to boost the sector. This investment aims to strengthen the nation...

TechCrunch

Chinese spies are using LinkedIn to lure Westerners into sharing sensitive information

A joint Western security alert warns that Chinese spies use LinkedIn to impersonate recruiters and extract sensitive dat...

Peter Thiel’s Family Office Pays Record Rent for Top Miami Tower
Bloomberg

Peter Thiel’s Family Office Pays Record Rent for Top Miami Tower

Peter Thiel’s family office set a record rent for a Miami tower lease. This deal establishes a new benchmark for the cit...