Belief-Aware VLM Model for Human-like Reasoning
Title: Belief-Aware VLM Model for Human-like Reasoning
Abstract: Conventional neural network architectures for intent inference are predominantly dependent on observable states, which often hinders their capacity to generalize across a wide array of tasks and dynamic settings. While recent breakthroughs in Vision Language Models (VLMs) and Vision Language Action (VLA) models have introduced common-sense reasoning capabilities through large-scale multimodal pretraining—facilitating zero-shot performance—these systems still lack explicit mechanisms to represent and update belief states. This deficiency restricts their ability to reason in a manner akin to humans or to track evolving human intent over extended periods. To overcome these limitations, we introduce a belief-aware VLM framework that combines reinforcement learning with retrieval-based memory. Rather than constructing an explicit belief model, our approach approximates belief through a vector-based memory system that retrieves pertinent multimodal context, which is then integrated into the VLM to facilitate reasoning. Furthermore, we enhance decision-making processes by applying a reinforcement learning policy within the VLM’s latent space. Our evaluations on publicly accessible VQA datasets, including HD-EPIC, reveal consistent performance gains over zero-shot baselines, underscoring the critical role of belief-aware reasoning.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC




