Stateful Visual Encoders for Vision-Language Models
Title: Stateful Visual Encoders for Vision-Language Models
Vision-language models (VLMs) are seeing growing adoption in multi-turn, multi-image agentic environments where decision-making relies on detecting visual changes. Despite this trend, current open-weight VLMs suffer from a structural limitation: their visual encoders are stateless. This means each image is processed in isolation, without incorporating prior visual context, forcing all visual comparisons to occur exclusively within the language model. Consequently, subtle but critical changes—particularly those that do not alter the high-level semantics of a scene—may be diluted before the language model can effectively analyze them.
To address this, we propose a Stateful Visual Encoder that conditions every visual representation on previously encoded visual features. Through supervised fine-tuning, VLMs utilizing these stateful encoders demonstrate reliable performance gains in tasks requiring cross-image spatial aggregation, multi-object visual differencing, and visual trajectory behavior cloning. These enhancements remain robust across varying input resolutions, different language model scales, and distinct VLM architectures.
Furthermore, we evaluated our approach on practical, real-world applications such as longitudinal radiology, fine-grained image comparison, and remote sensing. In these domains, stateful encoders consistently elevated the performance of generalist VLM baselines, allowing them to rival or exceed the capabilities of specialized models in specific areas.
Project page: https://statefulvisualencoders.github.io/
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC






