arXiv

Stateful Visual Encoders for Vision-Language Models

Title: Stateful Visual Encoders for Vision-Language Models

Vision-language models (VLMs) are seeing growing adoption in multi-turn, multi-image agentic environments where decision-making relies on detecting visual changes. Despite this trend, current open-weight VLMs suffer from a structural limitation: their visual encoders are stateless. This means each image is processed in isolation, without incorporating prior visual context, forcing all visual comparisons to occur exclusively within the language model. Consequently, subtle but critical changes—particularly those that do not alter the high-level semantics of a scene—may be diluted before the language model can effectively analyze them.

To address this, we propose a Stateful Visual Encoder that conditions every visual representation on previously encoded visual features. Through supervised fine-tuning, VLMs utilizing these stateful encoders demonstrate reliable performance gains in tasks requiring cross-image spatial aggregation, multi-object visual differencing, and visual trajectory behavior cloning. These enhancements remain robust across varying input resolutions, different language model scales, and distinct VLM architectures.

Furthermore, we evaluated our approach on practical, real-world applications such as longitudinal radiology, fine-grained image comparison, and remote sensing. In these domains, stateful encoders consistently elevated the performance of generalist VLM baselines, allowing them to rival or exceed the capabilities of specialized models in specific areas.

Project page: https://statefulvisualencoders.github.io/


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

Zurich Insurance Expands Data-Center Offering Beyond the US
Bloomberg

Zurich Insurance Expands Data-Center Offering Beyond the US

Zurich Insurance Group is expanding its data center insurance products internationally, extending coverage beyond the Un...

Emerging-Market Stocks Fall as Broadcom Miss Disrupts AI Trade
Bloomberg

Emerging-Market Stocks Fall as Broadcom Miss Disrupts AI Trade

Broadcom’s earnings miss triggered a sell-off in AI stocks, dragging down emerging-market equities. This disruption high...

Revolut Co-Founder, CTO Vlad Yatsenko to Step Down From Role
Bloomberg

Revolut Co-Founder, CTO Vlad Yatsenko to Step Down From Role

Revolut co-founder and CTO Vlad Yatsenko is stepping down from his executive role. The resignation marks a significant l...

Netflix Top Tech Exec Stone on Integrating AI
Bloomberg

Netflix Top Tech Exec Stone on Integrating AI

Netflix’s top tech exec discusses integrating AI to enhance content discovery and production efficiency.

Microsoft’s AI Chief Says Anthropic Models Are Too Expensive
Bloomberg

Microsoft’s AI Chief Says Anthropic Models Are Too Expensive

Microsoft AI CEO Mustafa Suleyman criticized Anthropic’s models as too expensive. Meanwhile, Microsoft plans to allow us...

Ramp Notches $44 Billion Valuation in New Funding Round
Bloomberg

Ramp Notches $44 Billion Valuation in New Funding Round

RAMP secured a $44 billion valuation in its latest funding round. CEO Eric Glyman attended the 2026 Reagan National Econ...