arXiv

DeepLatent: Think with Images via Parallel Latent Visual Reasoning

June 2, 2026 · Dongchen Lu, Zhimo Li, Mao Shu, Huo Cao · Original Source

Title: DeepLatent: Enabling Visual Reasoning Through Parallel Latent Processing

Abstract:

The burgeoning concept of "thinking with images" represents a significant leap forward for Vision-Language Models by integrating visual states directly into intermediate reasoning stages. Current methodologies generally fall into two distinct categories. The first, tool-assisted approaches, rely on explicit visual operations but are hindered by high latency and limited manipulation capabilities. The second, latent reasoning methods, generate implicit visual states autoregressively; however, these often lag behind tool-assisted techniques in performance, with their latent tokens proving inadequate for capturing robust visual information.

To address these limitations, we introduce DeepLatent, a novel framework designed for parallel latent visual reasoning. Our approach features two key innovations. First, we present LatentFormer, a component that utilizes learnable 2D tokens to concurrently generate context-conditioned latent states. This mechanism ensures that every visual update is firmly grounded in the original image features. Second, we have developed a continuous-space reinforcement learning algorithm that fine-tunes latent modulation parameters directly within the embedding space, thereby markedly enhancing the quality of latent representations. The entire framework is trained through a two-step process: initial knowledge distillation followed by the application of our continuous-space reinforcement learning algorithm. Additionally, we release DeepLatent-180K, a comprehensive dataset specifically curated for latent visual reasoning tasks. Extensive testing across various benchmarks confirms that DeepLatent delivers state-of-the-art performance.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC