Render-of-Thought: Rendering Textual Chain-of-Thought as Images for Visual Latent Reasoning
Title: Render-of-Thought: Converting Textual Chain-of-Thought into Images for Visual Latent Reasoning
Abstract:
Chain-of-Thought (CoT) prompting has proven highly effective in unlocking the reasoning potential of Large Language Models (LLMs). However, while CoT improves reasoning performance, its verbose nature creates significant computational burdens. Current approaches predominantly prioritize outcome alignment, often neglecting supervision of the intermediate reasoning steps. This oversight hampers the ability to analyze the latent reasoning chain. To overcome these limitations, we present Render-of-Thought (RoT), the inaugural framework that materializes the reasoning process by transforming textual steps into images, thereby rendering the latent rationale explicit and traceable. By utilizing the vision encoders inherent in existing Vision Language Models (VLMs) as semantic anchors, we align vision embeddings with the textual space. This approach allows for seamless, plug-and-play integration without requiring extra pre-training costs. Comprehensive experiments conducted on mathematical and logical reasoning benchmarks reveal that our method delivers 3-4 times greater token compression and significant inference acceleration relative to explicit CoT. Additionally, it sustains competitive performance levels compared to other methodologies, confirming the viability of this new paradigm. Our code is accessible at https://github.com/TencentBAC/RoT
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





