DenseMLLM: Standard Multimodal LLMs for Dense Prediction
Title: DenseMLLM: Enabling Standard Multimodal LLMs for Dense Prediction
Abstract:
While Multimodal Large Language Models (MLLMs) have proven their prowess in high-level visual comprehension, adapting them for fine-grained dense prediction duties—such as depth estimation and semantic segmentation—usually demands the addition of intricate, task-specific decoders and other custom modifications. This reliance on architectural fragmentation not only inflates model complexity but also strays from the core generalist philosophy of MLLMs, thereby hindering their real-world applicability. In this study, we overturn this convention by demonstrating how standard MLLMs can execute dense predictions without the need for extra task-specific decoders. We introduce DenseMLLM, a model built upon a standard architecture that utilizes a novel vision token supervision strategy to handle multiple labels and tasks. Although its design is minimalist, DenseMLLM delivers highly competitive results across a broad spectrum of vision-language benchmarks and dense prediction challenges. These findings confirm that a general-purpose MLLM, devoid of architectural specialization, is fully capable of effective dense perception. The project code is accessible at github.com/Eli-YiLi/DenseMLLM.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC






