arXiv

VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models

June 2, 2026 · Jianke Zhang, Xiaoyu Chen, Qiuyue Wang, Mingsheng Li, Yanjiang Guo, Yucheng Hu, Jiajun Zhang, Shuai Bai, Junyang Lin, Jianyu Chen · Original Source

Title: VLM4VLA: Re-evaluating the Role of Vision-Language Models in Vision-Language-Action Frameworks

Abstract:

Vision-Language-Action (VLA) models, which embed pre-trained large Vision-Language Models (VLMs) into their policy architecture, are increasingly recognized for their robust generalization potential. This study addresses a critical yet under-explored issue: the extent to which the selection and proficiency of a VLM influence the performance of downstream VLA policies. To facilitate fair and efficient comparisons, we propose VLM4VLA, a streamlined adaptation pipeline that transforms general-purpose VLMs into VLA policies by employing only a minimal set of newly learnable parameters. Remarkably, despite its parsimonious design, VLM4VLA achieves performance levels competitive with more complex network architectures.

Our extensive empirical analysis across three benchmarks and various downstream tasks reveals two key insights. First, while initializing with a VLM provides a consistent advantage over training from scratch, a model’s general capabilities are poor indicators of its success in specific downstream tasks. These findings contradict prevailing assumptions, suggesting that while standard VLM competence is a prerequisite, it is not sufficient for effective embodied control. Second, we examine the influence of specialized embodied skills by fine-tuning VLMs on seven auxiliary tasks, including embodied question-answering, visual pointing, and depth estimation. Surprisingly, enhancing a VLM’s proficiency in these specific domains does not necessarily translate to improved control performance in downstream applications.

Through modality-level ablation studies, we identify the visual module, rather than the language component, as the primary bottleneck affecting performance. We show that incorporating control-relevant supervision into the VLM’s vision encoder leads to consistent performance improvements, even when the encoder is kept frozen during subsequent fine-tuning. This result highlights a persistent domain gap between the objectives of current VLM pretraining and the specific demands of embodied action planning.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC