Global News Digest

arXiv

VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models

Title: VLM4VLA: Re-evaluating the Role of Vision-Language Models in Vision-Language-Action Frameworks

Abstract:

Vision-Language-Action (VLA) models, which embed pre-trained large Vision-Language Models (VLMs) into their policy architecture, are increasingly recognized for their robust generalization potential. This study addresses a critical yet under-explored issue: the extent to which the selection and proficiency of a VLM influence the performance of downstream VLA policies. To facilitate fair and efficient comparisons, we propose VLM4VLA, a streamlined adaptation pipeline that transforms general-purpose VLMs into VLA policies by employing only a minimal set of newly learnable parameters. Remarkably, despite its parsimonious design, VLM4VLA achieves performance levels competitive with more complex network architectures.

Our extensive empirical analysis across three benchmarks and various downstream tasks reveals two key insights. First, while initializing with a VLM provides a consistent advantage over training from scratch, a model’s general capabilities are poor indicators of its success in specific downstream tasks. These findings contradict prevailing assumptions, suggesting that while standard VLM competence is a prerequisite, it is not sufficient for effective embodied control. Second, we examine the influence of specialized embodied skills by fine-tuning VLMs on seven auxiliary tasks, including embodied question-answering, visual pointing, and depth estimation. Surprisingly, enhancing a VLM’s proficiency in these specific domains does not necessarily translate to improved control performance in downstream applications.

Through modality-level ablation studies, we identify the visual module, rather than the language component, as the primary bottleneck affecting performance. We show that incorporating control-relevant supervision into the VLM’s vision encoder leads to consistent performance improvements, even when the encoder is kept frozen during subsequent fine-tuning. This result highlights a persistent domain gap between the objectives of current VLM pretraining and the specific demands of embodied action planning.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Schroders Renewable Unit Targets AI Assets as Power Demand Soars
Bloomberg

Schroders Renewable Unit Targets AI Assets as Power Demand Soars

Schroders’ renewable unit targets AI infrastructure, pivoting to meet soaring energy demand from artificial intelligence...

State Street's Paglia on SBI Group Partnership, ETFs
Bloomberg

State Street's Paglia on SBI Group Partnership, ETFs

State Street's Paglia discusses the SBI Group partnership and ETFs, but the source text is missing. Please provide the a...

Nvidia Boss Says Workers Should Be Paid ‘as Much as Possible’
Bloomberg

Nvidia Boss Says Workers Should Be Paid ‘as Much as Possible’

Nvidia CEO Jensen Huang advocates for paying workers “as much as possible,” emphasizing maximum compensation. This stanc...

TSE Talking With Regulator For Easing ETF Listing Rules
Bloomberg

TSE Talking With Regulator For Easing ETF Listing Rules

The Tokyo Stock Exchange is discussing with regulators to ease ETF listing rules. This aims to simplify market access an...

S&P DJI CEO on Japan Markets, Mega IPOs
Bloomberg

S&P DJI CEO on Japan Markets, Mega IPOs

S&P DJI CEO discusses Japan's financial markets and major IPOs.