arXiv

Per-Group Error, Not Total MSE: Fine-Tuning Vision-Language-Action Models for 11-DoF Mobile Manipulation

Title: Prioritizing Per-Group Error Over Aggregate MSE for Optimizing Vision-Language-Action Models in 11-DoF Mobile Manipulation

Abstract:

When fine-tuning Vision-Language-Action (VLA) models for mobile manipulators featuring heterogeneous joint configurations, researchers may encounter a paradoxical outcome: the model checkpoint exhibiting the lowest overall Mean Squared Error (MSE) often fails to deliver the best performance on physical hardware. We posit that this phenomenon is an expected result of aggregating distinct joint groups—such as the arm, gripper, head, and wheeled base—into a single performance metric. In such cases, high prediction accuracy for simpler joints can obscure failures in more complex ones.

To investigate this, we fine-tuned SmolVLA (450M parameters, action-expert only) on an 11-DoF Toyota HSR platform and benchmarked it against $\pi_{0.5}$ (3.3B parameters), a more robust pretrained baseline. A granular, per-group analysis revealed two key insights: for SmolVLA, the mobile base exhibited the slowest convergence rate, thereby bottlenecking overall system performance. Conversely, when applying expert-only fine-tuning to $\pi_{0.5}$ (where only the action head is trained while the backbone remains frozen), the total MSE decreased below the baseline levels; however, this improvement came at the cost of degraded arm accuracy.

In a series of 60 real-robot trials (20 trials per model), the $\pi_{0.5}$ 80k variant (scoring 4.0/4) significantly surpassed both fine-tuned alternatives. Specifically, it outperformed the expert-only 3k model (3.75/4) and the HSR-SmolVLA model (3.5/4), with statistical significance confirmed via Mann-Whitney test ($p \leq 0.010$). Notably, the expert-only 3k model achieved the lowest total MSE, yet performed worse on the robot. This divergence suggests that offline error rates within specific arm groups, rather than total MSE or base-group error, are the primary indicators of real-world success. Consequently, we recommend using per-group error as a superior metric for checkpoint selection in robots with heterogeneous action spaces.

Code: https://github.com/paumontagut/per-group-mse-vla


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...