arXiv

Per-Group Error, Not Total MSE: Fine-Tuning Vision-Language-Action Models for 11-DoF Mobile Manipulation

June 2, 2026 · Pau Montagut Bofi, Mario Garc\'ia Blasco, Tessa Pulli, Markus Vincze · Original Source

Title: Prioritizing Per-Group Error Over Aggregate MSE for Optimizing Vision-Language-Action Models in 11-DoF Mobile Manipulation

Abstract:

When fine-tuning Vision-Language-Action (VLA) models for mobile manipulators featuring heterogeneous joint configurations, researchers may encounter a paradoxical outcome: the model checkpoint exhibiting the lowest overall Mean Squared Error (MSE) often fails to deliver the best performance on physical hardware. We posit that this phenomenon is an expected result of aggregating distinct joint groups—such as the arm, gripper, head, and wheeled base—into a single performance metric. In such cases, high prediction accuracy for simpler joints can obscure failures in more complex ones.

To investigate this, we fine-tuned SmolVLA (450M parameters, action-expert only) on an 11-DoF Toyota HSR platform and benchmarked it against $\pi_{0.5}$ (3.3B parameters), a more robust pretrained baseline. A granular, per-group analysis revealed two key insights: for SmolVLA, the mobile base exhibited the slowest convergence rate, thereby bottlenecking overall system performance. Conversely, when applying expert-only fine-tuning to $\pi_{0.5}$ (where only the action head is trained while the backbone remains frozen), the total MSE decreased below the baseline levels; however, this improvement came at the cost of degraded arm accuracy.

In a series of 60 real-robot trials (20 trials per model), the $\pi_{0.5}$ 80k variant (scoring 4.0/4) significantly surpassed both fine-tuned alternatives. Specifically, it outperformed the expert-only 3k model (3.75/4) and the HSR-SmolVLA model (3.5/4), with statistical significance confirmed via Mann-Whitney test ($p \leq 0.010$). Notably, the expert-only 3k model achieved the lowest total MSE, yet performed worse on the robot. This divergence suggests that offline error rates within specific arm groups, rather than total MSE or base-group error, are the primary indicators of real-world success. Consequently, we recommend using per-group error as a superior metric for checkpoint selection in robots with heterogeneous action spaces.

Code: https://github.com/paumontagut/per-group-mse-vla

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC