Contrastive Representation Regularization for Vision-Language-Action Models
Title: Enhancing Vision-Language-Action Models via Contrastive Representation Regularization
Vision-Language-Action (VLA) models have demonstrated significant proficiency in robotic manipulation by capitalizing on the rich semantic representations inherent in pre-trained Vision-Language Models (VLMs). Despite these successes, their internal representations are often considered suboptimal, particularly regarding their responsiveness to critical robotic signals like control actions and proprioceptive data. To resolve this limitation, we propose the Robot State-aware Contrastive Loss (RS-CL), a streamlined and effective regularization technique for VLA models intended to narrow the divide between VLM embeddings and robotic feedback. Specifically, RS-CL utilizes relative distances between states to provide soft supervision, thereby aligning representations more tightly with the robot’s proprioceptive condition. Working in tandem with the standard action prediction objective, RS-CL strengthens representation learning focused on control tasks. It is designed to be lightweight and integrates seamlessly into conventional VLA training workflows. Our experiments indicate that RS-CL significantly elevates the performance of leading VLA architectures. This approach achieves a state-of-the-art score of 69.7% on the RoboCasa-Kitchen benchmark and increases success rates on difficult real-world robot manipulation tasks from 45.0% to 58.3%.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





