From Abstraction to Instantiation: Learning Behavioral Representation for Vision-Language-Action Model
Title: Bridging Abstraction and Instantiation: Acquiring Behavioral Representations for Vision-Language-Action Models
Abstract: Vision-Language-Action (VLA) models frequently encounter performance drops when facing distribution shifts, primarily due to difficulties in acquiring generalized behavioral representations that span diverse environments. Current methods typically rely on action-centric latent variables to build these representations; however, they are often hindered by static execution alignment and short-horizon temporal fragmentation, which result in inconsistent performance within complex scenarios. To overcome these challenges, we introduce BehaviorVLA, a framework designed to enable robust manipulation by learning temporally coherent behavioral representations. This framework comprises two symmetric modules: (1) the Visuomotor Behavior Encoder (VBE), which employs a causal Mamba-based architecture to consolidate long-horizon trajectory data into a cohesive behavior representation; and (2) the Phase-conditioned Behavior Decoder (PBD), which translates this representation into precise actions by dynamically synchronizing task-level priors with real-time execution progress. Our evaluations on RoboTwin 2.0, LIBERO, and CALVIN reveal state-of-the-art success rates of 58%, 98%, and an average length of 4.36, respectively. Furthermore, in real-world sim-to-real transfer tasks, BehaviorVLA achieves performance comparable to OpenVLA-OFT while utilizing merely 50% of the demonstration data, highlighting its exceptional data efficiency and generalization capabilities.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




