SeeTraceAct: Visibility-Aware Latent Planning from Cross-Embodiment Demonstration Videos
Title: SeeTraceAct: Visibility-Aware Latent Planning from Cross-Embodiment Demonstration Videos
Abstract:
While Vision-Language-Action (VLA) models represent a promising avenue for developing general-purpose robotic policies, their adaptation to novel tasks usually demands expensive, task-specific teleoperation data. To circumvent this dependency, we investigate one-shot demo-conditioned VLAs, in which the robot policy is guided by a single video demonstration of an unfamiliar task. Our analysis reveals that current end-to-end methods frequently falter when tasks demand the precise localization of small target areas. To overcome this challenge, we introduce SeeTraceAct, a framework for demo-conditioned VLAs that enhances spatial grounding by predicting future end-effector traces with a focus on visibility. Furthermore, to facilitate reproducible research using cross-embodiment demonstrations, we present and open-source RoboCasa-DC, an extension of the RoboCasa environment that includes episode-paired videos of humanoid actions. Evaluations conducted on RoboCasa-DC and a real-world benchmark—where a Franka Panda manipulator is conditioned on human demonstrations—demonstrate that SeeTraceAct surpasses baseline methods. It secures the highest success rates across all four configurations of RoboCasa-DC and boosts the average success rate in real-world scenarios by 12.5 percentage points.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



