arXiv

How Visible Are Silent Manipulation Failures? An Observability Study of False-Success Detection in Simulated Robot Episodes

June 3, 2026 · Aarav Bedi (University of California, Berkeley) · Original Source

Title: Assessing the Detectability of Covert Manipulation Errors: An Observability Analysis of False-Success Identification in Simulated Robotic Tasks

Abstract: Robot manipulation policies trained via imitation learning are inherently constrained by the accuracy of success labels derived from their training episodes, which typically rely on the robot’s internal success verification mechanisms. A critical flaw in this process is the "false success," where the system incorrectly logs an episode as successful despite a failed task outcome. This study addresses a specific, practical inquiry regarding such episodes: once an event is marked as successful, what proportion of the data required to reclassify it is contained within proprioceptive sensors versus visual inputs? To investigate this, we constructed a simulation environment featuring two bimanual ALOHA tasks. We introduced failures by applying environmental perturbations rather than altering labels, and we annotated every episode using privileged simulator states that remained inaccessible to the detection models. Our dataset was strictly limited to episodes the robot had previously classified as successes. We then evaluated detectors relying solely on proprioception against those incorporating visual data. Our results indicate significant variability in recoverability: for cube transfer tasks, false successes are almost entirely detectable using joint data alone. In contrast, for peg insertion tasks, proprioceptive data only partially identifies these errors, with visual detectors bridging most of the remaining gap. Furthermore, we demonstrate that the separability observed in proprioceptive data relies on velocity differences that fall below any realistic sensor noise threshold. Consequently, these findings should be interpreted as an optimistic upper bound, inflated by the noiseless nature of the simulator. We have made both the generation and evaluation pipelines publicly available.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC