arXiv

SCAPO: Self-Supervised Category-Level Articulated Pose Estimation from a Single 3D Observation

June 2, 2026 · Can Zhang, Gim Hee Lee · Original Source

Title: SCAPO: Self-Supervised Category-Level Articulated Pose Estimation from a Single 3D Observation

Abstract: Current approaches for estimating category-level object articulation from a single 3D view typically depend on dense supervision, multi-frame sequences, or CAD templates. These methods often fail to effectively separate geometric structure from articulation or to extract explicit joint parameters. To address these limitations, we introduce SCAPO, a self-supervised framework capable of inferring canonical geometry, rigid part segmentation, and joint characteristics—including pivots, axes, and articulation states—from a solitary RGB-D input. Notably, this process requires neither ground-truth annotations nor category-specific models. Initially, SCAPO employs an SE(3)-equivariant vector-neuron autoencoder to isolate global pose, thereby aligning varied instances into a unified canonical space. Subsequently, a specialized joint-aware blend-skinning module is utilized to characterize part motion within this aligned configuration. The model’s representation is refined via cycle reconstruction between observed and canonical shapes, alongside cross-space alignment facilitated by a learnable canonical template. This template effectively decouples shared category geometry from instance-specific residual shapes. Evaluations on both synthetic and real-world articulated-object datasets demonstrate that SCAPO successfully retrieves consistent part structures and precise articulation parameters, surpassing all existing self-supervised baseline methods.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC