SkelHCC: A Hyperbolic CLIP-Driven Cache Adaptation Framework for Skeleton-based One-Shot Action Recognition
Title: SkelHCC: A Hyperbolic CLIP-Driven Cache Adaptation Framework for Skeleton-based One-Shot Action Recognition
Abstract:
Skeleton-based action recognition seeks to interpret human behaviors by analyzing sequences of body joints, a task that becomes particularly arduous in one-shot scenarios where only a single labeled example exists for each new action class. The primary difficulty lies in developing representations that capture the hierarchical and compositional nature of human movement while maintaining strong alignment with high-level semantic meanings, despite severe data limitations. Current methods, which predominantly rely on low-level motion indicators and Euclidean embeddings, fail to adequately represent the tree-like structure inherent in skeletal data. This limitation hinders effective cross-modal alignment and reduces the model's ability to generalize to novel action categories.
To address these issues, we introduce SkelHCC, a comprehensive framework for one-shot skeleton-based action recognition that combines hyperbolic geometry with CLIP-driven cache adaptation. Central to this approach is the Explicitly Hierarchical Hyperbolic CLIP (EH-HCLIP) module, which maps both skeleton sequences and action language into a shared hyperbolic space. By exploiting the negative curvature and exponential volume expansion characteristic of hyperbolic geometry, EH-HCLIP inherently models the anatomical hierarchy of joints, parts, and the whole body, resulting in structurally coherent cross-modal representations. Furthermore, to facilitate efficient one-shot adaptation, SkelHCC incorporates a training-free, LLM-guided Multi-granularity Voting Cache (LMV-Cache) to enable context-aware inference. Our experiments on the NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD datasets indicate that SkelHCC consistently surpasses existing state-of-the-art techniques.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC





