arXiv

Semantic Motion Anchors: Bridging Motion and Meaning in Co-Speech Gestures

June 2, 2026 · Varsha Suresh, Mohammad Mahdi Abootorabi, Mohamed Salman, M. Hamza Mughal, Christian Theobalt, Ashwin Ram, J\"urgen Steimle, Vera Demberg · Original Source

Title: Semantic Motion Anchors: Connecting Motion with Meaning in Co-Speech Gestures

Abstract

Developing a unified representation that links spoken language with gesture is a fundamental requirement for advancing co-speech gesture retrieval, synthesis, and comprehension. However, this task proves particularly difficult when dealing with semantically rich gestures, as their communicative purpose cannot be derived from movement data alone. Traditional methods that rely on direct contrastive alignment between text transcripts and continuous motion embeddings tend to prioritize low-level kinematic details, thereby overlooking the symbolic significance inherent in semantic gestures.

To address this, we introduce "semantic motion anchors," which serve as natural-language abstractions that encapsulate both the physical structure of a gesture and its communicative intent. Our approach involves breaking down 3D gestures into distinct body-hand motion primitives, converting these into structured verbal descriptions, and linking them to the accompanying transcript to offer auxiliary contrastive supervision.

Evaluations on the BEAT2 dataset demonstrate that our method increases text-to-gesture retrieval accuracy (R@1) by 8.2% compared to a baseline that aligns text and motion directly. Furthermore, it surpasses previous retrieval techniques in both text-to-gesture and gesture-to-text directions. Beyond standard aggregate metrics, the supervision provided by semantic motion anchors ensures that the retrieved gestures are semantically relevant to the spoken query, preventing the system from defaulting to generic movement patterns.

In a downstream study focused on retrieval-augmented gesture generation, participants showed a strong preference for gestures generated using our method over those produced by a standard retrieval-augmented generation baseline. This outcome highlights that grounding retrieval processes in semantic meaning leads to generated gestures that more effectively convey the intended communicative message.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC