arXiv

Semantic Motion Anchors: Bridging Motion and Meaning in Co-Speech Gestures

Title: Semantic Motion Anchors: Connecting Motion with Meaning in Co-Speech Gestures

Abstract

Developing a unified representation that links spoken language with gesture is a fundamental requirement for advancing co-speech gesture retrieval, synthesis, and comprehension. However, this task proves particularly difficult when dealing with semantically rich gestures, as their communicative purpose cannot be derived from movement data alone. Traditional methods that rely on direct contrastive alignment between text transcripts and continuous motion embeddings tend to prioritize low-level kinematic details, thereby overlooking the symbolic significance inherent in semantic gestures.

To address this, we introduce "semantic motion anchors," which serve as natural-language abstractions that encapsulate both the physical structure of a gesture and its communicative intent. Our approach involves breaking down 3D gestures into distinct body-hand motion primitives, converting these into structured verbal descriptions, and linking them to the accompanying transcript to offer auxiliary contrastive supervision.

Evaluations on the BEAT2 dataset demonstrate that our method increases text-to-gesture retrieval accuracy (R@1) by 8.2% compared to a baseline that aligns text and motion directly. Furthermore, it surpasses previous retrieval techniques in both text-to-gesture and gesture-to-text directions. Beyond standard aggregate metrics, the supervision provided by semantic motion anchors ensures that the retrieved gestures are semantically relevant to the spoken query, preventing the system from defaulting to generic movement patterns.

In a downstream study focused on retrieval-augmented gesture generation, participants showed a strong preference for gestures generated using our method over those produced by a standard retrieval-augmented generation baseline. This outcome highlights that grounding retrieval processes in semantic meaning leads to generated gestures that more effectively convey the intended communicative message.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...