DyaPlex: Full-Duplex Speech-Motion Model for Dyadic Interaction
Title: DyaPlex: A Full-Duplex Speech-Motion Model for Dyadic Interaction
Original: arXiv:2606.03874v1 Announce Type: new Abstract: We present DyaPlex, a streaming, full-duplex speech-and-motion model designed for dyadic interaction. To capture the continuous and reciprocal nature of human communication, this full-duplex capability empowers the agent to simultaneously perceive and generate both speech and physical motion in a streaming fashion. At its core, our method leverages the strong priors of a foundational full-duplex speech model and integrates a novel motion pathway, thereby achieving fully synchronized multi-modal interaction. Specifically, we design a dual-tower Transformer architecture that preserves the zero-shot conversational reasoning of a frozen base speech model while constructing a deeply coupled, streaming motion pathway. By introducing a unified dyadic token interleaving mechanism and guiding cross-attention via a time-aligned speech-motion RoPE, our model effectively aligns autoregressive motions with rich latent speech features. Trained on the 4,000-hour Seamless Interaction dataset, our model effectively captures cross-speaker dependencies and establishes new state-of-the-art performance across both monadic and dyadic human interaction benchmarks.
Rewritten:
We introduce DyaPlex, a novel streaming model capable of full-duplex speech and motion generation, specifically engineered for dyadic interactions. To mirror the fluid and reciprocal dynamics inherent in human dialogue, this full-duplex architecture enables agents to concurrently perceive and produce speech alongside physical gestures in real-time. The core of our approach builds upon the robust priors of an existing foundational full-duplex speech model, augmenting it with a newly developed motion pathway to facilitate completely synchronized multi-modal engagement.
Our solution employs a dual-tower Transformer design. This structure maintains the zero-shot conversational reasoning capabilities of a frozen base speech model while simultaneously establishing a tightly integrated, streaming motion trajectory. We achieve precise alignment between autoregressive movements and complex latent speech features through two key innovations: a unified mechanism for interleaving dyadic tokens and the use of time-aligned speech-motion RoPE to direct cross-attention.
Evaluated on the 4,000-hour Seamless Interaction dataset, the model successfully learns cross-speaker dependencies. Consequently, it sets new state-of-the-art records on benchmarks measuring both monadic and dyadic human interaction performance.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC






