Hybrid Autoregressive-Diffusion Model for Real-Time Sign Language Production
Title: Hybrid Autoregressive-Diffusion Model for Real-Time Sign Language Production
Abstract:
Prior Sign Language Production (SLP) systems have predominantly depended on autoregressive decoding. While this method inherently maintains temporal causality, it is prone to error accumulation during inference. Conversely, newer diffusion-based techniques enhance generation fidelity via iterative denoising; however, their sequence-level refinement steps often result in significant latency. To resolve this conflict between speed and quality, we introduce HybridSign, a novel hybrid architecture designed for low-latency sign language production. This model integrates causal frame generation with flow-based diffusion refinement.
The framework incorporates a Multi-Scale Pose Representation module to capture detailed articulator features. Additionally, it utilizes a Confidence-Aware Causal Attention mechanism, which employs joint-level confidence scores to bolster robustness against noisy 2D pose data. Evaluations conducted on the PHOENIX14T and How2Sign datasets demonstrate that HybridSign delivers the optimal balance of quality and efficiency relative to existing baselines. Specifically, on the How2Sign test set, the model achieved BLEU-1 and BLEU-4 scores of 30.12 and 6.48, respectively, alongside a Dynamic Time Warping (DTW) score of 3.89. Under a 60-frame evaluation protocol, HybridSign reduced the time-to-first-frame to 5.90 seconds while boosting throughput to 10.17 FPS.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC





