Conditional Collapse in Sign Language Production: A Diagnostic and a Scaling Argument
Title: Conditional Collapse in Sign Language Production: A Diagnostic and a Scaling Argument
Abstract
Sign Language Production (SLP) involves the synthesis of avatar-based sign language movements derived from natural language text. Traditionally, the efficacy of these generated motions is assessed using the Fréchet Inception Distance (FID) in motion space and back-translation (BT) BLEU scores, particularly on benchmarks like How2Sign. However, a significant discrepancy often arises: while these metrics may show substantial improvement, the generator frequently fails to accurately reproduce the actual gestures of the sign language.
To address this issue, this study introduces a three-tiered evaluation framework for generated motion: initial-pose conditioning ($\tau_1$), output diversity ($\tau_2$), and target faithfulness ($\tau_3$). These metrics are calculated as pairwise-distance ratios derived from the latent representations of a frozen motion autoencoder (MoAE).
We tested 14 checkpoints of SLP models on the How2Sign dataset, which includes a re-implemented version of Neural Sign Actors (NSA). Our findings reveal that target faithfulness ($\tau_3$) is consistently unattained, whereas FID scores fluctuate by nearly two orders of magnitude and bear no correlation with faithfulness. In contrast, experiments on the isolated gloss dataset ASL3DWord demonstrate that favorable $\tau_3$ values are achievable. This suggests that the limited size of the sentence-level paired dataset constitutes the primary bottleneck in current SLP systems.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





