Segment, Embed, and Align: A Universal Recipe for Aligning Subtitles to Signing
Title: Segment, Embed, and Align: A Universal Recipe for Aligning Subtitles to Signing
Abstract: This study aims to establish a generalized methodology for synchronizing subtitles—defined as spoken language text accompanied by precise timestamps—with continuous sign language video footage. Previous solutions have largely depended on end-to-end training models that are confined to specific languages or datasets, thereby restricting their broader applicability. To address this limitation, we introduce Segment, Embed, and Align (SEA), a unified framework capable of operating across diverse languages and domains. SEA utilizes two pre-trained models: the initial model segments video frame sequences into distinct signs, while the second embeds each corresponding sign video clip into a shared latent space alongside text representations. The alignment process is executed via a lightweight dynamic programming algorithm, enabling efficient CPU-based processing that completes within a minute, even for hour-long episodes. Demonstrating significant flexibility, the system adapts to various contexts, ranging from small lexicons to extensive continuous corpora. Evaluations across four sign language datasets reveal that SEA achieves state-of-the-art alignment results, underscoring its capacity to produce high-quality parallel data that can propel sign language processing research forward. Both the code and models for SEA are publicly accessible.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC


