arXiv

SLU-2K: A Question-Based Benchmark for Semantic Evaluation of Sign Language Translation

June 3, 2026 · Zeno Testa, Antonino Furnari, Lorenzo Baraldi, Natalia D\'iaz-Rodr\'iguez · Original Source

Title: SLU-2K: A Question-Based Benchmark for Semantic Evaluation of Sign Language Translation

Abstract

Current Sign Language Translation (SLT) evaluation relies heavily on surface-form metrics like BLEU and ROUGE. While these measures reward lexical overlap, they fail to capture whether a translation accurately preserves the meaning of the source sign sequence—a critical shortcoming given that the ultimate goal of SLT is its integration into assistive technologies. To address this, this study pivots the focus from Sign Language Translation (SLT) to Sign Language Understanding (SLU), prioritizing semantic comprehension. We assess systems based on their capacity to extract key semantic details from input videos, such as specific actions, and facts regarding people and objects.

To facilitate systematic evaluation, we introduce SLU-2K, a dataset comprising 2,350 closed-ended video question-answer pairs derived from the widely used PHOENIX-2014T and CSL-Daily datasets. We developed and rigorously tested an automated data generation pipeline to create SLU-2K, which generates questions spanning seven distinct categories: actions, locations, numbers, objects, people, time, and weather conditions.

We demonstrate the utility of SLU-2K by testing it against popular Multimodal Large Language Models (MLLMs) and two leading state-of-the-art systems, MMSTL and SpaMo. The results indicate that MLLMs perform at near-random levels, underscoring the necessity for deeper integration of SLU into existing AI frameworks. Additionally, even state-of-the-art translation systems, when carefully fine-tuned on in-domain data, show a significant semantic gap, with accuracy scores ranging from 56.7% to 75.2%. These findings imply that traditional SLT evaluation protocols may overestimate genuine understanding. Consequently, future advancements should be gauged not just by fluency and n-gram overlap, but also by semantic accuracy. Code, prompts, and benchmark files are available at https://github.com/ZenoTsT/SLU-2K

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC