arXiv

SLU-2K: A Question-Based Benchmark for Semantic Evaluation of Sign Language Translation

Title: SLU-2K: A Question-Based Benchmark for Semantic Evaluation of Sign Language Translation

Abstract

Current Sign Language Translation (SLT) evaluation relies heavily on surface-form metrics like BLEU and ROUGE. While these measures reward lexical overlap, they fail to capture whether a translation accurately preserves the meaning of the source sign sequence—a critical shortcoming given that the ultimate goal of SLT is its integration into assistive technologies. To address this, this study pivots the focus from Sign Language Translation (SLT) to Sign Language Understanding (SLU), prioritizing semantic comprehension. We assess systems based on their capacity to extract key semantic details from input videos, such as specific actions, and facts regarding people and objects.

To facilitate systematic evaluation, we introduce SLU-2K, a dataset comprising 2,350 closed-ended video question-answer pairs derived from the widely used PHOENIX-2014T and CSL-Daily datasets. We developed and rigorously tested an automated data generation pipeline to create SLU-2K, which generates questions spanning seven distinct categories: actions, locations, numbers, objects, people, time, and weather conditions.

We demonstrate the utility of SLU-2K by testing it against popular Multimodal Large Language Models (MLLMs) and two leading state-of-the-art systems, MMSTL and SpaMo. The results indicate that MLLMs perform at near-random levels, underscoring the necessity for deeper integration of SLU into existing AI frameworks. Additionally, even state-of-the-art translation systems, when carefully fine-tuned on in-domain data, show a significant semantic gap, with accuracy scores ranging from 56.7% to 75.2%. These findings imply that traditional SLT evaluation protocols may overestimate genuine understanding. Consequently, future advancements should be gauged not just by fluency and n-gram overlap, but also by semantic accuracy. Code, prompts, and benchmark files are available at https://github.com/ZenoTsT/SLU-2K


Source: arXiv Generated at: 2026-06-03 00:00:00 UTC

Related Articles

AI Saves Time But Most Companies Waste the Gain, Study Shows
Bloomberg

AI Saves Time But Most Companies Waste the Gain, Study Shows

A study reveals that while AI saves employee time, most companies fail to capitalize on these gains, squandering potenti...

JPMorgan Lifts S&P Target on Earnings 'Supercycle'
Bloomberg

JPMorgan Lifts S&P Target on Earnings 'Supercycle'

JPMorgan raised its S&P 500 target, citing an earnings “supercycle” that reflects heightened confidence in corporate pro...

Europe Sleepwalking Into Economic Ruin, Serb Leader Says
Bloomberg

Europe Sleepwalking Into Economic Ruin, Serb Leader Says

Serbian leader warns Europe is sleepwalking into economic ruin.

Delta Electronics Flags Power Crunch
Bloomberg

Delta Electronics Flags Power Crunch

Delta Electronics warns of a looming power deficit due to surging demand and constrained production, predicting serious ...

Google Ordered to Make Changes to AI Search Summaries by UK
Bloomberg

Google Ordered to Make Changes to AI Search Summaries by UK

The UK has ordered Google to modify its AI search summaries. This mandate aims to ensure greater accuracy and transparen...

TikTok Billionaire Tops Ambani as Asia’s Second-Richest
Bloomberg

TikTok Billionaire Tops Ambani as Asia’s Second-Richest

TikTok founder surpasses Mukesh Ambani to become Asia’s second-richest person, marking a significant shift in the region...