arXiv

NextMotionQA: Benchmarking and Judging Human Motion Understanding with Vision-Language Models

Title: NextMotionQA: Evaluating Human Motion Comprehension via Vision-Language Models

Abstract:

Accurate assessment of human motion understanding is a cornerstone for progress in animation, robotics, and embodied AI. Yet, current benchmarks are hindered by imprecise semantic granularity, uniform difficulty levels, subpar annotation quality, and widespread answer ambiguity, rendering them ineffective at pinpointing specific model failures. To address these shortcomings, we present NextMotionQA, a robust benchmark that utilizes vision-language models (VLMs) to create a semi-automated dataset verified by experts.

NextMotionQA comprises three distinct tasks: fine-grained error correction, video captioning, and multiple-choice question answering. These tasks are organized along three fundamental semantic axes and divided into three tiers of complexity. Through an extensive evaluation of twelve prominent VLMs, we identify significant capability gaps and weaknesses that conventional, single-task assessments typically overlook.

In a parallel investigation, we examine the efficacy of VLMs as judges for text-to-motion generation—a practice gaining traction in recent research. We analyze whether these models exhibit performance declines when faced with more challenging tasks. Our findings indicate that while VLMs correlate well with expert ratings on broad criteria (Cohen's $\kappa=0.70$), their reliability collapses when required to make fine-grained, part-level judgments (Cohen's $\kappa=0.10$). This result confirms the utility of the VLM-as-judge paradigm in high-confidence scenarios while clearly delineating its limitations.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

Exelon CEO Sees Daily Cybersecurity Threats
Bloomberg

Exelon CEO Sees Daily Cybersecurity Threats

Exelon’s CEO warns of daily cybersecurity threats, highlighting persistent risks to the energy giant.

TechCrunch

Ramp raises $750M at $44B valuation as investors hunger for fintechs with an AI story

Ramp secured $750M at a $44B valuation, driven by AI integration and $1.5B+ revenue. The fintech firm now serves 70,000 ...

TechCrunch

Is Silicon Valley ready to put robots in people’s homes? Hello Robot is.

Hello Robot’s Stretch avoids Silicon Valley hype, focusing on practical home deployment to gather essential real-world d...

Canada to Provide Funding, Buy Equity Stakes in AI Startups
Bloomberg

Canada to Provide Funding, Buy Equity Stakes in AI Startups

Canada will fund and buy equity stakes in AI startups to boost the sector. This investment aims to strengthen the nation...

TechCrunch

Chinese spies are using LinkedIn to lure Westerners into sharing sensitive information

A joint Western security alert warns that Chinese spies use LinkedIn to impersonate recruiters and extract sensitive dat...

Peter Thiel’s Family Office Pays Record Rent for Top Miami Tower
Bloomberg

Peter Thiel’s Family Office Pays Record Rent for Top Miami Tower

Peter Thiel’s family office set a record rent for a Miami tower lease. This deal establishes a new benchmark for the cit...