arXiv

Understanding-Enhanced Model Collaboration for Long-Tailed Egocentric Mistake Detection

June 2, 2026 · Boyu Han, Qianqian Xu, Shilong Bao, Zhiyong Yang, Ruochen Cui, Qingming Huang · Original Source

Title: Optimizing Long-Tailed Egocentric Error Identification Through Understanding-Driven Model Synergy

Abstract

This study tackles the challenge of identifying incorrect user actions within egocentric video footage. To solve this, we introduce the Understanding-Enhanced Model Collaboration Method (UE-MCM), a framework that merges efficient, broad-scope video comprehension with precise, detailed action reasoning. UE-MCM operates through two distinct pathways: a lightweight small model branch and a robust large model branch. The large branch is dedicated to determining if the specific fine-grained action is executed incorrectly. In contrast, the small branch processes both the coarse-grained video context and the fine-grained segment to spot actions that might appear correct in isolation but conflict with the broader workflow.

Architecturally, the small model branch relies on a CLIP4CLIP video encoder, which is initialized using a CLIP model improved via Diffusion Contrastive Reconstruction. Meanwhile, the large model branch utilizes the Qwen3-VL Embedding model to derive high-dimensional representations from the specific action segments. The predictions from both branches are then dynamically combined using a lightweight collaboration gate. To effectively manage the long-tailed distribution inherent in mistake instances, we refine the classifiers using a set of complementary objectives: reweighted cross-entropy, AUC-oriented learning, and label-aware adjustments. This approach achieves an optimal balance between computational efficiency and precision, proving highly effective for detecting subtle, uncommon, and ambiguous errors in egocentric instructional videos.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC