Understanding-Enhanced Model Collaboration for Long-Tailed Egocentric Mistake Detection
Title: Optimizing Long-Tailed Egocentric Error Identification Through Understanding-Driven Model Synergy
Abstract
This study tackles the challenge of identifying incorrect user actions within egocentric video footage. To solve this, we introduce the Understanding-Enhanced Model Collaboration Method (UE-MCM), a framework that merges efficient, broad-scope video comprehension with precise, detailed action reasoning. UE-MCM operates through two distinct pathways: a lightweight small model branch and a robust large model branch. The large branch is dedicated to determining if the specific fine-grained action is executed incorrectly. In contrast, the small branch processes both the coarse-grained video context and the fine-grained segment to spot actions that might appear correct in isolation but conflict with the broader workflow.
Architecturally, the small model branch relies on a CLIP4CLIP video encoder, which is initialized using a CLIP model improved via Diffusion Contrastive Reconstruction. Meanwhile, the large model branch utilizes the Qwen3-VL Embedding model to derive high-dimensional representations from the specific action segments. The predictions from both branches are then dynamically combined using a lightweight collaboration gate. To effectively manage the long-tailed distribution inherent in mistake instances, we refine the classifiers using a set of complementary objectives: reweighted cross-entropy, AUC-oriented learning, and label-aware adjustments. This approach achieves an optimal balance between computational efficiency and precision, proving highly effective for detecting subtle, uncommon, and ambiguous errors in egocentric instructional videos.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




