arXiv

Face versus Body Tracking for Human-Robot Interaction: An Egocentric Dataset

June 3, 2026 · Jessica Wenninger, Gabriel Skantze · Original Source

Title: Comparing Face and Body Tracking for Human-Robot Interaction: A New Egocentric Dataset

Abstract

To facilitate genuine human-robot interaction (HRI), robots must maintain continuous awareness of user engagement through consistent longitudinal tracking. However, prevailing computer vision models are primarily tailored for surveillance and autonomous driving contexts. These general-purpose systems struggle with the unique egocentric challenges posed by social robots, where users may bounce, block one another, or exit the visual field. Such scenarios frequently result in identity switches (IDSW), causing the robot to lose track of the conversation partner.

In response, this study introduces a novel, custom-annotated egocentric dataset gathered using the Furhat robot, designed to capture intricate social dynamics. We conduct a systematic evaluation that separates detection errors from tracking logic, contrasting face and body tracking methods, and examining the influence of extended spatial memory and appearance re-identification (ReID).

Our findings reveal that while expanded spatial memory helps handle long-term occlusions, it remains ineffective against complex dynamic events. Conversely, integrating ReID successfully resolves intricate identity switches but produces divergent results: it significantly enhances the stability of body tracking, yet triggers a spike in facial IDSW due to sensitivity to profile angles. Ultimately, our refined pipeline decreases IDSW by 49%, thereby reducing interaction failures. Since conventional benchmarks do not account for dense, close-proximity occlusions, this research underscores the necessity of datasets that natively capture social dynamics to properly validate HRI perception models.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC