Reconstructing Objects along Hand Interaction Timelines in Egocentric Video
Title: Reconstructing Objects Along Hand Interaction Timelines in Egocentric Video
Abstract: This paper introduces ROHIT, a new task centered on Reconstructing Objects along Hand Interaction Timelines. We establish the concept of the Hand Interaction Timeline (HIT) from the viewpoint of a rigid object. Within this framework, an object begins in a static position relative to its surroundings, transitions to being held by a hand which alters its pose upon contact, undergoes a firm grip during usage, and finally returns to a static state relative to the scene upon release. To address this, we model these pose constraints and propose the Constrained Optimisation and Propagation (COP) framework, which propagates the object’s pose along the HIT to achieve superior reconstruction quality. A key focus of our approach is on timelines featuring stable grasps, defined as instances where the hand holds the object steadily, maintaining consistent contact throughout the interaction. This emphasis enables the efficient annotation, analysis, and evaluation of object reconstruction in video without requiring 3D ground truth data. We assess the ROHIT task using two egocentric datasets: HOT3D and the in-the-wild EPIC-Kitchens. For HOT3D, we curated a collection of 1,200 clips showcasing stable grasps. In EPIC-Kitchens, we annotated 2,400 clips of stable grasps, covering 390 object instances across nine categories, drawn from videos depicting daily interactions within 141 distinct environments. In the absence of 3D ground truth, we employ 2D projection error as the metric for assessing reconstruction accuracy. Quantitative results demonstrate that the COP method enhances stable grasp reconstruction by 6.2% to 11.3%, and improves HIT reconstruction by as much as 24.5% through constrained pose propagation.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC





