TROPHIES: Temporal Reconstruction of Places, Humans, and Cameras from Multi-view Videos
Title: TROPHIES: Temporal Reconstruction of Places, Humans, and Cameras from Multi-view Videos
Abstract: Establishing a globally consistent 4D representation of individuals alongside their surrounding environments is a critical component of holistic perception. Yet, previous methodologies often rely on single-view data or treat humans, scenes, and cameras as separate entities, which prevents the accurate recovery of coherent geometry, stable movement patterns, and trajectories that align with physical laws. To address these constraints, we define a novel objective: the unified reconstruction of humans, scenes, and cameras using multi-view video inputs. This approach seeks to simultaneously estimate dynamic human figures, static environmental geometry, and camera poses within a single global coordinate system.
In response, we present TROPHIES (Temporal Reconstruction of Places, Humans, and Cameras from Multi-view Videos), a comprehensive framework designed specifically for this challenge. The architecture comprises two main components: a Human Branch, which utilizes temporal and spatial reasoning to model human dynamics, and a Scene Branch, which reconstructs static geometry by employing human-aware attention mechanisms. These two branches are integrated through a global alignment and optimization module that ensures scale consistency, incorporates contact priors, and maintains cross-view temporal coherence.
Our evaluation on the EgoHuman and EgoExo4D datasets indicates that TROPHIES produces 4D reconstructions that are both globally aligned and physically plausible. Furthermore, the method consistently surpasses current state-of-the-art approaches in terms of global fidelity and the consistency between human and scene elements.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





