OrthoPhys: Physically Plausible Video Generation with Orthogonal-View Geometry Guidance
Title: OrthoPhys: Achieving Physically Plausible Video Generation via Orthogonal-View Geometry Guidance
Abstract: While recent advancements in video generation have markedly enhanced visual fidelity, maintaining physically consistent motion continues to pose a significant hurdle. This constraint stems from a fundamental disconnect: real-world object movement occurs in three-dimensional space, whereas video footage offers only partial, view-dependent snapshots of these dynamics. To overcome this, we introduce OrthoPhys, a two-stage framework that utilizes geometry guidance from orthogonal views to guarantee physical plausibility. Rather than producing unstructured 2D clips directly, our initial phase creates synchronized orthogonal videos featuring four distinct viewpoints of foreground dynamics. Through a geometry-enhanced attention mechanism applied across these views, the model ensures 3D spatial coherence and implicitly anchors motion within physical attributes. In the subsequent phase, these physically accurate orthogonal foregrounds act as strict guidance to generate the final, complete video, effectively capturing the interplay between foreground motion and background context. To facilitate this training approach, we developed PhysMV, a comprehensive dataset comprising 40,000 scenes, with each scene offering four orthogonal perspectives, totaling 160,000 video sequences. Comprehensive experiments indicate that OrthoPhys substantially outperforms current video generation techniques in terms of spatial-temporal coherence and physical realism. Project page: https://anonymous.4open.science/w/Phys4D/.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC





