Multi-view Pyramid Transformer: Look Coarser to See Broader
Title: Multi-view Pyramid Transformer: Look Coarser to See Broader
Original: arXiv:2512.07806v2 Announce Type: replace
Abstract: We introduce the Multi-view Pyramid Transformer (MVP), a scalable multi-view transformer architecture capable of directly reconstructing extensive 3D scenes from tens to hundreds of images in a single forward pass. Guided by the principle of "looking broader to see the whole, looking finer to see the details," MVP relies on two foundational design elements: 1) a local-to-global inter-view hierarchy that systematically expands the model's viewpoint from localized perspectives to groups, and finally to the entire scene, and 2) a fine-to-coarse intra-view hierarchy that initiates with detailed spatial representations and progressively consolidates them into compact, information-rich tokens. This dual hierarchical structure ensures both computational efficiency and representational depth, facilitating rapid reconstruction of large and intricate scenes. We evaluate MVP across various datasets and demonstrate that, when integrated with 3D Gaussian Splatting as the underlying 3D representation, it delivers state-of-the-art generalizable reconstruction quality while preserving high efficiency and scalability across a broad spectrum of view configurations.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





