ZipSplat: Fewer Gaussians, Better Splats
Title: ZipSplat: Fewer Gaussians, Better Splats
Abstract:
Existing feed-forward 3D Gaussian Splatting techniques reconstruct scenes from images—whether posed or pose-free—in a single forward pass. However, these methods typically assign one Gaussian to each input pixel, effectively linking the representational capacity to camera resolution instead of the actual complexity of the scene. Consequently, a simple flat wall generates the same number of Gaussians as a complex, textured object, despite their disparate geometric requirements.
To address this, we introduce ZipSplat, a token-driven feed-forward architecture that separates Gaussian placement from the pixel grid. The process begins with a multi-view backbone that extracts dense visual tokens, which are then compressed into a streamlined set of scene tokens via k-means clustering. These tokens are refined through cross- and self-attention mechanisms before a lightweight MLP decodes them into groups of Gaussians with free-form 3D coordinates. Since clustering occurs during inference, a single trained model can navigate the quality-efficiency trade-off without the need for retraining.
ZipSplat functions without requiring ground-truth camera poses or intrinsics. It establishes a new state of the art on both DL3DV and RealEstate10K, utilizing approximately six times fewer Gaussians than pixel-aligned methods. Specifically, it outperforms the leading pose-free baseline by 2.1dB on DL3DV and 1.2dB on RealEstate10K in terms of PSNR. Additionally, the model demonstrates strong zero-shot generalization to Mip-NeRF360 and ScanNet++, surpassing all comparable baselines in these settings.
Our project page is available at https://veichta.com/zipsplat.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC




