Effective Multi-sensor Conditioning for Street-view Novel-view Synthesis
Title: Effective Multi-sensor Conditioning for Street-view Novel-view Synthesis
Abstract: Modern vehicle platforms utilize sensor suites including LiDAR, calibrated multi-camera rigs, and ego-motion data, providing signals for re-rendering driving scenes from novel viewpoints. Recent research has applied video diffusion models to generate plausible novel views from sparse vehicle observations. Current methods, however, often utilize only a subset of available data, resulting in quality degradation as the target trajectory diverges from the recorded path.
This study addresses the problem as a multi-sensor fusion task, integrating sparse LiDAR reprojections for metric geometry, surround-view imagery for appearance, and camera poses for cross-view alignment. The proposed framework, StreetNVS, is a video diffusion model that conditions on these three signals via a Reference-Enhanced Camera Attention module utilizing relative ray-level positional encoding. The model employs a two-stage curriculum training strategy that progressively reduces LiDAR density during training.
Evaluation on the Waymo Open Dataset demonstrates that StreetNVS outperforms state-of-the-art baselines under sparse LiDAR conditioning and achieves performance comparable to methods using point clouds 10 to 100 times denser. The framework is also capable of synthesizing coherent videos along extreme out-of-trajectory paths, including changes in elevation, lane shifts, pullbacks, and rotations.
Project website: https://streetnvs.github.io
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





