HumanNOVA: Photorealistic, Universal and Rapid 3D Human Avatar Modeling from a Single Image
Title: HumanNOVA: Rapid, Universal, and Photorealistic 3D Human Avatar Synthesis from Single Images
Abstract:
We introduce HumanNOVA, a novel framework designed to generate photorealistic, universal, and high-speed 3D human avatars using only a single RGB image. The dual challenge of achieving high-fidelity realism and broad generalization has historically been hindered by the limited availability of diverse, high-quality 3D human datasets. To overcome this data bottleneck, we developed a scalable data generation pipeline grounded in two primary strategies. First, we animate existing rigged assets with a vast array of everyday poses. Second, we apply fitting techniques to multi-camera human captures to synthesize additional diverse viewpoints for training purposes. By combining these methods, we expanded our dataset to include 100,000 assets, substantially improving both the volume and variety of data available for robust model training.
Regarding its architecture, HumanNOVA employs a feed-forward, token-conditioned approach that enables inference in under one second without the need for test-time optimization. The process begins by encoding both the input image and an estimated simplified human mesh (SMPL)—which lacks detailed geometry or appearance—into compact token representations. These tokens serve as conditioning signals and are integrated via cross-attention mechanisms to build a triplane-based 3D avatar representation. Comprehensive experiments across various benchmarks confirm that our method outperforms existing approaches both quantitatively and qualitatively, while also demonstrating strong robustness across a wide range of input image conditions. For more details, visit the project page at https://HumanNOVA.github.io.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





