GAP3D: Generative Alignment of VLM Latents to Patch-Level Embeddings for 3D Generation
Title: GAP3D: Generative Alignment of VLM Latents to Patch-Level Embeddings for 3D Generation
Abstract:
Current methods that employ vision-language models (VLMs) as prompt encoders for conditioning generative models often depend on costly end-to-end training or compress features into representations that strip away the dense spatial structure essential for geometry-centric tasks, such as 3D asset creation. To overcome these limitations, we introduce GAP3D, a modular, diffusion-based technique that aligns VLM-derived latents directly with the full, patch-level feature space of a pre-trained image encoder. This approach allows a frozen downstream generative model to leverage a VLM for prompting while preserving a spatially structured conditioning signal. In evaluations focused on 3D asset generation, our method eliminates the requirement for extensive 3D datasets by relying primarily on general-domain image-text pairs for training. Notably, it displays emergent zero-shot capabilities for multimodal prompts, even though it is trained exclusively on text inputs. Although GAP3D currently emphasizes high-level semantics rather than fine-grained details, it proves that the representational divide between VLM and image-encoder feature spaces can be partially closed via diffusion-based alignment. This work marks an initial step toward the modular integration of foundation models into dense embedding spaces through generative alignment.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




