arXiv

SceneSmith: Agentic Generation of Simulation-Ready Indoor Scenes

June 2, 2026 · Nicholas Pfaff, Thomas Cohn, Sergey Zakharov, Rick Cory, Russ Tedrake · Original Source

Title: SceneSmith: Agentic Generation of Simulation-Ready Indoor Scenes

Original: arXiv:2602.09153v2 Announce Type: replace-cross

Abstract:

While simulation has emerged as a critical instrument for the large-scale training and assessment of domestic robots, prevailing virtual environments struggle to replicate the physical intricacy and variety found in actual indoor settings. Existing scene synthesis techniques typically yield sparsely decorated rooms, missing the dense clutter, articulated furnishings, and physical attributes necessary for effective robotic manipulation. To address this gap, we present SceneSmith, a hierarchical agentic framework designed to create simulation-ready indoor environments based on natural language inputs. This system builds scenes through a multi-stage process—spanning architectural layout, furniture arrangement, and small object placement—where each phase is managed by an interaction between three VLM agents: the designer, the critic, and the orchestrator. The framework seamlessly combines asset creation via text-to-3D synthesis for static items, dataset retrieval for articulated objects, and the estimation of physical properties. SceneSmith produces three to six times more objects than previous approaches, maintaining fewer than 2% inter-object collisions and ensuring that 96% of objects remain stable during physics simulations. In a user study involving 205 participants, the system achieved win rates of 92% for average realism and 91% for average prompt faithfulness when compared to baseline methods. Furthermore, we show that these generated environments can be integrated into an end-to-end pipeline for the automated evaluation of robot policies.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC