MetaPoint: Unlocking Precise Spatial Control in Agentic Visual Generation
Title: MetaPoint: Enabling Accurate Spatial Management in Agentic Visual Generation
Abstract:
Generative visual models currently face significant challenges in achieving precise spatial control. This limitation stems from a fundamental disconnect: while these models can interpret textual descriptions of space, they lack the ability to directly translate numerical coordinates onto a 2D image canvas. To address this, we present MetaPoint, a novel approach that closes this gap by encoding a continuous 2D coordinate as a single, unique token. Notably, MetaPoint does not necessitate any new architectural additions. Instead, it capitalizes on the model’s existing positional encoding mechanisms to interpret these coordinates, effectively treating the token as a virtual point on the canvas. This streamlined method facilitates pixel-level positioning of objects using one token, or the definition of a bounding box with two tokens, all without the need for architectural modifications or custom attention masking. Designed with compositionality in mind, MetaPoint tokens function as spatial primitives. This feature empowers a planner agent to break down complex, high-level user instructions into a structured sequence of these primitives for the generator. By offering a straightforward, accurate, and scalable foundation for spatial control, MetaPoint facilitates the development of more robust compositional generative agents and supports intuitive, interactive editing workflows.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC






