MAOAM: Unified Object and Material Selection with Vision-Language Models
Title: MAOAM: A Unified Approach to Object and Material Selection Using Vision-Language Models
Abstract: Precise selection is a fundamental component of interactive image editing. For such systems to be truly practical, they must allow users to define and clarify target regions through either textual descriptions or click-based inputs. Furthermore, these systems should extend beyond simple object detection to include criteria such as material type, which is essential for applications like surface re-texturing or modifying specific material instances. Current selection methods based on vision-language models (VLMs) are generally limited by their object-centric focus and reliance on a single interaction mode, thereby restricting their utility. To address these limitations, we introduce Mask Any Object And Material (MAOAM), a comprehensive framework that facilitates accurate selection at both the object and material levels, supporting both text and click interactions. MAOAM employs a VLM equipped with a segmentation head to generate pixel-perfect masks from user prompts. In this process, the VLM deciphers the user’s intent—whether targeting an object or a material—while encoding visual entities, attributes, and spatial relationships; the segmentation head then translates the resulting output token into a precise mask. A significant hurdle in this domain is the scarcity of datasets featuring material selections annotated with text. To overcome this, we developed a scalable data generation pipeline that utilizes both real and synthetic images containing material masks, employing VLMs to create detailed material descriptions enriched with visual-semantic information. MAOAM is trained using a multi-task objective that covers both click and text-based selection, supplemented by an auxiliary visual question answering (VQA) task derived from the material descriptions to enhance the model’s grasp of material properties. Although the model is trained on unimodal prompts, it demonstrates an emergent capability to improve selection accuracy when text and clicks are combined during inference, thus supporting flexible editing workflows. Our experiments confirm that MAOAM delivers accurate and coherent selections across a wide variety of objects, materials, and interaction contexts, underscoring its robustness in real-world scenarios.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC





