AFUN: Towards an Affordance Foundation Model for Functionality Understanding
Title: AFUN: Pioneering an Affordance Foundation Model for Functionality Comprehension
Abstract: The comprehension of affordances serves as a critical link between visual perception and physical action, offering an interpretable framework for robotic manipulation within complex, unstructured, and open-world settings. Despite its importance, the development of a comprehensive affordance foundation model—one that not only identifies the location and manner of interaction but also maintains robust generalization across varied environments, objects, and tasks—has long posed a significant research hurdle. Current approaches typically fall short by addressing only fragments of this problem: some localize relevant areas without defining executable movements, while others predict motion but lack scalability.
In this study, we introduce ourmodel, a significant advancement toward realizing an affordance foundation model dedicated to functionality understanding. By leveraging a single RGB-D image alongside a textual task description, ourmodel generates a task-specific functional mask to indicate where interaction should occur, alongside a 3D post-contact motion curve to dictate how the interaction proceeds. To facilitate generalization in open-world scenarios, we have engineered a large-scale, standardized data pipeline. This system transforms diverse data sources—including robot logs, human demonstrations, simulations, and real-world scans—into a unified affordance schema featuring language tags, masks, and object-centric 3D motion labels.
We assessed ourmodel across three key dimensions. In terms of affordance segmentation, it surpassed all baseline methods by a substantial margin across eight test sets derived from four benchmarks, yielding improvements in mean gIoU and cIoU of +23.9% and +26.3%, respectively. For contact-point prediction, the model delivered significantly higher accuracy, achieving a 12.7% to 61.3% increase in hit rates compared to the strongest baseline. Furthermore, it secured top performance on all three 3D motion test sets. Notably, ourmodel can be directly deployed for real-world robotic manipulation without requiring fine-tuning for specific robot embodiments or relying on task-specific heuristics, thereby proving its capacity to adapt to affordance tasks in open-world contexts.
Project page: https://www.zhaoningwang.com/AFUN
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





