arXiv

AFUN: Towards an Affordance Foundation Model for Functionality Understanding

Title: AFUN: Pioneering an Affordance Foundation Model for Functionality Comprehension

Abstract: The comprehension of affordances serves as a critical link between visual perception and physical action, offering an interpretable framework for robotic manipulation within complex, unstructured, and open-world settings. Despite its importance, the development of a comprehensive affordance foundation model—one that not only identifies the location and manner of interaction but also maintains robust generalization across varied environments, objects, and tasks—has long posed a significant research hurdle. Current approaches typically fall short by addressing only fragments of this problem: some localize relevant areas without defining executable movements, while others predict motion but lack scalability.

In this study, we introduce ourmodel, a significant advancement toward realizing an affordance foundation model dedicated to functionality understanding. By leveraging a single RGB-D image alongside a textual task description, ourmodel generates a task-specific functional mask to indicate where interaction should occur, alongside a 3D post-contact motion curve to dictate how the interaction proceeds. To facilitate generalization in open-world scenarios, we have engineered a large-scale, standardized data pipeline. This system transforms diverse data sources—including robot logs, human demonstrations, simulations, and real-world scans—into a unified affordance schema featuring language tags, masks, and object-centric 3D motion labels.

We assessed ourmodel across three key dimensions. In terms of affordance segmentation, it surpassed all baseline methods by a substantial margin across eight test sets derived from four benchmarks, yielding improvements in mean gIoU and cIoU of +23.9% and +26.3%, respectively. For contact-point prediction, the model delivered significantly higher accuracy, achieving a 12.7% to 61.3% increase in hit rates compared to the strongest baseline. Furthermore, it secured top performance on all three 3D motion test sets. Notably, ourmodel can be directly deployed for real-world robotic manipulation without requiring fine-tuning for specific robot embodiments or relying on task-specific heuristics, thereby proving its capacity to adapt to affordance tasks in open-world contexts.

Project page: https://www.zhaoningwang.com/AFUN


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...