Papers
arxiv:2606.02551

AFUN: Towards an Affordance Foundation Model for Functionality Understanding

Published on Jun 1
· Submitted by
Zhaoning Wang
on Jun 2
Authors:
,
,
,
,

Abstract

Affordance understanding model predicts functional masks and 3D motion curves from RGB-D observations and language descriptions, enabling generalizable robot manipulation across diverse environments.

Affordance understanding bridges visual perception and physical action, serving as an explainable interface for robot manipulation in open and unstructured real-world environments. Yet, building an affordance foundation model that not only understands where and how the interaction should happen, but also generalizes across diverse environments, objects, and tasks, remains a long-standing research challenge. Existing methods typically address only part of this challenge, either localizing task-relevant regions without specifying executable motion, or predicting motion but with limited scalability. In this paper, we present ourmodel, a step towards an affordance foundation model for functionality understanding. From a single RGB-D observation and a language task description, ourmodel predicts a task-conditional functional mask (where to interact) and a 3D post-contact motion curve (how to interact). To support open-world generalization, we build a large-scale standardized data pipeline that converts heterogeneous robot, human, simulation, and real-world scan data into a shared affordance schema with language, masks, and object-centric 3D motion labels. We evaluate ourmodel from three aspects: for affordance segmentation, ourmodel outperforms all baselines by a large margin across 8 test sets from 4 benchmarks, improving mean gIoU/cIoU by +23.9/+26.3; for contact-point prediction, it predicts substantially more accurate points, with a 12.7--61.3% hit-rate gain over the best baseline; and for 3D motion, it achieves the best performance on all three test sets. ourmodel can be deployed for real-world robot manipulation without finetuning for robot embodiment or using task-specific heuristics, demonstrating the ability to adapt to open-world affordance tasks. Project page: https://www.zhaoningwang.com/AFUN

Community

Paper submitter

Humans glance at any object and instantly know where to act and how -- AFUN is an affordance foundation model that gives robots the same ability. From a single RGB-D image and a language command, it jointly predicts a task-conditional functional mask (where to interact) and a 3D post-contact motion curve (how to move), trained on one of the largest real-world affordance datasets to date, spanning robot, human, simulation, and 3D-scan sources. It takes a step toward open-world generalization, with state-of-the-art results across segmentation, contact-point, and 3D-motion benchmarks. And it deploys directly to a real robot, executing manipulation tasks like opening and grasping without any robot-specific fine-tuning.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.02551
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.02551 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.02551 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.02551 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.