arxiv:2603.23478

UniFunc3D: Unified Active Spatial-Temporal Grounding for 3D Functionality Segmentation

Published on Mar 24

· Submitted by

Jiaying Lin on Mar 26

Upvote

Authors:

Jiaying Lin ,

Abstract

UniFunc3D enables 3D scene functionality segmentation by treating multimodal large language models as active observers that perform joint semantic, temporal, and spatial reasoning through adaptive frame selection.

AI-generated summary

Functionality segmentation in 3D scenes requires an agent to ground implicit natural-language instructions into precise masks of fine-grained interactive elements. Existing methods rely on fragmented pipelines that suffer from visual blindness during initial task parsing. We observe that these methods are limited by single-scale, passive and heuristic frame selection. We present UniFunc3D, a unified and training-free framework that treats the multimodal large language model as an active observer. By consolidating semantic, temporal, and spatial reasoning into a single forward pass, UniFunc3D performs joint reasoning to ground task decomposition in direct visual evidence. Our approach introduces active spatial-temporal grounding with a coarse-to-fine strategy. This allows the model to select correct video frames adaptively and focus on high-detail interactive parts while preserving the global context necessary for disambiguation. On SceneFun3D, UniFunc3D achieves state-of-the-art performance, surpassing both training-free and training-based methods by a large margin with a relative 59.9\% mIoU improvement, without any task-specific training. Code will be released on our project page: https://jiaying.link/unifunc3d.

View arXiv page View PDF Project page Add to collection

Community

garrying

Paper author Paper submitter 2 days ago

Project page: https://jiaying.link/unifunc3d/

garrying

Paper author Paper submitter 1 day ago

UniFunc3D is a training-free framework that enables AI agents to accurately segment interactive 3D object parts from natural language instructions. By treating a Multimodal Large Language Model as an "active observer," it uses a coarse-to-fine strategy to adaptively zoom in on relevant details while maintaining global scene context. This approach significantly outperforms previous baselines, even the training-based methods, achieving a relative 59.9% mIoU improvement in accuracy on the SceneFun3D benchmark.

librarian-bot

about 12 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2603.23478

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.23478 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.23478 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.23478 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.