EVA01: Unified Native 3D Understanding and Generation via Mixture-of-Transformers
Abstract
EVA01 enables native 3D mesh integration in multimodal language models through a Mixture-of-Transformers architecture that aligns semantic and geometric manifolds for improved generation and editing capabilities.
This paper addresses the challenge of integrating 3D meshes as a native modality within Multimodal Large Language Models (MLLMs). Diffusion-based large reconstruction models decouple semantic understanding from geometric reasoning, operating as stateless reconstructors conditioned on dense 2D pixel priors. Recent MLLM-based methods treat the 3D modality as an external output rather than a native component of the multimodal sequence, making incremental adaptations without a systematic analysis of how geometric manifolds align with MLLM feature spaces. We introduce EVA01, a unified framework that extends the modality boundary of MLLMs to natively incorporate 3D mesh understanding, generation, and context-aware editing. Built upon a Mixture-of-Transformers (MoT) architecture, EVA01 decouples the model into a pre-trained Understanding Expert (E_{und}) and a structurally mirrored Generation Expert (E_{gen}), coupled through shared global self-attention with hard modality routing. This design aligns the semantic latent space of the MLLM backbone with the geometric manifold, enabling direct transfer of multimodal priors without intermediate 2D representations. Results show that EVA01 achieves state-of-the-art native text-to-3D generation fidelity and unlocks robust long-context multi-turn geometric editing with identity preservation, a capability fundamentally inaccessible to stateless reconstruction pipelines. Our findings further offer architectural insights for integrating 2D foundation models with 3D tasks, informing the design of 3D-native multimodal systems. Project Page: https://www.seeles.ai/research/pages/EVA01
Community
EVA01 is a unified framework that extends the modality boundary of MLLMs to natively incorporate 3D mesh understanding, generation, and context-aware editing.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- UniMesh: Unifying 3D Mesh Understanding and Generation (2026)
- Feedforward 3D Editing Learns from Semantic-Part Transformation (2026)
- Seed3D 2.0: Advancing High-Fidelity Simulation-Ready 3D Content Generation (2026)
- DreamEdit3D: Personalization of Multi-View Diffusion Models for 3D Editing (2026)
- SpatialFusion: Endowing Unified Image Generation with Intrinsic 3D Geometric Awareness (2026)
- AniGen: Unified $S^3$ Fields for Animatable 3D Asset Generation (2026)
- Beyond Voxel 3D Editing: Learning from 3D Masks and Self-Constructed Data (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.16745 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper