Dream-VLA 7B

Dream-VLA 7B is an open vision-language-action model built from a diffusion VLM Dream-VL. The model takes language instructions and camera images as input and generates robot actions. It supports controlling multiple robots out-of-the-box, and can be quickly adapted for new robot domains via (parameter-efficient) fine-tuning.

All Dream-VLA checkpoints, as well as our training codebase are released under an Apache 2.0 License.

For full details, please read our blog and paper (pending).

Model Summary

Uses

Dream-VLA models take a language instruction and a camera image of a robot workspace as input, and predict (normalized) robot actions consisting of 7-DoF end-effector deltas of the form (x, y, z, roll, pitch, yaw, gripper). To execute on an actual robot platform, actions need to be un-normalized subject to statistics computed on a per-robot, per-dataset basis. The available un-normalized keys are listed inside config.json.

Dream-VLA models can be used zero-shot to control robots for specific combinations of embodiments and domains seen in the Open X-Embodiment pretraining mixture (e.g., for BridgeV2 environments with a Widow-X robot). They can also be efficiently fine-tuned for new tasks and robot setups given minimal demonstration data; see here.

Simialr as OpenVLA, Dream-VLA models also do not zero-shot generalize to new (unseen) robot embodiments, or setups that are not represented in the pretraining mix; in these cases, we suggest collecting a dataset of demonstrations on the desired setup, and fine-tuning Dream-VLA models instead.

Getting Started

Dream-VLA 7B can be used to control multiple robots for domains represented in the pretraining mixture out-of-the-box. For example, here is an example for loading Dream-VLA for zero-shot instruction following in the [BridgeV2 environments] with a Widow-X robot:

# Install minimal dependencies (`torch`, `transformers`, `timm`, `tokenizers`, `flash_attn`, ...)
from transformers import AutoModel, AutoProcessor
from PIL import Image

import torch

# Load Processor & VLA
processor = AutoProcessor.from_pretrained("Dream-org/Dream-VLA-7B", trust_remote_code=True)
vla = AutoModel.from_pretrained(
    "Dream-org/Dream-VLA-7B",
    attn_implementation="flash_attention_2",  # [Optional] Requires `flash_attn`
    torch_dtype=torch.bfloat16, 
    low_cpu_mem_usage=True, 
    trust_remote_code=True
).to("cuda:0")

# Grab image input & format prompt
image: Image.Image = get_from_camera(...)
conversation = [
    {"role": "user", "content": [{"type": "image"}] + [{"type": "text", "text": f"What action should the robot take to {task_description}?"}]},
]
text = processor.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], padding=True, return_tensors="pt").to(DEVICE)

# Predict Action (7-DoF; un-normalize for BridgeV2)
inputs = processor(prompt, image).to("cuda:0", dtype=torch.bfloat16)
action = vla.predict_action(**inputs, unnorm_key="bridge_orig", do_sample=False)

# Execute...
robot.act(action, ...)

Citation

BibTeX:

@article{ye2025dreamvla,
  title={Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone},
  author={Ye, Jiacheng and Gong, Shansan and Gao, Jiahui and Fan, Junming and Wu, Shuang and Bi, Wei and Bai, Haoli and Shang, Lifeng and Kong, Lingpeng},
  journal={arXiv preprint},
  year={2025}
}
Downloads last month
46
Safetensors
Model size
8B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support