Dream-VLA 7B
Dream-VLA 7B is an open vision-language-action model built from a diffusion VLM Dream-VL. The model takes language instructions and camera images as input and generates robot actions. It supports controlling multiple robots out-of-the-box, and can be quickly adapted for new robot domains via (parameter-efficient) fine-tuning.
All Dream-VLA checkpoints, as well as our training codebase are released under an Apache 2.0 License.
For full details, please read our blog and paper (pending).
Model Summary
- Model type: Vision-language-action (language, image => robot actions)
- Language(s) (NLP): en
- License: apache-2.0
- Finetuned from:
Dream-VL, a VLM trained from:- Vision Backbone: Qwen2ViT
- Language Model: Dream-7B
- Pretraining Dataset: Open X-Embodiment, with specific dataset mixture following OpenVLA.
- Repository: https://github.com/DreamLM/Dream-VLX
- Project Page & Videos: https://hkunlp.github.io/blog/2025/dream-vlx
Uses
Dream-VLA models take a language instruction and a camera image of a robot workspace as input, and predict (normalized) robot actions consisting of 7-DoF end-effector deltas
of the form (x, y, z, roll, pitch, yaw, gripper). To execute on an actual robot platform, actions need to be un-normalized subject to statistics computed on a per-robot,
per-dataset basis. The available un-normalized keys are listed inside config.json.
Dream-VLA models can be used zero-shot to control robots for specific combinations of embodiments and domains seen in the Open X-Embodiment pretraining mixture (e.g., for BridgeV2 environments with a Widow-X robot). They can also be efficiently fine-tuned for new tasks and robot setups given minimal demonstration data; see here.
Simialr as OpenVLA, Dream-VLA models also do not zero-shot generalize to new (unseen) robot embodiments, or setups that are not represented in the pretraining mix; in these cases, we suggest collecting a dataset of demonstrations on the desired setup, and fine-tuning Dream-VLA models instead.
Getting Started
Dream-VLA 7B can be used to control multiple robots for domains represented in the pretraining mixture out-of-the-box. For example, here is an example for loading Dream-VLA for zero-shot instruction following in the [BridgeV2 environments] with a Widow-X robot:
# Install minimal dependencies (`torch`, `transformers`, `timm`, `tokenizers`, `flash_attn`, ...)
from transformers import AutoModel, AutoProcessor
from PIL import Image
import torch
# Load Processor & VLA
processor = AutoProcessor.from_pretrained("Dream-org/Dream-VLA-7B", trust_remote_code=True)
vla = AutoModel.from_pretrained(
"Dream-org/Dream-VLA-7B",
attn_implementation="flash_attention_2", # [Optional] Requires `flash_attn`
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True
).to("cuda:0")
# Grab image input & format prompt
image: Image.Image = get_from_camera(...)
conversation = [
{"role": "user", "content": [{"type": "image"}] + [{"type": "text", "text": f"What action should the robot take to {task_description}?"}]},
]
text = processor.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], padding=True, return_tensors="pt").to(DEVICE)
# Predict Action (7-DoF; un-normalize for BridgeV2)
inputs = processor(prompt, image).to("cuda:0", dtype=torch.bfloat16)
action = vla.predict_action(**inputs, unnorm_key="bridge_orig", do_sample=False)
# Execute...
robot.act(action, ...)
Citation
BibTeX:
@article{ye2025dreamvla,
title={Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone},
author={Ye, Jiacheng and Gong, Shansan and Gao, Jiahui and Fan, Junming and Wu, Shuang and Bi, Wei and Bai, Haoli and Shang, Lifeng and Kong, Lingpeng},
journal={arXiv preprint},
year={2025}
}
- Downloads last month
- 46