| | --- |
| | language: |
| | - en |
| | library_name: transformers |
| | license: apache-2.0 |
| | pipeline_tag: robotics |
| | base_model: Dream-org/Dream-VL-7B |
| | tags: |
| | - robotics |
| | - vla |
| | - multimodal |
| | - pretraining |
| | --- |
| | |
| | # Dream-VLA 7B |
| |
|
| | [Paper](https://huggingface.co/papers/2512.22615) | [Project Page](https://hkunlp.github.io/blog/2025/dream-vlx/) | [GitHub](https://github.com/DreamLM/Dream-VLX) |
| |
|
| | Dream-VLA 7B is an open vision-language-action model built from a diffusion VLM [Dream-VL](https://huggingface.co/Dream-org/Dream-VL-7B). |
| | The model takes language instructions and camera images as input and generates robot actions. It supports controlling multiple robots out-of-the-box, and can be quickly adapted for new robot domains via (parameter-efficient) fine-tuning. |
| |
|
| | All Dream-VLA checkpoints, as well as our [training codebase](https://github.com/DreamLM/Dream-VLX) are released under an Apache 2.0 License. |
| |
|
| | ## Model Summary |
| |
|
| | - **Model type:** Vision-language-action (language, image => robot actions) |
| | - **Language(s) (NLP):** en |
| | - **License:** apache-2.0 |
| | - **Finetuned from:** [`Dream-VL`](https://huggingface.co/Dream-org/Dream-VL-7B), a VLM trained from: |
| | + **Vision Backbone**: Qwen2ViT |
| | + **Language Model**: Dream-7B (Diffusion Language Model) |
| | - **Pretraining Dataset:** [Open X-Embodiment](https://robotics-transformer-x.github.io/), with specific dataset mixture following [OpenVLA](https://github.com/openvla/openvla). |
| | - **Repository:** [https://github.com/DreamLM/Dream-VLX](https://github.com/DreamLM/Dream-VLX) |
| |
|
| | ## Uses |
| |
|
| | Dream-VLA models take a language instruction and a camera image of a robot workspace as input, and predict (normalized) robot actions consisting of 7-DoF end-effector deltas |
| | of the form (x, y, z, roll, pitch, yaw, gripper). To execute on an actual robot platform, actions need to be *un-normalized* subject to statistics computed on a per-robot, |
| | per-dataset basis. The available *un-normalized* keys are listed inside `config.json`. |
| |
|
| | Dream-VLA models can be used zero-shot to control robots for specific combinations of embodiments and domains seen in the Open X-Embodiment pretraining mixture (e.g., for |
| | [BridgeV2 environments with a Widow-X robot](https://rail-berkeley.github.io/bridgedata/)). They can also be efficiently *fine-tuned* for new tasks and robot setups |
| | given minimal demonstration data. |
| |
|
| | ## Getting Started |
| |
|
| | Dream-VLA 7B can be used to control multiple robots for domains represented in the pretraining mixture out-of-the-box. For example, here is an example for loading Dream-VLA for zero-shot instruction following in the BridgeV2 environments with a Widow-X robot: |
| |
|
| | ```python |
| | # Install minimal dependencies (`torch`, `transformers`, `timm`, `tokenizers`, `flash_attn`, ...) |
| | from transformers import AutoModel, AutoProcessor |
| | from PIL import Image |
| | import torch |
| | |
| | # Load Processor & VLA |
| | processor = AutoProcessor.from_pretrained("Dream-org/Dream-VLA-7B", trust_remote_code=True) |
| | vla = AutoModel.from_pretrained( |
| | "Dream-org/Dream-VLA-7B", |
| | attn_implementation="flash_attention_2", # [Optional] Requires `flash_attn` |
| | torch_dtype=torch.bfloat16, |
| | low_cpu_mem_usage=True, |
| | trust_remote_code=True |
| | ).to("cuda:0") |
| | |
| | # Grab image input & format prompt |
| | image: Image.Image = get_from_camera(...) # Replace with actual camera loading |
| | task_description = "pick up the block" |
| | conversation = [ |
| | {"role": "user", "content": [{"type": "image"}, {"type": "text", "text": f"What action should the robot take to {task_description}?"}]}, |
| | ] |
| | text = processor.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True) |
| | inputs = processor(text=[text], images=[image], padding=True, return_tensors="pt").to("cuda:0", dtype=torch.bfloat16) |
| | |
| | # Predict Action (7-DoF; un-normalize for BridgeV2) |
| | action = vla.predict_action(**inputs, unnorm_key="bridge_orig", do_sample=False) |
| | |
| | # Execute... |
| | robot.act(action) |
| | ``` |
| |
|
| | ## Citation |
| |
|
| | ```bibtex |
| | @article{ye2025dreamvla, |
| | title={Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone}, |
| | author={Ye, Jiacheng and Gong, Shansan and Gao, Jiahui and Fan, Junming and Wu, Shuang and Bi, Wei and Bai, Haoli and Shang, Lifeng and Kong, Lingpeng}, |
| | journal={arXiv preprint arXiv:2512.22615} |
| | year={2025} |
| | } |
| | ``` |