Update README.md

0cd0e25 verified about 2 months ago

4.24 kB

	---
	language:
	- en
	library_name: transformers
	license: apache-2.0
	pipeline_tag: robotics
	base_model: Dream-org/Dream-VL-7B
	tags:
	- robotics
	- vla
	- multimodal
	- pretraining
	---

	# Dream-VLA 7B

	[Paper](https://huggingface.co/papers/2512.22615) \| [Project Page](https://hkunlp.github.io/blog/2025/dream-vlx/) \| [GitHub](https://github.com/DreamLM/Dream-VLX)

	Dream-VLA 7B is an open vision-language-action model built from a diffusion VLM [Dream-VL](https://huggingface.co/Dream-org/Dream-VL-7B).
	The model takes language instructions and camera images as input and generates robot actions. It supports controlling multiple robots out-of-the-box, and can be quickly adapted for new robot domains via (parameter-efficient) fine-tuning.

	All Dream-VLA checkpoints, as well as our [training codebase](https://github.com/DreamLM/Dream-VLX) are released under an Apache 2.0 License.

	## Model Summary

	- Model type: Vision-language-action (language, image => robot actions)
	- Language(s) (NLP): en
	- License: apache-2.0
	- Finetuned from: [`Dream-VL`](https://huggingface.co/Dream-org/Dream-VL-7B), a VLM trained from:
	+ Vision Backbone: Qwen2ViT
	+ Language Model: Dream-7B (Diffusion Language Model)
	- Pretraining Dataset: [Open X-Embodiment](https://robotics-transformer-x.github.io/), with specific dataset mixture following [OpenVLA](https://github.com/openvla/openvla).
	- Repository: [https://github.com/DreamLM/Dream-VLX](https://github.com/DreamLM/Dream-VLX)

	## Uses

	Dream-VLA models take a language instruction and a camera image of a robot workspace as input, and predict (normalized) robot actions consisting of 7-DoF end-effector deltas
	of the form (x, y, z, roll, pitch, yaw, gripper). To execute on an actual robot platform, actions need to be un-normalized subject to statistics computed on a per-robot,
	per-dataset basis. The available un-normalized keys are listed inside `config.json`.

	Dream-VLA models can be used zero-shot to control robots for specific combinations of embodiments and domains seen in the Open X-Embodiment pretraining mixture (e.g., for
	[BridgeV2 environments with a Widow-X robot](https://rail-berkeley.github.io/bridgedata/)). They can also be efficiently fine-tuned for new tasks and robot setups
	given minimal demonstration data.

	## Getting Started

	Dream-VLA 7B can be used to control multiple robots for domains represented in the pretraining mixture out-of-the-box. For example, here is an example for loading Dream-VLA for zero-shot instruction following in the BridgeV2 environments with a Widow-X robot:

	```python
	# Install minimal dependencies (`torch`, `transformers`, `timm`, `tokenizers`, `flash_attn`, ...)
	from transformers import AutoModel, AutoProcessor
	from PIL import Image
	import torch

	# Load Processor & VLA
	processor = AutoProcessor.from_pretrained("Dream-org/Dream-VLA-7B", trust_remote_code=True)
	vla = AutoModel.from_pretrained(
	"Dream-org/Dream-VLA-7B",
	attn_implementation="flash_attention_2", # [Optional] Requires `flash_attn`
	torch_dtype=torch.bfloat16,
	low_cpu_mem_usage=True,
	trust_remote_code=True
	).to("cuda:0")

	# Grab image input & format prompt
	image: Image.Image = get_from_camera(...) # Replace with actual camera loading
	task_description = "pick up the block"
	conversation = [
	{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": f"What action should the robot take to {task_description}?"}]},
	]
	text = processor.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True)
	inputs = processor(text=[text], images=[image], padding=True, return_tensors="pt").to("cuda:0", dtype=torch.bfloat16)

	# Predict Action (7-DoF; un-normalize for BridgeV2)
	action = vla.predict_action(**inputs, unnorm_key="bridge_orig", do_sample=False)

	# Execute...
	robot.act(action)
	```

	## Citation

	```bibtex
	@article{ye2025dreamvla,
	title={Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone},
	author={Ye, Jiacheng and Gong, Shansan and Gao, Jiahui and Fan, Junming and Wu, Shuang and Bi, Wei and Bai, Haoli and Shang, Lifeng and Kong, Lingpeng},
	journal={arXiv preprint arXiv:2512.22615}
	year={2025}
	}
	```