Code for running <= 24GB cards
Tested with an RTX 5000 24GB.
It's slow though.
from diffusers import QwenImageTransformer2DModel
import torch
from PIL import Image
from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig
from transformers import Qwen2_5_VLForConditionalGeneration
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
from diffusers import QwenImageLayeredPipeline
model_id = "Qwen/Qwen-Image-Layered"
torch_dtype = torch.bfloat16
quantization_config = DiffusersBitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
llm_int8_skip_modules=["transformer_blocks.0.img_mod"],
)
transformer = QwenImageTransformer2DModel.from_pretrained(
model_id,
subfolder="transformer",
quantization_config=quantization_config,
torch_dtype=torch_dtype,
)
transformer = transformer.to("cpu")
quantization_config = TransformersBitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
text_encoder = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_id,
subfolder="text_encoder",
quantization_config=quantization_config,
torch_dtype=torch_dtype,
)
text_encoder = text_encoder.to("cpu")
pipeline = QwenImageLayeredPipeline.from_pretrained(
model_id,
transformer=transformer,
text_encoder=text_encoder,
torch_dtype=torch_dtype,
)
pipeline.enable_model_cpu_offload()
pipeline.set_progress_bar_config(disable=None)
image = Image.open("workdir/bc1e03f40776b8bee006ea2b2b0d8103.webp").convert("RGBA")
inputs = {
"image": image,
"generator": torch.Generator(device="cuda").manual_seed(777),
"true_cfg_scale": 4.0,
"negative_prompt": " ",
"num_inference_steps": 50,
"num_images_per_prompt": 1,
"layers": 4,
"resolution": 640, # Using different bucket (640, 1024) to determine the resolution. For this version, 640 is recommended
"cfg_normalize": True, # Whether enable cfg normalization.
"use_en_prompt": True, # Automatic caption language if user does not provide caption
}
with torch.inference_mode():
output = pipeline(**inputs)
output_image = output.images[0]
for i, image in enumerate(output_image):
image.save(f"{i}.png")
use sageattention and triton , u can run full bf16 on 10gb vram 64gb ram, speed almost 7s/it
Thanks for replying!
Just to make sure I understand correctly:
You mentioned using SageAttention + Triton can run full bf16 on 10GB VRAM at ~7s/it.
Does this mean that on an RTX 5090 Mobile with 24GB VRAM, I should be able to run the model:
In full bf16 precision (no quantization needed)
Without CPU offload (everything fits in VRAM)
Significantly faster than 7s/it?
Or would you still recommend using your 4-bit quantization code for 24GB cards?
Thanks!
Thanks for replying!
Just to make sure I understand correctly:
You mentioned using SageAttention + Triton can run full bf16 on 10GB VRAM at ~7s/it.
Does this mean that on an RTX 5090 Mobile with 24GB VRAM, I should be able to run the model:
In full bf16 precision (no quantization needed)
Without CPU offload (everything fits in VRAM)
Significantly faster than 7s/it?
Or would you still recommend using your 4-bit quantization code for 24GB cards?
Thanks!
If I'm not mistaken, SageAttention doesn't deal with offloading, but I use an RTX 5080 right now running full BF16. A week ago I had a RTX3080 and it was the same story , even WAN 2.2 FP16 with text encoder in FP16 in ComfyUI. The first run takes time, but after the first run the speed improves by almost 30%. The Python env is installed on a high-speed NVMe, and I use the KJ diffusion loader which allows me to do this.
In Windows I had a hard time and Python often crashed, but in Ubuntu the first run crashes if I have a YouTube video at 1080p playing in the background (only if with WAN2.2 FP16 + text-encoder FP16). Otherwise, changing the text encoder to FP8 lets me have almost 10 tabs open in Firefox with 2 gaming monitors.
The screenshot shows 4.5s/it. Here I'm using the RTX 5080, but on the RTX 3080 it was around 7s/it with the full text encoder model and no quantization.
