DiffusionGemma 26B-A4B-it — INT8 W8A8 (dynamic)
INT8 weight + activation quantization of google/diffusiongemma-26B-A4B-it, Google's block-diffusion language model (dLLM) built on the Gemma 4 26B-A4B MoE backbone.
Why INT8 instead of FP8? The existing FP8-dynamic quantization requires Hopper/Ada GPUs: on Ampere (sm_86, e.g. RTX 3090), vLLM's Marlin FP8 MoE kernel cannot tile this model's expert shape (K=352) within Ampere's 99 KB shared memory (Invalid thread config), and the Triton FP8 MoE backend rejects the per-channel × per-token scheme. INT8 W8A8 routes through vLLM's Triton Int8 MoE backend, which runs on Ampere and newer — this checkpoint serves on 2× RTX 3090.
Quantization
- Weights: int8, symmetric, per-output-channel (RTN,
memoryless_minmax) - Activations: int8, symmetric, per-token, dynamic (computed at runtime, nothing stored)
- Format:
compressed-tensors,int-quantized - Ignored (kept bf16):
lm_head(tied), embeddings, MoE routers, vision tower, self-conditioning — same ignore list as the FP8-dynamic release - Coverage: 11,725 quantized weight tensors (all 128 experts × 48 layers split per-expert + dense MLP/attention projections), identical tensor layout to the FP8-dynamic release
- Size: ~26 GB (from ~49 GB bf16)
Quantization was performed by direct streaming RTN over the base safetensors (one shard at a time, CPU-only). Note for reproducers: the base checkpoint stores MoE experts fused (experts.gate_up_proj [E, 2I, H], experts.down_proj [E, H, I]); this release splits them into per-expert gate_proj/up_proj/down_proj Linears so compressed-tensors/vLLM load them with targets: [Linear]. (llm-compressor oneshot with transformers ≥ 5.11 silently skips the fused 3-D expert tensors — 88% of parameters — producing a barely-compressed model.)
Fidelity vs base weights: relative reconstruction error < 1.2%, cosine similarity ≈ 0.99996 (experts and dense layers). Spot-checked outputs match base-model behavior, including its quirks (see below).
Serving with vLLM
Requires a vLLM build with DiffusionGemma support (the vllm/vllm-openai:gemma image / diffusion branch; mainline ≤ 0.14-nightly does not include it).
vllm serve <this-repo> \
--trust-remote-code \
--max-num-seqs 4 \
--hf-overrides '{"diffusion_sampler": "entropy_bound", "diffusion_entropy_bound": 0.1}' \
--override-generation-config '{"max_new_tokens": 8192}' \
--enable-auto-tool-choice \
--reasoning-parser gemma4 \
--tool-call-parser gemma4 \
--default-chat-template-kwargs '{"enable_thinking": true}'
Optional faster decoding (~3× over the 48-step default, minor quality trade):
--diffusion-config '{"canvas_length": 256, "max_denoising_steps": 16}' \
--hf-overrides '{"diffusion_sampler": "entropy_bound", "diffusion_entropy_bound": 0.1, "diffusion_confidence_threshold": 0.0}'
Operational notes (important)
- Default
max_new_tokensis 256 (fromgeneration_config.json). With thinking enabled, reasoning will consume the whole budget andcontentcomes back empty/truncated for clients that don't setmax_tokens(e.g. OpenWebUI). Strongly recommended:--override-generation-config '{"max_new_tokens": 8192}'as above. - Reasoning output field: with
--reasoning-parser gemma4, parsed chain-of-thought is returned inmessage.reasoning(notreasoning_content) on current diffusion-branch builds. - Use the bundled
chat_template.jinjaas-is. The model self-emits<|channel>thought ... <channel|>delimiters; pre-opening the channel in a custom template suppresses them. - Memory under concurrency: each concurrent diffusion decode materializes ~1.9 GB of full-vocab fp32 sampler buffers beyond vLLM's profiled budget. Leave
--gpu-memory-utilizationheadroom accordingly (e.g. 24 GB cards: ≈0.75 at--max-num-seqs 4). - Tensor parallel on multi-GPU consumer cards: the diffusion-branch sampler assumes an unsharded embedding for its self-conditioning matmul (
probs @ embed_weight) and its prompt-logprobs path; TP > 1 requires small patches until fixed upstream (all-reduce of vocab-sharded partials; disable prompt-logprobs). Single-GPU (≥ 32 GB) needs no patches. On PCIe-linked consumer cards also pass--disable-custom-all-reduce.
Known model behaviors (present in the bf16 original — not quantization artifacts)
- Trivial imperative prompts ("Say OK.") can deterministically commit an empty canvas (immediate EOS, empty output). Verified identical on the bf16 base via the transformers reference implementation.
- Thinking consumes the output token budget; use
max_tokens ≥ 1024for thinking-enabled requests.
License
Gemma. This is a derivative (quantized) version of google/diffusiongemma-26B-A4B-it; use is subject to the Gemma Terms of Use.
- Downloads last month
- 4,299
Model tree for aidendle94/diffusiongemma-26B-A4B-it-INT8-dynamic
Base model
google/diffusiongemma-26B-A4B-it