you know which nightly it worked with? because it does not with current one
you know which nightly it worked with? because it does not with current one
https://hub.docker.com/r/vllm/vllm-openai/tags
nvm , it is vllm/vllm-openai:nightly-11fd69dd54060a59c6f62a6d217e1ecc47d74a68
let me know if it works for you with that image
it does, same performance as AWQ - maybe extra 1% on tests
we got to get to the bottom of what is going on with vllm nvfp4 and blackwell sm120
how is support on non-blackwell gpus?
Really wanted to try this model but it looks like that nightly is no more available :(
The commit is:
https://github.com/vllm-project/vllm/commit/11fd69dd54060a59c6f62a6d217e1ecc47d74a68
The commit is in v0.11.2 (doesn't work), v0.11.1 (doesn't work), v0.11.1rc7 (not available), v0.11.1rc6 (not available).
I also tried a bunch of nightly with that commit but nothing.
I've not tried to build a dockerfile for it but if someone has and it works it'd be great!
Unfortunately NVFP4, despite being pretty amazing (on paper) doesn't receive enough love yet. Hopefully soon.
In any case, thank you for quantizing this awesome model!
try running with latest vllm compiled, not docker . there is cuda image regression in docker currently
you want ENABLE_CUTLASS_MOE_SM120=1 https://github.com/vllm-project/vllm/pull/29242
I was getting errors like this:
ValueError: CutlassExpertsFp4 doesn't support DP. Use flashinfer CUTLASS Fused MoE backend instead (set VLLM_USE_FLASHINFER_MOE_FP4=1)
I set VLLM_USE_FLASHINFER_MOE_FP4=1 but the errors remain.
Examining the vllm code, this error is thrown because only 2 VLLM_FLASHINFER_MOE_BACKEND are supported: masked_gemm and throughput.
I set export VLLM_FLASHINFER_MOE_BACKEND=throughput and it loaded it.
This seems to work reliably on 2x NVIDIA RTX 6000 PRO Blackwell (2x 96GB VRAM):
export VLLM_ATTENTION_BACKEND=FLASHINFER
export VLLM_FLASHINFER_MOE_BACKEND=throughput
export VLLM_USE_FLASHINFER_MOE_FP16=1
export VLLM_USE_FLASHINFER_MOE_FP8=1
export VLLM_USE_FLASHINFER_MOE_FP4=1
export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1
# Run on 2 GPUs with tensor parallelism
CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES:-0,1} /opt/vllm/bin/vllm serve lukealonso/MiniMax-M2-NVFP4 \
--host 0.0.0.0 \
--port 8345 \
--served-model-name default-model lukealonso/MiniMax-M2-NVFP4 \
--trust-remote-code \
--gpu-memory-utilization 0.95 \
--pipeline-parallel-size 1 \
--enable-expert-parallel \
--tensor-parallel-size 2 \
--max-model-len 196608 \
--max-num-seqs 32 \
--enable-auto-tool-choice \
--reasoning-parser minimax_m2_append_think \
--tool-call-parser minimax_m2 \
--all2all-backend pplx \
--enable-prefix-caching \
--enable-chunked-prefill \
--max-num-batched-tokens 16384 \
--dtype auto \
--kv-cache-dtype fp8
My environment:
Python: 3.12.3
vLLM: 0.11.2.dev360+g8e7a89160
PyTorch: 2.9.0+cu130
CUDA: 13.0
GPU 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition
GPU 1: NVIDIA RTX PRO 6000 Blackwell Workstation Edition
Triton: 3.5.0
FlashInfer: 0.5.3
works with above
though with no gains over vllm/vllm-openai:nightly-11fd69dd54060a59c6f62a6d217e1ecc47d74a68 in high ~72tps, ~332tps x10 (2 x RTX PRO 6000 @50k context)
python -c "import vllm; print(vllm.version); import torch; print(f'Torch Version: {torch.version}'); print(f'CUDA Available: {torch.cuda.is_available()}'); print(f'CUDA Version: {torch.version.cuda}'); import triton; print(triton.version); import flashinfer; print(flashinfer.version)"
0.11.2.dev403+gb9d0504a3
Torch Version: 2.9.0+cu130
CUDA Available: True
CUDA Version: 13.0
3.5.0
0.5.3
that vllm-openai:nightly-11fd69dd54060a59c6f62a6d217e1ecc47d74a68 seems to be the last one that actually ran nvfp4 out of the box at all
for those who stumble on this just use it via
docker pull lavd/vllm-openai:nightly-11fd69dd54060a59c6f62a6d217e1ecc47d74a68
or
docker pull lavd/vllm-openai:nvfp4
Thanks, finally it worked. But as others confirmed it's the same speed of AWQ at the moment. :(
same speed, but better perplexity
if someone can measure against a2s-ai/MiniMax-M2-AWQ for example
yeah awq was never very good quality it seems, it failed every-time in testing vs q4 ggufs. hopefully nvfp4 is much higher quality quantization method
Interesting. My experience with the awq model was pretty bad. Has anyone tried a qwen3 vl 235B nvfp4? Was looking for a quantized version of that one.
Like this? RedHatAI/Qwen3-VL-235B-A22B-Instruct-NVFP4
AWQ loosing 1 point here
lm_eval --model local-completions --model_args model=mmm2,system_prompt="Provide only the final numerical answer. No reasoning or explanations.",base_url=http://192.168.1.200:8001/v1/completions,tokenizer=/data1/MiniMax-M2-AWQ/,max_tokens=4096 --tasks gsm8k_cot,humaneval --confirm_run_unsafe_code --gen_kwargs do_sample=False --trust_remote_code --log_samples --output_path eval_results/ --batch_size 20
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| gsm8k_cot | 3 | flexible-extract | 8 | exact_match | ↑ | 0.9189 | ± | 0.0075 |
| strict-match | 8 | exact_match | ↑ | 0.9143 | ± | 0.0077 | ||
| humaneval | 1 | create_test | 0 | pass@1 | 0.5000 | ± | 0.0392 | |
| Requesting API: 100% | ███████ | 1319/1319 [06:37<00:00, 3.32it/s] | ||||||
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
| ----- | ------: | ---------------- | -----: | ----------- | --- | -----: | --- | -----: |
| gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.9272 | ± | 0.0072 |
| strict-match | 5 | exact_match | ↑ | 0.9265 | ± | 0.0072 | ||
| Requesting API: 100% | ███████ | 56168/56168 [18:22<00:00, 50.94it/s] | ||||||
| Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
| ------------------ | ------: | ------ | ------ | ------ | --- | -----: | --- | -----: |
| mmlu | 2 | none | acc | ↑ | 0.8164 | ± | 0.0031 | |
| - humanities | 2 | none | acc | ↑ | 0.7488 | ± | 0.0061 | |
| - other | 2 | none | acc | ↑ | 0.8616 | ± | 0.0060 | |
| - social sciences | 2 | none | acc | ↑ | 0.8921 | ± | 0.0055 | |
| - stem | 2 | none | acc | ↑ | 0.7989 | ± | 0.0069 |
==========================================================================================================================
lm_eval --model local-completions --model_args model=mmm2nv,system_prompt="Provide only the final numerical answer. No reasoning or explanations.",base_url=http://192.168.1.200:8003/v1/completions,tokenizer=/data1/MiniMax-M2-NVFP4/,max_tokens=4096 --tasks gsm8k_cot,humaneval --confirm_run_unsafe_code --gen_kwargs do_sample=False --trust_remote_code --log_samples --output_path eval_results/ --batch_size 20
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| gsm8k_cot | 3 | flexible-extract | 8 | exact_match | ↑ | 0.9257 | ± | 0.0072 |
| strict-match | 8 | exact_match | ↑ | 0.9212 | ± | 0.0074 | ||
| humaneval | 1 | create_test | 0 | pass@1 | 0.5854 | ± | 0.0386 | |
| Requesting API: 100% | ███████ | 1319/1319 [06:31<00:00, 3.37it/s] | ||||||
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
| ----- | ------: | ---------------- | -----: | ----------- | --- | -----: | --- | -----: |
| gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.9378 | ± | 0.0067 |
| strict-match | 5 | exact_match | ↑ | 0.9363 | ± | 0.0067 | ||
| Requesting API: 100% | ███████ | 56168/56168 [16:58<00:00, 55.17it/s] | ||||||
| Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
| ------------------ | ------: | ------ | ------ | ------ | --- | -----: | --- | -----: |
| mmlu | 2 | none | acc | ↑ | 0.8173 | ± | 0.0031 | |
| - humanities | 2 | none | acc | ↑ | 0.7564 | ± | 0.0060 | |
| - other | 2 | none | acc | ↑ | 0.8622 | ± | 0.0060 | |
| - social sciences | 2 | none | acc | ↑ | 0.8859 | ± | 0.0056 | |
| - stem | 2 | none | acc | ↑ | 0.7967 | ± | 0.0069 |
============================================================================================================================
full just for comparison
lm_eval --model local-completions --model_args model=mmm2,system_prompt="Provide only the final numerical answer. No reasoning or explanations.",base_url=http://192.168.1.200:8001/v1/completions,tokenizer=/data1/MiniMax-M2/,num_concurrent=40,max_tokens=8168 --tasks mmlu --confirm_run_unsafe_code --gen_kwargs do_sample=False --trust_remote_code --log_samples --output_path eval_results/
| Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| mmlu | 2 | none | acc | ↑ | 0.8329 | ± | 0.0030 | |
| - humanities | 2 | none | acc | ↑ | 0.7747 | ± | 0.0059 | |
| - other | 2 | none | acc | ↑ | 0.8774 | ± | 0.0057 | |
| - social sciences | 2 | none | acc | ↑ | 0.9022 | ± | 0.0053 | |
| - stem | 2 | none | acc | ↑ | 0.8081 | ± | 0.0068 |
Interesting. My experience with the awq model was pretty bad. Has anyone tried a qwen3 vl 235B nvfp4? Was looking for a quantized version of that one.
could also be that modern Q4 ggufs are mostly dynamic using imatrix so a lot of the important layers are kept at higher precision. I think i've seen some AWQ athat are fp4/fp8 mix? However for Deepseek the only one I can run in 4x96gb is the smallest 4bit AWQ, the other 4bit variations i found did not fit
The new vllm 0.12.0 seems to have better support for nvfp4.
If anyone tries it would be nice to see if there is any improvement for speed.
for this model 0.11.0 vs 0.12.0 is ~1-2% difference is tps on 50k context (2 x RTX PRO 6000), 1:1 catching up with sglang 73tps@50kcontext
Thanks for testing it!
Does this model support running with a single RTX Pro 6000?
I have been unable to run this on my 2X6000 Pros. any help will be great. Is there any guide/help for this. I have tried the latest instructions on the readme of this repo, built the image, and the GPUs spin at 100% with 125-130 watts of power consumption.
any help will be appreciated.
2 x 6000 RTX PRO
source kk2.env/bin/activate #0.11.2.dev365+g0808eb813.cu130 PyTorch 2.9.1+cu130 Triton 3.5.1 flashinfer 0.5.3
export TORCH_CUDA_ARCH_LIST="12.0 12.1"
export VLLM_USE_TRITON_AWQ=1
vllm serve
/data1/MiniMax-M2/ --served-model-name mmm2 --host 0.0.0.0 --port 8001 --gpu-memory-utilization 0.95
--enable-auto-tool-choice --tool-call-parser minimax_m2 --reasoning-parser minimax_m2_append_think
--tensor-parallel-size 4
--max-num-seqs 128
--dtype auto
--trust-remote-code
--enable-prefix-caching --enable-chunked-prefill
--async-scheduling
--max-num-batched-tokens 8192
2 x 6000 RTX PRO
source kk2.env/bin/activate #0.11.2.dev365+g0808eb813.cu130 PyTorch 2.9.1+cu130 Triton 3.5.1 flashinfer 0.5.3
export TORCH_CUDA_ARCH_LIST="12.0 12.1"
export VLLM_USE_TRITON_AWQ=1
vllm serve
/data1/MiniMax-M2/ --served-model-name mmm2 --host 0.0.0.0 --port 8001 --gpu-memory-utilization 0.95
--enable-auto-tool-choice --tool-call-parser minimax_m2 --reasoning-parser minimax_m2_append_think
--tensor-parallel-size 4
--max-num-seqs 128
--dtype auto
--trust-remote-code
--enable-prefix-caching --enable-chunked-prefill
--async-scheduling
--max-num-batched-tokens 8192
what docker vllm image are you using for this purpose?
i keep getting this:
(Worker_TP0 pid=672) INFO 12-15 00:01:52 [backends.py:288] Compiling a graph for dynamic shape takes 74.89 s
(EngineCore_DP0 pid=537) INFO 12-15 00:02:26 [shm_broadcast.py:501] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore_DP0 pid=537) INFO 12-15 00:03:26 [shm_broadcast.py:501] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore_DP0 pid=537) INFO 12-15 00:04:26 [shm_broadcast.py:501] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore_DP0 pid=537) INFO 12-15 00:05:26 [shm_broadcast.py:501] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
and that last message repeats for hours.
I think --tensor-parallel-size 4 needs to be --tensor-parallel-size 2 since you have 2 cards.
Also --max-num-seqs 128 is probably too big, try --max-num-seqs 8 for a start
Generally your settings are not sane. This is an NVFP4 quant, not a AWQ one.
I use this on the same setup (2x rtx 6000 pro blackwell):
#!/bin/bash
# Activate vLLM venv
source /opt/vllm/bin/activate
# Set HuggingFace cache to use existing models
export HF_HOME=/opt/models/huggingface
export TRANSFORMERS_CACHE=/opt/models/huggingface/hub
# Required Environment Variables
export VLLM_ATTENTION_BACKEND=FLASHINFER
export VLLM_FLASHINFER_MOE_BACKEND=throughput
export VLLM_USE_FLASHINFER_MOE_FP4=1
export ENABLE_NVFP4_SM120=1
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export VLLM_ALL2ALL_BACKEND=pplx
export SAFETENSORS_FAST_GPU=1
# Run command
exec env CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES:-0,1} /opt/vllm/bin/vllm serve lukealonso/MiniMax-M2-NVFP4 \
--host 0.0.0.0 \
--port 8345 \
--served-model-name default-model minimax-m2 \
--trust-remote-code \
--gpu-memory-utilization 0.95 \
--tensor-parallel-size 2 \
--pipeline-parallel-size 1 \
--max-model-len 196608 \
--max-num-seqs 8 \
--max-num-batched-tokens 16384 \
--dtype auto \
--kv-cache-dtype fp8 \
--enable-auto-tool-choice \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think \
--all2all-backend pplx \
--compilation-config "{\"cudagraph_mode\": \"PIECEWISE\"}" \
--enable-expert-parallel \
--enable-prefix-caching \
--enable-chunked-prefill \
It works with the latest nightly version of vllm.
Keep in mind that to my tests vllm is not as good as sglang. I find sglang more reliable for agentic use (latest nightly of sglang too):
#!/bin/bash
# MiniMax-M2-NVFP4 model script using SGLang
# NVFP4 quantized MoE model with TP=2
# Optimized for 2x RTX 6000 PRO Blackwell (96GB each)
# Activate SGLang venv
source /opt/sglang/bin/activate
# Set CUDA environment - use 13.0 for Blackwell
export CUDA_HOME=/usr/local/cuda-13.0
export LD_LIBRARY_PATH=/usr/local/cuda-13.0/lib64:/usr/local/lib:$LD_LIBRARY_PATH
export PATH=/usr/local/cuda-13.0/bin:$PATH
# Set HuggingFace cache to use existing models
export HF_HOME=/opt/models/huggingface
# SGLang optimization environment variables
export PYTORCH_ALLOC_CONF=expandable_segments:True
# Use both GPUs for TP=2
export CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES:-0,1}
MODEL="lukealonso/MiniMax-M2-NVFP4"
PORT=8345
echo "Starting MiniMax-M2-NVFP4 with SGLang (TP=2)"
echo "Model: $MODEL"
echo "GPUs: $CUDA_VISIBLE_DEVICES, Port: ${PORT}"
echo "Context length: 196608"
echo ""
# Launch server
exec /opt/sglang/bin/python -m sglang.launch_server \
--model-path "$MODEL" \
--served-model-name default-model \
--trust-remote-code \
--dtype auto \
--quantization modelopt_fp4 \
--host 0.0.0.0 \
--port ${PORT} \
--tp 2 \
--mem-fraction-static 0.95 \
--context-length 196608 \
--kv-cache-dtype fp8_e4m3 \
--max-running-requests 4 \
--chunked-prefill-size 16384 \
--attention-backend triton \
--tool-call-parser minimax-m2 \
--reasoning-parser minimax-append-think \
--model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 56}' \
"$@"
is there a docker image that you are using that I can try, this will eliminate all external factors my friend. I am open to trying sglang, my primary usecase is agentic as well.
i was indeed using TP of 2 and not 4, I have tried both nvfp4 models and awq models with same result.
for example, below command works perfectly for one GPU, but fails for 2 GPU. PrimeIntellect fits in one 96GB card so I am ok with running it with TP of 1, however, when it comes to Minimax, I cannot get it to run on 1 card. and TP of 2 fails for me...
Works:
docker run --gpus '"device=1"' \
--rm \
--shm-size=24g \
--ipc=host \
-p 10002:8000 \
-v /media/mukul/data/models:/models \
-e VLLM_SLEEP_WHEN_IDLE=1 \
-e NVIDIA_VISIBLE_DEVICES=all \
-e NVIDIA_DRIVER_CAPABILITIES=compute,utility \
-e VLLM_ATTENTION_BACKEND=FLASHINFER \
-e VLLM_FLASHINFER_FORCE_TENSOR_CORES=1 \
vllm/vllm-openai:v0.12.0 \
/models/Firworks/INTELLECT-3-nvfp4 \
--served-model-name jarvis-thinker \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser deepseek_r1 \
--gpu-memory-utilization 0.9 \
--max-num-seqs 48 \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 1
fails:
docker run --gpus '"device=0,1"' \
--rm \
--shm-size=24g \
--ipc=host \
-p 10002:8000 \
-v /media/mukul/data/models:/models \
-e VLLM_SLEEP_WHEN_IDLE=1 \
-e NVIDIA_VISIBLE_DEVICES=all \
-e NVIDIA_DRIVER_CAPABILITIES=compute,utility \
-e VLLM_ATTENTION_BACKEND=FLASHINFER \
-e VLLM_FLASHINFER_FORCE_TENSOR_CORES=1 \
vllm/vllm-openai:v0.12.0 \
/models/Firworks/INTELLECT-3-nvfp4 \
--served-model-name jarvis-thinker \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser deepseek_r1 \
--gpu-memory-utilization 0.9 \
--max-num-seqs 48 \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 2
i took some AI help to get this docker container with vllm nightly in it and cuda13:
FROM pytorch/pytorch:2.9.1-cuda13.0-cudnn9-devel
RUN nvcc --version && nvidia-smi || true
RUN apt-get update && apt-get install -y \
git wget build-essential cmake ninja-build \
&& rm -rf /var/lib/apt/lists/*
RUN pip install --upgrade pip
RUN wget -qO- https://astral.sh/uv/install.sh | sh
ENV PATH="/root/.local/bin:$PATH"
ENV TORCH_CUDA_ARCH_LIST="12.0"
ENV VLLM_USE_PRECOMPILED=0
WORKDIR /workspace
# ---- flashinfer ----
RUN git clone --recursive https://github.com/flashinfer-ai/flashinfer.git
WORKDIR /workspace/flashinfer
RUN python -m pip install -v .
# ---- vLLM (nightly wheels for CUDA 13.0) ----
WORKDIR /workspace
RUN uv pip install --system --upgrade --prerelease=allow \
--index-url https://wheels.vllm.ai/nightly/cu130 \
--extra-index-url https://pypi.org/simple \
"vllm[all]" \
&& python -c "import torch, vllm; print('torch:', torch.__version__); print('vLLM:', vllm.__version__)"
Well tried that and same result, this is my startup command.
Are you using docker? could that be an issue?
docker run --rm -it \
--gpus '"device=0,1"' \
--ipc=host \
-p 10002:8000 \
-v /media/mukul/data/models:/models \
-e CUDA_VISIBLE_DEVICES=0,1 \
-e SAFETENSORS_FAST_GPU=1 \
-e VLLM_FLASHINFER_MOE_BACKEND=throughput \
-e VLLM_ALL2ALL_BACKEND=pplx \
-e ENABLE_NVFP4_SM120=1 \
vllm-6000:nightly-2025-12-14 \
vllm serve /models/lukealonso/MiniMax-M2-NVFP4 \
--host 0.0.0.0 \
--port 8000 \
--served-model-name jarvis-thinker \
--trust-remote-code \
--gpu-memory-utilization 0.92 \
--tensor-parallel-size 2 \
--pipeline-parallel-size 1 \
--max-model-len 65536 \
--max-num-seqs 8 \
--max-num-batched-tokens 16384 \
--dtype auto \
--kv-cache-dtype fp8 \
--enable-auto-tool-choice \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think \
--compilation-config '{"cudagraph_mode":"PIECEWISE"}' \
--enable-expert-parallel \
--all2all-backend pplx \
--enable-prefix-caching \
--enable-chunked-prefill
Well, I give up...
this is vllm it worked with initially with command in this model page
docker pull lavd/vllm-openai:nightly-11fd69dd54060a59c6f62a6d217e1ecc47d74a68
or
docker pull lavd/vllm-openai:nvfp4
it also works with image: vllm/vllm-openai:v0.12.0
ENABLE_NVFP4_SM120=1
VLLM_ATTENTION_BACKEND=FLASHINFER
- --async-scheduling
- --enable-auto-tool-choice
- --tool-call-parser
- minimax_m2
- --reasoning-parser
- minimax_m2_append_think
- --all2all-backend
- pplx
- --mm-encoder-tp-mode
- "data"
- --enable-prefix-caching
- --enable-chunked-prefill
- --served-model-name
- "mmm2nv"
- --tensor-parallel-size
- "2"
- --gpu-memory-utilization
- "0.95"
- --max-num-batched-tokens
- "16384"
- --max-num-seqs
- "128"
- --host
- "0.0.0.0"
- --port
- "8000"
use docker compose instead of run