lukealonso/MiniMax-M2-NVFP4 · you know which nightly it worked with? because it does not with current one

Nov 13

you know which nightly it worked with? because it does not with current one
https://hub.docker.com/r/vllm/vllm-openai/tags

willfalco

Nov 13

nvm , it is vllm/vllm-openai:nightly-11fd69dd54060a59c6f62a6d217e1ecc47d74a68

lukealonso

Owner Nov 13

let me know if it works for you with that image

willfalco

Nov 14

it does, same performance as AWQ - maybe extra 1% on tests
we got to get to the bottom of what is going on with vllm nvfp4 and blackwell sm120

Abduali

30 days ago

how is support on non-blackwell gpus?

maleal

18 days ago

•

edited 18 days ago

Really wanted to try this model but it looks like that nightly is no more available :(

The commit is:
https://github.com/vllm-project/vllm/commit/11fd69dd54060a59c6f62a6d217e1ecc47d74a68

The commit is in v0.11.2 (doesn't work), v0.11.1 (doesn't work), v0.11.1rc7 (not available), v0.11.1rc6 (not available).
I also tried a bunch of nightly with that commit but nothing.
I've not tried to build a dockerfile for it but if someone has and it works it'd be great!

Unfortunately NVFP4, despite being pretty amazing (on paper) doesn't receive enough love yet. Hopefully soon.

In any case, thank you for quantizing this awesome model!

willfalco

18 days ago

try running with latest vllm compiled, not docker . there is cuda image regression in docker currently
you want ENABLE_CUTLASS_MOE_SM120=1 https://github.com/vllm-project/vllm/pull/29242

ktsaou

18 days ago

I was getting errors like this:

ValueError: CutlassExpertsFp4 doesn't support DP. Use flashinfer CUTLASS Fused MoE backend instead (set VLLM_USE_FLASHINFER_MOE_FP4=1)

I set VLLM_USE_FLASHINFER_MOE_FP4=1 but the errors remain.

Examining the vllm code, this error is thrown because only 2 VLLM_FLASHINFER_MOE_BACKEND are supported: masked_gemm and throughput.

I set export VLLM_FLASHINFER_MOE_BACKEND=throughput and it loaded it.

ktsaou

18 days ago

•

edited 18 days ago

This seems to work reliably on 2x NVIDIA RTX 6000 PRO Blackwell (2x 96GB VRAM):

export VLLM_ATTENTION_BACKEND=FLASHINFER
export VLLM_FLASHINFER_MOE_BACKEND=throughput
export VLLM_USE_FLASHINFER_MOE_FP16=1
export VLLM_USE_FLASHINFER_MOE_FP8=1
export VLLM_USE_FLASHINFER_MOE_FP4=1
export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1

# Run on 2 GPUs with tensor parallelism
CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES:-0,1} /opt/vllm/bin/vllm serve lukealonso/MiniMax-M2-NVFP4 \
  --host 0.0.0.0 \
  --port 8345 \
  --served-model-name default-model lukealonso/MiniMax-M2-NVFP4 \
  --trust-remote-code \
  --gpu-memory-utilization 0.95 \
  --pipeline-parallel-size 1 \
  --enable-expert-parallel \
  --tensor-parallel-size 2 \
  --max-model-len 196608 \
  --max-num-seqs 32 \
  --enable-auto-tool-choice \
  --reasoning-parser minimax_m2_append_think \
  --tool-call-parser minimax_m2 \
  --all2all-backend pplx \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --max-num-batched-tokens 16384 \
  --dtype auto \
  --kv-cache-dtype fp8

My environment:

  Python: 3.12.3
  vLLM: 0.11.2.dev360+g8e7a89160
  PyTorch: 2.9.0+cu130
  CUDA: 13.0
  GPU 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition
  GPU 1: NVIDIA RTX PRO 6000 Blackwell Workstation Edition
  Triton: 3.5.0
  FlashInfer: 0.5.3

willfalco

17 days ago

works with above
though with no gains over vllm/vllm-openai:nightly-11fd69dd54060a59c6f62a6d217e1ecc47d74a68 in high ~72tps, ~332tps x10 (2 x RTX PRO 6000 @50k context)
python -c "import vllm; print(vllm.version); import torch; print(f'Torch Version: {torch.version}'); print(f'CUDA Available: {torch.cuda.is_available()}'); print(f'CUDA Version: {torch.version.cuda}'); import triton; print(triton.version); import flashinfer; print(flashinfer.version)"
0.11.2.dev403+gb9d0504a3
Torch Version: 2.9.0+cu130
CUDA Available: True
CUDA Version: 13.0
3.5.0
0.5.3

willfalco

17 days ago

that vllm-openai:nightly-11fd69dd54060a59c6f62a6d217e1ecc47d74a68 seems to be the last one that actually ran nvfp4 out of the box at all
for those who stumble on this just use it via
docker pull lavd/vllm-openai:nightly-11fd69dd54060a59c6f62a6d217e1ecc47d74a68
or
docker pull lavd/vllm-openai:nvfp4

maleal

17 days ago

Thanks, finally it worked. But as others confirmed it's the same speed of AWQ at the moment. :(

willfalco

17 days ago

same speed, but better perplexity
if someone can measure against a2s-ai/MiniMax-M2-AWQ for example

Fernanda24

16 days ago

yeah awq was never very good quality it seems, it failed every-time in testing vs q4 ggufs. hopefully nvfp4 is much higher quality quantization method

maleal

16 days ago

Interesting. My experience with the awq model was pretty bad. Has anyone tried a qwen3 vl 235B nvfp4? Was looking for a quantized version of that one.

willfalco

16 days ago

Like this? RedHatAI/Qwen3-VL-235B-A22B-Instruct-NVFP4

AWQ loosing 1 point here

lm_eval --model local-completions --model_args model=mmm2,system_prompt="Provide only the final numerical answer. No reasoning or explanations.",base_url=http://192.168.1.200:8001/v1/completions,tokenizer=/data1/MiniMax-M2-AWQ/,max_tokens=4096 --tasks gsm8k_cot,humaneval --confirm_run_unsafe_code --gen_kwargs do_sample=False --trust_remote_code --log_samples --output_path eval_results/ --batch_size 20

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k_cot	3	flexible-extract	8	exact_match	↑	0.9189	±	0.0075
		strict-match	8	exact_match	↑	0.9143	±	0.0077
humaneval	1	create_test	0	pass@1		0.5000	±	0.0392
Requesting API: 100%	███████	1319/1319 [06:37<00:00, 3.32it/s]
Tasks	Version	Filter	n-shot	Metric		Value		Stderr
-----	------:	----------------	-----:	-----------	---	-----:	---	-----:
gsm8k	3	flexible-extract	5	exact_match	↑	0.9272	±	0.0072
		strict-match	5	exact_match	↑	0.9265	±	0.0072
Requesting API: 100%	███████	56168/56168 [18:22<00:00, 50.94it/s]
Groups	Version	Filter	n-shot	Metric		Value		Stderr
------------------	------:	------	------	------	---	-----:	---	-----:
mmlu	2	none		acc	↑	0.8164	±	0.0031
- humanities	2	none		acc	↑	0.7488	±	0.0061
- other	2	none		acc	↑	0.8616	±	0.0060
- social sciences	2	none		acc	↑	0.8921	±	0.0055
- stem	2	none		acc	↑	0.7989	±	0.0069

==========================================================================================================================
lm_eval --model local-completions --model_args model=mmm2nv,system_prompt="Provide only the final numerical answer. No reasoning or explanations.",base_url=http://192.168.1.200:8003/v1/completions,tokenizer=/data1/MiniMax-M2-NVFP4/,max_tokens=4096 --tasks gsm8k_cot,humaneval --confirm_run_unsafe_code --gen_kwargs do_sample=False --trust_remote_code --log_samples --output_path eval_results/ --batch_size 20

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k_cot	3	flexible-extract	8	exact_match	↑	0.9257	±	0.0072
		strict-match	8	exact_match	↑	0.9212	±	0.0074
humaneval	1	create_test	0	pass@1		0.5854	±	0.0386
Requesting API: 100%	███████	1319/1319 [06:31<00:00, 3.37it/s]
Tasks	Version	Filter	n-shot	Metric		Value		Stderr
-----	------:	----------------	-----:	-----------	---	-----:	---	-----:
gsm8k	3	flexible-extract	5	exact_match	↑	0.9378	±	0.0067
		strict-match	5	exact_match	↑	0.9363	±	0.0067
Requesting API: 100%	███████	56168/56168 [16:58<00:00, 55.17it/s]
Groups	Version	Filter	n-shot	Metric		Value		Stderr
------------------	------:	------	------	------	---	-----:	---	-----:
mmlu	2	none		acc	↑	0.8173	±	0.0031
- humanities	2	none		acc	↑	0.7564	±	0.0060
- other	2	none		acc	↑	0.8622	±	0.0060
- social sciences	2	none		acc	↑	0.8859	±	0.0056
- stem	2	none		acc	↑	0.7967	±	0.0069

============================================================================================================================

willfalco

15 days ago

full just for comparison
lm_eval --model local-completions --model_args model=mmm2,system_prompt="Provide only the final numerical answer. No reasoning or explanations.",base_url=http://192.168.1.200:8001/v1/completions,tokenizer=/data1/MiniMax-M2/,num_concurrent=40,max_tokens=8168 --tasks mmlu --confirm_run_unsafe_code --gen_kwargs do_sample=False --trust_remote_code --log_samples --output_path eval_results/

Groups	Version	Filter	Metric		Value		Stderr
mmlu	2	none	acc	↑	0.8329	±	0.0030
- humanities	2	none	acc	↑	0.7747	±	0.0059
- other	2	none	acc	↑	0.8774	±	0.0057
- social sciences	2	none	acc	↑	0.9022	±	0.0053
- stem	2	none	acc	↑	0.8081	±	0.0068

Fernanda24

15 days ago

•

edited 15 days ago

Interesting. My experience with the awq model was pretty bad. Has anyone tried a qwen3 vl 235B nvfp4? Was looking for a quantized version of that one.

could also be that modern Q4 ggufs are mostly dynamic using imatrix so a lot of the important layers are kept at higher precision. I think i've seen some AWQ athat are fp4/fp8 mix? However for Deepseek the only one I can run in 4x96gb is the smallest 4bit AWQ, the other 4bit variations i found did not fit

maleal

12 days ago

The new vllm 0.12.0 seems to have better support for nvfp4.

If anyone tries it would be nice to see if there is any improvement for speed.

willfalco

12 days ago

for this model 0.11.0 vs 0.12.0 is ~1-2% difference is tps on 50k context (2 x RTX PRO 6000), 1:1 catching up with sglang 73tps@50kcontext

maleal

12 days ago

Thanks for testing it!

kangkang2024

9 days ago

•

edited 9 days ago

Does this model support running with a single RTX Pro 6000？

mtcl

2 days ago

I have been unable to run this on my 2X6000 Pros. any help will be great. Is there any guide/help for this. I have tried the latest instructions on the readme of this repo, built the image, and the GPUs spin at 100% with 125-130 watts of power consumption.

any help will be appreciated.

willfalco

2 days ago

•

edited 2 days ago

2 x 6000 RTX PRO
source kk2.env/bin/activate #0.11.2.dev365+g0808eb813.cu130 PyTorch 2.9.1+cu130 Triton 3.5.1 flashinfer 0.5.3
export TORCH_CUDA_ARCH_LIST="12.0 12.1"
export VLLM_USE_TRITON_AWQ=1
vllm serve
/data1/MiniMax-M2/ --served-model-name mmm2 --host 0.0.0.0 --port 8001 --gpu-memory-utilization 0.95
--enable-auto-tool-choice --tool-call-parser minimax_m2 --reasoning-parser minimax_m2_append_think
--tensor-parallel-size 4
--max-num-seqs 128
--dtype auto
--trust-remote-code
--enable-prefix-caching --enable-chunked-prefill
--async-scheduling
--max-num-batched-tokens 8192

mtcl

2 days ago

2 x 6000 RTX PRO
source kk2.env/bin/activate #0.11.2.dev365+g0808eb813.cu130 PyTorch 2.9.1+cu130 Triton 3.5.1 flashinfer 0.5.3
export TORCH_CUDA_ARCH_LIST="12.0 12.1"
export VLLM_USE_TRITON_AWQ=1
vllm serve
/data1/MiniMax-M2/ --served-model-name mmm2 --host 0.0.0.0 --port 8001 --gpu-memory-utilization 0.95
--enable-auto-tool-choice --tool-call-parser minimax_m2 --reasoning-parser minimax_m2_append_think
--tensor-parallel-size 4
--max-num-seqs 128
--dtype auto
--trust-remote-code
--enable-prefix-caching --enable-chunked-prefill
--async-scheduling
--max-num-batched-tokens 8192

what docker vllm image are you using for this purpose?

mtcl

2 days ago

i keep getting this:
(Worker_TP0 pid=672) INFO 12-15 00:01:52 [backends.py:288] Compiling a graph for dynamic shape takes 74.89 s
(EngineCore_DP0 pid=537) INFO 12-15 00:02:26 [shm_broadcast.py:501] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore_DP0 pid=537) INFO 12-15 00:03:26 [shm_broadcast.py:501] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore_DP0 pid=537) INFO 12-15 00:04:26 [shm_broadcast.py:501] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore_DP0 pid=537) INFO 12-15 00:05:26 [shm_broadcast.py:501] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).

and that last message repeats for hours.

ktsaou

2 days ago

I think --tensor-parallel-size 4 needs to be --tensor-parallel-size 2 since you have 2 cards.
Also --max-num-seqs 128 is probably too big, try --max-num-seqs 8 for a start

Generally your settings are not sane. This is an NVFP4 quant, not a AWQ one.

I use this on the same setup (2x rtx 6000 pro blackwell):

#!/bin/bash

# Activate vLLM venv
source /opt/vllm/bin/activate

# Set HuggingFace cache to use existing models
export HF_HOME=/opt/models/huggingface
export TRANSFORMERS_CACHE=/opt/models/huggingface/hub

# Required Environment Variables
export VLLM_ATTENTION_BACKEND=FLASHINFER
export VLLM_FLASHINFER_MOE_BACKEND=throughput
export VLLM_USE_FLASHINFER_MOE_FP4=1
export ENABLE_NVFP4_SM120=1
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export VLLM_ALL2ALL_BACKEND=pplx
export SAFETENSORS_FAST_GPU=1

# Run command
exec env CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES:-0,1} /opt/vllm/bin/vllm serve lukealonso/MiniMax-M2-NVFP4 \
  --host 0.0.0.0 \
  --port 8345 \
  --served-model-name default-model minimax-m2 \
  --trust-remote-code \
  --gpu-memory-utilization 0.95 \
  --tensor-parallel-size 2 \
  --pipeline-parallel-size 1 \
  --max-model-len 196608 \
  --max-num-seqs 8 \
  --max-num-batched-tokens 16384 \
  --dtype auto \
  --kv-cache-dtype fp8 \
  --enable-auto-tool-choice \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2_append_think \
  --all2all-backend pplx \
  --compilation-config "{\"cudagraph_mode\": \"PIECEWISE\"}" \
  --enable-expert-parallel \
  --enable-prefix-caching \
  --enable-chunked-prefill \

It works with the latest nightly version of vllm.

Keep in mind that to my tests vllm is not as good as sglang. I find sglang more reliable for agentic use (latest nightly of sglang too):

#!/bin/bash
# MiniMax-M2-NVFP4 model script using SGLang
# NVFP4 quantized MoE model with TP=2
# Optimized for 2x RTX 6000 PRO Blackwell (96GB each)

# Activate SGLang venv
source /opt/sglang/bin/activate

# Set CUDA environment - use 13.0 for Blackwell
export CUDA_HOME=/usr/local/cuda-13.0
export LD_LIBRARY_PATH=/usr/local/cuda-13.0/lib64:/usr/local/lib:$LD_LIBRARY_PATH
export PATH=/usr/local/cuda-13.0/bin:$PATH

# Set HuggingFace cache to use existing models
export HF_HOME=/opt/models/huggingface

# SGLang optimization environment variables
export PYTORCH_ALLOC_CONF=expandable_segments:True

# Use both GPUs for TP=2
export CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES:-0,1}

MODEL="lukealonso/MiniMax-M2-NVFP4"
PORT=8345

echo "Starting MiniMax-M2-NVFP4 with SGLang (TP=2)"
echo "Model: $MODEL"
echo "GPUs: $CUDA_VISIBLE_DEVICES, Port: ${PORT}"
echo "Context length: 196608"
echo ""

# Launch server
exec /opt/sglang/bin/python -m sglang.launch_server \
    --model-path "$MODEL" \
    --served-model-name default-model \
    --trust-remote-code \
    --dtype auto \
    --quantization modelopt_fp4 \
    --host 0.0.0.0 \
    --port ${PORT} \
    --tp 2 \
    --mem-fraction-static 0.95 \
    --context-length 196608 \
    --kv-cache-dtype fp8_e4m3 \
    --max-running-requests 4 \
    --chunked-prefill-size 16384 \
    --attention-backend triton \
    --tool-call-parser minimax-m2 \
    --reasoning-parser minimax-append-think \
    --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 56}' \
    "$@"

mtcl

2 days ago

is there a docker image that you are using that I can try, this will eliminate all external factors my friend. I am open to trying sglang, my primary usecase is agentic as well.

i was indeed using TP of 2 and not 4, I have tried both nvfp4 models and awq models with same result.

for example, below command works perfectly for one GPU, but fails for 2 GPU. PrimeIntellect fits in one 96GB card so I am ok with running it with TP of 1, however, when it comes to Minimax, I cannot get it to run on 1 card. and TP of 2 fails for me...

Works:

docker run --gpus '"device=1"' \
    --rm \
    --shm-size=24g \
    --ipc=host \
    -p 10002:8000 \
    -v /media/mukul/data/models:/models \
    -e VLLM_SLEEP_WHEN_IDLE=1 \
    -e NVIDIA_VISIBLE_DEVICES=all \
    -e NVIDIA_DRIVER_CAPABILITIES=compute,utility \
    -e VLLM_ATTENTION_BACKEND=FLASHINFER \
    -e VLLM_FLASHINFER_FORCE_TENSOR_CORES=1 \
    vllm/vllm-openai:v0.12.0 \
        /models/Firworks/INTELLECT-3-nvfp4 \
        --served-model-name jarvis-thinker \
        --enable-auto-tool-choice \
        --tool-call-parser qwen3_coder \
        --reasoning-parser deepseek_r1 \
        --gpu-memory-utilization 0.9 \
        --max-num-seqs 48 \
        --host 0.0.0.0 \
        --port 8000 \
        --tensor-parallel-size 1

fails:

docker run --gpus '"device=0,1"' \
    --rm \
    --shm-size=24g \
    --ipc=host \
    -p 10002:8000 \
    -v /media/mukul/data/models:/models \
    -e VLLM_SLEEP_WHEN_IDLE=1 \
    -e NVIDIA_VISIBLE_DEVICES=all \
    -e NVIDIA_DRIVER_CAPABILITIES=compute,utility \
    -e VLLM_ATTENTION_BACKEND=FLASHINFER \
    -e VLLM_FLASHINFER_FORCE_TENSOR_CORES=1 \
    vllm/vllm-openai:v0.12.0 \
        /models/Firworks/INTELLECT-3-nvfp4 \
        --served-model-name jarvis-thinker \
        --enable-auto-tool-choice \
        --tool-call-parser qwen3_coder \
        --reasoning-parser deepseek_r1 \
        --gpu-memory-utilization 0.9 \
        --max-num-seqs 48 \
        --host 0.0.0.0 \
        --port 8000 \
        --tensor-parallel-size 2

mtcl

2 days ago

i took some AI help to get this docker container with vllm nightly in it and cuda13:

FROM pytorch/pytorch:2.9.1-cuda13.0-cudnn9-devel

RUN nvcc --version && nvidia-smi || true

RUN apt-get update && apt-get install -y \
    git wget build-essential cmake ninja-build \
 && rm -rf /var/lib/apt/lists/*

RUN pip install --upgrade pip

RUN wget -qO- https://astral.sh/uv/install.sh | sh
ENV PATH="/root/.local/bin:$PATH"

ENV TORCH_CUDA_ARCH_LIST="12.0"
ENV VLLM_USE_PRECOMPILED=0

WORKDIR /workspace

# ---- flashinfer ----
RUN git clone --recursive https://github.com/flashinfer-ai/flashinfer.git
WORKDIR /workspace/flashinfer
RUN python -m pip install -v .

# ---- vLLM (nightly wheels for CUDA 13.0) ----
WORKDIR /workspace
RUN uv pip install --system --upgrade --prerelease=allow \
      --index-url https://wheels.vllm.ai/nightly/cu130 \
      --extra-index-url https://pypi.org/simple \
      "vllm[all]" \
 && python -c "import torch, vllm; print('torch:', torch.__version__); print('vLLM:', vllm.__version__)"

mtcl

2 days ago

Well tried that and same result, this is my startup command.

Are you using docker? could that be an issue?

docker run --rm -it \
  --gpus '"device=0,1"' \
  --ipc=host \
  -p 10002:8000 \
  -v /media/mukul/data/models:/models \
  -e CUDA_VISIBLE_DEVICES=0,1 \
  -e SAFETENSORS_FAST_GPU=1 \
  -e VLLM_FLASHINFER_MOE_BACKEND=throughput \
  -e VLLM_ALL2ALL_BACKEND=pplx \
  -e ENABLE_NVFP4_SM120=1 \
  vllm-6000:nightly-2025-12-14 \
  vllm serve /models/lukealonso/MiniMax-M2-NVFP4 \
    --host 0.0.0.0 \
    --port 8000 \
    --served-model-name jarvis-thinker \
    --trust-remote-code \
    --gpu-memory-utilization 0.92 \
    --tensor-parallel-size 2 \
    --pipeline-parallel-size 1 \
    --max-model-len 65536 \
    --max-num-seqs 8 \
    --max-num-batched-tokens 16384 \
    --dtype auto \
    --kv-cache-dtype fp8 \
    --enable-auto-tool-choice \
    --tool-call-parser minimax_m2 \
    --reasoning-parser minimax_m2_append_think \
    --compilation-config '{"cudagraph_mode":"PIECEWISE"}' \
    --enable-expert-parallel \
    --all2all-backend pplx \
    --enable-prefix-caching \
    --enable-chunked-prefill

mtcl

2 days ago

Well, I give up...

willfalco

2 days ago

this is vllm it worked with initially with command in this model page
docker pull lavd/vllm-openai:nightly-11fd69dd54060a59c6f62a6d217e1ecc47d74a68
or
docker pull lavd/vllm-openai:nvfp4

it also works with image: vllm/vllm-openai:v0.12.0
ENABLE_NVFP4_SM120=1
VLLM_ATTENTION_BACKEND=FLASHINFER
- --async-scheduling
- --enable-auto-tool-choice
- --tool-call-parser
- minimax_m2
- --reasoning-parser
- minimax_m2_append_think
- --all2all-backend
- pplx
- --mm-encoder-tp-mode
- "data"
- --enable-prefix-caching
- --enable-chunked-prefill
- --served-model-name
- "mmm2nv"
- --tensor-parallel-size
- "2"
- --gpu-memory-utilization
- "0.95"
- --max-num-batched-tokens
- "16384"
- --max-num-seqs
- "128"
- --host
- "0.0.0.0"
- --port
- "8000"

use docker compose instead of run