Dual RTX Pro 6000 Blackwell 96GB - IT FITS!

#11

by aaron-newsome - opened Oct 25

Oct 25

I can confirm I was able to get the model to load on dual pro 6000. When I say it fits, I mean just barely fits. I use full context because coding agents chew through context like nobody's business. Here's the setup I used

start command:

vllm serve /root/.ollama/models/GLM-4.5-Air-REAP-82B-A12B \
    --host 0.0.0.0 \
    --port 8080 \
    --tensor-parallel-size 2 \
    --gpu-memory-utilization 0.95 \
    --max-model-len 131072 \
    --quantization bitsandbytes \
    --load-format bitsandbytes \
    --tool-call-parser glm45 \
    --enable-auto-tool-choice \
    --served-model-name GLM-4.5-Air-REAP-82B-A12B

when I say "fits", I mean just barely since 99.3% of VRAM is used:

 nvidia-smi
Sat Oct 25 18:08:49 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX PRO 6000 Blac...    Off |   00000000:02:00.0 Off |                  Off |
| 30%   33C    P8              8W /  300W |   96513MiB /  97887MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA RTX PRO 6000 Blac...    Off |   00000000:04:00.0 Off |                  Off |
| 30%   31C    P8             16W /  300W |   96513MiB /  97887MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A           67222      C   VLLM::Worker_TP0                      96504MiB |
|    1   N/A  N/A           67223      C   VLLM::Worker_TP1                      96504MiB |
+-----------------------------------------------------------------------------------------+

it's definitely not a speed demon on this constrained hardware, with 32K of context using opencode

(APIServer pid=2141) INFO:     172.20.0.107:36644 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=2141) INFO 10-25 11:01:58 [loggers.py:123] Engine 000: Avg prompt throughput: 3047.5 tokens/s, Avg generation throughput: 5.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.1%, Prefix cache hit rate: 89.8%
(APIServer pid=2141) INFO:     172.20.0.107:36644 - "POST /v1/chat/completions HTTP/1.1" 200 OK

(APIServer pid=2141) INFO:     172.20.0.107:36644 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=2141) INFO 10-25 11:02:08 [loggers.py:123] Engine 000: Avg prompt throughput: 6115.7 tokens/s, Avg generation throughput: 7.4 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 90.6%
(APIServer pid=2141) INFO 10-25 11:02:18 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 90.6%

getting token generation in the low single digits on average and the GPUS are absolutely maxed out during generations. also note i've got both GPUs capped at 300W during this run. Also, each GPU is running x8 in an x16 slot. Upgrading my rig and power will definitely get me some speedups but for now I'm just stoked I was able to get it to load at all and opencode seems to be working ok. I was previously using GLM-4.5-Air, a Q6 quant and it was ok. The goal with even trying this model at all was to do a bit better than the Q6. I'll be putting it through the paces to see if it does that. If it's worse than the Q6, then of course I'll be going back to the Q6.

Bravo to the Cerebras team for putting this out! It's this kind of work the will get companies to pay attention to their work. Great job team!

CHNtentes

Oct 27

You'd better use following command to run full fp8 version on 2x96GB gpu.

export CUDA_VISIBLE_DEVICES="6,7"

python3 -m sglang.launch_server \
  --model-path /hdd2/ltg/GLM-4.5-Air-FP8 \
  --tp-size 2 \
  --tool-call-parser glm45 \
  --reasoning-parser glm45 \
  --speculative-algorithm EAGLE \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4 \
  --mem-fraction-static 0.9 \
  --disable-shared-experts-fusion \
  --kv-cache-dtype auto \
  --host 0.0.0.0 \
  --port 8515 &

And here's gpu usage:

ltg@8XH20:~/GLM-4.5-Air-FP8$ nvidia-smi -i 6,7
Mon Oct 27 15:30:49 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   6  NVIDIA H20                     On  | 00000000:6B:02.0 Off |                    0 |
| N/A   39C    P0             122W / 500W |  96131MiB / 97871MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA H20                     On  | 00000000:6B:03.0 Off |                    0 |
| N/A   34C    P0             116W / 500W |  96130MiB / 97871MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    6   N/A  N/A   3063546      C   sglang::scheduler_TP0                     96124MiB |
|    7   N/A  N/A   3063547      C   sglang::scheduler_TP1                     96124MiB |
+---------------------------------------------------------------------------------------+

aaron-newsome

Oct 28

Thanks for the tip @CHNtentes . I wasn't able to get the 4.5 Air FP8 running with the command line you used. I tried for a few hours but eventually gave up and went back to unsloth's GLM-4.5-Air-UD-Q6_K_XL. It's been the most stable for me when using a custom chat template for tool calling. Although it breaks when using MCP servers with most coding tools, it's still been the most dependable.

This model here, GLM-4.5-Air-REAP-82B-A12B did not perform well at all for me. It doesn't crash or anything like that but it gets confused so easily, to the point that it's not usable for much at all. When writing code, performing a single task, by the end of the task it frequently forgot important details and creating broken code. For example it would things like create an async function in the beginning and then go implement it a few places and forget it was an async function, try to run the code and realize it needed to implement as async. Then it goes into fixing code it should have got right the first time to begin with, taking WAY longer than it should, chewing through available context and so on. That's a simple example but it would continually do these kind of f-ups over and over, which really slows down development. The Q6 quant does much better for me.

GLM-4.5-Air-REAP-82B-A12B is a neat curiosity but it's not good for my use.

hareram241

Oct 31

7 token/s generation speed is way too low , i would suggest jst using 1 gpu instead of 2 ( looks like 2 gpu is having bandwidth limitation ) .. u can use glm 4.5 air quant trio version using vllm , much better and u can run 128k context and get speed around 40t/s to 120t/s .. i use it with claude code or roo code ( blackwell 6000 pro rtx ) .. its basically sonnet at home

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment