Dual RTX Pro 6000 Blackwell 96GB - IT FITS!
I can confirm I was able to get the model to load on dual pro 6000. When I say it fits, I mean just barely fits. I use full context because coding agents chew through context like nobody's business. Here's the setup I used
start command:
vllm serve /root/.ollama/models/GLM-4.5-Air-REAP-82B-A12B \
--host 0.0.0.0 \
--port 8080 \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.95 \
--max-model-len 131072 \
--quantization bitsandbytes \
--load-format bitsandbytes \
--tool-call-parser glm45 \
--enable-auto-tool-choice \
--served-model-name GLM-4.5-Air-REAP-82B-A12B
when I say "fits", I mean just barely since 99.3% of VRAM is used:
nvidia-smi
Sat Oct 25 18:08:49 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA RTX PRO 6000 Blac... Off | 00000000:02:00.0 Off | Off |
| 30% 33C P8 8W / 300W | 96513MiB / 97887MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA RTX PRO 6000 Blac... Off | 00000000:04:00.0 Off | Off |
| 30% 31C P8 16W / 300W | 96513MiB / 97887MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 67222 C VLLM::Worker_TP0 96504MiB |
| 1 N/A N/A 67223 C VLLM::Worker_TP1 96504MiB |
+-----------------------------------------------------------------------------------------+
it's definitely not a speed demon on this constrained hardware, with 32K of context using opencode
(APIServer pid=2141) INFO: 172.20.0.107:36644 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=2141) INFO 10-25 11:01:58 [loggers.py:123] Engine 000: Avg prompt throughput: 3047.5 tokens/s, Avg generation throughput: 5.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.1%, Prefix cache hit rate: 89.8%
(APIServer pid=2141) INFO: 172.20.0.107:36644 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=2141) INFO: 172.20.0.107:36644 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=2141) INFO 10-25 11:02:08 [loggers.py:123] Engine 000: Avg prompt throughput: 6115.7 tokens/s, Avg generation throughput: 7.4 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 90.6%
(APIServer pid=2141) INFO 10-25 11:02:18 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 90.6%
getting token generation in the low single digits on average and the GPUS are absolutely maxed out during generations. also note i've got both GPUs capped at 300W during this run. Also, each GPU is running x8 in an x16 slot. Upgrading my rig and power will definitely get me some speedups but for now I'm just stoked I was able to get it to load at all and opencode seems to be working ok. I was previously using GLM-4.5-Air, a Q6 quant and it was ok. The goal with even trying this model at all was to do a bit better than the Q6. I'll be putting it through the paces to see if it does that. If it's worse than the Q6, then of course I'll be going back to the Q6.
Bravo to the Cerebras team for putting this out! It's this kind of work the will get companies to pay attention to their work. Great job team!
You'd better use following command to run full fp8 version on 2x96GB gpu.
export CUDA_VISIBLE_DEVICES="6,7"
python3 -m sglang.launch_server \
--model-path /hdd2/ltg/GLM-4.5-Air-FP8 \
--tp-size 2 \
--tool-call-parser glm45 \
--reasoning-parser glm45 \
--speculative-algorithm EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--mem-fraction-static 0.9 \
--disable-shared-experts-fusion \
--kv-cache-dtype auto \
--host 0.0.0.0 \
--port 8515 &
And here's gpu usage:
ltg@8XH20:~/GLM-4.5-Air-FP8$ nvidia-smi -i 6,7
Mon Oct 27 15:30:49 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08 Driver Version: 535.161.08 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 6 NVIDIA H20 On | 00000000:6B:02.0 Off | 0 |
| N/A 39C P0 122W / 500W | 96131MiB / 97871MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 7 NVIDIA H20 On | 00000000:6B:03.0 Off | 0 |
| N/A 34C P0 116W / 500W | 96130MiB / 97871MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 6 N/A N/A 3063546 C sglang::scheduler_TP0 96124MiB |
| 7 N/A N/A 3063547 C sglang::scheduler_TP1 96124MiB |
+---------------------------------------------------------------------------------------+
Thanks for the tip @CHNtentes . I wasn't able to get the 4.5 Air FP8 running with the command line you used. I tried for a few hours but eventually gave up and went back to unsloth's GLM-4.5-Air-UD-Q6_K_XL. It's been the most stable for me when using a custom chat template for tool calling. Although it breaks when using MCP servers with most coding tools, it's still been the most dependable.
This model here, GLM-4.5-Air-REAP-82B-A12B did not perform well at all for me. It doesn't crash or anything like that but it gets confused so easily, to the point that it's not usable for much at all. When writing code, performing a single task, by the end of the task it frequently forgot important details and creating broken code. For example it would things like create an async function in the beginning and then go implement it a few places and forget it was an async function, try to run the code and realize it needed to implement as async. Then it goes into fixing code it should have got right the first time to begin with, taking WAY longer than it should, chewing through available context and so on. That's a simple example but it would continually do these kind of f-ups over and over, which really slows down development. The Q6 quant does much better for me.
GLM-4.5-Air-REAP-82B-A12B is a neat curiosity but it's not good for my use.
7 token/s generation speed is way too low , i would suggest jst using 1 gpu instead of 2 ( looks like 2 gpu is having bandwidth limitation ) .. u can use glm 4.5 air quant trio version using vllm , much better and u can run 128k context and get speed around 40t/s to 120t/s .. i use it with claude code or roo code ( blackwell 6000 pro rtx ) .. its basically sonnet at home