Hardware question

by spanspek - opened Oct 27

Oct 27

Maybe a dumb question but are these models only meant to be run on Cerebras' custom in-house hardware or will they perform well on Nvidia GPUs as well?

GoldenLeafBird

Oct 27

I am currently running this model, albeit a quantized version, using llama.cpp on a Radeon 7900XTX

spanspek

Oct 27

•

edited Oct 27

Thank you for the response but the question was does it perform "well" not does it run :) If you don't mind:

Which quantization are you running (team, bits and method - e.g. unsloth Q2_K_XL)?
What type of performance are you seeing (time to first token, tokens per second output, etc)?

I just tested this bartowski/cerebras_Qwen3-Coder-REAP-25B-A3B-GGUF:Q2_K_L - model weights are 9.2GB, on a laptop with a RTX4060 (8GB VRAM, only 256 GBps of memory bandwidth) and 64GB of DDR5 (4800) and a middle of the pack CPU and I got the following (ollama verbose stats):

total duration: 24.1978335s
load duration: 71.064ms
prompt eval count: 35 token(s)
prompt eval duration: 429.1029ms
prompt eval rate: 81.57 tokens/s
eval count: 917 token(s)
eval duration: 23.559657s
eval rate: 38.92 tokens/s

I'm very impressed!

I then passed my prompt and the model's output into Grok to judge if it answered correctly and it did. I told it to ignore error handling, etc - focus on the core technical task only

It was an easy scripting question so the quality of more difficult questions remains to be seen.

EDIT: Just to add some more detail:

CPU: Intel Core i7-12650H
NVIDIA Driver Version: 566.14
CUDA Version: 12.7
Ollama version (off-the-shelf): 0.12.6
Context = 8192 (this is 25% CPU / 75% GPU on my system with the Q2_K_XL model above)

At 32k context it's 12GB and 39% CPU / 61% GPU), haven't tested the inference performance at this context

GoldenLeafBird

Oct 27

My bad, I misread the original question.

Which quantization are you running (team, bits and method - e.g. unsloth Q2_K_XL)?
Bartowski Q6_K_L GGUF using llama.cpp build 6853 with Vulkan backend

What type of performance are you seeing (time to first token, tokens per second output, etc)?
I ran a benchmark for you using llama-bench with the following parameters:
.\llama-bench.exe -m cerebras_Qwen3-Coder-REAP-25B-A3B-Q4_K_M.gguf -fa 1 -t 12 -p 4096 -n 1024 -d 32768
So processing 4096 tokens, generating 1024 tokens, at a context depth of 32k.

The results were:

Total Benchmark Duration: 1131 seconds
Initial Load Duration: 14.6 seconds
Prompt Eval Token Count (pp4096): 4096 tokens
Prompt Eval Duration (pp4096): 24.6 seconds
Prompt Eval Rate: 166.42 tokens/second
Generation Token Count (tg1024): 1024 tokens
Generation Duration (tg1024): 18.8 seconds
Generation Rate: 54.41 tokens/second
Overall Throughput (full sequence): 54.4 tokens/second
KV Cache Size (pp4096): 3456 MiB
KV Cache Size (tg1024): 3168 MiB
Model VRAM Usage: 14.3 GiB
Total VRAM (incl. buffers/cache): 18.0 GiB
Graphs Reused: 5100
Std Dev (Prompt Eval): ±0.87 tokens/second
Std Dev (Generation): ±0.03 tokens/second

This is with a 7900XTX, 5900x with 64GB of DDR4 RAM on Windows 10 using llama.cpp with Vulkan backend.

spanspek

Oct 27

Brilliant, thank you. That's impressive. Speed is one thing but quality always remains to be seen, if you encounter any strange unexpected decreases in quality of output please share them here - I'll do the same

GoldenLeafBird

Oct 27

I am still looking into a way to benchmarks models I run locally. For instance, I can run the Q6 model with the same context at about half the speed, since it requires offloading MoE layers to the CPU, but I am wondering if it is worth the tradeoff. "benchmarking" by hand is an impossible task, because I won't necessarily notice minor deviations.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment