Hardware question

#3
by spanspek - opened

Maybe a dumb question but are these models only meant to be run on Cerebras' custom in-house hardware or will they perform well on Nvidia GPUs as well?

I am currently running this model, albeit a quantized version, using llama.cpp on a Radeon 7900XTX

Thank you for the response but the question was does it perform "well" not does it run :) If you don't mind:

  • Which quantization are you running (team, bits and method - e.g. unsloth Q2_K_XL)?
  • What type of performance are you seeing (time to first token, tokens per second output, etc)?

I just tested this bartowski/cerebras_Qwen3-Coder-REAP-25B-A3B-GGUF:Q2_K_L - model weights are 9.2GB, on a laptop with a RTX4060 (8GB VRAM, only 256 GBps of memory bandwidth) and 64GB of DDR5 (4800) and a middle of the pack CPU and I got the following (ollama verbose stats):

total duration: 24.1978335s
load duration: 71.064ms
prompt eval count: 35 token(s)
prompt eval duration: 429.1029ms
prompt eval rate: 81.57 tokens/s
eval count: 917 token(s)
eval duration: 23.559657s
eval rate: 38.92 tokens/s

I'm very impressed!

I then passed my prompt and the model's output into Grok to judge if it answered correctly and it did. I told it to ignore error handling, etc - focus on the core technical task only

It was an easy scripting question so the quality of more difficult questions remains to be seen.

EDIT: Just to add some more detail:

CPU: Intel Core i7-12650H
NVIDIA Driver Version: 566.14
CUDA Version: 12.7
Ollama version (off-the-shelf): 0.12.6
Context = 8192 (this is 25% CPU / 75% GPU on my system with the Q2_K_XL model above)

  • At 32k context it's 12GB and 39% CPU / 61% GPU), haven't tested the inference performance at this context

My bad, I misread the original question.

Which quantization are you running (team, bits and method - e.g. unsloth Q2_K_XL)?
Bartowski Q6_K_L GGUF using llama.cpp build 6853 with Vulkan backend

What type of performance are you seeing (time to first token, tokens per second output, etc)?
I ran a benchmark for you using llama-bench with the following parameters:
.\llama-bench.exe -m cerebras_Qwen3-Coder-REAP-25B-A3B-Q4_K_M.gguf -fa 1 -t 12 -p 4096 -n 1024 -d 32768
So processing 4096 tokens, generating 1024 tokens, at a context depth of 32k.

The results were:

  • Total Benchmark Duration: 1131 seconds
  • Initial Load Duration: 14.6 seconds
  • Prompt Eval Token Count (pp4096): 4096 tokens
  • Prompt Eval Duration (pp4096): 24.6 seconds
  • Prompt Eval Rate: 166.42 tokens/second
  • Generation Token Count (tg1024): 1024 tokens
  • Generation Duration (tg1024): 18.8 seconds
  • Generation Rate: 54.41 tokens/second
  • Overall Throughput (full sequence): 54.4 tokens/second
  • KV Cache Size (pp4096): 3456 MiB
  • KV Cache Size (tg1024): 3168 MiB
  • Model VRAM Usage: 14.3 GiB
  • Total VRAM (incl. buffers/cache): 18.0 GiB
  • Graphs Reused: 5100
  • Std Dev (Prompt Eval): ±0.87 tokens/second
  • Std Dev (Generation): ±0.03 tokens/second

This is with a 7900XTX, 5900x with 64GB of DDR4 RAM on Windows 10 using llama.cpp with Vulkan backend.

Brilliant, thank you. That's impressive. Speed is one thing but quality always remains to be seen, if you encounter any strange unexpected decreases in quality of output please share them here - I'll do the same

I am still looking into a way to benchmarks models I run locally. For instance, I can run the Q6 model with the same context at about half the speed, since it requires offloading MoE layers to the CPU, but I am wondering if it is worth the tradeoff. "benchmarking" by hand is an impossible task, because I won't necessarily notice minor deviations.

Sign up or log in to comment