Qwen3.5-397B-A17B-EXL3 Pareto Frontier
Pareto-frontier EXL3 quants for Qwen 3.5 397B.
Pick a quant with lowest KL and ppl that suits your hardware. Quants are in separate model repositories.
| Quant | GiB | GB | bpw | PPL | KL(qโo) | KL(oโq) | Top-1 | Top-2 | Top-3 | Top-4 | Top-5 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| MikeRoz 2.0bpw | 97 | 104 | 2.00 | 5.072 | 0.5160 | 0.8210 | 76.1% | 41.3% | 18.6% | 7.5% | 2.9% |
| MikeRoz 2.08bpw | 100 | 107 | 2.08 | 3.386 | 0.1210 | 0.1630 | 89.3% | 62.6% | 38.3% | 21.6% | 11.7% |
| cpral 2.20bpw | 104 | 112 | 2.20 | 3.381 | 0.1198 | 0.1591 | 89.4% | 62.8% | 38.5% | 21.6% | 11.6% |
| cpral 2.36bpw | 113 | 121 | 2.36 | 3.260 | 0.0819 | 0.1054 | 91.6% | 68.1% | 44.6% | 27.1% | 15.7% |
| cpral 2.64bpw | 126 | 135 | 2.64 | 3.139 | 0.0429 | 0.0490 | 94.1% | 75.5% | 54.3% | 36.5% | 23.4% |
| cpral 2.93bpw | 139 | 149 | 2.93 | 3.117 | 0.0319 | 0.0349 | 94.8% | 78.1% | 58.3% | 40.6% | 27.0% |
| NeuroSenko 3.0bpw | 142 | 153 | 3.00 | 3.220 | 0.0674 | 0.0776 | 91.9% | 68.4% | 44.5% | 26.6% | 14.8% |
| NeuroSenko 3.03bpw | 143 | 154 | 3.03 | 3.173 | 0.0474 | 0.0531 | 93.5% | 73.4% | 51.1% | 32.9% | 20.1% |
| cpral 3.11bpw | 147 | 158 | 3.11 | 3.114 | 0.0270 | 0.0296 | 95.3% | 79.8% | 60.7% | 43.3% | 29.6% |
| cpral 3.29bpw | 156 | 167 | 3.29 | 3.089 | 0.0200 | 0.0213 | 96.0% | 82.1% | 64.3% | 47.4% | 33.5% |
| cpral 3.45bpw | 163 | 175 | 3.45 | 3.081 | 0.0159 | 0.0166 | 96.4% | 83.7% | 67.3% | 51.2% | 37.3% |
| mratsim 3.47bpw | 164 | 175 | 3.47 | 3.096 | 0.0203 | 0.0216 | 96.0% | 82.2% | 64.7% | 48.1% | 34.1% |
| cpral 3.53bpw | 167 | 179 | 3.53 | 3.075 | 0.0134 | 0.0139 | 96.7% | 84.9% | 69.3% | 53.5% | 39.8% |
| cpral 3.57bpw | 169 | 181 | 3.57 | 3.072 | 0.0127 | 0.0130 | 96.7% | 85.2% | 69.8% | 54.2% | 40.4% |
| cpral 3.68bpw | 173 | 186 | 3.68 | 3.069 | 0.0120 | 0.0122 | 96.9% | 85.7% | 70.6% | 55.1% | 41.3% |
| NeuroSenko 4.0bpw | 188 | 202 | 4.00 | 3.101 | 0.0203 | 0.0210 | 95.7% | 81.0% | 62.3% | 44.7% | 30.5% |
| NeuroSenko 4.03bpw | 189 | 203 | 4.03 | 3.082 | 0.0149 | 0.0153 | 96.3% | 83.9% | 67.2% | 50.7% | 36.6% |
| cpral 4.61bpw | 216 | 232 | 4.61 | 3.059 | 0.0054 | 0.0054 | 97.8% | 90.0% | 78.4% | 65.4% | 52.6% |
| NeuroSenko 5.0bpw | 234 | 252 | 5.00 | 3.067 | 0.0079 | 0.0079 | 97.3% | 87.6% | 73.9% | 59.0% | 45.3% |
| mratsim 8.0bpw | 385 | 400 | 8.00 | 3.055 | 0.0025 | 0.0026 | 98.6% | 93.3% | 85.0% | 75.1% | 64.7% |
| original | bf16 | 752 | 807 | 16.00 | 3.053 | โ | โ | โ | โ | โ | โ |
Methodology
Methodology that I've used to create custom quants is documented in https://github.com/adamo1139/qwen397b-exl3 and is mostly reproducible (I may have manually overriden some auto-generated configs in a minor way). Custom override configs have been placed into model repositories of all quants produced using this method soon. 2.2bpw quant was produced using exllamav3's optimize.py tool.
Credits
Thanks to @mratsim for sharing his custom quants, methodology, override config and 8bpw baseline. Thanks to @Goldkoron for sharing per-module KLD sensivity chart for Qwen 3.5 397B. Thanks to @NeuroSenko for sharing 3bpw, 4bpw and 5bpw baseline quants. Thanks to @MikeRoz for sharing 2bpw baseline quant. Thanks to @turboderp for creating exllamav3.
Potential for future work
Future work could enable better quants by tweaking superlinear penalty, incorporating 6bpw and 7bpw baselines, and quantizing various experts to a variable degree as informed by REAP/REAM data. Methodology that was used here should be applicable for other MoE models like GLM 4.5 family and Qwen 3.5 122B too, and it might be applicable to GGUF ecosystem too
TODO
Quantization_bits metadata is misleading and is a constant 3.00 since it was just copied over from one of the baseline quants. head_bits might be incorrect in the same way.
Model tree for cpral/Qwen3.5-397B-A17B-exl3
Base model
Qwen/Qwen3.5-397B-A17B