Qwen3.5-397B-A17B-EXL3 Pareto Frontier

Pareto-frontier EXL3 quants for Qwen 3.5 397B.

Pick a quant with lowest KL and ppl that suits your hardware. Quants are in separate model repositories.

Quant GiB GB bpw PPL KL(qโ†’o) KL(oโ†’q) Top-1 Top-2 Top-3 Top-4 Top-5
MikeRoz 2.0bpw 97 104 2.00 5.072 0.5160 0.8210 76.1% 41.3% 18.6% 7.5% 2.9%
MikeRoz 2.08bpw 100 107 2.08 3.386 0.1210 0.1630 89.3% 62.6% 38.3% 21.6% 11.7%
cpral 2.20bpw 104 112 2.20 3.381 0.1198 0.1591 89.4% 62.8% 38.5% 21.6% 11.6%
cpral 2.36bpw 113 121 2.36 3.260 0.0819 0.1054 91.6% 68.1% 44.6% 27.1% 15.7%
cpral 2.64bpw 126 135 2.64 3.139 0.0429 0.0490 94.1% 75.5% 54.3% 36.5% 23.4%
cpral 2.93bpw 139 149 2.93 3.117 0.0319 0.0349 94.8% 78.1% 58.3% 40.6% 27.0%
NeuroSenko 3.0bpw 142 153 3.00 3.220 0.0674 0.0776 91.9% 68.4% 44.5% 26.6% 14.8%
NeuroSenko 3.03bpw 143 154 3.03 3.173 0.0474 0.0531 93.5% 73.4% 51.1% 32.9% 20.1%
cpral 3.11bpw 147 158 3.11 3.114 0.0270 0.0296 95.3% 79.8% 60.7% 43.3% 29.6%
cpral 3.29bpw 156 167 3.29 3.089 0.0200 0.0213 96.0% 82.1% 64.3% 47.4% 33.5%
cpral 3.45bpw 163 175 3.45 3.081 0.0159 0.0166 96.4% 83.7% 67.3% 51.2% 37.3%
mratsim 3.47bpw 164 175 3.47 3.096 0.0203 0.0216 96.0% 82.2% 64.7% 48.1% 34.1%
cpral 3.53bpw 167 179 3.53 3.075 0.0134 0.0139 96.7% 84.9% 69.3% 53.5% 39.8%
cpral 3.57bpw 169 181 3.57 3.072 0.0127 0.0130 96.7% 85.2% 69.8% 54.2% 40.4%
cpral 3.68bpw 173 186 3.68 3.069 0.0120 0.0122 96.9% 85.7% 70.6% 55.1% 41.3%
NeuroSenko 4.0bpw 188 202 4.00 3.101 0.0203 0.0210 95.7% 81.0% 62.3% 44.7% 30.5%
NeuroSenko 4.03bpw 189 203 4.03 3.082 0.0149 0.0153 96.3% 83.9% 67.2% 50.7% 36.6%
cpral 4.61bpw 216 232 4.61 3.059 0.0054 0.0054 97.8% 90.0% 78.4% 65.4% 52.6%
NeuroSenko 5.0bpw 234 252 5.00 3.067 0.0079 0.0079 97.3% 87.6% 73.9% 59.0% 45.3%
mratsim 8.0bpw 385 400 8.00 3.055 0.0025 0.0026 98.6% 93.3% 85.0% 75.1% 64.7%
original bf16 752 807 16.00 3.053 โ€” โ€” โ€” โ€” โ€” โ€”

Methodology

Methodology that I've used to create custom quants is documented in https://github.com/adamo1139/qwen397b-exl3 and is mostly reproducible (I may have manually overriden some auto-generated configs in a minor way). Custom override configs have been placed into model repositories of all quants produced using this method soon. 2.2bpw quant was produced using exllamav3's optimize.py tool.

Credits

Thanks to @mratsim for sharing his custom quants, methodology, override config and 8bpw baseline. Thanks to @Goldkoron for sharing per-module KLD sensivity chart for Qwen 3.5 397B. Thanks to @NeuroSenko for sharing 3bpw, 4bpw and 5bpw baseline quants. Thanks to @MikeRoz for sharing 2bpw baseline quant. Thanks to @turboderp for creating exllamav3.

Potential for future work

Future work could enable better quants by tweaking superlinear penalty, incorporating 6bpw and 7bpw baselines, and quantizing various experts to a variable degree as informed by REAP/REAM data. Methodology that was used here should be applicable for other MoE models like GLM 4.5 family and Qwen 3.5 122B too, and it might be applicable to GGUF ecosystem too

TODO

Quantization_bits metadata is misleading and is a constant 3.00 since it was just copied over from one of the baseline quants. head_bits might be incorrect in the same way.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for cpral/Qwen3.5-397B-A17B-exl3

Quantized
(72)
this model