Qwen3.5-397B-A17B-EXL3 Pareto Frontier

Pareto-frontier EXL3 quants for Qwen 3.5 397B.

Pick a quant with lowest KL and ppl that suits your hardware. Quants are in separate model repositories.

Quant	GiB	GB	bpw	PPL	KL(q→o)	KL(o→q)	Top-1	Top-2	Top-3	Top-4	Top-5
MikeRoz 2.0bpw	97	104	2.00	5.072	0.5160	0.8210	76.1%	41.3%	18.6%	7.5%	2.9%
MikeRoz 2.08bpw	100	107	2.08	3.386	0.1210	0.1630	89.3%	62.6%	38.3%	21.6%	11.7%
cpral 2.20bpw	104	112	2.20	3.381	0.1198	0.1591	89.4%	62.8%	38.5%	21.6%	11.6%
cpral 2.36bpw	113	121	2.36	3.260	0.0819	0.1054	91.6%	68.1%	44.6%	27.1%	15.7%
cpral 2.64bpw	126	135	2.64	3.139	0.0429	0.0490	94.1%	75.5%	54.3%	36.5%	23.4%
cpral 2.93bpw	139	149	2.93	3.117	0.0319	0.0349	94.8%	78.1%	58.3%	40.6%	27.0%
NeuroSenko 3.0bpw	142	153	3.00	3.220	0.0674	0.0776	91.9%	68.4%	44.5%	26.6%	14.8%
NeuroSenko 3.03bpw	143	154	3.03	3.173	0.0474	0.0531	93.5%	73.4%	51.1%	32.9%	20.1%
cpral 3.11bpw	147	158	3.11	3.114	0.0270	0.0296	95.3%	79.8%	60.7%	43.3%	29.6%
cpral 3.29bpw	156	167	3.29	3.089	0.0200	0.0213	96.0%	82.1%	64.3%	47.4%	33.5%
cpral 3.45bpw	163	175	3.45	3.081	0.0159	0.0166	96.4%	83.7%	67.3%	51.2%	37.3%
mratsim 3.47bpw	164	175	3.47	3.096	0.0203	0.0216	96.0%	82.2%	64.7%	48.1%	34.1%
cpral 3.53bpw	167	179	3.53	3.075	0.0134	0.0139	96.7%	84.9%	69.3%	53.5%	39.8%
cpral 3.57bpw	169	181	3.57	3.072	0.0127	0.0130	96.7%	85.2%	69.8%	54.2%	40.4%
cpral 3.68bpw	173	186	3.68	3.069	0.0120	0.0122	96.9%	85.7%	70.6%	55.1%	41.3%
NeuroSenko 4.0bpw	188	202	4.00	3.101	0.0203	0.0210	95.7%	81.0%	62.3%	44.7%	30.5%
NeuroSenko 4.03bpw	189	203	4.03	3.082	0.0149	0.0153	96.3%	83.9%	67.2%	50.7%	36.6%
cpral 4.61bpw	216	232	4.61	3.059	0.0054	0.0054	97.8%	90.0%	78.4%	65.4%	52.6%
NeuroSenko 5.0bpw	234	252	5.00	3.067	0.0079	0.0079	97.3%	87.6%	73.9%	59.0%	45.3%
mratsim 8.0bpw	385	400	8.00	3.055	0.0025	0.0026	98.6%	93.3%	85.0%	75.1%	64.7%
original	bf16	752	807	16.00	3.053	—	—	—	—	—	—

Methodology

Methodology that I've used to create custom quants is documented in https://github.com/adamo1139/qwen397b-exl3 and is mostly reproducible (I may have manually overriden some auto-generated configs in a minor way). Custom override configs have been placed into model repositories of all quants produced using this method soon. 2.2bpw quant was produced using exllamav3's optimize.py tool.

Credits

Thanks to @mratsim for sharing his custom quants, methodology, override config and 8bpw baseline. Thanks to @Goldkoron for sharing per-module KLD sensivity chart for Qwen 3.5 397B. Thanks to @NeuroSenko for sharing 3bpw, 4bpw and 5bpw baseline quants. Thanks to @MikeRoz for sharing 2bpw baseline quant. Thanks to @turboderp for creating exllamav3.

Potential for future work

Future work could enable better quants by tweaking superlinear penalty, incorporating 6bpw and 7bpw baselines, and quantizing various experts to a variable degree as informed by REAP/REAM data. Methodology that was used here should be applicable for other MoE models like GLM 4.5 family and Qwen 3.5 122B too, and it might be applicable to GGUF ecosystem too

TODO

Quantization_bits metadata is misleading and is a constant 3.00 since it was just copied over from one of the baseline quants. head_bits might be incorrect in the same way.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cpral/Qwen3.5-397B-A17B-exl3

Base model

Qwen/Qwen3.5-397B-A17B

Quantized

(72)

this model