bad perplexity

by snomile - opened 12 days ago

Discussion

snomile

12 days ago

I'm afraid I have some bad news, perplexity is very very bad...

Dataset | Baseline | Qwen3-30B-A3B-Instruct-2507-FP4

English (General) | 10.4457 | 12.37(+18.4%) ❌
Chinese (General) | 9.2344 | 11.15(+20.8%) ❌
Code (Python Logic) | 2.5934 | 3.09(+19.2%) ❌
Math (Reasoning) | 3.0195 | 3.01(-0.5%) ✅
German (Translation) | 3.2072 | 3.56(+11.0%) ❌
French (Euro Lang) | 2.5961 | 2.89(+11.4%) ❌
Japanese (Asian Lang) | 4.4435 | 5.03(+13.2%) ❌

snomile

12 days ago

•

edited 11 days ago

eval result is also very bad

================================================================================
(Benchmark Comparison)

+----------+-------------+--------+---------+--------+------------+----------+
| Task | Metric | Base | Quant | Diff | Recovery | Status |
+==========+=============+========+=========+========+============+==========+
| cmmlu | acc_norm | 83.34 | 75.27 | -8.06 | 90.3% | ❌ |
+----------+-------------+--------+---------+--------+------------+----------+
| gsm8k | exact_match | 88.63 | 79.83 | -8.79 | 90.1% | ❌ |
+----------+-------------+--------+---------+--------+------------+----------+
| mmlu | acc | 71.21 | 61.69 | -9.51 | 86.6% | ❌ |
+----------+-------------+--------+---------+--------+------------+----------+
| mmlu_pro | exact_match | 72.43 | 65.84 | -6.59 | 90.9% | ❌ |
+----------+-------------+--------+---------+--------+------------+----------+

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

bad perplexity

Dataset | Baseline | Qwen3-30B-A3B-Instruct-2507-FP4

================================================================================ (Benchmark Comparison)

================================================================================
(Benchmark Comparison)