| precision recall f1-score support | |
| Brainstorming 0.94 0.93 0.94 496 | |
| Coding 0.97 1.00 0.98 500 | |
| Extraction 0.90 0.99 0.94 500 | |
| Factual QA 0.95 0.99 0.97 500 | |
| Generation 0.96 0.88 0.92 497 | |
| Math 0.98 1.00 0.99 500 | |
| Reasoning 1.00 0.90 0.94 500 | |
| accuracy 0.96 3493 | |
| macro avg 0.96 0.96 0.96 3493 | |
| weighted avg 0.96 0.96 0.96 3493 | |