Galahad-Quality-Classifier
- Model Type: Sequence Classification Model
- Base Model:
google/embeddinggemma-300m
This is a text classification model designed to enable qualitative data annotation, facilitate the creation of quality-specific data blends, and allow for the addition of metadata tags based on document quality. (Unlike nvidia's quality-classifier, this model only focus on political-related content)
The model classifies documents (e.g., text samples, comments, reviews, or web pages) into one of three distinct quality classes:
| Class ID | Class Name | Score Mapping | Quality Description |
|---|---|---|---|
| 2 | High Quality | Score > 4 | Highly informative, professional, and well-structured content. |
| 1 | Medium Quality | 3 ≤ Score ≤ 4 | Acceptable, non-offensive, and relevant content with moderate structure. |
| 0 | Low Quality | Score < 3 | Poorly structured, potentially spammy, low-information, or inappropriate content. |
The score is labeled by prompting the Deepseek-V3.2(Chat) via official api.
Data Labelling
We adopt SmolLM2's grading prompt to help generate annotation (score and reasoning):
SYSTEM_PROMPT = """
Evaluate the following [CORPUS] for its potential usefulness for training a professional political science large language model.
Use the following 5-point scoring system. Points are accumulated based on the satisfaction of each criterion:
1. Relevance (+1): Add 1 point if the extract contains explicit political content (e.g., government, power, elections, policy). Note: General social science (history, economy, law) counts ONLY if analyzed through a political lens (e.g., political history, political economy).
2. Topic Coverage (+1): Add another point if the extract specifically addresses core subfields: political institutions, political behavior, comparative politics, international relations, public policy, or political theory. The text must be coherent and readable (not spam or gibberish).
3. Analytical Depth (+1): Award a third point if the extract goes beyond simple reporting/facts and demonstrates analysis, argumentation, causal explanation, or interpretation relevant to political science.
4. Academic Structure (+1): Grant a fourth point if the extract presents structured, coherent, and academically useful analysis similar to undergraduate textbooks, think-tank reports, or detailed geopolitical commentary.
5. Scholarly Value (+1): Give a fifth point if the extract is outstanding in its scholarly value, containing theoretically grounded arguments, empirical reasoning, or advanced frameworks suitable for graduate-level study or professional research.
After examining the extract:
1. Briefly justify your total score (up to 100 words).
2. Conclude with the score
"""
PROMPT_TEMPLATE = """
[CORPUS]:
{DOCUMENT}
"""
Results
| Label | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| 0 | 0.81 | 0.66 | 0.73 | 119 |
| 1 | 0.78 | 0.84 | 0.81 | 354 |
| 2 | 0.42 | 0.39 | 0.41 | 69 |
| Accuracy | — | — | 0.75 | 542 |
| Macro Avg | 0.67 | 0.63 | 0.65 | 542 |
| Weighted Avg | 0.74 | 0.75 | 0.74 | 542 |
Limitations: The quality assessment is inherently subjective as the data is labeled by a single model and may vary among different models. The dataset is relatively small with the size is only 30k.
HOW-TO
To try, please visit the huggingface space: galahad-classifier or, just use transformers.pipeline:
from transformers import pipeline
from sklearn.metrics import classification_report
LABEL_TO_ID = {v: k for k, v in classifier.model.config.id2label.items()}
classifier = pipeline("text-classification", model="TerenceLau/galahad-classifier")
def classify_text_batch(texts):
results = classifier(texts, truncation=True, batch_size=32, function_to_apply="none", max_length=1024)
return [LABEL_TO_ID[r['label']] for r in results]
texts = data['text'].tolist()
y_true = data['label'].tolist()
y_pred = classify_text_batch(texts)
print(classification_report(y_true, np.round(y_pred), labels=[0, 1, 2]))
- Downloads last month
- -
Model tree for TerenceLau/galahad-classifier-300m
Base model
google/embeddinggemma-300m