Galahad-Quality-Classifier

Galahad

This is a text classification model designed to enable qualitative data annotation, facilitate the creation of quality-specific data blends, and allow for the addition of metadata tags based on document quality. (Unlike nvidia's quality-classifier, this model only focus on political-related content)

The model classifies documents (e.g., text samples, comments, reviews, or web pages) into one of three distinct quality classes:

Class ID Class Name Score Mapping Quality Description
2 High Quality Score > 4 Highly informative, professional, and well-structured content.
1 Medium Quality 3 ≤ Score ≤ 4 Acceptable, non-offensive, and relevant content with moderate structure.
0 Low Quality Score < 3 Poorly structured, potentially spammy, low-information, or inappropriate content.

The score is labeled by prompting the Deepseek-V3.2(Chat) via official api.

Data Labelling

We adopt SmolLM2's grading prompt to help generate annotation (score and reasoning):

SYSTEM_PROMPT = """
Evaluate the following [CORPUS] for its potential usefulness for training a professional political science large language model.
Use the following 5-point scoring system. Points are accumulated based on the satisfaction of each criterion:

1.  Relevance (+1): Add 1 point if the extract contains explicit political content (e.g., government, power, elections, policy). Note: General social science (history, economy, law) counts ONLY if analyzed through a political lens (e.g., political history, political economy).
2.  Topic Coverage (+1): Add another point if the extract specifically addresses core subfields: political institutions, political behavior, comparative politics, international relations, public policy, or political theory. The text must be coherent and readable (not spam or gibberish).
3.  Analytical Depth (+1): Award a third point if the extract goes beyond simple reporting/facts and demonstrates analysis, argumentation, causal explanation, or interpretation relevant to political science.
4.  Academic Structure (+1): Grant a fourth point if the extract presents structured, coherent, and academically useful analysis similar to undergraduate textbooks, think-tank reports, or detailed geopolitical commentary.
5.  Scholarly Value (+1): Give a fifth point if the extract is outstanding in its scholarly value, containing theoretically grounded arguments, empirical reasoning, or advanced frameworks suitable for graduate-level study or professional research.

After examining the extract:
    1. Briefly justify your total score (up to 100 words).
    2. Conclude with the score
"""

PROMPT_TEMPLATE = """
[CORPUS]:
{DOCUMENT}
"""

Results

Label Precision Recall F1-Score Support
0 0.81 0.66 0.73 119
1 0.78 0.84 0.81 354
2 0.42 0.39 0.41 69
Accuracy — — 0.75 542
Macro Avg 0.67 0.63 0.65 542
Weighted Avg 0.74 0.75 0.74 542

Limitations: The quality assessment is inherently subjective as the data is labeled by a single model and may vary among different models. The dataset is relatively small with the size is only 30k.

HOW-TO

To try, please visit the huggingface space: galahad-classifier or, just use transformers.pipeline:

from transformers import pipeline
from sklearn.metrics import classification_report

LABEL_TO_ID = {v: k for k, v in classifier.model.config.id2label.items()}
classifier = pipeline("text-classification", model="TerenceLau/galahad-classifier")

def classify_text_batch(texts):
    results = classifier(texts, truncation=True, batch_size=32, function_to_apply="none", max_length=1024)
    return [LABEL_TO_ID[r['label']] for r in results]

texts = data['text'].tolist()
y_true = data['label'].tolist()
y_pred = classify_text_batch(texts)
print(classification_report(y_true, np.round(y_pred), labels=[0, 1, 2]))
Downloads last month
-
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TerenceLau/galahad-classifier-300m

Finetuned
(177)
this model

Dataset used to train TerenceLau/galahad-classifier-300m

Space using TerenceLau/galahad-classifier-300m 1

Collection including TerenceLau/galahad-classifier-300m