Galahad-Quality-Classifier

Model Type: Sequence Classification Model
Base Model: google/embeddinggemma-300m

This is a text classification model designed to enable qualitative data annotation, facilitate the creation of quality-specific data blends, and allow for the addition of metadata tags based on document quality. (Unlike nvidia's quality-classifier, this model only focus on political-related content)

The model classifies documents (e.g., text samples, comments, reviews, or web pages) into one of three distinct quality classes:

Class ID	Class Name	Score Mapping	Quality Description
2	High Quality	Score > 4	Highly informative, professional, and well-structured content.
1	Medium Quality	3 ≤ Score ≤ 4	Acceptable, non-offensive, and relevant content with moderate structure.
0	Low Quality	Score < 3	Poorly structured, potentially spammy, low-information, or inappropriate content.

The score is labeled by prompting the Deepseek-V3.2(Chat) via official api.

Data Labelling

We adopt SmolLM2's grading prompt to help generate annotation (score and reasoning):

SYSTEM_PROMPT = """
Evaluate the following [CORPUS] for its potential usefulness for training a professional political science large language model.
Use the following 5-point scoring system. Points are accumulated based on the satisfaction of each criterion:

1.  Relevance (+1): Add 1 point if the extract contains explicit political content (e.g., government, power, elections, policy). Note: General social science (history, economy, law) counts ONLY if analyzed through a political lens (e.g., political history, political economy).
2.  Topic Coverage (+1): Add another point if the extract specifically addresses core subfields: political institutions, political behavior, comparative politics, international relations, public policy, or political theory. The text must be coherent and readable (not spam or gibberish).
3.  Analytical Depth (+1): Award a third point if the extract goes beyond simple reporting/facts and demonstrates analysis, argumentation, causal explanation, or interpretation relevant to political science.
4.  Academic Structure (+1): Grant a fourth point if the extract presents structured, coherent, and academically useful analysis similar to undergraduate textbooks, think-tank reports, or detailed geopolitical commentary.
5.  Scholarly Value (+1): Give a fifth point if the extract is outstanding in its scholarly value, containing theoretically grounded arguments, empirical reasoning, or advanced frameworks suitable for graduate-level study or professional research.

After examining the extract:
    1. Briefly justify your total score (up to 100 words).
    2. Conclude with the score
"""

PROMPT_TEMPLATE = """
[CORPUS]:
{DOCUMENT}
"""

Results

Label	Precision	Recall	F1-Score	Support
0	0.81	0.66	0.73	119
1	0.78	0.84	0.81	354
2	0.42	0.39	0.41	69
Accuracy	—	—	0.75	542
Macro Avg	0.67	0.63	0.65	542
Weighted Avg	0.74	0.75	0.74	542

Limitations: The quality assessment is inherently subjective as the data is labeled by a single model and may vary among different models. The dataset is relatively small with the size is only 30k.

HOW-TO

To try, please visit the huggingface space: galahad-classifier or, just use transformers.pipeline:

from transformers import pipeline
from sklearn.metrics import classification_report

LABEL_TO_ID = {v: k for k, v in classifier.model.config.id2label.items()}
classifier = pipeline("text-classification", model="TerenceLau/galahad-classifier")

def classify_text_batch(texts):
    results = classifier(texts, truncation=True, batch_size=32, function_to_apply="none", max_length=1024)
    return [LABEL_TO_ID[r['label']] for r in results]

texts = data['text'].tolist()
y_true = data['label'].tolist()
y_pred = classify_text_batch(texts)
print(classification_report(y_true, np.round(y_pred), labels=[0, 1, 2]))

Downloads last month: -

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for TerenceLau/galahad-classifier-300m

Base model

google/embeddinggemma-300m

Finetuned

(177)

this model

Dataset used to train TerenceLau/galahad-classifier-300m

Space using TerenceLau/galahad-classifier-300m 1

Collection including TerenceLau/galahad-classifier-300m

Galahad

Collection

Galahad • 3 items • Updated 19 days ago