🧠 Open Multi-Label ASJC Classification
Model Overview
This model fine-tunes allenai/scibert_scivocab_uncased across 307 ASJC subject categories, enabling document-level classification beyond traditional journal-level schemes.
- Task: Multi-label classification
- Labels: 307 ASJC subjects (compare google sheet for all labels)
- Base Model: SciBERT
- Training Data: Crossref 2023 dataset (titles, abstracts, container titles)
- License: MIT
- Framework: Hugging Face Transformers
📚 Intended Use
- Classify individual research documents into multiple ASJC subjects.
- Analyze disciplinary orientation of collections (authors, institutions, databases).
- Works with title, abstract, and optionally container title metadata.
🛠 Training Details
- Preprocessing:
- Removed “multidisciplinary” and 26 “miscellaneous” categories → 307 subjects.
- Multi-hot encoding for multi-label classification.
- Data augmentation for underrepresented classes.
- Fine-tuning:
- Optimizer: AdamW
- Loss: Binary Cross-Entropy
- Learning Rate: 2e-5
- Epochs: 1
- Batch Size: 16
- Threshold for label assignment: 0.3
📈 Metrics
| Input Features | Labels | Precision | Recall | F1-Score (weighted) |
|---|---|---|---|---|
| Title + Container Title + Abstract | 307 | 0.912 | 0.885 | 0.892 |
| Title + Abstract | 307 | 0.607 | 0.503 | 0.532 |
| Title + Container Title | 307 | 0.949 | 0.957 | 0.952 |
| Title only | 307 | 0.528 | 0.416 | 0.448 |
For 26 parent subjects, weighted F1-score improves to 0.934 using full metadata (Title + Container Title + Abstract) and 0.694 using Title + Abstract.
The evaluation dataset is publicly available: Test-Dataset
✅ Model Strengths
- Handles interdisciplinary and general science journals.
- Works even without container title (lower accuracy).
- Scalable for large collections.
⚠️ Limitations
- Performance relies on metadata completeness (title, abstract, container title).
- Lower accuracy for rare subjects and missing source info.
- Snapshot of ASJC schema as of April 2023 (not updated for emerging fields).
🔍 Example Usage
from transformers import TextClassificationPipeline, pipeline
import torch
# --- Custom multi-label pipeline ---
class ASJCMultiLabelPipeline(TextClassificationPipeline):
"""
Multi-label classification pipeline for ASJC categories.
Uses a configurable threshold to return all labels with scores above the threshold.
"""
def __init__(self, *args, **kwargs):
# Allow threshold override; default falls back to model config
self.threshold = kwargs.pop("threshold", None)
super().__init__(*args, **kwargs)
if self.threshold is None:
self.threshold = getattr(self.model.config, "threshold", 0.3)
def postprocess(self, model_outputs, **kwargs):
# Convert logits to probabilities using sigmoid
scores = torch.sigmoid(torch.tensor(model_outputs["logits"])).tolist()
results = []
for i, score in enumerate(scores[0]):
if score >= self.threshold:
label = self.model.config.id2label[(i)]
results.append({"label": label, "score": float(score)})
# Sort by descending score
results = sorted(results, key=lambda x: x["score"], reverse=True)
return results
# --- Create the pipeline explicitly using the custom class ---
pipe = pipeline(
task="text-classification",
model="asjc-classification/scibert_multilabel_asjc_classifier",
pipeline_class=ASJCMultiLabelPipeline
)
# --- Example text input ---
text = (
"title={Dose optimization of β-lactams antibiotics in pediatrics and adults: A systematic review}, "
"container_title={Frontiers in Pharmacology}, "
"abstract={Background: β-lactams remain the cornerstone of the empirical therapy to treat various bacterial infections. This systematic review aimed to analyze the data describing the dosing regimen of β-lactams.Methods: Systematic scientific and grey literature was performed in accordance with Preferred Items for Systematic Reviews and Meta-Analysis (PRISMA) guidelines. The studies were retrieved and screened on the basis of pre-defined exclusion and inclusion criteria. The cohort studies, randomized controlled trials (RCT) and case reports that reported the dosing schedule of β-lactams are included in this study.Results: A total of 52 studies met the inclusion criteria, of which 40 were cohort studies, 2 were case reports and 10 were RCTs. The majority of the studies (34/52) studied the pharmacokinetic (PK) parameters of a drug. A total of 20 studies proposed dosing schedule in pediatrics while 32 studies proposed dosing regimen among adults. Piperacillin (12/52) and Meropenem (11/52) were the most commonly used β-lactams used in hospitalized patients. As per available evidence, continuous infusion is considered as the most appropriate mode of administration to optimize the safety and efficacy of the treatment and improve the clinical outcomes.Conclusion: Appropriate antibiotic therapy is challenging due to pathophysiological changes among different age groups. The optimization of pharmacokinetic/pharmacodynamic parameters is useful to support alternative dosing regimens such as an increase in dosing interval, continuous infusion, and increased bolus doses.}"
)
# --- Get multi-label predictions ---
result = pipe(text)
print(result)
# Predicted labels:
# [
# {'label': 'Pharmacology (medical)', 'score': 0.9922493696212769},
# {'label': 'Pharmacology', 'score': 0.902540922164917}
# ]
# Expected labels:
# - Pharmacology (medical)
# - Pharmacology
📖 Citation
If you use this work, please cite:
@article{Gusenbauer.2025,
author = {Gusenbauer, Michael and Endermann, Jochen and Huber, Harald and Strasser, Simon and Granitzer, Andreas-Nizar and Ströhle, Thomas},
year = {2025},
title = {Fine-tuning SciBERT to enable ASJC-based assessments of the disciplinary orientation of research collections},
keywords = {All Science Journal Classification;Disciplinary coverage;Fine-tuning;multi-label classification;SciBERT;Transformer-based language models},
issn = {0138-9130},
journal = {Scientometrics},
doi = {10.1007/s11192-025-05490-0},
}
- Downloads last month
- 156
Model tree for asjc-classification/scibert_multilabel_asjc_classifier
Base model
allenai/scibert_scivocab_uncased