🧠 Open Multi-Label ASJC Classification

Model Overview

This model fine-tunes allenai/scibert_scivocab_uncased across 307 ASJC subject categories, enabling document-level classification beyond traditional journal-level schemes.

Task: Multi-label classification
Labels: 307 ASJC subjects (compare google sheet for all labels)
Base Model: SciBERT
Training Data: Crossref 2023 dataset (titles, abstracts, container titles)
License: MIT
Framework: Hugging Face Transformers

📚 Intended Use

Classify individual research documents into multiple ASJC subjects.
Analyze disciplinary orientation of collections (authors, institutions, databases).
Works with title, abstract, and optionally container title metadata.

🛠 Training Details

Preprocessing:
- Removed “multidisciplinary” and 26 “miscellaneous” categories → 307 subjects.
- Multi-hot encoding for multi-label classification.
- Data augmentation for underrepresented classes.
Fine-tuning:
- Optimizer: AdamW
- Loss: Binary Cross-Entropy
- Learning Rate: 2e-5
- Epochs: 1
- Batch Size: 16
- Threshold for label assignment: 0.3

📈 Metrics

Input Features	Labels	Precision	Recall	F1-Score (weighted)
Title + Container Title + Abstract	307	0.912	0.885	0.892
Title + Abstract	307	0.607	0.503	0.532
Title + Container Title	307	0.949	0.957	0.952
Title only	307	0.528	0.416	0.448

For 26 parent subjects, weighted F1-score improves to 0.934 using full metadata (Title + Container Title + Abstract) and 0.694 using Title + Abstract.

The evaluation dataset is publicly available: Test-Dataset

✅ Model Strengths

Handles interdisciplinary and general science journals.
Works even without container title (lower accuracy).
Scalable for large collections.

⚠️ Limitations

Performance relies on metadata completeness (title, abstract, container title).
Lower accuracy for rare subjects and missing source info.
Snapshot of ASJC schema as of April 2023 (not updated for emerging fields).

🔍 Example Usage

from transformers import TextClassificationPipeline, pipeline
import torch

# --- Custom multi-label pipeline ---
class ASJCMultiLabelPipeline(TextClassificationPipeline):
    """
    Multi-label classification pipeline for ASJC categories.
    Uses a configurable threshold to return all labels with scores above the threshold.
    """
    def __init__(self, *args, **kwargs):
        # Allow threshold override; default falls back to model config
        self.threshold = kwargs.pop("threshold", None)
        super().__init__(*args, **kwargs)
        if self.threshold is None:
            self.threshold = getattr(self.model.config, "threshold", 0.3)

    def postprocess(self, model_outputs, **kwargs):
        # Convert logits to probabilities using sigmoid
        scores = torch.sigmoid(torch.tensor(model_outputs["logits"])).tolist()

        results = []
        for i, score in enumerate(scores[0]):
            if score >= self.threshold:
                label = self.model.config.id2label[(i)]
                results.append({"label": label, "score": float(score)})

        # Sort by descending score
        results = sorted(results, key=lambda x: x["score"], reverse=True)
        return results

# --- Create the pipeline explicitly using the custom class ---
pipe = pipeline(
    task="text-classification",
    model="asjc-classification/scibert_multilabel_asjc_classifier",
    pipeline_class=ASJCMultiLabelPipeline
)

# --- Example text input ---
text = (
    "title={Dose optimization of β-lactams antibiotics in pediatrics and adults: A systematic review}, "
    "container_title={Frontiers in Pharmacology}, "
    "abstract={Background: β-lactams remain the cornerstone of the empirical therapy to treat various bacterial infections. This systematic review aimed to analyze the data describing the dosing regimen of β-lactams.Methods: Systematic scientific and grey literature was performed in accordance with Preferred Items for Systematic Reviews and Meta-Analysis (PRISMA) guidelines. The studies were retrieved and screened on the basis of pre-defined exclusion and inclusion criteria. The cohort studies, randomized controlled trials (RCT) and case reports that reported the dosing schedule of β-lactams are included in this study.Results: A total of 52 studies met the inclusion criteria, of which 40 were cohort studies, 2 were case reports and 10 were RCTs. The majority of the studies (34/52) studied the pharmacokinetic (PK) parameters of a drug. A total of 20 studies proposed dosing schedule in pediatrics while 32 studies proposed dosing regimen among adults. Piperacillin (12/52) and Meropenem (11/52) were the most commonly used β-lactams used in hospitalized patients. As per available evidence, continuous infusion is considered as the most appropriate mode of administration to optimize the safety and efficacy of the treatment and improve the clinical outcomes.Conclusion: Appropriate antibiotic therapy is challenging due to pathophysiological changes among different age groups. The optimization of pharmacokinetic/pharmacodynamic parameters is useful to support alternative dosing regimens such as an increase in dosing interval, continuous infusion, and increased bolus doses.}"
)

# --- Get multi-label predictions ---
result = pipe(text)
print(result)

# Predicted labels:
# [
#   {'label': 'Pharmacology (medical)', 'score': 0.9922493696212769}, 
#   {'label': 'Pharmacology', 'score': 0.902540922164917}
# ]

# Expected labels:
# - Pharmacology (medical)
# - Pharmacology

📖 Citation

If you use this work, please cite:

@article{Gusenbauer.2025,
author = {Gusenbauer, Michael and Endermann, Jochen and Huber, Harald and Strasser, Simon and Granitzer, Andreas-Nizar and Ströhle, Thomas},
year = {2025},
title = {Fine-tuning SciBERT to enable ASJC-based assessments of the disciplinary orientation of research collections},
keywords = {All Science Journal Classification;Disciplinary coverage;Fine-tuning;multi-label classification;SciBERT;Transformer-based language models},
issn = {0138-9130},
journal = {Scientometrics},
doi = {10.1007/s11192-025-05490-0},
}

Downloads last month: 156

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for asjc-classification/scibert_multilabel_asjc_classifier

Base model

allenai/scibert_scivocab_uncased

Finetuned

(88)

this model

asjc-classification
/

scibert_multilabel_asjc_classifier