Model Card for 2025-24679-text-distilbert-predictor

This is a text classification model fine-tuned on a dataset containing text examples related to 'Pittsburgh' and 'Shanghai'. The model is based on the DistilBERT architecture and is designed to classify text into one of these two categories.

Data

The model was trained using a dataset sourced from the Hugging Face Hub: cassieli226/cities-text-dataset. This dataset contains two splits:

original: 100 examples used for external validation.
augmented: 1000 examples used for training, validation, and testing.

The dataset consists of text samples with corresponding labels indicating whether the text is related to 'Pittsburgh' or 'Shanghai'.

Preprocessing

The text data was preprocessed using the distilbert-base-uncased tokenizer. The preprocessing steps included:

Tokenization of text inputs.
Truncation to a maximum length of 128 tokens.
Padding was handled dynamically during training using DataCollatorWithPadding.
String labels ('Pittsburgh', 'Shanghai') were mapped to integer IDs (0, 1 respectively).

Training Setup

The model was fine-tuned using the Trainer API from the Hugging Face transformers library.

Base Model: distilbert-base-uncased
Training Data: Split from the augmented dataset (640 examples).
Validation Data: Split from the augmented dataset (160 examples), used for evaluation during training.
Training Arguments:
- Learning Rate: 2e-5
- Per Device Train Batch Size: 8
- Per Device Eval Batch Size: 8
- Number of Training Epochs: 5
- Weight Decay: 0.01
- Evaluation Strategy: epoch
- Save Strategy: epoch
- Load Best Model at End: True (based on accuracy)
- Seed: 42
Optimizer: AdamW
Loss Function: Cross-Entropy Loss (default for AutoModelForSequenceClassification)

Metrics

The model was evaluated on the augmented test set and the original external validation set using the following metrics:

Accuracy: The proportion of correctly classified examples.
Precision (weighted): The ability of the model not to label as positive a sample that is actually negative, averaged by support.
Recall (weighted): The ability of the model to find all the positive samples, averaged by support.
F1 Score (weighted): The harmonic mean of precision and recall, averaged by support.

Evaluation Results

The model's performance was evaluated on both the augmented test set and the original external validation set.

Dataset	Accuracy	Weighted F1
Augmented Test Set	1.0000	1.0000
Original Validation Set	1.0000	1.0000

A confusion matrix for the external validation set is also available in the notebook to visualize per-class performance.

Limitations and Ethical Considerations

Dataset Bias: The model's performance is highly dependent on the training data. If the data does not fully represent the diversity of text related to 'Pittsburgh' and 'Shanghai', the model may exhibit biases.
Generalization: While the model performed well on the provided datasets, its performance on completely novel text outside the distribution of the training data is not guaranteed.
Interpretability: As a deep learning model, the decision-making process can be less transparent compared to simpler models.
Potential Misuse: The model could potentially be misused to misrepresent or stereotype content related to these cities.

It is important to use this model responsibly and be aware of its limitations.

License

The model is licensed under the Apache 2.0 License, which is the default license for models uploaded to the Hugging Face Hub. Please refer to the license file in the repository for full details.

Hardware and Compute

The model was trained on a system with a GPU (indicated by Using device: cuda in the training output). The specific hardware details (e.g., GPU type, memory) are not explicitly logged in the notebook output but were sufficient for training the DistilBERT model on the provided dataset.

AI Usage Disclosure

This model was developed and trained using publicly available datasets and open-source libraries, including Hugging Face transformers, datasets, scikit-learn, and matplotlib. The training process involved standard deep learning techniques and practices. The notebook was executed within the Google Colab environment.

Downloads last month: 1

Safetensors

Model size

67M params

Tensor type

F32

Model tree for kaitongg/2025-24679-text-distilbert-predictor

Base model

distilbert/distilbert-base-uncased

Finetuned

(10878)

this model

kaitongg
/

2025-24679-text-distilbert-predictor