Model Card for 2025-24679-text-distilbert-predictor
This is a text classification model fine-tuned on a dataset containing text examples related to 'Pittsburgh' and 'Shanghai'. The model is based on the DistilBERT architecture and is designed to classify text into one of these two categories.
Data
The model was trained using a dataset sourced from the Hugging Face Hub: cassieli226/cities-text-dataset.
This dataset contains two splits:
original: 100 examples used for external validation.augmented: 1000 examples used for training, validation, and testing.
The dataset consists of text samples with corresponding labels indicating whether the text is related to 'Pittsburgh' or 'Shanghai'.
Preprocessing
The text data was preprocessed using the distilbert-base-uncased tokenizer. The preprocessing steps included:
- Tokenization of text inputs.
- Truncation to a maximum length of 128 tokens.
- Padding was handled dynamically during training using
DataCollatorWithPadding. - String labels ('Pittsburgh', 'Shanghai') were mapped to integer IDs (0, 1 respectively).
Training Setup
The model was fine-tuned using the Trainer API from the Hugging Face transformers library.
- Base Model:
distilbert-base-uncased - Training Data: Split from the
augmenteddataset (640 examples). - Validation Data: Split from the
augmenteddataset (160 examples), used for evaluation during training. - Training Arguments:
- Learning Rate: 2e-5
- Per Device Train Batch Size: 8
- Per Device Eval Batch Size: 8
- Number of Training Epochs: 5
- Weight Decay: 0.01
- Evaluation Strategy:
epoch - Save Strategy:
epoch - Load Best Model at End: True (based on accuracy)
- Seed: 42
- Optimizer: AdamW
- Loss Function: Cross-Entropy Loss (default for
AutoModelForSequenceClassification)
Metrics
The model was evaluated on the augmented test set and the original external validation set using the following metrics:
- Accuracy: The proportion of correctly classified examples.
- Precision (weighted): The ability of the model not to label as positive a sample that is actually negative, averaged by support.
- Recall (weighted): The ability of the model to find all the positive samples, averaged by support.
- F1 Score (weighted): The harmonic mean of precision and recall, averaged by support.
Evaluation Results
The model's performance was evaluated on both the augmented test set and the original external validation set.
| Dataset | Accuracy | Weighted F1 |
|---|---|---|
| Augmented Test Set | 1.0000 | 1.0000 |
| Original Validation Set | 1.0000 | 1.0000 |
A confusion matrix for the external validation set is also available in the notebook to visualize per-class performance.
Limitations and Ethical Considerations
- Dataset Bias: The model's performance is highly dependent on the training data. If the data does not fully represent the diversity of text related to 'Pittsburgh' and 'Shanghai', the model may exhibit biases.
- Generalization: While the model performed well on the provided datasets, its performance on completely novel text outside the distribution of the training data is not guaranteed.
- Interpretability: As a deep learning model, the decision-making process can be less transparent compared to simpler models.
- Potential Misuse: The model could potentially be misused to misrepresent or stereotype content related to these cities.
It is important to use this model responsibly and be aware of its limitations.
License
The model is licensed under the Apache 2.0 License, which is the default license for models uploaded to the Hugging Face Hub. Please refer to the license file in the repository for full details.
Hardware and Compute
The model was trained on a system with a GPU (indicated by Using device: cuda in the training output). The specific hardware details (e.g., GPU type, memory) are not explicitly logged in the notebook output but were sufficient for training the DistilBERT model on the provided dataset.
AI Usage Disclosure
This model was developed and trained using publicly available datasets and open-source libraries, including Hugging Face transformers, datasets, scikit-learn, and matplotlib. The training process involved standard deep learning techniques and practices. The notebook was executed within the Google Colab environment.
- Downloads last month
- 1
Model tree for kaitongg/2025-24679-text-distilbert-predictor
Base model
distilbert/distilbert-base-uncased