ChemBERTa-700k-Augmented-SMILES
A RoBERTa-based transformer model pre-trained on 70k unique nonisomeric SMILES from PubChem, augmented 10x to 700k training samples using masked language modeling (MLM). This model is designed for molecular property prediction and chemical SMILES understanding tasks.
Model Details
Model Description
- Model Type: RoBERTa (RobertaForMaskedLM)
- Architecture: Transformer-based encoder
- Task: Masked Language Modeling (MLM)
- Language: SMILES (Simplified Molecular Input Line Entry System)
Model Architecture
- Hidden Size: 768
- Number of Hidden Layers: 6
- Number of Attention Heads: 12
- Intermediate Size: 12288
- Vocabulary Size: 600
- Max Position Embeddings: 515
- Activation Function: GELU
- Hidden Dropout: 0.1
- Attention Dropout: 0.1
Training Details
Training Data
- Dataset: 70k unique random nonisomeric SMILES from PubChem, augmented 10x to 700k total training samples
- Augmentation: 10x augmentation using random SMILES generation
- Starting with 70k unique nonisomeric SMILES strings
- Each SMILES string was augmented by generating 10 random (non-canonical) SMILES variants
- Final training set contains 700k SMILES strings (70k ร 10)
- Augmentation helps the model learn invariant representations across different SMILES representations of the same molecule
- SMILES Type: Nonisomeric SMILES (no stereochemical information)
- Pre-processing: SMILES strings tokenized using a custom SMILES tokenizer
Training Procedure
- Training Method: Masked Language Modeling (MLM)
- Batch Size: 4 per device
- Number of Epochs: 10
- Optimizer: AdamW
- Learning Rate: (default RoBERTa settings)
Training Infrastructure
Trained using Hugging Face Transformers library with distributed training capabilities.
Note: This model was pre-trained from scratch (not fine-tuned). The model architecture and training approach are inspired by ChemBERTa, and seyonec/SMILES_tokenized_PubChem_shard00_160k is used as a baseline for comparison. The tokenizer from seyonec/SMILES_tokenized_PubChem_shard00_160k should be used with this model as they share the same vocabulary.
Usage
Direct Use
This model can be used for masked language modeling on SMILES strings to understand molecular structure representations. It can be fine-tuned on downstream tasks such as:
- Molecular property prediction (classification and regression)
- Drug discovery tasks
- Toxicity prediction
- Bioavailability prediction
- Flavor prediction
Downstream Use
The model should be fine-tuned on specific molecular property prediction tasks. Example tasks include:
- Classification: BBBP (Blood-Brain Barrier Penetration), Tox21 (Toxicity), Clintox (Clinical Toxicity)
- Regression: Delaney (Aqueous Solubility), Lipo (Lipophilicity), Clearance (Molecular Clearance)
Out-of-Scope Use
This model was pre-trained on molecular SMILES data and is not designed for:
- Natural language understanding tasks
- Non-molecular chemical representations
- Tasks requiring 3D molecular geometry
How to Get Started with the Model
from transformers import AutoModelForMaskedLM, AutoTokenizer
from transformers import pipeline
# Load model and tokenizer
model = AutoModelForMaskedLM.from_pretrained("terrytwk/chemberta-700k-augmented-smiles")
tokenizer = AutoTokenizer.from_pretrained("seyonec/SMILES_tokenized_PubChem_shard00_160k")
# Create fill-mask pipeline
fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)
# Example: predict masked token in a SMILES string
results = fill_mask("CC(=O)OC1=CC=CC=C1C(=O)O [MASK]")
print(results)
Fine-tuning Example
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from transformers import TrainingArguments, Trainer
# Load pre-trained model for sequence classification
model = AutoModelForSequenceClassification.from_pretrained(
"terrytwk/chemberta-700k-augmented-smiles",
num_labels=2 # Adjust based on your task
)
tokenizer = AutoTokenizer.from_pretrained("seyonec/SMILES_tokenized_PubChem_shard00_160k")
# Fine-tune on your molecular property prediction dataset
# ... (add your training code)
Limitations and Bias
- The model was trained on 70k unique molecules (700k augmented samples) from PubChem, which may not represent all chemical space
- Performance may vary on molecules with rare or unusual substructures
- SMILES augmentation introduces some variance in representations
- Model performance is limited by the vocabulary size (600 tokens)
Evaluation
FART (Flavor and Aroma Recognition Task) Evaluation
This model was evaluated on the FART dataset for molecular taste prediction. On this benchmark, the model achieved an evaluation accuracy of 0.8895, performing better or on par with the baseline model seyonec/SMILES_tokenized_PubChem_shard00_160k.
Notably, this performance was achieved despite:
- Training on 7% of the data volume compared to the baseline (700k vs ~10M training samples)
- Training on 0.7% of unique molecules compared to the baseline (70k unique molecules vs ~10M unique molecules)
- Using 10x SMILES augmentation instead of training on a larger dataset
This demonstrates the effectiveness of SMILES augmentation as a data-efficient training strategy for molecular language models.
Evaluation Reference: Zimmermann et al. (2025). A chemical language model for molecular taste prediction. npj Science of Food
For evaluation results on other molecular property prediction benchmarks, please refer to fine-tuning experiments on datasets such as:
- MoleculeNet benchmarks (BBBP, Tox21, ClinTox, etc.)
Citation
If you use this model in your research, please cite:
This Model
@misc{chemberta700kaugmented2024,
title={ChemBERTa-700k-Augmented-SMILES: Improved Molecular Property Prediction with SMILES Augmentation},
author={Terry Kim},
year={2024},
howpublished={\url{https://huggingface.co/terrytwk/chemberta-700k-augmented-smiles}}
}
FART Evaluation
@article{zimmermann2025chemical,
title={A chemical language model for molecular taste prediction},
author={Zimmermann, Yoel and Sieben, Leif and Seng, Henrik and Pestlin, Philipp and G{\"o}rlich, Franz},
journal={npj Science of Food},
volume={9},
number={1},
pages={122},
year={2025},
publisher={Nature Publishing Group},
doi={10.1038/s41538-025-00474-z},
url={https://www.nature.com/articles/s41538-025-00474-z}
}
Baseline Model for Comparison
- seyonec/SMILES_tokenized_PubChem_shard00_160k - Baseline model used for comparison (hypothesized to be trained on ~10M SMILES)
Related Research
@article{chemberta2020,
title={ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction},
author={Chithrananda, Seyone and Grand, Gabriel and Ramsundar, Bharath},
journal={arXiv preprint arXiv:2010.09885},
year={2020}
}
Model Card Contact
For questions or issues regarding this model, please open an issue in the repository.
- Downloads last month
- 29