ChemBERTa-700k-Augmented-SMILES

A RoBERTa-based transformer model pre-trained on 70k unique nonisomeric SMILES from PubChem, augmented 10x to 700k training samples using masked language modeling (MLM). This model is designed for molecular property prediction and chemical SMILES understanding tasks.

Model Details

Model Description

Model Type: RoBERTa (RobertaForMaskedLM)
Architecture: Transformer-based encoder
Task: Masked Language Modeling (MLM)
Language: SMILES (Simplified Molecular Input Line Entry System)

Model Architecture

Hidden Size: 768
Number of Hidden Layers: 6
Number of Attention Heads: 12
Intermediate Size: 12288
Vocabulary Size: 600
Max Position Embeddings: 515
Activation Function: GELU
Hidden Dropout: 0.1
Attention Dropout: 0.1

Training Details

Training Data

Dataset: 70k unique random nonisomeric SMILES from PubChem, augmented 10x to 700k total training samples
Augmentation: 10x augmentation using random SMILES generation
- Starting with 70k unique nonisomeric SMILES strings
- Each SMILES string was augmented by generating 10 random (non-canonical) SMILES variants
- Final training set contains 700k SMILES strings (70k × 10)
- Augmentation helps the model learn invariant representations across different SMILES representations of the same molecule
SMILES Type: Nonisomeric SMILES (no stereochemical information)
Pre-processing: SMILES strings tokenized using a custom SMILES tokenizer

Training Procedure

Training Method: Masked Language Modeling (MLM)
Batch Size: 4 per device
Number of Epochs: 10
Optimizer: AdamW
Learning Rate: (default RoBERTa settings)

Training Infrastructure

Trained using Hugging Face Transformers library with distributed training capabilities.

Note: This model was pre-trained from scratch (not fine-tuned). The model architecture and training approach are inspired by ChemBERTa, and seyonec/SMILES_tokenized_PubChem_shard00_160k is used as a baseline for comparison. The tokenizer from seyonec/SMILES_tokenized_PubChem_shard00_160k should be used with this model as they share the same vocabulary.

Usage

Direct Use

This model can be used for masked language modeling on SMILES strings to understand molecular structure representations. It can be fine-tuned on downstream tasks such as:

Molecular property prediction (classification and regression)
Drug discovery tasks
Toxicity prediction
Bioavailability prediction
Flavor prediction

Downstream Use

The model should be fine-tuned on specific molecular property prediction tasks. Example tasks include:

Classification: BBBP (Blood-Brain Barrier Penetration), Tox21 (Toxicity), Clintox (Clinical Toxicity)
Regression: Delaney (Aqueous Solubility), Lipo (Lipophilicity), Clearance (Molecular Clearance)

Out-of-Scope Use

This model was pre-trained on molecular SMILES data and is not designed for:

Natural language understanding tasks
Non-molecular chemical representations
Tasks requiring 3D molecular geometry

How to Get Started with the Model

from transformers import AutoModelForMaskedLM, AutoTokenizer
from transformers import pipeline

# Load model and tokenizer
model = AutoModelForMaskedLM.from_pretrained("terrytwk/chemberta-700k-augmented-smiles")
tokenizer = AutoTokenizer.from_pretrained("seyonec/SMILES_tokenized_PubChem_shard00_160k")

# Create fill-mask pipeline
fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)

# Example: predict masked token in a SMILES string
results = fill_mask("CC(=O)OC1=CC=CC=C1C(=O)O [MASK]")
print(results)

Fine-tuning Example

from transformers import AutoModelForSequenceClassification, AutoTokenizer
from transformers import TrainingArguments, Trainer

# Load pre-trained model for sequence classification
model = AutoModelForSequenceClassification.from_pretrained(
    "terrytwk/chemberta-700k-augmented-smiles",
    num_labels=2  # Adjust based on your task
)
tokenizer = AutoTokenizer.from_pretrained("seyonec/SMILES_tokenized_PubChem_shard00_160k")

# Fine-tune on your molecular property prediction dataset
# ... (add your training code)

Limitations and Bias

The model was trained on 70k unique molecules (700k augmented samples) from PubChem, which may not represent all chemical space
Performance may vary on molecules with rare or unusual substructures
SMILES augmentation introduces some variance in representations
Model performance is limited by the vocabulary size (600 tokens)

Evaluation

FART (Flavor and Aroma Recognition Task) Evaluation

This model was evaluated on the FART dataset for molecular taste prediction. On this benchmark, the model achieved an evaluation accuracy of 0.8895, performing better or on par with the baseline model seyonec/SMILES_tokenized_PubChem_shard00_160k.

Notably, this performance was achieved despite:

Training on 7% of the data volume compared to the baseline (700k vs ~10M training samples)
Training on 0.7% of unique molecules compared to the baseline (70k unique molecules vs ~10M unique molecules)
Using 10x SMILES augmentation instead of training on a larger dataset

This demonstrates the effectiveness of SMILES augmentation as a data-efficient training strategy for molecular language models.

Evaluation Reference: Zimmermann et al. (2025). A chemical language model for molecular taste prediction. npj Science of Food

For evaluation results on other molecular property prediction benchmarks, please refer to fine-tuning experiments on datasets such as:

MoleculeNet benchmarks (BBBP, Tox21, ClinTox, etc.)

Citation

If you use this model in your research, please cite:

This Model

@misc{chemberta700kaugmented2024,
  title={ChemBERTa-700k-Augmented-SMILES: Improved Molecular Property Prediction with SMILES Augmentation},
  author={Terry Kim},
  year={2024},
  howpublished={\url{https://huggingface.co/terrytwk/chemberta-700k-augmented-smiles}}
}

FART Evaluation

@article{zimmermann2025chemical,
  title={A chemical language model for molecular taste prediction},
  author={Zimmermann, Yoel and Sieben, Leif and Seng, Henrik and Pestlin, Philipp and G{\"o}rlich, Franz},
  journal={npj Science of Food},
  volume={9},
  number={1},
  pages={122},
  year={2025},
  publisher={Nature Publishing Group},
  doi={10.1038/s41538-025-00474-z},
  url={https://www.nature.com/articles/s41538-025-00474-z}
}

Baseline Model for Comparison

seyonec/SMILES_tokenized_PubChem_shard00_160k - Baseline model used for comparison (hypothesized to be trained on ~10M SMILES)

Related Research

@article{chemberta2020,
  title={ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction},
  author={Chithrananda, Seyone and Grand, Gabriel and Ramsundar, Bharath},
  journal={arXiv preprint arXiv:2010.09885},
  year={2020}
}

Model Card Contact

For questions or issues regarding this model, please open an issue in the repository.

Downloads last month: 29

Safetensors

Model size

0.1B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support