reteropred
Model Details
- Model Type: BART (Bidirectional and Auto-Regressive Transformers)
- Task: Molecular Retrosynthesis (Product SMILES → Reactant SMILES)
- Language: SMILES (Simplified Molecular Input Line Entry System)
- Architecture:
BartForConditionalGeneration - Parameters:
44.8M - License:
mpl-2.0
Model Description
This model is designed for computer-aided retrosynthesis analysis. It predicts reactant sets given a product molecule represented as a SMILES string. The model uses a sequence-to-sequence transformer architecture (BART) customized for chemical language processing.
Key features include:
- Custom Tokenization: Uses a Byte-Pair Encoding (BPE) tokenizer trained specifically on SMILES syntax with a vocabulary size of 1,000 tokens.
- Data Augmentation: Implements SMILES randomization during training to improve robustness against different molecular representations.
- Canonicalization: Utilizes RDKit for canonical SMILES conversion during preprocessing and evaluation to ensure chemical validity.
Intended Uses
Direct Use:
- Predicting reactants for a given product molecule in organic chemistry research.
- Assisting chemists in planning synthesis routes.
- Educational purposes for computational chemistry.
Out-of-Scope Use:
- Not for Clinical Use: This model is not validated for pharmaceutical manufacturing or clinical applications.
- Not for Hazardous Materials: Should not be used to plan synthesis of regulated or dangerous substances without expert oversight.
- Guarantee of Validity: The model outputs SMILES strings that should be validated chemically (e.g., via RDKit) before use.
Training Data
The model was trained on a combination of public reaction datasets and template rules:
- USPTO Dataset: Curated patent reactions containing reactant-product pairs.
- Preprocessing:
- Canonicalization: All SMILES were canonicalized using RDKit (
Chem.MolToSmiles). - Cleaning: Atom maps were stripped, and explicit hydrogens were removed.
- Filtering: Identity mappings (where product == reactant) were removed.
- Augmentation: Training inputs were randomized using
Chem.MolToSmiles(mol, doRandom=True)to prevent overfitting to specific SMILES representations.
- Canonicalization: All SMILES were canonicalized using RDKit (
Training Procedure
Hyperparameters
| Hyperparameter | Value |
|---|---|
| Base Architecture | BART (Custom Config) |
Hidden Size (d_model) |
512 |
| Encoder/Decoder Layers | 6 |
| Attention Heads | 8 |
| Vocabulary Size | 1,000 |
| Max Sequence Length | 128 |
| Batch Size | 192 |
| Epochs | 10 |
| Learning Rate | 1e-4 |
| Optimizer | AdamW (via Transformers) |
| Precision | FP16 |
Framework & Libraries
- Deep Learning: PyTorch, Hugging Face
transformers - Cheminformatics: RDKit
- Data Processing: Pandas, Scikit-learn, Tokenizers
Evaluation
The model was evaluated on a held-out validation set and external test sets (Enamine Real, ChEMBL, ZINC). Accuracy is measured using Exact Match (EM) based on canonical SMILES comparison.
Metrics
Top-1, Top-3, and Top-5 exact match accuracy on the validation set.
Performance by Reaction Class
The heatmap below illustrates how the model performs across different USPTO reaction classes, distinguishing between correct predictions, invalid SMILES generation, and reactant mismatches.
Performance breakdown by USPTO reaction class.
Token-Level Analysis
To understand the model's "chemical vocabulary" performance, the following confusion matrix shows the most frequent SMILES tokens and how accurately the model predicts them.
Confusion matrix for the top 15 most frequent chemical tokens.
Prediction Examples
Below are representative examples of the model's retrosynthetic predictions compared against the ground truth.
Visual grid comparing the Target Product, the True Reactants, and the Model's Predicted Reactants.
Note: Evaluation involves generating 5 beam search sequences and checking if the canonicalized ground truth matches any of the top-k predictions.
How to Use
Load with Transformers
from transformers import BartForConditionalGeneration, PreTrainedTokenizerFast
import torch
from rdkit import Chem
# Load model and tokenizer
model = BartForConditionalGeneration.from_pretrained("surya/bart-retrosynth") # Replace with your HF repo
tokenizer = PreTrainedTokenizerFast.from_pretrained("surya/bart-retrosynth")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()
def predict_reactants(product_smiles: str):
# Canonicalize input
mol = Chem.MolFromSmiles(product_smiles)
if not mol:
return "Invalid SMILES"
canon_smiles = Chem.MolToSmiles(mol, canonical=True)
# Tokenize
inputs = tokenizer(canon_smiles, return_tensors="pt", max_length=128, truncation=True)
inputs = {k: v.to(device) for k, v in inputs.items()}
# Generate
with torch.no_grad():
outputs = model.generate(
**inputs,
max_length=128,
num_beams=5,
early_stopping=True
)
# Decode
predictions = tokenizer.batch_decode(outputs, skip_special_tokens=True)
return predictions
# Example
product = "CCO" # Ethanol
reactants = predict_reactants(product)
print(f"Predicted Reactants: {reactants}")
Limitations & Bias
- Stereochemistry: The model was trained with
isomericSmiles=Falsein some preprocessing steps (checkcanonicalizefunction). Stereochemical accuracy may be limited. - Validity: Not all generated SMILES strings are guaranteed to be chemically valid. Post-processing validation is required.
- Length Constraint: Molecules requiring SMILES representations longer than 128 tokens will be truncated.
- Downloads last month
- 26


