File size: 15,018 Bytes

---
license: apache-2.0
pipeline_tag: text-generation
tags:
- chemistry
- molecular-generation
- qwen2
- mtp
- selfies
- cheminformatics
---

# 🧬 ChemQ3MTP-base

ChemQ3MTP-base is a lightweight generative model for chemistry, trained on 2.33 million valid bioactive and natural product molecules dataset curated from ChemBL34, COCONUTDB, and SuperNatural3. It is built on a compact Qwen2-like transformer backbone and employs a multi-horizon predictive (MTP) loss objective to model molecular structures in SELFIES representation.

Train Loss: 1.1720 → Perplexity: ~3.23 

Validation Loss: 1.0448 → Perplexity: ~2.84 

Current version (0.2) (Lic for Code: MIT; Weights: Apache 2.0)

A custom Qwen2-style language model, adapted for molecular generation:

- ✅ **Qwen2-like Mini Backbone** – Efficient causal LM architecture
- ✅ **Multi-Token Prediction (MTP Head)** – Parallel prediction of 1–3 future tokens, implemented as a plug-and-play head compatible with `AutoModel`  
- ✅ **Horizon Loss** – Weighted multi-horizon objectives for long-term coherence  
- ✅ **SELFIES-native Tokenizer** – Robust encoding with [FastChemTokenizer](https://github.com/gbyuvd/FastChemTokenizer)  
- ✅ **Ranger21 Optimizer** – Warmup/warmdown scheduling for stable training  
- ✅ **Gradient Checkpointing** – Lightweight, hardware-friendly, optimized for rapid RL prototyping
- ✅ **Adaptive Generation-Length Cap** – Dynamically limits max_new_tokens to 25% of prompt length when no budget is given, reducing inference cost and preventing runaway molecule chains during RL (no retraining needed; works with `generate` and `generate_with_logprobs`)


RL-Ready Features:
- ✅ **Durrant's Lab Filter** – Integrated substructure filtering based on [gypsum_dl](https://github.com/durrantlab/gypsum_dl/) (Ropp _et al._ 2019) methodology to remove improbable molecular variants in validity check
- ✅ **Pareto Reward Controller** – Ready for RL fine-tuning with dynamic multi-objective optimization balancing validity, synthesizability, and molecular complexity with adaptive weight adjustment

---
> 💡 **Target domain:** molecular generation (SELFIES).  
> 🔬 **Goal:** general base model knowledgable and capable in generating SELFIES representation of new molecules  
> 🚀 **Core innovation:** fast, modular **MTP + RL fine-tuning pipelines** using standard HuggingFace components.
---

# Disclaimer and Responsible Use Policy
**Model Purpose**: This generative model is designed exclusively for research and development applications in drug discovery and materials science. The model is intended to assist researchers in hypothesis generation, molecular design, and materials exploration.

**Limitations and Accuracy**:

The model's outputs are predictions and should be validated through experimental verification. 
The author makes no warranties regarding the accuracy, completeness, reliability, or suitability of generated results. 
Users assume all risks associated with model outputs and their applications. 


**Prohibited Uses**:

The model must not be used for:
- Legal, medical, or regulatory decision-making without proper validation
- Generating dangerous, toxic, or harmful compounds
- Any illegal activities or purposes
- Military, defense, or weapons development applications
- Circumventing safety regulations or ethical guidelines

**Compliance**: Users are responsible for ensuring compliance with applicable laws, regulations, and institutional policies in their jurisdiction. 

**Liability**: The author disclaims all liability for damages arising from the use or misuse of this model.

## Usage
## 🚀 Quick Start
```bash
# Clone repository
git clone https://huggingface.co/gbyuvd/ChemQ3MTP-base
cd ChemQ3MTP-base

# Install dependencies
pip install datasets numpy pandas ranger21 rdkit scikit_learn selfies torch tqdm transformers
```

### Direct Usage
Please clone the repo first, then you can:

```python
# ==============================
# Generate SELFIES from ChemQ3MTP checkpoint
# LOADING THE MODEL & TOKENIZER
# ================================

import sys
import os
import torch

# --- Replicate local module loading exactly as in training ---
notebook_dir = os.getcwd()
chemq3mtp_path = os.path.join(notebook_dir, "ChemQ3MTP")

if chemq3mtp_path not in sys.path:
    sys.path.insert(0, chemq3mtp_path)

# Optional: clean up duplicate paths 
existing_paths = [p for p in sys.path if p.endswith("ChemQ3MTP")]
for path in existing_paths[:-1]:  # keep only the most recently added
    sys.path.remove(path)

# Now import from local ChemQ3MTP folder
from FastChemTokenizerHF import FastChemTokenizerSelfies
from ChemQ3MTP import ChemQ3MTPForCausalLM  

# --- Load from checkpoint (same as saved in training) ---
checkpoint_dir = "./"  # or your actual checkpoint path

print(f"Loading tokenizer...")
tokenizer = FastChemTokenizerSelfies.from_pretrained('./selftok_core/')

print(f"Loading ChemQ3MTP model from {checkpoint_dir}...")
model = ChemQ3MTPForCausalLM.from_pretrained(checkpoint_dir)

# --- Prepare for generation ---
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()

# Disable MTP mode for standard autoregressive generation
if hasattr(model, 'set_mtp_training'):
    model.set_mtp_training(False)

try:
    # Tokenize start token
    input_ids = tokenizer("<s>", return_tensors="pt").input_ids.to(device)
    
    with torch.no_grad():
        gen = model.generate(
            input_ids=input_ids,
            max_length=256,
            top_k=50,
            temperature=1.0,
            do_sample=True,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id,
            early_stopping=True
        )
    
    result = tokenizer.decode(gen[0], skip_special_tokens=True)
    print("Generated SELFIES:")
    print(result)

except Exception as e:
    print(f"Generation failed: {e}")
    import traceback
    traceback.print_exc()

# Loading tokenizer...
# ✅ Special tokens bound: 0 1 2 3 4
# Loading ChemQ3MTP model from ./...
# Generated SELFIES:
# .[N] [C] [C] [N] [C] [C] [=C] [C] [=C] [Branch1] ...
```

**Generate and Visualize:**

```python
# Generate Mol Viz
from rdkit import Chem
from rdkit.Chem import Draw
import selfies as sf

input_ids = tokenizer("<s>", return_tensors="pt").input_ids.to(device)
gen = model.generate(input_ids, max_length=512, top_k=50, temperature=1, do_sample=True, pad_token_id=tokenizer.pad_token_id)
generatedmol = tokenizer.decode(gen[0], skip_special_tokens=True)

test = generatedmol.replace(' ', '')
csmi_gen = sf.decoder(test)
print(csmi_gen)
mol = Chem.MolFromSmiles(csmi_gen)

# Draw the molecule
Draw.MolToImage(mol)

# NC1=NC2=C(Br)C=CC=C2N1CCCCNCCC3=CC=CC(Cl)=C3
```

![image](https://cdn-uploads.huggingface.co/production/uploads/667da868d653c0b02d6a2399/Ro950Z7AVBGEXqfY5sV94.png)


---

## 📊 Model Architecture

| Component | Details |
|-----------|---------|
| **Base Architecture** | Qwen2-like Transformer with MTP Head |
| **Model Type** | `chemq3_mtp` |
| **Total Parameters** | 9.86M (9.10M base + 0.75M MTP head) |
| **Vocabulary Size** | 782 (SELFIES tokens) |
| **Hidden Size** | 320 |
| **Number of Layers** | 6 |
| **Attention Heads** | 4 (2 KV heads, GQA) |
| **Head Dimension** | 64 |
| **Intermediate Size** | 1280 (FFN) |
| **Max Sequence Length** | 512 |
| **Sliding Window** | 16 |
| **RoPE Theta** | 10,000 |
| **Attention Dropout** | 0.1 |
| **MTP Future Tokens** | 3 horizons |
| **Word Embeddings** | Tied (input/output) |

### Training Configuration

| Parameter | Value |
|-----------|-------|
| **Batch Size** | 16 (effective: 64 with grad accumulation) |
| **Gradient Accumulation** | 4 steps |
| **Learning Rate** | 7e-6 |
| **Weight Decay** | 0.01 |
| **Epochs** | 1 |
| **Optimizer** | Ranger21 (warmup/warmdown) |
| **Training Set** | 2,330,051 molecules (80%) |
| **Validation Set** | 291,256 molecules (10%) |
| **Test Set** | 291,256 molecules (10%) |

### Generation Defaults

| Parameter | Value |
|-----------|-------|
| **Max Length** | 512 tokens |
| **Sampling** | Top-k (50) + Temperature (1.0) |
| **Top-p** | 0.9 |
| **Num Sequences** | 3 per prompt |

## ⚙️ Model Training and Evaluation
```text
Warm-up steps = 25% × 36,407 ≈ 9,102 steps 
Training set size: 2,330,051 molecules
Validation / Test set sizes: 291,256 molecules each
Total parameters: 9,857,155
Base transformer: 9,104,832 parameters
MTP prediction head: 752,320 parameters
Horizon loss control parameters: 3
Enhancement overhead (vs. standard NTP baseline): 8.26%
Performance at Step 36,407:
Train Loss: 1.1720 → Perplexity: exp(1.1720) ≈ 3.23
Validation Loss: 1.0448 → Perplexity: exp(1.0448) ≈ 2.84
```

## ⚙️ Generated Molecules Evaluation
### On 1K generated examples:

using `model_eval.ipynb`:
**Overall Stats:**

```text
📊 FINAL EVALUATION SUMMARY
============================================================
Total generated:          1000
Valid SMILES:             976 (97.6%)
Lipinski-compliant:       687 (70.4% of valid)
Internal diversity:       0.6387
MACCS clusters (≥0.7):    448
Average cluster size:     2.18
Largest cluster size:     15
============================================================

✅ Results dictionary created
✅ Valid SMILES saved to 'generated_valid_2500.smi'

```

**PCA & t-SNE of MACCS:**

![image](https://cdn-uploads.huggingface.co/production/uploads/667da868d653c0b02d6a2399/cWmq905lpkU_iA38nqXZL.png)

### On contextual generated examples:
using `model_context_eval.ipynb` as 4 nAChR-a4b2 partial agonists as inputs:

**Generated 50 examples from input as context:**
```
=== Summary ===
Total generated: 200
Overall validity rate: 100.00%

Input: O=C(C[C@@H]1N([C@@H](CCC1)C[C@@H](C2=CC=CC=C2)O)C)C3=CC=CC=C3
  Validity: 100.00%
  Avg Similarity: 0.620
  Lipinski Pass Rate: 38.00%

Input: O=C2N(C)[C@H](c1cnccc1)CC2
  Validity: 100.00%
  Avg Similarity: 0.399
  Lipinski Pass Rate: 84.00%

Input: O=C1/C=C\C=C2/N1C[C@@H]3CNC[C@H]2C3
  Validity: 100.00%
  Avg Similarity: 0.613
  Lipinski Pass Rate: 98.00%

Input: n1c2cc3c(cc2ncc1)[C@@H]4CNC[C@H]3C4
  Validity: 100.00%
  Avg Similarity: 0.501
  Lipinski Pass Rate: 100.00%
```

Example outputs:

![image](https://cdn-uploads.huggingface.co/production/uploads/667da868d653c0b02d6a2399/XXn-UqBoH4YWsQLsZ5k6H.png)

![image](https://cdn-uploads.huggingface.co/production/uploads/667da868d653c0b02d6a2399/O5t69ooYnp1RXpCT89ja9.png)


**t-SNE:**


![image](https://cdn-uploads.huggingface.co/production/uploads/667da868d653c0b02d6a2399/rB2PTTZ9VczhWv_jU7cxF.png)


## ❤️ Support the Project

Training and scaling require significant computational resources.  
If you’d like to support this research (e.g., helping us rent compute servers for rapid RL prototyping and MTP validation), you can contribute here:  

[![ko-fi](https://ko-fi.com/img/githubbutton_sm.svg)](https://ko-fi.com/O4O710GFBZ) 

Every bit of support helps us push ChemQ3MTP further! 🚀🧬

---

## Citation
If you find this project useful in your research and wish to cite it, please use the following BibTex entry:

```bibtex
@software{chemq3mtp_base,
  author = {GP Bayu},
  title = {{ChemQ3MTP}: Pretraining a Lightweight Transformer for Molecular Generation with Multi-Token Prediction and Horizon Loss},
  url = {https://huggingface.co/gbyuvd/ChemQ3MTP-base},
  version = {0.2},
  year = {2025},
}
```

## References
### BibTeX
#### Qwen2
```bibtex
@misc{yang2024qwen2technicalreport,
      title={Qwen2 Technical Report}, 
      author={An Yang and Baosong Yang and Binyuan Hui and Bo Zheng and Bowen Yu and Chang Zhou and Chengpeng Li and Chengyuan Li and Dayiheng Liu and Fei Huang and Guanting Dong and Haoran Wei and Huan Lin and Jialong Tang and Jialin Wang and Jian Yang and Jianhong Tu and Jianwei Zhang and Jianxin Ma and Jianxin Yang and Jin Xu and Jingren Zhou and Jinze Bai and Jinzheng He and Junyang Lin and Kai Dang and Keming Lu and Keqin Chen and Kexin Yang and Mei Li and Mingfeng Xue and Na Ni and Pei Zhang and Peng Wang and Ru Peng and Rui Men and Ruize Gao and Runji Lin and Shijie Wang and Shuai Bai and Sinan Tan and Tianhang Zhu and Tianhao Li and Tianyu Liu and Wenbin Ge and Xiaodong Deng and Xiaohuan Zhou and Xingzhang Ren and Xinyu Zhang and Xipin Wei and Xuancheng Ren and Xuejing Liu and Yang Fan and Yang Yao and Yichang Zhang and Yu Wan and Yunfei Chu and Yuqiong Liu and Zeyu Cui and Zhenru Zhang and Zhifang Guo and Zhihao Fan},
      year={2024},
      eprint={2407.10671},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2407.10671}, 
}
```

#### COCONUTDB
```bibtex
@article{sorokina2021coconut,
  title={COCONUT online: Collection of Open Natural Products database},
  author={Sorokina, Maria and Merseburger, Peter and Rajan, Kohulan and Yirik, Mehmet Aziz and Steinbeck, Christoph},
  journal={Journal of Cheminformatics},
  volume={13},
  number={1},
  pages={2},
  year={2021},
  doi={10.1186/s13321-020-00478-9}
}
```

#### ChemBL34
```bibtex
@article{zdrazil2023chembl,
  title={The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods},
  author={Zdrazil, Barbara and Felix, Eloy and Hunter, Fiona and Manners, Emma J and Blackshaw, James and Corbett, Sybilla and de Veij, Marleen and Ioannidis, Harris and Lopez, David Mendez and Mosquera, Juan F and Magarinos, Maria Paula and Bosc, Nicolas and Arcila, Ricardo and Kizil{\"o}ren, Tevfik and Gaulton, Anna and Bento, A Patr{\'i}cia and Adasme, Melissa F and Monecke, Peter and Landrum, Gregory A and Leach, Andrew R},
  journal={Nucleic Acids Research},
  year={2023},
  volume={gkad1004},
  doi={10.1093/nar/gkad1004}
}

@misc{chembl34,
  title={ChemBL34},
  year={2023},
  doi={10.6019/CHEMBL.database.34}
}
```

#### SuperNatural3
```bibtex
@article{Gallo2023,
  author = {Gallo, K and Kemmler, E and Goede, A and Becker, F and Dunkel, M and Preissner, R and Banerjee, P},
  title = {{SuperNatural 3.0-a database of natural products and natural product-based derivatives}},
  journal = {Nucleic Acids Research},
  year = {2023},
  month = jan,
  day = {6},
  volume = {51},
  number = {D1},
  pages = {D654-D659},
  doi = {10.1093/nar/gkac1008}
}
```

### Ranger21 Optimizer
``` bibtex
@article{wright2021ranger21,
      title={Ranger21: a synergistic deep learning optimizer}, 
      author={Wright, Less and Demeure, Nestor},
      year={2021},
      journal={arXiv preprint arXiv:2106.13731},
}
```

### Durrant's Lab Filtering
```
@article{ropp2019gypsum,
  title={Gypsum-DL: An Open-source Program for Preparing Small-molecule Libraries for Structure-based Virtual Screening},
  author={Ropp, Patrick J. and Spiegel, Jacob O. and Walker, Jennifer L. and Green, Harrison and Morales, Guillermo A. and Milliken, Katherine A. and Ringe, John J. and Durrant, Jacob D.},
  journal={Journal of Cheminformatics},
  volume={11},
  number={1},
  year={2019},
  doi={10.1186/s13321-019-0358-3}
}
```