File size: 15,018 Bytes
92a77d7
 
 
 
 
 
 
 
 
 
 
 
 
 
8298ad5
92a77d7
1a03aa2
 
 
c84417f
cde2c0d
92a77d7
 
 
728a83d
6d84b45
 
 
 
058f009
cde2c0d
 
058f009
 
6d84b45
 
92a77d7
 
 
058f009
 
92a77d7
 
058f009
 
 
 
 
a2f5322
 
 
058f009
441697f
058f009
 
 
78b03ff
 
 
 
 
5bb62a8
 
058f009
 
 
92a77d7
b614c9d
 
 
 
 
 
 
 
92a77d7
 
7c73e17
 
73843ba
 
7c73e17
 
e08caa0
 
7c73e17
 
 
 
 
 
 
 
 
 
 
 
e08caa0
7c73e17
 
 
 
 
 
e08caa0
7c73e17
 
e08caa0
7c73e17
e08caa0
 
7c73e17
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e08caa0
 
 
 
 
 
7c73e17
 
e08caa0
92a77d7
73843ba
7c73e17
 
 
e08caa0
7c73e17
 
e08caa0
7c73e17
 
 
 
 
 
 
 
 
e08caa0
 
7c73e17
 
e08caa0
 
 
92a77d7
 
78b03ff
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
058f009
76cf621
 
 
 
 
 
 
 
 
 
 
 
 
 
058f009
4735465
58e231b
7953a1b
 
4735465
 
 
58e231b
 
 
 
 
 
 
 
 
 
 
 
 
 
4735465
 
58e231b
4735465
58e231b
4735465
7953a1b
3b51393
7953a1b
 
 
3b51393
7953a1b
 
 
 
 
3b51393
7953a1b
 
 
 
3b51393
 
7953a1b
 
 
3b51393
7953a1b
 
 
 
3b51393
 
7953a1b
 
 
 
3b51393
 
 
7953a1b
 
 
 
3b51393
 
7953a1b
 
92a77d7
 
 
 
 
 
 
974b627
92a77d7
 
 
4e12c74
 
 
 
 
 
 
 
4b217dc
4e12c74
 
 
 
92a77d7
 
0178d90
92a77d7
0178d90
 
 
 
 
92a77d7
 
0178d90
92a77d7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
---
license: apache-2.0
pipeline_tag: text-generation
tags:
- chemistry
- molecular-generation
- qwen2
- mtp
- selfies
- cheminformatics
---

# 🧬 ChemQ3MTP-base

ChemQ3MTP-base is a lightweight generative model for chemistry, trained on 2.33 million valid bioactive and natural product molecules dataset curated from ChemBL34, COCONUTDB, and SuperNatural3. It is built on a compact Qwen2-like transformer backbone and employs a multi-horizon predictive (MTP) loss objective to model molecular structures in SELFIES representation.

Train Loss: 1.1720 β†’ Perplexity: ~3.23 

Validation Loss: 1.0448 β†’ Perplexity: ~2.84 

Current version (0.2) (Lic for Code: MIT; Weights: Apache 2.0)

A custom Qwen2-style language model, adapted for molecular generation:

- βœ… **Qwen2-like Mini Backbone** – Efficient causal LM architecture
- βœ… **Multi-Token Prediction (MTP Head)** – Parallel prediction of 1–3 future tokens, implemented as a plug-and-play head compatible with `AutoModel`  
- βœ… **Horizon Loss** – Weighted multi-horizon objectives for long-term coherence  
- βœ… **SELFIES-native Tokenizer** – Robust encoding with [FastChemTokenizer](https://github.com/gbyuvd/FastChemTokenizer)  
- βœ… **Ranger21 Optimizer** – Warmup/warmdown scheduling for stable training  
- βœ… **Gradient Checkpointing** – Lightweight, hardware-friendly, optimized for rapid RL prototyping
- βœ… **Adaptive Generation-Length Cap** – Dynamically limits max_new_tokens to 25% of prompt length when no budget is given, reducing inference cost and preventing runaway molecule chains during RL (no retraining needed; works with `generate` and `generate_with_logprobs`)


RL-Ready Features:
- βœ… **Durrant's Lab Filter** – Integrated substructure filtering based on [gypsum_dl](https://github.com/durrantlab/gypsum_dl/) (Ropp _et al._ 2019) methodology to remove improbable molecular variants in validity check
- βœ… **Pareto Reward Controller** – Ready for RL fine-tuning with dynamic multi-objective optimization balancing validity, synthesizability, and molecular complexity with adaptive weight adjustment

---
> πŸ’‘ **Target domain:** molecular generation (SELFIES).  
> πŸ”¬ **Goal:** general base model knowledgable and capable in generating SELFIES representation of new molecules  
> πŸš€ **Core innovation:** fast, modular **MTP + RL fine-tuning pipelines** using standard HuggingFace components.
---

# Disclaimer and Responsible Use Policy
**Model Purpose**: This generative model is designed exclusively for research and development applications in drug discovery and materials science. The model is intended to assist researchers in hypothesis generation, molecular design, and materials exploration.

**Limitations and Accuracy**:

The model's outputs are predictions and should be validated through experimental verification. 
The author makes no warranties regarding the accuracy, completeness, reliability, or suitability of generated results. 
Users assume all risks associated with model outputs and their applications. 


**Prohibited Uses**:

The model must not be used for:
- Legal, medical, or regulatory decision-making without proper validation
- Generating dangerous, toxic, or harmful compounds
- Any illegal activities or purposes
- Military, defense, or weapons development applications
- Circumventing safety regulations or ethical guidelines

**Compliance**: Users are responsible for ensuring compliance with applicable laws, regulations, and institutional policies in their jurisdiction. 

**Liability**: The author disclaims all liability for damages arising from the use or misuse of this model.

## Usage
## πŸš€ Quick Start
```bash
# Clone repository
git clone https://huggingface.co/gbyuvd/ChemQ3MTP-base
cd ChemQ3MTP-base

# Install dependencies
pip install datasets numpy pandas ranger21 rdkit scikit_learn selfies torch tqdm transformers
```

### Direct Usage
Please clone the repo first, then you can:

```python
# ==============================
# Generate SELFIES from ChemQ3MTP checkpoint
# LOADING THE MODEL & TOKENIZER
# ================================

import sys
import os
import torch

# --- Replicate local module loading exactly as in training ---
notebook_dir = os.getcwd()
chemq3mtp_path = os.path.join(notebook_dir, "ChemQ3MTP")

if chemq3mtp_path not in sys.path:
    sys.path.insert(0, chemq3mtp_path)

# Optional: clean up duplicate paths 
existing_paths = [p for p in sys.path if p.endswith("ChemQ3MTP")]
for path in existing_paths[:-1]:  # keep only the most recently added
    sys.path.remove(path)

# Now import from local ChemQ3MTP folder
from FastChemTokenizerHF import FastChemTokenizerSelfies
from ChemQ3MTP import ChemQ3MTPForCausalLM  

# --- Load from checkpoint (same as saved in training) ---
checkpoint_dir = "./"  # or your actual checkpoint path

print(f"Loading tokenizer...")
tokenizer = FastChemTokenizerSelfies.from_pretrained('./selftok_core/')

print(f"Loading ChemQ3MTP model from {checkpoint_dir}...")
model = ChemQ3MTPForCausalLM.from_pretrained(checkpoint_dir)

# --- Prepare for generation ---
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()

# Disable MTP mode for standard autoregressive generation
if hasattr(model, 'set_mtp_training'):
    model.set_mtp_training(False)

try:
    # Tokenize start token
    input_ids = tokenizer("<s>", return_tensors="pt").input_ids.to(device)
    
    with torch.no_grad():
        gen = model.generate(
            input_ids=input_ids,
            max_length=256,
            top_k=50,
            temperature=1.0,
            do_sample=True,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id,
            early_stopping=True
        )
    
    result = tokenizer.decode(gen[0], skip_special_tokens=True)
    print("Generated SELFIES:")
    print(result)

except Exception as e:
    print(f"Generation failed: {e}")
    import traceback
    traceback.print_exc()

# Loading tokenizer...
# βœ… Special tokens bound: 0 1 2 3 4
# Loading ChemQ3MTP model from ./...
# Generated SELFIES:
# .[N] [C] [C] [N] [C] [C] [=C] [C] [=C] [Branch1] ...
```

**Generate and Visualize:**

```python
# Generate Mol Viz
from rdkit import Chem
from rdkit.Chem import Draw
import selfies as sf

input_ids = tokenizer("<s>", return_tensors="pt").input_ids.to(device)
gen = model.generate(input_ids, max_length=512, top_k=50, temperature=1, do_sample=True, pad_token_id=tokenizer.pad_token_id)
generatedmol = tokenizer.decode(gen[0], skip_special_tokens=True)

test = generatedmol.replace(' ', '')
csmi_gen = sf.decoder(test)
print(csmi_gen)
mol = Chem.MolFromSmiles(csmi_gen)

# Draw the molecule
Draw.MolToImage(mol)

# NC1=NC2=C(Br)C=CC=C2N1CCCCNCCC3=CC=CC(Cl)=C3
```

![image](https://cdn-uploads.huggingface.co/production/uploads/667da868d653c0b02d6a2399/Ro950Z7AVBGEXqfY5sV94.png)


---

## πŸ“Š Model Architecture

| Component | Details |
|-----------|---------|
| **Base Architecture** | Qwen2-like Transformer with MTP Head |
| **Model Type** | `chemq3_mtp` |
| **Total Parameters** | 9.86M (9.10M base + 0.75M MTP head) |
| **Vocabulary Size** | 782 (SELFIES tokens) |
| **Hidden Size** | 320 |
| **Number of Layers** | 6 |
| **Attention Heads** | 4 (2 KV heads, GQA) |
| **Head Dimension** | 64 |
| **Intermediate Size** | 1280 (FFN) |
| **Max Sequence Length** | 512 |
| **Sliding Window** | 16 |
| **RoPE Theta** | 10,000 |
| **Attention Dropout** | 0.1 |
| **MTP Future Tokens** | 3 horizons |
| **Word Embeddings** | Tied (input/output) |

### Training Configuration

| Parameter | Value |
|-----------|-------|
| **Batch Size** | 16 (effective: 64 with grad accumulation) |
| **Gradient Accumulation** | 4 steps |
| **Learning Rate** | 7e-6 |
| **Weight Decay** | 0.01 |
| **Epochs** | 1 |
| **Optimizer** | Ranger21 (warmup/warmdown) |
| **Training Set** | 2,330,051 molecules (80%) |
| **Validation Set** | 291,256 molecules (10%) |
| **Test Set** | 291,256 molecules (10%) |

### Generation Defaults

| Parameter | Value |
|-----------|-------|
| **Max Length** | 512 tokens |
| **Sampling** | Top-k (50) + Temperature (1.0) |
| **Top-p** | 0.9 |
| **Num Sequences** | 3 per prompt |

## βš™οΈ Model Training and Evaluation
```text
Warm-up steps = 25% Γ— 36,407 β‰ˆ 9,102 steps 
Training set size: 2,330,051 molecules
Validation / Test set sizes: 291,256 molecules each
Total parameters: 9,857,155
Base transformer: 9,104,832 parameters
MTP prediction head: 752,320 parameters
Horizon loss control parameters: 3
Enhancement overhead (vs. standard NTP baseline): 8.26%
Performance at Step 36,407:
Train Loss: 1.1720 β†’ Perplexity: exp(1.1720) β‰ˆ 3.23
Validation Loss: 1.0448 β†’ Perplexity: exp(1.0448) β‰ˆ 2.84
```

## βš™οΈ Generated Molecules Evaluation
### On 1K generated examples:

using `model_eval.ipynb`:
**Overall Stats:**

```text
πŸ“Š FINAL EVALUATION SUMMARY
============================================================
Total generated:          1000
Valid SMILES:             976 (97.6%)
Lipinski-compliant:       687 (70.4% of valid)
Internal diversity:       0.6387
MACCS clusters (β‰₯0.7):    448
Average cluster size:     2.18
Largest cluster size:     15
============================================================

βœ… Results dictionary created
βœ… Valid SMILES saved to 'generated_valid_2500.smi'

```

**PCA & t-SNE of MACCS:**

![image](https://cdn-uploads.huggingface.co/production/uploads/667da868d653c0b02d6a2399/cWmq905lpkU_iA38nqXZL.png)

### On contextual generated examples:
using `model_context_eval.ipynb` as 4 nAChR-a4b2 partial agonists as inputs:

**Generated 50 examples from input as context:**
```
=== Summary ===
Total generated: 200
Overall validity rate: 100.00%

Input: O=C(C[C@@H]1N([C@@H](CCC1)C[C@@H](C2=CC=CC=C2)O)C)C3=CC=CC=C3
  Validity: 100.00%
  Avg Similarity: 0.620
  Lipinski Pass Rate: 38.00%

Input: O=C2N(C)[C@H](c1cnccc1)CC2
  Validity: 100.00%
  Avg Similarity: 0.399
  Lipinski Pass Rate: 84.00%

Input: O=C1/C=C\C=C2/N1C[C@@H]3CNC[C@H]2C3
  Validity: 100.00%
  Avg Similarity: 0.613
  Lipinski Pass Rate: 98.00%

Input: n1c2cc3c(cc2ncc1)[C@@H]4CNC[C@H]3C4
  Validity: 100.00%
  Avg Similarity: 0.501
  Lipinski Pass Rate: 100.00%
```

Example outputs:

![image](https://cdn-uploads.huggingface.co/production/uploads/667da868d653c0b02d6a2399/XXn-UqBoH4YWsQLsZ5k6H.png)

![image](https://cdn-uploads.huggingface.co/production/uploads/667da868d653c0b02d6a2399/O5t69ooYnp1RXpCT89ja9.png)


**t-SNE:**


![image](https://cdn-uploads.huggingface.co/production/uploads/667da868d653c0b02d6a2399/rB2PTTZ9VczhWv_jU7cxF.png)


## ❀️ Support the Project

Training and scaling require significant computational resources.  
If you’d like to support this research (e.g., helping us rent compute servers for rapid RL prototyping and MTP validation), you can contribute here:  

[![ko-fi](https://ko-fi.com/img/githubbutton_sm.svg)](https://ko-fi.com/O4O710GFBZ) 

Every bit of support helps us push ChemQ3MTP further! πŸš€πŸ§¬

---

## Citation
If you find this project useful in your research and wish to cite it, please use the following BibTex entry:

```bibtex
@software{chemq3mtp_base,
  author = {GP Bayu},
  title = {{ChemQ3MTP}: Pretraining a Lightweight Transformer for Molecular Generation with Multi-Token Prediction and Horizon Loss},
  url = {https://huggingface.co/gbyuvd/ChemQ3MTP-base},
  version = {0.2},
  year = {2025},
}
```

## References
### BibTeX
#### Qwen2
```bibtex
@misc{yang2024qwen2technicalreport,
      title={Qwen2 Technical Report}, 
      author={An Yang and Baosong Yang and Binyuan Hui and Bo Zheng and Bowen Yu and Chang Zhou and Chengpeng Li and Chengyuan Li and Dayiheng Liu and Fei Huang and Guanting Dong and Haoran Wei and Huan Lin and Jialong Tang and Jialin Wang and Jian Yang and Jianhong Tu and Jianwei Zhang and Jianxin Ma and Jianxin Yang and Jin Xu and Jingren Zhou and Jinze Bai and Jinzheng He and Junyang Lin and Kai Dang and Keming Lu and Keqin Chen and Kexin Yang and Mei Li and Mingfeng Xue and Na Ni and Pei Zhang and Peng Wang and Ru Peng and Rui Men and Ruize Gao and Runji Lin and Shijie Wang and Shuai Bai and Sinan Tan and Tianhang Zhu and Tianhao Li and Tianyu Liu and Wenbin Ge and Xiaodong Deng and Xiaohuan Zhou and Xingzhang Ren and Xinyu Zhang and Xipin Wei and Xuancheng Ren and Xuejing Liu and Yang Fan and Yang Yao and Yichang Zhang and Yu Wan and Yunfei Chu and Yuqiong Liu and Zeyu Cui and Zhenru Zhang and Zhifang Guo and Zhihao Fan},
      year={2024},
      eprint={2407.10671},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2407.10671}, 
}
```

#### COCONUTDB
```bibtex
@article{sorokina2021coconut,
  title={COCONUT online: Collection of Open Natural Products database},
  author={Sorokina, Maria and Merseburger, Peter and Rajan, Kohulan and Yirik, Mehmet Aziz and Steinbeck, Christoph},
  journal={Journal of Cheminformatics},
  volume={13},
  number={1},
  pages={2},
  year={2021},
  doi={10.1186/s13321-020-00478-9}
}
```

#### ChemBL34
```bibtex
@article{zdrazil2023chembl,
  title={The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods},
  author={Zdrazil, Barbara and Felix, Eloy and Hunter, Fiona and Manners, Emma J and Blackshaw, James and Corbett, Sybilla and de Veij, Marleen and Ioannidis, Harris and Lopez, David Mendez and Mosquera, Juan F and Magarinos, Maria Paula and Bosc, Nicolas and Arcila, Ricardo and Kizil{\"o}ren, Tevfik and Gaulton, Anna and Bento, A Patr{\'i}cia and Adasme, Melissa F and Monecke, Peter and Landrum, Gregory A and Leach, Andrew R},
  journal={Nucleic Acids Research},
  year={2023},
  volume={gkad1004},
  doi={10.1093/nar/gkad1004}
}

@misc{chembl34,
  title={ChemBL34},
  year={2023},
  doi={10.6019/CHEMBL.database.34}
}
```

#### SuperNatural3
```bibtex
@article{Gallo2023,
  author = {Gallo, K and Kemmler, E and Goede, A and Becker, F and Dunkel, M and Preissner, R and Banerjee, P},
  title = {{SuperNatural 3.0-a database of natural products and natural product-based derivatives}},
  journal = {Nucleic Acids Research},
  year = {2023},
  month = jan,
  day = {6},
  volume = {51},
  number = {D1},
  pages = {D654-D659},
  doi = {10.1093/nar/gkac1008}
}
```

### Ranger21 Optimizer
``` bibtex
@article{wright2021ranger21,
      title={Ranger21: a synergistic deep learning optimizer}, 
      author={Wright, Less and Demeure, Nestor},
      year={2021},
      journal={arXiv preprint arXiv:2106.13731},
}
```

### Durrant's Lab Filtering
```
@article{ropp2019gypsum,
  title={Gypsum-DL: An Open-source Program for Preparing Small-molecule Libraries for Structure-based Virtual Screening},
  author={Ropp, Patrick J. and Spiegel, Jacob O. and Walker, Jennifer L. and Green, Harrison and Morales, Guillermo A. and Milliken, Katherine A. and Ringe, John J. and Durrant, Jacob D.},
  journal={Journal of Cheminformatics},
  volume={11},
  number={1},
  year={2019},
  doi={10.1186/s13321-019-0358-3}
}
```