Update README.md
Browse files
README.md
CHANGED
|
@@ -13,7 +13,7 @@ tags:
|
|
| 13 |
|
| 14 |
# 𧬠ChemQ3MTP-base
|
| 15 |
|
| 16 |
-
ChemQ3MTP-base is a lightweight generative model for chemistry, built on mini **Qwen2-like** backbone with **multi-horizon predictive loss** for molecular SELFIES representations.
|
| 17 |
|
| 18 |
Current version (0.1) (Lic for Code: MIT; Weights: Apache 2.0)
|
| 19 |
|
|
@@ -24,17 +24,40 @@ A custom Qwen2-style language model, adapted for molecular generation:
|
|
| 24 |
- β
**Horizon Loss** β Weighted multi-horizon objectives for long-term coherence
|
| 25 |
- β
**SELFIES-native Tokenizer** β Robust encoding with [FastChemTokenizer](https://github.com/gbyuvd/FastChemTokenizer)
|
| 26 |
- β
**Ranger21 Optimizer** β Warmup/warmdown scheduling for stable training
|
| 27 |
-
- β
**Gradient Checkpointing
|
|
|
|
|
|
|
| 28 |
- β
**Durrant's Lab Filter** β Integrated substructure filtering based on [gypsum_dl](https://github.com/durrantlab/gypsum_dl/) (Ropp _et al._ 2019) methodology to remove improbable molecular variants in validity check
|
| 29 |
- β
**Pareto Reward Controller** β Ready for RL fine-tuning with dynamic multi-objective optimization balancing validity, synthesizability, and molecular complexity with adaptive weight adjustment
|
| 30 |
|
| 31 |
---
|
| 32 |
> π‘ **Target domain:** molecular generation (SELFIES).
|
| 33 |
-
> π¬ **Goal:**
|
| 34 |
-
> π **Core innovation:** fast, modular
|
| 35 |
-
|
| 36 |
---
|
| 37 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 38 |
## Usage
|
| 39 |
Requirements:
|
| 40 |
|
|
@@ -190,10 +213,14 @@ if mol is not None:
|
|
| 190 |
else:
|
| 191 |
print("β Could not create molecule from generated SMILES")
|
| 192 |
```
|
| 193 |
-
## βοΈ Model Eval
|
| 194 |
|
| 195 |
---
|
| 196 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 197 |
## β€οΈ Support the Project
|
| 198 |
|
| 199 |
Training and scaling require significant computational resources.
|
|
|
|
| 13 |
|
| 14 |
# 𧬠ChemQ3MTP-base
|
| 15 |
|
| 16 |
+
ChemQ3MTP-base is a lightweight generative model for chemistry trained on 2.3M valid bioactive and natural product molecules, built on mini **Qwen2-like** backbone with **multi-horizon predictive loss** for molecular SELFIES representations.
|
| 17 |
|
| 18 |
Current version (0.1) (Lic for Code: MIT; Weights: Apache 2.0)
|
| 19 |
|
|
|
|
| 24 |
- β
**Horizon Loss** β Weighted multi-horizon objectives for long-term coherence
|
| 25 |
- β
**SELFIES-native Tokenizer** β Robust encoding with [FastChemTokenizer](https://github.com/gbyuvd/FastChemTokenizer)
|
| 26 |
- β
**Ranger21 Optimizer** β Warmup/warmdown scheduling for stable training
|
| 27 |
+
- β
**Gradient Checkpointing** β Lightweight, hardware-friendly, optimized for rapid RL prototyping
|
| 28 |
+
|
| 29 |
+
RL-Ready Features:
|
| 30 |
- β
**Durrant's Lab Filter** β Integrated substructure filtering based on [gypsum_dl](https://github.com/durrantlab/gypsum_dl/) (Ropp _et al._ 2019) methodology to remove improbable molecular variants in validity check
|
| 31 |
- β
**Pareto Reward Controller** β Ready for RL fine-tuning with dynamic multi-objective optimization balancing validity, synthesizability, and molecular complexity with adaptive weight adjustment
|
| 32 |
|
| 33 |
---
|
| 34 |
> π‘ **Target domain:** molecular generation (SELFIES).
|
| 35 |
+
> π¬ **Goal:** general base model knowledgable and capable in generating SELFIES representation of new molecules
|
| 36 |
+
> π **Core innovation:** fast, modular **MTP + RL fine-tuning pipelines** using standard HuggingFace components.
|
|
|
|
| 37 |
---
|
| 38 |
|
| 39 |
+
# Disclaimer and Responsible Use Policy
|
| 40 |
+
**Model Purpose**: This generative model is designed exclusively for research and development applications in drug discovery and materials science. The model is intended to assist researchers in hypothesis generation, molecular design, and materials exploration.
|
| 41 |
+
|
| 42 |
+
**Limitations and Accuracy**:
|
| 43 |
+
|
| 44 |
+
The model's outputs are predictions and should be validated through experimental verification
|
| 45 |
+
The author makes no warranties regarding the accuracy, completeness, reliability, or suitability of generated results
|
| 46 |
+
Users assume all risks associated with model outputs and their applications
|
| 47 |
+
|
| 48 |
+
**Prohibited Uses**:
|
| 49 |
+
|
| 50 |
+
The model must not be used for:
|
| 51 |
+
|
| 52 |
+
Legal, medical, or regulatory decision-making without proper validation
|
| 53 |
+
Generating dangerous, toxic, or harmful compounds without appropriate safety measures
|
| 54 |
+
Any illegal activities or purposes
|
| 55 |
+
Military, defense, or weapons development applications
|
| 56 |
+
Circumventing safety regulations or ethical guidelines
|
| 57 |
+
Compliance: Users are responsible for ensuring compliance with applicable laws, regulations, and institutional policies in their jurisdiction.
|
| 58 |
+
|
| 59 |
+
**Liability**: The author disclaims all liability for damages arising from the use or misuse of this model.
|
| 60 |
+
|
| 61 |
## Usage
|
| 62 |
Requirements:
|
| 63 |
|
|
|
|
| 213 |
else:
|
| 214 |
print("β Could not create molecule from generated SMILES")
|
| 215 |
```
|
|
|
|
| 216 |
|
| 217 |
---
|
| 218 |
|
| 219 |
+
|
| 220 |
+
|
| 221 |
+
## βοΈ Model Eval
|
| 222 |
+
- Perplexity on unseen set:
|
| 223 |
+
|
| 224 |
## β€οΈ Support the Project
|
| 225 |
|
| 226 |
Training and scaling require significant computational resources.
|