π© RougeBERT: A Lightweight Experimental BERT-like Hybrid Transformer with RoPE+GQA
BERT-style MLM optimized β 8-layer transformer with GQA, RoPE, sliding-window + global attention, RMSNorm, and weight tying. Optimized for speed, memory, and contextual modeling.
π§ Core: 8L | 320d | 8H β GQA (2 groups) | RMSNorm | Weight Tied
π§ Position: Rotary (RoPE) β extrapolates beyond 1024
ποΈ Attention: Sliding Window (16) + Global Tokens β Vectorized Masks
β‘ Innovations: 4x smaller KV cache β’ No learned pos-embeds β’ Local+Global context
Prototype research code β not production-ready. Learning by building.
β Potentially Ideal for: Efficient training/inference β’ Medium-length NLP tasks β’ Memory-constrained environments π‘ Target domain: chemistry & molecular generation (SELFIES). π Architecture is potentially generalizable to other sequence domains.
Model Architecture
RougeBERT is a hybrid transformer architecture that combines modern efficiency techniques with BERT-style masked language modeling. The model integrates several key innovations:
Core Architecture
- 8-layer transformer with 320-dimensional hidden states and 8 attention heads
- Grouped Query Attention (GQA) with 2 key-value groups, reducing memory usage while maintaining performance
- RMSNorm instead of LayerNorm for improved training stability and efficiency
- Weight tying between input embeddings and output head for parameter efficiency
Positional Encoding
- Rotary Position Embedding (RoPE) replaces traditional learned position embeddings
- Provides better length extrapolation and relative position understanding
- Configured for sequences up to 1024 tokens with RoPE
Attention Mechanism
- Sliding window attention with configurable window size (default: 16 tokens)
- Global attention tokens that can attend to and be attended by all positions
- Combines local efficiency with global context modeling
- Vectorized attention mask computation for optimal performance
Evaluation vs. RoBERTa on MLM (WIP, for now 1K 10 Epoch | Val/10% steps)
Dataset & Preprocessing
- Data: 1,000 SMILES molecular representations from
sample_1k_smi_42.csv(sample 1K molecules from combined curated dataset built from COCONUTDB (Sorokina et al., 2021),
ChemBL34 (Zdrazil et al., 2023), and SuperNatural3 (Gallo et al. 2023) dataset - Split: 70% train / 15% validation / 15% test (stratified random split, seed=42)
- Tokenization: FastChemTokenizer with max sequence length of 512 tokens
Training Setup
- Task: Masked Language Modeling (MLM) with 15% token masking probability
- Architecture Comparison: RougeBERT (hybrid model) vs RoBERTa baseline (~9M parameters each)
- Training: 10 epochs, batch size 16, gradient accumulation 4 steps, learning rate 1e-5
- Optimizer: Ranger21 with AdaBelief, warmup, and MADGrad components
- Early Stopping: Patience of 10 validation steps based on validation loss
Evaluation Metrics
- Perplexity: Primary metric for language modeling quality (lower is better)
- MLM Accuracy: Token-level accuracy on masked positions
- Validation Loss: Cross-entropy loss on held-out validation set
- Evaluation Frequency: Every 10 training steps with continuous monitoring
Learning Curves of the Two Models on 1K dataset for 10 epochs
π§ Contributing
This project is a learning experiment β all contributions are welcome!
- π§ Have a better way to implement the methods?
- π Want to add evaluation metrics?
- β¨ Found a bug? Please open an issue!
π Please:
- Keep changes minimal and focused.
- Add comments if you change core logic.
β οΈ Disclaimer
This is NOT a production model.
- Built during late-night prototyping sessions π
- Not thoroughly validated or benchmarked due to compute constraint
- Some components are heuristic and unproven
- May crash, overfit, or generate nonsense (especially outside molecular data)
- Iβm still learning PyTorch, attention mechanisms, and transformer internals
Use this code to learn and experiment β not to deploy.
π License
MIT
COCONUTDB
@article{sorokina2021coconut,
title={COCONUT online: Collection of Open Natural Products database},
author={Sorokina, Maria and Merseburger, Peter and Rajan, Kohulan and Yirik, Mehmet Aziz and Steinbeck, Christoph},
journal={Journal of Cheminformatics},
volume={13},
number={1},
pages={2},
year={2021},
doi={10.1186/s13321-020-00478-9}
}
ChemBL34
@article{zdrazil2023chembl,
title={The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods},
author={Zdrazil, Barbara and Felix, Eloy and Hunter, Fiona and Manners, Emma J and Blackshaw, James and Corbett, Sybilla and de Veij, Marleen and Ioannidis, Harris and Lopez, David Mendez and Mosquera, Juan F and Magarinos, Maria Paula and Bosc, Nicolas and Arcila, Ricardo and Kizil{\"o}ren, Tevfik and Gaulton, Anna and Bento, A Patr{\'i}cia and Adasme, Melissa F and Monecke, Peter and Landrum, Gregory A and Leach, Andrew R},
journal={Nucleic Acids Research},
year={2023},
volume={gkad1004},
doi={10.1093/nar/gkad1004}
}
@misc{chembl34,
title={ChemBL34},
year={2023},
doi={10.6019/CHEMBL.database.34}
}
SuperNatural3
@article{Gallo2023,
author = {Gallo, K and Kemmler, E and Goede, A and Becker, F and Dunkel, M and Preissner, R and Banerjee, P},
title = {{SuperNatural 3.0-a database of natural products and natural product-based derivatives}},
journal = {Nucleic Acids Research},
year = {2023},
month = jan,
day = {6},
volume = {51},
number = {D1},
pages = {D654-D659},
doi = {10.1093/nar/gkac1008}
}
Ranger21 Optimizer
@article{wright2021ranger21,
title={Ranger21: a synergistic deep learning optimizer},
author={Wright, Less and Demeure, Nestor},
year={2021},
journal={arXiv preprint arXiv:2106.13731},
}
- Downloads last month
- 6

