🚩 RougeBERT: A Lightweight Experimental BERT-like Hybrid Transformer with RoPE+GQA

BERT-style MLM optimized — 8-layer transformer with GQA, RoPE, sliding-window + global attention, RMSNorm, and weight tying. Optimized for speed, memory, and contextual modeling.

🧠 Core: 8L | 320d | 8H → GQA (2 groups) | RMSNorm | Weight Tied
🧭 Position: Rotary (RoPE) — extrapolates beyond 1024
👁️ Attention: Sliding Window (16) + Global Tokens → Vectorized Masks
⚡ Innovations: 4x smaller KV cache • No learned pos-embeds • Local+Global context

Prototype research code — not production-ready. Learning by building.

✅ Potentially Ideal for: Efficient training/inference • Medium-length NLP tasks • Memory-constrained environments 💡 Target domain: chemistry & molecular generation (SELFIES). 🚀 Architecture is potentially generalizable to other sequence domains.

Model Architecture

RougeBERT is a hybrid transformer architecture that combines modern efficiency techniques with BERT-style masked language modeling. The model integrates several key innovations:

Core Architecture

8-layer transformer with 320-dimensional hidden states and 8 attention heads
Grouped Query Attention (GQA) with 2 key-value groups, reducing memory usage while maintaining performance
RMSNorm instead of LayerNorm for improved training stability and efficiency
Weight tying between input embeddings and output head for parameter efficiency

Positional Encoding

Rotary Position Embedding (RoPE) replaces traditional learned position embeddings
Provides better length extrapolation and relative position understanding
Configured for sequences up to 1024 tokens with RoPE

Attention Mechanism

Sliding window attention with configurable window size (default: 16 tokens)
Global attention tokens that can attend to and be attended by all positions
Combines local efficiency with global context modeling
Vectorized attention mask computation for optimal performance

Evaluation vs. RoBERTa on MLM (WIP, for now 1K 10 Epoch | Val/10% steps)

Dataset & Preprocessing

Data: 1,000 SMILES molecular representations from sample_1k_smi_42.csv (sample 1K molecules from combined curated dataset built from COCONUTDB (Sorokina et al., 2021),
ChemBL34 (Zdrazil et al., 2023), and SuperNatural3 (Gallo et al. 2023) dataset
Split: 70% train / 15% validation / 15% test (stratified random split, seed=42)
Tokenization: FastChemTokenizer with max sequence length of 512 tokens

Training Setup

Task: Masked Language Modeling (MLM) with 15% token masking probability
Architecture Comparison: RougeBERT (hybrid model) vs RoBERTa baseline (~9M parameters each)
Training: 10 epochs, batch size 16, gradient accumulation 4 steps, learning rate 1e-5
Optimizer: Ranger21 with AdaBelief, warmup, and MADGrad components
Early Stopping: Patience of 10 validation steps based on validation loss

Evaluation Metrics

Perplexity: Primary metric for language modeling quality (lower is better)
MLM Accuracy: Token-level accuracy on masked positions
Validation Loss: Cross-entropy loss on held-out validation set
Evaluation Frequency: Every 10 training steps with continuous monitoring

Learning Curves of the Two Models on 1K dataset for 10 epochs

🔧 Contributing

This project is a learning experiment — all contributions are welcome!

🧠 Have a better way to implement the methods?
📊 Want to add evaluation metrics?
✨ Found a bug? Please open an issue!

👉 Please:

Keep changes minimal and focused.
Add comments if you change core logic.

⚠️ Disclaimer

This is NOT a production model.

Built during late-night prototyping sessions 🌙

Not thoroughly validated or benchmarked due to compute constraint

Some components are heuristic and unproven

May crash, overfit, or generate nonsense (especially outside molecular data)

I’m still learning PyTorch, attention mechanisms, and transformer internals

Use this code to learn and experiment — not to deploy.

📜 License

MIT

COCONUTDB

@article{sorokina2021coconut,
  title={COCONUT online: Collection of Open Natural Products database},
  author={Sorokina, Maria and Merseburger, Peter and Rajan, Kohulan and Yirik, Mehmet Aziz and Steinbeck, Christoph},
  journal={Journal of Cheminformatics},
  volume={13},
  number={1},
  pages={2},
  year={2021},
  doi={10.1186/s13321-020-00478-9}
}

ChemBL34

@article{zdrazil2023chembl,
  title={The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods},
  author={Zdrazil, Barbara and Felix, Eloy and Hunter, Fiona and Manners, Emma J and Blackshaw, James and Corbett, Sybilla and de Veij, Marleen and Ioannidis, Harris and Lopez, David Mendez and Mosquera, Juan F and Magarinos, Maria Paula and Bosc, Nicolas and Arcila, Ricardo and Kizil{\"o}ren, Tevfik and Gaulton, Anna and Bento, A Patr{\'i}cia and Adasme, Melissa F and Monecke, Peter and Landrum, Gregory A and Leach, Andrew R},
  journal={Nucleic Acids Research},
  year={2023},
  volume={gkad1004},
  doi={10.1093/nar/gkad1004}
}

@misc{chembl34,
  title={ChemBL34},
  year={2023},
  doi={10.6019/CHEMBL.database.34}
}

SuperNatural3

@article{Gallo2023,
  author = {Gallo, K and Kemmler, E and Goede, A and Becker, F and Dunkel, M and Preissner, R and Banerjee, P},
  title = {{SuperNatural 3.0-a database of natural products and natural product-based derivatives}},
  journal = {Nucleic Acids Research},
  year = {2023},
  month = jan,
  day = {6},
  volume = {51},
  number = {D1},
  pages = {D654-D659},
  doi = {10.1093/nar/gkac1008}
}

Ranger21 Optimizer

@article{wright2021ranger21,
      title={Ranger21: a synergistic deep learning optimizer}, 
      author={Wright, Less and Demeure, Nestor},
      year={2021},
      journal={arXiv preprint arXiv:2106.13731},
}

Downloads last month: 6