Orthoformer / README.md
jackkuo's picture
Update README.md
a2676a2 verified
---
license: mit
language:
- en
pipeline_tag: feature-extraction
tags:
- genomics
- microbiome
- foundation-model
- functional-embedding
- orthology
datasets:
- jackkuo/Orthoformer
---
# Orthoformer Models
This repository contains **pre-trained Orthoformer foundation models** for **function-centric representation learning of microbial and viral genomes**.
Unlike sequence-based protein or nucleotide models, Orthoformer operates on **orthologous group composition and abundance**, treating **functional units as tokens** and learning genome-level embeddings that capture evolutionary, metabolic, and ecological signals.
The models are trained on approximately **3 million microbial and viral genomes**, encoded as functional profiles derived from orthologous gene groups.
---
## 🧬 Model Families
All Orthoformer models learn a **functional embedding space** that supports:
- Alignment-free phylogeny and taxonomy
- Functional convergence and divergence
- Metabolic and biosynthetic capacity prediction
- Genome-level phenotype inference
---
## πŸ“¦ Available Models
### 🧠 Foundation Models
| Model | Training Genomes | Max Length | Hidden | Layers | Heads | Description |
|------|-----------------|-----------|--------|--------|-------|-------------|
| `model_3M_2048_v8` | 3M | 2048 | 512 | 6 | 8 | Base Orthoformer foundation model |
| `model_3M_2048_v10` | 3M | 2048 | 1024 | 12 | 16 | Large Orthoformer foundation model |
| `model_140k_2048_v18` | 140k | 2048 | 512 | 6 | 8 | Compact foundation model |
All foundation models use:
- **ALiBi positional encoding**: enables long-context modeling across variable-length microbial genomes, preserving functional relationships between orthologous groups.
- **Span-masked language modeling (span-MLM, span=3)**: 15% of OG tokens are masked or corrupted following a BERT-style scheme, allowing the model to learn co-occurrence patterns, functional modules, and evolutionary dependencies in a self-supervised manner.
---
### 🎯 Task-Specific Models
| Model | Task | Initialized From |
|------|------|------------------|
| `Orthoformer_CRISPR_model` | CRISPR-associated genome prediction | `model_3M_2048_v10` |
| `BGC_abundance_regression_model` | Biosynthetic gene cluster abundance | `model_3M_2048_v10` |
These models adapt the foundation embeddings to **organism-level functional phenotypes**.
## Download Methods
### Method 1: Using Hugging Face CLI
```bash
# Install huggingface-hub
pip install huggingface-hub
# Download entire model repository
huggingface-cli download jackkuo/Orthoformer --local-dir ./model
# Or download specific model
huggingface-cli download jackkuo/Orthoformer/model_3M_2048_v8 --local-dir ./model/model_3M_2048_v8
huggingface-cli download jackkuo/Orthoformer/model_3M_2048_v10 --local-dir ./model/model_3M_2048_v10
```
### Method 2: Using Python Code
```python
from huggingface_hub import snapshot_download
# Download entire model repository
snapshot_download(
repo_id="jackkuo/Orthoformer",
local_dir="./model",
local_dir_use_symlinks=False
)
# Or download specific model
snapshot_download(
repo_id="jackkuo/Orthoformer",
allow_patterns="model_3M_2048_v8/*",
local_dir="./model",
local_dir_use_symlinks=False
)
```
### Method 3: Using Git LFS
```bash
# Recommended for large model files
git lfs install
git xet install || true
git clone https://huggingface.co/jackkuo/Orthoformer ./model
```
## Model Usage
After downloading the models, you can use [`feature_extraction_example.py`](https://github.com/JackKuo666/Orthoformer/blob/main/foundation_model/feature_extraction_example.py) to load and use the models:
```bash
# Using model_3M_2048_v8 (ALiBi positional encoding)
python feature_extraction_example.py --model_dir model/model_3M_2048_v8 --use_alibi
```
---
## πŸ“œ License
This dataset is released under the **MIT License**.
---
## πŸ“– Citation
If you use this dataset, please cite:
```bibtex
@dataset{xxx,
title = {Orthoformer: xxx},
author = {xxx},
year = {2025},
}
```
---
## πŸ”— Related Resources
* **Datasets**: [https://huggingface.co/datasets/jackkuo/Orthoformer](https://huggingface.co/datasets/jackkuo/Orthoformer)
* **Code**: [https://github.com/JackKuo666/Orthoformer](https://github.com/JackKuo666/Orthoformer)
---
## Notes
- Model files are large, ensure you have sufficient disk space
- Download speed depends on network connection, recommend using a stable network environment
- If download is interrupted, you can re-run the download command, the tool will automatically resume