| --- |
| license: mit |
| language: |
| - en |
| pipeline_tag: feature-extraction |
| tags: |
| - genomics |
| - microbiome |
| - foundation-model |
| - functional-embedding |
| - orthology |
| datasets: |
| - jackkuo/Orthoformer |
| --- |
| |
| # Orthoformer Models |
|
|
| This repository contains **pre-trained Orthoformer foundation models** for **function-centric representation learning of microbial and viral genomes**. |
|
|
| Unlike sequence-based protein or nucleotide models, Orthoformer operates on **orthologous group composition and abundance**, treating **functional units as tokens** and learning genome-level embeddings that capture evolutionary, metabolic, and ecological signals. |
|
|
| The models are trained on approximately **3 million microbial and viral genomes**, encoded as functional profiles derived from orthologous gene groups. |
|
|
| --- |
|
|
| ## 𧬠Model Families |
|
|
| All Orthoformer models learn a **functional embedding space** that supports: |
|
|
| - Alignment-free phylogeny and taxonomy |
| - Functional convergence and divergence |
| - Metabolic and biosynthetic capacity prediction |
| - Genome-level phenotype inference |
|
|
| --- |
|
|
| ## π¦ Available Models |
|
|
| ### π§ Foundation Models |
|
|
| | Model | Training Genomes | Max Length | Hidden | Layers | Heads | Description | |
| |------|-----------------|-----------|--------|--------|-------|-------------| |
| | `model_3M_2048_v8` | 3M | 2048 | 512 | 6 | 8 | Base Orthoformer foundation model | |
| | `model_3M_2048_v10` | 3M | 2048 | 1024 | 12 | 16 | Large Orthoformer foundation model | |
| | `model_140k_2048_v18` | 140k | 2048 | 512 | 6 | 8 | Compact foundation model | |
|
|
| All foundation models use: |
|
|
| - **ALiBi positional encoding**: enables long-context modeling across variable-length microbial genomes, preserving functional relationships between orthologous groups. |
| - **Span-masked language modeling (span-MLM, span=3)**: 15% of OG tokens are masked or corrupted following a BERT-style scheme, allowing the model to learn co-occurrence patterns, functional modules, and evolutionary dependencies in a self-supervised manner. |
|
|
| --- |
|
|
| ### π― Task-Specific Models |
|
|
| | Model | Task | Initialized From | |
| |------|------|------------------| |
| | `Orthoformer_CRISPR_model` | CRISPR-associated genome prediction | `model_3M_2048_v10` | |
| | `BGC_abundance_regression_model` | Biosynthetic gene cluster abundance | `model_3M_2048_v10` | |
|
|
| These models adapt the foundation embeddings to **organism-level functional phenotypes**. |
|
|
|
|
| ## Download Methods |
|
|
| ### Method 1: Using Hugging Face CLI |
|
|
| ```bash |
| # Install huggingface-hub |
| pip install huggingface-hub |
| |
| # Download entire model repository |
| huggingface-cli download jackkuo/Orthoformer --local-dir ./model |
| |
| # Or download specific model |
| huggingface-cli download jackkuo/Orthoformer/model_3M_2048_v8 --local-dir ./model/model_3M_2048_v8 |
| huggingface-cli download jackkuo/Orthoformer/model_3M_2048_v10 --local-dir ./model/model_3M_2048_v10 |
| ``` |
|
|
| ### Method 2: Using Python Code |
|
|
| ```python |
| from huggingface_hub import snapshot_download |
| |
| # Download entire model repository |
| snapshot_download( |
| repo_id="jackkuo/Orthoformer", |
| local_dir="./model", |
| local_dir_use_symlinks=False |
| ) |
| |
| # Or download specific model |
| snapshot_download( |
| repo_id="jackkuo/Orthoformer", |
| allow_patterns="model_3M_2048_v8/*", |
| local_dir="./model", |
| local_dir_use_symlinks=False |
| ) |
| ``` |
|
|
| ### Method 3: Using Git LFS |
|
|
|
|
| ```bash |
| # Recommended for large model files |
| git lfs install |
| git xet install || true |
| git clone https://huggingface.co/jackkuo/Orthoformer ./model |
| ``` |
|
|
| ## Model Usage |
|
|
| After downloading the models, you can use [`feature_extraction_example.py`](https://github.com/JackKuo666/Orthoformer/blob/main/foundation_model/feature_extraction_example.py) to load and use the models: |
|
|
| ```bash |
| |
| # Using model_3M_2048_v8 (ALiBi positional encoding) |
| python feature_extraction_example.py --model_dir model/model_3M_2048_v8 --use_alibi |
| ``` |
| --- |
|
|
| ## π License |
|
|
| This dataset is released under the **MIT License**. |
|
|
| --- |
|
|
| ## π Citation |
|
|
| If you use this dataset, please cite: |
|
|
| ```bibtex |
| @dataset{xxx, |
| title = {Orthoformer: xxx}, |
| author = {xxx}, |
| year = {2025}, |
| } |
| ``` |
|
|
| --- |
|
|
| ## π Related Resources |
|
|
| * **Datasets**: [https://huggingface.co/datasets/jackkuo/Orthoformer](https://huggingface.co/datasets/jackkuo/Orthoformer) |
| * **Code**: [https://github.com/JackKuo666/Orthoformer](https://github.com/JackKuo666/Orthoformer) |
|
|
| --- |
|
|
| ## Notes |
|
|
| - Model files are large, ensure you have sufficient disk space |
| - Download speed depends on network connection, recommend using a stable network environment |
| - If download is interrupted, you can re-run the download command, the tool will automatically resume |