--- license: mit language: - en pipeline_tag: feature-extraction tags: - genomics - microbiome - foundation-model - functional-embedding - orthology datasets: - jackkuo/Orthoformer --- # Orthoformer Models This repository contains **pre-trained Orthoformer foundation models** for **function-centric representation learning of microbial and viral genomes**. Unlike sequence-based protein or nucleotide models, Orthoformer operates on **orthologous group composition and abundance**, treating **functional units as tokens** and learning genome-level embeddings that capture evolutionary, metabolic, and ecological signals. The models are trained on approximately **3 million microbial and viral genomes**, encoded as functional profiles derived from orthologous gene groups. --- ## 🧬 Model Families All Orthoformer models learn a **functional embedding space** that supports: - Alignment-free phylogeny and taxonomy - Functional convergence and divergence - Metabolic and biosynthetic capacity prediction - Genome-level phenotype inference --- ## 📦 Available Models ### 🧠 Foundation Models | Model | Training Genomes | Max Length | Hidden | Layers | Heads | Description | |------|-----------------|-----------|--------|--------|-------|-------------| | `model_3M_2048_v8` | 3M | 2048 | 512 | 6 | 8 | Base Orthoformer foundation model | | `model_3M_2048_v10` | 3M | 2048 | 1024 | 12 | 16 | Large Orthoformer foundation model | | `model_140k_2048_v18` | 140k | 2048 | 512 | 6 | 8 | Compact foundation model | All foundation models use: - **ALiBi positional encoding**: enables long-context modeling across variable-length microbial genomes, preserving functional relationships between orthologous groups. - **Span-masked language modeling (span-MLM, span=3)**: 15% of OG tokens are masked or corrupted following a BERT-style scheme, allowing the model to learn co-occurrence patterns, functional modules, and evolutionary dependencies in a self-supervised manner. --- ### 🎯 Task-Specific Models | Model | Task | Initialized From | |------|------|------------------| | `Orthoformer_CRISPR_model` | CRISPR-associated genome prediction | `model_3M_2048_v10` | | `BGC_abundance_regression_model` | Biosynthetic gene cluster abundance | `model_3M_2048_v10` | These models adapt the foundation embeddings to **organism-level functional phenotypes**. ## Download Methods ### Method 1: Using Hugging Face CLI ```bash # Install huggingface-hub pip install huggingface-hub # Download entire model repository huggingface-cli download jackkuo/Orthoformer --local-dir ./model # Or download specific model huggingface-cli download jackkuo/Orthoformer/model_3M_2048_v8 --local-dir ./model/model_3M_2048_v8 huggingface-cli download jackkuo/Orthoformer/model_3M_2048_v10 --local-dir ./model/model_3M_2048_v10 ``` ### Method 2: Using Python Code ```python from huggingface_hub import snapshot_download # Download entire model repository snapshot_download( repo_id="jackkuo/Orthoformer", local_dir="./model", local_dir_use_symlinks=False ) # Or download specific model snapshot_download( repo_id="jackkuo/Orthoformer", allow_patterns="model_3M_2048_v8/*", local_dir="./model", local_dir_use_symlinks=False ) ``` ### Method 3: Using Git LFS ```bash # Recommended for large model files git lfs install git xet install || true git clone https://huggingface.co/jackkuo/Orthoformer ./model ``` ## Model Usage After downloading the models, you can use [`feature_extraction_example.py`](https://github.com/JackKuo666/Orthoformer/blob/main/foundation_model/feature_extraction_example.py) to load and use the models: ```bash # Using model_3M_2048_v8 (ALiBi positional encoding) python feature_extraction_example.py --model_dir model/model_3M_2048_v8 --use_alibi ``` --- ## 📜 License This dataset is released under the **MIT License**. --- ## 📖 Citation If you use this dataset, please cite: ```bibtex @dataset{xxx, title = {Orthoformer: xxx}, author = {xxx}, year = {2025}, } ``` --- ## 🔗 Related Resources * **Datasets**: [https://huggingface.co/datasets/jackkuo/Orthoformer](https://huggingface.co/datasets/jackkuo/Orthoformer) * **Code**: [https://github.com/JackKuo666/Orthoformer](https://github.com/JackKuo666/Orthoformer) --- ## Notes - Model files are large, ensure you have sufficient disk space - Download speed depends on network connection, recommend using a stable network environment - If download is interrupted, you can re-run the download command, the tool will automatically resume