Update README.md

a2676a2 verified 4 months ago

4.58 kB

	---
	license: mit
	language:
	- en
	pipeline_tag: feature-extraction
	tags:
	- genomics
	- microbiome
	- foundation-model
	- functional-embedding
	- orthology
	datasets:
	- jackkuo/Orthoformer
	---

	# Orthoformer Models

	This repository contains pre-trained Orthoformer foundation models for function-centric representation learning of microbial and viral genomes.

	Unlike sequence-based protein or nucleotide models, Orthoformer operates on orthologous group composition and abundance, treating functional units as tokens and learning genome-level embeddings that capture evolutionary, metabolic, and ecological signals.

	The models are trained on approximately 3 million microbial and viral genomes, encoded as functional profiles derived from orthologous gene groups.

	---

	## 🧬 Model Families

	All Orthoformer models learn a functional embedding space that supports:

	- Alignment-free phylogeny and taxonomy
	- Functional convergence and divergence
	- Metabolic and biosynthetic capacity prediction
	- Genome-level phenotype inference

	---

	## 📦 Available Models

	### 🧠 Foundation Models

	\| Model \| Training Genomes \| Max Length \| Hidden \| Layers \| Heads \| Description \|
	\|------\|-----------------\|-----------\|--------\|--------\|-------\|-------------\|
	\| `model_3M_2048_v8` \| 3M \| 2048 \| 512 \| 6 \| 8 \| Base Orthoformer foundation model \|
	\| `model_3M_2048_v10` \| 3M \| 2048 \| 1024 \| 12 \| 16 \| Large Orthoformer foundation model \|
	\| `model_140k_2048_v18` \| 140k \| 2048 \| 512 \| 6 \| 8 \| Compact foundation model \|

	All foundation models use:

	- ALiBi positional encoding: enables long-context modeling across variable-length microbial genomes, preserving functional relationships between orthologous groups.
	- Span-masked language modeling (span-MLM, span=3): 15% of OG tokens are masked or corrupted following a BERT-style scheme, allowing the model to learn co-occurrence patterns, functional modules, and evolutionary dependencies in a self-supervised manner.

	---

	### 🎯 Task-Specific Models

	\| Model \| Task \| Initialized From \|
	\|------\|------\|------------------\|
	\| `Orthoformer_CRISPR_model` \| CRISPR-associated genome prediction \| `model_3M_2048_v10` \|
	\| `BGC_abundance_regression_model` \| Biosynthetic gene cluster abundance \| `model_3M_2048_v10` \|

	These models adapt the foundation embeddings to organism-level functional phenotypes.


	## Download Methods

	### Method 1: Using Hugging Face CLI

	```bash
	# Install huggingface-hub
	pip install huggingface-hub

	# Download entire model repository
	huggingface-cli download jackkuo/Orthoformer --local-dir ./model

	# Or download specific model
	huggingface-cli download jackkuo/Orthoformer/model_3M_2048_v8 --local-dir ./model/model_3M_2048_v8
	huggingface-cli download jackkuo/Orthoformer/model_3M_2048_v10 --local-dir ./model/model_3M_2048_v10
	```

	### Method 2: Using Python Code

	```python
	from huggingface_hub import snapshot_download

	# Download entire model repository
	snapshot_download(
	repo_id="jackkuo/Orthoformer",
	local_dir="./model",
	local_dir_use_symlinks=False
	)

	# Or download specific model
	snapshot_download(
	repo_id="jackkuo/Orthoformer",
	allow_patterns="model_3M_2048_v8/*",
	local_dir="./model",
	local_dir_use_symlinks=False
	)
	```

	### Method 3: Using Git LFS


	```bash
	# Recommended for large model files
	git lfs install
	git xet install \|\| true
	git clone https://huggingface.co/jackkuo/Orthoformer ./model
	```

	## Model Usage

	After downloading the models, you can use [`feature_extraction_example.py`](https://github.com/JackKuo666/Orthoformer/blob/main/foundation_model/feature_extraction_example.py) to load and use the models:

	```bash

	# Using model_3M_2048_v8 (ALiBi positional encoding)
	python feature_extraction_example.py --model_dir model/model_3M_2048_v8 --use_alibi
	```
	---

	## 📜 License

	This dataset is released under the MIT License.

	---

	## 📖 Citation

	If you use this dataset, please cite:

	```bibtex
	@dataset{xxx,
	title = {Orthoformer: xxx},
	author = {xxx},
	year = {2025},
	}
	```

	---

	## 🔗 Related Resources

	* Datasets: [https://huggingface.co/datasets/jackkuo/Orthoformer](https://huggingface.co/datasets/jackkuo/Orthoformer)
	* Code: [https://github.com/JackKuo666/Orthoformer](https://github.com/JackKuo666/Orthoformer)

	---

	## Notes

	- Model files are large, ensure you have sufficient disk space
	- Download speed depends on network connection, recommend using a stable network environment
	- If download is interrupted, you can re-run the download command, the tool will automatically resume