HViLM-base: A Foundation Model for Viral Genomics
Model Description
HViLM (Human Virome Language Model) is the first foundation model specifically designed for comprehensive viral risk assessment through multi-task prediction of pathogenicity, host tropism, and transmissibility. Built through continued pre-training of DNABERT-2 on 5 million viral genome sequences from the VIRION database, HViLM captures universal viral genomic patterns relevant for human disease risk assessment.
Paper: HViLM: A Foundation Model for Viral Genomics Enables Multi-Task Prediction of Pathogenicity, Transmissibility, and Host Tropism (RECOMB 2026)
Authors: Pratik Dutta, Jack Vaska, Pallavi Surana, Rekha Sathian, Max Chao, Zhihan Zhou, Han Liu, and Ramana V. Davuluri
Code & Benchmarks: GitHub Repository
Key Features
- 🦠 Viral-specialized pre-training on 5M sequences from 10.8M genomes spanning 45+ viral families
- 🎯 Multi-task predictions across 3 epidemiologically critical tasks:
- Pathogenicity classification: 95.32% average accuracy
- Host tropism prediction: 96.25% accuracy
- Transmissibility assessment: 97.36% average accuracy
- 📊 HVUE Benchmark: 7 curated datasets totaling 60K+ viral sequences
- 🔍 Mechanistic interpretability: Identifies transcription factor binding site mimicry (42 conserved motifs)
- ⚡ Parameter-efficient fine-tuning: LoRA adaptation (~0.3M trainable parameters per task)
- 🚀 State-of-the-art performance: Outperforms Nucleotide Transformer, GENA-LM, and DNABERT-MB
Model Architecture
HViLM is built upon DNABERT-2 (117M parameters), which uses the MosaicBERT architecture with:
- Tokenization: Byte Pair Encoding (BPE) with vocabulary size 4,096
- Max sequence length: 1,000 base pairs
- Hidden size: 768
- Attention heads: 12
- Layers: 12
- Positional encoding: Attention with Linear Biases (ALiBi)
Continued pre-training:
- Objective: Masked Language Modeling (MLM)
- Training data: 5M viral sequence chunks (non-overlapping, 1000 bp)
- Data source: VIRION database (clustered at 80% identity with MMseqs2)
- Training: 10 epochs, AdamW optimizer, learning rate 5e-5
- Hardware: 4x NVIDIA A100 GPUs (72 hours)
- Performance: 94.2% MLM accuracy on validation set
Installation
pip install transformers torch
Quick Start
Basic Usage: Extract Sequence Embeddings
from transformers import AutoTokenizer, AutoModel
import torch
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(
"duttaprat/HViLM-base",
trust_remote_code=True # Required for custom architecture
)
model = AutoModel.from_pretrained(
"duttaprat/HViLM-base",
trust_remote_code=True
)
# Example: Get embeddings for a viral sequence
viral_sequence = "ATGCGTACGTTAGCCGATCGATTACGCGTACGTAGCTAGCTAGCT"
# Tokenize
inputs = tokenizer(
viral_sequence,
return_tensors="pt",
truncation=True,
max_length=512,
padding=True
)
# Generate embeddings
with torch.no_grad():
outputs = model(**inputs)
embeddings = outputs.last_hidden_state # [batch_size, seq_len, 768]
print(f"Sequence embeddings shape: {embeddings.shape}")
# Mean pooling for sequence-level representation
attention_mask = inputs['attention_mask']
mask_expanded = attention_mask.unsqueeze(-1).expand(embeddings.size()).float()
sum_embeddings = torch.sum(embeddings * mask_expanded, dim=1)
sum_mask = torch.clamp(mask_expanded.sum(dim=1), min=1e-9)
mean_embeddings = sum_embeddings / sum_mask
print(f"Mean sequence embedding shape: {mean_embeddings.shape}") # [batch_size, 768]
Fine-tuning on Your Own Task
For fine-tuning HViLM on custom viral classification tasks, please refer to the GitHub repository for complete training scripts and examples.
# Example fine-tuning setup (see GitHub for complete code)
from transformers import AutoModel, TrainingArguments, Trainer
from peft import LoraConfig, get_peft_model
# Load base model
model = AutoModel.from_pretrained("duttaprat/HViLM-base", trust_remote_code=True)
# Configure LoRA for parameter-efficient fine-tuning
lora_config = LoraConfig(
r=8, # rank
lora_alpha=16, # scaling factor
target_modules=["query", "value"], # attention layers
lora_dropout=0.1,
bias="none"
)
# Apply LoRA
model = get_peft_model(model, lora_config)
# Add classification head and train (see GitHub for details)
Performance on HVUE Benchmark
Pathogenicity Classification
| Dataset | Sequences | Accuracy | F1-Score | MCC |
|---|---|---|---|---|
| CINI | 159 | 87.74% | 86.98 | 74.48 |
| BVBRC-CoV | 18,066 | 98.26% | 98.26 | 96.52 |
| BVBRC-Calici | 31,089 | 99.95% | 99.93 | 99.90 |
| Average | 49,314 | 95.32% | 95.06 | 90.30 |
Host Tropism Prediction
| Dataset | Sequences | Accuracy | F1-Score | MCC |
|---|---|---|---|---|
| VHDB | 9,428 | 96.25% | 91.34 | 91.24 |
Transmissibility Assessment (R₀-based Classification)
| Viral Family | Sequences | Accuracy | F1-Score | MCC |
|---|---|---|---|---|
| Coronaviridae | ~3,000 | 97.45% | 97.37 | 93.43 |
| Orthomyxoviridae | ~2,500 | 95.62% | 95.44 | 91.07 |
| Caliciviridae | ~1,800 | 99.95% | 99.95 | 99.90 |
| Average | ~7,300 | 97.36% | 97.59 | 94.80 |
Comparison with baselines: HViLM consistently outperforms Nucleotide Transformer 500M-1000g, GENA-LM, and DNABERT-MB across all tasks.
Interpretability: Transcription Factor Mimicry
HViLM's attention mechanisms reveal biologically meaningful pathogenicity determinants through molecular mimicry of host regulatory elements:
- 42 conserved motifs identified in high-attention regions of pathogenic coronaviruses
- 10 vertebrate transcription factors targeted, including:
- Irf1 (Interferon Regulatory Factor 1): 8 convergent motifs for immune evasion
- Foxq1: Multiple motifs for epithelial cell tropism
- ZNF354A: 6 motifs for chromatin regulation
This demonstrates that HViLM captures genuine biological mechanisms rather than spurious correlations.
Training Data
Pre-training Corpus
- Source: VIRION database (476,242 virus-host associations)
- Genomes: 10,817,265 unique NCBI accession numbers
- Processing:
- Segmented into non-overlapping 1000 bp chunks
- Clustered with MMseqs2 at 80% identity threshold
- Final dataset: 5 million unique sequences
- Coverage: 45+ viral families across all Baltimore classification groups
HVUE Benchmark Datasets
The Human Virome Understanding Evaluation (HVUE) benchmark consists of 7 curated datasets:
Pathogenicity Prediction (3 datasets)
- CINI: 159 sequences, 4 viral families, manual literature curation
- BVBRC-CoV: 18,066 coronaviruses
- BVBRC-Calici: 31,089 caliciviruses
Host Tropism Prediction (1 dataset)
- VHDB: 9,428 sequences, 30 viral families
- Binary classification: human-tropic (13.1%) vs non-human-tropic (86.9%)
Transmissibility Prediction (3 datasets)
- Coronaviridae: R₀-based classification (R₀<1 vs R₀≥1)
- Orthomyxoviridae: R₀-based classification
- Caliciviridae: R₀-based classification
All datasets available at: 🤗 duttaprat/HVUE
Download and Use
from datasets import load_dataset
# Load specific task
host_tropism = load_dataset("duttaprat/HVUE", data_dir="Host_Tropism")
pathogenicity = load_dataset("duttaprat/HVUE", data_dir="Pathogenecity")
transmissibility = load_dataset("duttaprat/HVUE", data_dir="Transmissibility")
# Load specific split
train_data = load_dataset("duttaprat/HVUE", data_files="Host_Tropism/train.csv")
Reproducing Paper Results
Step 1: Download HVUE Benchmark
from datasets import load_dataset
# Download all datasets
host_tropism = load_dataset("duttaprat/HVUE", data_dir="Host_Tropism")
pathogenicity = load_dataset("duttaprat/HVUE", data_dir="Pathogenecity")
transmissibility = load_dataset("duttaprat/HVUE", data_dir="Transmissibility")
Step 2: Fine-tune and Evaluate
To reproduce the results reported in the paper, clone the repository and follow the fine-tuning instructions:
# Clone repository
git clone https://github.com/duttaprat/HViLM.git
cd HViLM
# Install dependencies
pip install -r requirements.txt
# Reproduce pathogenicity results on CINI dataset
cd finetune
bash scripts/run_patho_cini.sh
# Reproduce host tropism results
bash scripts/run_tropism_vhdb.sh
# Reproduce transmissibility results
bash scripts/run_r0_coronaviridae.sh
For detailed instructions, see the GitHub repository.
Citation
If you use DNABERT-2 (the base model), please also cite:
@article{zhou2023dnabert2,
title={DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome},
author={Zhou, Zhihan and Ji, Yanrong and Li, Weijian and Dutta, Pratik and Davuluri, Ramana and Liu, Han},
journal={ICLR},
year={2024}
}
If you use HViLM in your research, please cite our paper:
@article{dutta2025hvilm, title={HViLM: A Foundation Model for Viral Genomics Enables Multi-Task Prediction of Pathogenicity, Transmissibility, and Host Tropism}, author={Dutta, Pratik and Vaska, Jack and Surana, Pallavi and Sathian, Rekha and Chao, Max and Zhou, Zhihan and Liu, Han and Davuluri, Ramana V.}, journal={Submitted to RECOMB}, year={2025}, note={Under review} }
Model Card Authors
- Pratik Dutta (Senior Research Scientist, Stony Brook University)
- Ramana V. Davuluri (Professor, Stony Brook University)
Contact
- Email: [email protected]
- Lab: Davuluri Lab, Stony Brook University
- GitHub Issues: Report bugs or request features
Acknowledgments
This work builds upon DNABERT-2 by Zhou et al. Pre-training data from the VIRION database maintained by the Viral Emergence Research Initiative (Verena).
License
This model is released under the Apache License 2.0.
Disclaimer
HViLM is a research tool for computational biology and should not be used as the sole basis for clinical or public health decisions. Predictions should be validated through experimental methods and expert analysis.
- Downloads last month
- 27