DeepTaxa: Hierarchical 16S rRNA Taxonomy Classification
DeepTaxa is a deep learning model for hierarchical taxonomy classification of 16S rRNA gene sequences. The architecture couples a convolutional branch, which captures local k-mer motifs, with a BERT-style transformer, which captures long-range context. Both branches operate over tokens produced by the DNABERT-2 byte-pair encoder. Predictions are generated jointly for all seven standard taxonomic ranks: domain, phylum, class, order, family, genus, and species.
Three model families are released here, one trained on full-length 16S sequences, one trained on V3-V4 amplicons, and one trained on the shorter V4 amplicon. As of the v2 release each family is provided as five checkpoints trained under identical settings with different random seeds (42, 123, 456, 789, and 1011), enabling ensembling and cross-seed uncertainty estimates, alongside a single default checkpoint for users who need only one model.
What is new in v2
- Five seeds per region. Every family now ships all five seeds (42, 123, 456, 789, 1011) as
deeptaxa-<region>-v2-seed<N>.pt, so the released set is directly reproducible and suitable for ensembling. A default single model,deeptaxa-<region>-v2.pt, is provided as a copy of the seed 42 checkpoint for backward-compatible single-model use. - Self-describing checkpoints. Each checkpoint records its own seed in a
seedfield, so an ensemble member can be identified from the file contents alone, independent of the filename. - Reported cross-seed statistics. All three families now report mean and standard deviation across five seeds at every rank (previously only the full-length family reported cross-seed statistics, over three seeds, and the V4 family was single-seed).
- Same recipe as v1. The architecture, hyperparameters, training data, and in-silico amplicon extraction are identical to v1. v2 is a fresh five-seed retraining on the current codebase, not a methodological change. The v1 checkpoints remain available in this repository.
Checkpoint selection
| Sequencing protocol | Recommended family | Default file |
|---|---|---|
| Sanger 27F/1492R, PacBio HiFi 16S, Oxford Nanopore long-read 16S, full-length reference lookup | Full-length v2 | deeptaxa-full-length-v2.pt |
| Illumina paired-end V3-V4 with 341F/805R primers | V3-V4 v2 | deeptaxa-v3v4-v2.pt |
| Illumina paired-end V4 with 515F/806R primers | V4 v2 | deeptaxa-v4-v2.pt |
For maximum accuracy and calibrated uncertainty, average the softmax probabilities of all five seed checkpoints for the chosen region (see Ensembling).
Released checkpoints
Species-level test-set performance, reported as the mean and standard deviation across the five seeds:
| Family | Training data | Species Acc | Species F1 | Species ECE | Params |
|---|---|---|---|---|---|
| Full-length v2 | 277,336 full-length 16S sequences (approximately 1,500 bp) from Greengenes2 | 92.95% +/- 0.03 | 92.08% +/- 0.04 | 0.0241 | 76.4 M |
| V3-V4 v2 | 273,003 in-silico V3-V4 extractions (approximately 420 bp) from Greengenes2 | 87.54% +/- 0.09 | 85.90% +/- 0.09 | 0.0265 | 75.8 M |
| V4 v2 | 274,509 in-silico V4 extractions (approximately 253 bp) from Greengenes2 | 82.84% +/- 0.08 | 80.19% +/- 0.07 | 0.0254 | 76.4 M |
Standard deviations are given in percentage points. The cross-seed standard deviation is at most 0.09 percentage points of species F1 in every family, indicating high reproducibility. The v2 means match the v1 single-seed and three-seed numbers to within the cross-seed spread.
All checkpoints are inference-only. Optimizer, scheduler, scaler, and RNG state, along with the training/validation split, have been removed to reduce file size; resuming training from these checkpoints is not supported.
Seeds and ensembling
Each region provides five seed checkpoints and one default. The five differ only in their random seed; they share architecture, hyperparameters, and training data.
| File | Contents |
|---|---|
deeptaxa-<region>-v2-seed42.pt ... -seed1011.pt |
The five individual seeds. Each records its seed in a seed field. |
deeptaxa-<region>-v2.pt |
Default single model, identical to the seed 42 checkpoint. |
A simple and effective ensemble averages the per-rank softmax probabilities across the five seeds and then takes the argmax at each rank. Because the seeds are independent training runs of the same configuration, this reduces variance and typically improves both accuracy and calibration relative to any single seed. When only one model is needed, use the default (seed 42) checkpoint.
Architecture
The full-length, V3-V4, and V4 families share the same compact HybridCNNBERT configuration.
| Component | Full-length v2 | V3-V4 v2 | V4 v2 |
|---|---|---|---|
tokenizer_name |
zhihan1996/DNABERT-2-117M |
zhihan1996/DNABERT-2-117M |
zhihan1996/DNABERT-2-117M |
max_length |
512 (tokens) | 512 (tokens) | 512 (tokens) |
embed_dim |
896 | 896 | 896 |
num_filters |
256 | 256 | 256 |
kernel_sizes |
[3, 5, 7] |
[3, 5, 7] |
[3, 5, 7] |
num_conv_layers |
1 | 1 | 1 |
hidden_size |
896 | 896 | 896 |
num_hidden_layers |
4 | 4 | 4 |
num_attention_heads |
7 | 7 | 7 |
intermediate_size |
3584 | 3584 | 3584 |
hidden_dropout_prob |
0.20 | 0.20 | 0.20 |
Test-set performance
Each family was evaluated on its respective held-out Greengenes2 2024.09 test split. Numbers below are the five-seed mean at each rank.
| Rank | Full-length Acc | Full-length F1 | V3-V4 Acc | V3-V4 F1 | V4 Acc | V4 F1 |
|---|---|---|---|---|---|---|
| Domain | 99.98% | 99.98% | 99.99% | 99.99% | 99.98% | 99.98% |
| Phylum | 99.70% | 99.68% | 99.68% | 99.67% | 99.59% | 99.58% |
| Class | 99.63% | 99.59% | 99.62% | 99.59% | 99.54% | 99.49% |
| Order | 99.07% | 98.96% | 98.96% | 98.85% | 98.75% | 98.63% |
| Family | 98.61% | 98.41% | 98.42% | 98.21% | 98.06% | 97.84% |
| Genus | 96.88% | 96.45% | 95.33% | 94.84% | 93.42% | 92.62% |
| Species | 92.95% | 92.08% | 87.54% | 85.90% | 82.84% | 80.19% |
Training configuration
| Parameter | Full-length v2 | V3-V4 v2 | V4 v2 |
|---|---|---|---|
| Training data | Greengenes2 2024.09 training set (277,336 full-length sequences, approximately 1,500 bp) | In-silico V3-V4 extractions from the same training set (273,003 amplicons) | In-silico V4 extractions from the same training set (274,509 amplicons) |
| Test data | Greengenes2 2024.09 test split (69,335 full-length sequences) | V3-V4 extractions from the test split (68,282 amplicons) | V4 extractions from the test split (68,668 amplicons) |
| Extraction primers | N/A | 341F CCTACGGGNGGCWGCAG and 805R GACTACHVGGGTATCTAATCC |
515F GTGYCAGCMGCCGCGGTAA and 806R GGACTACNVGGGTWTCTAAT |
| Label space (species) | 16,909 | 8,347 | 16,909 |
| Label space (domain / phylum / class / order / family / genus) | 2 / 129 / 349 / 997 / 2,250 / 7,287 | 2 / 115 / 270 / 709 / 1,528 / 4,529 | 2 / 129 / 349 / 997 / 2,250 / 7,287 |
| Total parameters | 76,365,205 | 75,813,550 | 76,365,205 |
| Learning rate | 5e-4 | 5e-4 | 5e-4 |
| Batch size | 64 | 64 | 64 |
| Weight decay | 1e-2 | 1e-2 | 1e-2 |
| Epochs | 10 | 10 | 10 |
| Loss | Cross-entropy with uniform per-rank weights | Cross-entropy with uniform per-rank weights | Cross-entropy with uniform per-rank weights |
| Optimizer | AdamW (beta1 = 0.9, beta2 = 0.999) | AdamW (beta1 = 0.9, beta2 = 0.999) | AdamW (beta1 = 0.9, beta2 = 0.999) |
| Learning rate schedule | Linear warm-up over 10% of steps, followed by linear decay | Linear warm-up over 10% of steps, followed by linear decay | Linear warm-up over 10% of steps, followed by linear decay |
| Seeds | 42, 123, 456, 789, 1011 | 42, 123, 456, 789, 1011 | 42, 123, 456, 789, 1011 |
| Hardware | NVIDIA A40 | NVIDIA A40 | NVIDIA A40 |
Usage
Download
# Default single models (seed 42) for each region
wget https://huggingface.co/systems-genomics-lab/deeptaxa/resolve/main/deeptaxa-full-length-v2.pt
wget https://huggingface.co/systems-genomics-lab/deeptaxa/resolve/main/deeptaxa-v3v4-v2.pt
wget https://huggingface.co/systems-genomics-lab/deeptaxa/resolve/main/deeptaxa-v4-v2.pt
# All five seeds for a region (example: full-length), for ensembling
for s in 42 123 456 789 1011; do
wget https://huggingface.co/systems-genomics-lab/deeptaxa/resolve/main/deeptaxa-full-length-v2-seed${s}.pt
done
# Or clone the full repository
git clone https://huggingface.co/systems-genomics-lab/deeptaxa
After downloading, verify each file against its SHA-256 checksum. The SHA-256 sums for all v2 files are listed in SHA256SUMS; the default models are:
| Checkpoint | SHA-256 |
|---|---|
deeptaxa-full-length-v2.pt |
bc5907a881fffe40cccaf66cf9072fb71fc37802fcceb296263076a2a0990307 |
deeptaxa-v3v4-v2.pt |
91a9243cee2fc420f8a78c609ac98e75dd875b2a8a0a06d6a2455624664d898f |
deeptaxa-v4-v2.pt |
bd48d9744f2d7b5c7fd930a3109c1223d3409ccdb953f8a8d11da1d2f62934b0 |
sha256sum --check SHA256SUMS
Python API with huggingface_hub:
from huggingface_hub import hf_hub_download
# Default single model (seed 42)
full_length_ckpt = hf_hub_download(
repo_id="systems-genomics-lab/deeptaxa",
filename="deeptaxa-full-length-v2.pt",
)
# All five seeds for ensembling
seed_ckpts = [
hf_hub_download(
repo_id="systems-genomics-lab/deeptaxa",
filename=f"deeptaxa-full-length-v2-seed{s}.pt",
)
for s in (42, 123, 456, 789, 1011)
]
Install DeepTaxa and run predictions
pip install git+https://github.com/systems-genomics-lab/deeptaxa.git
# Full-length sequences (Sanger, PacBio HiFi, Oxford Nanopore)
deeptaxa predict \
--fasta-file your_full_length_16s.fna.gz \
--checkpoint deeptaxa-full-length-v2.pt \
--output-dir predictions/
# V3-V4 amplicons (Illumina, already demultiplexed and primer-trimmed)
deeptaxa predict \
--fasta-file your_v3v4_amplicons.fna.gz \
--checkpoint deeptaxa-v3v4-v2.pt \
--output-dir predictions/
# V4 amplicons (Illumina, already demultiplexed and primer-trimmed)
deeptaxa predict \
--fasta-file your_v4_amplicons.fna.gz \
--checkpoint deeptaxa-v4-v2.pt \
--output-dir predictions/
Input preparation for amplicon checkpoints: the input FASTA file should contain region-matched sequences that have already been demultiplexed and primer-trimmed by an upstream tool such as DADA2, cutadapt, or QIIME2. The V3-V4 and V4 checkpoints were trained on in-silico primer extractions (341F/805R and 515F/806R respectively), which approximate merged paired-end amplicons. Paired-end reads should therefore be merged into consensus amplicons prior to prediction, or the forward read alone may be provided.
Full usage documentation and analysis notebooks are available in the GitHub repository.
Limitations
Limitations that apply to all checkpoints:
- Approximately 44.8% of Greengenes2 species have only a single training example, which limits reliable prediction for those classes.
- The label space corresponds to Greengenes2 2024.09. Predictions are produced against the exact Greengenes2 hierarchy, and species absent from the training data cannot be predicted. Adapting the model to a different reference database, such as SILVA or GTDB, would require retraining.
- A GPU is strongly recommended; CPU inference is impractical for large datasets.
Limitations specific to the full-length family:
- Best performance is obtained on sequences of at least 1,200 bp; shorter amplicons should be classified using the V3-V4 or V4 family.
- Species-level accuracy plateaus near 93%.
Limitations specific to the V3-V4 family:
- Species-level accuracy plateaus near 87.5%. The approximately 420 bp V3-V4 region carries less taxonomic information than the full 16S gene.
- The label space contains 8,347 species. Those for which no V3-V4 amplicon could be extracted during training are absent and cannot be predicted.
- Primer specificity: the models were trained on 341F/805R extractions. Sequences amplified with other V3-V4 primers, such as 357F or 338F, or with substantially different region boundaries may yield degraded predictions.
Limitations specific to the V4 family:
- Species-level accuracy is approximately 82.8%. The approximately 253 bp V4 region carries less taxonomic information than V3-V4 or the full 16S gene, so species-level calls should be read together with their confidence scores.
- The label space contains 16,909 species (the same as the full-length family), retained because V4 amplicons were extracted at 99.0% yield. Species for which no V4 amplicon could be extracted during training are absent and cannot be predicted.
- Primer specificity: the models were trained on 515F/806R extractions. Sequences amplified with other V4 primers or with substantially different region boundaries may yield degraded predictions.
Citation
If DeepTaxa contributes to your research, please cite our paper in Bioinformatics Advances: https://doi.org/10.1093/bioadv/vbag166
@article{salah2026deeptaxa,
title={{DeepTaxa}: A Hybrid {CNN}-{BERT} Framework for {16S} {rRNA} Taxonomic Classification},
author={Salah, Rana and AbdElaal, Khlood R. and Ghonaim, Lobna and Awe, Olaitan I. and Moustafa, Ahmed},
journal={Bioinformatics Advances},
year={2026},
doi={10.1093/bioadv/vbag166},
publisher={Oxford University Press}
}
References
- Akiba, T., Sano, S., Yanase, T., et al. (2019). Optuna: A next-generation hyperparameter optimization framework. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2623-2631. DOI: 10.1145/3292500.3330701
- Bolyen, E., Rideout, J.R., Dillon, M.R., et al. (2019). Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nature Biotechnology, 37(8), 852-857. DOI: 10.1038/s41587-019-0209-9
- Callahan, B.J., McMurdie, P.J., Rosen, M.J., et al. (2016). DADA2: High-resolution sample inference from Illumina amplicon data. Nature Methods, 13(7), 581-583. DOI: 10.1038/nmeth.3869
- Martin, M. (2011). Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal, 17(1), 10-12. DOI: 10.14806/ej.17.1.200
- McDonald, D., Jiang, Y., Balaban, M., et al. (2024). Greengenes2 unifies microbial data in a single reference tree. Nature Biotechnology, 42(5), 715-718. DOI: 10.1038/s41587-023-01845-1
- Parks, D.H., Chuvochina, M., Chaumeil, P.A., et al. (2020). A complete domain-to-species taxonomy for Bacteria and Archaea. Nature Biotechnology, 38(9), 1079-1086. DOI: 10.1038/s41587-020-0501-8
- Quast, C., Pruesse, E., Yilmaz, P., et al. (2013). The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucleic Acids Research, 41(D1), D590-D596. DOI: 10.1093/nar/gks1219
- Zhou, Z., Ji, Y., Li, W., et al. (2024). DNABERT-2: Efficient foundation model and benchmark for multi-species genomes. International Conference on Learning Representations. arXiv:2306.15006
Contact
For support, please open an issue on the GitHub repository.
Acknowledgments
- Hugging Face, for hosting datasets and models.
- The High-Performance Computing Team of the School of Sciences and Engineering at the American University in Cairo, for granting access to the GPU resources used in training.
Version history
v2 (July 2026). Five-seed release. Each of the three families (full-length, V3-V4, V4) was retrained from scratch on the current codebase under the identical v1 recipe with five seeds (42, 123, 456, 789, 1011), for fifteen training runs in total on an NVIDIA A40. Every family now ships all five seeds as deeptaxa-<region>-v2-seed<N>.pt for ensembling, plus a default deeptaxa-<region>-v2.pt (a copy of seed 42). Each checkpoint records its seed in a seed field. Five-seed mean species performance: full-length 92.95% accuracy / 92.08% F1, V3-V4 87.54% / 85.90%, V4 82.84% / 80.19%; cross-seed standard deviation at most 0.09 percentage points of species F1 in every family. The means reproduce the v1 numbers to within the cross-seed spread. The v1 checkpoints remain available in this repository.
v1 (June 2026). Added the V4 checkpoint (deeptaxa-v4-v1.pt), trained from scratch in the compact HybridCNNBERT configuration on 274,509 in-silico V4 extractions (515F/806R, approximately 253 bp) from Greengenes2 2024.09. Single-seed (seed 42); species accuracy 82.84%, F1 80.16%, ECE 0.0256. The V4 amplicon was extracted at 99.0% yield, so the checkpoint keeps the full 16,909-species label space and matches the full-length parameter count (76.4 M).
v1 (April 2026). Initial release of the full-length and V3-V4 checkpoints. Both were updated in late April 2026 to the compact HybridCNNBERT architecture (76.4 M and 75.8 M parameters respectively; kernels 3/5/7, 256 filters, 4 transformer layers, 7 attention heads, 3584 FFN intermediate, 896 hidden, dropout 0.20). The full-length update (v1.1) matched or beat the prior full-length numbers at every taxonomic rank with roughly 32% fewer parameters and roughly half the training time. The V3-V4 update (v1.2) achieved equivalent species-level performance (Acc 87.55% vs 87.52%, F1 85.92% vs 85.79%) at roughly 24% fewer parameters, harmonizing the two checkpoints under the same architecture. Users who downloaded either checkpoint before the corresponding update may see different SHA-256 hashes; re-downloading retrieves the updated file.
- Downloads last month
- 72
Model tree for systems-genomics-lab/deeptaxa
Base model
zhihan1996/DNABERT-2-117MDataset used to train systems-genomics-lab/deeptaxa
Paper for systems-genomics-lab/deeptaxa
Evaluation results
- Domain Accuracy on Greengenes2 (2024-09 Test Split, full-length 16S)test set self-reported1.000
- Phylum Accuracy on Greengenes2 (2024-09 Test Split, full-length 16S)test set self-reported0.997
- Class Accuracy on Greengenes2 (2024-09 Test Split, full-length 16S)test set self-reported0.996
- Order Accuracy on Greengenes2 (2024-09 Test Split, full-length 16S)test set self-reported0.991
- Family Accuracy on Greengenes2 (2024-09 Test Split, full-length 16S)test set self-reported0.986
- Genus Accuracy on Greengenes2 (2024-09 Test Split, full-length 16S)test set self-reported0.969
- Species Accuracy on Greengenes2 (2024-09 Test Split, full-length 16S)test set self-reported0.929
- Species F1 (weighted) on Greengenes2 (2024-09 Test Split, full-length 16S)test set self-reported0.921