PlasmidGPT-244M

A conditional DNA language model for generating plasmid sequences based on specified biological properties.

Model Description

PlasmidGPT is a GPT-2 style transformer trained to generate plasmid DNA sequences conditioned on biological properties like host organism, antibiotic resistance, GC content, and more.

Parameters: 244M
Context Length: 16,384 tokens
Architecture: 18 layers, 16 heads, 1024 embedding dim
Vocabulary: 72 tokens (special tokens + condition tokens + ACGT nucleotides)

Training

Dataset: Deduplicated Addgene plasmid sequences (13,260 unique prompt-sequence pairs)
Training Steps: 10,000
Best Validation Loss: 0.721
Best Validation Perplexity: 2.06
WandB Run: xotxj25a

Conditioning Tokens

The model accepts the following condition tokens:

Category	Tokens
Host	`<HOST:ECOLI>`, `<HOST:HUMAN>`, `<HOST:MAMMALIAN>`, `<HOST:MOUSE>`, `<HOST:PLANT>`, `<HOST:RAT>`, `<HOST:SYNTHETIC>`, `<HOST:WORM>`, `<HOST:YEAST>`
Resistance	`<RESISTANCE:AMP>`, `<RESISTANCE:KAN>`, `<RESISTANCE:SPEC>`, `<RESISTANCE:CHLOR>`, `<RESISTANCE:GENT>`, `<RESISTANCE:STREP>`, `<RESISTANCE:TET>`
Length	`<LENGTH:SHORT>`, `<LENGTH:MEDIUM>`, `<LENGTH:LONG>`
GC Content	`<GC:LOW>`, `<GC:MEDIUM>`, `<GC:HIGH>`
Application	`<APPLICATION:CLONING>`, `<APPLICATION:CRISPR>`, `<APPLICATION:EDITING>`, `<APPLICATION:EXPRESSION>`, `<APPLICATION:RECOMBINATION>`, `<APPLICATION:REPORTER>`, `<APPLICATION:RNAI>`
Copy Number	`<COPY:HIGH>`, `<COPY:LOW>`
Promoter	`<PROMOTER:BAD>`, `<PROMOTER:CAG>`, `<PROMOTER:CBH>`, `<PROMOTER:CMV>`, `<PROMOTER:HSYN>`, `<PROMOTER:LAC>`, `<PROMOTER:PGK>`, `<PROMOTER:POLYH>`, `<PROMOTER:SFFV>`, `<PROMOTER:TAC>`, `<PROMOTER:TRE>`
Vector Type	`<VECTOR:AAV>`, `<VECTOR:LENTIVIRAL>`, `<VECTOR:RETROVIRAL>`, `<VECTOR:TRANSPOSON>`
Tags	`<TAG:FLAG>`, `<TAG:GFP>`, `<TAG:GST>`, `<TAG:HA>`, `<TAG:HIS>`, `<TAG:MBP>`, `<TAG:MCHERRY>`, `<TAG:MYC>`, `<TAG:NLS>`, `<TAG:SNAP>`

Usage

from transformers import GPT2LMHeadModel
import torch

# Load model
model = GPT2LMHeadModel.from_pretrained("mcclain/plasmid-gpt-244m")
model.eval()

# Load vocab for encoding
import json
with open("vocab.json") as f:
    vocab = json.load(f)
id_to_token = {v: k for k, v in vocab.items()}

# Encode a prompt
def encode(text):
    tokens = [vocab["<BOS>"]]
    import re
    pos = 0
    for match in re.finditer(r'<[A-Z_]+:[A-Z_]+>|<[A-Z]+>', text):
        for char in text[pos:match.start()]:
            tokens.append(vocab.get(char, vocab["<UNK>"]))
        tokens.append(vocab.get(match.group(), vocab["<UNK>"]))
        pos = match.end()
    for char in text[pos:]:
        tokens.append(vocab.get(char, vocab["<UNK>"]))
    return tokens

def decode(token_ids):
    return "".join(id_to_token.get(t, "<UNK>") for t in token_ids)

# Generate
prompt = "<HOST:ECOLI><RESISTANCE:AMP><LENGTH:MEDIUM><GC:MEDIUM><SEQ>"
input_ids = torch.tensor([encode(prompt)])

with torch.no_grad():
    output = model.generate(
        input_ids,
        max_new_tokens=500,
        do_sample=True,
        temperature=0.85,
        top_k=50,
        repetition_penalty=1.15,
        pad_token_id=0,
        eos_token_id=2,
    )

generated = decode(output[0, input_ids.shape[1]:].tolist())
print(generated.replace("<EOS>", ""))

Related Models

plasmid-gpt-319m - Larger model with better perplexity (1.88)

Citation

If you use this model, please cite:

@misc{plasmidgpt2024,
  title={PlasmidGPT: Conditional DNA Language Model for Plasmid Sequence Generation},
  author={McClain},
  year={2024},
  publisher={HuggingFace}
}

License

MIT License

Downloads last month: -

Safetensors

Model size

0.2B params

Tensor type

F32