Text Generation
Transformers
Safetensors
English
gpt2
biology
dna
plasmid
genomics
conditional-generation
text-generation-inference
Instructions to use McClain/plasmid-gpt-244m with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use McClain/plasmid-gpt-244m with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="McClain/plasmid-gpt-244m")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("McClain/plasmid-gpt-244m") model = AutoModelForCausalLM.from_pretrained("McClain/plasmid-gpt-244m") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use McClain/plasmid-gpt-244m with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "McClain/plasmid-gpt-244m" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "McClain/plasmid-gpt-244m", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/McClain/plasmid-gpt-244m
- SGLang
How to use McClain/plasmid-gpt-244m with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "McClain/plasmid-gpt-244m" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "McClain/plasmid-gpt-244m", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "McClain/plasmid-gpt-244m" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "McClain/plasmid-gpt-244m", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use McClain/plasmid-gpt-244m with Docker Model Runner:
docker model run hf.co/McClain/plasmid-gpt-244m
PlasmidGPT-244M
A conditional DNA language model for generating plasmid sequences based on specified biological properties.
Model Description
PlasmidGPT is a GPT-2 style transformer trained to generate plasmid DNA sequences conditioned on biological properties like host organism, antibiotic resistance, GC content, and more.
- Parameters: 244M
- Context Length: 16,384 tokens
- Architecture: 18 layers, 16 heads, 1024 embedding dim
- Vocabulary: 72 tokens (special tokens + condition tokens + ACGT nucleotides)
Training
- Dataset: Deduplicated Addgene plasmid sequences (13,260 unique prompt-sequence pairs)
- Training Steps: 10,000
- Best Validation Loss: 0.721
- Best Validation Perplexity: 2.06
- WandB Run: xotxj25a
Conditioning Tokens
The model accepts the following condition tokens:
| Category | Tokens |
|---|---|
| Host | <HOST:ECOLI>, <HOST:HUMAN>, <HOST:MAMMALIAN>, <HOST:MOUSE>, <HOST:PLANT>, <HOST:RAT>, <HOST:SYNTHETIC>, <HOST:WORM>, <HOST:YEAST> |
| Resistance | <RESISTANCE:AMP>, <RESISTANCE:KAN>, <RESISTANCE:SPEC>, <RESISTANCE:CHLOR>, <RESISTANCE:GENT>, <RESISTANCE:STREP>, <RESISTANCE:TET> |
| Length | <LENGTH:SHORT>, <LENGTH:MEDIUM>, <LENGTH:LONG> |
| GC Content | <GC:LOW>, <GC:MEDIUM>, <GC:HIGH> |
| Application | <APPLICATION:CLONING>, <APPLICATION:CRISPR>, <APPLICATION:EDITING>, <APPLICATION:EXPRESSION>, <APPLICATION:RECOMBINATION>, <APPLICATION:REPORTER>, <APPLICATION:RNAI> |
| Copy Number | <COPY:HIGH>, <COPY:LOW> |
| Promoter | <PROMOTER:BAD>, <PROMOTER:CAG>, <PROMOTER:CBH>, <PROMOTER:CMV>, <PROMOTER:HSYN>, <PROMOTER:LAC>, <PROMOTER:PGK>, <PROMOTER:POLYH>, <PROMOTER:SFFV>, <PROMOTER:TAC>, <PROMOTER:TRE> |
| Vector Type | <VECTOR:AAV>, <VECTOR:LENTIVIRAL>, <VECTOR:RETROVIRAL>, <VECTOR:TRANSPOSON> |
| Tags | <TAG:FLAG>, <TAG:GFP>, <TAG:GST>, <TAG:HA>, <TAG:HIS>, <TAG:MBP>, <TAG:MCHERRY>, <TAG:MYC>, <TAG:NLS>, <TAG:SNAP> |
Usage
from transformers import GPT2LMHeadModel
import torch
# Load model
model = GPT2LMHeadModel.from_pretrained("mcclain/plasmid-gpt-244m")
model.eval()
# Load vocab for encoding
import json
with open("vocab.json") as f:
vocab = json.load(f)
id_to_token = {v: k for k, v in vocab.items()}
# Encode a prompt
def encode(text):
tokens = [vocab["<BOS>"]]
import re
pos = 0
for match in re.finditer(r'<[A-Z_]+:[A-Z_]+>|<[A-Z]+>', text):
for char in text[pos:match.start()]:
tokens.append(vocab.get(char, vocab["<UNK>"]))
tokens.append(vocab.get(match.group(), vocab["<UNK>"]))
pos = match.end()
for char in text[pos:]:
tokens.append(vocab.get(char, vocab["<UNK>"]))
return tokens
def decode(token_ids):
return "".join(id_to_token.get(t, "<UNK>") for t in token_ids)
# Generate
prompt = "<HOST:ECOLI><RESISTANCE:AMP><LENGTH:MEDIUM><GC:MEDIUM><SEQ>"
input_ids = torch.tensor([encode(prompt)])
with torch.no_grad():
output = model.generate(
input_ids,
max_new_tokens=500,
do_sample=True,
temperature=0.85,
top_k=50,
repetition_penalty=1.15,
pad_token_id=0,
eos_token_id=2,
)
generated = decode(output[0, input_ids.shape[1]:].tolist())
print(generated.replace("<EOS>", ""))
Related Models
- plasmid-gpt-319m - Larger model with better perplexity (1.88)
Citation
If you use this model, please cite:
@misc{plasmidgpt2024,
title={PlasmidGPT: Conditional DNA Language Model for Plasmid Sequence Generation},
author={McClain},
year={2024},
publisher={HuggingFace}
}
License
MIT License
- Downloads last month
- -