Instructions to use chargoddard/llama2-22b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use chargoddard/llama2-22b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="chargoddard/llama2-22b")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("chargoddard/llama2-22b") model = AutoModelForCausalLM.from_pretrained("chargoddard/llama2-22b") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use chargoddard/llama2-22b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "chargoddard/llama2-22b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "chargoddard/llama2-22b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/chargoddard/llama2-22b
- SGLang
How to use chargoddard/llama2-22b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "chargoddard/llama2-22b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "chargoddard/llama2-22b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "chargoddard/llama2-22b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "chargoddard/llama2-22b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use chargoddard/llama2-22b with Docker Model Runner:
docker model run hf.co/chargoddard/llama2-22b
More info?
This is very cool, could we get some info on how this was created, plus any scripts used?
yes please
Hey, thanks for the interest! I've added the script I used to generate the base model to the repo (frankenllama_22.py).
This actually came out of some experiments I was doing with attention head pruning. I decided to try going the other direction instead, and it's looking pretty promising so far.
For the fine tuning, I used axolotl: https://github.com/OpenAccess-AI-Collective/axolotl
@chargoddard Thanks for posting the script, I'm going to experiment with it. Do you know if it's possible to transplant heads from l2-70b instead of l1-33b like in the original script? And does the script need any changing other than pointing to the right donor?
I can't find this github repo, could you link it?
I can't find this github repo, could you link it?
@Vezora Do you mean the merge script? It's the .py file in the files section of this model.