Instructions to use latimar/Phind-Codellama-34B-v2-exl2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use latimar/Phind-Codellama-34B-v2-exl2 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="latimar/Phind-Codellama-34B-v2-exl2")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("latimar/Phind-Codellama-34B-v2-exl2") model = AutoModelForCausalLM.from_pretrained("latimar/Phind-Codellama-34B-v2-exl2") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use latimar/Phind-Codellama-34B-v2-exl2 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "latimar/Phind-Codellama-34B-v2-exl2" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "latimar/Phind-Codellama-34B-v2-exl2", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/latimar/Phind-Codellama-34B-v2-exl2
- SGLang
How to use latimar/Phind-Codellama-34B-v2-exl2 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "latimar/Phind-Codellama-34B-v2-exl2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "latimar/Phind-Codellama-34B-v2-exl2", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "latimar/Phind-Codellama-34B-v2-exl2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "latimar/Phind-Codellama-34B-v2-exl2", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use latimar/Phind-Codellama-34B-v2-exl2 with Docker Model Runner:
docker model run hf.co/latimar/Phind-Codellama-34B-v2-exl2
| base_model: https://huggingface.co/Phind/Phind-CodeLlama-34B-v2 | |
| inference: false | |
| license: llama2 | |
| model_creator: https://huggingface.co/Phind | |
| model_name: Phind-Codellama-34B-v2 | |
| model_type: llama | |
| quantized_by: latimar | |
| # Phind-CodeLlama-34B-v2 EXL2 | |
| Weights of [Phind-CodeLlama-34B-v2](https://huggingface.co/Phind/Phind-CodeLlama-34B-v2) converted | |
| to [EXL2](https://github.com/turboderp/exllamav2#exl2-quantization) format. | |
| Each separate quant is in a different branch, like in The Bloke's GPTQ repos. | |
| ``` | |
| export BRANCH=5_0-bpw-h8 | |
| git clone --single-branch --branch ${BRANCH} https://huggingface.co/latimar/Phind-Codellama-34B-v2-exl2 | |
| ``` | |
| There are the following branches: | |
| ``` | |
| 5_0-bpw-h8 | |
| 5_0-bpw-h8-evol-ins | |
| 4_625-bpw-h6 | |
| 4_4-bpw-h8 | |
| 4_125-bpw-h6 | |
| 3_8-bpw-h6 | |
| 2_75-bpw-h6 | |
| 2_55-bpw-h6 | |
| ``` | |
| * Calibration dataset used for conversion: [wikitext-v2](https://huggingface.co/datasets/wikitext/blob/refs%2Fconvert%2Fparquet/wikitext-2-v1/test/0000.parquet) | |
| * Evaluation dataset used to calculate perplexity: [wikitext-v2](https://huggingface.co/datasets/wikitext/blob/refs%2Fconvert%2Fparquet/wikitext-2-v1/validation/0000.parquet) | |
| * Calibration dataset used for conversion of `5_0-bpw-h8-evol-ins`: [wizardLM-evol-instruct_70k](https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_70k/blob/refs%2Fconvert%2Fparquet/default/train/0000.parquet) | |
| * Evaluation dataset used to calculate ppl for `Evol-Ins`: : [nikrosh-evol-instruct](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1/blob/refs%2Fconvert%2Fparquet/default/train/0000.parquet) | |
| * When converting `4_4-bpw-h8` quant, additional `-mr 32` arg was used. | |
| PPL was measured with the [test_inference.py exllamav2 script](https://github.com/turboderp/exllamav2/blob/master/test_inference.py): | |
| ``` | |
| python test_inference.py -m /storage/models/LLaMA/EXL2/Phind-Codellama-34B-v2 -ed /storage/datasets/text/evol-instruct/nickrosh-evol-instruct-code-80k.parquet | |
| ``` | |
| | BPW | PPL on Wiki | PPL on Evol-Ins | File Size (Gb) | | |
| | ----------- | ----------- | --------------- | -------------- | | |
| | 2.55-h6 | 11.0310 | 2.4542 | 10.56 | | |
| | 2.75-h6 | 9.7902 | 2.2888 | 11.33 | | |
| | 3.8-h6 | 6.7293 | 2.0724 | 15.37 | | |
| | 4.125-h6 | 6.6713 | 2.0617 | 16.65 | | |
| | 4.4-h8 | 6.6487 | 2.0509 | 17.76 | | |
| | 4.625-h6 | 6.6576 | 2.0459 | 18.58 | | |
| | 5.0-h8 | 6.6379 | 2.0419 | 20.09 | | |
| | 5.0-h8-ev | 6.7785 | 2.0445 | 20.09 | | |