Instructions to use latimar/Phind-Codellama-34B-v2-exl2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use latimar/Phind-Codellama-34B-v2-exl2 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="latimar/Phind-Codellama-34B-v2-exl2")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("latimar/Phind-Codellama-34B-v2-exl2")
model = AutoModelForCausalLM.from_pretrained("latimar/Phind-Codellama-34B-v2-exl2")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use latimar/Phind-Codellama-34B-v2-exl2 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "latimar/Phind-Codellama-34B-v2-exl2"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "latimar/Phind-Codellama-34B-v2-exl2",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/latimar/Phind-Codellama-34B-v2-exl2

SGLang

How to use latimar/Phind-Codellama-34B-v2-exl2 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "latimar/Phind-Codellama-34B-v2-exl2" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "latimar/Phind-Codellama-34B-v2-exl2",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "latimar/Phind-Codellama-34B-v2-exl2" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "latimar/Phind-Codellama-34B-v2-exl2",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use latimar/Phind-Codellama-34B-v2-exl2 with Docker Model Runner:
```
docker model run hf.co/latimar/Phind-Codellama-34B-v2-exl2
```

Phind-Codellama-34B-v2-exl2 / README.md

latimar

Update README

cee8ea0 verified over 2 years ago

preview code

raw

history blame contribute delete

2.6 kB

	---
	base_model: https://huggingface.co/Phind/Phind-CodeLlama-34B-v2
	inference: false
	license: llama2
	model_creator: https://huggingface.co/Phind
	model_name: Phind-Codellama-34B-v2
	model_type: llama
	quantized_by: latimar
	---

	# Phind-CodeLlama-34B-v2 EXL2

	Weights of [Phind-CodeLlama-34B-v2](https://huggingface.co/Phind/Phind-CodeLlama-34B-v2) converted
	to [EXL2](https://github.com/turboderp/exllamav2#exl2-quantization) format.

	Each separate quant is in a different branch, like in The Bloke's GPTQ repos.

	```
	export BRANCH=5_0-bpw-h8
	git clone --single-branch --branch ${BRANCH} https://huggingface.co/latimar/Phind-Codellama-34B-v2-exl2
	```

	There are the following branches:

	```
	5_0-bpw-h8
	5_0-bpw-h8-evol-ins
	4_625-bpw-h6
	4_4-bpw-h8
	4_125-bpw-h6
	3_8-bpw-h6
	2_75-bpw-h6
	2_55-bpw-h6
	```

	* Calibration dataset used for conversion: [wikitext-v2](https://huggingface.co/datasets/wikitext/blob/refs%2Fconvert%2Fparquet/wikitext-2-v1/test/0000.parquet)
	* Evaluation dataset used to calculate perplexity: [wikitext-v2](https://huggingface.co/datasets/wikitext/blob/refs%2Fconvert%2Fparquet/wikitext-2-v1/validation/0000.parquet)
	* Calibration dataset used for conversion of `5_0-bpw-h8-evol-ins`: [wizardLM-evol-instruct_70k](https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_70k/blob/refs%2Fconvert%2Fparquet/default/train/0000.parquet)
	* Evaluation dataset used to calculate ppl for `Evol-Ins`: : [nikrosh-evol-instruct](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1/blob/refs%2Fconvert%2Fparquet/default/train/0000.parquet)
	* When converting `4_4-bpw-h8` quant, additional `-mr 32` arg was used.

	PPL was measured with the [test_inference.py exllamav2 script](https://github.com/turboderp/exllamav2/blob/master/test_inference.py):

	```
	python test_inference.py -m /storage/models/LLaMA/EXL2/Phind-Codellama-34B-v2 -ed /storage/datasets/text/evol-instruct/nickrosh-evol-instruct-code-80k.parquet
	```

	\| BPW \| PPL on Wiki \| PPL on Evol-Ins \| File Size (Gb) \|
	\| ----------- \| ----------- \| --------------- \| -------------- \|
	\| 2.55-h6 \| 11.0310 \| 2.4542 \| 10.56 \|
	\| 2.75-h6 \| 9.7902 \| 2.2888 \| 11.33 \|
	\| 3.8-h6 \| 6.7293 \| 2.0724 \| 15.37 \|
	\| 4.125-h6 \| 6.6713 \| 2.0617 \| 16.65 \|
	\| 4.4-h8 \| 6.6487 \| 2.0509 \| 17.76 \|
	\| 4.625-h6 \| 6.6576 \| 2.0459 \| 18.58 \|
	\| 5.0-h8 \| 6.6379 \| 2.0419 \| 20.09 \|
	\| 5.0-h8-ev \| 6.7785 \| 2.0445 \| 20.09 \|