Instructions to use deepseek-ai/deepseek-coder-6.7b-instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use deepseek-ai/deepseek-coder-6.7b-instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="deepseek-ai/deepseek-coder-6.7b-instruct")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-6.7b-instruct")
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-6.7b-instruct")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use deepseek-ai/deepseek-coder-6.7b-instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "deepseek-ai/deepseek-coder-6.7b-instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepseek-ai/deepseek-coder-6.7b-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/deepseek-ai/deepseek-coder-6.7b-instruct

SGLang

How to use deepseek-ai/deepseek-coder-6.7b-instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "deepseek-ai/deepseek-coder-6.7b-instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepseek-ai/deepseek-coder-6.7b-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "deepseek-ai/deepseek-coder-6.7b-instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepseek-ai/deepseek-coder-6.7b-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use deepseek-ai/deepseek-coder-6.7b-instruct with Docker Model Runner:
```
docker model run hf.co/deepseek-ai/deepseek-coder-6.7b-instruct
```

Enhancement Request: Model Sharding for DeepSeek-Coder-6.7b-Instruct

by Firejowl - opened Nov 8, 2023

Discussion

Firejowl

Nov 8, 2023

Dear DeepSeek AI Team,

Greetings! I am reaching out to discuss the DeepSeek-Coder-6.7b-Instruct model and how its accessibility could be further improved. As someone eager to utilize this model, I've encountered constraints due to the limitations of lower-end hardware. Therefore, I propose the consideration of model sharding as a potential enhancement.

The introduction of a sharded version would be a significant step towards inclusivity, allowing those with less powerful machines to still take advantage of the model's capabilities. This not only benefits individual hobbyists and researchers with resource constraints but also enhances the utility of the model on cloud-based platforms where optimized resource usage is essential, such as Google Colab and Kaggle.

Understanding that model sharding entails technical complexities, I am hopeful that its implementation could widen the user base and foster a more diverse range of applications and innovations.

I am keen to hear your perspective on this suggestion and any other possible options to assist users like myself in overcoming hardware limitations.

Thank you for your pioneering work and for considering this request.

Chester111

Nov 9, 2023

I don't really get it. Do you want to finetune this model or just run inference with it? If you want to finetune it on low-end hardware, I'd recommend QLoRA algorithm; if you want to run inference only, I'd recommend running a quantized version of the model (e.g.: the one from TheBloke).
Model Sharding is not for your use case I guess.

Chester111 changed discussion status to closed Nov 9, 2023

Firejowl

Nov 9, 2023

If you shard the model you can run it through transformers on either cloud platform. This removes inference rate limits and allows people who don't have less financial capabilities to still access modern technology.

Here is an example:
https://youtu.be/c_S_KGRUzoY

Firejowl changed discussion status to open Nov 9, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment