Instructions to use deepseek-ai/deepseek-coder-6.7b-instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use deepseek-ai/deepseek-coder-6.7b-instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="deepseek-ai/deepseek-coder-6.7b-instruct") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-6.7b-instruct") model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-6.7b-instruct") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use deepseek-ai/deepseek-coder-6.7b-instruct with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "deepseek-ai/deepseek-coder-6.7b-instruct" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "deepseek-ai/deepseek-coder-6.7b-instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/deepseek-ai/deepseek-coder-6.7b-instruct
- SGLang
How to use deepseek-ai/deepseek-coder-6.7b-instruct with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "deepseek-ai/deepseek-coder-6.7b-instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "deepseek-ai/deepseek-coder-6.7b-instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "deepseek-ai/deepseek-coder-6.7b-instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "deepseek-ai/deepseek-coder-6.7b-instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use deepseek-ai/deepseek-coder-6.7b-instruct with Docker Model Runner:
docker model run hf.co/deepseek-ai/deepseek-coder-6.7b-instruct
Enhancement Request: Model Sharding for DeepSeek-Coder-6.7b-Instruct
Dear DeepSeek AI Team,
Greetings! I am reaching out to discuss the DeepSeek-Coder-6.7b-Instruct model and how its accessibility could be further improved. As someone eager to utilize this model, I've encountered constraints due to the limitations of lower-end hardware. Therefore, I propose the consideration of model sharding as a potential enhancement.
The introduction of a sharded version would be a significant step towards inclusivity, allowing those with less powerful machines to still take advantage of the model's capabilities. This not only benefits individual hobbyists and researchers with resource constraints but also enhances the utility of the model on cloud-based platforms where optimized resource usage is essential, such as Google Colab and Kaggle.
Understanding that model sharding entails technical complexities, I am hopeful that its implementation could widen the user base and foster a more diverse range of applications and innovations.
I am keen to hear your perspective on this suggestion and any other possible options to assist users like myself in overcoming hardware limitations.
Thank you for your pioneering work and for considering this request.
I don't really get it. Do you want to finetune this model or just run inference with it? If you want to finetune it on low-end hardware, I'd recommend QLoRA algorithm; if you want to run inference only, I'd recommend running a quantized version of the model (e.g.: the one from TheBloke).
Model Sharding is not for your use case I guess.
If you shard the model you can run it through transformers on either cloud platform. This removes inference rate limits and allows people who don't have less financial capabilities to still access modern technology.
Here is an example:
https://youtu.be/c_S_KGRUzoY