mii-llm/nesso-4B-GGUF
This model is the GGUF version of mii-llm/nesso-4B
Deployment
Docker and llama.cpp
You can use the following docker-compose.yml file (change the quantization (:Q4_K_M, :Q8_0 or :BF16) and the context_size (--ctx-size 4000) according your needs):
services:
llamacpp:
image: ghcr.io/ggml-org/llama.cpp:full
command: --server --host 0.0.0.0 -hf mii-llm/nesso-4B-GGUF:Q4_K_M --alias nesso:4B --jinja -ngl 99 --threads -1 --temp 0.15 --ctx-size 4000
ports:
- 8080:8080
volumes:
- ./llama.cpp:/root/.cache/llama.cpp:rw,z
Run unsing docker compose up, on localhost:8080 you can find a web iterface for chatting and at localhost:8080/v1 you can use the OPEN AI Like API.
To use the GPU (so nvidia docker toolkit) specify the GPUs on the docker-compose.yml file and use it as the one below (change the device_ids with the ID of GPUs that you want to use):
services:
llamacpp:
image: ghcr.io/ggml-org/llama.cpp:full-cuda
command: --server --host 0.0.0.0 -hf mii-llm/nesso-4B-GGUF:Q4_K_M --alias nesso:4B --jinja -ngl 99 --threads -1 --temp 0.15 --ctx-size 4000
ports:
- 8080:8080
volumes:
- ./llama.cpp:/root/.cache/llama.cpp:rw,z
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["0,1"]
capabilities: [gpu]
You can also use vulkan and other kind of GPUs more information on llama.cpp documentation
- Downloads last month
- 411
Hardware compatibility
Log In
to add your hardware
4-bit
8-bit
16-bit
Model tree for mii-llm/nesso-4B-GGUF
Base model
mii-llm/nesso-4B