mii-llm/nesso-4B-GGUF

This model is the GGUF version of mii-llm/nesso-4B

Deployment

Docker and llama.cpp

You can use the following docker-compose.yml file (change the quantization (:Q4_K_M, :Q8_0 or :BF16) and the context_size (--ctx-size 4000) according your needs):

services:
  llamacpp:
    image: ghcr.io/ggml-org/llama.cpp:full
    command: --server --host 0.0.0.0 -hf mii-llm/nesso-4B-GGUF:Q4_K_M --alias nesso:4B --jinja -ngl 99 --threads -1 --temp 0.15 --ctx-size 4000
    ports:
    - 8080:8080
    volumes:
    - ./llama.cpp:/root/.cache/llama.cpp:rw,z

Run unsing docker compose up, on localhost:8080 you can find a web iterface for chatting and at localhost:8080/v1 you can use the OPEN AI Like API.

To use the GPU (so nvidia docker toolkit) specify the GPUs on the docker-compose.yml file and use it as the one below (change the device_ids with the ID of GPUs that you want to use):

services:
  llamacpp:
    image: ghcr.io/ggml-org/llama.cpp:full-cuda
    command: --server --host 0.0.0.0 -hf mii-llm/nesso-4B-GGUF:Q4_K_M --alias nesso:4B --jinja -ngl 99 --threads -1 --temp 0.15 --ctx-size 4000
    ports:
    - 8080:8080
    volumes:
    - ./llama.cpp:/root/.cache/llama.cpp:rw,z
    deploy:
      resources:
        reservations:
          devices:
          - driver: nvidia
            device_ids: ["0,1"]
            capabilities: [gpu]

You can also use vulkan and other kind of GPUs more information on llama.cpp documentation

Downloads last month
411
GGUF
Model size
4B params
Architecture
qwen3
Hardware compatibility
Log In to add your hardware

4-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for mii-llm/nesso-4B-GGUF

Base model

mii-llm/nesso-4B
Quantized
(1)
this model