mii-llm/nesso-4B-GGUF

This model is the GGUF version of mii-llm/nesso-4B

Deployment

Docker and llama.cpp

You can use the following docker-compose.yml file (change the quantization (:Q4_K_M, :Q8_0 or :BF16) and the context_size (--ctx-size 4000) according your needs):

services:
  llamacpp:
    image: ghcr.io/ggml-org/llama.cpp:full
    command: --server --host 0.0.0.0 -hf mii-llm/nesso-4B-GGUF:Q4_K_M --alias nesso:4B --jinja -ngl 99 --threads -1 --temp 0.15 --ctx-size 4000
    ports:
    - 8080:8080
    volumes:
    - ./llama.cpp:/root/.cache/llama.cpp:rw,z

Run unsing docker compose up, on localhost:8080 you can find a web iterface for chatting and at localhost:8080/v1 you can use the OPEN AI Like API.

To use the GPU (so nvidia docker toolkit) specify the GPUs on the docker-compose.yml file and use it as the one below (change the device_ids with the ID of GPUs that you want to use):

services:
  llamacpp:
    image: ghcr.io/ggml-org/llama.cpp:full-cuda
    command: --server --host 0.0.0.0 -hf mii-llm/nesso-4B-GGUF:Q4_K_M --alias nesso:4B --jinja -ngl 99 --threads -1 --temp 0.15 --ctx-size 4000
    ports:
    - 8080:8080
    volumes:
    - ./llama.cpp:/root/.cache/llama.cpp:rw,z
    deploy:
      resources:
        reservations:
          devices:
          - driver: nvidia
            device_ids: ["0,1"]
            capabilities: [gpu]

You can also use vulkan and other kind of GPUs more information on llama.cpp documentation

Downloads last month: 411

GGUF

Model size

4B params

Architecture

qwen3

Hardware compatibility

4-bit

8-bit

16-bit

Model tree for mii-llm/nesso-4B-GGUF

Base model

mii-llm/nesso-4B

Quantized

(1)

this model