Qwen3-8B-DMS-8x-4bit-NF4

This repository contains a 4-bit quantized version of the Qwen3-8B-DMS-8x model.

Model Description

The Qwen3-8B-DMS-8x is an autoregressive language model based on the Qwen3-8B architecture, enhanced with Dynamic Memory Sparsification (DMS). DMS is a state-of-the-art technique that allows for an 8x compression ratio of the KV cache during inference by using learned, per-head eviction policies. This significantly reduces memory usage and increases throughput for long-context generation.

This 4-bit quantization further reduces the VRAM requirements, making it possible to run this high-performance DMS model on consumer GPUs with limited memory.

  • Original Model: https://huggingface.co/nvidia/Qwen3-8B-DMS-8x
  • Architecture: Qwen3-8B with Dynamic Memory Sparsification
  • Quantization: 4-bit NF4 with double quantization (BitsAndBytes)
  • Compression Level: 4-bit weights + 8x KV cache compression (via DMS)
  • Parameters: 8.2 billion
  • Context Length: Up to 131,072 tokens with YaRN

Quantization Method

  • Methodology: BitsAndBytes NF4 (Normal Float 4-bit) with double quantization
  • Bits: 4-bit
  • Compute Dtype: bfloat16
  • Storage: uint8
  • Calibration: Default BitsAndBytes quantization (no custom dataset)

Usage

SEE: 'load_and_test.py' for an inferencing example.

Because this model uses custom Dynamic Memory Sparsification logic, you must set trust_remote_code=True to load the specialized configuration and modeling files.

Limitations and License

https://huggingface.co/nvidia/Qwen3-8B-DMS-8x/blob/main/LICENSE

Attribution and Citation

  • Model Page: https://huggingface.co/nvidia/Qwen3-8B-DMS-8x
  • Reference Paper: Łańcucki et al., "Inference-Time Hyper-Scaling with KV Cache Compression," 2025.
  • Paper Link: https://huggingface.co/papers/2506.05345
  • BibTeX:
    @misc{lancucki2025inferencetime,
          title={Inference-Time Hyper-Scaling with KV Cache Compression},
          author={Adrian Łańcucki and Konrad Staniszewski and Piotr Nawrot and Edoardo M. Ponti},
          year={2025},
          eprint={2506.05345},
          archivePrefix={arXiv},
          primaryClass={cs.LG},
          url={https://arxiv.org/abs/2506.05345},
    }
    
Downloads last month
584
Safetensors
Model size
8B params
Tensor type
F32
·
F16
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for g023/Qwen3-8B-DMS-8x-4bit-NF4

Base model

Qwen/Qwen3-8B-Base
Finetuned
Qwen/Qwen3-8B
Quantized
(1)
this model

Paper for g023/Qwen3-8B-DMS-8x-4bit-NF4