Qwen3-8B-DMS-8x-4bit-NF4

This repository contains a 4-bit quantized version of the Qwen3-8B-DMS-8x model.

Model Description

The Qwen3-8B-DMS-8x is an autoregressive language model based on the Qwen3-8B architecture, enhanced with Dynamic Memory Sparsification (DMS). DMS is a state-of-the-art technique that allows for an 8x compression ratio of the KV cache during inference by using learned, per-head eviction policies. This significantly reduces memory usage and increases throughput for long-context generation.

This 4-bit quantization further reduces the VRAM requirements, making it possible to run this high-performance DMS model on consumer GPUs with limited memory.

Original Model: https://huggingface.co/nvidia/Qwen3-8B-DMS-8x
Architecture: Qwen3-8B with Dynamic Memory Sparsification
Quantization: 4-bit NF4 with double quantization (BitsAndBytes)
Compression Level: 4-bit weights + 8x KV cache compression (via DMS)
Parameters: 8.2 billion
Context Length: Up to 131,072 tokens with YaRN

Quantization Method

Methodology: BitsAndBytes NF4 (Normal Float 4-bit) with double quantization
Bits: 4-bit
Compute Dtype: bfloat16
Storage: uint8
Calibration: Default BitsAndBytes quantization (no custom dataset)

Usage

SEE: 'load_and_test.py' for an inferencing example.

Because this model uses custom Dynamic Memory Sparsification logic, you must set trust_remote_code=True to load the specialized configuration and modeling files.

Limitations and License

https://huggingface.co/nvidia/Qwen3-8B-DMS-8x/blob/main/LICENSE

Attribution and Citation

Model Page: https://huggingface.co/nvidia/Qwen3-8B-DMS-8x
Reference Paper: Łańcucki et al., "Inference-Time Hyper-Scaling with KV Cache Compression," 2025.
Paper Link: https://huggingface.co/papers/2506.05345

BibTeX:

@misc{lancucki2025inferencetime,
      title={Inference-Time Hyper-Scaling with KV Cache Compression},
      author={Adrian Łańcucki and Konrad Staniszewski and Piotr Nawrot and Edoardo M. Ponti},
      year={2025},
      eprint={2506.05345},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2506.05345},
}