Qwen3-8B-DMS-8x-4bit-NF4
This repository contains a 4-bit quantized version of the Qwen3-8B-DMS-8x model.
Model Description
The Qwen3-8B-DMS-8x is an autoregressive language model based on the Qwen3-8B architecture, enhanced with Dynamic Memory Sparsification (DMS). DMS is a state-of-the-art technique that allows for an 8x compression ratio of the KV cache during inference by using learned, per-head eviction policies. This significantly reduces memory usage and increases throughput for long-context generation.
This 4-bit quantization further reduces the VRAM requirements, making it possible to run this high-performance DMS model on consumer GPUs with limited memory.
- Original Model: https://huggingface.co/nvidia/Qwen3-8B-DMS-8x
- Architecture: Qwen3-8B with Dynamic Memory Sparsification
- Quantization: 4-bit NF4 with double quantization (BitsAndBytes)
- Compression Level: 4-bit weights + 8x KV cache compression (via DMS)
- Parameters: 8.2 billion
- Context Length: Up to 131,072 tokens with YaRN
Quantization Method
- Methodology: BitsAndBytes NF4 (Normal Float 4-bit) with double quantization
- Bits: 4-bit
- Compute Dtype: bfloat16
- Storage: uint8
- Calibration: Default BitsAndBytes quantization (no custom dataset)
Usage
SEE: 'load_and_test.py' for an inferencing example.
Because this model uses custom Dynamic Memory Sparsification logic, you must set trust_remote_code=True to load the specialized configuration and modeling files.
Limitations and License
https://huggingface.co/nvidia/Qwen3-8B-DMS-8x/blob/main/LICENSE
Attribution and Citation
- Model Page: https://huggingface.co/nvidia/Qwen3-8B-DMS-8x
- Reference Paper: Łańcucki et al., "Inference-Time Hyper-Scaling with KV Cache Compression," 2025.
- Paper Link: https://huggingface.co/papers/2506.05345
- BibTeX:
@misc{lancucki2025inferencetime, title={Inference-Time Hyper-Scaling with KV Cache Compression}, author={Adrian Łańcucki and Konrad Staniszewski and Piotr Nawrot and Edoardo M. Ponti}, year={2025}, eprint={2506.05345}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2506.05345}, }
- Downloads last month
- 584