Instructions to use EchoLabs33/mamba2-1.3b-hxq with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use EchoLabs33/mamba2-1.3b-hxq with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="EchoLabs33/mamba2-1.3b-hxq")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("EchoLabs33/mamba2-1.3b-hxq") model = AutoModelForCausalLM.from_pretrained("EchoLabs33/mamba2-1.3b-hxq") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use EchoLabs33/mamba2-1.3b-hxq with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "EchoLabs33/mamba2-1.3b-hxq" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "EchoLabs33/mamba2-1.3b-hxq", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/EchoLabs33/mamba2-1.3b-hxq
- SGLang
How to use EchoLabs33/mamba2-1.3b-hxq with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "EchoLabs33/mamba2-1.3b-hxq" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "EchoLabs33/mamba2-1.3b-hxq", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "EchoLabs33/mamba2-1.3b-hxq" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "EchoLabs33/mamba2-1.3b-hxq", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use EchoLabs33/mamba2-1.3b-hxq with Docker Model Runner:
docker model run hf.co/EchoLabs33/mamba2-1.3b-hxq
Mamba2-1.3B-HXQ
2.1x smaller. +8.0% PPL. Pure Mamba2 SSM scales with compression.
Mamba2-1.3B compressed from 2.9 GB to 1.4 GB with +8.0% perplexity. Down from +18.4% at 130M — SSM compression quality improves with model size. No calibration data. Just
pip installandfrom_pretrained().
Install and Run
pip install "helix-substrate[hf]"
import helix_substrate # registers the HXQ quantizer with HuggingFace
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("EchoLabs33/mamba2-1.3b-helix")
tokenizer = AutoTokenizer.from_pretrained("EchoLabs33/mamba2-1.3b-helix")
inputs = tokenizer("The future of artificial intelligence", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=64)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
That's it. import helix_substrate registers the quantizer. from_pretrained() handles the rest automatically.
Benchmark
| Dense (BF16) | HXQ | |
|---|---|---|
| Size | 2.9 GB | 1.4 GB |
| Perplexity (WikiText-2) | 9.26 | 10.00 (+8.0%) |
| Compression ratio | — | 2.1x |
| Compressed modules | — | 98 (in_proj + out_proj + embedding) |
| Architecture | Mamba2 (48 layers, pure SSM) | unchanged |
Good to Know
- +8.0% PPL delta — higher than transformers at this scale, but down from +18.4% at 130M. SSM compression quality scales with model size.
- GPU and CPU supported — runs on any CUDA GPU or CPU via standard PyTorch. Fused kernels for additional speedup are in progress.
- Fine-tunable via LoRA — compressed weights remain frozen, but LoRA adapters attach to each
HelixLinearlayer viaHelixLinearSTE. Seehelix-substratefor training infrastructure. - Requires
helix-substrate— the quantizer is not built into transformers. You needpip install "helix-substrate[hf]". mamba-ssmrecommended — without it, falls back to a slower sequential code path.- Requires
transformers >= 4.45— for Mamba2 architecture support.
What is HelixCode?
HelixCode is a universal weight compression codec based on vector quantization:
- Each weight matrix is replaced by a 256-entry codebook (float32) + uint8 index matrix + optional sidecar corrections for outlier values
- The compressed form is the executable —
HelixLinearperformscodebook[indices] @ xdirectly, no decompression step - Works on any
nn.Linearregardless of architecture (Transformer, Mamba, MLP, CNN) - No calibration data required — unlike GPTQ/AWQ, codebooks are fit from the weights alone
How It Works
import helix_substrateregisters thehxqquantizer with HuggingFacefrom_pretrained()readsquantization_config.quant_method = "hxq"fromconfig.json- The quantizer replaces 98 modules with
HelixLinearshells before weight loading - Safetensors populates the codebook, indices, and sidecar buffers directly
- The model runs in compressed form — no decompression needed
Architecture Details
Mamba2-1.3B is a pure state-space model with:
- 48 Mamba2 layers (SSD blocks with in_proj + out_proj linear layers)
- hidden_size=2048, 64 heads, head_dim=64
- state_size=128, conv_kernel=4
Only the in_proj, out_proj, and embedding layers are VQ-compressed. Mamba2-specific parameters (A_log, D, dt_bias, conv1d, norms) are stored at full precision.
Compression Receipt
Compressed modules: 98 (in_proj + out_proj + embedding)
Exact tensors: 337 (A_log, D, dt_bias, conv1d, norms)
Total keys: 736
Output size: 1,388 MB
Weight ratio: 2.1x
PPL delta: +8.0% (10.00 vs 9.26 dense)
Eval: WikiText-2 test, 2048 tokens, stride=512
Companion Models
Same codec, same pip install, multiple architectures:
| Model | Architecture | Ratio | PPL Delta |
|---|---|---|---|
| qwen2.5-14b-instruct-helix | Transformer | 3.4x | pending |
| qwen2.5-7b-instruct-helix | Transformer | 2.2x | +6.34% |
| qwen2.5-3b-instruct-helix | Transformer | 1.6x | +0.69% |
| qwen2.5-coder-3b-helix | Transformer (code) | 1.6x | +1.92% |
| qwen2.5-coder-1.5b-instruct-helix | Transformer (code) | 2.4x | +1.63% |
| tinyllama-1.1b-helix | Transformer | 4.0x | +0.78% |
| zamba2-2.7b-instruct-helix | Hybrid (Mamba2+Transformer) | 1.8x | +6.59% |
| zamba2-1.2b-helix | Hybrid (Mamba2+Transformer) | 1.7x | +2.90% |
| mamba-130m-helix | Pure SSM | 3.8x | +18.4% |
Citation
@software{helix_substrate_2026,
title={Helix Substrate: Universal Weight Compression via HelixCode},
author={EchoLabs},
year={2026},
url={https://github.com/echo313unfolding/helix-substrate}
}
License
Apache 2.0 (inherited from state-spaces/mamba2-1.3b).
- Downloads last month
- 115
Model tree for EchoLabs33/mamba2-1.3b-hxq
Base model
state-spaces/mamba2-1.3bCollection including EchoLabs33/mamba2-1.3b-hxq
Evaluation results
- Perplexity on WikiText-2test set self-reported10.000