IBM Granite 3.3 8B Instruct - TevunahAi Ultra-Hybrid GPTQ
Model Details
| Property | Value |
|---|---|
| Base Model | IBM Granite 3.3 8B Instruct |
| Architecture | Dense Transformer |
| Parameters | 8B |
| Context Length | 128K |
| Quantization | TevunahAi Ultra-Hybrid GPTQ + EoRA |
| Original Size | ~16 GB (BF16) |
| Quantized Size | ~5 GB |
| Compression | ~69% reduction |
Quantization Strategy
TevunahAi Ultra-Hybrid Mixed-Precision with Full EoRA:
| Component | Precision | EoRA | Rationale |
|---|---|---|---|
| Attention (q/k/v/o_proj) | INT4 | ✅ Rank-128 | Quality recovery with error correction |
| MLP (gate/up/down_proj) | INT4 | ✅ Rank-128 | Quality recovery with error correction |
| Embeddings | BF16 | - | Preserved for accuracy |
| LM Head | BF16 | - | Preserved for accuracy |
EoRA (Error-corrected Low-Rank Adaptation)
This quantization uses NVIDIA's EoRA technique to recover quantization error through learned low-rank adapters. EoRA applies SVD-based error correction to all quantized layers, significantly improving output quality compared to standard GPTQ.
- Rank: 128
- Applied to: All attention projections (Q/K/V/O) AND all MLP projections (gate/up/down)
- Benefit: Near-lossless INT4 quantization with minimal overhead
Credit: EoRA technique developed by NVIDIA Research. See NVIDIA EoRA Paper for details.
Calibration
- 2048 samples (8x industry standard of 256)
- 4096 sequence length
- Diverse datasets: UltraChat, SlimOrca, OpenHermes, Code-Feedback, Orca-Math
- Premium calibration for superior quality retention
About Granite 3.3
IBM Granite 3.3 is a family of enterprise-grade language models optimized for:
- Instruction following and chat
- Code generation and analysis
- RAG (Retrieval-Augmented Generation)
- Function calling and tool use
- Multilingual support
Key Features
- 128K context window
- Trained on high-quality enterprise data
- Strong reasoning and coding capabilities
- Apache 2.0 license for commercial use
Usage
GPTQModel (Recommended):
from gptqmodel import GPTQModel
from transformers import AutoTokenizer
model = GPTQModel.load(
"TevunahAi/Granite-3.3-8B-Instruct-GPTQ",
trust_remote_code=True,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
"TevunahAi/Granite-3.3-8B-Instruct-GPTQ",
trust_remote_code=True
)
# Generate
prompt = "Explain the difference between machine learning and deep learning."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
With Chat Template:
messages = [
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "Write a Python function to calculate fibonacci numbers."}
]
tokenized = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)
outputs = model.generate(
tokenized,
max_new_tokens=512,
temperature=0.7,
top_p=0.9,
do_sample=True,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Installation:
pip install gptqmodel transformers
Inference Backends:
For optimal performance, use one of these backends:
# Triton (default, ~14 tok/s)
model = GPTQModel.load(model_path, backend="triton")
# BitBLAS (faster, requires compilation)
model = GPTQModel.load(model_path, backend="bitblas", torch_dtype=torch.float16)
# ExLlama v2 (fastest, if available)
model = GPTQModel.load(model_path, backend="exllama_v2")
Memory Requirements
Inference:
- Minimum: 8GB VRAM
- Recommended: 12GB+ VRAM
- For 128K context: 24GB+ VRAM
Tested Hardware:
- NVIDIA RTX 5000 Ada (32GB) - ~14 tok/s with Triton
- NVIDIA A30 (24GB) - BitBLAS compatible
Quantization Details
| Specification | Value |
|---|---|
| Method | GPTQ + Full EoRA |
| Quantizer | GPTQModel |
| Calibration Samples | 2048 (8x standard) |
| Sequence Length | 256 tokens |
| Attention Precision | INT4 + EoRA (rank-128) |
| MLP Precision | INT4 + EoRA (rank-128) |
| desc_act | True |
| sym | True |
| group_size | 128 |
Expected Quality
Based on the Full EoRA quantization strategy (INT4 + EoRA on all layers):
- General instruction following: 97-99% of baseline
- Code generation: 96-98% of baseline
- Reasoning tasks: 96-98% of baseline
- Long context: 95-97% of baseline
Full EoRA coverage (attention + MLP) provides excellent quality recovery despite aggressive INT4 quantization.
Technical Specifications
| Specification | Value |
|---|---|
| Model Family | IBM Granite 3.3 |
| Variant | 8B Instruct |
| Parameters | 8B |
| Hidden Size | 4096 |
| Num Layers | 40 |
| Num Attention Heads | 32 |
| Num KV Heads | 8 |
| Intermediate Size | 12800 |
| Context Length | 128K |
| Vocab Size | 128256 |
License
Apache 2.0 - Same as base model
Citation
@software{granite_3_3_8b_gptq_2025,
title = {IBM Granite 3.3 8B Instruct - TevunahAi GPTQ with Full EoRA},
author = {TevunahAi},
year = {2025},
note = {INT4 GPTQ quantization with full EoRA error correction on all layers},
url = {https://huggingface.co/TevunahAi/Granite-3.3-8B-Instruct-GPTQ}
}
@misc{ibm_granite_2024,
title = {Granite 3.0 Language Models},
author = {IBM Research},
year = {2024},
url = {https://github.com/ibm-granite/granite-3.0-language-models}
}
@misc{nvidia_eora_2025,
title = {EoRA: Error-corrected Low-Rank Adaptation for Quantized LLMs},
author = {NVIDIA Research},
year = {2025},
note = {Low-rank error correction technique for quantization quality recovery}
}
Acknowledgments
- IBM Research for the excellent Granite model family
- NVIDIA Research for the EoRA technique enabling high-quality quantization
- GPTQModel team for the quantization framework
- Downloads last month
- 38
Model tree for TevunahAi/Granite-3.3-8B-Instruct-GPTQ
Base model
ibm-granite/granite-3.3-8b-base
Finetuned
ibm-granite/granite-3.3-8b-instruct
Collection including TevunahAi/Granite-3.3-8B-Instruct-GPTQ
Collection
These models are quantized in mixed precision that allows them to have a smaller footprint than fp8, but still high quality.
•
13 items
•
Updated
•
1