IBM Granite 3.3 8B Instruct - TevunahAi Ultra-Hybrid GPTQ

Model Details

Property Value
Base Model IBM Granite 3.3 8B Instruct
Architecture Dense Transformer
Parameters 8B
Context Length 128K
Quantization TevunahAi Ultra-Hybrid GPTQ + EoRA
Original Size ~16 GB (BF16)
Quantized Size ~5 GB
Compression ~69% reduction

Quantization Strategy

TevunahAi Ultra-Hybrid Mixed-Precision with Full EoRA:

Component Precision EoRA Rationale
Attention (q/k/v/o_proj) INT4 ✅ Rank-128 Quality recovery with error correction
MLP (gate/up/down_proj) INT4 ✅ Rank-128 Quality recovery with error correction
Embeddings BF16 - Preserved for accuracy
LM Head BF16 - Preserved for accuracy

EoRA (Error-corrected Low-Rank Adaptation)

This quantization uses NVIDIA's EoRA technique to recover quantization error through learned low-rank adapters. EoRA applies SVD-based error correction to all quantized layers, significantly improving output quality compared to standard GPTQ.

  • Rank: 128
  • Applied to: All attention projections (Q/K/V/O) AND all MLP projections (gate/up/down)
  • Benefit: Near-lossless INT4 quantization with minimal overhead

Credit: EoRA technique developed by NVIDIA Research. See NVIDIA EoRA Paper for details.

Calibration

  • 2048 samples (8x industry standard of 256)
  • 4096 sequence length
  • Diverse datasets: UltraChat, SlimOrca, OpenHermes, Code-Feedback, Orca-Math
  • Premium calibration for superior quality retention

About Granite 3.3

IBM Granite 3.3 is a family of enterprise-grade language models optimized for:

  • Instruction following and chat
  • Code generation and analysis
  • RAG (Retrieval-Augmented Generation)
  • Function calling and tool use
  • Multilingual support

Key Features

  • 128K context window
  • Trained on high-quality enterprise data
  • Strong reasoning and coding capabilities
  • Apache 2.0 license for commercial use

Usage

GPTQModel (Recommended):

from gptqmodel import GPTQModel
from transformers import AutoTokenizer

model = GPTQModel.load(
    "TevunahAi/Granite-3.3-8B-Instruct-GPTQ",
    trust_remote_code=True,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
    "TevunahAi/Granite-3.3-8B-Instruct-GPTQ",
    trust_remote_code=True
)

# Generate
prompt = "Explain the difference between machine learning and deep learning."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

With Chat Template:

messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Write a Python function to calculate fibonacci numbers."}
]

tokenized = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
    tokenized,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Installation:

pip install gptqmodel transformers

Inference Backends:

For optimal performance, use one of these backends:

# Triton (default, ~14 tok/s)
model = GPTQModel.load(model_path, backend="triton")

# BitBLAS (faster, requires compilation)
model = GPTQModel.load(model_path, backend="bitblas", torch_dtype=torch.float16)

# ExLlama v2 (fastest, if available)
model = GPTQModel.load(model_path, backend="exllama_v2")

Memory Requirements

Inference:

  • Minimum: 8GB VRAM
  • Recommended: 12GB+ VRAM
  • For 128K context: 24GB+ VRAM

Tested Hardware:

  • NVIDIA RTX 5000 Ada (32GB) - ~14 tok/s with Triton
  • NVIDIA A30 (24GB) - BitBLAS compatible

Quantization Details

Specification Value
Method GPTQ + Full EoRA
Quantizer GPTQModel
Calibration Samples 2048 (8x standard)
Sequence Length 256 tokens
Attention Precision INT4 + EoRA (rank-128)
MLP Precision INT4 + EoRA (rank-128)
desc_act True
sym True
group_size 128

Expected Quality

Based on the Full EoRA quantization strategy (INT4 + EoRA on all layers):

  • General instruction following: 97-99% of baseline
  • Code generation: 96-98% of baseline
  • Reasoning tasks: 96-98% of baseline
  • Long context: 95-97% of baseline

Full EoRA coverage (attention + MLP) provides excellent quality recovery despite aggressive INT4 quantization.

Technical Specifications

Specification Value
Model Family IBM Granite 3.3
Variant 8B Instruct
Parameters 8B
Hidden Size 4096
Num Layers 40
Num Attention Heads 32
Num KV Heads 8
Intermediate Size 12800
Context Length 128K
Vocab Size 128256

License

Apache 2.0 - Same as base model

Citation

@software{granite_3_3_8b_gptq_2025,
  title = {IBM Granite 3.3 8B Instruct - TevunahAi GPTQ with Full EoRA},
  author = {TevunahAi},
  year = {2025},
  note = {INT4 GPTQ quantization with full EoRA error correction on all layers},
  url = {https://huggingface.co/TevunahAi/Granite-3.3-8B-Instruct-GPTQ}
}

@misc{ibm_granite_2024,
  title = {Granite 3.0 Language Models},
  author = {IBM Research},
  year = {2024},
  url = {https://github.com/ibm-granite/granite-3.0-language-models}
}

@misc{nvidia_eora_2025,
  title = {EoRA: Error-corrected Low-Rank Adaptation for Quantized LLMs},
  author = {NVIDIA Research},
  year = {2025},
  note = {Low-rank error correction technique for quantization quality recovery}
}

Acknowledgments

  • IBM Research for the excellent Granite model family
  • NVIDIA Research for the EoRA technique enabling high-quality quantization
  • GPTQModel team for the quantization framework

https://huggingface.co/TevunahAi

Downloads last month
38
Safetensors
Model size
8B params
Tensor type
BF16
·
I32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TevunahAi/Granite-3.3-8B-Instruct-GPTQ

Quantized
(32)
this model

Collection including TevunahAi/Granite-3.3-8B-Instruct-GPTQ