IBM Granite 3.3 8B Instruct - TevunahAi Ultra-Hybrid GPTQ

Model Details

Property	Value
Base Model	IBM Granite 3.3 8B Instruct
Architecture	Dense Transformer
Parameters	8B
Context Length	128K
Quantization	TevunahAi Ultra-Hybrid GPTQ + EoRA
Original Size	~16 GB (BF16)
Quantized Size	~5 GB
Compression	~69% reduction

Quantization Strategy

TevunahAi Ultra-Hybrid Mixed-Precision with Full EoRA:

Component	Precision	EoRA	Rationale
Attention (q/k/v/o_proj)	INT4	✅ Rank-128	Quality recovery with error correction
MLP (gate/up/down_proj)	INT4	✅ Rank-128	Quality recovery with error correction
Embeddings	BF16	-	Preserved for accuracy
LM Head	BF16	-	Preserved for accuracy

EoRA (Error-corrected Low-Rank Adaptation)

This quantization uses NVIDIA's EoRA technique to recover quantization error through learned low-rank adapters. EoRA applies SVD-based error correction to all quantized layers, significantly improving output quality compared to standard GPTQ.

Rank: 128
Applied to: All attention projections (Q/K/V/O) AND all MLP projections (gate/up/down)
Benefit: Near-lossless INT4 quantization with minimal overhead

Credit: EoRA technique developed by NVIDIA Research. See NVIDIA EoRA Paper for details.

Calibration

2048 samples (8x industry standard of 256)
4096 sequence length
Diverse datasets: UltraChat, SlimOrca, OpenHermes, Code-Feedback, Orca-Math
Premium calibration for superior quality retention

About Granite 3.3

IBM Granite 3.3 is a family of enterprise-grade language models optimized for:

Instruction following and chat
Code generation and analysis
RAG (Retrieval-Augmented Generation)
Function calling and tool use
Multilingual support

Key Features

128K context window
Trained on high-quality enterprise data
Strong reasoning and coding capabilities
Apache 2.0 license for commercial use

Usage

GPTQModel (Recommended):

from gptqmodel import GPTQModel
from transformers import AutoTokenizer

model = GPTQModel.load(
    "TevunahAi/Granite-3.3-8B-Instruct-GPTQ",
    trust_remote_code=True,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
    "TevunahAi/Granite-3.3-8B-Instruct-GPTQ",
    trust_remote_code=True
)

# Generate
prompt = "Explain the difference between machine learning and deep learning."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

With Chat Template:

messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Write a Python function to calculate fibonacci numbers."}
]

tokenized = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
    tokenized,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Installation:

pip install gptqmodel transformers

Inference Backends:

For optimal performance, use one of these backends:

# Triton (default, ~14 tok/s)
model = GPTQModel.load(model_path, backend="triton")

# BitBLAS (faster, requires compilation)
model = GPTQModel.load(model_path, backend="bitblas", torch_dtype=torch.float16)

# ExLlama v2 (fastest, if available)
model = GPTQModel.load(model_path, backend="exllama_v2")

Memory Requirements

Inference:

Minimum: 8GB VRAM
Recommended: 12GB+ VRAM
For 128K context: 24GB+ VRAM

Tested Hardware:

NVIDIA RTX 5000 Ada (32GB) - ~14 tok/s with Triton
NVIDIA A30 (24GB) - BitBLAS compatible

Quantization Details

Specification	Value
Method	GPTQ + Full EoRA
Quantizer	GPTQModel
Calibration Samples	2048 (8x standard)
Sequence Length	256 tokens
Attention Precision	INT4 + EoRA (rank-128)
MLP Precision	INT4 + EoRA (rank-128)
desc_act	True
sym	True
group_size	128

Expected Quality

Based on the Full EoRA quantization strategy (INT4 + EoRA on all layers):

General instruction following: 97-99% of baseline
Code generation: 96-98% of baseline
Reasoning tasks: 96-98% of baseline
Long context: 95-97% of baseline

Full EoRA coverage (attention + MLP) provides excellent quality recovery despite aggressive INT4 quantization.

Technical Specifications

Specification	Value
Model Family	IBM Granite 3.3
Variant	8B Instruct
Parameters	8B
Hidden Size	4096
Num Layers	40
Num Attention Heads	32
Num KV Heads	8
Intermediate Size	12800
Context Length	128K
Vocab Size	128256

License

Apache 2.0 - Same as base model

Citation

@software{granite_3_3_8b_gptq_2025,
  title = {IBM Granite 3.3 8B Instruct - TevunahAi GPTQ with Full EoRA},
  author = {TevunahAi},
  year = {2025},
  note = {INT4 GPTQ quantization with full EoRA error correction on all layers},
  url = {https://huggingface.co/TevunahAi/Granite-3.3-8B-Instruct-GPTQ}
}

@misc{ibm_granite_2024,
  title = {Granite 3.0 Language Models},
  author = {IBM Research},
  year = {2024},
  url = {https://github.com/ibm-granite/granite-3.0-language-models}
}

@misc{nvidia_eora_2025,
  title = {EoRA: Error-corrected Low-Rank Adaptation for Quantized LLMs},
  author = {NVIDIA Research},
  year = {2025},
  note = {Low-rank error correction technique for quantization quality recovery}
}