Omartificial-Intelligence-Space/Arabic-NLi-Triplet
Viewer • Updated • 571k • 131 • 6
How to use Abdalrahmankamel/matryoshka-arabert with sentence-transformers:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Abdalrahmankamel/matryoshka-arabert", trust_remote_code=True)
sentences = [
"الطفل يلعب في الحديقة",
"ولد صغير يلهو في البستان",
"السيارة تسير في الشارع",
"الأطفال يمرحون في الخارج"
]
embeddings = model.encode(sentences)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [4, 4]This model enhances AraBERTv02 with Matryoshka Representation Learning and LoRA adaptation to generate superior Arabic sentence embeddings. It supports multiple embedding dimensions (8, 64, 128, 256) from a single model, offering flexibility between performance and efficiency.
| Feature | Description |
|---|---|
| 🔄 Multi-Dimensional | Single model supports 4 different embedding sizes (8, 64, 128, 256) |
| 🚀 High Performance | Outperforms base AraBERT across all dimensions |
| 📊 Arabic NLI Optimized | Trained specifically on Arabic Natural Language Inference |
| ⚡ Efficient Inference | Smaller dimensions for faster processing |
| 🎯 Triplet Loss Training | Enhanced semantic understanding through triplet learning |
pip install git+https://github.com/Abdalrahman54/matryoshka-wrapper.git
from matryoshka_wrapper import load_model, MatryoshkaWrapper
from torch.nn.functional import cosine_similarity
# Load model with desired dimension
repo_name = "Abdalrahmankamel/matryoshka-arabert"
model, tokenizer = load_model(repo_name, dim="256")
# Example texts
text1 = "هذا المنتج كان مخيبًا للآمال."
text2 = "هذه البضاعة رائعة!"
# Generate embeddings
emb1 = model.get_embedding(text1, tokenizer, dim="256").squeeze()
emb2 = model.get_embedding(text2, tokenizer, dim="256").squeeze()
# Calculate similarity
similarity = cosine_similarity(emb1.unsqueeze(0), emb2.unsqueeze(0)).item()
print(f"🔍 Cosine Similarity: {similarity:.4f}")
# Triplet data example
anchor = "الطفل يلعب في الحديقة"
positive = "ولد صغير يلهو في البستان" # Similar sentence
negative = "السيارة تسير في الشارع" # Different sentence
# Generate embeddings
anchor_emb = model.get_embedding(anchor, tokenizer, dim="256").squeeze()
positive_emb = model.get_embedding(positive, tokenizer, dim="256").squeeze()
negative_emb = model.get_embedding(negative, tokenizer, dim="256").squeeze()
# Calculate similarities
sim_positive = cosine_similarity(anchor_emb.unsqueeze(0), positive_emb.unsqueeze(0)).item()
sim_negative = cosine_similarity(anchor_emb.unsqueeze(0), negative_emb.unsqueeze(0)).item()
print("🔍 Triplet Results:")
print(f"📊 Anchor ↔ Positive: {sim_positive:.4f}")
print(f"📊 Anchor ↔ Negative: {sim_negative:.4f}")
print(f"📈 Margin: {sim_positive - sim_negative:.4f}")
if sim_positive > sim_negative:
print("✅ Triplet Success!")
else:
print("❌ Triplet Failed!")
# Compare across all dimensions
text1 = "اطفال يمرحون سوياً بالكرة في المساحات الخضراء"
text2 = "أطفال يلعبون كرة القدم على العشب"
dimensions = [8, 64, 128, 256]
print("🔍 Multi-Dimensional Similarity Comparison")
print("=" * 50)
for dim in dimensions:
model, tokenizer = load_model(repo_name, dim=str(dim))
emb1 = model.get_embedding(text1, tokenizer, dim=str(dim)).squeeze()
emb2 = model.get_embedding(text2, tokenizer, dim=str(dim)).squeeze()
similarity = cosine_similarity(emb1.unsqueeze(0), emb2.unsqueeze(0)).item()
print(f"📐 Dim {dim:>3}: Similarity = {similarity:.4f} | Shape = {emb1.shape}")
| Component | Details |
|---|---|
| Base Model | aubmindlab/bert-base-arabertv02 |
| Enhancement | LoRA (Low-Rank Adaptation) |
| Training Objective | Triplet Loss |
| Embedding Dimensions | 8, 64, 128, 256 |
| Language | Arabic |
| Parameter | Value |
|---|---|
| Loss Function | Triplet Loss |
| Adaptation Method | LoRA |
| Precision | FP16 Mixed Precision |
| Epochs | 3 |
| Batch Size | 32 |
| Optimizer | AdamW |
| Hardware | NVIDIA A100 |
| Training Time | ~7 hours |
If you use this model in your research, please cite:
@misc{kamel2025arabert-matryoshka,
author = {Abdalrahman Kamel},
title = {AraBERT Matryoshka: Multi-Dimensional Arabic Sentence Embeddings with Triplet Loss},
year = {2025},
url = {https://huggingface.co/Abdalrahmankamel/matryoshka-arabert},
note = {Hugging Face Model Repository}
}
This model is released under the Apache 2.0 License.
Developed by Abdalrahman Kamel
Advancing Arabic NLP through innovative embedding techniques
Base model
aubmindlab/bert-base-arabertv02