Instructions to use SINAI/ALIA-MrBERT-es-legal-administrative-embeddings with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use SINAI/ALIA-MrBERT-es-legal-administrative-embeddings with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("SINAI/ALIA-MrBERT-es-legal-administrative-embeddings") sentences = [ "Esa es una persona feliz", "Ese es un perro feliz", "Esa es una persona muy feliz", "Hoy es un día soleado" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Notebooks
- Google Colab
- Kaggle
ALIA MrBERT Spanish Legal and Administrative Embeddings Model
This repository contains ALIA MrBERT Spanish Legal and Administrative Embeddings, a Spanish legal domain bi-encoder model for semantic similarity and information retrieval tasks. It is built upon MrBERT-es, a bilingual (Spanish–English) foundational language model based on the ModernBERT architecture, and fine-tuned on domain-specific legal and administrative data using a Curriculum Learning strategy.
DISCLAIMER: This model is a domain-specific proof-of-concept designed to demonstrate retrieval capabilities in the Spanish legal and administrative domain. While optimized for this domain, results should be verified against official legal sources. The model may fail in out-of-domain or adversarial inputs.
Model Details
Model Lineage
ModernBERT (architecture)
↓
MrBERT-es (BSC-LT)
Bilingual ES/EN encoder
150M parameters
↓
ALIA-MrBERT-es-legal-administrative-embeddings (SINAI)
Legal domain fine-tuning
Curriculum Learning + Hard Negatives
Key Features
- 🔍 Domain: Spanish legal and administrative texts
- 📐 Architecture: ModernBERT with Mean Pooling (bi-encoder)
- 📏 Long context: up to 8,192 tokens
- 🎓 Training strategy: Curriculum Learning (easy → medium → hard)
- ⚙️ Negative mining: Positive-Aware Hard Negative Mining (NVIDIA approach)
Architecture
This model uses the same base architecture as MrBERT-es, extended with a Mean Pooling layer for sentence-level embeddings:
| Base Architecture | ModernBERT |
| Total Parameters | ~150M |
| Hidden size | 768 |
| Intermediate size | 1,152 |
| Attention heads | 12 |
| Hidden layers | 22 |
| Context length | 8,192 tokens |
| Vocabulary size | 51,200 |
| Precision | bfloat16 |
| Positional encoding | RoPE |
| Activation function | GeLU |
| Attention type | Mixed (global every 3 layers + sliding window) |
| Pooling strategy | Mean Pooling |
Training
Training Strategy: Curriculum Learning
The model was fine-tuned using a two-phase Curriculum Learning strategy and progressively increasing the difficulty of training examples thanks to SINAI/ALIA-es-legal-administrative-triplets/train:
| Phase | Epochs | Negative Type | Difficulty Progression |
|---|---|---|---|
| Phase 1 | 6 | Random negatives | Easy → Medium → Hard |
| Phase 2 | 3 | Hard negatives (mined) | Easy → Medium → Hard |
| Total | 9 | — | — |
Phase 1 – Contrastive Learning with Random Negatives:
Training uses {query, relevant_doc, [irrelevant_docs]} with in-batch negatives. Examples are sorted by difficulty across 3 sub-phases (2 epochs each).
Phase 2 – Advanced Refinement with Hard Negatives: Refinement using mined hard negatives with Positive-Aware Mining (NVIDIA approach) to avoid false negatives. A candidate is only considered a negative if:
score < score_positive - margin (margin = 0.05)
Hyperparameter Optimization
Before training, hyperparameter search was conducted using Optuna (20 trials, subsets of 5,000 examples):
- Sampler: TPESampler (Tree-structured Parzen Estimator)
- Pruner: MedianPruner
- Storage: SQLite for trial persistence
| Hyperparameter | Search Space |
|---|---|
| Learning Rate | [1×10⁻⁶, 5×10⁻⁵] (log-uniform) |
| Warmup Ratio | [0.05, 0.2] |
| Weight Decay | [0.0, 0.1] |
| Mini Batch Size | {1, 4, 8, 12} |
Final Training Hyperparameters
| Hyperparameter | Value | Description |
|---|---|---|
| Learning Rate | 5×10⁻⁵ | Nominal learning rate |
| Batch Size | 256 | Global batch size |
| Cache Mini-Batch | 4 | For CachedMultipleNegativesRankingLoss |
| Warmup Ratio | 0.16 | Linear LR warmup at start of each phase |
| Weight Decay | 0.02 | L2 regularization |
| Optimizer | AdamW | Standard HuggingFace Trainer optimizer |
| Precision | bf16 | Bfloat16 for Ampere+ architectures |
| Max Sequence Length | 2,048 | Maximum tokens processed |
Loss Function
- CachedMultipleNegativesRankingLoss: Enables training with large batches (256) without VRAM overflow, by recalculating embeddings in smaller sub-batches (cache size: 4).
Training Framework
| Component | Details |
|---|---|
| Library | sentence-transformers |
| Distributed | DDP (Distributed Data Parallel) via torchrun |
| Memory optimization | Gradient Checkpointing |
| Logging | WandB (offline mode) |
Intended Use
Direct Use
This model is designed for semantic similarity and information retrieval tasks in the Spanish legal and administrative domain. Primary use cases include:
- Semantic search: Finding relevant legal documents from a query
- RAG pipelines: Generating context-enriched legal answers using retrieval-augmented generation
- Legal document clustering: Grouping similar legal texts by semantic content
- Duplicate detection: Identifying semantically similar legal clauses or articles
Out-of-Scope Use
- General-domain retrieval (model is specialized for legal/administrative Spanish)
- Cross-lingual retrieval beyond Spanish
- Use as a generative model (this is an encoder-only model)
- Legal advice or binding interpretations of legal texts
How to Use
With sentence-transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("SINAI/ALIA-MrBERT-es-legal-administrative-embeddings")
queries = ["¿Cuáles son los requisitos para solicitar una prestación por desempleo?"]
documents = [
"El trabajador que cese en su actividad laboral tendrá derecho a la prestación por desempleo...",
"La prestación por desempleo contributiva se reconoce a quienes hayan cotizado al menos 360 días...",
]
query_embeddings = model.encode(queries, prompt_name="query")
doc_embeddings = model.encode(documents)
scores = model.similarity(query_embeddings, doc_embeddings)
print(scores)
With transformers (manual)
import torch
from transformers import AutoTokenizer, AutoModel
model_name = "SINAI/ALIA-MrBERT-es-legal-administrative-embeddings"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
texts = ["¿Cuáles son los plazos para interponer un recurso administrativo?"]
encoded = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
outputs = model(**encoded)
embeddings = mean_pooling(outputs, encoded["attention_mask"])
print(embeddings.shape) # (1, 768)
Evaluation
The model was evaluated using the MTEB (Massive Text Embedding Benchmark) framework, adapted for the legal domain. The main reported metric is NDCG@10 (Normalized Discounted Cumulative Gain at k=10), which is the standard metric used in retrieval leaderboards and aligns with the metric reported in the MrBERT family.
An additional evaluation was made thanks to ragas evaluation framework and MiniMaxAI/MiniMax-M2.5 language model. These metrics are calculated by averaging on each pair puntuation from particular subsets of some of the following datasets.
Evaluation Datasets
| Dataset | Category | Description |
|---|---|---|
| QA | Retrieval | Spanish subset of the MIRACL dataset in MTEB format (jinaai/miracl-es) |
| STS | STS | Combination of three Spanish STS datasets: PAWS-X (google-research-datasets/paws-x), STS22 (mteb/sts22-crosslingual-sts), and SemRel2024 (SemRel/SemRel2024) |
| justicio | Retrieval | Subset of legal domain QA dataset for RAG evaluation derived from the Justicio corpus (dariolopez/justicio-rag-embedding-qa-tmp) and a subset based on an article from the Spanish Constitution (dariolopez/justicio-BOE-A-1978-31229-constitucion-by-articles-qa-qa-groq_llama3_70b_8192-sas) |
| ssld | Retrieval | Spanish legal language triplets dataset (wilfredomartel/small-spanish-legal-dataset) |
| pairs10k | Retrieval | Subset of 10K evaluation pairs (query + context) derived from SINAI/ALIA-es-legal-administrative-pairs |
| pairs1.5k | Retrieval | Subset of 1.5K evaluation pairs (query + context) derived from SINAI/ALIA-es-legal-administrative-pairs |
| triplets1.3k | Retrieval | Subset of 1.3K evaluation triplets (query + context + response) derived from SINAI/ALIA-es-legal-administrative-pairs |
Results (NDCG@10)
| Model | QA | STS | justicio | ssld | pairs10k | pairs1.5k |
|---|---|---|---|---|---|---|
| BAAI/bge-m3 | 0.9839 | 0.4629 | 0.6521 | 0.6955 | 0.7200 | 0.8372 |
| Qwen/Qwen3-Embedding-0.6B | 0.9869 | 0.5033 | 0.6344 | 0.7228 | 0.7979 | 0.8874 |
| sentence-transformers/paraphrase-multilingual-mpnet-base-v2 | 0.9419 | 0.4632 | 0.5396 | 0.3060 | 0.1655 | 0.2672 |
| ALIA-MrBERT-es-legal-administrative-embeddings (ours) | 0.9504 | 0.4787 | 0.6401 | 0.6409 | 0.8716 | 0.9383 |
BAAI/bge-m3 and Qwen/Qwen3-Embedding-0.6B are SOTA models with a size x4 larger than ours. Yet, our model shows comparable results.
Results (ragas)
Subset: triplets_queries1300_contexts1300 (triplets1.3k)
| Model | Context Precision (avg) | Context Utilization (avg) | Context Relevance (avg) |
|---|---|---|---|
| Qwen/Qwen3-Embedding-0.6B | 0.8637 | 0.8636 | 0.8636 |
| ALIA-MrBERT-es-legal-administrative-embeddings (ours) | 0.9112 | 0.9140 | 0.9121 |
Subset: justicio_queries512_contexts154
| Model | Context Precision (avg) | Context Utilization (avg) | Context Relevance (avg) |
|---|---|---|---|
| Qwen/Qwen3-Embedding-0.6B | 0.8187 | 0.8168 | 0.9345 |
| ALIA-MrBERT-es-legal-administrative-embeddings (ours) | 0.8509 | 0.8570 | 0.9631 |
Subset: ssld_queries500_contexts11234
| Model | Context Relevance (avg) |
|---|---|
| Qwen/Qwen3-Embedding-0.6B | 0.8145 |
| ALIA-MrBERT-es-legal-administrative-embeddings (ours) | 0.7765 |
Note: Each evaluation subset is named following the pattern
{dataset}_queries{N}_contexts{M}, whereNis the number of queries evaluated againstMcontexts taken from the datasets.
Limitations and Biases
Known Limitations
- Domain specificity: The model is optimized for Spanish legal and administrative texts. Performance may degrade significantly on general-domain or other specialized texts.
- Language: Although MrBERT-es supports Spanish and English, this fine-tuned model focuses on Spanish legal content.
- Legal accuracy: Semantic similarity does not guarantee legal correctness. Retrieved documents should always be verified by qualified professionals.
- Context length: Despite supporting up to 8,192 tokens, very long documents may require chunking strategies for optimal retrieval performance.
Biases
- The model may reflect biases present in the Spanish legal corpus used for training.
- It may underperform on legal texts from Latin American jurisdictions, as training focused on Spanish national legislation and administration.
Additional Information
License
Citation
If you use this model in your research, please cite:
@misc{ALIA-MrBERT-es-legal-administrative-embeddings,
title = {ALIA MrBERT Spanish Legal and Administrative Embeddings Model},
author = {SINAI Research Group, Universidad de Jaén},
year = {2026},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/SINAI/ALIA-MrBERT-es-legal-administrative-embeddings}}
}
Please also cite the base model:
@misc{tamayo2026mrbertmodernmultilingualencoders,
title={MrBERT: Modern Multilingual Encoders via Vocabulary, Domain, and Dimensional Adaptation},
author={Daniel Tamayo and Iñaki Lacunza and Paula Rivera-Hidalgo and Severino Da Dalt and Javier Aula-Blasco and Aitor Gonzalez-Agirre and Marta Villegas},
year={2026},
eprint={2602.21379},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2602.21379},
}
Funding
This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project ALIA.
Acknowledgments
This dataset has been generated thanks to CEATIC ( Centro de Estudios Avanzados en Tecnologías de la Información y de la Comunicación) – UJA (Universidad de Jaén) which provided the needed computational resources on its clusters.
Contact: ALIA Project - SINAI Research Group - Universidad de Jaén
More Information: SINAI Research Group | ALIA-UJA Project
- Downloads last month
- 103