ALIA MrBERT Spanish Legal and Administrative Embeddings Model

This repository contains ALIA MrBERT Spanish Legal and Administrative Embeddings, a Spanish legal domain bi-encoder model for semantic similarity and information retrieval tasks. It is built upon MrBERT-es, a bilingual (Spanish–English) foundational language model based on the ModernBERT architecture, and fine-tuned on domain-specific legal and administrative data using a Curriculum Learning strategy.

DISCLAIMER: This model is a domain-specific proof-of-concept designed to demonstrate retrieval capabilities in the Spanish legal and administrative domain. While optimized for this domain, results should be verified against official legal sources. The model may fail in out-of-domain or adversarial inputs.

Model Details

Model Lineage

ModernBERT (architecture)
       ↓
  MrBERT-es (BSC-LT)
  Bilingual ES/EN encoder
  150M parameters
       ↓
  ALIA-MrBERT-es-legal-administrative-embeddings (SINAI)
  Legal domain fine-tuning
  Curriculum Learning + Hard Negatives

Key Features

🔍 Domain: Spanish legal and administrative texts
📐 Architecture: ModernBERT with Mean Pooling (bi-encoder)
📏 Long context: up to 8,192 tokens
🎓 Training strategy: Curriculum Learning (easy → medium → hard)
⚙️ Negative mining: Positive-Aware Hard Negative Mining (NVIDIA approach)

Architecture

This model uses the same base architecture as MrBERT-es, extended with a Mean Pooling layer for sentence-level embeddings:


Base Architecture	ModernBERT
Total Parameters	~150M
Hidden size	768
Intermediate size	1,152
Attention heads	12
Hidden layers	22
Context length	8,192 tokens
Vocabulary size	51,200
Precision	bfloat16
Positional encoding	RoPE
Activation function	GeLU
Attention type	Mixed (global every 3 layers + sliding window)
Pooling strategy	Mean Pooling

Training

Training Strategy: Curriculum Learning

The model was fine-tuned using a two-phase Curriculum Learning strategy and progressively increasing the difficulty of training examples thanks to SINAI/ALIA-es-legal-administrative-triplets/train:

Phase	Epochs	Negative Type	Difficulty Progression
Phase 1	6	Random negatives	Easy → Medium → Hard
Phase 2	3	Hard negatives (mined)	Easy → Medium → Hard
Total	9	—	—

Phase 1 – Contrastive Learning with Random Negatives: Training uses {query, relevant_doc, [irrelevant_docs]} with in-batch negatives. Examples are sorted by difficulty across 3 sub-phases (2 epochs each).

Phase 2 – Advanced Refinement with Hard Negatives: Refinement using mined hard negatives with Positive-Aware Mining (NVIDIA approach) to avoid false negatives. A candidate is only considered a negative if:

score < score_positive - margin   (margin = 0.05)

Hyperparameter Optimization

Before training, hyperparameter search was conducted using Optuna (20 trials, subsets of 5,000 examples):

Sampler: TPESampler (Tree-structured Parzen Estimator)
Pruner: MedianPruner
Storage: SQLite for trial persistence

Hyperparameter	Search Space
Learning Rate	[1×10⁻⁶, 5×10⁻⁵] (log-uniform)
Warmup Ratio	[0.05, 0.2]
Weight Decay	[0.0, 0.1]
Mini Batch Size	{1, 4, 8, 12}

Final Training Hyperparameters

Hyperparameter	Value	Description
Learning Rate	5×10⁻⁵	Nominal learning rate
Batch Size	256	Global batch size
Cache Mini-Batch	4	For CachedMultipleNegativesRankingLoss
Warmup Ratio	0.16	Linear LR warmup at start of each phase
Weight Decay	0.02	L2 regularization
Optimizer	AdamW	Standard HuggingFace Trainer optimizer
Precision	bf16	Bfloat16 for Ampere+ architectures
Max Sequence Length	2,048	Maximum tokens processed

Loss Function

CachedMultipleNegativesRankingLoss: Enables training with large batches (256) without VRAM overflow, by recalculating embeddings in smaller sub-batches (cache size: 4).

Training Framework

Component	Details
Library	`sentence-transformers`
Distributed	DDP (Distributed Data Parallel) via `torchrun`
Memory optimization	Gradient Checkpointing
Logging	WandB (offline mode)

Intended Use

Direct Use

This model is designed for semantic similarity and information retrieval tasks in the Spanish legal and administrative domain. Primary use cases include:

Semantic search: Finding relevant legal documents from a query
RAG pipelines: Generating context-enriched legal answers using retrieval-augmented generation
Legal document clustering: Grouping similar legal texts by semantic content
Duplicate detection: Identifying semantically similar legal clauses or articles

Out-of-Scope Use

General-domain retrieval (model is specialized for legal/administrative Spanish)
Cross-lingual retrieval beyond Spanish
Use as a generative model (this is an encoder-only model)
Legal advice or binding interpretations of legal texts

How to Use

With `sentence-transformers`

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("SINAI/ALIA-MrBERT-es-legal-administrative-embeddings")

queries = ["¿Cuáles son los requisitos para solicitar una prestación por desempleo?"]
documents = [
    "El trabajador que cese en su actividad laboral tendrá derecho a la prestación por desempleo...",
    "La prestación por desempleo contributiva se reconoce a quienes hayan cotizado al menos 360 días...",
]

query_embeddings = model.encode(queries, prompt_name="query")
doc_embeddings = model.encode(documents)

scores = model.similarity(query_embeddings, doc_embeddings)
print(scores)

With `transformers` (manual)

import torch
from transformers import AutoTokenizer, AutoModel

model_name = "SINAI/ALIA-MrBERT-es-legal-administrative-embeddings"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

texts = ["¿Cuáles son los plazos para interponer un recurso administrativo?"]
encoded = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

with torch.no_grad():
    outputs = model(**encoded)

embeddings = mean_pooling(outputs, encoded["attention_mask"])
print(embeddings.shape)  # (1, 768)

Evaluation

The model was evaluated using the MTEB (Massive Text Embedding Benchmark) framework, adapted for the legal domain. The main reported metric is NDCG@10 (Normalized Discounted Cumulative Gain at k=10), which is the standard metric used in retrieval leaderboards and aligns with the metric reported in the MrBERT family.

An additional evaluation was made thanks to ragas evaluation framework and MiniMaxAI/MiniMax-M2.5 language model. These metrics are calculated by averaging on each pair puntuation from particular subsets of some of the following datasets.

Evaluation Datasets

Dataset	Category	Description
QA	Retrieval	Spanish subset of the MIRACL dataset in MTEB format (jinaai/miracl-es)
STS	STS	Combination of three Spanish STS datasets: PAWS-X (google-research-datasets/paws-x), STS22 (mteb/sts22-crosslingual-sts), and SemRel2024 (SemRel/SemRel2024)
justicio	Retrieval	Subset of legal domain QA dataset for RAG evaluation derived from the Justicio corpus (dariolopez/justicio-rag-embedding-qa-tmp) and a subset based on an article from the Spanish Constitution (dariolopez/justicio-BOE-A-1978-31229-constitucion-by-articles-qa-qa-groq_llama3_70b_8192-sas)
ssld	Retrieval	Spanish legal language triplets dataset (wilfredomartel/small-spanish-legal-dataset)
pairs10k	Retrieval	Subset of 10K evaluation pairs (query + context) derived from SINAI/ALIA-es-legal-administrative-pairs
pairs1.5k	Retrieval	Subset of 1.5K evaluation pairs (query + context) derived from SINAI/ALIA-es-legal-administrative-pairs
triplets1.3k	Retrieval	Subset of 1.3K evaluation triplets (query + context + response) derived from SINAI/ALIA-es-legal-administrative-pairs

Results (NDCG@10)

Model	QA	STS	justicio	ssld	pairs10k	pairs1.5k
BAAI/bge-m3	0.9839	0.4629	0.6521	0.6955	0.7200	0.8372
Qwen/Qwen3-Embedding-0.6B	0.9869	0.5033	0.6344	0.7228	0.7979	0.8874
sentence-transformers/paraphrase-multilingual-mpnet-base-v2	0.9419	0.4632	0.5396	0.3060	0.1655	0.2672
ALIA-MrBERT-es-legal-administrative-embeddings (ours)	0.9504	0.4787	0.6401	0.6409	0.8716	0.9383

BAAI/bge-m3 and Qwen/Qwen3-Embedding-0.6B are SOTA models with a size x4 larger than ours. Yet, our model shows comparable results.

Results (ragas)

Subset: `triplets_queries1300_contexts1300 (triplets1.3k)`

Model	Context Precision (avg)	Context Utilization (avg)	Context Relevance (avg)
Qwen/Qwen3-Embedding-0.6B	0.8637	0.8636	0.8636
ALIA-MrBERT-es-legal-administrative-embeddings (ours)	0.9112	0.9140	0.9121

Subset: `justicio_queries512_contexts154`

Model	Context Precision (avg)	Context Utilization (avg)	Context Relevance (avg)
Qwen/Qwen3-Embedding-0.6B	0.8187	0.8168	0.9345
ALIA-MrBERT-es-legal-administrative-embeddings (ours)	0.8509	0.8570	0.9631

Subset: `ssld_queries500_contexts11234`

Model	Context Relevance (avg)
Qwen/Qwen3-Embedding-0.6B	0.8145
ALIA-MrBERT-es-legal-administrative-embeddings (ours)	0.7765

Note: Each evaluation subset is named following the pattern {dataset}_queries{N}_contexts{M}, where N is the number of queries evaluated against M contexts taken from the datasets.

Limitations and Biases

Known Limitations

Domain specificity: The model is optimized for Spanish legal and administrative texts. Performance may degrade significantly on general-domain or other specialized texts.
Language: Although MrBERT-es supports Spanish and English, this fine-tuned model focuses on Spanish legal content.
Legal accuracy: Semantic similarity does not guarantee legal correctness. Retrieved documents should always be verified by qualified professionals.
Context length: Despite supporting up to 8,192 tokens, very long documents may require chunking strategies for optimal retrieval performance.

Biases

The model may reflect biases present in the Spanish legal corpus used for training.
It may underperform on legal texts from Latin American jurisdictions, as training focused on Spanish national legislation and administration.

Additional Information

License

Apache License, Version 2.0

Citation

If you use this model in your research, please cite:

@misc{ALIA-MrBERT-es-legal-administrative-embeddings,
  title        = {ALIA MrBERT Spanish Legal and Administrative Embeddings Model},
  author       = {SINAI Research Group, Universidad de Jaén},
  year         = {2026},
  publisher    = {HuggingFace},
  howpublished = {\url{https://huggingface.co/SINAI/ALIA-MrBERT-es-legal-administrative-embeddings}}
}

Please also cite the base model:

@misc{tamayo2026mrbertmodernmultilingualencoders,
      title={MrBERT: Modern Multilingual Encoders via Vocabulary, Domain, and Dimensional Adaptation}, 
      author={Daniel Tamayo and Iñaki Lacunza and Paula Rivera-Hidalgo and Severino Da Dalt and Javier Aula-Blasco and Aitor Gonzalez-Agirre and Marta Villegas},
      year={2026},
      eprint={2602.21379},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2602.21379}, 
}

Funding

This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project ALIA.

Acknowledgments

This dataset has been generated thanks to CEATIC ( Centro de Estudios Avanzados en Tecnologías de la Información y de la Comunicación) – UJA (Universidad de Jaén) which provided the needed computational resources on its clusters.

Contact: ALIA Project - SINAI Research Group - Universidad de Jaén

More Information: SINAI Research Group | ALIA-UJA Project

Downloads last month: 103

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for SINAI/ALIA-MrBERT-es-legal-administrative-embeddings

Base model

BSC-LT/MrBERT

Finetuned

BSC-LT/MrBERT-es