identifier: https://huggingface.co/oeg/RoBERTaSense-FACIL
name: RoBERTaSense-FACIL
version: 0.1.0
keywords:
- easy-to-read
- meaning preservation
- accessibility
- spanish
- text pair classification
headline: >-
Spanish RoBERTa fine-tuned to assess meaning preservation in Easy-to-Read
(E2R) adaptations.
description: >
RoBERTaSense-FACIL is a Spanish RoBERTa model fine-tuned to assess meaning
preservation in Easy-to-Read (E2R) adaptations. Given a pair {original,
adapted}, it predicts whether the adaptation preserves the meaning of the
original. ⚠️ Deprecation notice (base model): fine-tuned from
PlanTL-GOB-ES/roberta-base-bne, which is deprecated as of 2025. For actively
maintained Spanish RoBERTa models, see BSC-LT.
task:
- Text classification
- Pairwise classification
modelCategory:
- Supervised classification
language:
- es
license: apache-2.0
parameterSize: 125M
developmentStatus: Active
dateCreated: 25-09-2025
dateModified: 06-10-2025
citation: >
Diab Lozano, I., & Suárez-Figueroa, M. C. (2025). RoBERTaSense-FACIL: Meaning
Preservation for Easy-to-Read in Spanish. Retrieved from
https://huggingface.co/oeg/RoBERTaSense-FACIL
codeRepository: ''
referencePublication: ''
developmentLibrary: PyTorch + Transformers
usageInstructions: >
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
repo = "oeg/RoBERTaSense-FACIL" model =
AutoModelForSequenceClassification.from_pretrained(repo) tokenizer =
AutoTokenizer.from_pretrained(repo)
original = "El lobo, que parecía amable, engañó a Caperucita." adapted = "El
lobo parecía amable. El lobo engañó a Caperucita."
inputs = tokenizer(original, adapted, return_tensors="pt", truncation=True,
max_length=512) with torch.no_grad():
logits = model(**inputs).logits
probs = logits.softmax(-1).squeeze().tolist() print({model.config.id2label[i]:
probs[i] for i in range(len(probs))})
modelRisks:
- Trained for Spanish E2R; out-of-domain performance may degrade.
- >-
Binary labels compress nuanced cases; borderline adaptations may require
human review.
- Synthetic negatives do not cover all real-world human errors.
- Base model is deprecated; security/robustness updates will not be inherited.
evaluationMetrics:
- Accuracy
- F1
- ROC-AUC
evaluationResults: |
80/20 stratified split (seed=42). Example results:
- Accuracy: 0.81
- F1: 0.84
- ROC-AUC: 0.83
softwareRequirements:
- python>=3.9
- torch>=2.0
- transformers>=4.40
- datasets>=2.18
storageRequirements:
- ~500 MB
memoryRequirements:
- >-
>= 8 GB RAM (CPU inference), >= 12 GB VRAM recommended for large batch
inference
operatingSystem:
- Linux
- macOS
- Windows
processorRequirements:
- x86_64 CPU (AVX recommended)
GPURequirements:
- >-
Not required for single-pair inference; CUDA GPU recommended for batch
processing
distribution:
- encodingFormat: ''
contentUrl: ''
contentSize: ''
quantizationBits: ''
quantizationMethod: ''
trainedOn:
- identifier: internal:e2r-positives
name: Expert-validated E2R pairs (Spanish)
description: >
Positive pairs (original↔adapted) from an existing corpus validated by
experts; used as the positive class.
url: ''
- identifier: internal:synthetic-negatives
name: Synthetic hard negatives (Spanish)
description: >
Negatives generated via sentence shuffle, dropout, mismatch (derangement),
paraphrase-with-distortion, and zero-shot NLI contradictions; trivial
pairs filtered by BLEU/ROUGE-L thresholds.
url: ''
testedOn:
- identifier: internal:heldout-20
name: Held-out 20% stratified split
description: >
Stratified 80/20 split by Label (seed=42); pairwise tokenization up to 512
tokens.
evaluatedOn:
- identifier: internal:heldout-20
name: Held-out 20% stratified split
description: >
Metrics: Accuracy, F1, ROC-AUC; operating threshold tuned via Youden’s J
(ROC).
validatedOn: ''
author:
- name: Isam Diab Lozano
identifier: https://orcid.org/0000-0002-3967-0672
- name: Mari Carmen Suárez-Figueroa
identifier: https://orcid.org/0000-0003-3807-5019
successorOf: ''
funder:
- name: Comunidad de Madrid — PIPF-2022/COM-25762
identifier: ''
sharedBy:
- name: Ontology Engineering Group (UPM)
identifier: https://oeg.fi.upm.es/index.php/en/index.html
wasGeneratedBy:
- trainingRegion:
- name: Europe (West)
cloudProvider:
- name: ''
url: ''
duration: ''
hardwareType: ''
fineTunedFromModel: https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne
sdPublisher:
- name: Ontology Engineering Group
url: https://oeg.fi.upm.es/index.php/en/index.html
sdLicense: apache-2.0
metrics:
- accuracy
- f1
- roc_auc
base_model:
- PlanTL-GOB-ES/roberta-base-bne
pipeline_tag: text-classification
tags:
- easy-to-read
- meaning-preservation
Model Card for RoBERTaSense-FACIL
RoBERTaSense-FACIL (RoBERTa Fine-tuned for Accessible Comprehension In Language) is a Spanish RoBERTa model fine-tuned to assess meaning preservation in Easy-to-Read (E2R) adaptations. Given a pair of texts {original, adapted}, it predicts whether the adaptation preserves the meaning of the original.
⚠️ Deprecation notice (base model): This model was fine-tuned from PlanTL-GOB-ES/roberta-base-bne. As for September 2025, this checkpoint is deprecated and no longer actively maintained. For actively maintained Spanish RoBERTa models, please see the BSC-LT organization: https://huggingface.co/BSC-LT
🚀 How to Use
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
repo = "oeg/RoBERTaSense-FACIL"
model = AutoModelForSequenceClassification.from_pretrained(repo)
tokenizer = AutoTokenizer.from_pretrained(repo)
original = "El lobo, que parecía amable, engañó a Caperucita."
adapted = "El lobo parecía amable.
El lobo engañó a Caperucita."
# Encode the pair (original, adapted)
inputs = tokenizer(original, adapted, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
logits = model(**inputs).logits
probs = logits.softmax(-1).squeeze().tolist()
print({model.config.id2label[i]: probs[i] for i in range(len(probs))})
Suggested labels (adjust to your checkpoint):
{
"id2label": {"0": "DOES_NOT_PRESERVE", "1": "PRESERVES_MEANING"},
"label2id": {"DOES_NOT_PRESERVE": 0, "PRESERVES_MEANING": 1}
}
Model Description
- Developed by: Ontology Engineering Group (UPM) / Authors: Isam Diab Lozano and Mari Carmen Suárez-Figueroa
- Funded by: "Ayudas para la contratación de personal investigador predoctoral en formación para el año 2022" (Reference: PIPF-2022/COM-25762) by Comunidad Autónoma de Madrid (Spain)
- Model type: Encoder-only Transformer (RoBERTa) with a classification head
- Language: Spanish (es)
- License: Apache-2.0
- Finetuned from model:
PlanTL-GOB-ES/roberta-base-bne(deprecated; see notice above)
Uses
Direct Use
- Automatic scoring of meaning preservation for Spanish Easy-to-Read adaptations.
- As a signal in content quality checks for accessibility pipelines.
Out-of-Scope Use
- Clinical, legal, or other high-stakes decisions without human expert oversight.
- Non-Spanish or out-of-domain texts without prior adaptation or re-training.
Bias, Risks, and Limitations
- Domain limitation: trained for Spanish E2R; performance may degrade on other genres/domains.
- Binary labels: compress nuanced cases; borderline adaptations may require human review.
- Synthetic negatives: not all human errors are covered by synthetic negative strategies.
- Base deprecation: the upstream base model is deprecated; security/robustness updates won’t be inherited.
Recommendations
- Calibrate probabilities (e.g., temperature scaling) and expose confidence scores.
- Use threshold tuning (e.g., Youden’s J) to trade precision/recall for your setting.
- Keep a human-in-the-loop for critical use cases and periodic error audits.
How to Get Started with the Model
See How to Use above. For pairwise inputs, encode as sentence pairs:
inputs = tokenizer(text_original, text_adapted, return_tensors="pt", truncation=True, max_length=512)
Training Details
Training Data
- Source: Spanish pairs (original - adapted) curated/validated by experts.
- Columns:
text1(original),text2(adaptation),Label(0/1),neg_type. - Labels:
1 = PRESERVES_MEANING,0 = DOES_NOT_PRESERVE. - Negative types used in training data construction:
shuffle,dropout,mismatch(derangement),paraphrase_distortion,nli_contradiction. - Split: 80/20, stratified by
Label(random_state=42).
Training Procedure
Preprocessing
- Pair tokenization with truncation at 512 tokens:
tokenizer(text1, text2, truncation=True, max_length=512)
Training Hyperparameters
Training regime: fp16 mixed precision (if supported; otherwise fp32)
Arguments:
num_train_epochs=5per_device_train_batch_size=32per_device_eval_batch_size=16learning_rate=2e-5weight_decay=0.01warmup_ratio=0.1evaluation_strategy="epoch",save_strategy="epoch"load_best_model_at_end=True,metric_for_best_model="f1"
Optimizer: AdamW
Loss: CrossEntropy (2 logits)
Evaluation
Testing Data, Factors & Metrics
Testing Data
- Held-out 20% stratified split of the curated E2R pairs.
Factors
- Report per-negative-type breakdown (e.g., performance on
mismatch,paraphrase_distortion, etc.).
Metrics
- Accuracy, F1, ROC-AUC.
Results
- Accuracy:
0.81 - F1:
0.84 - ROC-AUC:
0.83 - Threshold tuned via Youden’s J for operating point selection.
Technical Specifications
Model Architecture and Objective
- Encoder-only RoBERTa with a classification head (
Linear(hidden → 2)). - Objective: supervised cross-entropy on binary label.
Citation
BibTeX:
@software{roberta_facil_2025,
title = {RoBERTaSense-FACIL: Meaning Preservation for Easy-to-Read in Spanish},
author = {Diab Lozano, Isam and Suárez-Figueroa, Mari Carmen},
year = {2025},
url = {https://huggingface.co/oeg/RoBERTaSense-FACIL}
}
APA: Diab Lozano, Isam and Suárez-Figueroa, Mari Carmen. (2025). RoBERTa-FACIL: Meaning Preservation for Easy-to-Read in Spanish. Hugging Face. https://huggingface.co/oeg/RoBERTaSense-FACIL