--- identifier: https://huggingface.co/oeg/RoBERTaSense-FACIL name: RoBERTaSense-FACIL version: 0.1.0 keywords: - easy-to-read - meaning preservation - accessibility - spanish - text pair classification headline: >- Spanish RoBERTa fine-tuned to assess meaning preservation in Easy-to-Read (E2R) adaptations. description: > RoBERTaSense-FACIL is a Spanish RoBERTa model fine-tuned to assess meaning preservation in Easy-to-Read (E2R) adaptations. Given a pair {original, adapted}, it predicts whether the adaptation preserves the meaning of the original. ⚠️ Deprecation notice (base model): fine-tuned from PlanTL-GOB-ES/roberta-base-bne, which is deprecated as of 2025. For actively maintained Spanish RoBERTa models, see BSC-LT. task: - Text classification - Pairwise classification modelCategory: - Supervised classification language: - es license: apache-2.0 parameterSize: 125M developmentStatus: Active dateCreated: 25-09-2025 dateModified: 06-10-2025 citation: > Diab Lozano, I., & Suárez-Figueroa, M. C. (2025). RoBERTaSense-FACIL: Meaning Preservation for Easy-to-Read in Spanish. Retrieved from https://huggingface.co/oeg/RoBERTaSense-FACIL codeRepository: '' referencePublication: '' developmentLibrary: PyTorch + Transformers usageInstructions: > from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch repo = "oeg/RoBERTaSense-FACIL" model = AutoModelForSequenceClassification.from_pretrained(repo) tokenizer = AutoTokenizer.from_pretrained(repo) original = "El lobo, que parecía amable, engañó a Caperucita." adapted = "El lobo parecía amable. El lobo engañó a Caperucita." inputs = tokenizer(original, adapted, return_tensors="pt", truncation=True, max_length=512) with torch.no_grad(): logits = model(**inputs).logits probs = logits.softmax(-1).squeeze().tolist() print({model.config.id2label[i]: probs[i] for i in range(len(probs))}) modelRisks: - Trained for Spanish E2R; out-of-domain performance may degrade. - >- Binary labels compress nuanced cases; borderline adaptations may require human review. - Synthetic negatives do not cover all real-world human errors. - Base model is deprecated; security/robustness updates will not be inherited. evaluationMetrics: - Accuracy - F1 - ROC-AUC evaluationResults: | 80/20 stratified split (seed=42). Example results: - Accuracy: 0.81 - F1: 0.84 - ROC-AUC: 0.83 softwareRequirements: - python>=3.9 - torch>=2.0 - transformers>=4.40 - datasets>=2.18 storageRequirements: - ~500 MB memoryRequirements: - >- >= 8 GB RAM (CPU inference), >= 12 GB VRAM recommended for large batch inference operatingSystem: - Linux - macOS - Windows processorRequirements: - x86_64 CPU (AVX recommended) GPURequirements: - >- Not required for single-pair inference; CUDA GPU recommended for batch processing distribution: - encodingFormat: '' contentUrl: '' contentSize: '' quantizationBits: '' quantizationMethod: '' trainedOn: - identifier: internal:e2r-positives name: Expert-validated E2R pairs (Spanish) description: > Positive pairs (original↔adapted) from an existing corpus validated by experts; used as the positive class. url: '' - identifier: internal:synthetic-negatives name: Synthetic hard negatives (Spanish) description: > Negatives generated via sentence shuffle, dropout, mismatch (derangement), paraphrase-with-distortion, and zero-shot NLI contradictions; trivial pairs filtered by BLEU/ROUGE-L thresholds. url: '' testedOn: - identifier: internal:heldout-20 name: Held-out 20% stratified split description: > Stratified 80/20 split by Label (seed=42); pairwise tokenization up to 512 tokens. evaluatedOn: - identifier: internal:heldout-20 name: Held-out 20% stratified split description: > Metrics: Accuracy, F1, ROC-AUC; operating threshold tuned via Youden’s J (ROC). validatedOn: '' author: - name: Isam Diab Lozano identifier: https://orcid.org/0000-0002-3967-0672 - name: Mari Carmen Suárez-Figueroa identifier: https://orcid.org/0000-0003-3807-5019 successorOf: '' funder: - name: Comunidad de Madrid — PIPF-2022/COM-25762 identifier: '' sharedBy: - name: Ontology Engineering Group (UPM) identifier: https://oeg.fi.upm.es/index.php/en/index.html wasGeneratedBy: - trainingRegion: - name: Europe (West) cloudProvider: - name: '' url: '' duration: '' hardwareType: '' fineTunedFromModel: https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne sdPublisher: - name: Ontology Engineering Group url: https://oeg.fi.upm.es/index.php/en/index.html sdLicense: apache-2.0 metrics: - accuracy - f1 - roc_auc base_model: - PlanTL-GOB-ES/roberta-base-bne pipeline_tag: text-classification tags: - easy-to-read - meaning-preservation --- ## Model Card for RoBERTaSense-FACIL **RoBERTaSense-FACIL** (RoBERTa Fine-tuned for Accessible Comprehension In Language) is a Spanish RoBERTa model fine-tuned to assess **meaning preservation** in **Easy-to-Read (E2R)** adaptations. Given a pair of texts {original, adapted}, it predicts whether the adaptation **preserves** the meaning of the original. ⚠️ **Deprecation notice (base model):** This model was fine-tuned from `PlanTL-GOB-ES/roberta-base-bne`. As for September 2025, this checkpoint is **deprecated** and no longer actively maintained. For actively maintained Spanish RoBERTa models, please see the **BSC-LT** organization: https://huggingface.co/BSC-LT --- ## 🚀 How to Use ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch repo = "oeg/RoBERTaSense-FACIL" model = AutoModelForSequenceClassification.from_pretrained(repo) tokenizer = AutoTokenizer.from_pretrained(repo) original = "El lobo, que parecía amable, engañó a Caperucita." adapted = "El lobo parecía amable. El lobo engañó a Caperucita." # Encode the pair (original, adapted) inputs = tokenizer(original, adapted, return_tensors="pt", truncation=True, max_length=512) with torch.no_grad(): logits = model(**inputs).logits probs = logits.softmax(-1).squeeze().tolist() print({model.config.id2label[i]: probs[i] for i in range(len(probs))}) ```` **Suggested labels (adjust to your checkpoint):** ```json { "id2label": {"0": "DOES_NOT_PRESERVE", "1": "PRESERVES_MEANING"}, "label2id": {"DOES_NOT_PRESERVE": 0, "PRESERVES_MEANING": 1} } ``` --- ## Model Description * **Developed by:** Ontology Engineering Group (UPM) / Authors: Isam Diab Lozano and Mari Carmen Suárez-Figueroa * **Funded by:** "Ayudas para la contratación de personal investigador predoctoral en formación para el año 2022" (Reference: PIPF-2022/COM-25762) by Comunidad Autónoma de Madrid (Spain) * **Model type:** Encoder-only Transformer (RoBERTa) with a classification head * **Language:** Spanish (es) * **License:** Apache-2.0 * **Finetuned from model:** `PlanTL-GOB-ES/roberta-base-bne` (deprecated; see notice above) --- ## Uses ### Direct Use * Automatic scoring of **meaning preservation** for Spanish **Easy-to-Read** adaptations. * As a signal in content quality checks for accessibility pipelines. ### Out-of-Scope Use * Clinical, legal, or other high-stakes decisions without human expert oversight. * Non-Spanish or out-of-domain texts without prior adaptation or re-training. --- ## Bias, Risks, and Limitations * **Domain limitation:** trained for Spanish E2R; performance may degrade on other genres/domains. * **Binary labels:** compress nuanced cases; borderline adaptations may require human review. * **Synthetic negatives:** not all human errors are covered by synthetic negative strategies. * **Base deprecation:** the upstream base model is deprecated; security/robustness updates won’t be inherited. ### Recommendations * Calibrate probabilities (e.g., temperature scaling) and expose confidence scores. * Use threshold tuning (e.g., Youden’s J) to trade precision/recall for your setting. * Keep a **human-in-the-loop** for critical use cases and periodic error audits. --- ## How to Get Started with the Model See **How to Use** above. For pairwise inputs, encode as sentence pairs: ```python inputs = tokenizer(text_original, text_adapted, return_tensors="pt", truncation=True, max_length=512) ``` --- ## Training Details ### Training Data * **Source:** Spanish pairs (*original - adapted*) curated/validated by experts. * **Columns:** `text1` (original), `text2` (adaptation), `Label` (0/1), `neg_type`. * **Labels:** `1 = PRESERVES_MEANING`, `0 = DOES_NOT_PRESERVE`. * **Negative types** used in training data construction: `shuffle`, `dropout`, `mismatch` (derangement), `paraphrase_distortion`, `nli_contradiction`. * **Split:** 80/20, stratified by `Label` (random_state=42). ### Training Procedure #### Preprocessing * Pair tokenization with truncation at 512 tokens: ```python tokenizer(text1, text2, truncation=True, max_length=512) ``` #### Training Hyperparameters * **Training regime:** fp16 mixed precision (if supported; otherwise fp32) * **Arguments:** * `num_train_epochs=5` * `per_device_train_batch_size=32` * `per_device_eval_batch_size=16` * `learning_rate=2e-5` * `weight_decay=0.01` * `warmup_ratio=0.1` * `evaluation_strategy="epoch"`, `save_strategy="epoch"` * `load_best_model_at_end=True`, `metric_for_best_model="f1"` * **Optimizer:** AdamW * **Loss:** CrossEntropy (2 logits) ## Evaluation ### Testing Data, Factors & Metrics #### Testing Data * Held-out 20% stratified split of the curated E2R pairs. #### Factors * Report per-negative-type breakdown (e.g., performance on `mismatch`, `paraphrase_distortion`, etc.). #### Metrics * Accuracy, F1, ROC-AUC. ### Results * Accuracy: `0.81` * F1: `0.84` * ROC-AUC: `0.83` * Threshold tuned via Youden’s J for operating point selection. ## Technical Specifications ### Model Architecture and Objective * Encoder-only RoBERTa with a classification head (`Linear(hidden → 2)`). * Objective: supervised cross-entropy on binary label. --- ## Citation **BibTeX:** ```bibtex @software{roberta_facil_2025, title = {RoBERTaSense-FACIL: Meaning Preservation for Easy-to-Read in Spanish}, author = {Diab Lozano, Isam and Suárez-Figueroa, Mari Carmen}, year = {2025}, url = {https://huggingface.co/oeg/RoBERTaSense-FACIL} } ``` **APA:** Diab Lozano, Isam and Suárez-Figueroa, Mari Carmen. (2025). *RoBERTa-FACIL: Meaning Preservation for Easy-to-Read in Spanish*. Hugging Face. [https://huggingface.co/oeg/RoBERTaSense-FACIL](https://huggingface.co/oeg/RoBERTaSense-FACIL)