RoBERTaSense-FACIL / README.md

isamdiablo

Update README.md

2d9c3b7 verified 3 months ago

preview code

raw

history blame contribute delete

10.6 kB

metadata

identifier: https://huggingface.co/oeg/RoBERTaSense-FACIL
name: RoBERTaSense-FACIL
version: 0.1.0
keywords:
  - easy-to-read
  - meaning preservation
  - accessibility
  - spanish
  - text pair classification
headline: >-
  Spanish RoBERTa fine-tuned to assess meaning preservation in Easy-to-Read
  (E2R) adaptations.
description: >
  RoBERTaSense-FACIL is a Spanish RoBERTa model fine-tuned to assess meaning
  preservation in Easy-to-Read (E2R) adaptations. Given a pair {original,
  adapted}, it predicts whether the adaptation preserves the meaning of the
  original. ⚠️ Deprecation notice (base model): fine-tuned from
  PlanTL-GOB-ES/roberta-base-bne, which is deprecated as of 2025. For actively
  maintained Spanish RoBERTa models, see BSC-LT.
task:
  - Text classification
  - Pairwise classification
modelCategory:
  - Supervised classification
language:
  - es
license: apache-2.0
parameterSize: 125M
developmentStatus: Active
dateCreated: 25-09-2025
dateModified: 06-10-2025
citation: >
  Diab Lozano, I., & Suárez-Figueroa, M. C. (2025). RoBERTaSense-FACIL: Meaning
  Preservation for Easy-to-Read in Spanish. Retrieved from
  https://huggingface.co/oeg/RoBERTaSense-FACIL
codeRepository: ''
referencePublication: ''
developmentLibrary: PyTorch + Transformers
usageInstructions: >
  from transformers import AutoTokenizer, AutoModelForSequenceClassification
  import torch

  repo = "oeg/RoBERTaSense-FACIL" model =
  AutoModelForSequenceClassification.from_pretrained(repo) tokenizer =
  AutoTokenizer.from_pretrained(repo)

  original = "El lobo, que parecía amable, engañó a Caperucita." adapted  = "El
  lobo parecía amable. El lobo engañó a Caperucita."

  inputs = tokenizer(original, adapted, return_tensors="pt", truncation=True,
  max_length=512) with torch.no_grad():
      logits = model(**inputs).logits
  probs = logits.softmax(-1).squeeze().tolist() print({model.config.id2label[i]:
  probs[i] for i in range(len(probs))})
modelRisks:
  - Trained for Spanish E2R; out-of-domain performance may degrade.
  - >-
    Binary labels compress nuanced cases; borderline adaptations may require
    human review.
  - Synthetic negatives do not cover all real-world human errors.
  - Base model is deprecated; security/robustness updates will not be inherited.
evaluationMetrics:
  - Accuracy
  - F1
  - ROC-AUC
evaluationResults: |
  80/20 stratified split (seed=42). Example results:
    - Accuracy: 0.81
    - F1: 0.84
    - ROC-AUC: 0.83
softwareRequirements:
  - python>=3.9
  - torch>=2.0
  - transformers>=4.40
  - datasets>=2.18
storageRequirements:
  - ~500 MB
memoryRequirements:
  - >-
    >= 8 GB RAM (CPU inference), >= 12 GB VRAM recommended for large batch
    inference
operatingSystem:
  - Linux
  - macOS
  - Windows
processorRequirements:
  - x86_64 CPU (AVX recommended)
GPURequirements:
  - >-
    Not required for single-pair inference; CUDA GPU recommended for batch
    processing
distribution:
  - encodingFormat: ''
    contentUrl: ''
    contentSize: ''
    quantizationBits: ''
    quantizationMethod: ''
trainedOn:
  - identifier: internal:e2r-positives
    name: Expert-validated E2R pairs (Spanish)
    description: >
      Positive pairs (original↔adapted) from an existing corpus validated by
      experts; used as the positive class.
    url: ''
  - identifier: internal:synthetic-negatives
    name: Synthetic hard negatives (Spanish)
    description: >
      Negatives generated via sentence shuffle, dropout, mismatch (derangement),
      paraphrase-with-distortion, and zero-shot NLI contradictions; trivial
      pairs filtered by BLEU/ROUGE-L thresholds.
    url: ''
testedOn:
  - identifier: internal:heldout-20
    name: Held-out 20% stratified split
    description: >
      Stratified 80/20 split by Label (seed=42); pairwise tokenization up to 512
      tokens.
evaluatedOn:
  - identifier: internal:heldout-20
    name: Held-out 20% stratified split
    description: >
      Metrics: Accuracy, F1, ROC-AUC; operating threshold tuned via Youden’s J
      (ROC).
validatedOn: ''
author:
  - name: Isam Diab Lozano
    identifier: https://orcid.org/0000-0002-3967-0672
  - name: Mari Carmen Suárez-Figueroa
    identifier: https://orcid.org/0000-0003-3807-5019
successorOf: ''
funder:
  - name: Comunidad de Madrid — PIPF-2022/COM-25762
    identifier: ''
sharedBy:
  - name: Ontology Engineering Group (UPM)
    identifier: https://oeg.fi.upm.es/index.php/en/index.html
wasGeneratedBy:
  - trainingRegion:
      - name: Europe (West)
    cloudProvider:
      - name: ''
        url: ''
    duration: ''
    hardwareType: ''
fineTunedFromModel: https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne
sdPublisher:
  - name: Ontology Engineering Group
    url: https://oeg.fi.upm.es/index.php/en/index.html
sdLicense: apache-2.0
metrics:
  - accuracy
  - f1
  - roc_auc
base_model:
  - PlanTL-GOB-ES/roberta-base-bne
pipeline_tag: text-classification
tags:
  - easy-to-read
  - meaning-preservation

Model Card for RoBERTaSense-FACIL

RoBERTaSense-FACIL (RoBERTa Fine-tuned for Accessible Comprehension In Language) is a Spanish RoBERTa model fine-tuned to assess meaning preservation in Easy-to-Read (E2R) adaptations. Given a pair of texts {original, adapted}, it predicts whether the adaptation preserves the meaning of the original.

⚠️ Deprecation notice (base model): This model was fine-tuned from PlanTL-GOB-ES/roberta-base-bne. As for September 2025, this checkpoint is deprecated and no longer actively maintained. For actively maintained Spanish RoBERTa models, please see the BSC-LT organization: https://huggingface.co/BSC-LT

🚀 How to Use

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

repo = "oeg/RoBERTaSense-FACIL"  
model = AutoModelForSequenceClassification.from_pretrained(repo)
tokenizer = AutoTokenizer.from_pretrained(repo)

original = "El lobo, que parecía amable, engañó a Caperucita."
adapted  = "El lobo parecía amable.
            El lobo engañó a Caperucita."

# Encode the pair (original, adapted)
inputs = tokenizer(original, adapted, return_tensors="pt", truncation=True, max_length=512)

with torch.no_grad():
    logits = model(**inputs).logits

probs = logits.softmax(-1).squeeze().tolist()
print({model.config.id2label[i]: probs[i] for i in range(len(probs))})

Suggested labels (adjust to your checkpoint):

{
  "id2label": {"0": "DOES_NOT_PRESERVE", "1": "PRESERVES_MEANING"},
  "label2id": {"DOES_NOT_PRESERVE": 0, "PRESERVES_MEANING": 1}
}

Model Description

Developed by: Ontology Engineering Group (UPM) / Authors: Isam Diab Lozano and Mari Carmen Suárez-Figueroa
Funded by: "Ayudas para la contratación de personal investigador predoctoral en formación para el año 2022" (Reference: PIPF-2022/COM-25762) by Comunidad Autónoma de Madrid (Spain)
Model type: Encoder-only Transformer (RoBERTa) with a classification head
Language: Spanish (es)
License: Apache-2.0
Finetuned from model: PlanTL-GOB-ES/roberta-base-bne (deprecated; see notice above)

Uses

Direct Use

Automatic scoring of meaning preservation for Spanish Easy-to-Read adaptations.
As a signal in content quality checks for accessibility pipelines.

Out-of-Scope Use

Clinical, legal, or other high-stakes decisions without human expert oversight.
Non-Spanish or out-of-domain texts without prior adaptation or re-training.

Bias, Risks, and Limitations

Domain limitation: trained for Spanish E2R; performance may degrade on other genres/domains.
Binary labels: compress nuanced cases; borderline adaptations may require human review.
Synthetic negatives: not all human errors are covered by synthetic negative strategies.
Base deprecation: the upstream base model is deprecated; security/robustness updates won’t be inherited.

Recommendations

Calibrate probabilities (e.g., temperature scaling) and expose confidence scores.
Use threshold tuning (e.g., Youden’s J) to trade precision/recall for your setting.
Keep a human-in-the-loop for critical use cases and periodic error audits.

How to Get Started with the Model

See How to Use above. For pairwise inputs, encode as sentence pairs:

inputs = tokenizer(text_original, text_adapted, return_tensors="pt", truncation=True, max_length=512)

Training Details

Training Data

Source: Spanish pairs (original - adapted) curated/validated by experts.
Columns: text1 (original), text2 (adaptation), Label (0/1), neg_type.
Labels: 1 = PRESERVES_MEANING, 0 = DOES_NOT_PRESERVE.
Negative types used in training data construction: shuffle, dropout, mismatch (derangement), paraphrase_distortion, nli_contradiction.
Split: 80/20, stratified by Label (random_state=42).

Training Procedure

Preprocessing

Pair tokenization with truncation at 512 tokens:

tokenizer(text1, text2, truncation=True, max_length=512)

Training Hyperparameters

Training regime: fp16 mixed precision (if supported; otherwise fp32)
Arguments:
- num_train_epochs=5
- per_device_train_batch_size=32
- per_device_eval_batch_size=16
- learning_rate=2e-5
- weight_decay=0.01
- warmup_ratio=0.1
- evaluation_strategy="epoch", save_strategy="epoch"
- load_best_model_at_end=True, metric_for_best_model="f1"
Optimizer: AdamW
Loss: CrossEntropy (2 logits)

Evaluation

Testing Data, Factors & Metrics

Testing Data

Held-out 20% stratified split of the curated E2R pairs.

Factors

Report per-negative-type breakdown (e.g., performance on mismatch, paraphrase_distortion, etc.).

Metrics

Accuracy, F1, ROC-AUC.

Results

Accuracy: 0.81
F1: 0.84
ROC-AUC: 0.83
Threshold tuned via Youden’s J for operating point selection.

Technical Specifications

Model Architecture and Objective

Encoder-only RoBERTa with a classification head (Linear(hidden → 2)).
Objective: supervised cross-entropy on binary label.

Citation

BibTeX:

@software{roberta_facil_2025,
  title  = {RoBERTaSense-FACIL: Meaning Preservation for Easy-to-Read in Spanish},
  author = {Diab Lozano, Isam and Suárez-Figueroa, Mari Carmen},
  year   = {2025},
  url    = {https://huggingface.co/oeg/RoBERTaSense-FACIL}
}

APA: Diab Lozano, Isam and Suárez-Figueroa, Mari Carmen. (2025). RoBERTa-FACIL: Meaning Preservation for Easy-to-Read in Spanish. Hugging Face. https://huggingface.co/oeg/RoBERTaSense-FACIL