File size: 1,654 Bytes
fa874b2 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 |
---
language:
- eo
- en
- es
- ca
tags:
- translation
- machine-translation
- marian
- opus-mt
- multilingual
license: cc-by-4.0
pipeline_tag: translation
metrics:
- bleu
- chrf
---
# Catalan, English, Spanish -> Esperanto MT Model
## Model description
This repository contains a **multilingual MarianMT** model for **(English, Spanish, Catalan) → Esperanto** translation.
## Usage
The model is loaded and used with `transformers` as:
```python
from transformers import MarianMTModel, MarianTokenizer
import torch
model_name = "Helsinki-NLP/opus-mt-caenes-eo"
device = "cuda" if torch.cuda.is_available() else "cpu"
model = MarianMTModel.from_pretrained(model_name).to(device)
tokenizer = MarianTokenizer.from_pretrained(model_name)
source_texts = [
"Buenos días, qué tal?",
"Bon dia, com estàs?",
"Good morning, how are you?"
]
inputs = tokenizer(source_texts, return_tensors="pt", padding=True, truncation=True)
inputs = {k: v.to(device) for k, v in inputs.items()}
translated_ids = model.generate(inputs["input_ids"])
translated_texts = tokenizer.batch_decode(translated_ids, skip_special_tokens=True)
for src, tgt in zip(source_texts, translated_texts):
print(f"Source: {src} => Translated: {tgt}")
````
## Training data
The model was trained using **Tatoeba** parallel data, with **FLORES-200** used as the development set.
Training sentence-pair counts:
* **ca-eo**: 672,931
* **es-eo**: 4,677,945
* **eo-en**: 5,000,000
## Evaluation on FLORES
| Language Pair | BLEU | ChrF++ |
| ------------- | ----: | ----: |
| spa-epo | 16.25 | 49.10 |
| cat-epo | 21.43 | 51.37 |
| eng-epo | 26.42 | 58.23 |
|