| | --- |
| | library_name: transformers |
| | tags: |
| | - automatic-speech-recognition |
| | - whisper |
| | - fine-tuned |
| | - peft |
| | datasets: |
| | - Vardis/Greek_Mosel |
| | - mozilla-foundation/common_voice_11_0 |
| | - google/fleurs |
| | language: |
| | - el |
| | metrics: |
| | - wer |
| | - cer |
| | base_model: |
| | - openai/whisper-large |
| | --- |
| | |
| |
|
| | # Fine-Tuned Whisper Large |
| |
|
| | This is a **Large-sized Whisper model** fine-tuned for Greek speech transcription. It has 1.5B parameters and achieves improved transcription performance over the medium model. |
| |
|
| | - **WER:** 12.06% |
| | - **CER:** 6.20% |
| |
|
| | ## Training Results |
| |
|
| | | Step | Training Loss | Validation Loss | WER | CER | |
| | |-------|---------------|----------------|----------|----------| |
| | | 250 | 0.1776 | 0.1904 | 13.52% | 6.74% | |
| | | 500 | 0.1478 | 0.1698 | 12.55% | 6.38% | |
| | | 750 | 0.1229 | 0.1608 | 12.33% | 6.24% | |
| | | 1000 | 0.1057 | 0.1605 | 12.15% | 6.26% | |
| | | 1250 | 0.0864 | 0.1630 | 12.65% | 6.65% | |
| | | 1500 | 0.0677 | 0.1643 | 13.23% | 7.35% | |
| | | 1750 | 0.0618 | 0.1681 | 12.86% | 6.86% | |
| | | 2000 | 0.0533 | 0.1686 | 12.98% | 7.00% | |
| |
|
| | ## Model Details |
| |
|
| | - **Model Type:** Whisper (Large) |
| | - **Fine-tuned From:** OpenAI Whisper Large |
| | - **Language(s):** Greek |
| | - **Parameters:** 1.5B |
| | |
| |
|
| | ## How to Use |
| |
|
| | ```python |
| | from transformers import WhisperProcessor, WhisperForConditionalGeneration |
| | from peft import PeftModel |
| | import torch |
| | |
| | device = "cuda" if torch.cuda.is_available() else "cpu" |
| | |
| | # Load base model and Greek fine-tuned LoRA weights |
| | base_model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v2").to(device) |
| | model = PeftModel.from_pretrained(base_model, "Vardis/Whisper-Large-v2-Greek").to(device) |
| | processor = WhisperProcessor.from_pretrained("Vardis/Whisper-Large-v2-Greek") |
| | |
| | # Load your audio waveform (e.g., using librosa or torchaudio) |
| | audio_input = ... |
| | |
| | # Generate transcription |
| | inputs = processor(audio_input, return_tensors="pt").input_features.to(device) |
| | predicted_ids = model.generate(inputs) |
| | transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True) |
| | |
| | print(transcription) |
| | ``` |
| |
|
| | ## Context / Reference |
| |
|
| | This model was developed as part of the work described in: |
| |
|
| | **Georgilas, V., Stafylakis, T. (2025). _Automatic Speech Recognition for Greek Medical Dictation_.** |
| | The paper focuses on Greek medical ASR research in general and is **not primarily about the model itself**, but provides context for its development. Users are welcome to use the model freely for research and practical applications. |
| |
|
| | **BibTeX citation:** |
| | ```bibtex |
| | @misc{georgilas2025greekasr, |
| | title={Automatic Speech Recognition for Greek Medical Dictation}, |
| | author={Vardis Georgilas and Themos Stafylakis}, |
| | year={2025}, |
| | note={Available at: https://www.arxiv.org/abs/2509.23550} |
| | } |
| | |