Model Summary

OLaPh is a large language model for phonemization, finetuned from GemmaX2-28-2B-v0.1. Its tokenizer was extended with 1,024 phoneme tokens, derived from a BPE tokenizer trained on phoneme sequences generated by the OLaPh framework).

The model was then finetuned for grapheme-to-phoneme conversion on a multilingual dataset (English, German, French, Spanish), created by phonemizing text from HuggingFaceFW/fineweb and HuggingFaceFW/fineweb-2 using the OLaPh framework.

Finetuned By: Institute for Information Systems at Hof University
Model type: Text-To-Text
Dataset: OLaPh Phonemization Dataset
Language(s): English, French, German, Spanish
License: Gemma (Gemma is provided under and subject to the Gemma Terms of Use found at ai.google.dev/gemma/terms)
Release Date: September 25, 2025

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

lang = "English" #German, French, Spanish
sentence = "But we are not sorry, for the rain is delightful."

model_id = "iisys-hof/olaph"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id).to("cuda")
stop_tokens = [tokenizer.eos_token_id, tokenizer.encode(".", add_special_tokens=False)[0]]


prompt =  f"Translate this from {lang} to Phones:\n{lang}: "

inputs = tokenizer(f"{prompt}{sentence}\nPhones:", return_tensors="pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens=256, eos_token_id=stop_tokens)
phonemized = tokenizer.decode(outputs[0], skip_special_tokens=False)
phonemized = phonemized.split("\n")[-1].replace("Phones:", "")

print(phonemized)

Caveats

The model may produce full stops over the length of max_new_tokens instead of an EOS token, this behaviour is currently being examined.

Citation

@misc{wirth2025olaphoptimallanguagephonemizer,
      title={OLaPh: Optimal Language Phonemizer},
      author={Johannes Wirth},
      year={2025},
      eprint={2509.20086},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.20086},
}