Instructions to use yhavinga/gpt2-large-dutch with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use yhavinga/gpt2-large-dutch with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="yhavinga/gpt2-large-dutch")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("yhavinga/gpt2-large-dutch") model = AutoModelForCausalLM.from_pretrained("yhavinga/gpt2-large-dutch") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use yhavinga/gpt2-large-dutch with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "yhavinga/gpt2-large-dutch" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "yhavinga/gpt2-large-dutch", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/yhavinga/gpt2-large-dutch
- SGLang
How to use yhavinga/gpt2-large-dutch with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "yhavinga/gpt2-large-dutch" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "yhavinga/gpt2-large-dutch", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "yhavinga/gpt2-large-dutch" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "yhavinga/gpt2-large-dutch", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use yhavinga/gpt2-large-dutch with Docker Model Runner:
docker model run hf.co/yhavinga/gpt2-large-dutch
GPT2-Large pre-trained on cleaned Dutch mC4 🇳🇱
A GPT2 large model (762M parameters) trained from scratch on Dutch, with perplexity 15.1 on cleaned Dutch mC4.
How To Use
You can use this GPT2-model directly with a pipeline for text generation.
MODEL_DIR='yhavinga/gpt2-large-dutch'
from transformers import pipeline, GPT2Tokenizer, GPT2LMHeadModel
tokenizer = GPT2Tokenizer.from_pretrained(MODEL_DIR)
model = GPT2LMHeadModel.from_pretrained(MODEL_DIR)
generator = pipeline('text-generation', model, tokenizer=tokenizer)
generated_text = generator('Het eiland West-', max_length=100, do_sample=True, top_k=40, top_p=0.95, repetition_penalty=2.0))
"Het eiland West-" - "Terschelling wordt sinds jaar en dag bewoond door de mens. De mensen die in het huidige Terherne wonen doen er alles aan om hun dorp te behouden voor deze diersoort, namelijk; een natuurreservaat dat vooral bestaat uit hoge duinen met lage begroeing waar planten van vroeger worden afgewisseld (zoals wilde hyacinten)en waarop grassen groeien waarvan sommige soorten zeldzame vormen hebben ontwikkeld: duinlelie of blauwe bosbes zijn bijvoorbeeld bekend vanwege onder andere kleurmole"
Tokenizer
- BPE tokenizer trained from scratch for Dutch on mC4 nl cleaned with scripts from the Huggingface Transformers Flax examples.
Dataset
This model was trained on of the full configuration (33B tokens) of
cleaned Dutch mC4,
which is the original mC4, except
- Documents that contained words from a selection of the Dutch and English List of Dirty Naught Obscene and Otherwise Bad Words are removed
- Sentences with less than 3 words are removed
- Sentences with a word of more than 1000 characters are removed
- Documents with less than 5 sentences are removed
- Documents with "javascript", "lorum ipsum", "terms of use", "privacy policy", "cookie policy", "uses cookies", "use of cookies", "use cookies", "elementen ontbreken", "deze printversie" are removed.
Models
TL;DR: yhavinga/gpt2-medium-dutch is the best model.
- The models with
a/bin the step-column have been trained to stepaof a total ofbsteps.
| model | params | train seq len | ppl | loss | batch size | epochs | steps | optim | lr | duration | config | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| yhavinga/gpt-neo-125M-dutch | gpt neo | 125M | 512 | 20.9 | 3.04 | 128 | 1 | 190000/558608 | adam | 2.4e-3 | 1d 12h | full |
| yhavinga/gpt2-medium-dutch | gpt2 | 345M | 512 | 15.1 | 2.71 | 128 | 1 | 320000/520502 | adam | 8e-4 | 7d 2h | full |
| yhavinga/gpt2-large-dutch | gpt2 | 762M | 512 | 15.1 | 2.72 | 32 | 1 | 1100000/2082009 | adafactor | 3.3e-5 | 8d 15h | large |
| yhavinga/gpt-neo-1.3B-dutch | gpt neo | 1.3B | 512 | 16.0 | 2.77 | 16 | 1 | 960000/3049896 | adafactor | 5e-4 | 7d 11h | full |
Acknowledgements
This project would not have been possible without compute generously provided by Google through the TPU Research Cloud. The HuggingFace 🤗 ecosystem was also instrumental in most, if not all, parts of the training. The following repositories where helpful in setting up the TPU-VM, and training the models:
- Gsarti's Pretrain and Fine-tune a T5 model with Flax on GCP
- HUggingFace Flax MLM examples
- gpt2-medium-persian
- gpt2-medium-indonesian
Created by Yeb Havinga
- Downloads last month
- 1,835