Instructions to use BSC-LT/MrBERT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use BSC-LT/MrBERT with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="BSC-LT/MrBERT")# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("BSC-LT/MrBERT") model = AutoModelForMaskedLM.from_pretrained("BSC-LT/MrBERT") - Notebooks
- Google Colab
- Kaggle
Norwegian
Hi BSC-LT team,
Great initiative! On the model page language listings, there's Norwegian Nynorsk (nn), but not Norwegian Bokmal (nb). Is this a tagging mishap, or does the model really not support nb, in favor of nn ?
Thanks again for making this model!
Hi @exoplanet ! Thanks for the question and for taking a look at the model.
The training data we use for Norwegian variants (from FineWeb2) actually contains text from both Norwegian written standards. Since Norwegian Bokmål is the more common standard form, we grouped it by the name Norwegian (no), while we listed Norwegian Nynorsk explicitly as nn.
So both variants are present in the training data. In terms of scale, we used 6,798,808,558 tokens for Norwegian Bokmal and 214,056,022 tokens for Norwegian Nynorsk, as detailed in our paper: https://arxiv.org/abs/2602.21379
Thanks again for the interest in the model. Please feel free to reach out if you have any other questions.