Norwegian

by exoplanet - opened Mar 18

Mar 18

Hi BSC-LT team,
Great initiative! On the model page language listings, there's Norwegian Nynorsk (nn), but not Norwegian Bokmal (nb). Is this a tagging mishap, or does the model really not support nb, in favor of nn ?
Thanks again for making this model!

ilacunza

Language Technologies Laboratory @ Barcelona Supercomputing Center org Mar 18

•

edited Mar 18

Hi @exoplanet ! Thanks for the question and for taking a look at the model.

The training data we use for Norwegian variants (from FineWeb2) actually contains text from both Norwegian written standards. Since Norwegian Bokmål is the more common standard form, we grouped it by the name Norwegian (no), while we listed Norwegian Nynorsk explicitly as nn.

So both variants are present in the training data. In terms of scale, we used 6,798,808,558 tokens for Norwegian Bokmal and 214,056,022 tokens for Norwegian Nynorsk, as detailed in our paper: https://arxiv.org/abs/2602.21379

Thanks again for the interest in the model. Please feel free to reach out if you have any other questions.

exoplanet

Mar 18

Hey @ilacunza , thanks for getting back to me quickly.

Norwegian is fully supported, that's great. I have a few other questions, I'll create separate threads for them, but first I'll have a look at your paper.

Cheers!

exoplanet changed discussion status to closed Mar 18

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment