Spaces:
Sleeping
Sleeping
| ## Overview | |
| This document provides guidelines for evaluating the fluency of responses generated by Norwegian language models. Annotators will compare pairs of responses (Response A and Response B) and determine which response demonstrates better fluency, or if they are equally fluent. | |
| The evaluation focuses exclusively on language quality, naturalness, and grammaticality. Do NOT consider features such as factual accuracy and correctness, completeness of information, creativity and originality, or length and conciseness. | |
| ## Definitions | |
| #### What is fluency? | |
| Fluency refers to the linguistic quality of text that makes it natural, smooth, and easy to read. It should look like a text written by a native speaker. A fluent text should consistently use either Bokmål or Nynorsk (depending on the prompt), and should sound genuinely Norwegian rather than as it were translated from another language. | |
| #### Fluency issues to look for | |
| When evaluating fluency, pay attention to: | |
| 1. **Grammar errors**: agreement errors (e.g. adjective-noun or determiner-noun disagreement), incorrect verb tense, incorrect word order (violating V2 requirement), wrong word forms | |
| 2. **Awkward phrasing**: Unnatural word order, stilted expressions, robotic language | |
| 3. **Punctuation problems**: Missing or incorrect punctuation that affects readability | |
| 4. **Word choice issues**: Inappropriate vocabulary, incorrect word usage, repetitive language, wrong use of idioms or phrases, incorrect spacing of formation of compound words ("kaffe kopp" vs "kaffekopp"), preposition errors ("på" vs "i") | |
| 5. **Flow disruptions**: Abrupt transitions, disconnected ideas within sentences | |
| 6. **Spelling errors**: Typos and misspellings, wrong capitalization, incorrect use of diacritics (e.g. "å" vs "a", "ø" vs "o") | |
| 7. **Translationese**: A common problem of language models is that they base their output on English -- the majority language in the language corpus. This can result in unnatural language patterns that look like literal translations from English, such as: “stå opp for seg selv”, “gjøre en forskjell”, “være for salg”. | |
| ## Annotation procedure | |
| #### Step-by-Step process | |
| 1. **Read the prompt**: Do not analyze the fluency of the prompt, but look at it to understand the context and language style. | |
| 2. **Read both responses completely** without making immediate judgments | |
| 3. **Identify fluency issues** in each response using the criteria above, ignore content accuracy and relevance | |
| 4. **Compare the severity and frequency** of fluency issues between responses | |
| 5. **Make your decision** based on overall fluency | |
| #### Decision options | |
| You must select one of three options: | |
| - **A is more fluent**: Response A has better overall language quality than Response B | |
| - **B is more fluent**: Response B has better overall language quality than Response A | |
| - **Equally fluent**: Both responses have similar language quality (minor differences that don't clearly favor either response) | |
| #### Important guidelines | |
| - **Minor differences matter**: Even small improvements in fluency should influence your decision | |
| - **Be consistent**: Apply the same standards across all evaluations | |
| - **When in doubt about equality**: If you cannot decisively determine which is better after careful analysis, select "Equally fluent" | |
| ## Examples | |
| Here are some examples of texts that should not be considered as fluent Norwegian: | |
| - "Vi kan også prøve å finne måter å gjøre oppgavene dine mer overskuelige og gi deg mer tid til å gjøre dem på." (word choice) | |
| - "skrivemappa din" (agreement) | |
| - "en elsket medlem av kongefamilien" (agreement) | |
| - "jeg vil se deg neste gang" (English-influenced translationese, more fluent would be "sees neste gang") | |
| - "banal hjertroman" (compound) | |
| - "den første konge" (double definiteness) | |
| ## Edge cases and special considerations | |
| - **Other language than Norwegian**: If one of the responses is in a different language (e.g. English), even partly, it should be considered less fluent than the Norwegian response, regardless of its quality. | |
| - **Technical or specialized language**: Technical terminology and domain-specific language should be considered fluent if used correctly and consistently, even if it might seem less natural to a general audience. | |
| - **Formatting issues**: Ignore formatting differences (bold, italics, bullet points) unless they directly impact readability or sentence structure. | |
| - **Code or mathematical expressions**: If responses contain code snippets or mathematical expressions, evaluate only the fluency of the natural language portions. | |
| - **When in doubt, ask us :)** |