Spaces:
Sleeping
Sleeping
Create guidelines.md
Browse files- guidelines.md +89 -0
guidelines.md
ADDED
|
@@ -0,0 +1,89 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
## Overview
|
| 2 |
+
|
| 3 |
+
This document provides guidelines for evaluating the fluency of Norwegian responses generated by language models. Annotators will compare pairs of responses (Response A and Response B) and determine which response demonstrates better fluency, or if they are equally fluent. The evaluation focuses exclusively on language quality, naturalness, and grammaticality.
|
| 4 |
+
|
| 5 |
+
## Key principle
|
| 6 |
+
|
| 7 |
+
**Fluency evaluation is strictly limited to linguistic quality.** Do NOT consider:
|
| 8 |
+
- Factual accuracy or correctness
|
| 9 |
+
- Completeness of information
|
| 10 |
+
- Creativity or originality
|
| 11 |
+
- Formatting or structure (unless it affects readability)
|
| 12 |
+
- Length or conciseness
|
| 13 |
+
|
| 14 |
+
## Definitions
|
| 15 |
+
|
| 16 |
+
### What is fluency?
|
| 17 |
+
|
| 18 |
+
Fluency refers to the linguistic quality of text that makes it natural, smooth, and easy to read. A fluent response:
|
| 19 |
+
|
| 20 |
+
- **Grammatically correct**: Follows standard grammar rules with proper syntax
|
| 21 |
+
- **Natural-sounding**: Reads like something a native speaker would write
|
| 22 |
+
- **Coherent**: Maintains logical flow between sentences and paragraphs
|
| 23 |
+
- **Well-formed**: Uses appropriate vocabulary, punctuation, and sentence structure
|
| 24 |
+
- **Smooth**: Flows naturally without awkward phrasing or jarring transitions
|
| 25 |
+
- **Norwegian**: The models respond to Norwegian prompts and so they should always be either in Norwegian Bokmål or Norwegian Nynorsk
|
| 26 |
+
|
| 27 |
+
### Fluency issues to look for
|
| 28 |
+
|
| 29 |
+
When evaluating fluency, pay attention to:
|
| 30 |
+
|
| 31 |
+
1. **Grammar errors**: agreement errors (e.g. adjective-noun or determiner-noun disagreement), incorrect verb tense, incorrect word order (violating V2 requirement), wrong word forms
|
| 32 |
+
2. **Awkward phrasing**: Unnatural word order, stilted expressions, robotic language
|
| 33 |
+
3. **Punctuation problems**: Missing or incorrect punctuation that affects readability
|
| 34 |
+
4. **Word choice issues**: Inappropriate vocabulary, incorrect word usage, repetitive language
|
| 35 |
+
5. **Flow disruptions**: Abrupt transitions, disconnected ideas within sentences
|
| 36 |
+
6. **Spelling errors**: Typos and misspellings that affect readability
|
| 37 |
+
7. **Translationese**: A common problem of language models is that they base their output on English -- the majority language in the language corpus. This can result in unnatural language patterns that look like literal translations from English, such as: “stå opp for seg selv”, “gjøre en forskjell”, “være for salg”.
|
| 38 |
+
|
| 39 |
+
## Annotation procedure
|
| 40 |
+
|
| 41 |
+
### Step-by-Step process
|
| 42 |
+
|
| 43 |
+
1. **Read both responses completely** without making immediate judgments
|
| 44 |
+
2. **Focus solely on language quality** - ignore content accuracy and relevance
|
| 45 |
+
3. **Identify fluency issues** in each response using the criteria above
|
| 46 |
+
4. **Compare the severity and frequency** of fluency issues between responses
|
| 47 |
+
5. **Make your decision** based on overall fluency
|
| 48 |
+
|
| 49 |
+
### Decision options
|
| 50 |
+
|
| 51 |
+
You must select one of three options:
|
| 52 |
+
|
| 53 |
+
- **A is more fluent**: Response A has better overall language quality than Response B
|
| 54 |
+
- **B is more fluent**: Response B has better overall language quality than Response A
|
| 55 |
+
- **Equal fluency**: Both responses have similar language quality (minor differences that don't clearly favor either response)
|
| 56 |
+
|
| 57 |
+
### Important guidelines
|
| 58 |
+
|
| 59 |
+
- **Minor differences matter**: Even small improvements in fluency should influence your decision
|
| 60 |
+
- **Be consistent**: Apply the same standards across all evaluations
|
| 61 |
+
- **When in doubt about equality**: If you cannot decisively determine which is better after careful analysis, select "Equal fluency"
|
| 62 |
+
|
| 63 |
+
## Examples
|
| 64 |
+
|
| 65 |
+
### Example 1: Clear fluency difference
|
| 66 |
+
|
| 67 |
+
TODO
|
| 68 |
+
|
| 69 |
+
### Example 2: Equal fluency
|
| 70 |
+
|
| 71 |
+
TODO
|
| 72 |
+
|
| 73 |
+
### Example 3: Subtle fluency difference
|
| 74 |
+
|
| 75 |
+
TODO
|
| 76 |
+
|
| 77 |
+
### Example 4: Content vs. fluency
|
| 78 |
+
|
| 79 |
+
TODO
|
| 80 |
+
|
| 81 |
+
## Edge cases and special considerations
|
| 82 |
+
|
| 83 |
+
TODO
|
| 84 |
+
|
| 85 |
+
**Technical or specialized language**: Technical terminology and domain-specific language should be considered fluent if used correctly and consistently, even if it might seem less natural to a general audience.
|
| 86 |
+
|
| 87 |
+
**Formatting issues**: Ignore formatting differences (bold, italics, bullet points) unless they directly impact readability or sentence structure.
|
| 88 |
+
|
| 89 |
+
**Code or mathematical expressions**: If responses contain code snippets or mathematical expressions, evaluate only the fluency of the natural language portions.
|