---
license: llama3.2
language:
  - uz
  - en
base_model: meta-llama/Llama-3.1-8B-Instruct
library_name: transformers
tags:
  - llama
  - uzbek
  - uzbekllm
  - uzbeknlp
  - text-generation
  - translation
  - summarization
  - question-answering
  - tokenizer
datasets:
  - HuggingFaceFW/fineweb-2
  - tahrirchi/uz-crawl
  - yakhyo/uz-wiki
  - wikipedia
  - tatsu-lab/alpaca
  - behbudiy/alpaca-cleaned-uz
  - UAzimov/uzbek-instruct-llm
  - behbudiy/translation-instruction
metrics:
  - bleu
  - comet
  - accuracy
pipeline_tag: text-generation

---


### Model Description

This is the 8B Base (continual pretrained) version of our Uzbek-optimized Llama 8B Instruct model. For instruction following capability, check out our other models:

* **[alloma-1B-Instruct](https://huggingface.co/beruniy/Llama-3.2-1B-Instruct-Uz)**
* **[alloma-3B-Instruct](https://huggingface.co/beruniy/Llama-3.2-3B-Instruct-Uz)**
* **[alloma-8B-Instruct](https://huggingface.co/beruniy/Llama-3.1-8B-Instruct-Uz)**

---

Our 8B Base model has been continually pretrained with context length of 4096 tokens, on 3.6B tokens (67% English, 33% Uzbek). Our customized tokenizer averages 1.7 tokens per Uzbek word vs. ~3.5 in the original Llama models, meaning 2x faster inference and longer effective context length on Uzbek text.
 
## Methodology: Efficient Vocabulary Adaptation for Uzbek
The primary motivation for our technical approach is to create a model with a more efficient tokenizer for the Uzbek language. This ensures both faster inference speeds and a longer effective context length when processing Uzbek text, as fewer tokens are needed to represent the same amount of information.

## Acknowledgements

This project was developed by the teams at **[Examy.me](https://examy.me/)** and **[Teamwork.uz](https://teamwork.uz/)**. Their collaboration and resources were essential to the creation and success of the `alloma` model series.