--- license: llama3.2 language: - uz - en base_model: meta-llama/Llama-3.1-8B-Instruct library_name: transformers tags: - llama - uzbek - uzbekllm - uzbeknlp - text-generation - translation - summarization - question-answering - tokenizer datasets: - HuggingFaceFW/fineweb-2 - tahrirchi/uz-crawl - yakhyo/uz-wiki - wikipedia - tatsu-lab/alpaca - behbudiy/alpaca-cleaned-uz - UAzimov/uzbek-instruct-llm - behbudiy/translation-instruction metrics: - bleu - comet - accuracy pipeline_tag: text-generation --- ### Model Description This is the 8B Base (continual pretrained) version of our Uzbek-optimized Llama 8B Instruct model. For instruction following capability, check out our other models: * **[alloma-1B-Instruct](https://huggingface.co/beruniy/Llama-3.2-1B-Instruct-Uz)** * **[alloma-3B-Instruct](https://huggingface.co/beruniy/Llama-3.2-3B-Instruct-Uz)** * **[alloma-8B-Instruct](https://huggingface.co/beruniy/Llama-3.1-8B-Instruct-Uz)** --- Our 8B Base model has been continually pretrained with context length of 4096 tokens, on 3.6B tokens (67% English, 33% Uzbek). Our customized tokenizer averages 1.7 tokens per Uzbek word vs. ~3.5 in the original Llama models, meaning 2x faster inference and longer effective context length on Uzbek text. ## Methodology: Efficient Vocabulary Adaptation for Uzbek The primary motivation for our technical approach is to create a model with a more efficient tokenizer for the Uzbek language. This ensures both faster inference speeds and a longer effective context length when processing Uzbek text, as fewer tokens are needed to represent the same amount of information. ## Acknowledgements This project was developed by the teams at **[Examy.me](https://examy.me/)** and **[Teamwork.uz](https://teamwork.uz/)**. Their collaboration and resources were essential to the creation and success of the `alloma` model series.