--- license: apache-2.0 base_model: meta-llama/Llama-3.2-1B-Instruct tags: - dpo - lora - peft - llama-3.2 - iterative-dpo - self-rewarding library_name: peft --- # Iterative DPO Fine-Tune of Llama-3.2-1B (Iteration 1) This repository contains the LoRA adapters from the **first iteration** of a Direct Preference Optimization (DPO) fine-tuning process on the `meta-llama/Llama-3.2-1B-Instruct` model. This work is part of a project exploring iterative DPO, where the model refines itself over multiple cycles of preference data generation and training, inspired by the "Self-Rewarding Language Models" paper. - **Repository for Iteration 2:** [NilayR/llama32-iterative-dpo-iter2](https://huggingface.co/NilayR/llama32-iterative-dpo-iter2) ## Model Details ### Model Description This model is a fine-tuned version of `meta-llama/Llama-3.2-1B-Instruct`. It was trained using DPO on a preference dataset that the base model generated itself. An LLM Judge, powered by GPT-3.5-Turbo, evaluated pairs of model-generated responses to create the chosen/rejected pairs for training. The goal of this iteration was to establish the first step in a self-improvement loop, aligning the model more closely with human-like preferences for accuracy, helpfulness, and clarity. - **Developed by:** NilayR - **Model type:** Causal Language Model - **Language(s):** English - **License:** apache-2.0 - **Finetuned from model:** `meta-llama/Llama-3.2-1B-Instruct` ## How to Get Started with the Model To use these LoRA adapters, load the base model (`meta-llama/Llama-3.2-1B-Instruct`) and then apply the adapters from this repository. ```python import torch from peft import PeftModel from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig # Set base model ID and adapter path base_model_id = "meta-llama/Llama-3.2-1B-Instruct" adapter_id = "NilayR/llama32-iterative-dpo-iter1" # Configure BitsAndBytes for 4-bit quantization bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16 ) # Load the base model with quantization base_model = AutoModelForCausalLM.from_pretrained( base_model_id, quantization_config=bnb_config, device_map="auto", trust_remote_code=True, ) # Load the tokenizer tokenizer = AutoTokenizer.from_pretrained(base_model_id) tokenizer.pad_token = tokenizer.eos_token # Load and apply the PEFT adapters model = PeftModel.from_pretrained(base_model, adapter_id) # --- Generate a response --- prompt = "What are the key benefits of meditation?" messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": prompt} ] input_ids = tokenizer.apply_chat_template( messages, add_generation_prompt=True, return_tensors="pt" ).to(model.device) outputs = model.generate( input_ids, max_new_tokens=200, do_sample=True, temperature=0.7, top_p=0.95 ) response = tokenizer.decode(outputs[0], skip_special_tokens=True) print(response.split("assistant")[-1].strip()) ``` ## Training Details ### Training Data The model was trained on a preference dataset generated by the `meta-llama/Llama-3.2-1B-Instruct` model itself. * **Data Generation Process:** 1. **Instructions:** 20 instructions were selected from the LIMA dataset. 2. **Response Generation:** The base model generated multiple diverse responses for each instruction. 3. **Preference Labeling:** A custom LLM Judge powered by `GPT-3.5-Turbo` was used to compare pairs of the generated responses, creating a dataset of **56 chosen/rejected pairs**. ### Training Procedure The model was trained for one epoch using the TRL library's `DPOTrainer`. #### Training Hyperparameters * **Framework:** `trl.DPOTrainer` * **Epochs:** 1 * **Batch Size:** 1 * **Gradient Accumulation Steps:** 2 * **Optimizer:** `paged_adamw_8bit` * **Learning Rate:** 2e-5 * **DPO Beta (β):** 0.1 * **Max Steps:** 50 * **Final Training Loss:** `0.6405` #### LoRA Configuration * **Rank (`r`):** 16 * **Alpha (`lora_alpha`):** 32 * **Target Modules:** `q_proj`, `k_proj`, `v_proj`, `o_proj` * **Dropout:** 0.05 ### Compute Infrastructure * **Hardware:** 1x NVIDIA A100 40GB GPU * **Cloud Provider:** Google Colab * **Software:** `transformers`, `peft`, `trl`, `bitsandbytes`