---
base_model: sovitrath/Phi-3.5-vision-instruct
library_name: peft
---

# Model Card for Model ID

This is a fined-tuned Phi 3.5 Vision Instruct model for receipt OCR specifically.

It has been fine-tuned on the SROIEv2 datasets and the annotations were generated using Qwen2.5-3B VL.

The dataset is **[available on Kaggle](https://www.kaggle.com/datasets/sovitrath/receipt-ocr-input)**.

## Model Details

- The base model is **[sovitrath/Phi-3.5-vision-instruct](sovitrath/Phi-3.5-vision-instruct)**.

## Technical Specifications

### Compute Infrastructure

The model was trained on a system with 10GB RTX 3080 GPU, 10th generation i7 CPU, and 32GB RAM.

### Framework versions

```
torch==2.5.1
torchvision==0.20.1
torchaudio==2.5.1
flash-attn==2.7.2.post1
triton==3.1.0
transformers==4.51.3
accelerate==1.2.0
datasets==4.1.1
huggingface-hub==0.31.1
peft==0.15.2
trl==0.18.0
safetensors==0.4.5
sentencepiece==0.2.0
tiktoken==0.8.0
einops==0.8.0
opencv-python==4.10.0.84
pillow==10.2.0
numpy==2.2.0
scipy==1.14.1
tqdm==4.66.4
pandas==2.2.2
pyarrow==21.0.0
regex==2024.11.6
requests==2.32.3
python-dotenv==1.1.1
wandb==0.22.1
rich==13.9.4
jiwer==4.0.0 
bitsandbytes==0.45.0
```

## How to Get Started with the Model

Use the code below to get started with the model.

```python
import torch
import matplotlib.pyplot as plt
import transformers

from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor
from transformers import BitsAndBytesConfig

model_id = 'sovitrath/Phi-3.5-Vision-Instruct-OCR'

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map='auto',
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    # _attn_implementation='flash_attention_2', # Use `flash_attention_2` on Ampere GPUs and above and `eager` on older GPUs.
    _attn_implementation='eager', # Use `flash_attention_2` on Ampere GPUs and above and `eager` on older GPUs.
)

# processor = AutoProcessor.from_pretrained('sovitrath/Phi-3.5-vision-instruct', trust_remote_code=True)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

test_image = Image.open('../inference_data/image_1.jpeg').convert('RGB')

plt.figure(figsize=(9, 7))
plt.imshow(test_image)
plt.show()

def test(model, processor, image, max_new_tokens=1024, device='cuda'):
    placeholder = f"<|image_1|>\n"
    messages = [
        {
            'role': 'user',
            'content': placeholder + 'OCR this image accurately'
        },
    ]
    
    # Prepare the text input by applying the chat template
    text_input = processor.tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        tokenize=False
    )

    if image.mode != 'RGB':
        image = image.convert('RGB')
        
    # Prepare the inputs for the model
    model_inputs = processor(
        text=text_input,
        images=[image],
        return_tensors='pt',
    ).to(device)  # Move inputs to the specified device

    # Generate text with the model
    generated_ids = model.generate(**model_inputs, max_new_tokens=max_new_tokens)

    # Trim the generated ids to remove the input ids
    trimmed_generated_ids = [
        out_ids[len(in_ids):] for in_ids, out_ids in zip(model_inputs.input_ids, generated_ids)
    ]

    # Decode the output text
    output_text = processor.batch_decode(
        trimmed_generated_ids,
        skip_special_tokens=True,
        clean_up_tokenization_spaces=False
    )

    return output_text[0]  # Return the first decoded output text
    
output = test(model, processor, test_image)
print(output)
```


## Training Details

### Training Data

It has been fine-tuned on the SROIEv2 datasets and the annotations were generated using Qwen2.5-3B VL.

### Training Procedure

* It has been fine-tuned for 1200 steps. However, the checkpoints correspond to the model saved at 400 steps which gave the best loss.
* The text file annotations were generated using Qwen2.5-3B VL.


#### Training Hyperparameters

* It is a LoRA model.

**LoRA configuration:**

```python
# Configure LoRA
peft_config = LoraConfig(
    r=8,
    lora_alpha=16,
    lora_dropout=0.0,
    target_modules=['down_proj','o_proj','k_proj','q_proj','gate_proj','up_proj','v_proj'],
    use_dora=True,
    init_lora_weights='gaussian'
)

# Apply PEFT model adaptation
peft_model = get_peft_model(model, peft_config)

# Print trainable parameters
peft_model.print_trainable_parameters()
```

**Trainer configuration:**

```python
# Configure training arguments using SFTConfig
training_args = transformers.TrainingArguments(
    output_dir=output_dir,
    logging_dir=output_dir,
    # num_train_epochs=1,
    max_steps=1200, # 625,
    per_device_train_batch_size=1, # Batch size MUST be 1 for Phi 3.5 Vision Instruct fine-tuning
    per_device_eval_batch_size=1, # Batch size MUST be 1 for Phi 3.5 Vision Instruct fine-tuning
    gradient_accumulation_steps=4, # 4
    warmup_steps=50,
    learning_rate=1e-4,
    weight_decay=0.01,
    logging_steps=400,
    eval_steps=400,
    save_steps=400,
    logging_strategy='steps',
    eval_strategy='steps',
    save_strategy='steps',
    save_total_limit=2,
    optim='adamw_torch_fused',
    bf16=True,
    report_to='wandb',
    remove_unused_columns=False,
    gradient_checkpointing=True,
    dataloader_num_workers=4,
    # dataset_text_field='',
    # dataset_kwargs={'skip_prepare_dataset': True},
    load_best_model_at_end=True,
    save_safetensors=True,
)
```

## Evaluation

The current best validation loss is **0.377421**.

The CER on the test set is **0.355**. The Qwen2.5-3B VL test annotations were used as ground truth.