--- base_model: sovitrath/Phi-3.5-vision-instruct library_name: peft --- # Model Card for Model ID This is a fined-tuned Phi 3.5 Vision Instruct model for receipt OCR specifically. It has been fine-tuned on the SROIEv2 datasets and the annotations were generated using Qwen2.5-3B VL. The dataset is **[available on Kaggle](https://www.kaggle.com/datasets/sovitrath/receipt-ocr-input)**. ## Model Details - The base model is **[sovitrath/Phi-3.5-vision-instruct](sovitrath/Phi-3.5-vision-instruct)**. ## Technical Specifications ### Compute Infrastructure The model was trained on a system with 10GB RTX 3080 GPU, 10th generation i7 CPU, and 32GB RAM. ### Framework versions ``` torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 flash-attn==2.7.2.post1 triton==3.1.0 transformers==4.51.3 accelerate==1.2.0 datasets==4.1.1 huggingface-hub==0.31.1 peft==0.15.2 trl==0.18.0 safetensors==0.4.5 sentencepiece==0.2.0 tiktoken==0.8.0 einops==0.8.0 opencv-python==4.10.0.84 pillow==10.2.0 numpy==2.2.0 scipy==1.14.1 tqdm==4.66.4 pandas==2.2.2 pyarrow==21.0.0 regex==2024.11.6 requests==2.32.3 python-dotenv==1.1.1 wandb==0.22.1 rich==13.9.4 jiwer==4.0.0 bitsandbytes==0.45.0 ``` ## How to Get Started with the Model Use the code below to get started with the model. ```python import torch import matplotlib.pyplot as plt import transformers from PIL import Image from transformers import AutoModelForCausalLM, AutoProcessor from transformers import BitsAndBytesConfig model_id = 'sovitrath/Phi-3.5-Vision-Instruct-OCR' model = AutoModelForCausalLM.from_pretrained( model_id, device_map='auto', torch_dtype=torch.bfloat16, trust_remote_code=True, # _attn_implementation='flash_attention_2', # Use `flash_attention_2` on Ampere GPUs and above and `eager` on older GPUs. _attn_implementation='eager', # Use `flash_attention_2` on Ampere GPUs and above and `eager` on older GPUs. ) # processor = AutoProcessor.from_pretrained('sovitrath/Phi-3.5-vision-instruct', trust_remote_code=True) processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True) test_image = Image.open('../inference_data/image_1.jpeg').convert('RGB') plt.figure(figsize=(9, 7)) plt.imshow(test_image) plt.show() def test(model, processor, image, max_new_tokens=1024, device='cuda'): placeholder = f"<|image_1|>\n" messages = [ { 'role': 'user', 'content': placeholder + 'OCR this image accurately' }, ] # Prepare the text input by applying the chat template text_input = processor.tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=False ) if image.mode != 'RGB': image = image.convert('RGB') # Prepare the inputs for the model model_inputs = processor( text=text_input, images=[image], return_tensors='pt', ).to(device) # Move inputs to the specified device # Generate text with the model generated_ids = model.generate(**model_inputs, max_new_tokens=max_new_tokens) # Trim the generated ids to remove the input ids trimmed_generated_ids = [ out_ids[len(in_ids):] for in_ids, out_ids in zip(model_inputs.input_ids, generated_ids) ] # Decode the output text output_text = processor.batch_decode( trimmed_generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False ) return output_text[0] # Return the first decoded output text output = test(model, processor, test_image) print(output) ``` ## Training Details ### Training Data It has been fine-tuned on the SROIEv2 datasets and the annotations were generated using Qwen2.5-3B VL. ### Training Procedure * It has been fine-tuned for 1200 steps. However, the checkpoints correspond to the model saved at 400 steps which gave the best loss. * The text file annotations were generated using Qwen2.5-3B VL. #### Training Hyperparameters * It is a LoRA model. **LoRA configuration:** ```python # Configure LoRA peft_config = LoraConfig( r=8, lora_alpha=16, lora_dropout=0.0, target_modules=['down_proj','o_proj','k_proj','q_proj','gate_proj','up_proj','v_proj'], use_dora=True, init_lora_weights='gaussian' ) # Apply PEFT model adaptation peft_model = get_peft_model(model, peft_config) # Print trainable parameters peft_model.print_trainable_parameters() ``` **Trainer configuration:** ```python # Configure training arguments using SFTConfig training_args = transformers.TrainingArguments( output_dir=output_dir, logging_dir=output_dir, # num_train_epochs=1, max_steps=1200, # 625, per_device_train_batch_size=1, # Batch size MUST be 1 for Phi 3.5 Vision Instruct fine-tuning per_device_eval_batch_size=1, # Batch size MUST be 1 for Phi 3.5 Vision Instruct fine-tuning gradient_accumulation_steps=4, # 4 warmup_steps=50, learning_rate=1e-4, weight_decay=0.01, logging_steps=400, eval_steps=400, save_steps=400, logging_strategy='steps', eval_strategy='steps', save_strategy='steps', save_total_limit=2, optim='adamw_torch_fused', bf16=True, report_to='wandb', remove_unused_columns=False, gradient_checkpointing=True, dataloader_num_workers=4, # dataset_text_field='', # dataset_kwargs={'skip_prepare_dataset': True}, load_best_model_at_end=True, save_safetensors=True, ) ``` ## Evaluation The current best validation loss is **0.377421**. The CER on the test set is **0.355**. The Qwen2.5-3B VL test annotations were used as ground truth.