MedOCR-Vision: Medical Document OCR with PaddleOCR-VL
ERNIE AI Developer Challenge Submission
A fine-tuned PaddleOCR-VL model specialized for medical document OCR, achieving high accuracy on medical prescriptions, lab reports, and forms while maintaining general OCR capabilities.
Model Description
MedOCR-Vision is a vision-language model fine-tuned specifically for optical character recognition (OCR) of medical documents. The model is based on PaddleOCR-VL (1B parameters) and has been fine-tuned using LoRA (Low-Rank Adaptation) on a carefully curated dataset of 2,462 medical and general documents.
Key Features
- Specialized for Medical Documents: Optimized for prescriptions, lab reports, and medical forms
- Domain-Balanced Training: Maintains general OCR capabilities (invoices, receipts, business documents)
- Production-Ready: Full merged model in float16 precision
- Efficient Fine-tuning: LoRA-based training for optimal performance with minimal parameters
- High Accuracy: Validation loss of 0.578 after 3 epochs of training
Model Architecture
- Base Model: unsloth/PaddleOCR-VL (1B parameters)
- Fine-tuning Method: LoRA (Low-Rank Adaptation)
- LoRA Rank: 64
- LoRA Alpha: 64
- Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, out_proj, fc1, fc2, linear_1, linear_2
- Precision: Mixed (BF16/FP16)
Performance Highlights
Model Improvements Over Base Model
Our fine-tuned model demonstrates significant improvements across multiple metrics:
- ✅ Enhanced Information Extraction: Captures more complete medical information including headers, test values, and reference ranges
- ✅ Better Document Understanding: Improved coverage of document structure and context
- ✅ Medical Domain Specialization: Superior performance on medical terminology and clinical data
- ✅ Comprehensive Coverage: Extracts significantly more relevant content from medical documents
Intended Uses
Primary Use Cases
- Medical Prescription Digitization: Extract text from handwritten and printed prescriptions
- Lab Report Processing: Extract data from medical laboratory reports
- Medical Form OCR: Process various medical forms and documents
- Healthcare Document Management: Digitize medical records and documentation
- General Document OCR: Invoices, receipts, and business documents
Out-of-Scope Uses
- Real-time medical diagnosis (this is an OCR tool, not a diagnostic system)
- Legal document verification (requires domain-specific training)
- Privacy-sensitive applications without proper data handling protocols
Training Data
Dataset Composition
The model was trained on naazimsnh02/medocr-vision-dataset, a curated dataset of 2,462 samples with the following composition:
| Dataset | Samples | Domain | Type |
|---|---|---|---|
| Medical Prescriptions | 1,000 | Medical | Handwritten |
| OMR Scanned Documents | 36 | Medical | Scanned Forms |
| Medical Lab Reports | 426 | Medical | Printed Reports |
| Invoices & Receipts | 1,000 | General | Business Docs |
| Total | 2,462 | - | - |
Dataset Statistics
- Training: 1,969 samples (80%)
- Validation: 246 samples (10%)
- Test: 247 samples (10%)
- Domain Balance: 59.4% Medical / 40.6% General
Data Sources
Training Procedure
Training Hyperparameters
# Training Duration
num_train_epochs: 3
total_steps: 741
# Batch Configuration
per_device_train_batch_size: 4
gradient_accumulation_steps: 2
effective_batch_size: 8
# Learning Rate
learning_rate: 5e-5
warmup_steps: 50
lr_scheduler_type: linear
# Optimization
optimizer: adamw_8bit
weight_decay: 0.001
# LoRA Configuration
lora_r: 64
lora_alpha: 64
lora_dropout: 0
# Checkpointing
save_steps: 100
eval_steps: 100
save_total_limit: 5
load_best_model_at_end: true
Training Results
| Step | Training Loss | Validation Loss |
|---|---|---|
| 100 | 1.7026 | 0.7900 |
| 200 | 1.3005 | 0.6821 |
| 300 | 1.0004 | 0.6402 |
| 400 | 0.8176 | 0.6036 |
| 500 | 0.7387 | 0.5806 |
| 600 | 0.7406 | 0.5819 |
| 700 | 0.8801 | 0.5787 |
Final Metrics:
- Final Validation Loss: 0.5787
- Training Time: 38.36 minutes (2,301.80 seconds)
- Peak GPU Memory: 15.84 GB
- GPU Utilization: 71.89%
Training Environment
- GPU: NVIDIA L4 (24GB VRAM)
- Framework: Unsloth + HuggingFace Transformers
- Precision: Mixed (BF16/FP16)
- Memory Usage: ~14 GB for training
Training Strategy
- Domain-Balanced Approach: 60/40 split between medical and general documents to prevent catastrophic forgetting
- LoRA Fine-tuning: Efficient parameter-efficient fine-tuning targeting key attention and MLP layers
- Checkpoint Selection: Best model selected based on lowest validation loss
- Evaluation: Regular evaluation every 100 steps to monitor convergence
How to Use
Installation
pip install transformers unsloth einops torch pillow
Basic Usage
from unsloth import FastVisionModel
from transformers import AutoProcessor
from PIL import Image
# Load model and processor
model, tokenizer = FastVisionModel.from_pretrained(
"naazimsnh02/medocr-vision"
)
processor = AutoProcessor.from_pretrained(
"naazimsnh02/medocr-vision",
trust_remote_code=True
)
# Enable inference mode
FastVisionModel.for_inference(model)
# Load image
image = Image.open("medical_document.jpg")
# Prepare input
instruction = "Extract all text from this medical document:"
messages = [{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": instruction}
]
}]
# Generate
text_prompt = processor.tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
inputs = processor(
image,
text_prompt,
add_special_tokens=False,
return_tensors="pt",
).to("cuda")
output = model.generate(
**inputs,
max_new_tokens=256,
use_cache=False,
temperature=1.5,
min_p=0.1
)
# Decode output
text = tokenizer.decode(output[0], skip_special_tokens=True)
print(text)
Advanced Usage with Streaming
from transformers import TextStreamer
# Create text streamer
text_streamer = TextStreamer(tokenizer, skip_prompt=True)
# Generate with streaming
_ = model.generate(
**inputs,
streamer=text_streamer,
max_new_tokens=256,
use_cache=False,
temperature=1.5,
min_p=0.1
)
Limitations and Biases
Limitations
- Image Quality: Performance may degrade with very low-quality or heavily degraded images
- Handwriting Variability: Extremely poor handwriting may not be accurately recognized
- Language: Primarily trained on English medical documents
- Document Types: Optimized for the specific document types in the training set
- Context Understanding: This is an OCR model, not a medical understanding model
Potential Biases
- Dataset Bias: Training data is primarily from specific medical document sources
- Domain Bias: Better performance on medical documents similar to training data
- Language Bias: Primarily English-language documents
- Format Bias: May perform better on document formats similar to training data
Recommendations
- Validate outputs in critical medical applications
- Use as part of a larger system with human oversight
- Test on your specific use case before production deployment
- Consider fine-tuning on domain-specific data for specialized applications
Ethical Considerations
Privacy and Security
- Medical Data: This model processes medical documents which may contain sensitive patient information
- HIPAA Compliance: Users must ensure compliance with relevant healthcare data protection regulations
- Data Handling: Implement appropriate data security measures when using this model
- Audit Trail: Maintain logs of OCR processing for accountability
Responsible Use
- This model should be used as an assistive tool, not a replacement for human review
- Medical professionals should verify all extracted information
- Implement appropriate error handling and validation
- Consider the implications of automated medical document processing
Citation
@misc{medocr-vision-2025,
title={MedOCR-Vision: Medical Document OCR with PaddleOCR-VL},
author={Syed Naazim Hussain},
year={2025},
publisher={HuggingFace},
howpublished={\url{https://huggingface.co/naazimsnh02/medocr-vision}}
}
Additional Resources
- Code Repository: https://github.com/naazimsnh02/medocr-vision
- Training Dataset: https://huggingface.co/datasets/naazimsnh02/medocr-vision-dataset
- Training Notebook: Available in the repository
- ERNIE Challenge: Submitted for ERNIE AI Developer Challenge
License
This model is released under the MIT License. Please refer to individual dataset licenses for usage terms of the training data.
Acknowledgments
- Base Model: unsloth/PaddleOCR-VL
- Framework: Unsloth for efficient training
- Dataset Sources: chinmays18, saurabh1896, dikshaasinghhh, mychen76
- LLM Providers: Nebius and Novita for data processing
- PaddleOCR Team: For the excellent OCR framework
Model Version: 1.0
Release Date: December 2025
Challenge: ERNIE AI Developer Challenge
- Downloads last month
- 35