MedOCR-Vision: Medical Document OCR with PaddleOCR-VL

ERNIE AI Developer Challenge Submission

A fine-tuned PaddleOCR-VL model specialized for medical document OCR, achieving high accuracy on medical prescriptions, lab reports, and forms while maintaining general OCR capabilities.

Model Description

MedOCR-Vision is a vision-language model fine-tuned specifically for optical character recognition (OCR) of medical documents. The model is based on PaddleOCR-VL (1B parameters) and has been fine-tuned using LoRA (Low-Rank Adaptation) on a carefully curated dataset of 2,462 medical and general documents.

Key Features

Specialized for Medical Documents: Optimized for prescriptions, lab reports, and medical forms
Domain-Balanced Training: Maintains general OCR capabilities (invoices, receipts, business documents)
Production-Ready: Full merged model in float16 precision
Efficient Fine-tuning: LoRA-based training for optimal performance with minimal parameters
High Accuracy: Validation loss of 0.578 after 3 epochs of training

Model Architecture

Base Model: unsloth/PaddleOCR-VL (1B parameters)
Fine-tuning Method: LoRA (Low-Rank Adaptation)
LoRA Rank: 64
LoRA Alpha: 64
Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, out_proj, fc1, fc2, linear_1, linear_2
Precision: Mixed (BF16/FP16)

Performance Highlights

Model Improvements Over Base Model

Our fine-tuned model demonstrates significant improvements across multiple metrics:

✅ Enhanced Information Extraction: Captures more complete medical information including headers, test values, and reference ranges
✅ Better Document Understanding: Improved coverage of document structure and context
✅ Medical Domain Specialization: Superior performance on medical terminology and clinical data
✅ Comprehensive Coverage: Extracts significantly more relevant content from medical documents

Intended Uses

Primary Use Cases

Medical Prescription Digitization: Extract text from handwritten and printed prescriptions
Lab Report Processing: Extract data from medical laboratory reports
Medical Form OCR: Process various medical forms and documents
Healthcare Document Management: Digitize medical records and documentation
General Document OCR: Invoices, receipts, and business documents

Out-of-Scope Uses

Real-time medical diagnosis (this is an OCR tool, not a diagnostic system)
Legal document verification (requires domain-specific training)
Privacy-sensitive applications without proper data handling protocols

Training Data

Dataset Composition

The model was trained on naazimsnh02/medocr-vision-dataset, a curated dataset of 2,462 samples with the following composition:

Dataset	Samples	Domain	Type
Medical Prescriptions	1,000	Medical	Handwritten
OMR Scanned Documents	36	Medical	Scanned Forms
Medical Lab Reports	426	Medical	Printed Reports
Invoices & Receipts	1,000	General	Business Docs
Total	2,462	-	-

Dataset Statistics

Training: 1,969 samples (80%)
Validation: 246 samples (10%)
Test: 247 samples (10%)
Domain Balance: 59.4% Medical / 40.6% General

Data Sources

Training Procedure

Training Hyperparameters

# Training Duration
num_train_epochs: 3
total_steps: 741

# Batch Configuration
per_device_train_batch_size: 4
gradient_accumulation_steps: 2
effective_batch_size: 8

# Learning Rate
learning_rate: 5e-5
warmup_steps: 50
lr_scheduler_type: linear

# Optimization
optimizer: adamw_8bit
weight_decay: 0.001

# LoRA Configuration
lora_r: 64
lora_alpha: 64
lora_dropout: 0

# Checkpointing
save_steps: 100
eval_steps: 100
save_total_limit: 5
load_best_model_at_end: true

Training Results

Step	Training Loss	Validation Loss
100	1.7026	0.7900
200	1.3005	0.6821
300	1.0004	0.6402
400	0.8176	0.6036
500	0.7387	0.5806
600	0.7406	0.5819
700	0.8801	0.5787

Final Metrics:

Final Validation Loss: 0.5787
Training Time: 38.36 minutes (2,301.80 seconds)
Peak GPU Memory: 15.84 GB
GPU Utilization: 71.89%

Training Environment

GPU: NVIDIA L4 (24GB VRAM)
Framework: Unsloth + HuggingFace Transformers
Precision: Mixed (BF16/FP16)
Memory Usage: ~14 GB for training

Training Strategy

Domain-Balanced Approach: 60/40 split between medical and general documents to prevent catastrophic forgetting
LoRA Fine-tuning: Efficient parameter-efficient fine-tuning targeting key attention and MLP layers
Checkpoint Selection: Best model selected based on lowest validation loss
Evaluation: Regular evaluation every 100 steps to monitor convergence

How to Use

Installation

pip install transformers unsloth einops torch pillow

Basic Usage

from unsloth import FastVisionModel
from transformers import AutoProcessor
from PIL import Image

# Load model and processor
model, tokenizer = FastVisionModel.from_pretrained(
    "naazimsnh02/medocr-vision"
)
processor = AutoProcessor.from_pretrained(
    "naazimsnh02/medocr-vision",
    trust_remote_code=True
)

# Enable inference mode
FastVisionModel.for_inference(model)

# Load image
image = Image.open("medical_document.jpg")

# Prepare input
instruction = "Extract all text from this medical document:"
messages = [{
    "role": "user",
    "content": [
        {"type": "image"},
        {"type": "text", "text": instruction}
    ]
}]

# Generate
text_prompt = processor.tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

inputs = processor(
    image,
    text_prompt,
    add_special_tokens=False,
    return_tensors="pt",
).to("cuda")

output = model.generate(
    **inputs,
    max_new_tokens=256,
    use_cache=False,
    temperature=1.5,
    min_p=0.1
)

# Decode output
text = tokenizer.decode(output[0], skip_special_tokens=True)
print(text)

Advanced Usage with Streaming

from transformers import TextStreamer

# Create text streamer
text_streamer = TextStreamer(tokenizer, skip_prompt=True)

# Generate with streaming
_ = model.generate(
    **inputs,
    streamer=text_streamer,
    max_new_tokens=256,
    use_cache=False,
    temperature=1.5,
    min_p=0.1
)

Limitations and Biases

Limitations

Image Quality: Performance may degrade with very low-quality or heavily degraded images
Handwriting Variability: Extremely poor handwriting may not be accurately recognized
Language: Primarily trained on English medical documents
Document Types: Optimized for the specific document types in the training set
Context Understanding: This is an OCR model, not a medical understanding model

Potential Biases

Dataset Bias: Training data is primarily from specific medical document sources
Domain Bias: Better performance on medical documents similar to training data
Language Bias: Primarily English-language documents
Format Bias: May perform better on document formats similar to training data

Recommendations

Validate outputs in critical medical applications
Use as part of a larger system with human oversight
Test on your specific use case before production deployment
Consider fine-tuning on domain-specific data for specialized applications

Ethical Considerations

Privacy and Security

Medical Data: This model processes medical documents which may contain sensitive patient information
HIPAA Compliance: Users must ensure compliance with relevant healthcare data protection regulations
Data Handling: Implement appropriate data security measures when using this model
Audit Trail: Maintain logs of OCR processing for accountability

Responsible Use

This model should be used as an assistive tool, not a replacement for human review
Medical professionals should verify all extracted information
Implement appropriate error handling and validation
Consider the implications of automated medical document processing

Citation

@misc{medocr-vision-2025,
  title={MedOCR-Vision: Medical Document OCR with PaddleOCR-VL},
  author={Syed Naazim Hussain},
  year={2025},
  publisher={HuggingFace},
  howpublished={\url{https://huggingface.co/naazimsnh02/medocr-vision}}
}

Additional Resources

Code Repository: https://github.com/naazimsnh02/medocr-vision
Training Dataset: https://huggingface.co/datasets/naazimsnh02/medocr-vision-dataset
Training Notebook: Available in the repository
ERNIE Challenge: Submitted for ERNIE AI Developer Challenge

License

This model is released under the MIT License. Please refer to individual dataset licenses for usage terms of the training data.

Acknowledgments

Base Model: unsloth/PaddleOCR-VL
Framework: Unsloth for efficient training
Dataset Sources: chinmays18, saurabh1896, dikshaasinghhh, mychen76
LLM Providers: Nebius and Novita for data processing
PaddleOCR Team: For the excellent OCR framework

Model Version: 1.0
Release Date: December 2025
Challenge: ERNIE AI Developer Challenge

Downloads last month: 35

Safetensors

Model size

1.0B params

Tensor type

BF16

Model tree for naazimsnh02/medocr-vision

Base model

baidu/ERNIE-4.5-0.3B-Paddle

Finetuned

unsloth/PaddleOCR-VL

Adapter

(1)

this model

naazimsnh02
/

medocr-vision