|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
base_model: |
|
|
- Qwen/Qwen3-VL-8B-Instruct |
|
|
pipeline_tag: image-text-to-text |
|
|
library_name: transformers |
|
|
tags: |
|
|
- text-generation-inference |
|
|
- document-ai |
|
|
- node-implementations |
|
|
- table-extraction |
|
|
- layouts |
|
|
- markdown |
|
|
- html-markdown |
|
|
- document-retrieval |
|
|
- visual-grounding |
|
|
- pdf-ocr |
|
|
- layout-analysis |
|
|
- json |
|
|
- html |
|
|
--- |
|
|
|
|
|
 |
|
|
|
|
|
# **proxima-ocr-d.markdown-post3.0.l** |
|
|
|
|
|
> **proxima-ocr-d.markdown-post3.0.l** is an experimental document AI multimodal model fine-tuned on top of **Qwen3-VL-8B-Instruct**, optimized for high precision OCR and structured document reconstruction. The model converts documents into **Markdown**, **HTML-Markdown**, and hybrid enriched documentation formats capable of embedding inline programming languages and reconstructing complex layouts such as tables, forms, and mathematical content. |
|
|
|
|
|
# Key Enhancements |
|
|
|
|
|
* **Dynamic Markdown Reconstruction** |
|
|
Converts complex documents to structured Markdown or HTML-Markdown while preserving layout hierarchy, formatting consistency, semantic ordering, and section alignment. |
|
|
|
|
|
* **Inline Code and Language Embedding** |
|
|
Direct adaptation of Python, JavaScript, LaTeX, and shell syntax into reconstructed documents for technical and research documentation. |
|
|
|
|
|
* **High Fidelity OCR and Visual Parsing** |
|
|
Accurate recognition of text across structured and unstructured scanned documents, including multi page layout reasoning. |
|
|
|
|
|
* **Complex Layout Interpretation** |
|
|
Interprets tables, grids, equations, graphs, multi column layouts, and forms without structural distortion. |
|
|
|
|
|
* **Document Retrieval and Semantic Linking** |
|
|
Efficient multi page chunking with cross reference recognition and content traceability. |
|
|
|
|
|
* **Multimodal Long Reasoning** |
|
|
Supports advanced document question answering and reasoning across long input streams such as slides and manuscripts. |
|
|
|
|
|
--- |
|
|
|
|
|
> 👉 This model is a stage progression model, and it may currently contain artifacts. |
|
|
|
|
|
--- |
|
|
|
|
|
# Example Preview |
|
|
|
|
|
### [1] Markdown HTML |
|
|
|
|
|
| Input Image | Markdown Preview Page 1 | Markdown Preview Page 2 | |
|
|
|------------|-------------------------|--------------------------| |
|
|
|  |  |  | |
|
|
|
|
|
### [2] JSON Nodes |
|
|
|
|
|
| Input Image | Node Preview Page 1 | Node Preview Page 2 | |
|
|
|------------|----------------------|----------------------| |
|
|
|  |  |  | |
|
|
|
|
|
### [3] YAML Nodes |
|
|
|
|
|
| Input Image | Node Preview Page 1 | Node Preview Page 2 | |
|
|
|------------|----------------------|----------------------| |
|
|
|  |  |  | |
|
|
|
|
|
--- |
|
|
|
|
|
# Quick Start with Transformers |
|
|
|
|
|
```python |
|
|
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor |
|
|
from qwen_vl_utils import process_vision_info |
|
|
|
|
|
model = Qwen3VLForConditionalGeneration.from_pretrained( |
|
|
"prithivMLmods/proxima-ocr-d.markdown-post3.0.l", torch_dtype="auto", device_map="auto" |
|
|
) |
|
|
|
|
|
processor = AutoProcessor.from_pretrained("prithivMLmods/proxima-ocr-d.markdown-post3.0.l") |
|
|
|
|
|
messages = [ |
|
|
{ |
|
|
"role": "user", |
|
|
"content": [ |
|
|
{ |
|
|
"type": "image", |
|
|
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg", |
|
|
}, |
|
|
{"type": "text", "text": "Convert to Markdown."}, |
|
|
], |
|
|
} |
|
|
] |
|
|
|
|
|
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
|
|
image_inputs, video_inputs = process_vision_info(messages) |
|
|
inputs = processor( |
|
|
text=[text], |
|
|
images=image_inputs, |
|
|
videos=video_inputs, |
|
|
padding=True, |
|
|
return_tensors="pt", |
|
|
) |
|
|
inputs = inputs.to("cuda") |
|
|
|
|
|
generated_ids = model.generate(**inputs, max_new_tokens=2048) |
|
|
generated_ids_trimmed = [ |
|
|
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) |
|
|
] |
|
|
output_text = processor.batch_decode( |
|
|
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False |
|
|
) |
|
|
print(output_text) |
|
|
``` |
|
|
|
|
|
# Intended Use |
|
|
|
|
|
* OCR to Markdown or HTML-Markdown conversion |
|
|
* Complex document reconstruction and formatting regeneration |
|
|
* Multi page document reasoning and retrieval |
|
|
* Table extraction and structured output transformation |
|
|
* Mathematical OCR and LaTeX conversion |
|
|
* Form extraction and structured entity generation |
|
|
* Knowledge base indexing and large document QA |
|
|
* Documentation regeneration for enterprise automation |
|
|
|
|
|
# Limitations |
|
|
|
|
|
* Accuracy may drop on extremely damaged or poorly scanned images |
|
|
* Significant GPU VRAM required for long sequences and multi page documents |
|
|
* Language accuracy varies for low resource scripts |
|
|
* Complex objects such as mixed orientation blocks may require secondary post processing |
|
|
* May occasionally produce formatting misalignment in highly irregular layouts |
|
|
|
|
|
## Training Details |
|
|
|
|
|
| Parameter | Value | |
|
|
| ------------- | ------------------------------------------------- | |
|
|
| Dataset Size | approx. 544K [ modular combination open source data & synthetic document data entries from Gemini 3 Pro ] | |
|
|
| Architecture | Qwen3VLForConditionalGeneration | |
|
|
| Training Time | approx. 17,040 seconds (4 h 44 m) | |
|
|
| Precision | bfloat16 | |
|
|
| Hardware | 4x H100 SXM (320 GB VRAM) | |
|
|
| System Memory | 752 GB RAM | |
|
|
| CPU | 80 vCPU | |
|
|
|
|
|
## References |
|
|
|
|
|
* Qwen2.5 VL |
|
|
[https://huggingface.co/papers/2502.13923](https://huggingface.co/papers/2502.13923) |
|
|
|
|
|
* DocVLM Make Your VLM an Efficient Reader |
|
|
[https://arxiv.org/pdf/2412.08746v1](https://arxiv.org/pdf/2412.08746v1) |
|
|
|
|
|
* YaRN Efficient Context Window Extension |
|
|
[https://arxiv.org/pdf/2309.00071](https://arxiv.org/pdf/2309.00071) |
|
|
|
|
|
* Qwen2 VL High Resolution Perception |
|
|
[https://arxiv.org/pdf/2409.12191](https://arxiv.org/pdf/2409.12191) |
|
|
|
|
|
* Qwen VL Vision Language Understanding and OCR |
|
|
[https://arxiv.org/pdf/2308.12966](https://arxiv.org/pdf/2308.12966) |
|
|
|
|
|
* OCR Benchmark for Multimodal Models |
|
|
[https://arxiv.org/pdf/2412.02210](https://arxiv.org/pdf/2412.02210) |