File size: 7,139 Bytes
85a0ef8
 
 
 
 
 
eec6e62
85a0ef8
 
 
 
1d1d679
85a0ef8
 
 
 
 
 
 
 
91951e6
 
85a0ef8
 
38f0e60
b48d8fc
1885b47
b48d8fc
1885b47
b48d8fc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a601c8f
 
8a8de23
 
 
 
91951e6
a601c8f
 
 
 
 
 
 
7cb95d2
a601c8f
 
 
 
 
7cb95d2
a601c8f
 
 
 
 
 
 
b48d8fc
 
 
 
 
 
 
1885b47
b48d8fc
 
1885b47
b48d8fc
 
 
 
 
 
 
 
 
7cb95d2
b48d8fc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a601c8f
b48d8fc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
38f0e60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
---
license: apache-2.0
language:
- en
base_model:
- Qwen/Qwen3-VL-8B-Instruct
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- text-generation-inference
- document-ai
- node-implementations
- table-extraction
- layouts
- markdown
- html-markdown
- document-retrieval
- visual-grounding
- pdf-ocr
- layout-analysis
- json
- html
---

![1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/Z6i1Tpuq80rvooHrCsXJZ.png)

# **proxima-ocr-d.markdown-post3.0.l**

> **proxima-ocr-d.markdown-post3.0.l** is an experimental document AI multimodal model fine-tuned on top of **Qwen3-VL-8B-Instruct**, optimized for high precision OCR and structured document reconstruction. The model converts documents into **Markdown**, **HTML-Markdown**, and hybrid enriched documentation formats capable of embedding inline programming languages and reconstructing complex layouts such as tables, forms, and mathematical content.

# Key Enhancements

* **Dynamic Markdown Reconstruction**
  Converts complex documents to structured Markdown or HTML-Markdown while preserving layout hierarchy, formatting consistency, semantic ordering, and section alignment.

* **Inline Code and Language Embedding**
  Direct adaptation of Python, JavaScript, LaTeX, and shell syntax into reconstructed documents for technical and research documentation.

* **High Fidelity OCR and Visual Parsing**
  Accurate recognition of text across structured and unstructured scanned documents, including multi page layout reasoning.

* **Complex Layout Interpretation**
  Interprets tables, grids, equations, graphs, multi column layouts, and forms without structural distortion.

* **Document Retrieval and Semantic Linking**
  Efficient multi page chunking with cross reference recognition and content traceability.

* **Multimodal Long Reasoning**
  Supports advanced document question answering and reasoning across long input streams such as slides and manuscripts.

---

> 👉 This model is a stage progression model, and it may currently contain artifacts.

---

# Example Preview

### [1] Markdown HTML

| Input Image | Markdown Preview Page 1 | Markdown Preview Page 2 |
|------------|-------------------------|--------------------------|
| ![1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/_Y-OAIttAgeANK7Dv_IGD.jpeg) | ![Page1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/WQlpD6VpMNwhQVzqJutuz.png) | ![Page2](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/fEGekbnM1NvmqIocxuYb7.png) |

### [2] JSON Nodes

| Input Image | Node Preview Page 1 | Node Preview Page 2 |
|------------|----------------------|----------------------|
| ![1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/_jmGwS5ODHNNp1FswE2R7.jpeg) | ![Page1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/NrrCDWMenmxHjrhmGyoKZ.png) | ![Page2](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/NRUOYT_fE90wqTJi-u39q.png) |

### [3] YAML Nodes

| Input Image | Node Preview Page 1 | Node Preview Page 2 |
|------------|----------------------|----------------------|
| ![input](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/tGUTNem7wMUhlZQw7UMAr.png) | ![Page1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/w1AJfqnn7CyWAiJ9Ih4ll.png) | ![Page2](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/pvmhwEJPkdBR9duo-e58G.png) |

---

# Quick Start with Transformers

```python
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen3VLForConditionalGeneration.from_pretrained(
    "prithivMLmods/proxima-ocr-d.markdown-post3.0.l", torch_dtype="auto", device_map="auto"
)

processor = AutoProcessor.from_pretrained("prithivMLmods/proxima-ocr-d.markdown-post3.0.l")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Convert to Markdown."},
        ],
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=2048)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
```

# Intended Use

* OCR to Markdown or HTML-Markdown conversion
* Complex document reconstruction and formatting regeneration
* Multi page document reasoning and retrieval
* Table extraction and structured output transformation
* Mathematical OCR and LaTeX conversion
* Form extraction and structured entity generation
* Knowledge base indexing and large document QA
* Documentation regeneration for enterprise automation

# Limitations

* Accuracy may drop on extremely damaged or poorly scanned images
* Significant GPU VRAM required for long sequences and multi page documents
* Language accuracy varies for low resource scripts
* Complex objects such as mixed orientation blocks may require secondary post processing
* May occasionally produce formatting misalignment in highly irregular layouts

## Training Details

| Parameter     | Value                                             |
| ------------- | ------------------------------------------------- |
| Dataset Size  | approx. 544K [ modular combination open source data & synthetic document data entries from Gemini 3 Pro ] |
| Architecture  | Qwen3VLForConditionalGeneration                   |
| Training Time | approx. 17,040 seconds (4 h 44 m)                 |
| Precision     | bfloat16                                          |
| Hardware      | 4x H100 SXM (320 GB VRAM)                         |
| System Memory | 752 GB RAM                                        |
| CPU           | 80 vCPU                                           |

## References

* Qwen2.5 VL
  [https://huggingface.co/papers/2502.13923](https://huggingface.co/papers/2502.13923)

* DocVLM Make Your VLM an Efficient Reader
  [https://arxiv.org/pdf/2412.08746v1](https://arxiv.org/pdf/2412.08746v1)

* YaRN Efficient Context Window Extension
  [https://arxiv.org/pdf/2309.00071](https://arxiv.org/pdf/2309.00071)

* Qwen2 VL High Resolution Perception
  [https://arxiv.org/pdf/2409.12191](https://arxiv.org/pdf/2409.12191)

* Qwen VL Vision Language Understanding and OCR
  [https://arxiv.org/pdf/2308.12966](https://arxiv.org/pdf/2308.12966)

* OCR Benchmark for Multimodal Models
  [https://arxiv.org/pdf/2412.02210](https://arxiv.org/pdf/2412.02210)