File size: 7,139 Bytes

---
license: apache-2.0
language:
- en
base_model:
- Qwen/Qwen3-VL-8B-Instruct
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- text-generation-inference
- document-ai
- node-implementations
- table-extraction
- layouts
- markdown
- html-markdown
- document-retrieval
- visual-grounding
- pdf-ocr
- layout-analysis
- json
- html
---

![1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/Z6i1Tpuq80rvooHrCsXJZ.png)

# **proxima-ocr-d.markdown-post3.0.l**

> **proxima-ocr-d.markdown-post3.0.l** is an experimental document AI multimodal model fine-tuned on top of **Qwen3-VL-8B-Instruct**, optimized for high precision OCR and structured document reconstruction. The model converts documents into **Markdown**, **HTML-Markdown**, and hybrid enriched documentation formats capable of embedding inline programming languages and reconstructing complex layouts such as tables, forms, and mathematical content.

# Key Enhancements

* **Dynamic Markdown Reconstruction**
  Converts complex documents to structured Markdown or HTML-Markdown while preserving layout hierarchy, formatting consistency, semantic ordering, and section alignment.

* **Inline Code and Language Embedding**
  Direct adaptation of Python, JavaScript, LaTeX, and shell syntax into reconstructed documents for technical and research documentation.

* **High Fidelity OCR and Visual Parsing**
  Accurate recognition of text across structured and unstructured scanned documents, including multi page layout reasoning.

* **Complex Layout Interpretation**
  Interprets tables, grids, equations, graphs, multi column layouts, and forms without structural distortion.

* **Document Retrieval and Semantic Linking**
  Efficient multi page chunking with cross reference recognition and content traceability.

* **Multimodal Long Reasoning**
  Supports advanced document question answering and reasoning across long input streams such as slides and manuscripts.

---

> 👉 This model is a stage progression model, and it may currently contain artifacts.

---

# Example Preview

### [1] Markdown HTML

| Input Image | Markdown Preview Page 1 | Markdown Preview Page 2 |
|------------|-------------------------|--------------------------|
| ![1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/_Y-OAIttAgeANK7Dv_IGD.jpeg) | ![Page1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/WQlpD6VpMNwhQVzqJutuz.png) | ![Page2](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/fEGekbnM1NvmqIocxuYb7.png) |

### [2] JSON Nodes

| Input Image | Node Preview Page 1 | Node Preview Page 2 |
|------------|----------------------|----------------------|
| ![1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/_jmGwS5ODHNNp1FswE2R7.jpeg) | ![Page1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/NrrCDWMenmxHjrhmGyoKZ.png) | ![Page2](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/NRUOYT_fE90wqTJi-u39q.png) |

### [3] YAML Nodes

| Input Image | Node Preview Page 1 | Node Preview Page 2 |
|------------|----------------------|----------------------|
| ![input](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/tGUTNem7wMUhlZQw7UMAr.png) | ![Page1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/w1AJfqnn7CyWAiJ9Ih4ll.png) | ![Page2](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/pvmhwEJPkdBR9duo-e58G.png) |

---

# Quick Start with Transformers

```python
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen3VLForConditionalGeneration.from_pretrained(
    "prithivMLmods/proxima-ocr-d.markdown-post3.0.l", torch_dtype="auto", device_map="auto"
)

processor = AutoProcessor.from_pretrained("prithivMLmods/proxima-ocr-d.markdown-post3.0.l")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Convert to Markdown."},
        ],
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=2048)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
```

# Intended Use

* OCR to Markdown or HTML-Markdown conversion
* Complex document reconstruction and formatting regeneration
* Multi page document reasoning and retrieval
* Table extraction and structured output transformation
* Mathematical OCR and LaTeX conversion
* Form extraction and structured entity generation
* Knowledge base indexing and large document QA
* Documentation regeneration for enterprise automation

# Limitations

* Accuracy may drop on extremely damaged or poorly scanned images
* Significant GPU VRAM required for long sequences and multi page documents
* Language accuracy varies for low resource scripts
* Complex objects such as mixed orientation blocks may require secondary post processing
* May occasionally produce formatting misalignment in highly irregular layouts

## Training Details

| Parameter     | Value                                             |
| ------------- | ------------------------------------------------- |
| Dataset Size  | approx. 544K [ modular combination open source data & synthetic document data entries from Gemini 3 Pro ] |
| Architecture  | Qwen3VLForConditionalGeneration                   |
| Training Time | approx. 17,040 seconds (4 h 44 m)                 |
| Precision     | bfloat16                                          |
| Hardware      | 4x H100 SXM (320 GB VRAM)                         |
| System Memory | 752 GB RAM                                        |
| CPU           | 80 vCPU                                           |

## References

* Qwen2.5 VL
  [https://huggingface.co/papers/2502.13923](https://huggingface.co/papers/2502.13923)

* DocVLM Make Your VLM an Efficient Reader
  [https://arxiv.org/pdf/2412.08746v1](https://arxiv.org/pdf/2412.08746v1)

* YaRN Efficient Context Window Extension
  [https://arxiv.org/pdf/2309.00071](https://arxiv.org/pdf/2309.00071)

* Qwen2 VL High Resolution Perception
  [https://arxiv.org/pdf/2409.12191](https://arxiv.org/pdf/2409.12191)

* Qwen VL Vision Language Understanding and OCR
  [https://arxiv.org/pdf/2308.12966](https://arxiv.org/pdf/2308.12966)

* OCR Benchmark for Multimodal Models
  [https://arxiv.org/pdf/2412.02210](https://arxiv.org/pdf/2412.02210)