--- license: apache-2.0 language: - en base_model: - Qwen/Qwen3-VL-8B-Instruct pipeline_tag: image-text-to-text library_name: transformers tags: - text-generation-inference - document-ai - node-implementations - table-extraction - layouts - markdown - html-markdown - document-retrieval - visual-grounding - pdf-ocr - layout-analysis - json - html --- ![1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/Z6i1Tpuq80rvooHrCsXJZ.png) # **proxima-ocr-d.markdown-post3.0.l** > **proxima-ocr-d.markdown-post3.0.l** is an experimental document AI multimodal model fine-tuned on top of **Qwen3-VL-8B-Instruct**, optimized for high precision OCR and structured document reconstruction. The model converts documents into **Markdown**, **HTML-Markdown**, and hybrid enriched documentation formats capable of embedding inline programming languages and reconstructing complex layouts such as tables, forms, and mathematical content. # Key Enhancements * **Dynamic Markdown Reconstruction** Converts complex documents to structured Markdown or HTML-Markdown while preserving layout hierarchy, formatting consistency, semantic ordering, and section alignment. * **Inline Code and Language Embedding** Direct adaptation of Python, JavaScript, LaTeX, and shell syntax into reconstructed documents for technical and research documentation. * **High Fidelity OCR and Visual Parsing** Accurate recognition of text across structured and unstructured scanned documents, including multi page layout reasoning. * **Complex Layout Interpretation** Interprets tables, grids, equations, graphs, multi column layouts, and forms without structural distortion. * **Document Retrieval and Semantic Linking** Efficient multi page chunking with cross reference recognition and content traceability. * **Multimodal Long Reasoning** Supports advanced document question answering and reasoning across long input streams such as slides and manuscripts. --- > 👉 This model is a stage progression model, and it may currently contain artifacts. --- # Example Preview ### [1] Markdown HTML | Input Image | Markdown Preview Page 1 | Markdown Preview Page 2 | |------------|-------------------------|--------------------------| | ![1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/_Y-OAIttAgeANK7Dv_IGD.jpeg) | ![Page1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/WQlpD6VpMNwhQVzqJutuz.png) | ![Page2](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/fEGekbnM1NvmqIocxuYb7.png) | ### [2] JSON Nodes | Input Image | Node Preview Page 1 | Node Preview Page 2 | |------------|----------------------|----------------------| | ![1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/_jmGwS5ODHNNp1FswE2R7.jpeg) | ![Page1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/NrrCDWMenmxHjrhmGyoKZ.png) | ![Page2](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/NRUOYT_fE90wqTJi-u39q.png) | ### [3] YAML Nodes | Input Image | Node Preview Page 1 | Node Preview Page 2 | |------------|----------------------|----------------------| | ![input](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/tGUTNem7wMUhlZQw7UMAr.png) | ![Page1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/w1AJfqnn7CyWAiJ9Ih4ll.png) | ![Page2](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/pvmhwEJPkdBR9duo-e58G.png) | --- # Quick Start with Transformers ```python from transformers import Qwen3VLForConditionalGeneration, AutoProcessor from qwen_vl_utils import process_vision_info model = Qwen3VLForConditionalGeneration.from_pretrained( "prithivMLmods/proxima-ocr-d.markdown-post3.0.l", torch_dtype="auto", device_map="auto" ) processor = AutoProcessor.from_pretrained("prithivMLmods/proxima-ocr-d.markdown-post3.0.l") messages = [ { "role": "user", "content": [ { "type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg", }, {"type": "text", "text": "Convert to Markdown."}, ], } ] text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) image_inputs, video_inputs = process_vision_info(messages) inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", ) inputs = inputs.to("cuda") generated_ids = model.generate(**inputs, max_new_tokens=2048) generated_ids_trimmed = [ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) print(output_text) ``` # Intended Use * OCR to Markdown or HTML-Markdown conversion * Complex document reconstruction and formatting regeneration * Multi page document reasoning and retrieval * Table extraction and structured output transformation * Mathematical OCR and LaTeX conversion * Form extraction and structured entity generation * Knowledge base indexing and large document QA * Documentation regeneration for enterprise automation # Limitations * Accuracy may drop on extremely damaged or poorly scanned images * Significant GPU VRAM required for long sequences and multi page documents * Language accuracy varies for low resource scripts * Complex objects such as mixed orientation blocks may require secondary post processing * May occasionally produce formatting misalignment in highly irregular layouts ## Training Details | Parameter | Value | | ------------- | ------------------------------------------------- | | Dataset Size | approx. 544K [ modular combination open source data & synthetic document data entries from Gemini 3 Pro ] | | Architecture | Qwen3VLForConditionalGeneration | | Training Time | approx. 17,040 seconds (4 h 44 m) | | Precision | bfloat16 | | Hardware | 4x H100 SXM (320 GB VRAM) | | System Memory | 752 GB RAM | | CPU | 80 vCPU | ## References * Qwen2.5 VL [https://huggingface.co/papers/2502.13923](https://huggingface.co/papers/2502.13923) * DocVLM Make Your VLM an Efficient Reader [https://arxiv.org/pdf/2412.08746v1](https://arxiv.org/pdf/2412.08746v1) * YaRN Efficient Context Window Extension [https://arxiv.org/pdf/2309.00071](https://arxiv.org/pdf/2309.00071) * Qwen2 VL High Resolution Perception [https://arxiv.org/pdf/2409.12191](https://arxiv.org/pdf/2409.12191) * Qwen VL Vision Language Understanding and OCR [https://arxiv.org/pdf/2308.12966](https://arxiv.org/pdf/2308.12966) * OCR Benchmark for Multimodal Models [https://arxiv.org/pdf/2412.02210](https://arxiv.org/pdf/2412.02210)