proxima-ocr-d.markdown-post3.0.l / README.md

Update README.md

1d1d679 verified 15 days ago

7.14 kB

	---
	license: apache-2.0
	language:
	- en
	base_model:
	- Qwen/Qwen3-VL-8B-Instruct
	pipeline_tag: image-text-to-text
	library_name: transformers
	tags:
	- text-generation-inference
	- document-ai
	- node-implementations
	- table-extraction
	- layouts
	- markdown
	- html-markdown
	- document-retrieval
	- visual-grounding
	- pdf-ocr
	- layout-analysis
	- json
	- html
	---

	![1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/Z6i1Tpuq80rvooHrCsXJZ.png)

	# proxima-ocr-d.markdown-post3.0.l

	> proxima-ocr-d.markdown-post3.0.l is an experimental document AI multimodal model fine-tuned on top of Qwen3-VL-8B-Instruct, optimized for high precision OCR and structured document reconstruction. The model converts documents into Markdown, HTML-Markdown, and hybrid enriched documentation formats capable of embedding inline programming languages and reconstructing complex layouts such as tables, forms, and mathematical content.

	# Key Enhancements

	* Dynamic Markdown Reconstruction
	Converts complex documents to structured Markdown or HTML-Markdown while preserving layout hierarchy, formatting consistency, semantic ordering, and section alignment.

	* Inline Code and Language Embedding
	Direct adaptation of Python, JavaScript, LaTeX, and shell syntax into reconstructed documents for technical and research documentation.

	* High Fidelity OCR and Visual Parsing
	Accurate recognition of text across structured and unstructured scanned documents, including multi page layout reasoning.

	* Complex Layout Interpretation
	Interprets tables, grids, equations, graphs, multi column layouts, and forms without structural distortion.

	* Document Retrieval and Semantic Linking
	Efficient multi page chunking with cross reference recognition and content traceability.

	* Multimodal Long Reasoning
	Supports advanced document question answering and reasoning across long input streams such as slides and manuscripts.

	---

	> 👉 This model is a stage progression model, and it may currently contain artifacts.

	---

	# Example Preview

	### [1] Markdown HTML

	\| Input Image \| Markdown Preview Page 1 \| Markdown Preview Page 2 \|
	\|------------\|-------------------------\|--------------------------\|
	\| ![1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/_Y-OAIttAgeANK7Dv_IGD.jpeg) \| ![Page1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/WQlpD6VpMNwhQVzqJutuz.png) \| ![Page2](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/fEGekbnM1NvmqIocxuYb7.png) \|

	### [2] JSON Nodes

	\| Input Image \| Node Preview Page 1 \| Node Preview Page 2 \|
	\|------------\|----------------------\|----------------------\|
	\| ![1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/_jmGwS5ODHNNp1FswE2R7.jpeg) \| ![Page1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/NrrCDWMenmxHjrhmGyoKZ.png) \| ![Page2](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/NRUOYT_fE90wqTJi-u39q.png) \|

	### [3] YAML Nodes

	\| Input Image \| Node Preview Page 1 \| Node Preview Page 2 \|
	\|------------\|----------------------\|----------------------\|
	\| ![input](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/tGUTNem7wMUhlZQw7UMAr.png) \| ![Page1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/w1AJfqnn7CyWAiJ9Ih4ll.png) \| ![Page2](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/pvmhwEJPkdBR9duo-e58G.png) \|

	---

	# Quick Start with Transformers

	```python
	from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
	from qwen_vl_utils import process_vision_info

	model = Qwen3VLForConditionalGeneration.from_pretrained(
	"prithivMLmods/proxima-ocr-d.markdown-post3.0.l", torch_dtype="auto", device_map="auto"
	)

	processor = AutoProcessor.from_pretrained("prithivMLmods/proxima-ocr-d.markdown-post3.0.l")

	messages = [
	{
	"role": "user",
	"content": [
	{
	"type": "image",
	"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
	},
	{"type": "text", "text": "Convert to Markdown."},
	],
	}
	]

	text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	image_inputs, video_inputs = process_vision_info(messages)
	inputs = processor(
	text=[text],
	images=image_inputs,
	videos=video_inputs,
	padding=True,
	return_tensors="pt",
	)
	inputs = inputs.to("cuda")

	generated_ids = model.generate(**inputs, max_new_tokens=2048)
	generated_ids_trimmed = [
	out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
	]
	output_text = processor.batch_decode(
	generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
	)
	print(output_text)
	```

	# Intended Use

	* OCR to Markdown or HTML-Markdown conversion
	* Complex document reconstruction and formatting regeneration
	* Multi page document reasoning and retrieval
	* Table extraction and structured output transformation
	* Mathematical OCR and LaTeX conversion
	* Form extraction and structured entity generation
	* Knowledge base indexing and large document QA
	* Documentation regeneration for enterprise automation

	# Limitations

	* Accuracy may drop on extremely damaged or poorly scanned images
	* Significant GPU VRAM required for long sequences and multi page documents
	* Language accuracy varies for low resource scripts
	* Complex objects such as mixed orientation blocks may require secondary post processing
	* May occasionally produce formatting misalignment in highly irregular layouts

	## Training Details

	\| Parameter \| Value \|
	\| ------------- \| ------------------------------------------------- \|
	\| Dataset Size \| approx. 544K [ modular combination open source data & synthetic document data entries from Gemini 3 Pro ] \|
	\| Architecture \| Qwen3VLForConditionalGeneration \|
	\| Training Time \| approx. 17,040 seconds (4 h 44 m) \|
	\| Precision \| bfloat16 \|
	\| Hardware \| 4x H100 SXM (320 GB VRAM) \|
	\| System Memory \| 752 GB RAM \|
	\| CPU \| 80 vCPU \|

	## References

	* Qwen2.5 VL
	[https://huggingface.co/papers/2502.13923](https://huggingface.co/papers/2502.13923)

	* DocVLM Make Your VLM an Efficient Reader
	[https://arxiv.org/pdf/2412.08746v1](https://arxiv.org/pdf/2412.08746v1)

	* YaRN Efficient Context Window Extension
	[https://arxiv.org/pdf/2309.00071](https://arxiv.org/pdf/2309.00071)

	* Qwen2 VL High Resolution Perception
	[https://arxiv.org/pdf/2409.12191](https://arxiv.org/pdf/2409.12191)

	* Qwen VL Vision Language Understanding and OCR
	[https://arxiv.org/pdf/2308.12966](https://arxiv.org/pdf/2308.12966)

	* OCR Benchmark for Multimodal Models
	[https://arxiv.org/pdf/2412.02210](https://arxiv.org/pdf/2412.02210)