Models
Datasets
Spaces
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2309.11419

Woodpecker: Hallucination Correction for Multimodal Large Language Models

Paper • 2310.16045 • Published Oct 24, 2023 • 17
HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models

Paper • 2310.14566 • Published Oct 23, 2023 • 27
SILC: Improving Vision Language Pretraining with Self-Distillation

Paper • 2310.13355 • Published Oct 20, 2023 • 9
Conditional Diffusion Distillation

Paper • 2310.01407 • Published Oct 2, 2023 • 20

Kosmos-2.5: A Multimodal Literate Model

Paper • 2309.11419 • Published Sep 20, 2023 • 55
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities

Paper • 2311.05698 • Published Nov 9, 2023 • 14
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

Paper • 2311.06242 • Published Nov 10, 2023 • 95
PolyMaX: General Dense Prediction with Mask Transformer

Paper • 2311.05770 • Published Nov 9, 2023 • 11

Multimodal Foundation Models: From Specialists to General-Purpose Assistants

Paper • 2309.10020 • Published Sep 18, 2023 • 41
Kosmos-2.5: A Multimodal Literate Model

Paper • 2309.11419 • Published Sep 20, 2023 • 55
AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model

Paper • 2309.16058 • Published Sep 27, 2023 • 56
Jointly Training Large Autoregressive Multimodal Models

Paper • 2309.15564 • Published Sep 27, 2023 • 8

Kosmos-2.5: A Multimodal Literate Model

Paper • 2309.11419 • Published Sep 20, 2023 • 55
Running

129

Semantic Galaxy

🌌

129

Visualize embeddings in 3D space, powered by EmbeddingGemma

Nougat: Neural Optical Understanding for Academic Documents

Paper • 2308.13418 • Published Aug 25, 2023 • 41
Kosmos-2.5: A Multimodal Literate Model

Paper • 2309.11419 • Published Sep 20, 2023 • 55

Document Models (Pretrained)

Various pretrained models for analyzing documents. These need to be fine-tuned for a task

naver-clova-ix/donut-base

Image-to-Text • Updated Aug 13, 2022 • 44.5k • 238
google/pix2struct-base

Image-to-Text • 0.3B • Updated Dec 24, 2023 • 3.39k • 76
google/pix2struct-large

Image-to-Text • 1B • Updated Sep 6, 2023 • 1.33k • 34
microsoft/layoutlmv3-base

0.1B • Updated Apr 10, 2024 • 438k • 462

Woodpecker: Hallucination Correction for Multimodal Large Language Models

Paper • 2310.16045 • Published Oct 24, 2023 • 17
HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models

Paper • 2310.14566 • Published Oct 23, 2023 • 27
SILC: Improving Vision Language Pretraining with Self-Distillation

Paper • 2310.13355 • Published Oct 20, 2023 • 9
Conditional Diffusion Distillation

Paper • 2310.01407 • Published Oct 2, 2023 • 20

Kosmos-2.5: A Multimodal Literate Model

Paper • 2309.11419 • Published Sep 20, 2023 • 55
Running

129

Semantic Galaxy

🌌

129

Visualize embeddings in 3D space, powered by EmbeddingGemma

Kosmos-2.5: A Multimodal Literate Model

Paper • 2309.11419 • Published Sep 20, 2023 • 55
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities

Paper • 2311.05698 • Published Nov 9, 2023 • 14
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

Paper • 2311.06242 • Published Nov 10, 2023 • 95
PolyMaX: General Dense Prediction with Mask Transformer

Paper • 2311.05770 • Published Nov 9, 2023 • 11

Nougat: Neural Optical Understanding for Academic Documents

Paper • 2308.13418 • Published Aug 25, 2023 • 41
Kosmos-2.5: A Multimodal Literate Model

Paper • 2309.11419 • Published Sep 20, 2023 • 55

Multimodal Foundation Models: From Specialists to General-Purpose Assistants

Paper • 2309.10020 • Published Sep 18, 2023 • 41
Kosmos-2.5: A Multimodal Literate Model

Paper • 2309.11419 • Published Sep 20, 2023 • 55
AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model

Paper • 2309.16058 • Published Sep 27, 2023 • 56
Jointly Training Large Autoregressive Multimodal Models

Paper • 2309.15564 • Published Sep 27, 2023 • 8

Document Models (Pretrained)

Various pretrained models for analyzing documents. These need to be fine-tuned for a task

naver-clova-ix/donut-base

Image-to-Text • Updated Aug 13, 2022 • 44.5k • 238
google/pix2struct-base

Image-to-Text • 0.3B • Updated Dec 24, 2023 • 3.39k • 76
google/pix2struct-large

Image-to-Text • 1B • Updated Sep 6, 2023 • 1.33k • 34
microsoft/layoutlmv3-base

0.1B • Updated Apr 10, 2024 • 438k • 462

Previous
1
2
Next

Company

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs