-
DocLLM: A layout-aware generative language model for multimodal document understanding
Paper • 2401.00908 • Published • 189 -
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training
Paper • 2401.00849 • Published • 17 -
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper • 2311.05437 • Published • 51 -
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing
Paper • 2311.00571 • Published • 43
Collections
Discover the best community collections!
Collections including paper arxiv:2511.14993
-
From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence
Paper • 2511.18538 • Published • 294 -
Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation
Paper • 2511.14993 • Published • 230 -
The Dragon Hatchling: The Missing Link between the Transformer and Models of the Brain
Paper • 2509.26507 • Published • 546 -
Less is More: Recursive Reasoning with Tiny Networks
Paper • 2510.04871 • Published • 506
-
Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy
Paper • 2507.01352 • Published • 56 -
A Data-Centric Framework for Addressing Phonetic and Prosodic Challenges in Russian Speech Generative Models
Paper • 2507.13563 • Published • 53 -
Scaling Laws for Optimal Data Mixtures
Paper • 2507.09404 • Published • 37 -
Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation
Paper • 2511.14993 • Published • 230
-
TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times
Paper • 2512.16093 • Published • 95 -
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
Paper • 2511.22699 • Published • 234 -
DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI
Paper • 2512.16676 • Published • 214 -
Sharp Monocular View Synthesis in Less Than a Second
Paper • 2512.10685 • Published • 28
-
Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation
Paper • 2511.14993 • Published • 230 -
Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks
Paper • 2511.15065 • Published • 77 -
SAM 3: Segment Anything with Concepts
Paper • 2511.16719 • Published • 128 -
Canvas-to-Image: Compositional Image Generation with Multimodal Controls
Paper • 2511.21691 • Published • 36
-
SPATIALGEN: Layout-guided 3D Indoor Scene Generation
Paper • 2509.14981 • Published • 27 -
Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation
Paper • 2511.14993 • Published • 230 -
Kling-Omni Technical Report
Paper • 2512.16776 • Published • 169 -
Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length
Paper • 2512.04677 • Published • 170
-
DocLLM: A layout-aware generative language model for multimodal document understanding
Paper • 2401.00908 • Published • 189 -
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training
Paper • 2401.00849 • Published • 17 -
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper • 2311.05437 • Published • 51 -
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing
Paper • 2311.00571 • Published • 43
-
Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy
Paper • 2507.01352 • Published • 56 -
A Data-Centric Framework for Addressing Phonetic and Prosodic Challenges in Russian Speech Generative Models
Paper • 2507.13563 • Published • 53 -
Scaling Laws for Optimal Data Mixtures
Paper • 2507.09404 • Published • 37 -
Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation
Paper • 2511.14993 • Published • 230
-
From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence
Paper • 2511.18538 • Published • 294 -
Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation
Paper • 2511.14993 • Published • 230 -
The Dragon Hatchling: The Missing Link between the Transformer and Models of the Brain
Paper • 2509.26507 • Published • 546 -
Less is More: Recursive Reasoning with Tiny Networks
Paper • 2510.04871 • Published • 506
-
TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times
Paper • 2512.16093 • Published • 95 -
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
Paper • 2511.22699 • Published • 234 -
DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI
Paper • 2512.16676 • Published • 214 -
Sharp Monocular View Synthesis in Less Than a Second
Paper • 2512.10685 • Published • 28
-
Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation
Paper • 2511.14993 • Published • 230 -
Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks
Paper • 2511.15065 • Published • 77 -
SAM 3: Segment Anything with Concepts
Paper • 2511.16719 • Published • 128 -
Canvas-to-Image: Compositional Image Generation with Multimodal Controls
Paper • 2511.21691 • Published • 36
-
SPATIALGEN: Layout-guided 3D Indoor Scene Generation
Paper • 2509.14981 • Published • 27 -
Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation
Paper • 2511.14993 • Published • 230 -
Kling-Omni Technical Report
Paper • 2512.16776 • Published • 169 -
Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length
Paper • 2512.04677 • Published • 170