InfiniteVL Logo

InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models

Hongyuan Tao1, Bencheng Liao1, Shaoyu Chen2, Haoran Yin2, Qian Zhang2, Wenyu Liu1, Xinggang Wang1,βœ‰οΈ

1Huazhong University of Science and Technology, 2Horizon Robotics

(βœ‰οΈ) corresponding author: [email protected]


arXiv GitHub

Introduction

InfiniteVL is a novel linear-complexity Vision-Language Model (VLM) architecture designed to overcome the computational bottlenecks of traditional Transformers in processing unlimited multimodal streams.

By synergizing Sliding Window Attention (SWA) for fine-grained local perception and Gated DeltaNet for efficient long-term memory, InfiniteVL achieves a "best of both worlds" balance. It delivers competitive performance on standard benchmarks (comparable to Qwen2.5-VL) while enabling constant-memory inference and high-throughput streaming.

InfiniteVL Logo

✨ Key Highlights

  • πŸš€ High Efficiency: Achieves >3.6Γ— inference speedup and constant memory footprint compared to FlashAttention-2 accelerated Transformers.
  • ⚑ Real-Time Streaming: Sustains a stable 24 FPS prefill speed on a single NVIDIA RTX 4090 for continuous video understanding.
  • 🧠 Unlimited Context: Effectively retains context over extremely long sequences (tested >500K tokens) without OOM errors.
  • πŸ† Strong Performance: Matches leading Transformer-based VLMs (e.g., Qwen2.5-VL-3B) and significantly outperforms previous linear VLMs (e.g., VL-Mamba, Cobra) on comprehensive aspects.

Model Zoo

We release two versions of InfiniteVL-4B to cater to different application scenarios.

Model Stage Description Training context Length Download
InfiniteVL-4B Stage 2 Best Generalist / Base. The checkpoint directly after Instruction SFT. It delivers the peak foundational performance on standard multimodal benchmarks (e.g., OCR, MMMU, MathVista) and preserves the most robust knowledge. 8K πŸ€— Hugging Face
InfiniteVL-4B-LongSFT Stage 3 Long-Context Adapted. Fine-tuned using only a small amount of long-sequence multimodal data. It successfully activates length generalization for streaming scenarios, though its full potential on extreme contexts is not yet fully exploited. 32K πŸ€— Hugging Face

πŸ’‘ Recommendations:

  • For Long-Context Inference: Please use the Stage 3 model. It enables stable streaming inference and avoids memory explosion.
  • For Training / Fine-tuning: We strongly recommend using the Stage 2 model as your starting point. Since it maintains the strongest general capabilities and hasn't shifted towards the specific long-context distribution, it serves as the best foundation for adaptation to new tasks or domains.

Getting Started

πŸ› οΈ Environment Setup

We recommend using Anaconda or Miniconda to manage the environment. The code is tested on Python 3.11 + PyTorch 2.6.0 + CUDA 12.1.

1. Create and activate a virtual environment:

conda create -n infinitevl python=3.11 -y
conda activate infinitevl

2. Install Environment:

The core environments are list as follows:

# --- Core Deep Learning ---
torch==2.6.0
torchvision==0.21.0
torchaudio==2.6.0
transformers==4.57.0
accelerate==1.8.1

# --- Vision & Multimodal ---
qwen-vl-utils==0.0.11
decord==0.6.0
opencv-python==4.11.0.86
pillow==10.4.0
timm==1.0.22
einops==0.8.1

# --- Linear Attention & Kernels (Critical) ---
# Note: These often require specific CUDA environments to build
flash-attn==2.7.4.post1
flash-linear-attention==0.4.0
fla-core==0.4.0
causal-conv1d==1.5.0.post5
triton==3.2.0

Using πŸ€— Transformers to Chat

import torch
from transformers import AutoModelForCausalLM, AutoProcessor
from qwen_vl_utils import process_vision_info

# Load Model
model_path = "hustvl/InfiniteVL-LongSFT" # Replace with your HF repo ID
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

# Prepare Inputs
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Process Inputs
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
).to(model.device)

# Generate
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])
πŸ–ΌοΈ Multi-Image Inference (Click to expand)

InfiniteVL supports inputting multiple images in a single turn for comparison or storytelling.

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "What are the similarities between these two images?"},
        ],
    }
]

# Process
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
).to(model.device)

# Generate
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
print(processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True)[0])
πŸŽ₯ Video Inference (Click to expand)
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "file:///path/to/video.mp4",
                "max_pixels": 360 * 420,
                "fps": 1.0, 
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# Process
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
).to(model.device)

# Generate
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
print(processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True)[0])

πŸŽ₯ Advanced Usage (Cuda Graph)

Please refer to the guideline in the github page.

Citation

If you find InfiniteVL useful for your research or applications, please consider citing our paper:

@article{tao2025infinitevl,
  title={InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models},
  author={Tao, Hongyuan and Liao, Bencheng and Chen, Shaoyu and Yin, Haoran and Zhang, Qian and Liu, Wenyu and Wang, Xinggang},
  journal={arXiv preprint},
  year={2025}
}

Acknowledgement

InfiniteVL is built upon the giants of the open-source community. We would like to express our gratitude to:

  • Qwen2.5-VL: For providing a powerful vision-language codebase and vision encoder.
  • Gated DeltaNet: For the efficient linear attention mechanism and CUDA kernel implementations (FLA).
  • Open-Source Datasets: We sincerely thank the creators of the high-quality datasets used in our training, including FineVision, LLaVA-OneVision, PixMo, The Cauldron, Docmatix, LLaVA-Video, and others. Their contributions are essential to the development of efficient multimodal models.
Downloads last month
6
Safetensors
Model size
4B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support