Instructions to use merve/smol-vision with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use merve/smol-vision with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="merve/smol-vision")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("merve/smol-vision", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use merve/smol-vision with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "merve/smol-vision" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "merve/smol-vision", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/merve/smol-vision
- SGLang
How to use merve/smol-vision with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "merve/smol-vision" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "merve/smol-vision", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "merve/smol-vision" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "merve/smol-vision", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use merve/smol-vision with Docker Model Runner:
docker model run hf.co/merve/smol-vision
File size: 6,582 Bytes
3a81adc 0c90575 3a81adc b1f181c 87b3d4b 292e572 87b3d4b 8e0e463 4a4623f fe8c001 87b3d4b 8e0e463 87b3d4b 8e0e463 4a4623f 87b3d4b 4a4623f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | ---
tags:
- notebook
pipeline_tag: image-text-to-text
library_name: transformers
---

# Smol Vision 🐣
Recipes for shrinking, optimizing, customizing cutting edge vision and multimodal AI models. Original GH repository is [here](https://github.com/merveenoyan/smol-vision) migrated to Hugging Face since notebooks there aren't rendered 🥲
Latest examples 👇🏻
- [Fine-tune ColPali for Multimodal RAG](https://huggingface.co/merve/smol-vision/blob/main/Finetune_ColPali.ipynb)
- [Fine-tune Gemma-3n for all modalities (audio-text-image)](https://huggingface.co/merve/smol-vision/blob/main/Gemma3n_Fine_tuning_on_All_Modalities.ipynb)
- [Any-to-Any (Video) RAG with OmniEmbed and Qwen](https://huggingface.co/merve/smol-vision/blob/main/Any_to_Any_RAG.ipynb)
**Note**: The script and notebook are updated to fix few issues related to QLoRA!
| | Notebook | Description |
|------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------|
| Quantization/ONNX | [Faster and Smaller Zero-shot Object Detection with Optimum](https://huggingface.co/merve/smol-vision/blob/main/Faster_Zero_shot_Object_Detection_with_Optimum.ipynb) | Quantize the state-of-the-art zero-shot object detection model OWLv2 using Optimum ONNXRuntime tools. |
| VLM Fine-tuning | [Fine-tune PaliGemma](https://huggingface.co/merve/smol-vision/blob/main/Fine_tune_PaliGemma.ipynb) | Fine-tune state-of-the-art vision language backbone PaliGemma using transformers. |
| Intro to Optimum/ORT | [Optimizing DETR with 🤗 Optimum](https://huggingface.co/merve/smol-vision/blob/main/Reduce_any_model_to_fp16_using_%F0%9F%A4%97_Optimum_DETR.ipynb) | A soft introduction to exporting vision models to ONNX and quantizing them. |
| Model Shrinking | [Knowledge Distillation for Computer Vision](https://huggingface.co/docs/transformers/en/tasks/knowledge_distillation_for_image_classification) | Knowledge distillation for image classification. |
| Quantization | [Fit in vision models using Quanto](https://huggingface.co/merve/smol-vision/blob/main/Fit_in_vision_models_using_quanto.ipynb) | Fit in vision models to smaller hardware using quanto |
| Speed-up | [Faster foundation models with torch.compile](https://huggingface.co/merve/smol-vision/blob/main/Faster_foundation_models_with_torch_compile.ipynb) | Improving latency for foundation models using `torch.compile` |
| VLM Fine-tuning | [Fine-tune Florence-2](https://huggingface.co/merve/smol-vision/blob/main/Fine_tune_Florence_2.ipynb) | Fine-tune Florence-2 on DocVQA dataset |
| VLM Fine-tuning | [QLoRA/Fine-tune IDEFICS3 or SmolVLM on VQAv2](https://huggingface.co/merve/smol-vision/blob/main/Smol_VLM_FT.ipynb) | QLoRA/Full Fine-tune IDEFICS3 or SmolVLM on VQAv2 dataset |
| VLM Fine-tuning (Script) | [QLoRA Fine-tune IDEFICS3 on VQAv2](https://huggingface.co/merve/smol-vision/blob/main/smolvlm.py) | QLoRA/Full Fine-tune IDEFICS3 or SmolVLM on VQAv2 dataset |
| Multimodal RAG | [Multimodal RAG using ColPali and Qwen2-VL](https://huggingface.co/merve/smol-vision/blob/main/ColPali_%2B_Qwen2_VL.ipynb) | Learn to retrieve documents and pipeline to RAG without hefty document processing using ColPali through Byaldi and do the generation with Qwen2-VL |
| Multimodal Retriever Fine-tuning | [Fine-tune ColPali for Multimodal RAG](https://huggingface.co/merve/smol-vision/blob/main/Finetune_ColPali.ipynb) | Learn to apply contrastive fine-tuning on ColPali to customize it for your own multimodal document RAG use case |
| VLM Fine-tuning | [Fine-tune Gemma-3n for all modalities (audio-text-image)](https://huggingface.co/merve/smol-vision/blob/main/Gemma3n_Fine_tuning_on_All_Modalities.ipynb) | Fine-tune Gemma-3n model to handle any modality: audio, text, and image. |
| Multimodal RAG | [Any-to-Any (Video) RAG with OmniEmbed and Qwen](https://huggingface.co/merve/smol-vision/blob/main/Any_to_Any_RAG.ipynb) | Do retrieval and generation across modalities (including video) using OmniEmbed and Qwen. |
| Speed-up/Memory Optimization | Vision language model serving using TGI (SOON) | Explore speed-ups and memory improvements for vision-language model serving with text-generation inference |
| Quantization/Optimum/ORT | All levels of quantization and graph optimizations for Image Segmentation using Optimum (SOON) | End-to-end model optimization using Optimum |
|