YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)
Xiaomi-MiMo-Vl-Miloco


Introduction

Welcome to Xiaomi MiMo-VL-Miloco — the first open-source multimodal model built to actually understand what’s happening at home!

🤗 Why you’ll love it:

  • Built on MiMo-VL-7B: a rock-solid vision–language backbone with reliable video understanding and instruction-following.
  • Home-savvy by design: it spots everyday activities (esports, workouts, watching TV, reading, and more) and reads common hand gestures like the V sign, thumbs-up, open palm, OK, and even the shaka hand sign.
  • Base skills intact: with a mix training strategy of SFT and RL, we boost home-scene smarts while keeping the model’s generality and transferability in great shape.

🌟 Training recipe:

We use a carefully tuned two-stage pipeline to nail home-scene skills without sacrificing general abilities.

Stage 1: Supervised Fine-Tuning (SFT)

This stage focuses on boosting the model’s core capabilities in home scenarios. Even with a limited training set, we strike a good balance between sample-efficient learning and fast inference:

  • Chain-of-thought supervision: we add chain of reasoning so the model learns structured knowledge about home scenarios.
  • Token-budget-aware reasoning: training with “budgeted” reasoning encourages concise, straight-to-the-point answers at inference.

Stage 2: Reinforcement Learning (RL)

Building on fine-tuning, this stage introduces GRPO-based reinforcement learning to enhance the model’s overall performance:

  • Efficient Training Data: we employed the Time-R1 data strategy (our work accepted at NeurIPS 2025) to build efficient training datasets across multiple domains.
  • Keep-it-general: specialize for home tasks while preserving broad understanding and language generation.

In short: Xiaomi MiMo-VL-Miloco is your friendly, sharp-eyed model roommate—great at recognizing what’s going on around the house, and still ready for the wider world.

😉 Model Recomendation

Both versions of the MiMo-VL-Miloco-7B model are now open-sourced:

  • MiMo-VL-Miloco-7B

    • Recommended for most users to experience and utilize.
  • MiMo-VL-Miloco-7B-GGUF

    • This is the mixed-precision quantized version of MiMo-VL-Miloco-7B. It is recommended for evaluation and use in compute-constrained environments.

Performance

Evaluation of Home-Scenario Undersatnding Capabilities (F1-Score)

  • MiMo-VL-Miloco-7B achieves leading performance in both gesture recognition and common household scene understanding.
Accuracy & Recall

Results of general capability evaluations

In household scene understanding, we prioritize video and image perception alongside the model’s reasoning ability.

  • Across three video benchmarks (Video-MME, Video-MMMU, Charades-STA), the base model shows clear improvements.
  • On MMMU-Pro, a general-capabilities benchmark, the base model also saw significant improvements (10+%).
  • Surprisingly, as video and image understanding improved, we observed corresponding gains on the text-only task MMLU-Pro.
  • We see a modest performance dip on tasks such as document understanding, OCR, and mathematics; this is in line with expectations and does not affect the model’s intended use cases.
    Accuracy & Recall

Citation

@misc{xiaomimimovlmiloco,
  author       = {Jiaze Li, Yuxun Qu, Jingyang Chen, Shijie Xu, Zhenru Lin, Junyou Zhu, Boshen Xu, Wenhui Tan, Pei Fu, JianZhong Ju, Zhenbo Luo, Jian Luan},
  title        = {Xiaomi MiMo-VL-Miloco},
  year         = {2025},
  howpublished = {\url{https://github.com/XiaoMi/xiaomi-mimo-vl-miloco}},
}

Contact

Please contact us at [email protected] or open an issue if you have any questions.

Downloads last month
175
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for xiaomi-open-source/Xiaomi-MiMo-VL-Miloco-7B

Quantizations
2 models

Collection including xiaomi-open-source/Xiaomi-MiMo-VL-Miloco-7B