Xiaomi MiMo-V2.5
Last reviewed
Jun 3, 2026
Sources
13 citations
Review status
Source-backed
Revision
v1 · 1,703 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
13 citations
Review status
Source-backed
Revision
v1 · 1,703 words
Add missing citations, update stale details, or suggest a clearer explanation.
Xiaomi MiMo-V2.5 is an open-weights model family released by Xiaomi in April 2026, made up of two siblings that share a name but solve different problems. One of them, the plain MiMo-V2.5, is a native omnimodal system that takes in text, images, video, and audio. The other, MiMo-V2.5-Pro, is a much larger text-only reasoning and coding model that Xiaomi positions against frontier agents like Claude Opus and GPT-5.4. Both are built on a mixture of experts architecture, handle context windows of up to one million tokens, and ship under the permissive MIT license with weights on Hugging Face.[1][2][3]
People keep getting the two confused, and it is an easy mistake to make. The "Pro" suffix usually signals "the multimodal flagship," but here it is the opposite. Pro is the text specialist. The smaller, cheaper standard model is the one that can actually see and hear. I will try to keep that distinction front and center, because most of the early write-ups blurred it, and at least one downstream tool catalog tagged Pro as accepting image attachments when it does not.[4]
MiMo is Xiaomi's house brand for large models, and it has moved quickly. The first release, MiMo-7B, arrived in April 2025 as a reasoning-first 7-billion-parameter model that Xiaomi claimed beat OpenAI's o1-mini and Alibaba's QwQ-32B-Preview on math and competitive-programming benchmarks despite its small size.[5] After that the family branched out: MiMo-VL-7B added vision-language understanding, MiMo-Audio-7B handled speech, MiMo-Embodied (November 2025) targeted robotics and embodied tasks, and MiMo-V2-Flash (December 2025) was the first big MoE flagship, a 309B-parameter model aimed at agentic work and also released under MIT.[5][6]
So by the time V2.5 showed up, Xiaomi had already shipped separate models for reasoning, vision, audio, and embodiment. The V2.5 release is partly an attempt to fold some of that work back together, at least on the standard model, where a single network sees, hears, and acts. Xiaomi's own tagline for it is "a single model that sees, hears, and acts on what it perceives."[3]
The cleanest way to think about the pair is reasoning depth versus sensory breadth. Pro goes deep on text and code; the standard model goes wide across modalities. They also differ a lot in size and cost.
| MiMo-V2.5 (standard) | MiMo-V2.5-Pro | |
|---|---|---|
| Total parameters | 310B | 1.02T |
| Active parameters per token | 15B | 42B |
| Architecture | Sparse MoE | MoE |
| Routed experts (top-k) | 256 (top-8) | 384 (top-8) |
| Layers | 48 (1 dense + 47 MoE) | 70 (1 dense + 69 MoE) |
| Attention | Hybrid SWA + global, 5:1 | Hybrid SWA + global, 6:1 |
| Context length | up to 1M tokens | up to 1M tokens |
| Pre-training tokens | ~48T | ~27T |
| Precision | FP8 (E4M3) mixed | FP8 (E4M3) mixed |
| Modalities | text, image, video, audio | text only |
| License | MIT | MIT |
Sources: Xiaomi MiMo site and the two Hugging Face model cards.[1][2][7]
The training-token figures look backwards at first glance, since the smaller model trained on more tokens than the larger one. That is genuinely what the model cards report: roughly 48 trillion tokens for the 310B omnimodal model and roughly 27 trillion for the 1.02T Pro.[2][7] Multimodal pre-training tends to burn through a lot of tokens once you start counting image and audio data, so the gap is not as strange as it sounds, but it is worth flagging since a few secondary articles copied the 48T number onto the Pro by mistake.
Pro is the headline-grabber. It is a 1.02-trillion-parameter MoE language model with 42 billion parameters active on any given token, spread across 384 routed experts with the top 8 selected per token.[7] The network runs 70 layers (one dense layer followed by 69 MoE layers), uses 128 attention heads with 8 key-value heads under grouped-query attention, and ships natively in FP8 (E4M3) weights so it can be served without a separate quantization step.[7]
The attention scheme is the interesting part. Pro interleaves local sliding-window attention with full global attention at a 6:1 ratio, meaning for every six layers that only look at a 128-token window, one layer attends across the whole sequence.[7] Of the 70 layers, 60 use sliding-window attention and 10 use full attention. Xiaomi reports this design, paired with a learnable attention-sink bias, cuts the KV cache footprint by close to seven times compared with a full-attention model of the same size, which is what makes the million-token context window practical to serve.[1][7] The model also carries a three-layer Multi-Token Prediction head for speculative decoding, which Xiaomi says roughly triples output throughput.[7]
Despite the "Pro" badge, this model has no vision or audio encoders. The Hugging Face card describes it plainly as "a Mixture-of-Experts (MoE) language model," and independent coverage confirms it is text-only, built for coding, software engineering, and long-horizon autonomous agents rather than perception.[4][8][9]
The standard model is the one with senses. It is a 310B-parameter sparse MoE with 15B active parameters, 256 routed experts (top-8), and 48 layers (one dense plus 47 MoE), using the same hybrid attention idea as Pro but at a 5:1 sliding-window-to-global ratio.[2] Xiaomi describes it as "a native omnimodal model with strong agentic capabilities, supporting text, image, video, and audio understanding within a unified architecture."[2]
What makes that possible is a pair of dedicated encoders bolted onto the language backbone. There is a 729-million-parameter Vision Transformer with hybrid window attention for images and video, and a 261-million-parameter audio encoder initialized from the weights of Xiaomi's earlier MiMo-Audio model.[2] Both feed into the main network through lightweight projectors, so a single model and a single API call can handle a photo, a video tutorial, or a recorded meeting without you switching tools.[3] It is also the cheaper and faster of the two, which is an unusual place for the multimodal model to sit. Xiaomi prices it at roughly half the per-token cost of Pro and says it surpasses the older MiMo-V2-Flash on agentic tasks while matching Pro on many coding problems "at half the cost."[3][10]
Xiaomi's pitch for both models is frontier-level results at a fraction of the token cost, and the numbers it published lean heavily on coding and agentic evaluations rather than raw knowledge tests. As always with vendor-reported benchmarks, these are self-reported and worth treating as claims rather than settled fact.
For Pro, the most cited figure is SWE-Bench Pro, where Xiaomi reports 57.2%. It puts that just behind GPT-5.4 at 57.7% and ahead of Claude Opus 4.6 at 53.4%, while charging far less per output token.[8][9] On the regular SWE-Bench resolved metric the card lists 78.9%, and on TerminalBench 2 it reports around 68%.[7] Other reported Pro scores include 72.9 on the τ3-Bench tool-use benchmark and roughly 63 to 64 on Xiaomi's Claw-Eval agentic suite.[8][10] On Humanity's Last Exam the picture is murkier: one set of coverage cites 48.0% against GPT-5.4's 58.7%, while another quotes 33.8%, so I would not put much weight on that particular figure until the official numbers settle.[8][11]
The standard model's benchmarks are mostly multimodal. Xiaomi reports 87.7 on Video-MME, 77.9 on MMMU-Pro, and 81.0 on CharXiv reasoning questions, and says the model is "level with closed-source models" on image and video understanding, matching Gemini 3 Pro on video tasks and Claude Sonnet 4.6 on multimodal agentic work.[3][12] On the text-and-agent side it scores 62.3 on the general subset of Claw-Eval and 23.8 on the multimodal subset, which Xiaomi frames as sitting "at the Pareto frontier of performance and efficiency."[2][12] Xiaomi also makes a token-efficiency argument throughout, claiming the standard model uses roughly half the tokens of comparable systems and that Pro spends about 42% fewer tokens than Kimi K2.6 on agentic trajectories.[11][12]
Both models are fully open-weight under the MIT license, with weights, tokenizer, and model cards published on Hugging Face under the XiaomiMiMo organization.[1][2][7] Xiaomi released base checkpoints as well as the post-trained instruct versions; the Pro instruct model went through supervised fine-tuning, large-scale agentic reinforcement learning, and a multi-teacher on-policy distillation stage, while the base model ships with a shorter 256K context that the instruct model extends to 1M.[7] For deployment, Xiaomi recommends SGLang for long-context serving and also supports vLLM, and the models are accessible through Xiaomi's own MiMo API alongside a growing list of third-party inference providers.[2][13]
The release lands in a crowded field. Xiaomi is benchmarking against DeepSeek-V4, Kimi K2.6, Claude Opus 4.6, and Gemini 3.1 Pro, and the most credible part of its story is not that it beats all of them outright but that it gets close while costing much less to run and being something you can actually download. Whether the omnimodal-plus-text-flagship split is the right product decision is harder to judge. It means two downloads, two cost structures, and a naming scheme that trips people up. But it also means each model is doing one job well instead of compromising, and for the standard model in particular, shipping real audio and video understanding in a 15B-active package that anyone can self-host is the kind of thing that quietly changes what small teams can build.