MiniCPM-V

Chinese AI Multimodal AI Open Source AI

29 min read

Updated Jul 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 23, 2026

Fact-checked

In review queue

Sources

18 citations

Revision

v4 · 5,735 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

MiniCPM-V is a family of open-weights multimodal large language models developed by the OpenBMB lab at Tsinghua University's Natural Language Processing group together with the spin-off company ModelBest Inc. The series targets efficient on-device deployment, packaging vision, OCR, video, and (since the MiniCPM-o branch) speech understanding into models small enough to run on a contemporary smartphone or tablet. The August 2024 release, MiniCPM-Llama3-V 2.5, was billed by its authors as "the first end-side MLLM achieving GPT-4V level performance," with results above GPT-4V-1106, Gemini Pro, and Claude 3 on the OpenCompass aggregate of eleven vision benchmarks.^[1] By the flagship MiniCPM-V 4.5 (open-sourced August 2025, technical report September 2025), the 8B model scored 77.0 on the OpenCompass average, which OpenBMB reported as surpassing GPT-4o-latest, Gemini 2.0 Pro, and the roughly nine-times-larger Qwen2.5-VL 72B.^[13]^[14] The lineage stretches from a 3B-parameter pilot in early 2024 through MiniCPM-o 2.6 (January 2025), the omnimodal MiniCPM-o 4.5 (February 2026), and the roughly 1B-parameter MiniCPM-V 4.6 (May 2026), holding the family's on-device focus while climbing from GPT-4V-class to Gemini-2.5-Flash-class quality.^[2]^[3]^[15]^[16]

Infobox

Field	Value
Developer	OpenBMB / ModelBest Inc. / Tsinghua University NLP Lab^[1]^[4]
First release	MiniCPM-V 1.0, February 2024^[5]
Latest covered release	MiniCPM-V 4.6, May 11, 2026^[16]
Parameter sizes	1B, 2.8B, 3B, 4.1B, 8B, 9B (depending on variant)^[5]^[12]^[13]^[16]
Vision encoder	SigLIP SoViT-400m/14 (v1.0-2.6); SigLIP2-400M (v4.0 onward)^[1]^[13]
Adaptive encoding	LLaVA-UHD style image slicing, perceiver resampler; unified 3D-Resampler for image and video (v4.5)^[1]^[7]^[14]
Max image resolution	1.8M pixels at any aspect ratio^[1]
Key papers	arXiv:2408.01800 (Aug 3, 2024); arXiv:2509.18154 (Sep 16, 2025)^[1]^[14]
Code repository	github.com/OpenBMB/MiniCPM-V^[4]
License	Apache-2.0 (code); weights free after registration (v2.0-o-2.6), fully Apache-2.0 from v4.0^[6]^[12]

History

Origins: OpenBMB and ModelBest

OpenBMB ("Open Lab for Big Model Base") is a research lab jointly operated by the Tsinghua University NLP group and ModelBest Inc., a spin-off founded in 2022 in Beijing's Haidian district by Tsinghua researchers.^[8] ModelBest positions itself around a thesis that small, distilled language and multimodal models can match much larger systems on practical tasks while remaining cheap enough to deploy on personal devices.^[8] In December 2024, MIT Technology Review reported that the company had closed a third funding round in the "tens of millions of dollars," part of a broader wave of Chinese on-device AI startups that emerged alongside DeepSeek.^[8] The company's leadership is drawn from Tsinghua: CEO Li Dahai, a former Zhihu chief technology officer, runs the commercial arm, while co-founder and chief scientist Liu Zhiyuan is a Tsinghua computer science professor.^[17]

The bet paid off commercially. In April 2026, ModelBest was recognized as a 2026 China Unicorn Enterprise at the Zhongguancun Forum after a fresh several-hundred-million-yuan round, led by Shenzhen Capital Group and Inovance Capital, pushed its post-money valuation past the 1 billion dollar unicorn threshold; combined with an earlier 2026 round led by China Telecom, the company raised more than 1 billion yuan (roughly 140 million US dollars) in the first quarter of 2026.^[17]

The MiniCPM ("Mini Chinese-English Pre-trained Model") project began as the text-only MiniCPM-2B, released in February 2024, which the authors reported as performing comparably to Mistral-7B on public benchmarks despite using roughly a third of the parameters.^[9] A 4B-parameter MiniCPM3 followed in September 2024, with the authors claiming results above Phi-3.5-mini-instruct and GPT-3.5-Turbo-0125 and competitive with Qwen2-7B and 8B-class Llama 3 variants.^[9] The text-only line later iterated into MiniCPM4 (June 2025) and MiniCPM4.1 (September 2025), both emphasizing inference acceleration via trainable sparse attention (InfLLM-V2) and BitCPM-style quantization, but the vision branch, MiniCPM-V, predates and parallels that trajectory rather than depending on it directly.^[9] The vision lineage reused the same base language model in its first two releases and inherited the same on-device-first design philosophy.

MiniCPM-V 1.0 (January / February 2024)

The first MiniCPM-V model, sometimes referred to as OmniLMM-3B in early documentation, paired a SigLIP-400M vision encoder with the 2.4B-parameter MiniCPM text base through a perceiver-style resampler.^[5] The release emphasized aggressive token compression: image features were squeezed into 64 visual tokens, versus more than 512 tokens for typical MLP-projector models such as LLaVA.^[5] On general multimodal benchmarks the authors reported 1452 on MME, 67.9 on MMBench (English) and 37.2 on MMMU, ahead of the 9.6B-parameter Qwen-VL-Chat and the 17.4B-parameter CogVLM at similar settings.^[5] Although 1.0 lacked the adaptive resolution scheme that would come in 2.0, it already exhibited the family's signature design choice: privilege end-side deployability over raw parameter count, and rely on the resampler to keep visual token budgets small enough for mobile inference.^[5]

MiniCPM-V 2.0 (April 2024)

MiniCPM-V 2.0, released on April 12, 2024, kept the 2.8B-parameter footprint but introduced two changes that defined the rest of the family.^[6] First, it integrated the adaptive visual encoding scheme from LLaVA-UHD, allowing the model to accept images up to roughly 1.8 million pixels (for example 1344x1344) at any aspect ratio rather than forcing a fixed 336x336 square.^[6]^[7] Second, it was the first end-side multimodal model from the group aligned with multimodal RLHF, drawing on the RLHF-V technique from the same authors.^[6]^[10] The authors reported that 2.0 reached scene-text understanding comparable to Gemini Pro and surpassed Qwen-VL-Chat 9.6B, CogVLM-Chat 17.4B, and Yi-VL 34B on OCRBench, TextVQA, MME, MMBench, and MathVista.^[6] Importantly, 2.0 also matched the much larger GPT-4V on the Object HalBench hallucination benchmark, which the authors attributed primarily to the RLHF-V alignment step rather than to the visual encoder change.^[6] The model shipped on Hugging Face under an Apache-2.0 code license, with weights free for academic use and free for commercial use after registration, a licensing template carried through all subsequent releases up to MiniCPM-o 2.6.^[6]

MiniCPM-Llama3-V 2.5 (May 2024)

Released on May 20, 2024, MiniCPM-Llama3-V 2.5 swapped the small MiniCPM base for Llama3-8B-Instruct, bringing the total parameter count to roughly 8B (8B LLM plus the 400M SigLIP encoder and a thin resampler).^[11] This was the version the authors positioned as the first open end-side model reaching GPT-4V-class quality: the OpenCompass average across eleven benchmarks rose to 65.1, ahead of GPT-4V-1106 at 63.5, with OCRBench above 700 (versus 656 for GPT-4V) and an Object HalBench hallucination rate of 10.3 percent versus GPT-4V's 13.6 percent.^[1]^[11] Language coverage expanded to more than thirty languages, spanning Chinese, English, German, French, Spanish, Italian, Korean, Japanese, and a long tail of European and Asian languages defined in the model card's language list.^[11] The release also formalized streaming output and customizable system prompts as first-class features, and it was the first MiniCPM-V to be paired with an RLAIF-V alignment pass rather than the older RLHF-V approach.^[11] The 2.5 model card also documented LoRA fine-tuning on two NVIDIA V100 GPUs as a supported deployment path, which positioned the model as accessible to academic labs with modest hardware budgets.^[11]

MiniCPM-V 2.6 (August 2024)

The August 2024 release, paired with the formal technical report, dropped the Llama3 base in favor of Qwen2-7B, again landing at roughly 8B total parameters.^[2] MiniCPM-V 2.6 added two significant capabilities: multi-image reasoning (state-of-the-art on Mantis-Eval and BLINK at the model's scale, plus Mathverse mv and Sciverse mv) and full video understanding with temporal reasoning on Video-MME and Video-ChatGPT.^[2] OpenCompass scored the model at 65.2 on an updated eight-benchmark subset, with the authors claiming wins over GPT-4o mini, GPT-4V, Gemini 1.5 Pro, and Claude 3.5 Sonnet on single-image tasks.^[2] Token density was a marquee number: 1.8M-pixel images compressed to 640 visual tokens, around 75 percent fewer than the typical contemporary MLLM, enabling the demonstrated real-time video understanding on a stock iPad Pro.^[2] The OpenBMB team published a raw, unedited iPad Pro screen recording alongside the release to substantiate the on-device video understanding claim, a presentation pattern they repeated in later releases.^[2] In-context few-shot learning across multiple images, conversation and reasoning over image stacks, and chart and table understanding rounded out the capability bundle.^[2]

MiniCPM-o 2.6 (January 2025)

MiniCPM-o 2.6, released January 24, 2025, extended the architecture into a full speech-vision-text omnimodal model while staying at 8B parameters.^[3] The build was end-to-end across four pre-trained components: SigLIP-400M for vision, Whisper-medium-300M for audio understanding, ChatTTS-200M for speech generation, and Qwen2.5-7B as the language backbone.^[3] OpenBMB billed it as "a GPT-4o level MLLM for vision, speech, and multimodal live streaming on your phone."^[3] On OpenCompass the authors reported 70.2 average across the same eight-benchmark setup, ahead of GPT-4o-202405, Gemini 1.5 Pro, and Claude 3.5 Sonnet on single-image understanding for models below 25B parameters.^[3] On English speech recognition (LibriSpeech test-clean) the model achieved 4.4 percent WER, and on Chinese ASR (AISHELL-1) 1.6 percent CER, with the authors reporting results above GPT-4o-Realtime on audio understanding tasks.^[3] A new StreamingBench score of 66.0 measured the model's ability to process continuous video and audio streams without explicit user queries, ahead of GPT-4o-202408 and Claude 3.5 Sonnet on that benchmark.^[3] On real-time video specifically the StreamingBench sub-score was 79.9.^[3] Speech generation was rated on community ELO scales at 1088 semantic and 1163 acoustic.^[3] A novel Time-Division Multiplexing (TDM) mechanism handled the streaming omnimodal scheduling, and configurable audio system prompts let downstream applications swap voices in a relatively flexible way.^[3]

MiniCPM-V 4.0 (August 2025)

MiniCPM-V 4.0, open-sourced on August 2, 2025, pivoted back toward the smallest end of the family.^[4]^[12] It pairs the upgraded SigLIP2-400M vision encoder with the 3B-parameter MiniCPM4 text backbone for a total of 4.1B parameters, roughly half the size of the 2.6 and o-2.6 models.^[12] The jump from 2.6 to 4.0 skipped the 3.x label: the vision model is built on the MiniCPM4 text base, aligning its major version with that backbone. Despite the smaller footprint, OpenBMB reported an OpenCompass average of 69.0, which the model card states outperforms GPT-4.1-mini-20250414, the 8.1B-parameter MiniCPM-V 2.6 (65.2), and the 3.8B Qwen2.5-VL-3B-Instruct (64.5).^[12] The release leaned hard into deployment: on an iPhone 16 Pro Max the model card reports a first-token delay under 2 seconds and decoding above 17 tokens per second "without heating problems," and OpenBMB shipped an open-source iOS app that runs the model on iPhone and iPad.^[12] MiniCPM-V 4.0 was also the point at which the project moved its weights to a clean Apache-2.0 license, dropping the earlier registration requirement for commercial use and making the sign-up form optional.^[12]

MiniCPM-V 4.5 (August / September 2025)

MiniCPM-V 4.5, open-sourced on August 26, 2025 and documented in a technical report titled "MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe" (arXiv:2509.18154, September 16, 2025), returned to the 8B tier on a Qwen3-8B backbone with the SigLIP2-400M encoder, for roughly 8.7B parameters total.^[4]^[13]^[14] OpenBMB reported an OpenCompass average of 77.0, and the paper's abstract states the model "surpasses widely used proprietary models such as GPT-4o-latest, and significantly larger open-source models such as Qwen2.5-VL 72B," which the team framed as the strongest MLLM under 30B parameters.^[13]^[14] The headline architectural change was a unified 3D-Resampler that compresses images and video through the same module: six 448x448 video frames collapse into 64 tokens, a 96x compression rate that supports high-frame-rate (up to 10 FPS) and long-video understanding.^[13]^[14] On Video-MME the model scored 73.5 while using, per the paper, 46.7 percent of the GPU memory and 8.7 percent of the inference time of Qwen2.5-VL 7B.^[14] MiniCPM-V 4.5 also introduced a "Controllable Hybrid Fast/Deep Thinking" mode that lets users trade latency for step-by-step reasoning, and OpenBMB reported leading OCRBench results (ahead of GPT-4o-latest and Gemini 2.5) plus state-of-the-art PDF document parsing on OmniDocBench.^[13] OpenBMB summarized the release with the tagline "A GPT-4o Level MLLM for Single Image, Multi Image and High-FPS Video Understanding on Your Phone."^[13]

MiniCPM-o 4.5 (February 2026)

MiniCPM-o 4.5, open-sourced on February 3, 2026, carried the 4.5 generation into the omnimodal branch.^[4]^[15] Built end-to-end on Qwen3-8B with SigLIP2 for vision, Whisper-medium for audio understanding, and CosyVoice2 for speech generation, the roughly 9B-parameter model reported an OpenCompass average of 77.6.^[15] The model card says it surpasses GPT-4o and Gemini 2.0 Pro and approaches Gemini 2.5 Flash on vision-language tasks, and OpenBMB brands it a "Gemini 2.5 Flash Level" system for vision, speech, and live streaming.^[15] Its marquee capability is full-duplex, proactive multimodal live streaming: the model can see, listen, and speak simultaneously rather than waiting for a turn to end.^[15] On speech generation OpenBMB reported a Chinese character error rate of 0.86 percent, below CosyVoice2's 1.45 percent, and a long-form speech word error rate of 3.37 percent versus CosyVoice2's 14.80 percent.^[15]

MiniCPM-V 4.6 (May 2026)

MiniCPM-V 4.6, open-sourced on May 11, 2026, pushed the family below the 2B mark.^[4]^[16] It pairs the SigLIP2-400M encoder with a Qwen3.5-0.8B backbone for a total of about 1 billion parameters (Artificial Analysis catalogs it as "MiniCPM-V 4.6 1.3B Instruct"), and introduces mixed 4x/16x visual token compression that lets a caller trade detail for speed at inference time.^[16]^[18] Independent testing by Artificial Analysis placed it at 13 on the Artificial Analysis Intelligence Index, the highest of any open-weights model under 2B parameters at release, with a 262K-token context window and an MMMU-Pro visual-reasoning score of 38 percent.^[18] The model card describes deployment across all three mainstream mobile platforms, iOS, Android, and HarmonyOS, with the edge-adaptation code open-sourced, and OpenBMB brands the current line "A Pocket-Sized MLLM for Ultra-Efficient Image and Video Understanding on Your Phone."^[16]

Technical details

Architecture

All MiniCPM-V models share a three-stage architecture: a frozen visual encoder (SigLIP SoViT-400m/14 in every version from 1.0 through 2.6, upgraded to SigLIP2-400M from v4.0), a compression layer based on a perceiver resampler that converts visual features into a small set of query tokens, and a decoder-only language model.^[1]^[13] The compression target is aggressive: in 1.0 and 2.0 each image slice is reduced to 64 query tokens, while MiniCPM-Llama3-V 2.5 uses 96 tokens per slice.^[1] By the 2.6 release the total visual token count for a full 1.8M-pixel image had settled at roughly 640 tokens, which the authors compare to thousands of tokens emitted by MLP-projector competitors at similar resolutions.^[2] The resampler is a single-layer cross-attention module: the small set of learned query tokens attends over the dense vision transformer feature grid and produces the fixed-length output that the LLM then consumes.^[1] Because the resampler runs once per slice rather than per generated text token, the cost of high-resolution visual context is paid up front, not amortized across every decode step.^[1]

From MiniCPM-V 4.5 (August 2025) the design generalized the perceiver resampler into a unified 3D-Resampler that compresses images and video through one module.^[13]^[14] The 3D scheme raises video token density sharply: six 448x448 frames are packed into 64 tokens, a 96x compression rate that makes high-frame-rate (up to 10 FPS) and long-video understanding tractable on device.^[13]^[14] MiniCPM-V 4.6 (May 2026) then added switchable 4x/16x visual token compression, so a single model can trade fidelity for throughput on a per-request basis rather than at training time.^[16]

Adaptive visual encoding

The defining technical contribution from 2.0 onward is the adaptive visual encoding pipeline borrowed from LLaVA-UHD.^[7] Rather than resize an input image to a fixed square, the system computes an ideal slice count N as the ceiling of the input image area divided by the ViT's pre-training area, then searches row-column factorizations of N to pick the partition whose aspect ratio is closest to the source image.^[1] Each slice is independently resized to match the ViT's training area, with its 2D positional embeddings interpolated to the slice's true aspect ratio.^[1] The full original image is also resized and encoded as an extra slice so that the model sees both global context and local detail. Slice features are wrapped with <slice> and </slice> tokens, with newline characters demarcating rows so the spatial schema is preserved in the LLM's token stream.^[1] The net effect is that an arbitrarily shaped 1.8M-pixel image (for example a tall screenshot or wide receipt) can be encoded without distortion, which materially improves OCR and document-understanding accuracy.

Training pipeline

The MiniCPM-V technical report describes a three-stage pre-training schedule.^[1] Stage 1 warms up the resampler at 224x224 resolution using roughly 200M image-caption pairs, with the ViT and LLM frozen. Stage 2 unfreezes the ViT and extends it to 448x448, training on another 200M samples to adapt the encoder to a higher fidelity regime. Stage 3 turns on the full adaptive encoding scheme and integrates dedicated OCR data, training both visual modules end to end at the 1.8M-pixel target.^[1] Supervised fine-tuning then runs in two parts: a recognition-focused phase using traditional VQA and captioning datasets, and a long-form interaction phase covering complex instructions across 36-plus languages, with roughly 2M curated samples in total.^[1] The two SFT phases are intentionally split: the first builds robust object, scene, and character recognition; the second teaches longer-form discourse and multi-turn reasoning, including responses that exceed the typical short-answer length of academic VQA datasets.^[1] The report frames this split as important for preventing the model from collapsing to terse, brittle responses, which is a common failure mode of MLLMs trained on captioning-style data alone.^[1] The MiniCPM-V 4.5 report later added a hybrid reinforcement learning stage so a single model can operate in both short-answer and long-reasoning modes, the training basis for the release's controllable fast and deep thinking switch.^[14]

RLAIF-V alignment

For trust and hallucination control, MiniCPM-V uses RLAIF-V, a framework proposed in a companion paper by Tianyu Yu and colleagues (arXiv:2405.17220, May 27, 2024), later accepted as a CVPR 2025 highlight.^[10] RLAIF-V is built on two ideas. First, a divide-and-conquer feedback pipeline decomposes each candidate response into atomic claims (factored using a small text LLM such as Llama-3-8B), converts each claim to a yes/no question, and scores the questions with an open-source MLLM rather than calling GPT-4V or human annotators.^[10] Second, the resulting preference pairs are consumed by an online iterative form of Direct Preference Optimization (DPO) that mitigates the distribution-shift problem of vanilla DPO.^[10] The reported effect is large: at 7B scale RLAIF-V cuts object hallucination by roughly 80 percent and overall hallucination by roughly 34 percent, and at 12B scale a model trained against its own feedback can outperform GPT-4V on object hallucination benchmarks.^[10] For MiniCPM-V the practical signal is that MiniCPM-Llama3-V 2.5's 10.3 percent Object HalBench rate is below GPT-4V-1106's 13.6 percent at a fraction of the parameter count.^[1] The training set used to drive the DPO step is the publicly released RLAIF-V dataset on Hugging Face, which makes the alignment stage uniquely reproducible by comparison with most other open MLLM alignment pipelines that rely on private feedback corpora.^[11] The technique carried forward: OpenBMB reported that MiniCPM-V 4.5 uses RLAIF-V to beat GPT-4o-latest on the MMHal-Bench hallucination benchmark.^[13]

Mobile and end-side deployment

The technical report devotes a full section to on-device deployment, taking MiniCPM-Llama3-V 2.5 from a 16-17 GB FP16 footprint to a working 8B model running on a Qwen/Llama 3-class smartphone.^[1] Concretely, the authors used a Xiaomi 14 Pro powered by Qualcomm's Snapdragon 8 Gen 3 mobile platform as the reference target, with a vivo X100 Pro as a secondary device and an Apple MacBook Pro M1 included for comparison.^[1] A 4-bit Q4_K_M quantization via the GGML/llama.cpp toolchain reduced the memory footprint to roughly 5 GB. Sequential ViT/LLM loading cut peak memory further and brought image-processing time from 45.2 to 31.5 seconds. Native compilation lowered encoding latency from 50.5 to 17.0 seconds and improved decode throughput from 1.3 to 3.2 tokens/second. Automatic parameter tuning pushed throughput to 8.2 tokens/second on the Snapdragon target.^[1] Finally, porting the visual encoder to Qualcomm's QNN framework to run on the on-chip NPU reduced visual encoding from 3.7 seconds to 1.3 seconds, a roughly 150x speed-up over the unoptimized baseline.^[1]^[11] The report concludes that throughput on the Xiaomi and vivo devices exceeds typical human reading speed, which is the authors' working threshold for "usable" deployment.^[1]

After that initial mobile demonstration, the inference story broadened. The MiniCPM-V 2.6 and MiniCPM-o 2.6 model cards document support for llama.cpp, Ollama, vLLM, int4 quantization at roughly 7 GB GPU memory, GGUF format in sixteen sizes, and integration with LLaMA-Factory for fine-tuning.^[2]^[3] The model cards also document a Gradio WebUI for local interactive testing and the existence of an online demo space.^[2]^[3] Later releases pushed the deployment story onto newer silicon and an official app: the MiniCPM-V 4.0 model card reports an iPhone 16 Pro Max first-token delay under 2 seconds and more than 17 tokens per second of decoding, OpenBMB open-sourced an iOS app that runs the model on iPhone and iPad, and MiniCPM-V 4.6 extended packaged edge support to iOS, Android, and HarmonyOS with SGLang added alongside vLLM, llama.cpp, and Ollama.^[12]^[16] The combination of compact size, permissive licensing (Apache-2.0 code, and from v4.0 fully Apache-2.0 weights), and broad inference-runtime coverage is the principal reason MiniCPM-V is widely adopted as a default in open vision-language stacks.^[11]^[12]

Variants

Version	Release	Total params	Vision encoder	LLM backbone	Key claim
MiniCPM-V 1.0 (OmniLMM-3B)	Jan/Feb 2024	3B	SigLIP-400M	MiniCPM-2.4B	Outperforms 9.6B Qwen-VL-Chat on MME/MMBench/MMMU at 3B^[5]
MiniCPM-V 2.0	Apr 12, 2024	2.8B	SigLIP-400M	MiniCPM-2.4B	1.8M-pixel adaptive encoding; matches Gemini Pro on scene text^[6]
MiniCPM-Llama3-V 2.5	May 20, 2024	8B	SigLIP-400M	Llama3-8B-Instruct	First end-side MLLM at GPT-4V level on OpenCompass; 30+ languages^[11]
MiniCPM-V 2.6	Aug 2024	8B	SigLIP-400M	Qwen2-7B	Multi-image + video; real-time video on iPad; 640 tokens per 1.8M-pixel image^[2]
MiniCPM-o 2.6	Jan 24, 2025	8B	SigLIP-400M (+ Whisper-medium + ChatTTS)	Qwen2.5-7B	Omnimodal speech/vision/text with streaming; bilingual real-time speech^[3]
MiniCPM-V 4.0	Aug 2, 2025	4.1B	SigLIP2-400M	MiniCPM4-3B	OpenCompass 69.0; beats GPT-4.1-mini; under 2s first token on iPhone 16 Pro Max^[12]
MiniCPM-V 4.5	Aug 26, 2025	8B (8.7B)	SigLIP2-400M	Qwen3-8B	OpenCompass 77.0; 3D-Resampler 96x video compression; hybrid fast/deep thinking^[13]^[14]
MiniCPM-o 4.5	Feb 3, 2026	9B	SigLIP2 (+ Whisper-medium + CosyVoice2)	Qwen3-8B	OpenCompass 77.6; full-duplex live streaming; "Gemini 2.5 Flash Level"^[15]
MiniCPM-V 4.6	May 11, 2026	~1B	SigLIP2-400M	Qwen3.5-0.8B	Mixed 4x/16x token compression; top open-weights MLLM under 2B on the AA Intelligence Index^[16]^[18]

The cross-row trend illustrates the project's strategy: hold a compact SigLIP-family vision encoder roughly constant (upgraded to SigLIP2-400M from v4.0), swap in successively stronger language backbones (MiniCPM, Llama 3, Qwen2, Qwen2.5, Qwen3, and Qwen3.5), and stack new modalities, compression schemes, and alignment techniques while keeping parameter counts between about 1B and 9B.

Is MiniCPM-V open source?

Yes. Every MiniCPM-V and MiniCPM-o release ships open weights on Hugging Face with a public technical report or model card, and the code has been Apache-2.0 from the start.^[4] The weight license evolved over time. MiniCPM-V 2.0 through MiniCPM-o 2.6 released weights that were free for academic use and free for commercial use only after a registration questionnaire.^[6]^[11] From MiniCPM-V 4.0 (August 2025) onward, OpenBMB moved the weights to a clean Apache-2.0 license and made the registration form optional, so commercial users no longer need prior approval.^[12]^[13] Alongside the weights, the project has open-sourced the RLAIF-V alignment dataset, the LLaVA-UHD encoding code, GGUF and int4 quantizations, and, from v4.0, the on-device iOS app and edge-adaptation code, which is why MiniCPM-V is often cited as one of the most reproducible open MLLM recipes available.^[10]^[12]^[16]

What is MiniCPM-V used for?

The MiniCPM-V model family's applications cluster around scenarios where cloud-hosted MLLMs are impractical because of cost, latency, privacy, or connectivity. The model cards and technical report call out three main use cases.^[1]^[2]^[3]

Document and scene-text understanding is the most thoroughly evaluated. Reported OCRBench scores above 700 for both MiniCPM-Llama3-V 2.5 and MiniCPM-V 2.6, combined with native handling of 1.8M-pixel images at arbitrary aspect ratios, target tasks such as receipt parsing, ID-card and form OCR, screenshot question answering, and chart and table understanding, all on a smartphone or tablet.^[1]^[2] MiniCPM-V 4.5 later reported leading OCRBench results ahead of GPT-4o-latest and Gemini 2.5, plus state-of-the-art PDF document parsing on OmniDocBench.^[13]

Multilingual visual assistance follows from the model's support for thirty-plus languages plus aggressive token compression: a 2.6 deployment can run with about 7 GB of GPU memory at int4 quantization, putting it within reach of mid-range consumer devices for translation-with-vision and accessibility tasks.^[2]^[11]

Real-time multimodal interaction is the central pitch of MiniCPM-o 2.6, which accepts continuous video and audio streams independent of user queries and responds with bilingual speech.^[3] Together with the StreamingBench score of 66.0, that capability targets ambient assistants, video-call understanding, and on-device companion experiences without round-trips to a cloud server.^[3] MiniCPM-o 4.5 (February 2026) extended this to full-duplex operation, letting the model see, listen, and speak at the same time.^[15]

Outside the core MiniCPM-V repository the models have been packaged for Ollama, llama.cpp, and vLLM, making them a frequent default for developers building local vision language model prototypes that need both image and OCR capabilities under permissive licensing.^[2]^[4]

Robotics and embodied AI form a fourth, more speculative application area. ModelBest's public marketing materials position the MiniCPM family as a candidate brain for smart home devices and robots, where the latency, privacy, and connectivity constraints of cloud MLLMs are particularly acute.^[8] Concrete benchmarks of MiniCPM-V or MiniCPM-o inside robotics stacks are sparser than the desktop OCR and ASR numbers, but the inference cost profile (10-plus tokens/second on Snapdragon 8 Gen 3 mobile, sub-7 GB int4 memory) is in the regime where running a multimodal policy on a battery-powered platform is plausible.^[1]^[2] That pitch began moving from marketing to demonstration in 2026: at the 2026 Zhongguancun Forum, ModelBest showed MiniCPM-V 4.5 running on embodied robots, its first public deployment of the vision line inside a robotics stack.^[17]

A fifth application is education and accessibility. Because MiniCPM-V supports more than thirty languages and can run offline, the model has been used as a building block for offline study assistants, OCR-based reading aids for visually impaired users, and document translation pipelines in regions with intermittent connectivity.^[11] These uses are documented primarily in community forks and Hugging Face Spaces rather than in formal benchmarks, but they illustrate the practical reach of a freely downloadable multimodal model that now spans from about 1B to 9B parameters.^[11]^[16]

Significance

MiniCPM-V's significance comes from being one of the first public demonstrations that strong multimodal capability does not require frontier-scale compute. The August 2024 technical report frames the contribution as evidence of a "rapidly decreasing" model size needed for usable GPT-4-V-class performance, paired with mobile silicon (the Snapdragon 8 Gen 3 generation specifically) that for the first time could host an 8B-parameter MLLM with NPU-accelerated visual encoding and 8-plus tokens/second decode.^[1] The combination undercut a widely held assumption that frontier multimodal models must be tens of billions of parameters and cloud-only.

A second contribution is the open release pattern. The code is Apache-2.0, and from v4.0 the weights are Apache-2.0 as well.^[11]^[12] Together with the public RLAIF-V dataset and the LLaVA-UHD code, MiniCPM-V provided one of the most complete openly reproducible MLLM recipes in 2024, covering the vision encoder choice, the adaptive encoding scheme, the supervised fine-tuning mix, and the alignment pipeline.^[4]^[10] Subsequent open multimodal projects, including later InternVL and Qwen2.5-VL revisions, share architectural family resemblance to the MiniCPM-V resampler-plus-SigLIP design, though those teams have published their own independent contributions.

Third, MiniCPM-V seeded a Chinese on-device AI ecosystem. ModelBest's pitch of "Little Powerhouses" engineered for smartphones, PCs, automotive systems, smart home devices, and even robots is now a competitive positioning against Gemini Nano, Phi-3 Vision, and proprietary on-device stacks from device OEMs.^[8] MIT Technology Review identified ModelBest as one of four Chinese AI startups worth watching beyond DeepSeek in its February 2025 coverage, noting the December 2024 funding round and the company's "tens of millions of dollars" milestone.^[8]

Adoption has followed. By mid-2026 the OpenBMB/MiniCPM-V repository had passed 25,000 GitHub stars, and the MiniCPM-V 4.5 weights alone drew more than 260,000 Hugging Face downloads in a single month, while ModelBest's April 2026 unicorn round underscored commercial confidence in the on-device thesis.^[4]^[13]^[17]

Fourth, the project demonstrated that the alignment-data bottleneck for multimodal models could be cracked without proprietary feedback. The RLAIF-V framework, by using open-source MLLMs as labelers in an atomic-claim decomposition loop, lets a 7B model reduce object hallucination by roughly 80 percent without any reliance on GPT-4V or human annotators.^[10] That result has implications well beyond MiniCPM-V: it suggests that the gap between open and closed MLLMs on trustworthy behavior may be closable using fully open data and tooling, which informs the broader academic argument about open-source Multimodal Models.^[10]

What are MiniCPM-V's limitations?

Several caveats apply to the headline claims, drawn either from the technical report itself or from third-party model card commentary.^[1]^[2]

Benchmark cherry-picking risk: the "GPT-4V level" and later "GPT-4o level" claims are anchored to OpenCompass at specific competitor snapshots and a fixed benchmark aggregate. On benchmarks not in that suite, particularly those involving long-horizon reasoning, complex spatial layouts, or video tasks beyond Video-MME, the gap to proprietary frontier models can be larger.^[1]^[14]

Token-density tradeoffs: aggressive resampler compression (640 tokens for a 1.8M-pixel image in 2.6, or the 16x video mode in 4.6) saves memory and decode time but can hurt fine-grained text recognition in dense documents or very small fonts, where MLP-projector models that emit thousands of visual tokens may still have an edge.^[2]^[16]

Mobile-NPU portability: the QNN-accelerated visual encoder demonstration is specifically against Snapdragon 8 Gen 3-class hardware, and the 8.2 tokens/second decode throughput on Xiaomi 14 Pro reflects a heavily optimized stack.^[1] On older or non-Qualcomm mobile chips the deployment story degrades substantially, a limitation acknowledged implicitly by the choice of reference hardware in the paper.

License nuance, now largely resolved: through MiniCPM-o 2.6, commercial use of the weights required registering a questionnaire with OpenBMB, which was less permissive than fully open licenses.^[11] From MiniCPM-V 4.0 the weights are Apache-2.0 with registration optional, which removes this friction for newer releases but not retroactively for the pre-4.0 checkpoints.^[12]

Hallucination is reduced, not solved: even after RLAIF-V, Object HalBench rates of around 10 percent remain non-trivial, and out-of-distribution image domains (medical, scientific diagrams, low-resource languages) are not extensively covered in the released benchmarks.^[1]^[10]

A high-profile dispute in mid-2024 around alleged training-data overlap with LLaVA derivatives also surfaced briefly in the open-source community, though the OpenBMB team published clarifications and the project remained widely used. That episode is not extensively documented in the formal academic record cited here and is therefore not described in detail.

Reproducibility constraints also affect external verification. The full training datasets are not all released, so independent replication of the OpenCompass and Object HalBench numbers reported in the technical report is harder than the open-weights and Apache-2.0 code license suggest at first glance.^[1] The RLAIF-V dataset is publicly available on Hugging Face, which closes that gap for the alignment stage specifically, but the multilingual pre-training mixture is described at a relatively high level in the paper rather than released as a single downloadable corpus.^[1]^[10]

MiniCPM-V sits alongside several other open MLLM families targeting the 7-9B size class.

Family	Vision backbone	LLM	Key feature	Comparison point
MiniCPM-V 4.5	SigLIP2-400M	Qwen3-8B	Unified 3D-Resampler, hybrid fast/deep thinking, RLAIF-V	OpenCompass 77.0, 96x video compression at 8B^[13]^[14]
Qwen2.5-VL	Native dynamic ViT	Qwen2.5	Native dynamic resolution, agent capabilities	Alibaba's primary open VLM line^[2]
InternVL	InternViT-6B	InternLM / Qwen	High-resolution multi-image	Larger total parameter counts at the top of the lineup^[4]
LLaVA (1.5/NeXT)	CLIP / SigLIP	Vicuna / Llama	Simple MLP projector	Reference baseline; higher token counts per image^[5]
DeepSeek-VL2	Mixture-of-experts vision	DeepSeek-V2 / MoE LM	MoE multimodal scaling	Different efficiency strategy (MoE rather than compression)^[4]

The closest peer in spirit is arguably Gemini Nano, which is similarly engineered for Edge AI deployment but ships as a closed system inside Android and Pixel devices, while MiniCPM-V is open-weights with a published technical report.^[8] Where most peers scale up total parameters to gain quality, the MiniCPM-V line has instead pushed token compression and encoder efficiency, letting an 8B model (4.5) claim wins over the 72B Qwen2.5-VL while a roughly 1B model (4.6) tops the open-weights field under 2B parameters.^[14]^[18]

References

Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, et al. (OpenBMB / Tsinghua University), "MiniCPM-V: A GPT-4V Level MLLM on Your Phone", arXiv, 2024-08-03. https://arxiv.org/abs/2408.01800. Accessed 2026-05-21. ↩
OpenBMB, "openbmb/MiniCPM-V-2_6 Model Card", Hugging Face, 2024-08-06. https://huggingface.co/openbmb/MiniCPM-V-2_6. Accessed 2026-05-21. ↩
OpenBMB, "openbmb/MiniCPM-o-2_6 Model Card", Hugging Face, 2025-01-24. https://huggingface.co/openbmb/MiniCPM-o-2_6. Accessed 2026-05-21. ↩
OpenBMB, "OpenBMB/MiniCPM-V GitHub repository (News/changelog and star count)", GitHub, 2024-02-01 (initial release); accessed 2026-07-12. https://github.com/OpenBMB/MiniCPM-V. ↩
OpenBMB, "openbmb/MiniCPM-V Model Card (OmniLMM-3B)", Hugging Face, 2024-02-01. https://huggingface.co/openbmb/MiniCPM-V. Accessed 2026-05-21. ↩
OpenBMB, "openbmb/MiniCPM-V-2 Model Card", Hugging Face, 2024-04-12. https://huggingface.co/openbmb/MiniCPM-V-2. Accessed 2026-05-21. ↩
Ruyi Xu, Yuan Yao, Zonghao Guo, et al., "LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images", arXiv, 2024-03-18. https://arxiv.org/pdf/2403.11703. Accessed 2026-05-21. ↩
Zeyi Yang, "Four Chinese AI startups to watch beyond DeepSeek", MIT Technology Review, 2025-02-04. https://www.technologyreview.com/2025/02/04/1110942/four-chinese-ai-startups-deepseek/. Accessed 2026-05-21. ↩
OpenBMB, "OpenBMB/MiniCPM GitHub repository (text-only series)", GitHub, 2024-02-01. https://github.com/openbmb/minicpm. Accessed 2026-05-21. ↩
Tianyu Yu, Haoye Zhang, Qiming Li, Qixin Xu, Yuan Yao, et al., "RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness", arXiv, 2024-05-27. https://arxiv.org/abs/2405.17220. Accessed 2026-05-21. ↩
OpenBMB, "openbmb/MiniCPM-Llama3-V-2_5 Model Card", Hugging Face, 2024-05-20. https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5. Accessed 2026-05-21. ↩
OpenBMB, "openbmb/MiniCPM-V-4 Model Card", Hugging Face, 2025-08-02. https://huggingface.co/openbmb/MiniCPM-V-4. Accessed 2026-07-12. ↩
OpenBMB, "openbmb/MiniCPM-V-4_5 Model Card", Hugging Face, 2025-08-26. https://huggingface.co/openbmb/MiniCPM-V-4_5. Accessed 2026-07-12. ↩
OpenBMB, "MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe", arXiv:2509.18154, 2025-09-16. https://arxiv.org/abs/2509.18154. Accessed 2026-07-12. ↩
OpenBMB, "openbmb/MiniCPM-o-4_5 Model Card", Hugging Face, 2026-02-03. https://huggingface.co/openbmb/MiniCPM-o-4_5. Accessed 2026-07-12. ↩
OpenBMB, "openbmb/MiniCPM-V-4.6 Model Card", Hugging Face, 2026-05-11. https://huggingface.co/openbmb/MiniCPM-V-4.6. Accessed 2026-07-12. ↩
"大模型公司面壁智能完成数亿元融资投后估值迈入独角兽门槛" (ModelBest completes several-hundred-million-yuan financing, post-money valuation enters unicorn threshold), Sina Finance, 2026-04-09. https://finance.sina.com.cn/stock/t/2026-04-09/doc-inhtwhrc8925749.shtml. Accessed 2026-07-12. ↩
Artificial Analysis, "OpenBMB launches MiniCPM-V 4.6 1.3B Instruct", Artificial Analysis, 2026-05. https://artificialanalysis.ai/articles/openbmb-launches-minicpm-v-4-6-1-3b-instruct. Accessed 2026-07-12. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributor · full history

Suggest edit

What links here

InternVideo MMStar Qwen2-VL Skywork-R1V

Infobox

History

Origins: OpenBMB and ModelBest

MiniCPM-V 1.0 (January / February 2024)

MiniCPM-V 2.0 (April 2024)

MiniCPM-Llama3-V 2.5 (May 2024)

MiniCPM-V 2.6 (August 2024)

MiniCPM-o 2.6 (January 2025)

MiniCPM-V 4.0 (August 2025)

MiniCPM-V 4.5 (August / September 2025)

MiniCPM-o 4.5 (February 2026)

MiniCPM-V 4.6 (May 2026)

Technical details

Architecture

Adaptive visual encoding

Training pipeline

RLAIF-V alignment

Mobile and end-side deployment

Variants

Is MiniCPM-V open source?

What is MiniCPM-V used for?

Significance

What are MiniCPM-V's limitations?

Related work and comparison

See also

References

Improve this article

Related Articles

DeepSeek-OCR

InternVL

Qwen2.5-VL

Qwen2-VL

Qwen3-Omni

Qwen3-VL

What links here

Related Articles

DeepSeek-OCR

InternVL

Qwen2.5-VL

Qwen2-VL

Qwen3-Omni

Qwen3-VL

What links here