Molmo
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,586 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,586 words
Add missing citations, update stale details, or suggest a clearer explanation.
Molmo is a family of open-weight, open-data vision-language models (VLMs) released by the Allen Institute for AI (Ai2) on 25 September 2024.[^1][^2] The family includes a 72-billion-parameter flagship (Molmo-72B), two 7-billion variants (Molmo-7B-D and Molmo-7B-O), and a Mixture-of-Experts model with roughly 1.5 billion active parameters (MolmoE-1B).[^3][^4][^5][^6] Unlike most contemporary open VLMs, Molmo was trained entirely without synthetic image-text data distilled from proprietary systems such as GPT-4o or Claude 3.5 Sonnet; instead, it relied on a new collection of human-curated datasets called PixMo.[^1][^7] Ai2 reported that Molmo-72B outperformed Gemini 1.5 Pro/Flash and Claude 3.5 Sonnet on eleven academic vision benchmarks and placed second only to GPT-4o on a large pairwise human-preference study.[^1][^4] The paper introducing Molmo, "Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models" (arXiv:2409.17146), received a Best Paper Honorable Mention at CVPR 2025.[^7][^8]
| Attribute | Detail |
|---|---|
| Developer | Allen Institute for AI (Ai2) |
| First release | 25 September 2024 (the "0924" tag in checkpoint names)[^2][^3] |
| Models | Molmo-72B, Molmo-7B-D, Molmo-7B-O, MolmoE-1B (all 0924 release)[^3][^4][^5][^6] |
| License | Hugging Face weights distributed under Apache 2.0; PixMo datasets under ODC-BY-1.0[^3][^9] |
| Vision encoder | OpenAI CLIP ViT-L/14 at 336 px input (all variants)[^4][^5][^10] |
| Backbone LLMs | Qwen2-72B (Molmo-72B), Qwen2-7B (Molmo-7B-D), OLMo-7B-1024-preview (Molmo-7B-O), OLMoE-1B-7B-0924 (MolmoE-1B)[^4][^5][^6][^10] |
| Training data | ~1 million image-text pairs from PixMo (over 100x fewer than typical open VLMs)[^1][^11] |
| Novel capability | Native 2D "pointing" output for objects in images[^1][^11] |
| Paper | Deitke et al., arXiv:2409.17146 (submitted 25 September 2024; v2 5 December 2024)[^7] |
| Venue | CVPR 2025 (Best Paper Honorable Mention)[^8] |
The release combined model weights on Hugging Face, a hosted demo at molmo.allenai.org, a publicly available training and evaluation codebase on GitHub, and the PixMo datasets, with the explicit goal of supplying the open community with foundational knowledge about how to build performant VLMs from scratch rather than by distilling closed competitors.[^1][^2][^11]
By mid-2024 the strongest VLMs were proprietary: OpenAI's GPT-4o and GPT-4V, Anthropic's Claude 3.5 Sonnet, and Google's Gemini 1.5 Pro/Flash dominated the leaderboards. Several openly released VLMs (such as LLaVA-OneVision, LLaVA, and various Qwen-VL derivatives) had narrowed the gap on academic benchmarks, but their published training pipelines almost universally relied on data either generated by, captioned by, or judged by closed models.[^1][^7] In effect, the open community was distilling proprietary systems, which left it without independent knowledge about how to train a competitive VLM from scratch and exposed downstream releases to the terms of service of the closed providers whose outputs they consumed.[^7][^11]
The Molmo project at Ai2 set out explicitly to break that dependency. The team, led by Matt Deitke with co-authors including Christopher Clark, Sangho Lee, Rohun Tripathi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Ranjay Krishna, Noah A. Smith, Hannaneh Hajishirzi, Ross Girshick, Ali Farhadi, and Aniruddha Kembhavi (about 50 authors in total), argued that the bottleneck for performance was not model size or compute but the quality and provenance of training data.[^7] Their thesis was that a modestly sized, carefully curated, fully human-produced dataset, combined with a simple two-stage training pipeline and a standard ViT-plus-LLM architecture, could match or exceed VLMs trained on billions of image-text pairs.[^1][^11] Deitke later described the work as "a long journey over 1.5 years, from failing to get strong performance with massive scale, low quality data, to focusing on modest scale extremely high quality data."[^8]
The announcement was widely covered. TechCrunch quoted Ai2 chief executive Ali Farhadi summarizing the framing as "open is equal to closed, and small is now equal to big."[^2] DeepLearning.AI's The Batch newsletter highlighted Molmo's claim that "vision models trained on fully open, high-quality datasets can compete with closed systems built using massive computational resources."[^12] Nathan Lambert, writing on Interconnects, positioned Molmo against Meta's Llama 3.2 Vision release in the same week, concluding that "Llama 3.2 V is a better text model, maybe even much better, but Molmo is a better image model."[^13]
Molmo follows what the paper calls a "simple, standard design"[^7] with four components that are conceptually similar to most modern VLMs but trained end-to-end with carefully chosen learning rates.
openai/clip-vit-large-patch14-336 checkpoint) to map each tile independently into a sequence of vision tokens.[^4][^5][^10]There is no separately learned "perceiver resampler" or Q-former; the connector is intentionally minimal and the pooling is the only learned reduction.[^11] Training is a two-stage pipeline that the authors describe as "streamlined" because it avoids the freeze/unfreeze schedules common to other VLMs.[^10][^11] The first stage is multimodal pre-training on PixMo-Cap (dense captioning), and the second stage is supervised fine-tuning on a mixture of the remaining PixMo subsets together with a handful of widely used academic datasets.[^7][^11]
The optimization recipe reported in the paper uses AdamW for four epochs with separate learning rates for each component: 2e-4 for the connector, 6e-6 for the ViT, and 2e-5 for the LLM, with a cosine schedule decaying to 10% of peak.[^14] The connector uses a 200-step linear warmup, while the ViT and LLM warm up for 2,000 steps; gradients are clipped separately for the encoder, connector, and decoder.[^14] The 72B variant was pre-trained on 128 NVIDIA H100 GPUs in roughly 33.3 hours, for about 4,200 H100-hours.[^14]
The "0924" tag on the public checkpoints encodes the release month (September 2024). The four publicly released models pair the common vision tower with different language backbones:[^3][^4][^5][^6]
| Model | Hugging Face ID | LLM backbone | Vision encoder | Notes |
|---|---|---|---|---|
| Molmo-72B | allenai/Molmo-72B-0924 | Qwen2-72B (Alibaba) | CLIP ViT-L/14 336 px | Flagship; 73B total parameters[^4] |
| Molmo-7B-D | allenai/Molmo-7B-D-0924 | Qwen2-7B (Alibaba) | CLIP ViT-L/14 336 px | "Demo" model; about 8B parameters[^3] |
| Molmo-7B-O | allenai/Molmo-7B-O-0924 | OLMo-7B-1024-preview (Ai2) | CLIP ViT-L/14 336 px | Most "fully open" 7B variant, since both the LLM and the data are open[^5] |
| MolmoE-1B | allenai/MolmoE-1B-0924 | OLMoE-1B-7B-0924 (Ai2) | CLIP ViT-L/14 336 px | Mixture of Experts; ~1.5B active / 7.2B total parameters[^6] |
In addition to the publicly released checkpoints, the paper reports experiments with Mistral 7B, Gemma 2 9B, Phi-3 Medium, and other backbones to demonstrate that the recipe generalizes across LLMs.[^7][^11] The 7B-O suffix denotes the variant built on Ai2's own OLMo family (and is therefore the most independently reproducible), while 7B-D is the Qwen2-7B-based "demo" model intended for the public web demo.[^5][^15]
All four checkpoints are distributed under Apache 2.0,[^3][^4][^5][^6] which contrasts with the more restrictive Llama 3.2 Community License Agreement that governs Meta's Llama 3.2 Vision models released around the same time.[^13]
The paper's authors describe PixMo as "the success of our approach," with the model architecture treated as deliberately conventional so that the new data can be isolated as the source of performance gains.[^7][^11] PixMo is a collection of subsets, each designed for a specific capability and each collected without using outputs from any external VLM.
Naive crowdsourcing of dense image captions had two failure modes for the Ai2 team. First, asking annotators to type long descriptions yielded short and sparse text because typing is slow and tiring. Second, the team could not verify that annotators were not copy-pasting captions from a publicly available VLM such as GPT-4 or Gemini, which would silently re-introduce distillation.[^11][^16]
The solution was a "modality switching trick": annotators were asked to describe each image by speaking continuously for 60 to 90 seconds, with the audio recording kept as a receipt that no VLM was queried during the task.[^11][^16] The audio was then transcribed and the transcripts were passed to Claude to be turned into a single polished long-form caption averaging about 200 words.[^16][^17] Claude was used here only as a deterministic, text-only post-processor over the human-generated transcripts; the underlying visual content described in the caption came from the human annotator, not from a VLM.[^16][^17] Both the cleaned caption and the raw audio transcripts are released publicly so users can audit the pipeline.[^17]
The released PixMo-Cap dataset contains 717,042 images with one or more captions, totalling roughly 1.3 million captions, and is distributed under the ODC-BY-1.0 license.[^17] In an early version of the protocol three annotators described every image; later in the project a single annotator per image with a 90-second minimum was used to reduce cost without measurably hurting downstream performance.[^11]
For supervised fine-tuning on free-form visual question answering, the team built PixMo-AskModelAnything, consisting of human-authored question/answer pairs about images. The reported scale is 162,000 QA pairs across 73,000 images.[^11] Annotators wrote both the question and the answer for an image they were shown, again with no VLM in the loop.[^11]
The most distinctive PixMo subset is PixMo-Points, which trains Molmo to output 2D pixel coordinates as part of its natural-language answer. Annotators were asked to point at something in an image, describe it, and then point to each instance of the same thing in the image so the labels exhaustively cover all occurrences. The released collection contains roughly 2.3 million question-point pairs over about 223,000 images.[^14][^11] To teach the model to refuse when a queried object is absent, the data also includes "not present" examples.[^14] A secondary "point explanations" pipeline let annotators feed the LLM a list of text-labelled points so that the model would learn to reference points when constructing a textual answer; this added roughly 79,000 point-explanation annotations on 14,000 images.[^14]
Coordinates are normalized to the 0-100 range so they are independent of input resolution, and points are ordered top-to-bottom, left-to-right with each point numbered, which lets the same data train the model to both point and count.[^14]
The full PixMo collection released alongside Molmo includes several other subsets used in supervised fine-tuning:[^11][^16]
The total fine-tuning mixture (PixMo subsets plus standard academic supervised datasets) is described in the paper as roughly one million image-text pairs, which the Ai2 blog notes is "three orders of magnitude fewer" than the volume used by some competing approaches.[^1][^11]
Pointing is the most-discussed novel feature of Molmo. Where GPT-4V, Claude, and Gemini respond to an image only in natural language, Molmo can interleave its prose answer with structured point coordinates that the front end can render directly on the image.[^1][^13] Asked to count the dogs in a photo, for example, Molmo will output one dot per dog face; asked about tongues, it will output one dot per visible tongue.[^2]
Because pointing produces machine-readable spatial references, Ai2 frames it as a grounding primitive for downstream agents. The blog notes that "a robot could query a pointing-enabled VLM for a waypoint or the location of an object to pick up, or a web agent could query the VLM for the location of a user interface element to click."[^1] TechCrunch's coverage emphasized the implication for robotics and for browser automation: Molmo can navigate web interfaces without ever inspecting the underlying HTML, because it can look at a screenshot and point at the right button.[^2]
The released paper notes two known failure modes of the pointing system. First, because counting and pointing can produce very long output sequences, Molmo's training data was capped at 40 counts per image to control memory usage; the authors flag the cap as a planned future improvement.[^11] Second, the model sometimes fails to point on counting-style questions because "how many" prompts in the supervised data are not always paired with point labels; the authors describe a heuristic for detecting such questions and prefixing them with explicit instructions when point output is desired.[^11]
Ai2 evaluated Molmo on eleven academic vision-language benchmarks and on a large pairwise human-preference study, comparing against both the strongest closed models (GPT-4o, GPT-4V, Claude 3.5 Sonnet, Gemini 1.5 Pro and Flash) and several open VLMs (LLaVA OneVision 7B, Pixtral 12B, Qwen VL2 7B and 72B, among others).[^4][^6][^11]
The eleven benchmarks used were AI2D, ChartQA, VQA v2.0, DocVQA, InfographicVQA, TextVQA, RealWorldQA, MMMU, MathVista, CountBenchQA, and PixMo-Count (sometimes referred to as Flickr Count in the model cards).[^3][^4][^14] Selected per-benchmark scores reported in the paper are summarized below.[^14]
| Benchmark | MolmoE-1B | Molmo-7B-O | Molmo-7B-D | Molmo-72B |
|---|---|---|---|---|
| AI2D | 86.4 | 90.7 | 93.2 | 96.3 |
| ChartQA | 78.0 | 80.4 | 84.1 | 87.3 |
| VQA v2.0 | 83.9 | 85.3 | 85.6 | 86.5 |
| DocVQA | 77.7 | 90.8 | 92.2 | 93.5 |
| InfographicVQA | 53.9 | 70.0 | 72.6 | 81.9 |
| TextVQA | 78.8 | 80.4 | 81.7 | 83.1 |
| RealWorldQA | 60.4 | 67.5 | 70.7 | 75.2 |
| MMMU | 34.9 | 39.3 | 45.3 | 54.1 |
| MathVista | 34.0 | 44.5 | 51.6 | 58.6 |
| CountBenchQA | 87.2 | 89.0 | 88.5 | 91.2 |
Headline averages over the eleven benchmarks reported on the official model cards put Molmo-72B at 81.2, ahead of Qwen VL2 72B at 79.4, GPT-4o at 78.5, and Gemini 1.5 Pro at 78.3.[^4] Molmo-7B-D averages 77.3, edging Claude 3.5 Sonnet at 76.7 and ahead of LLaVA OneVision 7B at 72.0 and GPT-4V at 71.1.[^3] MolmoE-1B averages 68.6, comparable to small open models such as Pixtral 12B (69.5) despite running with only roughly 1.5 billion active parameters.[^6]
In parallel with the academic benchmarks, Ai2 ran a large pairwise human evaluation across 27 vision-language models. The study collected about 325,000 pairwise preference ratings, which the team noted was roughly three times the volume of votes on the LMSYS Chatbot Arena at the time. The ratings were fitted with a Bradley-Terry model to produce Elo-style rankings.[^11][^14] Reported Elo scores include:[^3][^4][^14]
| Model | Human-preference Elo |
|---|---|
| GPT-4o | 1079 |
| Molmo-72B | 1077 |
| Gemini 1.5 Pro | 1074 |
| Claude 3.5 Sonnet | 1069 |
| Molmo-7B-D | 1056 |
| Molmo-7B-O | 1051 |
| GPT-4V | 1041 |
| MolmoE-1B | 1032 |
| Qwen VL2 7B | 1025 |
| LLaVA OneVision 7B | 1024 |
| Pixtral 12B | 1016 |
Across the two evaluation frameworks, Ai2 framed the result as: Molmo-72B places first on the academic-benchmark average and second on the human-preference Elo, with both scores reachable from approximately one million image-text training pairs.[^1][^4]
Most independent press coverage echoes the paper's framing. TechCrunch summarized that Molmo "matches performance of GPT-4o, Gemini 1.5 Pro, and Claude-3.5 Sonnet" while running at "approximately one-tenth the size of competing closed models."[^2] DeepLearning.AI's The Batch reported that Molmo-72B "outperformed Gemini 1.5 Pro and Claude 3.5 Sonnet on academic tests and certain vision benchmarks," that 7B variants "performed between GPT-4V and GPT-4o," and that MolmoE-1B "nearly matched GPT-4V's capabilities."[^12]
Nathan Lambert's comparison with Llama 3.2 Vision noted that Molmo outperformed Llama 3.2 Vision by roughly +1 to +4 points across MathVista, ChartQA, AI2D, and DocVQA, while Llama 3.2 Vision led on MMLU (text-only reasoning) by about 6 points.[^13] Lambert's conclusion that Llama 3.2 Vision is "a better text model" while Molmo is "a better image model" is consistent with Molmo's explicit choice to omit large-scale text-only instruction tuning and RLHF.[^13]
Later open vision-language releases from 2024 and 2025, including Qwen2.5-VL-72B-Instruct (reported around 70.2 on MMMU val) and InternVL3-78B (reported around 72.2 on MMMU), have since exceeded Molmo's reasoning-heavy benchmark scores on MMMU and MathVista, though typically with training pipelines that include synthetic data, scale-up text RLHF, or substantially more parameters.[^18]
The table below summarizes how Molmo's release positioning differs from the most-discussed contemporary open and closed VLMs.
| Model family | Weights | Training data released? | Synthetic data from proprietary VLMs? | Reported MMMU (val) |
|---|---|---|---|---|
| Molmo-72B (Ai2, Sep 2024) | Open, Apache 2.0[^4] | Yes (PixMo, ODC-BY-1.0)[^17] | No (audio receipts; human-only)[^11] | 54.1[^14] |
| Llama 3.2 Vision 11B/90B (Meta, Sep 2024) | Open, Llama 3.2 Community License[^13] | No[^13] | Not fully disclosed[^13] | Not directly reported in Molmo paper |
| Qwen2.5-VL-72B-Instruct (Alibaba, 2025) | Open weights | Partial | Yes (mixed) | ~70.2[^18] |
| GPT-4o (OpenAI) | Closed | No | n/a | 78.5 average across the 11-benchmark suite[^4] |
| Claude 3.5 Sonnet (Anthropic) | Closed | No | n/a | Suite average 76.7[^3] |
| Gemini 1.5 Pro (Google) | Closed | No | n/a | Suite average 78.3[^4] |
| LLaVA OneVision 7B | Open | Partial | Yes (uses GPT-4V outputs) | Suite average 72.0[^3] |
| Pixtral 12B (Mistral, 2024) | Open weights | Partial | Not disclosed | Suite average 69.5[^6] |
What is distinctive about Molmo's release, on the dimensions tracked above, is the combination of: a fully permissive Apache 2.0 weights license, a fully released training data set, and a credible audit trail (audio receipts) that no proprietary VLM was queried at any point during data construction.[^1][^11][^17]
All four Molmo-0924 checkpoints are hosted on Hugging Face and load through AutoModelForCausalLM and AutoProcessor with trust_remote_code=True, with bfloat16 autocast as the default precision.[^3][^4] The official model cards document two image preprocessing gotchas: images must be in RGB mode (transparent PNGs and other modes degrade quality), and a recommended workaround composites RGBA inputs onto a white or black background chosen by average brightness.[^3]
Beyond Hugging Face Transformers, the model cards note official support for vLLM (version 0.7.2 or earlier due to a preprocessing regression in newer releases), SGLang, and Docker Model Runner, with community quantizations available for llama.cpp, Ollama, LM Studio, and Jan.[^3] The training and evaluation codebase at github.com/allenai/molmo is open-source and includes scripts/train.py, helper launchers train_captioner.py and train_multitask_model.py, an eval_downstream.py evaluation runner, and a scripts/download.py data fetcher; MolmoE-1B additionally requires a megablocks fork for sparse MoE training.[^19]
A live demo at molmo.allenai.org accepts image uploads from desktop or mobile browsers and exposes both natural-language answers and the pointing visualization layer.[^1][^2]
Within months of release, Molmo's pointing capability became a building block for several agent-oriented systems and for a wave of follow-up Ai2 releases. Molmo's appeal for robotics researchers came from its ability to emit pixel-accurate waypoints that downstream controllers can convert into manipulation targets, and for browser-automation researchers from its ability to identify clickable UI elements without inspecting page source.[^1][^2]
Ai2 itself extended the Molmo program with several successor projects:
Outside Ai2, Molmo is among the most-downloaded open VLMs on Hugging Face and is widely used as a baseline in subsequent vision-language papers that need an open reference model whose training data provenance is fully known.[^3][^4]
The paper and accompanying model cards explicitly catalogue several limitations:[^3][^11]
More broadly, third-party safety research on multimodal models has repeatedly shown that visual inputs can bypass text-level safety filters in VLMs, and Molmo is not exempt from this class of issue. Ai2 distributes the weights under its Responsible Use Guidelines and labels them for research and educational use.[^3][^4][^6]
Coverage in late September 2024 framed Molmo as evidence that high-quality, human-curated data could substitute for both scale and distillation. Farhadi's "open is equal to closed, and small is now equal to big" framing was widely repeated;[^2] DeepLearning.AI's The Batch summarized the upshot as a demonstration that "vision models trained on fully open, high-quality datasets can compete with closed systems built using massive computational resources";[^12] and MarkTechPost called Molmo a release that "ranks second on human evaluation, just slightly behind GPT-4o" while being entirely open.[^15]
The longer-term impact is reflected in the paper's reception at academic venues. The Molmo and PixMo paper was published at CVPR 2025 (Deitke et al., "Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models") and received a Best Paper Honorable Mention,[^8] one of the most-cited acknowledgments of the year's vision-language work. The release is now commonly cited as evidence that the open-VLM community can produce competitive frontier-class models without distilling closed systems, provided it invests heavily in original data collection.[^7][^8]