Molmo

AI Models Multimodal AI Open Source AI

23 min read

Updated Jun 7, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 7, 2026

Fact-checked

In review queue

Sources

22 citations

Revision

v2 · 4,586 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Molmo is a family of open-weight, open-data vision-language models (VLMs) released by the Allen Institute for AI (Ai2) on 25 September 2024.^[1]^[2] The family includes a 72-billion-parameter flagship (Molmo-72B), two 7-billion variants (Molmo-7B-D and Molmo-7B-O), and a Mixture-of-Experts model with roughly 1.5 billion active parameters (MolmoE-1B).^[3]^[4]^[5]^[6] Unlike most contemporary open VLMs, Molmo was trained entirely without synthetic image-text data distilled from proprietary systems such as GPT-4o or Claude 3.5 Sonnet; instead, it relied on a new collection of human-curated datasets called PixMo.^[1]^[7] Ai2 reported that Molmo-72B outperformed Gemini 1.5 Pro/Flash and Claude 3.5 Sonnet on eleven academic vision benchmarks and placed second only to GPT-4o on a large pairwise human-preference study.^[1]^[4] The paper introducing Molmo, "Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models" (arXiv:2409.17146), received a Best Paper Honorable Mention at CVPR 2025.^[7]^[8]

Overview

Attribute	Detail
Developer	Allen Institute for AI (Ai2)
First release	25 September 2024 (the "0924" tag in checkpoint names)^[2]^[3]
Models	Molmo-72B, Molmo-7B-D, Molmo-7B-O, MolmoE-1B (all 0924 release)^[3]^[4]^[5]^[6]
License	Hugging Face weights distributed under Apache 2.0; PixMo datasets under ODC-BY-1.0^[3]^[9]
Vision encoder	OpenAI CLIP ViT-L/14 at 336 px input (all variants)^[4]^[5]^[10]
Backbone LLMs	Qwen2-72B (Molmo-72B), Qwen2-7B (Molmo-7B-D), OLMo-7B-1024-preview (Molmo-7B-O), OLMoE-1B-7B-0924 (MolmoE-1B)^[4]^[5]^[6]^[10]
Training data	~1 million image-text pairs from PixMo (over 100x fewer than typical open VLMs)^[1]^[11]
Novel capability	Native 2D "pointing" output for objects in images^[1]^[11]
Paper	Deitke et al., arXiv:2409.17146 (submitted 25 September 2024; v2 5 December 2024)^[7]
Venue	CVPR 2025 (Best Paper Honorable Mention)^[8]

The release combined model weights on Hugging Face, a hosted demo at molmo.allenai.org, a publicly available training and evaluation codebase on GitHub, and the PixMo datasets, with the explicit goal of supplying the open community with foundational knowledge about how to build performant VLMs from scratch rather than by distilling closed competitors.^[1]^[2]^[11]

Background and motivation

By mid-2024 the strongest VLMs were proprietary: OpenAI's GPT-4o and GPT-4V, Anthropic's Claude 3.5 Sonnet, and Google's Gemini 1.5 Pro/Flash dominated the leaderboards. Several openly released VLMs (such as LLaVA-OneVision, LLaVA, and various Qwen-VL derivatives) had narrowed the gap on academic benchmarks, but their published training pipelines almost universally relied on data either generated by, captioned by, or judged by closed models.^[1]^[7] In effect, the open community was distilling proprietary systems, which left it without independent knowledge about how to train a competitive VLM from scratch and exposed downstream releases to the terms of service of the closed providers whose outputs they consumed.^[7]^[11]

The Molmo project at Ai2 set out explicitly to break that dependency. The team, led by Matt Deitke with co-authors including Christopher Clark, Sangho Lee, Rohun Tripathi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Ranjay Krishna, Noah A. Smith, Hannaneh Hajishirzi, Ross Girshick, Ali Farhadi, and Aniruddha Kembhavi (about 50 authors in total), argued that the bottleneck for performance was not model size or compute but the quality and provenance of training data.^[7] Their thesis was that a modestly sized, carefully curated, fully human-produced dataset, combined with a simple two-stage training pipeline and a standard ViT-plus-LLM architecture, could match or exceed VLMs trained on billions of image-text pairs.^[1]^[11] Deitke later described the work as "a long journey over 1.5 years, from failing to get strong performance with massive scale, low quality data, to focusing on modest scale extremely high quality data."^[8]

The announcement was widely covered. TechCrunch quoted Ai2 chief executive Ali Farhadi summarizing the framing as "open is equal to closed, and small is now equal to big."^[2] DeepLearning.AI's The Batch newsletter highlighted Molmo's claim that "vision models trained on fully open, high-quality datasets can compete with closed systems built using massive computational resources."^[12] Nathan Lambert, writing on Interconnects, positioned Molmo against Meta's Llama 3.2 Vision release in the same week, concluding that "Llama 3.2 V is a better text model, maybe even much better, but Molmo is a better image model."^[13]

Architecture

Molmo follows what the paper calls a "simple, standard design"^[7] with four components that are conceptually similar to most modern VLMs but trained end-to-end with carefully chosen learning rates.

Pre-processor. The input image is converted into a set of multi-scale, multi-crop tiles so that fine details in high-resolution images remain visible to the fixed-resolution vision tower.^[10]^[11]
Vision encoder. Every Molmo release uses OpenAI's CLIP ViT-L/14 at 336 px (the openai/clip-vit-large-patch14-336 checkpoint) to map each tile independently into a sequence of vision tokens.^[4]^[5]^[10]
Connector. A multi-layer perceptron projects vision tokens into the embedding space of the language decoder; the projected tokens are then pooled inside each 2x2 patch window to reduce token count while preserving spatial structure.^[10]^[11]
Language decoder. A decoder-only Transformer LLM consumes the interleaved image and text tokens. Each Molmo variant pairs the same vision tower and connector with a different LLM (see below).^[4]^[5]^[6]^[10]

There is no separately learned "perceiver resampler" or Q-former; the connector is intentionally minimal and the pooling is the only learned reduction.^[11] Training is a two-stage pipeline that the authors describe as "streamlined" because it avoids the freeze/unfreeze schedules common to other VLMs.^[10]^[11] The first stage is multimodal pre-training on PixMo-Cap (dense captioning), and the second stage is supervised fine-tuning on a mixture of the remaining PixMo subsets together with a handful of widely used academic datasets.^[7]^[11]

The optimization recipe reported in the paper uses AdamW for four epochs with separate learning rates for each component: 2e-4 for the connector, 6e-6 for the ViT, and 2e-5 for the LLM, with a cosine schedule decaying to 10% of peak.^[14] The connector uses a 200-step linear warmup, while the ViT and LLM warm up for 2,000 steps; gradients are clipped separately for the encoder, connector, and decoder.^[14] The 72B variant was pre-trained on 128 NVIDIA H100 GPUs in roughly 33.3 hours, for about 4,200 H100-hours.^[14]

Backbone LLMs and the 0924 family

The "0924" tag on the public checkpoints encodes the release month (September 2024). The four publicly released models pair the common vision tower with different language backbones:^[3]^[4]^[5]^[6]

Model	Hugging Face ID	LLM backbone	Vision encoder	Notes
Molmo-72B	allenai/Molmo-72B-0924	Qwen2-72B (Alibaba)	CLIP ViT-L/14 336 px	Flagship; 73B total parameters^[4]
Molmo-7B-D	allenai/Molmo-7B-D-0924	Qwen2-7B (Alibaba)	CLIP ViT-L/14 336 px	"Demo" model; about 8B parameters^[3]
Molmo-7B-O	allenai/Molmo-7B-O-0924	OLMo-7B-1024-preview (Ai2)	CLIP ViT-L/14 336 px	Most "fully open" 7B variant, since both the LLM and the data are open^[5]
MolmoE-1B	allenai/MolmoE-1B-0924	OLMoE-1B-7B-0924 (Ai2)	CLIP ViT-L/14 336 px	Mixture of Experts; ~1.5B active / 7.2B total parameters^[6]

In addition to the publicly released checkpoints, the paper reports experiments with Mistral 7B, Gemma 2 9B, Phi-3 Medium, and other backbones to demonstrate that the recipe generalizes across LLMs.^[7]^[11] The 7B-O suffix denotes the variant built on Ai2's own OLMo family (and is therefore the most independently reproducible), while 7B-D is the Qwen2-7B-based "demo" model intended for the public web demo.^[5]^[15]

All four checkpoints are distributed under Apache 2.0,^[3]^[4]^[5]^[6] which contrasts with the more restrictive Llama 3.2 Community License Agreement that governs Meta's Llama 3.2 Vision models released around the same time.^[13]

PixMo: data collection without proprietary VLMs

The paper's authors describe PixMo as "the success of our approach," with the model architecture treated as deliberately conventional so that the new data can be isolated as the source of performance gains.^[7]^[11] PixMo is a collection of subsets, each designed for a specific capability and each collected without using outputs from any external VLM.

PixMo-Cap (dense captions)

Naive crowdsourcing of dense image captions had two failure modes for the Ai2 team. First, asking annotators to type long descriptions yielded short and sparse text because typing is slow and tiring. Second, the team could not verify that annotators were not copy-pasting captions from a publicly available VLM such as GPT-4 or Gemini, which would silently re-introduce distillation.^[11]^[16]

The solution was a "modality switching trick": annotators were asked to describe each image by speaking continuously for 60 to 90 seconds, with the audio recording kept as a receipt that no VLM was queried during the task.^[11]^[16] The audio was then transcribed and the transcripts were passed to Claude to be turned into a single polished long-form caption averaging about 200 words.^[16]^[17] Claude was used here only as a deterministic, text-only post-processor over the human-generated transcripts; the underlying visual content described in the caption came from the human annotator, not from a VLM.^[16]^[17] Both the cleaned caption and the raw audio transcripts are released publicly so users can audit the pipeline.^[17]

The released PixMo-Cap dataset contains 717,042 images with one or more captions, totalling roughly 1.3 million captions, and is distributed under the ODC-BY-1.0 license.^[17] In an early version of the protocol three annotators described every image; later in the project a single annotator per image with a 90-second minimum was used to reduce cost without measurably hurting downstream performance.^[11]

PixMo-AskModelAnything (free-form Q&A)

For supervised fine-tuning on free-form visual question answering, the team built PixMo-AskModelAnything, consisting of human-authored question/answer pairs about images. The reported scale is 162,000 QA pairs across 73,000 images.^[11] Annotators wrote both the question and the answer for an image they were shown, again with no VLM in the loop.^[11]

PixMo-Points (the pointing dataset)

The most distinctive PixMo subset is PixMo-Points, which trains Molmo to output 2D pixel coordinates as part of its natural-language answer. Annotators were asked to point at something in an image, describe it, and then point to each instance of the same thing in the image so the labels exhaustively cover all occurrences. The released collection contains roughly 2.3 million question-point pairs over about 223,000 images.^[14]^[11] To teach the model to refuse when a queried object is absent, the data also includes "not present" examples.^[14] A secondary "point explanations" pipeline let annotators feed the LLM a list of text-labelled points so that the model would learn to reference points when constructing a textual answer; this added roughly 79,000 point-explanation annotations on 14,000 images.^[14]

Coordinates are normalized to the 0-100 range so they are independent of input resolution, and points are ordered top-to-bottom, left-to-right with each point numbered, which lets the same data train the model to both point and count.^[14]

Additional PixMo subsets

The full PixMo collection released alongside Molmo includes several other subsets used in supervised fine-tuning:^[11]^[16]

PixMo-CapQA: about 214,000 QA pairs generated from the long captions to produce caption-grounded Q&A data.
PixMo-Docs: about 255,000 synthetic document and chart images for OCR and document-VQA training.
PixMo-Clocks: about 826,000 synthetic images of analog clocks for time-reading training.
PixMo-Count: counting-task examples derived from the point annotations and used as both a training source and an internal benchmark.

The total fine-tuning mixture (PixMo subsets plus standard academic supervised datasets) is described in the paper as roughly one million image-text pairs, which the Ai2 blog notes is "three orders of magnitude fewer" than the volume used by some competing approaches.^[1]^[11]

Pointing capability

Pointing is the most-discussed novel feature of Molmo. Where GPT-4V, Claude, and Gemini respond to an image only in natural language, Molmo can interleave its prose answer with structured point coordinates that the front end can render directly on the image.^[1]^[13] Asked to count the dogs in a photo, for example, Molmo will output one dot per dog face; asked about tongues, it will output one dot per visible tongue.^[2]

Because pointing produces machine-readable spatial references, Ai2 frames it as a grounding primitive for downstream agents. The blog notes that "a robot could query a pointing-enabled VLM for a waypoint or the location of an object to pick up, or a web agent could query the VLM for the location of a user interface element to click."^[1] TechCrunch's coverage emphasized the implication for robotics and for browser automation: Molmo can navigate web interfaces without ever inspecting the underlying HTML, because it can look at a screenshot and point at the right button.^[2]

The released paper notes two known failure modes of the pointing system. First, because counting and pointing can produce very long output sequences, Molmo's training data was capped at 40 counts per image to control memory usage; the authors flag the cap as a planned future improvement.^[11] Second, the model sometimes fails to point on counting-style questions because "how many" prompts in the supervised data are not always paired with point labels; the authors describe a heuristic for detecting such questions and prefixing them with explicit instructions when point output is desired.^[11]

Performance and evaluation

Ai2 evaluated Molmo on eleven academic vision-language benchmarks and on a large pairwise human-preference study, comparing against both the strongest closed models (GPT-4o, GPT-4V, Claude 3.5 Sonnet, Gemini 1.5 Pro and Flash) and several open VLMs (LLaVA OneVision 7B, Pixtral 12B, Qwen VL2 7B and 72B, among others).^[4]^[6]^[11]

Academic benchmarks

The eleven benchmarks used were AI2D, ChartQA, VQA v2.0, DocVQA, InfographicVQA, TextVQA, RealWorldQA, MMMU, MathVista, CountBenchQA, and PixMo-Count (sometimes referred to as Flickr Count in the model cards).^[3]^[4]^[14] Selected per-benchmark scores reported in the paper are summarized below.^[14]

Benchmark	MolmoE-1B	Molmo-7B-O	Molmo-7B-D	Molmo-72B
AI2D	86.4	90.7	93.2	96.3
ChartQA	78.0	80.4	84.1	87.3
VQA v2.0	83.9	85.3	85.6	86.5
DocVQA	77.7	90.8	92.2	93.5
InfographicVQA	53.9	70.0	72.6	81.9
TextVQA	78.8	80.4	81.7	83.1
RealWorldQA	60.4	67.5	70.7	75.2
MMMU	34.9	39.3	45.3	54.1
MathVista	34.0	44.5	51.6	58.6
CountBenchQA	87.2	89.0	88.5	91.2

Headline averages over the eleven benchmarks reported on the official model cards put Molmo-72B at 81.2, ahead of Qwen VL2 72B at 79.4, GPT-4o at 78.5, and Gemini 1.5 Pro at 78.3.^[4] Molmo-7B-D averages 77.3, edging Claude 3.5 Sonnet at 76.7 and ahead of LLaVA OneVision 7B at 72.0 and GPT-4V at 71.1.^[3] MolmoE-1B averages 68.6, comparable to small open models such as Pixtral 12B (69.5) despite running with only roughly 1.5 billion active parameters.^[6]

Human-preference evaluation

In parallel with the academic benchmarks, Ai2 ran a large pairwise human evaluation across 27 vision-language models. The study collected about 325,000 pairwise preference ratings, which the team noted was roughly three times the volume of votes on the LMSYS Chatbot Arena at the time. The ratings were fitted with a Bradley-Terry model to produce Elo-style rankings.^[11]^[14] Reported Elo scores include:^[3]^[4]^[14]

Model	Human-preference Elo
GPT-4o	1079
Molmo-72B	1077
Gemini 1.5 Pro	1074
Claude 3.5 Sonnet	1069
Molmo-7B-D	1056
Molmo-7B-O	1051
GPT-4V	1041
MolmoE-1B	1032
Qwen VL2 7B	1025
LLaVA OneVision 7B	1024
Pixtral 12B	1016

Across the two evaluation frameworks, Ai2 framed the result as: Molmo-72B places first on the academic-benchmark average and second on the human-preference Elo, with both scores reachable from approximately one million image-text training pairs.^[1]^[4]

Cross-source consistency and caveats

Most independent press coverage echoes the paper's framing. TechCrunch summarized that Molmo "matches performance of GPT-4o, Gemini 1.5 Pro, and Claude-3.5 Sonnet" while running at "approximately one-tenth the size of competing closed models."^[2] DeepLearning.AI's The Batch reported that Molmo-72B "outperformed Gemini 1.5 Pro and Claude 3.5 Sonnet on academic tests and certain vision benchmarks," that 7B variants "performed between GPT-4V and GPT-4o," and that MolmoE-1B "nearly matched GPT-4V's capabilities."^[12]

Nathan Lambert's comparison with Llama 3.2 Vision noted that Molmo outperformed Llama 3.2 Vision by roughly +1 to +4 points across MathVista, ChartQA, AI2D, and DocVQA, while Llama 3.2 Vision led on MMLU (text-only reasoning) by about 6 points.^[13] Lambert's conclusion that Llama 3.2 Vision is "a better text model" while Molmo is "a better image model" is consistent with Molmo's explicit choice to omit large-scale text-only instruction tuning and RLHF.^[13]

Later open vision-language releases from 2024 and 2025, including Qwen2.5-VL-72B-Instruct (reported around 70.2 on MMMU val) and InternVL3-78B (reported around 72.2 on MMMU), have since exceeded Molmo's reasoning-heavy benchmark scores on MMMU and MathVista, though typically with training pipelines that include synthetic data, scale-up text RLHF, or substantially more parameters.^[18]

Comparison with contemporary VLMs

The table below summarizes how Molmo's release positioning differs from the most-discussed contemporary open and closed VLMs.

Model family	Weights	Training data released?	Synthetic data from proprietary VLMs?	Reported MMMU (val)
Molmo-72B (Ai2, Sep 2024)	Open, Apache 2.0^[4]	Yes (PixMo, ODC-BY-1.0)^[17]	No (audio receipts; human-only)^[11]	54.1^[14]
Llama 3.2 Vision 11B/90B (Meta, Sep 2024)	Open, Llama 3.2 Community License^[13]	No^[13]	Not fully disclosed^[13]	Not directly reported in Molmo paper
Qwen2.5-VL-72B-Instruct (Alibaba, 2025)	Open weights	Partial	Yes (mixed)	~70.2^[18]
GPT-4o (OpenAI)	Closed	No	n/a	78.5 average across the 11-benchmark suite^[4]
Claude 3.5 Sonnet (Anthropic)	Closed	No	n/a	Suite average 76.7^[3]
Gemini 1.5 Pro (Google)	Closed	No	n/a	Suite average 78.3^[4]
LLaVA OneVision 7B	Open	Partial	Yes (uses GPT-4V outputs)	Suite average 72.0^[3]
Pixtral 12B (Mistral, 2024)	Open weights	Partial	Not disclosed	Suite average 69.5^[6]

What is distinctive about Molmo's release, on the dimensions tracked above, is the combination of: a fully permissive Apache 2.0 weights license, a fully released training data set, and a credible audit trail (audio receipts) that no proprietary VLM was queried at any point during data construction.^[1]^[11]^[17]

Software ecosystem and deployment

All four Molmo-0924 checkpoints are hosted on Hugging Face and load through AutoModelForCausalLM and AutoProcessor with trust_remote_code=True, with bfloat16 autocast as the default precision.^[3]^[4] The official model cards document two image preprocessing gotchas: images must be in RGB mode (transparent PNGs and other modes degrade quality), and a recommended workaround composites RGBA inputs onto a white or black background chosen by average brightness.^[3]

Beyond Hugging Face Transformers, the model cards note official support for vLLM (version 0.7.2 or earlier due to a preprocessing regression in newer releases), SGLang, and Docker Model Runner, with community quantizations available for llama.cpp, Ollama, LM Studio, and Jan.^[3] The training and evaluation codebase at github.com/allenai/molmo is open-source and includes scripts/train.py, helper launchers train_captioner.py and train_multitask_model.py, an eval_downstream.py evaluation runner, and a scripts/download.py data fetcher; MolmoE-1B additionally requires a megablocks fork for sparse MoE training.^[19]

A live demo at molmo.allenai.org accepts image uploads from desktop or mobile browsers and exposes both natural-language answers and the pointing visualization layer.^[1]^[2]

Adoption and downstream work

Within months of release, Molmo's pointing capability became a building block for several agent-oriented systems and for a wave of follow-up Ai2 releases. Molmo's appeal for robotics researchers came from its ability to emit pixel-accurate waypoints that downstream controllers can convert into manipulation targets, and for browser-automation researchers from its ability to identify clickable UI elements without inspecting page source.^[1]^[2]

Ai2 itself extended the Molmo program with several successor projects:

MolmoAct, a robotics-focused model that the company described as "thinking in 3D," targeting manipulation and embodied tasks.^[20]
MolmoPoint, focused specifically on improving the pointing architecture with dedicated grounding tokens.^[20]
MolmoWeb, an open-weight visual web agent released in 2025 alongside the MolmoWebMix dataset of 30,000 human task trajectories across more than 1,100 websites and 2.2 million screenshot-QA pairs.^[21]
Molmo 2, announced in December 2025, which extends the model family to multi-image and video inputs and adds object and action tracking, while preserving the open-weights, open-data philosophy of the original.^[22]

Outside Ai2, Molmo is among the most-downloaded open VLMs on Hugging Face and is widely used as a baseline in subsequent vision-language papers that need an open reference model whose training data provenance is fully known.^[3]^[4]

Limitations and criticisms

The paper and accompanying model cards explicitly catalogue several limitations:^[3]^[11]

Counting beyond 40 instances. To keep output sequences tractable, training data omits pointing labels for images with more than 40 instances; very dense scenes can confuse the model into over- or under-counting. The authors flag this as a planned fix.^[11]
Counting vs pointing confusion. Because "how many" questions in the supervised data are not always paired with explicit point labels, Molmo sometimes answers counting questions in plain language without producing the corresponding points; heuristic prompt rewriting is the documented workaround.^[11]
Image-mode sensitivity. Non-RGB inputs (notably transparent PNGs) produce noticeably worse outputs unless composited onto a solid background, a behavior the official model card documents with sample code.^[3]
vLLM compatibility ceiling. The custom preprocessing code requires vLLM 0.7.2 or earlier; later vLLM releases have a preprocessing bug that Ai2 directs users to avoid.^[3]
Limited text-only reasoning. Because Molmo skipped large-scale text-only instruction tuning and RLHF, it underperforms text-tuned models on reasoning-heavy or multi-turn-chat-style evaluations; Nathan Lambert observed that this "has a very different vibe than people are used to" for conversational reasoning tasks.^[13]
"Open source" definitional ambiguity. Because Molmo is fine-tuned from non-open base LLMs (in particular Qwen2 for the 72B and 7B-D models), it would not qualify as "open source" under the strictest Open Source Initiative-style definitions, although it remains "by far the closest vision model to being so," in Lambert's assessment.^[13] The 7B-O variant, which sits on Ai2's own OLMo backbone, comes closest to a fully open-source stack.^[5]^[13]

More broadly, third-party safety research on multimodal models has repeatedly shown that visual inputs can bypass text-level safety filters in VLMs, and Molmo is not exempt from this class of issue. Ai2 distributes the weights under its Responsible Use Guidelines and labels them for research and educational use.^[3]^[4]^[6]

Reception and significance

Coverage in late September 2024 framed Molmo as evidence that high-quality, human-curated data could substitute for both scale and distillation. Farhadi's "open is equal to closed, and small is now equal to big" framing was widely repeated;^[2] DeepLearning.AI's The Batch summarized the upshot as a demonstration that "vision models trained on fully open, high-quality datasets can compete with closed systems built using massive computational resources";^[12] and MarkTechPost called Molmo a release that "ranks second on human evaluation, just slightly behind GPT-4o" while being entirely open.^[15]

The longer-term impact is reflected in the paper's reception at academic venues. The Molmo and PixMo paper was published at CVPR 2025 (Deitke et al., "Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models") and received a Best Paper Honorable Mention,^[8] one of the most-cited acknowledgments of the year's vision-language work. The release is now commonly cited as evidence that the open-VLM community can produce competitive frontier-class models without distilling closed systems, provided it invests heavily in original data collection.^[7]^[8]

Allen Institute for AI: the non-profit research lab that produced Molmo, PixMo, OLMo, OLMoE, Dolma, and Tulu 3.
OLMo and OLMoE: the open language model and Mixture-of-Experts language model that serve as Ai2-native backbones for Molmo-7B-O and MolmoE-1B respectively.^[5]^[6]
CLIP: OpenAI's contrastive image-text encoder, used unchanged as Molmo's vision tower at 336 px input resolution.^[3]^[4]
Mixture of Experts: the sparse-routing architecture used in MolmoE-1B via the OLMoE backbone.^[6]
LLaVA: an earlier open VLM family compared against Molmo on academic benchmarks and human-preference Elo.^[3]
Pixtral: Mistral's contemporary open-weight VLM, included in Molmo's benchmark tables for comparison.^[6]
Gemini 1.5, GPT-4o, and Claude 3.5 Sonnet: closed proprietary VLMs whose academic and human-preference performance Molmo is benchmarked against.^[1]^[4]
Llama 3.2: Meta's vision-language family released the same week as Molmo; widely compared with Molmo in the open-VLM ecosystem discussion.^[13]
MMMU and MathVista: two of the eleven benchmarks used to evaluate Molmo.^[14]
Knowledge Distillation and Synthetic data: the practices Molmo deliberately avoided in its training pipeline.^[7]^[11]
Grounding: the broader research area to which Molmo's pointing capability contributes.^[1]
Chatbot Arena: the open chatbot-evaluation platform whose pairwise-preference methodology inspired Molmo's larger image-VLM human-preference study.^[11]

References

Allen Institute for AI, "Molmo: A family of open state-of-the-art multimodal AI models", Ai2 Blog, 2024-09-25. https://allenai.org/blog/molmo. Accessed 2026-05-20. ↩
Devin Coldewey, "Ai2's Molmo shows open source can meet, and beat, closed multimodal models", TechCrunch, 2024-09-25. https://techcrunch.com/2024/09/25/ai2s-molmo-shows-open-source-can-meet-and-beat-closed-multimodal-models/. Accessed 2026-05-20. ↩
Allen Institute for AI, "allenai/Molmo-7B-D-0924 model card", Hugging Face, 2024-09-25. https://huggingface.co/allenai/Molmo-7B-D-0924. Accessed 2026-05-20. ↩
Allen Institute for AI, "allenai/Molmo-72B-0924 model card", Hugging Face, 2024-09-25. https://huggingface.co/allenai/Molmo-72B-0924. Accessed 2026-05-20. ↩
Allen Institute for AI, "allenai/Molmo-7B-O-0924 model card", Hugging Face, 2024-09-25. https://huggingface.co/allenai/Molmo-7B-O-0924. Accessed 2026-05-20. ↩
Allen Institute for AI, "allenai/MolmoE-1B-0924 model card", Hugging Face, 2024-09-25. https://huggingface.co/allenai/MolmoE-1B-0924. Accessed 2026-05-20. ↩
Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, et al., "Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models", arXiv:2409.17146, 2024-09-25 (v1) / 2024-12-05 (v2). https://arxiv.org/abs/2409.17146. Accessed 2026-05-20. ↩
Matt Deitke, post announcing Molmo's Best Paper Honorable Mention at CVPR 2025, X (formerly Twitter), 2025-06-13. https://x.com/mattdeitke/status/1933543308510347544. Accessed 2026-05-20. ↩
Allen Institute for AI, "Pixmo dataset collection page", Hugging Face, 2024. https://huggingface.co/collections/allenai/pixmo-674746ea613028006285687b. Accessed 2026-05-20. ↩
OpenCV, "MOLMO: A Powerful Open-Source Vision-Language Model", OpenCV.org, 2024. https://opencv.org/molmo-vlm/. Accessed 2026-05-20. ↩
Matt Deitke et al., "Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models (HTML version)", arXiv:2409.17146v2, 2024-12-05. https://arxiv.org/html/2409.17146v2. Accessed 2026-05-20. ↩
DeepLearning.AI, "Data Points: Molmo's impressive open multimodal models", The Batch, 2024-10. https://www.deeplearning.ai/the-batch/molmos-impressive-open-multimodal-models/. Accessed 2026-05-20. ↩
Nathan Lambert, "Llama 3.2 Vision and Molmo: Foundations for the multimodal open-source ecosystem", Interconnects, 2024-09-26. https://www.interconnects.ai/p/molmo-and-llama-3-vision. Accessed 2026-05-20. ↩
Matt Deitke et al., "Molmo and PixMo (paper PDF, Tables 1 and 8)", arXiv:2409.17146, 2024. https://arxiv.org/pdf/2409.17146. Accessed 2026-05-20. ↩
MarkTechPost, "Allen Institute for Artificial Intelligence (Ai2) Releases Molmo: A Family of Open-Source Multimodal Language Models", MarkTechPost, 2024-09-26. https://www.marktechpost.com/2024/09/26/are-small-language-models-really-the-future-of-language-models-allen-institute-for-artificial-intelligence-ai2-releases-molmo-a-family-of-open-source-multimodal-language-models/. Accessed 2026-05-20. ↩
Ritvik Rastogi, "Papers Explained 241: Pixmo and Molmo", Medium, 2024. https://ritvik19.medium.com/papers-explained-241-pixmo-and-molmo-239d70abebff. Accessed 2026-05-20. ↩
Allen Institute for AI, "allenai/pixmo-cap dataset card", Hugging Face, 2024. https://huggingface.co/datasets/allenai/pixmo-cap. Accessed 2026-05-20. ↩
Labellerr Team, "Top Vision LLMs Compared: Qwen 2.5-VL vs LLaMA 3.2", Labellerr Blog, 2025. https://www.labellerr.com/blog/qwen-2-5-vl-vs-llama-3-2/. Accessed 2026-05-20. ↩
Allen Institute for AI, "allenai/molmo GitHub repository", GitHub, 2024. https://github.com/allenai/molmo. Accessed 2026-05-20. ↩
Allen Institute for AI, "MolmoPoint: Better pointing architecture for vision-language models", Ai2 Blog, 2025. https://allenai.org/blog/molmopoint. Accessed 2026-05-20. ↩
Sean Michael Kerner, "Ai2 releases MolmoWeb, an open-weight visual web agent with 30K human task trajectories and a full training stack", VentureBeat, 2025. https://venturebeat.com/data/ai2-releases-molmoweb-an-open-weight-visual-web-agent-with-30k-human-task. Accessed 2026-05-20. ↩
Allen Institute for AI, "Molmo 2: State-of-the-art video understanding, pointing, and tracking", Ai2 Blog, 2025-12-16. https://allenai.org/blog/molmo2. Accessed 2026-05-20. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributor · full history

Suggest edit

What links here

Image-to-Text Models Vision language model

Overview

Background and motivation

Architecture

Backbone LLMs and the 0924 family

PixMo: data collection without proprietary VLMs

PixMo-Cap (dense captions)

PixMo-AskModelAnything (free-form Q&A)

PixMo-Points (the pointing dataset)

Additional PixMo subsets

Pointing capability

Performance and evaluation

Academic benchmarks

Human-preference evaluation

Cross-source consistency and caveats

Comparison with contemporary VLMs

Software ecosystem and deployment

Adoption and downstream work

Limitations and criticisms

Reception and significance

Related work

See also

References

Improve this article

Related Articles

SmolVLA

Llama 3.2

Gemma 3

Pixtral

Llama 4 Scout and Maverick

Reka Flash

What links here

Related Articles

SmolVLA

Llama 3.2

Gemma 3

Pixtral

Llama 4 Scout and Maverick

Reka Flash

What links here