# Molmo

> Source: https://aiwiki.ai/wiki/molmo
> Updated: 2026-06-07
> Categories: AI Models, Multimodal AI, Open Source AI
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

Molmo is a family of open-weight, open-data vision-language models (VLMs) released by the [Allen Institute for AI](/wiki/allen_institute_for_ai) (Ai2) on 25 September 2024.[^1][^2] The family includes a 72-billion-parameter flagship (Molmo-72B), two 7-billion variants (Molmo-7B-D and Molmo-7B-O), and a [Mixture-of-Experts](/wiki/mixture_of_experts) model with roughly 1.5 billion active parameters (MolmoE-1B).[^3][^4][^5][^6] Unlike most contemporary open VLMs, Molmo was trained entirely without synthetic image-text data distilled from proprietary systems such as [GPT-4o](/wiki/gpt_4o) or [Claude 3.5 Sonnet](/wiki/claude_3_5_sonnet); instead, it relied on a new collection of human-curated datasets called PixMo.[^1][^7] Ai2 reported that Molmo-72B outperformed [Gemini](/wiki/gemini) 1.5 Pro/Flash and Claude 3.5 Sonnet on eleven academic vision benchmarks and placed second only to GPT-4o on a large pairwise human-preference study.[^1][^4] The paper introducing Molmo, "Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models" (arXiv:2409.17146), received a Best Paper Honorable Mention at [CVPR](/wiki/cvpr) 2025.[^7][^8]

## Overview

| Attribute | Detail |
| --- | --- |
| Developer | [Allen Institute for AI](/wiki/allen_institute_for_ai) (Ai2) |
| First release | 25 September 2024 (the "0924" tag in checkpoint names)[^2][^3] |
| Models | Molmo-72B, Molmo-7B-D, Molmo-7B-O, MolmoE-1B (all 0924 release)[^3][^4][^5][^6] |
| License | [Hugging Face](/wiki/hugging_face) weights distributed under Apache 2.0; PixMo datasets under ODC-BY-1.0[^3][^9] |
| Vision encoder | OpenAI [CLIP](/wiki/clip) ViT-L/14 at 336 px input (all variants)[^4][^5][^10] |
| Backbone LLMs | Qwen2-72B (Molmo-72B), Qwen2-7B (Molmo-7B-D), [OLMo](/wiki/olmo)-7B-1024-preview (Molmo-7B-O), [OLMoE](/wiki/olmoe)-1B-7B-0924 (MolmoE-1B)[^4][^5][^6][^10] |
| Training data | ~1 million image-text pairs from PixMo (over 100x fewer than typical open VLMs)[^1][^11] |
| Novel capability | Native 2D "pointing" output for objects in images[^1][^11] |
| Paper | Deitke et al., arXiv:2409.17146 (submitted 25 September 2024; v2 5 December 2024)[^7] |
| Venue | CVPR 2025 (Best Paper Honorable Mention)[^8] |

The release combined model weights on [Hugging Face](/wiki/hugging_face), a hosted demo at molmo.allenai.org, a publicly available training and evaluation codebase on GitHub, and the PixMo datasets, with the explicit goal of supplying the open community with foundational knowledge about how to build performant VLMs from scratch rather than by distilling closed competitors.[^1][^2][^11]

## Background and motivation

By mid-2024 the strongest VLMs were proprietary: OpenAI's GPT-4o and GPT-4V, Anthropic's [Claude 3.5 Sonnet](/wiki/claude_3_5_sonnet), and Google's [Gemini](/wiki/gemini) 1.5 Pro/Flash dominated the leaderboards. Several openly released VLMs (such as LLaVA-OneVision, [LLaVA](/wiki/llava), and various Qwen-VL derivatives) had narrowed the gap on academic benchmarks, but their published training pipelines almost universally relied on data either generated by, captioned by, or judged by closed models.[^1][^7] In effect, the open community was distilling proprietary systems, which left it without independent knowledge about how to train a competitive VLM from scratch and exposed downstream releases to the terms of service of the closed providers whose outputs they consumed.[^7][^11]

The Molmo project at Ai2 set out explicitly to break that dependency. The team, led by Matt Deitke with co-authors including Christopher Clark, Sangho Lee, Rohun Tripathi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Ranjay Krishna, Noah A. Smith, Hannaneh Hajishirzi, Ross Girshick, Ali Farhadi, and Aniruddha Kembhavi (about 50 authors in total), argued that the bottleneck for performance was not model size or compute but the quality and provenance of training data.[^7] Their thesis was that a modestly sized, carefully curated, fully human-produced dataset, combined with a simple two-stage training pipeline and a standard ViT-plus-LLM architecture, could match or exceed VLMs trained on billions of image-text pairs.[^1][^11] Deitke later described the work as "a long journey over 1.5 years, from failing to get strong performance with massive scale, low quality data, to focusing on modest scale extremely high quality data."[^8]

The announcement was widely covered. TechCrunch quoted Ai2 chief executive Ali Farhadi summarizing the framing as "open is equal to closed, and small is now equal to big."[^2] DeepLearning.AI's The Batch newsletter highlighted Molmo's claim that "vision models trained on fully open, high-quality datasets can compete with closed systems built using massive computational resources."[^12] Nathan Lambert, writing on Interconnects, positioned Molmo against Meta's Llama 3.2 Vision release in the same week, concluding that "Llama 3.2 V is a better text model, maybe even much better, but Molmo is a better image model."[^13]

## Architecture

Molmo follows what the paper calls a "simple, standard design"[^7] with four components that are conceptually similar to most modern VLMs but trained end-to-end with carefully chosen learning rates.

1. **Pre-processor.** The input image is converted into a set of multi-scale, multi-crop tiles so that fine details in high-resolution images remain visible to the fixed-resolution vision tower.[^10][^11]
2. **Vision encoder.** Every Molmo release uses OpenAI's [CLIP](/wiki/clip) ViT-L/14 at 336 px (the `openai/clip-vit-large-patch14-336` checkpoint) to map each tile independently into a sequence of vision tokens.[^4][^5][^10]
3. **Connector.** A multi-layer perceptron projects vision tokens into the embedding space of the language decoder; the projected tokens are then pooled inside each 2x2 patch window to reduce token count while preserving spatial structure.[^10][^11]
4. **Language decoder.** A decoder-only [Transformer](/wiki/transformer) LLM consumes the interleaved image and text tokens. Each Molmo variant pairs the same vision tower and connector with a different LLM (see below).[^4][^5][^6][^10]

There is no separately learned "perceiver resampler" or Q-former; the connector is intentionally minimal and the pooling is the only learned reduction.[^11] Training is a two-stage pipeline that the authors describe as "streamlined" because it avoids the freeze/unfreeze schedules common to other VLMs.[^10][^11] The first stage is multimodal pre-training on PixMo-Cap (dense captioning), and the second stage is supervised fine-tuning on a mixture of the remaining PixMo subsets together with a handful of widely used academic datasets.[^7][^11]

The optimization recipe reported in the paper uses [AdamW](/wiki/adamw) for four epochs with separate learning rates for each component: 2e-4 for the connector, 6e-6 for the ViT, and 2e-5 for the LLM, with a cosine schedule decaying to 10% of peak.[^14] The connector uses a 200-step linear warmup, while the ViT and LLM warm up for 2,000 steps; gradients are clipped separately for the encoder, connector, and decoder.[^14] The 72B variant was pre-trained on 128 NVIDIA H100 GPUs in roughly 33.3 hours, for about 4,200 H100-hours.[^14]

### Backbone LLMs and the 0924 family

The "0924" tag on the public checkpoints encodes the release month (September 2024). The four publicly released models pair the common vision tower with different language backbones:[^3][^4][^5][^6]

| Model | Hugging Face ID | LLM backbone | Vision encoder | Notes |
| --- | --- | --- | --- | --- |
| Molmo-72B | allenai/Molmo-72B-0924 | Qwen2-72B (Alibaba) | CLIP ViT-L/14 336 px | Flagship; 73B total parameters[^4] |
| Molmo-7B-D | allenai/Molmo-7B-D-0924 | Qwen2-7B (Alibaba) | CLIP ViT-L/14 336 px | "Demo" model; about 8B parameters[^3] |
| Molmo-7B-O | allenai/Molmo-7B-O-0924 | [OLMo](/wiki/olmo)-7B-1024-preview (Ai2) | CLIP ViT-L/14 336 px | Most "fully open" 7B variant, since both the LLM and the data are open[^5] |
| MolmoE-1B | allenai/MolmoE-1B-0924 | [OLMoE](/wiki/olmoe)-1B-7B-0924 (Ai2) | CLIP ViT-L/14 336 px | [Mixture of Experts](/wiki/mixture_of_experts); ~1.5B active / 7.2B total parameters[^6] |

In addition to the publicly released checkpoints, the paper reports experiments with Mistral 7B, Gemma 2 9B, [Phi-3](/wiki/phi_3) Medium, and other backbones to demonstrate that the recipe generalizes across LLMs.[^7][^11] The 7B-O suffix denotes the variant built on Ai2's own [OLMo](/wiki/olmo) family (and is therefore the most independently reproducible), while 7B-D is the Qwen2-7B-based "demo" model intended for the public web demo.[^5][^15]

All four checkpoints are distributed under Apache 2.0,[^3][^4][^5][^6] which contrasts with the more restrictive Llama 3.2 Community License Agreement that governs Meta's Llama 3.2 Vision models released around the same time.[^13]

## PixMo: data collection without proprietary VLMs

The paper's authors describe PixMo as "the success of our approach," with the model architecture treated as deliberately conventional so that the new data can be isolated as the source of performance gains.[^7][^11] PixMo is a collection of subsets, each designed for a specific capability and each collected without using outputs from any external VLM.

### PixMo-Cap (dense captions)

Naive crowdsourcing of dense image captions had two failure modes for the Ai2 team. First, asking annotators to type long descriptions yielded short and sparse text because typing is slow and tiring. Second, the team could not verify that annotators were not copy-pasting captions from a publicly available VLM such as GPT-4 or Gemini, which would silently re-introduce distillation.[^11][^16]

The solution was a "modality switching trick": annotators were asked to describe each image by speaking continuously for 60 to 90 seconds, with the audio recording kept as a receipt that no VLM was queried during the task.[^11][^16] The audio was then transcribed and the transcripts were passed to [Claude](/wiki/claude) to be turned into a single polished long-form caption averaging about 200 words.[^16][^17] Claude was used here only as a deterministic, text-only post-processor over the human-generated transcripts; the underlying visual content described in the caption came from the human annotator, not from a VLM.[^16][^17] Both the cleaned caption and the raw audio transcripts are released publicly so users can audit the pipeline.[^17]

The released PixMo-Cap dataset contains 717,042 images with one or more captions, totalling roughly 1.3 million captions, and is distributed under the ODC-BY-1.0 license.[^17] In an early version of the protocol three annotators described every image; later in the project a single annotator per image with a 90-second minimum was used to reduce cost without measurably hurting downstream performance.[^11]

### PixMo-AskModelAnything (free-form Q&A)

For supervised fine-tuning on free-form visual question answering, the team built PixMo-AskModelAnything, consisting of human-authored question/answer pairs about images. The reported scale is 162,000 QA pairs across 73,000 images.[^11] Annotators wrote both the question and the answer for an image they were shown, again with no VLM in the loop.[^11]

### PixMo-Points (the pointing dataset)

The most distinctive PixMo subset is PixMo-Points, which trains Molmo to output 2D pixel coordinates as part of its natural-language answer. Annotators were asked to point at something in an image, describe it, and then point to each instance of the same thing in the image so the labels exhaustively cover all occurrences. The released collection contains roughly 2.3 million question-point pairs over about 223,000 images.[^14][^11] To teach the model to refuse when a queried object is absent, the data also includes "not present" examples.[^14] A secondary "point explanations" pipeline let annotators feed the LLM a list of text-labelled points so that the model would learn to reference points when constructing a textual answer; this added roughly 79,000 point-explanation annotations on 14,000 images.[^14]

Coordinates are normalized to the 0-100 range so they are independent of input resolution, and points are ordered top-to-bottom, left-to-right with each point numbered, which lets the same data train the model to both point and count.[^14]

### Additional PixMo subsets

The full PixMo collection released alongside Molmo includes several other subsets used in supervised fine-tuning:[^11][^16]

- **PixMo-CapQA**: about 214,000 QA pairs generated from the long captions to produce caption-grounded Q&A data.
- **PixMo-Docs**: about 255,000 synthetic document and chart images for OCR and document-VQA training.
- **PixMo-Clocks**: about 826,000 synthetic images of analog clocks for time-reading training.
- **PixMo-Count**: counting-task examples derived from the point annotations and used as both a training source and an internal benchmark.

The total fine-tuning mixture (PixMo subsets plus standard academic supervised datasets) is described in the paper as roughly one million image-text pairs, which the Ai2 blog notes is "three orders of magnitude fewer" than the volume used by some competing approaches.[^1][^11]

## Pointing capability

Pointing is the most-discussed novel feature of Molmo. Where GPT-4V, [Claude](/wiki/claude), and [Gemini](/wiki/gemini) respond to an image only in natural language, Molmo can interleave its prose answer with structured point coordinates that the front end can render directly on the image.[^1][^13] Asked to count the dogs in a photo, for example, Molmo will output one dot per dog face; asked about tongues, it will output one dot per visible tongue.[^2]

Because pointing produces machine-readable spatial references, Ai2 frames it as a [grounding](/wiki/grounding) primitive for downstream agents. The blog notes that "a robot could query a pointing-enabled VLM for a waypoint or the location of an object to pick up, or a web agent could query the VLM for the location of a user interface element to click."[^1] TechCrunch's coverage emphasized the implication for [robotics](/wiki/robotics) and for browser automation: Molmo can navigate web interfaces without ever inspecting the underlying HTML, because it can look at a screenshot and point at the right button.[^2]

The released paper notes two known failure modes of the pointing system. First, because counting and pointing can produce very long output sequences, Molmo's training data was capped at 40 counts per image to control memory usage; the authors flag the cap as a planned future improvement.[^11] Second, the model sometimes fails to point on counting-style questions because "how many" prompts in the supervised data are not always paired with point labels; the authors describe a heuristic for detecting such questions and prefixing them with explicit instructions when point output is desired.[^11]

## Performance and evaluation

Ai2 evaluated Molmo on eleven academic vision-language benchmarks and on a large pairwise human-preference study, comparing against both the strongest closed models (GPT-4o, GPT-4V, [Claude 3.5 Sonnet](/wiki/claude_3_5_sonnet), [Gemini](/wiki/gemini) 1.5 Pro and Flash) and several open VLMs ([LLaVA](/wiki/llava) OneVision 7B, [Pixtral](/wiki/pixtral) 12B, Qwen VL2 7B and 72B, among others).[^4][^6][^11]

### Academic benchmarks

The eleven benchmarks used were AI2D, ChartQA, VQA v2.0, DocVQA, InfographicVQA, TextVQA, RealWorldQA, [MMMU](/wiki/mmmu), [MathVista](/wiki/mathvista), CountBenchQA, and PixMo-Count (sometimes referred to as Flickr Count in the model cards).[^3][^4][^14] Selected per-benchmark scores reported in the paper are summarized below.[^14]

| Benchmark | MolmoE-1B | Molmo-7B-O | Molmo-7B-D | Molmo-72B |
| --- | --- | --- | --- | --- |
| AI2D | 86.4 | 90.7 | 93.2 | 96.3 |
| ChartQA | 78.0 | 80.4 | 84.1 | 87.3 |
| VQA v2.0 | 83.9 | 85.3 | 85.6 | 86.5 |
| DocVQA | 77.7 | 90.8 | 92.2 | 93.5 |
| InfographicVQA | 53.9 | 70.0 | 72.6 | 81.9 |
| TextVQA | 78.8 | 80.4 | 81.7 | 83.1 |
| RealWorldQA | 60.4 | 67.5 | 70.7 | 75.2 |
| MMMU | 34.9 | 39.3 | 45.3 | 54.1 |
| MathVista | 34.0 | 44.5 | 51.6 | 58.6 |
| CountBenchQA | 87.2 | 89.0 | 88.5 | 91.2 |

Headline averages over the eleven benchmarks reported on the official model cards put Molmo-72B at 81.2, ahead of Qwen VL2 72B at 79.4, GPT-4o at 78.5, and Gemini 1.5 Pro at 78.3.[^4] Molmo-7B-D averages 77.3, edging Claude 3.5 Sonnet at 76.7 and ahead of LLaVA OneVision 7B at 72.0 and GPT-4V at 71.1.[^3] MolmoE-1B averages 68.6, comparable to small open models such as Pixtral 12B (69.5) despite running with only roughly 1.5 billion active parameters.[^6]

### Human-preference evaluation

In parallel with the academic benchmarks, Ai2 ran a large pairwise human evaluation across 27 vision-language models. The study collected about 325,000 pairwise preference ratings, which the team noted was roughly three times the volume of votes on the LMSYS [Chatbot Arena](/wiki/lmsys_chatbot_arena) at the time. The ratings were fitted with a Bradley-Terry model to produce Elo-style rankings.[^11][^14] Reported Elo scores include:[^3][^4][^14]

| Model | Human-preference Elo |
| --- | --- |
| GPT-4o | 1079 |
| Molmo-72B | 1077 |
| Gemini 1.5 Pro | 1074 |
| Claude 3.5 Sonnet | 1069 |
| Molmo-7B-D | 1056 |
| Molmo-7B-O | 1051 |
| GPT-4V | 1041 |
| MolmoE-1B | 1032 |
| Qwen VL2 7B | 1025 |
| LLaVA OneVision 7B | 1024 |
| Pixtral 12B | 1016 |

Across the two evaluation frameworks, Ai2 framed the result as: Molmo-72B places first on the academic-benchmark average and second on the human-preference Elo, with both scores reachable from approximately one million image-text training pairs.[^1][^4]

### Cross-source consistency and caveats

Most independent press coverage echoes the paper's framing. TechCrunch summarized that Molmo "matches performance of GPT-4o, Gemini 1.5 Pro, and Claude-3.5 Sonnet" while running at "approximately one-tenth the size of competing closed models."[^2] DeepLearning.AI's The Batch reported that Molmo-72B "outperformed Gemini 1.5 Pro and Claude 3.5 Sonnet on academic tests and certain vision benchmarks," that 7B variants "performed between GPT-4V and GPT-4o," and that MolmoE-1B "nearly matched GPT-4V's capabilities."[^12]

Nathan Lambert's comparison with Llama 3.2 Vision noted that Molmo outperformed Llama 3.2 Vision by roughly +1 to +4 points across [MathVista](/wiki/mathvista), ChartQA, AI2D, and DocVQA, while Llama 3.2 Vision led on MMLU (text-only reasoning) by about 6 points.[^13] Lambert's conclusion that Llama 3.2 Vision is "a better text model" while Molmo is "a better image model" is consistent with Molmo's explicit choice to omit large-scale text-only instruction tuning and RLHF.[^13]

Later open vision-language releases from 2024 and 2025, including Qwen2.5-VL-72B-Instruct (reported around 70.2 on MMMU val) and InternVL3-78B (reported around 72.2 on MMMU), have since exceeded Molmo's reasoning-heavy benchmark scores on MMMU and MathVista, though typically with training pipelines that include synthetic data, scale-up text RLHF, or substantially more parameters.[^18]

## Comparison with contemporary VLMs

The table below summarizes how Molmo's release positioning differs from the most-discussed contemporary open and closed VLMs.

| Model family | Weights | Training data released? | Synthetic data from proprietary VLMs? | Reported MMMU (val) |
| --- | --- | --- | --- | --- |
| Molmo-72B (Ai2, Sep 2024) | Open, Apache 2.0[^4] | Yes (PixMo, ODC-BY-1.0)[^17] | No (audio receipts; human-only)[^11] | 54.1[^14] |
| Llama 3.2 Vision 11B/90B (Meta, Sep 2024) | Open, Llama 3.2 Community License[^13] | No[^13] | Not fully disclosed[^13] | Not directly reported in Molmo paper |
| Qwen2.5-VL-72B-Instruct (Alibaba, 2025) | Open weights | Partial | Yes (mixed) | ~70.2[^18] |
| GPT-4o (OpenAI) | Closed | No | n/a | 78.5 average across the 11-benchmark suite[^4] |
| Claude 3.5 Sonnet (Anthropic) | Closed | No | n/a | Suite average 76.7[^3] |
| Gemini 1.5 Pro (Google) | Closed | No | n/a | Suite average 78.3[^4] |
| LLaVA OneVision 7B | Open | Partial | Yes (uses GPT-4V outputs) | Suite average 72.0[^3] |
| Pixtral 12B (Mistral, 2024) | Open weights | Partial | Not disclosed | Suite average 69.5[^6] |

What is distinctive about Molmo's release, on the dimensions tracked above, is the combination of: a fully permissive Apache 2.0 weights license, a fully released training data set, and a credible audit trail (audio receipts) that no proprietary VLM was queried at any point during data construction.[^1][^11][^17]

## Software ecosystem and deployment

All four Molmo-0924 checkpoints are hosted on [Hugging Face](/wiki/hugging_face) and load through `AutoModelForCausalLM` and `AutoProcessor` with `trust_remote_code=True`, with `bfloat16` autocast as the default precision.[^3][^4] The official model cards document two image preprocessing gotchas: images must be in RGB mode (transparent PNGs and other modes degrade quality), and a recommended workaround composites RGBA inputs onto a white or black background chosen by average brightness.[^3]

Beyond [Hugging Face Transformers](/wiki/transformers_library), the model cards note official support for [vLLM](/wiki/vllm) (version 0.7.2 or earlier due to a preprocessing regression in newer releases), SGLang, and Docker Model Runner, with community quantizations available for llama.cpp, Ollama, LM Studio, and Jan.[^3] The training and evaluation codebase at github.com/allenai/molmo is open-source and includes `scripts/train.py`, helper launchers `train_captioner.py` and `train_multitask_model.py`, an `eval_downstream.py` evaluation runner, and a `scripts/download.py` data fetcher; MolmoE-1B additionally requires a `megablocks` fork for sparse [MoE](/wiki/mixture_of_experts) training.[^19]

A live demo at molmo.allenai.org accepts image uploads from desktop or mobile browsers and exposes both natural-language answers and the pointing visualization layer.[^1][^2]

## Adoption and downstream work

Within months of release, Molmo's pointing capability became a building block for several agent-oriented systems and for a wave of follow-up Ai2 releases. Molmo's appeal for [robotics](/wiki/robotics) researchers came from its ability to emit pixel-accurate waypoints that downstream controllers can convert into manipulation targets, and for browser-automation researchers from its ability to identify clickable UI elements without inspecting page source.[^1][^2]

Ai2 itself extended the Molmo program with several successor projects:

- **MolmoAct**, a robotics-focused model that the company described as "thinking in 3D," targeting manipulation and embodied tasks.[^20]
- **MolmoPoint**, focused specifically on improving the pointing architecture with dedicated grounding tokens.[^20]
- **MolmoWeb**, an open-weight visual web agent released in 2025 alongside the MolmoWebMix dataset of 30,000 human task trajectories across more than 1,100 websites and 2.2 million screenshot-QA pairs.[^21]
- **Molmo 2**, announced in December 2025, which extends the model family to multi-image and video inputs and adds object and action tracking, while preserving the open-weights, open-data philosophy of the original.[^22]

Outside Ai2, Molmo is among the most-downloaded open VLMs on Hugging Face and is widely used as a baseline in subsequent vision-language papers that need an open reference model whose training data provenance is fully known.[^3][^4]

## Limitations and criticisms

The paper and accompanying model cards explicitly catalogue several limitations:[^3][^11]

- **Counting beyond 40 instances.** To keep output sequences tractable, training data omits pointing labels for images with more than 40 instances; very dense scenes can confuse the model into over- or under-counting. The authors flag this as a planned fix.[^11]
- **Counting vs pointing confusion.** Because "how many" questions in the supervised data are not always paired with explicit point labels, Molmo sometimes answers counting questions in plain language without producing the corresponding points; heuristic prompt rewriting is the documented workaround.[^11]
- **Image-mode sensitivity.** Non-RGB inputs (notably transparent PNGs) produce noticeably worse outputs unless composited onto a solid background, a behavior the official model card documents with sample code.[^3]
- **vLLM compatibility ceiling.** The custom preprocessing code requires vLLM 0.7.2 or earlier; later vLLM releases have a preprocessing bug that Ai2 directs users to avoid.[^3]
- **Limited text-only reasoning.** Because Molmo skipped large-scale text-only instruction tuning and [RLHF](/wiki/rlhf), it underperforms text-tuned models on reasoning-heavy or multi-turn-chat-style evaluations; Nathan Lambert observed that this "has a very different vibe than people are used to" for conversational reasoning tasks.[^13]
- **"Open source" definitional ambiguity.** Because Molmo is fine-tuned from non-open base LLMs (in particular Qwen2 for the 72B and 7B-D models), it would not qualify as "open source" under the strictest Open Source Initiative-style definitions, although it remains "by far the closest vision model to being so," in Lambert's assessment.[^13] The 7B-O variant, which sits on Ai2's own [OLMo](/wiki/olmo) backbone, comes closest to a fully open-source stack.[^5][^13]

More broadly, third-party safety research on multimodal models has repeatedly shown that visual inputs can bypass text-level safety filters in VLMs, and Molmo is not exempt from this class of issue. Ai2 distributes the weights under its Responsible Use Guidelines and labels them for research and educational use.[^3][^4][^6]

## Reception and significance

Coverage in late September 2024 framed Molmo as evidence that high-quality, human-curated data could substitute for both scale and distillation. Farhadi's "open is equal to closed, and small is now equal to big" framing was widely repeated;[^2] DeepLearning.AI's The Batch summarized the upshot as a demonstration that "vision models trained on fully open, high-quality datasets can compete with closed systems built using massive computational resources";[^12] and MarkTechPost called Molmo a release that "ranks second on human evaluation, just slightly behind GPT-4o" while being entirely open.[^15]

The longer-term impact is reflected in the paper's reception at academic venues. The Molmo and PixMo paper was published at CVPR 2025 (Deitke et al., "Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models") and received a Best Paper Honorable Mention,[^8] one of the most-cited acknowledgments of the year's vision-language work. The release is now commonly cited as evidence that the open-VLM community can produce competitive frontier-class models without distilling closed systems, provided it invests heavily in original data collection.[^7][^8]

## Related work

- [Allen Institute for AI](/wiki/allen_institute_for_ai): the non-profit research lab that produced Molmo, PixMo, [OLMo](/wiki/olmo), [OLMoE](/wiki/olmoe), [Dolma](/wiki/dolma), and [Tulu 3](/wiki/tulu_3).
- [OLMo](/wiki/olmo) and [OLMoE](/wiki/olmoe): the open language model and Mixture-of-Experts language model that serve as Ai2-native backbones for Molmo-7B-O and MolmoE-1B respectively.[^5][^6]
- [CLIP](/wiki/clip): OpenAI's contrastive image-text encoder, used unchanged as Molmo's vision tower at 336 px input resolution.[^3][^4]
- [Mixture of Experts](/wiki/mixture_of_experts): the sparse-routing architecture used in MolmoE-1B via the OLMoE backbone.[^6]
- [LLaVA](/wiki/llava): an earlier open VLM family compared against Molmo on academic benchmarks and human-preference Elo.[^3]
- [Pixtral](/wiki/pixtral): Mistral's contemporary open-weight VLM, included in Molmo's benchmark tables for comparison.[^6]
- [Gemini](/wiki/gemini) 1.5, [GPT-4o](/wiki/gpt_4o), and [Claude 3.5 Sonnet](/wiki/claude_3_5_sonnet): closed proprietary VLMs whose academic and human-preference performance Molmo is benchmarked against.[^1][^4]
- [Llama 3.2](/wiki/llama_3_2): Meta's vision-language family released the same week as Molmo; widely compared with Molmo in the open-VLM ecosystem discussion.[^13]
- [MMMU](/wiki/mmmu) and [MathVista](/wiki/mathvista): two of the eleven benchmarks used to evaluate Molmo.[^14]
- [Knowledge Distillation](/wiki/knowledge_distillation) and [Synthetic data](/wiki/synthetic_data): the practices Molmo deliberately avoided in its training pipeline.[^7][^11]
- [Grounding](/wiki/grounding): the broader research area to which Molmo's pointing capability contributes.[^1]
- [Chatbot Arena](/wiki/lmsys_chatbot_arena): the open chatbot-evaluation platform whose pairwise-preference methodology inspired Molmo's larger image-VLM human-preference study.[^11]

## See also

- [Allen Institute for AI](/wiki/allen_institute_for_ai)
- [OLMo](/wiki/olmo)
- [OLMoE](/wiki/olmoe)
- [Dolma](/wiki/dolma)
- [Tülu 3](/wiki/tulu_3)
- [CLIP](/wiki/clip)
- [Vision Transformer](/wiki/vision_transformer)
- [Mixture of Experts](/wiki/mixture_of_experts)
- [LLaVA](/wiki/llava)
- [Pixtral](/wiki/pixtral)
- [Florence-2](/wiki/florence_2)
- [GPT-4o](/wiki/gpt_4o)
- [Claude 3.5 Sonnet](/wiki/claude_3_5_sonnet)
- [Gemini](/wiki/gemini)
- [Llama 3.2](/wiki/llama_3_2)
- [MMMU](/wiki/mmmu)
- [MathVista](/wiki/mathvista)
- [Grounding](/wiki/grounding)
- [Knowledge Distillation](/wiki/knowledge_distillation)
- [Synthetic data](/wiki/synthetic_data)
- [Instruction Tuning](/wiki/instruction_tuning)
- [Reinforcement Learning from Human Feedback](/wiki/rlhf)
- [Chatbot Arena](/wiki/lmsys_chatbot_arena)
- [AdamW](/wiki/adamw)
- [Hugging Face](/wiki/hugging_face)
- [vLLM](/wiki/vllm)
- [CVPR](/wiki/cvpr)
- [Multimodal Models](/wiki/multimodal_models)

## References

[^1]: Allen Institute for AI, "Molmo: A family of open state-of-the-art multimodal AI models", Ai2 Blog, 2024-09-25. https://allenai.org/blog/molmo. Accessed 2026-05-20.
[^2]: Devin Coldewey, "Ai2's Molmo shows open source can meet, and beat, closed multimodal models", TechCrunch, 2024-09-25. https://techcrunch.com/2024/09/25/ai2s-molmo-shows-open-source-can-meet-and-beat-closed-multimodal-models/. Accessed 2026-05-20.
[^3]: Allen Institute for AI, "allenai/Molmo-7B-D-0924 model card", Hugging Face, 2024-09-25. https://huggingface.co/allenai/Molmo-7B-D-0924. Accessed 2026-05-20.
[^4]: Allen Institute for AI, "allenai/Molmo-72B-0924 model card", Hugging Face, 2024-09-25. https://huggingface.co/allenai/Molmo-72B-0924. Accessed 2026-05-20.
[^5]: Allen Institute for AI, "allenai/Molmo-7B-O-0924 model card", Hugging Face, 2024-09-25. https://huggingface.co/allenai/Molmo-7B-O-0924. Accessed 2026-05-20.
[^6]: Allen Institute for AI, "allenai/MolmoE-1B-0924 model card", Hugging Face, 2024-09-25. https://huggingface.co/allenai/MolmoE-1B-0924. Accessed 2026-05-20.
[^7]: Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, et al., "Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models", arXiv:2409.17146, 2024-09-25 (v1) / 2024-12-05 (v2). https://arxiv.org/abs/2409.17146. Accessed 2026-05-20.
[^8]: Matt Deitke, post announcing Molmo's Best Paper Honorable Mention at CVPR 2025, X (formerly Twitter), 2025-06-13. https://x.com/mattdeitke/status/1933543308510347544. Accessed 2026-05-20.
[^9]: Allen Institute for AI, "Pixmo dataset collection page", Hugging Face, 2024. https://huggingface.co/collections/allenai/pixmo-674746ea613028006285687b. Accessed 2026-05-20.
[^10]: OpenCV, "MOLMO: A Powerful Open-Source Vision-Language Model", OpenCV.org, 2024. https://opencv.org/molmo-vlm/. Accessed 2026-05-20.
[^11]: Matt Deitke et al., "Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models (HTML version)", arXiv:2409.17146v2, 2024-12-05. https://arxiv.org/html/2409.17146v2. Accessed 2026-05-20.
[^12]: DeepLearning.AI, "Data Points: Molmo's impressive open multimodal models", The Batch, 2024-10. https://www.deeplearning.ai/the-batch/molmos-impressive-open-multimodal-models/. Accessed 2026-05-20.
[^13]: Nathan Lambert, "Llama 3.2 Vision and Molmo: Foundations for the multimodal open-source ecosystem", Interconnects, 2024-09-26. https://www.interconnects.ai/p/molmo-and-llama-3-vision. Accessed 2026-05-20.
[^14]: Matt Deitke et al., "Molmo and PixMo (paper PDF, Tables 1 and 8)", arXiv:2409.17146, 2024. https://arxiv.org/pdf/2409.17146. Accessed 2026-05-20.
[^15]: MarkTechPost, "Allen Institute for Artificial Intelligence (Ai2) Releases Molmo: A Family of Open-Source Multimodal Language Models", MarkTechPost, 2024-09-26. https://www.marktechpost.com/2024/09/26/are-small-language-models-really-the-future-of-language-models-allen-institute-for-artificial-intelligence-ai2-releases-molmo-a-family-of-open-source-multimodal-language-models/. Accessed 2026-05-20.
[^16]: Ritvik Rastogi, "Papers Explained 241: Pixmo and Molmo", Medium, 2024. https://ritvik19.medium.com/papers-explained-241-pixmo-and-molmo-239d70abebff. Accessed 2026-05-20.
[^17]: Allen Institute for AI, "allenai/pixmo-cap dataset card", Hugging Face, 2024. https://huggingface.co/datasets/allenai/pixmo-cap. Accessed 2026-05-20.
[^18]: Labellerr Team, "Top Vision LLMs Compared: Qwen 2.5-VL vs LLaMA 3.2", Labellerr Blog, 2025. https://www.labellerr.com/blog/qwen-2-5-vl-vs-llama-3-2/. Accessed 2026-05-20.
[^19]: Allen Institute for AI, "allenai/molmo GitHub repository", GitHub, 2024. https://github.com/allenai/molmo. Accessed 2026-05-20.
[^20]: Allen Institute for AI, "MolmoPoint: Better pointing architecture for vision-language models", Ai2 Blog, 2025. https://allenai.org/blog/molmopoint. Accessed 2026-05-20.
[^21]: Sean Michael Kerner, "Ai2 releases MolmoWeb, an open-weight visual web agent with 30K human task trajectories and a full training stack", VentureBeat, 2025. https://venturebeat.com/data/ai2-releases-molmoweb-an-open-weight-visual-web-agent-with-30k-human-task. Accessed 2026-05-20.
[^22]: Allen Institute for AI, "Molmo 2: State-of-the-art video understanding, pointing, and tracking", Ai2 Blog, 2025-12-16. https://allenai.org/blog/molmo2. Accessed 2026-05-20.