olmOCR
Last reviewed
Jun 7, 2026
Sources
11 citations
Review status
Source-backed
Revision
v3 · 1,917 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 7, 2026
Sources
11 citations
Review status
Source-backed
Revision
v3 · 1,917 words
Add missing citations, update stale details, or suggest a clearer explanation.
olmOCR is an open toolkit and vision-language model from the Allen Institute for AI (Ai2) that turns PDFs and document images into clean, structured plain text and Markdown. It reads a page the way a person would, recovering the natural reading order while keeping sections, tables, lists, equations, and even handwriting intact. The project ships as an end-to-end pipeline built around a fine-tuned 7B vision language model, and Ai2 released the model weights, the training data, the inference code, and a benchmark under a permissive license. The first version arrived in February 2025 with the paper "olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models" [1][2], and a follow-up, olmOCR 2, landed in October 2025 [3][4].
The name and the motivation both come from the OLMo family of fully open language models. PDFs hold an enormous amount of high-quality text that never makes it into training corpora because the format is hard to parse, so Ai2 framed olmOCR as a way to unlock those tokens for pretraining [1][2].
Most of the world's written knowledge sits in PDFs and scanned images: scientific papers, government filings, books, legal records, manuals, and forms. Turning those documents into clean text is the job of OCR, and the quality of that text directly shapes what a large language model can learn from it. Garbled tables, scrambled column order, and dropped equations all leak into a model as noise.
There are two practical reasons the open part matters. First, the strongest commercial vision models can read documents well, but sending millions of pages through a proprietary API gets expensive fast, and the paper puts GPT-4o at over 6,240 USD per million pages [1][2]. Second, many documents simply cannot leave a private network for legal or privacy reasons, which rules out a hosted API regardless of cost. An open model with open weights lets a lab run the whole job on its own hardware.
The same qualities help retrieval-augmented generation. A RAG system is only as good as the text it retrieves from, and a chunk of a financial table that lost its row and column structure will produce wrong answers no matter how good the language model is. Clean, faithfully linearized output is the foundation for both pretraining and retrieval.
There is also a quality gap between approaches. Traditional open-source OCR tools tend to produce noisier extractions than a vision-language model, especially on complex layouts, but the best vision models have until recently been closed and costly to run at volume [1][2]. olmOCR was built to close that gap, giving the extraction quality of a strong vision model in a form that anyone can download and run.
The original olmOCR fine-tunes Qwen2-VL-7B-Instruct, an open vision-language model from Alibaba, into a document reader [1][2]. olmOCR 2 moves the base to Qwen2.5-VL-7B-Instruct [3][4]. In both cases the model takes a rendered image of a page plus a text prompt and returns the page content as structured text.
The idea that sets olmOCR apart is document anchoring. A born-digital PDF already carries a lot of structure in its internal code: the coordinates of text blocks, the positions of images, font sizes, and other layout clues. Instead of throwing that away and treating the page as a flat picture, olmOCR pulls those text fragments and their positions out of the PDF and concatenates them into the prompt alongside the rasterized image [1][2]. The model then has two complementary views of the page, the pixels and the extracted anchor text, and it uses both to reconstruct the correct reading order and content. This helps most on the cases that break naive approaches, like multi-column layouts, sidebars, footnotes, and dense tables.
Training the first version used olmOCR-mix-0225, a dataset of roughly 260,000 pages sampled from more than 100,000 PDFs crawled from the open web, deliberately picked to include graphics, handwriting, and poor scans so the model would generalize [1][2]. The labels came from a strong teacher model prompted with document anchoring, so the fine-tuned 7B student learned to match much larger systems at a fraction of the serving cost. olmOCR 2 trains on an updated olmOCR-mix-1025 with about 268,000 pages [3][4].
olmOCR 2 adds a reinforcement learning stage on top of supervised fine-tuning. The team built synthetic documents whose correct output is known, then wrote thousands of small unit tests that each check one verifiable property of an OCR result, such as whether a specific table cell is present or whether an equation is rendered correctly. During training the model generates many candidate transcriptions per page, 28 of them, and reinforcement learning with Group Relative Policy Optimization (GRPO) rewards the ones that pass more tests [3][4]. The reward is just the fraction of unit tests a transcription passes, which gives a clean, checkable training signal instead of a fuzzy similarity score. The released checkpoint, olmOCR-2-7B-1025, is also distributed in an FP8 quantized form for cheaper inference [4].
olmOCR is not only a model; it is a batch pipeline meant to grind through huge document collections cheaply. It renders each PDF page to an image, builds the anchor text, runs the model with a high-throughput inference engine, and writes out linearized text. The pipeline is designed to scale from a single GPU to hundreds and to recover from failures partway through a job [1][2].
The headline number from the paper is that olmOCR can convert a million PDF pages for about 176 USD, which the authors put at roughly one thirty-second of the GPT-4o API cost [1][2]. The current toolkit documentation rounds this to under 200 USD per million pages and lists tested hardware including the NVIDIA RTX 4090, L40S, A100, and H100, with a minimum of about 12 GB of GPU memory [5]. That price point is what makes pretraining-scale extraction realistic for an open project rather than only for companies that can absorb a six-figure API bill.
Getting there cheaply depends on keeping the GPU busy. A 7B model is small enough to serve many pages in parallel, and document anchoring lets the model spend its capacity on reconstruction rather than on rediscovering layout it could have read from the PDF. The toolkit is distributed as a Python package so a job can be launched on a single workstation for a small collection or fanned out across a cluster for a web-scale crawl [5][11]. Because the work is just a stream of independent pages, the pipeline parallelizes cleanly and can pick up where it left off after an interruption.
To compare systems fairly, Ai2 released olmOCR-Bench, a curated set of about 1,400 PDFs covering content that still trips up strong tools: formulas, tables, tiny fonts, old scans, multi-column pages, and headers and footers [1][6]. Rather than scoring against a single reference transcription, the benchmark uses roughly 7,010 unit tests, each a pass or fail check on a property the output should have [3][4][6]. A system's score is the percentage of tests it passes, which sidesteps the problem that there are many valid ways to write the same page.
The original olmOCR paper reported strong results against GPT-4o, Gemini Flash 2, and Qwen2.5-VL, and a human preference study gave olmOCR an Elo rating above 1800, ahead of Marker, MinerU, and GOT-OCR 2.0 [1][2]. The olmOCR 2 paper re-ran the updated benchmark across a wider field. The table below shows overall olmOCR-Bench scores as published in the olmOCR 2 work [3][4]. The benchmark and its leaderboard are updated over time, so newer entries and revised numbers appear in the project repository.
| System | olmOCR-Bench overall |
|---|---|
| olmOCR 2 (olmOCR-2-7B-1025) | 82.4 |
| dots.ocr | 79.1 |
| Marker 1.10.1 | 76.1 |
| DeepSeek-OCR | 75.7 |
| MinerU 2.5.4 | 75.2 |
| Mistral OCR API | 72.0 |
| GPT-4o | 68.9 |
| olmOCR v1 | 68.2 |
| Qwen2.5-VL-7B | 65.5 |
| Gemini Flash 2 | 57.8 |
The jump from the first olmOCR to olmOCR 2 is about 14.2 points overall, with the largest gains on math formulas, table parsing, and multi-column layouts, which the team credits to the unit-test reward training [3][4]. Among the systems Ai2 measured, olmOCR 2 led general-purpose models like GPT-4o and Gemini and stayed close to the best specialist parsers.
Everything ships under the Apache 2.0 license: the model weights on Hugging Face, the olmOCR-mix datasets, the olmOCR-Bench data, and the training and inference code on GitHub [4][5]. That combination of open weights, open data, and a permissive license is the defining trait of the release and is what lets others reproduce the work, fine-tune on their own documents, or run the pipeline entirely on private infrastructure. It fits the broader open-source AI approach Ai2 takes with its OLMo models, where the goal is a fully inspectable stack rather than just a downloadable checkpoint.
olmOCR arrived during a busy year for document OCR built on vision-language models. Mistral OCR, released by Mistral AI in March 2025, is a hosted API rather than open weights, and it appears in the olmOCR-Bench comparison [3]. DeepSeek-OCR, from October 2025, takes a different angle with its "contexts optical compression" idea, encoding long text as a compact set of vision tokens, and it ships with open weights. dots.ocr, from rednote-hilab, is a single open vision-language model aimed at multilingual document layout parsing, and it scores near the top of olmOCR-Bench. olmOCR's distinguishing choices within this group are document anchoring, the focus on cheap pretraining-scale batch throughput, and the fully open release of data and benchmark, not just weights.
olmOCR targets English-language print documents first, and the olmOCR 2 work describes its state-of-the-art claim in that setting, so coverage of other languages and scripts is weaker than tools built for multilingual parsing [3][4]. The benchmark itself is English-centric. As with any model-based OCR, the system can hallucinate plausible but wrong text on very degraded scans, and it inherits the context-length and resolution limits of its 7B base model, which can hurt very large or very dense pages. The cost figures assume well-utilized GPUs running large batches, so per-page economics look different for small jobs or interactive use. And because the cheapest setup uses an open base model rather than the strongest frontier vision model, the very hardest pages can still favor larger systems. None of these are unique to olmOCR, but they bound where it is the right tool.