Nougat (model)
Last reviewed
Jun 3, 2026
Sources
3 citations
Review status
Source-backed
Revision
v1 · 1,475 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
3 citations
Review status
Source-backed
Revision
v1 · 1,475 words
Add missing citations, update stale details, or suggest a clearer explanation.
Nougat (Neural Optical Understanding for Academic Documents) is a document-understanding model from Meta AI that converts the rendered image of a document page into structured markup text. It is a vision transformer trained to read scientific papers, typically scanned pages or PDFs, and emit a lightweight Markdown-style markup that preserves reading order, mathematical expressions as LaTeX, and tables. Unlike a conventional pipeline that runs text through a separate optical character recognition (OCR) engine and then tries to reconstruct layout, Nougat performs the recognition end to end inside one neural network. The work was introduced in the paper "Nougat: Neural Optical Understanding for Academic Documents" by Lukas Blecher, Guillem Cucurull, Thomas Scialom, and Robert Stojnic, posted to arXiv on 25 August 2023.[1][2] Meta released the model weights and code publicly.[1]
Most scientific knowledge is distributed as PDF files, a format optimized for fixed visual presentation rather than for machine reading. When a document is stored as a PDF, much of its semantic structure is lost, and mathematical notation in particular is difficult to recover because equations are often rendered as positioned glyphs or images rather than as recoverable symbolic content. The paper frames Nougat as a way to bridge human-readable documents and machine-readable text, with an emphasis on making mathematics and other scientific content searchable and reusable.[2] This matters for downstream uses such as building training corpora for language models, indexing equations, and accessibility.
The abstract states the goal directly:
Scientific knowledge is predominantly stored in books and scientific journals, often in the form of PDFs. However, the PDF format leads to a loss of semantic information, particularly for mathematical expressions. We propose Nougat (Neural Optical Understanding for Academic Documents), a Visual Transformer model that performs an Optical Character Recognition (OCR) task for processing scientific documents into a markup language.[2]
Standard OCR systems detect and transcribe characters or words and return plain text, usually without reconstructing document structure, columns, or mathematics. Layout-aware pipelines add a separate stage to recover headings, paragraphs, and tables, and a dedicated math recognizer for equations. Nougat collapses these stages into a single image-to-sequence model. The paper notes that the model "does not require any OCR related inputs or modules" and that "the text is recognized implicitly by the network," citing prior visual document understanding work showing that "an external OCR engine is not necessarily needed to achieve competitive results."[2] In practice this means Nougat reads a page image and directly produces a structured document, including section headers, inline and display math in LaTeX, and tables, rather than a flat stream of characters.
Nougat is an encoder-decoder transformer built on the Donut architecture, an OCR-free visual document understanding model. The paper states plainly: "We build on the Donut architecture."[2]
The output is a lightweight markup that supports headings, bold and italic text, lists, algorithms, inline and display LaTeX math, and LaTeX tables.[2] The reference implementation writes this as .mmd files in a Mathpix Markdown style, described as mostly compatible with the Mathpix Markdown specification.[1]
Two checkpoints were released, distinguished mainly by decoder depth and maximum output length.[1][2]
| Model | Parameters | Decoder layers | Max sequence length |
|---|---|---|---|
| Nougat small (0.1.0-small) | 250M | 4 | 3584 tokens |
| Nougat base (0.1.0-base) | 350M | 10 | 4096 tokens |
The two models perform very similarly on the paper's benchmarks, so the smaller one is a reasonable default for many uses.[2]
A central contribution of the work is the construction of a paired dataset, since no large public corpus of page images aligned with structured markup existed. The authors built one by pairing source documents with their rendered pages.[2]
For arXiv, they took the LaTeX source of papers and converted it to HTML using LaTeXML, then transformed that HTML into the target lightweight markup. They rendered the corresponding PDF pages to images and aligned each page image with the markup that belongs to it. Splitting a continuous markup document into the correct per-page segments is the hard part of this alignment; the authors trained a Bag-of-Words classifier (an SVM over TF-IDF features) to match markup spans to page boundaries, and only kept pages where the split was judged reliable, which accepted roughly 47% of candidate pages.[2]
The full training corpus drew on three sources:[2]
| Source | Pages | Share |
|---|---|---|
| arXiv | 7,511,745 | ~91.5% |
| PubMed Central (PMC) | 536,319 | ~6.5% |
| Industry Documents Library (IDL) | 446,777 | ~5.4% |
| Total | 8,204,754 | 100% |
The arXiv portion was derived from about 1.75 million articles.[2] To make the model robust to the appearance of real scanned and printed documents, training images were augmented with transformations such as erosion, dilation, Gaussian noise, Gaussian blur, bitmap conversion, image compression, grid distortion, and elastic transforms.[2]
Training used the AdamW optimizer with an initial learning rate of 5e-5, decayed to 7.5e-6, an effective batch size of 192, and 3 epochs.[2]
Nougat was evaluated on a held-out set of arXiv pages using edit distance (lower is better), BLEU, METEOR, and precision/recall/F1, reported overall and broken out by plain text, mathematical expressions, and tables. The authors compared against extracting the embedded text directly from the PDF and against GROBID, a widely used tool for parsing scholarly PDFs, including a variant of GROBID augmented with a LaTeX-OCR module for equations.[2]
Overall results ("All" content) were as follows:[2]
| Method | Edit distance | BLEU | METEOR | Precision | Recall | F1 |
|---|---|---|---|---|---|---|
| PDF text extraction | 0.255 | 65.8 | 82.1 | 77.1 | 81.4 | 79.2 |
| GROBID | 0.312 | 55.6 | 71.9 | 74.0 | 72.1 | 73.0 |
| GROBID + LaTeX OCR | 0.363 | 57.4 | 69.2 | 82.1 | 70.5 | 75.9 |
| Nougat small (250M) | 0.073 | 88.9 | 92.8 | 93.6 | 92.2 | 92.9 |
| Nougat base (350M) | 0.071 | 89.1 | 93.0 | 93.5 | 92.8 | 93.1 |
Broken down by content type, the base model scored an edit distance of 0.058 on plain text, 0.128 on math, and 0.211 on tables, with corresponding F1 scores of 95.7, 76.5, and 78.0.[2] As expected, prose is recognized most accurately, while mathematics and tables, which require both correct symbols and correct structure, are harder.
The paper is candid about failure modes.[2]
Meta released Nougat under an open license: the code is under the MIT license, and the model weights are under a Creative Commons Attribution-NonCommercial (CC-BY-NC) license, restricting the weights to non-commercial use.[1] The reference implementation provides a command-line tool that takes a PDF and writes Mathpix Markdown output, and a small API server. Checkpoints download automatically on first run.[1] The model was also integrated into the Hugging Face Transformers library, which broadened its reach beyond the original repository.[3]