Nougat (model)

AI Models Computer Vision Meta AI

7 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

3 citations

Revision

v2 · 1,472 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Nougat (Neural Optical Understanding for Academic Documents) is a document-understanding model from Meta AI that converts the rendered image of a document page into structured markup text. It is a vision transformer trained to read scientific papers, typically scanned pages or PDFs, and emit a lightweight Markdown-style markup that preserves reading order, mathematical expressions as LaTeX, and tables. Unlike a conventional pipeline that runs text through a separate optical character recognition (OCR) engine and then tries to reconstruct layout, Nougat performs the recognition end to end inside one neural network. The work was introduced in the paper "Nougat: Neural Optical Understanding for Academic Documents" by Lukas Blecher, Guillem Cucurull, Thomas Scialom, and Robert Stojnic, posted to arXiv on 25 August 2023.^[1]^[2] Meta released the model weights and code publicly.^[1]

Motivation

Most scientific knowledge is distributed as PDF files, a format optimized for fixed visual presentation rather than for machine reading. When a document is stored as a PDF, much of its semantic structure is lost, and mathematical notation in particular is difficult to recover because equations are often rendered as positioned glyphs or images rather than as recoverable symbolic content. The paper frames Nougat as a way to bridge human-readable documents and machine-readable text, with an emphasis on making mathematics and other scientific content searchable and reusable.^[2] This matters for downstream uses such as building training corpora for language models, indexing equations, and accessibility.

The abstract states the goal directly:

Scientific knowledge is predominantly stored in books and scientific journals, often in the form of PDFs. However, the PDF format leads to a loss of semantic information, particularly for mathematical expressions. We propose Nougat (Neural Optical Understanding for Academic Documents), a Visual Transformer model that performs an Optical Character Recognition (OCR) task for processing scientific documents into a markup language.^[2]

How it differs from generic OCR

Standard OCR systems detect and transcribe characters or words and return plain text, usually without reconstructing document structure, columns, or mathematics. Layout-aware pipelines add a separate stage to recover headings, paragraphs, and tables, and a dedicated math recognizer for equations. Nougat collapses these stages into a single image-to-sequence model. The paper notes that the model "does not require any OCR related inputs or modules" and that "the text is recognized implicitly by the network," citing prior visual document understanding work showing that "an external OCR engine is not necessarily needed to achieve competitive results."^[2] In practice this means Nougat reads a page image and directly produces a structured document, including section headers, inline and display math in LaTeX, and tables, rather than a flat stream of characters.

Architecture

Nougat is an encoder-decoder transformer built on the Donut architecture, an OCR-free visual document understanding model. The paper states plainly: "We build on the Donut architecture."^[2]

Encoder. A Swin Transformer, a hierarchical vision transformer, takes the document page as a fixed-size image and converts it into a sequence of latent embeddings. The image is processed at 96 DPI and resized to a fixed resolution of (H, W) = (896, 672) pixels, with padding to preserve aspect ratio.^[2]
Decoder. A transformer decoder based on mBART autoregressively generates the markup token sequence conditioned on the encoder embeddings. The base model uses 10 decoder layers; the smaller model uses 4.^[2]
Tokenizer. Nougat uses the tokenizer from Galactica, Meta's earlier scientific language model, which suits LaTeX and scientific notation.^[2]

The output is a lightweight markup that supports headings, bold and italic text, lists, algorithms, inline and display LaTeX math, and LaTeX tables.^[2] The reference implementation writes this as .mmd files in a Mathpix Markdown style, described as mostly compatible with the Mathpix Markdown specification.^[1]

Model sizes

Two checkpoints were released, distinguished mainly by decoder depth and maximum output length.^[1]^[2]

Model	Parameters	Decoder layers	Max sequence length
Nougat small (0.1.0-small)	250M	4	3584 tokens
Nougat base (0.1.0-base)	350M	10	4096 tokens

The two models perform very similarly on the paper's benchmarks, so the smaller one is a reasonable default for many uses.^[2]

Training data

A central contribution of the work is the construction of a paired dataset, since no large public corpus of page images aligned with structured markup existed. The authors built one by pairing source documents with their rendered pages.^[2]

For arXiv, they took the LaTeX source of papers and converted it to HTML using LaTeXML, then transformed that HTML into the target lightweight markup. They rendered the corresponding PDF pages to images and aligned each page image with the markup that belongs to it. Splitting a continuous markup document into the correct per-page segments is the hard part of this alignment; the authors trained a Bag-of-Words classifier (an SVM over TF-IDF features) to match markup spans to page boundaries, and only kept pages where the split was judged reliable, which accepted roughly 47% of candidate pages.^[2]

The full training corpus drew on three sources:^[2]

Source	Pages	Share
arXiv	7,511,745	~91.5%
PubMed Central (PMC)	536,319	~6.5%
Industry Documents Library (IDL)	446,777	~5.4%
Total	8,204,754	100%

The arXiv portion was derived from about 1.75 million articles.^[2] To make the model robust to the appearance of real scanned and printed documents, training images were augmented with transformations such as erosion, dilation, Gaussian noise, Gaussian blur, bitmap conversion, image compression, grid distortion, and elastic transforms.^[2]

Training used the AdamW optimizer with an initial learning rate of 5e-5, decayed to 7.5e-6, an effective batch size of 192, and 3 epochs.^[2]

Evaluation

Nougat was evaluated on a held-out set of arXiv pages using edit distance (lower is better), BLEU, METEOR, and precision/recall/F1, reported overall and broken out by plain text, mathematical expressions, and tables. The authors compared against extracting the embedded text directly from the PDF and against GROBID, a widely used tool for parsing scholarly PDFs, including a variant of GROBID augmented with a LaTeX-OCR module for equations.^[2]

Overall results ("All" content) were as follows:^[2]

Method	Edit distance	BLEU	METEOR	Precision	Recall	F1
PDF text extraction	0.255	65.8	82.1	77.1	81.4	79.2
GROBID	0.312	55.6	71.9	74.0	72.1	73.0
GROBID + LaTeX OCR	0.363	57.4	69.2	82.1	70.5	75.9
Nougat small (250M)	0.073	88.9	92.8	93.6	92.2	92.9
Nougat base (350M)	0.071	89.1	93.0	93.5	92.8	93.1

Broken down by content type, the base model scored an edit distance of 0.058 on plain text, 0.128 on math, and 0.211 on tables, with corresponding F1 scores of 95.7, 76.5, and 78.0.^[2] As expected, prose is recognized most accurately, while mathematics and tables, which require both correct symbols and correct structure, are harder.

Limitations

The paper is candid about failure modes.^[2]

Repetition. Like other autoregressive transformer decoders, the model can fall into repetition loops, degenerating into repeated tokens or lines. This affected about 1.5% of test pages. The authors mitigated it during training by randomly perturbing tokens, and at inference by detecting anomalous patterns in the logit distribution over a sliding window so generation can be stopped.^[2]
Language. The training data is overwhelmingly English, and the model tends to break down (often into immediate repetition) on documents in other languages.^[2]
Speed. Generating markup token by token is slower than rule-based parsers; the authors report processing on the order of seconds per batch of pages on a single GPU, far slower than GROBID's throughput, though Nougat recovers content that rule-based tools cannot.^[2]
Page independence. Each page is processed on its own, so information that spans page boundaries, such as a table or equation split across two pages, is not always handled consistently.^[2]

Release and usage

Meta released Nougat under an open license: the code is under the MIT license, and the model weights are under a Creative Commons Attribution-NonCommercial (CC-BY-NC) license, restricting the weights to non-commercial use.^[1] The reference implementation provides a command-line tool that takes a PDF and writes Mathpix Markdown output, and a small API server. Checkpoints download automatically on first run.^[1] The model was also integrated into the Hugging Face Transformers library, which broadened its reach beyond the original repository.^[3]

References

Meta AI / facebookresearch. "Nougat" (GitHub repository). https://github.com/facebookresearch/nougat ↩
Lukas Blecher, Guillem Cucurull, Thomas Scialom, Robert Stojnic. "Nougat: Neural Optical Understanding for Academic Documents." arXiv:2308.13418, 25 August 2023. https://arxiv.org/abs/2308.13418 ↩
Hugging Face. "Nougat" (Transformers documentation). https://huggingface.co/docs/transformers/en/model_doc/nougat ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Segment Anything Model and Dataset (SAM and SA-1B)

Motivation

How it differs from generic OCR

Architecture

Model sizes

Training data

Evaluation

Limitations

Release and usage

See also

References

Improve this article

Related Articles

Segment Anything Model and Dataset (SAM and SA-1B)

DINOv2

DINOv3

SAM 2

Sapiens (computer vision)

DINO (computer vision)

What links here

Related Articles

Segment Anything Model and Dataset (SAM and SA-1B)

DINOv2

DINOv3

SAM 2

Sapiens (computer vision)

DINO (computer vision)