# Nougat (model)

> Source: https://aiwiki.ai/wiki/nougat
> Updated: 2026-06-03
> Categories: AI Models, Computer Vision, Meta AI
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

# Nougat (model)

**Nougat** (Neural Optical Understanding for Academic Documents) is a document-understanding model from [Meta AI](/wiki/meta_ai) that converts the rendered image of a document page into structured markup text. It is a [vision transformer](/wiki/vision_transformer) trained to read scientific papers, typically scanned pages or PDFs, and emit a lightweight Markdown-style markup that preserves reading order, mathematical expressions as [LaTeX](/wiki/latex), and tables. Unlike a conventional pipeline that runs text through a separate [optical character recognition](/wiki/optical_character_recognition) (OCR) engine and then tries to reconstruct layout, Nougat performs the recognition end to end inside one neural network. The work was introduced in the paper "Nougat: Neural Optical Understanding for Academic Documents" by Lukas Blecher, Guillem Cucurull, Thomas Scialom, and Robert Stojnic, posted to arXiv on 25 August 2023.[1][2] Meta released the model weights and code publicly.[1]

## Motivation

Most scientific knowledge is distributed as PDF files, a format optimized for fixed visual presentation rather than for machine reading. When a document is stored as a PDF, much of its semantic structure is lost, and mathematical notation in particular is difficult to recover because equations are often rendered as positioned glyphs or images rather than as recoverable symbolic content. The paper frames Nougat as a way to bridge human-readable documents and machine-readable text, with an emphasis on making mathematics and other scientific content searchable and reusable.[2] This matters for downstream uses such as building training corpora for language models, indexing equations, and accessibility.

The abstract states the goal directly:

> Scientific knowledge is predominantly stored in books and scientific journals, often in the form of PDFs. However, the PDF format leads to a loss of semantic information, particularly for mathematical expressions. We propose Nougat (Neural Optical Understanding for Academic Documents), a Visual Transformer model that performs an Optical Character Recognition (OCR) task for processing scientific documents into a markup language.[2]

## How it differs from generic OCR

Standard OCR systems detect and transcribe characters or words and return plain text, usually without reconstructing document structure, columns, or mathematics. Layout-aware pipelines add a separate stage to recover headings, paragraphs, and tables, and a dedicated math recognizer for equations. Nougat collapses these stages into a single image-to-sequence model. The paper notes that the model "does not require any OCR related inputs or modules" and that "the text is recognized implicitly by the network," citing prior visual document understanding work showing that "an external OCR engine is not necessarily needed to achieve competitive results."[2] In practice this means Nougat reads a page image and directly produces a structured document, including section headers, inline and display math in LaTeX, and tables, rather than a flat stream of characters.

## Architecture

Nougat is an encoder-decoder [transformer](/wiki/transformer) built on the Donut architecture, an OCR-free visual document understanding model. The paper states plainly: "We build on the Donut architecture."[2]

- **Encoder.** A Swin Transformer, a hierarchical vision transformer, takes the document page as a fixed-size image and converts it into a sequence of latent embeddings. The image is processed at 96 DPI and resized to a fixed resolution of (H, W) = (896, 672) pixels, with padding to preserve aspect ratio.[2]
- **Decoder.** A transformer decoder based on mBART autoregressively generates the markup token sequence conditioned on the encoder embeddings. The base model uses 10 decoder layers; the smaller model uses 4.[2]
- **Tokenizer.** Nougat uses the tokenizer from Galactica, Meta's earlier scientific language model, which suits LaTeX and scientific notation.[2]

The output is a lightweight markup that supports headings, bold and italic text, lists, algorithms, inline and display LaTeX math, and LaTeX tables.[2] The reference implementation writes this as `.mmd` files in a Mathpix Markdown style, described as mostly compatible with the Mathpix Markdown specification.[1]

## Model sizes

Two checkpoints were released, distinguished mainly by decoder depth and maximum output length.[1][2]

| Model | Parameters | Decoder layers | Max sequence length |
|---|---|---|---|
| Nougat small (0.1.0-small) | 250M | 4 | 3584 tokens |
| Nougat base (0.1.0-base) | 350M | 10 | 4096 tokens |

The two models perform very similarly on the paper's benchmarks, so the smaller one is a reasonable default for many uses.[2]

## Training data

A central contribution of the work is the construction of a paired dataset, since no large public corpus of page images aligned with structured markup existed. The authors built one by pairing source documents with their rendered pages.[2]

For arXiv, they took the LaTeX source of papers and converted it to HTML using LaTeXML, then transformed that HTML into the target lightweight markup. They rendered the corresponding PDF pages to images and aligned each page image with the markup that belongs to it. Splitting a continuous markup document into the correct per-page segments is the hard part of this alignment; the authors trained a Bag-of-Words classifier (an SVM over TF-IDF features) to match markup spans to page boundaries, and only kept pages where the split was judged reliable, which accepted roughly 47% of candidate pages.[2]

The full training corpus drew on three sources:[2]

| Source | Pages | Share |
|---|---|---|
| arXiv | 7,511,745 | ~91.5% |
| PubMed Central (PMC) | 536,319 | ~6.5% |
| Industry Documents Library (IDL) | 446,777 | ~5.4% |
| Total | 8,204,754 | 100% |

The arXiv portion was derived from about 1.75 million articles.[2] To make the model robust to the appearance of real scanned and printed documents, training images were augmented with transformations such as erosion, dilation, Gaussian noise, Gaussian blur, bitmap conversion, image compression, grid distortion, and elastic transforms.[2]

Training used the AdamW optimizer with an initial learning rate of 5e-5, decayed to 7.5e-6, an effective batch size of 192, and 3 epochs.[2]

## Evaluation

Nougat was evaluated on a held-out set of arXiv pages using edit distance (lower is better), BLEU, METEOR, and precision/recall/F1, reported overall and broken out by plain text, mathematical expressions, and tables. The authors compared against extracting the embedded text directly from the PDF and against GROBID, a widely used tool for parsing scholarly PDFs, including a variant of GROBID augmented with a LaTeX-OCR module for equations.[2]

Overall results ("All" content) were as follows:[2]

| Method | Edit distance | BLEU | METEOR | Precision | Recall | F1 |
|---|---|---|---|---|---|---|
| PDF text extraction | 0.255 | 65.8 | 82.1 | 77.1 | 81.4 | 79.2 |
| GROBID | 0.312 | 55.6 | 71.9 | 74.0 | 72.1 | 73.0 |
| GROBID + LaTeX OCR | 0.363 | 57.4 | 69.2 | 82.1 | 70.5 | 75.9 |
| Nougat small (250M) | 0.073 | 88.9 | 92.8 | 93.6 | 92.2 | 92.9 |
| Nougat base (350M) | 0.071 | 89.1 | 93.0 | 93.5 | 92.8 | 93.1 |

Broken down by content type, the base model scored an edit distance of 0.058 on plain text, 0.128 on math, and 0.211 on tables, with corresponding F1 scores of 95.7, 76.5, and 78.0.[2] As expected, prose is recognized most accurately, while mathematics and tables, which require both correct symbols and correct structure, are harder.

## Limitations

The paper is candid about failure modes.[2]

- **Repetition.** Like other autoregressive transformer decoders, the model can fall into repetition loops, degenerating into repeated tokens or lines. This affected about 1.5% of test pages. The authors mitigated it during training by randomly perturbing tokens, and at inference by detecting anomalous patterns in the logit distribution over a sliding window so generation can be stopped.[2]
- **Language.** The training data is overwhelmingly English, and the model tends to break down (often into immediate repetition) on documents in other languages.[2]
- **Speed.** Generating markup token by token is slower than rule-based parsers; the authors report processing on the order of seconds per batch of pages on a single GPU, far slower than GROBID's throughput, though Nougat recovers content that rule-based tools cannot.[2]
- **Page independence.** Each page is processed on its own, so information that spans page boundaries, such as a table or equation split across two pages, is not always handled consistently.[2]

## Release and usage

Meta released Nougat under an open license: the code is under the MIT license, and the model weights are under a Creative Commons Attribution-NonCommercial (CC-BY-NC) license, restricting the weights to non-commercial use.[1] The reference implementation provides a command-line tool that takes a PDF and writes Mathpix Markdown output, and a small API server. Checkpoints download automatically on first run.[1] The model was also integrated into the Hugging Face Transformers library, which broadened its reach beyond the original repository.[3]

## See also

- [Donut](/wiki/donut_model)
- [Galactica](/wiki/galactica)
- [optical character recognition](/wiki/optical_character_recognition)
- [vision transformer](/wiki/vision_transformer)

## References

1. Meta AI / facebookresearch. "Nougat" (GitHub repository). https://github.com/facebookresearch/nougat
2. Lukas Blecher, Guillem Cucurull, Thomas Scialom, Robert Stojnic. "Nougat: Neural Optical Understanding for Academic Documents." arXiv:2308.13418, 25 August 2023. https://arxiv.org/abs/2308.13418
3. Hugging Face. "Nougat" (Transformers documentation). https://huggingface.co/docs/transformers/en/model_doc/nougat

