LayoutLM
Last reviewed
May 31, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 ยท 3,940 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 ยท 3,940 words
Add missing citations, update stale details, or suggest a clearer explanation.
LayoutLM is a family of pre-trained multimodal models developed by Microsoft Research for document AI, the task of automatically reading and understanding visually rich documents such as forms, invoices, receipts, contracts, and scanned reports. Unlike ordinary text models that see only a flat stream of words, LayoutLM jointly models the words on a page together with where those words sit in two-dimensional space, and in later versions with the pixels of the page image as well. This combination of text, layout, and image lets the model reason about documents the way a human reader does, using the position of a label relative to its value, the structure of a table, or the visual style of a heading to decide what each piece of text means.
The family currently has three members. The original LayoutLM was introduced in late 2019 and proved that adding two-dimensional position information to a BERT style encoder dramatically improved performance on document tasks. LayoutLMv2, released at the end of 2020, folded the page image directly into pre-training and added a spatial-aware attention mechanism. LayoutLMv3, released in 2022, simplified the architecture by treating image patches the same way it treats text tokens and unified the training objectives across both modalities. All three are built on the transformer architecture, are distributed through Hugging Face under model names like microsoft/layoutlm-base-uncased and microsoft/layoutlmv3-base, and are designed to be adapted to a specific document task through fine-tuning.
Most business information does not arrive as clean machine-readable text. It arrives as PDFs, scanned paper, photographs of receipts, and faxed forms. Extracting structured data from these documents has traditionally required either brittle template matching or expensive manual data entry. The goal of document AI is to automate this extraction, and the central difficulty is that the meaning of text on a page depends heavily on layout. The number "$4,200" means something different when it sits next to the word "Total" than when it sits in a column labeled "Salary." A model that reads only the linear sequence of words loses exactly the spatial cues that make such distinctions possible.
LayoutLM addresses this by extending the masked-language-model pre-training paradigm popularized by BERT into two dimensions. The pipeline begins with optical character recognition, which converts a document image into a list of recognized words, each accompanied by a bounding box giving its pixel coordinates on the page. LayoutLM consumes both the words and their bounding boxes, so the model learns not just what the text says but where it appears. Because the model is pre-trained on millions of documents before it is ever shown a labeled example, it acquires a general sense of how documents are structured, and that general knowledge transfers efficiently to specific tasks with comparatively little labeled data.
The three versions of the model represent a clear research trajectory. Version 1 demonstrated that text plus layout beats text alone. Version 2 demonstrated that bringing the image into pre-training, rather than only into fine-tuning, helps further, and that attention can be made aware of spatial relationships. Version 3 demonstrated that the heavy convolutional image processing used in version 2 was not necessary, and that a single clean architecture with aligned text and image objectives could match or beat its predecessors while being simpler to train and use. Across all three, the models are state-of-the-art or near it on the standard document-understanding benchmarks, and they have become a default starting point for practitioners building form and receipt extraction systems.
LayoutLM combines up to three kinds of information about a document: the text, the layout, and the image. Each is encoded as embeddings that are summed or concatenated and then processed by a transformer encoder. Understanding how each modality is represented makes the differences between the versions easier to follow.
The text modality works essentially as it does in BERT. The recognized words are split into subword tokens using a WordPiece tokenizer, special tokens such as [CLS] and [SEP] are added, and each token receives a learned word embedding. Because the underlying text encoder in the first version is initialized from BERT, LayoutLM inherits a strong language prior from the start and only needs to learn how to use the additional layout signal. This is one reason the model trains efficiently: it does not have to relearn language from scratch.
The layout modality is the defining contribution of LayoutLM. After OCR, every token has a bounding box, conventionally written as four coordinates (x0, y0, x1, y1) that give the top-left and bottom-right corners of the box, normalized to a 0 to 1000 scale relative to the page size. LayoutLM converts these coordinates into 2D position embeddings. Rather than learn a separate embedding for every possible four-tuple, the model uses shared lookup tables: it looks up x0 and x1 in an embedding table for the horizontal axis, and y0 and y1 in a table for the vertical axis, so coordinates that share an axis share parameters. The width and height of the box can also be embedded. These spatial embeddings are added to the token embeddings, which means that two identical words sitting in different places on the page get different representations. This is what allows the model to learn relationships like "the value usually appears to the right of or below its label."
The image modality evolved substantially across versions, and this is the clearest axis along which the family differs.
In the first LayoutLM, visual information is handled lightly and only at the fine-tuning stage, not during pre-training. After the transformer produces a representation for each token, image region features extracted by a Faster R-CNN object detector are added for downstream tasks. Pre-training itself used only text and layout.
In LayoutLMv2, the image becomes a first-class citizen of pre-training. The model is a two-stream multimodal transformer: the page image is resized and passed through a convolutional visual backbone, specifically a ResNeXt-101 with a Feature Pyramid Network whose weights come from a Mask R-CNN trained on the PubLayNet document-layout corpus. The resulting feature map is flattened into a sequence of visual tokens that are concatenated with the text tokens and fed through the same encoder. Version 2 also introduces a spatial-aware self-attention mechanism, which adds a learned bias to the attention scores based on the relative one-dimensional and two-dimensional distance between tokens, so a token can attend more strongly to its spatial neighbors.
In LayoutLMv3, the convolutional backbone is removed entirely. Following the approach used by Vision Transformers and BEiT, the page image is simply cut into fixed-size 16-by-16 patches, each patch is flattened and passed through a single linear projection to become an image token, and those tokens are fed to the transformer alongside the text tokens. This eliminates the dependency on a pre-trained object-detection model such as Faster R-CNN or ResNeXt, removes roughly the parameters those backbones consumed, and makes the architecture markedly simpler and faster to run. The price is that the image patches are aligned to text through a dedicated pre-training objective rather than through detected regions.
The pre-training objectives are where the modalities are stitched together, and they also track the family's evolution.
LayoutLM (v1) used two objectives. The Masked Visual-Language Model (MVLM) masks some input tokens but keeps their 2D position embeddings, then asks the model to predict the masked words from both the surrounding text and the layout, which forces it to use spatial context. A Multi-label Document Classification (MDC) objective used the document category tags available in the pre-training corpus to encourage better document-level representations.
LayoutLMv2 kept a masked visual-language objective and added two cross-modal tasks. Text-Image Alignment (TIA) randomly covers some lines in the image and asks the model, for each text token, whether the image region covering it has been covered, which teaches fine-grained correspondence between words and pixels. Text-Image Matching (TIM) is a document-level task that asks whether a given image and text actually come from the same document.
LayoutLMv3 unified the objectives so that text and image are treated symmetrically. Masked Language Modeling (MLM) masks roughly 30 percent of text tokens using span masking, with span lengths drawn from a Poisson distribution, and reconstructs them. Masked Image Modeling (MIM) masks roughly 40 percent of image patches using blockwise masking and reconstructs them not as raw pixels but as discrete visual tokens drawn from an 8,192-entry codebook produced by an image tokenizer adapted from DiT and the discrete VAE approach of BEiT. Word-Patch Alignment (WPA) ties the two together by predicting, for each unmasked text token, whether its corresponding image patch has been masked. The symmetry of masking on both sides is the central idea behind version 3 and the reason its title emphasizes unified text and image masking.
The three releases share a common philosophy but differ in architecture, the way the image is used, training objectives, size, and licensing. The table below summarizes the differences using the publicly reported figures. Parameter counts refer to the BASE configurations unless the LARGE figure is noted; benchmark scores are the strongest reported for each version, generally from the LARGE model.
| Aspect | LayoutLM (v1) | LayoutLMv2 | LayoutLMv3 |
|---|---|---|---|
| Released / venue | Dec 2019 (arXiv); KDD 2020 | Dec 2020 (arXiv); ACL 2021 | Apr 2022 (arXiv); ACM Multimedia 2022 |
| Lead authors | Yiheng Xu, Minghao Li, Lei Cui et al. | Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui et al. | Yupan Huang, Tengchao Lv, Lei Cui et al. |
| Modalities | Text + layout (image only at fine-tuning) | Text + layout + image (image in pre-training) | Text + layout + image (unified) |
| Image encoding | Faster R-CNN region features, added downstream | ResNeXt-101 FPN convolutional backbone (Mask R-CNN on PubLayNet) | Linear projection of 16x16 image patches, no CNN |
| Attention | Standard transformer self-attention | Spatial-aware self-attention (relative-position bias) | Standard transformer self-attention |
| Pre-training objectives | MVLM, Multi-label Document Classification | Masked visual-language, Text-Image Alignment, Text-Image Matching | Masked Language Modeling, Masked Image Modeling, Word-Patch Alignment |
| BASE parameters | about 113M | about 200M | about 133M |
| LARGE parameters | about 343M | about 426M | about 368M |
| Pre-training data | IIT-CDIP (over 6M docs, over 11M images) | IIT-CDIP scanned documents | IIT-CDIP, about 11M document images |
| License | MIT | Non-commercial research license | CC BY-NC-SA 4.0 (non-commercial) |
The original LayoutLM was submitted to arXiv on December 31, 2019 (arXiv:1912.13318) by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou, and was published at KDD 2020. The paper described itself as the first time text and layout were jointly learned in a single framework for document-level pre-training. The BASE model is a 12-layer transformer with 768 hidden dimensions and 12 attention heads, totaling about 113 million parameters; the LARGE model is a 24-layer transformer with 1,024 hidden dimensions and 16 heads, about 343 million parameters. It was pre-trained on the IIT-CDIP Test Collection 1.0, which contains more than 6 million documents and more than 11 million scanned images, for two epochs, using Tesseract to obtain words and their 2D positions. The Hugging Face checkpoints microsoft/layoutlm-base-uncased and microsoft/layoutlm-large-uncased are released under the permissive MIT license, which makes version 1 the most freely usable member of the family for commercial work.
LayoutLMv2 was submitted to arXiv on December 29, 2020 (arXiv:2012.14740) by Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, and a larger team, and was published at ACL 2021. Its headline changes were bringing the page image into pre-training through a ResNeXt-101 FPN visual backbone and introducing spatial-aware self-attention. The BASE configuration has roughly 200 million parameters and the LARGE configuration roughly 426 million, reflecting the added cost of the convolutional image branch. Version 2 produced large jumps on visual and question-answering tasks; its biggest gain over version 1 was on document visual question answering, where the ability to attend to the image proved especially valuable. A multilingual sibling, LayoutXLM, applied the same recipe to many languages for cross-lingual document understanding.
LayoutLMv3 was submitted to arXiv on April 18, 2022 (arXiv:2204.08387) by Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei, and was published at ACM Multimedia 2022. Its contribution was simplification and unification: replacing the convolutional backbone with linear image-patch embeddings and masking both text and image during pre-training with the aligned MLM, MIM, and WPA objectives. The BASE model has about 133 million parameters and the LARGE model about 368 million, so version 3 is smaller than version 2 despite matching or beating it, because it no longer carries a separate detection backbone. It is described as a general-purpose model for both text-centric tasks, such as form and receipt understanding and document question answering, and image-centric tasks, such as document classification and layout analysis, the latter being something the earlier versions did not target directly. The catch for many users is licensing: microsoft/layoutlmv3-base and microsoft/layoutlmv3-large are released under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0), which forbids commercial use, in contrast to version 1's MIT terms.
The LayoutLM family is evaluated on a small set of public benchmarks that between them cover the main document-understanding tasks: information extraction from forms and receipts, document classification, and visual question answering. These datasets are worth knowing because progress in the field is largely measured against them.
FUNSD (Form Understanding in Noisy Scanned Documents) is a set of 199 annotated forms containing more than 30,000 words, sampled and cleaned from the form category of the larger RVL-CDIP collection. The task is to label tokens as headers, questions, answers, or other, and to link questions to their answers, so it tests both entity labeling and relational structure. Performance is reported as entity-level F1.
CORD (Consolidated Receipt Dataset) is a collection of about 1,000 Indonesian receipts, split into 800 for training, 100 for validation, and 100 for testing, with fields such as menu items, prices, and totals annotated for extraction. It is the standard receipt-understanding benchmark and is also scored with F1.
RVL-CDIP (Ryerson Vision Lab Complex Document Information Processing) consists of 400,000 grayscale document images evenly divided into 16 classes such as letter, email, form, invoice, and scientific report, with 25,000 images per class and a 320,000 / 40,000 / 40,000 train, validation, test split. The low-resolution, noisy 1980s and 1990s scans make it a realistic test of whole-document classification, scored as accuracy.
DocVQA (Document Visual Question Answering) contains about 50,000 questions posed over more than 12,000 document images, where a model must read the document and answer a natural-language question about its contents. It is scored with ANLS (Average Normalized Levenshtein Similarity), which rewards answers that are close to the reference string.
The reported results show steady improvement across the family. On FUNSD, the F1 climbed from roughly 79.3 for LayoutLM to about 84.2 for LayoutLMv2 and about 92.1 for LayoutLMv3. On CORD, scores rose from about 95.0 to 96.0 to 97.5. On RVL-CDIP, classification accuracy moved from about 94.4 percent to 95.6 percent to 95.9 percent. The most striking jump was on DocVQA, where the ANLS rose from about 72.9 for the first version to about 86.7 for LayoutLMv2, reflecting how much that task benefits from reading the image. LayoutLMv3 additionally reported strong document-layout analysis on PubLayNet, around 95.1 mean average precision, a task the earlier versions did not target. These numbers placed the models at or near the top of the relevant leaderboards at the time of each release and made the family a common baseline in subsequent document-AI research.
Because LayoutLM is a pre-trained backbone rather than a finished product, its uses are defined by what people fine-tune it to do. The dominant applications fall into a few categories.
The largest is information extraction from business documents. Companies fine-tune LayoutLM to pull structured fields out of invoices (vendor, invoice number, line items, totals), receipts (merchant, date, amount, tax) for expense management, purchase orders, bills of lading, insurance claims, and tax forms. Because the model understands layout, it generalizes across slightly different templates better than rigid rule-based extractors, which is valuable when an organization receives documents from many different senders.
A second major use is form understanding, which goes beyond pulling fixed fields to recovering the structure of a form: identifying which text is a question or label, which is the corresponding answer, and how key-value pairs are grouped. This is the task FUNSD measures, and it underlies tools that turn paper or PDF forms into structured records.
A third use is document classification and routing. Organizations that receive large volumes of mixed documents use a fine-tuned model to sort each incoming page into a category, such as contract, invoice, resume, or correspondence, so it can be routed to the right workflow. RVL-CDIP is the benchmark proxy for this task.
A fourth use, enabled strongly by versions 2 and 3, is document visual question answering, where a user asks a plain-language question about a document and the system answers by reading it. This supports search and assistant features over document collections. LayoutLMv3 also supports document layout analysis, detecting the regions of a page such as text blocks, tables, figures, and titles, which is useful as a pre-processing step for downstream parsing and for converting documents into structured or accessible formats.
In practice these systems are deployed as part of a pipeline. An OCR engine such as Tesseract or a cloud OCR service first produces words and bounding boxes, the fine-tuned LayoutLM model then labels or extracts, and post-processing assembles the output into the schema the business needs. The models are small enough by modern standards, on the order of a hundred million to a few hundred million parameters, that they can be served cost-effectively, which is part of why they remain popular even as much larger general-purpose multimodal systems have appeared.
LayoutLM has real constraints that shape where it is appropriate. The most important is its dependence on OCR. The model does not read pixels into words by itself in the first two versions; it relies on an external OCR step to supply text and bounding boxes, so OCR errors propagate directly into the model's input. On low-quality scans, handwriting, unusual fonts, or languages the OCR engine handles poorly, accuracy suffers regardless of how good LayoutLM is, and the quality of the bounding boxes matters as much as the recognized text.
A second limitation is input length. Like other BERT-style encoders, LayoutLM has a fixed maximum sequence length, typically 512 tokens, so long multi-page documents must be split into chunks or windowed, which can break relationships that span a page boundary and complicates handling of long contracts or reports.
A third is that the models are encoders specialized for understanding and extraction, not generators. They classify, label, and extract; they do not write free-form summaries or carry on a dialogue. For generative document tasks, practitioners reach for different architectures. The base checkpoints are also only pre-trained and must be fine-tuned on labeled data before they are useful for a specific task, which means a deployment still requires an annotated dataset, although usually a far smaller one than training from scratch would demand.
The benchmarks also carry caveats. RVL-CDIP and FUNSD are dominated by English documents from particular eras and domains, so strong benchmark numbers do not guarantee equally strong results on, say, modern multilingual invoices; LayoutXLM exists partly to address the multilingual gap. CORD is specific to receipts and is relatively small.
Licensing is a practical limitation that often surprises new users. The three versions are not licensed the same way. The original LayoutLM is released under the permissive MIT license, so it can be used commercially. LayoutLMv3, however, is released under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0), a non-commercial license, and LayoutLMv2 and LayoutXLM are likewise distributed under a non-commercial research license. This means that the best-performing and most modern member of the family cannot legally be used in a commercial product under its released weights, and teams building commercial systems must either fall back to version 1, negotiate separate terms, or train their own model. Anyone adopting LayoutLM should check the license on the specific checkpoint they intend to use before building on it.
LayoutLM emerged from Microsoft Research Asia as part of a broader effort to bring the pre-training revolution that BERT had brought to natural language into the world of documents. The first paper appeared on arXiv on December 31, 2019, and was presented at KDD 2020. Its core insight, that adding two-dimensional position to a masked language model substantially improves document understanding, was simple in hindsight but had not been demonstrated at scale before, and it established the template the whole family would follow: OCR to get words and boxes, encode text and layout jointly, pre-train on the large IIT-CDIP collection of scanned documents, then fine-tune on the target task.
The second version followed quickly, submitted to arXiv on December 29, 2020, and presented at ACL 2021. It reflected lessons from the first: the image should be part of pre-training rather than an afterthought at fine-tuning, and attention itself should be made spatially aware. The resulting two-stream model with its ResNeXt-FPN backbone pushed the benchmarks higher, especially on visually demanding tasks like DocVQA, and the contemporaneous LayoutXLM extended the recipe to many languages, broadening the family's reach beyond English.
The third version, submitted to arXiv on April 18, 2022, and presented at ACM Multimedia 2022, came from a partly different author group and reflected the influence of Vision Transformers and BEiT, which had shown that images could be tokenized into patches and masked just like text. By adopting linear patch embeddings and unifying the masking objectives across both modalities, LayoutLMv3 discarded the heavy convolutional machinery of version 2 while matching or exceeding its accuracy, and it explicitly targeted both text-centric and image-centric tasks within one model. Throughout this progression the models were released openly on Hugging Face, where they have accumulated millions of downloads and a large ecosystem of fine-tuned derivatives and demonstration spaces, and they sit within Microsoft's wider Document AI program alongside related models such as DiT for document image classification, TrOCR for transformer-based text recognition, and Markup-LM for web and XML documents. Together these established the modern playbook for visually rich document understanding, and LayoutLM remains one of the most widely used and cited starting points in the field.