Document Question Answering Models
Last reviewed
May 13, 2026
Sources
46 citations
Review status
Source-backed
Revision
v2 · 5,021 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 13, 2026
Sources
46 citations
Review status
Source-backed
Revision
v2 · 5,021 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Multimodal Models and Tasks
Document question answering models (DocQA, sometimes called DocVQA for document visual question answering) are machine learning systems that take a document image or PDF together with a natural language question and return an answer grounded in the document. The task combines optical character recognition, layout analysis, and reading comprehension into a single end-to-end problem. A working DocQA model has to read printed or handwritten text, understand the spatial arrangement of headers, tables, columns, and figures, and then perform the language reasoning needed to map the question to an answer span or a generated string.[1][2]
DocQA emerged as a distinct benchmark in 2020 with the release of the DocVQA dataset by Mathew, Karatzas, and Jawahar at the IIIT Hyderabad and the Computer Vision Center in Barcelona. The dataset, presented at WACV 2021, contained roughly 12,000 document images drawn from the UCSF Industry Documents Library and about 50,000 questions written by human annotators. Within two years, the same group released InfographicVQA and the Robust Reading Competition added a multi-page split, expanding the task beyond single scanned pages.[1][3] By 2025, document understanding had become one of the standard ways to benchmark frontier vision-language models, with Qwen2.5-VL topping the DocVQA leaderboard at 96.4% ANLS and commercial APIs such as Mistral OCR, Google Document AI, and Anthropic Claude all offering production document pipelines.[4][5][6]
Document question answering is the task of answering a natural language question about the contents of a document presented as an image or a digitally rendered PDF. The defining property is that the model must read the document from pixels rather than from a clean text stream. Real documents include form fields, scanned receipts, tax forms, academic papers with equations and figures, financial filings with multi-column layouts, slide decks, mobile app screenshots, and infographics. Each of these formats encodes information not only in the text but also in the position of that text on the page.[1][2][7]
The DocVQA paper formalized four subtask categories that have shaped the literature since. Layout-based questions require finding text in a particular region or column. Table-based questions require reading rows and columns and combining cells. Free-text questions ask about paragraphs with little visual structure. Form-based questions ask the model to identify the value next to a printed key.[1] Subsequent benchmarks added new dimensions: chart reasoning in ChartQA, dense data-rich layouts in InfographicVQA, multi-page reasoning in MP-DocVQA and DUDE, slide decks in SlideVQA, and mobile screens in ScreenQA.[3][8][9][10][11]
DocQA pipelines tend to fall into one of three families. The first uses an OCR engine to extract text and bounding boxes, then feeds those plus the image into a layout-aware transformer such as LayoutLM, LayoutLMv2, or LayoutLMv3. The second skips OCR and trains an end-to-end image-to-text model like Donut, Pix2Struct, or UDOP. The third uses a general-purpose vision-language model such as GPT-4o, Gemini, or Claude and treats DocQA as one application of the broader VLM. Hybrid systems are common in production, where an OCR foundation model like Nougat, GOT-OCR2.0, or Mistral OCR produces clean markdown that is then fed to a large language model for question answering.[12][13][14][6][15]
The table below lists the main public datasets used to train and evaluate document question answering systems. Each emphasizes a different document type or reasoning skill.
| Dataset | Year | Documents | Questions | Focus |
|---|---|---|---|---|
| AI2D | 2016 | 4,903 diagrams | About 15,000 | Primary-school science diagrams with parse graphs[16] |
| FUNSD | 2019 | 199 forms | Entity and link annotations | Form understanding in noisy scanned documents[17] |
| SROIE | 2019 | 1,000 receipts | Key information extraction | OCR and field extraction on scanned receipts[18] |
| CORD | 2019 | About 11,000 receipts | Multi-level labels | Indonesian receipts for post-OCR semantic parsing[19] |
| RVL-CDIP | 2015 | 400,000 pages | 16-class labels | Document image classification[20] |
| PubLayNet | 2019 | Over 1 million pages | Layout segmentation | Page layout analysis from PubMed Central PDFs[21] |
| DocLayNet | 2022 | 80,863 pages | 11-class layout boxes | Hand-annotated layout across six document categories[22] |
| DocVQA | 2021 | About 12,000 images | About 50,000 questions | Single-page document VQA on industry documents[1] |
| VisualMRC | 2021 | Over 10,000 webpages | Over 30,000 abstractive QA pairs | Generative reading comprehension on web documents[23] |
| InfographicVQA | 2022 | About 5,485 infographics | About 30,000 questions | Numeric and graphical reasoning on infographics[3] |
| ChartQA | 2022 | 9,600 charts | 32,719 questions | Question answering over charts with visual and logical reasoning[8] |
| MP-DocVQA | 2023 | Multi-page documents | About 46,000 questions | Multi-page document VQA, up to 20 pages per document[9] |
| SlideVQA | 2023 | 2,619 slide decks | 14,484 questions | Multi-image reasoning across slide presentations[10] |
| ScreenQA | 2022 | About 35,000 screenshots | About 86,000 questions | Mobile app screen understanding[11] |
DocVQA was the first large-scale benchmark dedicated to question answering over document images. The questions were collected through Amazon Mechanical Turk on 12,767 images sampled from the UCSF Industry Documents Library, which holds millions of scanned reports, memos, letters, and forms from the tobacco, drug, and chemical industries. The dataset is split into train, validation, and test splits, with a public leaderboard hosted by the Robust Reading Competition portal at the Computer Vision Center in Barcelona. Human performance on DocVQA is approximately 94.36% measured by Average Normalized Levenshtein Similarity (ANLS), and modern frontier models have closed in on or matched that score.[1][24]
InfographicVQA was published at WACV 2022 by Mathew, Bagal, Tito, Karatzas, Valveny, and Jawahar. The dataset contains around 30,000 questions over 5,485 infographics scraped from the web, with an emphasis on arithmetic and data-visualization reasoning. Most answers are short numeric strings, and the dataset is harder than DocVQA because the documents mix text, charts, icons, and design elements.[3] ChartQA, published in Findings of ACL 2022 by Masry and colleagues at York University, focused specifically on charts. It includes 9,608 human-written questions and 23,111 machine-generated ones, all paired with the underlying chart data tables so that models can be evaluated on both visual and table-grounded reasoning.[8]
MP-DocVQA, introduced by Tito, Karatzas, and Valveny in Pattern Recognition 2023, extended DocVQA to documents of up to 20 pages and required models to answer questions and identify the supporting page. The accompanying Hi-VT5 baseline used a hierarchical encoder to summarize each page before generating an answer.[9] SlideVQA, presented at AAAI 2023 by Tanaka and colleagues at NTT, asked questions over decks containing many slides and required single-hop, multi-hop, and numerical reasoning across them.[10] ScreenQA, released by Google Research in 2022 over the Rico mobile UI dataset, contained 86,000 question-answer pairs grounded in 35,000 mobile app screenshots and pushed DocQA toward UI understanding.[11]
The layout and classification datasets are not strictly DocQA, but most DocQA models depend on layout pretraining or use the datasets as transfer-learning targets. FUNSD was released by Jaume, Ekenel, and Thiran at ICDAR-OST 2019. It is a small set of 199 noisy scanned forms drawn from RVL-CDIP, with text bounding boxes, entity labels, and link annotations between fields.[17] CORD, released by NAVER CLOVA AI in 2019, includes about 11,000 Indonesian receipts with hierarchical labels for OCR and parsing.[19] SROIE was the ICDAR 2019 Scanned Receipts OCR and Information Extraction challenge with 1,000 annotated receipts.[18] RVL-CDIP, introduced by Harley, Ufkes, and Derpanis at ICDAR 2015, contains 400,000 grayscale document images across 16 categories and remains a standard pretraining and evaluation set for document classifiers.[20] PubLayNet, from IBM, was built by automatically matching PubMed Central XML to PDF rendering and contains over one million pages with bounding-box layout annotations.[21] DocLayNet, also from IBM in 2022, is a smaller but human-annotated set of 80,863 pages spanning finance, science, patents, tenders, law texts, and manuals.[22]
The LayoutLM family from Microsoft Research Asia is the most influential line of OCR-aware document understanding models. Each generation extended pretraining to richer multimodal signals, and the family powered most of the strong DocVQA results published between 2020 and 2023.
LayoutLM was introduced by Xu, Li, Cui, Huang, Wei, and Zhou at KDD 2020. The model added 2D position embeddings derived from OCR bounding boxes to a BERT backbone, so the same transformer could attend to text and to where that text sits on the page. The authors reported new state-of-the-art results on form understanding (FUNSD F1 increased from 70.72 to 79.27), receipt understanding (SROIE F1 from 94.02 to 95.24), and RVL-CDIP document classification (from 93.07 to 94.42).[12][25]
LayoutLMv2, published at ACL 2021, added a visual encoder so that page pixels were fused with text and layout in a single two-stream transformer. The pretraining tasks expanded to include masked visual-language modeling, text-image alignment, and text-image matching, and the self-attention was made spatially aware so that the model could reason about relative positions between text blocks. LayoutLMv2 reached state of the art on FUNSD, CORD, SROIE, Kleister-NDA, RVL-CDIP, and DocVQA at the time of release.[26][27] LayoutXLM, released alongside LayoutLMv2, extended the same approach to multilingual documents.
LayoutLMv3 was published at ACM Multimedia 2022 by Huang, Lv, Cui, Lu, and Wei. The model used unified text and image masking, removing the dependence on a separately trained CNN backbone. LayoutLMv3 worked well on both text-centric tasks like form and receipt understanding and on image-centric tasks like document image classification and layout analysis, and the unified architecture made fine-tuning simpler than for v2.[28]
LiLT (Language-independent Layout Transformer) was introduced at ACL 2022 by Wang, Jin, and Ding. LiLT decoupled the textual and layout streams so that a single layout pretrained model could be paired with any monolingual or multilingual text encoder at fine-tuning time. The result was strong cross-lingual transfer: pretraining on English documents, the model could be fine-tuned on FUNSD, XFUND, and EPHOIE in seven other languages with competitive performance.[29] ERNIE-Layout, from Baidu, was published at Findings of EMNLP 2022 by Peng and colleagues. It added a spatial-aware disentangled attention, a reading-order prediction task, and a replaced-regions prediction task, and set new state-of-the-art results on key information extraction, document classification, and DocVQA at the time.[30]
DocFormer, introduced by Appalaraju, Jasani, Kota, Xie, and Manmatha at ICCV 2021, was an end-to-end encoder-only transformer with a CNN backbone for vision. Its multi-modal self-attention layer fused text, vision, and spatial features, and the authors reported strong results on FUNSD, CORD, RVL-CDIP, and DocVQA with a smaller parameter count than comparable models.[31] StrucTexT, from Baidu, and its successor StrucTexTv2 explored OCR-aware pretraining with masked image and language modeling tasks. StrucTexTv2 used only image input and avoided OCR pre-processing at inference time, making it a bridge between the OCR-aware LayoutLM family and the OCR-free models discussed below.[32][33]
A second line of work removes OCR from the pipeline entirely and trains a vision encoder to read pixels directly. The motivations are clear: OCR errors propagate to the rest of the pipeline, OCR engines need separate training for new languages, and OCR adds latency and cost at inference time.[13]
Donut (Document Understanding Transformer) was introduced at ECCV 2022 by Kim, Hong, Yim, Nam, Park, Yim, Hwang, Yun, Han, and Park at NAVER CLOVA AI. The architecture is a Swin Transformer vision encoder paired with a BART decoder that emits structured JSON or text answers. Donut is pretrained on synthetic documents generated by SynthDoG (Synthetic Document Generator) in multiple languages, then fine-tuned on tasks like CORD, RVL-CDIP, DocVQA, and TicketCorpus. The original paper showed Donut matching or beating LayoutLMv2 on document classification, parsing, and DocVQA without ever running an OCR engine, and at higher inference speed.[13]
Pix2Struct was published at ICML 2023 as an oral presentation by Lee, Joshi, Turc, Hu, Liu, Eisenschlos, Khandelwal, Shaw, Chang, and Toutanova at Google Research. The model is pretrained by learning to parse masked screenshots of web pages into simplified HTML, an objective the authors argued subsumes OCR, language modeling, and image captioning. Pix2Struct reached state-of-the-art results on six of nine benchmarks across illustrations, user interfaces, natural images, and documents, with the largest improvements (between 1 and 44 points) coming on low-resource domains.[14]
UDOP (Unifying Vision, Text, and Layout for Universal Document Processing) was introduced as a CVPR 2023 Highlight by Tang and colleagues at Microsoft and UNC Chapel Hill. UDOP used a single Vision-Text-Layout Transformer with a prompt-based sequence generation scheme, supporting both document understanding and document generation. It also learned to generate document images from text and layout, enabling neural document editing for the first time, and it set state of the art on eight document AI tasks and topped the Document Understanding Benchmark leaderboard.[15]
General vision-language models have closed the gap with specialized DocQA models and now top most public leaderboards. The shift accelerated in 2024 as labs began including more document and OCR data in their pretraining mix.
mPLUG-DocOwl, from Alibaba DAMO Academy and Renmin University, is a multimodal large language model targeted specifically at documents. mPLUG-DocOwl 1.5 (March 2024) introduced unified structure learning across documents, webpages, tables, charts, and natural images, reaching 82.2 ANLS on DocVQA, 50.7 on InfoVQA, and 70.2 on ChartQA at 8B parameters. mPLUG-DocOwl 2, released in late 2024, focused on multi-page documents and encoded each page in only 324 tokens, allowing longer documents to fit in the context window.[34]
Alibaba Qwen2-VL achieved 97.25% on the DocVQA test set on release in 2024. Qwen2.5-VL, released in early 2025, kept the top score across most document benchmarks, with Qwen2.5-VL 72B Instruct reaching 96.4% on DocVQA. The Qwen team highlighted structured extraction from invoices, forms, HTML tables, and chemical formulas as a primary use case, alongside strong InfoVQA and ChartQA numbers.[4]
InternVL from OpenGVLab is an open-source vision-language family that has tracked the frontier on document benchmarks. InternVL2 achieved state of the art on DocVQA and InfoVQA among open-source models, and InternVL2.5, released in late 2024, doubled the dataset size while tightening filtering and reached parity with GPT-4o and Claude 3.5 Sonnet on document understanding. InternVL3, released in April 2025, refined the test-time recipe further.[35]
LLaVA-NeXT (Jan 2024) added document and chart fine-tuning data and made DocVQA a default benchmark. Idefics2 (April 2024, Hugging Face) used Mistral-7B and SigLIP, processed images in native aspect ratio, and improved OCR and document understanding over Idefics. Idefics3 (August 2024) swapped in Llama 3 and removed the perceiver, with further gains on OCR and document tasks. Microsoft Phi-3-Vision (May 2024) brought document understanding to a 4B-parameter model that could run on a single consumer GPU.[36][37]
GPT-4V (Sep 2023), GPT-4o (May 2024), and successor OpenAI models, Gemini 1.5 (Feb 2024), Gemini 2.0 (Dec 2024), and Gemini 2.5, and Claude 3, 3.5, and successor Anthropic models all support document images and PDFs as input. These are not specialized DocQA models, but they are evaluated on DocVQA, ChartQA, InfoVQA, AI2D, and OCRBench in their technical reports and have steadily approached human performance on most.[5][6][38]
A related thread treats OCR itself as the primary task and lets a downstream language model answer questions on the extracted markdown. This split has become standard in production document pipelines because the OCR output can be cached, audited, and fed to multiple downstream consumers.
Nougat (Neural Optical Understanding for Academic Documents) was released by Meta AI in August 2023, authored by Blecher, Cucurull, Scialom, and Stojnic. Nougat shares Donut's architecture (Swin Transformer encoder, mBART decoder) but is trained specifically to convert academic PDFs into LaTeX-flavored markdown. It handles mathematical equations, tables, and reading order, and was trained on papers from arXiv and PubMed Central.[39]
GOT-OCR2.0 (General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model) was released in September 2024 by Wei, Liu, and colleagues. The model has 580 million parameters, combines a high-compression vision encoder with a long-context decoder, and supports plain text, formatted markdown, math (TikZ), molecules (SMILES), and sheet music output. GOT-OCR2.0 trended at number one on Hugging Face on release and treats all artificial optical signals (text, equations, formulas, tables, charts, sheet music, and even geometric shapes) as a unified character category.[40]
olmOCR was released by the Allen Institute for AI in February 2025. It is a 7B parameter vision-language model fine-tuned on a 260,000-page corpus paired with GPT-4o outputs, designed to convert PDFs and images into clean Markdown that preserves reading order, equations, tables, and handwriting. Allen AI reported a cost of about $190 per million pages, roughly 1/32 the cost of equivalent processing through the GPT-4o batch API. olmOCR 2 (October 2025) reached 82.4 on olmOCR-Bench, almost four points above the previous release, using unit-test rewards during training.[41]
Mistral OCR was launched by Mistral AI on March 6, 2025. It uses Mistral language models to interpret the layout and content of extracted OCR output and supports PDFs and images, returning interleaved text and embedded images as markdown. The API was priced at 1,000 pages per dollar with double that rate in batch mode. Mistral reported its OCR outperforming Google Document AI, Azure OCR, Gemini 1.5 and 2.0, and GPT-4o on internal benchmarks at launch.[6] Mistral OCR 3, released in December 2025, was a smaller version targeted at structured document AI at scale.
IBM open-sourced Docling in July 2024. Docling is a Python toolkit that converts PDFs and other formats into JSON and Markdown suitable for retrieval-augmented generation, using DocLayNet for layout analysis and TableFormer for table structure recognition. The toolkit reached more than 30,000 GitHub stars within months of release and was hosted under the LF AI & Data Foundation. IBM reported that running specialized vision models in place of OCR could reduce errors and cut processing time by up to 30 times.[42]
The canonical metric for DocVQA and related single-page tasks is Average Normalized Levenshtein Similarity, abbreviated ANLS. The metric was proposed by Biten and colleagues at ICCV 2019 for scene-text VQA and adopted for DocVQA at WACV 2021. For each predicted answer, the model computes a normalized edit distance to the ground truth, with a threshold of 0.5 below which the score is treated as zero. The threshold is designed to distinguish answers that were correctly chosen but slightly miscopied from answers that were simply wrong. ANLS is case-insensitive but space-sensitive, ranges from 0 to 1, and is averaged across all questions.[24][43]
Other DocQA benchmarks use related metrics. ChartQA reports relaxed accuracy that allows numeric tolerance. InfographicVQA uses ANLS but reports separate scores for question types. SlideVQA, MP-DocVQA, and DUDE report ANLS plus retrieval metrics for the evidence page. Generative QA datasets like VisualMRC report ROUGE-L and BLEU alongside exact match.[3][8][10][23]
Three hyperscaler platforms dominate enterprise document AI in 2025, supported by a growing field of specialized vendors. The table below summarizes the main offerings.
| Platform | Vendor | Capabilities |
|---|---|---|
| Amazon Textract | AWS | Text Detection, Document Analysis, and Analyze Expense APIs covering text, tables, key-value pairs, and forms[44] |
| Azure AI Document Intelligence | Microsoft Azure | OCR, generic layout, prebuilt and custom neural or template models, with on-premises layout containers[44] |
| Google Document AI | Google Cloud | Processor ecosystem covering invoices, receipts, IDs, and forms, with layout preservation across digital and scanned files[44] |
| Hyperscience | Hyperscience | Intelligent document processing focused on human-in-the-loop accuracy |
| Rossum | Rossum | Cognitive data capture aimed at invoices and trade documents |
| Ephesoft | Ephesoft (Tungsten Automation) | Document capture and classification across regulated industries |
| Unstructured.io | Unstructured | Open-source and hosted pipelines that turn documents into LLM-ready chunks |
| LandingAI | LandingAI | Agentic document extraction; reported 99.16% on DocVQA in 2025 internal tests[4] |
Amazon Textract is strongest for AWS-first teams using the AnalyzeExpense path inside existing pipelines. Azure AI Document Intelligence (formerly Form Recognizer) tends to handle irregular or older invoices well and integrates with Microsoft Foundry. Google Document AI offers the largest processor catalog and preserves layout on both digital and scanned files. Independent benchmarks have produced mixed results across the three, with Azure outperforming AWS on irregular invoices and Google's invoice parser showing weaker line-item extraction in one 2025 study.[44]
Long-context vision-language models have shifted document QA from a per-page problem to a per-document or per-corpus problem. Gemini 1.5 Pro supported up to 10 million tokens of multimodal context in research evaluations and 1 million tokens in production at launch in February 2024, and Google's API documentation noted that a single request could process up to 1,000 PDF pages if the total stayed inside the context window. Google publicly demonstrated reasoning over 402 pages of Apollo 11 transcripts as part of the launch.[45]
Claude added PDF support starting with Claude 3.5 Sonnet in late 2024. The PDF endpoint accepts documents up to 32 MB and 100 pages, with each page consuming 1,500 to 3,000 tokens depending on content density. Claude can read both the text and the embedded images in a PDF, extract tables, generate structured JSON or Excel outputs, and cite page numbers in its answers. PDF support is available on the Claude API, on Amazon Bedrock, on Google Vertex AI, and on Microsoft Foundry.[5][46]
GPT-4o, GPT-4.1, and later OpenAI models accept PDFs through the Files API and answer questions about them. Long-context document QA workloads are also commonly handled by chunking the document and using retrieval-augmented generation over an indexed corpus, with a vision-language model providing the per-page understanding.
Document question answering is used across regulated industries where information lives in PDFs, scans, and faxed forms. Financial document analysis applies it to 10-K filings, earnings releases, and prospectuses. Insurance claims teams use DocQA to read claim forms, medical records, police reports, and receipts, then route them to the correct workflow. Healthcare extracts structured data from lab reports and discharge summaries. Legal teams pull clauses, dates, and parties out of contracts for due diligence and discovery. Receipt and expense processing automates expense reports through OCR plus key-value extraction. Scientific literature pipelines parse arXiv and journal PDFs into markdown for retrieval-augmented generation using Nougat or olmOCR. Government and tax filing assistants read IRS and HMRC forms, and customer support teams ingest product manuals into chat assistants.
Even the strongest 2025 DocQA models share a set of persistent failure modes. Numeric reasoning over charts and tables remains brittle, with models often hallucinating values that look plausible but are not in the source document. Multi-page reasoning is still weaker than single-page reasoning, especially when the answer requires combining evidence from non-adjacent pages. Handwriting accuracy varies sharply across writers, languages, and scan quality. Documents in low-resource scripts (Arabic, Tamil, Burmese, and many African scripts) are underrepresented in the public training data, and accuracy drops significantly outside Latin-script English. Tables with complex merged cells or rotated text are a recurring problem for both OCR engines and OCR-free models.[2][9][10]
Privacy and confidentiality are also active concerns. Many enterprise documents contain personally identifiable information, financial data, or legally privileged content, and sending them to a third-party API raises both regulatory and audit issues. The shift toward smaller, on-device or on-premises DocQA models (Docling, olmOCR, Mistral OCR on-premises, Azure Document Intelligence containers) is partly driven by these concerns.[42][44][6]
Finally, evaluation is itself a limitation. ANLS is forgiving of small spelling errors but cannot distinguish between an answer that is correct in spirit and one that is wrong but lexically close. Several recent papers have argued for more semantically aware metrics and richer evaluation, including the ANLS* extension proposed in 2024 for generative large language models.[43]