Document Question Answering Models

AI Models Multimodal AI

28 min read

Updated Jun 9, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 9, 2026

Fact-checked

In review queue

Sources

53 citations

Revision

v4 · 5,623 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

See also: Multimodal Models and Tasks

Document question answering models (DocQA, sometimes called DocVQA for document visual question answering) are machine learning systems that take a document image or PDF together with a natural language question and return an answer grounded in the document. The task combines optical character recognition, layout analysis, and reading comprehension into a single end-to-end problem, making it a visually grounded specialization of the broader family of question answering models. A working DocQA model has to read printed or handwritten text, understand the spatial arrangement of headers, tables, columns, and figures, and then perform the language reasoning needed to map the question to an answer span or a generated string.^[1]^[2]

DocQA emerged as a distinct benchmark in 2020 with the release of the DocVQA dataset by Mathew, Karatzas, and Jawahar at the IIIT Hyderabad and the Computer Vision Center in Barcelona. The dataset, presented at WACV 2021, contained roughly 12,000 document images drawn from the UCSF Industry Documents Library and about 50,000 questions written by human annotators. Within two years, the same group released InfographicVQA and the Robust Reading Competition added a multi-page split, expanding the task beyond single scanned pages.^[1]^[3] By 2026, document understanding had become one of the standard ways to benchmark frontier vision-language models. As of May 2026 the public DocVQA test leaderboard is led by Qwen3-VL 235B-A22B Instruct at 97.1% ANLS, ahead of Qwen3-VL 32B Instruct at 96.9% and Qwen2-VL 72B Instruct at 96.5%, while commercial APIs such as Mistral OCR, Google Document AI, and Anthropic Claude all offer production document pipelines.^[5]^[6]^[48]^[49]

Overview

Document question answering is the task of answering a natural language question about the contents of a document presented as an image or a digitally rendered PDF. The defining property is that the model must read the document from pixels rather than from a clean text stream. Real documents include form fields, scanned receipts, tax forms, academic papers with equations and figures, financial filings with multi-column layouts, slide decks, mobile app screenshots, and infographics. Each of these formats encodes information not only in the text but also in the position of that text on the page.^[1]^[2]^[7]

The DocVQA paper formalized four subtask categories that have shaped the literature since. Layout-based questions require finding text in a particular region or column. Table-based questions require reading rows and columns and combining cells. Free-text questions ask about paragraphs with little visual structure. Form-based questions ask the model to identify the value next to a printed key.^[1] Subsequent benchmarks added new dimensions: chart reasoning in ChartQA, dense data-rich layouts in InfographicVQA, multi-page reasoning in MP-DocVQA and DUDE, slide decks in SlideVQA, and mobile screens in ScreenQA.^[3]^[8]^[9]^[10]^[11]

DocQA pipelines tend to fall into one of three families. The first uses an OCR engine to extract text and bounding boxes, then feeds those plus the image into a layout-aware transformer such as LayoutLM, LayoutLMv2, or LayoutLMv3. The second skips OCR and trains an end-to-end image-to-text model like Donut, Pix2Struct, or UDOP. The third uses a general-purpose vision-language model such as GPT-4o, Gemini, or Claude and treats DocQA as one application of the broader VLM. Hybrid systems are common in production, where an OCR foundation model like Nougat, GOT-OCR2.0, or Mistral OCR produces clean markdown that is then fed to a large language model for question answering.^[12]^[13]^[14]^[6]^[15]

Datasets

The table below lists the main public datasets used to train and evaluate document question answering systems. Each emphasizes a different document type or reasoning skill.

Dataset	Year	Documents	Questions	Focus
AI2D	2016	4,903 diagrams	About 15,000	Primary-school science diagrams with parse graphs^[16]
FUNSD	2019	199 forms	Entity and link annotations	Form understanding in noisy scanned documents^[17]
SROIE	2019	1,000 receipts	Key information extraction	OCR and field extraction on scanned receipts^[18]
CORD	2019	About 11,000 receipts	Multi-level labels	Indonesian receipts for post-OCR semantic parsing^[19]
RVL-CDIP	2015	400,000 pages	16-class labels	Document image classification^[20]
PubLayNet	2019	Over 1 million pages	Layout segmentation	Page layout analysis from PubMed Central PDFs^[21]
DocLayNet	2022	80,863 pages	11-class layout boxes	Hand-annotated layout across six document categories^[22]
DocVQA	2021	About 12,000 images	About 50,000 questions	Single-page document VQA on industry documents^[1]
VisualMRC	2021	Over 10,000 webpages	Over 30,000 abstractive QA pairs	Generative reading comprehension on web documents^[23]
InfographicVQA	2022	About 5,485 infographics	About 30,000 questions	Numeric and graphical reasoning on infographics^[3]
ChartQA	2022	9,600 charts	32,719 questions	Question answering over charts with visual and logical reasoning^[8]
MP-DocVQA	2023	Multi-page documents	About 46,000 questions	Multi-page document VQA, up to 20 pages per document^[9]
DUDE	2023	5,019 documents	41,541 questions	Multi-domain, multi-page (5.72 pages average), with extractive, abstractive, list, and not-answerable questions^[47]
SlideVQA	2023	2,619 slide decks	14,484 questions	Multi-image reasoning across slide presentations^[10]
ScreenQA	2022	About 35,000 screenshots	About 86,000 questions	Mobile app screen understanding^[11]

DocVQA

DocVQA was the first large-scale benchmark dedicated to question answering over document images. The questions were collected through Amazon Mechanical Turk on 12,767 images sampled from the UCSF Industry Documents Library, which holds millions of scanned reports, memos, letters, and forms from the tobacco, drug, and chemical industries. The dataset is split into train, validation, and test splits, with a public leaderboard hosted by the Robust Reading Competition portal at the Computer Vision Center in Barcelona. Human performance on DocVQA is approximately 94.36% measured by Average Normalized Levenshtein Similarity (ANLS), and modern frontier models have closed in on or matched that score.^[1]^[24]

InfographicVQA and ChartQA

InfographicVQA was published at WACV 2022 by Mathew, Bagal, Tito, Karatzas, Valveny, and Jawahar. The dataset contains around 30,000 questions over 5,485 infographics scraped from the web, with an emphasis on arithmetic and data-visualization reasoning. Most answers are short numeric strings, and the dataset is harder than DocVQA because the documents mix text, charts, icons, and design elements.^[3] ChartQA, published in Findings of ACL 2022 by Masry and colleagues at York University, focused specifically on charts. It includes 9,608 human-written questions and 23,111 machine-generated ones, all paired with the underlying chart data tables so that models can be evaluated on both visual and table-grounded reasoning.^[8]

Multi-page and multi-image datasets

MP-DocVQA, introduced by Tito, Karatzas, and Valveny in Pattern Recognition 2023, extended DocVQA to documents of up to 20 pages and required models to answer questions and identify the supporting page. The accompanying Hi-VT5 baseline used a hierarchical encoder to summarize each page before generating an answer.^[9] SlideVQA, presented at AAAI 2023 by Tanaka and colleagues at NTT, asked questions over decks containing many slides and required single-hop, multi-hop, and numerical reasoning across them.^[10] ScreenQA, released by Google Research in 2022 over the Rico mobile UI dataset, contained 86,000 question-answer pairs grounded in 35,000 mobile app screenshots and pushed DocQA toward UI understanding.^[11]

Layout and classification datasets

The layout and classification datasets are not strictly DocQA, but most DocQA models depend on layout pretraining or use the datasets as transfer-learning targets. FUNSD was released by Jaume, Ekenel, and Thiran at ICDAR-OST 2019. It is a small set of 199 noisy scanned forms drawn from RVL-CDIP, with text bounding boxes, entity labels, and link annotations between fields.^[17] CORD, released by NAVER CLOVA AI in 2019, includes about 11,000 Indonesian receipts with hierarchical labels for OCR and parsing.^[19] SROIE was the ICDAR 2019 Scanned Receipts OCR and Information Extraction challenge with 1,000 annotated receipts.^[18] RVL-CDIP, introduced by Harley, Ufkes, and Derpanis at ICDAR 2015, contains 400,000 grayscale document images across 16 categories and remains a standard pretraining and evaluation set for document classifiers.^[20] PubLayNet, from IBM, was built by automatically matching PubMed Central XML to PDF rendering and contains over one million pages with bounding-box layout annotations.^[21] DocLayNet, also from IBM in 2022, is a smaller but human-annotated set of 80,863 pages spanning finance, science, patents, tenders, law texts, and manuals.^[22]

Specialized LayoutLM family

The LayoutLM family from Microsoft Research Asia is the most influential line of OCR-aware document understanding models. Each generation extended pretraining to richer multimodal signals, and the family powered most of the strong DocVQA results published between 2020 and 2023.

LayoutLM

LayoutLM was introduced by Xu, Li, Cui, Huang, Wei, and Zhou at KDD 2020. The model added 2D position embeddings derived from OCR bounding boxes to a BERT backbone, so the same transformer could attend to text and to where that text sits on the page. The authors reported new state-of-the-art results on form understanding (FUNSD F1 increased from 70.72 to 79.27), receipt understanding (SROIE F1 from 94.02 to 95.24), and RVL-CDIP document classification (from 93.07 to 94.42).^[12]^[25]

LayoutLMv2 and LayoutXLM

LayoutLMv2, published at ACL 2021, added a visual encoder so that page pixels were fused with text and layout in a single two-stream transformer. The pretraining tasks expanded to include masked visual-language modeling, text-image alignment, and text-image matching, and the self-attention was made spatially aware so that the model could reason about relative positions between text blocks. LayoutLMv2 reached state of the art on FUNSD, CORD, SROIE, Kleister-NDA, RVL-CDIP, and DocVQA at the time of release.^[26]^[27] LayoutXLM, released alongside LayoutLMv2, extended the same approach to multilingual documents.

LayoutLMv3

LayoutLMv3 was published at ACM Multimedia 2022 by Huang, Lv, Cui, Lu, and Wei. The model used unified text and image masking, removing the dependence on a separately trained CNN backbone. LayoutLMv3 worked well on both text-centric tasks like form and receipt understanding and on image-centric tasks like document image classification and layout analysis, and the unified architecture made fine-tuning simpler than for v2.^[28]

LiLT and ERNIE-Layout

LiLT (Language-independent Layout Transformer) was introduced at ACL 2022 by Wang, Jin, and Ding. LiLT decoupled the textual and layout streams so that a single layout pretrained model could be paired with any monolingual or multilingual text encoder at fine-tuning time. The result was strong cross-lingual transfer: pretraining on English documents, the model could be fine-tuned on FUNSD, XFUND, and EPHOIE in seven other languages with competitive performance.^[29] ERNIE-Layout, from Baidu, was published at Findings of EMNLP 2022 by Peng and colleagues. It added a spatial-aware disentangled attention, a reading-order prediction task, and a replaced-regions prediction task, and set new state-of-the-art results on key information extraction, document classification, and DocVQA at the time.^[30]

DocFormer and StrucTexT

DocFormer, introduced by Appalaraju, Jasani, Kota, Xie, and Manmatha at ICCV 2021, was an end-to-end encoder-only transformer with a CNN backbone for vision. Its multi-modal self-attention layer fused text, vision, and spatial features, and the authors reported strong results on FUNSD, CORD, RVL-CDIP, and DocVQA with a smaller parameter count than comparable models.^[31] StrucTexT, from Baidu, and its successor StrucTexTv2 explored OCR-aware pretraining with masked image and language modeling tasks. StrucTexTv2 used only image input and avoided OCR pre-processing at inference time, making it a bridge between the OCR-aware LayoutLM family and the OCR-free models discussed below.^[32]^[33]

OCR-free models

A second line of work removes OCR from the pipeline entirely and trains a vision encoder to read pixels directly. The motivations are clear: OCR errors propagate to the rest of the pipeline, OCR engines need separate training for new languages, and OCR adds latency and cost at inference time.^[13]

Donut

Donut (Document Understanding Transformer) was introduced at ECCV 2022 by Kim, Hong, Yim, Nam, Park, Yim, Hwang, Yun, Han, and Park at NAVER CLOVA AI. The architecture is a Swin Transformer vision encoder paired with a BART decoder that emits structured JSON or text answers. Donut is pretrained on synthetic documents generated by SynthDoG (Synthetic Document Generator) in multiple languages, then fine-tuned on tasks like CORD, RVL-CDIP, DocVQA, and TicketCorpus. The original paper showed Donut matching or beating LayoutLMv2 on document classification, parsing, and DocVQA without ever running an OCR engine, and at higher inference speed.^[13]

Pix2Struct

Pix2Struct was published at ICML 2023 as an oral presentation by Lee, Joshi, Turc, Hu, Liu, Eisenschlos, Khandelwal, Shaw, Chang, and Toutanova at Google Research. The model is pretrained by learning to parse masked screenshots of web pages into simplified HTML, an objective the authors argued subsumes OCR, language modeling, and image captioning. Pix2Struct reached state-of-the-art results on six of nine benchmarks across illustrations, user interfaces, natural images, and documents, with the largest improvements (between 1 and 44 points) coming on low-resource domains.^[14]

UDOP

UDOP (Unifying Vision, Text, and Layout for Universal Document Processing) was introduced as a CVPR 2023 Highlight by Tang and colleagues at Microsoft and UNC Chapel Hill. UDOP used a single Vision-Text-Layout Transformer with a prompt-based sequence generation scheme, supporting both document understanding and document generation. It also learned to generate document images from text and layout, enabling neural document editing for the first time, and it set state of the art on eight document AI tasks and topped the Document Understanding Benchmark leaderboard.^[15]

General VLMs adapted to documents

General vision-language models have closed the gap with specialized DocQA models and now top most public leaderboards. The shift accelerated in 2024 as labs began including more document and OCR data in their pretraining mix.

mPLUG-DocOwl

mPLUG-DocOwl, from Alibaba DAMO Academy and Renmin University, is a multimodal large language model targeted specifically at documents. mPLUG-DocOwl 1.5 (March 2024) introduced unified structure learning across documents, webpages, tables, charts, and natural images, reaching 82.2 ANLS on DocVQA, 50.7 on InfoVQA, and 70.2 on ChartQA at 8B parameters. mPLUG-DocOwl 2, released in late 2024, focused on multi-page documents and encoded each page in only 324 tokens, allowing longer documents to fit in the context window.^[34]

Qwen2-VL, Qwen2.5-VL, and Qwen3-VL

The Alibaba Qwen-VL line has held the top of the public document leaderboards since 2024. Qwen2-VL achieved 96.5% ANLS on the DocVQA test set on release in 2024. Qwen2.5-VL, released in early 2025, kept the lead across most document benchmarks, with Qwen2.5-VL 72B Instruct reaching 96.4% on DocVQA. The Qwen team highlighted structured extraction from invoices, forms, HTML tables, and chemical formulas as a primary use case, alongside strong InfoVQA and ChartQA numbers.^[4] Qwen3-VL, with a technical report released in November 2025, spans dense models from 2B to 32B parameters and mixture-of-experts variants at 30B-A3B and 235B-A22B, natively supports interleaved contexts of up to 256K tokens, and ships in both Instruct and reasoning-oriented Thinking editions. Qwen3-VL 235B-A22B Instruct led the DocVQA test leaderboard at 97.1% ANLS as of May 2026, and even the 8B Instruct model reached 96.1%.^[48]^[49]

InternVL

InternVL from OpenGVLab is an open-source vision-language family that has tracked the frontier on document benchmarks. InternVL2 achieved state of the art on DocVQA and InfoVQA among open-source models, and InternVL2.5, released in late 2024, doubled the dataset size while tightening filtering and reached parity with GPT-4o and Claude 3.5 Sonnet on document understanding. InternVL3, released in April 2025, refined the training and test-time recipe further, with InternVL3-8B scoring 92.7 and InternVL3-2B scoring 88.3 on DocVQA, and the report noting that the series outperforms the Qwen2.5-VL line on several multimodal benchmarks while a gap remains on others.^[35]^[50]

LLaVA, Idefics, Phi-Vision

LLaVA-NeXT (Jan 2024) added document and chart fine-tuning data and made DocVQA a default benchmark. Idefics2 (April 2024, Hugging Face) used Mistral-7B and SigLIP, processed images in native aspect ratio, and improved OCR and document understanding over Idefics. Idefics3 (August 2024) swapped in Llama 3 and removed the perceiver, with further gains on OCR and document tasks. Microsoft Phi-3-Vision (May 2024) brought document understanding to a 4B-parameter model that could run on a single consumer GPU.^[36]^[37]

Frontier proprietary VLMs

GPT-4V (Sep 2023), GPT-4o (May 2024), and successor OpenAI models, Gemini 1.5 (Feb 2024), Gemini 2.0 (Dec 2024), and Gemini 2.5, and Claude 3, 3.5, and successor Anthropic models all support document images and PDFs as input. These are not specialized DocQA models, but they are evaluated on DocVQA, ChartQA, InfoVQA, AI2D, and OCRBench in their technical reports and have steadily approached human performance on most.^[5]^[6]^[38]

OCR foundation models

A related thread treats OCR itself as the primary task and lets a downstream language model answer questions on the extracted markdown. This split has become standard in production document pipelines because the OCR output can be cached, audited, and fed to multiple downstream consumers.

Nougat

Nougat (Neural Optical Understanding for Academic Documents) was released by Meta AI in August 2023, authored by Blecher, Cucurull, Scialom, and Stojnic. Nougat shares Donut's architecture (Swin Transformer encoder, mBART decoder) but is trained specifically to convert academic PDFs into LaTeX-flavored markdown. It handles mathematical equations, tables, and reading order, and was trained on papers from arXiv and PubMed Central.^[39]

GOT-OCR2.0

GOT-OCR2.0 (General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model) was released in September 2024 by Wei, Liu, and colleagues. The model has 580 million parameters, combines a high-compression vision encoder with a long-context decoder, and supports plain text, formatted markdown, math (TikZ), molecules (SMILES), and sheet music output. GOT-OCR2.0 trended at number one on Hugging Face on release and treats all artificial optical signals (text, equations, formulas, tables, charts, sheet music, and even geometric shapes) as a unified character category.^[40]

olmOCR

olmOCR was released by the Allen Institute for AI in February 2025. It is a 7B parameter vision-language model fine-tuned on a 260,000-page corpus paired with GPT-4o outputs, designed to convert PDFs and images into clean Markdown that preserves reading order, equations, tables, and handwriting. Allen AI reported a cost of about $190 per million pages, roughly 1/32 the cost of equivalent processing through the GPT-4o batch API. olmOCR 2 (October 2025) reached 82.4 on olmOCR-Bench, almost four points above the previous release, using unit-test rewards during training.^[41]

Mistral OCR

Mistral OCR was launched by Mistral AI on March 6, 2025. It uses Mistral language models to interpret the layout and content of extracted OCR output and supports PDFs and images, returning interleaved text and embedded images as markdown. The API was priced at 1,000 pages per dollar with double that rate in batch mode. Mistral reported its OCR outperforming Google Document AI, Azure OCR, Gemini 1.5 and 2.0, and GPT-4o on internal benchmarks at launch.^[6] Mistral OCR 3, released in December 2025, was a smaller version targeted at structured document AI at scale.

Docling and DocLayNet

IBM open-sourced Docling in July 2024. Docling is a Python toolkit that converts PDFs and other formats into JSON and Markdown suitable for retrieval-augmented generation, using DocLayNet for layout analysis and TableFormer for table structure recognition. The toolkit reached more than 30,000 GitHub stars within months of release and was hosted under the LF AI & Data Foundation. IBM reported that running specialized vision models in place of OCR could reduce errors and cut processing time by up to 30 times.^[42]

DeepSeek-OCR

DeepSeek-OCR was released by DeepSeek in October 2025, authored by Wei, Sun, and Li. The model pairs a DeepEncoder vision encoder, designed to keep activations low under high-resolution input while compressing aggressively, with a DeepSeek3B-MoE-A570M decoder. Its central idea, "contexts optical compression," renders text into a 2D visual form so that a long document can be represented in far fewer vision tokens than the equivalent text tokens. DeepSeek reported about 97% OCR precision at a compression ratio under 10 times and roughly 60% at 20 times, and on OmniDocBench it surpassed GOT-OCR2.0 using only 100 vision tokens per page and exceeded MinerU2.0 while using fewer than 800. The approach is positioned both as a cheaper OCR engine and as a possible route to longer effective context for large language models.^[51]

Compact open document-parsing models

A wave of small open document-parsing VLMs arrived in late 2025, narrowing the cost gap with the hyperscaler APIs. Baidu released PaddleOCR-VL, a 0.9B-parameter model that pairs a PP-DocLayoutV2 layout stage with a compact recognition VLM and targets multilingual parsing of text, tables, formulas, and charts.^[52] Community models such as dots.ocr reached table and formula recognition quality competitive with much larger systems on OmniDocBench, the CVPR 2025 document-parsing benchmark that has become a default evaluation alongside DocVQA.^[53] These models are typically run as the OCR front end of a retrieval-augmented generation or DocQA pipeline rather than answering questions directly.

Evaluation metrics

The canonical metric for DocVQA and related single-page tasks is Average Normalized Levenshtein Similarity, abbreviated ANLS. The metric was proposed by Biten and colleagues at ICCV 2019 for scene-text VQA and adopted for DocVQA at WACV 2021. For each predicted answer, the model computes a normalized edit distance to the ground truth, with a threshold of 0.5 below which the score is treated as zero. The threshold is designed to distinguish answers that were correctly chosen but slightly miscopied from answers that were simply wrong. ANLS is case-insensitive but space-sensitive, ranges from 0 to 1, and is averaged across all questions.^[24]^[43]

Other DocQA benchmarks use related metrics. ChartQA reports relaxed accuracy that allows numeric tolerance. InfographicVQA uses ANLS but reports separate scores for question types. SlideVQA, MP-DocVQA, and DUDE report ANLS plus retrieval metrics for the evidence page. Generative QA datasets like VisualMRC report ROUGE-L and BLEU alongside exact match.^[3]^[8]^[10]^[23] For end-to-end document parsing rather than question answering, OmniDocBench (CVPR 2025) measures edit distance and table structure scores across layout, text, formula, and table extraction, and has become a default benchmark for the 2025 generation of OCR models.^[53]

Commercial and enterprise platforms

Three hyperscaler platforms dominate enterprise document AI in 2025, supported by a growing field of specialized vendors. The table below summarizes the main offerings.

Platform	Vendor	Capabilities
Amazon Textract	AWS	Text Detection, Document Analysis, and Analyze Expense APIs covering text, tables, key-value pairs, and forms^[44]
Azure AI Document Intelligence	Microsoft Azure	OCR, generic layout, prebuilt and custom neural or template models, with on-premises layout containers^[44]
Google Document AI	Google Cloud	Processor ecosystem covering invoices, receipts, IDs, and forms, with layout preservation across digital and scanned files^[44]
Hyperscience	Hyperscience	Intelligent document processing focused on human-in-the-loop accuracy
Rossum	Rossum	Cognitive data capture aimed at invoices and trade documents
Ephesoft	Ephesoft (Tungsten Automation)	Document capture and classification across regulated industries
Unstructured.io	Unstructured	Open-source and hosted pipelines that turn documents into LLM-ready chunks
LandingAI	LandingAI	Agentic document extraction; reported 99.16% on DocVQA in 2025 internal tests^[4]

Amazon Textract is strongest for AWS-first teams using the AnalyzeExpense path inside existing pipelines. Azure AI Document Intelligence (formerly Form Recognizer) tends to handle irregular or older invoices well and integrates with Microsoft Foundry. Google Document AI offers the largest processor catalog and preserves layout on both digital and scanned files. Independent benchmarks have produced mixed results across the three, with Azure outperforming AWS on irregular invoices and Google's invoice parser showing weaker line-item extraction in one 2025 study.^[44]

Long-context VLMs for documents

Long-context vision-language models have shifted document QA from a per-page problem to a per-document or per-corpus problem. Gemini 1.5 Pro supported up to 10 million tokens of multimodal context in research evaluations and 1 million tokens in production at launch in February 2024, and Google's API documentation noted that a single request could process up to 1,000 PDF pages if the total stayed inside the context window. Google publicly demonstrated reasoning over 402 pages of Apollo 11 transcripts as part of the launch.^[45]

Claude added PDF support starting with Claude 3.5 Sonnet in late 2024. The PDF endpoint accepts documents up to 32 MB and 100 pages, with each page consuming 1,500 to 3,000 tokens depending on content density. Claude can read both the text and the embedded images in a PDF, extract tables, generate structured JSON or Excel outputs, and cite page numbers in its answers. PDF support is available on the Claude API, on Amazon Bedrock, on Google Vertex AI, and on Microsoft Foundry.^[5]^[46]

GPT-4o, GPT-4.1, and later OpenAI models accept PDFs through the Files API and answer questions about them. Long-context document QA workloads are also commonly handled by chunking the document and using retrieval-augmented generation over an indexed corpus, with a vision-language model providing the per-page understanding.

Use cases

Document question answering is used across regulated industries where information lives in PDFs, scans, and faxed forms. Financial document analysis applies it to 10-K filings, earnings releases, and prospectuses. Insurance claims teams use DocQA to read claim forms, medical records, police reports, and receipts, then route them to the correct workflow. Healthcare extracts structured data from lab reports and discharge summaries. Legal teams pull clauses, dates, and parties out of contracts for due diligence and discovery. Receipt and expense processing automates expense reports through OCR plus key-value extraction. Scientific literature pipelines parse arXiv and journal PDFs into markdown for retrieval-augmented generation using Nougat or olmOCR. Government and tax filing assistants read IRS and HMRC forms, and customer support teams ingest product manuals into chat assistants.

Limitations

Even the strongest 2025 DocQA models share a set of persistent failure modes. Numeric reasoning over charts and tables remains brittle, with models often hallucinating values that look plausible but are not in the source document. Multi-page reasoning is still weaker than single-page reasoning, especially when the answer requires combining evidence from non-adjacent pages. Handwriting accuracy varies sharply across writers, languages, and scan quality. Documents in low-resource scripts (Arabic, Tamil, Burmese, and many African scripts) are underrepresented in the public training data, and accuracy drops significantly outside Latin-script English. Tables with complex merged cells or rotated text are a recurring problem for both OCR engines and OCR-free models.^[2]^[9]^[10]

Privacy and confidentiality are also active concerns. Many enterprise documents contain personally identifiable information, financial data, or legally privileged content, and sending them to a third-party API raises both regulatory and audit issues. The shift toward smaller, on-device or on-premises DocQA models (Docling, olmOCR, Mistral OCR on-premises, Azure Document Intelligence containers) is partly driven by these concerns.^[42]^[44]^[6]

Finally, evaluation is itself a limitation. ANLS is forgiving of small spelling errors but cannot distinguish between an answer that is correct in spirit and one that is wrong but lexically close. Several recent papers have argued for more semantically aware metrics and richer evaluation, including the ANLS* extension proposed in 2024 for generative large language models.^[43]

References

Minesh Mathew, Dimosthenis Karatzas, and C. V. Jawahar. "DocVQA: A Dataset for VQA on Document Images." Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2021. https://openaccess.thecvf.com/content/WACV2021/papers/Mathew_DocVQA_A_Dataset_for_VQA_on_Document_Images_WACV_2021_paper.pdf ↩
Robust Reading Competition. "Document Visual Question Answering Challenge." Computer Vision Center, Universitat Autònoma de Barcelona. https://rrc.cvc.uab.es/?ch=17 ↩
Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimosthenis Karatzas, Ernest Valveny, and C. V. Jawahar. "InfographicVQA." Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2022. https://openaccess.thecvf.com/content/WACV2022/papers/Mathew_InfographicVQA_WACV_2022_paper.pdf ↩
"DocVQA Benchmark Leaderboard." llm-stats.com. https://llm-stats.com/benchmarks/docvqa ↩
Anthropic. "PDF support." Claude API Documentation. https://platform.claude.com/docs/en/build-with-claude/pdf-support ↩
Mistral AI. "Mistral OCR." March 7, 2025. https://mistral.ai/news/mistral-ocr ↩
Document Visual Question Answering 2020 Challenge. https://www.docvqa.org/challenges/challenge-2020 ↩
Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. "ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning." Findings of the Association for Computational Linguistics: ACL 2022. https://aclanthology.org/2022.findings-acl.177/ ↩
Rubèn Tito, Dimosthenis Karatzas, and Ernest Valveny. "Hierarchical multimodal transformers for Multi-Page DocVQA." Pattern Recognition 2023, arXiv:2212.05935. https://arxiv.org/abs/2212.05935 ↩
Ryota Tanaka, Kyosuke Nishida, Kosuke Nishida, Taku Hasegawa, Itsumi Saito, and Kuniko Saito. "SlideVQA: A Dataset for Document Visual Question Answering on Multiple Images." Proceedings of the AAAI Conference on Artificial Intelligence 2023. https://arxiv.org/abs/2301.04883 ↩
Yu-Chung Hsiao, Fedir Zubach, Maria Wang, and Jindong Chen. "ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots." arXiv:2209.08199, 2022. https://arxiv.org/abs/2209.08199 ↩
Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. "LayoutLM: Pre-training of Text and Layout for Document Image Understanding." Proceedings of KDD 2020. https://arxiv.org/abs/1912.13318 ↩
Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. "OCR-Free Document Understanding Transformer." Proceedings of ECCV 2022. https://arxiv.org/abs/2111.15664 ↩
Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, and Kristina Toutanova. "Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding." Proceedings of ICML 2023. https://arxiv.org/abs/2210.03347 ↩
Zineng Tang, Ziyi Yang, Guoxin Wang, Yuwei Fang, Yang Liu, Chenguang Zhu, Michael Zeng, Cha Zhang, and Mohit Bansal. "Unifying Vision, Text, and Layout for Universal Document Processing." CVPR 2023 Highlight. https://arxiv.org/abs/2212.02623 ↩
Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. "A Diagram is Worth a Dozen Images." Proceedings of ECCV 2016, pp. 235-251. https://prior.allenai.org/projects/diagram-understanding ↩
Guillaume Jaume, Hazim Kemal Ekenel, and Jean-Philippe Thiran. "FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents." ICDAR-OST 2019. https://arxiv.org/abs/1905.13538 ↩
ICDAR 2019 Robust Reading Challenge on Scanned Receipts OCR and Information Extraction (SROIE). https://github.com/zzzDavid/ICDAR-2019-SROIE ↩
Seunghyun Park, Seung Shin, Bado Lee, Junyeop Lee, Jaeheung Surh, Minjoon Seo, and Hwalsuk Lee. "CORD: A Consolidated Receipt Dataset for Post-OCR Parsing." Workshop on Document Intelligence at NeurIPS 2019. https://github.com/clovaai/cord ↩
Adam W. Harley, Alex Ufkes, and Konstantinos G. Derpanis. "Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval." ICDAR 2015. https://adamharley.com/icdar15/ ↩
Xu Zhong, Jianbin Tang, and Antonio Jimeno Yepes. "PubLayNet: Largest Dataset Ever for Document Layout Analysis." ICDAR 2019 Best Paper. https://github.com/ibm-aur-nlp/PubLayNet ↩
Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S. Nassar, and Peter Staar. "DocLayNet: A Large Human-Annotated Dataset for Document-Layout Segmentation." KDD 2022. https://arxiv.org/abs/2206.01062 ↩
Ryota Tanaka, Kyosuke Nishida, and Sen Yoshida. "VisualMRC: Machine Reading Comprehension on Document Images." Proceedings of AAAI 2021. https://arxiv.org/abs/2101.11272 ↩
Ali Furkan Biten et al. "Scene Text Visual Question Answering." ICCV 2019. (origin of the ANLS metric used by DocVQA) ↩
Microsoft Research. "LayoutLM: Pre-training of Text and Layout for Document Image Understanding." https://www.microsoft.com/en-us/research/publication/layoutlm-pre-training-of-text-and-layout-for-document-image-understanding/ ↩
Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, and Lidong Zhou. "LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding." ACL 2021. https://aclanthology.org/2021.acl-long.201/ ↩
Yiheng Xu et al. LayoutLMv2 arXiv preprint, 2012.14740. https://arxiv.org/abs/2012.14740 ↩
Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei. "LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking." ACM Multimedia 2022. https://arxiv.org/abs/2204.08387 ↩
Jiapeng Wang, Lianwen Jin, and Kai Ding. "LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding." ACL 2022. https://aclanthology.org/2022.acl-long.534/ ↩
Qiming Peng et al. "ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding." Findings of EMNLP 2022. https://arxiv.org/abs/2210.06155 ↩
Srikar Appalaraju, Bhavan Jasani, Bhargava Urala Kota, Yusheng Xie, and R. Manmatha. "DocFormer: End-to-End Transformer for Document Understanding." ICCV 2021. https://arxiv.org/abs/2106.11539 ↩
Yulin Li et al. "StrucTexT: Structured Text Understanding with Multi-Modal Transformers." ACM MM 2021. https://arxiv.org/abs/2108.02923 ↩
Yuechen Yu et al. "StrucTexTv2: Masked Visual-Textual Prediction for Document Image Pre-training." ICLR 2023. https://arxiv.org/abs/2303.00289 ↩
Jiabo Ye et al. "mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding." arXiv:2307.02499; mPLUG-DocOwl 1.5 and 2 follow-ups, 2024. https://github.com/X-PLUG/mPLUG-DocOwl ↩
InternVL2.5 Blog. OpenGVLab. December 5, 2024. https://internvl.github.io/blog/2024-12-05-InternVL-2.5/ ↩
Hugging Face. "Introducing Idefics2: A Powerful 8B Vision-Language Model for the community." April 2024. https://huggingface.co/blog/idefics2 ↩
Hugging Face. "Idefics3 model card." August 2024. https://huggingface.co/HuggingFaceM4/Idefics3-8B-Llama3 ↩
Google. "Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context." arXiv:2403.05530, 2024. https://arxiv.org/abs/2403.05530 ↩
Lukas Blecher, Guillem Cucurull, Thomas Scialom, and Robert Stojnic. "Nougat: Neural Optical Understanding for Academic Documents." Meta AI, August 2023. https://arxiv.org/abs/2308.13418 ↩
Haoran Wei, Chenglong Liu, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, Chunrui Han, and Xiangyu Zhang. "General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model." arXiv:2409.01704, September 2024. https://arxiv.org/abs/2409.01704 ↩
Jake Poznanski et al. "olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models." Allen Institute for AI, February 2025. https://olmocr.allenai.org/papers/olmocr.pdf ↩
IBM Research. "Docling: An Efficient Open-Source Toolkit for AI-driven Document Conversion." 2024. https://research.ibm.com/blog/docling-generative-AI ↩
David Peer, Philemon Schöpf, Volckmar Nebendahl, Alexander Rietzler, and Sebastian Stabinger. "ANLS* -- A Universal Document Processing Metric for Generative Large Language Models." arXiv:2402.03848, 2024. https://arxiv.org/abs/2402.03848 ↩
Robert Vamosi. "Review: Document parsing in AWS, Azure, and Google Cloud." InfoWorld, 2025. https://www.infoworld.com/article/2271149/review-document-parsing-in-aws-azure-and-google-cloud.html ↩
Google DeepMind. "Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context." arXiv:2403.05530, March 2024. https://arxiv.org/abs/2403.05530 ↩
Anthropic. "Claude 3.5 Sonnet Model Card Addendum." 2024. https://www-cdn.anthropic.com/fed9cc193a14b84131812372d8d5857f8f304c52/Model_Card_Claude_3_Addendum.pdf ↩
Jordy Van Landeghem, Rubén Tito, Łukasz Borchmann, Michał Pietruszka, Paweł Józiak, Rafał Powalski, Dawid Jurkiewicz, Mickaël Coustaty, Bertrand Ackaert, Ernest Valveny, Matthew Blaschko, Sien Moens, and Tomasz Stanisławek. "Document Understanding Dataset and Evaluation (DUDE)." ICCV 2023, arXiv:2305.08455. https://arxiv.org/abs/2305.08455 Accessed 2026-05-31. ↩
Qwen Team, Alibaba. "Qwen3-VL Technical Report." arXiv:2511.21631, November 2025. https://arxiv.org/abs/2511.21631 Accessed 2026-05-31. ↩
"DocVQA test Benchmark Leaderboard." llm-stats.com. https://llm-stats.com/benchmarks/docvqatest Accessed 2026-05-31. ↩
Jinguo Zhu, Weiyun Wang, Zhe Chen, et al. "InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models." arXiv:2504.10479, April 2025. https://arxiv.org/abs/2504.10479 Accessed 2026-05-31. ↩
Haoran Wei, Yaofeng Sun, and Yukun Li. "DeepSeek-OCR: Contexts Optical Compression." DeepSeek, arXiv:2510.18234, October 2025. https://arxiv.org/abs/2510.18234 Accessed 2026-05-31. ↩
PaddlePaddle Team, Baidu. "PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model." arXiv:2510.14528, October 2025. https://arxiv.org/abs/2510.14528 Accessed 2026-05-31. ↩
Linke Ouyang, Yuan Qu, Hongbin Zhou, et al. "OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations." CVPR 2025. https://github.com/opendatalab/OmniDocBench Accessed 2026-05-31. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

What links here

OCR Models