Table Question Answering Models
Last reviewed
May 31, 2026
Sources
22 citations
Review status
Source-backed
Revision
v3 ยท 4,657 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
22 citations
Review status
Source-backed
Revision
v3 ยท 4,657 words
Add missing citations, update stale details, or suggest a clearer explanation.
Table question answering models (TableQA models) are machine learning systems that answer natural language questions over structured tabular data such as spreadsheets, database tables, and HTML tables extracted from web pages. The task sits at the intersection of natural language processing, structured data analysis, and reasoning, and it has produced a distinct family of models that differ from open-domain reading comprehension systems because tables impose schema, ordering, and numerical structure that text-only models often handle poorly.
Two dominant paradigms have shaped the field. The first is semantic parsing, in which a model translates a question into an executable logical form (typically SQL, lambda calculus, or a custom domain-specific language) that runs against the table to produce an answer. The second is end-to-end neural prediction, in which a transformer encodes the linearized table together with the question and either selects cells directly, applies aggregation operators, or generates a free-form answer. A third hybrid approach has emerged with large language models: models generate code (SQL or Python) in-context, execute it through a tool, and use the result to answer.
TableQA is hard for several reasons. Tables mix heterogeneous data types (numbers, dates, free text, units). Cell content is often abbreviated or noisy. Real-world tables can have hundreds of rows that exceed standard context windows. Reasoning frequently requires arithmetic, comparison, aggregation across rows, or joins across multiple tables. Finally, schema understanding for unseen databases requires generalization rather than memorization, which is the main difficulty in benchmarks like Spider.
Early work on question answering over structured data focused on natural language interfaces to databases (NLIDB) and rule-based parsers. These systems used manually crafted grammars and ontologies to map user queries to database queries, and while they could be precise, they required expensive expert labor to build and broke down on language they had not been explicitly programmed to handle. The modern academic line begins with the 2015 paper by Panupong Pasupat and Percy Liang at Stanford, which introduced the WikiTableQuestions dataset of 22,033 complex questions over Wikipedia tables and a compositional semantic parser called Sempre. Their work established WikiTableQuestions (often abbreviated WTQ) as a primary benchmark and showed that compositional logical forms could be learned from question-answer pairs even without annotated programs.
The same period produced WikiSQL (Zhong, Xiong, and Socher 2017), a much larger collection of 80,654 question-SQL pairs over 24,241 Wikipedia tables that constrained queries to a single table with a simple SELECT-WHERE form. Its scale enabled better neural training but its constrained query structure made it a relatively easy benchmark once strong models appeared. Tao Yu and colleagues at Yale released Spider in 2018, a cross-domain text-to-SQL benchmark with 10,181 questions and 5,693 complex SQL queries spanning 200 databases across 138 domains, where train and test sets contain different schemas to test generalization. Spider quickly became the standard benchmark for enterprise text-to-SQL and was followed by multi-turn variants SParC and CoSQL.
The arrival of BERT in 2018 triggered a wave of pretrained table encoders. Google released TAPAS (Herzig et al. 2020, arXiv 2004.02349), which extended BERT with column, row, and rank embeddings so a single transformer could read the table and the question jointly. Facebook AI Research published TaBERT (Yin, Neubig, Yih, and Riedel 2020, arXiv 2005.08314), pretrained on 26 million tables paired with surrounding text. Microsoft followed with TaPEx (Liu et al. 2021, arXiv 2107.07653), which pretrained a BART encoder-decoder by teaching it to act as a neural SQL executor on synthetic queries, jumping WikiSQL accuracy to 89.5%.
The 2019 to 2021 period also produced a cluster of task-specific benchmarks for fact verification (TabFact), hybrid reasoning (HybridQA, TAT-QA, FinQA), and free-form generation (FeTaQA), broadening the community's view of what TableQA should cover beyond single-cell lookup and simple aggregation.
From 2022 onward the dominant story has been the rise of in-context learning over tables. Wenhu Chen showed that GPT-3 with a single demonstration could match or beat specialized table models on WikiTableQuestions, FeTaQA, TabFact, and FEVEROUS (arXiv 2210.06710). Binder (Cheng et al. 2022) bound a language model to SQL or Python as a tool, achieving state-of-the-art on WikiTableQuestions and TabFact without supervised training. By 2024 systems like Chain-of-Table (Wang et al. 2024) treated the table itself as an evolving reasoning state, while open models such as TableLlama (Zhang et al. 2023) and StructLM (Zhuang et al. 2024) packaged generalist table abilities into instruction-tuned Llama and Mistral families.
Before any model processes a table, the table must be converted into a form the model can consume. This representation problem is central to TableQA and different approaches make very different choices.
The simplest method serializes the table as a sequence of tokens. A common format places the header row first (with a special separator), followed by each data row, using pipe characters or newline characters to delimit cells. For example: header: Name | Country | GDP row 1: United States | USA | 25.46T. This format is easy to implement and is compatible with any language model, but it loses all structural information beyond sequential position. As table size grows, the flat sequence can quickly exceed context window limits, and the model must learn to infer column and row boundaries from delimiter tokens alone.
Many LLM-based TableQA systems default to some variant of flat linearization because it requires no architectural changes. TaPEx uses a cleaned version in which column headers are prefixed to each row.
A more principled approach, pioneered by TAPAS, adds extra embedding dimensions on top of the standard wordpiece or token embeddings. Each token in the table receives embeddings for: (1) its column index, (2) its row index, (3) whether it is in the header or the body, and (4) a rank embedding that encodes the numeric rank of the cell value within its column. These additional signals let the transformer learn attention patterns that respect table structure without retraining on sequential position alone. The downside is that this architecture is harder to extend to arbitrary table sizes and the original TAPAS was capped at 512 tokens, which excluded many real-world tables.
TaBERT addresses large tables differently. Instead of trying to fit the whole table into one sequence, it builds a content snapshot: for each question, it selects the most relevant rows (using TF-IDF or another relevance measure), then encodes each row together with its column headers as a separate sequence. Cross-row self-attention connects the per-row encodings into a joint representation. This approach generalizes better to multi-row tables but requires a retrieval step and can miss cross-row reasoning over rows that were not sampled.
MATE (Massively Augmented Transformer for Tables, Eisenschlos et al. 2021) and TUTA (Wang et al. 2021) extend the encoding approach to very large tables by using sparse attention patterns that connect each cell mainly to its own row and column rather than to every token in the table. This reduces memory and compute from quadratic to near-linear in table size, enabling processing of tables with thousands of rows at the cost of some long-range attention capacity.
Some systems pass tables to models in their original HTML or in GitHub-flavored Markdown format, relying on the model's pretraining exposure to recognize these formats. HTML encoding preserves colspan and rowspan attributes (useful for merged cells) but adds considerable token overhead. For spreadsheet-specific systems, formats that encode cell coordinates (A1, B2, etc.) directly can be useful. The question of which serialization format is best remains open and likely depends on model pretraining distribution and table type.
Classical semantic parsing pipelines map a question to a logical form (lambda-DCS, FunQL, or SQL) and execute it. The parser is trained with weak supervision: given a question and the final denotation (the answer), search through candidate programs that produce the correct value. Sempre, DynSP, and Coarse2Fine fall in this category. Logical forms are interpretable and composable, but search is expensive and spurious programs (programs that produce the right answer for the wrong reason) are a chronic problem. A spurious program gets the correct denotation on the training example but does not generalize: on a new question with similar wording but different data, it fails. Filtering spurious programs requires either annotated programs or careful augmentation of the search procedure.
A close cousin of general semantic parsing, text-to-SQL targets SQL specifically as the output language. Because SQL is broadly used in industry, text-to-SQL has become the dominant practical formulation. Early neural systems like Seq2SQL added reinforcement learning with execution rewards. Later architectures (IRNet, RAT-SQL, PICARD) added schema linking, relation-aware attention, and constrained decoding to handle cross-domain Spider data. Schema linking is the key challenge in cross-domain text-to-SQL: the model must correctly identify which tables and columns the question is about, even when column names in the database schema do not match vocabulary in the question. PICARD addresses this with a constrained decoder that prevents invalid SQL from being generated at each step. LLM-prompted text-to-SQL (DIN-SQL, MAC-SQL, DAIL-SQL) now leads most leaderboards by decomposing the problem into sub-steps: schema filtering, SQL generation, and self-correction.
A simple but surprisingly competitive approach serializes the table as a string (rows separated by newlines, cells separated by pipes) and feeds it to a seq2seq model along with the question. The model generates the answer directly. TaPEx and many LLM prompting approaches use variants of this format. The main weakness is loss of structural cues for large tables; the main strength is that it inherits the full reasoning power of the underlying generative model. Early work by Chen et al. (2020) demonstrated that GPT-3-style models pretrained on web text had enough implicit table knowledge from HTML and Wikipedia exposure to perform usefully under this format.
TAPAS, TaBERT, MATE, TUTA, ReasTAP, and PASTA modify the transformer architecture itself to respect table structure. Typical additions include separate token-type embeddings for headers and cells, row and column position embeddings, and structural attention biases. Pretraining objectives include masked cell value prediction, denotation pretraining, and synthetic SQL execution. These models excel at cell selection and aggregation answers but tend to be capped by their pretraining table size (commonly 512 or 1024 tokens). TAPAS in particular produces a probability distribution over table cells that can be combined with a learnable aggregation operator (SUM, COUNT, AVERAGE, NONE) to produce answers without ever generating free text.
Modern systems treat the LLM as a program writer, not the executor. The LLM produces SQL, Python (pandas), or a domain-specific program, the program runs in a sandbox, and the result is returned to the model for final formatting. Binder, Lever, ReAcTable, and the OpenAI Code Interpreter follow this pattern. Code execution sidesteps hallucinated cell values and gives exact arithmetic, which is why it dominates 2024 leaderboards on numerically heavy datasets. The Binder approach is notable because it adds model-defined API functions: the LLM can call a function (defined on the fly in the prompt) that the system resolves back to the LLM itself, enabling hybrid SQL-plus-language reasoning without a custom architecture.
Chain-of-Table extends chain-of-thought prompting by letting the model perform table operations (select columns, group, filter, add column) step by step. Each operation transforms the table, so intermediate state is grounded in actual data rather than text reasoning that might drift from the table. Unlike code execution, which outsources computation to a deterministic interpreter, Chain-of-Table keeps all steps inside the model's context window as successive table states, enabling transparent inspection of the reasoning chain.
For enterprise databases with hundreds of tables, an initial retrieval step narrows the relevant schema before any SQL generation. Systems like retrieval-augmented generation pipelines over data warehouses combine embedding-based table retrieval with LLM-based query generation. HybridQA and TAT-QA required models to reason over tables linked to external text passages, and this multi-hop setting is increasingly important in practice: a financial analyst's question may require combining a table of quarterly revenues with the text of a press release.
| Model | Year | Organization | Approach | Notes |
|---|---|---|---|---|
| Sempre / DynSP | 2015 to 2017 | Stanford, Microsoft | Semantic parsing to lambda-DCS | Established WikiTableQuestions baseline |
| Seq2SQL | 2017 | Salesforce Research | Seq2seq with RL on execution | First neural WikiSQL baseline at 59.4% execution accuracy |
| TAPAS | April 2020 | Google Research | BERT plus row/column/rank embeddings | Direct cell selection plus aggregation; SQA 67.2% |
| TaBERT | May 2020 | Facebook AI Research | Content snapshots, joint text-table pretraining | Pretrained on 26M tables; strong on WikiTableQuestions |
| TaPEx | July 2021 | Microsoft Research Asia | BART pretrained as neural SQL executor | WikiSQL 89.5%, WikiTableQuestions 57.5% |
| MATE / TUTA | 2021 | Google, Microsoft | Sparse attention for very large tables | Targets thousands of rows |
| ReasTAP | March 2022 | MIT, Tencent | Reasoning-pretrained encoder-decoder | Synthetic reasoning tasks injected at pretraining |
| PASTA | April 2022 | Microsoft, ISCAS | Sentence-table cloze pretraining for fact verification | Strong on TabFact |
| OmniTab | July 2022 | Microsoft, CMU | Joint natural and synthetic data pretraining | New state-of-the-art on WikiTableQuestions; 16.2% gain in 128-shot |
| Binder | October 2022 | Yale, University of Washington, Allen AI | Codex prompts SQL or Python with model-bound APIs | Training-free; state-of-the-art on WikiTQ and TabFact |
| Lever | February 2023 | Meta AI, Yale | Verifier reranks generated programs | 4.6% to 10.9% gains on TableQA code generation |
| TableLlama | November 2023 | Ohio State University | Llama-2 7B fine-tuned with LongLoRA | First open generalist table model; 8K context |
| Chain-of-Table | January 2024 | Google, UC San Diego | Iterative table operation prompting | State-of-the-art on WikiTQ, FeTaQA, TabFact |
| StructLM | February 2024 | University of Waterloo, Vector | Mistral and CodeLlama tuned on 1.1M structured tasks | Beats TableLlama by 35% on out-of-domain tasks |
| Benchmark | Year | Creators | Size | What it tests |
|---|---|---|---|---|
| WikiTableQuestions (WTQ) | 2015 | Pasupat and Liang, Stanford | 22,033 questions, 2,108 Wikipedia tables | Compositional reasoning, denotation accuracy |
| WikiSQL | 2017 | Zhong, Xiong, Socher (Salesforce) | 80,654 question-SQL pairs, 24,241 tables | Single-table SELECT-WHERE SQL |
| SQA | 2017 | Iyyer, Yih, Chang (Microsoft) | 6,066 sequential questions | Conversational table QA |
| Spider | 2018 | Yu et al., Yale | 10,181 questions, 5,693 SQL, 200 DBs, 138 domains | Cross-domain text-to-SQL generalization |
| SParC | 2019 | Yu et al., Yale | 4,298 question sequences over Spider DBs | Context-dependent multi-turn SQL |
| CoSQL | 2019 | Yu et al., Yale | 3,007 dialogues, 30k turns | Conversational text-to-SQL |
| TabFact | 2019 | Chen et al., UCSB | 118,000 statements, 16,000 tables | Binary entailment over tables |
| HybridQA | 2020 | Chen et al., UCSB | 70,000+ questions | Multi-hop reasoning over tables plus linked text |
| FeTaQA | 2021 | Nan et al., Yale and Salesforce | 10,000 free-form question and answer pairs | Generation of explanatory answers |
| TAT-QA | 2021 | Zhu et al., NUS | 16,552 questions on financial reports | Numerical reasoning over hybrid text and tables |
| FinQA | 2021 | Chen et al., UCSB | 8,281 expert-written question and answer pairs | Multi-step numerical reasoning in finance |
| BIRD | 2023 | Li et al., HKU and Alibaba | 12,751 questions, 95 large databases | Big-bench text-to-SQL with messy real data |
The most common evaluation metrics depend on the answer type. Denotation accuracy checks whether the predicted answer set equals the gold answer set; it is the default for WikiTableQuestions and SQA. The term "denotation" refers to the value that a logical form evaluates to, so this metric is technically independent of the program the model used to produce the answer. Exact match accuracy compares predicted strings to gold strings, often after normalization (lowercasing, removing articles, collapsing whitespace), and is used for WikiSQL and Spider at the query level.
Execution accuracy runs both the gold SQL and the predicted SQL against the database and checks whether the result sets match, which is more lenient than exact-match because semantically equivalent SQL with different syntax counts as correct. For example, SELECT name FROM t WHERE salary > 50000 and SELECT name FROM t WHERE 50000 < salary would differ under exact match but agree under execution accuracy. This metric has become preferred for text-to-SQL research because it rewards correct answers rather than correct syntax.
F1 is used for TabFact and other classification framings. For free-form benchmarks like FeTaQA, evaluators report BLEU, ROUGE, and increasingly LLM-as-judge scores; the field has not converged on a single standard for generative TableQA. The BIRD benchmark introduced an additional metric called valid efficiency score (VES), which penalizes SQL queries that return correct results but use unnecessarily slow execution plans, reflecting real-world performance concerns.
A persistent criticism of all these metrics is that they measure the final answer but not the quality of the reasoning process. A model can achieve high execution accuracy on WikiSQL by memorizing surface patterns between questions and SQL templates while failing on any unseen table structure.
The choice between semantic parsing and end-to-end neural prediction involves meaningful tradeoffs that have shaped research strategy over the past decade.
Semantic parsing to SQL or logical forms produces outputs that are interpretable and executable, making errors diagnosable. An analyst can read the generated SQL and identify why a wrong answer was produced. Programs are also composable: if the right SQL can be generated, it will execute correctly on any database engine and handle arbitrarily large result sets. The main weaknesses are brittleness to paraphrase, exposure to spurious programs during training, and the difficulty of handling questions that require linguistic understanding beyond what SQL can express (such as FeTaQA's explanatory answers).
End-to-end neural approaches like TAPAS bypass the logical form entirely and directly select cells or generate answers token by token. They generalize well to questions with ambiguous scope and handle free-form natural language more gracefully. Their weakness is that they cannot guarantee arithmetic correctness on multi-step calculations, cannot operate on tables larger than their context window, and produce answers that are opaque (there is no program to inspect). They are also difficult to apply in production where the output must interact with a live database.
Code execution with LLMs gives the best of both worlds in many settings: the generated SQL or Python is inspectable, execution is exact, and the underlying model can handle complex linguistic phenomena in the question. The remaining weakness is that LLMs can generate plausible-looking but incorrect programs, especially for complex multi-table queries or queries requiring domain knowledge not in the prompt.
The arrival of large general-purpose language models changed the competitive landscape of TableQA more rapidly than almost any other NLP task. Several properties explain why tables proved to be an unexpectedly tractable domain for LLMs.
First, LLMs pretrained on web text encounter many tables during training: Wikipedia infoboxes and data tables, HTML tables from news sites, CSV files in GitHub, and spreadsheet data in Stack Overflow answers. This implicit exposure gives them a non-trivial prior on tabular structure even without explicit table pretraining.
Second, the code generation abilities of modern LLMs map directly onto text-to-SQL. A model trained to produce Python and JavaScript code has already learned the kind of structured, syntactically constrained generation that SQL requires. Several studies have shown that code-specialized models (Codex, CodeLlama, DeepSeek-Coder) outperform general-purpose models on text-to-SQL even when the general-purpose models have more parameters.
Third, the reasoning capabilities of large models, especially those with chain-of-thought training, matter for the multi-step arithmetic and comparison operations that tables frequently demand. TAT-QA and FinQA require calculations like: "what is the percentage change in operating income from Q1 to Q2?" which require finding two cell values, computing their difference, dividing by the original, and expressing the result as a percentage. These steps are within reach of a capable LLM reasoning over a serialized table, though reliability degrades as the number of steps grows.
The practical regime in 2024 and 2025 is one where frontier LLMs (GPT-4o, Claude, Gemini) combined with tool use dominate accuracy leaderboards on nearly all TableQA benchmarks. Specialized table models remain relevant in latency-sensitive, cost-constrained, or on-device settings, and for narrow tasks where a small fine-tuned encoder can be deployed efficiently. Open-weight instruction-tuned models like StructLM and TableLlama offer intermediate options: they are table-aware, small enough for local deployment, and stronger on structured tasks than same-size general-purpose models.
For retrieval-augmented generation over large data warehouses, the typical production pattern has three stages: (1) embed all table schemas and retrieve the top-k relevant tables by cosine similarity to the question, (2) have the LLM generate SQL over those k tables, (3) execute the SQL and have a second LLM pass turn the result set into a natural language answer. This pattern, sometimes called text-to-SQL-to-text, is the basis of enterprise analytics products from Snowflake, Databricks, and independent startups.
Through 2023 and 2024, frontier general-purpose LLMs (GPT-4, Claude, and Gemini) combined with code execution tools have set the top scores on WikiTableQuestions, TabFact, and BIRD. Specialized table models still serve where latency, cost, or on-device constraints rule out cloud LLMs, or where the task is narrow enough that a small fine-tuned encoder is adequate. Practical deployments often combine retrieval-augmented generation over a data warehouse with text-to-SQL: retrieval narrows the relevant tables, an LLM writes SQL, the database executes it, and a second LLM call summarizes the result.
Use cases span business intelligence (natural language analytics on Snowflake or BigQuery), spreadsheet copilots (Excel Copilot, Google Sheets' Gemini integration), financial document question answering (10-K filings, earnings reports, with FinQA-style numerical reasoning), scientific table analysis on arXiv and PubMed, government open-data interfaces, and customer-facing dashboard chatbots. Enterprise text-to-SQL has matured into a product category, with both startups (Vanna, Defog, Wherobots) and incumbents (Databricks AI/BI Genie, Snowflake Cortex) shipping commercial systems.
Spreadsheet copilots represent a distinct application category from database-oriented text-to-SQL. Spreadsheets mix data and computation in a single object (formulas can reference other cells, worksheets can have multi-row merged headers, and the same "table" may contain subtotals interleaved with data rows). Models must handle these departures from relational normal form, which most academic benchmarks do not test. Microsoft's Copilot for Excel and Google's Gemini for Sheets both rely on a combination of LLM-generated formulas, Python code execution, and natural language generation to answer questions and perform data transformations over spreadsheet content.
Large tables remain a persistent problem. Most table-specific transformers cap at 512 or 1024 tokens, and even long-context LLMs degrade on retrieval inside multi-thousand-row tables. Strategies include row sampling, column selection, and chunking, but each introduces information loss. Even when tables fit within the context window, "lost in the middle" effects, where models attend poorly to rows in the middle of a long input, can degrade accuracy.
Numerical reasoning is reliable only when the model writes and executes code; raw token generation still produces arithmetic errors on multi-step problems. Multi-table joins and schema understanding for unseen databases are the central difficulties in Spider and BIRD; even GPT-4 with prompting strategies falls well below human performance on BIRD's harder splits. BIRD specifically benchmarks on messy real-world databases with inconsistent naming conventions, NULL values, and encoding issues, and the performance gap versus human performance on this benchmark is substantially larger than on cleaner academic benchmarks.
Hallucinated cell values appear when LLMs generate answers without executing code, especially for low-frequency entities. A model asked "what was the GDP of Turkmenistan in 2018?" may confidently generate a plausible-sounding number rather than looking it up in the table, particularly if the table is long or the entity is rare.
Evaluation gaps for free-form answers (FeTaQA) push the community toward LLM-as-judge methods, which themselves have biases toward longer or more fluent answers regardless of factual content. BLEU and ROUGE are poor proxies for correctness on generated numerical explanations.
Real spreadsheets often violate the relational assumption with merged cells, multi-row headers, and embedded subtotals, which most academic models do not handle. The gap between benchmark tables (clean Wikipedia infoboxes, carefully curated financial filings) and production spreadsheets (years of accretion, inconsistent formatting, mixed languages) is the main source of performance degradation in deployed systems.
Ambiguity in natural language questions causes additional failures. A question like "which countries have the highest income?" could mean the top-1 country, the top-k countries, all countries above a threshold, or the country in the most recent year if the table is time-series. Humans resolve this ambiguity by asking clarifying questions, but most TableQA systems generate a single answer without flagging uncertainty. Generating clarifications or confidence estimates is an active research direction.
Finally, multi-lingual table QA remains significantly behind English performance. Most large table datasets are in English, and models fine-tuned on English table data generalize poorly to Japanese, Chinese, Arabic, or low-resource language tables where the column headers, cell values, and questions are in a non-English language.
See also: natural language processing models, question answering systems, text to SQL models, foundation model.