Table Question Answering Models
Last reviewed
May 11, 2026
Sources
18 citations
Review status
Source-backed
Revision
v2 ยท 2,442 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 11, 2026
Sources
18 citations
Review status
Source-backed
Revision
v2 ยท 2,442 words
Add missing citations, update stale details, or suggest a clearer explanation.
Table question answering models (TableQA models) are machine learning systems that answer natural language questions over structured tabular data such as spreadsheets, database tables, and HTML tables extracted from web pages. The task sits at the intersection of natural language processing, structured data analysis, and reasoning, and it has produced a distinct family of models that differ from open-domain reading comprehension systems because tables impose schema, ordering, and numerical structure that text-only models often handle poorly.
Two dominant paradigms have shaped the field. The first is semantic parsing, in which a model translates a question into an executable logical form (typically SQL, lambda calculus, or a custom domain-specific language) that runs against the table to produce an answer. The second is end-to-end neural prediction, in which a transformer encodes the linearized table together with the question and either selects cells directly, applies aggregation operators, or generates a free-form answer. A third hybrid approach has emerged with large language models: models generate code (SQL or Python) in-context, execute it through a tool, and use the result to answer.
TableQA is hard for several reasons. Tables mix heterogeneous data types (numbers, dates, free text, units). Cell content is often abbreviated or noisy. Real-world tables can have hundreds of rows that exceed standard context windows. Reasoning frequently requires arithmetic, comparison, aggregation across rows, or joins across multiple tables. Finally, schema understanding for unseen databases requires generalization rather than memorization, which is the main difficulty in benchmarks like Spider.
Early work on question answering over structured data focused on natural language interfaces to databases (NLIDB) and rule-based parsers. The modern academic line begins with the 2015 paper by Panupong Pasupat and Percy Liang at Stanford, which introduced the WikiTableQuestions dataset of 22,033 complex questions over Wikipedia tables and a compositional semantic parser called Sempre. Their work established WikiTableQuestions (often abbreviated WTQ) as a primary benchmark and showed that compositional logical forms could be learned from question-answer pairs even without annotated programs.
The same period produced WikiSQL (Zhong, Xiong, and Socher 2017), a much larger collection of 80,654 question-SQL pairs over 24,241 Wikipedia tables that constrained queries to a single table with a simple SELECT-WHERE form. Tao Yu and colleagues at Yale released Spider in 2018, a cross-domain text-to-SQL benchmark with 10,181 questions and 5,693 complex SQL queries spanning 200 databases across 138 domains, where train and test sets contain different schemas to test generalization. Spider quickly became the standard benchmark for enterprise text-to-SQL and was followed by multi-turn variants SParC and CoSQL.
The arrival of BERT in 2018 triggered a wave of pretrained table encoders. Google released TaPas (Herzig et al. 2020, arXiv 2004.02349), which extended BERT with column, row, and rank embeddings so a single transformer could read the table and the question jointly. Facebook AI Research published TaBERT (Yin, Neubig, Yih, and Riedel 2020, arXiv 2005.08314), pretrained on 26 million tables paired with surrounding text. Microsoft followed with TaPEx (Liu et al. 2021, arXiv 2107.07653), which pretrained a BART encoder-decoder by teaching it to act as a neural SQL executor on synthetic queries, jumping WikiSQL accuracy to 89.5%.
From 2022 onward the dominant story has been the rise of in-context learning over tables. Wenhu Chen showed that GPT-3 with a single demonstration could match or beat specialized table models on WikiTableQuestions, FeTaQA, TabFact, and FEVEROUS (arXiv 2210.06710). Binder (Cheng et al. 2022) bound a language model to SQL or Python as a tool, achieving state-of-the-art on WikiTableQuestions and TabFact without supervised training. By 2024 systems like Chain-of-Table (Wang et al. 2024) treated the table itself as an evolving reasoning state, while open models such as TableLlama (Zhang et al. 2023) and StructLM (Zhuang et al. 2024) packaged generalist table abilities into instruction-tuned Llama and Mistral families.
Classical semantic parsing pipelines map a question to a logical form (lambda-DCS, FunQL, or SQL) and execute it. The parser is trained with weak supervision: given a question and the final denotation (the answer), search through candidate programs that produce the correct value. Sempre, DynSP, and Coarse2Fine fall in this category. Logical forms are interpretable and composable, but search is expensive and spurious programs (programs that produce the right answer for the wrong reason) are a chronic problem.
A close cousin of general semantic parsing, text-to-SQL targets SQL specifically as the output language. Because SQL is broadly used in industry, text-to-SQL has become the dominant practical formulation. Early neural systems like Seq2SQL added reinforcement learning with execution rewards. Later architectures (IRNet, RAT-SQL, PICARD) added schema linking, relation-aware attention, and constrained decoding to handle cross-domain Spider data. LLM-prompted text-to-SQL (DIN-SQL, MAC-SQL, DAIL-SQL) now leads most leaderboards.
A simple but surprisingly competitive approach serializes the table as a string (rows separated by newlines, cells separated by pipes) and feeds it to a seq2seq model along with the question. The model generates the answer directly. TaPEx and many LLM prompting approaches use variants of this format. The main weakness is loss of structural cues for large tables; the main strength is that it inherits the full reasoning power of the underlying generative model.
TaPas, TaBERT, MATE, TUTA, ReasTAP, and PASTA modify the transformer architecture itself to respect table structure. Typical additions include separate token-type embeddings for headers and cells, row and column position embeddings, and structural attention biases. Pretraining objectives include masked cell value prediction, denotation pretraining, and synthetic SQL execution. These models excel at cell selection and aggregation answers but tend to be capped by their pretraining table size (commonly 512 or 1024 tokens).
Modern systems treat the LLM as a program writer, not the executor. The LLM produces SQL, Python (pandas), or a domain-specific program, the program runs in a sandbox, and the result is returned to the model for final formatting. Binder, Lever, ReAcTable, and the OpenAI Code Interpreter follow this pattern. Code execution sidesteps hallucinated cell values and gives exact arithmetic, which is why it dominates 2024 leaderboards on numerically heavy datasets.
Chain-of-Table extends chain-of-thought prompting by letting the model perform table operations (select columns, group, filter, add column) step by step. Each operation transforms the table, so intermediate state is grounded in actual data rather than text reasoning that might drift from the table.
| Model | Year | Organization | Approach | Notes |
|---|---|---|---|---|
| Sempre / DynSP | 2015 to 2017 | Stanford, Microsoft | Semantic parsing to lambda-DCS | Established WikiTableQuestions baseline |
| Seq2SQL | 2017 | Salesforce Research | Seq2seq with RL on execution | First neural WikiSQL baseline at 59.4% execution accuracy |
| TaPas | April 2020 | Google Research | BERT plus row/column/rank embeddings | Direct cell selection plus aggregation; SQA 67.2% |
| TaBERT | May 2020 | Facebook AI Research | Content snapshots, joint text-table pretraining | Pretrained on 26M tables; strong on WikiTableQuestions |
| TaPEx | July 2021 | Microsoft Research Asia | BART pretrained as neural SQL executor | WikiSQL 89.5%, WikiTableQuestions 57.5% |
| MATE / TUTA | 2021 | Google, Microsoft | Sparse attention for very large tables | Targets thousands of rows |
| ReasTAP | March 2022 | MIT, Tencent | Reasoning-pretrained encoder-decoder | Synthetic reasoning tasks injected at pretraining |
| PASTA | April 2022 | Microsoft, ISCAS | Sentence-table cloze pretraining for fact verification | Strong on TabFact |
| OmniTab | July 2022 | Microsoft, CMU | Joint natural and synthetic data pretraining | New state-of-the-art on WikiTableQuestions; 16.2% gain in 128-shot |
| Binder | October 2022 | Yale, University of Washington, Allen AI | Codex prompts SQL or Python with model-bound APIs | Training-free; state-of-the-art on WikiTQ and TabFact |
| Lever | February 2023 | Meta AI, Yale | Verifier reranks generated programs | 4.6% to 10.9% gains on TableQA code generation |
| TableLlama | November 2023 | Ohio State University | Llama-2 7B fine-tuned with LongLoRA | First open generalist table model; 8K context |
| Chain-of-Table | January 2024 | Google, UC San Diego | Iterative table operation prompting | State-of-the-art on WikiTQ, FeTaQA, TabFact |
| StructLM | February 2024 | University of Waterloo, Vector | Mistral and CodeLlama tuned on 1.1M structured tasks | Beats TableLlama by 35% on out-of-domain tasks |
| Benchmark | Year | Creators | Size | What it tests |
|---|---|---|---|---|
| WikiTableQuestions (WTQ) | 2015 | Pasupat and Liang, Stanford | 22,033 questions, 2,108 Wikipedia tables | Compositional reasoning, denotation accuracy |
| WikiSQL | 2017 | Zhong, Xiong, Socher (Salesforce) | 80,654 question-SQL pairs, 24,241 tables | Single-table SELECT-WHERE SQL |
| SQA | 2017 | Iyyer, Yih, Chang (Microsoft) | 6,066 sequential questions | Conversational table QA |
| Spider | 2018 | Yu et al., Yale | 10,181 questions, 5,693 SQL, 200 DBs, 138 domains | Cross-domain text-to-SQL generalization |
| SParC | 2019 | Yu et al., Yale | 4,298 question sequences over Spider DBs | Context-dependent multi-turn SQL |
| CoSQL | 2019 | Yu et al., Yale | 3,007 dialogues, 30k turns | Conversational text-to-SQL |
| TabFact | 2019 | Chen et al., UCSB | 118,000 statements, 16,000 tables | Binary entailment over tables |
| HybridQA | 2020 | Chen et al., UCSB | 70,000+ questions | Multi-hop reasoning over tables plus linked text |
| FeTaQA | 2021 | Nan et al., Yale and Salesforce | 10,000 free-form question and answer pairs | Generation of explanatory answers |
| TAT-QA | 2021 | Zhu et al., NUS | 16,552 questions on financial reports | Numerical reasoning over hybrid text and tables |
| FinQA | 2021 | Chen et al., UCSB | 8,281 expert-written question and answer pairs | Multi-step numerical reasoning in finance |
| BIRD | 2023 | Li et al., HKU and Alibaba | 12,751 questions, 95 large databases | Big-bench text-to-SQL with messy real data |
The most common evaluation metrics depend on the answer type. Denotation accuracy checks whether the predicted answer set equals the gold answer set; it is the default for WikiTableQuestions and SQA. Exact match accuracy compares predicted strings to gold strings, often after normalization, and is used for WikiSQL and Spider. Execution accuracy runs both the gold SQL and the predicted SQL against the database and checks whether the result sets match, which is more lenient than exact-match because semantically equivalent SQL with different syntax counts as correct. F1 is used for TabFact and other classification framings. For free-form benchmarks like FeTaQA, evaluators report BLEU, ROUGE, and increasingly LLM-as-judge scores; the field has not converged on a single standard for generative TableQA.
Through 2023 and 2024, frontier general-purpose LLMs (GPT-4, Claude, and Gemini) combined with code execution tools have set the top scores on WikiTableQuestions, TabFact, and BIRD. Specialized table models still serve where latency, cost, or on-device constraints rule out cloud LLMs, or where the task is narrow enough that a small fine-tuned encoder is adequate. Practical deployments often combine retrieval-augmented generation over a data warehouse with text-to-SQL: retrieval narrows the relevant tables, an LLM writes SQL, the database executes it, and a second LLM call summarizes the result.
Use cases span business intelligence (natural language analytics on Snowflake or BigQuery), spreadsheet copilots (Excel, Google Sheets), financial document question answering (10-K filings, earnings reports, with FinQA-style numerical reasoning), scientific table analysis on arXiv and PubMed, government open-data interfaces, and customer-facing dashboard chatbots. Enterprise text-to-SQL has matured into a product category, with both startups (Vanna, Defog, Wherobots) and incumbents (Databricks AI/BI Genie, Snowflake Cortex) shipping commercial systems.
Large tables remain a persistent problem. Most table-specific transformers cap at 512 or 1024 tokens, and even long-context LLMs degrade on retrieval inside multi-thousand-row tables. Strategies include row sampling, column selection, and chunking, but each introduces information loss. Numerical reasoning is reliable only when the model writes and executes code; raw token generation still produces arithmetic errors on multi-step problems. Multi-table joins and schema understanding for unseen databases are the central difficulties in Spider and BIRD; even GPT-4 with prompting strategies falls well below human performance on BIRD's harder splits. Hallucinated cell values appear when LLMs generate answers without executing code, especially for low-frequency entities. Evaluation gaps for free-form answers (FeTaQA) push the community toward LLM-as-judge methods, which themselves have biases. Finally, real spreadsheets often violate the relational assumption with merged cells, multi-row headers, and embedded subtotals, which most academic models do not handle.
See also: natural language processing models, question answering systems, text to SQL models, foundation model.