Table Question Answering Models

AI Models Natural Language Processing

23 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

22 citations

Revision

v4 · 4,657 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Table question answering models (TableQA models) are machine learning systems that answer natural language questions over structured tabular data such as spreadsheets, database tables, and HTML tables extracted from web pages. The task sits at the intersection of natural language processing, structured data analysis, and reasoning, and it has produced a distinct family of models that differ from open-domain reading comprehension systems because tables impose schema, ordering, and numerical structure that text-only models often handle poorly.

Two dominant paradigms have shaped the field. The first is semantic parsing, in which a model translates a question into an executable logical form (typically SQL, lambda calculus, or a custom domain-specific language) that runs against the table to produce an answer. The second is end-to-end neural prediction, in which a transformer encodes the linearized table together with the question and either selects cells directly, applies aggregation operators, or generates a free-form answer. A third hybrid approach has emerged with large language models: models generate code (SQL or Python) in-context, execute it through a tool, and use the result to answer.

TableQA is hard for several reasons. Tables mix heterogeneous data types (numbers, dates, free text, units). Cell content is often abbreviated or noisy. Real-world tables can have hundreds of rows that exceed standard context windows. Reasoning frequently requires arithmetic, comparison, aggregation across rows, or joins across multiple tables. Finally, schema understanding for unseen databases requires generalization rather than memorization, which is the main difficulty in benchmarks like Spider.^[3]

Brief history

Early work on question answering over structured data focused on natural language interfaces to databases (NLIDB) and rule-based parsers. These systems used manually crafted grammars and ontologies to map user queries to database queries, and while they could be precise, they required expensive expert labor to build and broke down on language they had not been explicitly programmed to handle. The modern academic line begins with the 2015 paper by Panupong Pasupat and Percy Liang at Stanford, which introduced the WikiTableQuestions dataset of 22,033 complex questions over Wikipedia tables and a compositional semantic parser called Sempre.^[1] Their work established WikiTableQuestions (often abbreviated WTQ) as a primary benchmark and showed that compositional logical forms could be learned from question-answer pairs even without annotated programs.^[1]

The same period produced WikiSQL (Zhong, Xiong, and Socher 2017), a much larger collection of 80,654 question-SQL pairs over 24,241 Wikipedia tables that constrained queries to a single table with a simple SELECT-WHERE form.^[2] Its scale enabled better neural training but its constrained query structure made it a relatively easy benchmark once strong models appeared. Tao Yu and colleagues at Yale released Spider in 2018, a cross-domain text-to-SQL benchmark with 10,181 questions and 5,693 complex SQL queries spanning 200 databases across 138 domains, where train and test sets contain different schemas to test generalization.^[3] Spider quickly became the standard benchmark for enterprise text-to-SQL and was followed by multi-turn variants SParC and CoSQL.^[21]^[22]

The arrival of BERT in 2018 triggered a wave of pretrained table encoders. Google released TAPAS (Herzig et al. 2020, arXiv 2004.02349), which extended BERT with column, row, and rank embeddings so a single transformer could read the table and the question jointly.^[5] Facebook AI Research published TaBERT (Yin, Neubig, Yih, and Riedel 2020, arXiv 2005.08314), pretrained on 26 million tables paired with surrounding text.^[6] Microsoft followed with TaPEx (Liu et al. 2021, arXiv 2107.07653), which pretrained a BART encoder-decoder by teaching it to act as a neural SQL executor on synthetic queries, jumping WikiSQL accuracy to 89.5%.^[11]

The 2019 to 2021 period also produced a cluster of task-specific benchmarks for fact verification (TabFact), hybrid reasoning (HybridQA, TAT-QA, FinQA), and free-form generation (FeTaQA), broadening the community's view of what TableQA should cover beyond single-cell lookup and simple aggregation.^[4]^[7]^[8]^[9]^[10]

From 2022 onward the dominant story has been the rise of in-context learning over tables. Wenhu Chen showed that GPT-3 with a single demonstration could match or beat specialized table models on WikiTableQuestions, FeTaQA, TabFact, and FEVEROUS (arXiv 2210.06710).^[15] Binder (Cheng et al. 2022) bound a language model to SQL or Python as a tool, achieving state-of-the-art on WikiTableQuestions and TabFact without supervised training.^[14] By 2024 systems like Chain-of-Table (Wang et al. 2024) treated the table itself as an evolving reasoning state,^[19] while open models such as TableLlama (Zhang et al. 2023) and StructLM (Zhuang et al. 2024) packaged generalist table abilities into instruction-tuned Llama and Mistral families.^[18]^[20]

Table encoding and serialization

Before any model processes a table, the table must be converted into a form the model can consume. This representation problem is central to TableQA and different approaches make very different choices.

Flat linearization

The simplest method serializes the table as a sequence of tokens. A common format places the header row first (with a special separator), followed by each data row, using pipe characters or newline characters to delimit cells. For example: header: Name | Country | GDP row 1: United States | USA | 25.46T. This format is easy to implement and is compatible with any language model, but it loses all structural information beyond sequential position. As table size grows, the flat sequence can quickly exceed context window limits, and the model must learn to infer column and row boundaries from delimiter tokens alone.

Many LLM-based TableQA systems default to some variant of flat linearization because it requires no architectural changes. TaPEx uses a cleaned version in which column headers are prefixed to each row.^[11]

Structured token-type embeddings (TAPAS approach)

A more principled approach, pioneered by TAPAS, adds extra embedding dimensions on top of the standard wordpiece or token embeddings. Each token in the table receives embeddings for: (1) its column index, (2) its row index, (3) whether it is in the header or the body, and (4) a rank embedding that encodes the numeric rank of the cell value within its column.^[5] These additional signals let the transformer learn attention patterns that respect table structure without retraining on sequential position alone. The downside is that this architecture is harder to extend to arbitrary table sizes and the original TAPAS was capped at 512 tokens, which excluded many real-world tables.^[5]

Content snapshot encoding (TaBERT approach)

TaBERT addresses large tables differently. Instead of trying to fit the whole table into one sequence, it builds a content snapshot: for each question, it selects the most relevant rows (using TF-IDF or another relevance measure), then encodes each row together with its column headers as a separate sequence.^[6] Cross-row self-attention connects the per-row encodings into a joint representation.^[6] This approach generalizes better to multi-row tables but requires a retrieval step and can miss cross-row reasoning over rows that were not sampled.

Sparse attention for very large tables

MATE (Massively Augmented Transformer for Tables, Eisenschlos et al. 2021) and TUTA (Wang et al. 2021) extend the encoding approach to very large tables by using sparse attention patterns that connect each cell mainly to its own row and column rather than to every token in the table.^[12] This reduces memory and compute from quadratic to near-linear in table size, enabling processing of tables with thousands of rows at the cost of some long-range attention capacity.^[12]

Markdown, HTML, and domain-specific formats

Some systems pass tables to models in their original HTML or in GitHub-flavored Markdown format, relying on the model's pretraining exposure to recognize these formats. HTML encoding preserves colspan and rowspan attributes (useful for merged cells) but adds considerable token overhead. For spreadsheet-specific systems, formats that encode cell coordinates (A1, B2, etc.) directly can be useful. The question of which serialization format is best remains open and likely depends on model pretraining distribution and table type.

Methods

Semantic parsing to logical forms

Classical semantic parsing pipelines map a question to a logical form (lambda-DCS, FunQL, or SQL) and execute it. The parser is trained with weak supervision: given a question and the final denotation (the answer), search through candidate programs that produce the correct value.^[1] Sempre, DynSP, and Coarse2Fine fall in this category. Logical forms are interpretable and composable, but search is expensive and spurious programs (programs that produce the right answer for the wrong reason) are a chronic problem.^[1] A spurious program gets the correct denotation on the training example but does not generalize: on a new question with similar wording but different data, it fails. Filtering spurious programs requires either annotated programs or careful augmentation of the search procedure.

Text-to-SQL

A close cousin of general semantic parsing, text-to-SQL targets SQL specifically as the output language. Because SQL is broadly used in industry, text-to-SQL has become the dominant practical formulation. Early neural systems like Seq2SQL added reinforcement learning with execution rewards.^[2] Later architectures (IRNet, RAT-SQL, PICARD) added schema linking, relation-aware attention, and constrained decoding to handle cross-domain Spider data. Schema linking is the key challenge in cross-domain text-to-SQL: the model must correctly identify which tables and columns the question is about, even when column names in the database schema do not match vocabulary in the question. PICARD addresses this with a constrained decoder that prevents invalid SQL from being generated at each step. LLM-prompted text-to-SQL (DIN-SQL, MAC-SQL, DAIL-SQL) now leads most leaderboards by decomposing the problem into sub-steps: schema filtering, SQL generation, and self-correction.

Table-as-text linearization

A simple but surprisingly competitive approach serializes the table as a string (rows separated by newlines, cells separated by pipes) and feeds it to a seq2seq model along with the question. The model generates the answer directly. TaPEx and many LLM prompting approaches use variants of this format.^[11] The main weakness is loss of structural cues for large tables; the main strength is that it inherits the full reasoning power of the underlying generative model. Early work by Chen et al. (2020) demonstrated that GPT-3-style models pretrained on web text had enough implicit table knowledge from HTML and Wikipedia exposure to perform usefully under this format.

Table-specific transformers

TAPAS, TaBERT, MATE, TUTA, ReasTAP, and PASTA modify the transformer architecture itself to respect table structure. Typical additions include separate token-type embeddings for headers and cells, row and column position embeddings, and structural attention biases. Pretraining objectives include masked cell value prediction, denotation pretraining, and synthetic SQL execution. These models excel at cell selection and aggregation answers but tend to be capped by their pretraining table size (commonly 512 or 1024 tokens). TAPAS in particular produces a probability distribution over table cells that can be combined with a learnable aggregation operator (SUM, COUNT, AVERAGE, NONE) to produce answers without ever generating free text.^[5]

Code execution with LLMs

Modern systems treat the LLM as a program writer, not the executor. The LLM produces SQL, Python (pandas), or a domain-specific program, the program runs in a sandbox, and the result is returned to the model for final formatting. Binder, Lever, ReAcTable, and the OpenAI Code Interpreter follow this pattern. Code execution sidesteps hallucinated cell values and gives exact arithmetic, which is why it dominates 2024 leaderboards on numerically heavy datasets. The Binder approach is notable because it adds model-defined API functions: the LLM can call a function (defined on the fly in the prompt) that the system resolves back to the LLM itself, enabling hybrid SQL-plus-language reasoning without a custom architecture.^[14]

Chain-of-table reasoning

Chain-of-Table extends chain-of-thought prompting by letting the model perform table operations (select columns, group, filter, add column) step by step.^[19] Each operation transforms the table, so intermediate state is grounded in actual data rather than text reasoning that might drift from the table.^[19] Unlike code execution, which outsources computation to a deterministic interpreter, Chain-of-Table keeps all steps inside the model's context window as successive table states, enabling transparent inspection of the reasoning chain.

Retrieval-augmented and multi-table approaches

For enterprise databases with hundreds of tables, an initial retrieval step narrows the relevant schema before any SQL generation. Systems like retrieval-augmented generation pipelines over data warehouses combine embedding-based table retrieval with LLM-based query generation. HybridQA and TAT-QA required models to reason over tables linked to external text passages,^[7]^[9] and this multi-hop setting is increasingly important in practice: a financial analyst's question may require combining a table of quarterly revenues with the text of a press release.

Notable models

Model	Year	Organization	Approach	Notes
Sempre / DynSP	2015 to 2017	Stanford, Microsoft	Semantic parsing to lambda-DCS	Established WikiTableQuestions baseline^[1]
Seq2SQL	2017	Salesforce Research	Seq2seq with RL on execution	First neural WikiSQL baseline at 59.4% execution accuracy^[2]
TAPAS	April 2020	Google Research	BERT plus row/column/rank embeddings	Direct cell selection plus aggregation; SQA 67.2%^[5]
TaBERT	May 2020	Facebook AI Research	Content snapshots, joint text-table pretraining	Pretrained on 26M tables; strong on WikiTableQuestions^[6]
TaPEx	July 2021	Microsoft Research Asia	BART pretrained as neural SQL executor	WikiSQL 89.5%, WikiTableQuestions 57.5%^[11]
MATE / TUTA	2021	Google, Microsoft	Sparse attention for very large tables	Targets thousands of rows^[12]
ReasTAP	March 2022	MIT, Tencent	Reasoning-pretrained encoder-decoder	Synthetic reasoning tasks injected at pretraining
PASTA	April 2022	Microsoft, ISCAS	Sentence-table cloze pretraining for fact verification	Strong on TabFact
OmniTab	July 2022	Microsoft, CMU	Joint natural and synthetic data pretraining	New state-of-the-art on WikiTableQuestions; 16.2% gain in 128-shot^[13]
Binder	October 2022	Yale, University of Washington, Allen AI	Codex prompts SQL or Python with model-bound APIs	Training-free; state-of-the-art on WikiTQ and TabFact^[14]
Lever	February 2023	Meta AI, Yale	Verifier reranks generated programs	4.6% to 10.9% gains on TableQA code generation^[16]
TableLlama	November 2023	Ohio State University	Llama-2 7B fine-tuned with LongLoRA	First open generalist table model; 8K context^[18]
Chain-of-Table	January 2024	Google, UC San Diego	Iterative table operation prompting	State-of-the-art on WikiTQ, FeTaQA, TabFact^[19]
StructLM	February 2024	University of Waterloo, Vector	Mistral and CodeLlama tuned on 1.1M structured tasks	Beats TableLlama by 35% on out-of-domain tasks^[20]

Benchmarks

Benchmark	Year	Creators	Size	What it tests
WikiTableQuestions (WTQ)	2015	Pasupat and Liang, Stanford	22,033 questions, 2,108 Wikipedia tables	Compositional reasoning, denotation accuracy^[1]
WikiSQL	2017	Zhong, Xiong, Socher (Salesforce)	80,654 question-SQL pairs, 24,241 tables	Single-table SELECT-WHERE SQL^[2]
SQA	2017	Iyyer, Yih, Chang (Microsoft)	6,066 sequential questions	Conversational table QA
Spider	2018	Yu et al., Yale	10,181 questions, 5,693 SQL, 200 DBs, 138 domains	Cross-domain text-to-SQL generalization^[3]
SParC	2019	Yu et al., Yale	4,298 question sequences over Spider DBs	Context-dependent multi-turn SQL^[21]
CoSQL	2019	Yu et al., Yale	3,007 dialogues, 30k turns	Conversational text-to-SQL^[22]
TabFact	2019	Chen et al., UCSB	118,000 statements, 16,000 tables	Binary entailment over tables^[4]
HybridQA	2020	Chen et al., UCSB	70,000+ questions	Multi-hop reasoning over tables plus linked text^[7]
FeTaQA	2021	Nan et al., Yale and Salesforce	10,000 free-form question and answer pairs	Generation of explanatory answers^[8]
TAT-QA	2021	Zhu et al., NUS	16,552 questions on financial reports	Numerical reasoning over hybrid text and tables^[9]
FinQA	2021	Chen et al., UCSB	8,281 expert-written question and answer pairs	Multi-step numerical reasoning in finance^[10]
BIRD	2023	Li et al., HKU and Alibaba	12,751 questions, 95 large databases	Big-bench text-to-SQL with messy real data^[17]

Metrics

The most common evaluation metrics depend on the answer type. Denotation accuracy checks whether the predicted answer set equals the gold answer set; it is the default for WikiTableQuestions and SQA.^[1] The term "denotation" refers to the value that a logical form evaluates to, so this metric is technically independent of the program the model used to produce the answer. Exact match accuracy compares predicted strings to gold strings, often after normalization (lowercasing, removing articles, collapsing whitespace), and is used for WikiSQL and Spider at the query level.^[2]^[3]

Execution accuracy runs both the gold SQL and the predicted SQL against the database and checks whether the result sets match, which is more lenient than exact-match because semantically equivalent SQL with different syntax counts as correct.^[2] For example, SELECT name FROM t WHERE salary > 50000 and SELECT name FROM t WHERE 50000 < salary would differ under exact match but agree under execution accuracy. This metric has become preferred for text-to-SQL research because it rewards correct answers rather than correct syntax.

F1 is used for TabFact and other classification framings. For free-form benchmarks like FeTaQA, evaluators report BLEU, ROUGE, and increasingly LLM-as-judge scores; the field has not converged on a single standard for generative TableQA. The BIRD benchmark introduced an additional metric called valid efficiency score (VES), which penalizes SQL queries that return correct results but use unnecessarily slow execution plans, reflecting real-world performance concerns.^[17]

A persistent criticism of all these metrics is that they measure the final answer but not the quality of the reasoning process. A model can achieve high execution accuracy on WikiSQL by memorizing surface patterns between questions and SQL templates while failing on any unseen table structure.

Semantic parsing versus end-to-end approaches: tradeoffs

The choice between semantic parsing and end-to-end neural prediction involves meaningful tradeoffs that have shaped research strategy over the past decade.

Semantic parsing to SQL or logical forms produces outputs that are interpretable and executable, making errors diagnosable. An analyst can read the generated SQL and identify why a wrong answer was produced. Programs are also composable: if the right SQL can be generated, it will execute correctly on any database engine and handle arbitrarily large result sets. The main weaknesses are brittleness to paraphrase, exposure to spurious programs during training, and the difficulty of handling questions that require linguistic understanding beyond what SQL can express (such as FeTaQA's explanatory answers).

End-to-end neural approaches like TAPAS bypass the logical form entirely and directly select cells or generate answers token by token.^[5] They generalize well to questions with ambiguous scope and handle free-form natural language more gracefully. Their weakness is that they cannot guarantee arithmetic correctness on multi-step calculations, cannot operate on tables larger than their context window, and produce answers that are opaque (there is no program to inspect). They are also difficult to apply in production where the output must interact with a live database.

Code execution with LLMs gives the best of both worlds in many settings: the generated SQL or Python is inspectable, execution is exact, and the underlying model can handle complex linguistic phenomena in the question. The remaining weakness is that LLMs can generate plausible-looking but incorrect programs, especially for complex multi-table queries or queries requiring domain knowledge not in the prompt.

LLM-based table reasoning

The arrival of large general-purpose language models changed the competitive landscape of TableQA more rapidly than almost any other NLP task. Several properties explain why tables proved to be an unexpectedly tractable domain for LLMs.

First, LLMs pretrained on web text encounter many tables during training: Wikipedia infoboxes and data tables, HTML tables from news sites, CSV files in GitHub, and spreadsheet data in Stack Overflow answers. This implicit exposure gives them a non-trivial prior on tabular structure even without explicit table pretraining.

Second, the code generation abilities of modern LLMs map directly onto text-to-SQL. A model trained to produce Python and JavaScript code has already learned the kind of structured, syntactically constrained generation that SQL requires. Several studies have shown that code-specialized models (Codex, CodeLlama, DeepSeek-Coder) outperform general-purpose models on text-to-SQL even when the general-purpose models have more parameters.

Third, the reasoning capabilities of large models, especially those with chain-of-thought training, matter for the multi-step arithmetic and comparison operations that tables frequently demand. TAT-QA and FinQA require calculations like: "what is the percentage change in operating income from Q1 to Q2?" which require finding two cell values, computing their difference, dividing by the original, and expressing the result as a percentage.^[9]^[10] These steps are within reach of a capable LLM reasoning over a serialized table, though reliability degrades as the number of steps grows.

The practical regime in 2024 and 2025 is one where frontier LLMs (GPT-4o, Claude, Gemini) combined with tool use dominate accuracy leaderboards on nearly all TableQA benchmarks. Specialized table models remain relevant in latency-sensitive, cost-constrained, or on-device settings, and for narrow tasks where a small fine-tuned encoder can be deployed efficiently. Open-weight instruction-tuned models like StructLM and TableLlama offer intermediate options: they are table-aware, small enough for local deployment, and stronger on structured tasks than same-size general-purpose models.^[18]^[20]

For retrieval-augmented generation over large data warehouses, the typical production pattern has three stages: (1) embed all table schemas and retrieve the top-k relevant tables by cosine similarity to the question, (2) have the LLM generate SQL over those k tables, (3) execute the SQL and have a second LLM pass turn the result set into a natural language answer. This pattern, sometimes called text-to-SQL-to-text, is the basis of enterprise analytics products from Snowflake, Databricks, and independent startups.

Modern era and applications

Through 2023 and 2024, frontier general-purpose LLMs (GPT-4, Claude, and Gemini) combined with code execution tools have set the top scores on WikiTableQuestions, TabFact, and BIRD. Specialized table models still serve where latency, cost, or on-device constraints rule out cloud LLMs, or where the task is narrow enough that a small fine-tuned encoder is adequate. Practical deployments often combine retrieval-augmented generation over a data warehouse with text-to-SQL: retrieval narrows the relevant tables, an LLM writes SQL, the database executes it, and a second LLM call summarizes the result.

Use cases span business intelligence (natural language analytics on Snowflake or BigQuery), spreadsheet copilots (Excel Copilot, Google Sheets' Gemini integration), financial document question answering (10-K filings, earnings reports, with FinQA-style numerical reasoning), scientific table analysis on arXiv and PubMed, government open-data interfaces, and customer-facing dashboard chatbots. Enterprise text-to-SQL has matured into a product category, with both startups (Vanna, Defog, Wherobots) and incumbents (Databricks AI/BI Genie, Snowflake Cortex) shipping commercial systems.

Spreadsheet copilots represent a distinct application category from database-oriented text-to-SQL. Spreadsheets mix data and computation in a single object (formulas can reference other cells, worksheets can have multi-row merged headers, and the same "table" may contain subtotals interleaved with data rows). Models must handle these departures from relational normal form, which most academic benchmarks do not test. Microsoft's Copilot for Excel and Google's Gemini for Sheets both rely on a combination of LLM-generated formulas, Python code execution, and natural language generation to answer questions and perform data transformations over spreadsheet content.

Limitations

Large tables remain a persistent problem. Most table-specific transformers cap at 512 or 1024 tokens, and even long-context LLMs degrade on retrieval inside multi-thousand-row tables. Strategies include row sampling, column selection, and chunking, but each introduces information loss. Even when tables fit within the context window, "lost in the middle" effects, where models attend poorly to rows in the middle of a long input, can degrade accuracy.

Numerical reasoning is reliable only when the model writes and executes code; raw token generation still produces arithmetic errors on multi-step problems. Multi-table joins and schema understanding for unseen databases are the central difficulties in Spider and BIRD; even GPT-4 with prompting strategies falls well below human performance on BIRD's harder splits.^[17] BIRD specifically benchmarks on messy real-world databases with inconsistent naming conventions, NULL values, and encoding issues, and the performance gap versus human performance on this benchmark is substantially larger than on cleaner academic benchmarks.^[17]

Hallucinated cell values appear when LLMs generate answers without executing code, especially for low-frequency entities. A model asked "what was the GDP of Turkmenistan in 2018?" may confidently generate a plausible-sounding number rather than looking it up in the table, particularly if the table is long or the entity is rare.

Evaluation gaps for free-form answers (FeTaQA) push the community toward LLM-as-judge methods, which themselves have biases toward longer or more fluent answers regardless of factual content. BLEU and ROUGE are poor proxies for correctness on generated numerical explanations.

Real spreadsheets often violate the relational assumption with merged cells, multi-row headers, and embedded subtotals, which most academic models do not handle. The gap between benchmark tables (clean Wikipedia infoboxes, carefully curated financial filings) and production spreadsheets (years of accretion, inconsistent formatting, mixed languages) is the main source of performance degradation in deployed systems.

Ambiguity in natural language questions causes additional failures. A question like "which countries have the highest income?" could mean the top-1 country, the top-k countries, all countries above a threshold, or the country in the most recent year if the table is time-series. Humans resolve this ambiguity by asking clarifying questions, but most TableQA systems generate a single answer without flagging uncertainty. Generating clarifications or confidence estimates is an active research direction.

Finally, multi-lingual table QA remains significantly behind English performance. Most large table datasets are in English, and models fine-tuned on English table data generalize poorly to Japanese, Chinese, Arabic, or low-resource language tables where the column headers, cell values, and questions are in a non-English language.

References

Pasupat, P. and Liang, P. (2015). Compositional Semantic Parsing on Semi-Structured Tables. arXiv:1508.00305. https://arxiv.org/abs/1508.00305 ↩
Zhong, V., Xiong, C., and Socher, R. (2017). Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning (introduces WikiSQL). arXiv:1709.00103. https://arxiv.org/abs/1709.00103 ↩
Yu, T. et al. (2018). Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. arXiv:1809.08887. https://arxiv.org/abs/1809.08887 ↩
Chen, W. et al. (2019). TabFact: A Large-scale Dataset for Table-based Fact Verification. arXiv:1909.02164. https://arxiv.org/abs/1909.02164 ↩
Herzig, J. et al. (2020). TaPas: Weakly Supervised Table Parsing via Pre-training. arXiv:2004.02349. https://arxiv.org/abs/2004.02349 ↩
Yin, P., Neubig, G., Yih, W. and Riedel, S. (2020). TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data. arXiv:2005.08314. https://arxiv.org/abs/2005.08314 ↩
Chen, W. et al. (2020). HybridQA: A Dataset of Multi-Hop Question Answering over Tabular and Textual Data. arXiv:2004.07347. https://arxiv.org/abs/2004.07347 ↩
Nan, L. et al. (2021). FeTaQA: Free-form Table Question Answering. arXiv:2104.00369. https://arxiv.org/abs/2104.00369 ↩
Zhu, F. et al. (2021). TAT-QA: A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance. arXiv:2105.07624. https://arxiv.org/abs/2105.07624 ↩
Chen, Z. et al. (2021). FinQA: A Dataset of Numerical Reasoning over Financial Data. arXiv:2109.00122. https://arxiv.org/abs/2109.00122 ↩
Liu, Q. et al. (2021). TAPEX: Table Pre-training via Learning a Neural SQL Executor. arXiv:2107.07653. https://arxiv.org/abs/2107.07653 ↩
Eisenschlos, J. et al. (2021). MATE: Multi-view Attention for Table Transformer Efficiency. arXiv:2109.04312. https://arxiv.org/abs/2109.04312 ↩
Jiang, Z. et al. (2022). OmniTab: Pretraining with Natural and Synthetic Data for Few-shot Table-based Question Answering. arXiv:2207.03637. https://arxiv.org/abs/2207.03637 ↩
Cheng, Z. et al. (2022). Binding Language Models in Symbolic Languages (Binder). arXiv:2210.02875. https://arxiv.org/abs/2210.02875 ↩
Chen, W. (2022). Large Language Models are few(1)-shot Table Reasoners. arXiv:2210.06710. https://arxiv.org/abs/2210.06710 ↩
Ni, A. et al. (2023). LEVER: Learning to Verify Language-to-Code Generation with Execution. arXiv:2302.08468. https://arxiv.org/abs/2302.08468 ↩
Li, J. et al. (2023). Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs (BIRD). arXiv:2305.03111. https://arxiv.org/abs/2305.03111 ↩
Zhang, T., Yue, X., Li, Y. and Sun, H. (2023). TableLlama: Towards Open Large Generalist Models for Tables. arXiv:2311.09206. https://arxiv.org/abs/2311.09206 ↩
Wang, Z. et al. (2024). Chain-of-Table: Evolving Tables in the Reasoning Chain for Table Understanding. arXiv:2401.04398. https://arxiv.org/abs/2401.04398 ↩
Zhuang, A. et al. (2024). StructLM: Towards Building Generalist Models for Structured Knowledge Grounding. arXiv:2402.16671. https://arxiv.org/abs/2402.16671 ↩
Yu, T. et al. (2019). SParC: Cross-Domain Semantic Parsing in Context. arXiv:1906.02285. https://arxiv.org/abs/1906.02285 ↩
Yu, T. et al. (2019). CoSQL: A Conversational Text-to-SQL Challenge Towards Cross-Domain Natural Language Interfaces to Databases. arXiv:1909.05378. https://arxiv.org/abs/1909.05378 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

What links here

ELMo (Embeddings from Language Models)Text Classification Models

Brief history

Table encoding and serialization

Flat linearization

Structured token-type embeddings (TAPAS approach)

Content snapshot encoding (TaBERT approach)

Sparse attention for very large tables

Markdown, HTML, and domain-specific formats

Methods

Semantic parsing to logical forms

Text-to-SQL

Table-as-text linearization

Table-specific transformers

Code execution with LLMs

Chain-of-table reasoning

Retrieval-augmented and multi-table approaches

Notable models

Benchmarks

Metrics

Semantic parsing versus end-to-end approaches: tradeoffs

LLM-based table reasoning

Modern era and applications

Limitations

References

Improve this article

Related Articles

Translation Models

Bert-base-uncased model

Conversational Models

Fill-Mask Models

Question Answering Models

Sentence-transformers/all-MiniLM-L6-v2 model

What links here

Related Articles

Translation Models

Bert-base-uncased model

Conversational Models

Fill-Mask Models

Question Answering Models

Sentence-transformers/all-MiniLM-L6-v2 model

What links here