Spider 2.0

AI Benchmarks AI Code Generation Model Evaluation

10 min read

Updated May 31, 2026

Suggest edit History Talk

RawGraph

Last edited

May 31, 2026

Fact-checked

In review queue

Sources

10 citations

Revision

v1 · 2,005 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Spider 2.0 is a benchmark for evaluating large language models on real-world enterprise text-to-SQL workflows. Introduced by Fangyu Lei and colleagues in the paper "Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows," it was accepted as an oral presentation at the International Conference on Learning Representations (ICLR) in 2025. ^[1]^[2]^[7] The benchmark contains 632 problems built from production data applications, and it asks a model to translate a natural language request into SQL that runs against databases hosted on cloud warehouses such as BigQuery and Snowflake. ^[1] Where earlier text-to-SQL datasets had largely been solved, Spider 2.0 was designed to be hard: the strongest baseline reported in the paper, the OpenAI o1-preview model wrapped in a code agent framework, solves only 21.3% of the tasks, against 91.2% for the same framework on the original Spider and 73.0% on BIRD. ^[1]

Background and motivation

Text-to-SQL is the task of mapping a question in plain language to an executable SQL query, so that a person who does not write SQL can still ask questions of a relational database. The original Spider dataset, released by Tao Yu and collaborators at Yale in 2018, became the field's reference point. It paired 10,181 questions with 5,693 unique SQL queries across 200 databases spanning 138 domains, and its key idea was cross-domain generalization: the databases in the test set never appear in training, so a model has to adapt to schemas it has not seen. ^[3]^[8]

For several years Spider drove steady progress, and by the time large language models arrived the leaderboard had effectively saturated. Execution accuracy on Spider climbed past 90%, and reported systems pushed toward the high 90s, which left little signal for distinguishing one strong model from another. The 2023 BIRD benchmark from Jinyang Li and colleagues raised the bar with 95 larger and messier databases totaling about 33.4 GB across 37 professional domains, adding dirty values and questions that need outside knowledge. ^[4]^[9] BIRD is harder than Spider, yet it still presents one question that maps to one query against a single database, which is not how analysts actually work inside a company.

The gap that Spider 2.0 targets is the distance between those academic setups and a real data stack. Inside an enterprise, a single business question often turns into a multi-step workflow: you locate the right tables among thousands of columns, join across several of them, apply warehouse-specific functions, and sometimes read internal documentation or transformation code before you can write anything that returns the correct answer. The authors built Spider 2.0 to measure that work directly, and the result is a benchmark on which even frontier models start from a low base. ^[1]

Dataset and workflow design

Spider 2.0 holds 632 text-to-SQL workflow problems drawn from real data applications. The databases are large by benchmark standards, often containing over 1,000 columns each, and they live on local or cloud systems including BigQuery, Snowflake, DuckDB, and SQLite. ^[1] That scale matters, because a model can no longer fit a whole schema into a short prompt and reason over it in one pass. It has to search.

The paper describes the central difficulty plainly: solving a Spider 2.0 problem "frequently requires understanding and searching through database metadata, dialect documentation, and even project-level codebases." ^[1] A model is expected to process long contexts, perform multi-step reasoning, and produce SQL that often combines several queries with diverse operations. Reference solutions frequently exceed 100 lines, which is far longer than the single-statement answers typical of Spider 1.0. ^[1] Dialect also becomes a real variable. A function that exists in BigQuery may be named differently or absent in Snowflake, so the model has to know, or look up, how a given warehouse expresses a particular operation.

The dataset was assembled by hand. Eight authors with strong SQL expertise built the problems through a multi-step pipeline that covered data collection, SQL rewriting, context setup, task instruction writing, evaluation design, and quality control. ^[1]^[5] Correctness is measured by execution accuracy: a generated query counts as correct when running it against the database produces results that match the expected output, rather than when its text matches a gold query. This matters because there are usually many SQL strings that return the same correct table, and execution-based scoring credits all of them. ^[6]

Task settings and variants

Spider 2.0 ships in more than one configuration so that researchers can trade off realism against cost and convenience. The three settings target different parts of the problem. ^[5]^[6]

Setting	Examples	Databases / dialects	What it tests
Spider 2.0 (full)	632 problems total	BigQuery, Snowflake, DuckDB, SQLite	End-to-end agent workflow over real warehouses, including metadata and documentation search
Spider 2.0-Lite	547	BigQuery (214), Snowflake (198), SQLite (135)	Self-contained text-in, text-out SQL generation with prepared metadata and documentation
Spider 2.0-Snow	547	Snowflake only	Snowflake-hosted SQL generation, runnable at no cost to users
Spider 2.0-DBT	68	DuckDB with dbt	Repository-level code agent task, not a single SQL query

Spider 2.0-Lite is the most widely used research setting. It is a text-in, text-out task: the prompt already includes well-prepared database metadata and documentation, so a system reads that context and returns SQL without having to drive a live connection. That makes iteration faster and cheaper, and its 547 examples split across BigQuery, Snowflake, and SQLite let a model be tested for dialect handling. ^[5]^[6] Spider 2.0-Snow re-hosts a comparable set of 547 examples on Snowflake alone, and the authors arranged for it to run without charge so that cost is not a barrier to evaluation. ^[5]^[6]

The Spider 2.0-DBT setting is the most agentic. Its 68 tasks use DuckDB together with dbt, a popular data transformation framework, and they ask a model to operate over a project repository rather than emit one statement. Solving them looks less like answering a question and more like a software engineering job inside a data pipeline. ^[5]^[6] The full Spider 2.0 setting combines this kind of repository-level work with the warehouse-connected workflows, which is why it is the hardest of the four and the one the headline numbers refer to.

Headline results and the gap from Spider 1.0

The central finding is a steep drop in performance when models move from the older benchmarks to Spider 2.0. The paper evaluates the OpenAI o1-preview model inside a code agent framework, and that combination solves 21.3% of the full Spider 2.0 tasks. The same framework reaches 91.2% on the original Spider and 73.0% on BIRD, so the new benchmark cuts a strong system's success rate by roughly three quarters relative to Spider 1.0. ^[1]

Looking at base models without the agent scaffolding sharpens the picture further. o1-preview on its own solves about 17.1% of Spider 2.0 problems, and GPT-4o reaches only 10.1%, compared with 86.6% for GPT-4o on Spider 1.0. ^[5]^[6] The table below collects these figures.

System	Spider 2.0	Spider 1.0	BIRD
o1-preview + code agent framework	21.3%	91.2%	73.0%
o1-preview (model)	17.1%	not reported	not reported
GPT-4o (model)	10.1%	86.6%	not reported

For reference, human experts reach about 92.96% execution accuracy on BIRD, and the best systems there sit near 75%, so even BIRD is not solved. ^[4] Spider 2.0 sits well below all of these, which is the point. A benchmark where frontier models clear most items leaves little room to measure progress, and a benchmark where they clear only a fifth gives researchers a long runway. Public leaderboards for the Lite and Snow settings have since recorded higher scores from specialized agent systems. ReFoRCE, an agent that compresses schema information, refines its own SQL, and explores columns from execution feedback, reported state-of-the-art results of about 31.26 on Spider 2.0-Snow and 30.35 on Spider 2.0-Lite, which suggests that purpose-built scaffolding can close part of the gap even while general-purpose models stay low. ^[6]^[10]

What it reveals about LLMs on data tasks

Spider 2.0 is informative less because of any single score and more because of where models break. The tasks isolate several capabilities at once. A system has to retrieve the relevant slice of a huge schema, hold a long and heterogeneous context, reason across multiple steps, respect the quirks of a specific SQL dialect, and in the agentic settings read documentation or code to figure out what a column or a transformation means. Failures tend to come from the workflow around the query as much as from the query syntax itself.

That profile places Spider 2.0 close to the broader effort to evaluate LLMs as agents on code generation and software tasks. Writing a 100-line, multi-statement SQL program against an unfamiliar warehouse resembles writing code in a large repository: the model must explore, gather context, and compose a correct artifact, not just complete a short snippet. The benchmark therefore reads as a data-engineering analogue to repository-level coding evaluations, and the low scores echo a recurring pattern where models do well on tidy, self-contained problems and struggle once the environment grows large and underspecified. ^[1]

The practical reading for anyone deploying text-to-SQL is cautious. High accuracy on Spider 1.0 does not transfer to a production warehouse with thousands of columns and dialect-specific functions, and a system that looks reliable on academic data can fail on the messy, multi-step questions a business actually asks. Spider 2.0 gives that gap a number. ^[1]

Limitations

The benchmark has natural constraints. With 632 problems it is far smaller than Spider 1.0 or BIRD, a direct consequence of how labor-intensive each enterprise-grade task is to construct and verify, so per-category sample sizes are modest. Running the full and Snow settings depends on live connections to cloud warehouses, which adds operational overhead and, for some configurations, cost, even though the authors arranged the Snow setting to run for free. Execution accuracy, while more faithful than text matching, still rewards only answers that reproduce the expected result and can miss queries that are correct in a way the reference does not anticipate, or that happen to match by coincidence on a particular dataset. As with any static benchmark, there is also the longer-term risk that its data leaks into training corpora, which would inflate scores without reflecting genuine capability. Finally, the headline figures come from a specific set of models and an agent framework available at the time of the paper, so they describe a moving target rather than a fixed ceiling. ^[1]^[6]

References

Lei, Fangyu; Chen, Jixuan; Ye, Yuxiao; et al. "Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows." arXiv preprint arXiv:2411.07763, 2024. https://arxiv.org/abs/2411.07763 ↩
Lei, Fangyu; et al. "Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows." OpenReview, ICLR 2025 Oral. https://openreview.net/forum?id=XmProj9cPs ↩
Yu, Tao; Zhang, Rui; Yang, Kai; et al. "Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task." Proceedings of EMNLP, 2018. https://arxiv.org/abs/1809.08887 ↩
Li, Jinyang; Hui, Binyuan; Qu, Ge; et al. "Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs (BIRD)." Advances in Neural Information Processing Systems (NeurIPS), 2023. https://arxiv.org/abs/2305.03111 ↩
Spider 2.0 project team. "Spider 2.0 (xlang-ai/Spider2)." GitHub repository, 2024. https://github.com/xlang-ai/Spider2 ↩
Spider 2.0 project team. "Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows." Project website and leaderboard. https://spider2-sql.github.io/ ↩
International Conference on Learning Representations. "Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows (Oral)." ICLR 2025 virtual session. https://iclr.cc/virtual/2025/oral/31826 ↩
Yu, Tao; et al. "Spider: Yale Semantic Parsing and Text-to-SQL Challenge." Project website. https://yale-lily.github.io/spider ↩
BIRD-Bench team. "BIRD: A BIg Bench for Large-Scale Database Grounded Text-to-SQL Evaluation." Project website. https://bird-bench.github.io/ ↩
Deng, Minghang; Ramachandran, Ashwin; Xu, Canwen; et al. "ReFoRCE: A Text-to-SQL Agent with Self-Refinement, Consensus Enforcement, and Column Exploration." arXiv preprint arXiv:2502.00675, 2025. https://arxiv.org/abs/2502.00675 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

Data Visualization

Background and motivation

Dataset and workflow design

Task settings and variants

Headline results and the gap from Spider 1.0

What it reveals about LLMs on data tasks

Limitations

See also

References

Improve this article

Related Articles

SWE-bench Verified

Terminal-Bench

Pass@k

Multi-SWE-bench

HumanEval

LiveCodeBench