Spider 2.0
Last reviewed
May 31, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 2,005 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 2,005 words
Add missing citations, update stale details, or suggest a clearer explanation.
Spider 2.0 is a benchmark for evaluating large language models on real-world enterprise text-to-SQL workflows. Introduced by Fangyu Lei and colleagues in the paper "Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows," it was accepted as an oral presentation at the International Conference on Learning Representations (ICLR) in 2025. [1][2][7] The benchmark contains 632 problems built from production data applications, and it asks a model to translate a natural language request into SQL that runs against databases hosted on cloud warehouses such as BigQuery and Snowflake. [1] Where earlier text-to-SQL datasets had largely been solved, Spider 2.0 was designed to be hard: the strongest baseline reported in the paper, the OpenAI o1-preview model wrapped in a code agent framework, solves only 21.3% of the tasks, against 91.2% for the same framework on the original Spider and 73.0% on BIRD. [1]
Text-to-SQL is the task of mapping a question in plain language to an executable SQL query, so that a person who does not write SQL can still ask questions of a relational database. The original Spider dataset, released by Tao Yu and collaborators at Yale in 2018, became the field's reference point. It paired 10,181 questions with 5,693 unique SQL queries across 200 databases spanning 138 domains, and its key idea was cross-domain generalization: the databases in the test set never appear in training, so a model has to adapt to schemas it has not seen. [3][8]
For several years Spider drove steady progress, and by the time large language models arrived the leaderboard had effectively saturated. Execution accuracy on Spider climbed past 90%, and reported systems pushed toward the high 90s, which left little signal for distinguishing one strong model from another. The 2023 BIRD benchmark from Jinyang Li and colleagues raised the bar with 95 larger and messier databases totaling about 33.4 GB across 37 professional domains, adding dirty values and questions that need outside knowledge. [4][9] BIRD is harder than Spider, yet it still presents one question that maps to one query against a single database, which is not how analysts actually work inside a company.
The gap that Spider 2.0 targets is the distance between those academic setups and a real data stack. Inside an enterprise, a single business question often turns into a multi-step workflow: you locate the right tables among thousands of columns, join across several of them, apply warehouse-specific functions, and sometimes read internal documentation or transformation code before you can write anything that returns the correct answer. The authors built Spider 2.0 to measure that work directly, and the result is a benchmark on which even frontier models start from a low base. [1]
Spider 2.0 holds 632 text-to-SQL workflow problems drawn from real data applications. The databases are large by benchmark standards, often containing over 1,000 columns each, and they live on local or cloud systems including BigQuery, Snowflake, DuckDB, and SQLite. [1] That scale matters, because a model can no longer fit a whole schema into a short prompt and reason over it in one pass. It has to search.
The paper describes the central difficulty plainly: solving a Spider 2.0 problem "frequently requires understanding and searching through database metadata, dialect documentation, and even project-level codebases." [1] A model is expected to process long contexts, perform multi-step reasoning, and produce SQL that often combines several queries with diverse operations. Reference solutions frequently exceed 100 lines, which is far longer than the single-statement answers typical of Spider 1.0. [1] Dialect also becomes a real variable. A function that exists in BigQuery may be named differently or absent in Snowflake, so the model has to know, or look up, how a given warehouse expresses a particular operation.
The dataset was assembled by hand. Eight authors with strong SQL expertise built the problems through a multi-step pipeline that covered data collection, SQL rewriting, context setup, task instruction writing, evaluation design, and quality control. [1][5] Correctness is measured by execution accuracy: a generated query counts as correct when running it against the database produces results that match the expected output, rather than when its text matches a gold query. This matters because there are usually many SQL strings that return the same correct table, and execution-based scoring credits all of them. [6]
Spider 2.0 ships in more than one configuration so that researchers can trade off realism against cost and convenience. The three settings target different parts of the problem. [5][6]
| Setting | Examples | Databases / dialects | What it tests |
|---|---|---|---|
| Spider 2.0 (full) | 632 problems total | BigQuery, Snowflake, DuckDB, SQLite | End-to-end agent workflow over real warehouses, including metadata and documentation search |
| Spider 2.0-Lite | 547 | BigQuery (214), Snowflake (198), SQLite (135) | Self-contained text-in, text-out SQL generation with prepared metadata and documentation |
| Spider 2.0-Snow | 547 | Snowflake only | Snowflake-hosted SQL generation, runnable at no cost to users |
| Spider 2.0-DBT | 68 | DuckDB with dbt | Repository-level code agent task, not a single SQL query |
Spider 2.0-Lite is the most widely used research setting. It is a text-in, text-out task: the prompt already includes well-prepared database metadata and documentation, so a system reads that context and returns SQL without having to drive a live connection. That makes iteration faster and cheaper, and its 547 examples split across BigQuery, Snowflake, and SQLite let a model be tested for dialect handling. [5][6] Spider 2.0-Snow re-hosts a comparable set of 547 examples on Snowflake alone, and the authors arranged for it to run without charge so that cost is not a barrier to evaluation. [5][6]
The Spider 2.0-DBT setting is the most agentic. Its 68 tasks use DuckDB together with dbt, a popular data transformation framework, and they ask a model to operate over a project repository rather than emit one statement. Solving them looks less like answering a question and more like a software engineering job inside a data pipeline. [5][6] The full Spider 2.0 setting combines this kind of repository-level work with the warehouse-connected workflows, which is why it is the hardest of the four and the one the headline numbers refer to.
The central finding is a steep drop in performance when models move from the older benchmarks to Spider 2.0. The paper evaluates the OpenAI o1-preview model inside a code agent framework, and that combination solves 21.3% of the full Spider 2.0 tasks. The same framework reaches 91.2% on the original Spider and 73.0% on BIRD, so the new benchmark cuts a strong system's success rate by roughly three quarters relative to Spider 1.0. [1]
Looking at base models without the agent scaffolding sharpens the picture further. o1-preview on its own solves about 17.1% of Spider 2.0 problems, and GPT-4o reaches only 10.1%, compared with 86.6% for GPT-4o on Spider 1.0. [5][6] The table below collects these figures.
| System | Spider 2.0 | Spider 1.0 | BIRD |
|---|---|---|---|
| o1-preview + code agent framework | 21.3% | 91.2% | 73.0% |
| o1-preview (model) | 17.1% | not reported | not reported |
| GPT-4o (model) | 10.1% | 86.6% | not reported |
For reference, human experts reach about 92.96% execution accuracy on BIRD, and the best systems there sit near 75%, so even BIRD is not solved. [4] Spider 2.0 sits well below all of these, which is the point. A benchmark where frontier models clear most items leaves little room to measure progress, and a benchmark where they clear only a fifth gives researchers a long runway. Public leaderboards for the Lite and Snow settings have since recorded higher scores from specialized agent systems. ReFoRCE, an agent that compresses schema information, refines its own SQL, and explores columns from execution feedback, reported state-of-the-art results of about 31.26 on Spider 2.0-Snow and 30.35 on Spider 2.0-Lite, which suggests that purpose-built scaffolding can close part of the gap even while general-purpose models stay low. [6][10]
Spider 2.0 is informative less because of any single score and more because of where models break. The tasks isolate several capabilities at once. A system has to retrieve the relevant slice of a huge schema, hold a long and heterogeneous context, reason across multiple steps, respect the quirks of a specific SQL dialect, and in the agentic settings read documentation or code to figure out what a column or a transformation means. Failures tend to come from the workflow around the query as much as from the query syntax itself.
That profile places Spider 2.0 close to the broader effort to evaluate LLMs as agents on code generation and software tasks. Writing a 100-line, multi-statement SQL program against an unfamiliar warehouse resembles writing code in a large repository: the model must explore, gather context, and compose a correct artifact, not just complete a short snippet. The benchmark therefore reads as a data-engineering analogue to repository-level coding evaluations, and the low scores echo a recurring pattern where models do well on tidy, self-contained problems and struggle once the environment grows large and underspecified. [1]
The practical reading for anyone deploying text-to-SQL is cautious. High accuracy on Spider 1.0 does not transfer to a production warehouse with thousands of columns and dialect-specific functions, and a system that looks reliable on academic data can fail on the messy, multi-step questions a business actually asks. Spider 2.0 gives that gap a number. [1]
The benchmark has natural constraints. With 632 problems it is far smaller than Spider 1.0 or BIRD, a direct consequence of how labor-intensive each enterprise-grade task is to construct and verify, so per-category sample sizes are modest. Running the full and Snow settings depends on live connections to cloud warehouses, which adds operational overhead and, for some configurations, cost, even though the authors arranged the Snow setting to run for free. Execution accuracy, while more faithful than text matching, still rewards only answers that reproduce the expected result and can miss queries that are correct in a way the reference does not anticipate, or that happen to match by coincidence on a particular dataset. As with any static benchmark, there is also the longer-term risk that its data leaks into training corpora, which would inflate scores without reflecting genuine capability. Finally, the headline figures come from a specific set of models and an agent framework available at the time of the paper, so they describe a moving target rather than a fixed ceiling. [1][6]