ZebraLogic is a benchmark for evaluating the logical reasoning capabilities of large language models (LLMs). Developed by researchers at the Allen Institute for AI (AI2), the University of Washington, and Stanford University, ZebraLogic uses logic grid puzzles derived from constraint satisfaction problems (CSPs) to systematically test how well language models handle deductive reasoning at varying levels of difficulty. The benchmark consists of 1,000 programmatically generated puzzles ranging in size from 2x2 to 6x6 grids, and its primary finding is the "curse of complexity," a sharp decline in model accuracy as puzzle complexity increases that persists regardless of model size or additional inference-time computation.
The paper introducing ZebraLogic was published in February 2025 and accepted at the 42nd International Conference on Machine Learning (ICML 2025). It was authored by Bill Yuchen Lin, Ronan Le Bras, Kyle Richardson, Ashish Sabharwal, Radha Poovendran, Peter Clark, and Yejin Choi.
ZebraLogic takes its name from the Zebra Puzzle, a classic logic puzzle also known as Einstein's Riddle. The original puzzle was first published in Life International magazine on December 17, 1962, and has been popularly (though without evidence) attributed to Albert Einstein or Lewis Carroll. The puzzle presents a set of clues about houses in a row, each with different attributes (such as color, nationality of the owner, pet, drink, and cigarette brand), and challenges the solver to determine which house has which attributes using only deductive logic.
Zebra puzzles belong to a broader class of logic grid puzzles, which are a well-known type of constraint satisfaction problem. These puzzles require solvers to use elimination, cross-referencing, and logical deduction to arrive at a unique solution. Logic grid puzzles appear regularly on standardized tests such as the Law School Admission Test (LSAT) and the GRE, where they assess analytical reasoning skills.
Logical reasoning is widely considered a fundamental capability for artificial intelligence, with direct relevance to real-world tasks such as planning, scheduling, resource allocation, and decision-making. Despite steady improvements in LLM performance across many benchmarks, systematic evaluation of pure logical reasoning has remained challenging. Many existing benchmarks conflate logical reasoning with world knowledge, reading comprehension, or mathematical computation, making it difficult to isolate a model's deductive capabilities.
ZebraLogic was designed to address this gap. By using synthetically generated logic grid puzzles with controllable complexity, the benchmark provides a clean test of formal logical reasoning. The puzzles require no domain-specific knowledge; they test only whether a model can follow a chain of logical deductions from a set of constraints to a unique solution.
Each ZebraLogic puzzle is set up as follows: there are N houses arranged in a row (numbered 1 to N from left to right), and each house has M features (such as name, car model, pet, favorite drink, and so on). Each feature has N distinct possible values, and every house must be assigned exactly one value per feature, with no two houses sharing the same value for any feature. Given a set of natural-language clues, the solver must deduce the unique correct assignment of all values to all houses.
For example, a simple 2x3 puzzle might read:
There are 2 houses, each with 3 features: Name (Arnold, Eric), Car Model (Ford F-150, Tesla Model 3), and Animal (cat, horse).
Clue 1: Eric is directly left of the person who owns a Tesla Model 3.
Clue 2: The person who keeps horses is in the first house.
The unique solution would be: House 1 has Eric, Ford F-150, and horse; House 2 has Arnold, Tesla Model 3, and cat.
The ZebraLogic benchmark contains 1,000 puzzles in total, with grid sizes ranging from 2x2 (2 houses, 2 features) to 6x6 (6 houses, 6 features). For each possible NxM configuration, there are 40 puzzles. The puzzles were generated programmatically using a principled algorithm that ensures each puzzle has exactly one valid solution.
The benchmark uses seven distinct types of logical clues, each expressed as a natural-language sentence constructed from predefined templates:
| Clue Type | Description | Example |
|---|---|---|
| FoundAt | A specific value is assigned to a specific house | "The tea drinker lives in House 3" |
| NotAt | A specific value is NOT at a specific house | "The musician does not live in House 2" |
| SameHouse | Two values from different features share the same house | "The musician drinks tea" |
| DirectLeft / DirectRight | One value is in the house immediately left or right of another | "The greenhouse is directly left of the white house" |
| SideBySide | Two values are in adjacent houses | "The coffee drinker and tea drinker are neighbors" |
| LeftOf / RightOf | One value is somewhere to the left or right of another | "Arnold lives somewhere to the left of Eric" |
| OneBetween / TwoBetween | Exactly one or two houses separate two values | "There is one house between the musician and the painter" |
These clue types were chosen to capture the range of constraint relationships typically found in logic grid puzzles, from simple direct assignments to more complex spatial relationships.
The puzzle generation process follows these steps:
This generation approach guarantees that every puzzle in the benchmark has exactly one valid solution and that the clue set is minimal or near-minimal, preventing puzzles from being trivially solvable through redundant information.
Formally, each puzzle is represented as a constraint satisfaction problem. The variables are x_{a,k}, representing the assignment of attribute a to house k. The constraints include:
The problem of solving a general logic grid puzzle has been proven to be NP-complete through a reduction from the Quasigroup Completion Problem, which establishes that no known polynomial-time algorithm can solve all instances.
A key contribution of ZebraLogic is the introduction of two quantifiable complexity metrics that allow researchers to measure and compare puzzle difficulty systematically.
The search space of a puzzle is defined as the total number of possible configurations, calculated as (N!)^M for an NxM grid. This measures the raw combinatorial difficulty of the puzzle before any clues are applied.
The benchmark categorizes puzzles into four difficulty tiers based on search space:
| Category | Search Space Size | Example Grid Sizes |
|---|---|---|
| Small | Less than 10^3 | 2x2, 2x3, 2x4, 2x5, 2x6, 3x2 |
| Medium | 10^3 to 10^6 | 3x3, 3x4, 4x2, 4x3 |
| Large | 10^6 to 10^10 | 3x5, 3x6, 4x4, 5x2, 5x3 |
| X-Large | 10^10 or greater | 4x5, 4x6, 5x4, 5x5, 5x6, 6x2 through 6x6 |
For instance, a 3x4 grid has (3!)^4 = 1,296 possible configurations (Small/Medium), while a 5x5 grid has (5!)^5 = 24,883,200,000, placing it firmly in the X-Large category.
The second complexity metric leverages the Z3 SMT (Satisfiability Modulo Theories) solver. Z3 uses a Conflict-Driven Clause Learning (CDCL) algorithm to solve constraint problems. During the solving process, a "conflict" occurs when the solver must backtrack because its current partial assignment leads to a contradiction.
The number of conflicts Z3 encounters while solving a puzzle serves as a measure of the puzzle's inherent logical difficulty, independent of the raw search space size. Puzzles with zero Z3 conflicts can be solved entirely through forward chaining (applying clues one by one without needing to guess and backtrack), while puzzles with many conflicts require extensive backtracking and hypothesis testing.
This metric is particularly insightful because it captures the reasoning depth required by a puzzle, not just its combinatorial breadth. Two puzzles with the same grid size can have very different Z3 conflict counts depending on how their clues interact.
LLMs are evaluated on ZebraLogic using a one-shot prompting approach. Each model receives a single demonstration example that includes:
The model is then given a new puzzle and instructed to first output its reasoning process and then present its answer in the same JSON format as the demonstration.
ZebraLogic uses two complementary accuracy metrics:
In addition to the search-space-based categories, puzzles are also classified as "Easy" or "Hard" based on the logarithmic random-guessing probability. Puzzles of size 3x3 or smaller are classified as Easy, while larger puzzles are classified as Hard. This binary classification provides a simple way to compare model performance across difficulty levels.
The central finding of ZebraLogic is what the authors call the "curse of complexity." As puzzle complexity increases, whether measured by search space size or Z3 conflict count, model accuracy drops sharply and consistently across all tested models. This decline is not gradual; there is a threshold beyond which performance collapses dramatically.
Specific observations include:
This finding has significant implications. It suggests that current LLM architectures face fundamental limitations in handling complex logical reasoning, and that simply scaling up model parameters is insufficient to overcome these limitations.
The paper identifies several specific reasoning capabilities that LLMs lack for solving complex logic puzzles:
These skill gaps help explain why scaling model size alone does not solve the problem: the missing capabilities are architectural rather than parametric.
The following table summarizes puzzle-level accuracy for all models evaluated in the paper, broken down by search-space difficulty category:
| Model | Overall | Small | Medium | Large | X-Large | Cell-Wise |
|---|---|---|---|---|---|---|
| o1 | 81.0% | 97.2% | 92.1% | 78.0% | 42.5% | 78.7% |
| DeepSeek-R1 | 78.7% | 98.4% | 95.7% | 73.5% | 28.5% | 80.5% |
| o1-preview | 71.4% | 98.1% | 88.2% | 59.5% | 17.0% | 75.1% |
| o1-mini | 59.7% | 87.5% | 76.8% | 39.0% | 12.0% | 70.3% |
| Claude 3.5 Sonnet | 36.2% | 84.7% | 28.9% | 4.0% | 1.0% | 54.3% |
| Llama 3.1 405B | 32.6% | 81.3% | 22.5% | 1.5% | 0.0% | 45.8% |
| GPT-4o | 31.7% | 80.0% | 19.6% | 2.5% | 0.5% | 50.3% |
| Gemini 1.5 Pro | 30.5% | 75.3% | 20.7% | 3.0% | 0.0% | 50.8% |
| Mistral Large 2 | 29.0% | 75.9% | 15.0% | 2.5% | 0.0% | 47.6% |
| Qwen 2.5 72B | 26.6% | 72.5% | 12.1% | 0.0% | 0.0% | 40.9% |
| Gemini 1.5 Flash | 25.0% | 65.0% | 13.6% | 2.0% | 0.0% | 43.6% |
| Llama 3.1 70B | 24.9% | 67.8% | 10.4% | 1.5% | 0.0% | 28.0% |
| DeepSeek v2.5 | 22.1% | 62.2% | 7.9% | 0.0% | 0.0% | 38.0% |
| GPT-4o mini | 20.1% | 58.8% | 4.6% | 0.0% | 0.0% | 41.3% |
| Gemma 2 27B | 16.3% | 46.6% | 5.0% | 0.0% | 0.0% | 41.2% |
| Llama 3.1 8B | 12.8% | 39.4% | 0.7% | 0.0% | 0.0% | 13.7% |
| Phi 3.5 4B | 6.4% | 19.4% | 0.7% | 0.0% | 0.0% | 6.0% |
One of the clearest patterns in the results is the substantial gap between reasoning-focused models (such as o1 and DeepSeek-R1) and standard instruction-tuned models. The o1 model achieved 81.0% overall accuracy, more than double the best non-reasoning model (Claude 3.5 Sonnet at 36.2%). This gap widens dramatically for harder puzzles: on X-Large puzzles, o1 achieved 42.5% while Claude 3.5 Sonnet managed only 1.0%.
This performance difference is closely linked to the amount of reasoning computation each model performs. The o1 family generates approximately 10 times more reasoning tokens than standard models:
| Model Family | Average Reasoning Tokens |
|---|---|
| o1-mini | 5,144.6 (hidden CoT) |
| o1-preview | 5,346.3 (hidden CoT) |
| GPT-4o | 543.7 (visible) |
| GPT-4o mini | 502.9 (visible) |
The paper found a correlation of roughly 400 hidden reasoning tokens per Z3 conflict for conflicts below 20. Beyond 30 conflicts, reasoning token usage plateaus, suggesting the model reaches a point where additional thinking does not translate into better solutions.
Models with 7 to 10 billion parameters face severe limitations on ZebraLogic. Even on Easy puzzles (3x3 and smaller), these models achieve modest accuracy. On Hard puzzles, accuracy drops to less than 1%. The Phi 3.5 model with 4 billion parameters solves only 6.4% of all puzzles overall, illustrating that logical reasoning of this kind requires a minimum scale of model capacity.
The paper also investigated the effect of decoding strategy on performance. Most models perform better with greedy decoding (temperature = 0) compared to sampling (temperature = 0.5). One notable exception was Gemini 1.5 Pro, which showed slight improvement with sampling, while Gemini 1.5 Flash showed degradation, an unexpected divergence between two models from the same family.
The ZebraLogic paper explores several strategies aimed at improving LLM performance on logical reasoning tasks.
Best-of-N sampling generates multiple candidate solutions and selects the best one. The paper tested this approach with GPT-4o:
| Method | Overall Accuracy | Medium | Large |
|---|---|---|---|
| Baseline (greedy) | 31.7% | 19.6% | 2.5% |
| BoN Oracle (N=32) | 60.3% | 81.1% | 28.0% |
| BoN Oracle (N=128) | 69.1% | 92.9% | 49.0% |
| Majority Voting (N=32) | 38.0% | 34.3% | 7.0% |
| Reward Model (N=32) | 33.9% | 28.9% | 4.5% |
The Oracle results (where the correct answer is known and used to select the best candidate) show that there is significant untapped potential: with 128 samples, GPT-4o could in theory achieve 69.1% accuracy, more than double its baseline. However, practical selection methods such as majority voting and reward model scoring capture only a small fraction of this potential, yielding improvements of just 6 to 2 percentage points over baseline.
Self-verification asks the model to check its own solution and attempt corrections. Results with GPT-4o:
| Approach | Overall Accuracy | Change |
|---|---|---|
| Baseline | 31.7% | -- |
| Self-Verify (Oracle) | 34.8% | +3.1% |
| Self-Verify | 33.0% | +1.3% |
| Self-Verify (2x iteration) | 32.1% | +0.4% |
Self-verification produces only marginal improvements. Even with oracle knowledge of which solutions are wrong, the model can only fix a small fraction of its errors. Iterating the verification process twice actually degrades performance compared to a single verification pass, likely because the model introduces new errors while attempting corrections.
The most promising improvement strategy explored in the paper involves backtracking mechanisms combined with extended reasoning steps. Rather than generating a single forward pass of reasoning, this approach encourages the model to:
This approach mirrors how the Z3 solver itself operates (using conflict-driven clause learning with backtracking) and aligns more closely with how humans solve complex logic puzzles. The o1 family of models, which uses hidden chain-of-thought reasoning with what appears to be an internal backtracking mechanism, achieves significantly better results than models that reason in a strictly forward manner. This suggests that training models to perform explicit backtracking during reasoning is a more effective path forward than simply scaling model parameters or generating more samples.
The relationship between reasoning effort and puzzle complexity provides additional insight into LLM behavior. For the o1 models, the number of hidden chain-of-thought tokens scales roughly linearly with Z3 conflict count for puzzles with fewer than 20 conflicts, at a rate of approximately 400 tokens per conflict. Beyond 30 conflicts, token usage plateaus, indicating that the model has reached the limits of its reasoning capacity and is no longer able to productively extend its thinking.
This token-scaling behavior mirrors a pattern seen in human problem solving: people can effectively increase their effort on moderately difficult problems, but beyond a certain difficulty threshold, additional time spent does not lead to progress. The plateau in reasoning tokens may correspond to the model "giving up" or cycling through unproductive reasoning paths.
To provide context for the LLM results, the authors also collected data on human solving times for puzzles of various sizes:
| Puzzle Size | Average Human Solving Time |
|---|---|
| 2x2 | Approximately 15 seconds |
| 3x3 | Approximately 1 minute 30 seconds |
| 4x4 | 10 to 15 minutes |
While direct accuracy comparisons between humans and LLMs are not straightforward (humans can take as long as they need and typically achieve very high accuracy given sufficient time), these timings illustrate the exponential growth in difficulty that logic grid puzzles exhibit as size increases. The jump from 15 seconds to 15 minutes between 2x2 and 4x4 puzzles reflects the same combinatorial explosion captured by the search space metric.
ZebraLogic maintains a public leaderboard hosted on Hugging Face Spaces, allowing researchers and practitioners to submit and compare model results. The dataset is publicly available as the allenai/ZebraLogicBench dataset on Hugging Face, and the evaluation code is released through the ZeroEval framework on GitHub.
Since the original paper, newer models have been evaluated on the benchmark. As of early 2026, several models have surpassed the original o1 scores:
| Rank | Model | Score |
|---|---|---|
| 1 | Qwen3 VL 235B A22B Thinking | 0.973 |
| 2 | LongCat-Flash-Thinking (Meituan) | 0.955 |
| 3 | Qwen3 235B A22B Instruct | 0.950 |
| 4 | LongCat-Flash-Chat (Meituan) | 0.893 |
| 5 | Kimi K2 Instruct (Moonshot AI) | 0.890 |
| 6 | MiniMax M1 80K | 0.868 |
| 7 | MiniMax M1 40K | 0.801 |
These updated results suggest that newer generations of reasoning-focused models have made significant progress on the benchmark, with the top models approaching near-perfect accuracy. However, it should be noted that these are self-reported results and have not all been independently verified.
ZebraLogic's findings have several important implications for the field of large language models:
Scaling is not sufficient. The curse of complexity shows that making models bigger does not reliably improve logical reasoning beyond a certain problem difficulty. This challenges the prevailing assumption that scale is the primary driver of capability improvement.
Architectural innovation is needed. The specific reasoning skill gaps identified (counterfactual thinking, backtracking, structured state tracking) point toward concrete architectural changes that might improve logical reasoning, such as explicit working memory mechanisms or built-in search procedures.
Inference-time computation matters, but has limits. Reasoning models like o1 demonstrate that spending more compute at inference time can substantially help, but even this approach plateaus on the hardest puzzles.
Practical selection methods lag far behind theoretical potential. The gap between oracle and practical Best-of-N results indicates that better solution verification and selection mechanisms could yield significant improvements without changing the underlying model.
The logical reasoning limitations exposed by ZebraLogic are relevant beyond puzzle solving. Constraint satisfaction problems underlie many practical applications:
If LLMs cannot reliably solve structured logical problems in a controlled benchmark setting, their reliability in these real-world applications warrants careful scrutiny.
ZebraLogic complements other reasoning benchmarks in the AI evaluation landscape:
ZebraLogic's unique contribution is its focus on pure deductive reasoning with precisely controllable difficulty, making it especially useful for studying how reasoning capability scales (or fails to scale) with problem complexity.
The ZebraLogic paper outlines several promising directions for future research:
| Resource | Location |
|---|---|
| Paper (arXiv) | arxiv.org/abs/2502.01100 |
| ICML 2025 Proceedings | PMLR Volume 267, pages 37889-37905 |
| Leaderboard | huggingface.co/spaces/allenai/ZebraLogic |
| Dataset | huggingface.co/datasets/allenai/ZebraLogicBench |
| Evaluation Code | github.com/yuchenlin/ZeroEval |