CRUXEval (Code Reasoning, Understanding, and eXecution Evaluation) is a benchmark designed to measure how well large language models can reason about, understand, and mentally execute short Python programs. Introduced in January 2024 by researchers at MIT CSAIL and Meta AI, CRUXEval consists of 800 Python functions paired with input-output examples, giving rise to two complementary tasks: predicting the output of a function given its input (CRUXEval-O) and predicting the input that would produce a given output (CRUXEval-I). The benchmark was published at the 41st International Conference on Machine Learning (ICML 2024) and has since become an influential tool for evaluating code reasoning capabilities beyond simple code generation.
CRUXEval addresses a gap in the evaluation landscape for code LLMs. While benchmarks such as HumanEval and MBPP measure a model's ability to generate code from natural language descriptions, CRUXEval tests whether models can actually understand what code does when it runs. The results reveal that strong performance on code generation benchmarks does not necessarily translate to strong code reasoning, highlighting fundamental limitations in how current models process program semantics.
By late 2023, large language models had achieved impressive scores on code generation benchmarks. Models like GPT-4, Code Llama, and various fine-tuned derivatives were posting increasingly high pass rates on HumanEval and MBPP. However, these benchmarks primarily test whether a model can translate a natural language specification into working code. They do not directly assess whether the model understands what the generated code actually does during execution.
This distinction matters because code understanding is a prerequisite for many real-world programming tasks: debugging, code review, program analysis, and reasoning about edge cases all require the ability to mentally simulate program execution. A model that can generate syntactically correct code but cannot predict what that code does when run on a specific input has a shallow understanding of programming.
Prior execution-based benchmarks had various limitations. Some relied on complex algorithmic problems that conflate reasoning difficulty with domain knowledge. Others used programs that were too long or computationally intensive to expect a model (or a human) to trace through reliably. CRUXEval was designed to fill this gap with simple, short programs that a competent programmer could reason about without extraordinary effort.
The CRUXEval authors set out to create a benchmark with several key properties. First, the programs should be short (3 to 13 lines) and involve only basic Python operations on strings, lists, and dictionaries, so that any university-level computer science student could trace through them in under a minute. Second, the benchmark should avoid complex arithmetic, floating-point operations, and reliance on external libraries, isolating pure code reasoning from mathematical or domain-specific knowledge. Third, the benchmark needed to be large enough to produce statistically meaningful comparisons between models (the authors determined that 800 samples suffice for significance at the 0.05 level) while remaining small enough to run evaluations efficiently.
The CRUXEval dataset was constructed using a semi-automated pipeline built around Code Llama 34B as the program generator. The process began by selecting 69 standard library functions from Python's built-in string (47 functions), dictionary (11 functions), and list (11 functions) types. These served as seeds for generating diverse programs.
The authors created 25 different few-shot prompt combinations to guide Code Llama 34B in generating candidate functions. Using a temperature of 1.0, the model generated approximately 102,000 candidate functions across the three data types: 46% involving strings, 27% involving dictionaries, and 27% involving lists. For each generated function, the model also produced candidate inputs, and the outputs were obtained by actually executing the functions on those inputs, yielding 489,306 input-output pairs in total.
The raw generated programs went through a rigorous multi-stage filtering process to ensure quality and appropriateness.
Compile-time filters removed programs that:
Runtime filters eliminated programs that:
Code quality filters excluded programs that:
After filtering, the authors used bootstrap sampling to select the final 800 samples, ensuring the dataset was both manageable in size and statistically robust for detecting performance differences between models. The authors noted encountering a "duplication bottleneck" during generation: after approximately 6,000 generations per prompt pair, only about 5,000 unique programs remained, necessitating the use of many different prompt combinations.
CRUXEval defines two complementary evaluation tasks, each testing a different direction of code reasoning.
In the output prediction task, the model receives a Python function along with a specific input and must predict what the function returns when executed on that input. This task tests forward reasoning: given a program and its starting state, can the model simulate the execution and arrive at the correct final state?
For example, a model might be given a function that manipulates a string through several operations (slicing, concatenation, replacement) and an input string, and must determine the exact output after all operations are applied.
In the input prediction task, the model receives a Python function and the expected output, and must determine an input that would cause the function to produce that output. This task tests backward reasoning (or inverse reasoning): given the end state, can the model work backwards through the program logic to find a valid starting state?
Input prediction is generally considered more challenging because it requires the model to reason about the program in reverse, and multiple valid inputs may exist for a given output. The evaluation accepts any input that produces the correct output when the function is actually executed.
CRUXEval uses pass@k metrics, consistent with established code evaluation practice. The primary metric is pass@1, which measures the probability that a single generation from the model is correct. The benchmark also reports pass@5, indicating the probability that at least one of five generated samples is correct. For pass@1 evaluation, models are sampled with a temperature of 0.2, while pass@5 uses a temperature of 0.8 to encourage more diverse outputs.
Correctness is verified by execution: for output prediction, the model's predicted output is compared against the actual output; for input prediction, the model's predicted input is fed into the function and the resulting output is compared against the expected output.
The CRUXEval authors evaluated 20 code models spanning both proprietary and open-source systems. The results, drawn from the official CRUXEval leaderboard, reveal significant performance stratification.
The following table shows pass@1 scores for direct prediction (without chain-of-thought prompting), sampled at temperature 0.2.
| Model | Parameters | CRUXEval-I (%) | CRUXEval-O (%) |
|---|---|---|---|
| GPT-4-0613 | Undisclosed | 69.8 | 68.7 |
| GPT-4-turbo-2024-04-09 | Undisclosed | 68.5 | 67.7 |
| Claude 3 Opus | Undisclosed | 64.2 | 65.8 |
| GPT-4o | Undisclosed | 65.1 | 70.0 |
| GPT-3.5-turbo-0613 | Undisclosed | 49.0 | 49.4 |
| CodeTulu-2-34B | 34B | 49.3 | 45.8 |
| StarCoder2-15B | 15B | 48.1 | 47.1 |
| DeepSeek Coder 33B (Instruct) | 33B | 46.5 | 49.9 |
| DeepSeek Coder 33B (Base) | 33B | 46.5 | 48.6 |
| Code Llama 34B | 34B | 47.2 | 42.4 |
| Phind CodeLlama 34B v2 | 34B | 47.2 | 39.7 |
| Code Llama Python 34B | 34B | 43.9 | 41.4 |
| WizardCoder 34B | 34B | 42.7 | 43.4 |
| Magicoder DS 6.7B | 6.7B | 41.7 | 44.4 |
| DeepSeek Coder 6.7B (Base) | 6.7B | 41.9 | 43.5 |
| Code Llama 13B | 13B | 42.5 | 39.7 |
| Code Llama Python 13B | 13B | 39.7 | 39.8 |
| Mixtral 8x7B | 46.7B (MoE) | 39.3 | 40.5 |
| Mistral 7B | 7B | 35.0 | 34.3 |
| DeepSeek Coder 6.7B (Instruct) | 6.7B | 37.4 | 41.2 |
| Code Llama Python 7B | 7B | 37.3 | 35.9 |
| WizardCoder 13B | 13B | 36.5 | 41.3 |
| Code Llama 7B | 7B | 35.9 | 34.2 |
| StarCoder2-7B | 7B | 34.6 | 36.0 |
| StableCode 3B | 3B | 33.5 | 26.7 |
| StarCoder2-3B | 3B | 32.7 | 34.2 |
| Phi-2 | 2.7B | 31.6 | 33.5 |
| StarCoderBase 16B | 15.5B | 31.3 | 34.2 |
| StarCoderBase 7B | 7B | 29.7 | 32.2 |
| DeepSeek Coder 1.3B (Base) | 1.3B | 27.8 | 31.0 |
| DeepSeek Coder 1.3B (Instruct) | 1.3B | 27.2 | 28.7 |
| Phi-1.5 | 1.3B | 23.2 | 27.5 |
| Phi-1 | 1.3B | 13.1 | 21.7 |
Chain-of-thought (CoT) prompting, where models are asked to reason step-by-step before giving a final answer, produced notable improvements for some models.
| Model | CRUXEval-I (%) | CRUXEval-O (%) |
|---|---|---|
| GPT-4-turbo + CoT | 75.7 | 82.0 |
| GPT-4o + CoT | 75.6 | 76.0 |
| GPT-4-0613 + CoT | 75.5 | 77.1 |
| Claude 3 Opus + CoT | 73.4 | 82.0 |
| GPT-3.5-turbo + CoT | 50.3 | 59.0 |
| Code Llama 34B + CoT | 50.1 | 43.6 |
| Code Llama 13B + CoT | 47.4 | 36.0 |
| Code Llama 7B + CoT | 40.4 | 29.9 |
GPT-4 and Claude 3 Opus benefited substantially from CoT, with output prediction improving by over 13 percentage points for GPT-4-turbo (from 67.7% to 82.0%). In contrast, smaller open-source models showed mixed results with CoT. Code Llama 34B improved on input prediction (from 47.2% to 50.1%) but actually declined on output prediction (from 42.4% to 43.6%). Strikingly, Code Llama 7B and 13B both performed worse on output prediction with CoT than without it, suggesting that CoT can hurt performance when the model's reasoning capability is insufficient to produce reliable step-by-step traces.
Pass@5 results (sampled at temperature 0.8) show how much headroom remains when models are allowed multiple attempts.
| Model | CRUXEval-I pass@5 (%) | CRUXEval-O pass@5 (%) |
|---|---|---|
| GPT-4-0613 + CoT | 88.9 | 88.2 |
| GPT-4-0613 | 76.8 | 73.0 |
| GPT-3.5-turbo + CoT | 74.9 | 76.7 |
| Code Llama 34B + CoT | 73.8 | 69.4 |
| Code Llama 13B + CoT | 68.4 | 61.8 |
| Code Llama 34B | 66.6 | 55.9 |
| StarCoder2-15B | 66.9 | 59.5 |
| DeepSeek Coder 33B (Base) | 64.9 | 61.6 |
| GPT-3.5-turbo | 63.2 | 59.3 |
| Code Llama 7B + CoT | 62.8 | 55.4 |
The gap between pass@1 and pass@5 is particularly wide for CoT models, indicating that even when a model sometimes fails to reason correctly, it can often find the right answer with additional attempts.
One of CRUXEval's most striking findings is the disconnect between code generation performance and code reasoning ability. Several models that had been fine-tuned or distilled to achieve high HumanEval scores showed little to no improvement on CRUXEval relative to their base models.
WizardCoder 34B, for example, outperforms Code Llama 34B on HumanEval by over 20 percentage points. Yet on CRUXEval, WizardCoder 34B scored 42.7% on input prediction and 43.4% on output prediction, actually performing slightly below Code Llama 34B's 47.2% and 42.4%. Similarly, Phind CodeLlama 34B v2, another high-scoring HumanEval model, posted a CRUXEval-O score of just 39.7%, several points below its Code Llama 34B base model.
The Phi-1 model presents perhaps the most dramatic example: it achieves competitive HumanEval performance despite having only 1.3 billion parameters, but scores just 13.1% on CRUXEval-I and 21.7% on CRUXEval-O, placing it at the bottom of the leaderboard. This suggests that the training process for Phi-1, which focused heavily on textbook-quality code generation data, did not cultivate execution reasoning capabilities.
For base models (those not fine-tuned specifically for code generation tasks), such as StarCoder, Mistral, Code Llama, and DeepSeek Base, there is a positive correlation between HumanEval scores and CRUXEval scores. This indicates that general code capability does contribute to code reasoning, but that targeted fine-tuning for generation benchmarks can improve generation without improving (or even while degrading) reasoning.
Even GPT-4, the best-performing model in the original evaluation, exhibits systematic failures on CRUXEval. Using CoT prompting, GPT-4 scored 0 out of 10 attempts on 54 output prediction problems and 65 input prediction problems, with 22 problems proving unsolvable for both tasks simultaneously.
The authors identified several categories of failure:
String manipulation errors. GPT-4 sometimes demonstrates the correct high-level approach to a string problem but fails during the actual character-by-character concatenation or replacement steps. The model understands the algorithm but cannot reliably execute it.
Variable name confusion. In some cases, GPT-4 appears to be misled by semantically meaningful variable names, making assumptions about what a variable represents rather than following the actual computation. For instance, a variable named "prefix" might lead the model to assume certain behavior that does not match the function's logic.
Loop simulation failures. Problems requiring the model to simulate 20 or more loop iterations consistently cause errors. The model loses track of intermediate state, particularly when counters or accumulators are involved. Counting to approximately 30 or tracking multiple variables across many iterations proves unreliable.
Simple logical errors. GPT-4 occasionally produces incorrect True/False predictions on basic conditional expressions, suggesting that even simple comparison operations are not always handled correctly during mental simulation.
Function semantics misunderstanding. Some failures involve incorrect assumptions about Python built-in methods, such as str.removeprefix() or string doubling behavior, where the model applies an incorrect mental model of what the function does.
Across models, CRUXEval-I and CRUXEval-O performance are strongly correlated. Models that perform well on output prediction generally also perform well on input prediction, and vice versa. This suggests that the underlying capability being measured, the ability to reason about code execution, is a unified skill rather than two independent abilities.
However, the correlation is not perfect. GPT-4 shows a weaker input-output correlation compared to open-source models, and CoT prompting affects the two tasks differently (output prediction gains tend to be larger than input prediction gains). These asymmetries likely reflect the different cognitive demands of forward versus backward reasoning through code.
The authors tested whether replacing meaningful variable names with generic names (anonymization) affected model performance. The impact was minimal, with performance differences staying within 3 percentage points. This finding suggests that models are not primarily relying on variable name semantics to solve CRUXEval problems, but are instead engaging with the actual program logic (though the GPT-4 failure analysis shows that variable names can occasionally mislead).
Performance varied substantially across different Python built-in methods. For Code Llama 34B, common operations like list.append() and str.index() were among the easiest, while less common methods like str.rsplit(), str.maketrans(), and str.rfind() were among the hardest. The authors hypothesize that this difficulty gradient reflects the frequency of these methods in pretraining data: models perform better on operations they have seen more often during training.
The CRUXEval authors conducted fine-tuning experiments to investigate whether targeted training could improve code reasoning performance. They fine-tuned Code Llama 34B on approximately 140,000 samples of Python functions generated using the same pipeline as the benchmark, with inputs, outputs, and assertion-style formatting.
To ensure fair evaluation, the authors applied decontamination at two levels. Weak decontamination removed only exact matches where both the function and its input-output pairs appeared in the training set. Strong decontamination removed any training sample whose function matched one in the benchmark, regardless of whether the input-output pairs differed.
Despite training on a large corpus of programs structurally similar to the benchmark, test accuracy plateaued relatively quickly. The model's performance on CRUXEval improved only modestly and hit a ceiling, suggesting that code execution reasoning is difficult to learn through standard supervised fine-tuning on input-output assertion data. The format of the fine-tuning data proved important: training assertions needed to match the evaluation format for the fine-tuning to be effective.
This plateauing result carries broader implications. It suggests that code reasoning may require training approaches beyond simple next-token prediction on code-execution examples, potentially involving program synthesis, symbolic execution traces, or other forms of structured reasoning data.
The authors performed statistical analysis to justify the 800-sample benchmark size. Through bootstrap analysis, they determined that sampling noise (approximately 1.5% per model) dominates generation noise (approximately 0.2%), and that 800 samples provide sufficient statistical power to detect meaningful performance differences between models at the 0.05 significance level. The bootstrap analysis confirmed that comparisons such as Code Llama 34B versus Code Llama 13B are statistically significant.
The 800 benchmark samples span programs of 3 to 13 lines in length, with character counts ranging from 75 to 300. The median bytecode operation count is approximately 60 to 70 steps. The dataset covers operations across Python's string, dictionary, and list types, reflecting the proportions in the generation phase (46% string, 27% dictionary, 27% list).
In August 2024, researchers from the Chinese Academy of Sciences, Hong Kong University of Science and Technology, and Huazhong University of Science and Technology introduced CRUXEval-X, a multilingual extension of the original benchmark. Published at ACL 2025, CRUXEval-X expands the evaluation from Python alone to 19 programming languages: C++, C#, D, Go, Java, JavaScript, Julia, Lua, Perl, PHP, Python, R, Racket, Ruby, Rust, Scala, Shell, Swift, and TypeScript.
CRUXEval-X contains at least 600 subjects per language and approximately 19,000 test instances in total, with 500 aligned test cases spanning all 19 languages. The benchmark was constructed through a fully automated process using iterative generation and repair techniques guided by test execution feedback.
Key findings from CRUXEval-X include:
CRUXEval occupies a distinct niche in the code evaluation landscape:
| Benchmark | Primary Task | Language | Evaluation Type |
|---|---|---|---|
| HumanEval | Code generation from docstrings | Python | Functional correctness |
| MBPP | Code generation from descriptions | Python | Functional correctness |
| SWE-bench | Real-world software engineering | Python | Patch correctness |
| BigCodeBench | Complex code generation | Python | Functional correctness |
| CRUXEval | Code reasoning and execution | Python | Input/output prediction |
| CRUXEval-X | Code reasoning and execution | 19 languages | Input/output prediction |
While code generation benchmarks test whether models can write code, CRUXEval tests whether models understand code. This complementary perspective provides insights that generation benchmarks alone cannot capture.
CRUXEval has contributed several important insights to the field of AI and code intelligence.
First, the benchmark provides concrete evidence that code generation ability and code understanding are partially independent capabilities. This finding has implications for how code LLMs are trained and evaluated: optimizing for generation benchmarks alone may produce models with shallow code understanding, a concern for applications like automated debugging, code review, and program verification.
Second, the systematic failure analysis of GPT-4 on simple programs highlights that even frontier models have fundamental limitations in simulating program execution. These failures, which include losing track of state during loops and misunderstanding basic string operations, suggest that current transformer-based architectures may struggle with the sequential, stateful nature of program execution.
Third, the fine-tuning experiments demonstrate that code reasoning is resistant to simple supervised training approaches, pointing toward the need for more sophisticated training methodologies that incorporate explicit reasoning traces, symbolic execution data, or curriculum-based learning strategies.
The benchmark's open availability (MIT license), small size, and fast evaluation time have made it widely adopted for comparing code models, and it is regularly cited alongside HumanEval and MBPP in model release announcements and technical reports.
CRUXEval is fully open source and available through multiple channels:
The dataset is stored in JSONL format, with each entry containing three fields: the function code, the input, and the expected output. The repository includes evaluation scripts, sample model generations, and a quickstart Jupyter notebook.