CRUXEval

AI Benchmarks AI Code Generation Machine Learning Natural Language Processing

19 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

7 citations

Revision

v4 · 3,735 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

CRUXEval (Code Reasoning, Understanding, and eXecution Evaluation) is a benchmark designed to measure how well large language models can reason about, understand, and mentally execute short Python programs. Introduced in January 2024 by researchers at MIT CSAIL and Meta AI, CRUXEval consists of 800 Python functions paired with input-output examples, giving rise to two complementary tasks: predicting the output of a function given its input (CRUXEval-O) and predicting the input that would produce a given output (CRUXEval-I).^[1] The benchmark was published at the 41st International Conference on Machine Learning (ICML 2024) and has since become an influential tool for evaluating code reasoning capabilities beyond simple code generation.^[1]

CRUXEval addresses a gap in the evaluation landscape for code LLMs. While benchmarks such as HumanEval and MBPP measure a model's ability to generate code from natural language descriptions, CRUXEval tests whether models can actually understand what code does when it runs. The results reveal that strong performance on code generation benchmarks does not necessarily translate to strong code reasoning, highlighting fundamental limitations in how current models process program semantics.

Background and Motivation

The Gap Between Code Generation and Code Understanding

By late 2023, large language models had achieved impressive scores on code generation benchmarks. Models like GPT-4, Code Llama, and various fine-tuned derivatives were posting increasingly high pass rates on HumanEval and MBPP. However, these benchmarks primarily test whether a model can translate a natural language specification into working code. They do not directly assess whether the model understands what the generated code actually does during execution.

This distinction matters because code understanding is a prerequisite for many real-world programming tasks: debugging, code review, program analysis, and reasoning about edge cases all require the ability to mentally simulate program execution. A model that can generate syntactically correct code but cannot predict what that code does when run on a specific input has a shallow understanding of programming.

Prior execution-based benchmarks had various limitations. Some relied on complex algorithmic problems that conflate reasoning difficulty with domain knowledge. Others used programs that were too long or computationally intensive to expect a model (or a human) to trace through reliably. CRUXEval was designed to fill this gap with simple, short programs that a competent programmer could reason about without extraordinary effort.

Design Philosophy

The CRUXEval authors set out to create a benchmark with several key properties. First, the programs should be short (3 to 13 lines) and involve only basic Python operations on strings, lists, and dictionaries, so that any university-level computer science student could trace through them in under a minute. Second, the benchmark should avoid complex arithmetic, floating-point operations, and reliance on external libraries, isolating pure code reasoning from mathematical or domain-specific knowledge. Third, the benchmark needed to be large enough to produce statistically meaningful comparisons between models (the authors determined that 800 samples suffice for significance at the 0.05 level) while remaining small enough to run evaluations efficiently.^[1]

Benchmark Construction

Program Generation Pipeline

The CRUXEval dataset was constructed using a semi-automated pipeline built around Code Llama 34B^[4] as the program generator. The process began by selecting 69 standard library functions from Python's built-in string (47 functions), dictionary (11 functions), and list (11 functions) types.^[1] These served as seeds for generating diverse programs.

The authors created 25 different few-shot prompt combinations to guide Code Llama 34B in generating candidate functions. Using a temperature of 1.0, the model generated approximately 102,000 candidate functions across the three data types: 46% involving strings, 27% involving dictionaries, and 27% involving lists. For each generated function, the model also produced candidate inputs, and the outputs were obtained by actually executing the functions on those inputs, yielding 489,306 input-output pairs in total.^[1]

Filtering Criteria

The raw generated programs went through a rigorous multi-stage filtering process to ensure quality and appropriateness.

Compile-time filters removed programs that:

Contained syntax errors
Did not use all declared function arguments
Fell outside the 75 to 300 character length range
Did not follow the required assertion format for evaluation

Runtime filters eliminated programs that:

Used floating-point operations or true division (to avoid precision issues)
Involved integer operations where all arguments exceeded 3 (to keep arithmetic simple)
Involved string or list operations where all arguments had length greater than 3
Failed to complete execution within 2 seconds
Raised uncaught exceptions during execution

Code quality filters excluded programs that:

Imported external modules (os, random, etc.)
Were non-deterministic (involving random number generation or set ordering)
Had side effects (calls to input(), modifications to builtins, etc.)

After filtering, the authors used bootstrap sampling to select the final 800 samples, ensuring the dataset was both manageable in size and statistically robust for detecting performance differences between models. The authors noted encountering a "duplication bottleneck" during generation: after approximately 6,000 generations per prompt pair, only about 5,000 unique programs remained, necessitating the use of many different prompt combinations.^[1]

Tasks

CRUXEval defines two complementary evaluation tasks, each testing a different direction of code reasoning.

CRUXEval-O: Output Prediction

In the output prediction task, the model receives a Python function along with a specific input and must predict what the function returns when executed on that input. This task tests forward reasoning: given a program and its starting state, can the model simulate the execution and arrive at the correct final state?

For example, a model might be given a function that manipulates a string through several operations (slicing, concatenation, replacement) and an input string, and must determine the exact output after all operations are applied.

CRUXEval-I: Input Prediction

In the input prediction task, the model receives a Python function and the expected output, and must determine an input that would cause the function to produce that output. This task tests backward reasoning (or inverse reasoning): given the end state, can the model work backwards through the program logic to find a valid starting state?

Input prediction is generally considered more challenging because it requires the model to reason about the program in reverse, and multiple valid inputs may exist for a given output. The evaluation accepts any input that produces the correct output when the function is actually executed.

Evaluation Metrics

CRUXEval uses pass@k metrics, consistent with established code evaluation practice. The primary metric is pass@1, which measures the probability that a single generation from the model is correct. The benchmark also reports pass@5, indicating the probability that at least one of five generated samples is correct. For pass@1 evaluation, models are sampled with a temperature of 0.2, while pass@5 uses a temperature of 0.8 to encourage more diverse outputs.^[1]

Correctness is verified by execution: for output prediction, the model's predicted output is compared against the actual output; for input prediction, the model's predicted input is fed into the function and the resulting output is compared against the expected output.

Model Results

The CRUXEval authors evaluated 20 code models spanning both proprietary and open-source systems. The results, drawn from the official CRUXEval leaderboard, reveal significant performance stratification.^[1]

Direct Prediction Results (Pass@1)

The following table shows pass@1 scores for direct prediction (without chain-of-thought prompting), sampled at temperature 0.2.^[1]

Model	Parameters	CRUXEval-I (%)	CRUXEval-O (%)
GPT-4-0613	Undisclosed	69.8	68.7
GPT-4-turbo-2024-04-09	Undisclosed	68.5	67.7
Claude 3 Opus	Undisclosed	64.2	65.8
GPT-4o	Undisclosed	65.1	70.0
GPT-3.5-turbo-0613	Undisclosed	49.0	49.4
CodeTulu-2-34B	34B	49.3	45.8
StarCoder2-15B	15B	48.1	47.1
DeepSeek Coder 33B (Instruct)	33B	46.5	49.9
DeepSeek Coder 33B (Base)	33B	46.5	48.6
Code Llama 34B	34B	47.2	42.4
Phind CodeLlama 34B v2	34B	47.2	39.7
Code Llama Python 34B	34B	43.9	41.4
WizardCoder 34B	34B	42.7	43.4
Magicoder DS 6.7B	6.7B	41.7	44.4
DeepSeek Coder 6.7B (Base)	6.7B	41.9	43.5
Code Llama 13B	13B	42.5	39.7
Code Llama Python 13B	13B	39.7	39.8
Mixtral 8x7B	46.7B (MoE)	39.3	40.5
Mistral 7B	7B	35.0	34.3
DeepSeek Coder 6.7B (Instruct)	6.7B	37.4	41.2
Code Llama Python 7B	7B	37.3	35.9
WizardCoder 13B	13B	36.5	41.3
Code Llama 7B	7B	35.9	34.2
StarCoder2-7B	7B	34.6	36.0
StableCode 3B	3B	33.5	26.7
StarCoder2-3B	3B	32.7	34.2
Phi-2	2.7B	31.6	33.5
StarCoderBase 16B	15.5B	31.3	34.2
StarCoderBase 7B	7B	29.7	32.2
DeepSeek Coder 1.3B (Base)	1.3B	27.8	31.0
DeepSeek Coder 1.3B (Instruct)	1.3B	27.2	28.7
Phi-1.5	1.3B	23.2	27.5
Phi-1	1.3B	13.1	21.7

Chain-of-Thought Results (Pass@1)

Chain-of-thought (CoT) prompting, where models are asked to reason step-by-step before giving a final answer, produced notable improvements for some models.

Model	CRUXEval-I (%)	CRUXEval-O (%)
GPT-4-turbo + CoT	75.7	82.0
GPT-4o + CoT	75.6	76.0
GPT-4-0613 + CoT	75.5	77.1
Claude 3 Opus + CoT	73.4	82.0
GPT-3.5-turbo + CoT	50.3	59.0
Code Llama 34B + CoT	50.1	43.6
Code Llama 13B + CoT	47.4	36.0
Code Llama 7B + CoT	40.4	29.9

GPT-4 and Claude 3 Opus benefited substantially from CoT, with output prediction improving by over 13 percentage points for GPT-4-turbo (from 67.7% to 82.0%).^[1] In contrast, smaller open-source models showed mixed results with CoT. Code Llama 34B improved on input prediction (from 47.2% to 50.1%) but actually declined on output prediction (from 42.4% to 43.6%). Strikingly, Code Llama 7B and 13B both performed worse on output prediction with CoT than without it, suggesting that CoT can hurt performance when the model's reasoning capability is insufficient to produce reliable step-by-step traces.

Pass@5 Scores

Pass@5 results (sampled at temperature 0.8) show how much headroom remains when models are allowed multiple attempts.

Model	CRUXEval-I pass@5 (%)	CRUXEval-O pass@5 (%)
GPT-4-0613 + CoT	88.9	88.2
GPT-4-0613	76.8	73.0
GPT-3.5-turbo + CoT	74.9	76.7
Code Llama 34B + CoT	73.8	69.4
Code Llama 13B + CoT	68.4	61.8
Code Llama 34B	66.6	55.9
StarCoder2-15B	66.9	59.5
DeepSeek Coder 33B (Base)	64.9	61.6
GPT-3.5-turbo	63.2	59.3
Code Llama 7B + CoT	62.8	55.4

The gap between pass@1 and pass@5 is particularly wide for CoT models, indicating that even when a model sometimes fails to reason correctly, it can often find the right answer with additional attempts.^[1]

Key Findings

Code Generation Does Not Imply Code Understanding

One of CRUXEval's most striking findings is the disconnect between code generation performance and code reasoning ability. Several models that had been fine-tuned or distilled to achieve high HumanEval scores showed little to no improvement on CRUXEval relative to their base models.

WizardCoder 34B, for example, outperforms Code Llama 34B on HumanEval by over 20 percentage points. Yet on CRUXEval, WizardCoder 34B scored 42.7% on input prediction and 43.4% on output prediction, actually performing slightly below Code Llama 34B's 47.2% and 42.4%.^[1] Similarly, Phind CodeLlama 34B v2, another high-scoring HumanEval model, posted a CRUXEval-O score of just 39.7%, several points below its Code Llama 34B base model.

The Phi-1 model presents perhaps the most dramatic example: it achieves competitive HumanEval performance despite having only 1.3 billion parameters, but scores just 13.1% on CRUXEval-I and 21.7% on CRUXEval-O, placing it at the bottom of the leaderboard.^[1] This suggests that the training process for Phi-1, which focused heavily on textbook-quality code generation data, did not cultivate execution reasoning capabilities.

For base models (those not fine-tuned specifically for code generation tasks), such as StarCoder, Mistral, Code Llama, and DeepSeek Base, there is a positive correlation between HumanEval scores and CRUXEval scores. This indicates that general code capability does contribute to code reasoning, but that targeted fine-tuning for generation benchmarks can improve generation without improving (or even while degrading) reasoning.

GPT-4 Failure Analysis

Even GPT-4, the best-performing model in the original evaluation, exhibits systematic failures on CRUXEval. Using CoT prompting, GPT-4 scored 0 out of 10 attempts on 54 output prediction problems and 65 input prediction problems, with 22 problems proving unsolvable for both tasks simultaneously.^[1]

The authors identified several categories of failure:

String manipulation errors. GPT-4 sometimes demonstrates the correct high-level approach to a string problem but fails during the actual character-by-character concatenation or replacement steps. The model understands the algorithm but cannot reliably execute it.

Variable name confusion. In some cases, GPT-4 appears to be misled by semantically meaningful variable names, making assumptions about what a variable represents rather than following the actual computation. For instance, a variable named "prefix" might lead the model to assume certain behavior that does not match the function's logic.

Loop simulation failures. Problems requiring the model to simulate 20 or more loop iterations consistently cause errors. The model loses track of intermediate state, particularly when counters or accumulators are involved. Counting to approximately 30 or tracking multiple variables across many iterations proves unreliable.

Simple logical errors. GPT-4 occasionally produces incorrect True/False predictions on basic conditional expressions, suggesting that even simple comparison operations are not always handled correctly during mental simulation.

Function semantics misunderstanding. Some failures involve incorrect assumptions about Python built-in methods, such as str.removeprefix() or string doubling behavior, where the model applies an incorrect mental model of what the function does.

Input vs. Output Prediction Correlation

Across models, CRUXEval-I and CRUXEval-O performance are strongly correlated. Models that perform well on output prediction generally also perform well on input prediction, and vice versa. This suggests that the underlying capability being measured, the ability to reason about code execution, is a unified skill rather than two independent abilities.

However, the correlation is not perfect. GPT-4 shows a weaker input-output correlation compared to open-source models, and CoT prompting affects the two tasks differently (output prediction gains tend to be larger than input prediction gains). These asymmetries likely reflect the different cognitive demands of forward versus backward reasoning through code.

Variable Anonymization Has Minimal Impact

The authors tested whether replacing meaningful variable names with generic names (anonymization) affected model performance. The impact was minimal, with performance differences staying within 3 percentage points.^[1] This finding suggests that models are not primarily relying on variable name semantics to solve CRUXEval problems, but are instead engaging with the actual program logic (though the GPT-4 failure analysis shows that variable names can occasionally mislead).

Method-Level Difficulty Variation

Performance varied substantially across different Python built-in methods. For Code Llama 34B, common operations like list.append() and str.index() were among the easiest, while less common methods like str.rsplit(), str.maketrans(), and str.rfind() were among the hardest.^[1] The authors hypothesize that this difficulty gradient reflects the frequency of these methods in pretraining data: models perform better on operations they have seen more often during training.

Fine-Tuning Experiments

The CRUXEval authors conducted fine-tuning experiments to investigate whether targeted training could improve code reasoning performance. They fine-tuned Code Llama 34B on approximately 140,000 samples of Python functions generated using the same pipeline as the benchmark, with inputs, outputs, and assertion-style formatting.^[1]

Decontamination

To ensure fair evaluation, the authors applied decontamination at two levels. Weak decontamination removed only exact matches where both the function and its input-output pairs appeared in the training set. Strong decontamination removed any training sample whose function matched one in the benchmark, regardless of whether the input-output pairs differed.

Results and Plateauing

Despite training on a large corpus of programs structurally similar to the benchmark, test accuracy plateaued relatively quickly. The model's performance on CRUXEval improved only modestly and hit a ceiling, suggesting that code execution reasoning is difficult to learn through standard supervised fine-tuning on input-output assertion data.^[1] The format of the fine-tuning data proved important: training assertions needed to match the evaluation format for the fine-tuning to be effective.

This plateauing result carries broader implications. It suggests that code reasoning may require training approaches beyond simple next-token prediction on code-execution examples, potentially involving program synthesis, symbolic execution traces, or other forms of structured reasoning data.

Statistical Properties

Sample Size Justification

The authors performed statistical analysis to justify the 800-sample benchmark size. Through bootstrap analysis, they determined that sampling noise (approximately 1.5% per model) dominates generation noise (approximately 0.2%), and that 800 samples provide sufficient statistical power to detect meaningful performance differences between models at the 0.05 significance level. The bootstrap analysis confirmed that comparisons such as Code Llama 34B versus Code Llama 13B are statistically significant.^[1]

Dataset Composition

The 800 benchmark samples span programs of 3 to 13 lines in length, with character counts ranging from 75 to 300. The median bytecode operation count is approximately 60 to 70 steps. The dataset covers operations across Python's string, dictionary, and list types, reflecting the proportions in the generation phase (46% string, 27% dictionary, 27% list).^[1]

CRUXEval-X: Multilingual Extension

In August 2024, researchers from the Chinese Academy of Sciences, Hong Kong University of Science and Technology, and Huazhong University of Science and Technology introduced CRUXEval-X, a multilingual extension of the original benchmark. Published at ACL 2025, CRUXEval-X expands the evaluation from Python alone to 19 programming languages: C++, C#, D, Go, Java, JavaScript, Julia, Lua, Perl, PHP, Python, R, Racket, Ruby, Rust, Scala, Shell, Swift, and TypeScript.^[5]

CRUXEval-X contains at least 600 subjects per language and approximately 19,000 test instances in total, with 500 aligned test cases spanning all 19 languages. The benchmark was constructed through a fully automated process using iterative generation and repair techniques guided by test execution feedback.^[5]

Key findings from CRUXEval-X include:

Input and output reasoning capabilities remain comparable across languages for most models
JavaScript and TypeScript show the strongest cross-language correlation (0.87 to 0.91), likely due to their syntactic similarity
Racket consistently yields the weakest correlation with other languages
Models trained primarily on Python showed unexpected cross-language transfer, with Phi-1.5 reaching 21.7% on input reasoning across non-Python languages^[5]
Output string length negatively correlates with LLM reasoning accuracy, consistent with the hypothesis that longer traces are harder to simulate mentally

Comparison with Other Code Benchmarks

CRUXEval occupies a distinct niche in the code evaluation landscape:

Benchmark	Primary Task	Language	Evaluation Type
HumanEval^[2]	Code generation from docstrings	Python	Functional correctness
MBPP^[3]	Code generation from descriptions	Python	Functional correctness
SWE-bench	Real-world software engineering	Python	Patch correctness
BigCodeBench	Complex code generation	Python	Functional correctness
CRUXEval^[1]	Code reasoning and execution	Python	Input/output prediction
CRUXEval-X^[5]	Code reasoning and execution	19 languages	Input/output prediction

While code generation benchmarks test whether models can write code, CRUXEval tests whether models understand code. This complementary perspective provides insights that generation benchmarks alone cannot capture.

Significance and Impact

CRUXEval has contributed several important insights to the field of AI and code intelligence.

First, the benchmark provides concrete evidence that code generation ability and code understanding are partially independent capabilities. This finding has implications for how code LLMs are trained and evaluated: optimizing for generation benchmarks alone may produce models with shallow code understanding, a concern for applications like automated debugging, code review, and program verification.

Second, the systematic failure analysis of GPT-4 on simple programs highlights that even frontier models have fundamental limitations in simulating program execution. These failures, which include losing track of state during loops and misunderstanding basic string operations, suggest that current transformer-based architectures may struggle with the sequential, stateful nature of program execution.

Third, the fine-tuning experiments demonstrate that code reasoning is resistant to simple supervised training approaches, pointing toward the need for more sophisticated training methodologies that incorporate explicit reasoning traces, symbolic execution data, or curriculum-based learning strategies.

The benchmark's open availability (MIT license), small size, and fast evaluation time have made it widely adopted for comparing code models, and it is regularly cited alongside HumanEval and MBPP in model release announcements and technical reports.

Dataset and Resources

CRUXEval is fully open source and available through multiple channels:

GitHub repository: facebookresearch/cruxeval (MIT License)
HuggingFace Datasets: cruxeval-org/cruxeval
Official website: crux-eval.github.io
Leaderboard: crux-eval.github.io/leaderboard.html

The dataset is stored in JSONL format, with each entry containing three fields: the function code, the input, and the expected output. The repository includes evaluation scripts, sample model generations, and a quickstart Jupyter notebook.

References

Gu, A., Roziere, B., Leather, H., Solar-Lezama, A., Synnaeve, G., & Wang, S. I. (2024). CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution. *Proceedings of the 41st International Conference on Machine Learning (ICML)*, PMLR 235, 16568-16621. ↩
Chen, M., Tworek, J., Jun, H., et al. (2021). Evaluating Large Language Models Trained on Code. *arXiv preprint arXiv:2107.03374*. (HumanEval) ↩
Austin, J., Odena, A., Nye, M., et al. (2021). Program Synthesis with Large Language Models. *arXiv preprint arXiv:2108.07732*. (MBPP) ↩
Roziere, B., Gehring, J., Gloeckle, F., et al. (2023). Code Llama: Open Foundation Models for Code. *arXiv preprint arXiv:2308.12950*. ↩
Xu, R., Cao, J., Lu, Y., et al. (2024). CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution. *Proceedings of ACL 2025*. arXiv:2408.13001. ↩
OpenAI (2023). GPT-4 Technical Report. *arXiv preprint arXiv:2303.08774*.
Guo, D., Zhu, Q., Yang, D., et al. (2024). DeepSeek-Coder: When the Large Language Model Meets Programming. *arXiv preprint arXiv:2401.14196*.

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

What links here

Software Development