RepoBench

AI Benchmarks AI Code Generation

9 min read

Updated Jun 8, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 8, 2026

Fact-checked

In review queue

Sources

6 citations

Revision

v1 · 1,741 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Overview

RepoBench is an AI benchmark for repository-level code auto-completion, introduced in the 2023 paper "RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems" by Tianyang Liu, Canwen Xu, and Julian McAuley of the University of California, San Diego. The paper was first posted to arXiv on June 5, 2023, and was published at the International Conference on Learning Representations (ICLR) in 2024 ^[1]^[2]. Unlike earlier code-completion benchmarks that evaluate a model on a single isolated file, RepoBench measures how well a system uses cross-file context drawn from an entire software repository, the setting that matches how developers actually write code ^[1].

The benchmark is organized as three interconnected tasks that decompose the repository-level completion problem: RepoBench-R for retrieving the most relevant cross-file code snippets, RepoBench-C for completing the next line of code given both cross-file and in-file context, and RepoBench-P for the end-to-end pipeline that combines retrieval with completion ^[1]. RepoBench covers two programming languages, Python and Java, with data sourced from public GitHub repositories ^[1]. It has become a standard reference for evaluating long-context code models and retrieval-augmented code completion, and one of its tasks was incorporated into the widely used long-context evaluation suite LongBench ^[3].

Motivation: repository-level context

Most early benchmarks for AI code generation, such as HumanEval and MBPP, present a model with a single, self-contained function signature and docstring and ask it to produce a correct implementation. These benchmarks measure functional correctness on isolated problems but do not capture the dependencies that pervade real software, where a single line may reference classes, functions, and constants defined in other files of the same project ^[1].

RepoBench was designed to fill this gap. In real development, an auto-completion system frequently needs to predict code that depends on imported modules and on definitions located elsewhere in the repository. Producing the correct completion therefore requires two distinct capabilities: locating the relevant cross-file context, and then generating code conditioned on that context together with the in-file code already written ^[1]. By separating these capabilities into individual tasks and also evaluating them jointly, RepoBench provides a more faithful picture of how a model would perform inside a real-world coding assistant than a single-file benchmark can. This focus on cross-file dependencies and large surrounding context also makes RepoBench a natural testbed for models with a long context window and for retrieval-augmented generation over code ^[1]^[3].

The three tasks

RepoBench is structured around three tasks, labeled R, C, and P, that can be used independently or chained together.

Task	Full name	Input	Output	Primary metrics
RepoBench-R	Retrieval	A query (the in-file context) plus a set of candidate cross-file snippets	The most relevant snippet(s)	acc@k (accuracy at k)
RepoBench-C	Code Completion	Cross-file context plus in-file context	The next line of code	Exact Match, Edit Similarity
RepoBench-P	Pipeline	A repository snapshot	The next line of code, end to end	Exact Match, Edit Similarity

RepoBench-R (Retrieval)

RepoBench-R evaluates a system's ability to retrieve the cross-file code snippet most relevant to the line being completed. The task is split into an "easy" subset and a "hard" subset based on the number of candidate snippets: the easy subset presents roughly 5 to 9 candidates and is scored with acc@1 and acc@3, while the hard subset presents 10 or more candidates and is scored with acc@1, acc@3, and acc@5 ^[1]. Accuracy at k (acc@k) measures whether the correct, ground-truth snippet appears among the top k retrieved candidates. This task isolates the retrieval component so that retrieval methods, from sparse lexical matching to dense neural encoders, can be compared directly ^[1].

RepoBench-C (Code Completion)

RepoBench-C evaluates next-line prediction when the model is given both cross-file context and the preceding in-file code. The masked target lines are drawn from three settings depending on how they relate to cross-file modules: Cross-File-First (XF-F), where the line is the first use of a cross-file module in the file; Cross-File-Random (XF-R), where it is a later use after the module has already appeared in the file; and In-File (IF), where the line has no cross-file dependency ^[1]. Completions are scored with two metrics: Exact Match (EM), the fraction of predictions that match the reference line exactly, and Edit Similarity (ES), a Levenshtein-distance-based measure of character-level closeness that gives partial credit ^[1]. Because the target is a single next line rather than a full executable function, RepoBench-C does not provide unit tests and does not compute an execution-based pass@k metric ^[1].

To probe long-context behavior, RepoBench-C is released in length-bucketed variants. The 2k variant restricts prompts so they fit models with a 2,048-token context, and the 8k variant targets models with an 8,192-token context ^[1]. The authors evaluated this task primarily in a zero-shot setting to examine how well models handle long-range repository context ^[1].

RepoBench-P (Pipeline)

RepoBench-P is the end-to-end task that combines retrieval and completion: the system must retrieve relevant cross-file snippets and then use them, together with in-file context, to predict the next line. The paper studies several context-construction strategies to disentangle the contributions of retrieval, including a gold (oracle) setting that supplies only the correct snippet, settings that mix the gold snippet with distractor candidates placed at the head or tail of the prompt, a retrieval setting that ranks candidates with a neural retriever such as UniXcoder, a random-snippet baseline, and a no-retrieval baseline ^[1]. RepoBench-P is scored with the same Exact Match and Edit Similarity metrics as RepoBench-C ^[1]. This task most closely simulates a deployed completion system, where retrieval quality and prompt assembly directly affect the final prediction.

Data and metrics

RepoBench is built from public GitHub repositories in Python and Java. Cross-file dependencies are identified by parsing source files with the tree-sitter library to extract import statements and to trace which lines depend on modules defined in other files ^[1]. The original release drew training data from the github-code dataset and constructed its test set from Python and Java repositories created after the cutoff of common pretraining corpora, in a window during 2023, to reduce the chance that test code had been memorized during model pretraining ^[1].

Because the data range matters for measuring contamination, the benchmark has been refreshed over time. The maintained RepoBench v1.1 release recollects code from GitHub between October 6, 2023, and December 31, 2023, and applies a deduplication step against The Stack v2 (based on file content) to further mitigate data leakage into models trained on that corpus ^[4]. The v1.1 data is distributed on the Hugging Face Hub as separate Python and Java datasets, with examples bucketed by prompt length using OpenAI's tokenizer into levels such as 2k, 8k, and longer, extending up to a 128k bucket for very-long-context evaluation ^[4].

The metric design reflects RepoBench's line-level, non-executable nature. Retrieval is measured with accuracy at k, and completion is measured with Exact Match and Edit Similarity rather than test execution ^[1]^[4]. This choice keeps evaluation fast and language-agnostic but means RepoBench measures textual fidelity to a reference line rather than runtime correctness, a deliberate trade-off the authors note when contrasting it with execution-based benchmarks ^[1].

Use and adoption

In their original experiments, the authors evaluated a range of code language models, including the 175B-parameter Codex model (code-davinci-002), the CodeGen family across sizes from 350M to 16.1B parameters, and StarCoder, finding that performance on the cross-file settings lagged the in-file setting and that retrieval quality materially affected pipeline results ^[1]. These baselines established RepoBench as a way to quantify how well models exploit repository context rather than only local context.

Since publication, RepoBench has been adopted as a standard yardstick for repository-level and long-context code models, and results on it are reported in numerous model releases and follow-up papers, including the technical reports for IBM's Granite Code models and their long-context extensions ^[5]. Its most visible second life is inside LongBench, the bilingual long-context understanding benchmark, which includes a code-completion task derived from RepoBench: LongBench uses the challenging Cross-File-First setting in an oracle-filled configuration, concatenating randomly drawn cross-file snippets (including the gold snippet) and scoring completions with Edit Similarity ^[3]. Through LongBench, RepoBench-style evaluation reaches a broad audience comparing the long-context abilities of general-purpose large language models, not just dedicated code models.

Relationship to other code benchmarks

RepoBench sits within a family of code-evaluation benchmarks that differ in granularity and in whether they execute the generated code.

HumanEval and MBPP evaluate single, self-contained functions for functional correctness using pass@k against hidden unit tests. They test isolated reasoning, not cross-file context, which is precisely the limitation RepoBench addresses ^[1].
CrossCodeEval is a closely related, concurrently developed benchmark that also introduces cross-file context into code completion. It spans four languages (Python, Java, TypeScript, and C#) with around 10,000 examples and, like RepoBench, scores predictions with Exact Match and Edit Similarity rather than execution ^[6]. RepoBench and CrossCodeEval are frequently cited together as the two reference benchmarks for retrieval-augmented, cross-file completion.
SWE-bench is also repository-level but operates at a different granularity and evaluation mode: it tasks a model with resolving real GitHub issues by producing a patch to a full codebase, and it verifies correctness by running the repository's test suite. SWE-bench thus measures end-to-end software-engineering ability with execution-based grading, whereas RepoBench measures next-line completion with textual-similarity metrics ^[1].

In short, RepoBench occupies the middle of this spectrum. It is broader in scope than single-function benchmarks like HumanEval because it requires cross-file context, but it is lighter-weight than execution-based, issue-resolution benchmarks like SWE-bench because it targets line-level completion scored by Exact Match and Edit Similarity. That positioning, together with its explicit retrieval, completion, and pipeline decomposition, is why it remains a common reference for long-context code models and retrieval-augmented code completion through 2025 and 2026 ^[1]^[3]^[5].

References

Tianyang Liu, Canwen Xu, Julian McAuley. "RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems." arXiv:2306.03091, June 2023. https://arxiv.org/abs/2306.03091 ↩
"RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems." ICLR 2024 (poster). https://iclr.cc/virtual/2024/poster/17776 ↩
THUDM. "LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding." https://github.com/THUDM/LongBench ↩
Leolty. "RepoBench data (v1.1) README." GitHub. https://github.com/Leolty/repobench/blob/main/data/README.md ↩
Mayank Mishra et al. "Granite Code Models: A Family of Open Foundation Models for Code Intelligence." arXiv:2405.04324, 2024. https://arxiv.org/abs/2405.04324 ↩
Yangruibo Ding et al. "CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion." NeurIPS 2023. https://arxiv.org/abs/2310.11248 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

Abbreviations LongBench

Overview

Motivation: repository-level context

The three tasks

RepoBench-R (Retrieval)

RepoBench-C (Code Completion)

RepoBench-P (Pipeline)

Data and metrics

Use and adoption

Relationship to other code benchmarks

References

Improve this article

Related Articles

HumanEval

LiveCodeBench

SWE-bench Verified

MBPP

CodeContests

CRUXEval

What links here

Related Articles

HumanEval

LiveCodeBench

SWE-bench Verified

MBPP

CodeContests

CRUXEval

What links here