RepoBench
Last reviewed
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,741 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,741 words
Add missing citations, update stale details, or suggest a clearer explanation.
RepoBench is an AI benchmark for repository-level code auto-completion, introduced in the 2023 paper "RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems" by Tianyang Liu, Canwen Xu, and Julian McAuley of the University of California, San Diego. The paper was first posted to arXiv on June 5, 2023, and was published at the International Conference on Learning Representations (ICLR) in 2024 [1][2]. Unlike earlier code-completion benchmarks that evaluate a model on a single isolated file, RepoBench measures how well a system uses cross-file context drawn from an entire software repository, the setting that matches how developers actually write code [1].
The benchmark is organized as three interconnected tasks that decompose the repository-level completion problem: RepoBench-R for retrieving the most relevant cross-file code snippets, RepoBench-C for completing the next line of code given both cross-file and in-file context, and RepoBench-P for the end-to-end pipeline that combines retrieval with completion [1]. RepoBench covers two programming languages, Python and Java, with data sourced from public GitHub repositories [1]. It has become a standard reference for evaluating long-context code models and retrieval-augmented code completion, and one of its tasks was incorporated into the widely used long-context evaluation suite LongBench [3].
Most early benchmarks for AI code generation, such as HumanEval and MBPP, present a model with a single, self-contained function signature and docstring and ask it to produce a correct implementation. These benchmarks measure functional correctness on isolated problems but do not capture the dependencies that pervade real software, where a single line may reference classes, functions, and constants defined in other files of the same project [1].
RepoBench was designed to fill this gap. In real development, an auto-completion system frequently needs to predict code that depends on imported modules and on definitions located elsewhere in the repository. Producing the correct completion therefore requires two distinct capabilities: locating the relevant cross-file context, and then generating code conditioned on that context together with the in-file code already written [1]. By separating these capabilities into individual tasks and also evaluating them jointly, RepoBench provides a more faithful picture of how a model would perform inside a real-world coding assistant than a single-file benchmark can. This focus on cross-file dependencies and large surrounding context also makes RepoBench a natural testbed for models with a long context window and for retrieval-augmented generation over code [1][3].
RepoBench is structured around three tasks, labeled R, C, and P, that can be used independently or chained together.
| Task | Full name | Input | Output | Primary metrics |
|---|---|---|---|---|
| RepoBench-R | Retrieval | A query (the in-file context) plus a set of candidate cross-file snippets | The most relevant snippet(s) | acc@k (accuracy at k) |
| RepoBench-C | Code Completion | Cross-file context plus in-file context | The next line of code | Exact Match, Edit Similarity |
| RepoBench-P | Pipeline | A repository snapshot | The next line of code, end to end | Exact Match, Edit Similarity |
RepoBench-R evaluates a system's ability to retrieve the cross-file code snippet most relevant to the line being completed. The task is split into an "easy" subset and a "hard" subset based on the number of candidate snippets: the easy subset presents roughly 5 to 9 candidates and is scored with acc@1 and acc@3, while the hard subset presents 10 or more candidates and is scored with acc@1, acc@3, and acc@5 [1]. Accuracy at k (acc@k) measures whether the correct, ground-truth snippet appears among the top k retrieved candidates. This task isolates the retrieval component so that retrieval methods, from sparse lexical matching to dense neural encoders, can be compared directly [1].
RepoBench-C evaluates next-line prediction when the model is given both cross-file context and the preceding in-file code. The masked target lines are drawn from three settings depending on how they relate to cross-file modules: Cross-File-First (XF-F), where the line is the first use of a cross-file module in the file; Cross-File-Random (XF-R), where it is a later use after the module has already appeared in the file; and In-File (IF), where the line has no cross-file dependency [1]. Completions are scored with two metrics: Exact Match (EM), the fraction of predictions that match the reference line exactly, and Edit Similarity (ES), a Levenshtein-distance-based measure of character-level closeness that gives partial credit [1]. Because the target is a single next line rather than a full executable function, RepoBench-C does not provide unit tests and does not compute an execution-based pass@k metric [1].
To probe long-context behavior, RepoBench-C is released in length-bucketed variants. The 2k variant restricts prompts so they fit models with a 2,048-token context, and the 8k variant targets models with an 8,192-token context [1]. The authors evaluated this task primarily in a zero-shot setting to examine how well models handle long-range repository context [1].
RepoBench-P is the end-to-end task that combines retrieval and completion: the system must retrieve relevant cross-file snippets and then use them, together with in-file context, to predict the next line. The paper studies several context-construction strategies to disentangle the contributions of retrieval, including a gold (oracle) setting that supplies only the correct snippet, settings that mix the gold snippet with distractor candidates placed at the head or tail of the prompt, a retrieval setting that ranks candidates with a neural retriever such as UniXcoder, a random-snippet baseline, and a no-retrieval baseline [1]. RepoBench-P is scored with the same Exact Match and Edit Similarity metrics as RepoBench-C [1]. This task most closely simulates a deployed completion system, where retrieval quality and prompt assembly directly affect the final prediction.
RepoBench is built from public GitHub repositories in Python and Java. Cross-file dependencies are identified by parsing source files with the tree-sitter library to extract import statements and to trace which lines depend on modules defined in other files [1]. The original release drew training data from the github-code dataset and constructed its test set from Python and Java repositories created after the cutoff of common pretraining corpora, in a window during 2023, to reduce the chance that test code had been memorized during model pretraining [1].
Because the data range matters for measuring contamination, the benchmark has been refreshed over time. The maintained RepoBench v1.1 release recollects code from GitHub between October 6, 2023, and December 31, 2023, and applies a deduplication step against The Stack v2 (based on file content) to further mitigate data leakage into models trained on that corpus [4]. The v1.1 data is distributed on the Hugging Face Hub as separate Python and Java datasets, with examples bucketed by prompt length using OpenAI's tokenizer into levels such as 2k, 8k, and longer, extending up to a 128k bucket for very-long-context evaluation [4].
The metric design reflects RepoBench's line-level, non-executable nature. Retrieval is measured with accuracy at k, and completion is measured with Exact Match and Edit Similarity rather than test execution [1][4]. This choice keeps evaluation fast and language-agnostic but means RepoBench measures textual fidelity to a reference line rather than runtime correctness, a deliberate trade-off the authors note when contrasting it with execution-based benchmarks [1].
In their original experiments, the authors evaluated a range of code language models, including the 175B-parameter Codex model (code-davinci-002), the CodeGen family across sizes from 350M to 16.1B parameters, and StarCoder, finding that performance on the cross-file settings lagged the in-file setting and that retrieval quality materially affected pipeline results [1]. These baselines established RepoBench as a way to quantify how well models exploit repository context rather than only local context.
Since publication, RepoBench has been adopted as a standard yardstick for repository-level and long-context code models, and results on it are reported in numerous model releases and follow-up papers, including the technical reports for IBM's Granite Code models and their long-context extensions [5]. Its most visible second life is inside LongBench, the bilingual long-context understanding benchmark, which includes a code-completion task derived from RepoBench: LongBench uses the challenging Cross-File-First setting in an oracle-filled configuration, concatenating randomly drawn cross-file snippets (including the gold snippet) and scoring completions with Edit Similarity [3]. Through LongBench, RepoBench-style evaluation reaches a broad audience comparing the long-context abilities of general-purpose large language models, not just dedicated code models.
RepoBench sits within a family of code-evaluation benchmarks that differ in granularity and in whether they execute the generated code.
In short, RepoBench occupies the middle of this spectrum. It is broader in scope than single-function benchmarks like HumanEval because it requires cross-file context, but it is lighter-weight than execution-based, issue-resolution benchmarks like SWE-bench because it targets line-level completion scored by Exact Match and Edit Similarity. That positioning, together with its explicit retrieval, completion, and pipeline decomposition, is why it remains a common reference for long-context code models and retrieval-augmented code completion through 2025 and 2026 [1][3][5].