# repobench_benchmarking_repositorylevel_code_autocompletion_systems__fcfe7c0e.pdf Published as a conference paper at ICLR 2024 REPOBENCH: BENCHMARKING REPOSITORY-LEVEL CODE AUTO-COMPLETION SYSTEMS Tianyang Liu Canwen Xu Julian Mc Auley University of California San Diego {til040, cxu, jmcauley}@ucsd.edu Large Language Models (LLMs) have greatly advanced code auto-completion systems, with a potential for substantial productivity enhancements for developers. However, current benchmarks mainly focus on single-file tasks, leaving an assessment gap for more complex, real-world, multi-file programming scenarios. To fill this gap, we introduce Repo Bench, a new benchmark specifically designed for evaluating repository-level code auto-completion systems. Repo Bench supports both Python and Java and consists of three interconnected evaluation tasks: Repo Bench-R (Retrieval), Repo Bench-C (Code Completion), and Repo Bench P (Pipeline). Each task respectively measures the system s ability to retrieve the most relevant code snippets from other files as cross-file context, predict the next line of code with cross-file and in-file context, and handle complex tasks that require a combination of both retrieval and next-line prediction. Repo Bench aims to facilitate a more complete comparison of performance and encouraging continuous improvement in auto-completion systems. Repo Bench is actively maintained with the latest code, serving as a live benchmark publicly available at https://github.com/Leolty/repobench. 1 INTRODUCTION Large language models (LLMs; Brown et al., 2020; Chowdhery et al., 2022; Touvron et al., 2023; Open AI, 2023; Chiang et al., 2023; Xu et al., 2023) have been instrumental in paving new avenues for innovative applications across diverse domains, with programming being a notably attractive and promising domain (Chen et al., 2021; van Dam et al., 2023; Austin et al., 2021; Wang et al., 2023). In particular, the rise and application of code auto-completion systems like Git Hub s Copilot 1, driven by Open AI s Codex (Chen et al., 2021), have the potential to substantially changed the manner in which we interact with code. These changes facilitate coding for beginners and improve efficiency of the coding process for experienced developers. A variety of code auto-completion models (Chen et al., 2021; Guo et al., 2022; Fried et al., 2022; Nijkamp et al., 2022; Li et al., 2023; Allal et al., 2023; Guo et al., 2023; 2024) have emerged in recent years, each boasting unique capabilities and performance characteristics. This emergence of models emphasizes the increasing importance of AI in the realm of programming, leading to a more diversified and competitive landscape. However, current evaluation datasets and benchmarks (Lu et al., 2021; Raychev et al., 2016a; Allamanis & Sutton, 2013) predominantly focus on completion tasks within the scope of a single file. This focus fails to reflect the complexity and intricacies of real-world programming scenarios, where developers frequently work on multi-file projects, often navigating through and understanding code spanning several repositories. Recognizing the need for a more comprehensive evaluation, we introduce Repo Bench, a new benchmark for evaluating the effectiveness of repository-level code auto-completion systems. Specifically, Repo Bench offers three distinct evaluation sub-tasks, each emphasizing a unique aspect of a fully functioning code auto-completion system: (1) The Retrieval Task (Repo Bench-R), which tests the system s ability to retrieve the most relevant code snippets, thereby providing the necessary context for the prediction of the next line of code. (2) The Code Completion Task (Repo Bench-C), where 1https://github.com/features/copilot Published as a conference paper at ICLR 2024 the task is to predict the next line of code given a pre-defined context. The context can involve content from different files (cross-file context) and within the file (in-file context) with a moderate length setting that can fit most models. (3) The End-to-End Pipeline Task (Repo Bench-P), which is designed to simulate the complete process of a code auto-completion system like Git Hub Copilot - first retrieving relevant snippets and then completing the code by predicting the next line. In this scenario, the system may encounter a large set of potential snippets for retrieval, resulting in longer and broader contexts, which leads to the need for the system to optimize the efficient selection of numerous candidates to facilitate code completion while ensuring that the extensive context remains within the system s processing capabilities. To summarize, the primary contributions of our work are as follows: We present Repo Bench, a benchmark tailored for evaluating repository-level code autocompletion systems. This benchmark comprises three interconnected tasks: Repo Bench-R for code retrieval, Repo Bench-C for code completion, and Repo Bench-P, which integrates both aspects to reflect the entire workflow of an auto-completion system, offering a balanced assessment. We conduct a series of experiments on Repo Bench, analyzing the efficacy of various retrieval methods and code completion models of different magnitudes, and the assessment of their combined performance in a full pipeline, providing some insights for future research and development. Our results underscore the significance of code models that can manage extended contexts and maintain generalizability in real-world coding environments. 2 RELATED WORK LLMs for Code Completion Code completion, also referred to as auto-completion or intelligent code completion, is an essential feature provided by many modern Integrated Development Environments (IDEs) and code editors. It aids programmers in writing code more efficiently by predicting and automatically completing the next line or multiple next lines. The inception of Language Models (LMs) in code completion can be traced back to the usage of n-gram based LMs (Tu et al., 2014; Hindle et al., 2016), RNN models (White et al., 2015), and probabilistic grammar-models (Bielik et al., 2016; Raychev et al., 2016b; Hellendoorn & Devanbu, 2017), which laid the foundation for the subsequent introduction of more advanced LMs in this field. With the advent of transformer-based models (Vaswani et al., 2017; Devlin et al., 2019; Radford et al., 2018; 2019; Brown et al., 2020), decoder-only models trained on large-scale code datasets have been proposed to foster the advancements in code completion. For instance, GPT-C (Svyatkovskiy et al., 2020a) and Code GPT (Lu et al., 2021) following the underlying architecture of GPT-style models are pre-trained on vast amounts of code. Uni XCoder (Guo et al., 2022) and Cug LM (Liu et al., 2020) incorporates multitask learning strategies, and leverages code structures to enhance pretraining. More recent LLMs, including Codex (Chen et al., 2021), Poly Coder (Xu et al., 2022), Code Gen (Nijkamp et al., 2022), In-Coder (Fried et al., 2022), Code Gee X (Zheng et al., 2023), Santa Coder (Allal et al., 2023), Star Coder (Li et al., 2023; Lozhkov et al., 2024), Long Coder (Guo et al., 2023), Code Llama (Rozi ere et al., 2024) and Deep Seek Coder (Guo et al., 2024) employ billions of parameters and excel in code generation tasks, benefiting from large-scale, high-quality code corpora. The scope of code completion has expanded with works like RLPG (Shrivastava et al., 2022), Co Co MIC (Ding et al., 2022), and Repo Coder (Zhang et al., 2023), emphasizing the integration of in-file and cross-file contexts and the importance of specialized benchmarks for evaluating repository-level code autocompletion systems. Code Completion Datasets The task of code completion serves as a foundation for programming language models and plays a pivotal role in intelligent code completion systems. While public benchmarks like Code XGLUE (Lu et al., 2021) with datasets PY150 (Raychev et al., 2016a) and Github Java Corpus (Allamanis & Sutton, 2013) play a key role in evaluating models within singlefile contexts, they may not fully encapsulate the intricacies of real-world coding projects which often entail cross-file interactions. To address this, Ding et al. (2022) proposed Co Co MIC, a model for cross-file completion and a code completion dataset with retrieved cross-file context. Different from the Co Co MIC data, our benchmark extends beyond code completion and includes evaluation of retrieval and pipeline construction, thus can better capture the complexity of such cross-file code completion systems. Repo Eval by Zhang et al. (2023) serves as a project-oriented benchmark, Published as a conference paper at ICLR 2024 focusing on 16 selected Python repositories to simulate real-world coding environments. However, its limitation arises from being integrated into the training data of Star Coder. Repo Bench not only spans a wider range of repositories across Python and Java, but also offers a segmented evaluation into retrieval, completion, and end-to-end tasks. Transitioning from file-based to repository-level code completion not only offers a more realistic representation of practical coding scenarios but also serves as a platform for evaluating the transfer learning capabilities of language models, as most models are not initially pre-trained with cross-file contexts included. This shift also introduces the challenge of handling longer prompts, a situation less common in single-file contexts, and a known limitation of many Transformer-based models. Recent research on long-range transformers (Zaheer et al., 2020) has shown promise in handling long sequences, with notable contributions from initial works like Long Former (Beltagy et al., 2020) and Reformer (Kitaev et al., 2020), as well as more recent advancements like Co LT5 (Ainslie et al., 2023), Unlimi Former (Bertsch et al., 2023), and Claude-100k (PBC, 2023), which has demonstrated their potential in effectively processing and generating code with much more cross-file context included. 3 THE REPOBENCH DATASET Repo Bench is a live benchmark for auto-code completion, with a commitment to continuously incorporate the latest data for model evaluation. This paper introduces the construction and findings of Repo Bench s inaugural iteration (v1.0). 3.1 DATA SOURCES Github-Code Dataset: The first source of Repo Bench is the github-code dataset2, which consists of a vast collection of code files sourced from Git Hub repositories under open-source licenses with a data cutoff date of March 16, 2022. Specifically, we aggregate files based on their repository name as the github-code dataset is originally stored at the file-level. Given that the code in this dataset has been widely utilized for training various models (Li et al., 2023; Nijkamp et al., 2022), we primarily use this dataset for constructing our training data. The use of this data for training specifically addresses the adoption of patterns that concatenate cross-file context and in-file context for next-line prediction. Fine-tuning on this dataset is optional, as sufficiently robust models may already exhibit this generalizability. Newly Crawled Git Hub Data: To mitigate impacts regarding data leakage and memorization, we augment the dataset by incorporating the most recent, non-forked Git Hub repositories that are permitted under their respective licenses. Specifically, we use Git Hub s official API to crawl Python and Java repositories created after February 9, 2023, which aligns with the newest knowledge cutoff date of The Stack (Kocetkov et al., 2022), and before August 3, 2023. This newly-crawled data serves exclusively as our test set for evaluation. Continuous Updates: In response to the rapid advancement of Code LLMs and their training datasets, Repo Bench is committed to a regimen of continuous updates, to ensure that Repo Bench keeps pace with the latest developments and avoids potential data leakage, which could compromise the integrity of model evaluations. As of this writing, Repo Bench v1.1 is already available. Detailed discussions on Repo Bench v1.1 can be found Appendix C. 3.2 DATA PROCESSING The data processing procedure for this study involves multiple steps. For the training data sourced from github-code, repositories with a number of Python or Java files between 32 and 128 are selected. This range is chosen to ensure an adequate cross-file dependency while avoiding excessive complexity and keeping the data volume within a reasonable range. While for the newly crawled test data, we do not set file number constraints to ensure a thorough evaluation. To identify cross-file dependencies and their usage, we use tree-sitter3 to parse each file. This parsing is primarily directed at import statements, enabling us to identify all cross-file modules and the lines utilizing these 2https://huggingface.co/datasets/codeparrot/github-code 3https://tree-sitter.github.io/tree-sitter/ Published as a conference paper at ICLR 2024 Figure 1: Construction of a prompt for repository-level cross-file code completion. The commented cross-file context (path + snippet), parsed from import statements using tree-sitter, is concatenated with the in-file context (path + import statements + preceding lines), which cropped to a maximum of 30 lines in Repo Bench to form the input prompt, with the objective is to predict the next line . Note that for clarity, certain lines of code are omitted in this figure, which is an abbreviated and simplified version derived from a real example. Refer to Appendix A for a detailed ablation study on prompt construction. modules (termed cross-file lines). Further, we track the corresponding code snippets that define these imported modules. After processing the data, our dataset comprises 10,345 Python and 14,956 Java historical repositories, serving as training data and are available for optional fine-tuning. Additionally, we have 1,075 Python and 594 Java new repositories from Git Hub designated as test data for evaluation. 3.3 TASK CONSTRUCTION Task Settings To effectively evaluate next-line prediction in auto-completion systems, we define three settings: Cross-File-First (XF-F): This is the most challenging setting, where we mask the first appearance of a cross-file line within a file. In this setting, there is no prior usage of the module in the in-file context to aid the prediction, thereby requiring the system to handle long-range cross-file context for better accuracy. Cross-File-Random (XF-R): In this setting, we mask a random and non-first occurrence of a cross-file line. Unlike the XF-F setting, the prior in-file usage of the module may serve as a hint for the prediction. In-File (IF): In this setting, we mask an in-file line that does not involve any cross-file modules. This setting serves as a robustness test to ensure that the incorporation of cross-file context does not greatly affect the accuracy of predictions. Note that Repo Bench-R (Retrieval) is designed with only XF-F and XF-R settings, as IF does not involve retrieval and thus cannot be evaluated in this task, while both Repo Bench-C (Code Completion) and Repo Bench-P (Pipeline) involve all three settings: XF-F, XF-R, and IF. Published as a conference paper at ICLR 2024 Table 1: Overview of the test data in Repo Bench for Python and Java across three tasks: Repo Bench R, Repo Bench-C and Repo Bench-P, including details on the number of data points for different settings (XF-F, XF-R, IF), as well as the mean number of candidates and tokens in each subset. Lang. Task Subset XF-F XF-R IF Mean Candidates Mean Tokens Repo Bench-R Easy 12,000 6,000 - 6.7 - Hard 12,000 6,000 - 17.8 - Repo Bench-C 2k 12,000 5,000 7,000 - 1,035 8k 18,000 7,500 10,500 - 3,967 Repo Bench-P 10,867 4,652 6,399 24 44,028 Repo Bench-R Easy 12,000 6,000 - 6.8 - Hard 12,000 6,000 - 25.5 - Repo Bench-C 2k 12,000 5,000 7,000 - 1,093 8k 18,000 7,500 10,500 - 4,179 Repo Bench-P 10,599 4,459 6,196 26 139,406 Repo Bench-R Repo Bench-R targets the retrieval component of a repository-level auto-completion system, focusing on extracting the most relevant code snippet from a project repository for next-line code prediction. In Repo Bench-R, every snippet parsed from import statements is treated as a potential candidate for next-line prediction, where only one gold snippet is the optimal context for prediction. This task considers scenarios with 5 or more candidate snippets, and specifically, we categorize them into two subsets: those with 5-9 candidates as the easy subset, and those with 10 or more candidates as the hard subset. As demonstrated in Table 1, both the easy and hard subsets contain 12,000 samples for the XF-F setting, whereas for the XF-R setting, each subset consists of 6,000 samples. We also provide training data for optional usage, further details can be also located in Appendix B. For evaluative purposes, the Accuracy@k (acc@k) metric is employed to assess retrieval performance. The easy subset is evaluated using acc@1 and acc@3, while the hard subset is examined through acc@1, acc@3, and acc@5 metrics. Repo Bench-C Repo Bench-C simply focuses on the prediction of the next line of code, given a set of in-file context (including several preceding lines and import statements), and cross-file context. In Repo Bench-C, as shown in Figure 1 the prompt is created by combining all the parsed snippets as cross-file contexts and an in-file context. The in-file context includes import statements and several preceding lines of code with a maximum limit of 30 lines. To address the varying context length in existing models, Repo Bench-C is divided into two subsets: Repo Bench-C-2k and Repo Bench-C-8k. Repo Bench-C-2k, designed for models with a token limit of 2,048, holds prompts that do not exceed 1,925 tokens. Concurrently, Repo Bench-C-8k is architected with a higher threshold, encompassing up to 7,685 tokens, apt for models with an 8,192 token limit (e.g., Star Coder (Li et al., 2023)) or 8,000 token limit (e.g., Codex (Chen et al., 2021)). Repo Bench-C is designed primarily for 0-shot learning, in order to examine the model s ability to handle long-range contexts. Despite this, we also provide a large amount of training data to allow fine-tuning, thereby enhancing the transfer capabilities of relatively smaller models, and for the test set, we allocate more data under XF-F settings compared with XF-R and IF settings. Details of this data are provided in Table 1. For evaluation metrics, we adopt Exact Match (EM) and Edit Similarity (Edit Sim) (Svyatkovskiy et al., 2020b) following the previous work (Lu et al., 2021) and extend our evaluation with Code BLEU (Ren et al., 2020) to evaluate the accuracy of the predicted code line. Repo Bench-P Repo Bench-P evaluates the model s performance by combining Repo Bench-R and Repo Bench-C: retrieval of relevant snippets and next-line code prediction, presenting a challenging pipeline task. This task mirrors complex real-world scenarios that a practical auto-completion system would face, assessing the model s comprehensive performance and flexibility. In Repo Bench-P, each setting (XF-F, XF-R, and IF) requires the model to first identify the most pertinent snippets and then employ these snippets as cross-file context in conjunction with the in-file context to predict the subsequent line. Contrary to specifying a maximum token limit, we define Published as a conference paper at ICLR 2024 a minimum token threshold: 12,000 for Python and 24,000 for Java, and the gold snippet retrieval process requires a minimum of 10 candidates. Due to the substantial amount of data resulting from these constraints, we opt to down-sample to ensure parity between Java and Python datasets. Details of this data are provided in Table 1. For evaluating the predicted next line, we also use the Exact Match, Edit Similarity and Code BLEU metrics, in line with the Repo Bench-C setting. 4 EXPERIMENTS 4.1 REPOBENCH-R The primary objective of the retrieval task in Repo Bench-R is to identify the most relevant code snippets to predict the next line given an in-file context. The process generally involves cropping certain lines of the in-file code before the predicted line, followed by the calculation of the degree of relevance (we use the term similarity uniformly) between the cropped code and each candidate snippet. Formally, the general method for retrieval can be mathematically formulated as follows: k arg max i {1,...,n} f(C[ m :], Si) where C denotes the in-file code, Si refers to the i-th candidate snippet, n is the total number of candidate snippets, m is the number of lines from the in-file code retained, k represents the top k candidates to be retrieved, and f is the function computing the similarity (or other scores) between the cropped in-file code and the candidate snippets. Baseline In our baseline approach, three strategies are employed for the retrieval task: (1) Random Retrieval involves retrieving code snippets in a random manner, serving as a lower-bound benchmark against which we can compare the effectiveness of the other retrieval methods. To ensure the stability and reliability of our results, this random process is repeated 100 times and the outcomes are averaged. (2) Lexical Retrieval uses Jaccard Similarity and Edit Similarity to assess the relevance between the cropped code from the in-file context and the candidate code snippets. (3) Semantic Retrieval applies encoder models, including Code BERT (Feng et al., 2020) based on BERT (Devlin et al., 2019), Unix Coder (Guo et al., 2022) based on Uni LM (Dong et al., 2019) ( we use the Encoder-only mode), and Instruct OR (Su et al., 2023) to obtain the code embeddings. We also include some other encoder-decoder or decoder models, include Code GPT (Lu et al., 2021), Code T5+ (Wang et al., 2023) and Code Gen (Nijkamp et al., 2022). Cosine Similarity is employed to measure the semantic similarity between the cropped code and the candidate snippets. In baseline, we crop m = 3 lines from the in-file code as specified in the general method (C[ m :]), indicating the last three lines before the line to predict (Refer to the Appendix D for the ablation study on the number of lines kept). All computations for determining the similarity score are executed at the token level. Results and Analysis Table 2 presents a detailed comparison of different retrieval strategies in Repo Bench-R. We have the following observations: (1) Instruct OR outperforms retrieval models, followed by Uni Xcoder: Instruct OR consistently outperforms other retrieval models across tasks, with Uni Xcoder achieving comparable results despite having only 1/10 of Instruct OR s parameters. This performance of Uni Xcoder can be partially attributed to its unique approach that includes multi-modal data representation learning and the utilization of both multi-modal contrastive learning (MCL) (Gao et al., 2021) and cross-modal generation tasks (CMG). (2) Jaccard similarity offers a competitive lexical retrieval alternative: Within the lexical retrieval category, Jaccard similarity has shown to be competitive, offering a viable and light-weighted alternative to semantic methods. (3) Python code retrieval tasks yield higher accuracy than Java: Tasks involving Python code retrieval tend to yield higher accuracy than those with Java, which is partially hypothesized to be due to the common Python practice of defining function arguments in close proximity to their calls, thereby providing valuable contextual cues for retrieval. In contrast, Java s extensive use of class structures may introduce additional complexities into the retrieval process. 4.2 REPOBENCH-C The code completion task, Repo Bench-C, aims to predict the next line of code based on a given in-file context (Cin), consisting of import statements and preceding lines before the target line, as Published as a conference paper at ICLR 2024 Table 2: Baseline weighted average results of Repo Bench-R on Python and Java retrieval tasks for Easy and Hard subset. The models used are codebert-base for Code BERT, unixcoder-base for Uni Xcoder, Code GPT-small-py and Code GPT-small-small for Code GPT in Python and Java respectively, and codegen-350M-mono and codegen-350M-multi for Code Gen in Python and Java respectively, codet5p-220m for Code T5+ and instructor-xl for Instruct OR. Language Retrieval Model Params. Easy Hard acc@1 acc@3 acc@1 acc@3 acc@5 Random - - 15.66 46.96 6.43 19.31 32.12 Lexical Jaccard - 21.97 53.75 10.47 25.93 40.01 Edit - 18.69 50.98 7.83 21.77 36.49 Code GPT 124M 16.18 47.05 8.27 22.79 36.35 Uni Xcoder 125M 27.09 60.42 18.48 39.69 54.00 Code BERT 125M 16.94 48.27 6.72 19.89 33.05 Code T5+ 220M 18.32 50.95 8.58 23.03 36.24 Code Gen 350M 21.03 54.27 13.20 31.64 46.22 Instruct OR 1.5B 28.22 62.77 19.10 39.91 53.54 Random - - 15.36 46.03 5.61 16.89 28.16 Lexical Jaccard - 16.58 48.49 7.84 20.83 32.80 Edit - 15.19 45.92 6.09 17.63 28.69 Code GPT 124M 16.46 48.46 7.87 22.97 37.53 Code BERT 125M 15.68 46.51 6.02 17.52 28.80 Uni Xcoder 125M 19.61 52.96 12.23 28.74 41.88 Code T5+ 220M 16.12 47.67 6.46 18.50 30.50 Code Gen 350M 20.09 52.60 11.09 29.32 44.22 Instruct OR 1.5B 19.94 53.91 13.07 31.28 44.52 well as a cross-file context (Cx), comprising snippets from other files parsed by import statements. This task commonly uses autoregressive language models trained on code for prediction. The formal expression of this task can be illustrated as follows: i=1 P(yi|y