# inversecoder_selfimproving_instructiontuned_code_llms_with_inverseinstruct__5c4ef52b.pdf Inverse Coder: Self-improving Instruction-Tuned Code LLMs with Inverse-Instruct Yutong Wu1, 2, Di Huang1, Wenxuan Shi1, 2, Wei Wang3, Yewen Pu4, Lingzhe Gao3 Shihao Liu3, Ziyuan Nan1, 2, Kaizhao Yuan1, 2, Rui Zhang1, Xishan Zhang1, Zidong Du1, Qi Guo1, Dawei Yin3, Xing Hu1, Yunji Chen1, 2* 1SKL of Processors, Institute of Computing Technology, CAS 2University of Chinese Academy of Sciences 3Baidu Inc., Beijing, China 4Autodesk Research wuyutong22s@ict.ac.cn Recent advancements in open-source code large language models (LLMs) have been driven by fine-tuning on the data generated from powerful closed-source LLMs, which are expensive to obtain. This paper explores whether it is possible to use a fine-tuned open-source model to generate additional data to augment its instruction-tuning dataset. We make two observations: (1) A code snippet can serve as the response to different instructions. (2) Instruction-tuned code LLMs perform better at translating code into instructions than the reverse. Based on these observations, we propose Inverse-Instruct, a data augmentation technique that uses a fine-tuned LLM to generate additional instructions of code responses from its own training dataset. The additional instruction-response pairs are added to the original dataset, and a stronger code LLM can be obtained by fine-tuning on the augmented dataset. We empirically validate Inverse-Instruct on a range of open-source code models (e.g., Code Llama-Python and Deep Seek-Coder) and benchmarks (e.g., Human Eval(+), MBPP(+), DS-1000 and Multi PL-E), showing it consistently improves the base models. Code https://github.com/wyt2000/Inverse Coder Extended version https://arxiv.org/abs/2407.05700 1 Introduction Code generation, which aims to generate code that satisfies the user s intent from inputs/outputs or natural language, has been a significant challenge in computer science. Recently, closed-source LLMs like GPT-3.5 and GPT-4 (Open AI 2023) have enabled the generation of general-purpose code (like Python) based on natural language, making them broadly applicable in the fields of programming assistance (Microsoft 2023), computer vision (Sur ıs, Menon, and Vondrick 2023; Gupta and Kembhavi 2023), science (Nejjar et al. 2023), and embodied intelligence (Liang et al. 2023; Ma et al. 2023; Tang, Key, and Ellis 2024; Wang et al. 2023). To develop high-performance open-source models, researchers have leveraged these closed-source LLMs to generate datasets of instructions and code, then distilled these *Corresponding author. Copyright 2025, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. datasets into smaller, open-source code LLMs via instruction tuning (Luo et al. 2023; Wei et al. 2023; Yu et al. 2023; Song et al. 2024). For example, Code Alpaca (Chaudhary 2023) was fine-tuned on 20K instruction-code pairs generated based on GPT-3.5 with SELF-INSTRUCT (Wang et al. 2022). Luo et al. (2023) used Evol-Instruct (Xu et al. 2023), a method that creates a diverse set of instruction data from GPT-3.5 for code generation via evolution heuristics. OSSINSTRUCT (Wei et al. 2023) first creates coding problems from the source code snippet, then queries strong LLMs for their corresponding solutions. Fine-tuned with 75K GPT3.5 OSS-INSTRUCT data and 110K GPT-4 Evol-Instruct data (i.e. evol-codealpaca-v1) (theblackcat102 2023), Magicoder S series achieve state-of-the-art (SOTA) results among open-source code models. These approaches have one thing in common: they heavily rely on generating data by querying stronger closed-source LLMs (e.g., GPT-4), which incurs significant additional expenses. Therefore, it is crucial to develop a self-improvement method for open-source models without relying on stronger guidance. This paper explores how to improve an instruction-tuned code LLM by querying itself (rather than querying a closedsource LLM). We make the following two observations : 1. A single code snippet can serve as a valid response to multiple instructions. 2. Instruction-tuned code LLMs perform better at translating code into instructions than translating instructions into code (see Section 3). The first observation suggests that an instruction-tuned LLM can generate a new instruction for each response code in its training dataset, thereby expanding the original dataset. The second observation confirms that generating data in this way (Code-to-NL) is more effective than NL-to-Code. Therefore, we develop Inverse-Instruct, a simple yet effective instruction tuning approach based on self-generating instructions from code snippets (Figure 1). Inverse-Instruct starts with an instruction-code corpus, and a code LLM finetuned on it. We first clean and extract code snippets from the corpus, then let the code LLM translate these code snippets into new instructions. Next, we use the code LLM to evaluate and filter consistent instruction-code pairs from the newly generated data. Finally, the filtered dataset is combined with the original instruction dataset to fine-tune a new model. The The Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25) Original Instruction dataset @@instruction The script currently has a bug Fix this error and modify the script def traverse_list( ): @@response def traverse_list( ): In the given code, it tries to Clean code snippets def traverse_list( ): n = len(arr) while i < n: print(arr[i]) i += 1 Summarized instruction data @@instruction 1 Create a python program with while @@instruction 2 Create a python program with for @@instruction 3 @@response def traverse_list( ): Inverse instruction dataset @@instruction Create a python program with while @@response def traverse_list( ): n = len(arr) while i < n: print(arr[i]) i += 1 Code Preprocessing Self-evaluation & data selection Code Summarization Instruction-tuned model Instruction-tuned model Figure 1: The overview of Inverse-Instruct. Inverse-Instruct utilizes the models own capability in code summarization to generate an inverse instruction dataset which can further enhance the model s performance. Inverse-Instruct consists of three steps, including code preprocessing, code summarization, and self-evaluation & data selection. main differences between Inverse-Instruct and previous data generation methods are discussed in Section 2.2. Using Inverse-Instruct, we develop Inverse Coder, a series of fine-tuned code LLMs that achieve SOTA results. We evaluated Inverse Coder on a wide range of benchmarks (Section 6), including Human Eval(+) (Chen et al. 2021; Liu et al. 2023), MBPP(+) (Austin et al. 2021; Liu et al. 2023), Multi PL-E (Cassano et al. 2023), and DS-1000 (Lai et al. 2023). Results show that Inverse Coder series surpasses the base models by exploiting the base models own capability. Specifically, Inverse Coder-DS-6.7B achieves 76.8% on Human Eval+, 69.0% on MBPP+, 62.6% on Multi PL-E, 44.2% on DS-1000, which are SOTA results across four benchmarks among fully open-source (both model and dataset) models with only 6.7B parameters. Our key contributions are introducing Inverse-Instruct, an effective self-improvement instruction tuning approach for code LLMs and presenting a series of code LLMs named Inverse Coder, which achieves SOTA or comparative results on a wide range of benchmarks. We organize the structure of the paper as follows: Section 2 introduces related works. Section 3 shows the evidence of our observations. Section 4, 5 provide a detailed explanation of our approach (i.e. Inverse-Instruct). Section 6 presents the experiments for our models (i.e. Inverse Coder). Section 7 concludes with a summary. 2 Related Work 2.1 LLMs for Code Generation After being pre-trained on a large amount of code, LLMs have demonstrated impressive code generation capabilities. Recently, AI code assistants have become one of the most important applications of LLMs. Technology companies such as Open AI and Google have developed and released many closed-source large language models, including Codex (Chen et al. 2021), GPT-4 (Open AI 2023), Pa LM (Chowdhery et al. 2022), and Gemini (Team et al. 2023), which have achieved outstanding performance on code generation benchmarks. In addition to closed-source models, there are also some available open-source models whose weights and training data are available to the public, such as Code Gen (Nijkamp et al. 2022), Code Gee X (Zheng et al. 2023), Alpha Code (Li et al. 2022), Code T5 series (Wang et al. 2021), Star Coder series (Li et al. 2023; Lozhkov et al. 2024), Code Llama (Rozi ere et al. 2023), Deep Seek-Coder (Guo et al. 2024) and Code Qwen (Bai et al. 2023). These open-source code models have shown notable advancements in code-related tasks, but there is still a gap compared to the most advanced code LLMs. 2.2 Instruction-Tuned Code LLMs Instruction tuning is a method for further enhancing the instruction-following capability of pre-trained LLMs. It has been widely applied to the LLMs for general tasks including T5 (Raffel et al. 2020) and FLAN (Wei et al. 2021). For code LLMs, Octo Pack (Muennighoff et al. 2023) and PIE (Shypula et al. 2024) extracted high-quality data from human-written instructions and code. Fine-tuning with these data has significantly enhanced the program generation capabilities of the base models. However, obtaining high-quality human-written instruction datasets is usually laborious. Researchers have attempted to employ stronger closed-source LLMs to generate both new instructions and responses for instruction-tuning. Specifically, Code Alpaca (Chaudhary 2023) sampled tasks from a seed task pool and prompted a stronger LLM to generate instruction-tuning data based on the seed tasks. Wizard Coder (Luo et al. 2023) prompted a stronger LLM to generate more complex instructions and the corresponding responses. Magicoder (Wei et al. 2023) used a stronger LLM to create problems and code solutions based on open-source code snippets, as the seed code snippets offer controllability and diversity to the generation. Wave Coder (Yu et al. 2023) used a stronger LLM to both generate and discriminate the instruction-response pair for different coding tasks (e.g., code summarization and code repair). Alchemist Coder (Song et al. 2024) employed a stronger LLM to add more details for existing instructions. The main differences between our method and the aforementioned related works are: Generation Method WC-CL-7B WC-DS-6.7B NL Code 62.4 70.2 Code NL GPT-4 Code 74.3 79.0 Code NL Humans Code 86.7 80.0 Table 1: Pass@1 (%) results on MBPP+ in observation checking experiments. The abbreviations WC-CL-7B and WC-DS-6.7B refer to the instruction-tuned models Wizard Coder-GPT4-CL and Wizard Coder-GPT4-DS. Line 1 represents the evaluation of NL-to-Code for instruction-tuned open models. Lines 2 and 3 evaluate Code-to-NL by leveraging GPT-4 and humans to convert NL into its equivalent code, then assess its correctness against the original code. We removed the problems that GPT-4 was unable to give executable code for them. We focus on the self-improvement of open-source code models rather than relying on stronger guidance (such as human annotation or advanced LLMs like GPT-4). We generate new data by converting code to instructions from existing datasets rather than generating code from instructions. 3 Sanity Check: Code-to-NL vs. NL-to-Code In this section, we validate our observation that instructiontuned code LLMs perform better at translating code into instructions (i.e., Code-to-NL) than translating instructions into code (i.e., NL-to-Code) through an experiment. We first select a manually written set of correctly matched NL-Code pairs {x, y} with unit tests and prompted a finetuned code LLM to convert x into new code y and y into new NL x separately. Then, We use the following metrics to quantify the model s performance in the two tasks: For NL-to-Code, we use unit tests to evaluate the functional correctness of generated code y against original code y. For Code-to-NL, we convert generated NL x to an equivalent code snippet ˆy by humans and a stronger code LLM. Then we measured the functional correctness of ˆy by unit tests. Specifically, we use the problem-answer pairs with unit tests in a basic Python generation benchmark MBPP+ (Liu et al. 2023) as matched NL-Code pairs {x, y}. For NL-to Code, we took all 378 problems in the benchmark for evaluation. For Code-to-NL, we first select 30 problems for humans to write the equivalent code of the generated NL, and then we employ GPT-4 to finish this task for all problems. We evaluate two instruction fine-tuned code LLMs (i.e., Wizard Coder-GPT4-CL and Wizard Coder-GPT4-DS, which are instruction-tuned by 110K GPT-4 dataset evol-codealpaca-v1). The results are shown in Table 1. From the table, we conclude that (Code NL) is better than (NL Code), showing that code LLMs perform better in code summarization than in code generation. 4 Inverse-Instruct: Data Augmentation via Code Summarization In this section, we will introduce Inverse-Instruct, a data augmentation method that can obtain more instruction data through the model s own capabilities. The overall illustration of Inverse-Instruct is shown in Figure 1. Inverse-Instruct is founded on the following two observations: (1) The same code can be considered as a response to different instructions, which expands the dataset effectively. (2) Converting formal language (i.e., code) into informal language (i.e., natural language) is generally more straightforward than the reverse. The whole data generation process contains three stages: (1) Code preprocessing. (2) Code summarization. (3) Selfevaluation and data selection. In code preprocessing, we preprocess the code data by filtering clean code snippets {y i } from an off-the-shelf instruction tuning dataset {(xi, yi)} (e.g., evol-codealpaca-v1). Subsequently, in code summarization, we prompt an instruction fine-tuned code LLM M (e.g., Wizard Coder-GPT4-CL) to convert the clean code snippets {y i } to multiple new instructions {x ij}. Then, in self-evaluation and data selection, we use the same code LLM M to select the best instruction x i in {x ij}. The selected instructions {x i } are combined with the original code snippets {y i } to construct a new instruction tuning dataset {(x i , y i )}. Finally, we fine-tune the base code LLM with the instruction data {(x i , y i )} {(xi, yi)} to obtain a stronger code LLM (i.e. Inverse Coder). Details of the three steps are illustrated below. 4.1 Code Preprocessing The first step is to preprocess the existing code data and get clean code snippets {y i }. This is because the Code-to-NL capabilities of code LLMs can only be fully utilized with clean code, whereas the response data {yi} in the original dataset typically contains a lot of noise, such as natural language responses. We select data with code snippet {y i } from the original {yi} with the following two steps: 1. Filtering responses. We first collect responses that contain the marker of the code block (i.e. ), which indicates that there are code snippets in the response. The remaining data might contain clean code without any code markers, so then we collect the responses that can pass syntax checking. 2. Extracting code. After filtering responses with code snippets, we remove the natural language surrounding the code snippets to make it easier for the model to summarize. If there are multiple parts of code in the original response, we only keep the first part, since the following parts are usually test cases or using examples. At the end of code preprocessing, we obtain clean code snippets {y i } for summarization. 4.2 Code Summarization After filtering, we employ the code LLM M to generate a certain number of corresponding instructions {x ij} for each code snippet in {y i } by summarizing its functionality. During the summarization process, we randomly choose different instruction prefixes for the prompt to enhance the diversity of the instructions. In this way, we have obtained new pairs of natural language and code {(x ij, y i )}. 4.3 Self-evaluation and Data Selection We noticed that code LLM M might make mistakes during the code summarization process. Therefore, we utilize M itself to evaluate {(x ij, y i )} and select the most appropriate instruction. Data selection is typically performed by powerful LLMs such as GPT-4 because these models possess excellent instruction-following capabilities, enabling them to understand complex filtering rules (Wang et al. 2024). However, the instruction-following capabilities of code LLMs are often weaker, making it difficult to conduct effective selection. (See the comparison experiments in Section 6.5). Inspired by Auto Math Text (Zhang et al. 2024), we use the pseudo-probability of YES token given by the code LLM M as an indicator of the instruction quality rather than a score in textual format. Specifically, we concatenate the generated instructions {x ij} and the original code snippets {y i } as problem-answer pairs {(x ij, y i )} . Then, we ask M to evaluate the correctness of each answer under the given problem and calculate the pseudo-probability of YES using the logits of the first token given by M. The formula for calculating the pseudo-probability is shown as follows (Zhang et al. 2024): LM-Score( ) = exp(logit( YES )) exp(logit( YES )) + exp(logit( NO )) After evaluation, we select the instruction with the highest score x i for each response in {y i } to obtain a new training dataset {(x i , y i )}. 5 Implementation Details The original instruction tuning dataset. In this work, we mainly use evol-codealpaca-v1 as our original instruction tuning dataset {(xi, yi)}, which is widely used for instruction tuning of code LLMs (Wei et al. 2023; Yu et al. 2023; Song et al. 2024). It contains 111183 instruction-response pairs generated by Evol-Instruct using GPT-4. Following Magicoder (Wei et al. 2023), evol-codealpaca-v1 is decontaminated by removing data that contain docstrings or solutions from Human Eval (Chen et al. 2021), MBPP (Austin et al. 2021), Multi PL-E (Cassano et al. 2023), and DS-1000 (Lai et al. 2023), which are used to evaluate Inverse Coder. We apply the same decontamination method to the newly generated instruction data {(x i , y i )}. Training for original Code LLM. We take Code Llama Python-13B, Code Llama-Python-7B (Rozi ere et al. 2023) and Deep Seek-Coder-Base-6.7B (Guo et al. 2024) as base models. To obtain the beginning code LLM M (hereinafter called Wizard Coder-GPT4), we fine-tune the base models on evol-codealpaca-v1 for 2 epochs using 8 NVIDIA A100-40GB SMX GPUs. We set the initial learning rate at 5e 5 with 15 warmup steps and a linear learning rate scheduler. We use Adafactor (Shazeer and Stern 2018) as our optimizer and choose a batch size of 512 with a sequence truncation length of 1024. Instruction data collection. We use the v LLM inference framework (Kwon et al. 2023) for code summarization and instruction selection on the same GPUs as training. We generate 10 instructions {x ij}10 j=1 for each code snippet in the code summarization stage. For each instruction-response pair, the self-evaluation and data selection process is conducted by prompting the beginning code LLM M with greedy decoding. We choose the instruction with the highest pseudo-probability of YES as the best-generated instruction for each response. Training for Inverse Coder. Following Magicoder S (Wei et al. 2023), we first fine-tune the base models on the new dataset {(x i , y i )} with 90363 instruction-response pairs (generated by the original Code LLM M) for 1 epoch, then we continue to fine-tune the models with the original dataset {(xi, yi)} (generated by GPT-4) for 2 epochs to obtain Inverse Coder. The hyperparameters are the same as the training process for the original code LLM M. The instruction tuning prompt is aligned with Magicoder S. 6 Experiments We conduct a series of experiments to investigate these topics: 1. Inverse Coder s performance on benchmarks (Sec. 6.1). 2. Impact of each stage in Inverse-Instruct (Sec. 6.2). 3. Impact of dataset size scaling (Sec. 6.3). 4. Is Inverse-Instruct effective on other datasets (Sec. 6.4)? 5. Comparison with other data selection methods (Sec. 6.5). 6. Does selecting multiple self-generated instructions for each response lead to further improvement (Sec. 6.6)? 7. Can Inverse-Instruct be repeatedly applied to Inverse Coder to achieve multi-round optimization (Sec. 6.7)? 8. Can Inverse-Instruct be further optimized by using additional self-generated code as responses (Sec. 6.8)? 6.1 Main Results We train Inverse Coder on three base models with different parameter sizes and evaluate them on four benchmarks widely used for code LLMs, including Python text-to-code generation, multilingual coding, and data-science code generation. The results show that the performance of SOTA code LLMs can continue to improve by Inverse-Instruct. Baselines. We compare the performance of our models with a wide range of baselines including: 1. Base Models: Three base models mentioned in Section 5. We compare Inverse Coder with them to show the absolute improvement of the whole instruction-tuning process. 2. Wizard Coder-GPT4: The beginning code LLMs in our data generation process, which are only trained by the original instruction-tuning dataset (i.e., evol-codealpaca-v1). We compared Inverse Coder Model Common Data Specific Data Wizard Coder-GPT-4 0K (baseline) Magicoder S 75K GPT-3.5 Wave Coder-Ultra 20K GPT-4 Alchemist Coder > 80K GPT-3.5 Inverse Coder (ours) 90K self-generated Table 2: Training data size of different instruction-tuned code LLMs. It is worth noting that only Inverse Coder is trained by self-generated data, which is easier to obtain at a lower cost. Model Human Eval (+) MBPP (+) (Closed-source Models) GPT-4-Turbo (April 2024) 90.2 (86.6) 85.7 (73.3) GPT-3.5-Turbo (Nov 2023) 76.8 (70.7) 82.5 (69.7) (Based on Code Llama-Python-13B) Code Llama-Python-13B 42.7 (38.4) 63.5 (52.6) Wizard Coder-GPT4-CL-13B 76.8 (70.7) 73.5 (62.2) Inverse Coder-CL-13B (ours) 79.9 (74.4) 74.6 (63.0) (Based on Code Llama-Python-7B) Code Llama-Python-7B 37.8 (35.4) 59.5 (46.8) Magicoder S-CL-7B 70.7 (67.7) 70.6 (60.1) Alchemist Coder-CL-7B 74.4 (68.3) 68.5 (55.1) Wizard Coder-GPT4-CL-7B 72.6 (68.9) 69.3 (59.3) Inverse Coder-CL-7B (ours) 76.2 (72.0) 70.6 (60.1) (Based on Deep Seek-Coder-6.7B) Deep Seek-Coder-6.7B 47.6 (39.6) 72.0 (58.7) Magicoder S-DS-6.7B 76.8 (71.3) 79.4 (69.0) Wave Coder-Ultra-DS-6.7B 75.0 (69.5) 74.9 (63.5) Alchemist Coder-DS-6.7B 79.9 (75.6) 77.0 (60.2) Wizard Coder-GPT4-DS-6.7B 77.4 (73.2) 77.8 (67.5) Inverse Coder-DS-6.7B (ours) 79.9 (76.8) 78.6 (69.0) Table 3: Pass@1 (%) results of different LLMs on Human Eval (+) and MBPP (+) computed with greedy decoding. The abbreviations CL and DS refer to the base models Code Llama-Python and Deep Seek-Coder, respectively. We report other results consistently from the Eval Plus (Liu et al. 2023) Leaderboard in August 2024 and Magicoder (Wei et al. 2023) paper. with them to show the improvement brought by Inverse Instruct. 3. Other Open Source Instruction-Tuned Code LLMs: Instruction-tuned code models in related works, including Magicoder S (Wei et al. 2023), Wave Coder-Ultra DS (Yu et al. 2023) and Alchemist Coder (Song et al. 2024). They are trained on additional data generated by stronger closed-source LLMs (e.g., GPT-3.5) in addition to evol-codealpaca-v1. The comparison of training data size is shown in Table 2. The actual data consumption of Inverse Coder should be mainly measured by the scale of the original train- Model Java JS C++ PHP Swift Rust Avg. (Based on Code Llama-Python-13B) Wizard Coder-GPT4* 55.4 64.2 55.9 52.0 49.9 53.4 55.1 Inverse Coder (ours)* 54.5 65.4 58.1 55.3 52.5 55.6 56.9 (Based on Code Llama-Python-7B) Code Llama-Python 29.1 35.7 30.2 29.0 27.1 27.0 29.7 Magicoder S * 49.8 62.6 50.2 53.3 44.9 43.8 50.8 Wizard Coder-GPT4* 50.4 60.7 50.6 51.6 45.6 48.2 51.2 Inverse Coder (ours)* 48.7 61.9 52.6 55.2 53.0 46.1 52.9 (Based on Deep Seek-Coder-6.7B) Magicoder S * 59.6 69.8 70.0 64.4 54.4 53.6 62.0 Wizard Coder-GPT4* 61.4 66.4 68.7 61.8 52.6 56.1 61.2 Inverse Coder (ours)* 60.7 70.1 70.5 63.6 53.0 57.4 62.6 Table 4: Pass@1 (%) results of different LLMs on Multi PL-E. The models marked with (*) are evaluated with the same prompt format as training and the same hyperparameter as Magicoder. We report other results consistently from Magicoder paper. ing dataset (110K) since the cost of self-generating data is much lower than generating data by querying closedsource LLMs (Irugalbandara et al. 2023). 4. Closed-source LLMs: GPT-3.5 (Open AI 2022) and GPT4 (Open AI 2023) to show the gap between Inverse Coder with the advanced closed-source LLMs. Inverse-Instruct improves general Python code generation capabilities. We use Human Eval(+) and MBPP(+) (Liu et al. 2023), the enhanced versions of two Python code generation benchmarks (Chen et al. 2021; Austin et al. 2021), to evaluate the text-to-code capability of Inverse Coder. Each benchmark provides a set of tasks with natural language descriptions as prompts for the code LLM to generate functionlevel code, which is then validated using pre-prepared test cases. We use the pass@1 (Chen et al. 2021) score to compare the code generation capability among different models. The results are shown in Table 3, which demonstrate that Inverse Coder makes a significant improvement over Wizard Coder GPT4 in Python code generation capability. The improvement of Inverse-Instruct is reflected across multiple programming languages. Besides Python, we evaluate the code generation capabilities of other six mainstream programming languages for Inverse Coder on Multi PLE benchmark (Cassano et al. 2023). Table 4 shows the performances of Inverse Coder and other models on Multi PL-E. The results reveal that the capabilities of Inverse Coder to generate code in different programming languages are improved over Wizard Coder-GPT4. Inverse-Instruct also leads to enhancement in data science code generation tasks. To show the capability of Inverse Coder for complex programming problems in realistic applications, we evaluate it on DS-1000 benchmark (Lai et al. 2023), which comprises 1000 different data science work- Model plt. np. pd. torch scipy sklearn tf. All (Based on Code Llama-Python-13B) Wizard Coder-GPT4 56.1 52.2 30.3 43.0 25.2 49.5 40.0 42.1 Inverse Coder (ours) 53.0 54.3 32.1 50.9 22.5 50.5 43.8 43.1 (Based on Code Llama-Python-7B) Code Llama-Python 55.3 34.5 16.4 19.9 22.3 17.6 28.5 28.0 Wizard Coder 53.5 34.4 15.2 25.7 21.0 24.5 28.9 28.4 Magicoder S 55.9 40.6 28.4 40.4 28.8 35.8 37.6 37.5 Wizard Coder-GPT4 51.5 46.9 29.9 43.6 34.9 41.9 39.0 40.2 Inverse Coder (ours) 54.2 48.6 27.4 38.0 34.0 41.9 40.3 39.9 (Based on Deep Seek-Coder-6.7B) Magicoder S 54.8 48.9 30.0 49.2 27.3 44.7 41.2 41.2 Wizard Coder-GPT4 53.8 53.9 28.0 49.3 30.4 45.7 44.4 42.2 Inverse Coder (ours) 55.5 53.9 32.3 56.7 30.0 50.3 33.9 44.2 Table 5: Pass@1 (%) results on DS-1000 including seven data science libraries: Matplotlib (plt.), Numpy (np.), Pandas (pd.), Pytorch, Scipy, Sklearn and Tensorflow (tf.). We evaluate our models in the same prompt and hyperparameters as Magicoder. We report other results from Magicoder paper. Method Human Eval(+) MBPP(+) Gen. + Eval. 70.7 (67.1) 70.9 (60.1) Pre. 72.6 (68.9) 69.8 (59.8) Pre. + Sum. 75.6 (71.3) 68.0 (58.2) Pre. + Sum. + Eval. (ours) 76.2 (72.0) 70.6 (60.1) Table 6: Pass@1 (%) results on Human Eval+ and MBPP+ in ablation studies. Preprocessing (Pre.), Summarization (Sum.) and Evaluation (Eval.) correspond to the three steps in our method. Generation (Gen.) represents regenerate responses for each instruction. flows across seven libraries. Following Wei et al. (2023), we evaluate our model only on the completion mode. The results in Table 5 show that the average performances of Inverse Coder-CL-13B and Inverse Coder-DS-6.7B in the data science code generation tasks are enhanced, which implies that Inverse-Instruct can help to improve the code generation capability of the original model in realistic tasks beyond basic programming problems. 6.2 Ablation Study We conduct a series of ablation experiments to analyze the utility of code summarization and data selection steps in our method. We use Code Llama-Python-7B as the base model in the following experiments with the same training settings as Inverse Coder and present the results in Table 6. The ablation experiments are in three aspects: Inverse-Instruct outperforms the NL-to-Code data generation method (Gen. + Eval.). We regenerate 10 responses {yij}10 j=1 for each instruction xi in the original training dataset and apply the same self-evaluation method to select the best responses. It shows that the code summarization Figure 2: Impact of data scaling. The dashed line represents Human Eval and the solid line represents Human Eval+. Legend Original and Ours represent the original models and the models improved by Inverse-Instruct. Model Human Eval (+) MBPP (+) Magicoder-DS 66.5 (60.4) 75.4 (61.9) Inverse Coder-DS-OSS 69.5 (64.0) 77.0 (66.1) Table 7: Performance improvement of Inverse-Instruct when applied to Magicoder-OSS-Instruct-75K. step provides overall better performance than generating responses from instructions. Performance improvement comes not only from the preprocessing step (Pre.). We only apply preprocessing to the responses in the original dataset {(xi, yi)} to obtain a cleaned dataset {(xi, y i )}. We train the models with the cleaned dataset and the original one to show the improvement from preprocessing is minor. The self-evaluation and data selection step also plays a role in Inverse-Instruct (Pre. + Sum.). To study the role of self-evaluation and data selection, we generate only one instruction for each response in the code summarization step without any selection. The results show that self-evaluation and selection are also helpful to performance improvement. 6.3 Data Scaling Inverse-Instruct is effective across different data scales. We conduct a series of experiments to explore the data scaling law of Inverse-Instruct. Specifically, we randomly select 25K, 50K, and 75K instruction-response pairs from the original dataset and train 3 weaker original models with them. Then, we apply Inverse-Instruct for the original models. It is shown that the performances of the models are all improved by Inverse-Instruct at different scales of data (Figure 2). 6.4 Impact of Original Dataset Inverse-Instruct is effective across different original datasets. We apply Inverse-Instruct to Magicoder-OSS- Data-Selection Method Human Eval (+) Random Selection 72.6 (68.3) Textual Score 73.8 (69.5) Lowest Perplexity 70.1 (67.7) Highest Perplexity 70.7 (67.7) YES Pseudo-probability (ours) 76.2 (72.0) Table 8: Comparison of our data selection method with alternatives (for CL-7B). Selected Instructions Human Eval (+) MBPP (+) Top-1 (ours) 76.2 (72.0) 70.6 (60.1) Top-3 70.1 (67.1) 68.0 (58.5) Top-5 70.1 (65.2) 61.9 (53.4) Table 9: Performance comparison of the models (CL-7B) trained with different numbers of selected instructions. Top-k means that for each response, we select the instructions with the top k highest pseudo-probability. Instruct-75K (Wei et al. 2023), a smaller dataset generated by GPT-3.5. The results (Table 7) show that performance is still improved even with a smaller and lower-quality original dataset, demonstrating the robustness of Inverse-Instruct. 6.5 Alternative Data Selection Methods Our data selection method outperforms alternatives. We compare our data selection method which is based on the pseudo-probability of YES with the three alternatives: 1. Randomly selecting one instruction from all synthetic candidates corresponding to each response. 2. Using textual format scores (1-5) provided by the LLM itself as an indicator. If no textual score is given, assign a default score of 3. 3. Using the sentence perplexity of the response code under different instructions as an indicator. We select the data with the highest and lowest perplexity respectively. The results are shown in Table 8, demonstrating the pseudoprobability method s efficiency. 6.6 Selecting Multiple Self-Generated Instructions Selecting multiple self-generated instructions for a single response will harm the model s performance. We select the top-k scoring instructions for each response. The results in Table 9 indicate that the model s performance declines as the number of selected instructions increases. This suggests that open-source code LLMs are not capable of generating a large number of correct instructions, which is why we only select the best instructions in our method. 6.7 Multi-Round Optimization for Inverse Coder Repeatedly applying Inverse-Instruct to Inverse Coder does not significantly improve performance. We replace the original model with Inverse Coder in the pipeline of Model Human Eval (+) MBPP (+) Inverse Coder-CL-7B 76.2 (72.0) 70.6 (60.1) Inverse Coder-CL-7B-V2 75.0 (70.1) 70.6 (60.6) Table 10: Performance diffenernce when applying Inverse Instruct to Inverse Coder again. V2 means models trained with the data generated by Inverse Coder. Data-Generation Method Human Eval (+) MBPP (+) Code NL (ours) 76.2 (72.0) 70.6 (60.1) Code NL Code 73.2 (68.9) 67.7 (57.7) Code Code NL 73.2 (68.3) 70.9 (62.2) Table 11: Comparison of Inverse-Instruct with other alternative data generation methods which prompt the original model to generate additional code (for CL-7B). Inverse-Instruct and train a new model with the data generated by Inverse Coder. The performance results (Table 10) show no significant improvement, which confirms the phenomenon of model collapse caused by repeatedly training on self-generated data (Shumailov et al. 2024). 6.8 Training with Additional Self-Generated Code Performance cannot be steadily improved when the model is trained with both self-generated instructions and code. We conduct the following two experiments to examine whether training with the code generated by the original model provides additional benefits. 1. Code NL Code: Regenerating new response code for the new instructions obtained by Inverse-Instruct. 2. Code Code NL: Prompting the original model to give more complex code and applying Inverse-Instruct to the new code. The results are shown in Table 11. Unstable performance reveals issues with the quality of the self-generated code of original models. 7 Conclusion In conclusion, this paper presents a novel approach to enhancing the capabilities of open-source code LLMs by leveraging self-generated data for instruction tuning, rather than relying solely on data from powerful closed-source LLMs like GPT-3.5 and GPT-4. Our proposed method, named Inverse Instruct, capitalizes on the inherent asymmetry in translating between formal and informal languages. By reversing the conventional process, Inverse-Instruct generates additional natural language instructions from code snippets via summarization and self-evaluation techniques. The effectiveness of this methodology is demonstrated through the development of Inverse Coder, a new series of code LLMs that not only outperform their predecessors in traditional benchmarks but also show significant improvement across diverse coding tasks. Acknowledgements We thank Lei Qi for helping us analyze data and convert NL to code in sanity check experiments (Section 3) during the rebuttal. This work is partially supported by the National Key R&D Program of China (under Grant 2022YFB4501600), the NSF of China (under Grants 61925208, U22A2028, 6240073476, 62222214, 62341411, 62102398, 62102399, 62302478, 62302482, 62302483, 62302480, 62302481, 62402477), Strategic Priority Research Program of the Chinese Academy of Sciences, (Grant No.XDB0660200, XDB0660201, XDB0660202), CAS Project for Young Scientists in Basic Research (YSBR-029), Youth Innovation Promotion Association CAS and Xplore Prize. References Austin, J.; Odena, A.; Nye, M.; Bosma, M.; Michalewski, H.; Dohan, D.; Jiang, E.; Cai, C.; Terry, M.; Le, Q.; et al. 2021. Program synthesis with large language models. ar Xiv preprint ar Xiv:2108.07732. Bai, J.; Bai, S.; Chu, Y.; Cui, Z.; Dang, K.; Deng, X.; Fan, Y.; Ge, W.; Han, Y.; Huang, F.; Hui, B.; Ji, L.; Li, M.; Lin, J.; Lin, R.; Liu, D.; Liu, G.; Lu, C.; Lu, K.; Ma, J.; Men, R.; Ren, X.; Ren, X.; Tan, C.; Tan, S.; Tu, J.; Wang, P.; Wang, S.; Wang, W.; Wu, S.; Xu, B.; Xu, J.; Yang, A.; Yang, H.; Yang, J.; Yang, S.; Yao, Y.; Yu, B.; Yuan, H.; Yuan, Z.; Zhang, J.; Zhang, X.; Zhang, Y.; Zhang, Z.; Zhou, C.; Zhou, J.; Zhou, X.; and Zhu, T. 2023. Qwen Technical Report. ar Xiv preprint ar Xiv:2309.16609. Cassano, F.; Gouwar, J.; Nguyen, D.; Nguyen, S. D.; Phipps Costin, L.; Pinckney, D.; Yee, M.-H.; Zi, Y.; Anderson, C. J.; Feldman, M. Q.; Guha, A.; Greenberg, M.; and Jangda, A. 2023. Multi PL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation. IEEE Transactions on Software Engineering, 49: 3675 3691. Chaudhary, S. 2023. Code Alpaca: An Instruction-following LLa MA model for code generation. https://github.com/ sahil280114/codealpaca. Chen, M.; Tworek, J.; Jun, H.; Yuan, Q.; Pinto, H. P. d. O.; Kaplan, J.; Edwards, H.; Burda, Y.; Joseph, N.; Brockman, G.; et al. 2021. Evaluating large language models trained on code. ar Xiv preprint ar Xiv:2107.03374. Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H. W.; Sutton, C.; Gehrmann, S.; Schuh, P.; Shi, K.; Tsvyashchenko, S.; Maynez, J.; Rao, A.; Barnes, P.; Tay, Y.; Shazeer, N. M.; Prabhakaran, V.; Reif, E.; Du, N.; Hutchinson, B. C.; Pope, R.; Bradbury, J.; Austin, J.; Isard, M.; Gur-Ari, G.; Yin, P.; Duke, T.; Levskaya, A.; Ghemawat, S.; Dev, S.; Michalewski, H.; Garc ıa, X.; Misra, V.; Robinson, K.; Fedus, L.; Zhou, D.; Ippolito, D.; Luan, D.; Lim, H.; Zoph, B.; Spiridonov, A.; Sepassi, R.; Dohan, D.; Agrawal, S.; Omernick, M.; Dai, A. M.; Pillai, T. S.; Pellat, M.; Lewkowycz, A.; Moreira, E.; Child, R.; Polozov, O.; Lee, K.; Zhou, Z.; Wang, X.; Saeta, B.; D ıaz, M.; Firat, O.; Catasta, M.; Wei, J.; Meier-Hellstern, K. S.; Eck, D.; Dean, J.; Petrov, S.; and Fiedel, N. 2022. Pa LM: Scaling Language Modeling with Pathways. J. Mach. Learn. Res., 24: 240:1 240:113. Guo, D.; Zhu, Q.; Yang, D.; Xie, Z.; Dong, K.; Zhang, W.; Chen, G.; Bi, X.; Wu, Y.; Li, Y. K.; Luo, F.; Xiong, Y.; and Liang, W. 2024. Deep Seek-Coder: When the Large Language Model Meets Programming - The Rise of Code Intelligence. Ar Xiv, abs/2401.14196. Gupta, T.; and Kembhavi, A. 2023. Visual programming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14953 14962. Irugalbandara, C.; Mahendra, A.; Daynauth, R.; Arachchige, T. K.; Flautner, K.; Tang, L.; Kang, Y.; and Mars, J. 2023. Scaling Down to Scale Up: A Cost-Benefit Analysis of Replacing Open AI s GPT-4 with Self-Hosted Open Source SLMs in Production. Ar Xiv, abs/2312.14972. Kwon, W.; Li, Z.; Zhuang, S.; Sheng, Y.; Zheng, L.; Yu, C. H.; Gonzalez, J. E.; Zhang, H.; and Stoica, I. 2023. Efficient Memory Management for Large Language Model Serving with Paged Attention. Proceedings of the 29th Symposium on Operating Systems Principles. Lai, Y.; Li, C.; Wang, Y.; Zhang, T.; Zhong, R.; Zettlemoyer, L.; Yih, W.-t.; Fried, D.; Wang, S.; and Yu, T. 2023. DS1000: A natural and reliable benchmark for data science code generation. In International Conference on Machine Learning, 18319 18345. PMLR. Li, R.; Allal, L. B.; Zi, Y.; Muennighoff, N.; Kocetkov, D.; Mou, C.; Marone, M.; Akiki, C.; Li, J.; Chim, J.; Liu, Q.; Zheltonozhskii, E.; Zhuo, T. Y.; Wang, T.; Dehaene, O.; Davaadorj, M.; Lamy-Poirier, J.; Monteiro, J.; Shliazhko, O.; Gontier, N.; Meade, N.; Zebaze, A.; Yee, M.-H.; Umapathi, L. K.; Zhu, J.; Lipkin, B.; Oblokulov, M.; Wang, Z.; Murthy, R.; Stillerman, J.; Patel, S. S.; Abulkhanov, D.; Zocca, M.; Dey, M.; Zhang, Z.; Fahmy, N.; Bhattacharyya, U.; Yu, W.; Singh, S.; Luccioni, S.; Villegas, P.; Kunakov, M.; Zhdanov, F.; Romero, M.; Lee, T.; Timor, N.; Ding, J.; Schlesinger, C.; Schoelkopf, H.; Ebert, J.; Dao, T.; Mishra, M.; Gu, A.; Robinson, J.; Anderson, C. J.; Dolan-Gavitt, B.; Contractor, D.; Reddy, S.; Fried, D.; Bahdanau, D.; Jernite, Y.; Ferrandis, C. M.; Hughes, S. M.; Wolf, T.; Guha, A.; von Werra, L.; and de Vries, H. 2023. Star Coder: may the source be with you! Ar Xiv, abs/2305.06161. Li, Y.; Choi, D.; Chung, J.; Kushman, N.; Schrittwieser, J.; Leblond, R.; Tom; Eccles; Keeling, J.; Gimeno, F.; Lago, A. D.; Hubert, T.; Choy, P.; de, C.; d Autume, M.; Babuschkin, I.; Chen, X.; Huang, P.-S.; Welbl, J.; Gowal, S.; Alexey; Cherepanov; Molloy, J.; Mankowitz, D. J.; Robson, E. S.; Kohli, P.; de, N.; Freitas; Kavukcuoglu, K.; and Vinyals, O. 2022. Competition-level code generation with Alpha Code. Science, 378: 1092 1097. Liang, J.; Huang, W.; Xia, F.; Xu, P.; Hausman, K.; Ichter, B.; Florence, P.; and Zeng, A. 2023. Code as policies: Language model programs for embodied control. In 2023 IEEE International Conference on Robotics and Automation (ICRA), 9493 9500. IEEE. Liu, J.; Xia, C.; Wang, Y.; and Zhang, L. 2023. Is Your Code Generated by Chat GPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. Ar Xiv, abs/2305.01210. Lozhkov, A.; Li, R.; Allal, L. B.; Cassano, F.; Lamy-Poirier, J.; Tazi, N.; Tang, A.; Pykhtar, D.; Liu, J.; Wei, Y.; et al. 2024. Star Coder 2 and The Stack v2: The Next Generation. ar Xiv preprint ar Xiv:2402.19173. Luo, Z.; Xu, C.; Zhao, P.; Sun, Q.; Geng, X.; Hu, W.; Tao, C.; Ma, J.; Lin, Q.; and Jiang, D. 2023. Wizardcoder: Empowering code large language models with evol-instruct. ar Xiv preprint ar Xiv:2306.08568. Ma, Y. J.; Liang, W.; Wang, G.; Huang, D.-A.; Bastani, O.; Jayaraman, D.; Zhu, Y.; Fan, L.; and Anandkumar, A. 2023. Eureka: Human-level reward design via coding large language models. ar Xiv preprint ar Xiv:2310.12931. Microsoft. 2023. Git Hub Copilot Your AI pair programmer. https://github.com/features/copilot. Accessed: 2024-12-22. Muennighoff, N.; Liu, Q.; Liu, Q.; Zebaze, A.; Zheng, Q.; Hui, B.; Zhuo, T. Y.; Singh, S.; Tang, X.; von Werra, L.; and Longpre, S. 2023. Octo Pack: Instruction Tuning Code Large Language Models. Ar Xiv, abs/2308.07124. Nejjar, M.; Zacharias, L.; Stiehle, F.; and Weber, I. 2023. LLMs for Science: Usage for Code Generation and Data Analysis. ar Xiv preprint ar Xiv:2311.16733. Nijkamp, E.; Pang, B.; Hayashi, H.; Tu, L.; Wang, H.; Zhou, Y.; Savarese, S.; and Xiong, C. 2022. Codegen: An open large language model for code with multi-turn program synthesis. ar Xiv preprint ar Xiv:2203.13474. Open AI. 2022. Chatgpt: Optimizing language models for dialogue. Open AI, R. 2023. GPT-4 technical report. ar Xiv 2303.08774. View in Article. Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; and Liu, P. J. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140): 1 67. Rozi ere, B.; Gehring, J.; Gloeckle, F.; Sootla, S.; Gat, I.; Tan, X.; Adi, Y.; Liu, J.; Remez, T.; Rapin, J.; Kozhevnikov, A.; Evtimov, I.; Bitton, J.; Bhatt, M. P.; Ferrer, C. C.; Grattafiori, A.; Xiong, W.; D efossez, A.; Copet, J.; Azhar, F.; Touvron, H.; Martin, L.; Usunier, N.; Scialom, T.; and Synnaeve, G. 2023. Code Llama: Open Foundation Models for Code. Ar Xiv, abs/2308.12950. Shazeer, N. M.; and Stern, M. 2018. Adafactor: Adaptive Learning Rates with Sublinear Memory Cost. Ar Xiv, abs/1804.04235. Shumailov, I.; Shumaylov, Z.; Zhao, Y.; Papernot, N.; Anderson, R. J.; and Gal, Y. 2024. AI models collapse when trained on recursively generated data. Nat., 631(8022): 755 759. Shypula, A.; Madaan, A.; Zeng, Y.; Alon, U.; Gardner, J. R.; Yang, Y.; Hashemi, M.; Neubig, G.; Ranganathan, P.; Bastani, O.; and Yazdanbakhsh, A. 2024. Learning Performance Improving Code Edits. In The Twelfth International Conference on Learning Representations. Song, Z.; Wang, Y.; Zhang, W.; Liu, K.; Lyu, C.; Song, D.; Guo, Q.; Yan, H.; Lin, D.; Chen, K.; and Zhao, C. 2024. Alchemist Coder: Harmonizing and Eliciting Code Capability by Hindsight Tuning on Multi-source Data. In The Thirtyeighth Annual Conference on Neural Information Processing Systems. Sur ıs, D.; Menon, S.; and Vondrick, C. 2023. Vipergpt: Visual inference via python execution for reasoning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 11888 11898. Tang, H.; Key, D.; and Ellis, K. 2024. World Coder, a Model Based LLM Agent: Building World Models by Writing Code and Interacting with the Environment. ar Xiv preprint ar Xiv:2402.12275. Team, G.; Anil, R.; Borgeaud, S.; Wu, Y.; Alayrac, J.-B.; Yu, J.; Soricut, R.; Schalkwyk, J.; Dai, A. M.; Hauth, A.; et al. 2023. Gemini: a family of highly capable multimodal models. ar Xiv preprint ar Xiv:2312.11805. theblackcat102. 2023. The evolved code alpaca dataset. https://huggingface.co/datasets/theblackcat102/evolcodealpaca-v1. Accessed: 2024-12-22. Wang, G.; Xie, Y.; Jiang, Y.; Mandlekar, A.; Xiao, C.; Zhu, Y.; Fan, L.; and Anandkumar, A. 2023. Voyager: An openended embodied agent with large language models. ar Xiv preprint ar Xiv:2305.16291. Wang, J.; Zhang, B.; Du, Q.; Zhang, J.; and Chu, D. 2024. A Survey on Data Selection for LLM Instruction Tuning. Ar Xiv, abs/2402.05123. Wang, Y.; Kordi, Y.; Mishra, S.; Liu, A.; Smith, N. A.; Khashabi, D.; and Hajishirzi, H. 2022. Self-instruct: Aligning language models with self-generated instructions. ar Xiv preprint ar Xiv:2212.10560. Wang, Y.; Wang, W.; Joty, S.; and Hoi, S. C. 2021. Code T5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. In EMNLP. Wei, J.; Bosma, M.; Zhao, V. Y.; Guu, K.; Yu, A. W.; Lester, B.; Du, N.; Dai, A. M.; and Le, Q. V. 2021. Finetuned language models are zero-shot learners. ar Xiv preprint ar Xiv:2109.01652. Wei, Y.; Wang, Z.; Liu, J.; Ding, Y.; and Zhang, L. 2023. Magicoder: Source code is all you need. ar Xiv preprint ar Xiv:2312.02120. Xu, C.; Sun, Q.; Zheng, K.; Geng, X.; Zhao, P.; Feng, J.; Tao, C.; and Jiang, D. 2023. Wizardlm: Empowering large language models to follow complex instructions. ar Xiv preprint ar Xiv:2304.12244. Yu, Z.; Zhang, X.; Shang, N.; Huang, Y.; Xu, C.; Zhao, Y.; Hu, W.; and Yin, Q. 2023. Wavecoder: Widespread and versatile enhanced instruction tuning with refined data generation. ar Xiv preprint ar Xiv:2312.14187. Zhang, Y.; Luo, Y.; Yuan, Y.; and Yao, A. C.-C. 2024. Auto Math Text: Autonomous Data Selection with Language Models for Mathematical Texts. ar Xiv preprint ar Xiv:2402.07625. Zheng, Q.; Xia, X.; Zou, X.; Dong, Y.; Wang, S.; Xue, Y.; Wang, Z.; Shen, L.; Wang, A.; Li, Y.; Su, T.; Yang, Z.; and Tang, J. 2023. Code Gee X: A Pre-Trained Model for Code Generation with Multilingual Evaluations on Human Eval-X. Ar Xiv, abs/2303.17568.