# mceval_massively_multilingual_code_evaluation__3edfe731.pdf Published as a conference paper at ICLR 2025 MCEVAL: MASSIVELY MULTILINGUAL CODE EVALUATION Linzheng Chai1 , Shukai Liu1 *, Jian Yang1 * , Yuwei Yin2, Ke Jin1, Jiaheng Liu1, Tao Sun1, Ge Zhang3, Changyu Ren1, Hongcheng Guo1, Zekun Wang1, Boyang Wang1, Xianjie Wu1, Bing Wang1, Tongliang Li4, Liqun Yang1, Sufeng Duan5, Zhoujun Li1 1CCSE, Beihang University, 2University of British Columbia, 3University of Waterloo 4Beijing Information Science and Technology University, 5Shanghai Jiao Tong University Code large language models (LLMs) have shown remarkable advances in code understanding, completion, and generation tasks. Programming benchmarks, comprised of a selection of code challenges and corresponding test cases, serve as a standard to evaluate the capability of different LLMs in such tasks. However, most existing benchmarks primarily focus on Python and are still restricted to a limited number of languages, where other languages are translated from the Python samples degrading the data diversity. To further facilitate the research of code LLMs, we propose a massively multilingual code benchmark covering 40 programming languages (MCEVAL) with 16K test samples, which substantially pushes the limits of code LLMs in multilingual scenarios. The benchmark contains challenging code completion, understanding, and generation evaluation tasks with finely curated massively multilingual instruction corpora MCEVALINSTRUCT. In addition, we introduce an effective multilingual coder MCODER trained on MCEVAL-INSTRUCT to support multilingual programming language generation. Extensive experimental results on MCEVAL show that there is still a difficult journey between open-source models and closed-source LLMs in numerous languages. The instruction corpora and evaluation benchmark are available at https://github.com/MCEVAL/Mc Eval. 1 INTRODUCTION Rust Shell Power Shell HTML Java Script Type Script Perl Coffee Script Erlang Swift Visual Basic Mark Down Ruby Elisp Racket Multilingual Code Multilingual Code Completion Multilingual Code Explain def find_median(nums): nums.sort() n = len(numbers) if n % 2 == 1: m = nums[n//2] else: [MASK] m = (m1 + m2)/2 return m def find_median(nums): nums.sort() n = len(numbers) if n % 2 == 1: m = nums[n//2] else: m1 = nums[n//2-1] m2 = nums[n//2] m = (m1 + m2)/2 return m Problem: Find the median number given list. def find_median(nums): nums.sort() n = len(numbers) if n % 2 == 1: m = nums[n//2] else: m1 = nums[n//2-1] m2 = nums[n//2] m = (m1 + m2)/2 return m Problem: Please fill the [MASK] and output the complete function. Problem: Please describe the function in English. Figure 1: MCEVAL comprised three tasks: code generation, code completion, and code explanation. Large language models (LLMs) designed for code, such as Codex (Chen et al., 2021), Code Gen (Nijkamp et al., 2023), Code Llama (Rozière et al., 2023), Deep Seek Coder (Guo et al., 2024), and Code Qwen (Hui et al., 2024) excel at code understanding, completion, and generation tasks. Code LLMs with a large number of parameters (e.g. 7B, 13B, or larger) are pre-trained on large-scale code databases with self-supervised autoregressive objectives, followed by instruction tuning (Ouyang et al., 2022) for aligning to human preferences and downstream coderelated tasks. Most code benchmarks (Chen et al., 2021; Austin et al., 2021; Athiwaratkun et al., 2023) are introduced to evaluate the performance of code LLMs by assessing their ability to generate executable code based on the problem descriptions. The assessments aim to gauge the capacity of the models to understand and generate code effectively, thereby contributing to facilitating and streamlining the programming process for developers. The execution-based method executes generated code against test cases to measure the Equal contribution. Corresponding Author. Published as a conference paper at ICLR 2025 success rate. Due to the difficulty of creating the problem and its corresponding solution (requiring specialized programming staff), the development of evaluation benchmarks is limited within Python, with a few other languages being translated from Python. Therefore, the community desperately needs a massively multilingual programming benchmark (not from Human Eval or MBPP) comprised of instruction corpora and evaluation set to comprehensively facilitate and evaluate the generation, completion, and understanding capability of LLMs. To facilitate the development of code LLMs, we introduce a complete framework that includes the multilingual code instruction corpora, multilingual coder (MCODER), and multilingual code evaluation benchmark. First, we propose MCEVAL, the first massively multilingual code evaluation benchmark (from human handwriting) covering 40 languages (16K samples in total), encompassing multilingual code generation, multilingual code explanation, and multilingual code completion tasks. Then, we create a massively multilingual instruction corpora MCEVAL-INSTRUCT of 40 languages. We initially select and refine high-quality code snippets from various programming languages (PLs) using an LLM. The LLM then generates clear and self-contained instructional content, including problem descriptions and corresponding solutions, based on the refined snippets. To ensure consistency and enhance learning across languages, we introduce cross-lingual code transfer, adapting instructional content to different PLs while increasing sample complexity. Based on open-source models and MCEVAL-INSTRUCT, MCODER is used as a strong baseline to explore the transferability of LLMs among different PLs. The contributions are summarized as follows: (1) We propose MCEVAL with enough test samples (16K), a true massively multilingual multitask code evaluation benchmark (not from Human Eval or MBPP) covering 40 languages, encompassing multilingual code generation, multilingual code explanation, and multilingual code completion tasks. (2) We introduce MCEVAL-INSTRUCT, the massively multilingual code instruction corpora covering from the multilingual code snippet from 40 languages. Based on MCEVAL-INSTRUCT, an effective multilingual coder MCODER is used as a strong baseline for MCEVAL. (3) We systematically evaluate the understanding and generation capabilities of 20+ models on our created MCEVAL and create a leaderboard to evaluate them on 40 programming languages dynamically. Notably, extensive experiments suggest that comprehensive multilingual multitask evaluation can realistically measure the gap between open-source (e.g. Deep Seek Coder and Code Qwen1.5) and closed-source models (e.g. GPT-3.5 and GPT-4). 2 MULTILINGUAL CODE EVALUATION: MCEVAL 2.1 DATASET STATISTICS Table 1: MCEVAL dataset statistics. Statistics Value Questions Code Generation 2, 007 Code Explanation 2, 007 Code Completion 12, 017 - Single-Line 2, 998 - Multi-Line 2, 998 - Span 4, 014 - Span(light) 2, 007 Total Test Cases 10, 086 Difficulty Level - Easy 1, 221 - Medium 401 - Hard 385 Length Prompt - maximum length 793 tokens - minimum length 16 tokens - avg length 173.8 tokens Solution(Output) - maximum length 666 tokens - minimum length 4 tokens - avg length 120.9 tokens The created MCEVAL is comprised of three key code-related tasks covering 40 programming languages, including multilingual code generation, multilingual code explanation, and multilingual code completion tasks. The multilingual code generation and explanation tasks separately contain 2K samples, where each language has nearly 50 samples. The code completion task can be decomposed intomulti-line completion (3K samples), single-line completion (3K samples), span completion (4K samples), and span completion (light) (2K samples) (Bavarian et al., 2022). In Table 1, we display the number of questions, test cases, and difficulty levels corresponding to the three tasks in MCEVAL and the number of questions in the four sub-tasks of the completion task. Moreover, we counted the token length of the prompt and solutions. (The tokens are calculated based on the Llama-3 tokenizer.) Among these tasks, the span completion (light) task is similar in form to the span completion task. However, in the span completion (light) task, each problem is paired with all the corresponding code, making it a balanced version of the span completion task (fewer samples for fast inference and the same test size of Published as a conference paper at ICLR 2025 Coffee Script Common Lisp CPP Java Script Power Shell Type Script Visual Basic 50x tokens / cases Prompt Length (50x tokens) Output Length (50x tokens) Test Cases Figure 2: Data statistics of the MCEVAL benchmark involving 40 programming languages. each programming language). The results of span completion (light) can better reflect the differences in model performance across different languages. Figure 2 plots the length of the prompt, solution(output), and the number of test cases of each programming language. In Table 2, We compared MCEVAL with other multilingual benchmarks. It is noteworthy that our benchmark provides a significant supplement to current benchmarks in terms of both the variety of programming languages and the number of questions. Table 2: Comparison between MCEVAL and other multilingual code benchmarks. The number of each of the three tasks (Generation, Explanation, and Completion). Benchmark Multi-Task #Languages Data source #Questions Muli PL-E (Cassano et al., 2023) 18 Translate ~3,000 MBXP Athiwaratkun et al. (2023) 10 Translate 12,425 Human Eval-X Zheng et al. (2023b) 5 Hand-Written 820 Human Eval-XL Peng et al. (2024) 12 Hand-Written 22,080 MCEVAL 40 Hand-Written 16,031 (2007/2007/12017) 2.2 HUMAN ANNOTATION & QUALITY CONTROL To create the massively multilingual code evaluation benchmark MCEVAL, the annotation of multilingual code samples is conducted utilizing a comprehensive and systematic human annotation procedure, underpinned by rigorously defined guidelines to ensure accuracy and consistency. Initially, 10 software developers in computer science are recruited as multilingual programming annotators with proven proficiency in the respective programming languages. Following a detailed training session on the annotation protocol, which emphasizes the importance of context, syntactical correctness, and semantic fidelity across languages, annotators are tasked with creating problem definitions and the corresponding solution. The annotators should follow: (1) Provide a clear and self-contained problem definition, answer the question with any tools, and design the test cases to evaluate the correctness of the code. (2) Classify them into multiple difficulties (Easy/Middle/Hard), based on algorithmic complexity and functionality. Each sample is independently annotated by at least two annotators to minimize subjective bias and errors. Discrepancies between annotators are resolved through consensus or adjudication by a senior annotator. Finally, three volunteers are employed to evaluate the correctness of the benchmark (> 90% accuracy) and correct the errors. (See Appendix A.2 for more details). 2.3 EVALUATION TASKS Multilingual Code Generation. Given the k-th programming language Lk {Li}K i=1, where K = 40 is the number of programming languages, we provide the problem description q Lk and examples test cases e Lk as the input for code LLMs M to generate the corresponding code a Lk. We obtain the sampled code result from the code generation distribution P(a Lk|q Lk, e Lk; M) from code LLM M, and then feed the test cases into the generated code, where the generated outputs by code should equal the expected outputs. The process can be described as: r Lk = I(P(a Lk|q Lk, e Lk; M); u Lk) (1) Published as a conference paper at ICLR 2025 Instruction: Language Python def find_min_n_greater_than_k(k: int) -> int: # Initialize variables n = 1 S_n = 0 # Calculate the minimum n for which S_n > k while S_n <= k: S_n += 1 / n n += 1 return n - 1 Reference Solution def test_find_min_n_greater_than_k(): assert find_min_n_greater_than_k(1) == 2, "Test case 1 failed." assert find_min_n_greater_than_k(3) == 11, "Test case 2 failed." assert find_min_n_greater_than_k(5) == 83, "Test case 3 failed." print("All test cases passed.") # Run the test function test_find_min_n_greater_than_k() fn separate_paren_groups(paren_string: String) -> Vec{ let mut result:Vec = vec![]; let mut current_string:String = String::new(); let mut current_depth:u32 = 0; for c in paren_string.chars(){ if c == '('{ current_depth += 1; current_string.push(c); } else if c == ')' { current_depth -= 1; current_string.push(c); if current_depth == 0{ result.push(current_string.clone()); current_string.clear() } } } return result; } Provide a concise natural language description (docstring) of the Rust code in English using at most 500 characters. Instruction: Language Rust Write a Python function def find_min_n_greater_than_k(k: int) -> int: to solve the following problem: Calculate the smallest positive integer n such that the sum of the harmonic series up to 1/n is greater than a given positive integer k. The harmonic series is defined as S_n = 1 + 1/2 + 1/3 + ... + 1/n. Args: - k (int): A positive integer representing the threshold value the harmonic series sum must exceed. Returns: - int: The smallest integer n for which the harmonic series sum S_n exceeds the threshold k. Examples: >>> find_min_n_greater_than_k(1) 2 >>> find_min_n_greater_than_k(3) 11 Input to this function is a string containing multiple groups of nested parentheses. Your goal is to separate those group into separate strings and return the list of those. Separate groups are balanced (each open brace is properly closed) and not nested within each other Ignore any spaces in the input string. Reference Solution int main() { assert(process_request(0) == 1); assert(process_request(1) == 1); assert(process_request(2) == 2); assert(process_request(3) == 6); assert(process_request(4) == 24); assert(process_request(10) == 6266); assert(process_request(10000) == 6991); return 0; } calculate_max_pens() { local yuan=$1 local jiao=$2 local total_jiao=$((yuan * 10 + jiao)) local price_per_pen=19 local max_pens=$((total_jiao / price_per_pen)) echo "$max_pens" } Reference Solution Instruction: Language CPP Below is a explanation of CPP code and incomplete code implementation. * Docstring: Calculates the factorial of N modulo 10007. Parameters: - N (int): An integer representing the input value (N <= 10000). Returns: int: The result after calculating the factorial of N and taking the modulo 10007. Examples: process_request(1) return 1 process_request(10) return 6266 * Incomplete Code: int process_request(int n) { [MASK] [MASK] [MASK] for (register int i = 2; i <= 10000; i++) { a[i] = (a[i - 1] * i) % 10007; } [MASK] } Please fill the [MASK] multiple lines of code may be masked out) and write the complete function. (1) Code Generation (2) Code Explain (3) Code Completion Figure 3: Examples of multilingual code generation, explanation, and completion. where I( ) is the indicator function by executing the generated code with the given test cases u Lk. when the generated code a Lk passes all test cases, the evaluation result r = 1, else r = 0. Multilingual Code Explanation. To evaluate the understanding capability of code LLMs, we adopt two-pass generation (Code-to-Natural-Language and Natural-Language-to-Code), since the text-similar metrics (e.g. BLEU (Papineni et al., 2002) ) are hindered by the n-gram text matching and can not produce an accurate score. We first prompt the code LLMs to generate the natural language description t Lk based on the code a Lk and then we force the model to restore the original code based on t Lk. The sampled code from P(a Lk|t Lk; M) is used to evaluate the understanding capability as: r = I(P(t Lk|a Lk; M)P(a Lk|t Lk; M); u Lk) (2) where I( ) is used to check the correctness of the generated code by running the code with test cases. Multilingual Code Completion. Another important scenario is code completion, where the code LLM produces the middle code a Lk m based on the prefix code a Lk p and suffix code snippet a Lk s . Hence, we concatenate a Lk p , a Lk m , and a Lk s as the complete code for evaluation as: r = I(P(a Lk m |a Lk p , a Lk s ; M); u Lk) (3) where a Lk p , a Lk q , and a Lk m are concatenated as the complete code to be executed with test cases u Lk. 3.1 MCEVAL-INSTRUCT Collection from Code Snippet. For a programming language Lk (Lk {Li}K i=1) and K is the number of programming languages), consider an existing code snippet c DLk c , we prompt the LLM to select the high-quality code and refine the code to a self-contained code snippet by using the prompt {Code Snippet}\n Determine its educational value for a student whose goal is to learn basic coding concepts.\n\n If the answer is YES . Please refine the code with clear variable definitions, comments, and docstring. . Then, we can obtain the multilingual refined code snippets. (More details can be found in Appendix A.3) Published as a conference paper at ICLR 2025 Step1:Code Collection Step2: Code Selection and Refinement Step3: Code Corpora Instruction {Code Snippet} Determine its educational value for a student whose goal is to learn basic coding concepts. If the answer is the YES . Please refine the code with clear variable definition, comments, and docstring. Multilingual Instruction Corpora Question and Answer of Language {src}: {question}\n\n{content} Please draw the inspiration from the given question and response to create the new question and answer of language {tgt} Step4: Cross-lingual Enhancement Step5: Instruction Tuning Step6: Massively Multilingual Evaluation Code Generation You are an expert in programming, especially in designing high-quality {language} question and answer based on the given code snippet. ### Guidelines: * The question and answer must be completely selfcontained and clear. * The difficulty of the code can be taken a step further and the docstring describes the problem description. ### Given Code snippet: {code} ### Created Question {{Created Question}} ### Created Solution {{Created Solution}} C++ C-sharp Python Java Code Completion Code Explain Deep Seek-Coder print(sum([nums])) console.log(nums.reduce((a,b)=>a+b,0)); std::cout<>> create-largest-number '(56 9 45) "95645" >>> create-largest-number '(5 50 56) "56550" (defun create-largest-number (numbers) (let ((str-numbers (mapcar #'number-to-string numbers))) (setf str-numbers (sort str-numbers #'larger-whenconcatenated)) (reduce (lambda (acc x) (concatenate 'string acc x)) strnumbers :initial-value ""))) (defun number-to-string (number) (write-to-string number)) (defun larger-when-concatenated (a b) (string> (concatenate 'string a b) (concatenate 'string b a))) Instruction: Language Lisp Reference Solution (defun test-create-largest-number () (assert (string= (create-largest-number '(56 9 45)) "95645")) (assert (string= (create-largest-number '(5 50 56)) "56550")) (assert (string= (create-largest-number '(3 34 302 50)) "50343302")) (assert (string= (create-largest-number '(10 2 23)) "23210")) (assert (string= (create-largest-number '(4 42 40 400)) "44240400"))) (test-create-largest-number) Using the awk command in Linux, complete the following task: In data/AWK/contribution.txt, where the data format is as follows: # 'Zhang Dandan 41117397 :250:100:175 # Zhang Xiaoyu 390320151 :155:90:201 # Meng Feixue 80042789 :250:60:50 # Wu Waiwai 70271111 :250:80:75 # Liu Bingbing 41117483 :250:100:175 # Wang Xiaoai 3515064655 :50:95:135 # Zi Gege 1986787350 :250:168:200 # Li Youjiu 918391635 :175:75:300 # Lao Nanhai 918391635 :250:100:175', # where the first column is the surname, the second column is the first name (concatenating the first and second columns gives the full name), the third column is the corresponding ID number, and the last three columns are three donation amounts. Please print the full names and ID numbers of people with ID numbers starting with '41' awk -F "[ :]+" '$3~/ (41)/{print $1,$2,$3}' data/AWK/contribution.txt awk_command = < Code Being Evaluated > ref_command = " awk -F \"[ :]+\" '$3~/ (41)/{print $1,$2,$3}' data/AWK/contribution.txt " generate_result = subprocess.check_output( awk_command, shell=True, text=True) ref_result = subprocess.check_output( ref_command, shell=True, text=True) assert generate_result == reference_result Instruction: Language AWK Reference Solution Write a R function longest_increasing_subsequence <- function(sequence) to solve the following problem: This function longest_increasing_subsequence takes a vector sequence representing a sequence of integers and returns the length of the longest increasing subsequence within it. An increasing subsequence is defined as a set of numbers in the sequence that are in increasing order and are taken from the original sequence without changing their order. The length of the subsequence is the number of elements it contains. Example 1: Input: c(1, 7, 3, 5, 9, 4, 8) Output: 4 (The longest increasing subsequence is 1, 3, 5, 8) Instruction: Language R longest_increasing_subsequence <- function(sequence) { n <- length(sequence) lis <- numeric(n) lis[1] <- 1 for (i in 2:n) { max_val <- 0 for (j in 1:(i-1)) { if (sequence[i] > sequence[j] && lis[j] > max_val) {max_val <- lis[j] } } lis[i] <- max_val + 1 } max(lis)} main <- function() { stopifnot(longest_increasing_subsequence(c(1, 7, 3, 5, 9, 4, 8)) == 4) stopifnot(longest_increasing_subsequence(c(10, 22, 9, 33, 21, 50, 41, 60)) == 5) } main() Reference Solution Figure 10: Examples of multilingual generation. The data mainly consists of an instruction part (including function name, function description, and function call cases), a reference solution, and a test cases part. Left. Shows an example of the Lisp language. Middle. Shows a file processing programming task in AWK language. During the evaluation, the corresponding file processing result by the generated code will be compared with the reference answer. Right. Shows an example of the R language. fun find Prime Pairs(max Number: Int): List> { fun is Prime(num: Int): Boolean { if (num <= 1) return false for (i in 2 until num) { if (num % i == 0) return false } return true } val pairs = mutable List Of>() for (i in 2..max Number - 2) { if (is Prime(i) && is Prime(i + 2)) { pairs.add(Pair(i, i + 2)) } } return pairs } Provide a concise natural language description (docstring) of the Kotlin code in English using at most 500 characters. Instruction: Language Kotlin Finds all prime pairs where each prime is less than or equal to a given number and the pair differs by 2. A prime pair is defined as two prime numbers where the difference between them is exactly 2. Example: >>> find Prime Pairs(10) [(3, 5), (5, 7)] >>> find Prime Pairs(100) [(3, 5), (5, 7), (11, 13), (17, 19), (29, 31), (41, 43), (59, 61), (71, 73)] Reference Explanation function add Digits(num) while num >= 10 do local sum = 0 while num > 0 do sum = sum + (num % 10) num = math.floor(num / 10) Provide a concise natural language description (docstring) of the Lua code in English using at most 500 characters. Instruction: Language Lua Given a non-negative integer num, repeatedly add all its digits until the result has only one digit. For example: >>> add Digits(38) Because 3 + 8 = 11, and 1 + 1 = 2. Since 2 has only one digit, 2 is the result. Reference Explanation
col_1 col_2
row_1; col_1 row_1; col_2
row_2; col_1 row_2; col_2
Provide a concise natural language description (docstring) of the HTML code in English using at most 500 characters. Instruction: Language HTML a table with two rows and two columns, with column names col_ 1 and col_ 2. The table content includes its corresponding row and column numbers, such as row_ 1; col_1 and bold the header text Reference Explanation Figure 11: Examples of multilingual explanation. The data mainly consists of an instruction part (including a complete function), a reference Explanation. Left. Shows an example of the Kotlin language. Middle. Shows an example of the Lua language. Right. Shows an example of the HTML language. Published as a conference paper at ICLR 2025 Instruction: Language C++ int extra Number(int a, int b, int c){ if (a == b) { return c; } else if (a == c) { return b; } else { return a; } } Reference Solution int main() { assert(extra Number(2, 7, 2) == 7); assert(extra Number(3, 2, 2) == 3); assert(extra Number(5, 5, 1) == 1); assert(extra Number(500000000, 3, 500000000) == 3); assert(extra Number(500000000, 500000000, 3) == 3); return 0; } Below is a explanation of Rust code and incomplete code implementation. * Docstring: You are given a list of deposit and withdrawal operations on a bank account that starts with zero balance. Your task is to detect if at any point the balance of account fallls below zero, and at that point function should return True. Otherwise it should return False. * Incomplete Code: fn below_zero(operations:Vec) -> bool{ let mut balance:i32 = 0; for op in operations { [MASK] if balance < 0 { return true;}} return false; } Please fill the [MASK] multiple lines of code may be masked out) and write the complete function. Instruction: Language Rust Instruction: Language Shell test_calculate_max_pens() { local result result=$(calculate_max_pens 5 5) [[ "$result" -eq 2 ]] || { echo "Test 1 failed: Expected 2, got $result"; exit 1; } result=$(calculate_max_pens 20 1) [[ "$result" -eq 10 ]] || { echo "Test 2 failed: Expected 10, got $result"; exit 1; } result=$(calculate_max_pens 3 8) [[ "$result" -eq 2 ]] || { echo "Test 3 failed: Expected 1, got $result"; exit 1; } result=$(calculate_max_pens 11 0) [[ "$result" -eq 5 ]] || { echo "Test 4 failed: Expected 5, got $result"; exit 1; } Below is a explanation of CPP code and incomplete code implementation. * Docstring: You are given three integers a, b, c, where two of them are equal, and the third is different from the other two. Your task is to find the value that occurs exactly once. Examples: extra Number(0, 0, 1) returns 1 extra Number(4, 3, 4) returns 3 * Incomplete Code: int extra Number(int a, int b, int c){ [MASK]a == c) { return b; } else { return a; } } Please fill the [MASK] multiple lines of code may be masked out) and write the complete function. fn below_zero(operations:Vec) -> bool{ let mut balance:i32 = 0; for op in operations { balance = balance + op; if balance < 0 { return true; } } return false; } Reference Solution #[cfg(test)] mod tests { use super::*; #[test] fn test_below_zero() { assert_eq!(below_zero(vec![]), false); assert_eq!(below_zero(vec![1, 2, -3, 1, 2, -3]), false); assert_eq!(below_zero(vec![1, 2, -4, 5, 6]), true); assert_eq!(below_zero(vec![1, -1, 2, -2, 5, -5, 4, -4]), false); assert_eq!(below_zero(vec![1, -1, 2, -2, 5, -5, 4, -5]), true); assert_eq!(below_zero(vec![1, -2, 2, -2, 5, -5, 4, -4]), true); } } Below is a explanation of Shell code and incomplete code implementation. * Docstring: This function calculates the maximum number of pens that can be bought with a given amount of money. The price of one pen is 1 Yuan and 9 Jiao (1.9 Yuan). The function takes two integers, a and b, as input where 'a' represents the Yuan and 'b' represents the Jiao part of the total money available. It returns the maximum number of pens that can be purchased. For example, if a=5 and b=0, the function will return 2, as the total money is 5 Yuan, and two pens cost 3.8 Yuan. * Incomplete Code: calculate_max_pens() { [MASK] [MASK] local total_jiao=$((yuan * 10 + jiao)) [MASK] local max_pens=$((total_jiao / price_per_pen)) [MASK] Please fill the [MASK] multiple lines of code may be masked out) and write the complete function. calculate_max_pens() { local yuan=$1 local jiao=$2 local total_jiao=$((yuan * 10 + jiao)) local price_per_pen=19 local max_pens=$((total_jiao / price_per_pen)) echo "$max_pens" } Reference Solution Figure 12: Examples of multilingual completion. The data mainly consists of an instruction part (including a incomplete function ), a reference complete code solution and test cases. Left. Shows an span completion example of the C++ language. Middle. Shows an single line completion example of the Rust language. Right. Shows an multiple line completion example of the Shell language. { "sites": [ { "name": "Google", "url": "www.google.com" }, { "name": "Weibo", "url": "www.weibo.com" } ] } Rerference Solution Create a JSON represents a collection of websites, each defined with two key details: their name and URL. The sites array holds two objects. The first object represents the website "Google" with the URL "www.google.com". The second object describes the website "Weibo" with its URL "www.weibo.com". This structure efficiently organizes these popular websites by their most essential identifiers. Instruction Language Json { "sites": [ { "name": "Google", "url": "www.google.com" }, { "name": "Weibo", "url": "www.weibo.com" } ] } { "sites": [ { "url": "www.google.com", "name": "Google" }, { "url":"www.weibo.com", "name": "Weibo" All subcomponents match exactly, test passes Figure 13: Examples of Markup language (Json) generation task evaluation. A.5 EVALUATION For programming languages other than markup languages, we use an execution-based correctness metric by running the code with the provided test cases. For markup languages, we use the Exact Published as a conference paper at ICLR 2025 Table 6: Runtime environments for different programming languages. Language Runtime Environments AWK GNU bash, version 4.4.20(1)-release (x86_64-pc-linux-gnu) C gcc (Ubuntu 7.5.0-3ubuntu1 18.04) 7.5.0 C# dotnet 8.0.100 CPP g++ (Ubuntu 7.5.0-3ubuntu1 18.04) 7.5.0 Coffee Script Coffee Script version 1.12.7 Common Lisp SBCL 1.4.5.debian Dart Dart SDK version: 3.3.1 (stable) Elixir elixir 1.3.3 Emacs Lisp GNU Emacs 25.2.2 Erlang Erlang/OTP 20 [erts-9.2] F# dotnet 8.0.100 Fortran GNU Fortran (Ubuntu 7.5.0-3ubuntu1 18.04) 7.5.0 Go go version go1.18.4 linux/amd64 Groovy Groovy Version: 4.0.16 JVM: 17.0.9 Vendor: Oracle Corporation OS: Linux HTML - Haskell The Glorious Glasgow Haskell Compilation System, version 9.4.7 Json - Java javac 11.0.19 Java Script Node.js v16.14.0 Julia julia v1.9.4 Kotlin kotlinc-jvm 1.9.21 (JRE 17.0.9+11-LTS-201) Lua Lua 5.4.6 Copyright (C) 1994-2023 Lua.org, PUC-Rio Markdown - PHP PHP 7.2.24-0ubuntu0.18.04.17 (cli) (built: Feb 23 2023 13:29:25) ( NTS ) Pascal Free Pascal Compiler version 3.2.2 [2021/05/16] for x86_64 Perl perl 5, version 26, subversion 1 (v5.26.1) built for x86_64-linux-gnu-thread-multi Power Shell Power Shell 7.4.0 Python Python 3.8.12 R R version 3.4.4 Racket Racket v6.11 Ruby ruby 2.5.1p57 (2018-03-29 revision 63029) [x86_64-linux-gnu] Rust rustc 1.74.0 (79e9716c9 2023-11-13) Scala Scala code runner version 3.3.1 Copyright 2002-2023, LAMP/EPFL Scheme Racket v6.11 Shell GNU bash, version 4.4.20(1)-release (x86_64-pc-linux-gnu) Swift Swift version 5.9.2 (swift-5.9.2-RELEASE) Tcl tclsh 8.6.11 Type Script tsc Version 5.3.3 Vim Script VIM - Vi IMproved 9.0 (2022 Jun 28, compiled Dec 20 2023 18:57:50) Visual Basic dotnet 8.0.100 Match metric for evaluation. Taking Json as an example, we parse all subcomponents in Json. If the model result is exactly the same as the subcomponent of the reference solution, the model generation result is considered correct. An example of Markup language (Json) is shown in Figure 13. We adopt the greedy Pass@1 (%) metric (Kulal et al., 2019; Chen et al., 2021) for our evaluations. For closed-source models, we generate answers through the official API service. For open-source models, we prioritize using v LLM (Kwon et al., 2023) for faster inference if the model is supported by v LLM. Otherwise, we perform inference with the Distributed Data Parallel (DDP) module from Py Torch. For the code generation and code completion tasks, we extract the functional part of the code from the model outputs and combine it with corresponding test cases to form compilable and executable code. For the code explanation task, we adopt a two-pass generation approach (Code-to-Natural-Language and Natural-Language-to-Code). The extraction and execution process for this task is consistent with the previous two tasks. We conduct all evaluations in a Docker environment. Detailed information on the code compilation and execution environment are displayed in Table 6. We have uploaded the Docker image to docker hub to facilitate the reproduction of results and the evaluation of new models. Published as a conference paper at ICLR 2025 Average AWK C C++ C# Clisp Coffee Dart Elisp Elixir Erlang F# Fortran Go 0 Qwen2.5-Coder-7B-Instruct Open Coder-8B-Instruct Magicoder-S-DS-6.7B DS-Coder-V1-6.7B-Instruct m Coder-7B (Our) Code Qwen1.5-7B-Chat Code Gemma-7B-It Code Llama-7B-Instruct Groovy Haskell Html JS Java Json Julia Kotlin Lua PHP Pascal Perl Power Python 0 R Racket Ruby Rust Scala Scheme Shell Swift TS Tcl VB Vim L MD 0 Figure 14: Pass@1 (%) scores of different code LLMs (<10B) for multilingual code generation tasks on MCEVAL. AVG represents the average scores of all code languages. Average AWK C C++ C# Clisp Coffee Dart Elisp Elixir Erlang F# Fortran Go 0 Qwen2.5-Coder-32B-Instruct DS-Coder-V2-Lite-Instruct Codestral-22B-v0.1 DS-Coder-V1-33B-Instruct Wizard Coder-Python-34B Code Llama-34B-Instruct Groovy Haskell Html JS Java Json Julia Kotlin Lua PHP Pascal Perl Power Python 0 R Racket Ruby Rust Scala Scheme Shell Swift TS Tcl VB Vim L MD 0 Figure 15: Pass@1 (%) scores of different code LLMs (10B to 40B) for multilingual code generation tasks on MCEVAL. AVG represents the average scores of all code languages. A.6 OPTIMIZATION DETAILS All MCODER models are fine-tuned using 8 NVIDIA A800-80GB GPUs. The models are trained for 2 epochs with a cosine scheduler, starting at a learning rate of 2e-5 and incorporating a 3% warmup phase. Training a model takes about 5 hours. We used Adam W (Loshchilov & Hutter, 2017) as the optimizer and a batch size of 512 with a sequence truncation length of 4096. We use Py Torch s Fully Sharded Data Parallel (FSDP) to perform distributed training of the model, and use gradient checkpointing technology and gradient accumulation to save memory and achieve training with a larger batch size. A.7 EXTRA RESULTS A.8 PROGRAMMING CLASSIFICATION As shown in Table 7 and Table 8, we comprehensively display the code generation performance of the models we tested across various programming paradigms and application scenarios. Published as a conference paper at ICLR 2025 Average AWK C C++ C# Clisp Coffee Dart Elisp Elixir Erlang F# Fortran Go 0 claude-3-5-sonnet-20240620 claude-3-5-sonnet-20241022 gpt-4o-2024-08-06 DS-Coder-V2-Instruct gpt-4o-mini-2024-07-18 Groovy Haskell Html JS Java Json Julia Kotlin Lua PHP Pascal Perl Power Python 0 R Racket Ruby Rust Scala Scheme Shell Swift TS Tcl VB Vim L MD 0 Figure 16: Pass@1 (%) scores of different code LLMs (Closed Source & 200B+) for multilingual code generation tasks on MCEVAL. AVG represents the average scores of all code languages. Average AWK C C++ C# Clisp Coffee Dart Elisp Elixir Erlang F# Fortran Go 0 gpt-4o-2024-05-13 gpt-4-turbo-231106 gpt-3.5-turbo-240125 DS-Coder-V1-6.7B-Instruct Magicoder-S-DS-6.7B Code Qwen1.5-7B-Chat m Coder (Our) Code Llama-13B-Instruct Groovy Haskell Html JS Java Json Julia Kotlin Lua PHP Pascal Perl Power Python 0 R Racket Ruby Rust Scala Scheme Shell Swift TS Tcl VB Vim L MD 0 Figure 17: Pass@1 (%) scores of different code LLMs for multilingual code explain tasks on MCEVAL. AVG represents the average scores of all code languages. A.9 MCODER RESULT In Table 9, we show some extra MCODER Pass@1 (%) results on multilingual code generation tasks. We evaluate the base models Code Qwen-1.5 and Deeps Seek-Coder-1.5 respectively. In addition to Code Qwen-1.5, we also selected Deep Seek-Coder-1.5-base as the base model for fine-tuning. A.10 PARALLEL QUESTIONS ACROSS LANGUAGES & PROGRAMMING GRAMMAR Due to the large number of languages, it is difficult to ensure parallel problem annotation. For most language annotations, we follow the characteristics of the language and perform independent annotations. For example, structured languages such as Markdown and HTML need independent annotations. For some similar languages, such as Typescript and Javascript, we use parallel annotation on some data. As shown in Figure 19, we analyzed the programming languages in the MCEVAL from the representation perspective. We used Code BERT (Feng et al., 2020) to extract code representations from code snippets in MCEVAL. These representations were visualized using t-SNE (Van der Maaten & Hinton, 2008) and hierarchical clustering (Murtagh & Contreras, 2012) methods. The figure clearly shows that languages with similar syntax have closely related representations. For example, other functional programming languages similar to Common Lisp, as well as C, C++, Java, and scripting languages, exhibit high grammar similarity. Published as a conference paper at ICLR 2025 Average AWK C C++ C# Clisp Coffee Dart Elisp Elixir Erlang F# Fortran Go 0 gpt-4-turbo-231106 Magicoder-S-DS-6.7B DS-Coder-V1-6.7B-Instruct Code Qwen1.5-7B-Chat m Coder (Our) Code Llama-Instruct-13B Code Llama-Instruct-7B Wizard Coder-V1.0-15B Groovy Haskell Html JS Java Json Julia Kotlin Lua PHP Pascal Perl Power Python 0 R Racket Ruby Rust Scala Scheme Shell Swift TS Tcl VB Vim L MD 0 (a) Single-line Completion Average AWK C C++ C# Clisp Coffee Dart Elisp Elixir Erlang F# Fortran Go 0 Groovy Haskell Html JS Java Json Julia Kotlin Lua PHP Pascal Perl Power Python 0 R Racket Ruby Rust Scala Scheme Shell Swift TS Tcl VB Vim L MD 0 (b) Multi-line Completion Average AWK C C++ C# Clisp Coffee Dart Elisp Elixir Erlang F# Fortran Go 0 Groovy Haskell Html JS Java Json Julia Kotlin Lua PHP Pascal Perl Power Python 0 R Racket Ruby Rust Scala Scheme Shell Swift TS Tcl VB Vim L MD 0 (c) Span Completion Average AWK C C++ C# Clisp Coffee Dart Elisp Elixir Erlang F# Fortran Go 0 Groovy Haskell Html JS Java Json Julia Kotlin Lua PHP Pascal Perl Power Python 0 R Racket Ruby Rust Scala Scheme Shell Swift TS Tcl VB Vim L MD 0 (d) Span Completion Light Figure 18: Pass@1 (%) scores of different models for multilingual code completion tasks on MCEVAL. Avg represents the average scores of all code languages. Published as a conference paper at ICLR 2025 Table 7: Pass@1(%) results of code generation performance of across various programming paradigms Method Procedural Object Oriented Multiple Paradigms Functional Markup Language GPT-4o (240517) 58.0 79.8 65.9 67.0 46.0 GPT-4 Turbo (231106) 56.7 78.7 65.2 59.3 46.7 GPT-3.5-Turbo (240125) 38.7 66.8 57.6 44.3 39.3 Codegemma-7b-it 19.3 46.6 34.0 16.3 34.0 Code Llama-13b-Instruct 21.3 32.0 27.0 32.3 28.0 Code Llama-34b-Instruct 27.3 33.6 28.0 30.0 30.7 Code Llama-7b 20.3 28.1 23.4 26.7 30.7 Code Qwen-1.5-7b-Chat 41.3 57.3 46.3 41.0 37.3 Codeshell-7b-chat 16.0 24.1 25.7 14.0 34.7 Codestral-22B-v0.1 40.0 67.6 54.1 39.7 40.7 Deep Seek Coder-33b-instruct 52.7 62.8 56.3 52.0 34.7 Deep Seek Coder-1.5-7b-instruct 39.0 51.8 48.8 41.0 40.0 Magicoder-S-DS-6.7B 45.7 58.5 49.4 49.0 32.0 Llama-3-8B-Instruct 27.3 44.7 38.0 32.0 33.3 Nxcode-CQ-7B-orpo 40.7 54.9 45.5 41.3 36.7 OCTOCODER 20.7 28.9 21.9 25.0 25.3 Open Code Interpreter-DS-6.7B 40.7 57.7 46.4 42.0 42.0 Phi-3-medium-4k-instruct 32.3 43.1 36.6 26.7 35.3 Qwen1.5-72B-Chat 38.3 37.2 36.2 29.3 39.3 Wizard Coder-15B-V1.0 19.0 31.6 34.2 24.0 6.7 Wizard Coder-Python-34B 27.7 43.9 38.2 33.7 36.0 MCODER 41.3 57.3 47.4 42.3 43.3 Table 8: Pass@1(%) results of code generation performance of across various application scenarios Method Mobile Cross Desktop Frontend Backend Scientific General Content Education Scripts Editor GPT-4o (230517) 84.0 68.3 75.0 66.7 64.6 71.6 57.6 46.0 72.7 65.7 52.0 GPT-4 Turbo (231106) 81.0 64.4 74.0 64.0 66.6 66.8 57.6 46.7 60.7 65.7 50.0 GPT-3.5 (240125) 60.0 56.7 71.0 63.3 57.5 55.6 50.4 39.3 45.3 50.0 25.0 Codegemma-7b-it 45.0 40.4 43.0 34.7 37.7 21.6 24.8 34.0 22.0 29.7 13.0 code-Llama-13b 30.0 15.4 39.0 34.7 28.0 23.2 34.8 28.0 24.0 27.7 13.0 Code Llama-34b-Instruct 33.0 17.3 38.0 38.0 27.2 24.0 32.8 30.7 26.7 31.7 19.0 Code-Llama-7b-Instruct 24.0 12.5 37.0 29.3 22.7 20.8 29.2 30.7 19.3 27.0 14.0 Code Qwen-1.5-7b 55.0 44.2 59.0 56.7 48.7 47.6 46.8 37.3 42.7 40.0 20.0 Codeshell-7b-chat 23.0 14.4 26.0 40.7 26.1 17.2 21.2 34.7 13.3 22.7 8.0 Codestral-22B-v0.1 68.0 58.7 64.0 57.3 55.0 54.0 44.8 40.7 30.0 53.3 28.0 Deep Seek Coder-33b-instruct 63.0 50.0 57.0 68.0 60.6 54.8 54.0 34.7 56.7 52.7 35.0 Deep Seek Coder-1.5-7b-instruct 40.0 42.3 59.0 62.0 52.7 40.8 50.0 40.0 34.7 46.0 22.0 Magicoder-S-DS-6.7B 49.0 43.3 64.0 60.7 50.4 49.6 52.4 32.0 48.7 49.7 24.0 Llama-3-8B-Instruct 41.0 30.8 48.0 50.7 40.5 30.0 37.2 33.3 34.0 33.0 15.0 Nxcode-CQ-7B-orpo 54.0 40.4 55.0 53.3 48.4 48.0 46.8 36.7 42.7 39.7 20.0 OCTOCODER 22.0 20.2 33.0 28.7 21.8 16.4 27.2 25.3 16.0 29.0 14.0 Open Code Interpreter-DS-6.7B 47.0 42.3 64.0 58.0 45.9 47.6 46.4 42.0 43.3 44.0 24.0 Phi-3-medium-4k-instruct 40.0 26.9 48.0 48.7 30.3 39.2 31.6 35.3 33.3 39.0 13.0 Qwen1.5-72B-Chat 30.0 29.8 43.0 44.7 36.3 30.4 38.0 39.3 32.7 40.0 21.0 Wizard Coder-15B-V1.0 28.0 24.0 36.0 48.0 37.1 29.2 27.2 6.7 20.7 26.3 9.0 Wizard Coder-Python-34B 42.0 28.8 46.0 42.0 44.2 32.8 38.0 36.0 32.7 32.3 19.0 MCODER 56.0 38.5 60.0 57.3 50.4 48.0 46.0 43.3 39.3 44.3 25.0 We selected training data from several languages in MCEVAL-INSTRUCT, which exhibit significant grammatical differences (approximately 10K samples of Python and 1K samples for other languages) and fine-tuned the model. The results are as shown in Table 10. When trained using only Python data, the performance on Python and AWK improved. However, this led to the scores for Type Script and Java Script dropping to 0. Upon inspection, we found that the generated code for these two languages contained syntax errors (Less data may lead to instability in model training). When training on a mixture of several languages, Python performance decreased slightly compared to using only Python data, while Scheme performance improved significantly. Furthermore, the syntax generation for Type Script and Java Script returned to normal (even without adding Java Script data, as Type Script and Java Script share similar syntax). However, there was no significant improvement compared to the base model. Thus, fine-tuning multilingual code models presents significant challenges. Similar languages can provide mutual benefits, while languages with greater differences may negatively impact performance. Published as a conference paper at ICLR 2025 Table 9: Additional MCODER Pass@1 (%) results on multilingual code generation tasks. Avgall" represents the average Pass@1 scores across all programming languages in the MCEVAL. Here, MCODER-DS indicates that the fine-tuned base model is Deep Seek Coder-1.5-7b-base. Method Size AWK C C++ C# Clisp Coffee Dart Elisp Elixir Erlang Fortran F# Go Groovy Haskell Html Java JS Json Julia Deep Seek Coder-1.5-base 7B 30.0 36.0 38.0 40.0 40.0 58.0 0.0 18.0 2.0 14.0 50.0 44.0 48.0 26.0 2.0 4.0 49.1 32.0 16.0 34.0 Code Qwen-1.5 7B 38.0 40.0 46.0 42.0 28.0 56.0 2.0 14.0 14.0 0.0 40.0 46.0 44.0 32.0 2.0 0.0 47.2 52.0 30.0 60.0 Code Qwen-1.5-Python 7B 42.0 48.0 48.0 12.0 52.0 68.0 23.5 22.0 40.0 62.0 50.0 42.0 48.0 68.0 56.0 32.0 67.9 52.0 52.0 54.0 MCODER-DS 7B 34.0 46.0 50.0 26.0 30.0 72.0 19.6 6.0 26.0 24.0 58.0 30.0 48.0 12.0 26.0 28.0 67.9 48.0 62.0 48.0 MCODER 7B 40.0 44.0 52.0 62.0 46.0 66.0 21.6 30.0 44.0 52.0 56.0 44.0 48.0 70.0 32.0 34.0 54.7 54.0 66.0 56.0 Method Kotlin Lua MD Pascal Perl PHP Power Python R Racket Ruby Rust Scala Scheme Shell Swift Tcl TS VB Vim L Avgall Deep Seek Coder-1.5-7B-base 42.0 20.0 0.0 24.0 24.0 36.0 42.0 54.0 24.0 20.0 38.0 39.6 44.0 20.0 18.0 32.0 10.0 44.0 22.0 22.0 28.9 Code Qwen-1.5 58.0 50.0 0.0 14.0 20.0 10.0 48.0 38.0 30.0 24.0 36.0 52.8 32.0 34.0 42.0 46.0 30.0 52.0 54.0 22.0 33.2 Code Qwen-1.5-Python 46.0 46.0 24.0 42.0 36.0 36.0 54.0 44.0 36.0 40.0 46.0 52.8 58.0 40.0 42.0 62.0 48.0 52.0 58.0 18.0 45.5 MCODER-DS 36.0 42.0 22.0 34.0 8.0 34.0 46.0 42.0 22.0 40.0 56.0 45.3 48.0 30.0 38.0 48.0 34.0 46.0 50.0 28.0 37.8 MCODER 48.0 52.0 30.0 42.0 36.0 32.0 54.0 44.0 40.0 36.0 48.0 52.8 58.0 44.0 46.0 64.0 38.0 52.0 58.0 20.0 46.7 AWK C C# CPP Coffee Script Common Lisp Dart Elixir Emacs Lisp Erlang F# Fortran Go Groovy HTML Haskell JAVA JSON Java Script Julia Kotlin Lua Markdown PHP Pascal Perl Power Shell Python R Racket Ruby Rust Scala Scheme Shell Swift Tcl Type Script Vim Script Visual Basic Java Script Type Script Common Lisp Visual Basic Power Shell Coffee Script Emacs Lisp Common Lisp Racket Scheme Type Script Java Script CPP C C# Java Groovy Shell Tcl Perl Power Shell (1) Representation visualization based on t-SNE (2) Representation visualization based on Hierarchical Cluster Figure 19: Analysis from the representation perspective on MCEVAL. Languages with similar syntax have closely related representations Table 10: Preliminary explorations on the impact of finetuning across different languages on model performance. Setting Python Scheme Type Script Java Script AWK Code Qwen1.5-base 38.0 34.0 52.0 52.0 38.0 + Python 48.0 12.0 0.0 0.0 40.0 + Python&Scheme&Type Script&AWK 44.0 38.0 50.0 48.0 42.0 A.11 DETAILED RELATED WORK Code Large Language Model. In recent years, numerous large language models (LLMs) have been developed specifically for code-related tasks. For the field of soft engineering, code LLMs (Feng et al., 2020; Chen et al., 2021; Scao et al., 2022; Li et al., 2022; Allal et al., 2023; Fried et al., 2022; Wang et al., 2021; Zheng et al., 2024; Guo et al., 2024) pre-trained on billions of code snippets, such as Star Coder (Li et al., 2023; Lozhkov et al., 2024), Code Llama (Rozière et al., 2023), Deep Seek Coder (Guo et al., 2024), and Code-Qwen (Bai et al., 2023). The development and refinement of code LLMs have been pivotal in automating software development tasks, providing code suggestions, and supporting code generation/translation. To improve the performance of code generation, researchers used optimized prompts (Liu et al., 2023a; Reynolds & Mc Donell, 2021; Zan et al., 2023; Beurer-Kellner et al., 2023), bring test cases (Chen et al., 2023) and collaborative roles (Dong et al., 2023). There are also some related studies on using large language models for other code tasks, such as dynamic programming (Dagan et al., 2023), compiler optimization (Cummins et al., 2023), multi-lingual prompts (Di et al., 2023), and Program of Thoughts (Chen et al., 2022). Published as a conference paper at ICLR 2025 Code Evaluation. In the domain of code evaluation, a rich tapestry of benchmarks (Zheng et al., 2023b; Yu et al., 2024; Yin et al., 2023; Peng et al., 2024; Khan et al., 2023; Orlanski et al., 2023) has been woven to address the challenges of accurately assessing code quality, functionality, and efficiency, such as Human Eval (Chen et al., 2021), MBPP (Austin et al., 2021), their upgraded version Eval Plus (Liu et al., 2023b). Studies have explored a variety of approaches, ranging from static analysis techniques (e.g. exact match (EM) and edit similarity (ES)), which examine code without executing it, to dynamic methods that involve code execution in controlled environments (e.g. Pass@k). The current benchmarks support code models to evaluate a series of different types of tasks, such as code understanding, function calling (Zhuo et al., 2024), code repair (Lin et al., 2017; Tian et al., 2024; Jimenez et al., 2023; Zhang et al., 2023; Prenner & Robbes, 2023; He et al., 2022), code translation (Yan et al., 2023). Recently, many works Wei et al. (2023); Zhuo et al. (2024) have leveraged LLMs to construct large-scale evaluation datasets and instruction-tuning corpora, further enhancing the evaluation and performance of code models. In our work, we used a similar approach to construct an instruction dataset and proposed the Cross-lingual Code Transfer method to expand the number of languages to 40. Some recent works pay attention to the multilingual scenarios (Cassano et al., 2023; Wang et al., 2023; Athiwaratkun et al., 2023; Zheng et al., 2023a; Peng et al., 2024; Zheng et al., 2023b) by extending the existing python-only Human Eval or MBPP benchmark, such as Multi PL-E (Cassano et al., 2023) and MBXP (Athiwaratkun et al., 2023), which is challenged by the number of the covering languages and data leaking problem (Li et al., 2023; Jain et al., 2024).