# multiturn_code_generation_through_singlestep_rewards__c3ace338.pdf Multi-Turn Code Generation Through Single-Step Rewards Arnav Kumar Jain * 1 2 Gonzalo Gonzalez-Pumariega * 3 Wayne Chen 3 Alexander M Rush 3 Wenting Zhao 3 Sanjiban Choudhury 3 We address the problem of code generation from multi-turn execution feedback. Existing methods either generate code without feedback or use complex, hierarchical reinforcement learning to optimize multi-turn rewards. We propose a simple yet scalable approach, µCODE, that solves multi-turn code generation using only single-step rewards. Our key insight is that code generation is a one-step recoverable MDP, where the correct code can be recovered from any intermediate code state in a single turn. µCODE iteratively trains both a generator to provide code solutions conditioned on multi-turn execution feedback and a verifier to score the newly generated code. Experimental evaluations show that our approach achieves significant improvements over the stateof-the-art baselines. We provide analysis of the design choices of the reward models and policy, and show the efficacy of µCODE at utilizing the execution feedback. Our code is available here. 1. Introduction Software engineers often iteratively refine their code based on execution errors. A common strategy for machine code generation is thus to repair code using execution feedback at test time (Chen et al., 2024; Wang et al., 2024b; Zhao et al., 2024). However, prompting alone is insufficient as it cannot teach how to recover from all possible errors within a limited context. We need to train models that can learn from execution feedback during training. Existing approaches fall into either single-turn or multi-turn settings. In the single-turn setting, methods either train without execution feedback (Zelikman et al., 2022) or perform one-step corrections (Welleck et al., *Equal contribution 1Mila Quebec AI Institute 2Universit e de Montr eal 3Cornell University. Correspondence to: Arnav , Gonzalo . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). 2023; Ni et al., 2024). However, these struggle to iteratively correct errors over multiple turns. Multi-turn approaches, on the other hand, rely on complex reinforcement learning (RL) (Gehring et al., 2024a; Kumar et al., 2024b; Zhou et al., 2024) to optimize long-term rewards. While effective in principle, these methods suffer from sparse learning signals which makes learning inefficient. Our key insight is that code generation is a one-step recoverable Markov Decision Process (MDP), implying that the correct code can be recovered from any intermediate state in a single step. This allows us to greedily maximize a onestep reward instead of relying on complex multi-step reward optimization. As a result, this reduces the problem from reinforcement learning, which requires exploration and credit assignment, to imitation learning, where the model simply learns to mimic correct code, leading to a more stable and efficient training process. We propose µCODE, a simple and scalable approach for multi-turn code generation from execution feedback. During training, µCODE follows an expert iteration (Anthony et al., 2017) framework with a local search expert, enabling iterative improvement of both the generator and the expert. The process begins by rolling out the current code generator to collect interaction data with execution feedback. A single-step verifier is then trained on this data and utilized to guide a local search expert in refining the code and generating training labels. Finally, the generator is fine-tuned using these labels. Given recent trends of test-time scaling in generating high quality solutions (Brown et al., 2024; Snell et al., 2024; Wu et al., 2024), µCODE also uses the learned verifier for inference-time scaling. Here, µCODE samples N trajectories; at each step, µCODE picks the best code solution ranked by the learned verifier. The key contributions of this work are as follows: 1. A novel framework, µCODE, for training code generators and verifiers through multi-turn execution feedback. We add theoretical analysis of performance bounds using the property of one-step recoverability for this task. 2. We propose a multi-turn Best-of-N (Bo N) approach for inference-time scaling and present benefits of learned verifier to select the code solution at each turn. 3. Our approach µCODE outperforms leading multi-turn Multi-Turn Code Generation Through Single-Step Rewards Check if the given string is a def is_palindrome(s): return s[0] == s[-1] is_palindrome( abca )==False is_palindrome( test )==False def is_palindrome(s): return s[0] == s[::-1] is_palindrome( a )==True is_palindrome( bob )==True def is_palindrome(s): return s == s[::-1] is_palindrome( a )==True is_palindrome( bob )==True def is_palindrome(s): return s == s[::-1] Train Verifier Local search Relabel with 𝒟 π Train Generator s2 s3 y1 y2 Rollout generator Aggregate data πθ 𝒟 Code generation is a one-step recoverable MDP Figure 1. (a) We define the task of multi-turn code generation where for an initial problem x, the generator πθ provides a solution y1. This solution is evaluated with the public test to get execution feedback o1. At a turn t, the generator is conditioned on the history to generate solution yt πθ(.|x, y - You must only return a single code block since we only parse the first code block. - Do not include any tests in your code - we will run the suite and return any error feedback. - Include relevant import statements. Human Eval Prompt Example from typing import List def has_close_elements(numbers: List[float], threshold: float) -> bool: """ Check if in given list of numbers, are any two numbers closer to each other than given threshold. >>> has_close_elements([1.0, 2.0, 3.0], 0.5) False >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3) True """ Human Eval Test Example def check(has_close_elements): assert has_close_elements([1.0, 2.0, 3.0], 0.5) == False assert has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3) == True check(has_close_elements) MBPP Prompt Example Write a function to find the minimum cost path to reach (m, n) from (0, 0) for the given cost matrix cost[][] and a position (m, n) in cost[][]. Multi-Turn Code Generation Through Single-Step Rewards MBPP Test Example assert min_cost([[1, 2, 3], [4, 8, 2], [1, 5, 3]], 2, 2) == 8 assert min_cost([[2, 3, 4], [5, 9, 3], [2, 6, 4]], 2, 2) == 12 assert min_cost([[3, 4, 5], [6, 10, 4], [3, 7, 5]], 2, 2) == 16 Code Contests Prompt Example Provide a Python solution for the following competitive programming question: Mr. Chanek has an array a of n integers. The prettiness value of a is denoted as: $$$\Sigma_{i=1}ˆ{n} {\Sigma_{j=1}ˆ{n} {\gcd(a_i, a_j) \cdot \gcd(i, j)}}$$$ where \gcd(x, y) denotes the greatest common divisor (GCD) of integers x and y. In other words, the prettiness value of an array a is the total sum of \gcd(a_i, a_j ) \cdot \gcd(i, j) for all pairs (i, j). Help Mr. Chanek find the prettiness value of a, and output the result modulo 10ˆ9 + 7! The first line contains an integer n (2 \leq n \leq 10ˆ5). The second line contains n integers a_1, a_2, ..., a_n (1 \leq a_i \leq 10ˆ5). Output an integer denoting the prettiness value of a modulo 10ˆ9 + 7. 5 3 6 2 1 4 Your code should be enclosed in triple backticks like so: python YOUR CODE HERE . Use the backticks for your code only. Code Contests Test Example # Input fed through stdin and output checked against stdout { input : [ 5\n54883 59286 71521 84428 60278\n , 2\n83160 83160\n ], output : [ 1027150\n , 415800\n ]} C.1.2. FEEDBACK PROMPT Immediately below is the prompt template for how we provide feedback in multi-step methods. The feedback only consists of executor error traces, and we provide an example from Human Eval. Multi-Turn Code Generation Through Single-Step Rewards Multi-Step Feedback Prompt \{feedback\} Human Eval Multi-Step Feedback Prompt Traceback (most recent call last): File "test.py", line 18, in assert has_close_elements([1.0, 2.0, 3.0], 0.5) == False ˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆ Assertion Error C.2. Public Private Tests We choose a public-private test split for Human Eval and MBPP to ensure that naively passing the public tests does not guarantee private test success. For Human Eval, we use a single test from the code prompt s docstring as the public test and the remaining tests along with the official test suite as private tests. For ease of parsing, we utilize a processed version of Human Eval, Human Eval Pack (Muennighoff et al., 2023). For MBPP, we use a single test from the official test suite as the public test, and the remaining tests and any challenge test list tests as private tests.