# learning_to_complete_code_with_sketches__429b8914.pdf Published as a conference paper at ICLR 2022 LEARNING TO COMPLETE CODE WITH SKETCHES Daya Guo School of Computer Science and Engineering Sun Yat-sen University, China guody5@mail2.sysu.edu.cn Alexey Svyatkovskiy Microsoft Redmond, WA, USA alsvyatk@microsoft.com Jian Yin School of Computer Science and Engineering Sun Yat-sen University, China issjyin@mail.sysu.edu.cn Nan Duan Microsoft Research Beijing, China nanduan@microsoft.com Marc Brockschmidt, Miltiadis Allamanis Microsoft Research Cambridge, UK {mabrocks,miallama}@microsoft.com Code completion is usually cast as a language modelling problem, i.e., continuing an input in a left-to-right fashion. However, in practice, some parts of the completion (e.g., string literals) may be very hard to predict, whereas subsequent parts directly follow from the context. To handle this, we instead consider the scenario of generating code completions with holes inserted in places where a model is uncertain. We develop GRAMMFORMER, a Transformer-based model that guides code generation by the programming language grammar, and compare it to a variety of more standard sequence models. We train the models on code completion for C# and Python given partial code context. To evaluate models, we consider both ROUGE as well as a new metric REGEXACC that measures success of generating completions matching long outputs with as few holes as possible. In our experiments, GRAMMFORMER generates 10-50% more accurate completions compared to traditional generative models and 37-50% longer sketches compared to sketch-generating baselines trained with similar techniques. 1 INTRODUCTION Recent high-capacity language models (LM) have shown that machine learning models are able to generate coherent, realistic text, but it is often hard to guide them towards a specific goal, especially when describing the intent is complex or more costly than manually generating the target output. One such scenario are LMs of source code (LMC). Since Hindle et al. (2012) increasingly sophisticated LMCs have been built, including transformer-based ones, such as those of Svyatkovskiy et al. (2020); Feng et al. (2020); Chen et al. (2021) and various similar unpublished models such as Tab Nine and Source AI. These models generate full sequences of code tokens left-to-right with any prefix acting as the (partial) user intent. While LMs generate realistic-looking outputs, they are known to occasionally hallucinate (Puduppully et al., 2019; Malmi et al., 2019; Maynez et al., 2020; Liu et al., 2021), i.e. generate plausible but incorrect content. This is particularly problematic in generating source code, where small mistakes can lead to erroneous code that is very hard to debug or introduces vulnerabilities (Pearce et al., 2021). In this work, we investigate models that can decline to make predictions in places where there is high uncertainty (e.g., where the user should choose a name), but continue generating around these holes . For example, in Fig. 1(left) a developer has typed some code and is about to type the next line. A likely completion is to consume more command line arguments, but their name is unclear Published as a conference paper at ICLR 2022 Code Context: 1 import sys 2 target = sys.argv[1] 3 I Ground-Truth: ID = sys.argv[2] Suggested Code Completions: L R target = target.replace("\\", "/") L R + target = L R + print(target) Copilot (No suggestion) GRAMMFORMER = sys.argv[2] Figure 1: A sample snippet (left; abbreviated from Fig. 12 in Appx. A). A developer has just typed the code and their cursor (in blue) is at line 3. Code completions provided by a number of models are shown on the right, where L R is a standard LMC and GRAMMFORMER is our new model. from the context. A traditional generative model (e.g. Fig. 1; top right) may choose to provide a completion that exists in the training data, but is not clearly called for here. On the other hand, a model able to explicitly mark where it is uncertain (Fig. 1; bottom right) makes it clear to a user where further input is required. However, creating such models is not trivial. A simple first attempt may be to use a standard LMC, but output a hole token whenever the model is uncertain about the next output token. However, continuing after the then becomes infeasible, as the LMC was not trained on such data. Hence, a suitable training dataset and objective need to devised. As no large datasets with holes exist, we instead choose to use a reinforcement learning approach in which our reward function encourages the model to make long predictions with as few tokens as possible, but to avoid making incorrect predictions. We found that standard left-to-right sequence models perform poorly on this task. Hence, we developed GRAMMFORMER, a model that construct suggestions by generating a (partial) syntax tree, but which has the option of leaving non-terminals in its output. Contributions (1) We present GRAMMFORMER, a transformer-based model that generates code based on the programming language grammar and can predict hole tokens rather than output it is uncertain about. (2) We develop REGEXACC, a metric that evaluates the quality of predictions with holes. (3) We evaluate GRAMMFORMER on Python and C# code and show that GRAMMFORMER makes longer and more precise statement-level sketch completions compared to baselines. Our aim is to predict code completions as sketches, a mix of actual tokens and holes , which are meant to signify that the model is unable to make a useful prediction within the given context and further user input is required. Formally, we consider models that take a context sequence x of tokens as input and have to produce an output sequence y; intuitively, x is what the user typed so far, and y is the suggestion presented to the user. In our setting, y is a sketch, a mix of tokens from the programming language and the special token signifying a hole that could be filled by an arbitrary sequence of tokens. For example, t = foo( ) is a sketch corresponding to assigning the return value of function foo to variable t, but leaves the arguments of the function call undetermined. Metric A good sketch is one that (a) can be completed into the correct output and (b) is as precise as possible. To measure how successful a method is in doing so, we define a new metric REGEXACC. For (a), we use to Regex(ˆy) to turn a predicted code sketch ˆy into a regular expression by replacing all holes with the wildcard matching any non-empty sequence ( .+ in Perl Compatible Regular Expression syntax). If the regex matches the ground truth, matches( , ) returns a score of 1 otherwise it returns 0. To implement (b), we scale this result by the proportion of terminal tokens predicted, by defining n Tokens(ˆy) as the function that returns the number of non-hole symbols in ˆy. More formally, assume an output sketch ˆy and a ground-truth sequence y , where y does not contain any tokens. REGEXACC is then defined as REGEXACC(ˆy, y ) matches(to Regex(ˆy), y ) n Tokens(ˆy) n Tokens(y ). Beyond REGEXACC, we also consider ROUGE (Lin, 2004), since a sketch can be thought as a form of a summary of the target text. For this, we use a helper function ERASEHOLES(ˆy) that simply Published as a conference paper at ICLR 2022 x(0): r = Expr i(0) = 3 x(1): r = Expr * Parenthesized Expr i(1) = 5 x(2): r = Expr * ( Expr ) i(2) = 6 x(3): r = Expr * ( Expr - Expr ) i(3) = 8 x(4): r = Expr * ( Expr - Identifier ( Arg List ) ) i(4) = 8 x(5): r = Expr * ( Expr - foo ( Arg List ) ) i(5) = 10 x(6): r = Expr * ( Expr - foo ( Identifer ) ) i(6) = 10 x(7): r = Expr * ( Expr - foo ( args ) ) i(7) = 6 x(8): r = Identifier * ( Expr - foo ( args ) ) i(8) = 6 x(9): r = x * ( Expr - foo ( args ) ) i(9) = Figure 2: Progress of grammar-based code generation of the sketch r = x * ( - foo(args)) by GRAMMFORMER. Each line represents consecutive x(t) in Alg. 1. Terminal tokens are shown in monospace blue font. The underlined non-terminal at position i(t) is selected by Ps and its expansion is generated by Pe, i.e. the output underneath the selected (underlined) non-terminal. Fig. 5 and Fig. 6 in Appx. A show real example generation sequences from our datasets. drops all tokens, and then consider ROUGEF1(ERASEHOLES(ˆy), y ). ROUGE is more lenient to errors than REGEXACC and gives partial credit to non-matching but plausible sketches. 2.1 LINEAR CODE SKETCH GENERATION First, we consider the idea of generating code sketches using a standard generative model for language. To this end, we simply extend the vocabulary with the special token. An obvious problem is that while we have plenty of training data for a standard generative model, we do not have training data for outputs y that contain the token. Consequently, we cannot train the model in a fully supervised fashion, and instead turn to reinforcement learning. Concretely, we devise a reward function r( ) that averages REGEXACC and ROUGE, i.e. for a predicted output sketch ˆy and a ground truth output (without tokens) y , we define r(ˆy, y ) = 1 2 (REGEXACC(ˆy, y ) + ROUGEF1(ERASEHOLES(ˆy, y )) . (1) Using the combination of ROUGE (which does not consider holes) and REGEXACC is crucial here, as ROUGE is much smoother compared to REGEXACC, which is 0 for all but very few predictions, allowing us to measure partial improvement. We use our reward function from Eq. 1 to evaluate the quality of the output of the full model and compute a loss. Inspired by Paulus et al. (2017) we use self-critical policy gradient training (Rennie et al., 2017) and for a prediction ˆy we minimise L(x, y ) = (r(ˆy, y ) r(x)) Lgen (x, ˆy) (2) Here, r(x) is the reward achieved by the prediction from the snapshots of the model that achieved the best score so far and Lgen is the loss of the generative model. Intuitively, this objective rewards models that improve upon the previous best policy with respect to r. To model this in practice, we use a standard encoder/decoder Transformer model Vaswani et al. (2017); Radford et al. (2019), translating the context x into the output y using separate encoder and decoder models. We additionally also consider the language modelling case, i.e., a model that conditioned on x predicts token y0, conditioned on x, y0 predicts token y1, etc.. Pretraining In practice, we found that directly training a sequence model to maximise Eq. 1, is very slow and does not converge to a useful model. Instead, we heuristically generate a dataset suitable for supervised pretraining. We replace random AST non-terminals of the target output by and generate target sequences. These contain terminals and zero or more . We then pretrain the model on this dataset to convergence, and then fine-tune it using the reward of Eq. 1. 2.2 GRAMMAR-BASED CODE SKETCH GENERATION In experiments, we found the simple extended sequence model from above to not perform well, in particular, tokens would not replace semantically meaningful subsequences (e.g. szconv. ) Published as a conference paper at ICLR 2022 Algorithm 1 GRAMMFORMER generative process, given an input sequence x(0). for t = 0, 1, 2, ... do i(t) Ps (i x(t), N(x(t))) sample non-terminal position from N(x(t)) to expand if i(t) = then if x(t) does not contain non-terminals or none was selected by Ps break stop generation u (t) i(t) Pe (u x(t), i(t)) sample expansion of non-terminal at position i(t) x(t+1) x (t) i(t) create x(t+1) by replacing non-terminal at i(t) by u (t) i(t) return NONTERMINALSTOHOLES(x(t)) convert remaining non-terminals to holes and return does not contain a left parenthesis and requires the user to fill it in.). To resolve this, we developed GRAMMFORMER, a grammar-guided model. It generates code by following the structure of the context-free grammar (CFG) defining the programming language syntax, iteratively expanding nonterminal symbols. Crucially, it can choose to not expand some non-terminal symbols, which can then be presented as to users. In traditional grammar-based generation of text (Cohen et al., 2012) or code (Maddison & Tarlow, 2014; Yin & Neubig, 2017; Allamanis & Sutton, 2014; Bielik et al., 2016), the CFG is followed by sequentially expanding the left-most, bottom-most non-terminal symbol, using one of the production rules of the grammar. GRAMMFORMER changes this and instead selects the non-terminal symbol to expand, if any. An example generation is shown in Fig. 2. Probabilistic Model A CFG is defined as a tuple (Σ, N, S, R) where Σ is a set of terminal symbols, N is a set of non-terminal symbols, S N is the root symbol and R is a set of production rules. We denote non-terminals as Non Terminal Name . GRAMMFORMER can be viewed as a sequenceto-sequence model transforming x = x0, x1, ..., xn into a new sequence in which one non-terminal symbol xi has been replaced by a new sequence of new symbols, according to a production rule of the grammar. Examples of such sequences and rewrites are shown in Fig. 2. GRAMMFORMER does this rewriting in two steps. First, a non-terminal selector model Ps selects a non-terminal in x to expand and then the non-terminal expansion model Pe determines how to expand it. To define Ps, let N(x) = {i xi N} { } denote the set of non-terminal positions in x and a special stop expansion symbol. Conditioned on x, Ps produces a probability distribution over N(x). In turn, Pe is conditioned on x and a position i N(x) and models a probability distribution over expansion sequences u (Σ N) . Note that factorising GRAMMFORMER into two models Ps and Pe is an important modelling decision: how to best expand a non-terminal is entirely separated from predicting whether a hole should be introduced. These two concepts are intermixed in standard (sequence) decoders. In practice, we define both models using neural architectures with partially shared parameters, as discussed below. Alg. 1 shows a high-level description of GRAMMFORMER, in which Ps and Pe are used repeatedly to select and expand non-terminals (not necessarily the left-most one), until none are left or the Ps indicates that expansion should stop. Here, NONTERMINALSTOHOLES( ) replaces all remaining non-terminal symbols with a hole . Note that GRAMMFORMER is not context-free, taking into account the whole input sequence when expanding a non-terminal. Second, in contrast to many grammar-based methods (Yin & Neubig, 2017; Bielik et al., 2016), any non-terminal can be expanded at each step. Finally, Pe is not directly constrained to follow the production rule set R, but can generate any sequence. In practice, it learns to follow to the rules of R from the data, but this flexibility is important for handling string literals and argument tuples of variable length. Neural Model To implement Ps and Pe, we use a shared encoder module that computes a representation of the input sequence x = x0, . . . , xn as vectors e0, . . . , en, ei RD, where D is a hyperparameter. Our encoder module is a Transformer (Vaswani et al., 2017), given the impressive results of transformer-based models in NLP and code (Feng et al., 2020). Other architectures (RNNs, 1D-CNNs, Transformer variants) would be suitable, but we leave their study for future work. Ps is implemented similar to a pointer network on top of this encoder module, i.e. Ps(i x) = softmax i N(x) (f(ei)) , Published as a conference paper at ICLR 2022 where f is a learnable feed-forward neural network. For our purposes, we define e as the representation of the special start symbol [CLS] used in our Transformer encoder. The expansion model Pe follows a standard autoregressive decoder formulation, i.e. Pe(u x, i) = j=1 Pdec(uj e0, . . . , en, i, u