# learning_to_complete_code_with_sketches__429b8914.pdf

Published as a conference paper at ICLR 2022

LEARNING TO COMPLETE CODE WITH SKETCHES

Daya Guo School of Computer Science and Engineering Sun Yat-sen University, China guody5@mail2.sysu.edu.cn

Alexey Svyatkovskiy Microsoft Redmond, WA, USA alsvyatk@microsoft.com

Jian Yin School of Computer Science and Engineering Sun Yat-sen University, China issjyin@mail.sysu.edu.cn

Nan Duan Microsoft Research Beijing, China nanduan@microsoft.com

Marc Brockschmidt, Miltiadis Allamanis Microsoft Research Cambridge, UK {mabrocks,miallama}@microsoft.com

Code completion is usually cast as a language modelling problem, i.e., continuing an input in a left-to-right fashion. However, in practice, some parts of the completion (e.g., string literals) may be very hard to predict, whereas subsequent parts directly follow from the context. To handle this, we instead consider the scenario of generating code completions with holes inserted in places where a model is uncertain. We develop GRAMMFORMER, a Transformer-based model that guides code generation by the programming language grammar, and compare it to a variety of more standard sequence models. We train the models on code completion for C# and Python given partial code context. To evaluate models, we consider both ROUGE as well as a new metric REGEXACC that measures success of generating completions matching long outputs with as few holes as possible. In our experiments, GRAMMFORMER generates 10-50% more accurate completions compared to traditional generative models and 37-50% longer sketches compared to sketch-generating baselines trained with similar techniques.

1 INTRODUCTION

Recent high-capacity language models (LM) have shown that machine learning models are able to generate coherent, realistic text, but it is often hard to guide them towards a speciﬁc goal, especially when describing the intent is complex or more costly than manually generating the target output.

One such scenario are LMs of source code (LMC). Since Hindle et al. (2012) increasingly sophisticated LMCs have been built, including transformer-based ones, such as those of Svyatkovskiy et al. (2020); Feng et al. (2020); Chen et al. (2021) and various similar unpublished models such as Tab Nine and Source AI. These models generate full sequences of code tokens left-to-right with any preﬁx acting as the (partial) user intent. While LMs generate realistic-looking outputs, they are known to occasionally hallucinate (Puduppully et al., 2019; Malmi et al., 2019; Maynez et al., 2020; Liu et al., 2021), i.e. generate plausible but incorrect content. This is particularly problematic in generating source code, where small mistakes can lead to erroneous code that is very hard to debug or introduces vulnerabilities (Pearce et al., 2021).

In this work, we investigate models that can decline to make predictions in places where there is high uncertainty (e.g., where the user should choose a name), but continue generating around these holes . For example, in Fig. 1(left) a developer has typed some code and is about to type the next line. A likely completion is to consume more command line arguments, but their name is unclear

Published as a conference paper at ICLR 2022

Code Context: 1 import sys 2 target = sys.argv[1] 3 I

Ground-Truth: ID = sys.argv[2]

Suggested Code Completions: L R target = target.replace("\\", "/") L R + target = L R + print(target) Copilot (No suggestion) GRAMMFORMER = sys.argv[2]

Figure 1: A sample snippet (left; abbreviated from Fig. 12 in Appx. A). A developer has just typed the code and their cursor (in blue) is at line 3. Code completions provided by a number of models are shown on the right, where L R is a standard LMC and GRAMMFORMER is our new model.

from the context. A traditional generative model (e.g. Fig. 1; top right) may choose to provide a completion that exists in the training data, but is not clearly called for here. On the other hand, a model able to explicitly mark where it is uncertain (Fig. 1; bottom right) makes it clear to a user where further input is required.

However, creating such models is not trivial. A simple ﬁrst attempt may be to use a standard LMC, but output a hole token whenever the model is uncertain about the next output token. However, continuing after the then becomes infeasible, as the LMC was not trained on such data. Hence, a suitable training dataset and objective need to devised. As no large datasets with holes exist, we instead choose to use a reinforcement learning approach in which our reward function encourages the model to make long predictions with as few tokens as possible, but to avoid making incorrect predictions. We found that standard left-to-right sequence models perform poorly on this task. Hence, we developed GRAMMFORMER, a model that construct suggestions by generating a (partial) syntax tree, but which has the option of leaving non-terminals in its output.

Contributions (1) We present GRAMMFORMER, a transformer-based model that generates code based on the programming language grammar and can predict hole tokens rather than output it is uncertain about. (2) We develop REGEXACC, a metric that evaluates the quality of predictions with holes. (3) We evaluate GRAMMFORMER on Python and C# code and show that GRAMMFORMER makes longer and more precise statement-level sketch completions compared to baselines.

Our aim is to predict code completions as sketches, a mix of actual tokens and holes , which are meant to signify that the model is unable to make a useful prediction within the given context and further user input is required. Formally, we consider models that take a context sequence x of tokens as input and have to produce an output sequence y; intuitively, x is what the user typed so far, and y is the suggestion presented to the user. In our setting, y is a sketch, a mix of tokens from the programming language and the special token signifying a hole that could be ﬁlled by an arbitrary sequence of tokens. For example, t = foo( ) is a sketch corresponding to assigning the return value of function foo to variable t, but leaves the arguments of the function call undetermined.

Metric A good sketch is one that (a) can be completed into the correct output and (b) is as precise as possible. To measure how successful a method is in doing so, we deﬁne a new metric REGEXACC. For (a), we use to Regex(ˆy) to turn a predicted code sketch ˆy into a regular expression by replacing all holes with the wildcard matching any non-empty sequence ( .+ in Perl Compatible Regular Expression syntax). If the regex matches the ground truth, matches( , ) returns a score of 1 otherwise it returns 0. To implement (b), we scale this result by the proportion of terminal tokens predicted, by deﬁning n Tokens(ˆy) as the function that returns the number of non-hole symbols in ˆy. More formally, assume an output sketch ˆy and a ground-truth sequence y , where y does not contain any tokens. REGEXACC is then deﬁned as

REGEXACC(ˆy, y ) matches(to Regex(ˆy), y ) n Tokens(ˆy)

n Tokens(y ).

Beyond REGEXACC, we also consider ROUGE (Lin, 2004), since a sketch can be thought as a form of a summary of the target text. For this, we use a helper function ERASEHOLES(ˆy) that simply

Published as a conference paper at ICLR 2022

x(0): r = Expr i(0) = 3

x(1): r = Expr * Parenthesized Expr i(1) = 5

x(2): r = Expr * ( Expr ) i(2) = 6

x(3): r = Expr * ( Expr - Expr ) i(3) = 8

x(4): r = Expr * ( Expr - Identiﬁer ( Arg List ) ) i(4) = 8

x(5): r = Expr * ( Expr - foo ( Arg List ) ) i(5) = 10

x(6): r = Expr * ( Expr - foo ( Identifer ) ) i(6) = 10

x(7): r = Expr * ( Expr - foo ( args ) ) i(7) = 6

x(8): r = Identiﬁer * ( Expr - foo ( args ) ) i(8) = 6

x(9): r = x * ( Expr - foo ( args ) ) i(9) =

Figure 2: Progress of grammar-based code generation of the sketch r = x * ( - foo(args)) by GRAMMFORMER. Each line represents consecutive x(t) in Alg. 1. Terminal tokens are shown in monospace blue font. The underlined non-terminal at position i(t) is selected by Ps and its expansion is generated by Pe, i.e. the output underneath the selected (underlined) non-terminal. Fig. 5 and Fig. 6 in Appx. A show real example generation sequences from our datasets.

drops all tokens, and then consider ROUGEF1(ERASEHOLES(ˆy), y ). ROUGE is more lenient to errors than REGEXACC and gives partial credit to non-matching but plausible sketches.

2.1 LINEAR CODE SKETCH GENERATION

First, we consider the idea of generating code sketches using a standard generative model for language. To this end, we simply extend the vocabulary with the special token. An obvious problem is that while we have plenty of training data for a standard generative model, we do not have training data for outputs y that contain the token. Consequently, we cannot train the model in a fully supervised fashion, and instead turn to reinforcement learning. Concretely, we devise a reward function r( ) that averages REGEXACC and ROUGE, i.e. for a predicted output sketch ˆy and a ground truth output (without tokens) y , we deﬁne

r(ˆy, y ) = 1

2 (REGEXACC(ˆy, y ) + ROUGEF1(ERASEHOLES(ˆy, y )) . (1)

Using the combination of ROUGE (which does not consider holes) and REGEXACC is crucial here, as ROUGE is much smoother compared to REGEXACC, which is 0 for all but very few predictions, allowing us to measure partial improvement. We use our reward function from Eq. 1 to evaluate the quality of the output of the full model and compute a loss. Inspired by Paulus et al. (2017) we use self-critical policy gradient training (Rennie et al., 2017) and for a prediction ˆy we minimise

L(x, y ) = (r(ˆy, y ) r(x)) Lgen (x, ˆy) (2)

Here, r(x) is the reward achieved by the prediction from the snapshots of the model that achieved the best score so far and Lgen is the loss of the generative model. Intuitively, this objective rewards models that improve upon the previous best policy with respect to r.

To model this in practice, we use a standard encoder/decoder Transformer model Vaswani et al. (2017); Radford et al. (2019), translating the context x into the output y using separate encoder and decoder models. We additionally also consider the language modelling case, i.e., a model that conditioned on x predicts token y0, conditioned on x, y0 predicts token y1, etc..

Pretraining In practice, we found that directly training a sequence model to maximise Eq. 1, is very slow and does not converge to a useful model. Instead, we heuristically generate a dataset suitable for supervised pretraining. We replace random AST non-terminals of the target output by and generate target sequences. These contain terminals and zero or more . We then pretrain the model on this dataset to convergence, and then ﬁne-tune it using the reward of Eq. 1.

2.2 GRAMMAR-BASED CODE SKETCH GENERATION

In experiments, we found the simple extended sequence model from above to not perform well, in particular, tokens would not replace semantically meaningful subsequences (e.g. szconv. )

Published as a conference paper at ICLR 2022

Algorithm 1 GRAMMFORMER generative process, given an input sequence x(0).

for t = 0, 1, 2, ... do

i(t) Ps (i x(t), N(x(t))) sample non-terminal position from N(x(t)) to expand

if i(t) = then if x(t) does not contain non-terminals or none was selected by Ps break stop generation u (t) i(t) Pe (u x(t), i(t)) sample expansion of non-terminal at position i(t)

x(t+1) x (t) <i(t) u (t) i(t) x (t) >i(t) create x(t+1) by replacing non-terminal at i(t) by u (t) i(t) return NONTERMINALSTOHOLES(x(t)) convert remaining non-terminals to holes and return

does not contain a left parenthesis and requires the user to ﬁll it in.). To resolve this, we developed GRAMMFORMER, a grammar-guided model. It generates code by following the structure of the context-free grammar (CFG) deﬁning the programming language syntax, iteratively expanding nonterminal symbols. Crucially, it can choose to not expand some non-terminal symbols, which can then be presented as to users. In traditional grammar-based generation of text (Cohen et al., 2012) or code (Maddison & Tarlow, 2014; Yin & Neubig, 2017; Allamanis & Sutton, 2014; Bielik et al., 2016), the CFG is followed by sequentially expanding the left-most, bottom-most non-terminal symbol, using one of the production rules of the grammar. GRAMMFORMER changes this and instead selects the non-terminal symbol to expand, if any. An example generation is shown in Fig. 2.

Probabilistic Model A CFG is deﬁned as a tuple (Σ, N, S, R) where Σ is a set of terminal symbols, N is a set of non-terminal symbols, S N is the root symbol and R is a set of production rules. We denote non-terminals as Non Terminal Name . GRAMMFORMER can be viewed as a sequenceto-sequence model transforming x = x0, x1, ..., xn into a new sequence in which one non-terminal symbol xi has been replaced by a new sequence of new symbols, according to a production rule of the grammar. Examples of such sequences and rewrites are shown in Fig. 2.

GRAMMFORMER does this rewriting in two steps. First, a non-terminal selector model Ps selects a non-terminal in x to expand and then the non-terminal expansion model Pe determines how to expand it. To deﬁne Ps, let N(x) = {i xi N} { } denote the set of non-terminal positions in x and a special stop expansion symbol. Conditioned on x, Ps produces a probability distribution over N(x). In turn, Pe is conditioned on x and a position i N(x) and models a probability distribution over expansion sequences u (Σ N) . Note that factorising GRAMMFORMER into two models Ps and Pe is an important modelling decision: how to best expand a non-terminal is entirely separated from predicting whether a hole should be introduced. These two concepts are intermixed in standard (sequence) decoders. In practice, we deﬁne both models using neural architectures with partially shared parameters, as discussed below.

Alg. 1 shows a high-level description of GRAMMFORMER, in which Ps and Pe are used repeatedly to select and expand non-terminals (not necessarily the left-most one), until none are left or the Ps indicates that expansion should stop. Here, NONTERMINALSTOHOLES( ) replaces all remaining non-terminal symbols with a hole . Note that GRAMMFORMER is not context-free, taking into account the whole input sequence when expanding a non-terminal. Second, in contrast to many grammar-based methods (Yin & Neubig, 2017; Bielik et al., 2016), any non-terminal can be expanded at each step. Finally, Pe is not directly constrained to follow the production rule set R, but can generate any sequence. In practice, it learns to follow to the rules of R from the data, but this ﬂexibility is important for handling string literals and argument tuples of variable length.

Neural Model To implement Ps and Pe, we use a shared encoder module that computes a representation of the input sequence x = x0, . . . , xn as vectors e0, . . . , en, ei RD, where D is a hyperparameter. Our encoder module is a Transformer (Vaswani et al., 2017), given the impressive results of transformer-based models in NLP and code (Feng et al., 2020). Other architectures (RNNs, 1D-CNNs, Transformer variants) would be suitable, but we leave their study for future work.

Ps is implemented similar to a pointer network on top of this encoder module, i.e.

Ps(i x) = softmax i N(x) (f(ei)) ,

Published as a conference paper at ICLR 2022

where f is a learnable feed-forward neural network. For our purposes, we deﬁne e as the representation of the special start symbol [CLS] used in our Transformer encoder.

The expansion model Pe follows a standard autoregressive decoder formulation, i.e.

Pe(u x, i) =

j=1 Pdec(uj e0, . . . , en, i, u<j).

We implement Pdec as a (causal) relational Transformer decoder, similar to Wang et al. (2019). Relational transformers augment the attention mechanism by incorporating predeﬁned relationships among elements; attention scores are then biased by learnable weights for each relation. In GRAMMFORMER, we only use a single relation, connecting each token to the expanded non-terminal token xi, to help the model focus on the token it needs to generate an expansion for.

Objective Due to the lack of supervised data, we employ reinforcement learning to train GRAMMFORMER. We use our reward function from Eq. 1 to evaluate the quality of the output of the full model. We use self-critical policy gradient training as in Eq. 2 and minimise

L(x, y ) = (r(ˆy, y ) r(x))

t=0 ( log Ps (i(t) x(t)) I (i(t) ) log Pe ((u (t) i(t)) x(t), i(t))) .

(3) Here, r(x) is the reward achieved by the snapshots of Ps and Pe that achieved the best score so far. The rest of the objective follows the iterations of the loop in Alg. 1, where t is the iteration index, ˆy is the predicted sketch, y is the ground-truth sequence of terminals, and I( ) is the indicator function.

Pretraining As in the sequence model, directly training with the RL objective Eq. 3 is computationally intensive due to the sampling requirement. We again use a pretraining strategy. First, we train Pe to expand every non-terminal, independently of the expansion order learned by Ps. To do this, we use the input training examples and follow Alg. 1, but instead of sampling from Ps( ), we sample i(t) from a uniform distribution over the non-terminals in x(t), N(x(t)) = {i xi N}. This yields sequences of intermediate sketches x(t) for each example. Furthermore, for each x(t), we compute the ground-truth expansion (u (t) i) for all non-terminals i N(x(t)). We can then pretrain Pe using the supervised objective

Lpre, e (x(t), (u (t) i) i N(x(t))) = 1

i N(x(t)) log Pe ((u (t) i(t)) x(t), i) ,

i.e. the negative log-likelihood of the correct expansion for all non-terminals in x(t). This computation is more computationally efﬁcient compared to the one in Eq. 3 since the cost of encoding x(t)

is amortised across all potential expansions and no sampling is required. Once Pe is pretrained, we pretrain Ps. For this, we ﬁx the weights of the shared encoder module, and optimise only the remaining parameters of Ps through Eq. 3. Once we have a pretrained both models, we then ﬁne-tune all model weights end-to-end, using Eq. 3.

Optimisation: Grammar Flattening Following the formal grammar of a programming language commonly introduces tedious expansions. For example, the Python non-terminal Call is always expanded to Expr ( Argument List ), and the C# non-terminal Not Equal Op is always expanded to the terminal !=. We ﬂatten the grammar by replacing non-terminals such as Call and Not Equal Op with all their possible expansions. In Appx. C we provide the list of the ﬂattened non-terminals. Note that if we repeated this process for all non-terminals except from the starting symbol S, GRAMMFORMER would degenerate into a standard encoder-decoder model.

Beam Search At test time, we employ a two-step beam search, and replace sampling from Ps and Pe with their top-ν outputs, keeping a beam of size k. First, for each x(t) in the beam, we compute Ps and select the top-m non-terminal positions to expand. For each of those m positions, we sample the top-n expansions from Pe using a standard beam search. We compute the likelihood of all k n m results, and then keep only the top-k. This process (detailed in Appx. E) is similar to a standard beam search but takes into account that two submodels are used.

Published as a conference paper at ICLR 2022

Computational Cost GRAMMFORMER s ability to predict sketches comes with additional computational cost compared to standard transformer encoder-decoders: at each iteration of the loop in Alg. 1 x(t) changes, Ps and Pe must be recomputed. This means that the encoder-decoder runs once on each partial sequence, in contrast to left-to-right causal generation, in which intermediate results can be re-used. Future work may consider selecting more than one element to expand from N(x(t)) at each step, reducing the expansion steps, similar to Welleck et al. (2019); Stern et al. (2019).

3 EVALUATION

To empirically evaluate our model s ability to predict useful completions, we use REGEXACC and ROUGE. Note that we measure these on the generated sequence, i.e. we ignore the context tokens.

Datasets To collect a dataset, we clone all non-fork repositories with more than 20 stars on Git Hub that have C# or Python as their top language. Then, we deduplicate the corpus using the method of Allamanis (2019); Lopes et al. (2017). Finally, we parse all ﬁles into a syntax tree using Treesitter, ignoring any ﬁles that cannot be parsed using the v0.19.0 grammar deﬁnitions. Finally, we split the ﬁles into 70-10-20 train-validation-test. To create (pre-)training examples, i.e. inputs to Alg. 1, we search the syntax tree of each ﬁle and for each Simple Statement non-terminal create an example. The syntax tree rooted at the Simple Statement non-terminal is then used to get the ground-truth expansions during pre-training and the ground-truth expansion y . For our test set, we randomly sample a Simple Statement non-terminal for each ﬁle to evaluate and obtain 318K (resp. 362K) examples for C# (resp. Python). For each example, x is the 200 terminal tokens before the Simple Statement non-terminal. More details about the dataset can be found in Appx. B.

Baselines Since we are not aware of any prior model that targets code completion with sketches, we consider two Transformer-based baselines. We consider both the sequence-to-sequence setting using separate encoder and decoder models (Vaswani et al., 2017) as well as the language modelling setting (where there is no distinction between encoder and decoder). We refer to these as L R and LM . We use L R to denote a standard Transformer encoder-decoder model (Vaswani et al., 2017) used in sequence-to-sequence tasks. Additionally, we consider L R + and LM + , which are trained to stop generation by inserting a ﬁnal token that captures any sufﬁx. Note that this models can only generate sketches that are preﬁxes of the target completion, i.e. it corresponds to a standard token-level generative model with a learnable stopping ability. To train this model, we use self-critical policy gradient training (as in Eq. 2).

Model Training We provide the training details for all experiments. Most of our models use a 6-layer Transformer as encoder and 6-layer Transformer as decoder, each with a hidden dimension of 768 and 12 attention heads, with the exception of the LM model (and its variations), which uses a single 12-layer Transformer, to match the number of parameters of the other models. We set the intermediate dimension of each Transformer layer as 3072 and use 3 fully-connected layers with 3072, 768 and 1 hidden sizes as the feed-forward neural network f in the selector model Ps. The vocabulary is constructed using byte-pair encoding (Sennrich et al., 2015) and the vocabulary size is 25 000. We set max length of input and output sequences as 512 and 64, respectively. We train the model with Adam optimiser using a learning rate of 2e-5 and batch size 4 096. We used automatic mix precision. Training used 64 NVIDIA Tesla P100 with 16GB memory for 10 days. For beam search we use k = 5, n = 1 and m = , i.e. we consider all non-terminals in each x(t). We selected these during early experiments as a reasonable trade-off between speed and predictive performance.

Results Tbl. 1 shows the results for all considered models. For both Python and C#, GRAMMFORMER outperforms the baseline methods in terms of REGEXACC, showing that the grammarbased generation can create better sketches compared to simpler methods. Note that although L R has a comparable or better ROUGE score, it does substantially worse than GRAMMFORMER with respect to REGEXACC, meaning that the predictions are similar but the sketches contain errors (i.e. do not match the ground-truth). This means that if a code completion system suggested the full output of L R, the user would have to pause and correct the suggestion more frequently. On the other hand, L R + improves over L R in terms of REGEXACC but has a worse ROUGE and generates signiﬁcantly shorter suggestions (5.3 vs. 7.5 tokens-long for C#). This is expected since

Published as a conference paper at ICLR 2022

Table 1: Performance of GRAMMFORMER compared to baselines for Python and C#.

REGEXACC ROUGE Avg REGEXACC ROUGE Avg

Top 1 Top 5 Len Top 1 Top 5 Len

LM 0.42 0.52 75.7 8.0 0.18 0.24 51.0 8.6 L R 0.42 0.47 77.0 7.1 0.17 0.20 53.2 5.8 LM + 0.42 0.49 70.9 6.8 0.19 0.25 49.5 7.3 L R + 0.45 0.54 69.1 5.3 0.20 0.29 39.3 3.0 LM + 0.44 0.54 73.3 6.3 0.20 0.27 53.9 6.6 L R + 0.45 0.55 73.5 5.8 0.18 0.22 48.9 4.7 GRAMMFORMER (pre-trained only) 0.45 0.57 77.0 7.2 0.20 0.29 50.2 5.7 GRAMMFORMER 0.47 0.59 77.4 7.5 0.21 0.30 51.6 6.1

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Ground-truth length

Sketch length

L->R L->R with Stop Policy L->R with Holes Gramm Former

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Ground-truth length

Sketch length

L->R L->R with Stop Policy L->R with Holes Gramm Former

Figure 3: Sketch length vs. ground-truth length

L R + is trained to be more conservative (i.e. avoid incorrect suggestions) but is also unable to introduce holes beyond the last generated token. Finally, we can see that while GRAMMFORMER already performs well after our pre-training procedure, we can further improve its performance with our ﬁne-tuning technique. We believe that this is because when Ps and Pe are trained jointly, they co-adapt: some of the capacity of the shared encoder module that is used to make predictions for hard-to-expand non-terminals is freed since Ps learns to not expand them.

Fig. 3 shows how the length of the generated code sketch relates to the length of the ground truth expression. While the differences between models are small for short target sequences, GRAMMFORMER generates substantially longer suggestions than other models when more complex suggestions are required. In particular, the L R + model generates very short suggestions, as it is trained to stop generation whenever it reaches a point at which it is uncertain about the next token.

Fig. 4 in turn shows how often the suggested sketch was correct dependent on the length of the ground truth token sequence. Here, L R+ does best because it generates the shortest (i.e., least determined) predictions, which is exactly the trade-off captured by our REGEXACC metric. Of the models that generate longer suggestions, GRAMMFORMER clearly does best, with the improvement becoming more pronounced with the length of the target sequence. Note that the performance of the models on C# is generally better compared to the performance in Python. We believe that this has to do with the grammar of each language and the patterns it induces within the developer s code. Casalnuovo et al. (2019); Karampatsis et al. (2020) have observed a similar phenomenon on the perplexity across (standard left-to-right) language models for different programming languages.

Ablations Next, we look into ablations of GRAMMFORMER and reason about how its components perform. To this end, Tbl. 2 shows the performance of different model variants on the C# dataset. First, we analyse the effect of the selector model Ps. To this end, we consider two ablations. The ﬁrst is the random expansion model, in which the non-terminal token to expand is sampled uniformly at random from the full set of non-terminal symbols, and which hence does not stop expansion as long as any holes are remaining. This is effectively GRAMMFORMER after our pre-training procedure for Pe. This model achieves the best ROUGE score, but a relatively bad REGEXACC, as it is forced to generate a prediction even when it is very uncertain. The second ablation, a ﬁxed threshold

Published as a conference paper at ICLR 2022

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Ground-truth length

L->R L->R with Stop Policy L->R with Holes Gramm Former

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Ground-truth length

L->R L->R with Stop Policy L->R with Holes Gramm Former

Figure 4: Percent Correct (i.e., matching) sketches (top-1 generated sketch) vs. ground-truth length

Table 2: Performance for GRAMMFORMER ablations (C#), for different Ps and reward functions.

REGEXACC ROUGE Avg Length

Top 1 Top 5

GRAMMFORMER 0.47 0.59 77.4 7.5

Random expansion, no 0.42 0.54 78.3 8.1 Random expansion, at ﬁxed threshold 0.45 0.57 71.6 5.8

GRAMMFORMER, r( ) = ROUGEF 1 0.42 0.54 78.2 8.1 GRAMMFORMER, r( ) = REGEXACC 0.51 0.62 70.8 5.8

L R 0.42 0.47 77.0 7.1 GRAMMFORMER, no 0.42 0.55 78.1 8.2

model is similar to our ﬁrst ablation, but stops expansion when the probability of the generated x(t) falls below a threshold. We choose this threshold on the validation set. This model makes shorter, but more accurate sketch predictions compared to the random expansion , but is worse than GRAMMFORMER. These two ablations demonstrate that our learned Ps is useful.

Second, we consider the effect of using different reward functions r( ) in the training of GRAMMFORMER. Concretely, whereas we use the mean of REGEXACC and ROUGE in GRAMMFORMER, we now consider using only a single of these two metrics. As expected, the results on the corresponding metric improve, with a substantial cost in the other metric. Concretely, using only REGEXACC leads to signiﬁcantly shorter predictions with a low ROUGE score. We believe that this is because REGEXACC is a strict metric, returning 0 if the sketch does not match, which leads to sparse rewards and makes the resulting model more conservative at expanding non-terminals.

Finally, to evaluate the beneﬁt of the grammar-guided decoder, we consider a variant of GRAMMFORMER that does not allow the introduction of and instead has continue expansion until no non-terminals exist anymore. This variant can be compared to L R, which also cannot stop or introduce tokens. Our ablation shows that there is substantial beneﬁt in using grammar-guided decoding, leading both to longer predictions as well as more correct ones.

3.1 QUALITATIVE EVALUATION

Having observed the quantitative results, we now turn our attention to a qualitative look at the results and show some cherry-picked examples that illustrate desired and undesired behaviours of GRAMMFORMER and the baselines, where we also include the suggestions of the Git Hub Copilot system Git Hub (2021). Fig. 1 shows an example and eleven more are shown in Appx. A. Fig. 1 illustrates the importance of generating sketches instead of concrete sequences of terminal tokens: oftentimes, the code context does not provide sufﬁcient information about the user s intent. Sketchgenerating models can offer more informative suggestions given the partial intent.

Of course, GRAMMFORMER also makes mistakes. For example, GRAMMFORMER and L R + are sometimes too conservative (e.g. Fig. 15 in Appx. A) generating holes where L R generates fully concrete completions. This suggests future research opportunities for better calibration of Ps.

Published as a conference paper at ICLR 2022

Finally, a pure language modelling approach to code completion will always be insufﬁcient. For example, user-deﬁned types and rare APIs cannot be predicted by a language model, since the APIs cannot be known during training (Fig. 7 and Fig. 17 in Appx. A). Researching methods to scalably introduce information from static analyses and additional context may alleviate this.

4 RELATED WORK

One of the successful applications of LMCs is code completion (Svyatkovskiy et al., 2019; Karampatsis et al., 2020). Transformer LMs have shown exceptional performance at the task being able to predict relatively long code sequences (Svyatkovskiy et al., 2020; Chen et al., 2021). Grammarbased code completion and generation has been researched with neural (Maddison & Tarlow, 2014; Yin & Neubig, 2017; Kim et al., 2021) and non-neural models (Bielik et al., 2016), always expanding the left-most, bottom-most non-terminal. In contrast to GRAMMFORMERs, all these models target the generation of complete code without the ability to create sketches. R3NN (Parisotto et al., 2017) generates only complete programs of a simple string transformation DSL but expands the non-terminal with the highest conﬁdence, instead of the left-most, bottom-most one, similar to GRAMMFORMER. In contrast to the aforementioned models, GRAMMFORMER does not maintain an explicit tree representation but instead uses the sequences of leaves in the generation tree.

Sketch-like ideas appear in NLP such as the coarse-to-ﬁne semantic parsing of Dong & Lapata (2018) and chat-bots of Shum et al. (2019). However, sketches are extracted deterministically to create a supervised dataset. Similarly, Sketch Adapt (Nye et al., 2019) uses a sequence model to generate sketches for small functional programs of a simple DSL towards speeding-up enumerative program synthesis from input-output examples. Sketch Adapt is also trained as a supervised sketch generator. A supervised corpus is created by enumerating all possible sketches and selecting the one with the highest-probability and within a heuristically computed time budget. In GRAMMFORMER domain, enumerating all sketches is computationally intractable due to the complexity of generalpurpose programming languages while no similar heuristic exists for code completion.

Recently, sequence generation approaches beyond the left-to-right paradigm have been proposed (Welleck et al., 2019; Stern et al., 2019; Gu et al., 2019; Ford et al., 2018; Lee et al., 2018; Shah et al., 2018), usually by considering generation as an procedure that iteratively changes or extends a sequence. These models often aim in speeding-up inference or allowing models to ﬁgure a better order for full sentence generation. However, since these models focus on natural language and since its grammar is not deﬁned a priori, these methods do not follow a grammar that limits the space for sketch generation. Additionally, these generate full utterances of text, rather than sketches.

A related concept is learning to abstain (Ziyin et al., 2019) where a model learns to predict a don t know . This resembles the stop symbol with the difference that GRAMMFORMER employs RL to learn Ps for a sequential problem rather than learning to abstain for a single-step classiﬁcation.

5 DISCUSSION & CONCLUSIONS

In this work, we presented GRAMMFORMER, a generative model of code that goes beyond standard left-to-right generation and is able to generate sketches, i.e. snippets of code with holes. Designing generative machine learning models with such abilities is important towards facilitating better collaboration between machine learning models and their human users.

While we showed that GRAMMFORMER performs better than alternatives in sketch generation, there are still many research opportunities. First, larger transformers will most probably yield better results, as shown in the literature. Second, although we used REGEXACC as an evaluation metric, human studies to evaluate it are needed. Such studies, similar to those in machine translation and summarization, can yield more informed r( ) and improved user experiences. Second, although we focused on programming languages, modelling natural language also seems possible. Finally, we treated programming languages as a sequence of terminals and non-terminals, ignoring the structure imposed by code s semantics, e.g. data and control ﬂow. Explicitly providing x the code s structure, e.g. with relational transformers (Hellendoorn et al., 2019) may further improve GRAMMFORMER.

Published as a conference paper at ICLR 2022

ACKNOWLEDGMENTS

The authors would like to thank Alex Polozov for useful discussions. We also thank Patrick Fernandes, Szymon Malik, and Guilherme Ilunga for working on earlier modeling ideas on sketch generation. Although those were unsuccessful, they provided the inspiration for this work.

Miltiadis Allamanis. The adverse effects of code duplication in machine learning models of code. In Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reﬂections on Programming and Software, pp. 143 153, 2019.

Miltiadis Allamanis and Charles Sutton. Mining idioms from source code. In Proceedings of the International Symposium on Foundations of Software Engineering (FSE), 2014.

Pavol Bielik, Veselin Raychev, and Martin Vechev. PHOG: Probabilistic model for code. In Proceedings of the International Conference on Machine Learning (ICML), 2016.

Casey Casalnuovo, Kenji Sagae, and Prem Devanbu. Studying the difference between natural and programming language corpora. Empirical Software Engineering, 24(4):1823 1868, 2019.

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Joshua Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob Mc Grew, Dario Amodei, Sam Mc Candlish, Ilya Sutskever, and Wojciech Zaremba. Evaluating large language models trained on code. ar Xiv preprint ar Xiv:2107.03374, 2021.

Shay B Cohen, Karl Stratos, Michael Collins, Dean Foster, and Lyle Ungar. Spectral learning of latent-variable PCFGs. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 223 231, 2012.

Li Dong and Mirella Lapata. Coarse-to-ﬁne decoding for neural semantic parsing. ar Xiv preprint ar Xiv:1805.04793, 2018.

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. Code BERT: A pre-trained model for programming and natural languages. ar Xiv preprint ar Xiv:2002.08155, 2020.

Nicolas Ford, Daniel Duckworth, Mohammad Norouzi, and George E Dahl. The importance of generation order in language modeling. ar Xiv preprint ar Xiv:1808.07910, 2018.

Git Hub. Copilot - your ai pair programmer. https://copilot.github.com/, 2021.

Jiatao Gu, Changhan Wang, and Jake Zhao. Levenshtein transformer. ar Xiv preprint ar Xiv:1905.11006, 2019.

Vincent J Hellendoorn, Charles Sutton, Rishabh Singh, Petros Maniatis, and David Bieber. Global relational models of source code. In International conference on learning representations, 2019.

Abram Hindle, Earl T Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. On the naturalness of software. In Proceedings of the International Conference on Software Engineering (ICSE), 2012.

Rafael-Michael Karampatsis, Hlib Babii, Romain Robbes, Charles Sutton, and Andrea Janes. Big code!= big vocabulary: Open-vocabulary models for source code. In Proceedings of the International Conference on Software Engineering (ICSE), 2020.

Published as a conference paper at ICLR 2022

Seohyun Kim, Jinman Zhao, Yuchi Tian, and Satish Chandra. Code prediction by feeding trees to transformers. In Proceedings of the International Conference on Software Engineering (ICSE), pp. 150 162. IEEE, 2021.

Jason Lee, Elman Mansimov, and Kyunghyun Cho. Deterministic non-autoregressive neural sequence modeling by iterative reﬁnement. ar Xiv preprint ar Xiv:1802.06901, 2018.

Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text summarization branches out, pp. 74 81, 2004.

Tianyu Liu, Yizhe Zhang, Chris Brockett, Yi Mao, Zhifang Sui, Weizhu Chen, and Bill Dolan. A token-level reference-free hallucination detection benchmark for free-form text generation. ar Xiv preprint ar Xiv:2104.08704, 2021.

Cristina V Lopes, Petr Maj, Pedro Martins, Vaibhav Saini, Di Yang, Jakub Zitny, Hitesh Sajnani, and Jan Vitek. Déjàvu: a map of code duplicates on github. Proceedings of the ACM on Programming Languages, 1(OOPSLA):84, 2017.

Chris Maddison and Daniel Tarlow. Structured generative models of natural source code. In Proceedings of the International Conference on Machine Learning (ICML), 2014.

Eric Malmi, Sebastian Krause, Sascha Rothe, Daniil Mirylenka, and Aliaksei Severyn. Encode, tag, realize: High-precision text editing. ar Xiv preprint ar Xiv:1909.01187, 2019.

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan Mc Donald. On faithfulness and factuality in abstractive summarization. ar Xiv preprint ar Xiv:2005.00661, 2020.

Maxwell Nye, Luke Hewitt, Joshua Tenenbaum, and Armando Solar-Lezama. Learning to infer program sketches. In International Conference on Machine Learning, pp. 4861 4870. PMLR, 2019.

Emilio Parisotto, Abdel-rahman Mohamed, Rishabh Singh, Lihong Li, Dengyong Zhou, and Pushmeet Kohli. Neuro-symbolic program synthesis. In Proceedings of the International Conference on Learning Representations (ICLR), 2017.

Romain Paulus, Caiming Xiong, and Richard Socher. A deep reinforced model for abstractive summarization. ar Xiv preprint ar Xiv:1705.04304, 2017.

Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and Ramesh Karri. An empirical cybersecurity evaluation of github copilot s code contributions. ar Xiv preprint ar Xiv:2108.09293, 2021.

Ratish Puduppully, Li Dong, and Mirella Lapata. Data-to-text generation with content selection and planning. In Proceedings of the AAAI conference on artiﬁcial intelligence, volume 33, pp. 6908 6915, 2019.

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019.

Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7008 7024, 2017.

Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. ar Xiv preprint ar Xiv:1508.07909, 2015.

Harshil Shah, Bowen Zheng, and David Barber. Generating sentences using a dynamic canvas. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 32, 2018.

Michael Shum, Stephan Zheng, Wojciech Kry sci nski, Caiming Xiong, and Richard Socher. Sketch Fill-AR: A persona-grounded chit-chat generation framework. ar Xiv preprint ar Xiv:1910.13008, 2019.

Mitchell Stern, William Chan, Jamie Kiros, and Jakob Uszkoreit. Insertion transformer: Flexible sequence generation via insertion operations. ar Xiv preprint ar Xiv:1902.03249, 2019.

Published as a conference paper at ICLR 2022

Alexey Svyatkovskiy, Ying Zhao, Shengyu Fu, and Neel Sundaresan. Pythia: AI-assisted code completion system. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2727 2735, 2019.

Alexey Svyatkovskiy, Shao Kun Deng, Shengyu Fu, and Neel Sundaresan. Intelli Code Compose: Code generation using transformer. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 1433 1443, 2020.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998 6008, 2017.

Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr Polozov, and Matthew Richardson. RATSQL: Relation-aware schema encoding and linking for text-to-SQL parsers. ar Xiv preprint ar Xiv:1911.04942, 2019.

Sean Welleck, Kianté Brantley, Hal Daumé III, and Kyunghyun Cho. Non-monotonic sequential text generation. ar Xiv preprint ar Xiv:1902.02192, 2019.

Pengcheng Yin and Graham Neubig. A syntactic neural model for general-purpose code generation. 2017.

Liu Ziyin, Zhikang Wang, Paul Pu Liang, Ruslan Salakhutdinov, Louis-Philippe Morency, and Masahito Ueda. Deep gamblers: Learning to abstain with portfolio theory. ar Xiv preprint ar Xiv:1907.00208, 2019.

Published as a conference paper at ICLR 2022

A GENERATED SAMPLES

Fig. 5 and Fig. 6 show two examples from our dataset along with the ground-truth and the sequence of expansions performed by GRAMMFORMER. Fig. 7-13 show example generations by GRAMMFORMER and the baseline models L R and L R + . The parentheses in red indicate the REGEXACC score for each suggestion. For the L R + baseline the special non-terminal <suffix> is added to indicate that a hole is introduced at the end of the left-to-right generation. Finally Fig. 15-17 show example generations where GRAMMFORMER make mistakes. A discussion for each of those sample is found at the caption of each ﬁgure.

Context: using System.Collections; using System.Collections.Generic; using Unity Engine; public class Bottle Flip Academy: Academy{

public float Max Distance; public float Min Scale; public bool Is Random Direction; public override void Academy Reset(){

Max Distance = reset Parameters["max_distance"]; <expression_statement>

Ground Truth: Min Scale = reset Parameters["min_scale"];

Prediction: Min Scale = reset Parameters[<string_literal>];

Generation Process: <expression_statement> <left> <assignment_operator> <right> ; <left> = <right> ; <identifier> = <right> ; <identifier> = <expression> [ <argument> ] ; <identifier> = <identifier> [ <argument> ] ; <identifier> = reset Parameters [ <argument> ] ; <identifier> = reset Parameters [ <string_literal> ] ; Min Scale = reset Parameters [ <string_literal> ] ;

Figure 5: An example GRAMMFORMER generation for C#. Each line in the generation process shows subsequent states of xt in Alg. 1. Here, GRAMMFORMER predicts a sketch that matches the ground-truth expansion, but places a hole at the key of the dictionary lookup, instead of predicting a low-likelihood string literal.

B DATASET STATISTICS

Some statistics about the datasets used throughout this work are shown in Tbl. 3

C FLATTENED NON-TERMINALS

The non-terminals in Tbl. 4 are always expanded and are not considered as non-terminals. Most of these non-terminals have always the same children (terminals or non-terminals), representing a single CFG rule. By ﬂattening those non-terminals the depth of tree is reduced (and hence the number of loops needed in Alg. 1).

Published as a conference paper at ICLR 2022

Context: import sys import os import platform if platform.system() == "Linux":

os.system('clear') elif platform.system() == "Windows":

os.system('cls') target = sys.argv[1] <expression_statement>

Ground Truth: ID = sys.argv[2]

Prediction: <identifier> = sys.argv[2]

Generation Process: <left> = <right> <left> = <subscript> <left> = <value> [ <subscript> ] <identifier> = <value> [ <subscript> ] <identifier> = <attribute> [ <subscript> ] <identifier> = <object> . <attribute> [ <subscript> ] <identifier> = <object> . <identifier> [ <subscript> ] <identifier> = <object> . argv [ <subscript> ] <identifier> = <object> . argv [ <integer> ] <identifier> = <object> . argv [ 2 ] <identifier> = <identifier> . argv [ 2 ] <identifier> = sys . argv [ 2 ]

Figure 6: An example GRAMMFORMER generation for Python. Each line in the generation process shows subsequent states of xt in Alg. 1. GRAMMFORMER here predicts that the user s intent is to read-in a second argument and store it in a variable. However, within the current context, the name of the variable storing the second argument would be impossible to predict. GRAMMFORMER reasonably places a hole at the given location and generates a matching sketch. In this example, any traditional left-to-right model would need to ﬁrst predict an accurate target variable name (which seems unlikely in the given context) before predicting the right-hand side of the assignment.

Num Training Files/Trees 1973400 1948516 Num Validation Files/Trees 218398 216299 Num Test Files/Trees 460874 480166 Avg num tokens of xt 194.5 201.4 Median num tokens of xt 205 206 99 percentile num tokens of xt 250 260 Avg num tokens of y 1.9 1.9 Median num tokens of y 1 1 99 percentile num tokens of y 9 7

Table 3: Statistics of the datasets used.

D UNDERSTANDING REGEXACC

Since REGEXACC is a new metric, we include two deterministic ways of introducing sketches in Tbl. 5. First, if all literals (strings, numeric) are replaced with a hole, we see that a high REGEXACC

Published as a conference paper at ICLR 2022

Context: ... Exchange Activity

= (Exchange Activity) Singleton<Activity Sys>.Get Instance().Get Activity( COM_WEAL_TYPE.COM_WEAL_EXCHANGE, msg.st Pkg Data.st Weal Exchange Res.dw Weal ID) ;

if ( exchange Activity != null ){

exchange Activity.Increase Exchange Count(

(int) msg.st Pkg Data.st Weal Exchange Res.b Weal Idx, msg.st Pkg Data.st Weal Exchange Res.dw Draw Cnt ); <expression_statement>

Ground Truth: exchange Activity.Update View();

Prediction: 𝑳 𝑹: Singleton<CUIManager>.Get Instance().Close Send Msg Alert(); (0.000)

𝑳 𝑹+ : Singleton<<suffix> (0.000)

𝑳 𝑹 : exchange Activity.<hole>(); (0.833)

𝑪𝒐𝑷𝒊𝒍𝒐𝒕: } (0.000)

𝑮𝒓𝒂𝒎𝒎𝑭𝒐𝒓𝒎𝒆𝒓: exchange Activity.<identifier>(); (0.833)

Figure 7: A C# example and completion outputs from different models. REGEXACC score reported in red. Here, GRAMMFORMER correctly identiﬁes that a method should be invoked on exchange Activity, but does not predict the concrete method. If GRAMMFORMER was extended with information from a static analysis about the Exchange Activity (potentially a userdeﬁned type) then an accurate suggestion could have potential been made.

is achieved. In contrast, replacing both identiﬁers and literals (leaving just parentheses, brackets, dots, etc.) we get an easy lower-bound . Note how C# which is syntactically more verbose achieves a better score, compared to Python. In Tbl. 6, we show some example sketches and their associated REGEXACC score.

E BEAM SEARCH

Alg. 2 presents the beam search used in GRAMMFORMER.

Published as a conference paper at ICLR 2022

Context: ... [Test] public void Can Pass Two Providers( ){

// arrange var expected Length = 100; var input1 = new Test Sample Provider(44100, 2, 50); var input2 = new Test Sample Provider(44100, 2, 50); var concatenator = new Concatenating Sample Provider(new[]{input1, input2}); var buffer = new float[2000]; var read = concatenator.Read(buffer, 0, buffer.Length); Assert.Are Equal(expected Length, read, "read == expected Length"); Assert.Are Equal(49, buffer[49]); <expression_statement>

Ground Truth: Assert.Are Equal(0, buffer[50]);

Prediction: 𝑳 𝑹: Assert.Are Equal(50, buffer[50]); (0.000)

𝑳 𝑹+ : Assert.Are Equal(<suffix> (0.333)

𝑳 𝑹 : Assert.Are Equal(expected Length, read); (0.00)

𝑪𝒐𝑷𝒊𝒍𝒐𝒕: Assert.Are Equal(50, buffer[50]); (0.000)

𝑮𝒓𝒂𝒎𝒎𝑭𝒐𝒓𝒎𝒆𝒓: Assert.Are Equal(<argument>, buffer[50]); (0.917)

Figure 8: A C# example and completion outputs from different models. REGEXACC score reported in red. Here, GRAMMFORMER correctly predicts that an Are Equal assert statement should be made, checking the value of buffer[50]. However, within this context, the correct concrete expected value (0) would be hard to predict, even for a human. GRAMMFORMER places a hole there and generates a correct line-level sketch. In contrast, L R introduces a wrong completion and L R + creates a correct, but much shorter sketch.

Algorithm 2 GRAMMFORMER beam search, given an input sequence x0.

b {(x0, 0, false)} Initialize Beam (state, logprob, is Done) while (x, p, is Done) b with is Done = false do While beam contains incomplete generations

b {} for (x, p, is Done) b do For each sample in beam if is Done then If suggestion is complete b b {(x, p, is Done)} No operation, beam is complete continue for i TOPM(Ps(i x, N(x))) do Get top-m non-terminal positions ps log Ps(i x, N(x)) if i = then

b b {(x, p + ps, true)} Stop Expansion else

for y TOPN(Pe(y x, i)) do Beam search on y yields n candidates pe Pe(y x, i) b b {(x<i y x>i), p + ps + pe, false)} Expand xi b TOPK(b ) Prune Candidates and keep top k return b

Published as a conference paper at ICLR 2022

Context: ... namespace Bug604053.Prueba{

public class Data{

public int M1{get; set;} public string M2{get; set;} public Data(int m1, string m2){M1 = m1; M2 = m2;} } [Data Object(true)] public class Data Source{

public Data[] Retrieve( ){

Data[] data = new Data[10]; for (int i = 0; i<10; i++){

<expression_statement>

Ground Truth: data[i] = new Data(i, i.To String());

Prediction: 𝑳 𝑹: data[i] = new Data( ); (0.000)

𝑳 𝑹+ : data[i] = new Data( ); (0.000)

𝑳 𝑹 : data[i] = new Data( ); (0.000)

𝑪𝒐𝑷𝒊𝒍𝒐𝒕: data[i] = new Data(i, "Data" + i); (0.000)

𝑮𝒓𝒂𝒎𝒎𝑭𝒐𝒓𝒎𝒆𝒓: data[i] = new Data(i, <string_literal>); (0.706)

Figure 9: A C# example and completion outputs from different models. REGEXACC score reported in red. While all models predict that an assignment needs to be made to each data[i], the exact form of the constructor is hard to predict. GRAMMFORMER seems to be looking at the constructor deﬁnition and predicts that some String Literal needs to be used as the second argument, although it is uncertain about its concrete form, hence introducing a hole.

Published as a conference paper at ICLR 2022

Context: # Provides a character-based width estimate when simple tags # such as <b> and <i> are present in a multi-line, # \"break\"-delimited, string. Very approximate, but a useful # default. def html Width(s In):

i Br = index Of Br(s In) if (-1 == i Br):

s = s In else:

s = s In[:i Br] <return_statement>

Ground Truth: return len(s)

Prediction: 𝑳 𝑹: return len(s) (1.000)

𝑳 𝑹+ : return <suffix> (0.200)

𝑳 𝑹 : return s (0.000)

𝑪𝒐𝑷𝒊𝒍𝒐𝒕: s = s.replace("&nbsp;", " ") (0.000)

𝑮𝒓𝒂𝒎𝒎𝑭𝒐𝒓𝒎𝒆𝒓: return len(s) (1.000)

Figure 10: A Python example and completion outputs from different models. REGEXACC score reported in red. Here both L R and GRAMMFORMER predict the full line correctly, but L R + seems to return a more conservative (but correct) sketch.

Published as a conference paper at ICLR 2022

Context: #!/usr/bin/env python2 from __future__ import print_function import argparse import os import subprocess import sys ap = argparse.Argument Parser() ap.add_argument("--release", action = "store_true") ap.add_argument("--prerelease", action = "store_true") <expression_statement>

Ground Truth: ap.add_argument("--experimental", action = "store_true")

Prediction: 𝑳 𝑹: args = ap.parse_args() (0.000)

𝑳 𝑹+ : args = ap.parse_args() (0.000)

𝑳 𝑹 : args = ap.add_argument(<hole> (0.000)

𝑪𝒐𝑷𝒊𝒍𝒐𝒕: ap.add_argument("--beta", action = "store_true") (0.000)

𝑮𝒓𝒂𝒎𝒎𝑭𝒐𝒓𝒎𝒆𝒓: ap.add_argument(<string>, action = "store_true") (0.833)

Figure 11: A Python example and completion outputs from different models. REGEXACC score reported in red. See main text in the introduction for a description.

Python block, tuple, and, or, +, -, *, /, &, ||, //, %, @, +=, -=, *=, /=, //=, @=, &=, |=, call, keyword_argument, name, binary_operator, for_in_clause, unary_operator, **, true, not_operator, none, false, boolean_operator, augumented_assignment, await, >>, pair, |, parameters, <<, dictionary_comprehension, ellipsis, arguments, assignment, , ~ C# block, tuple, and, or, +, -, *, /, &, ||, //, %, @, +=, -=, *=, /=, //=, %=, @=, &=, |=, **, >>, |, <<, , ~, assignment_expression, invocation_expression, arguments, member_access_expression, try_statement, catch_clause, conditional_expression, ==, array_type, rank, base_expression, conditional_access_expression, member_binding_expression, initializer, null_literal, >, element_access_expression, subscript, ??, this_expression, implicit_array_creation_expression, cast_expression, !=, variable_declaration, implicit_type, &&, as_expression, as, <, local_declaration_statement, if_statement, >=, <=, throw_expression, default_expression, pattern, is_pattern_expression, binary_expression, bracketed_argument_list, name, object_creation_expression, await_expression, ,

Table 4: Non-terminals that are always expanded in the Tree-Sitter grammar for the two languages considered.

Published as a conference paper at ICLR 2022

Context: import sys import os import platform if platform.system() == "Linux":

os.system('clear') elif platform.system() == "Windows":

os.system('cls') target = sys.argv[1] <expression_statement>

Ground Truth: ID = sys.argv[2]

Prediction: 𝑳 𝑹: target = target.replace("\\\\", "/") (0.000)

𝑳 𝑹+ : target = <suffix> (0.000)

𝑳 𝑹 : print(target) (0.000)

𝑪𝒐𝑷𝒊𝒍𝒐𝒕: No Suggestion (0.000)

𝑮𝒓𝒂𝒎𝒎𝑭𝒐𝒓𝒎𝒆𝒓: <identifier> = sys.argv[2] (0.875)

Figure 12: A Python example and completion outputs from different models. REGEXACC score reported in red. Generation steps of GRAMMFORMER shown in Fig. 6. L R and L R + cannot generate correct sketches since the ﬁrst token would be impossible to guess within this code context.

Replace all literals with holes 0.865 0.608 Replace all literals and identiﬁers with holes 0.126 0.060

Table 5: REGEXACC when deterministically introducing holes at speciﬁc location.

Table 6: Example REGEXACC scores for a variety of sketches.

Ground-truth ap.add_argument("--experimental", action="store_true")

ap.add_argument( , action="store_true") 0.9 ap.add_argument( , action= ) 0.8 ap.add_argument( , ) 0.6 ap.add_argument( , action="store_false") 0.0 ap.add_argument( , required= ) 0.0

Published as a conference paper at ICLR 2022

Context: ... console.set Level(logging.DEBUG) log.add Handler(console) # moving assembled contigs (scaffolds) to misc dir if os.path.isfile(args.corrected):

shutil.move(args.corrected, args.assembled) tmp_dir_for_corrector = os.path.join (args.output_dir, "mismatch_corrector", args.assembly_type) # correcting result_corrected_filename = os.path.join(tmp_dir_for_corrector, "corrected_contigs.fasta") <expression_statement>

Ground Truth: dst_configs = os.path.join(tmp_dir_for_corrector, "configs")

Prediction: 𝑳 𝑹: result_corrected_fasta = os.path.join(args.output_dir, "corrected_fasta") (0.000)

𝑳 𝑹+ : result_corrected_fasta = <suffix> (0.000)

𝑳 𝑹 : result_corrected_filename = <hole> (0.000)

𝑪𝒐𝑷𝒊𝒍𝒐𝒕: if os.path.isfile(tmp_dir_for_corrector): (0.000)

𝑮𝒓𝒂𝒎𝒎𝑭𝒐𝒓𝒎𝒆𝒓: <identifier> = os.path.join(tmp_dir_for_corrector, <string>) (0.769)

Figure 13: A Python example and completion outputs from different models. REGEXACC score reported in red. GRAMMFORMER completes the line creating a correct sketch with two holes at locations avoiding to make the mistakes that L R and L R + makes.

Context: ... left = right = new Rect(0, 0, config.border Size, height); right.x = width - config.border Size; top = bottom = new Rect(config.border Size, 0, width-config.border Size*2, config.border Size); bottom.y = height-config.border Size;} Color32 previous Color; public override void On GUI(){

if(!props.wrong Action Show Frame)

return; previous Color = GUI.color; GUI.color = config.border Color; GUI.Draw Texture(left, config.texture); GUI.Draw Texture(right, config.texture); <expression_statement>

Ground Truth: GUI.Draw Texture(top, config.texture);

Prediction: 𝑳 𝑹: GUI.color = previous Color; (0.000)

𝑳 𝑹+ : GUI.color = previous Color; (0.000)

𝑳 𝑹 : GUI.color = Color.white; (0.000)

𝑪𝒐𝑷𝒊𝒍𝒐𝒕: GUI.Draw Texture(top, config.texture); (1.000)

𝑮𝒓𝒂𝒎𝒎𝑭𝒐𝒓𝒎𝒆𝒓: GUI.Draw Texture(bottom, config.texture); (0.000)

Figure 14: A C# example and incorrect completion outputs from different models. REGEXACC score reported in red. The prediction from GRAMMFORMER is almost right but should have created a hole at the ﬁrst argument for the user to ﬁll-in. This shows that improved methods for training the policy network may improve results in the future.

Published as a conference paper at ICLR 2022

Context: namespace Asp Net Mvc Core Performance{

public class Program{

public static int Main(string[] args){

string url Base = "http://localhost:54562/"; var thread Count = 1; var iterations Per Thread = 50; if (args?.Length > 0){

url Base = args[0]; thread Count = int.Parse(args[1]); <expression_statement>

Ground Truth: iterations Per Thread = int.Parse(args[2]);

Prediction: 𝑳 𝑹: iterations Per Thread = int.Parse(args[2]); (1.000)

𝑳 𝑹+ : iterations Per Thread = int.Parse(args[2]); (1.000)

𝑳 𝑹 : iterations Per Thread = int.Parse(args[2]); (1.000)

𝑪𝒐𝑷𝒊𝒍𝒐𝒕: iterations Per Thread = int.Parse(args[2]); (1.000)

𝑮𝒓𝒂𝒎𝒎𝑭𝒐𝒓𝒎𝒆𝒓: iterations Per Thread = <integer_literal>; (0.250)

Figure 15: A C# example and completion outputs from different models. REGEXACC score reported in red. GRAMMFORMER suggests a correct sketch but the right-hand side of the assignment has to stop expansion since Integer Literal cannot generate int.Parse(args[2]). This suggests some of the limitations that the grammar-based generation of GRAMMFORMER may have, especially for shorter sequences.

Context: import multiprocessing from os import getenv bind = '127.0.0.1:8001' <expression_statement>

Ground Truth: workers = multiprocessing.cpu_count() * 3

Prediction: 𝑳 𝑹: workers = 2 (0.000)

𝑳 𝑹+ : workers = <suffix> (0.222)

𝑳 𝑹 : workers = 4 (0.000)

𝑪𝒐𝑷𝒊𝒍𝒐𝒕: workers = multiprocessing.cpu_count() * 2 + 1 (0.000)

𝑮𝒓𝒂𝒎𝒎𝑭𝒐𝒓𝒎𝒆𝒓: <identifier> = <string> (0.111)

Figure 16: A Python example and completion outputs from different models. REGEXACC score reported in red. Although the sketch of the prediction from GRAMMFORMER is typically correct, it is not useful. Researching better evaluation metrics may improve GRAMMFORMER.

Published as a conference paper at ICLR 2022

Context: # import python modules import random import time import OSC # Connect to Super Collider's internal port <expression_statement>

Ground Truth: c = OSC.OSCClient()

Prediction: 𝑳 𝑹: OSC.Connect Collider() (0.000)

𝑳 𝑹+ : OSC.<suffix> (0.000)

𝑳 𝑹 : OSC.connect() (0.000)

𝑪𝒐𝒑𝒊𝒍𝒐𝒕: client = OSC.OSCClient() (0.000)

𝑮𝒓𝒂𝒎𝒎𝑭𝒐𝒓𝒎𝒆𝒓: conn = OSC.connect() (0.000)

Figure 17: A Python example and completion outputs from different models. REGEXACC score reported in red. All model fail to invoke the correct API of the library. A potential future direction to mitigate the problem is to incorporate deﬁnitions of the external or system classes.