# structural_language_models_of_code__32a81cb9.pdf Structural Language Models of Code Uri Alon 1 Roy Sadaka 1 Omer Levy 2 3 Eran Yahav 1 We address the problem of any-code completion generating a missing piece of source code in a given program without any restriction on the vocabulary or structure. We introduce a new approach to any-code completion that leverages the strict syntax of programming languages to model a code snippet as a tree structural language modeling (SLM). SLM estimates the probability of the program s abstract syntax tree (AST) by decomposing it into a product of conditional probabilities over its nodes. We present a neural model that computes these conditional probabilities by considering all AST paths leading to a target node. Unlike previous techniques that have severely restricted the kinds of expressions that can be generated in this task, our approach can generate arbitrary code in any programming language. Our model significantly outperforms both seq2seq and a variety of structured approaches in generating Java and C# code. Our code, data, and trained models are available at http://github.com/tech-srl/ slm-code-generation/. An online demo is available at http://Any Code Gen.org. 1. Introduction Code completion is the problem of generating code given its surrounding code as context. In its most general form, this problem is extremely challenging as it requires reasoning over an unbounded number of syntactic structures and user-defined symbols. Previous approaches have avoided this issue by limiting the generation problem: program synthesis approaches are often tailored to domain-specific languages (Gulwani, 2011; Polozov & Gulwani, 2015; De- 1Technion, Israel 2Tel Aviv University 3Facebook AI Research. Correspondence to: Uri Alon , Roy Sadaka , Omer Levy , Eran Yahav . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). vlin et al., 2017; Ellis et al., 2019), while other recent approaches generate code in general languages like Java and C#, but severely restrict the syntax, vocabulary, domain, or nature of the generated programs (Murali et al., 2018; Brockschmidt et al., 2019; Young et al., 2019). We introduce the task of any-code completion generating code in a general-purpose programming language without any restriction on its vocabulary or structure. Specifically, we focus on generating code in context: given a program P and some part of the program p, the task is to predict p from the rest of the program P =P\p. Any-code completion thus generalizes the restricted completion task of Brockschmidt et al. (2019), in which the target code contained only primitive types (e.g., int and string) and excluded user-defined functions. Figure 1 shows two anycode completion examples. In related tasks such as semantic parsing (Dong & Lapata, 2018; Yu et al., 2018; Iyer et al., 2019), natural-languageto-code (Allamanis et al., 2015; Iyer et al., 2018), and editto-code (Yin et al., 2019; Zhao et al., 2019), models must use separate encoders and decoders because of the different modalities of the input (e.g. natural language text) and the output (code). In contrast, we leverage the fact that our input and output are of the same modality (code), and pursue better generalization by modeling them jointly. We present a new approach that explicitly models the source and the target code as the same tree structural language modeling (SLM). SLM estimates the probability of the program s abstract syntax tree (AST) by decomposing it into a product of conditional probabilities over its nodes. We present a neural model that computes these conditional probabilities by considering all AST paths leading to a target node, generalizing over traditional language models that consider sequences of words. While prior work uses AST paths to read programs (Alon et al., 2019b), we generate code by predicting the next node along the set of paths, generating the target AST node-by-node. We evaluate SLMs on Java any-code completion, achieving a new state of the art: exact-match accuracy@1 of 18.04% and accuracy@5 of 24.83% (previous SOTA: 16.93% and 23.17%). SLMs also outperform existing models in the restricted completion task of Brockschmidt et al. (2019) in C# by a wide margin, 37.61% accuracy@1 compared to Structural Language Models of Code public static Path[] stat2Paths( File Status[] stats) { if (stats == null) return null; Path[] ret = new Path[stats.length]; for (int i = 0; i < stats.length; ++i){ ret[i] = ; } return ret; } public static string Camelize( this string input) { var word = input.Pascalize(); return word.Length > 0 ? .To Lower() + word.Substring(1) : word; } True ref (Java): stats[i].get Path() (25.2%) stats[i].get Path() (3.3%) Path(stats[i]) (2.5%) new Path(stats[i], charset) (1.7%) stat(stats[i], ret) (0.8%) new Path(stats[i]) True ref (C#): word.Substring(0, 1) (14.1%) word.Substring(0, 1) (8.2%) word.trim() (5.8%) word.Substring(1) (2.4%) input.Substring(0, 1) (1.9%) word Value.Substring(0, 1) Figure 1. Examples from the Java (left) and C# (right) test sets. The highlighted expression in each example is the target p, which our models correctly generated from the rest of the snippet. Additional and larger examples can be found in the supplementary material. 26.42%. Our ablation study reveals the importance of joint modeling of the source and target code, rather than separating encoders from decoders. Finally, we discuss the theoretical advantages of SLMs, and show how they generalize many previous structural approaches for code generation. An interactive demo of our model is presented at http://Any Code Gen.org. 2. Code Generation as Structural Language Modeling We model the task of any-code completion by computing the probability of a program Pr (P), similar to how a language model computes the probability of a natural language sentence. While language models typically assume a sequence as their input, our input is an abstract syntax tree AP. We thus introduce a structural language modeling approach (SLM). The intuition behind this idea is that a language model could generalize better by modeling the tree rather than the sequential form of the program. Further, learning from the AST allows a model to save learning capacity, instead of having to re-learn known syntactic patterns from the text. We first show a chain-rule decomposition of the tree s probability Pr (AP) into a product of conditional node probabilities, and then describe our path-based model for computing the individual conditional probabilities. We explain how to construct a tree from local node predictions, and finally discuss how our approach differs from previous work on production-based tree generation. Representing Code as a Tree A program P is a sequence of tokens that can be unambiguously mapped to an abstract syntax tree (AST) AP, where every node represents an element in the language (e.g. conditions, loops, variable declarations) from a set T . Each AST leaf (terminal) has an associated user-defined value v V. Nonterminal nodes can have a varying number of children nodes. Decomposing the Probability of a Tree Given a tree AP, we first traverse the tree, depth-first,1 to induce an ordering over its nodes a0, . . . , a|AP| AP. We decompose the probability of a tree Pr (AP) using the chain rule, akin to the standard approach in language modeling: Pr (AP) = Y t Pr (at|a 1 ) { ... } ... Figure 2. The subtree representing x > 1 is generated given its surrounding tree. At each step, the model generates the next node (denoted by ? ) of path1, path2 and path3 using the root path R. Dashed lines denote the AST structure; solid lines denote AST paths. Most AST paths are omitted from the figure, for clarity. We follow Alon et al. (2018) and use the set of paths from every leaf to at together with the path from the root to at. Intuitively, each path captures the effect of a different, possibly distant, program element on at, along with the syntactic relationship between them. For example, in Figure 1 (left) the three paths originating from Path[] ret inform the model about the existence of ret which is an array of type Path. Thus, when completing ret[i] = ... the completion should be a Path object. Other paths inform the model that the target is inside a For loop, iterated stats.length times. Considering the information flowing from all paths, our model correctly generates stats[i].get Path(). We denote the (candidate) node at time t as at, its (given) parent, which is currently expanded, by π (at), and the set of all paths as St: St = {ℓ; π (at) |ℓ leaves (a 1 and some of the paths used to compute the probability at each step. At each step, the probability of the next node is computed given the paths St from the root and every given leaf up to the current node to expand. Figure 2(d) shows how after the terminal node with the value x is given, path3 originating from this leaf is also used to compute the probability of the next nodes. Our path-based approach generalizes previous approaches such as parent feeding and previous action encoding (Yin & Neubig, 2017), context nodes (Bielik et al., 2016), and some of the graph-edges of Brockschmidt et al. (2019). See Section 8 for further discussion. Name Int Exp x Figure 3. Augmenting the AST with EOSnode and EOStok nodes. Generating Trees In sequence generation, the length of the target sequence is controlled by generating an EOS token to stop. When generating trees, we require a more sophisticated mechanism to control arity and depth. We augment AP in two ways to allow node-by-node generation. First, we add a special EOSnode node to every nonterminal to control for arity. Generating this node indicates that the parent node has no more children nodes. Second, we end each subtoken sequence with a special EOStok node to control for depth during generation; we decompose each Structural Language Models of Code terminal node nv into a sequence of terminal nodes Tv by splitting up the node s value v into subtokens based on camel notation. For example, if v = to Lower Case, then Tv = to lower case EOStok. Figure 3 shows an example of both EOSnode and EOStok in action. Node Trees vs. Production Trees While we predict a single node at each step, previous work (Iyer et al., 2018; 2019) predicts a grammar production rule. This representation decomposes the code in a way that often forces the model to predict with partial information. For instance, consider generating the expression str.Substring(3). The model of Brockschmidt et al. (2019) would first predict the rule Expr Expr.Substring(Expr), and only then expand Expr str and Expr 3. That is, the model needs to predict the method name (Substring) before the invoking object (str). Further, the Substring method can get either one or two arguments, forcing the model to choose whether to use the oneor two-argument rule in advance. Node generation, however, allows us to predict the presence of a function call and only then to predict its object and method name, rather than predicting these a priori. 3. Model Architecture In the previous section, we described how we can generate code given the probabilities Pr (at|a 1 and y > 2 would not count as identical in exact match, but would count as tree-match identical because both express that an identifier is greater than an integer (NAME > INT). The tree@k metric is interesting because it allows us to tease apart the model s syntactic errors from incorrect subtoken predictions. 4.2. Baselines We compare our model to a variety of original implementations and adaptations of existing models. We put significant effort to perform a fair comparison, including adding a copy mechanism to the NMT baselines and subtokenization as in our model. We adapt strong baselines from the literature to our task, even if they were designed to different tasks such as NL code and code NL. We re-train all the following baselines on the same datasets as our models. NMT We use standard autoregressive sequence-tosequence NMT baselines, in which we subtokenize the given code snippet, replace the target in the source with a special PRED symbol, and train the network to predict the target as a sequence of subtokens. Transformerbase+copy (Vaswani et al., 2017) uses the implementation of Open NMT (Klein et al., 2017) with a copy mechanism (Gu et al., 2016). Transformersmall+copy uses dmodel=256, dff=1024, and 4 self attention heads per layer. Bi LSTM LSTM+copy is a 2-layer bidirectional LSTM encoder-decoder with d=512 and attention. seq2tree+copy follows Aharoni & Goldberg (2017) and learns to generate the linearized, subtokenized target AST. Java-specific Baselines We use the original implementation of Iyer et al. (2018), and also their seq2prod baseline which is a re-implementation of Yin & Neubig (2017); these are designed for NL code tasks, in which we feed the code context as the NL input. The model of Iyer et al. (2018) is designed to get additional input of the available variables and their types, for which we do not feed types. Structural Language Models of Code Model acc@1 acc@5 tree@1 tree@5 code2seq (Alon et al., 2019a) 10.68 15.56 30.46 43.94 Iyer et al. (2018) 5.94 9.19 25.54 36.75 seq2prod (Yin & Neubig, 2017) 8.05 11.82 30.77 41.73 Transformersmall (Vaswani et al., 2017)+copy 14.23 21.35 31.83 47.40 Transformerbase (Vaswani et al., 2017)+copy 16.65 24.05 34.68 50.52 Bi LSTM LSTM (Luong et al., 2015)+copy 16.93 23.17 34.29 49.72 seq2tree (Aharoni & Goldberg, 2017)+copy 16.81 23.04 38.14 52.36 SLM (this work) 18.04 24.83 39.10 55.32 Table 1. Results on any-code completion in Java. While these models could also be applied to other languages, their implementation only supports Java. C#-specific Baselines We compare our model to the graphbased GNN NAG model using the implementation of Brockschmidt et al. (2019). Bielik et al. (2016) kindly trained and tested their non-neural PHOG model on our C# dataset. We note that PHOG does not have an explicit copy mechanism, and considers only context to the left of the target code, while we consider also context to the right. Extending PHOG could potentially improve its results. In both Java and C#, we compare to code2seq (Alon et al., 2019a), which is a strong code NL model. We train it to generate the target code as a sequence of subtokens. 4.3. Implementation and Hyperparameter Settings Architecture We use embeddings of size 512, 2 layers of LSTMs with 256 units, and 4 transformer layers with 8 attention heads. We kept a small subtoken vocabulary of size 1000 to encourage the model to learn to copy; larger vocabularies did not show an improvement. These resulted in a very lightweight model of only 15M parameters, which is close to Transformersmall (11.8M parameters). In comparison, Transformerbase had more than 45M parameters (3 more parameters than our model). Training We train the model end-to-end on a single V100 GPU, using cross entropy and the Adam optimizer (Kingma & Ba, 2015), an initial learning rate of 10 4 multiplied by 0.95 every 20k steps. We bucket examples based on the number of predictions in the target subtree (nodes + subtokens + EOS), and vary the batch size such that each batch contains about 512 targets. We train the model to prefer copying entire tokens rather than copying subtokens, if possible, by optimizing for the entire token as the true label. We apply dropout of 0.25 in the Transformer layers, and a recurrent dropout of 0.5 in the LSTMs. Inference We perform beam search with width of 5 and optimize for accuracy@1. Model acc@1 acc@5 tree@1 tree@5 GNN NAG 15.19 27.05 26.48 40.09 code2seq 6.20 10.05 21.97 30.89 seq2seq+copy 26.42 37.94 34.10 49.23 seq2tree+copy 22.29 35.86 31.85 48.53 PHOG 7.40 12.00 SLM (this work) 37.61 45.51 51.10 59.82 Table 2. Results on restricted completion in C#. Any-Code Completion: Java Table 1 shows that our SLM achieves over 1.1% and 0.78% better acc@1 and acc@5 (respectively) over the two strongest baselines. The improvement over Transformersmall, which is closer to our model in the number of parameters, is even higher: over 3.8% and 3.4% in acc@1 and acc@5. The NMT baselines performed better than code-specific baselines. We hypothesize that the reason is that the NMT baselines are more generic, while the code-specific baselines are designed for different tasks: seq2prod is designed for tasks which involve generating code given natural language input; Iyer et al. (2018) additionally expects all member methods, fields, and their types as input; code2seq is designed to generate sequences rather than code, and does not have a copy mechanism. An approximation of code2seq with a copy mechanism is presented in Section 6. Interestingly, the syntactically-informed seq2tree baseline achieved the highest tree@k among the baselines, while our model achieved higher acc@k and tree@k. This shows that leveraging the syntax can benefit NMT models as well. Restricted Completion: C# Table 2 shows the results for the restricted completion task in C#, where seq2seq+copy is the Bi LSTM LSTM+copy model which performed the best among the Java baselines. We first observe that the seq2seq+copy and the seq2tree+copy baselines outperform the GNN NAG of Brockschmidt et al. (2019), who introduced this task. Although Brockschmidt et al. (2019) did compare to a seq2seq baseline, their GNN NAG model Structural Language Models of Code Ablation acc@1 acc@5 tree@1 tree@5 Paths Seq 12.95 18.52 33.44 43.43 Seq Path 12.12 17.12 28.68 43.99 Paths Paths 17.63 24.62 37.78 53.98 No Root Att 14.43 18.48 28.20 35.65 No Copy 10.72 15.70 30.61 44.35 SLM (original) 18.04 24.83 39.10 55.32 Table 3. Ablations on any-code completion in Java. could copy symbols from the context, but their baseline did not. To conduct a fair comparison with our SLM model, we equipped the seq2seq and seq2tree baselines with a copy mechanism. Even though the seq2seq+copy and the seq2tree+copy baselines perform substantially better than the state of the art in this setting, our SLM model is able to go beyond, achieving significant gains over all models. The superiority of our model over GNN NAG may also be related to the GNN bottleneck (Alon & Yahav, 2020), which hinders GNNs from propagating long-range messages. In contrast, propagating long-range messages using paths is natural for our model. 6. Ablation Study To understand the importance of the various components and design decisions in our model, we conducted an extensive ablation study. Paths Seq follows code2seq (Alon et al., 2019a) and separates the model to an encoder and a decoder, where the decoder generates the target code as a sequence of subtokens. The main difference from code2seq is that Paths Seq includes a copy mechanism, as in our model. Seq Path follows Rabinovich et al. (2017) and separates our model to an encoder and a decoder (including a copy mechanism), where the encoder encodes the context as a sequence of subtokens using a Bi LSTM, and the decoder generates the missing subtree using the root path and the index of the generated child. Paths Paths is similar to our SLM model except that it uses separate encoder and decoder. These encoder and decoder have untied weights, unlike our SLM model which models the source and the target jointly. No Root Attention uses max pooling instead of attention in aggregating multiple paths (see Section 3.2). The indexinformed path from the root to the target s parent (R in Figure 2) is concatenated with the result, instead of being used as attention query. No Copy replaces copy mechanism with a much larger vo- cabulary (25k subtokens instead of 1k). Results Table 3 shows the results of these alternatives. As our SLM model performs better than Paths Paths, this ablation shows the importance of joint modeling of the context and the target subtree by parameter tying. Each of Paths Paths and the seq2seq baselines (Table 1) performs better than Paths Seq and Seq Path; this shows the importance of using the same type of encoder and decoder for any-code completion, rather than combining an optimal encoder with an optimal decoder . While this distinction between encoder and decoder types might be necessary for semantic parsing (Rabinovich et al., 2017; Dong & Lapata, 2018), NL code (Yin & Neubig, 2017) and code NL (Alon et al., 2019a; Fernandes et al., 2019) tasks because of the different modalities of the input and the output, this discrepancy may hurt generalization when the output is essentially a missing part of the input s AST. Paths Paths performs better than the seq2seq baselines (Table 1), showing the advantage of using paths over textual sequences, even without parameter tying. No Root Attention degrades acc@1 and acc@5 by 3.6% to 6.3%. This shows that dynamically attending to the context paths given the current root path is crucial. Not using a copying mechanism results in a degradation of 7.3% to 9.1%. Programs use symbols and identifiers repetitively, thus the ability to copy symbols from the context is crucial for this task. For this reason, we included a copying mechanism in all NMT baselines in Section 4. 7. Qualitative Analysis Our main results (Table 1 and Table 2) reveal a gap between acc@k and tree@k: when ignoring identifier values and comparing only the tree structure, accuracy is significantly higher across all models. While our SLM model performs better than all baselines in acc@k, our model also shows greater potential for improvement in its tree@k results, which are much higher than the baselines . We thus focus on studying the cases where the tree was predicted correctly, but the model failed to generate the code exactly including names. Figure 4(a) shows an example of this case: the ground truth has a structure of the form: NAME.NAME() > INT. Our model predicts value.length() > 0 (a tree-match) as its first candidate and value.length() > 55 (the ground truth) as its second. Null-checking a string is often followed by checking that it is also not empty, making the first candidate a reasonable prediction as well. Figure 4(b) shows another example: in this case, the ground truth this Value == that Value ? 0 : 1 was pre- Structural Language Models of Code private static void log(String value) { if (value != null && ) value = value.substring(0, 55)+"..."; LOG.info(value); } True ref: value.length() > 55 (9.6%) value.length() > 0 (7.3%) value.length() > 55 (1.8%) value.starts With("...") (1.5%) !value.starts With("...") (0.9%) value.char At(0) == '.' public int compare To(Long Writable o) { long this Value = this.value; long that Value = o.value; return (this Value < that Value ? -1 : this Value == that Value ? 0 : 1 (16.3%) this Value == this Value ? 0 : 1 (11.0%) this Value == that Value ? 0 : 1 (9.5%) this Value == value ? 0 : 1 (6.6%) this Value > that Value ? 0 : 1 (6.1%) (this Value == that Value) ? 0 : 1 Figure 4. Examples for cases where the top candidate is a tree-match (marked with ), but only the second candidate is an exact match (marked with in bold). Predictions that are logically equivalent to the ground truth are marked with . Additional (and larger) examples along with the predictions of the baselines are shown in the supplementary material. dicted correctly only as the second candidate. Nevertheless, the top-3 candidates are tree-matches since all of them are of the form: NAME == NAME ? INT : INT. Interestingly, the fifth candidate (this Value == that Value) ? 0 : 1 is logically-equivalent to the ground truth. In both examples, our model s top candidate differs from the ground truth by a single identifier or literal: in Figure 4(a) the model predicted 0 instead of 55; in Figure 4(b) the model predicted this Value instead of that Value. Such single subtoken errors are responsible for 30% of the cases where the model s top prediction is a tree-match but not an exact match. Single token (whole identifier or literal) mismatches are responsible for 74% of these cases. Thus, improving our model s ability to predict the right names has the potential to enhance our gains furthermore. Detailed results of allowing such mistakes in our model and in the baselines can be found in the supplementary material. Additional possible post-filtering could filter out candidates that do not compile. In Figure 5, the first, third and fourth candidates do not compile, because the this.current Attempt object does not have get Count, get, nor get Time methods. If the model s predictions would have been considered in the context of the entire project including its dependencies, these candidates could have been filtered out, and the (correct) fifth candidate would be ranked second. We leave compiler-guided code generation to future work. Additional examples can be found in the supplementary material and in our interactive demo at http:// Any Code Gen.org. 8. Related Work Generalizing Previous Approaches Our approach frames code generation as predicting the next node in all partial AST paths. This simple framing generalizes most previous work, without hand-crafted edges and special actions: Models that use information about ancestor nodes only (Rabinovich et al., 2017), as well as the Parent Feeding of Yin & Neubig (2017), are generalized by our model, since all paths that go into a node at pass through its parent, and the path from the root is the attention query. The previous action encoding of Yin & Neubig (2017) is also a special case of our approach, because St contains the paths starting from the previously expanded leaves of Ap into the currently expanded node π (at), such as path3 in Figure 2(e). The context node of PHOG (Bielik et al., 2016) is just one of the previously-traversed leaf nodes in a