# learning_to_represent_edits__8b1c8f0b.pdf Published as a conference paper at ICLR 2019 LEARNING TO REPRESENT EDITS Pengcheng Yin , Graham Neubig Language Technology Institute Carnegie Mellon University Pittsburgh, PA 15213, USA {pcyin,gneubig}@cs.cmu.edu Miltiadis Allamanis, Marc Brockschmidt, Alexander L. Gaunt Microsoft Research Cambridge, CB1 2FB, United Kingdom {miallama,mabrocks,algaunt}@microsoft.com We introduce the problem of learning distributed representations of edits. By combining a neural editor with an edit encoder , our models learn to represent the salient information of an edit and can be used to apply edits to new inputs. We experiment on natural language and source code edit data. Our evaluation yields promising results that suggest that our neural network models learn to capture the structure and semantics of edits. We hope that this interesting task and data source will inspire other researchers to work further on this problem. 1 INTRODUCTION One great advantage of electronic storage of documents is the ease with which we can edit them, and edits are performed in a wide variety of contents. For example, right before a conference deadline, papers worldwide are finalized and polished, often involving common fixes for grammar, clarity and style. Would it be possible to automatically extract rules from these common edits? Similarly, program source code is constantly changed to implement new features, follow best practices and fix bugs. With the widespread deployment of (implicit) version control systems, these edits are quickly archived, creating a major data stream that we can learn from. In this work, we study the problem of learning distributed representations of edits. We only look at small edits with simple semantics that are more likely to appear often and do not consider larger edits; i.e., we consider add definite articles rather than rewrite act 2, scene 3. Concretely, we focus on two questions: i) Can we group semantically equivalent edits together, so that we can automatically recognize common edit patterns? ii) Can we automatically transfer edits from one context to another? A solution to the first question would yield a practical tool for copy editors and programmers alike, automatically identifying the most common changes. By leveraging tools from program synthesis, such groups of edits could be turned into interpretable rules and scripts (Rolim et al., 2017). When there is no simple hard rule explaining how to apply an edit, an answer to the second question would be of great use, e.g., to automatically rewrite natural language following some stylistic rule. We propose to handle edit data in an autoencoder-style framework, in which an edit encoder f is trained to compute a representation of an edit x x+, and a neural editor α is trained to construct x+ from the edit representation and x . This framework ensures that the edit representation is semantically meaningful, and a sufficiently strong neural editor allows this representation to not be specific to the changed element. We experiment with various neural architectures that can learn to represent and apply edits and hope to direct the attention of the research community to this new and interesting data source, leading to better datasets and stronger models. Briefly, the contributions of our paper are: (a) in Sect. 2, we present a new and important machine learning task on learning representations of edits (b) we present a family of Work done as an intern in Microsoft Research, Cambridge, UK. Published as a conference paper at ICLR 2019 x : var green List = trivia==null ? null : trivia.Select(t=> t.Underlying Node); x+: var green List = trivia?.Select(t=> t.Underlying Node); x : me = ((ue!=null) ? ue.Operand : null) as Member Expression; x +: me = ue?.Operand as Member Expression; Edit Representation f (x , x+) Rn Neural Editor α(x , f (x , x+)) Figure 1: Given an edit (Edit 1) of x to x+, f computes an edit representation vector. Using that representation vector the neural editor α applies the same edit to a new x . The code snippets shown here are real code change examples from the roslyn open-source compiler project. models that capture the structure of edits and compute efficient representations in Sect. 3 (c) we create a new source code edit dataset, and release the data extraction code at https://github.com/Microsoft/msrc-dpu-learning-to-represent-edits and the data at http://www.cs.cmu.edu/ pengchey/githubedits.zip. (d) we perform a set of experiments on the learned edit representations in Sect. 4 for natural language text and source code and present promising empirical evidence that our models succeed in capturing the semantics of edits. In this work, we are interested in learning to represent and apply edits on discrete sequential or structured data, such as text or source code parse trees1. Figure 1 gives a graphical overview of the task, described precisely below. Edit Representation Given a dataset of edits {x(i) x(i) + }N i=1, where x(i) is the original version of some object and x(i) + its edited form (see upper half of Figure 1 for an example), our goal is to learn a representation function f that maps an edit operation x x+ to a real-valued edit representation f (x , x+) Rn. A desired quality of f is for the computed edit representations to have the property that semantically similar edits have nearby representations in Rn. Having distributed representations also allows other interesting downstream tasks, e.g., unsupervised clustering and visualization of similar edits from large-scale data (e.g. the Git Hub commit stream), which would be useful for developing human-assistance toolkits for discovering and extracting emerging edit patterns (e.g. new bug fixes or emerging best practices of coding). Neural Editor Given an edit representation function f , we want to learn to apply edits in a new context. This can be achieved by learning a neural editor α that accepts an edit representation f (x , x+) and a new input x and generates x +.2 This is illustrated in the lower half of Figure 1. We cast the edit representation problem as an autoencoding task, where we aim to minimize the reconstruction error of α for the edited version x+ given the edit representation f (x , x+) and the original version x . By limiting the capacity of f s output and allowing the model to freely use information about x , we are introducing a bottleneck that forces the overall framework to not simply treat f (x , x+) as an encoder of x+. The main difference from traditional autoencoders is that in our setup, an optimal solution requires to re-use as much information as possible from x 1Existing editing systems, e.g. the grammar checker in text editors and code refactoring module in IDEs, are powered by domain-specific, manually crafted rules, while we aim for a data-driven, domain-agnostic approach. 2We leave the problem of identifying which edit representation f (x , x+) to apply to x as interesting future work. Published as a conference paper at ICLR 2019 Assign Stmt h1 root Expr h2 Expr Expr Op Expr h3 TREECP Expr h4 Op h5 Expr Int Lit h6 Int Lit 23 (a) (b) AST Child Next Token h EXPANDR action h GENTERM action h TREECP action Action Flow Parent Feed Copied Figure 2: (a) Graph representation of statement u = x + x. Rectangular (resp. rounded) nodes denote tokens (resp. non-terminals). (b) Sequence of tree decoding steps yielding x + x - 23, where x + x is copied (using the TREECP action) from the context graph in (a). to make the most of the capacity of f . Formally, given a probabilistic editor function Pα such as a neural network and a dataset {x(i) x(i) + }N i=1, we seek to minimize the negative likelihood loss i log Pα(x+ | x , f (x , x+)). Note that this loss function can be interpreted in two ways: (1) as a conditional autoencoder that encodes the salient information of an edit, given x and (2) as an encoder-decoder model that encodes x and decodes x+ conditioned on the edit representation f (x , x+). In the rest of this section, we discuss our methods to model Pα and f as neural networks. 3.1 NEURAL EDITOR As discussed above, α should use as much information as possible from x , and hence, an encoderdecoder architecture with the ability to copy from the input is most appropriate. As we are primarily interested in edits on text and source code in this work, we explored two architectures: a sequenceto-sequence model for text, and a graph-to-tree model for source code, whose known semantics we can leverage both on the encoder as well as on the decoder side. Other classes of edits, for example, image manipulation, would most likely be better served by convolutional neural models. Sequence-to-Sequence Neural Editor First, we consider a standard sequence-to-sequence model with attention (over the tokens of x ). The architecture of our sequence-to-sequence model is similar to that of Luong et al. (2015), with the difference that we use a bidirectional LSTM in the encoder and a token-level copying mechanism (Vinyals et al., 2015) that directly copies tokens into the decoded sequence. Whereas in standard sequence-to-sequence models the decoder is initialized with the representation computed by the encoder, we initialize it with the concatenation of encoder output and the edit representation. We also feed the edit representation as input to the decoder LSTM at each decoding time step. This allows the LSTM decoder to take the edit representation into consideration while generating the output sequence. Graph-to-Tree Neural Editor Our second model aims to take advantage of the additional structure of x and x+. To achieve this, we combine a graph-based encoder with a tree-based decoder. We use T(x) to denote a tree representation of an element, e.g., the abstract syntax tree (AST) of a fragment of source code. We extend T(x) into a graph form G(x) by encoding additional relationships (e.g., the next token relationship between terminal nodes, etc.) (see Figure 2(a)). To encode the elements of G(x ) into vector representations, we use a gated graph neural network (GGNN) (Li et al., 2015). Similarly to recurrent neural networks for sequences (such as bi RNNs), GGNNs compute a representation for each node in the graph, which can be used in the attention mechanisms of a decoder. Additionally, we use them to obtain a representation of the full input x , by computing their weighted average following the strategy of Gilmer et al. (2017) (i.e., computing a score for each node, normalizing scores with a softmax, and using the resulting values as weights). Our tree decoder follows the semantic parsing model of Yin & Neubig (2018), which sequentially generate a tree T(x+) as a series of expansion actions a1 . . . a N. The probability of taking an action is modeled as p(at | a