# graph_edit_networks__bdcf6d9e.pdf Published as a conference paper at ICLR 2021 GRAPH EDIT NETWORKS Benjamin Paassen The University of Sydney benjamin.paassen@sydney.edu.au Daniele Grattarola Università della Svizzera italiana daniele.grattarola@usi.ch Daniele Zambon Università della Svizzera italiana daniele.zambon@usi.ch Cesare Alippi Università della Svizzera italiana Politecnico di Milano cesare.alippi@usi.ch Barbara Hammer Bielefeld University bhammer@techfak.uni-bielefeld.de While graph neural networks have made impressive progress in classification and regression, few approaches to date perform time series prediction on graphs, and those that do are mostly limited to edge changes. We suggest that graph edits are a more natural interface for graph-to-graph learning. In particular, graph edits are general enough to describe any graph-to-graph change, not only edge changes; they are sparse, making them easier to understand for humans and more efficient computationally; and they are local, avoiding the need for pooling layers in graph neural networks. In this paper, we propose a novel output layer - the graph edit network - which takes node embeddings as input and generates a sequence of graph edits that transform the input graph to the output graph. We prove that a mapping between the node sets of two graphs is sufficient to construct training data for a graph edit network and that an optimal mapping yields edit scripts that are almost as short as the graph edit distance between the graphs. We further provide a proof-of-concept empirical evaluation on several graph dynamical systems, which are difficult to learn for baselines from the literature. 1 INTRODUCTION Recent advances in graph representation learning have mostly focused on tasks of classification or regression, i.e. tasks with graph-structured input but numeric output (Battaglia et al., 2018; Kipf & Welling, 2016a; Veliˇckovi c et al., 2018). By contrast, few approaches to date can transform a graph-structured input to a graph-structured output (Hajiramezanali et al., 2019; Paaßen et al., 2018; Zambon et al., 2019). This lacuna is crucial because time series prediction on graphs requires graphstructured output, namely the next graph in a time series. Applications of time series prediction on graphs include epidemiological models (Keeling & Eames, 2005), social (Liben-Nowell & Kleinberg, 2007; Masuda & Holme, 2019), telecommunications (Nanavati et al., 2006), traffic (Cui et al., 2019), citation (Shibata et al., 2012), and financial transaction networks (Chan & Olmsted, 2017), as well as student solutions in intelligent tutoring systems (Paaßen et al., 2018). In each of these settings, predicting the changes in graphs can deepen the understanding of the domain and provide useful knowledge for designing interventions. Currently, methods for time series prediction on graphs are limited to the dynamics of the node attributes (Yu et al., 2018), or changes in connectivity (Goyal et al., 2020; Hajiramezanali et al., 2019), but do not cover changes in the node set. Fortunately, there exists a rich research tradition of edit distances (e.g. Levenshtein, 1965; Zhang & Shasha, 1989; Sanfeliu & Fu, 1983) which can describe any change between two graphs. Further, edits are sparse and have a simple semantic (delete, insert, relabel), which makes them easier to interpret for human observers and makes them computationally Published as a conference paper at ICLR 2021 more efficient (linear instead of quadratic) compared to a dense representation. Finally, edits are local, enabling us to make edit decisions at each node instead of coordinating information across the entire graph. In this work, we connect graph neural networks to edit distances by developing a simple, linear output layer that maps node embeddings to graph edits. We call our output layer the graph edit network (GEN). We also develop a general training and inference scheme to transform any graph Gt to its successor Gt+1 using only local binary edit decisions and a regression for node attributes. Theoretically, we prove that a) a mapping between the nodes of Gt and Gt+1 is sufficient to construct training data for the GEN, b) this construction yields almost no overhead compared to directly transforming the mapping to graph edits, and c) provided that the mapping between Gt and Gt+1 is optimal and the GEN can perfectly reproduce the training data, the edit script is almost as short as the graph edit distance (Sanfeliu & Fu, 1983). In addition to this core theoretical contribution, we provide a proof-of-concept of our model by demonstrating that GENs can learn a variety of dynamical systems on graphs which are more difficult to handle for baseline systems from the literature. We also show that the sparsity of edits enables GENs to scale up to realistic graphs with thousands of nodes. 2 BACKGROUND Graph Neural Networks: Graph neural networks (GNNs) compute representations of nodes in a graph by aggregating information of neighboring nodes (Bacciu et al., 2020; Defferrard et al., 2016; Kipf & Welling, 2016a; Micheli, 2009; Scarselli et al., 2009). In particular, the representation φl(v) Rnl of node v in layer l is computed as follows: φl(v) = f l merge φl 1(v), f l aggr {φl 1(u)|u N(v)} (1) where N(v) is some neighborhood of v in the graph and f l merge as well as f l aggr are functions that aggregate the information of their arguments, returning a single vector (Xu et al., 2019). The representation in the 0th layer is usually defined as the initial node attributes or a constant vector for all nodes (Kipf & Welling, 2016a). Recently, many implementations of f l merge, f l aggr, and neighborhood N have been suggested, such as a weighted sum via the graph Laplacian (Kipf & Welling, 2016a), recurrent neural networks (Hamilton et al., 2017), or attention mechanisms (Veliˇckovi c et al., 2018). Our approach is agnostic to the choice of graph neural network. We merely require some vectorial embedding for each node in the input graph. Graph Generators: Multiple works in recent years have proposed recurrent models to generate graphs (Bacciu et al., 2019; Li et al., 2018; You et al., 2018a;b; Zhang et al., 2019). Roughly speaking, these recurrent models first output a node, and then all connections of this node to previous nodes until a special end-of-sentence token is produced. While such a scheme does enable time series prediction, it only works for insertions, i.e. starting at an empty graph and inserting nodes and edges over time. If one wishes to account for general graph changes, one first has to encode a graph into a vector and then decode from this vector the graph in the next time step, similar to molecular design (Jin et al., 2019; Fu et al., 2020). However, such models have to generate the next graph from scratch and can not exploit the sparsity and interpretability of edits, as we suggest. Link Prediction: Link prediction is complementary to graph generation. It assumes a constant number of nodes, but changing connectivity between them (Liben-Nowell & Kleinberg, 2007; Richard et al., 2014; Shibata et al., 2012). Typical link prediction approaches compute node features first, followed by an affinity index between nodes based on their features. Finally, edges with low index are predicted to vanish, while edges with high index are predicted to appear. For example, Goyal et al. (2020) combine dense and recurrent blocks to build an autoencoder for link prediction, while Hajiramezanali et al. (2019) combine a GNN and a RNN to obtain a spatio-temporal variational graph autoencoder. In GENs, we predict edge changes with a similar scheme, using a graph neural network to obtain the node features and then mapping these node features to the graph changes. However, in contrast to prior work, we do not predict the next adjacency matrix but only the change in adjacencies, which is a much sparser signal, reducing the time complexity from quadratic to linear. Additionally, GENs can not only handle edge changes, but also node changes. Published as a conference paper at ICLR 2021 We note that our current paper is limited to a Markovian setting, i.e. we do not consider the past for computing node representations. This limitation could be addressed by combining our output layer with Evolve GCN (Pareja et al., 2020) which uses a recurrent net to predict the weights of a graph neural net, thus being able to handle changes in the node set. Dynamic Attributes: Recently, graph neural networks have been extended to predict changes to node attributes, while the nodes and edges remain fixed (Cui et al., 2019; Seo et al., 2018), which is particularly useful for traffic networks. GENs are complementary to these works, in that we consider the more general case of graph topology changes. Time Series Prediction on Graphs: To our knowledge, only very few works to date have addressed the most general case of time series of graphs, where both nodes and edges are permitted to change. In particular, Paaßen et al. (2018) suggest several kernel-based time series prediction methods for graphs. However, their scheme is limited to predictions in the kernel space and mapping a prediction back to a graph requires solving an inverse kernel problem, relying on approximations that impact accuracy (Paaßen et al., 2018). Zambon et al. (2019) embed the time series into a vector space using a GNN and use a recurrent neural network to predict the next time step. To obtain the corresponding graph, a multi-layer perceptron is used to compute the adjacency matrix and node features from the predicted embedding. Besides being computationally expensive, this dense decoder also assumes a fixed order of the nodes. Graph Edits: The basis for our approach are graph edits, which are functions that describe changes in graphs (Sanfeliu & Fu, 1983). Formally, we first define an attributed, directed graph as a triple G = (V, E, X), where V = {1, . . . , N} is a finite set of node indices, E V V is a set of edges, and X RN n is a matrix of node attributes for some n N. We define the nodes as indices for notational simplicity, but we do not assume any specific order, i.e. we treat isomorphic graphs as the same. Now, let G be the set of all possible attributed directed graphs. We define a graph edit as some function δ : G G. In particular, we consider the graph edits of Sanfeliu & Fu (1983), namely node deletions deli, which delete the ith node from a graph, node replacements repi,x, which set the attribute of node i to x, node insertions insx, which add a new node with attribute x to a graph, edge deletions edeli,j, which delete the edge (i, j) from a graph, and edge insertions einsi,j, which insert the edge (i, j) into a graph. We then define an edit script δ as a finite sequence δ = δ1, . . . , δT of graph edits and we define the application of δ as the composition of all edits, i.e. δ(G) := δT . . . δ1(G). Finally, we define the graph edit distance d GED(G, G ) between two graphs G and G as the length of the shortest script δ such that δ(G) = G , where = means isomorphic. The GED is well-defined and a proper metric, i.e. a script connecting any two graphs always exists, the GED between two isomorphic graphs is zero, the GED is symmetric, and it conforms to the triangular inequality (Abu-Aisheh et al., 2015; Sanfeliu & Fu, 1983). While prior work has already attempted to approximate the graph edit distance with graph neural nets (Bai et al., 2019; Li et al., 2019) our work is, to our knowledge, the first to produce actual graph edits as network output, and to avoid graph pooling layers. 3 GRAPH EDIT NETWORKS Let G1, G2, . . . , GT be a time series of graphs. Our goal is to develop a neural network that takes a graph Gt as input and outputs graph edits that transform Gt to Gt+1. For simplicity we make a Markov assumption, i.e. Gt is assumed to be sufficient to predict Gt+1 (future research could address this limitation, e.g. by applying Evolve GCN of Pareja et al., 2020). Now, let Gt = (V, E, X) be an attributed graph with N nodes. Our proposed processing pipeline has three steps. First, we use some graph neural network (refer to Equation 1) to compute a matrix of node embeddings Φ RN n. Second, we use a linear layer to compute numerical edit scores that express which nodes and edges should be deleted, inserted, and relabeled, respectively. Third, we translate these scores via Algorithm 1 to an edit script δ and apply this script to the input graph to obtain the output graph δ(Gt). This pipeline is also illustrated in Figure 1. In the remainder of this section, we describe the graph edit network layer (Section 3.1), our training scheme (Section 3.2), our inference scheme (Section 3.3), and finally our theoretical results (Section 3.4). Published as a conference paper at ICLR 2021 Figure 1: An illustration of the processing pipeline. An input graph G is processed by a message passing network (Equation 1). The output layer is a GEN, and produces node scores ν and edge scores E. Algorithm 1 translates the node and edge scores into an edit script δ which, when applied to the input graph G, constructs the predicted graph δ(G). 3.1 GRAPH EDIT NETWORK LAYER Our proposed graph edit network (GEN) is a linear layer to compute edit scores that express which nodes and edges should be deleted, inserted, or relabeled. The input of our GEN is a matrix Φ RN n of node embeddings as returned by a graph neural network (refer to Equation 1). We then compute node edit scores ν RN, edge filter scores e+ RN as well as e RN, and new node attributes Y via linear maps from Φ. After this is done, we consider only those pairs (i, j) where e+ i > 0 and e j > 0 and compute an edge edit score ϵi,j via another linear layer that receives φi, φj, and the inner product φT i φj as inputs. The interpretation of these scores is that νi should be positive if a new node connected to i is inserted, νi should be negative if node i is deleted, e+ i , e j , as well as ϵi,j should be positive if edge (i, j) is inserted and e+ i , e j , as well as ϵi,j should be positive if edge (i, j) is deleted. Note that we only compute the edge edit score ϵi,j for edges (i, j) where e+ i > 0 and e j > 0. Thus, if edge changes concern only a number of nodes in O( n), the GEN layer operates in linear instead of quadratic time. We can also enforce the linear time by setting all e+ i and e j to zero that are not in the top R for some R that is either constant or in O( n). From scores to edits: Next, we translate these scores into edits. The formal translation scheme is given in Algorithm 1. Roughly speaking, we delete any node i where the node edit score νi is smaller than 1 2, we insert a new node with attribute yi, connected to i, whenever νi is larger than + 1 2, and we replace the attribute yi with yi otherwise. For edges, we delete any edge (i, j) where ϵi,j < 1 2, where νi < 1 2, or where νj < 1 2, and we insert any edge where ϵi,j > + 1 2. The complexity of Algorithm 1 is as follows. In line 2, we first construct |E {(i, j)|νi < 1 2 or νj < 1 2}| edits, which is bounded by |E|, which in turn is in O(N) for a sparse graph and O(N 2) for a dense graph. Lines 3-8 perform |{(i, j)|e+ i > 0, e j > 0}| iterations, which can be bounded to some constant R2 by the edge filtering trick above. Lines 9-15 iterate over all nodes several times, which is in O(N). The space complexity is the same since we add one edit each iteration, which needs to be stored. The overall time and space complexity is thus O(N) for sparse graphs and O(N 2) for dense graphs. Note the special case of node insertions in this scheme. As with other edits, we make the decision to insert a new node locally at each node instead of globally for the graph. This relieves the need to aggregate information across the entire graph. Further, by connecting a new node directly with an existing one, we ensure that any two new nodes can be distinguished purely based on their graph connectivity, without relying on auxiliary information. 3.2 TRAINING The key challenge in training GENs is to identify which scores the network should produce such that the GEN transforms the input graph Gt into its desired successor Gt+1. In other words, we require a teaching signal consisting of ground truth scores (ˆν, ˆY , ˆe+, ˆe , ˆE), such that the edit script δ returned by Algorithm 1 yields δ(Gt) = Gt+1. Unfortunately, such a one-step teaching signal is sometimes insufficient. Consider the two example graphs Gt = ({1}, , (0)) and Gt+1 = ({1, 2, 3}, {(1, 2), (1, 3)}, (0, 0, 0)T ). In this case, there exists no one-step teaching signal that transforms Gt to Gt+1 because we can only insert Published as a conference paper at ICLR 2021 Algorithm 1 The scheme to translate the outputs of the GEN layer νi, yi, e+ i , e i , and ϵi,j to graph edits. 1: function TRANSLATE(graph G = (V, E, X), node edit scores ν RN, attributes Y RN n, edge filter scores e+, e RN, and edge edit scores E RN N) 2: Initialize script δ with edeli,j for all (i, j) with νi < 1 2 or νj < 1 2 in lexicographic order. 3: for i with e+ i > 0 and νi 1 2 do 4: for j with e j > 0 and νj 1 5: Append edeli,j to δ if (i, j) E and ϵi,j < 1 2. 6: Append einsi,j to δ if (i, j) / E and ϵi,j > + 1 2. 7: end for 8: end for 9: Append repi, yi to δ for all i with |νi| 1 2 and xi = yi. 10: k 1. 11: for i with νi > + 1 2 do 12: Append ins yi and einsi,N+k to δ. 13: k k + 1. 14: end for 15: Append deli to δ for all i with νi < 1 2 in descending order. 16: return δ. 17: end function as many new nodes as already exist. However, it is possible to set up a two-step teaching signal (ˆν1, ˆY1, ˆe+ 1 , ˆe 1 , ˆE1), (ˆν2, ˆY2, ˆe+ 2 , ˆe 2 , ˆE2) with ˆν1 = (1), ˆν2 = (1, 0)T , and all other scores set to zero. When plugging these values into Algorithm 1 we obtain the edits δ1 = ins, eins1,2 and δ2 = ins, eins1,3, such that the concatenation δ = δ1, δ2 does indeed yield δ(Gt) = Gt+1. In general, Theorem 2 (below) shows that a teaching signal with K + 1 steps suffices to transform any input graph Gt into any output graph Gt+1, where K is the number of insertions necessary to transform Gt into Gt+1 and the first K steps only perform these insertions. Provided that a teaching signal exists, our training procedure should ensure that the actual edit scores ( ν1, Y1, e+ 1 , e 1 , E1), . . . ( νK+1, YK+1, e+ K+1, e K+1, EK+1) result in the same edits as the teaching signal when being plugged into Algorithm 1. To do so we treat every edit decision in Algorithm 1 as a binary classification and punish different decisions with a classification loss, such as the hinge loss (Zhao et al., 2017) or crossentropy (refer to Appendices B and C for a more details on loss functions). In particular, in the first K steps of a teaching signal, we have a classification loss for the decision of inserting a node (νk,i > + 1 2) or not (νk,i + 1 2), plus a loss for punishing deviations between predicted attributes yk,i and desired attributes ˆyk,i for all i with ˆνk,i > + 1 2. For step K + 1, the same idea applies. For each node we have a binary classification loss for the decision of deleting a node i (νK+1,i < 1 2) or not; plus the regression loss between y K+1,i and ˆy K+1,i for non-deleted nodes i; plus the classification loss regarding whether to change the outgoing edges of node i (e+ K+1,i > 0) or not; plus the classification loss regarding whether to change the incoming edges of node i (e K+1,i > 0) or not; plus the classification loss regarding whether to delete an existing edge (i, j) (ϵi,j < 1) or not; plus the classification loss regarding whether to insert a non-existing edge (i, j) (ϵi,j > 1) or not. The sum of all these losses forms our training loss for a single graph pair (Gt, Gt+1). Because this loss is differentiable, we can train our neural net end-to-end by performing a gradient descent scheme on this loss, e.g. using Adam (Kingma & Ba, 2015). Importantly, edge filtering needs to be adjusted to the training data. In particular, if we impose a limit on the maximum number of nodes with e+ i > 0 or e i > 0, this limit should be higher than the maximum number of those nodes in the teaching signals for the training data - otherwise, the training can never achieve zero loss. If the limit is high enough, however, the training procedure is the same with or without the limit. The limit only impacts inference. Published as a conference paper at ICLR 2021 3.3 INFERENCE Once training is complete, we wish to use our GEN for inference. In particular, given a previously unseen graph Gt, we want to know its successor Gt+1. Because Gt+1 is unknown, we can not construct a teaching signal. Instead, we plug the current graph Gt into our graph neural net to compute node features Φ, then use the GEN to compute scores ( ν, Y , e+, e , E), and plug these into Algorithm 1 to retrieve an edit script δ, which then yields our predicted graph Gt+1 = δ(Gt) (also refer to Figure 1). This is the simple one-step scheme that we will also use in our graph dynamical system experiments. In case our training data requires multi-step teaching signals, our inference also needs multiple steps. In particular, we use the following scheme: 1) Color all nodes red. 2) If no red nodes are left, go to 4. Otherwise, re-compute the node features φi, the node edit score νi, and the attributes yi for all red nodes i. 3) For all i with νi > + 1 2, insert a new node with attributes yi, draw an edge from i to the new node, and color the new node red. Color all nodes i with νi + 1 2 blue. Then go to 2. 4) Compute the node features Φ and the edit scores ( ν, Y , e+, e , E) for all nodes. Set all νi > 1 2 to zero. Then, call Algorithm 1 and apply the resulting script to the graph. In general, it may be necessary to provide state information (e.g. via a recurrent neural net) to ensure that the network can distinguish whether it still needs to insert nodes or not. In our experiments, however, it is sufficient to either summarize all edits in a single step or to supply the network with a binary node attribute that flags whether we are still in insertion mode or not. In this section, we wish to answer two questions. First, how to construct a teaching signal for a graph edit network? Second, how short can the output edit script of a GEN get, provided that it still transforms graph Gt to graph Gt+1? The answer to both questions lies in making the theoretical connection to graph edit distances more explicit. To do so, we first introduce a key concept that will help us, namely that of a graph mapping. Definition 1 (Graph Mapping). Let G = (V = {1, . . . , M}, E, X) and G = (V = {1, . . . , N}, E , X ) be two non-empty graphs, i.e. M, N > 0. Then, we define a graph mapping ψ between G and G as a bijective mapping ψ : {1, . . . , M + N} {1, . . . , M + N} with the additional restriction that for the set Insψ := {j N|ψ 1(j) > M} we obtain ψ 1(Insψ) = {M + 1, . . . , M + |Insψ|}. Graph mappings are useful because they are intimately connected to edit scripts. In particular, we re-state a result from the literature that any edit script converts to a graph mapping and back and that this conversion never increases the length of the script. Theorem 1 (Script to mapping). Let G and G be any two graphs. There exist two polynomial algorithms to translate any edit script δ with δ(G) = G to a graph mapping ψ δ between G and G , and to translate any graph mapping ψ between G and G into an edit script δψ with δψ(G) = G , such that for any δ with δ(G) = G we obtain | δψ δ| | δ|. One proof is contained in Bougleux et al. (2017). We provide a more extensive version in Appendix A.1, which also gives more details on the structure of ψ δ and δψ δ. Next, we show that a graph mapping can also be used to construct a teaching signal for a GEN. Even better, the conversion from mapping to teaching signal to edit script yields almost as short scripts as the direct conversion from mappings to edit scripts. Indeed, we can provide a sharp bound for the overhead in terms of the connected components in the target graph. Theorem 2 (Mapping to teaching signal). Let G and G be any two graphs with M and N nodes, respectively. There exists an O(M 2 + N 2) algorithm (namely Algorithm 2 in the appendix) that translates any graph mapping ψ with |Insψ| < N between G and G into a teaching signal (ˆν1, ˆY1, ˆe+ 1 , ˆe 1 , ˆE1), . . . , (ˆνK+1, ˆYK+1, ˆe+ K+1, ˆe K+1, ˆEK+1) such that the output of Algorithm 1 is a script δ, where the following holds: 1) δ(G) = G ; 2) | δ| | δψ| + 2 (C 1), where C is the number of connected components in G (this bound is sharp); 3) the first K steps contain only Published as a conference paper at ICLR 2021 insertions, with νk,i = 0 νk+1,i = . . . = νK+1,i = 0; 4) the last step contains no insertions; 5) K |Insψ| (this bound is sharp). Refer to Appendix A.2 for the proof. So this result tells us that we obtain a (sparse) teaching signal if we have a good graph mapping ψ. But how to obtain ψ? In many cases, we can exploit domain knowledge. For example, if we know that the input graphs are trees, we can use the polynomial tree edit distance algorithm to infer mappings that correspond to short edit scripts (Zhang & Shasha, 1989); or, if node IDs are given (like user IDs in social networks), we can set ψ such that it maintains IDs. In our experiments, we follow such domain-specific schemes to construct the mappings ψ. But there is also a general strategy connected to the graph edit distance. While computing the graph edit distance itself is NP-hard (Bougleux et al., 2017), one can achieve a good approximation by constructing a graph mapping via the Hungarian algorithm and then converting this mapping to an edit script via Theorem 1 (Riesen & Bunke, 2009; Abu-Aisheh et al., 2017; Blumenthal et al., 2020). For our purposes, we can simply apply such an approximator and then use the graph mapping to construct our teaching signal. Importantly, if the approximator happens to find the optimal mapping and if our graph edit network is powerful enough to achieve a zero loss on one graph tuple (Gt, Gt+1), then the output of our graph edit network is close to the graph edit distance. Corollary 1 (Near-Optimality of Graph Edit Network Architecture). Let G and G be any two non-empty graphs and let C be the number of connected components in G . Further, let ψ be a graph mapping between G and G such that | δψ| = d GED(Gt, Gt+1). Finally, let f be a graph edit network that reproduces the teaching signal, i.e. f(G) = δ with δ being the same script as the result of Theorem 2 for ψ. Then, it holds: | δ| d GED(G, G ) + 2 (C 1). As mentioned before, this corollary only holds if the mapping is optimal and if the GEN can reproduce the teaching signal. The latter only holds if the node features Φ are rich enough to make each edit classification problem linearly separable (because any edit decision is a linear binary classification). Typically, this fails if a graph neural net can not distinguish two nodes that would need to be treated differently. For example, in an unlabeled ring graph G = ({1, . . . , N}, {(1, 2), . . . , (N, 1)}), Equation 1 assigns the same node embedding to all nodes and, hence, the GEN returns the same edits. In such cases, distinguishing information must be integrated via an alternative architecture or via node attributes (which is the strategy we take in the experiments). For further work on the expressiveness of graph neural nets we point the reader to Xu et al. (2019). To summarize: The proposed way to use a GEN is 1) to gather a training time series of graphs G1, . . . , GT +1; 2) to set up reference graph mappings ψt between Gt and Gt+1 for all t {1, . . . , T}, e.g. via graph edit distance approximators; 3) to compute teaching signals via Theorem 2; 4) to initialize an appropriately powerful graph neural net with a final GEN layer and train it to reproduce the teaching signals on the training data; 5) to use the trained GEN in inference. 4 EXPERIMENTS Our experimental evaluation displays the capability of GENs on a set of graph dynamical systems in comparison to baselines from the literature. Experiments are reported in three groups and cover graph dynamical systems, tree dynamical systems, and a social network dataset. All experiments require the possibility of changing nodes, and almost all require additional edge changes. The data is discussed in more detail in Appendix D. We perform all experiments on a consumer grade laptop with core i7 CPU. All experimental code is available at https://gitlab.com/bpaassen/ graph-edit-networks. First, we consider the following three graph dynamical systems. Edit Cycles: A manually defined dataset of cycles in the set of undirected graphs with up to four nodes. The teaching protocol is hand-crafted to perform optimal edits between each graph and its successor. To sample a time series we let the cycle run for 4-12 time steps at random. The node features φ0(x) were set to zero. Degree Rules: A dynamical system on undirected graphs of arbitrary size with the following rules. First, delete every node with a degree larger than 3. Second, connect nodes that share at least one common neighbor. Third, insert a new node at any node with a degree lower than 3. We used the Published as a conference paper at ICLR 2021 rule 3 ˆν1 = 1 rule 2 ˆϵ2,3 = 1 rule 3 ˆν1 = 1 rule 2 ˆϵ2,4 = 1 rule 2 ˆϵ3,4 = 1 rule 1 ˆν1 = 1 rule 2 ˆϵ1,4 = 1 Figure 2: Two graph time series from the degree rules dataset. Blue arrows indicate graph dynamics, labelled with the teaching signal. Table 1: The average precision and recall values ( std.) across five repeats for all edit types on the graph dynamical systems. node insertion node deletion edge insertion edge deletion model recall precision recall precision recall precision recall precision edit cycles VGAE 0.62 0.0 1.00 0.0 1.00 0.0 0.69 0.1 1.00 0.0 1.00 0.0 1.00 0.0 1.00 0.0 VGRNN 0.64 0.0 1.00 0.0 0.63 0.0 1.00 0.0 0.95 0.0 0.06 0.0 1.00 0.0 0.71 0.1 XE-GEN 1.00 0.0 1.00 0.0 1.00 0.0 1.00 0.0 1.00 0.0 1.00 0.0 1.00 0.0 1.00 0.0 GEN 1.00 0.0 1.00 0.0 1.00 0.0 1.00 0.0 1.00 0.0 1.00 0.0 1.00 0.0 1.00 0.0 degree rules VGAE 0.15 0.0 1.00 0.0 1.00 0.0 0.96 0.0 0.88 0.0 0.97 0.1 1.00 0.0 0.97 0.1 VGRNN 0.14 0.0 1.00 0.0 0.72 0.0 1.00 0.0 0.56 0.0 0.21 0.0 1.00 0.0 0.02 0.0 XE-GEN 1.00 0.0 1.00 0.0 1.00 0.0 1.00 0.0 0.97 0.0 0.99 0.0 1.00 0.0 1.00 0.0 GEN 1.00 0.0 1.00 0.0 1.00 0.0 1.00 0.0 0.97 0.1 0.99 0.0 1.00 0.0 1.00 0.0 game of life VGAE 0.27 0.1 1.00 0.0 1.00 0.0 0.03 0.0 1.00 0.0 1.00 0.0 1.00 0.0 1.00 0.0 VGRNN 0.31 0.1 1.00 0.0 0.32 0.1 1.00 0.0 1.00 0.0 0.00 0.0 1.00 0.0 0.01 0.0 XE-GEN 1.00 0.0 1.00 0.0 1.00 0.0 0.98 0.0 1.00 0.0 1.00 0.0 1.00 0.0 1.00 0.0 GEN 1.00 0.0 1.00 0.0 1.00 0.0 1.00 0.0 1.00 0.0 1.00 0.0 1.00 0.0 1.00 0.0 node index in one-hot coding as node features φ0(x). For every connected component in the input graph, this dynamical system provably converges to a 4-clique. We started with a random undirected adjacency matrix of size 8 8 and let the system run until convergence. Refer to Figure 2 for two example time series. Game of Life: We simulated one of five oscillatory shapes in Conway s game of life (Gardner, 1970), namely blinker, glider, beacon, toad, and clock, for 10 time steps, placed on a random location on a 10 10 grid and additionally activated 10% of the grid cells at random. We represented the grid as a graph with 100 nodes and represented the 8-neighborhood via the adjacency matrix. The desired edits were node deletions for all nodes that should switch from alive to dead and node insertions for all nodes that should switch from dead to alive. As node features we used the alive-state of the node. Note that a single-step teaching signal is sufficient in all cases. We compare our graph edit network against variational graph autoencoders (VGAE) and variational graph recurrent nets (VGRNN) which predict the adjacency matrix of the next graph via an outer product of the node features, similar to our edge prediction scheme (Kipf & Welling, 2016b; Hajiramezanali et al., 2019). Note that neither net can predict node changes directly, but we predict a node deletion whenever all edges of a node are deleted. To train our GEN model, we apply both the hinge loss (GEN; Appendix B) as well as the crossentropy loss (XE-GEN; Appendix C). For all models, we use two graph neural network layers with 64 neurons each, sum as aggregation function and concatenation as merge function (refer to Equation 1). We train all networks with an Adam optimizer in py Torch using a learning rate of 10 3 and stopping training after 30,000 time series or if the loss dropped below 10 3. After training, we evaluated the predictive performance on 10 additional time series. We repeated each experiment five times. The recall and precision for all edit types, all models, and all datasets is shown in Table 1. Node insertion precision, node deletion recall, and edge deletion recall where consistently at 100% for all models, and edge insertion precision as well as edge deletion precision very close to 100%. This is Published as a conference paper at ICLR 2021 the case for both hinge and crossentropy loss. Unsurprisingly, VGAE and VGRNN perform poorly on node edits, for which they are not designed, but also underperform on some edge edit scores; especially VGRNN on edge edit precision. We note that our datasets are strictly Markovian, such that the added recurrent capability of VGRNNs may not provide much value and that, indeed, the difficulty of maintaining the state across node set changes may hurt VGRNN in these cases. Next, we consider the following two tree dynamical systems, where node labels are represented via one-hot coding. Boolean Formulae: Random Boolean formula with up to three binary operators (e.g. (x y) x y), which get simplified with rules like x x until the formula could not be simplified anymore. Peano addition: Random additions of single-digit integers with at most 3 + operators, which get re-written using Peano s definition of addition, i.e. m + succ(n) = succ(m) + n and m + 0 = m. The purpose of these datasets is to test whether our GEN is able to correctly predict node attributes. We compare against a Gaussian process prediction approach for tree time series prediction suggested by Paaßen et al. (2018) which uses Gaussian process regression with an RBF kernel on the tree edit distance to predict the next tree in kernel space and then solves a kernel pre-image problem to find the actual next tree (Paaßen et al., 2018). We use the same GEN hyper parameters as for the graph dynamical systems. As results we observe that the GEN is able to achieve perfect predictive accuracy for both datasets on all five repeats of the experiment. By contrast, the Gaussian process prediction scheme yields an RMSE (as measured by the tree edit distance) of 2.39 (std.: 0.24) on Boolean and 4.54 (std.: 0.77) on Peano, indicating that the datasets are not trivial to predict. We evaluated the runtime of GENs on a variation of the HEP-Th paper dataset of Leskovec et al. (2007). In particular, we considered all authors as nodes and included an edge if two authors submitted a joint HEP-Th paper within the last τ month from some month t. We removed authors without any papers from the graph and resolved duplicates. We varied t over the entire range between January 1992 and April 2003 in the HEP-Th dataset and τ {1, . . . , 12}, thus obtaining 1554 different graphs of sizes in the range [100, 2786]. For all these graphs, we measured the time needed to compute a forward and backward pass with a GEN without edge filtering (i.e. e+ = e = 1 in all cases) and with edge filtering, respectively. For forward computations, both variants scaled sub-quadratically with empiric exponents of 1.65 and 1.37, respectively. For the backward pass, GENs without node filtering scaled with an exponent of 4.11, whereas GENs with edge filtering scaled roughly linearly (exponent 0.93), yielding faster times by several orders of magnitude. We emphasize that edge filtering can not be made arbitrarily strict. As discussed in Section 3.2, the limit on the number of edited nodes must be adjusted to the training data; otherwise accuracy will suffer. Still, the precise value of the limit only influences a constant factor in the linear efficiency. 5 CONCLUSION We introduced the graph edit network, a novel output layer for graph neural networks to predict graph edits. In contrast to prior work, graph edits cover all possible graph-to-graph transformations, including changes in the node set. Importantly, graph edits are sparse, reducing the time complexity from quadratic to linear and facilitating interpretation. Further, graph edits can be locally decided at each node, avoiding the need for pooling layers. In addition to the novel output layer, we also provided a training scheme which only requires a mapping between nodes as input and then returns target values for all outputs of the linear layer, turning the training into combination of binary classification and regression tasks that can be solved with established methods. We further showed that, if the input node mapping is optimal and the graph neural network layers are expressive enough, a graph edit network can achieve edit scripts almost as short as the graph edit distance. Empirically, we evaluated graph edit networks against variational graph autoencoders, variational graph recurrent nets, and kernel time series prediction on three graph and two tree dynamical systems, proving the concept of graph edit networks. We hope that our work is the starting point for exciting further research in the field of graph-to-graph learning and time series prediction on graphs, especially in combination with recurrent graph neural networks to go beyond the Markovian setting. Published as a conference paper at ICLR 2021 ACKNOWLEDGMENTS Funding by the German Research Foundation (DFG) under grant number PA 3460/1-1 as well as the Bielefeld Young Researchers Fund is gratefully acknowledged. D.Z. and D.G. gratefully acknowledge the support of the Swiss National Science Foundation under grant 200021/172671. Zeina Abu-Aisheh, Romain Raveaux, Jean-Yves Ramel, and Patrick Martineau. An exact graph edit distance algorithm for solving pattern recognition problems. In Ana Fred, Maria De Marsico, and Mario Figueiredo (eds.), Proceedings of the 4th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2015), pp. 271 278, 2015. doi:10.5220/0005209202710278. Zeina Abu-Aisheh, Benoit Gaüzere, Sébastien Bougleux, Jean-Yves Ramel, Luc Brun, Romain Raveaux, Pierre Héroux, and Sébastien Adam. Graph edit distance contest: Results and future challenges. Pattern Recognition Letters, 100:96 103, 2017. doi:10.1016/j.patrec.2017.10.007. Davide Bacciu, Alessio Micheli, and Marco Podda. Graph generation by sequential edge prediction. In Michel Verleysen (ed.), Proceedings of the 27th European Symposium on Artificial Neural Networks (ESANN 2019), pp. 95 100, 2019. URL http://www.elen.ucl.ac.be/Proceedings/ esann/esannpdf/es2019-107.pdf. Davide Bacciu, Federico Errica, Alessio Micheli, and Marco Podda. A gentle introduction to deep learning for graphs. Neural Networks, 129:203 221, 2020. doi:10.1016/j.neunet.2020.06.006. Yunsheng Bai, Hao Ding, Song Bian, Ting Chen, Yizhou Sun, and Wei Wang. Simgnn: A neural network approach to fast graph similarity computation. In J. Shane Culpepper, Alistair Moffat, Paul Bennett, Kristina Lerman, and Joel Mackenzie (eds.), Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining (WSDM 2019), pp. 384 392, 2019. doi:10.1145/3289600.3290967. Peter W. Battaglia, Jessica B. Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. Relational inductive biases, deep learning, and graph networks. ar Xiv preprint ar Xiv:1806.01261, 2018. URL https://arxiv.org/abs/1806.01261. David B. Blumenthal, Nicolas Boria, Johann Gamper, Sébastien Bougleux, and Luc Brun. Comparing heuristics for graph edit distance computation. The VLDB Journal, 29:419 458, 2020. doi:10.1007/s00778-019-00544-1. Sebastien Bougleux, Luc Brun, Vincenzo Carletti, Pasquale Foggia, Benoit Gaüzère, and Mario Vento. Graph edit distance as a quadratic assignment problem. Pattern Recognition Letters, 87: 38 46, 2017. doi:10.1016/j.patrec.2016.10.001. Wren Chan and Aspen Olmsted. Ethereum transaction graph analysis. In Ella Pereira and Michael Mackey (eds.), Proceedings of the 12th International Conference for Internet Technology and Secured Transactions (ICITST 2017), pp. 498 500, 2017. doi:10.23919/ICITST.2017.8356459. Zhiyong Cui, Kristian Henrickson, Ruimin Ke, and Yinhai Wang. Traffic graph convolutional recurrent neural network: A deep learning framework for network-scale traffic learning and forecasting. IEEE Transactions on Intelligent Transportation Systems, 21(11):4883 4894, 2019. doi:10.1109/TITS.2019.2950416. Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (eds.), Proceedings of the 20th International Conference on Advances in Neural Information Processing Systems (NIPS 2016), pp. 3844 3852, 2016. URL https://papers.nips.cc/paper/ 6081-convolutional-neural-networks-on-graphs-with-fast-localized-spectral-filterin Published as a conference paper at ICLR 2021 Tianfan Fu, Cao Xiao, and Jimeng Sun. Core: Automatic molecule optimization using copy & refine strategy. In Francesca Rossi, Vincent Conitzer, and Fei Sha (eds.), Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI 2020), pp. 638 645, 2020. doi:10.1609/aaai.v34i01.5404. Martin Gardner. The fantastic combinations of John Conway s new solitaire game "life". Scientific American, 223:120 123, 1970. URL https://www.scientificamerican.com/ article/mathematical-games-1970-10/. Palash Goyal, Sujit Rokka Chhetri, and Arquimedes Canedo. Dyngraph2vec: Capturing network dynamics using dynamic graph representation learning. Knowledge-Based Systems, 187:104816, 2020. doi:10.1016/j.knosys.2019.06.024. Ehsan Hajiramezanali, Arman Hasanzadeh, Krishna Narayanan, Nick Duffield, Mingyuan Zhou, and Xiaoning Qian. Variational graph recurrent neural networks. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché Buc, E. Fox, and R. Garnett (eds.), Proceedings of the 32nd International Conference on Advances in Neural Information Processing Systems (Neur IPS 2019), pp. 10700 10710, 2019. URL http://papers.nips.cc/paper/ 9254-variational-graph-recurrent-neural-networks. Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In I. Guyon, U.V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Proceedings of the 30th Conference on Advances in Neural Information Processing Systems (NIPS 2017), pp. 1024 1034, 2017. URL http://papers.nips.cc/paper/ 6703-inductive-representation-learning-on-large-graphs. Wengong Jin, Kevin Yang, Regina Barzilay, and Tommi Jaakkola. Learning multimodal graph-tograph translation for molecule optimization. In Tara Sainath, Alexander Rush, Sergey Levine, Karen Livescu, and Shakir Mohamed (eds.), Proceedings of the 7th International Conference on Learning Representations (ICLR 2019), 2019. URL https://openreview.net/forum? id=B1x JAs A5F7. Matt J. Keeling and Ken T.D. Eames. Networks and epidemic models. Journal of the Royal Society Interface, 2(4):295 307, 2005. doi:doi.org/10.1098/rsif.2005.0051. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio, Yann Le Cun, Brian Kingsbury, Samy Bengio, Nando de Freitas, and Hugo Larochelle (eds.), Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), 2015. URL https://arxiv.org/abs/1412.6980. Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In Hugo Larochelle, Samy Bengio, Brian Kingsbury, Yoshua Bengio, and Yann Le Cun (eds.), Proceedings of the 4th International Conference on Learning Representations (ICLR 2016), 2016a. URL https://arxiv.org/abs/1609.02907. Thomas N. Kipf and Max Welling. Variational graph auto-encoders. In Yarin Gal, Christos Louizos, Zoubin Ghahramani, Kevin Murphy, and Max Welling (eds.), Proceedings of the NIPS 2016 Workshop on Bayesian Deep Learning, 2016b. URL http://bayesiandeeplearning. org/2016/papers/BDL_16.pdf. Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. Graph evolution: Densification and shrinking diameters. ACM Transactions on Knowledge Discovery from Data, 1(1):2 es, March 2007. doi:10.1145/1217299.1217301. Vladimir Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10(8):707 710, 1965. Yujia Li, Oriol Vinyals, Chris Dyer, Razvan Pascanu, and Peter Battaglia. Learning deep generative models of graphs. ar Xiv, 2018. URL https://arxiv.org/abs/1803.03324. Yujia Li, Chenjie Gu, Thomas Dullien, Oriol Vinyals, and Pushmeet Kohli. Graph matching networks for learning the similarity of graph structured objects. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Published as a conference paper at ICLR 2021 Learning (ICML 2019), pp. 3835 3845, 2019. URL http://proceedings.mlr.press/ v97/li19d.html. David Liben-Nowell and Jon Kleinberg. The link-prediction problem for social networks. Journal of the American Society for Information Science and Technology, 58(7):1019 1031, 2007. doi:10.1002/asi.20591. Naoki Masuda and Petter Holme. Detecting sequences of system states in temporal networks. Scientific reports, 9(1):1 11, 2019. doi:10.1038/s41598-018-37534-2. Alessio Micheli. Neural network for graphs: A contextual constructive approach. IEEE Transactions on Neural Networks, 20(3):498 511, 2009. doi:10.1109/TNN.2008.2010350. Amit A. Nanavati, Siva Gurumurthy, Gautam Das, Dipanjan Chakraborty, Koustuv Dasgupta, Sougata Mukherjea, and Anupam Joshi. On the structural properties of massive telecom call graphs: Findings and implications. In Philip S. Yu, Vassilis Tsotras, and Edward Fox (eds.), Proceedings of the 15th ACM International Conference on Information and Knowledge Management (CIKM 06), pp. 435 444, 2006. doi:10.1145/1183614.1183678. Benjamin Paaßen, Christina Göpfert, and Barbara Hammer. Time series prediction for graphs in kernel and dissimilarity spaces. Neural Processing Letters, 48(2):669 689, 2018. doi:10.1007/s11063017-9684-5. Benjamin Paaßen, Barbara Hammer, Thomas Price, Tiffany Barnes, Sebastian Gross, and Niels Pinkwart. The continuous hint factory - providing hints in vast and sparsely populated edit distance spaces. Journal of Educational Datamining, 10(1):1 35, 2018. URL https://jedm. educationaldatamining.org/index.php/JEDM/article/view/158. Aldo Pareja, Giacomo Domeniconi, Jie Chen, Tengfei Ma, Toyotaro Suzumura, Hiroki Kanezashi, Tim Kaler, Tao B Schardl, and Charles E Leiserson. Evolve GCN: Evolving graph convolutional networks for dynamic graphs. In Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20), pp. 5363 5370, 2020. doi:https://doi.org/10.1609/aaai.v34i04.5984. Emile Richard, Stéphane Gaïffas, and Nicolas Vayatis. Link prediction in graphs with autoregressive features. Journal of Machine Learning Research, 15(1):565 593, 2014. URL http://www. jmlr.org/papers/v15/richard14a.html. Kaspar Riesen and Horst Bunke. Approximate graph edit distance computation by means of bipartite graph matching. Image and Vision Computing, 27(7):950 959, 2009. doi:10.1016/j.imavis.2008.04.004. 7th IAPR-TC15 Workshop on Graph-based Representations (Gb R 2007). Alberto Sanfeliu and King-Sun Fu. A distance measure between attributed relational graphs for pattern recognition. IEEE Transactions on Systems, Man, and Cybernetics, SMC-13(3):353 362, 1983. doi:10.1109/TSMC.1983.6313167. Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61 80, 2009. doi:10.1109/TNN.2008.2005605. Youngjoo Seo, Michaël Defferrard, Pierre Vandergheynst, and Xavier Bresson. Structured sequence modeling with graph convolutional recurrent networks. In Jun Wang, Long Cheng, Andrew Leung, and Seiichi Ozawa (eds.), Proceedings of the 25th International Conference on Neural Information Processing (ICONIP 2018), pp. 362 373. Springer, 2018. doi:10.1007/978-3-030-04167-0_33. Naoki Shibata, Yuya Kajikawa, and Ichiro Sakata. Link prediction in citation networks. Journal of the American Society for Information Science and Technology, 63(1):78 85, 2012. doi:10.1002/asi.21664. Johan A K Suykens, Tony Van Gestel, Jos De Brabanter, Bart De Moor, and Joos Vandewalle. Least Squares Support Vector Machines. World Scientific, Singapore, 2002. doi:10.1142/5089. Published as a conference paper at ICLR 2021 Petar Veliˇckovi c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. In Yoshua Bengio, Yann Le Cun, Tara Sainath, Ian Murray, Marc Aurelio Ranzato, and Oriol Vinyals (eds.), Proceedings of the 6th International Conference on Learning Representations (ICLR 2018), 2018. URL https://openreview.net/forum? id=r JXMpik CZ. Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In Tara Sainath, Alexander Rush, Sergey Levine, Karen Livescu, and Shakir Mohamed (eds.), Proceedings of the 7th International Conference on Learning Representations (ICLR 2019), 2019. URL https://arxiv.org/abs/1810.00826. Jiaxuan You, Bowen Liu, Zhitao Ying, Vijay Pande, and Jure Leskovec. Graph convolutional policy network for goal-directed molecular graph generation. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems 31, pp. 6410 6421. Curran Associates, Inc., 2018a. URL http://papers.nips.cc/paper/ 7877-graph-convolutional-policy-network-for-goal-directed-molecular-graph-generati pdf. Jiaxuan You, Rex Ying, Xiang Ren, William Hamilton, and Jure Leskovec. Graph RNN: Generating realistic graphs with deep auto-regressive models. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning (ICML 2018), pp. 5708 5717, 2018b. URL http://proceedings.mlr.press/v80/you18a.html. Bing Yu, Haoteng Yin, and Zhanxing Zhu. Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting. In Jérôme Lang (ed.), Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI 2018), pp. 3634 3640, 2018. URL https: //www.ijcai.org/Proceedings/2018/0505.pdf. Daniele Zambon, Daniele Grattarola, Lorenzo Livi, and Cesare Alippi. Autoregressive models for sequences of graphs. In Chrisina Jayne, Zoltan Somogyvari, Plamen Angelov, and Manuel Roveri (eds.), Proceedings of the 2019 International Joint Conference on Neural Networks (IJCNN 2019), pp. 1 8, 2019. doi:10.1109/IJCNN.2019.8852131. Kaizhong Zhang and Dennis Shasha. Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal on Computing, 18(6):1245 1262, 1989. doi:10.1137/0218082. Muhan Zhang, Shali Jiang, Zhicheng Cui, Roman Garnett, and Yixin Chen. D-VAE: A variational autoencoder for directed acyclic graphs. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché Buc, E. Fox, and R. Garnett (eds.), Proceedings of the 32nd International Conference on Advances in Neural Information Processing Systems (Neur IPS 2019), pp. 1586 1598, 2019. URL http://papers.nips.cc/paper/ 8437-d-vae-a-variational-autoencoder-for-directed-acyclic-graphs. Junbo Zhao, Michael Mathieu, and Yann Le Cun. Energy-based generative adversarial network. In Yoshua Bengio, Yann Le Cun, Marc Aurelio Ranzato, Hugo Larochelle, Oriol Vinyals, and Tara Sainath (eds.), Proceedings of the 5th International Conference on Learning Representations (ICLR 2017), 2017. URL https://arxiv.org/abs/1609.03126. Published as a conference paper at ICLR 2021 A.1 PROOF OF THEOREM 1 We structure our proof into three lemmas. Lemma 1 translates a graph mapping into a script, Lemma 2 translates a script to a graph mapping, and Lemma 3 shows that the length never gets longer by doing so. This argument is analogous to Proposition 1 by Bougleux et al. (2017), who show a one-to-one correspondence between bijective mappings and a reduced set of edit scripts, called restricted paths. In this work, we aim to connect the notion of graph mappings to teaching signals for graph edit networks. To make this connection, we require stronger statements about the structure of such graph mappings and a different order of operations compared to Bougleux et al. (2017), such that we provide a full argument here, without relying on Bougleux et al. (2017) directly. First, we define the corresponding script to a graph mapping. Definition 2. Let G = (V = {1, . . . , M}, E, X) and H = (V = {1, . . . , N}, E , X ) be two nonempty graphs, i.e. M, N > 0, and let ψ be a graph mapping between G and H. We then define the script δψ corresponding to ψ between G and H as follows. First, we construct a replacement repi, x ψ(i) for all i (in ascending order) where i M, ψ(i) N and xi = x ψ(i). Let δrep ψ be the script of all these replacements. Next, we construct an insertion insx ψ(i) for all i (in ascending order) where i > M and ψ(i) N. Let δins ψ be the script of all these insertions. Next, we construct an edge insertion einsi,j for all (i, j) (in lexicographically ascending order) where (i, j) / E but (ψ(i), ψ(j)) E . Let δeins ψ be the script of all these edge insertions. Next, we construct an edge deletion edeli,j for all (i, j) (in lexicographically ascending order) where (i, j) E but (ψ(i), ψ(j)) / E . Let δedel ψ be the script of all these edge deletions. Finally, we construct a node deletion deli for all i M (in descending order) where ψ(i) > N. Let δdel ψ be the script of all these node deletions. We then define the overall script δψ as the concatenation δψ = δrep ψ , δins ψ , δeins ψ , δedel ψ , δdel ψ . We next show that the scripts resulting from graph mappings do what they should, i.e. they to indeed convert G to a graph that is isomorphic to H. Lemma 1. Let G = (V = {1, . . . , M}, E, X) and H = (V = {1, . . . , N}, E , X ) be two nonempty graphs, i.e. M, N > 0, let ψ be a graph mapping between G and H, and let δψ be the corresponding script. Then, H := δψ(G) is isomorphic to H. Proof. In particular, by applying δrep ψ and δins ψ , to G, we obtain a node set V and attributes X with the following properties. First, consider U := {i V |ψ(i) N}. Due to our construction and Definition 1 of a graph mapping, it holds U = {i M|ψ(i) N} {M < i M + |Insψ|} and ψ(U) = {j N|ψ 1(j) M} Insψ = V . Further, for all i U we have by construction xi = x ψ(i), because either i M and xi = x ψ(i), in which case this holds trivially, or i M and xi = x ψ(i), in which case a replacement was applied which ensured the condition, or M < i M + |Insψ|, in which case an insertion was applied that ensured the condition. Next, we consider the graph G = ( V , E, X) = δrep ψ , δins ψ , δeins ψ , δedel ψ (G), i.e. the graph that results from additionally applying all edge insertions and edge deletions. For this graph we can show that the subgraph restricted to the node set U is isomorphic to H with the isomorphism ψ restricted to U. In particular, we have already shown that ψ(U) = V and that xi = x ψ(i). It remains to show that for all i, j U it holds: (i, j) E if and only if (ψ(i), ψ(j)) E . This, however, is exactly the condition enforced by the edge deletion/insertion construction above. Accordingly, the subgraph restricted to the nodes U is indeed isomorphic to H. Additionally, we also obtain that for any node i / U, there exists no edge (i, j) or (j, i) in E, because these would have been deleted by δedel ψ . Finally, δdel ψ deletes all nodes i / U (in descending order to prevent interference in node ordering) and the remaining graph H = δ(G) is isomorphic to H, because it is exactly the subgraph we have analyzed before. This concludes the proof. In a next step, we define the conversion back from a script to a graph mapping. Published as a conference paper at ICLR 2021 Definition 3. Let G = (V = {1, . . . , M}, E, X) and H = (V = {1, . . . , N}, E , X ) be two non-empty graphs, i.e. M, N > 0. Further, let H = ( V = {1, . . . , N}, E , X ) be a graph that is isomorphic to H via isomorphism ψ : {1, . . . , N} {1, . . . , N} from V to V , i.e. for all i V : x i = x ψ(i), and (i, j) E if and only if ( ψ(i), ψ(j)) E . Finally, let δ = δ1, . . . , δT be an edit script, such that δ(G) = H. Then, we define the mapping ψ δ : {1, . . . , M +N} {1, . . . , M +N} recursively as follows. First, if T = 0 (i.e. δ is empty), we set ψ δ = ψ, extended by the identity on the inputs N +1, . . . , M + N. Next, if T > 0, let ψ := ψδ2,...,δT and consider the first edit δ1. If δ1(G) = G, i.e. the edit has no effect, we set ψ δ = ψ . If δ1 = delj for some j, we set ψ δ(i) = ψ (i) for all i < j, ψ δ(i) = ψ (i 1) for all i > j, and ψ δ(i) = M + N. If δ1 = ins x for some x, let i := ψ 1(M + N + 1) and distinguish four cases. First, if i M and ψ (M + 1) N, we set ψ δ(i) = ψ (i) for all i = i and ψ δ(i ) = ψ (M + N + 1). Second, if i M and ψ (M +1) > N, we set ψ δ(i) = ψ (i) for all i = i with i M, ψ δ(i ) = ψ (M +1), and ψ δ(i) = ψ (i + 1) for all i > M. Third, if i = M + 1, we set ψ δ(i) = ψ (i) for all i M and ψ δ(i) = ψ (i + 1) for all i > M. Fourth, if i > M + 1, we set ψ δ(i) = ψ , restricted to {1, . . . , M + N}. Finally, if δ1 is any other edit, we set ψ δ = ψ . We will later prove in Corollary 2 that the resulting mapping ψ δ is indeed a graph mapping. Refer to Figure 3 for an example of the construction. In particular, we consider the script δ = del2, ins x1, ins x2, del3, transforming the graph G = (V = {1, 2, 3}, E = , X) to the graph H = ( V = {1, 2, 3}, E = , X ), which in turn is isomorphic to the graph H = (V = {1, 2, 3}, E = , X ) for some attributes X, X , X , and for the isomorphism ψ between H and H with ψ(1) = 3, ψ(2) = 1, and ψ(3) = 2. Now, we construct the mapping ψ δ recursively from the last edit to the first. The initial mapping for the empty script is ψϵ : {1, . . . , 3 + 3} {1, . . . , 3 + 3} with ψϵ(i) = ψ(i) for i 3 and ψϵ(i) = i for i > 3. Next, we incorporate the last edit of the script del3. Because this is a deletion, the graph before the deletion must have had one node more. Accordingly, ψdel3 is now defined on the domain and image {1, . . . , 4 + 3}. In more detail, following Definition 3 we obtain the mapping ψdel3(1) = ψϵ(1) = 3, ψdel3(2) = ψϵ(2) = 1, ψdel3(3) = 4 + 3 = 7, ψdel3(4) = ψϵ(4 1) = 2, ψdel3(5) = ψϵ(5 1) = 4, ψdel3(6) = ψϵ(6 1) = 5, and ψdel3(7) = ψϵ(7 1) = 6. In the figure, this mapping is obtained by following all arrows from the graph right of H to their end point. Next, we incorporate the edit ins x2. Because we consider an insertion, the graph before the insertion had one node less. Accordingly, the mapping ψins x2,del3 is now defined on the domain and image {1, . . . , 3 + 3}; it is crucial that we remove the entry 3 + 3 + 1 = 7 from the codomain of our mapping. The pre-image of 7 is i = ψ 1 del3(7) = 3. Further, we need to consider whether the inserted node 3 + 1 = 4 gets deleted later in the script. Because ψdel3(3 + 1) = 2 3, we know that this is not the case. Accordingly, the first of the four cases in Definition 3 applies and we obtain ψins x2,del3(i) = ψdel3(i) for all i = i = 3 and ψins x2,del3(3) = ψdel3(7) = 6. Note that this is still a proper graph mapping, i.e. the bijectivity is never violated and the only node j Insψins x2 ,del3 is 2 with the pre-image ψins x2,del3(2) = 4 = 3 + 1, as required. Now, we incorporate another insertion ins x1. We observe that the newly inserted node 2 + 1 = 3 will get deleted later, since we have ψins x2,del3(2 + 1) = 6 > 3. Furthermore, the newly inserted node 3 is at the same time the pre-image of entry 2 + 3 + 1 = 6, which we need to delete later. Accordingly, the third case in Definition 3 applies and we obtain ψins x1,ins x2,del3(i) = ψins x2,del3(i) for i 2 and ψins x1,ins x2,del3(i) = ψins x2,del3(i + 1) for i > 2. Because we do not refer to ψins x2,del3(2 + 1) = 6, we do not leave the desired range {1, . . . , 2 + 3}. Also note that Insψins x1 ,ins x2 ,del3 still contains only the node 2, which has the pre-image ψ 1 ins x1,ins x2,del3(2) = 3 = 2 + 1 as required. Published as a conference paper at ICLR 2021 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 Figure 3: An illustration of the conversion of a script δ = del2, ins x1, ins x2, del3 to a graph mapping ψ δ. We start with an initial isomorphism ψ between the graph H and the graph H = δ(G) (left) and then adjust the mapping from left to right until we arrive at a final graph map between G and H (right). The mapping is obtained by following the arrows from G to H. For simplicity we assume that the graphs have no edges. Also note that we include the virtual nodes M + 1, . . . , M + N in the illustration. Finally, we incorporate the deletion del2. As stated in Definition 3, we thus obtain ψ δ(i) = ψins x1,ins x2,del3(i) for i < 2, ψ δ(2) = 3 + 3 = 6, and ψ δ(i) = ψins x1,ins x2,del3(i 1) for i > 2. The final mapping is thus ψ δ(1) = 3, ψ δ(2) = 6, ψ δ(3) = 1, ψ δ(4) = 2, ψ δ(5) = 4, and ψ δ(6) = 5. In Figure 3, we obtain this mapping by starting at G and following the arrows to their end points. Our next step is to show that the mapping constructed from a script is indeed a graph mapping and that additional structure applies, which we can exploit later on. Lemma 2. Let G, H, H be graphs as in the previous definition with isomorphism ψ between H and H, let δ = δ1, . . . , δT be a script, such that δ(G) = H, let ψ δ be the corresponding mapping, and let ψ : {1, . . . , M + N} {1, . . . , M + N} be defined as ψ(i) = ψ 1(ψ δ(i)) if ψ δ(i) N and ψ(i) = ψ δ(i) otherwise, i.e. the mapping composed with ψ 1. Further, let Delψ := {i M|ψ(i) > N}. Then, ψ is bijective. Further, for all i / Delψ, ψ is monotonously increasing (constraint 1), and the image of ψ on Delψ is {M + N + 1 |Delψ|, . . . , M + N} (constraint 2). Proof. We perform this proof via induction over the length T of the script δ = δ1, . . . , δT . If T = 0, ψ is the identity, which is obviously bijective, yields Delψ = (such that constraint 2 is fulfilled), and is monotonously increasing (such that constraint 1 is fulfilled). Now, if T > 1, let ψ be defined as ψ above, but between the graphs δ1(G) and H. Accordingly, ψ is a bjective mapping from and to {1, . . . , M + N} by induction, where M is the size of the node set of δ1(G). Further, by induction we also know that ψ is monotonously increasing on all inputs except on Delψ := {i M |ψ (i) > N} (constraint 1) and that the image of ψ on Delψ is precisely {M + N + 1 |Delψ |, . . . , M + N} (constraint 2). Now, consider the first edit δ1. If δ1 leaves the node set as-is, we obtain M = M , ψ = ψ , and both constraints hold by induction. If δ1 = delj for some j M, we obtain M = M 1. Further, by the definition of ψ δ, we obtain ψ(i) = ψ (i) for i < j, ψ(i) = ψ (i 1) for i > j, and ψ(j) = M + N. Accordingly, ψ maps the set {1, . . . , j 1} {j + 1, . . . , M + N} bijectively to {1, . . . , M + N 1} and the setting ψ(j) = M + N completes the bijective map. Further, j Delψ and is mapped to M + N, conforming to constraint 2. Finally, for all other inputs both constraints hold by induction. If δ1 = ins x for some x, we obtain M = M + 1 and thus ψ is a bijective map from and to {1, . . . , M + N + 1}. Now, let i = ψ 1(M + N + 1) and distinguish four cases. Published as a conference paper at ICLR 2021 First, if i M and ψ (M + 1) N we obtain ψ(i) = ψ (i) for all i = i and ψ(i ) = ψ (M + N +1). First, observe that ψ (M +N +1) < M +N +1, otherwise ψ (M +N +1) = M +N +1 and i = M + N + 1 > M, which is a contradiction. Accordingly, ψ is a bijective map from and to {1, . . . , M + N} because by definition ψ maps the set {1, . . . , i 1} {i + 1, . . . , M + N} to {1, . . . , M +N}\{ψ (M +N +1)} and the setting ψ(i ) = ψ (M +N +1) completes the bijective map. Furthermore, we obtain ψ (M + N + 1) > N. Otherwise, the monotonicity of ψ would require all values M + 2, . . . , M + N to be mapped to values N as well. However, there exist only N such values and one is already blocked by ψ (M + 1), which is a contradiction. Accordingly, ψ (M + N + 1) > N and thus i Delψ. Both constraints hold by induction. Second, if i M and ψ (M + 1) > N we obtain ψ δ(i) = ψ (i) for all i = i with i M, ψ δ(i ) = ψ (M + 1), and ψ δ(i) = ψ (i + 1) for all i > M. Accordingly, ψ is a bijective map from and to {1, . . . , M + N} because by definition ψ maps the set {1, . . . , i 1, i + 1 . . . , M + N} to the image of ψ for the inputs {1, . . . , i 1, i + 1 . . . , M, M + 2, . . . , M + N + 1}, which is exactly {1, . . . , M + N} \ {ψ (M + 1)}. Setting ψ δ(i ) = ψ (M + 1) then completes the bijective map. Further, on all inputs except i both constraints still hold due to induction. For i , we obtain i Delψ because ψ(i ) = ψ (M + 1) > N and because M + 1 Delψ constraint 2 also holds for i due to induction. Third, if i = M + 1, we obtain ψ δ(i) = ψ (i) for all i M, ψ δ(i) = ψ (i + 1) for all i > M. This is a bijective map from and to {1, . . . , M + N} because by definition ψ maps {i, . . . , M + N} to the image of ψ for {1, . . . , M, M + 2, . . . , M + N + 1}, which is exactly {1, . . . , M + N}. Further, the constraints hold due to induction. Fourth, if i > M + 1, we obtain ψ δ(i) = ψ restricted to {1, . . . , M + N}. In this case, we observe that i must be M + N + 1. Otherwise, we would obtain ψ (i + 1) < M + N + 1 = ψ (i ), which contradicts monotonicity of ψ . Accordingly, restricting ψ to {1, . . . , M + N} does indeed yield a bijective map to {1, . . . , M + N} which conforms to both constraints due to induction. Because this covers all possible edits, this concludes our proof. Corollary 2. Let G, H, H, ψ, δ, ψ δ, and ψ be defined as in the previous Lemma. then, ψ δ is a graph mapping between G and H. Proof. First, observe that ψ δ(i) = ψ(ψ(i)) for all i with ψ δ(i) N, and ψ δ(i) = ψ(i) otherwise. Further, observe that ψ δ must thus be bijective. Otherwise, ψ or ψ could not be bijective. Next, assume that there exists some j Insψ δ such that ψ 1 δ (j) > M + |Insψ δ|. Then, because ψ δ is bijective, there also exists some i with M < i M + |Insψ δ| with ψ δ(i) / Insψ δ. This implies that ψ δ(i) > N j, even though ψ 1 δ (j) > i, which contradicts the monotonicity constraint on ψ. Accordingly, ψ δ is indeed a graph mapping between G and H. Lemma 3. Let G = (V = {1, . . . , M}, E, X) and H = (V = {1, . . . , N}, E , X ) be two nonempty graphs, i.e. M, N > 0. Further, let H = ( V = {1, . . . , N}, E , X ) be a graph that is isomorphic to H via isomorphism ψ : {1, . . . , N} {1, . . . , N}, and let δ be an edit script with δ(G) = H. Finally, let ψ δ be the corresponding graph mapping between G and H and let δψ δ be the edit script corresponding to that graph mapping. Then, it holds: | δψ δ| | δ|. Proof. We again prove this claim via induction over the length T of the script δ = δ1, . . . , δT . First, if T = 0, G = δ(G) = H. Accordingly, ψ δ is ψ (extended with the identity on M +1, . . . , M + N). It then follows that δψ δ is empty. In particular, there can not exist any i M with xi = x ψ δ(i) = x ψ(i), otherwise ψ would not be an isomorphism between H and H. Accordingly, δrep ψ δ is empty. Further, there can not exist any i > M with ψ δ(i) N, because ψ δ(i) = i > M = N. Accordingly, δins ψ δ is empty. Next, for all i, j M it holds (i, j) E (ψ δ(i), ψ δ(j)) = ( ψ(i), ψ(j)) E , otherwise ψ would not be an isomorphism between H and H; and for all i, j > M = N it holds (i, j) / E and (ψ δ(i), ψ δ(j)) = (i, j) / E per definition. Accordingly, both δeins ψ δ and δedel ψ δ are empty. Finally, there can not exist an i M with ψ δ(i) = ψ(i) > N because the image of ψ is Published as a conference paper at ICLR 2021 {1, . . . , N}. Accordingly, δdel ψ δ is empty and thus δψ δ = δrep ψ δ δins ψ δ δeins ψ δ δedel ψ δ δdel ψ δ is overall empty as well, i.e. | δψ δ| = 0 0 = | δ|, as claimed. Now, consider the case T > 0, let ψ := ψδ2,...,δT , and let G = ( V , E, X) := δ1(G). Further, let δψ be the script that ψ is transformed into based on G and H. By induction, | δψ | T 1. We now consider the first edit δ1. If δ1 has no effect on G, we obtain ψ δ = ψ and by definition δψ δ = δψ , which in turn yields | δψ δ| = | δψ | T 1 < T as claimed. If δ1 = repi, x for some i M and some x, we obtain ψ δ = ψ , V = V , E = E, xj = xj for j = i, and xi = x = xi. Now, distinguish two cases. If x = x ψ (i), i.e. xi = x ψ (i) = xi, δψ δ contains by construction an edit repi, x and otherwise the same edits as δψ , such that we obtain | δψ δ| = | δψ | + 1 T 1 + 1 = T as claimed. If x = x ψ (i), i.e. δ1 sets xi to a value which is temporary and not equal to the final value of x ψ(i), we obtain δψ δ = δψ , because δψ by construction already contains an edit repi, x ψ (i). Accordingly, we obtain | δψ δ| = | δψ | T 1 < T as claimed. If δ1 = ins x for some x, we obtain V = V {M + 1}, E = E, xi = xi for i M, and x M+1 = x. Now, let let i := ψ 1(M + N + 1) and distinguish four cases. First, if i M and ψ (M + 1) N, we obtain ψ δ(i) = ψ (i) for i = i and ψ δ(i ) = ψ (M + N + 1). As argued in the proof of the previous lemma, ψ (M + N + 1) > N due to monotonicity of ψ . Accordingly, i is deleted both in δψ and in δψ δ. Further, M + 1 is inserted in δψ δ but not in δψ , while all other edits remain the same such that we obtain | δψ δ| = | δψ | + 1 T 1 + 1 = T as claimed. Second, if i M and ψ (M + 1) > N, we obtain ψ δ(i) = ψ (i) for all i with i M and i = i , ψ δ(i ) = ψ (M + 1), and ψ δ(i) = ψ (i + 1) for all i > M. Because ψ (M + 1) > N, i is deleted both in δψ and in δψ δ. Further, M + 1 is deleted in δψ but not in δψ δ, while all other edits remain the same, such that we obtain | δψ δ| = | δψ | 1 T 2 < T as claimed. Third, if i = M + 1, we obtain ψ δ(i) = ψ (i) for all i M and ψ δ(i) = ψ (i + 1) otherwise. Again, M + 1 is deleted in δψ but not in δψ δ, while all other edits remain the same, such that we obtain | δψ δ| = | δψ | 1 T 2 < T as claimed. Fourth, if i > M + 1, we obtain ψ δ = ψ restricted to {1, . . . , M + N}. Further, i = M + N + 1 as argued in the proof of the previous lemma. Accordingly, M + 1 is inserted in δψ δ but not in δψ , while all other edits remain the same, such that we obtain | δψ δ| = | δψ | + 1 T 1 + 1 = T as claimed. Next, if δ1 = einsi,j, we obtain ψ δ = ψ as well as V = V , X = X , and E = E {(i, j)}. Now, distinguish two cases. If (ψ (i), ψ (j)) E , δψ δ inserts (i, j) whereas δψ does not, while all other edits remain the same, such that we obtain | δψ δ| = | δψ |+1 T 1+1 = T as claimed. Conversely, if (ψ (i), ψ (j)) / E , δψ deletes (i, j) whereas δψ δ does not, while all other edits remain the same, such that we obtain | δψ δ| = | δψ | 1 T 2 < T as claimed. Next, if δ1 = edeli,j, we obtain ψ δ = ψ as well as V = V , X = X , and E = E \ {(i, j)}. Now, distinguish two cases. If (ψ (i), ψ (j)) / E , δψ δ deletes (i, j) whereas δψ does not, while all other edits remain the same, such that we obtain | δψ δ| = | δψ |+1 T 1+1 = T as claimed. Conversely, if (ψ (i), ψ (j)) E , δψ inserts (i, j) whereas δψ δ does not, while all other edits remain the same, such that we obtain | δψ δ| = | δψ | 1 T 2 < T as claimed. Finally, if δ1 = deli, we obtain ψ δ(j) = ψ (j) for all j < i, ψ δ(j) = ψ (j 1) for all j > i, and ψ δ(j) = M + N, as well as V = {1, . . . , M 1}, xj = xj for all j V , and E = E. Accordingly, δψ δ deletes i whereas δψ does not, while all other edits remain the same, such that we obtain | δψ δ| = | δψ | + 1 T 1 + 1 = T as claimed. Because this covers all possible edits, this concludes our proof. Published as a conference paper at ICLR 2021 We note again that the entire argument up to this point is analogous to Proposition 1 of Bougleux et al. (2017). However, in contrast to Bougleux et al. (2017), we do not rely on a notion of restricted edit scripts, instead comparing to all possible edit scripts, and we provide stronger results regarding the structure of our graph mappings via Lemma 2. We also slightly change the order of operations for consistency with Algorithm 1. A.2 PROOF OF THEOREM 2 The heart of our proof is Algorithm 2, which translates a graph mapping to a corresponding teaching signal for a graph edit network. We prove the correctness of this algorithm with the following Lemma. Refer to Figure 4 for a graphical intuition. In particular we consider in this example the graphs G = ({1, 2, 3}, {(1, 2), (1, 3), (3, 1)}, X) and H = ({1, 2, 3, 4, 5, 6}, {(1, 2), (2, 3), (3, 1), (4, 5), (5, 4)}, X ) as well as the graph mapping ψ with ψ(1) = 3, ψ(2) = 2, and ψ(3) = 1. Note that H is not fully connected. In particular, we have the connected components {1, 2, 3}, {4, 5}, and {6}. In itself, this would not be a problem, but note that Algorithm 1 can only insert a new node that is connected to an existing node. Accordingly, we can not insert node without any connection to the pre-existing graph. However, this would be required for the insertion of nodes 4, 5, and 6. To manage this, Algorithm 2 first constructs auxiliary edges (in lines 2-5) connecting the new nodes to the already existing part of the graph. In our example, these auxiliary edges are (3, 4) and (3, 6). Then, the algorithm computes (in lines 6-9) the shortest path to any inserted node from any already existing node, yielding the paths π4 = (3, 4), π5 = (3, 4, 5), and π6 = (3, 6). Line 13 then re-defines the shortest paths as a shortest-path tree with children ch(3) = (4, 6) and ch(4) = (5). Lines 14-22 then take care of performing the actual node insertions, where insertions are put into different steps of the teaching signal if they can not be performed at the same time. In particular, we need to use multiple steps whenever a single node has multiple children in the shortest-path tree and whenever a node that does not yet exist needs to make an insertion. Both is covered by the depth-first-search in lines 14-22. In particular, we first insert parents, then children, and we insert children in succession. In our example, this results in node 4 being inserted in the first step (by node 1) and nodes 5 and 6 being inserted in the second step (by nodes 1 and 4, respectively). Then, line 24 initializes the third and final step of our teaching signal which performs all remaining edits. In particular, line 25 sets up all replacements (irrelevant in this case), line 26 all edge insertions ((2, 1), (3, 2), and (5, 4) in this case), line 27 all edge deletions ((1, 2), (3, 1), (1, 4), and (1, 6) in this case), and line 28 all node deletions (none in this case). Lines 29-30 ensure that the edge filter scores e+ K+1 and e K+1 are consistent with the edge edit scores EK+1. Finally, we return the result. The resulting script is shown in the bottom of the figure. Note that this script is not as short as possible because it contains four additional auxiliary edits, namely the edge insertions (1, 4) and (1, 6), which are then deleted in the end. These additional edges occur because we need to insert new nodes connected to existing nodes. Accordingly, the disconnected components C1 and C2 from H are first connected and then disconnected at the end of the script, yielding the desired graph. We first consider the time complexity of the algorithm. First, note that computing connected components in line 2 is possible in O(N). Similarly, the minimum computations in lines 3-4 are possible in O(N). Constructing the auxiliary graph in line 5 may need O(N 2) for a dense graph. Because we have unit edge weights for the shortest path computation, a simple breadth-first-search suffices, making the computation in line 7 O(N). Because we need to repeat this computation for each insertion, lines 6-9 are overall in O(N 2). Lines 10-12 are slightly implementation dependent. In principle, constructing Ek for each step would require O((M + N)2). But because this matrix is always zero, we can construct it once and re-use it for every step. All other operations are linear, yielding O((M + N)2) overall. The computation of the children in the shortest-path tree in lines 13 is linear in the number of insertions, which is in O(N). Lines 14-22 perform a depth first search through the shortest-path trees, which is in O(N) again. This also includes recording the inserted edges for line 23. Published as a conference paper at ICLR 2021 shortest paths (lines 6-9): π4 = (3, 4), π5 = (3, 4, 5), π6 = (3, 6) Shortest path tree (line 13): ch(3) = (4, 6), ch(4) = (5) insertions (lines 14-22): ν1,1 = +1, ν2,1 = +1, ν2,4 = +1 edge insertions (line 26): ϵ3,2,1 = ϵ3,3,2 = ϵ3,5,4 = +1 edge deletions (line 27): ϵ3,1,2 = ϵ3,3,1 = ϵ3,1,4 = ϵ3,1,6 = 1 script: δ = ins, eins1,4, ins, eins4,5, ins, eins1,6, eins2,1, eins3,2, eins5,4, edel1,2, edel1,4, edel1,6, edel3,1 Figure 4: A graphical illustration of the construction of a teaching signal via Algorithm 2 for the inputs G = ({1, 2, 3}, {(1, 2), (1, 3), (3, 1)}, X), H = ({1, 2, 3, 4, 5, 6}, {(1, 2), (2, 3), (3, 1), (4, 5), (5, 4)}, X ), and ψ with ψ(1) = 3, ψ(2) = 2, and ψ(3) = 1. The mapping ψ is shown in blue, the auxiliary construction of H (lines 2-5 in the algorithm) in red and dashed. After constructing H, we first compute shortest paths from 3 to each node in C1 and C2 (top right). This defines the shortest path tree with children ch(3) = (4, 6) and ch(4) = 5. Accordingly, node ψ 1(3) = 1 needs to insert two nodes (4 and 6), and node ψ 1(4) = 4 needs to insert one node (5). Since this is not possible in one step, the first step inserts only node 4 (ν1,1 = +1) and the second step nodes 5 and 6 (ν2,1 = +1, ν2,4 = +1). The remaining edit scores ensure that the correct edges are inserted and deleted, yielding the final script at the bottom. This is exactly the script generated by ψ, except for four superfluous auxiliary edits, namely eins1,4, eins1,6, edel1,4, and edel1,6. Published as a conference paper at ICLR 2021 Algorithm 2 An algorithm to turn a graph mapping ψ between two graphs G and H into a teaching signal for a graph edit network, such that the teaching signal yields the same edit script as ψ, up to 2 (C 1) auxiliary edits, where C is the number of connected components in H. 1: function MAP-TO-SIGNAL(Two non-empty graphs G = (V = {1, . . . , M}, E, X) and H = (V = {1, . . . , N}, E , X ), a graph mapping ψ between them) 2: Compute the connected components of H C1, . . . , CL where Cl Insψ. 3: Set cl minj Cl j for all l {1, . . . , L}. 4: Set r mini M:ψ(i) N i. 5: Construct H = (V , E {(ψ(r), c1), . . . , (ψ(r), c L)}), X ). 6: for j Insψ do 7: Compute a shortest path πj 1, . . . , πj Rj from the closest node 8: πj 1 {1, . . . , N} \ Insψ to πj Rj = j in H. 9: end for 10: for k {1, . . . , |Insψ|} do 11: Initialize νk 0, Yk X, e k 1, e+ k 1, Ek 0. 12: end for 13: Let ch(j) be the children of node j in the shortest path trees from roots πj 1 to inserted nodes. 14: Initialize a stack S with entries (πj 1, 0) for all j Insψ. Set K 0. 15: while S is not empty do 16: Pop (j, k) from S. Set c 1. Set i ψ 1(j). 17: for j ch(j) do 18: Set νk+c,i +1. yk+c,i x j . 19: Add (j , k + c) to S. 20: c c + 1. K max{K, k + c}. 21: end for 22: end while 23: Let E be the set of inserted edges (i, j) due to line 18. 24: Initialize νK+1 0, YK+1 0, e K+1 1, e+ K+1 1, EK+1 0, 25: Set y K+1,i x ψ(i) if ψ(i) N. performs replacements 26: Set ϵK+1,i,j +1 if (i, j) / E E and (ψ(i), ψ(j)) E . performs edge insertions 27: Set ϵK+1,i,j 1 if (i, j) E E and (ψ(i), ψ(j)) / E . performs edge deletions 28: Set νK+1,i 1 if ψ(i) > N. performs node deletions 29: Set e+ K+1,i +1 if ϵK+1,i,j = 0 for any j. 30: Set e K+1,j +1 if ϵK+1,i,j = 0 for any i. 31: return ( ν1, X1, e+ 1 , e 1 , E1), . . . , ( νK+1, XK+1, e+ K+1, e K+1, EK+1). 32: end function Published as a conference paper at ICLR 2021 Lines 24-30 require only matrix and set lookups, making them overall O((M + N)2). In summary, we obtain the claimed efficiency class of O(M 2 + N 2). Now, let G and H be any two non-empty graphs, let C be the number of connected components in H, and let ψ be a graph mapping between them with |Insψ| < N, i.e. at least one node is not inserted. Further, let s = (ˆν1, ˆ X1, ˆe+ 1 , ˆe 1 , ˆE1), . . . , (ˆνK+1, ˆ XK+1, ˆe+ K+1, ˆe K+1, ˆEK+1) be the teaching signal returned by Algorithm 2 for mapping ψ, and let δ be the output of Algorithm 1 for input s. Then, we wish to prove the following properties for δ. 1. δ(G) = H. 2. | δ| | δψ| + 2 (C 1). 3. The first K steps of s contain only insertions with νk,i = 0 νk+1,i = . . . νK+1,i = 0. 4. The K + 1th step contains no insertions. 5. K |Insψ|. First, observe that lines 1-3 of Algorithm 2 are guaranteed to work because H is non-empty and thus at least one connected component exists, and that each connected component is per definition non-empty. Further, line 4 works because we restricted ψ such that at least one i exists with ψ(i) N, which means that the minimum is well-defined. Accordingly, the graph H in line 5 is well-defined as well. This preparation was necessary to ensure that the shortest-path computations in lines 6-9 are valid. In particular, assume that there exists some j Insψ such that j is disconnected in H from any node i / Insψ. Now, distinguish two cases. First, if j lies in a connected component of H which contains some node i with i / Insψ, then j is connected to i, which directly yields a contradiction. Otherwise, j lies in one of the connected components Cl. Accordingly, j is connected to cl. However, then j is also connected to ψ(r) in H by construction, where r / Insψ, which is also a contradiction. Thus, the shortest path required in lines 6-9 is well-defined. Next, observe that any shortest path computation gives rise to shortest path trees as required by line 13. If multiple shortest paths overlap, we obtain proper trees, otherwise trivial trees which are just copies of the shortest paths. Lines 14 to 22 now insert all nodes in the shortest paths trees according to a particular ordering, namely that any parent is inserted at least one step before its child and each child is inserted one step before its right sibling. By using this ordering we ensure that no node makes more than one insertion per step, that whenever a node stops to do insertions, it never needs to start again, and that all nodes are reached (because the depth-first-search enumerates all nodes in the shortest path trees). Accordingly, note that the K first steps of the generates script, when plugged into Algorithm 1, yield exactly the insertions in δins ψ as given in Definition 2 (up to reordering), in addition to edge insertions between an inserted node and its predecessor in the shortest paths tree. These inserted edges are stored in E in line 23. The remainder of the algorithm takes care of all remaining edits. In particular, line 24 initializes a trivial teaching signal with zero node edit scores, zero attribute matrix, negative edge filter scores, and zero edit score matrix. Then, line 25 sets all attributes to the target attributes, which yields all desired replacements of already existing nodes and leaves the attributes of inserted nodes as they were. In other words, line 25 yields, via Algorithm 1, exactly the script δrep ψ as given in Definition 2. Line 26 sets edge insertion scores ϵK+1,i,j = 1 for all edges (i, j) such that (ψ(i), ψ(j)) is in E but not already in E E . In effect, after line 26 we have accumulated all edge insertions in δeins ψ as given in Definition 2, plus exactly L edge insertions einsr,ψ 1(cl). This is the case because the shortest path to any node cl is necessarily πcl 1 = r and πcl 2 = cl. If any other, equally short path would exist, cl would be in a connected component with some node i / Insψ, which is a contradiction. Accordingly, lines 14-22 have ensured that node r inserts node ψ 1(cl) at some point. Next, line 27 sets edge deletion scores ϵK+1,i,j = 1 for all edges (i, j) that are in E E but where (ψ(i), ψ(j)) is not in E anymore. Note that there is no overlap between these edges and the edges considered in line 26, such that we can handle both cases within the same step. Also note Published as a conference paper at ICLR 2021 that this step yields all edge deletions from from δedel ψ as given in Definition 2, plus exactly L edge deletions edelr,ψ 1(cl) due to our previous argument. Finally, line 28 sets node edit scores νK+1,i = 1 for all nodes i with ψ(i) > N. This yields exactly the node deletions from δdel ψ as given in Definition 2. Lines 29-30 merely ensure that the edge filter scores are consistent with Ek. In summary, when we plug the teaching signal returned by Algorithm 2 into Algorithm 1, the resulting script δ contains exactly the same edits as δψ from Definition 2, up to reordering inserted nodes, plus L edge insertions einsr,ψ 1(cl) and L edge deletions edelr,ψ 1(cl). Because the edge deletions remove precisely the edges which have been additionally inserted before, we obtain δ(G) = δψ(G). Further, Lemma 1 shows that δψ(G) = H, which implies property 1 above. Next, observe that L is exactly the number of connected components of H which are subsets of Insψ. Because we required that ψ does not insert all nodes of H, there must exist at least one connected component of H which is not a subset of Insψ. Accordingly, L is upper-bounded by C 1 and we obtain the bound required in the second property. For the third property, observe that the first K steps do indeed only insert nodes. Further, a node i only stops insertions when all its children in the shortest path tree are inserted. Once that is done, it does not start inserting anymore, implying the desired property. The fourth property holds trivially because lines 24-30 never set a node score to a value > 1 Finally, the fifth property holds because each insertion step in lines 14-22 contains at least one insertion. This is because K is set via line 20 exactly to a step in which the last insertion occurs and in insertion occuring in step K implies that either a left sibling insertion or the parent insertion had to occur in step K 1. However, if at least one insertion occurs in each step, there can be at most K |Insψ| steps, as claimed. It only remains to show that the bounds in properties 2 and 5 are sharp. To do so, consider the example graphs G = ({1}, , 0) and H = ({1, . . . , C}, , 0). Note that all nodes in H are isolated, such that they form trivial connected components. We note in passing that it is also possible to construct an example for non-trivial connected components, e.g. by constructing H as C copies of G, where G can be arbitrarily shaped. Here, we consider the simplest case. Now, let ψ be any mapping between G and H such that |Insψ| < C. In other words, ψ(1) C. Then, line 2 of Algorithm 2 yields Cl = {l} for l < ψ(1), and Cl = {l + 1} for l ψ(1) with l {1, . . . , C 1} and cl becomes the only element of Cl in line 3. Line 4 yields r = 1. Accordingly, H = ({1, . . . , C}, {(ψ(1), c1), . . . , (ψ(1), c C 1)}, X ). Following the arguments above, the script resulting via Algorithm 1 from the output of Algorithm 2 then includes one edge insertion and one edge deletion for each edge in H, in addition to the C 1 insertions that are contained in δψ. Accordingly, we obtain | δ| = | δψ| + 2 (C 1), which is precisely the bound. Further, because node 1 can only perform a single insertion per step we obtain K = C 1 = |Insψ|. Published as a conference paper at ICLR 2021 B AN EXAMPLE LOSS FUNCTION For training, we propose a simple loss inspired by the margin hinge loss of the support vector machine (Suykens et al., 2002). Recall that the SVM hinge loss is given as ℓSVM( w, {( x, ˆy)}) = X w T x ˆy + 1 2 + where ˆy { 1, 1} is the desired label and y = w T x is the predicted label. In more detail, let A be the adjacency matrix of the input graph, let ν, X, e+, e , E be the GEN layer output, and let ˆν, ˆ X, ˆe+, ˆe , ˆE be the scores of our teaching signal. Then, we define the loss ℓGEN as: i=1 ℓnode(νi, ˆνi) + ℓfilter(e+ i , ˆe+ i ) + ℓfilter(e i , ˆe i ) + X ℓattr( xi, ˆxi) + X (i,j):ˆe+ i >0,ˆe j >0 ℓedge(ϵij, ˆϵij, aij), where ℓnode(ν, ˆν) is defined as [ν + 1]2 + for ˆν < 1 2, as [ ν + 1]2 + for ˆν > + 1 2, and as [|ν| 1 4]2 + otherwise, with [x]+ = max{0, x} being the rectified linear unit; where ℓfilter(e, ˆe) is defined as M [e + 1 2]2 + if ˆe 0 and as M [ e + 1 2]2 + otherwise; where ℓattr is the squared Euclidean distance for continuous attributes and crossentropy for discrete attributes; and where ℓedge(ϵ, ˆϵ, a) is defined as [ϵ + 1]2 + if a = 1 and ˆϵ < 1 2, as [ ϵ]2 + if a = 1 and ˆϵ 1 2, as [ϵ]2 + if a = 0 and ˆϵ 1 2, and as [ ϵ + 1]2 if a = 0 and ˆϵ > 1 2. We can show that this loss is zero if and only if the scripts of the teaching signal and the GEN output are equal (plus a loss for margin violations). Theorem 3. Let A {0, 1}N N be an adjacency matrix and ν, X, e+, e , E as well as ˆν, ˆ X, ˆe+, ˆe , ˆE be two sets of edit scores. Then, loss 2 is zero if and only if the following conditions hold. 1. For all i : If ˆνi > + 1 2, νi is at least 1, 2. for all i : if |ˆνi| 1 2, |νi| is at most 1 3. for all i : if ˆνi < 1 2, νi is at most 1, 4. for all i : If ˆνi 1 2, ℓattr( xi, ˆxi) = 0, 5. for all i : If ˆe+ i > 0, e+ i is at least 1 6. for all i : If ˆe+ i 0, e+ i is at most 1 7. for all i : If ˆe i > 0, e i is at least 1 8. for all i : If ˆe i 0, e i is at most 1 9. for all i, j : If ˆe+ i > 0, ˆe j > 0, ˆϵi,j > + 1 2, and ai,j = 0, ϵi,j is at least 1, 10. for all i, j : If ˆe+ i > 0, ˆe j > 0, ˆϵi,j 1 2, and ai,j = 0, ϵi,j is at most 0, 11. for all i, j : If ˆe+ i > 0, ˆe j > 0, ˆϵi,j 1 2, and ai,j = 1, ϵi,j is at least 0, and 12. for all i, j : If ˆe+ i > 0, ˆe j > 0, ˆϵi,j < 1 2, and ai,j = 1, ϵi,j is at most 1. Further, if the loss is zero, the edit scripts resulting from Algorithm 1 for both sets of edits are equal. Proof. We first re-frame all conditions in terms of the component losses ℓnode, ℓattr, ℓfilter, and ℓedge. 1. In this case, the node loss is ℓnode(νi, ˆνi) = νi + 1 2 +. Accordingly, the loss is zero iff νi + 1 0 νi 1, as claimed. Published as a conference paper at ICLR 2021 Figure 5: An illustration of the loss ℓnode (left) and the loss ℓedge (right) for different teacher and adjacency matrix inputs. 2. In this case, the node loss is ℓnode(νi, ˆνi) = |νi| 1 4 2 +. Accordingly, the loss is zero iff |νi| 1 4, as claimed. 3. In this case, the node loss is ℓnode(νi, ˆνi) = νi + 1 2 +. Accordingly, the loss is zero iff νi + 1 0 νi 1, as claimed. 4. This condition directly refers to the attribute loss and is obvious. 5. In this case, the filter loss is ℓfilter(e+ i , ˆe+ i ) = e+ i + 1 2 2 +. Accordingly, the loss is zero iff e+ i + 1 2, as claimed. 6. In this case, the filter loss is ℓfilter(e+ i , ˆe+ i ) = e+ i + 1 2 2 +. Accordingly, the loss is zero iff e+ i + 1 2, as claimed. 7. In this case, the filter loss is ℓfilter(e i , ˆe i ) = e i + 1 2 2 +. Accordingly, the loss is zero iff e i + 1 2, as claimed. 8. In this case, the filter loss is ℓfilter(e i , ˆe i ) = e i + 1 2 2 +. Accordingly, the loss is zero iff e i + 1 2, as claimed. 9. In this case, the edge loss is ℓedge(ϵi,j, ˆϵi,j, ai,j) = ϵi,j + 1 2 +. Accordingly, the loss is zero iff ϵi,j + 1 0 ϵi,j 1, as claimed. 10. In this case, the edge loss is ℓedge(ϵi,j, ˆϵi,j, ai,j) = ϵi,j 2 +. Accordingly, the loss is zero iff ϵi,j 0, as claimed. 11. In this case, the edge loss is ℓedge(ϵi,j, ˆϵi,j, ai,j) = ϵi,j 2 +. Accordingly, the loss is zero iff ϵi,j 0 ϵi,j 0, as claimed. 12. In this case, the edge loss is ℓedge(ϵi,j, ˆϵi,j, ai,j) = ϵi,j + 1 2 +. Accordingly, the loss is zero iff ϵi,j + 1 0 ϵi,j 1, as claimed. Accordingly, we found that all component losses are zero if and only if the aforementioned margin conditions holds. Since the entire loss 2 is a sum of these component losses and no contribution can be negative in this case, the entire sum is zero if and only if all contributions are zero, which proves the claim. Also refer to Figure 5 for a graphical illustration of the component losses ℓnode and ℓedge. This directly yields the second part of Theorem 3. In particular, if condition 3 holds, the same edge deletions are appended in line 2 of Algorithm 1. Due to conditions 5-8, the loops in lines 3-8 iterate over the same elements (i, j). Further, due to conditions 11-12, Algorithm 1 appends the same edge deletions in line 5, and due to conditions 9-10, Algorithm 1 appends the same edge insertions in line 6. The replacements in line 9 are the same due to conditions 2 and 4, as well as the fact that ℓattr being zero implies equal inputs by definition. The loop in lines 11-14 now iterates over the same elements due to condition 1, and the same node/edge insertions are appended in line 12 due to Published as a conference paper at ICLR 2021 1 0.5 0 0.5 1 0 0.2 0.4 0.6 0.8 p(ins|ν) p(del|ν) p(rep|ν) 1 0.5 0 0.5 1 0 ˆν = +1 ˆν = 1 ˆν = 0 Figure 6: Left: The node edit probabilities p(ins|ν) (blue), p(del|ν) (orange), and p(rep|ν) (red) for β = 10 for varying ν (left). Right: The crossentropy loss (solid) and the GEN loss times β (dashed) for varying ν and for ˆν = +1 (blue), ˆν = 1 (orange), and ˆν = 0 (red). condition 4 and the fact that ℓattr being zero implies equal inputs by definition. Finally, the same node deletions are generated in line 15 due to condition 3. Thus, the output of Algorithm 1 is the same for both inputs. C PROBABILISTIC INTERPRETATION OF THE GEN Given the node edit scores νi, the edge filter scores e+ i and e i as well as the edge edit scores ϵi,j as returned by a GEN layer, we can define the following probabilities over edits. The probability of a node insertion at node i is the logistic distribution p(ins|νi) = Zβ/(1 + exp[ β (νi 1 2)]); the probability of a deletion of node i is the logistic distribution p(del|νi) = Zβ/(1+exp[ β ( νi 1 2)]); and the probability of a replacement is p(rep|νi) = 1 p(ins|νi) p(del|νi). In all cases, β > 0 is a hyper-parameter regulating the slope of the distribution with respect to νi and Zβ = (1+eβ)/(2+eβ) is chosen to ensure that p(ins| 1 2) = p(rep| 1 2) = p(del| 1 2) = p(rep| 1 2 is the decision boundary between insertion and replacement and 1 2 is the decision boundary between deletion and replacement, just as in Algorithm 1. For sufficiently high β, e.g. β 10, we obtain p(ins|νi) 0 for νi < 0 and p(del|νi) 0 for νi > 0, such that the distribution becomes equivalent to two binary logistic distributions glued together . Also refer to Figure 6 (left). A straightforward loss based on this probability distribution is the crossentropy between p and the strict teaching distribution q(ins|ˆνi) = 1 if ˆνi > 1 2, q(del|ˆνi) if ˆνi < 1 2, and q(rep|ˆνi) = 1 otherwise. The loss for varying ν and ˆν and β = 10 is shown in Figure 6 (right). As we can see, the behavior is qualitatively very similar to the GEN loss from Equation 2, shown in dashed lines. The most noteable differences are that the crossentropy behaves linearly for sufficiently extreme ν, whereas the GEN loss behaves quadratic, and that the GEN loss is strictly zero when the margin is not violated, whereas the crossentropy still remains strictly larger than zero. This can lead to empiric problems during learning, as we see in the experiments. The quadratic behavior emphasizes large margin violations whereas small margin violations are less emphasized compared to the crossentropy loss, which yields slightly favorable properties in our experiments. To sample edge edits we perform a three-step process. First, for each node we sample whether outgoing edges or incoming edges are edited with the logistic distribution p( |e+ i ) = 1/(1+exp[ β e+ i ]) and p( |e i ) = 1/(1 + exp[ β e i ]). Then, we only consider edges (i, j) where has been sampled for both i and j, and we sample an edge insertion with probability p(eins|ϵi,j, ai,j) = 1/(1 + exp[ β (ϵi,j 1 2)]) if ai,j = 0 and p(eins|ϵi,j, ai,j) = 0 if ai,j = 1, and we sample an egde deletion with probability p(edel|ϵi,j, ai,j) = 1/(1 + exp[ β ( ϵi,j 1 2)]) if ai,j = 1 and p(eins|ϵi,j, ai,j) = 0 if ai,j = 0. Also refer to Figure 7 (left). The crossentropy loss compared to the strict distribution q(eins|ˆϵi,j) = 1 if ˆϵi,j > 1 2, q(edel|ˆϵi,j) = 1 if ˆϵi,j < 1 2, and q(eins|ˆϵi,j) = q(edel|ˆϵi,j) = 0 is shown in Figure 7 (right). Again, we observe that the loss behaves qualitatively very similar to the GEN edge loss, with the striking difference being the square. Published as a conference paper at ICLR 2021 1 0.5 0 0.5 1 0 0.2 0.4 0.6 0.8 p(eins|ϵ, a) 1 0.5 0 0.5 1 0 ˆϵ = +1 ˆϵ = 0 1 0.5 0 0.5 1 0 0.2 0.4 0.6 0.8 p(edel|ϵ, a) 1 0.5 0 0.5 1 0 ˆϵ = +1 ˆϵ = 0 Figure 7: Left top: The edge insertion probability p(eins|ϵ, a) if the edge is not present. Right top: The crossentropy loss (solid) and the GEN edge loss times β (dashed) for ˆν = +1 (blue) and ˆν = 0 (red) if the edge is not present. Left bottom: The edge deletion probability p(edel|ϵ, a) if the edge is present. Right bottom: The crossentropy loss (solid) and the GEN edge loss times β (dashed) for ˆν = 1 (orange) and ˆν = 0 (red) if the edge is present. Published as a conference paper at ICLR 2021 Table 2: Summary statistics for all synthetic datasets, in particular the number of unique graphs, the average time series length T ( standard deviation) and the average graph size N ( standard deviation). The number of unique undirected graphs with 8 nodes for degree rules was computed using the nauty-geng function, all averages were obtained via 1000 samples from the dataset. edit cycles degree rules game of life Boolean Peano unique graphs 9 12346 2100 10788 34353 T 3 0.82 10.46 1.91 10 0.00 1.51 0.99 11.08 11.63 N 3.22 1.03 7.33 1.98 100 0.00 4.84 2.52 7.82 3.89 ˆν1 = 1 ˆν2 = 1 ˆν3 = 1 ˆϵ1,4 = 1 3 4 ˆν3 = 1 ˆϵ1,4 = 1 Figure 8: An illustration of the edit cycles dataset, where each graph is shown in black, whereas colored arrows illustrate the system dynamics. Each colored arrow is labelled with the edit scores necessary to move from one graph to the next. D DATA IN DETAIL In this section, we discuss the graph dynamical systems of Section 4 in more detail and also discuss how graph edit networks can realize a solution. The summary statistics are shown in Table 2. Edit Cycles: The edit cycles dataset is illustrated in Figure 8. Colored, thick arrows indicate the simulated system dynamics. Each arrow is labelled with the corresponding teaching protocol. The reference solution to this task is to uniquely identify at each node to which graph it belongs and then decide on the edit. This requires a number of layers equal to the graph diameter, in this case two graph neural network layers. Degree Rules: The degree rules dataset is illustrated for two example graphs in Figure 2. The teaching protocol implements the rules of the dataset as follows. First, for any node i with degree larger than 3 we expect ˆνi < 1 2, i.e. the node should get deleted. Second, for any pair of nodes i and j which share at least one neighbor, we expect ˆϵi,j = ˆϵj,i = 1, i.e. an edge should be inserted between the nodes. Third, for any node i with degree smaller than 3 we expect ˆνi > 1 2, i.e. a node should get inserted. To generate the actual dynamics, we limit the process to apply only one rule per connected component. This is because simultaneous application of all rules leads to degeneracies. For example, a 5-clique would be deleted entirely. Instead, we use the following ordering: We first apply rule 1 to the node with highest degree. If rule 1 applies to no node, we apply rule 2 to the node pair with lowest sum of degrees. If rule 2 does not apply, we apply rule 3 to the node with lowest degree. If multiple nodes have the same degree, nodes with lower index take precedence. If no rule applies, we perform no edit and the process has converged. Next, we show that each connected component converges to a 4-clique in this scheme. Note that rule 2 ensures that every connected component becomes fully connected, provided that rule 1 does not interfere. Rule 1 interferes if a node degree rises above 3, which will be the case whenever a connected component contains more than 4 nodes. Accordingly, every connected component with Published as a conference paper at ICLR 2021 more than 4 nodes must converge to a 4-clique. For smaller connected components, rule 2 first ensures that the component becomes a k-clique for k < 4. Then, rule 2 does not apply anymore but rule 3 applies because all nodes in the component have degree less than 3. Once a new node has been added, rule 2 applies again, and so on, until the component is a 4-clique. Finally, note that a 4-clique is stable because no rule applies anymore. The reference solution for this task is a one-layer net with n+2 neurons. Further, we assign a one-hot coding of the node index as attribute. Note that the ordering does not matter, merely that every node has a unique one-hot code. Now, notice that we can compute the degree of node i via the expression P j N(i) 1T xj. This is the case because xj is a one-hot coding of the index j and, hence, the expression 1T xj is always 1, such that the sum just counts the size of the neighborhood of i, which is exactly the degree. Accordingly, we need to apply rule 1 if this expression is higher than 3, and we need to apply rule 3 if this expression is less than 3. Rule 2 is slightly more complicated. We wish to apply rule 2 whenever nodes i and j share at least one neighbor, i.e. if N(i) N(j) = . Because the node attributes are one-hot codings of the node index, we can represent the neighborhoods by the sum of the one-hot codings in the neighborhood and the set intersection by their inner product. More precisely, we obtain for any two nodes i and j: X k N(i) xk T X l N(j) xl = |N(i) N(j)|. Based on this result, we can identify the need to apply any of the three rules by setting f 1 aggr and f 1 merge from Equation 1 as follows. f 1 aggr(N) = X j N xj, and f 1 merge( x, y) = Re LU y 1 β ( 1T y 3) 1 β ( 1T y 2) which yields the following representation for all nodes i: φ1(i) = Re LU P j N(i) xj 1 β (|N(i)| 3) 1 β (|N(i)| 3) Here, β is defined as β := 2 max{3, ˆm}, where ˆm is the maximum degree in the dataset. Then, we can compute the node edit score as νi = β (φ1(i)n+2 φ1(i)n+1), which is 1 if and only if the degree of i is smaller than 3 (rule 3), which is 1 if and only if the degree of i is larger than 3 (rule 1), and which is zero otherwise. Further, we can set the edge edit score as ϵi,j = φ1(i)T φ1(j), which is 1 if |N(i) N(j)| 1 and which is between 0 and 1 2 otherwise because in that case φ1(i)T φ1(j) = 1 β2 Re LU(|N(i)| 3) Re LU(|N(j)| 3) + Re LU(3 |N(i)|) Re LU(3 |N(j)|) β2 (2 max{3, ˆm}2) 1 Still, note that this is relatively difficult to learn for a neural net because parameters need to be set to relatively large values (exceeding the maximum degree). Hence the long experimental learning times. Game of life: Conway s game of life is a 2D cellular automaton with the following rules. Cells can be either alive (xt i = 1) or dead (xt i = 0) at time t. The state in time t + 1 is given by the equation ( 1 if 5 xt i + 2 P j N(i) xt j 7 0 otherwise where N(i) is the 8-neighborhood of i in the grid. Figure 9 shows the development of an example grid. Our teaching protocol imposes ˆνt i = xt+1 i xt i for all nodes i and all times t 1, i.e. the node edit score νt i should correspond to the change in aliveness . Note that our graph connects all neighboring nodes in the grid and that the aliveness xt i is the only node attribute. Published as a conference paper at ICLR 2021 Figure 9: An example 6 6 grid evolving over time according to the rules of Conway s game of life with alive cells marked blue and dead cells white. Note that even a single alive cell close to a glider shape breaks the shape and leads to the entire board becoming empty over time. Our reference solution is a one-layer graph neural network with the following setting of f 1 aggr and f 1 merge from Equation 1. f 1 aggr(N) = X j N xj and f 1 merge(x, y) = Re LU x + 2y 4 x + 2y 5 x + 2y 7 x + 2y 8 x Now, let yi := xi + 2 P j N(i) xj. Then, we obtain the node representation φ1(i) = Re LU(yi 4, yi 5, yi 7, yi 8, xi)T . With this representation, we obtain the following result for φ1 1(i) φ1 2(i) φ1 3(i) + φ1 4(i). φ1 1(i) φ1 2(i) φ1 3(i) + φ1 4(i) = 0 0 0 + 0 = 0 if yi 4 yi 4 (yi 5) 0 + 0 = 1 if 5 yi 7 yi 4 (yi 5) (yi 7) + yi 8 = 0 if yi 8 In other words, we obtain φ1 1(i) φ1 2(i) φ1 3(i) + φ1 4(i) = xt+1 i . Accordingly, we can simply set νi = φ1 1(i) φ1 2(i) φ1 3(i) + φ1 4(i) φ1 5(i) and obtain νi = xt+1 i xt i as desired. Boolean Formulae: We first generate a random Boolean formula using the following stochastic process: Let Σ = { , , x, y, x, y, , , root} be our alphabet of symbols. Then, we initialize a counter ops 0, a graph G = ({1}, , X) with x1 = root, and a stack with 1 as single element. Next, as long as the stack is not empty, we pop the top node i from the stack and sample a new symbol from Σ with probabilities P( ) = P( ) = 0.3, P(x) = P(y) = P( x) = P( y) = 0.1, and P(root) = P( ) = P( ) = 0. Then, we add a new node M + 1 to the graph with x M+1 being the one-hot-code of the samples symbol, and add the edge (i, M + 1). Further, if we have sampled or , we push i onto the stack two times and increment the ops counter. If it reaches 3, we adjust the sampling probabilities to P(x) = P(y) = P( x) = P( y) = 1 4 and P( ) = P( ) = P(root) = P( ) = P( ) = 0. Once the initial graph is sampled, the teaching protocol implements eight simplification rules for Boolean formulae. In particular, we implement the following rules. 1. F for any formula F. Accordingly, if for two nodes i, j we have xi = , xj = and (i, j) E, we impose ˆνk = 1 for any node k = j with (i, k) E. As a minor side note, we mention that we consider tree edits according to Zhang & Shasha (1989) for this dataset instead of graph edits, i.e. when we delete node i, the children of i automatically get connected to the parent of i. If F is a leaf, we also impose ˆνi = 1. 2. F F for any F. Accordingly, if for two nodes i, j we have xi = , xj = and (i, j) E, we impose ˆνi = ˆνj = 1. 3. F for any F. Accordingly, we apply the same scheme as in rule 1, just with the condition xi = and xj = . 4. F F for any F. Accordingly, if for two nodes i, j we have xi = , xj = and (i, j) E, we impose ˆνi = ˆνj = 1. 5. x x x and y y y. Accordingly, if we find three nodes i, j, k, such that xi = , xj = xk {x, y}, and (i, j), (i, k) E, we impose ˆνi = ˆνj = 1. Published as a conference paper at ICLR 2021 ˆν4 = ˆν5 = 1 ˆν4 = ˆν5 = 1 ˆx3 = ˆν1 = ˆν3 = 1 Figure 10: An example time series from the Boolean dataset. The initial formula on the left is simplified until no simplification rule applies anymore. Each arrow indicates one simplification step. Superscripts indicate the index of each node in the graph and blue arrow labels indicate the teaching protocol. 6. x x x and y y y. Accordingly, if we find three nodes i, j, k, such that xi = , xj = xk {x, y}, and (i, j), (i, k) E, we impose ˆνi = ˆνj = 1. 7. x x and y y . Accordingly, if we find three nodes i, j, k, such that xi = , (i, j), (i, k) E, and xj = x as well as xk = x, or xj = y as well as xk = y, we impose ˆνj = ˆνk = 1 and ˆyi = . 8. x x and y y . Accordingly, if we find three nodes i, j, k, such that xi = , (i, j), (i, k) E, and xj = x as well as xk = x, or xj = y as well as xk = y, we impose ˆνj = ˆνk = 1 and ˆyi = . The next step in the dynamic is the graph that results from applying all edits imposed by the teaching protocol. The dynamic process ends once no rules apply anymore. An example simplification process is shown in Figure 10. The reference solution is a graph neural network with two layers, where the first layer represents whether certain conditions for the eight rules apply and the second layer implements conjunctions between these conditions. For simplicity, we only consider rule 2 here but all other rules can be implemented with a similar scheme. In particular, let φ(x) be the one-hot coding of the symbol x Σ. Then, we use the following setting of f 1 aggr, f 1 merge, f 2 aggr, and f 2 merge from Equation 1. f 1 aggr(N +, N ) = P and f 1 merge( x, y z φ( )T x φ( )T y φ( )T x φ( )T z f 2 aggr(N +, N ) = 0 and f 2 merge( φ1, 0) = Re LU φ1 1 + φ1 2 1 φ1 3 + φ1 4 1 Note that we distinguish the neighborhood N + of children and N of parents. The resulting node representation in layer 2 is as follows. φ2 1(i) = 1 if xi = and i has a child j with xj = . Otherwise, φ2 1(i) = 0. Further, φ2 2(i) = 1 if xi = and i has a parent j with xj = . Otherwise, φ2 2(i) = 0. Accordingly, we can implement rule 2 by setting νi = φ2 1(i) φ2 2(i), which deletes both the parent and the child if rule 2 applies. Peano addition: With the same scheme as for the Boolean dataset we first generate a random addition with at most 3 plus operators. However, for the peano dataset we have the alphabet Σ = {+, 0, 1, . . . , 9, succ, root} and the sampling probabilities P(+) = 1 2 and P(1) = . . . = P(9) = 1 18, which are changed to P(1) = . . . = P(9) = 1 9 once 3 plus operators are generated. Once the initial graph is generated, we apply Peano s addition axiom until the addition is resolved. In more detail, we apply the following rules. 1. F + 0 = F for any formula F. Accordingly, if we find two nodes i, j with xi = + and xj = 0 such that j is the second child of i, we impose ˆνi = ˆνj = 1. Published as a conference paper at ICLR 2021 ˆν1 = +1, ˆx3 = 0 ˆν1 = +1, ˆν3 = 1 rule 1 ˆν1 = ˆν4 = 1 rule 4 ˆν2 = 1, ˆx3 = 6 Figure 11: An example time series from the Peano dataset. The initial formula on the left is evaluated until only a single number - namely the result of the addition - is left. Exponents indicate the index of each node in the graph and blue arrow labels indicate the teaching protocol. 2. F + succ(G) = succ(F) + G for any formulae F and G. Accordingly, if we find two nodes i, j with xi = + and xj = succ such that j is the second child of i, we impose ˆνi = +1, ˆyi = succ, and ˆνj = 1. Again, notice that we use tree edits (Zhang & Shasha, 1989) instead of graph edits in this case, such that an insertion at node i automatically establishes an edge between i and the newly inserted node as well as between the newly inserted node and the first child of i, while cutting the direct connection between i and its first child, i.e. the new child is inserted between i and its current first child. 3. F + n = F + succ(n 1) for any number n {1, . . . , 9}. Accordingly, if we find two nodes i, j with xi = + and xj {1, . . . , 9}, such that j is the second child of i, we impose ˆνi = +1, ˆyi = succ, and ˆyj = n 1 in one-hot coding. 4. succ(n) = n + 1 for any number n {0, . . . , 9}. If n = 9, we define n + 1 = 0 to remain in the alphabet. Accordingly, if we find two nodes i, j with xi = succ, xj {0, . . . , 9}, and (i, j) E, we impose νi = 1 and ˆyj = n + 1. The next step in the dynamic is the graph that results from applying all edits imposed by the teaching protocol. The dynamic process ends once no rules apply anymore. An example time series is shown in Figure 11. Note that, for the Peano dataset, the order of children is important. Accordingly, we assume that there is an auxiliary attribute that codes whether the node is a second child (where the attribute has value 1) or not (value 0). Further, let φ(x) denote the one-hot coding of symbol x Σ and let φII denote the vector that is zero except for a one at the auxiliary attribute. Then, we use the following setting of f 1 aggr, f 1 merge, f 2 aggr, and f 2 merge from Equation 1. f 1 aggr(N +, N ) = 0, and f 1 merge( x, 0) = Re LU φ(+)T x φ(succ)T x + φT II x 1 f 2 aggr(N +, N ) = P j N + φ1(j) P and f 2 merge( x, y z ) = Re LU x1 + y2 1 x2 + z1 1 Again, we distinguish between the neighborhood N + of children and N of parents. The resulting node representation in layer 2 is as follows. φ2 1(i) = 1 if xi = + and i has a second child j with xj = succ, otherwise φ2 1(i) = 0. Further, φ2 2(i) = 1 if xi = succ and i has a parent j with xj = +. Accordingly, we can implement rule 2 by setting νi = φ2 1(i) φ2 2(i) and yi = φ(succ) if φ2 1(i) = 1. Published as a conference paper at ICLR 2021 Table 3: The average precision and recall values ( std.) across five repeats fro all edit types on the max degree dataset. node insertion node deletion edge insertion edge deletion model recall precision recall precision recall precision recall precision VGAE 1.00 0.0 1.00 0.0 0.35 0.2 0.51 0.2 1.00 0.0 0.83 0.1 1.00 0.0 0.60 0.3 XE-GEN 1.00 0.0 1.00 0.0 0.49 0.3 1.00 0.0 1.00 0.0 1.00 0.0 1.00 0.0 1.00 0.0 GEN 1.00 0.0 1.00 0.0 0.99 0.0 1.00 0.0 1.00 0.0 1.00 0.0 1.00 0.0 1.00 0.0 E MAX DEGREE EXPERIMENT In addition to the graph dynamical systems presented in the main part of the paper, we also investigated whether our network is able to identify the node with maximum degree in a graph and delete it. In particular, we generated random graphs from the Barabási Albert model with a random number of nodes sampled uniformly from {4, 5, 6, 7, 8} and 3 new edges per iteration. We then added a special node to the graph that is connected to all other nodes to facilitate communication across the graph. In this setup, identifying the node with maximum degree should be possible in three graph neural network layers: The first layer computes the degree for each node, the second layer communicates the degree information to the special node, and the third layer compares the degree of every node to the information stored at the special node, thus identifying the node with maximum degree. Accordingly, we used graph neural networks with 3 layers and 64 neurons in each layer. All other training hyperparameters were chosen as in the other experiments. The results are shown in Table 3. We observe that VGAE does not manage to identify the node with maximum degree, whereas the GEN with hinge loss is. Interestingly, the crossentropy loss GEN is not able to identify the node either (as is visible in the 0.49 value for node deletion recall). This may be because the crossentropy loss converges to zero slower and thus nodes that are correctly classified can still influence the gradient, whereas the hinge loss sets the gradient for nodes that are a margin of safety away from the decision boundary to zero (refer to Section B). F ADDITIONAL EXPERIMENTS ON SYNTAX TREES We additionally evaluated graph edit networks on six datasets of Python syntax trees, where students iteratively solve a Python programming task, in particular implementing the mathematical function f(x) = x4 x2 + x 4 in Python (fun), plotting the function (plt), implementing its gradient (grad), implementing gradient descent on it (desc), and finding optimal starting values (fin), as well as writing a sorting program (pysort). While the latter dataset is synthetic, the former five are recordings of fifteen students. In all cases, our task is to predict the next state of the student s program, represented by its abstract syntax tree. For all datasets we used a graph edit network with four layers, 128 neurons per layer, residual connections, and tanh nonlinearity, which we optimized using Adam with a learning rate of 10 3 and weight decay of 10 3 for 10k epochs. In each epoch we computed the loss for one randomly sampled time series. To obtain statistics we performed a 5-fold crossvalidation on all datasets. Table 4 shows the average root mean square tree edit distance between the predicted tree and the actual next syntax tree on all datasets, for graph edit networks (GENs, last row), as well as two baselines, namely the constant prediction (const.), i.e. predicting no change, and the Gaussian process scheme (GP) of Paaßen et al. (2018). We observe that both the synthetic syntax trees as well as the actual student data are hard to predict, as neither GP nor GEN outperforms the constant baseline. Further, we observe that GEN performs slightly better than GP on plt, grad, and desc, clearly better on pysort (by more than four standard deviations) and slightly worse on fun and fin. Overall, we can not observe a clear advantage of GEN on these data in terms of RMSE, which is likely due to the small set of training data (only 15 different time series). Still, in terms of inference time GENs (below 100 ms on all datasets) clearly outperform the expensive kernel-to-tree translation approach of Paaßen et al. (2018) (ranging from below 100 ms to over 2 minutes per prediction on pysort). Published as a conference paper at ICLR 2021 Table 4: The mean RMSEs ( std.) for the constant baseline, Gaussian process regression (GP) of Paaßen et al. (2018), and graph edit networks on all syntax tree datasets. model fun plt grad desc fin pysort const. 7.95 1.1 3.78 1.2 6.85 2.3 3.97 0.6 1.45 0.4 13.45 1.3 GP 9.69 1.4 6.61 2.6 8.70 1.7 13.97 3.8 1.32 0.5 26.53 1.8 GEN 11.68 1.4 5.66 2.1 8.23 2.1 9.10 3.4 1.70 1.0 18.89 1.2 G PROMOTING SHORTER EDIT PATHS IN GEN S TRAINING Denote with C(del), C(ins), C(edel) and C(eins) the edit costs associated with node deletion, node insertion, edge deletion and edge insertion. Define ℓsp(G) = (1 E) A 2 + A( 1 ν) 2 + ( E (1 A) 2 + ν 2) C(eins) + 1 ν 2 C(del) + ν 2 C(ins) where 1, 1 are a n-dimensional vector and n n matrix of all ones. Assuming the edge and node scores have been passed through a squashing function in (0, 1), then adding term ℓsp to the GEN s training loss will penalize costly edit paths, hence promoting shorter ones. The total cost derived by constructing edits in E is, in fact, given by |{edelij E}| C(edel) + |{einsij E}| C(eins) = |{(i, j) [n]2 : ϵij < 1 2, νi, νj > 1 2, a(G)ij = 1}| C(edel) + |{(i, j) [n]2 : ϵij > 1 2, νi, νj > 1 2, a(G)ij = 0}| C(eins). We can rewrite it in matrix form. Define for convenience 2, E+ = E > 1 2, ν = ν < 1 2, ν+ = ν > 1 where the operators > and < are applied component wise and return 1 if "true" and 0 if "false". The matrix form becomes E ( 1 ν )( 1 ν ) A 0 C(edel) + E+ ( 1 ν )( 1 ν ) (1 A) 0 C(eins) where is the element-wise product. Conversely, the cost associated with vector ν is (|{u [n] : a(G)uv = 1}| C(edel) + C(del)) + (C(eins) + C(ins)) = 1 A(G) ν C(edel) + 1 ν C(del) + 1 ν+(C(eins) + C(ins)). Assuming that both edge and node scores are bounded by 1, adding the following term to the loss will promote shorter edit path ℓsp(G) = (1 E) A 2 C(edel) + E (1 A) 2 C(eins) + A(G)( 1 ν) 2 C(edel) + ( 1 ν) 2 C(del) + ν 2 (C(eins) + C(ins)).