# optimizing_over_trained_gnns_via_symmetry_breaking__81f16cc5.pdf Optimizing over trained GNNs via symmetry breaking Shiqiang Zhang Imperial College London London, UK Juan S. Campos Imperial College London London, UK Christian Feldmann BASF SE Ludwigshafen, Germany David Walz BASF SE Ludwigshafen, Germany Frederik Sandfort BASF SE Ludwigshafen, Germany Miriam Mathea BASF SE Ludwigshafen, Germany Calvin Tsay Imperial College London London, UK Ruth Misener Imperial College London London, UK Optimization over trained machine learning models has applications including: verification, minimizing neural acquisition functions, and integrating a trained surrogate into a larger decision-making problem. This paper formulates and solves optimization problems constrained by trained graph neural networks (GNNs). To circumvent the symmetry issue caused by graph isomorphism, we propose two types of symmetry-breaking constraints: one indexing a node 0 and one indexing the remaining nodes by lexicographically ordering their neighbor sets. To guarantee that adding these constraints will not remove all symmetric solutions, we construct a graph indexing algorithm and prove that the resulting graph indexing satisfies the proposed symmetry-breaking constraints. For the classical GNN architectures considered in this paper, optimizing over a GNN with a fixed graph is equivalent to optimizing over a dense neural network. Thus, we study the case where the input graph is not fixed, implying that each edge is a decision variable, and develop two mixed-integer optimization formulations. To test our symmetry-breaking strategies and optimization formulations, we consider an application in molecular design. 1 Introduction Graph neural networks (GNNs) [1 3] are designed to operate on graph-structured data. By passing messages between nodes via edges (or vice versa), GNNs can efficiently capture and aggregate local information within graphs. GNN architectures including spectral approaches [4 9] and spatial approaches [10 18], are proposed based on various motivations. Due to their ability to learn non-Euclidean data structure, GNNs have recently been applied to many graph-specific tasks, demonstrating incredible performance. In drug discovery, for example, GNNs can predict molecular properties given the graph representation of molecules [16, 19 23]. Other applications using GNNs include graph clustering[24, 25], text classification [26, 27], and social recommendation [28, 29]. Although GNNs are powerful tools for these "forward" prediction tasks, few works discuss the "backward" (or inverse) problem defined on trained GNNs. Specifically, many applications motivate the inverse problem of using machine learning (ML) to predict the inputs corresponding to some output specification(s). For example, computer-aided molecular design (CAMD) [30 32] aims to Corresponding author: s.zhang21@imperial.ac.uk 37th Conference on Neural Information Processing Systems (Neur IPS 2023). GNN properties quantum mechanics physico-chemical bioactivity (eco)-toxicity . . . Prediction (Forward): What are the properties for a given molecule? Optimization (Backward): What is the optimal molecule with desired properties? Figure 1: Illustration of forward and backward problems defined on trained GNNs. design the optimal molecule(s) for a given target based on a trained ML model [33 36]. Several GNN-based approaches have been proposed for CAMD [37 40], however, these works still use GNNs as "forward" functions to make predictions over graph-domain inputs. Therefore, both the graph structures of inputs and the inner structures of the trained GNNs are ignored. Figure 1 conceptually depicts the difference between forward and backward/inverse problems. Mathematical optimization over trained ML models is an area of increasing interest. In this paradigm, a trained ML model is translated into an optimization formulation, thereby enabling decision-making problems over model predictions. For continuous, differentiable models, these problems can be solved via gradient-based methods [41 44]. Mixed-integer programming (MIP) has also been proposed, mainly to support non-smooth models such as Re LU neural networks [45 50], their ensembles [51], and tree ensembles [52 54]. Optimization over trained Re LU neural networks has been especially prominent [55], finding applications such as verification [56 64], reinforcement learning [65 67], compression [68], and black-box optimization [69]. There is no reason to exclude GNNs from this rich field. Wu et al. [70] managed to verify a GNN-based job scheduler, where MIP is used to obtain tighter bounds in a forward-backward abstraction refinement framework. Mc Donald [71] built MIP formulations for GCN [6] and Graph SAGE [13] and tested their performance on a CAMD problem, which is the first work to directly apply MIP on GNNs to the best of our knowledge. Optimization over trained GNNs involves two significant challenges. One is the symmetry issue arising from the permutation invariance of GNNs (i.e., isomorphism graphs share the same GNN output(s)). For forward problems, this invariance is preferred since it admits any graph representation for inputs. From optimization perspective, however, symmetric solutions can significantly enlarge the feasible domain and impede a branch-and-bound (B&B) search tree. One way to break symmetry in integer programs is by adding constraints to remove symmetric solutions [72 78]. The involvement of graphs not only adds complexity to the problem in terms of symmetry, but also in terms of implementation. Due to the complexity and variety of GNN architectures, a general framework is needed. This framework should be compatible with symmetry-breaking techniques. This paper first defines optimization problems on trained GNNs. To handle the innate symmetry, we propose two sets of constraints to remove most isomorphism graphs (i.e., different indexing for any abstract graph). To guarantee that these constraints do not exclude all possible symmetries (i.e., any abstract graph should have at least one feasible indexing), we design an algorithm to index any undirected graph and prove that the resulting indexing is feasible under the proposed constraints. To encode GNNs, we build a general framework for MIP on GNNs based on the open-source Optimization and Machine Learning Toolkit (OMLT) [79]. Many classic GNN architectures are supported after proper transformations. Two formulations are provided and compared on a CAMD problem. Numerical results show the outstanding improvement of symmetry-breaking. Paper structure: Section 2 introduces the problem definition, proposes the symmetry-breaking constraints, and gives theoretical guarantees. Section 3 discusses the application, implementation details, and numerical results. Section 4 concludes and discusses future work. 2 Methodology 2.1 Definition of optimization over trained GNNs We consider optimization over a trained GNN on a given dataset D = {(Xi, Ai), yi}M i=1. Each sample in the dataset consists of an input graph (with features Xi and adjacency matrix Ai) and an output property yi. The target of training is to approximate properties, that is: GNN (Xi, Ai) yi, 1 i M When optimizing over a trained GNN, the goal is to find the input with optimal property: (X , A ) = arg min (X,A) GNN (X, A) s.t. fj(X, A) 0, j J gk(X, A) = 0, k K where fj, gk are problem-specific constraints and J , K are index sets. Note that optimality of (OPT) is defined on the trained GNN instead of true properties. To simplify our presentation, we will focus on undirected graphs with node features, noting that directed graphs can be naturally transformed into undirected graphs. Likewise, using edge features does not influence our analyses. We discuss practical impacts of such modifications later. 2.2 Symmetry handling For any input (X, A), assume that there are N nodes, giving a N N adjacency matrix A and a feature matrix X = (X0, X1, . . . , XN 1)T RN F , where Xi RF contains the features for i-th node. For any permutation γ over set [N] := {0, 1, 2 . . . , N 1}, let: A u,v = Aγ(u),γ(v), X v = Xγ(v), u, v [N] Then the permuted input (X , A ) has isomorphic graph structure, and therefore the same output, due to the permutation invariance of GNNs (i.e., GNN (X, A) = GNN (X , A )). In other words, the symmetries result from different indexing for the same graph. In general, there exist N! ways to index N nodes, each of which corresponds to one solution of (OPT). Mc Donald [71] proposed a set of constraints to force molecules connected, which can also be used to break symmetry in our general setting. Mathematically, they can be written as: v [N]\{0}, u < v, s.t. Au,v = 1 (S1) Constraints (S1) require each node (except for node 0) to be linked with a node with smaller index. Even though constraints (S1) help break symmetry, there can still exist many symmetric solutions. To resolve, or at least relieve, this issue, we need to construct more symmetry-breaking constraints. Breaking symmetry at the feature level. We can first design some constraints on the features to define a starting point for the graph (i.e., assigning index 0 to a chosen node). Note that multiple nodes may share the same features. Specifically, we define an arbitrary function h : RF R, to assign a hierarchy to each node and force that node 0 has the minimal function value: h(X0) h(Xv), v [N]\{0} (S2) The choice of h can be application-specific, for example, as in Section 3.1. The key is to define h such that not too many nodes share the same minimal function value. Breaking symmetry at the graph level. It is unlikely that constraints on features will break much of the symmetry. After adding constraints (S1) and (S2), any neighbor of node 0 could be indexed 1, then any neighbor of node 0 or 1 could be indexed 2, and so on. We want to limit the number of possible indexing. A natural idea is to account for neighbors of each node: the neighbor set of a node with smaller index should also have smaller lexicographical order. Before further discussion, we need to define the lexicographical order. Let S(M, L) be the set of all non-decreasing sequences with L integer elements in [0, M] (i.e., S(M, L) = {(a1, a2, . . . , a L) | 0 a1 a2 a L M}). For any a S(M, L), denote its lexicographical order by LO(a). Then for any a, b S(M, L), we have: LO(a) < LO(b) l {1, 2, . . . , L}, s.t. ai = bi, 1 i < l al < bl (LO) For any multiset A with no more than L integer elements in [0, M 1], we first sort all elements in A in a non-decreasing order, then add elements with value M until the length equals to L. In that way, we build an injective mapping from A to a sequence a S(M, L). Define the lexicographical order of A by LO(a). For simplicity, we use LO(A) to denote the lexicographical order of A. Algorithm 1 Indexing algorithm Input: G = (V, E) with node set V = {v0, v1, . . . , v N 1} (N := |V |). Denote the neighbor set of node v as N(v), v V . I(v0) 0 Assume that v0 is indexed with 0 s 1 Index for next node V 1 1 {v0} Initialize set of indexed nodes while s < N do V s 2 V \V s 1 Set of unindexed nodes N s(v) {I(u) | u N(v) V s 1 }, v V s 2 Obtain all indexed neighbors ranks(v) |{LO(N s(u)) < LO(N s(v)) | u V s 2 }| , v V s 2 Assign a rank to each unindexed node Is(v) I(v), v V s 1 ranks(v) + s, v V s 2 Assign temporary indexes N s t (v) {Is(u) | u N(v)}, v V s 2 Define temporary neighbor sets based on Is vs arg min v V s 2 LO(N s t (v)) Neighbors of vs has minimal order If multiple nodes share the same minimal order, arbitrarily choose one I(vs) = s Index s to node vs V s+1 1 V s 1 {vs} Add vs to set of indexed nodes s s + 1 Next index is s + 1 Output: I(v), v V Result indexing Remark: The definition of LO( ) and Section 2.3 reuse A to denote a multiset. This will not cause ambiguity since multisets are only introduced in the theoretical part, where the adjacency matrix is not involved. For brevity, we use capital letters A, B, . . . to denote multisets and lowercase letters a, b, . . . to denote the corresponding sequences in S(M, L). Since we only need lexicographical orders for all neighbor sets of the nodes of the graph, let M = N, L = N 1. With the definition of LO( ), we can represent the aforementioned idea by: LO(N(v)\{v + 1}) LO(N(v + 1)\{v}), v [N 1]\{0} (S3) where N(v) is the neighbor set of node v. Note that the possible edge between nodes v and v + 1 is ignored in (S3) to include the cases that they are linked and share all neighbors with indexes less than v. In this case, LO(N(v + 1)) is necessarily smaller than LO(N(v)) since v N(v + 1). Constraints (S3) exclude many ways to index the rest of the nodes, for example they reduce the possible ways to index the graph in Figure 2 from 120 to 4 ways. But, we still need to ensure that there exists at least one feasible indexing for any graph after applying constraints (S3). Algorithm 1 constructs an indexing and Section 2.3 provides a theoretical guarantee of its feasibility. 2.3 Theoretical guarantee This section proves that Algorithm 1 always provides feasible indexing. We first give some properties that are used to derive an important lemma. Property 1. For any s = 1, 2, . . . , N 1, Is(v) < s, v V s 1 s, v V s 2 . v1 v2 v3 v4 v5 Figure 2: Consider a graph having 6 nodes with different features. Without any constraints, there are 6! = 720 ways to index it. Utilizing (S1) to force connectivity results in 636 ways. Using (S2) to choose v0 to be indexed 0, then there are 5! = 120 ways. Finally, applying (S3) to order the rest of the nodes, there are only 4 ways, 2 of which can be derived from Algorithm 1. For details on using Algorithm 1 to index this graph, see Appendix A.2. Property 2. For any s1, s2 = 1, 2, . . . , N 1, s1 s2 N s1(v) = N s2(v) [s1], v V s2 2 Property 3. Given any multisets A, B with no more than L integer elements in [0, M 1], we have: LO(A) LO(B) LO(A [m]) LO(B [m]), m = 1, 2, . . . , M Using these properties, we can prove Lemma 1, which shows that if the final index of node u is smaller than v, then at each iteration, the temporary index assigned to u is not greater than v. Lemma 1. For any two nodes u and v, I(u) < I(v) Is(u) Is(v), s = 1, 2, . . . , N 1 Now we can prove Theorem 1. Theorem 1. Given any undirected graph G = (V, E) with one node indexed 0. The indexing yielded from Algorithm 1 satisfies (S3). Proof. Assume that constraints (S3) are not satisfied and the minimal index that violates (S3) is s. Denote nodes with index s and s + 1 by u, v respectively (i.e., I(u) = s, I(v) = s + 1). Let N(u)\{v} := {u1, u2, . . . , um} be all neighbors of u except for v, where: I(ui) < I(ui+1), i = 1, 2, . . . , m 1 Similarly, let N(v)\{u} := {v1, v2, . . . , vn} be all neighbors of v except for u, where: I(vj) < I(vj+1), j = 1, 2, . . . , n 1 Denote the sequences in S(N, N 1) corresponding to sets {I(ui) | 1 i m} and {I(vj) | 1 j n} by a = (a1, a2, . . . , a N 1), b = (b1, b2, . . . , b N 1). By definition of LO( ): ai = I(ui), 1 i m N, m < i < N , bj = I(vj), 1 j n N, n < j < N Since nodes u and v violate constraints (S3), there exists a position 1 k N 1 satisfying: ai = bi, 1 i < k ak > bk from where we know that nodes u and v share the first k 1 neighbors (i.e. ui = vi, 1 i < k). From bk < ak N we know that node v definitely has its k-th neighbor node vk. Also, note that vk is not a neighbor of node u. Otherwise, we have uk = vk and then ak = I(uk) = I(vk) = bk. Case 1: If ak = N, that is, node u has k 1 neighbors. In this case, node v has all neighbors of node u as well as node vk. Therefore, we have: LO(N s t (u)) > LO(N s t (v)) which violates the fact that node u is chosen to be indexed s at s-th iteration of Algorithm 1. Case 2: If ak < N, that is, node u has nodes uk as its k-th neighbor. Since I(uk) = ak > bk = I(vk), we can apply Lemma 1 on node uk and vk at (s + 1)-th iteration, to obtain Is+1(uk) Is+1(vk) ( ) Similarly, if we apply Lemma 1 to all the neighbors of node u and node v at s-th iteration, we have: Is(ui) Is(ui+1), i = 1, 2, . . . , m 1 Is(vj) Is(vj+1), j = 1, 2, . . . , n 1 Given that ak = I(uk) is the k-th smallest number in a, we conclude that Is(uk) is equal to the k-th smallest number in N s t (u). Likewise, Is(vk) equals to the k-th smallest number in N s t (v). Meanwhile, Is(ui) = Is(vi) since ui = vi, 1 i < k. After comparing the lexicographical orders between of N s t (u) and N s t (v) (with the same k 1 smallest elements, Is(uk) and Is(vk) as the k-th smallest element, respectively), node u is chosen. Therefore, we have: Is(uk) Is(vk) from which we know that: LO(N s(uk)) LO(N s(vk)) At (s + 1)-th iteration, node uk has one more indexed neighbor (i.e., node u with index s), while node vk has no new indexed neighbor. Thus we have: LO(N s+1(uk)) = LO(N s(uk) {s}) < LO(N s(uk)) LO(N s(vk)) = LO(N s+1(vk)) which yields: Is+1(uk) < Is+1(vk) (<) The contradiction between ( ) and (<) completes this proof. Given any undirected graph with node features, after using (S2) to choose node 0, Theorem 1 guarantees that there exists at least one indexing satisfying (S3). However, when applying (S1) to force a connected graph, we need to show the compatibility between (S1) and (S3), as shown in Lemma 2. Appendix A provides proofs of properties and lemmas. Lemma 2. For any undirected, connected graph G = (V, E), if one indexing of G sastifies (S3), then it satisfies (S1). Note that (S1) - (S3) are not limited to optimizing over trained GNNs. In fact, they can be employed in generic graph-based search problems, as long as there exists a symmetry issue caused by graph isomorphism. GNNs can be generalized to all permutation invariant functions defined over graphs. Although the next sections apply MIP formulations to demonstrate the symmetry-breaking constraints, we could alternatively use a constraint programming [80] or satisfiability [81] paradigm. For example, we could have encoded the Elgabou and Frisch [82] constraints in Reluplex [83]. 2.4 Connection & Difference to the symmetry-breaking literature Typically, MIP solvers detect symmetry using graph automorphism, for example SCIP [84] uses BLISS [85, 86], and both Gurobi [87] and SCIP break symmetry using orbital fixing [88] and orbital pruning [89]. When a MIP solver detects symmetry, the only graph available is the graph formed by the variables and constraints in the MIP. Our symmetries come from alternative indexing of abstract nodes. Each indexing results in an isomorphic graph. In MIP, however, each indexing corresponds to an element of a much larger symmetry group defined on all variables, including node features (O(NF)), adjacency matrix (O(N 2)), model (e.g., a GNN), and problem-specific features. For instance, in the first row of Table 6, the input graph has N = 4 nodes, but the MIP has 616 + 411 variables. We only need to consider a permutation group with N! = 24 elements. However, because the MIP solver does not have access to the input graph structure, it needs to consider all possible automorphic graphs with 616 + 411 nodes. By adding constraints (S1) - (S3), there is no need to consider the symmetry group of all variables to find a much smaller subgroup corresponding to the permutation group defined on abstract nodes. The closest setting to ours is distributing m different jobs to n identical machines and then minimizing the total cost. Binary variable Ai,j denotes if job i is assigned to machine j. The requirement is that each job can only be assigned to one machine (but each machine can be assigned to multiple jobs). Symmetries come from all permutations of machines. This setting appears in noise dosage problems [90, 91], packing and partitioning orbitopes [92, 93], and scheduling problems [94]. However, requiring that the sum of each row in Ai,j equals to 1 simplifies the problem. By forcing decreasing lexicographical orders for all columns, the symmetry issue is handled well. Constraints (S3) can be regarded as a non-trivial generalization of these constraints from a bipartite graph to an arbitrary undirected graph: following Algorithm 1 will produce the same indexing documented in [90 94]. 2.5 Mixed-integer formulations for optimizing over trained GNNs As mentioned before, there are many variants of GNNs [4 18] with different theoretical and practical motivations. Although it is possible to build a MIP for a specific GNN from the scratch (e.g., [71]), a general definition that supports multiple architectures is preferable. 2.5.1 Definition of GNNs We define a GNN with L layers as follows: GNN : Rd0 Rd0 | {z } |V | times Rd L Rd L | {z } |V | times where V is the set of nodes of the input graph. Let x(0) v Rd0 be the input features for node v. Then, the l-th layer (l = 1, 2, . . . , L) is defined by: u N(v) {v} w(l) u vx(l 1) u + b(l) v where N(v) is the set of all neighbors of v, σ could be identity or any activation function. With linear aggregate functions such as sum and mean, many classic GNN architectures could be rewritten in form ( ), for example: Spectral Network [4], Cheb Net [5], GCN [6], Neural FPs [10], DCNN [11], PATCHY-SAN [12], Graph SAGE [13], and MPNN [16]. 2.5.2 Mixed-integer optimization formulations for non-fixed graphs If the graph structure for inputs is given and fixed, then ( ) is equivalent to a fully connected layer, whose MIP formulations are already well-established [46, 49]. But if the graph structure is non-fixed (i.e., the elements in adjacency matrix are also decision variables), two issues arise in ( ): (1) N(v) is not well-defined; (2) w(l) u v and b(l) v may be not fixed and contain the graph s information. Assuming that the weights and biases are constant, we can build two MIP formulations to handle the first issue. The first formulation comes from observing that the existence of edge from node u to node v determines the contribution link from x(l 1) u to x(l) v . Adding binary variables eu v for all u, v V , we can then formulate the GNNs with bilinear constraints: u V eu vw(l) u vx(l 1) u + b(l) v , v V (bilinear) Introducing this nonlinearity results in a mixed-integer quadratically constrained optimization problem (MIQCP), which can be handled by state-of-the-art solvers such as Gurobi [87]. The second formulation generalizes the big-M formulation for Graph SAGE in [71] to all GNN architectures satisfying ( ) and the assumption of constant weights and biases. Instead of using binary variables to directly control the existence of contributions between nodes, auxiliary variables z(l 1) u v are introduced to represent the contribution from node u to node v in the l-th layer: u V w(l) u vz(l 1) u v + b(l) v , v V (big-M) z(l 1) u v = ( 0, eu v = 0 x(l 1) u , eu v = 1 Assuming that each feature is bounded, the definition of z(l 1) u v can be reformulated using big-M: x(l 1) u M (l 1) u (1 eu v) z(l 1) u v x(l 1) u + M (l 1) u (1 eu v) M (l 1) u eu v z(l 1) u v M (l 1) u eu v where |x(l 1) u | M (l 1) u , eu v {0, 1}. By adding extra continuous variables and constraints, as well as utilizing the bounds for all features, the big-M formulation replaces the bilinear constraints with linear constraints. Section 3 numerically compares these two formulations. 3 Computational experiments We performed all experiments on a 3.2 GHz Intel Core i7-8700 CPU with 16 GB memory. GNNs are implemented and trained in Py G [95]. MIP formulations for GNNs and CAMD are implemented based on OMLT [79], and are optimized by Gurobi 10.0.1 [87] with default relative MIP optimality gap (i.e., 10 4). The code is available at https://github.com/cog-imperial/GNN_MIP_CAMD. These are all engineering choices and we could have, for example, extended software other than OMLT to translate the trained GNNs into an optimization solver [96, 97]. 3.1 Mixed-integer optimization formulation for molecular design MIP formulations are well-established in the CAMD literature [98 103]. The basic idea is representing a molecule as a graph, creating variables for each atom (or group of atoms), using constraints to preserve graph structure and satisfy chemical requirements. A score function is usually given as the optimization target. Instead of using knowledge from experts or statistics to build the score function, GNNs (or other ML models) could be trained from datasets to replace score functions. Here we provide a general and flexible framework following the MIP formulation for CAMD in [71]. Moreover, by adding meaningful bounds and breaking symmetry, our formulation could generate more reasonable molecules with less redundancy. Due to space limitation, we briefly introduce the formulation here (see Appendix B for the full MIP formulation). To design a molecule with N atoms, we define N F binary variables Xv,f, v [N], f [F] to represent F features for N atoms. Hydrogen atoms are not counted in since they can implicitly considered as node features. Features include types of atom, number of neighbors, number of hydrogen atoms associated, and types of adjacent bonds. For graph structure of molecules, we add three sets of binary variables Au,v, DBu,v, TBu,v to denote the type of bond (i.e., any bond, double bond, and triple bond) between atom u and atom v, where A is the adjacency matrix. To design reasonable molecules, constraints (C1) - (C21) handle structural feasibility following [71]. Additionally, we propose new constraints to bound the number of each type of atoms (C22), double bonds (C23), triple bonds (C24), and rings (C25). In our experiments, we calculate these bounds based on the each dataset itself so that each molecule in the dataset will satisfy these bounds (see Appendix C.1 for details). By setting proper bounds, we can control the composition of the molecule, and avoid extreme cases such as all atoms being set to oxygen, or a molecule with too many rings or double/triple bounds. In short, constraints (C22) - (C25) provide space for chemical expertise and practical requirements. Furthermore, our formulation could be easily applied to datasets with different types of atoms by only changing the parameters in Appendix B.1. Moreover, group representation of molecules [98, 102] is also compatible with this framework. The advantage is that all constraints can be reused without any modification. Among constraints (C1) - (C25) (as shown in Appendix B.3), constraints (C5) are the realization of (S1). Except for (C5), these structural constraints are independent of the graph indexing. Therefore, we can compatibly implement constraints (S2) and (S3) to break symmetry. Corresponding to (S2), we add the following constraints over features: X f [F ] 2F f 1 X0,f X f [F ] 2F f 1 Xv,f, v [N]\{0} (C26) where 2F f 1, f [F] are coefficients to help build a bijective h between all possible features and all integers in [0, 2F 1]. These coefficients are also called "universal ordering vector" in [72, 78]. On graph level, constraints (S3) can be equivalently rewritten as: X u =v,v+1 2N u 1 Au,v X u =v,v+1 2N u 1 Au,v+1, v [N 1]\{0} (C27) Similarly, the coefficients 2N u 1, u [N] are used to build a bijective mapping between all possible sets of neighbors and all integers in [0, 2N 1]. For illustration purposes, we can view CAMD as two separate challenges, where the first one uses structural constraints to design reasonable molecules (including all symmetric solutions), and the second one uses symmetry-breaking constraints to remove symmetric solutions. Note that the diversity of solutions will not change after breaking symmetry, because each molecule corresponds to at least one solution (guaranteed by Section 2.3). 3.2 Counting feasible structures: performance of symmetry-breaking We choose two datasets QM7 [104, 105] and QM9 [106, 107] from CAMD literature to test the proposed methods. See Appendix C.1 for more information about QM7 and QM9. To test the efficiency of our symmetry-breaking techniques, we build a MIP formulation for CAMD and count all feasible structures for different N. By setting Pool Search Mode=2, Pool Solutions=109, Gurobi can find many (up to 109) feasible solutions to fill in the solution pool. Table 1 shows the performance of our symmetry-breaking constraints (S2) and (S3) comparing to the baseline of (S1). Without adding (S1), we need to count the number of any graph with compatible features. Even ignoring features, the baseline would be 2 N(N 1) 2 . This number will be much larger after introducing features, which loses the meaning as a baseline, so we use (S1) as the baseline. Table 1: Numbers of feasible solutions. The time limit is 48 hours. At least 2.5 106 solutions are found for each time out (t.o.). For each dataset, the last column reports the percentage of removed symmetric solutions after adding (S2) and (S3) to the baseline of (S1). Higher percentage means breaking more symmetries. N QM7 QM9 (S1) (S1) - (S2) (S1) - (S3) (%) (S1) (S1) - (S2) (S1) - (S3) (%) 2 17 10 10 41 15 9 9 40 3 112 37 37 67 175 54 54 69 4 3, 323 726 416 87 4, 536 1, 077 631 86 5 67, 020 11, 747 3, 003 96 117, 188 21, 441 5, 860 95 6 t.o. 443, 757 50, 951 98 t.o. 527, 816 59, 492 98 7 t.o. t.o. 504, 952 t.o. t.o. 776, 567 8 t.o. t.o. t.o. t.o. t.o. t.o. 3.3 Optimizing over trained GNNs for molecular design For each dataset, we train a GNN with two Graph SAGE layers followed by an add pooling layer, and three dense layers. Details about their structures and training process are shown in Appendix C.2. For statistical consideration, we train 10 models with different random seeds and use the 5 models with smaller losses for optimization. Given N ( {4, 5, 6, 7, 8}), and a formulation ( {bilinear, bilinear+Br S, big-M, big-M+Br S}), where "+Br S" means adding symmetry-breaking constraints (C26) and (C27), we solve the corresponding optimization problem 10 times with different random seeds in Gurobi. This means that there are 50 runs for each N and formulation. Figure 3 shows the significant improvement of symmetry-breaking. Considering the average solving time, big-M performs better. For the bilinear constraints, Gurobi 10.0.1 appears to transform bilinear constraints into linear constraints. Appendix C.3 shows that, after Gurobi s presolve stage, the big-M formulation has more continuous variables but fewer binary variables compared to the bilinear formulation. The fewer binary variables after presolve may explain the better big-M performance. Figure 3: These graphs report the time timetol to achieve relative MIP optimality gap 10 4 averaged over the number of successful runs among 50 runs. We plot a bar for each formulation at different values of N. The leftand right-hand graphs correspond to datasets QM7 and QM9, respectively. The number of runs that achieve optimality is shown above each bar. Results with fewer than 20 successful runs do not have a bar because so few runs terminated. The time limit is 10 hours. In addition to the average solving time, we compare the performance of two formulations with breaking symmetry in each run. In 341 runs out of 500 runs (i.e., 50 runs for each dataset {QM7, QM9} with each N {4, 5, 6, 7, 8}), the big-M formulation achieves optimality faster than the bilinear formulation. Additionally, Gurobi uses much of the solving time to improve the bounds. Table 6 and 7 in Appendix C.3 report timeopt that denotes the first time to find the optimal solution. In 292 runs, the big-M formulation finds the optimal solution earlier than the bilinear formulation. 4 Discussion & Conclusion We introduce optimization over trained GNNs and propose constraints to break symmetry. We prove that there exists at least one indexing (resulting from Algorithm 1) satisfying these constraints. Numerical results show the significant improvement after breaking symmetry. These constraints are not limited to the problem (i.e., optimizing trained GNNs), technique (i.e., MIP), and application (i.e., CAMD) used in this work. For example, one can incorporate them into genetic algorithms instead of MIP, or replace the GNN by artificially-designed score functions or other ML models. In other graph-based decision-making problems, as long as the symmetry issue caused by graph isomorphism exists, these constraints could be used to break symmetry. Moreover, the proposed frameworks for building MIP for GNNs as well as CAMD provide generality and flexibility for more problems. Limitations. Assuming constant weights and biases excludes some GNN types. This limitation is an artifact of our framework and it is possible to build MIP formulations for some other architectures. Using MIP may limit GNN size, for instance, edge features may enlarge the optimization problem. Future work. One direction is to make the MIP-GNN framework more general, such as adding edge features, supporting more GNN architectures, and developing more efficient formulations. Another direction is using the symmetry-breaking constraints in graph-based applications beyond trained GNNs because we can consider any function that is invariant to graph isomorphism. To facilitate further CAMD applications, more features such as aromacity and formal charge could be incorporated. Also, optimizing lead structures in a defined chemical space is potential to design large molecules. Acknowledgements This work was supported by the Engineering and Physical Sciences Research Council [grant numbers EP/W003317/1 and EP/T001577/1], an Imperial College Hans Rausing Ph D Scholarship to SZ, a BASF/RAEng Research Chair in Data-Driven Optimisation to RM, and an Imperial College Research Fellowship to CT. We really appreciate the comments and suggestions from reviewers and meta-reviewers during the peer review process, which were very helpful to clarify and improve this paper. We also thank Guoliang Wang for the discussion about graph indexing. [1] Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, and Maosong Sun. Graph neural networks: A review of methods and applications. AI Open, 1:57 81, 2020. [2] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S. Yu Philip. A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems, 32(1):4 24, 2020. [3] Ines Chami, Sami Abu-El-Haija, Bryan Perozzi, Christopher Ré, and Kevin Murphy. Machine learning on graphs: A model and comprehensive taxonomy. Journal of Machine Learning Research, 23(89):1 64, 2022. [4] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann Lecun. Spectral networks and locally connected networks on graphs. In ICLR, 2014. [5] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In NIPS, 2016. [6] Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In ICLR, 2017. [7] Ruoyu Li, Sheng Wang, Feiyun Zhu, and Junzhou Huang. Adaptive graph convolutional neural networks. In AAAI, 2018. [8] Chenyi Zhuang and Qiang Ma. Dual graph convolutional networks for graph-based semisupervised classification. In WWW, 2018. [9] Bingbing Xu, Huawei Shen, Qi Cao, Yunqi Qiu, and Xueqi Cheng. Graph wavelet neural network. In ICLR, 2019. [10] David K. Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán Aspuru-Guzik, and Ryan P. Adams. Convolutional networks on graphs for learning molecular fingerprints. In NIPS, 2015. [11] James Atwood and Don Towsley. Diffusion-convolutional neural networks. In NIPS, 2016. [12] Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. Learning convolutional neural networks for graphs. In ICML, 2016. [13] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In NIPS, 2017. [14] Petar Veliˇckovi c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. In ICLR, 2017. [15] Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodola, Jan Svoboda, and Michael M. Bronstein. Geometric deep learning on graphs and manifolds using mixture model CNNs. In CVPR, 2017. [16] Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. Neural message passing for quantum chemistry. In ICML, 2017. [17] Hongyang Gao, Zhengyang Wang, and Shuiwang Ji. Large-scale learnable graph convolutional networks. In Knowledge Discovery & Data Mining, 2018. [18] Jiani Zhang, Xingjian Shi, Junyuan Xie, Hao Ma, Irwin King, and Dit-Yan Yeung. Ga AN: Gated attention networks for learning on large and spatiotemporal graphs. In UAI, 2018. [19] Hiroyuki Shindo and Yuji Matsumoto. Gated graph recursive neural networks for molecular property prediction. ar Xiv preprint ar Xiv:1909.00259, 2019. [20] Xiaofeng Wang, Zhen Li, Mingjian Jiang, Shuang Wang, Shugang Zhang, and Zhiqiang Wei. Molecule property prediction based on spatial graph embedding. Journal of Chemical Information and Modeling, 59(9):3817 3828, 2019. [21] Youjun Xu, Jianfeng Pei, and Luhua Lai. Deep learning based regression and multiclass models for acute oral toxicity prediction with automatic chemical feature extraction. Journal of Chemical Information and Modeling, 57(11):2672 2685, 2017. [22] Michael Withnall, Edvard Lindelöf, Ola Engkvist, and Hongming Chen. Building attention and edge message passing neural networks for bioactivity and physical chemical property prediction. Journal of Cheminformatics, 12(1):1 18, 2020. [23] Artur M. Schweidtmann, Jan G. Rittig, Andrea König, Martin Grohe, Alexander Mitsos, and Manuel Dahmen. Graph neural networks for prediction of fuel ignition quality. Energy & Fuels, 34(9):11395 11407, 2020. [24] Zhitao Ying, Jiaxuan You, Christopher Morris, Xiang Ren, Will Hamilton, and Jure Leskovec. Hierarchical graph representation learning with differentiable pooling. In NIPS, 2018. [25] Xiaotong Zhang, Han Liu, Qimai Li, and Xiao-Ming Wu. Attributed graph clustering via adaptive graph convolution. In IJCAI, 2019. [26] Hao Peng, Jianxin Li, Yu He, Yaopeng Liu, Mengjiao Bao, Lihong Wang, Yangqiu Song, and Qiang Yang. Large-scale hierarchical text classification with recursively regularized deep graph-CNN. In WWW, 2018. [27] Liang Yao, Chengsheng Mao, and Yuan Luo. Graph convolutional networks for text classification. In AAAI, 2019. [28] Qitian Wu, Hengrui Zhang, Xiaofeng Gao, Peng He, Paul Weng, Han Gao, and Guihai Chen. Dual graph attention networks for deep latent representation of multifaceted social effects in recommender systems. In WWW, 2019. [29] Wenqi Fan, Yao Ma, Qing Li, Yuan He, Eric Zhao, Jiliang Tang, and Dawei Yin. Graph neural networks for social recommendation. In WWW, 2019. [30] Rafiqul Gani. Computer-aided methods and tools for chemical product design. Chemical Engineering Research and Design, 82(11):1494 1504, 2004. [31] Lik Yin Ng, Fah Keen Chong, and Nishanth G. Chemmangattuvalappil. Challenges and opportunities in computer aided molecular design. Computer Aided Chemical Engineering, 34:25 34, 2014. [32] Nick D. Austin, Nikolaos V. Sahinidis, and Daniel W. Trahan. Computer-aided molecular design: An introduction and review of tools, applications, and solution techniques. Chemical Engineering Research and Design, 116:2 26, 2016. [33] Daniel C. Elton, Zois Boukouvalas, Mark D. Fuge, and Peter W. Chung. Deep learning for molecular design a review of the state of the art. Molecular Systems Design & Engineering, 4(4):828 849, 2019. [34] Abdulelah S. Alshehri, Rafiqul Gani, and Fengqi You. Deep learning and knowledge-based methods for computer-aided molecular design toward a unified approach: State-of-the-art and future directions. Computers & Chemical Engineering, 141:107005, 2020. [35] Faezeh Faez, Yassaman Ommi, Mahdieh Soleymani Baghshah, and Hamid R. Rabiee. Deep graph generators: A survey. IEEE Access, 9:106675 106702, 2021. [36] Wenhao Gao, Tianfan Fu, Jimeng Sun, and Connor W. Coley. Sample efficiency matters: A benchmark for practical molecular optimization. In NIPS Track Datasets and Benchmarks, 2022. [37] Xiaolin Xia, Jianxing Hu, Yanxing Wang, Liangren Zhang, and Zhenming Liu. Graph-based generative models for de Novo drug design. Drug Discovery Today: Technologies, 32:45 53, 2019. [38] Kevin Yang, Kyle Swanson, Wengong Jin, Connor Coley, Philipp Eiden, Hua Gao, Angel Guzman-Perez, Timothy Hopper, Brian Kelley, Miriam Mathea, et al. Analyzing learned molecular representations for property prediction. Journal of Chemical Information and Modeling, 59(8):3370 3388, 2019. [39] Jiacheng Xiong, Zhaoping Xiong, Kaixian Chen, Hualiang Jiang, and Mingyue Zheng. Graph neural networks for automated de novo drug design. Drug Discovery Today, 26(6):1382 1393, 2021. [40] Jan G. Rittig, Martin Ritzert, Artur M. Schweidtmann, Stefanie Winkler, Jana M. Weber, Philipp Morsch, Karl Alexander Heufer, Martin Grohe, Alexander Mitsos, and Manuel Dahmen. Graph machine learning for design of high-octane fuels. AICh E Journal, page e17971, 2022. [41] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In ICLR, 2014. [42] Ga Wu, Buser Say, and Scott Sanner. Scalable planning with deep neural network learned transition models. Journal of Artificial Intelligence Research, 68:571 606, 2020. [43] Rudy R. Bunel, Oliver Hinder, Srinadh Bhojanapalli, and Krishnamurthy Dvijotham. An efficient nonconvex reformulation of stagewise convex optimization problems. NIPS, 2020. [44] Blanka Horvath, Aitor Muguruza, and Mehdi Tomas. Deep learning volatility: A deep neural network perspective on pricing and calibration in (rough) volatility models. Quantitative Finance, 21(1):11 27, 2021. [45] Francesco Croce, Maksym Andriushchenko, and Matthias Hein. Provable robustness of relu networks via maximization of linear regions. In AISTATS, 2019. [46] Ross Anderson, Joey Huchette, Will Ma, Christian Tjandraatmadja, and Juan Pablo Vielma. Strong mixed-integer programming formulations for trained neural networks. Mathematical Programming, 183(1-2):3 39, 2020. [47] Matteo Fischetti and Jason Jo. Deep neural networks and mixed integer linear optimization. Constraints, 23(3):296 309, 2018. [48] Alessio Lomuscio and Lalit Maganti. An approach to reachability analysis for feed-forward relu neural networks. ar Xiv preprint ar Xiv:1706.07351, 2017. [49] Calvin Tsay, Jan Kronqvist, Alexander Thebelt, and Ruth Misener. Partition-based formulations for mixed-integer optimization of trained Re LU neural networks. In NIPS, 2021. [50] Alessandro De Palma, Harkirat S. Behl, Rudy Bunel, Philip Torr, and M. Pawan Kumar. Scaling the convex barrier with active sets. In ICLR, 2021. [51] Keliang Wang, Leonardo Lozano, Carlos Cardonha, and David Bergman. Optimizing over an ensemble of trained neural networks. INFORMS Journal on Computing, 2023. [52] Velibor V. Miši c. Optimization of tree ensembles. Operations Research, 68(5):1605 1624, 2020. [53] Miten Mistry, Dimitrios Letsios, Gerhard Krennrich, Robert M. Lee, and Ruth Misener. Mixedinteger convex nonlinear optimization with gradient-boosted trees embedded. INFORMS Journal on Computing, 33(3):1103 1119, 2021. [54] Alexander Thebelt, Jan Kronqvist, Miten Mistry, Robert M. Lee, Nathan Sudermann-Merx, and Ruth Misener. Entmoot: A framework for optimization over ensemble tree models. Computers & Chemical Engineering, 151:107343, 2021. [55] Joey Huchette, Gonzalo Muñoz, Thiago Serra, and Calvin Tsay. When deep learning meets polyhedral theory: A survey. ar Xiv preprint ar Xiv:2305.00241, 2023. [56] Chih-Hong Cheng, Georg Nührenberg, and Harald Ruess. Maximum resilience of artificial neural networks. In Automated Technology for Verification and Analysis, 2017. [57] Rudy R. Bunel, Ilker Turkaslan, Philip Torr, Pushmeet Kohli, and Pawan K. Mudigonda. A unified view of piecewise linear neural network verification. NIPS, 2018. [58] Huan Zhang, Tsui-Wei Weng, Pin-Yu Chen, Cho-Jui Hsieh, and Luca Daniel. Efficient neural network robustness certification with general activation functions. In NIPS, 2018. [59] Vincent Tjeng, Kai Y. Xiao, and Russ Tedrake. Evaluating robustness of neural networks with mixed integer programming. In ICLR, 2019. [60] Elena Botoeva, Panagiotis Kouvaros, Jan Kronqvist, Alessio Lomuscio, and Ruth Misener. Efficient verification of Re LU-based neural networks via dependency analysis. In AAAI, 2020. [61] Rudy Bunel, P Mudigonda, Ilker Turkaslan, Philip Torr, Jingyue Lu, and Pushmeet Kohli. Branch and bound for piecewise linear neural network verification. Journal of Machine Learning Research, 21(2020), 2020. [62] Rudy Bunel, Alessandro De Palma, Alban Desmaison, Krishnamurthy Dvijotham, Pushmeet Kohli, Philip Torr, and M. Pawan Kumar. Lagrangian decomposition for neural network verification. In UAI, 2020. [63] Kaidi Xu, Huan Zhang, Shiqi Wang, Yihan Wang, Suman Jana, Xue Lin, and Cho-Jui Hsieh. Fast and complete: Enabling complete neural network verification with rapid and massively parallel incomplete verifiers. In ICLR, 2021. [64] Shiqi Wang, Huan Zhang, Kaidi Xu, Xue Lin, Suman Jana, Cho-Jui Hsieh, and J. Zico Kolter. Beta-crown: Efficient bound propagation with per-neuron split constraints for neural network robustness verification. In NIPS, 2021. [65] Buser Say, Ga Wu, Yu Qing Zhou, and Scott Sanner. Nonlinear hybrid planning with deep net learned transition models and mixed-integer linear programming. In IJCAI, 2017. [66] Arthur Delarue, Ross Anderson, and Christian Tjandraatmadja. Reinforcement learning with combinatorial actions: An application to vehicle routing. In NIPS, 2020. [67] Moonkyung Ryu, Yinlam Chow, Ross Anderson, Christian Tjandraatmadja, and Craig Boutilier. CAQL: Continuous action Q-learning. In ICLR, 2020. [68] Thiago Serra, Xin Yu, Abhinav Kumar, and Srikumar Ramalingam. Scaling up exact neural network compression by Re LU stability. In NIPS, 2021. [69] Theodore P. Papalexopoulos, Christian Tjandraatmadja, Ross Anderson, Juan Pablo Vielma, and David Belanger. Constrained discrete black-box optimization using mixed-integer programming. In ICML, 2022. [70] Haoze Wu, Clark Barrett, Mahmood Sharif, Nina Narodytska, and Gagandeep Singh. Scalable verification of GNN-based job schedulers. Proceedings of the ACM on Programming Languages, 6(OOPSLA2):1036 1065, 2022. [71] Tom Mc Donald. Mixed integer (non-) linear programming formulations of graph neural networks. Master Thesis, 2022. [72] Eric J. Friedman. Fundamental domains for integer programs with symmetries. In Combinatorial Optimization and Applications, 2007. [73] Leo Liberti. Automatic generation of symmetry-breaking constraints. Lecture Notes in Computer Science, 5165:328 338, 2008. [74] François Margot. Symmetry in integer linear programming. 50 Years of Integer Programming 1958-2008: From the Early Years to the State-of-the-Art, 2009. [75] Leo Liberti. Reformulations in mathematical programming: Automatic symmetry detection and exploitation. Mathematical Programming, 131:273 304, 2012. [76] Leo Liberti and James Ostrowski. Stabilizer-based symmetry breaking constraints for mathematical programs. Journal of Global Optimization, 60:183 194, 2014. [77] Gustavo Dias and Leo Liberti. Orbital independence in symmetric mathematical programs. In Combinatorial Optimization and Applications, 2015. [78] Christopher Hojny and Marc E. Pfetsch. Polytopes associated with symmetry handling. Mathematical Programming, 175(1-2):197 240, 2019. [79] Francesco Ceccon, Jordan Jalving, Joshua Haddad, Alexander Thebelt, Calvin Tsay, Carl D. Laird, and Ruth Misener. OMLT: Optimization & machine learning toolkit. Journal of Machine Learning Research, 23(1):15829 15836, 2022. [80] Alan Frisch, Brahim Hnich, Zeynep Kiziltan, Ian Miguel, and Toby Walsh. Global constraints for lexicographic orderings. In Principles and Practice of Constraint Programming-CP 2002, pages 93 108, 2002. [81] Joao Marques-Silva, Josep Argelich, Ana Graça, and Inês Lynce. Boolean lexicographic optimization: algorithms & applications. Annals of Mathematics and Artificial Intelligence, 62:317 343, 2011. [82] Hani A. Elgabou and Alan M. Frisch. Encoding the lexicographic ordering constraint in sat modulo theories. In Thirteenth International Workshop on Constraint Modelling and Reformulation, 2014. [83] Guy Katz, Clark Barrett, David L. Dill, Kyle Julian, and Mykel J. Kochenderfer. Reluplex: An efficient SMT solver for verifying deep neural networks. In Computer Aided Verification, 2017. [84] Tobias Achterberg. SCIP: solving constraint integer programs. Mathematical Programming Computation, 1:1 41, 2009. [85] Tommi Junttila and Petteri Kaski. Engineering an efficient canonical labeling tool for large and sparse graphs. In ALENEX, 2007. [86] Tommi Junttila and Petteri Kaski. Conflict propagation and component recursion for canonical labeling. In TAPAS, 2011. [87] Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2023. URL https://www. gurobi.com. [88] Volker Kaibel, Matthias Peinhardt, and Marc E. Pfetsch. Orbitopal fixing. Discrete Optimization, 8(4):595 610, 2011. [89] François Margot. Pruning by isomorphism in branch-and-cut. Mathematical Programming, 94:71 90, 2002. [90] Hanif D. Sherali and J. Cole Smith. Improving discrete model representations via symmetry considerations. Management Science, 47(10):1396 1407, 2001. [91] Ahmed Ghoniem and Hanif D. Sherali. Defeating symmetry in combinatorial optimization via objective perturbations and hierarchical constraints. IIE Transactions, 43(8):575 588, 2011. [92] Volker Kaibel and Marc Pfetsch. Packing and partitioning orbitopes. Mathematical Programming, 114(1):1 36, 2008. [93] Yuri Faenza and Volker Kaibel. Extended formulations for packing and partitioning orbitopes. Mathematics of Operations Research, 34(3):686 697, 2009. [94] James Ostrowski, Miguel F. Anjos, and Anthony Vannelli. Symmetry in scheduling problems. Citeseer, 2010. [95] Matthias Fey and Jan E. Lenssen. Fast graph representation learning with Py Torch Geometric. In ICLR 2019 Workshop on Representation Learning on Graphs and Manifolds, 2019. [96] David Bergman, Teng Huang, Philip Brooks, Andrea Lodi, and Arvind U. Raghunathan. Janos: An integrated predictive and prescriptive modeling framework. INFORMS Journal on Computing, 34(2):807 816, 2022. [97] Laurens Lueg, Bjarne Grimstad, Alexander Mitsos, and Artur M. Schweidtmann. relu MIP: Open source tool for MILP optimization of Re LU neural networks, 2021. URL https: //github.com/Chem Eng AI/Re LU_ANN_MILP. [98] O. Odele and Sandro Macchietto. Computer aided molecular design: A novel method for optimal solvent selection. Fluid Phase Equilibria, 82:47 54, 1993. [99] Nachiket Churi and Luke E.K. Achenie. Novel mathematical programming model for computer aided molecular design. Industrial & Engineering Chemistry Research, 35(10):3788 3794, 1996. [100] Kyle V. Camarda and Costas D. Maranas. Optimization in polymer design using connectivity indices. Industrial & Engineering Chemistry Research, 38(5):1884 1892, 1999. [101] Manish Sinha, Luke E.K. Achenie, and Gennadi M. Ostrovsky. Environmentally benign solvent design by global optimization. Computers & Chemical Engineering, 23(10):1381 1394, 1999. [102] Nikolaos V. Sahinidis, Mohit Tawarmalani, and Minrui Yu. Design of alternative refrigerants via global optimization. AICh E Journal, 49(7):1761 1775, 2003. [103] Lei Zhang, Stefano Cignitti, and Rafiqul Gani. Generic mathematical programming formulation and solution for computer-aided molecular design. Computers & Chemical Engineering, 78: 79 84, 2015. [104] Lorenz C. Blum and Jean-Louis Reymond. 970 million druglike small molecules for virtual screening in the chemical universe database gdb-13. Journal of the American Chemical Society, 131(25):8732 8733, 2009. [105] Matthias Rupp, Alexandre Tkatchenko, Klaus-Robert Müller, and O. Anatole Von Lilienfeld. Fast and accurate modeling of molecular atomization energies with machine learning. Physical Review Letters, 108(5):058301, 2012. [106] Lars Ruddigkeit, Ruud Van Deursen, Lorenz C. Blum, and Jean-Louis Reymond. Enumeration of 166 billion organic small molecules in the chemical universe database gdb-17. Journal of Chemical Information and Modeling, 52(11):2864 2875, 2012. [107] Raghunathan Ramakrishnan, Pavlo O. Dral, Matthias Rupp, and O. Anatole Von Lilienfeld. Quantum chemistry structures and properties of 134 kilo molecules. Scientific Data, 1(1):1 7, 2014. A Theoretical guarantee: supplementary A.1 Proofs of properties and lemma Property 1. For any s = 1, 2, . . . , N 1, Is(v) < s, v V s 1 s, v V s 2 . Proof. At the s-th iteration, nodes in V s 1 have been indexed by 0, 1, . . . , s 1. Therefore, if v V s 1 , then Is(v) = I(v) < s. If v V s 2 , since Is(v) is the sum of s and ranks(v) (which is non-negative), then we have Is(v) s. Property 2. For any s1, s2 = 1, 2, . . . , N 1, s1 s2 N s1(v) = N s2(v) [s1], v V s2 2 Proof. First, N s1( ) is well-defined on V s2 2 since V s2 2 V s1 2 . By definitions of N s( ) and V s 1 , for any v V s2 2 we have: N s1(v) ={I(u) | u N(v) V s1 1 } ={I(u) | u N(v) V s1 1 V s2 1 } ={I(u) | u (N(v) V s2 1 ) V s1 1 } ={I(u) | u N(v) V s2 1 , I(u) < s1} ={I(u) | u N(v) V s2 1 } [s1] =N s2(v) [s1] where the second equation uses the fact that V s1 1 V s2 1 , and the fourth equation holds since u V s1 1 I(u) < s1 (using Property 1). Property 3. Given any multisets A, B with no more than L integer elements in [0, M 1], we have: LO(A) LO(B) LO(A [m]) LO(B [m]), m = 1, 2, . . . , M Proof. By the definition of lexicographical order for multisets, denote the corresponding sequence to A, B, A [m], B [m] by a, b, am, bm S(M, L), respectively. Then it is equivalent to show that: LO(a) LO(b) LO(am) LO(bm), m = 1, 2, . . . , M Let a = (a1, a2, . . . , a L) and b = (b1, b2, . . . , b L). If LO(a) = LO(b), then a = b and am = bm. Thus LO(am) = LO(bm). Otherwise, if LO(a) < LO(b), then there exists 1 l L such that: ai = bi, 1 i < l al < bl If m al, then am = bm, which means LO(am) = LO(bm). Otherwise, if m > al, then am and bm share the same first l 1 elements. But the l-th element of am is al, while the l-th element of bm is either bl or M. In both cases, we have LO(am) < LO(bm). Lemma 1. For any two nodes u and v, I(u) < I(v) Is(u) Is(v), s = 1, 2, . . . , N 1 Proof. Case 1: If I(u) < I(v) < s, then: Is(u) = I(u) < I(v) = Is(v) Case 2: If I(u) < s I(v), then Is(u) = I(u) and: Is(v) s > I(u) = Is(u) where Property 1 is used. Case 3: If s I(u) < I(v), at I(u)-th iteration, u is chosen to be indexed I(u). Thus: LO(N I(u) t (u)) LO(N I(u) t (v)) According to the definition of N I(u)( ) and N I(u) t ( ), we have: N I(u)(u) = N I(u) t (u) [I(u)], N I(u)(v) = N I(u) t (v) [I(u)] Using Property 3 (with A = N I(u) t (u), B = N I(u) t (v), m = I(u)) yields: LO(N I(u)(u)) LO(N I(u)(v)) Apply Property 2 (with s1 = s, s2 = I(u)) for u and v, we have: N s(u) = N I(u)(u) [s], N s(v) = N I(u)(v) [s] Using Property 3 again (with A = N I(u)(u), B = N I(u)(v), m = s) gives: LO(N s(u)) LO(N s(v)) Recall the definition of Is( ), we have: Is(u) Is(v) Lemma 2. For any undirected, connected graph G = (V, E), if one indexing of G sastifies (S3), then it satisfies (S1). Proof. Node 0 itself is a connected graph. Assume that the subgraph induced by nodes {0, 1, . . . , v} is connected, it suffices to show that the subgraph induced by nodes {0, 1, . . . , v + 1} is connected. Equivalently, we need to prove that there exists u < v + 1 such that Au,v+1 = 1. Assume that Au,v+1 = 0, u < v + 1. Since G is connected, there exists v > v + 1 such that: u < v + 1, s.t. Au,v = 1 Then we know: N(v + 1) [v + 1] = , N(v ) [v + 1] = Recall the definition of LO( ), we obtain that: LO(N(v + 1) [v + 1]) > LO(N(v ) [v + 1]) (>) Since the indexing satisfies (S3), then we have: LO(N(w)\{w + 1}) LO(N(w + 1)\{w}), v < w < v Applying Property 3: LO((N(w)\{w + 1}) [v + 1]) LO((N(w + 1)\{w}) [v + 1]), v < w < v Note that w > v. Therefore, LO(N(w) [v + 1]) LO(N(w + 1) [v + 1]), v < w < v Choosing w = v + 1 gives: LO(N(v + 1) [v + 1]) LO(N(v ) [v + 1]) ( ) The contradiction between (>) and ( ) completes the proof. A.2 Example for applying Algorithm 1 Given a graph with N = 6 nodes as shown in Figure 4, where v0 is already indexed 0. We provide details of Algorithm 1 by indexing the rest 5 nodes step by step for further illustration. Before indexing, we first calculate the neighbor sets for each node: N(v0) = {v1, v2, v3, v4, v5}, N(v1) = {v0, v2, v3, v4}, N(v2) = {v0, v1, v5} N(v3) = {v0, v1, v4}, N(v4) = {v0, v1, v3}, N(v5) = {v0, v2} s = 1 : V 1 1 = {v0}, V 1 2 = {v1, v2, v3, v4, v5} v1 v2 v3 v4 v5 Figure 4: V 1 1 = {v0} Obtain indexed neighbors to each unindexed node: N 1(v1) = {0}, N 1(v2) = {0}, N 1(v3) = {0}, N 1(v4) = {0}, N 1(v5) = {0} Rank all unindexed nodes: rank(v1) = rank(v2) = rank(v3) = rank(v4) = rank(v5) = 0 Assign a temporary index to each node based on previous indexes (for indexed nodes) and ranks (for unindexed nodes): I1(v0) = 0, I1(v1) = 1, I1(v2) = 1, I1(v3) = 1, I1(v4) = 1, I1(v5) = 1 After having indexes for all nodes, define temporary neighbor sets: N 1 t (v1) = {0, 1, 1, 1}, N 1 t (v2) = {0, 1, 1}, N 1 t (v3) = {0, 1, 1}, N 1 t (v4) = {0, 1, 1}, N 1 t (v5) = {0, 1} Based on the temporary neighbor sets, v1 is chosen to be indexed 1 (i.e., I(v1) = 1). s = 2 : V 2 1 = {v0, v1}, V 2 2 = {v2, v3, v4, v5} v2 v3 v4 v5 Figure 5: V 2 1 = {v0, v1} Obtain indexed neighbors to each unindexed node: N 2(v2) = {0, 1}, N 2(v3) = {0, 1}, N 2(v4) = {0, 1}, N 2(v5) = {0} Rank all unindexed nodes: rank(v2) = rank(v3) = rank(v4) = 0, rank(v5) = 1 Assign a temporary index to each node based on previous indexes (for indexed nodes) and ranks (for unindexed nodes): I2(v2) = 2, I2(v3) = 2, I2(v4) = 2, I2(v5) = 3 After having indexes for all nodes, define temporary neighbor sets: N 2 t (v2) = {0, 1, 3}, N 2 t (v3) = {0, 1, 2}, N 2 t (v4) = {0, 1, 2}, N 2 t (v5) = {0, 2} Based on the temporary neighbor sets, both v3 and v4 can be chosen to be indexed 2. Without loss of generality, let I(v3) = 2. Note: This step explains why temporary indexes and neighbor sets should be added into Algorithm 1. Otherwise, v2 is also valid to be index 2, following which v3, v4, v5 will be indexed 3, 4, 5. Then the neighbor set for I(v2) = 2 is {0, 1, 5} while the neighbor set for I(v3) = 3 is {0, 1, 4}, which violates constraints (S3). s = 3 : V 3 1 = {v0, v1, v3}, V 3 2 = {v2, v4, v5} Figure 6: V 3 1 = {v0, v1, v3} Obtain indexed neighbors to each unindexed node: N 3(v2) = {0, 1}, N 3(v4) = {0, 1, 2}, N 3(v5) = {0} Rank all unindexed nodes: rank(v2) = 1, rank(v4) = 0, rank(v5) = 2 Assign a temporary index to each node based on previous indexes (for indexed nodes) and ranks (for unindexed nodes): I3(v2) = 4, I3(v4) = 3, I3(v5) = 5 After having indexes for all nodes, define temporary neighbor sets: N 3 t (v2) = {0, 1, 5}, N 3 t (v4) = {0, 1, 2}, N 3 t (v5) = {0, 4} Based on the temporary neighbor sets, v4 is chosen to be indexed 3 (i.e., I(v4) = 3). s = 4 : V 4 1 = {v0, v1, v3, v4}, V 4 2 = {v2, v5} Figure 7: V 4 1 = {v0, v1, v3, v4} Obtain indexed neighbors to each unindexed node: N 4(v2) = {0, 1}, N 4(v5) = {0} Rank all unindexed nodes: rank(v2) = 0, rank(v5) = 1 Assign a temporary index to each node based on previous indexes (for indexed nodes) and ranks (for unindexed nodes): I4(v2) = 4, I4(v5) = 5 After having indexes for all nodes, define temporary neighbor sets: N 4 t (v2) = {0, 1, 5}, N 4 t (v5) = {0, 4} Based on the temporary neighbor sets, v2 is chosen to be indexed 4 (i.e., I(v2) = 4). s = 5 : V 5 1 = {v0, v1, v2, v3, v4}, V 4 2 = {v5} Figure 8: V 5 1 = {v0, v1, v2, v3, v4} Since there is only node v5 unindexed, without running the algorithm we still know that v5 is chosen to be indexed 5 (i.e., I(v5) = 5). Figure 9: V 6 1 = {v0, v1, v2, v3, v4, v5} B MIP formulation for CAMD: details In this part, we provide details of the MIP formulation for CAMD, including parameters, variables and constraints. Especially, we use atom set {C, N, O, S} as an example to help illuminate. The number of nodes N exclude hydrogen atoms since they are implicitly represented by N h features. B.1 Parameters Table 2: List of parameters Parameter Description Value N number of nodes F number of features 16 N t number of atom types 4 N n number of neighbors 5 N h number of hydrogen 5 It index for N t {0, 1, 2, 3} In index for N n {4, 5, 6, 7, 8} Ih index for N h {9, 10, 11, 12, 13} Idb index for double bond 14 Itb index for triple bond 15 Atom atom types {C, N, O, S} Cov covalence of atom {4, 3, 2, 2} B.2 Variables Table 3: List of variables for atom features Xv,f Type Xv,f #Neighbors Xv,f #Hydrogen 0 C 4 0 9 0 1 N 5 1 10 1 2 O 6 2 11 2 3 S 7 3 12 3 8 4 13 4 For the existence and types of bonds, we add the following extra features and variables: Xv,Idb, v [N]: if atom v is included in at least one double bond. Xv,Itb, v [N]: if atom v is included in at least one triple bond. Au,v, u = v [N]: if there is one bond between atom u and v. Av,v, v [N]: if node v exists. DBu,v, u = v [N]: if there is one double bond between atom u and v. TBu,v, u = v [N]: if there is one triple bond between atom u and v. Remark: Variables Av,v admits us to design molecules with at most N atoms, which brings flexibility in real-world applications. In our experiments, however, we fix them to consider possible structures with exact N atoms. Therefore, constraints (C2) are replaced by: Av,v = 1, v [N] Without this replacement, we can add an extra term on (C26) to exclude non-existing nodes and only index existing node: X f [F ] 2F f 1 X0,f X f [F ] 2F f 1 Xv,f + 2F (1 Av,v), v [N]\{0} B.3 Constraints There are at least two atoms and one bond between them: A0,0 = A1,1 = A0,1 = 1 (C1) Atoms with smaller indexes exist: Av,v Av+1,v+1, v [N 1] (C2) The adjacency matrix A is symmetric: Au,v = Av,u, u, v [N], u < v (C3) Atom v is linked to other atoms if and only if it exists: (N 1) Av,v X u =v Au,v, v [N] (C4) Note: N 1 here is a big-M constant. If atom v exists, then it has to be linked with at least one atom with smaller index: u