# kergm_kernelized_graph_matching__75c9e6f4.pdf Ker GM: Kernelized Graph Matching Zhen Zhang1, Yijian Xiang1, Lingfei Wu2, Bing Xue1, Arye Nehorai1 1Washington University in St. Louis 2IBM Research 1{zhen.zhang, yijian.xiang, xuebing, nehorai}@wustl.edu 2lwu@email.wm.edu Graph matching plays a central role in such fields as computer vision, pattern recognition, and bioinformatics. Graph matching problems can be cast as two types of quadratic assignment problems (QAPs): Koopmans-Beckmann s QAP or Lawler s QAP. In our paper, we provide a unifying view for these two problems by introducing new rules for array operations in Hilbert spaces. Consequently, Lawler s QAP can be considered as the Koopmans-Beckmann s alignment between two arrays in reproducing kernel Hilbert spaces (RKHS), making it possible to efficiently solve the problem without computing a huge affinity matrix. Furthermore, we develop the entropy-regularized Frank-Wolfe (En FW) algorithm for optimizing QAPs, which has the same convergence rate as the original FW algorithm while dramatically reducing the computational burden for each outer iteration. We conduct extensive experiments to evaluate our approach, and show that our algorithm significantly outperforms the state-of-the-art in both matching accuracy and scalability. 1 Introduction Graph matching (GM), which aims at finding the optimal correspondence between nodes of two given graphs, is a longstanding problem due to its nonconvex objective function and binary constraints. It arises in many applications, ranging from recognizing actions [3, 13] to identifying functional orthologs of proteins [11, 41]. Typically, GM problems can be formulated as two kinds of quadratic assignment problems (QAPs): Koopmans-Beckmann s QAP [18] or Lawler s QAP [22]. Koopman Beckmann s QAP is the structural alignment between two weighted adjacency matrices, which, as a result, can be written as the standard Frobenius inner product between two n n matrices, where n denotes the number of nodes. However, Koopmans-Beckmann s QAP cannot incorporate complex edge attribute information, which is usually of great importance in characterizing the relation between nodes. Lawler s QAP can tackle this issue, because it attempts to maximize the overall similarity that well encodes the attribute information. However, the key concern of the Lawler s QAP is that it needs to estimate the n2 n2 pairwise affinity matrix, limiting its application to very small graphs. In our work, we derive an equivalent formulation of Lawler s QAP, based on a very mild assumption that edge affinities are characterized by kernels [15, 34]. After introducing new rules for array operations in Hilbert spaces, named as H operations, we rewrite Lawler s QAP as the Koopmann Beckmann s alignment between two arrays in a reproducing kernel Hilbert space (RKHS), which allows us to solve it without computing the huge affinity matrix. Taking advantage of the H operations, we develop a path-following strategy for mitigating the local maxima issue of QAPs. In addition to the kernelized graph matching (Ker GM) formulation, we propose a numerical optimization algorithm, the entropy-regularized Frank-Wolfe (En FW) algorithm, for solving large-scale QAPs. The En FW has the same convergence rate as the original Frank-Wolfe algorithm, with far less computational burden in each iteration. Extensive experimental results show that our Ker GM, together with the En FW algorithm, achieves superior performance in both matching accuracy and scalability. 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. Related Work: In the past forty years, a myriad of graph matching algorithms have been proposed [8], most of which focused on solving QAPs. Previous work [2, 14, 21] approximated the quadratic term with a linear one, which consequently can be solved by standard linear programming solvers. In [36], several convex relaxation methods are proposed and compared. It is known that convex relaxations can achieve global convergence, but usually perform poorly because the final projection step separates the solution from the original QAP. Concave relaxations [29, 28] can avoid this problem since their outputs are just permutation matrices. However, concave programming [4] is NP-hard, which limits its applications. In [45], a seminal work termed the "path-following algorithm" was proposed, which leverages both the above relaxations via iteratively solving a series of optimization problems that gradually changed from convex to concave. In [27, 38, 39, 44], the path following strategy was further extended and improved. However, all the above algorithms, when applied to Lawler s QAP, need to compute the n2 n2 affinity matrix. To tackle the challenge, in [48], the authors elegantly factorized the affinity matrix into the Kronecker product of smaller matrices. However, it still cannot be well applied to large dense graphs, since it scales cubically with the number of edges. Beyond solving the QAP, there are interesting works on doing graph matching from other perspectives, such as probabilistic matching[46], hypergraph matching [24], and multigraph matching [42]. We refer to [43] for a survey of recent advances. Organization: In Section 2, we introduce the background, including Koopmans-Beckmann s and Lawler s QAPs, and kernel functions and its reproducing kernel Hilbert space. In Section 3, we present the proposed rules for array operations in Hilbert space. Section 4 and Section 5 form the core of our work, where we develop the kernelized graph matching, together with the entropyregularized Frank-Wolfe optimizaton algorithm. In Section 6, we report the experimental results. In the supplementary material, we provide proofs of all mathematical results in the paper, along with further technical discussions and more experimental results. 2 Background 2.1 Quadratic Assignment Problems for Graph Matching Let G = {A, V, P , E, Q} be an undirected, attributed graph of n nodes and m edges, where A Rn n is the adjacency matrix, V = {vi}n i=1 and P = [ p1, p2, ..., pn] Rd N n are the respective node set and node attributes matrix, and E = {eij|vi and vj are connected} and Q = [ qij|eij E] Rd E m are the respective edge set and edge attributes matrix. Given two graphs G1 = {A1, V1, P1, E1, Q1} and G2 = {A2, V2, P2, E2, Q2} of n nodes1, the GM problem aims to find a correspondence between nodes in V1 and V2 which is optimal in some sense. For Koopmans-Beckmann s QAP [18], the optimality refers to the Frobenius inner product maximization between two adjacency matrices after permutation, i.e., max A1X, XA2 F s.t. X P = {X {0, 1}n n|X 1 = 1, XT 1 = 1}, (1) where A, B F = tr(AT B) is the Frobenius inner product. The issue with (1) is that it ignores the complex edge attributes, which are usually of particular importance in characterizing graphs. For Lawler s QAP [22], the optimality refers to the similarity maximization between the node attribute sets and between the edge attribute sets, i.e., v1 i V1,v2a V2 k N( p1 i , p2 a)Xia + X e1 ij E1,e2 ab E2 k E( q1 ij, q2 ab)Xia Xjb s.t. X P, (2) where k N and k E are the node and edge similarity measurements, respectively. Furthermore, (2) can be rewritten in compact form: max KN, X F + vec(X)T Kvec(X) s.t. X P, (3) where KN Rn n is the node affinity matrix, K is an n2 n2 matrix, defined such that Kia,jb = k E( q1 ij, q2 ab) if i = j, a = b, e1 ij E1, and e2 ab E2, otherwise, Kia,jb = 0. It is well known that Koopmans-Beckmann s QAP is a special case of Lawler s QAP if we set K = A2 A1 and KN = 0n n. The issue of (3) is that the size of K scales quadruply with respect to n, which precludes its applications to large graphs. In our work, we will show that Lawler s QAP can be written in the Koopmans-Beckmann s form, which can avoid computing K. 1We assume G1 and G2 have the same number of nodes. If not, we add dummy nodes. Figure 1: Visualization of the operation Ψ X. 2.2 Kernels and reproducing kernel Hilbert spaces Given any set X, a kernel k : X X R is a function for quantitatively measuring the affinity between objects in X. It satisfies that there exist a Hilbert space, H, and an (implicit) feature map ψ : X H, such that k(q1, q2) = ψ(q1), ψ(q2) H, q1, q2 X. The space H is the reproducing kernel Hilbert space associated with k. Note that if X is a Euclidean space i.e., X = Rd, many similarity measurement functions are kernels, such as exp( q1 q2 2 2), exp( q1 q2 2), and q1, q2 , q1, q2 Rd. 3 H-operations for arrays in Hilbert spaces Let H be any Hilbert space, coupled with the inner product , H taking values in R. Let Hn n be the set of all n n arrays in H, and let Ψ, Ξ Hn n, i.e., Ψij, Ξij H, i, j = 1, 2, ..., n. Analogous to matrix operations in Euclidean spaces, we make the following addition, transpose, and multiplication rules (H-operations), i.e., X Rn n, and we have 1. Ψ + Ξ, ΨT Hn n, where [Ψ + Ξ]ij Ψij + Ξij H and [ΨT ]ij Ψji H. 2. Ψ Ξ Rn n, where [Ψ Ξ]ij Pn k=1 Ψik, Ξkj H R. 3. Ψ X, X Ψ Hn n, where [Ψ X]ij Pn k=1 Ψik Xkj = Pn k=1 XkjΨik H and [X Ψ]ij Pn k=1 XikΨkj H. Note that if H = R, all the above degenerate to the common operations for matrices in Euclidean spaces. In Fig. 1, we visualize the operation Ψ X, where we let H = Rd, let Ψ be a 3 3 array in Rd, and let X be a 3 3 permutation matrix. It is easy to see that Ψ X is just Ψ after column-permutation. As presented in the following corollary, the multiplication satisfy the combination law. Corollary 1. X, Y Rn n, Ψ X Y = Ψ (XY ), and Y (X Ψ) = (Y X) Ψ. Based on the H-operations, we can construct the Frobenius inner product on Hn n. Proposition 1. Define the function , FH : Hn n Hn n R such that Ψ, Ξ FH tr(ΨT Ξ) = Pn i,j=1 Ψij, Ξij H, Ψ, Ξ Hn n. Then , FH is an inner product on Hn n. As an immediate result, the function FH : Hn n R, defined such that Ψ FH = p Ψ, Ψ FH, is the Frobenius norm on Hn n. Next, we introduce two properties of , FH, which play important roles for developing the convex-concave relaxation of the Lawler s graph matching problem. Corollary 2. Ψ X, Ξ FH = Ψ, Ξ XT FH and X Ψ, Ξ FH = Ψ, XT Ξ FH. 4 Kernelized graph matching Before deriving kernelized graph matching, we first present an assumption. Assumption 1. We assume that the edge affinity function k E : Rd E Rd E R is a kernel. That is, there exist both an RKHS, H, and an (implicit) feature map, ψ : Rd E H, such that k E( q1, q2) = ψ( q1), ψ( q2) H, q1, q2 Rd E. Note that Assumption 1 is rather mild, since kernel functions are powerful and popular in quantifying the similarity between attributes [47], [19]. For any graph G = {A, V, P , E, Q}, we can construct an array, Ψ Hn n: Ψij = ψ( qij) H, if (vi, vj) E 0H H, otherwise , where 0H is the zero vector in H. (4) Given two graphs G1 and G2, let Ψ(1) and Ψ(2) be the corresponding Hilbert arrays of G1 and G2, respectively. Then the edge similarity term in Lawler s QAP (see (2)) can be written as X e1 ij E1,e2 ab E2 k E( q1 ij, q2 ab)Xia Xjb = j=1 Ψ(1) ij Xjb, a=1 XiaΨ(2) ab HK = Ψ(1) X, X Ψ(2) FH, which shares a similar form with (1), and can be considered as the Koopmans-Beckmann s alignment between the Hilbert arrays Ψ(1) and Ψ(2). The last term in (4) is just the Frobenius inner product between two Hilbert arrays after permutation. Adding the node affinity term, we write Laweler s QAP as2: min Jgm(X) = KN, X F Ψ(1) X, X Ψ(2) FH s.t. X P. (5) 4.1 Convex and concave relaxations The form (5) inspires an intuitive way to develop convex and concave relaxations. To do this, we first introduce an auxiliary function Jaux(X) = 1 2 Ψ(1) X, Ψ(1) X FH+ 1 2 X Ψ(2), X Ψ(2) FH. Applying Corollary 1 and 2, for any X P, which satisfies XXT = XT X = I, we have Jaux(X) = 1 2 Ψ(1), Ψ(1) (XXT ) FH+1 2 Ψ(2), (XT X) Ψ(2) FH = 1 2 Ψ(1) 2 FH+1 2 Ψ(2) 2 FH, which is always a constant. Introducing Jaux(X) to (5), we obtain convex and concave relaxations: Jvex(X) = Jgm(X) + Jaux(X) = KN, X F + 1 2 Ψ(1) X X Ψ(2) 2 FH, (6) Jcav(X) = Jgm(X) Jaux(X) = KN, X F 1 2 Ψ(1) X + X Ψ(2) 2 FH. (7) The convexity of Jvex(X) is easy to conclude, because the composite function of the squared norm, 2 FH, and the linear transformation, Ψ(1) X X Ψ(2), is convex. We have similarity interpretation for the concavity of Jcav(X). It is interesting to see that the term 1 2 Ψ(1) X X Ψ(2) FH in (6) is just the distance between Hilbert arrays. If we set the map ψ(x) = x, then the convex relaxation of (1) is recovered (see [1]). Path following strategy: Leveraging these two relaxations [45], we minimize Jgm by successively optimizing a series of sub-problems parameterized by α [0, 1]: min Jα(X) = (1 α)Jvex(X) + αJcav(X) s.t. X D = {X Rn n + |X1 = 1, XT 1 = 1}, (8) where D is the double stochastic relaxation of the permutation matrix set, P. We start at α = 0 and find the unique minimum. Then we gradually increase α until α = 1. That is, we optimize Jα+ α with the local minimizer of Jα as the initial point. Finally, we output the local minimizer of Jα=1. We refer to [45], [48], and [39] for detailed descriptions and improvements. Gradients computation: If we use the first-order optimization methods, we need only the gradients: Jα(X) = (1 2α) (Ψ(1) Ψ(1))X + X(Ψ(2) Ψ(2)) 2(Ψ(1) X) Ψ(2) KN, (9) where i, j = 1, 2, ..., n, [Ψ(1) Ψ(1)]ij = P e1 ik,e1 kj E1 k E( q1 ik, q1 kj); a, b = 1, 2, ..., n, [Ψ(2) Ψ(2)]ab = P e2 ac,e2 cb E2 k E( q2 ac, q2 cb); and i, a = 1, 2, ..., n, [(Ψ(1) X) Ψ(2)]ia = P e1 ik E1,e2ca E2 Xkck E( q1 ik, q2 ca). In the supplementary material, we provide compact matrix multiplication forms for computing (9). 2For convenience in developing the path-following strategy, we write it in the minimization form. 4.2 Approximate explicit feature maps Based on the above discussion, we significantly reduce the space cost of Lawler s QAP by avoiding computing the affinity matrix K Rn2 n2. However, the time cost of computing gradient with (9) is O(n4), which can be further reduced by employing the approximate explicit feature maps [33, 40]. For the kernel k E : Rd E Rd E R, we may find an explicit feature map ˆψ : Rd E RD, such that q1, q2 Rd E, ˆψ( q1), ˆψ( q2) = ˆk E( q1, q2) k E( q1, q2). (10) For example, if k E( q1, q2) = exp( γ q1 q2 2 2), then ˆψ is the Fourier random feature map [33]: 2 D cos(ωT 1 q + b1), ..., cos(ωT D q + b D) T , where ωi N( 0, γ2I) and bi U[0, 1]. (11) Note that in practice, the performance of ˆψ is good enough for relatively small values of D [47]. By the virtue of explicit feature maps, we obtain a new graph representation ˆΨ (RD)n n: ( ˆψ( qij) RD, if (vi, vj) E 0 RD, otherwise , where 0 is the zero vector in RD. (12) Its space cost is O(Dn2). Now computing the gradient-related terms ˆΨ ˆΨ , = (1), (2), and ( ˆΨ(1) X) ˆΨ(2) in (9) becomes rather simple. We first slice ˆΨ into D matrices ˆΨ (:, :, i) Rn n, i = 1, 2, ..., D. Then it can be easily shown that i=1 ˆΨ (:, :, i) ˆΨ (:, :, i), and ( ˆΨ(1) X) ˆΨ(2) = i=1 ˆΨ(1)(:, :, i)X ˆΨ(2)(:, :, i), (13) whose the first and second term respectively involves D and 2D matrix multiplications of the size n n. Hence, the time complexity is reduced to O(Dn3). Moreover, gradient computations with (13) are highly parallizable, which also contributes to scalability. 5 Entropy-regularized Frank-Wolfe optimization algorithm The state-of-the-art method for optimizing problem (8) is the Frank-Wolfe algorithm [29, 25, 37, 49], whose every iteration involves linear programming to obtain optimal direction Y , i.e., Y = argmin Y D Jα(X), Y F, (14) which is usually solved by the Hungarian algorithm [20]. Optimizing Jα may need to call the Hungarian algorithm many times, which is quite time-consuming for large graphs. 500 1000 1500 2000 2500 3000 3500 4000 n Hungarian Sinkhorn Figure 2: Hungarian vs Sinkhorn. In this work, instead of minimizing Jα(X) in (8), we consider the following problem, min X Fα(X) = Jα(X)+λH(X) s.t. X Dn, (15) where Dn = {X Rn n + |X1 = 1 n1, XT 1 = 1 n1}, H(X) = Pn i,j=1 Xij log Xij is the negative entropy, and the node affinity matrix KN in Jα(X) (see (5) and (8)) is normalized as KN 1 n KN to balance the node and edge affinity terms. The observation is that if λ is set to be small enough, the solution of (15), after being multiplied by n, will approximate that of the original QAP (8) as much as possible. We design the entropy-regularized Frank Wolfe algorithm ("En FW" for short) for optimizing (15), in each outer iteration of which we solve the following nonlinear problem. min Jα(X), Y F + λH(Y ) s.t. Y Dn. (16) Note that (16) can be extremely efficiently solved by the Sinkhorn-Knopp algorithm [10]. Theoretically, the Sinkhorn-Knopp algorithm converges at the linear rate, i.e., 0 < lim sup Yk+1 Y / Yk Y < 1. An empirical comparison between the runtimes of these two algorithms is shown in Fig. 2, where we can see that the Sinkhorn-Knopp algorithm for solving (16) is much faster than the Hungarian algorithm for solving (14). The En FW algorithm description: We first give necessary definitions. Write the quadratic function Jα(X +s(Y X)) = Jα(X)+s Jα(X), Y X F+ 1 2vec(Y X)T 2Jα(X)vec(Y X)s2. Then, we define the coefficient of the quadratic term as 2vec(Y X)T 2Jα(X)vec(Y X) = 1 2 Jα(Y X), Y X F, (17) where the second equality holds because Jα is a quadratic function. Next, similar to the original FW algorithm, we define the nonnegative gap function g(X) as g(X) Jα(X), X F + λH(X) min Y Dn Jα(X), Y F + λH(Y ). (18) Proposition 2. If X is an optimal solution of (15), then g(X ) = 0. Therefore, the gap function characterize the necessary condition for optimal solutions. Note that for any X Dn, if g(X) = 0, then we say "X is a first-order stationary point". Now with the definitions of Q(X, Y ) and g(X), we detail the En FW procedure in Algorithm 1. Algorithm 1 The En FW optimization algorithm for minimizing Fα (15) 1: Initialize X0 Dn 2: while not converge do 3: Compute the gradient Jα(Xt) based on (9) or (13), 4: Obtain the optimal direction Yt by solving (16), i.e., Yt = argmin Y Dn Jα(Xt), Y F + λH(Y ), 5: Compute Gt = g(Xt) and Qt = Q(Xt, Yt), 6: Determine the stepsize st: If Qt 0; st = 1, else st = min{Gt/(2Qt), 1}, 7: Update Xt+1 = Xt + st(Yt Xt). 8: end 9: Output the solution X α. After obtaining the optimal solution path X α, α = 0 : α : 1, we discretize n X 1 by the Hungarian [20] or the greedy discretization algorithm [5] to get the binary matching matrix. We next highlight the differences between the En FW algorithm and the original FW algorithm: (i) We find the optimal direction by solving a nonlinear convex problem (16) with the efficient Sinkhorn-Knopp algorithm, instead of solving the linear problem (14). (ii) We give an explicit formula for computing the stepsize s, instead of making a linear search on [0, 1] for optimizing Fα(X + s(Y X)) or estimating the Lipschitz constant of Fα [32]. 5.1 Convergence analysis In this part, we present the convergence properties of the proposed En FW algorithm, including the sequentially decreasing property of the objective function and the convergence rates. Theorem 1. The generated objective function value sequence, {Fα(Xt)}t=0, will decreasingly converge. The generated points sequence, {Xt}t=0 Dn Rn n, will weakly converge to the first-order stationary point, at the rate O( 1 t+1), i.e, min 1 t T g(Xt) 2 max{ 0, p T + 1 , (19) where 0 = Fα(X0) min X Dn Fα(X), and L is the largest absolute eigenvalue of 2Jα(X). If Jα(X) is convex, which happens when α is small (see (8)), then we have a tighter bound O( 1 T +1). Theorem 2. If Jα(X) is convex, we have Fα(XT ) Fα(X ) 4L n(T +1). Note that in both cases, convex and non-convex, our En FW achieves the same (up to a constant coefficient) convergence rate with the original FW algorithm (see [17] and [32]). Thanks to the efficiency of the Sinkhorn-Knopp algorithm, we need much less time to finish each iteration. Therefore, our optimization algorithm is more computationally efficient than the original FW algorithm. 6 Experiments In this section, we conduct extensive experiments to demonstrate the matching performance and scalability of our kernelized graph matching framework. We implement all the algorithms using Matlab on an Intel i7-7820HQ, 2.90 GHz CPU with 64 GB RAM. Notations: We use Ker GMI to denote our algorithm when we use exact edge affinity kernels, and use Ker GMII to denote it when we use approximate explicit feature maps. Baseline methods: We compare our algorithm with many state-of-the-art graph (network) matching algorithms: (i) Integer projected fixed point method (IPFP) [25], (ii) Spectral matching with affine constraints (SMAC) [9], (iii) Probabilistic graph matching (PM) [46] , (iv) Re-weighted random walk matching (RRWM) [5], (v) Factorized graph matching (FGM) [48], (vi) Branch path following for graph matching (BPFG) [39], (vii) Graduated assignment graph matching (GAGM) [14], (viii) Global network alignment using multiscale spectral signatures (GHOST) [31], (ix) Triangle alignment (TAME) [30], and (x) Maximizing accuracy in global network alignment (MAGNA) [35]. Note that GHOST, TAME, and MAGNA are popular protein-protein interaction (PPI) networks aligners. Settings: For all the baseline methods, we used the parameters recommended in the public code. For our method, if not specified, we set the regularization parameter (see (15)) λ = 0.005 and the path following parameters α = 0 : 0.1 : 1. We use the Hungarian algorithm for final discretization. We refer to the supplementary material for other implementation details. 6.1 Synthetic datasets We evaluate algorithms on the synthetic Erdos Rényi [12] random graphs, following the experimental protocol in [14, 48, 5]. For each trial, we generate two graphs: the reference graph G1 and the perturbed graph G2, each of which has nin inlier nodes and nout outlier nodes. Each edge in G1 is randomly generated with probability ρ [0, 1]. The edges e1 ij E1 are associated with the edge attributes q1 ij U[0, 1]. The corresponding edge e2 p(i)p(j) E2 has the attribute q2 p(i)p(j) = q1 ij + ϵ, where p is a permutation map for inlier nodes, and ϵ N(0, σ2) is the Gaussian noise. For the baseline methods, the edge affinity value between q1 ij and q2 ij is computed as k E(q1 ij, q2 ij) = exp( (q1 ij q2 ij)2/0.15). For our method, we use the Fourier random features (11) to approximate the Gaussian kernel, and represent each graph by an (nin + nout) (nin + nout) array in RD. We set the parameter γ = 5 and the dimension D = 20. Comparing matching accuracy. We perform the comparison under three parameter settings, in all of which we set nin = 50. Note that different from the standard protocol where nin = 20 [48], we use relatively large graphs to highlight the advantage of our Ker GMII. (i) We change the number of outlier nodes, nout, from 0 to 50 while fixing the noise, σ = 0, and the edge density, ρ = 1. (ii) We change σ from 0 to 0.2 while fixing nout = 0 and ρ = 1. (iii) We change ρ from 0.3 to 1 while fixing nout = 5 and σ = 0.1. For all cases in these settings, we repeat the experiments 100 times and report the average accuracy and standard error in Fig. 3 (a). Clearly, our Ker GMII outpeforms all the baseline methods with statistical significance. Comparing scalability. To fairly compare the scalability of different algorithms, we consider the exact matching between fully connected graphs, i.e., nout = 0, σ = 0, and ρ = 1. We change the number of nodes, n (= nin), from 50 to 2000, and report the CPU time of each algorithm in Fig. 3 (b). We can see that all the baseline methods can handle only graphs with fewer than 200 nodes because of the expensive space cost of matrix K (see (3)). However, Ker GMII can finish Lawler s graph matching problem with 2000 nodes in reasonable time. Analyzing parameter sensitivity. To analyze the parameter sensitivity of Ker GMII, we vary the regularization parameter, λ, and the dimension, D, of Fourier random features. We conduct large subgraph matching experiments by setting nin = 500, nout = 0 : 100 : 500, ρ = 1, and σ = 0. We repeat the experiments 50 times and report the average accuracies and standard errors. In Fig. 4, we show the results under different λ and different D. We can see that (i) smaller λ leads to better performance, which can be easily understood because the entropy regularizer will perturb the original optimal solution, and (ii) the dimension D does not much affect on Ker GMII, which implies that in practice, we can use relatively small D for reducing the time and space complexity. 0 10 20 30 40 50 outliers 0 0.04 0.08 0.12 0.16 0.2 noise 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 density 50 100 200 500 1000 2000 # nodes Figure 3: Comparison of graph matching on synthetic datasets. 0 100 200 300 400 500 outliers =0.005 =0.006 =0.007 =0.008 =0.009 =0.01 =0.012 =0.016 =0.02 =0.03 =0.04 =0.05 D=20 0 100 200 300 400 500 outliers D=10 D=20 D=50 (b) Figure 4: (a) Parameter sensitivity study of the regularizer λ. (b) Parameter sensitivity study of the dimension, D, of the random Fourier feature. 6.2 Image datasets The CMU House Sequence dataset has 111 frames of a house, each of which has 30 labeled landmarks. We follow the experimental protocol in [48, 39]. We match all the image pairs, spaced by 0:10:90 frames. We consider two node settings: (n1, n2) = (30, 30) and (n1, n2) = (20, 30). We build graphs by using Delaunay triangulation [23] to connect landmarks. The edge attributes are the pairwise distances between nodes. For all methods, we compute the edge affinity as k E(q1 ij, q2 ab) = exp( (q1 ij q2 ab)2/2500). In Fig. 5, we report the average matching accuracy and objective function (3) value ratio for every gap. It can be seen that on this dataset, Ker GMI and FGM achieve the best performance, and are slightly better than BPFG when outliers exist, i.e., (n1, n2) = (20, 30). The Pascal dataset [26] has 20 pairs of motorbike images and 30 pairs of car images. For each pair, the detected feature points and manually labeled correspondences are provided. Following [48, 39], we randomly select 0:2:20 outliers from the background to compare different methods. For each node, vi, its attribute, pi, is assigned as its orientation of the normal vector at that point to the contour where the point was sampled. Nodes are connected by Delaunay triangulation [23]. For each edge, eij, its attribute qij equals [dij, θij]T , where dij is the distance between vi and vj, and θij is the absolute angle between the edge and the horizontal line. For all methods, the node affinity is computed as k N(pi, pj) = exp( |pi pj|). The edge affinity is computed as k E( q1 ij, q2 ab) = exp( |d1 ij d2 ab|/2 |θ1 ij θ2 ab|/2). Fig. 6 (a) shows a matching result of Ker GMI. 10 20 30 40 50 60 70 80 90 100 gap 10 20 30 40 50 60 70 80 90 100 gap objective ratio 10 20 30 40 50 60 70 80 90 100 gap 10 20 30 40 50 60 70 80 90 100 gap objective ratio (20, 30) (20, 30) (30, 30) (30, 30) Figure 5: Comparison of graph matching on the CMU house dataset. 0 2 4 6 8 10 12 14 16 18 20 outliers 0 2 4 6 8 10 12 14 16 18 20 outliers 0 100 200 300 400 500 600 700 800 900 (b) Figure 6: (a) A matching example for a pair of motorbike images generated by Ker GMI, where green and red lines respectively indicate correct and incorrect matches. (b) Comparison of graph matching on the Pascal dataset. In Fig. 6 (b), we report the matching accuracies and CPU running time. From the perspective of matching accuracy, Ker GMI, BPFG, and FGM consistently outperforms other methods. When the number of outliers increases, Ker GMI and BPFG perform slightly better than FGM. However, from the perspective of running time, the time cost of BPFG is much higher than that of the others. 6.3 The protein-protein interaction network dataset The S.cerevisiae (yeast) PPI network [7] dataset is popularly used to evaluate PPI network aligners because it has known true node correspondences. 5% 10% 15% 20% 25% Noise level Node accuracy Figure 7: Results on PPI networks. It consists of an unweighted high-confidence PPI network with 1004 proteins (nodes) and 8323 PPIs (edges), and five noisy PPI networks generated by adding 5%, 10%, 15%, 20%, 25% low-confidence PPIs. We do graph matching between the high-confidence network with every noisy network. To apply Ker GM, we generate edge attributes by the heat diffusion matrix [16, 6], Ht = exp( t L) = Pn i=1 exp( λit) ui u T i Rn n, where L is the normalized Laplacian matrix [6], and {(λi, ui)}n i=1 are eigenpairs of L. The edge attributes vector qij is assigned as qij = [H5(i, j), H10(i, j), H15(i, j), H20(i, j)]T R4. We use the Fourier random features (11), and set D = 50 and γ = 200. We compare Ker GMII3 with the state-of-the-art PPI aligners: TAME, GHOST, and MAGNA. In Fig. 7, we report the matching accuracies. Clearly, Ker GMII significantly outperforms the baselines. Especially when the noise level are 20% or 25%, Ker GMII s accuracies are more than 50 percentages higher than those of other algorithms. 7 Conclusion In this work, based on a mild assumption regarding edge affinity values, we provided Ker GM, a unifying framework for Koopman-Beckmann s and Lawler s QAPs, within which both two QAPs can be considered as the alignment between arrays in RKHS. Then we derived convex and concave relaxations and the corresponding path-following strategy. To make Ker GM more scalable to large graphs, we developed the computationally efficient entropy-regularized Frank-Wolfe optimization algorithm. Ker GM achieved promising performance on both image and biology datasets. Thanks to its scalability, we believe Ker GM can be potentially useful for many applications in the real world. 8 Acknowledgment This work was supported in part by the AFOSR grant FA9550-16-1-0386. 3To the best our knowledge, Ker GM is the first one that uses Lawler s graph matching formulation to solve the PPI network alignment problem. [1] Yonathan Aflalo, Alexander Bronstein, and Ron Kimmel. On convex relaxation of graph isomorphism. Proceedings of the National Academy of Sciences, 112(10):2942 2947, 2015. [2] HA Almohamad and Salih O Duffuaa. A linear programming approach for the weighted graph matching problem. IEEE Transactions on pattern analysis and machine intelligence, 15(5):522 525, 1993. [3] William Brendel and Sinisa Todorovic. Learning spatiotemporal graphs of human activities. In 2011 International Conference on Computer Vision, pages 778 785. IEEE, 2011. [4] Altannar Chinchuluun, Enkhbat Rentsen, and Panos M Pardalos. A numerical method for concave programming problems. In Continuous Optimization, pages 251 273. Springer, 2005. [5] Minsu Cho, Jungmin Lee, and Kyoung Mu Lee. Reweighted random walks for graph matching. In European conference on Computer vision, pages 492 505. Springer, 2010. [6] Fan RK Chung and Fan Chung Graham. Spectral graph theory. Number 92. American Mathematical Soc., 1997. [7] Sean R Collins, Patrick Kemmeren, Xue-Chu Zhao, Jack F Greenblatt, Forrest Spencer, Frank CP Holstege, Jonathan S Weissman, and Nevan J Krogan. Toward a comprehensive atlas of the physical interactome of saccharomyces cerevisiae. Molecular & Cellular Proteomics, 6(3):439 450, 2007. [8] Donatello Conte, Pasquale Foggia, Carlo Sansone, and Mario Vento. Thirty years of graph matching in pattern recognition. International journal of pattern recognition and artificial intelligence, 18(03):265 298, 2004. [9] Timothee Cour, Praveen Srinivasan, and Jianbo Shi. Balanced graph matching. In Advances in Neural Information Processing Systems, pages 313 320, 2007. [10] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in neural information processing systems, pages 2292 2300, 2013. [11] Ahed Elmsallati, Connor Clark, and Jugal Kalita. Global alignment of protein-protein interaction networks: A survey. IEEE/ACM transactions on computational biology and bioinformatics, 13(4):689 705, 2015. [12] P ERDd S and A R&wi. On random graphs i. Publ. Math. Debrecen, 6:290 297, 1959. [13] Utkarsh Gaur, Yingying Zhu, Bi Song, and A Roy-Chowdhury. A string of feature graphs model for recognition of complex activities in natural videos. In 2011 International Conference on Computer Vision, pages 2595 2602. IEEE, 2011. [14] Steven Gold and Anand Rangarajan. A graduated assignment algorithm for graph matching. IEEE Transactions on pattern analysis and machine intelligence, 18(4):377 388, 1996. [15] Thomas Hofmann, Bernhard Schölkopf, and Alexander J Smola. Kernel methods in machine learning. The annals of statistics, pages 1171 1220, 2008. [16] Nan Hu, Raif M Rustamov, and Leonidas Guibas. Stable and informative spectral signatures for graph matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2305 2312, 2014. [17] Martin Jaggi. Revisiting frank-wolfe: Projection-free sparse convex optimization. In ICML (1), pages 427 435, 2013. [18] Tjalling C Koopmans and Martin Beckmann. Assignment problems and the location of economic activities. Econometrica: journal of the Econometric Society, pages 53 76, 1957. [19] Nils M. Kriege and Petra Mutzel. Subgraph matching kernels for attributed graphs. In ICML, 2012. [20] Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83 97, 1955. [21] Yam Kushinsky, Haggai Maron, Nadav Dym, and Yaron Lipman. Sinkhorn algorithm for lifted assignment problems. SIAM Journal on Imaging Sciences, 12(2):716 735, 2019. [22] Eugene L Lawler. The quadratic assignment problem. Management science, 9(4):586 599, 1963. [23] D. T. Lee and B. J. Schachter. Two algorithms for constructing a delaunay triangulation. International Journal of Computer & Information Sciences, 9(3):219 242, Jun 1980. [24] Jungmin Lee, Minsu Cho, and Kyoung Mu Lee. Hyper-graph matching via reweighted random walks. In CVPR 2011, pages 1633 1640. IEEE, 2011. [25] Marius Leordeanu, Martial Hebert, and Rahul Sukthankar. An integer projected fixed point method for graph matching and map inference. In Advances in neural information processing systems, pages 1114 1122, 2009. [26] Marius Leordeanu, Rahul Sukthankar, and Martial Hebert. Unsupervised learning for graph matching. International journal of computer vision, 96(1):28 45, 2012. [27] Zhi-Yong Liu and Hong Qiao. Gnccp graduated nonconvexityand concavity procedure. IEEE transactions on pattern analysis and machine intelligence, 36(6):1258 1267, 2014. [28] João Maciel and João P Costeira. A global solution to sparse correspondence problems. IEEE Transactions on Pattern Analysis & Machine Intelligence, (2):187 199, 2003. [29] Haggai Maron and Yaron Lipman. (probably) concave graph matching. In Advances in Neural Information Processing Systems, pages 408 418, 2018. [30] Shahin Mohammadi, David F Gleich, Tamara G Kolda, and Ananth Grama. Triangular alignment tame: A tensor-based approach for higher-order network alignment. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 14(6):1446 1458, 2017. [31] Rob Patro and Carl Kingsford. Global network alignment using multiscale spectral signatures. Bioinformatics, 28(23):3105 3114, 2012. [32] Fabian Pedregosa, Armin Askari, Geoffrey Negiar, and Martin Jaggi. Step-size adaptivity in projection-free optimization. ar Xiv preprint ar Xiv:1806.05123, 2018. [33] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in neural information processing systems, pages 1177 1184, 2008. [34] Carl Edward Rasmussen. Gaussian processes in machine learning. In Summer School on Machine Learning, pages 63 71. Springer, 2003. [35] Vikram Saraph and Tijana Milenkovi c. Magna: maximizing accuracy in global network alignment. Bioinformatics, 30(20):2931 2940, 2014. [36] Christian Schellewald, Stefan Roth, and Christoph Schnörr. Evaluation of convex optimization techniques for the weighted graph-matching problem in computer vision. In Joint Pattern Recognition Symposium, pages 361 368. Springer, 2001. [37] Joshua T Vogelstein, John M Conroy, Vince Lyzinski, Louis J Podrazik, Steven G Kratzer, Eric T Harley, Donniell E Fishkind, R Jacob Vogelstein, and Carey E Priebe. Fast approximate quadratic programming for graph matching. PLOS one, 10(4):e0121002, 2015. [38] Tao Wang, Haibin Ling, Congyan Lang, and Songhe Feng. Graph matching with adaptive and branching path following. IEEE transactions on pattern analysis and machine intelligence, 40(12):2853 2867, 2018. [39] Tao Wang, Haibin Ling, Congyan Lang, and Jun Wu. Branching path following for graph matching. In European Conference on Computer Vision, pages 508 523. Springer, 2016. [40] Lingfei Wu, Ian EH Yen, Jie Chen, and Rui Yan. Revisiting random binning features: Fast convergence and strong parallelizability. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1265 1274. ACM, 2016. [41] Lingfei Wu, Ian En-Hsu Yen, Zhen Zhang, Kun Xu, Liang Zhao, Xi Peng, Yinglong Xia, and Charu Aggarwal. Scalable global alignment graph kernel using random features: From node embedding to graph embedding. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1418 1428, 2019. [42] Junchi Yan, Jun Wang, Hongyuan Zha, Xiaokang Yang, and Stephen Chu. Consistency-driven alternating optimization for multigraph matching: A unified approach. IEEE Transactions on Image Processing, 24(3):994 1009, 2015. [43] Junchi Yan, Xu-Cheng Yin, Weiyao Lin, Cheng Deng, Hongyuan Zha, and Xiaokang Yang. A short survey of recent advances in graph matching. In Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, pages 167 174. ACM, 2016. [44] Tianshu Yu, Junchi Yan, Yilin Wang, Wei Liu, et al. Generalizing graph matching beyond quadratic assignment model. In Advances in Neural Information Processing Systems, pages 853 863, 2018. [45] Mikhail Zaslavskiy, Francis Bach, and Jean-Philippe Vert. A path following algorithm for the graph matching problem. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(12):2227 2242, 2009. [46] Ron Zass and Amnon Shashua. Probabilistic graph and hypergraph matching. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pages 1 8. IEEE, 2008. [47] Zhen Zhang, Mianzhi Wang, Yijian Xiang, Yan Huang, and Arye Nehorai. Retgk: Graph kernels based on return probabilities of random walks. In Advances in Neural Information Processing Systems, pages 3964 3974, 2018. [48] Feng Zhou and Fernando De la Torre. Factorized graph matching. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 127 134. IEEE, 2012. [49] Feng Zhou and Fernando De la Torre. Factorized graph matching. IEEE transactions on pattern analysis and machine intelligence, 38(9):1774 1789, 2015.