# crossgraph_learning_of_multirelational_associations__133aee45.pdf Cross-Graph Learning of Multi-Relational Associations Hanxiao Liu HANXIAOL@CS.CMU.EDU Yiming Yang YIMING@CS.CMU.EDU Carnegie Mellon University, Pittsburgh, PA 15213, USA Cross-graph Relational Learning (CGRL) refers to the problem of predicting the strengths or labels of multi-relational tuples of heterogeneous object types, through the joint inference over multiple graphs which specify the internal connections among each type of objects. CGRL is an open challenge in machine learning due to the daunting number of all possible tuples to deal with when the numbers of nodes in multiple graphs are large, and because the labeled training instances are extremely sparse as typical. Existing methods such as tensor factorization or tensor-kernel machines do not work well because of the lack of convex formulation for the optimization of CGRL models, the poor scalability of the algorithms in handling combinatorial numbers of tuples, and/or the non-transductive nature of the learning methods which limits their ability to leverage unlabeled data in training. This paper proposes a novel framework which formulates CGRL as a convex optimization problem, enables transductive learning using both labeled and unlabeled tuples, and offers a scalable algorithm that guarantees the optimal solution and enjoys a linear time complexity with respect to the sizes of input graphs. In our experiments with a subset of DBLP publication records and an Enzyme multi-source dataset, the proposed method successfully scaled to the large cross-graph inference problem, and outperformed other representative approaches significantly. 1. Introduction Many important problems in multi-source relational learning could be cast as joint learning over multiple graphs about how heterogeneous types of objects interact with Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s). each other. In literature data analysis, for example, publication records provide rich information about how authors collaborate with each other in a co-authoring graph, how papers are linked in citation networks, how keywords are related via ontology, and so on. The challenging question is about how to combine such heterogeneous information in individual graphs for the labeling or scoring of the multi-relational associations in tuples like (author,paper,keyword), given some observed instances of such tuples as the labeled training set. Automated labeling or scoring of unobserved tuples allows us to discover who have been active in the literature on what areas of research, and to predict who would become influential in which areas in the future. In protein data analysis, as another example, a graph of proteins with pairwise sequence similarities is often jointly studied with a graph of chemical compounds with their structural similarities for the discovery of interesting patterns in (compound,protein) pairs. We call the prediction problem in both examples cross-graph learning of multirelational associations, or simply cross-graph relational learning (CGRL), where the multi-relational associations are defined by the tuples of heterogeneous types of objects, and each object type has its own graph with type-specific relational structure as a part of the provided data. The task is to predict the labels or the scores of unobserved multirelational tuples, conditioned on a relatively small set of labeled instances. CGRL is an open challenge in machine learning for several reasons. Firstly, the number of multi-relational tuples grows combinatorially in the numbers of individual graphs and the number of nodes in each graph. How to make crossgraph inference computationally tractable for large graphs is a tough challenge. Secondly, how to combine the internal structures or relations in individual graphs for joint inference in a principled manner is an open question. Thirdly, supervised information (labeled instances) is typically extremely sparse in CGRL due to the very large number of all possible combinations of heterogeneous objects in individual graphs. Consequently, the success of cross-graph learning crucially depends on effectively leveraging the massively available unlabeled tuples (and the latent relations Cross-Graph Learning of Multi-Relational Associations among them) in addition to the labeled training data. In other words, how to make the learning transductive is crucial for the true success of CGRL. Research on transdcutive CGRL has been quite limited, to our knowledge. Existing approaches in CGRL or CGRL-related areas can be outlined as those using tensors or graph-regularized tensors, and kernel machines that combine multiple kernels. Tensor methods have been commonly used for combining multi-source evidence of the interactions among multiple types of objects (Nickel et al., 2011; Rendle et al., 2009; Kolda & Bader, 2009) as the combined evidence can be naturally represented as tuples. However, most of the tensor methods do not explicitly model the internal graph structure for each type of objects, although some of those methods implicitly leverage such information via graph-based regularization terms in their objective function that encourage similar objects within each graph to share similar latent factors (Narita et al., 2012; Cai et al., 2011). A major weakness in such tensor methods is the lack of convexity in their models, which leads to ill-posed optimization problems particularly in high-order scenarios. It has also been observed that tensor factorization models suffer from labelsparsity issue, which is typically severe in CGRL. Kernel machines have been widely studied for supervised classifiers, where a kernel matrix corresponds to a similarity graph among a single type of objects. Multiple kernels can be combined, for example, by taking the tensor product of each individual kernel matrix, which results in a desired kernel matrix among cross-graph multi-relational tuples. The idea has been explored in relational learning combined with SVMs (Ben-Hur & Noble, 2005), perceptions (Basilico & Hofmann, 2004) or Gaussian process (Yu & Chu, 2008) for two types of objects and is generalizable to the multi-type scenario of CGRL. Although being generic, the complexity of such kernel-based methods grows exponentially in the number of individual kernels (graphs) and the size of each individual graph. As a result, kernel machines suffer from poor scalability in general. In addition, kernel machines are purely supervised (not for transductive learning), i.e., they cannot leverage the massive number of available non-observed tuples induced from individual graphs and the latent connections among them. Those limitations make existing kernel methods less powerful for solving the CGRL problem in large scale and under severely datasparse conditions. In this paper, we propose a novel framework for CGRL which can be characterized as follows: (i) It uses graph products to map heterogeneous sources of information and the link structures in individual graphs onto a single homogeneous graph; (ii) It provides a convex formulation and approximation of the CGRL problem that ensure robust optimization and efficient computation; and (iii) It en- ables transductive learning in the form of label propagation over the induced homogeneous graph so that the massively available non-observed tuples and the latent connections among them can play an important role in effectively addressing the label-sparsity issue. The proposed framework is most related to (Liu & Yang, 2015), where the authors formulated graph products for learning the edges of a bipartite graph. Our new framework is fundamentally different in two aspects. First, our new formulation and algorithms allow the number of individual graphs to be greater than two, while method in (Liu & Yang, 2015) is only applicable to two graphs. Secondly, the algorithms in (Liu & Yang, 2015) suffer from cubic complexity over the graphs sizes (quadratic by using a non-convex approximation), while our new algorithm enjoys both the convexity of the formulation and the low time complexity which is linear over the graph sizes. Our method also shares the high-level goal with Statistical Relational Learning (SRL) (Getoor, 2007) and Inductive Logic Programming (ILP) (Lavrac & Dzeroski, 1994) in terms of multirelational learning. However, both of our problem setting and formulation differ substantially from existing SRL/ILP approaches focusing on first-order logic and/or probabilistic reasoning over graphical models. The paper is organized as follows: Section 2 shows how cross-graph multi-relations can be embedded into the vertex space of a homogeneous graph. Section 3 describes how efficient label propagation among multi-relations can be carried out in such space with approximation. We discuss our optimization algorithm in Section 4 and provide empirical evaluations over real-world datasets in Section 5. 2. The Proposed Method We introduce our notation in 2.1 and the notion of graph product (GP) in 2.2. We then narrow down to a specific GP family with desirable computational properties in 2.2, and finally propose our GP-based optimization objective in 2.4. 2.1. Notations We are given J heterogeneous graphs where the j-th graph contains nj vertices and is associated with an adjacency matrix G(j) Rnj nj. We use ij to index the ij-th vertex of graph j, and use a tuple (i1, . . . , i J) to index each multi-relation across the J graphs. The system predictions over all possible QJ j=1 nj multi-relations is summarized in an order-J tensor f Rn1 n J, where fi1,i2,...,i J corresponds to the prediction about tuple (i1, . . . , i J). Denote by the Kronecker (Tensor) product. We use NJ j=1 xj (or simply N j xj) as the shorthand for x1 x J. Denote by j the j-mode product between tensors. We Cross-Graph Learning of Multi-Relational Associations | {z } G(1) | {z } G(2) | {z } G(3) Figure 1. Product of G(1), G(2) and G(3). Vertices in the product graph P G(1), G(2), G(3) corresponds to multi-relations across the original graphs. For instance, vertex 3.II.B in P corresponds to multi-relation (3,II,B) across G(1), G(2) and G(3). refer the readers to (Kolda & Bader, 2009) for a thorough introduction about tensor mode product. 2.2. Graph Product In a nutshell, graph product (GP) 1 is a mapping from each cross-graph multi-relation to each vertex in a new graph P, whose edges encode similarities among the multi-relations (illustrated in Fig. 1). A desirable property of GP is it provides a natural reduction from the original multi-relational learning problem over heterogeneous information sources (Task 1) to an equivalent graph-based learning problem over a homogeneous graph (Task 2). Task 1. Given J graphs G(1), . . . , G(J) with a small set of labeled multi-relations O = {(i1, . . . , i J)}, predict labels of the unlabeled multi-relations. Task 2. Given the product graph P G(1), . . . , G(J) with a small set of labeled vertices O = {(i1, . . . , i J)}, predict labels of its unlabeled vertices. 2.3. Spectral Graph Product We define a parametric family of GP operators named the spectral graph product (SGP), which is of particular interest as it subsumes the well-known Tensor GP and Cartesian GP (Table 1), is well behaved (Theorem 1) and allows efficient optimization routines (Section 3). Let λ(j) ij and v(j) ij be the ij-th eigenvalue and eigenvector for the graph j, respectively. We construct SGP by defining the 1 While traditional GP only applies to two graphs, we generalize it to the case of multiple graphs (Section 2.3). eigensystem of its adjacency matrix based on the provided J heterogeneous eigensystems of G(1), . . . , G(J). Definition 1. The SGP of G(1), . . . , G(J) is a graph consisting of Q j nj vertices, with its adjacency matrix Pκ := Pκ G(1), . . . , G(J) defined by the following eigensystem n κ λ(1) i1 , . . . , λ(J) i J , O j v(j) ij o i1,...,i J (1) where κ is a pre-specified nonnegative nondecreasing function over λ(j) ij , j = 1, 2, . . . , J. In other words, the (i1, . . . , i J)-th eigenvalue of Pκ is defined by coupling the λ(1) i1 , . . . , λ(J) i J with function κ, and the (i1, . . . , i J)-th eigenvector of Pκ is defined by coupling v(1) i1 , . . . , v(J) i J via tensor (outer) product. Remark 1. If each individual v(j) ij nj ij=1 forms an orthog- onal basis in Rnj, j 1, . . . , J, then N j v(j) ij i1,...,i J forms an orthogonal basis in R QJ j=1 nj. In the following example we introduce two special kinds of SGPs, assuming J = 2 for brevity. Higher-order cases are later summarized in Table 1. Example 1. Tensor GP defines κ(λi1, λi2) = λi1λi2, and is equivalent to Kronecker product: PTensor G(1), G(2) = P i1,i2(λi1λi2) v(1) i1 v(2) i2 v(1) i1 v(2) i2 G(1) G(2). Cartesian GP defines κ(λi1, λi2) = λi1 + λi2, and is equivalent to the Kronecker sum: PCartesian G(1), G(2) = P i1,i2(λi1+λi2) v(1) i1 v(2) i2 v(1) i1 v(2) i2 G(1) G(2). SGP Type κ λ(1) i1 , , λ(J) i J [Pκ](i1, i J ),(i 1, i J ) j λ(j) ij Q j G(j) ij,i j Cartesian P j λ(j) ij P j G(j) ij,i j Q j =j δij =i j Table 1. Tensor GP and Cartesian GP in higher-orders. While Tensor GP and Cartesian GP provide mechanisms to associate multiple graphs in a multiplicative/additive manner, more complex cross-graph association patterns can be modeled by specifying κ. E.g., κ (λi1, λi2, λi3) = λi1λi2 + λi2λi3 + λi3λi1 indicates pairwise associations are allowed among three graphs, but no triple-wise association is allowed as term λi1λi2λi3 is not involved. Including higher order polynomials in κ amounts to incorporating higher-order associations among the graphs, which can be achieved by simply exponentiating κ. Since what the product graph P offers is essentially a similarity measure among multi-relations, shuffling the order Cross-Graph Learning of Multi-Relational Associations of input graphs G(1), . . . , G(J) should not affect P s topological structure. For SGP, this property is guaranteed by the following theorem: Theorem 1 (The Commutative Property). SGP is commutative (up to graph isomorphism) if κ is commutative. We omit the proof. The theorem suggests the SGP family is well-behaved as long as κ is commutative, which is true for both Tensor and Cartesian GPs as both multiplication and addition operations are order-insensitive. 2.4. Optimization Objective It is often more convenient to equivalently write tensor f as a multi-linear map. E.g., when J = 2, tensor (matrix) f Rn1 n2 defines a bilinear map from Rn1 Rn2 to R via f(x1, x2) := x 1 fx2 and we have fi1,i2 = f(ei1, ei2). Such equivalence is analogous to high-order cases where f defines a multi-linear map from Rn1 Rn J to R. To carry out transductive learning over Pκ (Task 2), we inject the structure of the product graph into f via a Gaussian random fields prior (Zhu et al., 2003). The negative loglikelihood of the prior log p (f | Pκ) is the same (up to constant) as the following squared semi-norm f 2 Pκ = vec(f) P 1 κ vec(f) (2) i1,i2,...,i J f v(1) i1 , . . . , v(J) i J 2 κ λ(1) i1 , . . . , λ(J) i J (3) Our optimization objective is therefore defined as min f Rn1 n J ℓO (f) + γ 2 f 2 Pκ (4) where ℓO( ) is a loss function to be defined later (Section 4), O is the set of training tuples, and γ is a tuning parameter controlling the strength of graph regularization. 3. Convex Approximation The computational bottleneck for optimization (4) lies in evaluating f 2 Pκ and its first-order derivative, due to the extremely large size of Pκ. In section 3.1, we first identify the computation bottleneck of using the exact formulation, based on which we propose our convex approximation scheme in 3.2 that reduces the time complexity of evaluating the semi-norm f 2 Pκ from O P j dj , where dj nj for j = 1, . . . , J. 3.1. Complexity of the Exact Formulation The brute-force evaluation of f 2 Pκ according to (3) costs j nj 2 , as one has to evaluate O Q j nj terms inside the summation where each term costs O Q However, redundancies exist and the minimum complexity for the exact evaluation is given as follows Proposition 1. The exact evaluation of semi-norm f Pκ takes O P Proof. Notice that the collection of all numerators in (3), namely f v(1) i1 , . . . , v(J) i J i1, ,i J, is a tensor in Rn1 n J that can be precomputed via f 1 V (1) 2 V (2) J V (J) (5) where j stands for the j-mode product between a tensor in Rn1 nj n J and V (j) Rnj nj. The conclusion follows as the j-th mode product in (5) takes O nj Q j nj flops, and one has to do this for each j = 1, . . . , J. When J = 2, (5) reduces to the multiplication of three matrices V (1) f V (2) at the complexity of O ((n1 + n2)n1n2). 3.2. Approximation via Tucker Form Equation (5) implies the key for complexity reduction is to reduce the cost of the j-mode multiplications j V (j). Such multiplication costs O nj Q j nj in general, but can be carried out more efficiently if f is structured. Our solution is twofold: First, we include only the top-dj eigenvectors in V (j) for each graph G(i), where dj nj. Hence each V (j) becomes a thin matrix in Rnj dj. Second, we restrict tensor f to be within the linear span of the top QJ j=1 dj eigenvectors of the product graph Pκ k1, ,k J=1 αk1, ,k J O j v(j) kj (6) = α 1 V (1) 2 V (2) 3 J V (J) (7) The combination coefficients α Rd1 d J is known as the core tensor of Tucker decomposition. In the case where J = 2, the above is equivalent to saying f Rn1 n2 is a low-rank matrix parametrized by α Rd1 d2 such that k1,k2 αk1,k2v(1) k1 v(2) k2 = V (1)αV (2) . Combining (6) with the orthogonality property of eigenvectors leads to the fact that f v(1) k1 , . . . , v(J) k J = αk1, ,k J. To see this for J = 2, notice f v(1) k1 , v(2) k2 = v(1) k1 V (1)αV (2) v(2) k1 = e k1αek2 = αk1,k2. Therefore the semi-norm in (2) can be simplified as f 2 Pκ = α 2 Pκ = k1,...,k J=1 α2 k1, ,k J κ λ(1) k1 , . . . , λ(J) k J (8) Comparing (8) with (3), the number of inside-summation terms is reduced from O Q j nj to O Q Cross-Graph Learning of Multi-Relational Associations Figure 2. An illustration of the eigenvectors of G(1), G(2) and P G(1), G(2) . The leading nontrivial eigenvectors of G(1) and G(2) are denoted by blue and red curves, respectively. The induced leading nontrivial eigenvectors of P G(1), G(2) are illustrated in 3D. If G(1) and G(2) are symmetrically normalized, their eigenvectors (corresponding to eigenvectors of the graph Laplacian) will be ordered by smoothness w.r.t. the graph structures. As a result, eigenvectors of P G(1), G(2) will also be ordered by smoothness. dj nj. In addition, the cost for evaluating each term inside summation is reduced from O Q j nj to O(1). Denote by V (j) ij Rdj the ij-th row of V (j), we obtain the following optimization by replacing f with α in (4) min α Rd1 d JℓO (f) + γ s.t. f = α 1 V (1) 2 J V (J) (9) Optimization above has intuitive interpretations. In principle, it is natural to emphasis bases in f that are smooth w.r.t. the manifold structure of Pκ, and de-emphasis those that are nonsmooth in order to obtain a parsimonious hypothesis with strong generalization ability. We claim this is exactly the role of regularizer (8). To see this, note any nonsmooth basis N j v(j) kj of Pκ is likely to be associated with small a eigenvalue κ λ(1) k1 , . . . , λ(J) k J (illustrated in Fig. 2). The conclusion follows by noticing that αk1,...,k J is essentially the activation strength of N j v(j) kj in f (implied by (6)), and that (8) is going to give any αk1,...,k J associated with a small κ λ(1) k1 , . . . , λ(J) k J a stronger penalty. (9) is a convex optimization problem over α with any convex ℓO( ). Spectral approximation techniques for graphbased learning has been found successful in standard classification tasks (Fergus et al., 2009), which are special cases under our framework when J = 1. We introduce this technique for multi-relational learning, which is particularly desirable as the complexity reduction will be much more significant for high-order cases (J >= 2). While f in (6) is assumed to be in the Tucker form, other low-rank tensor representation schemes are potentially applicable. E.g., the Candecomp/Parafac (CP) form that fur- ther restricts α to be diagonal, which is more aggressive but substantially less expressive. The Tensor-Train decomposition (Oseledets, 2011) offers an alternative representation scheme in the middle of Tucker and CP, but the resulting optimization problem will suffer from non-convexity. 4. Optimization Let (x)+ = max (0, 1 x) be the shorthand for hinge loss. We define ℓO(f) to be the ranking ℓ2-hinge loss P (i1, . . . , i J ) O (i 1, . . . , i J ) O fi1...i J fi 1...i J + |O O| (10) where O is the complement of O w.r.t. all possible multirelations. Eq. (10) encourages the valid tuples in our training set O to be ranked higher than those corrupted ones in O, and is known to be a surrogate of AUC. We use stochastic gradient descent for optimization as |O| is usually large. In each iteration, a random valid multirelation (i1, . . . , i J) is uniformly drawn from O, a random corrupted multirelation (i 1, . . . , i J) is uniformly drawn from O. The associated noisy gradient is computed as fi1,...,i J α fi 1,...,i J α + γα κ (11) where we abuse the notation by defining κ Rd1 d J, κk1,...,k J := κ λ(1) k1 , . . . , λ(J) k J ; is the element-wise di- Cross-Graph Learning of Multi-Relational Associations Algorithm 1: Transductive Learning over Product Graph (TOP) foreach j 1, . . . , J do v(j) k , λ(j) k dj k=1 APPROX EIGEN(G(j)); foreach (k1, . . . , k J) [d1] . . . [d J] do κk1,...,k J κ(λ(1) k1 , . . . , λ(J) k J ); α 0, Z 0; while not converge do (i1, . . . , i J) uni O, (i 1, . . . , i J) uni O; fi1,...,i J α 1 V (1) i1 2 J V (J) i J ; fi 1,...,i J α 1 V (1) i 1 2 J V (J) i J ; δ = fi1,...,i J fi 1,...,i J; if δ < 1 then j V (j) ij N j V (j) i j Z Z + 2 α ; α α η0Z 1 2 α; return α vision between tensors. The gradient w.r.t. α in (11) is fi1,...,i J α = α 1 V (1) i1 2 J V (J) i J j V (j) ij Rd1 ...d J (13) Each SGD iteration costs O Q j dj flops, which is independent from n1, n2, . . . , n J. After obtaining the solution ˆα(κ) of optimization (9) for any given SGP Pκ, our final predictions in ˆf(κ) can be recovered via (6). Following (Duchi et al., 2011), we allow adaptive step sizes for each element in α. That is, in the t-th iteration we use η(t) k1,...,k J = η0 .h Pt τ=1 α (τ) k1,...,k J 2 as the step size for αk1,...,k J, where α (τ) k1,...,k J t τ=0 are historical gradients associated with αk1,...,k J and η0 is the initial step size (set to be 1). The strategy is particularly efficient with highly redundant gradients, which is our case where the gradient is a regularized rank-2 tensor, according to (11) and (13). In practice (especially for large J), the computation cost of tensor operations involving NJ j=1 V (j) ij Rd1,...,d J is not ignorable even if d1, d2, . . . , d J are small. Fortunately, such medium-sized tensor operations in our algorithm are highly parallelable over GPU. The pseudocode for our optimization algorithm is summarized in Alg. 1. 5. Experiments 5.1. Datasets We evaluate our method on real-world data in two different domains: the Enzyme dataset (Yamanishi et al., 2008) for compound-protein interaction and the DBLP dataset of scientific publication records. Fig. 3 illustrates their heterogeneous objects and relational structures. The Enzyme dataset has been used for modeling and predicting drug-target interactions, which contains a graph of 445 chemical compounds (drugs) and a graph of 664 proteins (targets). The prediction task is to label the unknown compound-protein interactions based on both the graph structures and a small set of 2,926 known interactions. The graph of compounds is constructed based on the SIMCOMP score (Hattori et al., 2003), and the graph of proteins is constructed based on the normalized Smith Waterman score (Smith & Waterman, 1981). While both graphs are provided in the dense form, we converted them into sparse k NN graphs where each vertex is connected with its top 1% neighbors. As for the DBLP dataset, we use a subset of 34,340 DBLP publication records in the domain of Artificial Intelligence (Tang et al., 2008), from which 3 graphs are constructed as: For the author graph (G(1)) we draw an edge between two authors if they have coauthored an overlapping set of papers, and remove the isolated authors using a DFS algorithm. We then obtain a symmetric k NN graph by connecting each author with her top 0.5% nearest neighbors using the count of co-authored papers as the proximity measure. The resulting graph has 5,517 vertices with 17 links per vertex on average. For the paper graph (G(2)) we connect two papers if both of them cite another paper, or are cited by another paper. Like G(1), we remove isolated papers using DFS and construct a symmetric 0.5%-NN graph. To measure the similarity of any given pair of papers, we represent each paper as a bag-of-citations and compute their cosine similarity. The resulted graph has 11,879 vertices and has an average degree of 50. For the venue graph (G(3)) we connect two venues if they share similar research focus. The venue-venue similarity is measured by the total number of crosscitations in between, normalized by the size of the two venues involved. The symmetric venue graph has 22 vertices and an average degree of 7. Tuples in the form of (Author,Paper,Venue) are extracted from the publication records, and there are 15,514 tuples (cross-graph interactions) after preprocessing. Cross-Graph Learning of Multi-Relational Associations Protein Compound Structure Similarity Sequence Similarity Author Venue Coauthorship Shared Foci Figure 3. The heterogeneous types of objects (the circles) and the relational structures in the Enzyme (left) and DBLP (right) data sets. The blue edges represent the within-graph relations and the red edges represent the cross-graph interactions. The corresponding tuples in Enzyme is in the form of (Compound,Protein), and in DBLP is in the form of (Author,Paper,Venue). AUC MAP Hits@5 Tensor Cartesian Exponential AUC MAP Hits@5 Figure 4. Performance of TOP with different SGPs. 5.2. Methods for Comparison Transductive Learning over Product Graph (TOP). The proposed method. We explore the following κ s for parametrizing the spectral graph product. Name κ(x, y) (J = 2) κ(x, y, z) (J = 3) Tensor xy xyz Cartesian x + y x + y + z Exponential ex+y exy+yz+xz Tensor Factorization (TF) and Graph-regularized TF (GRTF). In TF we factorize f Rn1 n J as a set of dimensionality-reduced latent factors Cd1, d J, U n1 d1 1 , . . . , UJ Rn J d J. In GRTF, we further enhanced the traditional TF by adding graph regularizations to the objective function, which enforce the model to be aware of the context information in G(j) s (Narita et al., 2012; Cai et al., 2011); One-class Nearest Neighbor (NN). We score each tuple (i1, . . . , i J) in the test set with ˆf(i1, . . . , i J) = max(i 1,...,i J) O QJ j=1 Giji j. That is, we assume the tuple-tuple similarity can be factorized as the product of vertex-level similarities across different graphs. We experimented with several other similarity measures and empirically found the multiplicative similar- ity leads to the best overall performance. Note it does not rely on the presence of any negative examples. Ranking Support Vector Machines (Joachims, 2002) (RSVM). For the task of completing the missing paper in (Author,?,Venue), we use a Learning-to Rank strategy by treating (Author,Venue) as the query and Paper as the document to be retrieved. The query feature is constructed by concatenating the eigen-features of Author and Venue, where we define the eigen-feature of vertex ij in graph j as V (j) ij Rdj. The feature for each query-document pair is obtained by taking the tensor product of the query feature and document eigen-feature. Low-rank Tensor Kernel Machines (LTKM). While traditional tensor-based kernel construction methods for tuples suffer from poor scalability. We propose to speedup by replacing each individual kernel with its low-rank approximation before tensor product, leading to a low-rank kernel of tuples which allows more efficient optimization routines. For fair comparison, loss functions for TF, GRTF, RSVM and LTKM are set to be exactly the same as that for TOP, i.e. E.q. (10). All algorithms are trained using a minibatched stochastic gradient descent. We use the same eigensystems (eigenvectors and eigenvalues) of the G(j) s as the input for TOP, RSVM and LTKM. The number of top-eigenvalues/eigenvectors dj for graph j is chosen such that λ(j) 1 , . . . , λ(j) dj approximately cover 80% of the total spectral energy of G(j). With respect to this criterion, we choose d1 = 1, 281, d2 = 2, 170, d3 = 6 for DBLP, and d1 = 150, d2 = 159 for Enzyme. 5.3. Experiment Setups For both datasets, we randomly sample one third of known interactions for training (denoted by O), one third for validation and use the remaining ones for testing. Known interactions in the test set, denoted by T , are treated as positive Cross-Graph Learning of Multi-Relational Associations 12.5 25 50 100 Training Size (%) 12.5 25 50 100 Training Size (%) 12.5 25 50 100 Training Size (%) Figure 5. Test-set performance of different methods on Enzyme. 12.5 25 50 100 Training Size (%) 12.5 25 50 100 Training Size (%) 12.5 25 50 100 Training Size (%) Figure 6. Test-set performance of different methods on DBLP. examples. All tuples not in T , denoted by T , are treated as negative. Tuples that are already in O are removed from T to avoid misleading results (Bordes et al., 2013). We measure algorithm performance on Enzyme based on the quality of inferred target proteins given each compound, namely by the ability of completing (Compound,?). For DBLP, the performance is measured by the quality of inferred papers given author and venue, namely by the ability of completing (Author,?,Venue). We use Mean Average Prevision (MAP), Area Under the Curve (AUC) and Hits at Top 5 (Hits@5) as our evaluation metrics. 5.4. Results Fig. 4 compares the results of TOP with various parameterizations of the spectral graph product (SGP). Among those, Exponential κ works better on average. Figs. 5 and 6 show the main results, comparing TOP (with Exponential κ) with other representative baselines. Clearly, TOP outperforms all the other methods on both datasets in all the evaluation metrics of MAP 2, AUC and Hit@5. Fig. 7 shows the performance curves of TOP on Enzyme 2MAP scores for random guessing are 0.014 on Enzyme and 0.00072 on DBLP, respectively. 0 10 20 30 40 50 60 70 80 90 100 Performance Preservation (%) Model Size (%) MAP AUC Hits@5 Figure 7. Performance of TOP v.s. model size on Enzyme. over different model sizes (by varying the dj s). With a relatively small model size compared with using the full spectrum, TOP s performance converges to the optimal point. 6. Concluding Remarks The paper presents a novel convex optimization framework for transductive CGRL and a scalable algorithmic solution with guaranteed global optimum and a time complexity that does not depend on the sizes of input graphs. Our experiments on multi-graph data sets provide strong evidence for the superior power of the proposed approach in modeling cross-graph inference and large-scale optimization. Cross-Graph Learning of Multi-Relational Associations Acknowledgements We thank the reviewers for their helpful comments. This work is supported in part by the National Science Foundation (NSF) under grants IIS-1216282, IIS-1350364, and IIS-1546329. Basilico, Justin and Hofmann, Thomas. Unifying collaborative and content-based filtering. In Proceedings of the twenty-first international conference on Machine learning, pp. 9. ACM, 2004. Ben-Hur, Asa and Noble, William Stafford. Kernel methods for predicting protein protein interactions. Bioinformatics, 21(suppl 1):i38 i46, 2005. Bordes, Antoine, Usunier, Nicolas, Garcia-Duran, Alberto, Weston, Jason, and Yakhnenko, Oksana. Translating embeddings for modeling multi-relational data. In Advances in Neural Information Processing Systems, pp. 2787 2795, 2013. Cai, Deng, He, Xiaofei, Han, Jiawei, and Huang, Thomas S. Graph regularized nonnegative matrix factorization for data representation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 33(8):1548 1560, 2011. Duchi, John, Hazan, Elad, and Singer, Yoram. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, 12:2121 2159, 2011. Fergus, Rob, Weiss, Yair, and Torralba, Antonio. Semisupervised learning in gigantic image collections. In Advances in neural information processing systems, pp. 522 530, 2009. Getoor, Lise. Introduction to statistical relational learning. MIT press, 2007. Hattori, Masahiro, Okuno, Yasushi, Goto, Susumu, and Kanehisa, Minoru. Heuristics for chemical compound matching. Genome Informatics, 14:144 153, 2003. Joachims, Thorsten. Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 133 142. ACM, 2002. Kolda, Tamara G and Bader, Brett W. Tensor decompositions and applications. SIAM review, 51(3):455 500, 2009. Lavrac, Nada and Dzeroski, Saso. Inductive logic programming. In WLP, pp. 146 160. Springer, 1994. Liu, Hanxiao and Yang, Yiming. Bipartite edge prediction via transductive learning over product graphs. In Proceedings of The 32nd International Conference on Machine Learning, pp. 1880 1888, 2015. Narita, Atsuhiro, Hayashi, Kohei, Tomioka, Ryota, and Kashima, Hisashi. Tensor factorization using auxiliary information. Data Mining and Knowledge Discovery, 25(2):298 324, 2012. Nickel, Maximilian, Tresp, Volker, and Kriegel, Hans Peter. A three-way model for collective learning on multi-relational data. In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 809 816, 2011. Oseledets, Ivan V. Tensor-train decomposition. SIAM Journal on Scientific Computing, 33(5):2295 2317, 2011. Rendle, Steffen, Balby Marinho, Leandro, Nanopoulos, Alexandros, and Schmidt-Thieme, Lars. Learning optimal ranking with tensor factorization for tag recommendation. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 727 736. ACM, 2009. Smith, Temple F and Waterman, Michael S. Identification of common molecular subsequences. Journal of molecular biology, 147(1):195 197, 1981. Tang, Jie, Zhang, Jing, Yao, Limin, Li, Juanzi, Zhang, Li, and Su, Zhong. Arnetminer: extraction and mining of academic social networks. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 990 998. ACM, 2008. Yamanishi, Yoshihiro, Araki, Michihiro, Gutteridge, Alex, Honda, Wataru, and Kanehisa, Minoru. Prediction of drug target interaction networks from the integration of chemical and genomic spaces. Bioinformatics, 24(13): i232 i240, 2008. Yu, Kai and Chu, Wei. Gaussian process models for link analysis and transfer learning. In Advances in Neural Information Processing Systems, pp. 1657 1664, 2008. Zhu, Xiaojin, Ghahramani, Zoubin, Lafferty, John, et al. Semi-supervised learning using gaussian fields and harmonic functions. In ICML, volume 3, pp. 912 919, 2003.