# on_multirelational_link_prediction_with_bilinear_models__39495cc8.pdf On Multi-Relational Link Prediction with Bilinear Models Yanjie Wang University of Mannheim, Germany ywang@uni-mannheim.de Rainer Gemulla University of Mannheim, Germany rgemulla@uni-mannheim.de Hui Li The University of Hong Kong hli2@cs.hku.hk We study bilinear embedding models for the task of multirelational link prediction and knowledge graph completion. Bilinear models belong to the most basic models for this task, they are comparably efficient to train and use, and they can provide good prediction performance. The main goal of this paper is to explore the expressiveness of and the connections between various bilinear models proposed in the literature. In particular, a substantial number of models can be represented as bilinear models with certain additional constraints enforced on the embeddings. We explore whether or not these constraints lead to universal models, which can in principle represent every set of relations, and whether or not there are subsumption relationships between various models. We report results of an independent experimental study that evaluates recent bilinear models in a common experimental setup. Finally, we provide evidence that relation-level ensembles of multiple bilinear models can achieve state-of-the art prediction performance. 1 Introduction Multi-relational link prediction is the task of predicting missing links in an edge-labeled graph. We focus and use the terminology of knowledge base completion throughout. Largescale knowledge bases (KB) such as DBPedia (Lehmann et al. 2015) or YAGO (Rebele et al. 2016) contain millions of entities and facts, but they are nevertheless far from being complete (Nickel et al. 2016). Given a set of entities (vertices) and relations (edge labels) that hold between these entities, the goal of multi-relational link prediction (Bordes et al. 2013) is to determine whether or not some entity e1 links to some entity e2 via a relation R, i.e., whether the fact R(e1, e2) is true. Embedding models have recently received considerable attention for knowledge base completion tasks (Bordes et al. 2013; Nickel, Rosasco, and Poggio 2016; Trouillon et al. 2016a). Such models embed both entities and relations in a low-dimensional latent space such that the structure of the knowledge base is (largely) maintained. The embeddings are subsequently used to predict missing facts or to detect erroneous facts. Copyright c 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. The perhaps most basic class of embedding models is given by bilinear models. Such models predict a score for each fact R(e1, e2) by computing a weighted sum where the weights depend on R of the pairwise interactions of the entity embeddings of e1 and e2. The scores are used to rank (pairs of) entities according to their predicted truthfulness. Bilinear models are comparably efficient to train and use and they can provide good prediction performance (Trouillon and Nickel 2017). A large number of bilinear models has been proposed in the literature, including RESCAL (Nickel, Tresp, and Kriegel 2011), Trans E (Bordes et al. 2013), DISTMULT (Yang et al. 2014), Hol E (Nickel, Rosasco, and Poggio 2016), and Compl Ex (Trouillon et al. 2016a). There is, however, little work on the expressiveness of and the connections between various bilinear models. In this paper, we argue that all of the aforementioned models can be seen as bilinear models subject to certain constraints. We study whether and under which conditions each model is universal in that it can represent every possible set of relation instances (or, more precisely, entity rankings). We also explore the size of the embeddings needed for universality and derive upper bounds for the embedding size needed to obtain embeddings consistent with a given dataset. We establish a number of subsumption relationships between various models by giving explicit constructions on how to transform instances of one model to instances of another model (sometimes with a different embedding size). A summary of our results is given in Tab. 1. We report on an independent experimental study that compared various bilinear models on standard datasets in a common experimental setup. We found that the relative performance among the models is highly relation-dependent. We thus propose a simple relation-level ensemble of multiple bilinear models, which according to our experiments significantly and consistently improved prediction performance over individual models. In fact, we found that the ensemble performed competitively to the state-of-the-art embedding approaches, whether or not they are bilinear. 2 Multi-Relational Link Prediction Let E and R be a set of entities and relation names. A knowledge base K E R E is a collection of triples (i, k, j) where i, j, and k refer to subject, object and relation, resp. We denote by K = |R| 1 and N = |E| 2 the number The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18) of entities and relations, resp. We represent knowledge base K via a binary tensor X {0, 1}N N K, where xijk = 1 if and only if (i, k, j) K. By convention, vectors ai refer to rows of matrix A (as a column vector) and scalars aij to individual entries. Given dimensionalities r and r , we denote by ei,r the i-th standard basis vector, by 0r the zero vector, and by 0r r the zero matrix of the respective shape. Finally, let diag ( ) refer to a block-diagonal matrix built from the arguments (a vector or a list of matrices). 2.1 Preliminaries A score-based ranking model is a model m that associates a score sm k (i, j) R with each subject-relation-object triple. Denote by Sm k RN N the corresponding scoring matrix for relation k, i.e., [Sm k ]ij = sm k (i, j). Denote by Sm RN N K the scoring tensor of m, i.e., the tensor with frontal slices Sm (k) = Sm k . We are ultimately interested in rankings, not in scores. In particular, score-based models are used to rank (pairs of) entities by their predicted truthfulness, given a query of form R(i, ?), R(?, j), or R(?, ?). Generally, a result with a higher score is considered more likely to be correct. We say that an N N matrix is a ranking matrix if all its entries are in 1, 2, . . . , N 2 and whenever there is any entry with value s > 1, there is at least one other entry with value s 1. Denote by π(S) the unique ranking matrix associated with scoring matrix S, where πij(S) def = [π(S)]ij is the dense rank of sij in the multiset of the entries of S. For every pair of tuples (i, j) N N and (i , j ) N N, we have sij si j πij(S) πi j (S). For example, 0.2 2.4 1 1 4 2 3 0.2 0 5 2 4 7 1 3 8 5 6 In a slight abuse of notation, we overload π to also apply to tensors, sets of matrices, and sets of tensors. In particular, the ranking tensor π(S) for a score tensor S is the N N K tensor produced from S by replacing every frontal slice S(k) with π(S(k)). Moreover, for any set X, set π(X) = { π(x) : x X }. Observe that π(RN N) corresponds to the set of all possible ranking matrices, π(RN N K) to all possible ranking tensors, and that π( P ) = P for any ranking matrix (or ranking tensor) P . 2.2 Bilinear Models Bilinear models are models whose scoring function sk(i, j) has form a T i Rkaj, where ai, aj Rr and Rk Rr r are model parameters and are referred to as the embeddings of entities i and j as well as relation k, resp. We refer to r N as the size of the model. In this paper, we consider bilinear models as well as models that can be represented as bilinear models with an at most linear increase in model size. Although some of the model considered here may not look bilinear at first glance, we show that they are closely related to bilinear models. We denote throughout the set of all models of type t (and of size r) and by M t (M t r). RESCAL (Nickel, Tresp, and Kriegel 2011). An unconstrained bilinear model. Each model m M RESCAL r is parameterized by an entity matrix A RN r and K relation matrices R1, . . . , RK Rr r. We have sm k (i, j) = a T i Rkaj. RESCAL can be seen as an extension of the low-rank matrix factorization methods prominent in recommender systems to more then one relation. DISTMULT (Yang et al. 2014). Each model m M DISTMULT r is parameterized by an entity matrix A RN r and a relation matrix R RK r. We have sm k (i, j) = a T i diag (rk) aj. DISTMULT can be seen as a variant of RESCAL that puts a diagonality constraint on the relation matrices. Due to this constraint, it can only model symmetric relations. The model is equivalent to the INDSCAL tensor decomposition (Carroll and Chang 1970). Hol E (Nickel, Rosasco, and Poggio 2016). Each model m M Hol E r is parameterized by an entity matrix A RN r and a relation matrix R RK r. We have sm k (i, j) = r T k (ai aj), where refers to the circular correlation between ai and aj, i.e., (ai aj)k = r t=1 aitaj((k+t 2 mod r)+1). The idea of using circular correlation relates to associative memory (Nickel, Rosasco, and Poggio 2016). Hayashi and Shimbo (2017) provide an alternative viewpoint in terms of Compl Ex, discussed next. Compl Ex (Trouillon et al. 2016a). Each model m M Compl Ex r is parameterized by an entity matrix A CN r and a relation matrix R CN r. We have sm k (i, j) = Re(a T i diag (rk) aj), where Re( ) extracts the real part of a complex number (a T i diag (rk) aj is not necessarily real). Compl Ex is superficially related to DISTMULT but uses complex-valued parameter matrices. Trans E (Bordes et al. 2013). Each model m M Trans E r is parameterized by an entity matrix A RN r and an relation matrix R RK r. We have1 sm k (i, j) = ai + rk aj 2 2 . In contrast to the models presented above, Trans E is a translation-based model, not a factorization-based model. The use of translations i.e., differences between entity embeddings is inspired by Word2Vec s word analogy results (Mikolov et al. 2013). Note that Trans E can also be used with L1 norm instead of L2; we focus on the L2 variant given above throughout. 1This definition differs from the original definition of Trans E in that we negate all scores in order to rank larger scores higher. 3 Subsumption and Expressiveness For a given class M t r of models, denote by Mt r = { Sm : m M t r } the set of scoring tensors that the model class can represent. Let Mt = r N+Mt r. Note that π(Mt r) and π(Mt) denote the set of ranking tensors that can be represented by M t r and M t, respectively. 3.1 Subsumption We first explore subsumption relationships between different model classes as well as the the size of the entity representations needed for a subsumption to hold. We assume throughout that the number N 2 of entities and the number K 1 of relations are arbitrary but fixed. We say that class M t2 subsumes class M t1 whenever π(Mt1) π(Mt2). In other words, M t2 is at least as expressive in terms of rankings as M t1. If π(Mt1) π(Mt2), we say that M t2 strictly subsumes M t1, indicating that M t2 is strictly more expressive than M t1. Note that it is good when M t2 is more expressive than M t1 because M t2 can in principle express more rankings. It can also be problematic, however, because efficient training and the avoidance of overfitting become more challenging. We first show subsumption by specifying an explicit model transformation, then strictness via a counterexample. Theorem 1. For all r N+, M RESCAL 2r+1 subsumes M Trans E r . Proof. Fix some r N+. Pick any Trans E model m T M Trans E r , denote by A RN r and R RK r the corresponding parameter matrices, and by Sm T the scoring tensor. We show that π(Sm T ) π(MRESCAL 2r+1 ). We do this by explicitly constructing a corresponding RESCAL model m R M RESCAL 2r+1 by specifying its parameters A RN (2r+1) and R k R(2r+1) (2r+1). Setting a i = 1T r a T i a T i ai T , 0r r 2 diag (rk) e1,r 2 diag (rk) 2Ir r 0r 1 e T 1,r 01 r 0 we can now verify2 that sm R k (i, j) sm R k (i , j ) sm T k (i, j) sm T k (i , j ), which implies that m T and m R agree on the ranking for each relation, i.e., π(Sm T ) = π(Sm R). Since m R M RESCAL 2r+1 , we obtain π(Sm T ) π(MRESCAL 2r+1 ) as claimed. The proof above shows that Trans E can be viewed as a bilinear model with the constraints specified in Eq. (1). Theorem 2. M Trans E does not subsume M RESCAL r for any r 2. Note that the theorem implies that there are RESCAL models with r = 2 that cannot be expressed with any Trans E model, no matter how large its size. 2A more detailed derivation can be found in the online appendix. Proof. Fix some r 2 and consider the RESCAL model m R M RESCAL r specified by parameters ei,r for i { 1, 2 } 0r otherwise , 1 1 0r 2 1 0 0r 2 0(r 2) 1 0(r 2) 1 0(r 2) (r 2) 0r r otherw. We have sm R 1 (1, 1) = 1, sm R 1 (2, 2) = 0. Thus sm R 1 (1, 1) = sm R 2 (2, 2) and consequently π11(Sm R (1) ) = π22(Sm R (1) ). Now pick any Trans E model m T MTrans E, denote by A and R its parameters, and observe that sm T k (1, 1) = sm T k (2, 2) = rk 2 2. Thus π11(Sm T (1) ) = π22(Sm T (1) ). Since this holds for any Trans E model, we conclude that π(Sm R) π(MTrans E). Nickel, Rosasco, and Poggio (2016) argued that Hol E can be viewed as a compressed version of RESCAL and implicitly established the subsumption relationship to RESCAL. We present their argument formally below. Theorem 3. M RESCAL r subsumes M Hol E r . Proof. From the definition of Hol E, we rewrite r T k (ai aj) = u=1 aiuaj((t+u 2 mod r)+1) = a T i Rkaj, rk1 rk2 . . . rkr rkr rk1 . . . rk(r 1) ... ... ... ... rk2 rk3 . . . rk1 Recently, Hayashi and Shimbo (2017) proved that MHol E 2r+1 MCompl Ex r and MHol E r MCompl Ex r . Putting this together with Th. 3, we obtain: Corollary 1. M RESCAL 2r+1 subsumes M Compl Ex r . Finally, since DISTMULT differs from RESCAL only in that DISTMULT adds a diagonality constraint, we directly obtain: Theorem 4. M RESCAL r subsumes M DISTMULT r . 3.2 Universality We say that class M t is universal if π(Mt) = π(RN N K), i.e., any ranking tensor can be expressed. As with subsumption, universality does by no means imply that a model class is suitable for use in practice. If a model class is not universal, however, care must be taken because certain relations cannot be modeled. A direct consequence of Th. 2 is: Corollary 2. M Trans E is not universal. We establish the universality of RESCAL, Hol E, and Compl Ex next. Table 1: Summary of our main results. Each row corresponds to a model of size r. All conditions are sufficient conditions. ? means that no bound other than the universal bound is known. Model # Parameters Universal Consistent with B Subsumption of model of size r when r when r when r RESCAL Hol E Compl Ex DISTMULT Trans E RESCAL Nr + Kr2 N min{N, 2 k rrank(Bk)} r r 2r + 1 r 2r + 1 Hol E Nr + Kr 2KN + 1 2 min{KN, 2 k rrank(Bk)} + 1 ? r 2r + 1 2r + 1 ? Compl Ex 2Nr + 2Kr KN min{KN, 2 k rrank(Bk)} ? r r r ? DISTMULT Nr + Kr No No No No No r No Trans E Nr + Kr No No No No No No r Theorem 5. M RESCAL N is universal. Proof. Pick any ranking tensor P π(RN N K). Consider the model m M RESCAL N with parameterization A = IN and Rk = P(k). Then Sm k = ARk AT = P(k) and thus Sm = P. Using the fact that π( P) = P, we conclude that P π(MRESCAL N ). Note that models in M RESCAL N have very large embeddings. It is more involved to establish whether or not M RESCAL r is universal for some r < N. We approach this question below and show that r needs to be linear in N to obtain universality. Theorem 6. M RESCAL N/32 1 is not universal. The proof (given below) makes use of the notion of rounding rank (Neumann, Gemulla, and Miettinen 2016). Given a rounding threshold τ R, denote by roundτ(x) = I(x τ) the rounding function (1 if x τ, else 0). We apply roundτ to matrices and tensors by rounding elementwise. In particular, when A Rm n is any real-valued matrix, then roundτ(A) is the m n binary matrix with [roundτ(A)]ij = roundτ(aij). We assume τ = 1/2 unless explicitly stated otherwise and write round for round1/2. Definition 1. For τ R, the rounding rank w.r.t. τ of a binary matrix B { 0, 1 }m n is given by rrankτ(B) = min { rank(A) : A Rm n, roundτ(A) = B } . Given a Boolean matrix B, say that a scoring matrix S is consistent with B if bij = 1 and bi j = 0 = πij(S) < πi j (S). (2) The rounding rank can be interpreted as the minimum rank of a scoring matrix that is consistent with B. Neumann, Gemulla, and Miettinen (2016) proved that the rounding rank differs by at most 1 for different choices of τ and that it is connected to the sign rank (Alon, Moran, and Yehudayoff 2016). The rounding rank can be much smaller than the matrix rank in practice, which partially explains the the success of bilinear models. Proof (of Th. 6). Alon, Frankl, and Rödl (1985) showed that there exist Boolean matrices in { 0, 1 }N N with rounding rank at least N/32 for every N. Pick any such matrix B. The proof is by contradiction. Consider K = 1 and suppose there exists a scoring matrix S M RESCAL N/32 1 that satisfies Eq. (2), i.e., sij > si j whenever bij = 1 and bi j = 0. Observe that S has rank at most N/32 1 because it is defined by a product involving a N/32 1 N/32 1 matrix. But this implies that rrank(B) N/32 1, a contradiction. Note that the proof implies that there exists ranking tensors with just two distinct ranks that cannot be expressed by M RESCAL N/32 1 . Since RESCAL is an unconstrained bilinear model, we can generalize to other model classes. Corollary 3. No model class that only contains bilinear models of size less than N 32 is universal. Theorem 7. M Compl Ex KN and M Hol E 2KN+1 are universal. Proof. Pick any scoring tensor S RN N K. Trouillon et al. (2016b) showed that for every N N real matrix and thus every scoring matrix Sk, there exists Ak, Dk CN N, where Ak is unitary, Dk diagonal, and Sk = Re(Ak Dk A k). Now consider the Compl Ex model with A = (A1 A2 AK) Rk = diag (0N N, . . . , Dk, . . . , 0N N) We can verify Sk = Re(ARk A ) for each k. Thus S MCompl Ex KN and it follows that M Compl Ex KN is universal. The universiality of M Hol E 2KN+1 follows from the fact that MHol E 2r+1 MCompl Ex r for every r (Hayashi and Shimbo 2017). Finally, since DISTMULT s relation matrix is diagonal and thus symmetric, DISTMULT cannot model asymmetric relations. Theorem 8. MDISTMULT is not universal. 3.3 Consistent Ranking Suppose we are given an N N K Boolean tensor B and we look for a ranking tensor P that is consistent with B in each frontal slice, i.e., pijk < pi j k whenever bijk = 1 and bi j k = 0. In this section, we establish upper bounds on the size3 that various bilinear models need to express a ranking that is consistent with B, i.e., which ranks 1s above 0s. Here we think of B as the correct completed KB; there is no hope for a model class not consistent with B to recover the correct KB. Note that even if a model class is not universal, it may still contain consistent models for all Boolean tensors. This is not the case for DISTMULT and Trans E, however. In particular, since DISTMULT produces symmetric scoring matrices, DISTMULT does not contain models consistent with any Boolean tensor that has an asymmetric frontal slice. 3The expressive power of models considered here is nondecreasing as their size grows. For Trans E, the proof of Th. 2 implies that Trans E does not contain models for Boolean tensors with both 0s and 1s on the main diagonal of any of its frontal slices. Theorem 9. There exists Boolean tensors B such that no ranking tensor in π(MDISTMULT) is consistent with B. Theorem 10. There exists Boolean tensors B such that no ranking tensor in π(MTrans E) is consistent with B. For RESCAL, which is universal, we can make use of the rounding-rank decomposition to obtain a tighter bound than the one implied by its universality. Theorem 11. For any boolean tensor B, π(MRESCAL r ) contains a ranking tensor consistent with B if k=1 rrank(Bk) Proof. The case r N follows from Th. 6. Denote by rk the rounding rank of slice Bk of B; we explicitly construct a consistent RESCAL model with r = 2 k rk (as asserted). To do so, pick any Lk, Qk RN rk that form a rounding-rank decomposition of Bk, i.e., for which Bk = round(Lk QT k ). (By the definition of rounding rank, such matrices always exist.) Now set a T i = ( [L1]i: [Q1]i: [LK]i: [QK]i:)T M k = 0rk rk Irk rk 0rk rk 0rk rk (Rk)ij = diag (02r1 2r1, . . . , M k, . . . , 02r K 2r K) We can now verify that round(ARk AT ) = Bk, which implies consistency. Theorem 12. For any Boolean tensor B, π(MCompl Ex r ) contains a ranking tensor consistent with B if k=1 rrank(Bk) Proof. The case r KN follows directly from Th. 7. To obtain r 2 K k=1 rrank(Bk), define rk, Lk, and Qk as in the proof of Th. 11, and set Sk = Lk QT k . Then there exist matrices Ak CN 2rk and Dk C2rk 2rk, with Dk being diagonal, such that Sk = Re(Ak Dk A k) (Trouillon et al. 2016b). Now define A = (A1 A2 AK) Rk = diag (02r1 2r1, . . . , Dk, . . . , 02r K 2r K) and observe that Sk = Re(ARk A ). As a corollary of the above theorem, we have: Corollary 4. For any Boolean tensor B, π(MHol E r ) contains a ranking tensor consistent with B if k=1 rrank(Bk) + 1 4 Training and Relation-Level Ensemble We have seen that various prior models can be interpreted as a bilinear models subject to certain constraints. In other words, they are diverse with respect to their expressivity. So far, we did not touch on how to select a suitable model for a given dataset and from a given model class. In this section, we briefly discuss model training in a margin-based framework. We then propose a simple relation-level ensemble that combines multiple individual models. The rationale behind using an ensemble is that whether a model class can represent well or be trained well on a relation depends on properties of that relation. The ensemble thus aims to pick the best model (or a combination of models) for each relation. 4.1 Margin-Based Training We assume throughout that we are given a set of positive triples T + E R E, but no negative evidence. This is a common scenario in practice. To deal with the absence of negative evidence, ranking-based frameworks aim to produce a model that ranks triples in T + higher than other triples. A common approach (Bordes et al. 2013; Nickel, Rosasco, and Poggio 2016) is to define a set of negative triples for each positive triple (i, k, j) T + by perturbing subject or object: T (i,k,j) = (i , k, j) | i E, (i , k, j) / T + (i, k, j ) | j E, (i, k, j ) / T + . This approach corresponds to a local closed-world assumption (Dong et al. 2014). We now briefly summarize a common margin-based framework for training (Bordes et al. 2013). There are a number of alternatives, including logistic loss (Riedel et al. 2013) and negative log-likelihood (Trouillon et al. 2016a). Margin-based frameworks often lead to faster training times in practice because they focus on informative pairs of positive and negative triples, i.e., they ignore parts of the data that are already more or less well-represented by the model. In particular, we minimize (i+,k,j+) T +, (i ,k,j ) T (i+,k,j+) [f(i , k, j ) + γ f(i+, k, j+)]+ |T (i+,k,j+)| , where 0 γ R+ is a margin hyperparameter, [x]+ = max(0, x), and f depends on the model being trained. For all models but Hol E, we set f(i, k, j) = sm k (i, j). For Hol E, we set f(i, k, j) = σ(sm k (i, j)), where σ denotes the logistic function, as suggested by the authors. In our experimental study, we also consider an additional L2 regularization term over the model parameters. The models can be fit using stochastic gradient descent (SGD) as in (Bordes et al. 2013; Lin et al. 2015b). The computational cost per SGD step of RESCAL is O(r2), of Hol E O(r log n), and of all other models O(r). 4.2 Relation-Level Ensemble The simplest way of combining multiple models is to construct an ensemble at the model level (Krompaß and Tresp 2015). Our experimental study suggests that the relative performance of different models is relation-dependent, however. Table 2: Dataset statistics Dataset # Ent. # Rel. # Train. # Valid. # Test WN18 40,943 18 141,442 5,000 5,000 FB15K 14,951 1,345 483,142 50,000 59,071 A more promising approach is therefore to combine models at the relation level. To the best of our knowledge, this simple approach has not been explored previously. Our ensemble is based on stacking. A meta learner is used to combine the ranking matrices produced by the individual models such that some accuracy measure is maximized. Here we use logistic regression. To do so, we construct for each relation a dataset that contains all of its positive triples as well as an equal amount of negative triples obtained by randomly perturbing each positive triple following the same strategy as in training individual models. For logistic regression, we use rescaled scores of the individual models as features and the positive/negative class label as response variable. Rescaling accounts for the variety in range of scores of different models; we rescale each feature linearly into range [0, 1] (Han, Kamber, and Pei 2011, Sec. 3.5.2). 5 Experiments We conducted an experimental study on two real-world datasets, which are commonly used in prior work on KB completion. The primary goal of our study was to provide independent evidence for the performance of various bilinear models under the margin-based ranking framework. We also evaluated relation-level ensembles of such models and compared the results to prior results reported in the literature (for bilinear and other models). 5.1 Experimental Setup All datasets, experimental results, and source code will be made publicly available.4 Data. We used the WN18 (Bordes et al. 2014) and FB15K (Bordes et al. 2013) datasets, which were extracted from Word Net (Miller 1995) and Freebase (Bollacker et al. 2008), respectively. Word Net contains words and their relationships. Freebase contains various facts across a large number of relations. The two datasets are presplit into a training set, a validation set, and a test set. Table 2 summarizes the key statistics. Methods and training. We considered RESCAL (R), Hol E (H), and Trans E (T) in our experimental study. We reimplemented each method in C++, partly using the Intel Math Kernel Library. We trained each model in the marginbased ranking framework using Adagrad (Duchi, Hazan, and Singer 2011). In each step, we sampled a positive triple at random and obtained a negative triple by randomly perturbing subject or object. Sampling was done without replacement and we did not use mini-batches. When using the same hyperparameters, our implementation provided similar or better fits than the original implementations provided by the authors. 4http://dws.informatik.uni-mannheim.de/en/resources/ software/tf/ Table 3: Hyperparameters settings used in our study Dataset Model r γ η λe λr WN18 RESCAL 200 1.0 0.10 0.10 0.01 Hol E 200 0.2 0.10 0.01 0.00 Trans E 200 0.5 0.01 - - FK15k RESCAL 200 4.0 0.10 0.10 0.01 Hol E 200 0.2 0.10 0.01 0.01 Trans E 200 0.2 0.01 - - Note that our study was limited in that we considered only one particular training method; no conclusions can be drawn about other training methods. We focused on margin-based ranking because it led to much faster training times, making this study more feasible. We used LIBLINEAR for logistic regression. Evaluation. We evaluated model performance for the tasks of entity ranking and triple classification on the test data. In entity ranking, we rank entities for queries of the form R(?, e) or R(e, ?). Our evaluation closely follows Bordes et al. (2013), and we report mean reciprocal rank (MRR), HITS@10, and mean rank (MR) in the filtered setting, i.e., predictions that correspond to tuples in the training or validation datasets were discarded. In triple classification, we are given a triple (i, k, j) and are asked to classify it as positive or negative; we proceed as Socher et al. (2013) to produce the set of tuples to classify. To perform classification, we determined a score threshold σk for each relation and model; scores larger than σk were classified positive, else negative. We used optimal thresholds with respect to the validation set. Model selection. Each of the models has a number of hyperparameters. For all models, we trained the models solely on the training data and used the validation data solely to tune hyperparameters. Test data was not touched for model selection. We considered the following hyperparameter settings: r {100, 200}, learning rate η {0.01, 0.1, 1}, weight of L2-regularization λe, λr {0, 0.1, 0.01} for entity and relation parameters, resp., margin hyperparameter γ {1, 2, 4, 8} for RESCAL, γ {0.2, 0.5, 0.7} for Hol E, and γ {0.2, 0.5, 0.7, 1.0, 1.5} for Trans E.5 We performed exhaustive grid search, using 50 (2000 for Trans E) epochs (passes over the dataset) per hyperparameter setting and model. We then retrained the best-performing setting (w.r.t. HITS@10 on validation data) for each model on the training data for up to 2,000 epochs. Tab. 3 reports the hyperparameters ultimately selected. 5.2 Results Entity ranking. Our results are summarized in Tab. 4. Detailed results can be found in Tab. 6, where we measured HITS@10 per relation category and per argument to be predicted as in (Bordes et al. 2013). For the individual models, our results indicate that model 5We used smaller margins than the ones suggested for Trans E with L1 distance (Lin et al. 2015b; Wang et al. 2014; Lin et al. 2015a). By doing this, we obtained comparable prediction performance as Trans E-L1. Table 4: Entity ranking results of our experimental study. Best-performing entries marked bold. Dataset WN18 FB15K Model HITS@10 (%) MRR (%) MR HITS@10 (%) MRR (%) MR Hol E (Nickel, Rosasco, and Poggio 2016) 94.1 93.8 819 72.6 50.2 331 Trans E (Bordes et al. 2013) 94.5 43.9 474 79.5 34.4 76 RESCAL (Nickel, Tresp, and Kriegel 2011) 87.8 79.9 905 59.6 38.1 247 RESCAL + Trans E 94.8 87.3 510 79.7 51.1 61 RESCAL + Hol E 94.4 94.0 743 79.1 57.5 165 Hol E + Trans E 94.9 93.8 507 84.6 61.0 67 RESCAL + Hol E + Trans E 95.0 94.0 507 85.1 62.8 52 Table 5: Entity ranking results as reported in the literature (not reproduced here, partly with different training methods, partly non-bilinear models). Entries marked - were not reported. Entries better than any result in our study are marked bold. Dataset WN18 FB15K Model HITS@10 (%) MRR (%) MR HITS@10 (%) MRR (%) MR Gaifman (Niepert 2016) 93.9 - 352 84.2 - 75 Compl Ex (Trouillon et al. 2016a), r=150/200 94.7 94.1 - 84.0 69.2 - DISTMULT (Trouillon et al. 2016a), r=150/200 93.6 82.2 902 82.4 65.4 97 R-GCN+DISTMULT (Schlichtkrull et al. 2017), r=200 96.4 81.9 - 84.2 69.6 - ANALOGY (Liu, Wu, and Yang 2017), r=200 94.7 94.2 - 85.4 72.5 - Table 6: Detailed entity ranking results (FB15k, HITS@10) Task Predict subject Predict object Relations 1:1 1:N N:1 N:N 1:1 1:N N:1 N:N Trans E 75.8 91.9 41.4 82.2 75.5 51.1 91.9 84.7 Hol E 80.4 69.5 44.7 77.4 79.0 57.8 59.1 79.0 RESCAL 43.1 75.7 17.7 62.0 42.4 21.3 79.2 65.8 R+H+T 87.5 94.3 55.2 86.7 87.0 65.0 93.3 89.4 Table 7: Triple classification results (FB15K) Model T H R R+T R+H H+T R+H+T Accuracy 96.2 93.7 94.6 96.7 95.8 96.5 96.9 performance depends on the relation category. No single model always performed best across all categories. Hol E and Trans E generally performed better than RESCAL; here constraints help. The relation-level ensembles generally improved performance w.r.t. HITS@10 and MRR. Performance of MR was not improved, however, mainly because this metric is sensitive to low-ranked triples (which existed in Hol E and RESCAL predictions). Note that adding RESCAL to the ensemble was helpful. Finally, the ensemble of RESCAL, Trans E, and Hol E performed best w.r.t. HITS@10 on all relation categories and for both datasets. In Tab. 5, we compare to some recent results reported in the literature. Note that training methods were different than the one used in our study for some of these models, and that some models are not bilinear. Nevertheless, a direct comparison indicates that a relation-level ensemble of multiple bilinear models is competitive to the state-of-the-art. Triple classification. Tab. 7 summarizes the HITS@10 performance of each individual model and various relationlevel ensembles for triple classification on FB15k. The results are generally in line with the results for entity ranking. A notable exception is that RESCAL outperforms Hol E here; we conjecture that this is due to Hol E s high MR on this dataset. 6 Related Work We focus on recent embedding models that solely use the KB as input. There are a number of methods that modify Trans E in one way or another: Trans H (Wang et al. 2014) and Trans R (Lin et al. 2015b) improve support symmetric and many-to-one relations, Trans G (Xiao et al. 2015) adds refines relation embeddings by semantic components, and PTrans E (Lin et al. 2015a) adds multiple-step relation paths. Gaifman (Niepert 2016) exploits structural features in the form of Horn clauses to construct embeddings. Socher et al. (2013) combined neural networks with tensors. Schlichtkrull et al. (2017) models relational data with graph convolutional networks. ANALOGY (Liu, Wu, and Yang 2017) is a recent bilinear model that constrains relation embeddings be real normal matrices. Finally, Nickel, Jiang, and Tresp (2014) provided a rank bound for exact recovery of a Boolean tensor with RESCAL. Our results differ in that we consider consistency, not exact recovery. 7 Conclusion We studied the expressive power of and subsumption relationships between recent bilinear embedding models for knowledge graphs. We introduced the concepts of universality and consistency, which capture different aspects of model expressiveness, and provided bounds on model sizes needed for universality or consistency with a given dataset. We argued that using a relation-level ensembles are beneficial for multirelational learning. Finally, we conducted an independent experimental study that compared various bilinear models in a common setup. Future work includes tightening the bounds provided here, studying which relation types can be reportedresented by which models, and exploring the relationship between additional models. We also expect an in-depth study of model performance with various alternative datasets and training methods to be insightful. Alon, N.; Frankl, P.; and Rödl, V. 1985. Geometrical realization of set systems and probabilistic communication complexity. In FOCS, 277 280. Alon, N.; Moran, S.; and Yehudayoff, A. 2016. Sign rank versus vc dimension. In COLT, 47 80. Bollacker, K. D.; Evans, C.; Paritosh, P.; Sturge, T.; and Taylor, J. 2008. Freebase: a collaboratively created graph database for structuring human knowledge. In SIGMOD, 1247 1250. Bordes, A.; Usunier, N.; García-Durán, A.; Weston, J.; and Yakhnenko, O. 2013. Translating embeddings for modeling multi-relational data. In NIPS, 2787 2795. Bordes, A.; Glorot, X.; Weston, J.; and Bengio, Y. 2014. A semantic matching energy function for learning with multirelational data - application to word-sense disambiguation. Machine Learning 94(2):233 259. Carroll, J. D., and Chang, J.-J. 1970. Analysis of individual differences in multidimensional scaling via an n-way generalization of eckart-young decomposition. Psychometrika 35(3):283 319. Dong, X.; Gabrilovich, E.; Heitz, G.; Horn, W.; Lao, N.; Murphy, K.; Strohmann, T.; Sun, S.; and Zhang, W. 2014. Knowledge vault: a web-scale approach to probabilistic knowledge fusion. In KDD, 601 610. Duchi, J. C.; Hazan, E.; and Singer, Y. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12:2121 2159. Han, J.; Kamber, M.; and Pei, J. 2011. Data Mining: Concepts and Techniques, 3rd edition. Morgan Kaufmann. Hayashi, K., and Shimbo, M. 2017. On the equivalence of holographic and complex embeddings for link prediction. In ACL (2), 554 559. Krompaß, D., and Tresp, V. 2015. Ensemble solutions for link-prediction in knowledge graphs. In LD4KD. Lehmann, J.; Isele, R.; Jakob, M.; Jentzsch, A.; Kontokostas, D.; Mendes, P. N.; Hellmann, S.; Morsey, M.; van Kleef, P.; Auer, S.; and Bizer, C. 2015. Dbpedia - A large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web 6(2):167 195. Lin, Y.; Liu, Z.; Luan, H.; Sun, M.; Rao, S.; and Liu, S. 2015a. Modeling relation paths for representation learning of knowledge bases. In EMNLP, 705 714. Lin, Y.; Liu, Z.; Sun, M.; Liu, Y.; and Zhu, X. 2015b. Learning entity and relation embeddings for knowledge graph completion. In AAAI, 2181 2187. Liu, H.; Wu, Y.; and Yang, Y. 2017. Analogical inference for multi-relational embeddings. In ICML, volume 70, 2168 2178. Mikolov, T.; Chen, K.; Corrado, G.; and Dean, J. 2013. Efficient estimation of word representations in vector space. Co RR abs/1301.3781. Miller, G. A. 1995. Wordnet: A lexical database for english. Communications of the ACM 38(11):39 41. Neumann, S.; Gemulla, R.; and Miettinen, P. 2016. What you will gain by rounding: Theory and algorithms for rounding rank. In ICDM, 380 389. Nickel, M.; Murphy, K.; Tresp, V.; and Gabrilovich, E. 2016. A review of relational machine learning for knowledge graphs. IEEE Computer Society 104(1):11 33. Nickel, M.; Jiang, X.; and Tresp, V. 2014. Reducing the rank of relational factorization models by including observable patterns. In NIPS, 1179 1187. Nickel, M.; Rosasco, L.; and Poggio, T. A. 2016. Holographic embeddings of knowledge graphs. In AAAI, 1955 1961. Nickel, M.; Tresp, V.; and Kriegel, H. 2011. A three-way model for collective learning on multi-relational data. In ICML, 809 816. Niepert, M. 2016. Discriminative gaifman models. In NIPS, 3405 3413. Rebele, T.; Suchanek, F. M.; Hoffart, J.; Biega, J.; Kuzey, E.; and Weikum, G. 2016. YAGO: A multilingual knowledge base from Wikipedia, Wordnet, and Geonames. In ISWC, LNCS, 9982:177 185. Riedel, S.; Yao, L.; Mc Callum, A.; and Marlin, B. M. 2013. Relation extraction with matrix factorization and universal schemas. In HLT-NAACL, 74 84. Schlichtkrull, M. S.; Kipf, T. N.; Bloem, P.; van den Berg, R.; Titov, I.; and Welling, M. 2017. Modeling relational data with graph convolutional networks. Co RR abs/1703.06103. Socher, R.; Chen, D.; Manning, C. D.; and Ng, A. Y. 2013. Reasoning with neural tensor networks for knowledge base completion. In NIPS, 926 934. Trouillon, T., and Nickel, M. 2017. Complex and holographic embeddings of knowledge graphs: A comparison. Co RR abs/1707.01475. Trouillon, T.; Welbl, J.; Riedel, S.; Gaussier, É.; and Bouchard, G. 2016a. Complex embeddings for simple link prediction. In ICML, 2071 2080. Trouillon, T.; Dance, C. R.; Gaussier, E.; and Bouchard, G. 2016b. Decomposing real square matrices via unitary diagonalization. Co RR. Wang, Z.; Zhang, J.; Feng, J.; and Chen, Z. 2014. Knowledge graph embedding by translating on hyperplanes. In AAAI, 1112 1119. Xiao, H.; Huang, M.; Hao, Y.; and Zhu, X. 2015. Transg : A generative mixture model for knowledge graph embedding. Co RR abs/1509.05488. Yang, B.; Yih, W.; He, X.; Gao, J.; and Deng, L. 2014. Embedding entities and relations for learning and inference in knowledge bases. Co RR abs/1412.6575.