# quaternion_knowledge_graph_embeddings__3f391063.pdf Quaternion Knowledge Graph Embeddings Shuai Zhang , Yi Tayψ , Lina Yao , Qi Liuφ University of New South Wales ψNanyang Technological University, φUniversity of Oxford In this work, we move beyond the traditional complex-valued representations, introducing more expressive hypercomplex representations to model entities and relations for knowledge graph embeddings. More specifically, quaternion embeddings, hypercomplex-valued embeddings with three imaginary components, are utilized to represent entities. Relations are modelled as rotations in the quaternion space. The advantages of the proposed approach are: (1) Latent inter-dependencies (between all components) are aptly captured with Hamilton product, encouraging a more compact interaction between entities and relations; (2) Quaternions enable expressive rotation in four-dimensional space and have more degree of freedom than rotation in complex plane; (3) The proposed framework is a generalization of Compl Ex on hypercomplex space while offering better geometrical interpretations, concurrently satisfying the key desiderata of relational representation learning (i.e., modeling symmetry, anti-symmetry and inversion). Experimental results demonstrate that our method achieves state-of-the-art performance on four wellestablished knowledge graph completion benchmarks. 1 Introduction Knowledge graphs (KGs) live at the heart of many semantic applications (e.g., question answering, search, and natural language processing). KGs enable not only powerful relational reasoning but also the ability to learn structural representations. Reasoning with KGs have been an extremely productive research direction, with many innovations leading to improvements to many downstream applications. However, real-world KGs are usually incomplete. As such, completing KGs and predicting missing links between entities have gained growing interest. Learning low-dimensional representations of entities and relations for KGs is an effective solution for this task. Learning KG embeddings in the complex space C has been proven to be a highly effective inductive bias, largely owing to its intrinsic asymmetrical properties. This is demonstrated by the Compl Ex embedding method which infers new relational triplets with the asymmetrical Hermitian product. In this paper, we move beyond complex representations, exploring hypercomplex space for learning KG embeddings. More concretely, quaternion embeddings are utilized to represent entities and relations. Each quaternion embedding is a vector in the hypercomplex space H with three imaginary components i, j, k, as opposed to the standard complex space C with a single real component r and imaginary component i. We propose a new scoring function, where the head entity Qh is rotated by the relational quaternion embedding through Hamilton product. This is followed by a quaternion inner product with the tail entity Qt. There are numerous benefits of this formulation. (1) The Hamilton operator provides a greater extent of expressiveness compared to the complex Hermitian operator and the inner product in Euclidean space. The Hamilton operator forges inter-latent interactions between all of r, i, j, k, resulting in Equal contribution. 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. a highly expressive model. (2) Quaternion representations are highly desirable for parameterizing smooth rotation and spatial transformations in vector space. They are generally considered robust to sheer/scaling noise and perturbations (i.e., numerically stable rotations) and avoid the problem of Gimbal locks. Moreover, quaternion rotations have two planes of rotation2 while complex rotations only work on single plane, giving the model more degrees of freedom. (3) Our Quat E framework subsumes the Compl Ex method, concurrently inheriting its attractive properties such as its ability to model symmetry, anti-symmetry, and inversion. (4) Our model can maintain equal or even less parameterization, while outperforming previous work. Experimental results demonstrate that our method achieves state-of-the-art performance on four wellestablished knowledge graph completion benchmarks (WN18, FB15K, WN18RR, and FB15K-237). 2 Related Work Knowledge graph embeddings have attracted intense research focus in recent years, and a myriad of embedding methodologies have been proposed. We roughly divide previous work into translational models and semantic matching models based on the scoring function, i.e. the composition over head & tail entities and relations. Translational methods popularized by Trans E [Bordes et al., 2013] are widely used embedding methods, which interpret relation vectors as translations in vector space, i.e., head + relation tail. A number of models aiming to improve Trans E are proposed subsequently. Trans H [Wang et al., 2014] introduces relation-specific hyperplanes with a normal vector. Trans R [Lin et al., 2015] further introduces relation-specific space by modelling entities and relations in distinct space with a shared projection matrix. Trans D [Ji et al., 2015] uses independent projection vectors for each entity and relation and can reduce the amount of calculation compared to Trans R. Torus E [Ebisu and Ichise, 2018] defines embeddings and distance function in a compact Lie group, torus, and shows better accuracy and scalability. The recent state-of-the-art, Rotat E [Sun et al., 2019], proposes a rotation-based translational method with complex-valued embeddings. On the other hand, semantic matching models include bilinear models such as RESCAL [Nickel et al., 2011], Dist Mult [Yang et al., 2014], Hol E [Nickel et al., 2016], and Compl Ex [Trouillon et al., 2016], and neural-network-based models. These methods measure plausibility by matching latent semantics of entities and relations. In RESCAL, each relation is represented with a square matrix, while Dist Mult replaces it with a diagonal matrix in order to reduce the complexity. Simpl E [Kazemi and Poole, 2018] is also a simple yet effective bilinear approach for knowledge graph embedding. Hol E explores the holographic reduced representations and makes use of circular correlation to capture rich interactions between entities. Compl Ex embeds entities and relations in complex space and utilizes Hermitian product to model the antisymmetric patterns, which has shown to be immensely helpful in learning KG representations. The scoring function of Compl Ex is isomorphic to that of Hol E [Trouillon and Nickel, 2017]. Neural networks based methods have also been adopted, e.g., Neural Tensor Network [Socher et al., 2013] and ER-MLP [Dong et al., 2014] are two representative neural network based methodologies. More recently, convolution neural networks [Dettmers et al., 2018], graph convolutional networks [Schlichtkrull et al., 2018], and deep memory networks [Wang et al., 2018] also show promising performance on this task. Different from previous work, Quat E takes the advantages (e.g., its geometrical meaning and rich representation capability, etc.) of quaternion representations to enable rich and expressive semantic matching between head and tail entities, assisted by relational rotation quaternions. Our framework subsumes Dist Mult and Compl Ex, with the capability to generalize to more advanced hypercomplex spaces. Quat E utilizes the concept of geometric rotation. Unlike the Rotat E which has only one plane of rotation, there are two planes of rotation in Quat E. Quat E is a semantic matching model while Rotat E is a translational model. We also point out that the composition property introduced in Trans E and Rotat E can have detrimental effects on the KG embedding task. Quaternion is a hypercomplex number system firstly described by Hamilton [Hamilton, 1844] with applications in a wide variety of areas including astronautics, robotics, computer visualisation, animation and special effects in movies, and navigation. Lately, Quaternions have attracted attention in the field of machine learning. Quaternion recurrent neural networks (QRNNs) obtain better performance with 2A plane of rotation is an abstract object used to describe or visualize rotations in space. fewer number of free parameters than traditional RNNs on the phoneme recognition task. Quaternion representations are also useful for enhancing the performance of convolutional neural networks on multiple tasks such as automatic speech recognition [Parcollet et al.] and image classification [Gaudet and Maida, 2018, Parcollet et al., 2018a]. Quaternion multiplayer perceptron [Parcollet et al., 2016] and quaternion autoencoders [Parcollet et al., 2017] also outperform standard MLP and autoencoder. In a nutshell, the major motivation behind these models is that quaternions enable the neural networks to code latent interand intra-dependencies between multidimensional input features, thus, leading to more compact interactions and better representation capability. 3 Hamilton s Quaternions Quaternion [Hamilton, 1844] is a representative of hypercomplex number system, extending traditional complex number system to four-dimensional space. A quaternion Q consists of one real component and three imaginary components, defined as Q = a+bi+cj+dk, where a, b, c, d are real numbers and i, j, k are imaginary units. i, j and k are square roots of 1, satisfying the Hamilton s rules: i2 = j2 = k2 = ijk = 1. More useful relations can be derived based on these rules, such as ij = k, ji = -k, jk=i, ki=j, kj=-i and ik=-j. Figure 1(b) shows the quaternion imaginary units product. Apparently, the multiplication between imaginary units is non-commutative. Some widely used operations of quaternion algebra H are introduced as follows: Conjugate: The conjugate of a quaternion Q is defined as Q = a bi cj dk. Norm: The norm of a quaternion is defined as |Q| = a2 + b2 + c2 + d2. Inner Product: The quaternion inner product between Q1 = a1 + b1i + c1j + d1k and Q2 = a2 + b2i + c2j + d2k is obtained by taking the inner products between corresponding scalar and imaginary components and summing up the four inner products: Q1 Q2 = a1, a2 + b1, b2 + c1, c2 + d1, d2 (1) Hamilton Product (Quaternion Multiplication): The Hamilton product is composed of all the standard multiplications of factors in quaternions and follows the distributive law, defined as: Q1 Q2 = (a1a2 b1b2 c1c2 d1d2) + (a1b2 + b1a2 + c1d2 d1c2)i + (a1c2 b1d2 + c1a2 + d1b2)j + (a1d2 + b1c2 c1b2 + d1a2)k, (2) which determines another quaternion. Hamilton product is not commutative. Spatial rotations can be modelled with quaternions Hamilton product. Multiplying a quaternion, Q2, by another quaternion Q1, has the effect of scaling Q1 by the magnitude of Q2 followed by a special type of rotation in four dimensions. As such, we can also rewrite the above equation as: Q1 Q2 = Q1 |Q2| Q2 4.1 Quaternion Representations for Knowledge Graph Embeddings Suppose that we have a knowledge graph G consisting of N entities and M relations. E and R denote the sets of entities and relations, respectively. The training set consists of triplets (h, r, t), where h, t E and r R. We use Ωand Ω = E R E Ωto denote the set of observed triplets and the set of unobserved triplets, respectively. Yhrt { 1, 1} represents the corresponding label of the triplet (h, r, t). The goal of knowledge graph embeddings is to embed entities and relations to a continuous low-dimensional space, while preserving graph relations and semantics. In this paper, we propose learning effective representations for entities and relations with quaternions. We leverage the expressive rotational capability of quaternions. Unlike Rotat E which has only one plane of rotation (i.e., complex plane, shown in Figure 1(a)), Quat E has two planes of rotation. Compared to Euler angles, quaternion can avoid the problem of gimbal lock (loss of one degree of freedom). Quaternions are also more efficient and numerically stable than rotation matrices. The proposed method can be summarized into two steps: (1) rotate the head quaternion using the unit relation quaternion; (2) take the quaternion inner product between the rotated head quaternion and ii=-1 Imaginary Axis Figure 1: (a) Complex plane; (b) Quaternion units product; (c) sterographically projected hypersphere in 3D space. The purple dot indicates the position of the unit quaternion. the tail quaternion to score each triplet. If a triplet exists in the KG, the model will rotate the head entity with the relation to make the angle between head and tail entities smaller so the product can be maximized. Otherwise, we can make the head and tail entity be orthogonal so that their product becomes zero. Quaternion Embeddings of Knowledge Graphs More specifically, we use a quaternion matrix Q HN k to denote the entity embeddings and W HM k to denote the relation embeddings, where k is the dimension of embeddings. Given a triplet (h, r, t), the head entity h and the tail entity t correspond to Qh = {ah + bhi + chj + dhk : ah, bh, ch, dh Rk} and Qt = {at + bti + ctj + dtk : at, bt, ct, dt Rk}, respectively, while the relation r is represented by Wr = {ar + bri + crj + drk : ar, br, cr, dr Rk}. Hamilton-Product-Based Relational Rotation We first normalize the relation quaternion Wr to a unit quaternion W r = p + qi + uj + vk to eliminate the scaling effect by dividing Wr by its norm: W r (p, q, u, v) = Wr |Wr| = ar + bri + crj + drk p a2r + b2r + c2r + d2r (4) We visualize a unit quaternion in Figure 1(c) by projecting it into a 3D space. We keep the unit hypersphere which passes through i, j, k in place. The unit quaternion can be project in, on, or out of the unit hypersphere depending on the value of the real part. Secondly, we rotate the head entity Qh by doing Hamilton product between it and W r : Q h(a h, b h, c h, d h) = Qh W r = (ah p bh q ch u dh v) + (ah q + bh p + ch v dh u)i + (ah u bh v + ch p + dh q)j + (ah v + bh u ch q + dh p)k where denotes the element-wise multiplication between two vectors. Right-multiplication by a unit quaternion is a right-isoclinic rotation on Quaternion Qh. We can also swap Qh and W r and do a left-isoclinic rotation, which does not fundamentally change the geometrical meaning. Isoclinic rotation is a special case of double plane rotation where the angles for each plane are equal. Scoring Function and Loss We apply the quaternion inner product as the scoring function: φ(h, r, t) = Q h Qt = a h, at + b h, bt + c h, ct + d h, dt (6) Following Trouillon et al. [2016], we formulate the task as a classification problem, and the model parameters are learned by minimizing the following regularized logistic loss: L(Q, W) = X r(h,t) Ω Ω log(1 + exp( Yhrtφ(h, r, t))) + λ1 Q 2 2 +λ2 W 2 2 (7) Here we use the ℓ2 norm with regularization rates λ1 and λ2 to regularize Q and W, respectively. Ω is sampled from the unobserved set Ω using negative sampling strategies such as uniform sampling, bernoulli sampling [Wang et al., 2014], and adversarial sampling [Sun et al., 2019]. Note that the loss function is in Euclidean space, as we take the summation of all components when computing the scoring function in Equation (6). We utilise Adagrad [Duchi et al., 2011] for optimization. Table 1: Scoring functions of state-of-the-art knowledge graph embedding models, along with their parameters, time complexity. " denotes the circular correlation operation; " denotes Hadmard (or element-wise) product. " denotes Hamilton product. Model Scoring Function Parameters Otime Trans E (Qh + Wr) Qt Qh, Wr, Qt Rk O(k) Hol E Wr, Qh Qt Qh, Wr, Qt Rk O(k log(k)) Dist Mult Wr, Qh, Qt Qh, Wr, Qt Rk O(k) Compl Ex Re( Wr, Qh, Qt ) Qh, Wr, Qt Ck O(k) Rotat E Qh Wr Qt Qh, Wr, Qt Ck, |Wri| = 1 O(k) Torus E min(x,y) ([Qh]+[Qh]) [Wr] x y [Qh], [Wr], [Qt] Tk O(k) Quat E Qh W r Qt Qh, Wr, Qt Hk O(k) Initialization For parameters initilaization, we can adopt the initialization algorithm in [Parcollet et al., 2018b] tailored for quaternion-valued networks to speed up model efficiency and convergence [Glorot and Bengio, 2010]. The initialization of entities and relations follows the rule: wreal = ϕ cos(θ), wi = ϕQ imgi sin(θ), wj = ϕQ imgj sin(θ), wk = ϕQ imgk sin(θ), (8) where wreal, wi, wj, wk denote the scalar and imaginary coefficients, respectively. θ is randomly generated from the interval [ π, π]. Q img is a normalized quaternion, whose scalar part is zero. ϕ is randomly generated from the interval [ 1 2k], reminiscent to the He initialization [He et al., 2015]. This initialization method is optional. 4.2 Discussion Table 1 summarizes several popular knowledge graph embedding models, including scoring functions, parameters, and time complexities. Trans E, Hol E, and Dist Mult use Euclidean embeddings, while Compl Ex and Rotat E operate in the complex space. In contrast, our model operates in the quaternion space. Capability in Modeling Symmetry, Antisymmetry and Inversion. The flexibility and representational power of quaternions enable us to model major relation patterns at ease. Similar to Compl Ex, our model can model both symmetry (r(x, y) r(y, x)) and antisymmetry (r(x, y) r(y, x)) relations. The symmetry property of Quat E can be proved by setting the imaginary parts of Wr to zero. One can easily check that the scoring function is antisymmetric when the imaginary parts are nonzero. As for the inversion pattern (r1(x, y) r2(y, x)) , we can utilize the conjugation of quaternions. Conjugation is an involution and is its own inverse. One can easily check that: Qh W r Qt = Qt W r Qh (9) The detailed proof of antisymmetry and inversion can be found in the appendix. Composition patterns are commonplace in knowledge graphs [Lao et al., 2011, Neelakantan et al., 2015]. Both trans E and Rotat E have fixed composition methods [Sun et al., 2019]. Trans E composes two relations using the addition (r1 + r2) and Rotat E uses the Hadamard product (r1 r2). We argue that it is unreasonable to fix the composition patterns, as there might exist multiple composition patterns even in a single knowledge graph. For example, suppose there are three persons x, y, z . If y is the elder sister (denoted as r1) of x and z is the elder brother (denoted as r2) of y, we can easily infer that z is the elder brother of x. The relation between z and x is r2 instead of r1 + r2 or r1 r2, violating the two composition methods of Trans E and Rotat E. In Quat E, the composition patterns are not fixed. The relation between z and x is not only determined by relations r1 and r2 but also simultaneously influenced by entity embeddings. Connection to Dist Mult and Compl Ex. Quaternions have more degrees of freedom compared to complex numbers. Here we show that the Quat E framework can be seen as a generalization of Compl Ex. If we set the coefficients of the imaginary units j and k to zero, we get complex embeddings as in Compl Ex and the Hamilton product will also degrade to complex number multiplication. We Table 2: Statistics of the data sets used in this paper. Dataset N M #training #validation #test avg. #degree WN18 40943 18 141442 5000 5000 3.45 WN18RR 40943 11 86835 3034 3134 2.19 FB15K 14951 1345 483142 50000 59071 32.31 FB15K-237 14541 237 272115 17535 20466 18.71 further remove the normalization of the relational quaternion, obtaining the following equation: φ(h, r, t) = Qh Wr Qt = (ah + bhi) (ar + bri) (at + bti) = [(ah ar bh br) + (ah br + bh ar)i] (at + bti) = ar, ah, at + ar, bh, bt + br, ah, bt br, bh, at (10) where a, b, c = P k akbkck denotes standard component-wise multi-linear dot product. Equation 10 recovers the form of Compl Ex. This framework brings another mathematical interpretation for Compl Ex instead of just taking the real part of the Hermitian product. Another interesting finding is that Hermitian product is not necessary to formulate the scoring function of Compl Ex. If we remove the imaginary parts of all quaternions and remove the normalization step, the scoring function becomes φ(h, r, t) = ah, ar, at , degrading to Dist Mult in this case. 5 Experiments and Results 5.1 Experimental Setup Datasets Description: We conducted experiments on four widely used benchmarks, WN18, FB15K, WN18RR and FB15K-237, of which the statistics are summarized in Table 2. WN18 [Bordes et al., 2013] is extracted from Word Net3, a lexical database for English language, where words are interlinked by means of conceptual-semantic and lexical relations. WN18RR [Dettmers et al., 2018] is a subset of WN18, with inverse relations removed. FB15K [Bordes et al., 2013] contains relation triples from Freebase, a large tuple database with structured general human knowledge. FB15K-237 [Toutanova and Chen, 2015] is a subset of FB15K, with inverse relations removed. Evaluation Protocol: Three popular evaluation metrics are used, including Mean Rank (MR), Mean Reciprocal Rank (MRR), and Hit ratio with cut-off values n = 1, 3, 10. MR measures the average rank of all correct entities with a lower value representing better performance. MRR is the average inverse rank for correct entities. Hit@n measures the proportion of correct entities in the top n entities. Following Bordes et al. [2013], filtered results are reported to avoid possibly flawed evaluation. Baselines: We compared Quat E with a number of strong baselines. For Translational Distance Models, we reported Trans E [Bordes et al., 2013] and two recent extensions, Torus E [Ebisu and Ichise, 2018] and Rotat E [Sun et al., 2019]; For Semantic Matching Models, we reported Dist Mult [Yang et al., 2014], Hol E [Nickel et al., 2016], Compl Ex [Trouillon et al., 2016] , Simpl E [Kazemi and Poole, 2018], Conv E [Dettmers et al., 2018], R-GCN [Schlichtkrull et al., 2018], and KNGE (Conv E based) [Wang et al., 2018]. Implementation Details: We implemented our model using pytorch4 and tested it on a single GPU. The hyper-parameters are determined by grid search. The best models are selected by early stopping on the validation set. In general, the embedding size k is tuned amongst {50, 100, 200, 250, 300}. Regularization rate λ1 and λ2 are searched in {0, 0.01, 0.05, 0.1, 0.2}. Learning rate is fixed to 0.1 without further tuning. The number of negatives (#neg) per training sample is selected from {1, 5, 10, 20}. We create 10 batches for all the datasets. For most baselines, we report the results in the original papers, and exceptions are provided with references. For Rotat E (without self-adversarial negative sampling), we use the best hyper-parameter settings provided in the paper to reproduce the results. We also report the results of Rotat E with self-adversarial negative sampling and denote it as a-Rotat E. Note that we report three versions of Quat E: including Quat E with/without type constraints, Quat E with N3 regularization and reciprocal learning. Self-adversarial negative sampling [Sun et al., 2019] is not used for Quat E. All hyper-parameters of Quat E are provided in the appendix. 3https://wordnet.princeton.edu/ 4https://pytorch.org/ Table 3: Link prediction results on WN18 and FB15K. Best results are in bold and second best results are underlined. [ ]: Results are taken from [Nickel et al., 2016]; [ ]: Results are taken from [Kadlec et al., 2017]; [ ]: Results are taken from [Sun et al., 2019]. a-Rotat E denotes Rotat E with self-adversarial negative sampling. [Quat E1]: without type constraints; [Quat E2]: with N3 regularization and reciprocal learning; [Quat E3]: with type constraints. WN18 FB15K Model MR MRR Hit@10 Hit@3 Hit@1 MR MRR Hit@10 Hit@3 Hit@1 Trans E - 0.495 0.943 0.888 0.113 - 0.463 0.749 0.578 0.297 Dist Mult 655 0.797 0.946 - - 42.2 0.798 0.893 - - Hol E - 0.938 0.949 0.945 0.930 - 0.524 0.739 0.759 0.599 Compl Ex - 0.941 0.947 0.945 0.936 - 0.692 0.840 0.759 0.599 Conv E 374 0.943 0.956 0.946 0.935 51 0.657 0.831 0.723 0.558 R-GCN+ - 0.819 0.964 0.929 0.697 - 0.696 0.842 0.760 0.601 Simpl E - 0.942 0.947 0.944 0.939 - 0.727 0.838 0.773 0.660 NKGE 336 0.947 0.957 0.949 0.942 56 0.73 0.871 0.790 0.650 Torus E - 0.947 0.954 0.950 0.943 - 0.733 0.832 0.771 0.674 Rotat E 184 0.947 0.961 0.953 0.938 32 0.699 0.872 0.788 0.585 a-Rotat E 309 0.949 0.959 0.952 0.944 40 0.797 0.884 0.830 0.746 Quat E1 388 0.949 0.960 0.954 0.941 41 0.770 0.878 0.821 0.700 Quat E2 - 0.950 0.962 0.954 0.944 - 0.833 0.900 0.859 0.800 Quat E3 162 0.950 0.959 0.954 0.945 17 0.782 0.900 0.835 0.711 Table 4: Link prediction results on WN18RR and FB15K-237. [ ]: Results are taken from [Nguyen et al., 2017]; [ ]: Results are taken from [Dettmers et al., 2018]; [ ]: Results are taken from [Sun et al., 2019]. WN18RR FB15K-237 Model MR MRR Hit@10 Hit@3 Hit@1 MR MRR Hit@10 Hit@3 Hit@1 Trans E 3384 0.226 0.501 - - 357 0.294 0.465 - - Dist Mult 5110 0.43 0.49 0.44 0.39 254 0.241 0.419 0.263 0.155 Compl Ex 5261 0.44 0.51 0.46 0.41 339 0.247 0.428 0.275 0.158 Conv E 4187 0.43 0.52 0.44 0.40 244 0.325 0.501 0.356 0.237 R-GCN+ - - - - - - 0.249 0.417 0.264 0.151 NKGE 4170 0.45 0.526 0.465 0.421 237 0.33 0.510 0.365 0.241 Rotat E 3277 0.470 0.565 0.488 0.422 185 0.297 0.480 0.328 0.205 a-Rotat E 3340 0.476 0.571 0.492 0.428 177 0.338 0.533 0.375 0.241 Quat E1 3472 0.481 0.564 0.500 0.436 176 0.311 0.495 0.342 0.221 Quat E2 - 0.482 0.572 0.499 0.436 - 0.366 0.556 0.401 0.271 Quat E3 2314 0.488 0.582 0.508 0.438 87 0.348 0.550 0.382 0.248 5.2 Results Table 5: MRR for the models tested on each relation of WN18RR. Relation Name Rotat E Quat E3 hypernym 0.148 0.173 derivationally_related_form 0.947 0.953 instance_hypernym 0.318 0.364 also_see 0.585 0.629 member_meronym 0.232 0.232 synset_domain_topic_of 0.341 0.468 has_part 0.184 0.233 member_of_domain_usage 0.318 0.441 member_of_domain_region 0.200 0.193 verb_group 0.943 0.924 similar_to 1.000 1.000 The empirical results on four datasets are reported in Table 3 and Table 4. Quat E performs extremely competitively compared to the existing state-of-the-art models across all metrics. As a quaternion-valued method, Quat E outperforms the two representative complex-valued models Compl Ex and Rotat E. The performance gains over Rotat E also confirm the advantages of quaternion rotation over rotation in the complex plane. On the WN18 dataset, Quat E outperforms all the baselines on all metrics except Hit@10. R-GCN+ achieves high value on Hit@10, yet is surpassed by most models on the other four metrics. The four recent models NKGE, Torus E, Rota E, and a-Rotat E achieves comparable results. Quat E also achieves the best results on the FB15K dataset, while the second best results scatter amongst Rotat E, a-Rotat E and Dist Mult. We are well-aware of the good results of Dist Mult reported in [Kadlec et al., 2017], yet they used a very large negative sampling size (i.e., 1000, 2000). The results also demonstrate that Table 7: Analysis on different variants of scoring function. Same hyperparameters settings as Quat E3 WN18 FB15K WN18RR FB15K-237 Analysis MRR Hit@10 MRR Hit@10 MRR Hit@10 MRR Hit@10 Qh Wr Qt 0.936 0.951 0.686 0.866 0.415 0.482 0.272 0.463 Wr (Qh Qt) 0.784 0.945 0.599 0.809 0.401 0.471 0.263 0.446 (Qh W r ) (Qt V r ) 0.947 0.958 0.787 0.889 0.477 0.563 0.344 0.539 Quat E can effectively capture the symmetry, antisymmetry and inversion patterns since they account for a large portion of the relations in these two datasets. As shown in Table 4, Quat E achieves a large performance gain over existing state-of-the-art models on the two datasets where trivial inverse relations are removed. On WN18RR in which there are a number of symmetry relations, a-Rotat E is the second best, while other baselines are relatively weaker. The key competitors on the dataset FB15K-237 where a large number of composition patterns exist are NKGE and a-Rotat E. Table 5 summarizes the MRR for each relation on WN18RR, confirming the superior representation capability of quaternion in modelling different types of relation. Methods with fixed composition patterns such as Trans E and Rotat E are relatively weak at times. We can also apply N3 regularization and reciprocal learning approaches [Lacroix et al., 2018] to Quat E. Results are shown in Table 3 and Table 4 as Quat E2. It is observed that using N3 and reciprocal learning could boost the performances greatly, especially on FB15K and FB15K-237. We found that the N3 regularization method can reduce the norm of relations and entities embeddings so that we do not apply relation normalization here. However, same as the method in [Lacroix et al., 2018], Quat E2 requires a large embedding dimension. 5.3 Model Analysis Table 6: Number of free parameters comparison. Model Torus E Rotat E Quat E1 Space Tk Ck Hk WN18 409.61M 40.95M 49.15M ( 20.0%) FB15K 162.96M 31.25M 26.08M( 16.5%) WN18RR - 40.95M 16.38M( 60.0%) FB15K-237 - 29.32M 5.82M( 80.1%) Number of Free Parameters Comparison. Table 6 shows the amount of parameters comparison between Quat E1 and two recent competitive baselines: Rotat E and Torus E. Note that Quat E3 uses almost the same number of free parameters as Quat E1. Torus E uses a very large embedding dimension 10000 for both WN18 and FB15K. This number is even close to the entities amount of FB15K which we think is not preferable since our original intention is to embed entities and relations to a lower dimensional space. Quat E reduces the parameter size of the complex-valued counterpart Rotat E largely. This is more significant on datasets without trivial inverse relations, saving up to 80% parameters while maintaining superior performance. Ablation Study on Quaternion Normalization. We remove the normalization step in Quat E and use the original relation quaternion Wr to project head entity. From Table 7, we clearly observe that normalizing the relation to unit quaternion is a critical step for the embedding performance. This is likely because scaling effects in nonunit quaternions are detrimental. Hamilton Products between Head and Tail Entities. We reformulate the scoring function of Quat E following the original formulate of Compl Ex. We do Hamilton product between head and tail quaternions and consider the relation quaternion as weight. Thus, we have φ(h, r, t) = Wr (Qh Qt). As a result, the geometric property of relational rotation is lost, which leads to poor performance as shown in Table 7. Additional Rotational Quaternion for Tail Entity. We hypothesize that adding an additional relation quaternion to tail entity might bring the model more representation capability. So we revise the scoring function to (Qh W r ) (Qt V r ), where Vr represents the rotational quaternion for tail entity. From Table 7, we observe that it achieves competitive results without extensive tuning. However, it might cause some losses of efficiency. 6 Conclusion In this paper, we design a new knowledge graph embedding model which operates on the quaternion space with well-defined mathematical and physical meaning. Our model is advantageous with its capability in modelling several key relation patterns, expressiveness with higher degrees of freedom as well as its good generalization. Empirical experimental evaluations on four well-established datasets show that Quat E achieves an overall state-of-the-art performance, outperforming multiple recent strong baselines, with even fewer free parameters. Acknowledgments This research was partially supported by grant ONRG NICOP N62909-19-1-2009 Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. Translating embeddings for modeling multi-relational data. In Advances in neural information processing systems, pages 2787 2795, 2013. Tim Dettmers, Pasquale Minervini, Pontus Stenetorp, and Sebastian Riedel. Convolutional 2d knowledge graph embeddings. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018. Xin Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Ni Lao, Kevin Murphy, Thomas Strohmann, Shaohua Sun, and Wei Zhang. Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 601 610. ACM, 2014. John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121 2159, 2011. Takuma Ebisu and Ryutaro Ichise. Toruse: Knowledge graph embedding on a lie group. In Thirty Second AAAI Conference on Artificial Intelligence, 2018. Chase J Gaudet and Anthony S Maida. Deep quaternion networks. In 2018 International Joint Conference on Neural Networks (IJCNN), pages 1 8. IEEE, 2018. Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249 256, 2010. William Rowan Hamilton. Lxxviii. on quaternions; or on a new system of imaginaries in algebra: To the editors of the philosophical magazine and journal. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 25(169):489 495, 1844. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026 1034, 2015. Guoliang Ji, Shizhu He, Liheng Xu, Kang Liu, and Jun Zhao. Knowledge graph embedding via dynamic mapping matrix. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), volume 1, pages 687 696, 2015. Rudolf Kadlec, Ondrej Bajgar, and Jan Kleindienst. Knowledge base completion: Baselines strike back. ACL 2017, page 69, 2017. Seyed Mehran Kazemi and David Poole. Simple embedding for link prediction in knowledge graphs. In Advances in Neural Information Processing Systems, pages 4289 4300, 2018. Timothee Lacroix, Nicolas Usunier, and Guillaume Obozinski. Canonical tensor decomposition for knowledge base completion. In International Conference on Machine Learning, pages 2869 2878, 2018. Ni Lao, Tom Mitchell, and William W Cohen. Random walk inference and learning in a large scale knowledge base. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 529 539. Association for Computational Linguistics, 2011. Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, and Xuan Zhu. Learning entity and relation embeddings for knowledge graph completion. In Twenty-ninth AAAI conference on artificial intelligence, 2015. Arvind Neelakantan, Benjamin Roth, and Andrew Mc Callum. Compositional vector space models for knowledge base completion. ar Xiv preprint ar Xiv:1504.06662, 2015. Dai Quoc Nguyen, Tu Dinh Nguyen, Dat Quoc Nguyen, and Dinh Phung. A novel embedding model for knowledge base completion based on convolutional neural network. ar Xiv preprint ar Xiv:1712.02121, 2017. Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. A three-way model for collective learning on multi-relational data. In ICML, volume 11, pages 809 816, 2011. Maximilian Nickel, Lorenzo Rosasco, and Tomaso Poggio. Holographic embeddings of knowledge graphs. In Thirtieth Aaai conference on artificial intelligence, 2016. T. Parcollet, M. Morchid, P. Bousquet, R. Dufour, G. Linarès, and R. De Mori. Quaternion neural networks for spoken language understanding. In 2016 IEEE Spoken Language Technology Workshop (SLT), pages 362 368, Dec 2016. doi: 10.1109/SLT.2016.7846290. Titouan Parcollet, Ying Zhang, Mohamed Morchid, Chiheb Trabelsi, Georges Linarès, Renato De Mori, and Yoshua Bengio. Quaternion convolutional neural networks for end-to-end automatic speech recognition. ar Xiv preprint ar Xiv:1806.07789. Titouan Parcollet, Mohamed Morchid, and Georges Linarès. Quaternion denoising encoder-decoder for theme identification of telephone conversations. In INTERSPEECH, 2017. Titouan Parcollet, Mohamed Morchid, and Georges Linarès. Quaternion convolutional neural networks for heterogeneous image processing. Co RR, abs/1811.02656, 2018a. URL http: //arxiv.org/abs/1811.02656. Titouan Parcollet, Mirco Ravanelli, Mohamed Morchid, Georges Linarès, Chiheb Trabelsi, Renato De Mori, and Yoshua Bengio. Quaternion recurrent neural networks. The International Conference on Learning Representations, abs/1806.04418, 2018b. Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne Van Den Berg, Ivan Titov, and Max Welling. Modeling relational data with graph convolutional networks. In European Semantic Web Conference, pages 593 607. Springer, 2018. Richard Socher, Danqi Chen, Christopher D Manning, and Andrew Ng. Reasoning with neural tensor networks for knowledge base completion. In Advances in neural information processing systems, pages 926 934, 2013. Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, and Jian Tang. Rotate: Knowledge graph embedding by relational rotation in complex space. In The Seventh International Conference on Learning Representations, 2019. Kristina Toutanova and Danqi Chen. Observed versus latent features for knowledge base and text inference. In Proceedings of the 3rd Workshop on Continuous Vector Space Models and their Compositionality, pages 57 66, 2015. Théo Trouillon and Maximilian Nickel. Complex and holographic embeddings of knowledge graphs: a comparison. ar Xiv preprint ar Xiv:1707.01475, 2017. Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume Bouchard. Complex embeddings for simple link prediction. In International Conference on Machine Learning, pages 2071 2080, 2016. Kai Wang, Yu Liu, Xiujuan Xu, and Dan Lin. Knowledge graph embedding with entity neighbors and deep memory network. ar Xiv preprint ar Xiv:1808.03752, 2018. Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. Knowledge graph embedding by translating on hyperplanes. In Twenty-Eighth AAAI conference on artificial intelligence, 2014. Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. Embedding entities and relations for learning and inference in knowledge bases. ar Xiv preprint ar Xiv:1412.6575, 2014.