# quantum_embedding_of_knowledge_for_reasoning__69cf9bba.pdf

Quantum Embedding of Knowledge for Reasoning

Dinesh Garg1 , Shajith Ikbal1 , Santosh K. Srivastava1, Harit Vishwakarma2 , Hima Karanam1, L Venkata Subramaniam1

1IBM Research AI, India 2Dept. of Computer Sciences, University of Wisconsin-Madison, USA garg.dinesh, shajmoha, sasriva5@in.ibm.com, hvishwakarma@cs.wisc.edu, hkaranam, lvsubram@in.ibm.com

Statistical Relational Learning (SRL) methods are the most widely used techniques to generate distributional representations of the symbolic Knowledge Bases (KBs). These methods embed any given KB into a vector space by exploiting statistical similarities among its entities and predicates but without any guarantee of preserving the underlying logical structure of the KB. This, in turn, results in poor performance of logical reasoning tasks that are solved using such distributional representations. We present a novel approach called Embed2Reason (E2R) that embeds a symbolic KB into a vector space in a logical structure preserving manner. This approach is inspired by the theory of Quantum Logic. Such an embedding allows answering membership based complex logical reasoning queries with impressive accuracy improvements over popular SRL baselines.

1 Introduction

We consider the problem of embedding a given symbolic Knowledge Base (KB) into a vector space that preserves the logical structure. Such embeddings are popularly known as distributional representation (e.g., Word2Vec [1] and GLOVE [2]) and are aimed to be leveraged by several nonsymbolic (e.g. neural and vector) methods to accomplish various tasks (e.g. KB completion/ link prediction and logical reasoning under noisy/incomplete KB) on which symbolic methods struggle.

Figure 1 (best viewed in color) depicts a toy example of the kind of symbolic KBs that we wish to embed into a vector space. The red-colored oval-shaped nodes of the left tree denote a hierarchy (aka ontology) of the unary predicates (aka concepts), and the blue-colored rectangle-shaped nodes denote the memberships of entities to various concepts. Similarly, the right tree denotes an ontology of binary predicated (aka relations) and the ordered pair of entities as their members. In the right tree, having (Bob, Alice) as a child node of the relation Father_of means Bob is Father_of Alice. In such ontologies, a predicate node logically implies (aka entails) any of its ancestral predicate node. For example, cardiologist physician doctor, and similarly, Mother_of Parent_of Blood_relation. In the framework of mathematical logic, such a logical structure can be expressed via a subset of Description Logic (DL) statements. In the DL parlance, logical structure among predicates (red oval nodes) is commonly known as T-box, whereas logical structure connecting entities (or entity tuples) to the predicates is commonly known as A-box [3].

As far as vector embeddings of symbolic KBs is concerned, last couple of years have witnessed a surge in the newer methods with an end goal of KB completion or logical inferencing to answer reasoning queries [4, 5, 6, 7, 8, 9, 10, 11, 12]. While details being different, a high level idea behind all these methods is as follows Ingest a given symbolic KB and embed it into a vector space by using a neural network; followed by conversion of logical queries into score based algebraic operations

The ﬁrst two authors contributed equally. This work was done when the author was with IBM Research, New Delhi, India.

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

Unary Predicates (aka Concepts) Binary Predicates (aka Relations)

physician surgeon dentist

pulmonologist cardiologist

Predicate Hierarchy

Based on Hypernymy

Entities and their concepts memberships (A-Box)

any relation

blood_relation

Brother_of Parent_of Sister_of

Father_of Mother_of

Pairs of entities and their relationships (A-Box)

Alice Bob Sam Alex (Bob, Alice) (Maria, Alice)

Predicate Hierarchy

Based on Hypernymy

Figure 1: A toy example of knowledge base. The left (right) ﬁgure depicts unary (binary) predicates ontology.

over the vector space. Unlike pure symbolic logic frameworks [13, 14, 3], these neural methods are robust against input noise but at the same time, they suffer from a major drawback, namely no guarantees that embeddings maintain the sanctity of the logical structure present in the input A-Box and T-Box. Therefore, unlike pure symbolic reasoners, a machine that uses such embeddings for complex reasoning tasks struggle with the accuracy. The embeddings resulted by some of these neural methods [4, 5, 6, 7, 8] capture, at best, the logical structure of just A-box but not T-box. Some other methods [9, 10, 11], which provision to maintain T-box structure, suffer from the fact that resulting logical structure in the embedding is not sound enough by the design (unlike ours which is based on Quantum Logic) and hence they often fail to meet high accuracy for complex reasoning tasks.

In this paper, we propose a novel approach, called Embed2Reason (E2R) that can embed a given symbolic KB (such as the one shown in Figure 1) into a ﬁnite dimensional vector space. E2R is inspired from the theory of Quantum Logic (QL) which guides us constraining the embedding in such a way that the set of all logical propositions in the input A-Box and T-Box become isomorphic to a lattice structure over a set of subspaces of the vector space. Such an embedding satisﬁes the axioms of the Quantum Logic [15, 16] and allows one to perform logical operations (e.g. negation, conjunction, disjunction, and implication) directly over the vectors in a manner similar to the Boolean Logic except that distributive law does not hold true [17]. We call such an embedding as Quantum Embedding. We formulate an unconstrained optimization program that captures all such quantum logical (as well as regularity) constraints. We solve this program via Stochastic Gradient Descent (SGD) technique and the resulting embedding maintains the input logical structure with a good accuracy. Next, we show that these quantum embeddings can solve complex deductive as well as predictive reasoning tasks (formally deﬁned later) with a much superior performance as compared to other kinds of embeddings. Speciﬁcally, quantum embeddings show better performance on both link prediction as well as reasoning tasks. In our experiments, we found that on FB15K dataset, quantum embeddings obtained via the proposed E2R method exhibit 57% improvement on MRR and 95% improvement on HITS@1. Further, for LUBM1U dataset, we found E2R exhibiting 76% improvement on MRR and 139% improvement on HITS@1 relative to the closest competitor.

2 Preliminaries

Description Logic (DL) Syntaxes Description Logics (DLs) are a family of logics that are fragments of First Order Logic (FOL) [3, 18]. In this paper, we conﬁne to the simplest form of DL, namely Attributive Language with Complements (ALC). A triple (NO, NC, NR) is called as the signature of the DL where, NO = {O1, O2, . . . , O|NO|} denotes a ﬁnite set of entities (aka objects or elements) blue rectangular nodes in the left tree of Figure 1. NC = {C1, C2, . . . , C|NC|} denotes a ﬁnite set of unary predicates (aka (atomic) concepts, classes, or types) red oval nodes in the left tree of Figure 1. NR = {R1, R2, . . . , R|NR|} denotes a ﬁnite set of binary predicates (aka (atomic) relations or roles) red oval nodes in the right tree of Figure 1. The complex logical statements in DL are formed by applying one or more of the following constructors on one or more of the atomic concepts: negation ( ) of a concept, equality (=), intersection ( ), union ( ), and logical inclusion ( ) between two concepts. Also, there are two special concepts called (every concept) and (empty concept). Finally, the operators universal restriction ( R.C) and existential restriction ( R.C) correspond to the concepts

whose members are given by the sets {O NO | O having R(O, O ) we must have O C} and {O NO | O having R(O, O ) (O C)}, respectively.

Formal Concept Analysis (FCA) and Distributional Representation FCA [19] is a branch of lattice theory [20] focusing on special type of lattices that are obtained from a set of entities, a set of attributes, and a binary relation called context that determines whether an entity has an attribute. Although not directly related, FCA has lots of similarities with DL and can be exploited to get a naive and partial solution of the central question in this paper. FCA theory allows embedding entities of any DL into a binary space, say {0, 1}d. Such an embedding can also be used as a device to perform Boolean logical operations on unary predicates. Supplementary material comprises further details.

3 Quantum Embeddings

Although, FCA based embedding partially preserves the logical structure of DL, it has several limitations - (i) FCA makes closed world assumption all the missing membership assertions are treated as non-members in FCA. In real world, this assumption is rarely true. In fact, we don t even know how many other atomic concepts/relations are missing in the given KB. (ii) In many real world setup, an entity may have soft/fuzzy membership to certain concepts - for example, an apple may have each of red and green color but to a varying extent. (iii) Lastly, FCA embeds only unary predicates but we want to embed both unary and binary predicates. These limitations of FCA based embeddings motivated us to explore the use of Quantum Logic. 3

Originally, QL [22, 17] was proposed by [15] to model the quantum behavior of subatomic particles. We, however, leverage it here to embed A-box and T-box of a DL while preserving the logical structure. In QL, every predicate (unary or binary as well as atomic or compound) is represented by a linear subspace of a complex vector space Σ, where Σ = Cd for some integer d. 4 All the entities and entity pairs are denoted by (complex) vectors and they lie in the predicate subspaces to which they belong (see Figure 2 for a graphical illustration of the idea). The axes of such an embedding space represent latent semantic attributes of the entities and entity pairs. The set of all linear subspaces of Σ, known as Grassmannian, can be put up in correspondence with the set of all possible predicated - inﬁnitely many of them are not even supplied as part of typical input KB. The subspaces of Σ naturally have a partial order relation induced over them by set theoretic inclusion operation. That is, for any two subspaces Si, Sj Σ, we say Si Sj iff Si Sj or equivalently, Si is a subspace of Sj also. Because origin is a zero dimensional subspace and is common to any subspace, by letting = {0} and = Σ, the resulting partial order over the subspaces becomes bounded lattice of size. The quantum logical operations on this lattice are deﬁned by its inventors [15] as follows.

Si Sj = Si Sj := Si Sj. That is, logical AND or intersection between two subspaces is equal to the largest common subspace.

Si Sj = Si Sj := Si + Sj. Here + sign means the vector sum and not the set theoretic union. That is, logical OR or union between two subspaces is the smallest subspace encompassing both these subspaces. This is where QL differs from the Boolean Logic. This axiom also results in nondistributive nature of QL. That is, the distributive law Si (Sj Sk) = (Si Sj) (Si Sk) holds no more true in QL. However, we have proven (refer supplementary material) that by restricting each predicate subspace to be parallel to the axes, we can avoid this limitation of the QL.

S := S , where S is known as orthogonal complement of the subspace S. Every vector in S is perpendicular to every vector in S and also S + S = Σ. This means, logical NOT or negation of any subspace is given by set of all those vectors that are normal to this subspace.

Figure 2 depicts an illustrative example of Quantum Logic based embedding of the concept ontology into, say a Σ = C3 space. The embedding of relation hierarchy would also be similar except that an ordered entity pair such as (Bob, Alice) would be mapped to the vector (V Bob + ιV Alice), where V Bob + ι0, V Alice + ι0 are the embedding vectors for entities Bob, Alice, respectively. In other words, the imaginary component of the entities would be zero but for a pair of entities it would be non-zero. The relations would be mapped to different linear subspaces. In next section, we

3A similar motivation was advocated by [21] in the context of designing a document retrieval system. 4In general, QL allows the embedding space to be any ﬁnite or inﬁnite dimensional Hilbert space.

pulmonologist

cardiologist

𝑉()*+, + 𝑖0

physician surgeon dentist

pulmonologist cardiologist

Alice Bob Sam Alex

Figure 2: An illustrative example of Quantum embedding

present E2R learning scheme which automatically maps these predicates and (pair of) entities to the subspaces and vectors, respectively, in the space Σ = Cd for some integer d.

4 Embed2Reason

E2R writes down a quantum logical constrain for every logical structure present in A-Box/T-Box so as to ensure it preservation during embedding. The overall problem, thus, becomes a constraint satisfaction problem which we convert into an unconstrained optimization problem via loss functions.

In E2R, we assume the underlying ﬁeld of the embedding space Cd is R. Under this assumption - (i) The vector space Σ = Cd becomes isomorphic to the Euclidean space R2d. That means, there exist a bijection map T : Cd 7 R2d which preserves vector addition and scalar multiplication [23]. (ii) The subspace corresponding to any single axis of Cd can be viewed as span of two subspaces real part and imaginary part. This, in turn, implies that the set of vectors {x R + ι0} forms a valid subspace of Cd. (iii) Any axis-parallel subspace of Cd can be mapped to a unique axis-parallel subspace of R2d and vice-a-versa. Due to these properties, we can simulate Cd via R2d on a digital computer. Speciﬁcally, any axis parallel subspace S of Cd can be simulated in the space R2d by means of deﬁning a 2d-dimensional 0/1 indicator vector y having the following property The coordinates y[i] and y[i + d] would be 1 (or 0) depending on whether real and imaginary part, respectively, of the ith-axis (1 i d) of Cd is part (or not part) of the subspace S. To this end, we make a simplifying assumption that E2R always outputs an embedding where subspace corresponding to any predicate is parallel to the axes of the space Cd. We call such an embedding as axis-parallel embedding. The axis-parallel embedding assumption alleviates the limitation of the QL regarding distributive law (refer supplementary material). Lastly, because the underlying ﬁeld is R, we use the following valid inner product formula [23]: x, y = [(y x) (y x)]1/2 for x, y Cd, where denotes the complex conjugate operation. This inner product is essentially the length of the complex number obtained by taking the standard inner product between two complex vectors [23]. With this setup, we now describe our E2R technique.

Entities, Pair of Entities, and Predicates E2R maps individual entities Oi, Oj NO to the unit length vectors (under ℓ2 norm) of the form xi = ( xi R 0 ) and xj = ( xj R 0 ), respectively, and the ordered pair (Oi, Oj) to the vector xij = ( xi R xj R ). This can automatically be ensured by provisioning following loss terms, for each entity Oi, in the optimization program of E2R.

LOi = xi 0d 1d 2 (1)

where, 0d and 1d are d-dimensional vectors of all 0s and 1s, and denotes element wise multiplication. Observe that having LOi = 0 is both necessary and sufﬁcient condition for the xi to have the form ( xi R 0 ). Therefore, including LOi as a loss term will push its value to go towards zero.

Recall, QL theory requires entities (and pair of entities) to lie in the subspaces of concepts (and relations) to which they belong. Therefore, above representation conventions would immediately

entail that the subspace corresponding to any unary predicate, say Ci, must be given by an indicator vector yi having property that yi[t] = 0 (d + 1) t 2d. Similarly, the subspace corresponding to any binary predicate, say Ri, must be given by the indicator vector zi having property that (1 t1 d) and (d + 1 t2 2d) such that zi[t1] = zi[t2] = 1. These constraints can be implemented via following loss terms.

LCi = yi 0d 1d 2 ; LRi = min 0, z i 0d 1d 1 2 + min 0, z i 1d 0d 1 2 (2)

It is noteworthy that above constraints would not make sense unless we ensure indicator vectors yi and zi are binary vectors. Such combinatorial constraints can be approximately enforced by provisioning the following two losses.

Lyi = yi yi 2 ; Lzi = zi zi 2 (3)

where, yi (or zi) is the bit ﬂipped version of yi (or zi) given by yi = 12d yi (or zi = 12d zi). The dimensions of the subspaces for Ci and Ri, and the set of corresponding basis vectors (given by indicator vectors yi and zi) are learnable parameters and are learned automatically.

Membership For any given Oi Cj, or Ri(Op, Oq), E2R aspires to ensure that vectors xi or xpq lie in the subspaces corresponding to Ci and Ri, respectively. We felt a natural way to model such constraints would be via deﬁning residual length of the projection as a loss metric. The same idea is also endorsed by the Quantum Logic where projection length is indeed used as a probability of such membership assertion. This can be ensured by means of the following loss terms which essentially project the vectors xi and xpq into subspaces yj and zi in an orthogonal manner and enforce zero residual components of these projections. These loss terms also take care of the unit lengths for entity vectors. Here, 0d represents a d d matrix of all 0s.

L(Oi Cj) = yj xi 2 + 1 2d 0d Id Id 0d

yj xi 2 + 1 x i xi 2 ; LRi(Op,Oq) =

zi xpq 2 + 1 2d 0d Id Id 0d

zi xpq 2 + 1d 0d xpq xp 2 + 0d 1d xpq xq 2(4)

The last term in the expression of L(Oi Cj) tries to enforce unit length constraint for each entity vector. The last two terms in the expression of LRi(Op,Oq) try to enforce the structure of xpq to be equal to xp R xq R (as deﬁned earlier). The second term in each of the above losses represents squared length of complex part of the projected vector. The second term would be be zero for L(Oi Cj) due to the assumptions about structures of xi and yj.

Logical Inclusion As per the axioms of QL, for any given Ci Cj (or Ri Rj), the subspace corresponding to Ci (or Ri) must be a subset of the subspace corresponding to Cj (or Rj). Under axis-parallel assumption, such a constraint can be implemented via the following loss term.

LCi Cj = yi yj 2 ; LRi Rj = zi zj 2 (5)

where yi, yj, zi, zj are the indicator vectors for the subspaces corresponding to Ci, Cj, Ri, Rj, respectively. The rational behind above loss term is as follows. If Ci Cj then a necessary condition would be to have 1 in the vector yj at all those positions wherever there is 1 in the vector yi and that would mean we must have y i 1 yj = 0. Similarly, we can argue that y i 1 yj = 0 is also a sufﬁciency condition for Ci Cj. The argument for Ri Rj is similar.

Logical Conjunction (Disjunction) For any given logical conjunction (disjunction) of the form Ci = Cj Ck (Ci = Cj Ck) and similarly for Ri, Rj, Rk, we can provision the following loss terms whose value being zero provides necessary and sufﬁciency conditions to satisfy the QL axiom. Here, max( ) denotes the component wise max operation. Extension to multiple concepts (or predicates) such as Ci = Cj Ck Cℓ(Ci = Cj Ck Cℓ) is straightforward.

L(Ci=Cj Ck) = yi yj yk 2 ; L(Ri=Rj Rk) = zi (zj zk) 2 (6)

L(Ci=Cj Ck) = yi max yj, yk 2 ; L(Ri=Rj Rk) = zi max (zj, zk) 2 (7)

Logical Negation Necessary and sufﬁciency conditions are

L(Ci= Cj) = y i yj 2 +

y i yj 2 ; L(Ri= Rj) = z i zj 2 +

z i zj 2 (8)

Universal Type Restriction For each given universal type restriction Ri Cj, we need to ensure that for any Oi, Oj NC having relation Ri(Oi, Oj), we must have Oj Cj. A necessary and sufﬁcient condition for this constraint would be as follows - whenever we have Oj Ck, where Ck is a non-descendant of Cj in the concept ontology, then we must not have Ri(Oi, Oj). The loss term for this can be given as follows. For all Ck DCj, where DCj denotes the set of concepts that are non-descendant of Cj, we can have the following loss term:

L( Ri Cj)(yk) = y k 0d Id 0d 0d zi 2 (9)

The set DCj can be shrunk considerably by considering only the following concepts in the concept hierarchy - (i) any child Ck of the root concept that is non-ancestor of Cj (children of Ck would automatically be taken care of due to quantum logical constraints related to inclusion), (ii) concept Ck that is either a sibling of Cj or child of an ancestor but itself is not an ancestor.

By putting all these losses together, we get the following unconstrained non-convex optimization problem as E2R learning problem and we call it as E2R model. Here, LE2R is the sum of all the loss term (each averaged over its corresponding logical assertions) deﬁned from Equation (1) through (9).

minimize xi,yi,zi LE2R (10)

Choosing the Embedding Dimension d Under axis parallel assumption, we have a maximum of 22d different axis parallel subspaces when we embed into Cd (ﬁeld being R). Therefore, if we have |NC|, |NR| many unique concepts and relations then we must have d > log( p

|NC| + |NR|). This is just a trivial lower bound, but in practice, we require much higher d to accommodate all the constraints (especially regularity constraints given below). If we allow oblique subspace, we can embed in much smaller dimensional space. However, we still prefer axis-parallel embeddings because the distributive law, which in general doesn t hold true for quantum embeddings, surprisingly holds true under axis-parallel quantum embeddings (proof given in the supplementary material).

The Problem of Subspace Collapse The E2R formulation (10) is susceptible to one serious problem, namely subspace collapse. Imagine, there are no logical negation and universal type restriction statements in the input KB. In such a case, the loss terms (8), (9) would be missing and there would be a degenerate solution, namely all the subspaces (i.e. yi and zi) being the same, because it would drive all the loss terms related to A-Box and T-Box assertions, namely (4) through (7), to zero. We can avoid this by including two kinds of regularity constraints - (i) For any two predicates (unary/binary) that are siblings of each other, we should encourage their subspaces to be orthogonal as much as possible, (ii) Any two predicates (unary/binary) that have (parent, child) relationship in the T-Box hierarchy, we should encourage certain minimum gap between the dimensions of their subspaces. Both these constraints can be achieved via the following losses.

Ωorth = 1 Nsib P

y i yj 2 + 1 Msib P

z i zj 2 (11)

Ωsep = 1 N P

d yj yi 1 2 + 1 M P

d (zj zi) 1 2

The pairs (Ci sib Cj) and (Ri sib Rj) denote the pair of sibling concepts/relations. Nsib and Msib denote the count of such sibling pairs. N and M denote the counts of respective logical assertions. The choice of

d is ﬂexible and can be replaced with an appropriate positive constant 1. This choice allows us to stuff roughly

d levels of the inclusion hierarchy which usually sufﬁces.

Instead of using regularization losses (11) and (12), one can achieve a similar effect by having an alternative loss term which is simpler and effective in practice (we witnessed this in our experiments). This alternative loss term works as follows Take each valid membership assertion in the given A-box. Replace one of its entity with an invalid (aka negatively sampled) entity. Take the negation of the corresponding membership loss of this newly constructed fake assertion (aka negative sample). In fact, this way of regularizing helps us in achieving good convergence during our model training. Note, we use a tiny fraction of invalid entities just to avoid degenerate solution. This should not be confused with closed-world assumption because there one uses a huge number of negative assertions.

Reasoning Tasks While applicability of quantum embeddings is quite broad, in what follows, we highlight a few reasoning tasks that we think are important and apt for quantum embeddings. Each of these tasks could be deductive or predictive in nature and our approach need not even know this.

1. Membership [Find all the entities belonging the concept C where C could be a complex logical formula.] For this, we convert the formula C into an appropriate subspace S, followed by ﬁnding all those entities whose length of orthogonal projection onto S is more than some threshold. The projection length is used to assign the conﬁdence score/probability of the answer being correct.

2. Property Testing [Does entity Oi belongs to the concept C where C could be a complex logical formula?] The previous trick can be applied here as well and the answer would be probabilistic which can be made deterministic by setting an appropriate threshold.

3. Property Listing [Given entity Oi, list all the atomic concepts to which it belongs.] Find a set, say Bi, of all those standard basis vectors which are non-orthogonal to the embedding vector xi of Oi. Find all the atomic concepts Cj s whose subspaces lie within the space of Bi. Each such Cj would be an answer with conﬁdence/probability proportional to the projection length.

5 Experiments

Datasets and Setup We evaluated the performance of E2R on two different kinds of tasks (i) link prediction task, and (ii) reasoning task. For link prediction, we chose FB15K and WN18 datasets because they are standard in the literature [5, 6, 24]. FB15K is a subset of the Freebase dataset, whereas WN18 is a subset of the Word Net dataset featuring lexical relations between words. The train and test sets of these datasets are respectively used for training and testing our proposed model.

To evaluate the reasoning capabilities, we chose LUBM (Lehigh University Benchmark) dataset (http://swat.cse.lehigh.edu/projects/lubm/), which consists of a university domain ontology, customizable and repeatable synthetic data. We generated one university (LUBM1U) data for the evaluation. We used all the 69628 triples produced by the LUBM generator along with ontology in triple form (unary and binary predicates) as our training set. Its important to mention that majority of the A-Box assertions in LUBM dataset are concerned about the bottom most predicates (concepts and relations) in the hierarchy. Starting with these assertions, we custom designed eight new test queries that comprise higher level predicates in the LUBM hierarchy so that answering such queries would explicitly require deductive reasoning. These test queries involved deducing the members of the following unary predicates Professor, Faculty,Person, Student, Course, Organization, and the following binary predicates Member Of, Works For.

We implemented E2R model using Py Torch. We used SGD (Stochastic Gradient Descent) with ADAM optimizer [25] to solve the proposed E2R formation (10) together with the regularization terms as described in Section 4. In all our experiments we used d = 100 for E2R model. E2R did not show signiﬁcant sensitivity to the value of d, in our experimental range of 100 to 300. As speciﬁed in Section 4, the use of negative membership candidates during training helped in achieving good convergence. Using more than one negative candidate per positive membership candidate achieved better convergence. We used 3 different negative entities per positive entity in our experimental setup. For baseline approaches, we used optimal parameter setup obtained through grid search. For example, we used embedding dimension d = 100. Note, the datasets FB15k and WN18 do not contain T-Box. Therefore, we discarded the loss terms corresponding to T-Box assertions and retained the loss terms corresponding to A-Box assertions while training E2R on these datasets. Similarly, while training the baseline approaches on LUBM1U dataset, we simply added T-box assertions as additional triples because these approaches have no provision for handling T-box assertions in an explicit manner. Our experiments were performed on a Tesla K80 GPU machine.

Baselines, Evaluation Metrics, and Results We used Trans E [5] and Compl Ex [24] as baselines to illustrate the effectiveness of our proposed model. Trans E is a simple but effective model and Compl Ex is one of the current state-of-the-art approach. We used Open KE (https://github.com/thunlp/Open KE) implementation of these approaches for our evaluation. We compared E2R with these approaches both for link prediction task (using FB15K and WN18 datasets) and reasoning tasks (using LUBM1U dataset). Tuning of the hyper-parameters for the baseline approaches was performed on the test set for FB15K and WN18 datasets but on the training set for LUBM1U. For E2R, the tuning was always done on the training set.

Data MEAN RANK MRR HITS@1 (%) HITS@10 (%)

E2R TE CE E2R TE CE E2R TE CE E2R TE CE

FB15K 72.0 68.4 114.0 0.96 0.49 0.61 96.4 34.8 49.8 96.4 76.7 81.2

WN18 5780.2 409.9 468.1 0.71 0.63 0.90 71.1 41.0 87.4 71.1 93.2 95.25

LUBM1U 220.1 1292.6 5742.9 0.46 0.26 0.12 45.4 18.97 12.5 45.4 49.1 12.59

Table 1: Datasets and Experimental Results. Acronyms used are TE := Trans E, CE := Compl Ex. LUBM-IU dataset has 69628 triples in A-Box, and 43 concepts and 25 relations in T-Box.

For all the evaluations, we used four standard metrics, namely MEAN RANK (lower the better), MRR (Mean Reciprocal Rank), HITS@1, and HITS@10 [5]. These metrics capture how well an approach is able to extract correct entities for a given membership query. These metrics start improving as the correct answer entities start moving up in the ranked list of the predicted answer entities. Additionally, while computing these metrics, we considered only ﬁltered case (as advocated in [5]) that avoid conﬂict among multiple correct answers in terms of occupying top rank. For reasoning task on LUBM1U dataset, where our goal is to evaluate the inferred membership of the concepts and relations, we computed all the metrics at the query level. The query level metric is an average over instance level metrics, where instances are the true members of the query. We further averaged these metrics over all the queries. Note, we did not use any symbolic reasoner as a baseline because our objective is not to compete with them but to demonstrate logical reasoning capability in vector spaces.

In E2R, the predicted candidate ranked list for any query is generated simply by projecting each entity onto the respective subspace and assigning the length of its residual component as its non-ﬁtment score. For Trans E/ Compl Ex, a membership query rank orders the candidate triples based on the triple scores obtained through the corresponding model.

Insights Table 1 shows experimental results, where we highlighted the numbers reﬂecting superior performance of E2R. It is evident from the results that E2R outperforms both Trans E and Compl Ex on metrics MRR and HITS@1 for all the tasks and datasets, except for WN18. Interestingly, E2R achieves an exceptional accuracy improvements over the baseline approaches for the link prediction task on the FB15k dataset as well as on the LUBM1U dataset for the reasoning task. This illustrates the effectiveness of E2R in learning and representing the logical structure of the input KB, and leveraging the same for the reasoning task. On the other hand, baseline approaches, which are fundamentally designed to address the link prediction task, ﬁnd it hard to achieve the same. As far as WN18 dataset is concerned, its a collection of binary relation instances where most of these relations satisfy transitivity property (e.g. hypernym) and have inverse relations (e.g. hypernym/hyponym). Baseline methods, which are primarily distanced based, can probably capture transitivity/inversion properties bit better which E2R somewhat struggles because it is not distance based. Lastly, E2R consistently maintains a high score for HITS@1 metric and moreover, HITS@1 = HITS@10 for all the datasets. This implies that our algorithm often ranks a ground truth entity either at Rank-1 or at a quite low rank. We believe, this is an artifact of our approach trying to assign different sub-spaces for different predicates and pushing the non-members away from them. This phenomenon also results in a poor score for (average) MEAN RANK despite having high Hits@1 score. Such a behavior of E2R makes it practically interesting because Rank 1 candidate can be chosen with more conﬁdence.

6 Related Work

[Quantum Logic] Dominic [26, 21] advocated using QL framework to ﬁx the problem of word meaning disambiguation in keyword based search engines. The conceptual framework suggested by Dominic and the follow-up work, however, do not prescribe any recipe to embed knowledge into vector space satisfying QL axioms. Our work aims to bridge this gap. [Deep Nets for Word Embeddings] In recent times, deep-net based techniques have emerged that can embed word-phrases [2, 27] and natural language sentences [28, 29] into a vector space. Such embeddings have offered remarkable performance boost for several downstream tasks including question answering [30, 31]. However, a key concern about these techniques is their black box nature which makes the underlying models unexplainable and non-interpretable. [Statistical Relational Learning (SRL) and KG Embeddings] For majority of the SRL techniques, the goal is to embed a Knowledge Graph (KG) into a vector space followed by discovery

of new triples or links [7, 32, 12]. The prediction of the links is done via a scoring function that builds upon exploiting the statistical measures for computing the similarity between joining entities. While effective for the KG completion tasks, this simple form of reasoning based on the similarities among entities face difﬁculties when handing complex reasoning queries [11]. One striking difference between E2R and KG embedding approaches is that later ones are not capable of handling the ontology of relations/concepts unlike E2R.

[Hierarchy Aware KB Embedding] Recently, there have been attempts to embed ontological hierarchy [33, 34, 35, 36, 10] as well as inferring ontological hierarchy from embeddings [37, 38]. The work in [33] is less about embedding a KB and more about inducing an ontology. [34] embeds entities and predicates both in the forms of vectors which is very different from our philosophy. Unlike us, in [35], the training set upfront includes the assertions inferred on a select set of LUBM predicates (using symbolic reasoner) because their aim is to compete with symbolic reasoners (in speed and memory) during inference. [36] fall under the category of distance translation based embeddings (e.g. Trans E) except that it uses type information to improve quality of entities and relations embedding. Our Hits@10 score outperforms them on FB15K. In [10], they embed entities based on the type similarities as well as binary relations but they don t explicitly preserve ontology of binary relations themselves. In addition to the ontology embedding, there has also been some work on embedding the logic structure [39, 40, 41] of a KB into a vector space so as to enable reasoning. Unlike QL axioms in our approach, the underlying principles in all of these previous works do not offer any guarantee to preserve logical structure in the embedding space.

7 Conclusions

A novel approach, namely Embed2Reason, has been proposed which can embed a DL based symbolic KB into a vector space while preserving the structure of all logical propositions. The idea is inspired from Quantum Logic axioms. The key advantage is to be able to answer complex reasoning queries (both deductive and predictive) in accurate manner using distributional representation of the KB. The experimental results illustrate the effectiveness of E2R relative to standard baselines. The future directions include extension to more expressive logic as well as natural languages KBs.

Acknowledgments

We would like to thank Mukuntha N.S. for pointing out corrections at few places.

[1] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efﬁcient estimation of word representations in vector space. Co RR, abs/1301.3781, 2013.

[2] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word representation. In EMNLP, 2014.

[3] Franz Baader, Ian Horrocks, Carsten Lutz, and Uli Sattler. An Introduction to Description Logic. Cambridge University Press, 2017.

[4] Antoine Bordes, Jason Weston, Ronan Collobert, and Yoshua Bengio. Learning structured embeddings of knowledge bases. In AAAI, 2011.

[5] Antoine Bordes, Nicolas Usunier, Alberto García-Durán, Jason Weston, and Oksana Yakhnenko. Translating embeddings for modeling multi-relational data. In NIPS, pages 2787 2795, 2013.

[6] Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. Knowledge graph embedding by translating on hyperplanes. In AAAI, pages 1112 1119, 2014.

[7] Maximilian Nickel, Kevin Murphy, Volker Tresp, and Evgeniy Gabrilovich. A review of relational machine learning for knowledge graphs. IEEE, 104(1):11 33, 2016.

[8] William L. Hamilton, Payal Bajaj, Marinka Zitnik, Dan Jurafsky, and Jure Leskovec. Embedding logical queries on knowledge graphs. In Neur IPS, 2018.

[9] Gonzalo I. Diaz, Achille Fokoue, and Mohammad Sadoghi. Embeds: Scalable, ontology-aware graph embeddings. In EDBT, pages 433 436, 2018.

[10] Shoaib Jameel and Steven Schockaert. Entity embeddings with conceptual subspaces as a basis for plausible reasoning. In ECAI, 2016.

[11] Tim Rocktäschel and Sebastian Riedel. End-to-end differentiable proving. In NIPS, pages 3791 3803, 2017.

[12] Stephen H. Bach, Matthias Broecheler, Bert Huang, and Lise Getoor. Hinge-loss markov random ﬁelds and probabilistic soft logic. JMLR, 18:109:1 109:67, 2017.

[13] Alain Colmerauer and Philippe Roussel. The birth of prolog. In History of Programming Languages Conference (HOPL-II), Preprints, pages 37 52, 1993.

[14] Stephen Muggleton. Inductive logic programming. New Gen. Comput., 8(4):295 318, 1991.

[15] Garrett Birkhoff and John Von Neumann. The logic of quantum mechanics. The Annals of Mathematics, 37(4):823 843, 1936.

[16] V S Varadarajan. Geometry of Quantum Theory. Springer-Verlag, 1985.

[17] M. L. Dalla Chiara and R. Giuntini. Quantum Logic. 2008.

[18] Ronald J. Brachman and Hector J. Levesque. Knowledge Representation and Reasoning. Elsevier, 2004.

[19] Bernhard Ganter and Rudolf Wille. Formal Concept Analysis: Mathematical Foundations. Springer, 1999.

[20] Garrett Birkhoff. Lattice Theory, volume 25 of AMS Colloquium Publications. American Mathematical Society, 3 edition, 1965.

[21] Dominic Widdows. Geometry and Meaning, volume 172 of CSLI lecture notes series. CSLI Publications, 2004.

[22] David W. Cohen. An Introduction to Hilbert Space and Quantum Logic. Springer-Verlag, 1989.

[23] Kenneth Hoffman and Ray Kunze. Linear Algebra. Prentice Hall India, 2nd Edition, 2015.

[24] Théo Trouillon, Johannes Welbl, Sebastian Riedel, Eric Gaussier, and Guillaume Bouchard. Complex embeddings for simple link prediction. In ICML, pages 2071 2080, 2016.

[25] Diederik Kingma and Jimmy Ba. ADAM: A method for stochastic optimization. In ICLR, 2015.

[26] Dominic Widdows. Orthogonal negation in vector spaces for modelling word-meanings and document retrieval. In ACL, pages 136 143, 2003.

[27] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, pages 6000 6010, 2017.

[28] Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. Universal sentence encoder. Co RR, abs/1803.11175, 2018.

[29] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. Co RR, abs/1810.04805, 2018.

[30] Siva Reddy, Danqi Chen, and Christopher D. Manning. Co QA: A conversational question answering challenge. Co RR, abs/1808.07042, 2018.

[31] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. In EMNLP, pages 2383 2392, 2016.

[32] Lise Getoor and Ben Taskar. Introduction to Statistical Relational Learning. Adaptive Computation and Machine Learning. MIT Press, 2007.

[33] Zied Bouraoui, Shoaib Jameel, and Steven Schockaert. Inductive reasoning about ontologies using conceptual spaces. In AAAI, 2017.

[34] Gonzalo I. Diaz, Achille Fokoue, and Mohammad Sadoghi. Embed S: Scalable, ontology-aware graph embeddings. In EDBT, pages 433 436, 2018.

[35] Patrick Hohenecker and Thomas Lukasiewicz. Deep learning for ontology reasoning. Co RR, abs/1705.10342, 2017.

[36] Ruobing Xie, Zhiyuan Liu, and Maosong Sun. Representation learning of knowledge graphs with hierarchical types. In IJCAI, pages 2965 2971, 2016.

[37] Maximilian Nickel and Douwe Kiela. Poincaré embeddings for learning hierarchical representations. In NIPS, pages 6338 6347, 2017.

[38] Maximilian Nickel and Douwe Kiela. Learning continuous hierarchies in the lorentz model of hyperbolic geometry. In ICML, pages 3776 3785, 2018.

[39] Tim Rocktäschel, Sameer Singh, and Sebastian Riedel. Injecting logical background knowledge into embeddings for relation extraction. In NAACL-HLT, pages 1119 1129, 2015.

[40] Tim Rocktäschel, Matko Bosnjak, Sameer Singh, and Sebastian Riedel. Low-dimensional embeddings of logic. In Workshop on Semantic Parsing, co-located with ACL, 2014.

[41] Thomas Demeester, Tim Rocktäschel, and Sebastian Riedel. Lifted rule injection for relation embeddings. In EMNLP, pages 1389 1399, 2016.