# provenance_for_the_description_logic_elhr__f1e53b25.pdf

Provenance for the Description Logic ELHr

Camille Bourgaux1 , Ana Ozaki2,4 , Rafael Pe naloza3 and Livia Predoiu4

1DI ENS, ENS, CNRS, PSL University & Inria, Paris, France 2University of Bergen, Norway 3University of Milano-Bicocca, Italy 4Free University of Bozen-Bolzano, Italy camille.bourgaux@ens.fr, ana.ozaki@uib.no, rafael.penaloza@unimib.it, livia.predoiu@unibz.it

We address the problem of handling provenance information in ELHr ontologies. We consider a setting recently introduced for ontology-based data access, based on semirings and extending classical data provenance, in which ontology axioms are annotated with provenance tokens. A consequence inherits the provenance of the axioms involved in deriving it, yielding a provenance polynomial as annotation. We analyse the semantics for the ELHr case and show that the presence of conjunctions poses various difﬁculties for handling provenance, some of which are mitigated by assuming multiplicative idempotency of the semiring. Under this assumption, we study three problems: ontology completion with provenance, computing the set of relevant axioms for a consequence, and query answering.

1 Introduction Description logics (DLs) are a well-known family of ﬁrstorder logic fragments in which conceptual knowledge about a particular domain and facts about speciﬁc individuals are expressed in an ontology, using unary and binary predicates called concepts and roles [Baader et al., 2007a]. Important reasoning tasks performed over DL ontologies are axiom entailment, i.e. deciding whether a given DL axiom follows from the ontology; and query answering. Since scalability is crucial when using large ontologies, DLs with favorable computational properties have been investigated. In particular, the EL language and some of its extensions allow for axiom entailment in polynomial time, and conjunctive query entailment in NP [Baader et al., 2005; Baader et al., 2008a]. Many real-world ontologies, including SNOMED CT, use languages from the EL family, which underlies the OWL 2 EL proﬁle of the Semantic Web standard ontology language. In many settings it is crucial to know how a consequence e.g. an axiom or a query has been derived from the ontology. In the database community, provenance has been studied for nearly 30 years [Buneman, 2013] and gained traction when the connection to semirings, so called provenance semirings [Green et al., 2007; Green and Tannen, 2017] was discovered. Provenance semirings serve as an abstract algebraic tool to record and track provenance information; that is, to keep track

of the speciﬁc database tuples used for deriving the query, and of the way they have been processed in the derivation. Besides explaining a query answer, provenance has many applications like: computing the probability or the degree of conﬁdence of an answer, counting the different ways of producing an answer, handling authorship, data clearance, or user preferences [Senellart, 2017; Suciu et al., 2011; Lukasiewicz et al., 2014; Ives et al., 2008]. Semiring provenance has drawn interest beyond relational databases (e.g. [Buneman and Kostylev, 2010; Zimmermann et al., 2012; Deutch et al., 2014; Ramusat et al., 2018; Dannert and Gr adel, 2019]), and in particular has recently been considered for ontology-based data access, a setting where a database is enriched with a DL-Lite R ontology and mappings between them [Calvanese et al., 2019]. In the latter, the ontology axioms are annotated with provenance variables. Queries are then annotated with provenance polynomials that express their provenance information. Example 1. Consider the facts mayor(Venice, Brugnaro) and mayor(Venice, Orsoni), stating that Venice has mayors Brugnaro and Orsoni, annotated respectively with provenance information v1 and v2, and the DL axiom ran(mayor) Mayor, expressing that the range of the role mayor is the concept Mayor, annotated with v3. The query x.Mayor(x) asks if there is someone who is a mayor. The answer is yes and it can be derived using ran(mayor) Mayor together with any of the two facts, interpreting x by Brugnaro or Orsoni. This is expressed by the provenance polynomial v1 v3 + v2 v3. Intuitively, expresses the joint use of axioms in a derivation path of the query, and + the alternative derivations. We adapt the provenance semantics of Calvanese et al. for the ELHr variant of EL, extending it to those ELHr axioms that do not occur in DL-Lite R. It turns out that handling the conjunction allowed in ELHr axioms is not trivial. To obtain models from which we can derive meaningful provenanceannotated consequences, we adopt -idempotent semirings and a syntactic restriction on ELHr (preserving the expressivity of full ELHr when annotations are not considered). After introducing the basic deﬁnitions and the semantics for DL ontologies and queries annotated with provenance information, we present a completion algorithm and show that it solves annotated axiom entailment and instance queries in ELHr in polynomial time in the size of the ontology and polynomial space in the size of the provenance polynomial. We then show that we can compute the set of relevant provenance variables

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

for an entailment in polynomial time. Finally, we investigate conjunctive query answering. Note that the query answering methods developed by Calvanese et al. cannot be extended to ELHr since they rely on the FO-rewritability of conjunctive queries in DL-Lite R, a property that does not hold for ELHr [Bienvenu et al., 2013]. Therefore, we adapt the combined approach for query answering in EL [Lutz et al., 2009] to provenance-annotated ELHr ontologies. Detailed proofs are available in [Bourgaux et al., 2020].

2 Provenance for ELHr

We ﬁrst introduce our framework for provenance for ELHr

ontology and discuss our design choices.

2.1 Basic Notions In the database setting, commutative semirings have proven to be convenient for representing various kinds of provenance information [Green et al., 2007; Green and Tannen, 2017]. In a commutative semiring (K, +, , 0, 1), the product and the addition + are commutative and associative binary operators over K, and distributes over +. Given a countably inﬁnite set NV of variables that are used to annotate the database tuples, a provenance semiring is a semiring over a space of annotations, or provenance expressions, with variables from NV. Green and Tannen present a hierarchy of expressiveness for provenance annotations [2017]. The most expressive form of annotations is provided by the provenance polynomials semiring N[NV] = (N[NV], +, , 0, 1) of polynomials with coefﬁcients from N and variables from NV, and the usual operations. The semiring N[NV] is universal, i.e., for any other commutative semiring K = (K, +, , 0, 1), any function ν : NV K can be extended to a semiring homomorphism h : N[NV] K, allowing the computations for K to factor through the computations for N[NV] [Green, 2011]. Hence, provenance polynomials provide the most informative provenance annotations and correspond to so-called how-provenance [Cheney et al., 2009]. Less general provenance semirings are obtained by restricting the operations + and to idempotence and/or absorption [Green and Tannen, 2017]. In this work, we focus on -idempotent semirings, i.e., for every v K, v v = v. This corresponds to the Trio semiring Trio(NV), deﬁned in [Green, 2011] as the quotient semiring of N[NV] by the equivalence kernel trio of the function trio : N[NV] N[NV] that drops exponents. An annotation is a polynomial p that is understood to represent its equivalence class p/ trio. Trio(NV) encompasses in the hierarchy the well-known why-provenance semiring Why(NV) obtained by restricting + to be idempotent as well, where an annotation corresponds to the set of sets of tuples used to derive the result [Cheney et al., 2009]. We use the following notation. A monomial is a ﬁnite product of variables in NV. Let NM be the set of monomials, and NP the set of all ﬁnite sums of monomials, i.e., NP contains polynomials of the form P

1 ji mi vi,ji, with vi,ji NV; n, mi > 0. By distributivity, every polynomial can be written into this form. The representative [m] of a monomial m is the product of the variables occuring in m, in lexicographic order. Two monomials which are equivalent

w.r.t. trio (i.e. are syntactically equal modulo commutativity, associativity and -idempotency) have the same representative, e.g., v u and u v u have representative u v. N[M] denotes the set {[m] | m NM}. As ontology language we use a syntactic restriction of ELHr. Consider three mutually disjoint countable sets of concept NC, role NR, and individual names NI, disjoint from NV. ELHr general concept inclusions (GCIs) are expressions of the form C D, built according to the grammar rules

C ::= A | R.C | C C | D ::= A | R,

where R NR, A NC. Role inclusions (RIs) and range restrictions (RRs) are expressions of the form R S and ran(R) A, respectively, with R, S NR and A NC. An assertion is an expression of the form A(a) or R(a, b), with A NC, R NR, and a, b NI. An axiom is a GCI, RI, RR, or assertion. An ELHr ontology is a ﬁnite set of ELHr

axioms. ELHr usually allows GCIs of the form C C, but these can be translated into our format by exhaustively applying the rules: (i) replace C C1 C2 by C C1 and C C2, (ii) replace C1 R.C2 by C1 S, S R and ran(S) C2 where S is a fresh role name. The reason for syntactically restricting ELHr is that conjunctions or qualiﬁed restrictions of a role on the right-hand side of GCIs lead to counter-intuitive behavior when adding provenance annotations. We discuss this later in this section.

2.2 Annotated Ontologies

Provenance information is stored as annotations. An annotated axiom has the form (α, m) with α an axiom and m NM. An annotated ELHr ontology O is a ﬁnite set of annotated ELHr

axioms of the form (α, v) with v NV {1}. We denote by ind(O) the set of individual names occurring in O. The semantics of annotated ontologies extends the classical notion of interpretations to track provenance. An annotated interpretation is a triple I = ( I, I m, I) where I, I m are non-empty disjoint sets (the domain and domain of monomials of I, respectively), and I maps

every a NI to a I I; every A NC to AI I I m; every R NR to RI I I I m; and every m, n NM to m I, n I I m s.t. m I = n I iff m trio n.1

We extend I to complex ELHr expressions as usual:

( )I = I {1I};

( R)I = {(d, m I) | e I s.t. (d, e, m I) RI};

(C D)I = {(d, (m n)I) | (d, m I) CI, (d, n I) DI};

(ran(R))I = {(e, m I) | d I s.t. (d, e, m I) RI};

( R.C)I = {(d, (m n)I) | e I s.t.

(d, e, m I) RI, (e, n I) CI}.

The annotated interpretation I satisﬁes:

1or iff m = n if we consider N[NV] instead of Trio(NV)

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

(R S, m), if, for all n NM, (d, e, n I) RI

implies (d, e, (m n)I) SI; (C D, m), if, for all n NM, (d, n I) CI

implies (d, (m n)I) DI; (A(a), m), if (a I, m I) AI; and (R(a, b), m), if (a I, b I, m I) RI.

I is a model of the annotated ontology O, denoted I |= O, if it satisﬁes all annotated axioms in O. O entails (α, m), denoted O |= (α, m), if I |= (α, m) for every model I of O.

Remark. While it may appear counter-intuitive at ﬁrst sight that CI differs from (C C)I, this is in line with the intuition behind the provenance of a conjunction. In the database setting, the Trio-provenance of tuple (a) being an answer to query yz.R(x, y) R(x, z) over {R(a, b), R(a, c)} is also different from that of (a) being an answer to y.R(x, y). Example 2 illustrates the semantics and some differences with the DL-Lite R case from [Calvanese et al., 2019].

Example 2. Consider the following annotated ontology.

O = {(mayor(Venice, Orsoni), v1), (predecessor(Brugnaro, Orsoni), v2), ( predecessor.Mayor Mayor, v3), (ran(mayor) Mayor, v4)}.

Let I be s.t. I = {Brugnaro, Orsoni, Venice}, I m = N[M], individual names are interpreted by themselves, monomials by their representatives and

mayor I = {(Venice, Orsoni, v1)},

predecessor I = {(Brugnaro, Orsoni, v2)},

Mayor I = {(Orsoni, v1 v4), (Brugnaro, v1 v2 v3 v4)}.

I |= O by the semantics of annotated ELHr. Moreover, it can be veriﬁed that if I |= (α, m), then O |= (α, m). Note that O entails (Mayor(Brugnaro), v1 v2 v3 v4) whose provenance monomial contains v1 and v2, witnessing that the two assertions of O have been used to derive Mayor(Brugnaro). Combining two assertions to derive another one is not possible in DL-Lite R. The rewriting-based approach by Calvanese et al. cannot be applied here as predecessor.Mayor Mayor leads to inﬁnitely many rewritings.

2.3 Discussion on Framework Restrictions Example 2 shows that conjunction and qualiﬁed role restriction lead to a behavior different from DL-Lite R. They are also the reason for some features of our setting. First, the next example illustrates the -idempotency impact for the EL family.

Example 3. Let O = {(A B1, v1), (A B2, v2), (B1 B2 C, v3)}. If I is a model of O and (e, n I) AI, then (e, (n v1)I) BI 1 and (e, (n v2)I) BI 2 so (e, (n v1 n v2)I) (B1 B2)I, i.e. (e, (n v1 v2)I) (B1 B2)I by -idempotency, which implies (e, (n v1 v2 v3)I) CI. Thus O |= (A C, v1 v2 v3). This intuitive entailment is lost if is not idempotent. Indeed, assume that is not idempotent and let I be the interpretation deﬁned as

follows (where I = {e} and I m contains all monomials with variables in lexicographic order).

AI ={(e, u)} BI 1 = {(e, u v1)} BI 2 = {(e, u v2)}

CI ={(e, u u v1 v2 v3)}.

I is a model of O such that I |= (A C, v1 v2 v3). A downside of -idempotency is a loss of the expressive power of provenance, neglecting the number of times an axiom is used in a derivation. Let O = {(A B, v1), (B A, v2)}. With -idempotency, O |= (A B, vk 1 vl 2) for k 1 and l 0 because for k, l 1, vk 1 vl 2 is interpreted by (v1 v2)I in any interpretation I. In contrast, if is not idempotent, we only obtain O |= (A B, vk+1 1 vk 2) for k 0 (in particular O |= (A B, v1 v2)), which is a more informative result. Some useful semirings are not -idempotent; e.g. the Viterbi semiring ([0, 1] , max, , 0, 1), where is the usual product over real numbers, which is applied for representing conﬁdence scores. We limit ourselves to -idempotent semirings because we are interested in computing provenance not only for assertions or queries, but also for GCIs. In particular, when a non-annotated ontology entails the GCI C D, we want the annotated version of the ontology to entail (C D, m) for some monomial m. The non-idempotent case could be relevant when one is not concerned with provenance for GCI entailment, and is left as future work. Many useful semirings are -idempotent. Examples of these are: the Boolean semiring, used for probabilistic query answering in databases; the security semiring, used to determine the minimal level of clearance required to get the consequence; and the fuzzy semiring which allows to determine the truth degree of the consequence (see e.g. [Senellart, 2017] for details on these semirings and more examples). Second, let us explain the restrictions on the form of the right-hand side of the GCIs. Example 4 illustrates the case of conjunctions. Qualiﬁed role restrictions lead to the same kind of behavior (they can be seen as implicit conjunctions). Example 4. Let O = {(A B C, v), (A(a), u)}. All the following interpretations which interpret a by itself and monomials by their representatives are models of O:

AI1 = {(a, u)}, BI1 = {(a, u v)}, CI1 = {(a, u v)} AI2 = {(a, u)}, BI2 = {(a, u)}, CI2 = {(a, v)} AI3 = {(a, u)}, BI3 = {(a, 1)}, CI3 = {(a, u v)}

Since the semantics does not provide a unique way to split the monomial u v between the two elements of the conjunction, O |= (B(a), m) for any m NM, and in particular, O |= (B(a), u v). It is arguably counter-intuitive since we intuitively know that a is in A with provenance u and that A is a subclass of the intersection of B and C with provenance v. Partially normalizing the ontology before annotating it, or more speciﬁcally, replacing e.g. annotated GCIs of the form (C C1 C2, v) by (C C1, v) and (C C2, v), may be acceptable in most cases, even if the rewritten ontology leads to additional arguably natural consequences compared to the original one. For instance, even if O |= (A B, v) in Example 4, in many cases a user would accept to change the GCI of O to (A B, v) and (A C, v) as it may reﬂect

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

the original intention of the GCI since {A B, A C} and A B C are semantically equivalent. One could argue that it would be better to deﬁne the semantics so that only I1 was a model of O in Example 4, instead of restricting the language as we do. We explain next why this is not so simple. One possibility is to change the deﬁnition of satisfaction of a GCI by an interpretation such that I |= (A B C, m) iff for every (d, n I) AI, then (d, (m n)I) BI and (d, (m n)I) CI, and similarly for qualiﬁed role restrictions. This approach leads to a counter-intuitive behavior. For instance if O = {(A(a), u), (B(a), v)}, then O |= (A B A B, 1), since (a, (u v)I) (A B)I for every model I of O, but there is a model I of O such that (a, (u v)I) / AI (and (a, (u v)I) / BI). In contrast, our deﬁnition of satisfaction ensures that for every interpretation I and concept C, I |= (C C, 1). Another possibility is to modify the interpretation of conjunctions and qualiﬁed role restrictions such that (C D)I = {(d, m I) | (d, m I) CI, (d, m I) DI} and ( R.C)I = {(d, m I) | e I s.t. (d, e, m I) RI, (e, m I) CI}. In this case, we lose even basic entailments from annotated ABoxes; e.g., {(A(a), u), (B(a), v)} |= ((A B)(a), u v). We also lose the entailment of the GCI from Example 3. Hence, restricting the syntax to prevent conjunctions on the right and deﬁning the semantics as usual in DLs seems to be the most natural way of handling provenance in DL languages with conjunction. Since EL ontologies are often already expressed in normal form, the main restriction in our language is the avoidance of qualiﬁed existential restrictions on the right-hand side.

2.4 Annotated Queries Following Calvanese et al. [2019], we extend DL conjunctive queries with binary and ternary predicates, where the last term of the tuple is used for provenance information. Recall that by the semantics of annotated ontologies, tuples can only contain monomials. A Boolean conjunctive query (BCQ) q is a sentence x.ϕ( x, a), where ϕ is a conjunction of (unique) atoms of the form A(t1, t), R(t1, t2, t); ti is an individual name from a, or a variable from x; and t (the last term of the tuple) is a variable from x that does not occur anywhere else in q (Calvanese et al. call such a query standard). We use P( t, t) to refer to an atom which is either A(t1, t) or R(t1, t2, t), and P( t, t) q if P( t, t) occurs in q. A match of the BCQ q = x.ϕ( x, a) in the annotated interpretation I is a function π : x a I I m, such that π(b) = b I for all b a, and π( t, t) P I for every P( t, t) q, where π( t, t) is a shorthand for (π(t1), π(t)) or (π(t1), π(t2), π(t)) depending on the arity of P. I satisﬁes the BCQ q, written I |= q, if there is a match of q in I. A BCQ q is entailed by an annotated ontology O, denoted O |= q, if every model of O satisﬁes q. For a BCQ q and an interpretation I, νI(q) denotes the set of all matches of q in I. The provenance of q on I is the expression prov I(q) := P π νI(q)[Q P ( t,t) q π (t)]

where π(t) is the last element of the tuple π( t, t) P I; and π (t) is the only m N[M] s.t. m I = π(t). For p NP,

we write p prov I(q) if p is a sum of monomials and for each occurrence of a monomial in p we ﬁnd an occurrence of its representative in prov I(q). I satisﬁes q with provenance p NP, denoted I |= (q, p), if I |= q and p prov I(q). O |= (q, p), if O |= q and p prov I(q), for all I |= O. We call (q, p) an annotated query.

Remark. When O contains only assertions (no GCIs, RIs, and RRs), we can compare the provenance annotations we obtain to the database case. Similarly to Trio-provenance, the sums of monomials distinguish different ways the query atoms can be mapped to annotated interpretations. For example, given O = {(R(a, b), v1), (R(b, a), v2)} and query q = xytt .R(x, y, t) R(y, x, t ), it holds that O |= (q, v1 v2 + v1 v2). The provenance annotation v1 v2 + v1 v2 distinguishes among two derivations using the same axioms, contrary to the why-provenance v1 v2. Note that given an axiom α and O that may contain GCIs, RIs, and RRs, the sum over all monomials m such that O |= (α, m) represents all possible derivations of α, in the why-provenance spirit. The size |X| of an annotated ontology, a polynomial or a BCQ X is the length of the string representing X, where elements of NC, NR, NI and NV in X are of length one. We often omit annotated and refer to ontologies, queries, assertions, etc. when it is clear from the context.

3 Reasoning with Annotated ELHr Ontologies We present a completion algorithm for deriving basic entailments from an ELHr ontology. As usual with completion algorithms, we restrict to ontologies in normal form. The annotated ELHr ontology O is in normal form if for every GCI (α, v) O, α is of the form A B, A A B, A R, or R.A B, with A, A NC { }, B NC. Every annotated ELHr ontology can be transformed, in polynomial time, into an ontology in normal form which entails the same axioms over the ontology signature, using the following rules where b C, b D / NC { } and A is a fresh concept name:

NF1 : (C b D E, v) ( b D A, 1), (C A E, v) NF2 : ( R. b C D, v) ( b C A, 1), ( R.A D, v) NF3 : ( b C R, v) ( b C A, 1), (A R, v).

Theorem 5. Let O be an annotated ELHr ontology, α an axiom, and m a monomial. Let O be obtained by applying exhaustively Rules NF1-NF3 to O.

If O |= (α, m), then O |= (α, m).

If O |= (α, m) and every concept name occurring in α occurs in O, then O |= (α, m). Before describing the reasoning algorithm in detail, we present an important property of entailment; namely, that all entailment problems can be polynomially reduced to each other. This allows us to focus on only one problem. In particular, we focus on entailment of annotated assertions.

Theorem 6. Let O be an annotated ontology, and (α, m) an annotated GCI, RR, or RI. One can construct in polynomial time an ontology O and an annotated assertion (β, n) such that O |= (α, m) iff O |= (β, n). Conversely, if (α, m) is an

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

if then (if Φ / O)

CR0 X NC NR { } occurs in O add Φ = (X X, 1) to O CR1 (R1 R2, m1), (R2 R3, m2) O add Φ = (R1 R3, [m1 m2]) to O CR2 (R S, m1), (ran(S) A, m2) O add Φ = (ran(R) A, [m1 m2]) to O CR3 (A R, m1), (R S, m2) O add Φ = (A S, [m1 m2]) to O CR4 (A B, m1), (B C, m2) O add Φ = (A C, [m1 m2]) to O CR5 (A B, m1), (B R, m2) O add Φ = (A R, [m1 m2]) to O CR6 (A B1, m1), (A B2, m2), (B1 B2 C, m3) O add Φ = (A C, [m1 m2 m3]) to O CR7 (ran(R) B1, m1), (ran(R) B2, m2), (B1 C1, m3), (B2 C2, m4), (C1 C2 C, m5) O add Φ = (ran(R) C, [m1 m2 m3 m4 m5]) to O CR8 (A B C, m1), ( B, m2) O add Φ = (A C, [m1 m2]) to O CR9 (A S, m1), (ran(S) B, m2), (B C, m3), (S R, m4), ( R.C D, m5) O add Φ = (A D, [m1 m2 m3 m4 m5]) to O CR10 (A R, m1), ( B, m2), ( R.B C, m3) O add Φ = (A C, [m1 m2 m3]) to O

CR11 a ind(O) add Φ = ( (a), 1) to O CR12 (R(a, b), m1), (R S, m2) O add Φ = (S(a, b), [m1 m2]) to O CR13 (A(a), m1), (A B, m2) O add Φ = (B(a), [m1 m2]) to O CR14 (A1(a), m1), (A2(a), m2), (A1 A2 B, m3) O add Φ = (B(a), [m1 m2 m3]) to O CR15 (R(a, b), m1), (A(b), m2), ( R.A B, m3) O add Φ = (B(a), [m1 m2 m3]) to O CR16 (R(a, b), m1), (ran(R) A, m2) O add Φ = (A(b), [m1 m2]) to O

Table 1: Completion rules. A, . . . , D NC { }, R, S, Ri NR, m, mi NM.

annotated concept (resp. role) assertion, one can construct in polynomial time an ontology O and two annotated concept (resp. role) inclusions (β, n), (γ, n) such that O |= (α, m) iff O |= (β, n) or O |= (γ, n). We adapt the classical EL completion rules to handle annotated ELHr ontologies in normal form. The algorithm starts with the original ontology O, and extends it through an iterative application of the rules from Table 1 until O becomes saturated; i.e., no more rules are applicable. We cannot use the rules of [Baader et al., 2008a] which eliminate range restrictions by adding GCIs with qualiﬁed role restrictions on the right so we designed rules for ELHr. A rule application may add axioms annotated with monomials, and other assertions ( (a), 1), which are not foreseen in the deﬁnition of annotated ontologies. Still, -idempotency ensures that all monomials have at most |O| factors. To show that the completion algorithm is sound and complete for deciding assertion entailment, we prove a stronger result. The k-saturation of O is the saturated ontology Ok obtained from O through the completion algorithm restricted to monomials of length at most k. We show that Ok sufﬁces for deciding entailment of annotated assertions (α, m) where m is a monomial of length at most k. Theorem 7. If Ok is the k-saturation of O, then 1. Ok is computable in polynomial time w.r.t. the size of O, and in exponential time w.r.t. k, 2. for every assertion α and monomial mk with at most k variables, O |= (α, mk) iff (α, [mk]) Ok. This theorem states that to decide whether an assertion (α, m) is entailed by O, one just needs to ﬁnd the k-saturation of O, where k is the number of variables in m, and then check whether (α, [m]) Ok. Due to the ﬁrst point of Theorem 7 and Theorem 6, we obtain the following corollary. Corollary 8. For every axiom α, O |= (α, m) is decidable in polynomial time in |O| and in exponential time in |m|. In general there is no need to interrupt the completion algorithm; the ontology saturated without restricting the monomial length can be used to decide all relevant entailments regardless of the length of the monomial. Using Ok is merely an optimisation when one is only interested in a short monomial.

While the polynomial time upper bound w.r.t. the ontology size is positive, and in line with the complexity of the EL family, the exponential time bound on the monomial size does not scale well for entailments with larger monomials. Recall that these bounds are based on the number of annotated axioms generated by the completion rules. The following example illustrates the potential exponential blow-up.

Example 9. Consider O = {(A Ai, vi), (Ai B, ui) | 0 i n} {(B A, u)}. If O is the result of applying the completion algorithm to O, then for every S {1, . . . , n}, (B A, [u Πi Sui vi]) O .

Following Hutschenreiter and Pe naloza [2017], we can see the completion algorithm as an automaton. More precisely, given O and (α, m), we can construct a tree automaton A, whose states correspond exactly to all the elements in Ok, such that (α, [m]) Ok iff A accepts at least one tree. Brieﬂy, A is constructed by reading the rule applications backwards, allowing transitions from the consequence to the premises of the rule; see [Hutschenreiter and Pe naloza, 2017] for details. The number of states in A is exactly the cardinality of Ok and hence potentially exponential on k. However, the size of each state is bounded polynomially on k; the arity of the automaton is bounded by the maximum number of premises in a rule, in this case 5; and one can bound polynomially on k the number of different states that may appear in any successful run of A. Thus, A satisﬁes the conditions for a PSpace emptiness test [Baader et al., 2008b], which yields the following result.

Proposition 10. For every axiom α, O |= (α, m) is decidable in polynomial space in |m|.

Interestingly, these results allow us to bound the full complexity of answering instance queries (IQ) of the form C(a) where C is an ELHr concept and a NI.

Theorem 11. Let O be an ontology, C(a) an IQ and m NM. O |= (C(a), m) is decidable in polynomial time in |O| and |C(a)|, and polynomial space in |m|.

4 Computing Relevant Provenance Variables

An interesting question is whether a given variable appears in the provenance of a query q; i.e., whether a given axiom occurs

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

in some derivation of q. Formally, v NV is relevant for q (w.r.t. ontology O) iff m NM s.t. O |= (q, v m). For IQs and ELHr this problem can be solved in polynomial time, via an algorithm computing all the relevant variables for all queries of the form A(a), with a NI, A NC. We modify the completion algorithm (Section 3) to combine all monomials from a derivation, instead of storing them separately. As in Section 3, the algorithm assumes normal form and keeps as data structure a set S of annotated axioms (α, m), where α uses the vocabulary of O, and m NM. S is initialised as the original ontology where annotations of the same axiom are merged into a single monomial:

S := {(α, [Πv Vαv]) | (α, u) O, Vα = {v | (α, v) O}},

and extended by exhaustively applying the rules in Table 1, where rule applications change S into

S (α, m) := S {(α, m)} if there is no (α, n) S S \ {(α, n)} {(α, [m n])} if (α, n) S;

i.e., add the axiom α with an associated monomial if it does not yet appear in S, and modify the monomial associated to α to include new variables otherwise. To ensure termination, a rule is only applied if it modiﬁes S. The rules are applied until no new rule is applicable; i.e., S is saturated.

Example 12. The relevance algorithm on the ontology of Example 9, yields the saturated set S = {(A A, m), (B B, m), (A B, m), (B A, m)} {(Ai B, m), (B Ai, m), (Ai A, m), (A Ai, m) | (1 i n} {(Ai Aj, m) | 1 i, j n} with m = u Πn i=1ui Πn i=1vi.

Each rule application either adds a new axiom, or adds to the label of an existing axiom more variables. As the number of concept and role names, and variables appearing in S is linear on O, at most polynomially many rules are applied, each requiring polynomial time; i.e, the algorithm is polynomial.

Lemma 13. If S is the saturated set obtained from O, a NI, A NC, and v NV, then v is relevant for A(a) iff v occurs in m for some (A(a), m) S.

The algorithm decides relevance for assertion entailment in ELHr, yielding a polynomial-time upper bound for this problem. As in Section 3, axioms and IQs can be handled in polynomial time as well.

Theorem 14. Relevance for axiom and IQ entailment in ELHr can be decided in polynomial time.

This result shows that if we only need to know which axioms are used to derive an axiom or an IQ, the complexity is the same as reasoning in ELHr without provenance. This contrasts with axiom pinpointing : the task of ﬁnding the axioms responsible for a consequence to follow, in the sense of belonging to some minimal subontology entailing it (a Min A). Deciding whether an axiom belongs to a Min A is NP-hard for Horn-EL [Pe naloza and Sertkaya, 2010]. Relevance is easier in our context since provenance does not require minimality: if O = {(A B, v1), (B C, v2), (C B, v3)}, v2 and v3 are relevant for A B, but the only Min A is {A B} so other axioms are not relevant for axiom pinpointing.

Provenance relevance is related to lean kernels (LKs) [Pe naloza et al., 2017], which approximate the union of Min As. The LK of a consequence c is the set of axioms appearing in at least one proof of c in a given inference method, generalizing the notion from propositional logic, where an LK is the set of clauses appearing in a resolution proof for unsatisﬁability. The sets of variables computed by our algorithm are the sets of axioms used in the derivations by the completion algorithm, which is a consequence-based method for ELHr. Thus they correspond to LKs for the associated axioms and our algorithm is an alternative way of computing LKs in ELHr.

5 Query Answering with Provenance Even if ELHr is expressive enough to reduce entailment of rooted tree-shaped BCQs to assertion entailment, the methods presented in Section 3 do not apply to other kinds of BCQs. Example 15. For O={(R(a, a), u1), (A(a), u2), (A R, v1), (ran(R) A, v2)} and q = xyztt t .R(x, x, t) R(x, y, t ) R(z, y, t ), O |= (q, u1) but O |= (q, u2 v1 v2): O has a model I with RI = {(a, a, u1), (a, b1, u2 v1), (a, c1, u1 v1 v2)} {(bi, bi+1, u2 v1 v2) | i 1} {(ci, ci+1, u1 v1 v2) | i 1}. We adapt the combined approach by Lutz et al. [2009] to trace provenance. Assume that queries contain only individual names occurring in the ontology O. The combined approach builds a canonical model for O and shows that every query q can be rewritten into a query q that holds in this canonical model iff O |= q. We ﬁrst deﬁne the canonical model IO of an ontology O annotated with provenance information. Assume that O is in normal form; mon(O) denotes the set of monomial representatives built using variables of NV occurring in O, and rol(O) is the set of role names occurring in O. Also assume that ( ) if there is B NC, R NR, and n NM such that O |= (ran(R) B, n), then (ran(R) B, [n]) O. This simpliﬁes the presentation of the construction of the canonical model. Let aux(O) := {dm R | R rol(O), m mon(O)}. Assume that ind(O) aux(O) = . We deﬁne the domain of IO and the domain of monomials of IO as follows:

IO := ind(O) aux(O) IO m := N[M]

We deﬁne the interpretation function of IO as the union of Ii O, i 0. The function I0 O sets a I0 O = a for all a ind(O) (for a NI \ ind(O) the mapping a IO is irrelevant), m I0 O = [m] for all m NM, and for all A NC and all R NR,

AI0 O := {(a, [m]) | O |= (A(a), m)}

RI0 O := {(a, b, [m]) | O |= (R(a, b), m)}.

If Ii O is deﬁned, we deﬁne Ii+1 O by choosing an annotated axiom α O and applying one of the following rules in a fair way (i.e., every applicable rule is eventually applied). R1 α = (C A, m): if there is d IO and n mon(O) s.t. (d, [n]) CIi O, then add (d, [m n]) to AIi O. R2 α = (C R, m): if there is n mon(O), d IO s.t. (d, [n]) CIi O, then add (d, d[m n] R , [m n]) to RIi O.

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

R3 α = (R S, m): if there are d, d IO, n mon(O) s.t. (d, d , [n]) RIi O, then add (d, d , [m n]) to SIi O. Example 16. For our running example, IO is as follows:

AIO = {(a, u2), (a, u1 v2), (du2 v1 R , u2 v1 v2), (du1 v1 v2 R , u1 v1 v2), (du2 v1 v2 R , u2 v1 v2)} RIO = {(a, a, u1), (a, du2 v1 R , u2 v1), (a, du1 v1 v2 R , u1 v1 v2), (du2 v1 R , du2 v1 v2 R , u2 v1 v2), (du1 v1 v2 R , du1 v1 v2 R , u1 v1 v2), (du2 v1 v2 R , du2 v1 v2 R , u2 v1 v2)}.

Proposition 17 formalises the fact that IO is a model of O. Proposition 17. IO is a model of O. We deﬁne the rewriting q of a query q, closely following Lutz et al. [2009]. It contains an additional predicate Aux, always interpreted as ( IO \ ind(O)) {1IO} in IO. Let q be the smallest transitive relation over terms of q, term(q), that includes identity relation, and satisﬁes the closure condition ( ) R1(t1, t2, t), R2(t 1, t 2, t ) q, t2 q t 2 = t1 q t 1. Clearly, the relation q is computable in polynomial time in the size of q. Deﬁne for any equivalence class χ of q, the set

pre(χ) = {t1 | R NR s.t. R(t1, t2, t) q and t2 χ}.

We deﬁne the sets Cyc and Fork= whose main purpose in the translation is to prevent spurious matches (e.g., with cycles) of a query in the anonymous part of the canonical model. Fork= is the set of pairs (pre(χ), χ) with pre(χ) of cardinality at least two. Cyc is the set of variables x in term(q) such that there are R0(t0 1, t0 2, t0), . . ., Rm(tm 1 , tm 2 , tm), . . ., Rn(tn 1, tn 2, tn) in q with n, m 0, x q tj 1 for some j n, ti 2 q ti+1 1 for all i < n, and tn 2 q tm 1 . Fork=, and Cyc can also be computed in polynomial time in the size of q. For each equivalence class χ of q, we choose a representative tχ χ. For q = x.ψ, the rewritten query q is deﬁned as x.(ψ ϕ1 ϕ2), where

x Cyc Aux(x, 1)

({t1,...,tk},χ) Fork= (Aux(tχ, 1)

1 i<k ti = ti+1).

Example 18. The rewriting q of q in Example 16 is xyztt t .(R(x, x, t) R(x, y, t ) R(z, y, t ) Aux(x, 1) (Aux(y, 1) x = z)). ϕ1 prevents mapping x to some dm R , avoiding the R-loops in the anonymous part of IO to satisfy R(x, x, t). ϕ2 enforces that if y is mapped in the anonymous part, then x and z are mapped to the same object, which avoids R-loops in the anonymous part of IO. Our construction differs from the original rewriting of Lutz et al. [2009]. In particular, in their rewriting there is a formula ϕ3, which is not necessary in our case. Intuitively, this is because we keep the information of the role name used to connect an element of aux(O) to the rest of the model. Theorem 19 establishes that q is as required.

Theorem 19. Let O be an ontology in normal form and (q, p) be an annotated query. Then, O |= (q, p) iff IO |= (q , p). Although the domain of monomials is inﬁnite, since only elements of mon(O) are relevant, an exponential size structure representing IO is sufﬁcient to check whether IO |= (q , p). The size of the resulting structure is exponential in |O| and can be constructed in exponential time using the completion algorithm (Theorem 7) to check entailment of assertions and RRs. Corollary 20. Let O be an ontology, q a BCQ and p NP. O |= (q, p) is decidable in exponential time in |O| + |(q, p)|.

6 Discussion and Conclusions We study the problem of computing the provenance of an axiom or a BCQ entailment from ELHr ontologies. In particular, entailment of annotated axioms or IQs for a ﬁxed monomial size is tractable, and the set of relevant provenance variables can be computed in polynomial time. For the more challenging problem of CQ answering, we adapt the combined approach.

Related work. Explaining inferences in DLs has been studied mostly focusing on explaining axiom entailment, in particular concept subsumption, through axiom pinpointing [Schlobach and Cornet, 2003; Kalyanpur et al., 2007; Baader et al., 2007b]. Few approaches address query answer explanation for DL-Lite or existential rules [Borgida et al., 2008; Croce and Lenzerini, 2018; Ceylan et al., 2019; Bienvenu et al., 2019]. However, current explanation services in DLs provide minimal explanations, which is crucially different to provenance, since provenance takes into account all derivations (cf. discussion in Section 4). Closest to our work is provenance for OBDA [Calvanese et al., 2019]. However, the challenges in enriching the EL family with provenance were not investigated. We also study additional problems such as axiom entailment and relevance. Dannert and Gr adel [2019] consider provenance in the DL ALC. The setting is not the same as ours since they only consider annotated assertions (not annotated GCIs), do not study BCQs, and the semantics is different as well. There are several proposals for handling provenance in RDF(S), most notably an algebraic deductive system for annotated RDFS [Buneman and Kostylev, 2010]. The approach by Bourgaux and Ozaki [2019] for attributed DL-Lite fundamentally differs by using GCIs and RIs to express constraints on provenance.

Acknowledgments This work was supported by Camille Bourgaux s CNRS PEPS grant; contract ANR-18-CE23-0003 (CQFD); the University of Bergen; the Free University of Bozen-Bolzano projects PROVDL and FO2S; and the Italian PRIN project HOPE.

References [Baader et al., 2005] Franz Baader, Sebastian Brandt, and Carsten Lutz. Pushing the EL envelope. In IJCAI, 2005. [Baader et al., 2007a] Franz Baader, Diego Calvanese, Deborah Mc Guinness, Daniele Nardi, and Peter Patel-Schneider, editors. The Description Logic Handbook: Theory, Implementation, and Applications. Cambridge University Press, second edition, 2007.

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

[Baader et al., 2007b] Franz Baader, Rafael Pe naloza, and Boontawee Suntisrivaraporn. Pinpointing in the description logic EL+. In KI, 2007. [Baader et al., 2008a] Franz Baader, Sebastian Brandt, and Carsten Lutz. Pushing the EL envelope further. In OWLED, 2008. [Baader et al., 2008b] Franz Baader, Jan Hladik, and Rafael Pe naloza. Automata can show PSpace results for description logics. Inf. Comput., 206(9-10), 2008. [Bienvenu et al., 2013] Meghyn Bienvenu, Carsten Lutz, and Frank Wolter. First-Order Rewritability of Atomic Queries in Horn Description Logics. In IJCAI, 2013. [Bienvenu et al., 2019] Meghyn Bienvenu, Camille Bourgaux, and Franc ois Goasdou e. Computing and explaining query answers over inconsistent DL-Lite knowledge bases. J. Artif. Intell. Res., 64, 2019. [Borgida et al., 2008] Alexander Borgida, Diego Calvanese, and Mariano Rodriguez-Muro. Explanation in the DL-Lite family of description logics. In OTM, 2008. [Bourgaux and Ozaki, 2019] Camille Bourgaux and Ana Ozaki. Querying attributed DL-Lite ontologies using provenance semirings. In AAAI, 2019. [Bourgaux et al., 2020] Camille Bourgaux, Ana Ozaki, Rafael Pe naloza, and Livia Predoiu. Provenance for the description logic ELHr, 2020. ar Xiv:2001.07541 [cs.LO]. [Buneman and Kostylev, 2010] Peter Buneman and Egor V. Kostylev. Annotation algebras for RDFS data. In SWPM@ISWC, 2010. [Buneman, 2013] Peter Buneman. The providence of provenance. In BNCOD, 2013. [Calvanese et al., 2019] Diego Calvanese, Davide Lanti, Ana Ozaki, Rafael Pe naloza, and Guohui Xiao. Enriching ontology-based data access with provenance. In IJCAI, 2019. [Ceylan et al., 2019] Ismail Ilkan Ceylan, Thomas Lukasiewicz, Enrico Malizia, and Andrius Vaicenavicius. Explanations for query answers under existential rules. In IJCAI, 2019. [Cheney et al., 2009] James Cheney, Laura Chiticariu, and Wang Chiew Tan. Provenance in databases: Why, how, and where. Foundations and Trends in Databases, 1(4), 2009. [Croce and Lenzerini, 2018] Federico Croce and Maurizio Lenzerini. A framework for explaining query answers in DL-Lite. In EKAW, 2018. [Dannert and Gr adel, 2019] Katrin M. Dannert and Erich Gr adel. Provenance analysis: A perspective for description logics? In Description Logic, Theory Combination, and All That, 2019. [Deutch et al., 2014] Daniel Deutch, Tova Milo, Sudeepa Roy, and Val Tannen. Circuits for datalog provenance. In ICDT, 2014. [Green and Tannen, 2017] Todd J. Green and Val Tannen. The semiring framework for database provenance. In PODS, 2017.

[Green et al., 2007] Todd J. Green, Gregory Karvounarakis, and Val Tannen. Provenance semirings. In PODS, 2007. [Green, 2011] Todd J. Green. Containment of conjunctive queries on annotated relations. Theory of Computing Systems, 49, 2011. [Hutschenreiter and Pe naloza, 2017] Lisa Hutschenreiter and Rafael Pe naloza. An automata view to goal-directed methods. In LATA, 2017. [Ives et al., 2008] Zachary G. Ives, Todd J. Green, Grigoris Karvounarakis, Nicholas E. Taylor, Val Tannen, Partha Pratim Talukdar, Marie Jacob, and Fernando C. N. Pereira. The ORCHESTRA collaborative data sharing system. Sigmod Record, 37, 2008. [Kalyanpur et al., 2007] Aditya Kalyanpur, Bijan Parsia, Matthew Horridge, and Evren Sirin. Finding all justiﬁcations of OWL DL entailments. In ISWC, 2007. [Lukasiewicz et al., 2014] Thomas Lukasiewicz, Maria Vanina Mart ınez, Cristian Molinaro, Livia Predoiu, and Gerardo I. Simari. Answering ontological ranking queries based on subjective reports. In SUM, 2014. [Lutz et al., 2009] Carsten Lutz, David Toman, and Frank Wolter. Conjunctive query answering in the description logic EL using a relational database system. In IJCAI, 2009. [Pe naloza and Sertkaya, 2010] Rafael Pe naloza and Baris Sertkaya. On the complexity of axiom pinpointing in the EL family of description logics. In KR, 2010. [Pe naloza et al., 2017] Rafael Pe naloza, Carlos Menc ıa, Alexey Ignatiev, and Jo ao Marques-Silva. Lean kernels in description logics. In ESWC, 2017. [Ramusat et al., 2018] Yann Ramusat, Silviu Maniu, and Pierre Senellart. Semiring provenance over graph databases. In Ta PP, 2018. [Schlobach and Cornet, 2003] Stefan Schlobach and Ronald Cornet. Non-standard reasoning services for the debugging of description logic terminologies. In IJCAI, 2003. [Senellart, 2017] Pierre Senellart. Provenance and probabilities in relational databases. SIGMOD Record, 46(4), 2017. [Suciu et al., 2011] Dan Suciu, Dan Olteanu, Christopher R e, and Christoph Koch. Probabilistic Databases. Synthesis Lectures on Data Management, Morgan and Claypool Publishers, 2011. [Zimmermann et al., 2012] Antoine Zimmermann, Nuno Lopes, Axel Polleres, and Umberto Straccia. A general framework for representing, reasoning and querying with annotated semantic web data. J. Web Sem., 11, 2012.

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)