# enriching_ontologybased_data_access_with_provenance__a28bab9b.pdf

Enriching Ontology-based Data Access with Provenance

Diego Calvanese1 , Davide Lanti1 , Ana Ozaki1 , Rafael Pe naloza2 and Guohui Xiao1

1KRDB Research Centre, Free University of Bozen-Bolzano, Italy 2University of Milano-Bicocca, Italy

Ontology-based data access (OBDA) is a popular paradigm for querying heterogeneous data sources by connecting them through mappings to an ontology. In OBDA, it is often difﬁcult to reconstruct why a tuple occurs in the answer of a query. We address this challenge by enriching OBDA with provenance semirings, taking inspiration from database theory. In particular, we investigate the problems of (i) deciding whether a provenance annotated OBDA instance entails a provenance annotated conjunctive query, and (ii) computing a polynomial representing the provenance of a query entailed by a provenance annotated OBDA instance. Differently from pure databases, in our case these polynomials may be inﬁnite. To regain ﬁniteness, we consider idempotent semirings, and study the complexity in the case of DL-Lite R ontologies. We implement Task (ii) in a state-of-the-art OBDA system and show the practical feasibility of the approach through an extensive evaluation against two popular benchmarks.

1 Introduction Ontology-based data access (OBDA) [Xiao et al., 2018] is by now a popular paradigm which has been developed in recent years to overcome the difﬁculties in accessing and integrating legacy data sources. In OBDA, users are provided with a highlevel conceptual view of the data in the form of an ontology that encodes relevant domain knowledge. The concepts and roles of the ontology are associated via declarative mappings to SQL queries over the underlying relational data sources. Hence, user queries formulated over the ontology can be automatically rewritten, taking into account both ontology axioms and mappings, into SQL queries over the sources. When issuing a query, in many settings it is crucial to know not only its result but also how it was produced, how many different ways there are to derive it, or how dependent it is on certain parts of the data [Senellart, 2017; Zimmermann et al., 2012; Buneman and Kostylev, 2010]. To address these issues, which are of importance already for plain relational database management systems (RDBMSs), provenance semirings [Green et al., 2007; Green and Tannen, 2017] were introduced as an abstract tool to record and track

provenance information; that is, to keep track of the speciﬁc database tuples that are responsible for deriving an answer tuple, and of additional information associated to them. In OBDA, determining provenance is made even more challenging by the fact that answers are affected by implicit consequences derived through ontology axioms, and by the use of mappings. Such elements come indirectly into play in query rewriting, hence provenance information must be reconstructed from the rewritten queries used in the answering process [Borgida et al., 2008]. In this work, we start from the semiring approach introduced for RDBMSs, and extend it to the full-ﬂedged OBDA setting. To do so, we assume that not only database tuples are annotated with a label representing provenance information (e.g., the data source or the relation in which the tuple is stored), but also mappings and ontology axioms. Then, our task is to derive which combinations of these labels lead to the answer of a query. Such information is expressed through a provenance polynomial, as illustrated in the following example. Example 1. Let Mayors[Person, City] be a database relation with the tuples (Renier, Venice) and (Brugnaro, Venice), annotated with (sources) p and q, respectively. Assume two mappings City(Y ) Mayors(X, Y ) and head Gov(X, Y ) Mayors(X, Y ), annotated with m and n, respectively. The mappings and the database induce (i) two times the DL assertion City(Venice), one annotated with p m and one with q m, (ii) the DL assertion head Gov(Renier, Venice), annotated with p n, and (iii) the assertion head Gov(Brugnaro, Venice), annotated with q n. Now consider the inclusion head Gov Mayor annotated with s. The answer true to the Boolean conjunctive query x.(Mayor(x)) can be derived using this inclusion and any of the last two DL assertions. This information can be expressed through the provenance polynomial ((p n) + (q n)) s. In our OBDA setting, concept and role inclusions of the ontology affect query results, as illustrated in Example 1. By annotating the inclusions and the mappings, in addition to the tuples, we can distinguish which inclusions and mappings were involved in the derivation of a query result. This differs from the approach proposed for attributed DL-Lite R [Bourgaux and Ozaki, 2019], where the inclusions are used to express constraints on the provenance information. We investigate the problems of (i) deciding whether a provenance annotated OBDA instance entails a provenance anno-

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

tated conjunctive query (CQ), and (ii) computing a provenance polynomial of a CQ entailed by a provenance annotated OBDA instance. Differently from plain databases, in our case these polynomials may be inﬁnite. To regain ﬁniteness, we consider idempotent semirings, and study the complexity for DL-Lite R ontologies [Calvanese et al., 2007]. We implement task (ii) in the state-of-the-art OBDA system Ontop [Calvanese et al., 2017], and show the practical feasibility of our approach through a detailed evaluation against two popular benchmarks. An extended version of this work is available as a technical report [Calvanese et al., 2019].

2 Basic Deﬁnitions

We represent the provenance information via a positive algebra provenance semiring (or provenance semiring for short), originally introduced for databases [Green et al., 2007]. Given a countably inﬁnite set NV of variables, the provenance semiring is the algebra K = (N[NV], +, , 0, 1), where N[NV] denotes the space of polynomials with coefﬁcients in N and variables in NV, the product and the addition + are two commutative and associative binary operators over N[NV], and distributes over +. A monomial from K is a ﬁnite product of variables in NV. NM and NP denote the sets of all monomials from K, and of all ﬁnite sums of monomials in NM, respectively; i.e., NP contains only polynomials of the form P 1 i n Q 1 ji mi ai,ji, with ai,ji NV, and n, mi > 0. Since all coefﬁcients are in N, they disappear in this expanded form; e.g., 2a is a + a. A polynomial in expanded form is a ﬁnite sum of monomials, each formed by a ﬁnite product of variables. By distributivity, every polynomial can be equivalently rewritten in expanded form; however, the expanded form of a polynomial may become exponentially larger. By our deﬁnitions, NV NM NP.

2.1 Annotated OBDA The provenance information of each axiom in an ontology, each mapping, and each tuple in a data source, is stored as an annotation. For this paper, we consider the standard OBDA setting with ontologies written in DL-Lite R [Calvanese et al., 2007], standard relational databases as data sources, and mappings given by GAV rules. Consider three mutually disjoint countable sets of concept names NC, role names NR, and individual names NI. Assume that these sets are also disjoint from NV. DL-Lite R role and concept inclusions are expressions of the form S T and B C, respectively, where S, T are role expressions and B, C are concept expressions built through the grammar rules

S ::= R | R , T ::= S | S, B ::= A | S, C ::= B | B,

with R NR and A NC. A DL-Lite R axiom is a DL-Lite R role or concept inclusion. An annotated DL-Lite R ontology is a ﬁnite set of annotated axioms of the form (α, p), where α is a DL-Lite R axiom and p NM. A schema S is a ﬁnite set of predicate symbols disjoint from NC NR with ar(P) the arity of P S. An annotated data instance D over S maps every P S to a ﬁnite subset P D of NI ar(P ) NV. An annotated mapping is a ﬁnite set of annotated rules (ρ, p), where ρ is a (GAV) rule and p NV.

A rule ρ is of the form E( x) ϕ( x, y, z), with E NC NR and ϕ( x, y, z) a conjunction of atoms P( t, t), with P S, t an ar(P)-tuple of terms in x y, and t z. We restrict ϕ to a conjunction of atoms for simplicity of our theoretical development, also in line with the idea that semirings capture the provenance of positive queries [Green et al., 2007]. See Sec. 5 for handling arbitrary OBDA mappings in our implementation. An annotated OBDA speciﬁcation P is a triple (O, M, S), where O is an ontology with annotated axioms, S is a data source schema whose signature is disjoint from the signature of O, and M is a set of annotated mappings, connecting S to O [Xiao et al., 2018]. The pair (P, D) of an annotated OBDA speciﬁcation P and an annotated data instance D is an annotated OBDA instance. In OBDA, data sources and mappings induce virtual assertions. In annotated OBDA, virtual assertions are annotated with the provenance information of the mapping and of matching tuples in the data instance. Formally, an annotated assertion (E( a), p) is an expression of the form (A(a), p) or (R(a, b), p), with A NC, R NR, a, b NI, and p NM. We write ϕ(µ( x, y, z)) D if µ is a function mapping x, y to NI, z to NV, and (µ( t, t)) P D, for every atom P( t, t) in ϕ( x, y, z). Given an annotated mapping M and data instance D, the set M(D) of annotated assertions

(E(µ( x)), p Q z z µ(z)), satisfying

(E( x) ϕ( x, y, z), p) M and ϕ(µ( x, y, z)) D is the set of virtual annotated assertions for M over D. The semantics of annotated OBDA instances is based on interpretations over the signature of the ontology, extending classical DL-Lite R interpretations to track provenance, when relevant. An annotated interpretation is a triple I = ( I, I m, I) where I and I m are non-empty disjoint sets (called the domain of I and the domain of monomials of I, respectively), and I is the annotated interpretation function mapping every a NI to some a I I; every p, q NM to some p I, q I I m s.t. p I = q I iff the monomials p and q are mathematically equal (modulo associativity and commutativity, e.g., (p q)I = (q p)I by commutativity); every A NC to some AI I I m; and every R NR to some RI I I I m. We extend I to further DL-Lite R expressions as natural:

(R )I = {(e, d, p I) | (d, e, p I) RI} , ( S)I = ( I I I m) \ SI, ( S)I = {(d, p I) | e I : (d, e, p I) SI} , and ( B)I = ( I I m) \ BI.

The annotated interpretation I satisﬁes:

(A(a), p), if (a I, p I) AI; (R(a, b), p), if (a I, b I, p I) RI; (B C, p), if, for all q NM, (d, q I) BI

implies that (d, (q p)I) CI; and (S T, p), if, for all q NM, (d, e, q I) SI

implies that (d, e, (q p)I) T I.

I satisﬁes an annotated ontology O, in symbols I |= O, if it satisﬁes all annotated axioms in O. I satisﬁes an annotated OBDA instance ((O, M, S), D) if I |= O and I |= M(D).

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

Example 2. Consider the OBDA instance of Example 1 and an annotated interpretation I with I = {Renier, Venice, Brugnaro}, I m containing p n, q n, p m, q m, p n s, q n s, with such individuals and monomials interpreted by themselves, and

head Gov I = {(R, V, p n), (B, V, q n)}, Mayor I = {(R, p n s), (B, q n s)}, City I = {(V, p m), (V, q m)}.

I is a model of such OBDA instance, where R, V, and B stand for Renier, Venice, and Brugnaro, respectively.

Following the database approach [Green et al., 2007; Green and Tannen, 2017], we annotate facts in interpretations with provenance information. However, in Green et al. s setting, the database is the (only) interpretation, while in our case we adopt the open world assumption (as in OBDA), so the semantics is based on multiple interpretations. Our semantics ensures that, if we have a tuple (d, p I) CI and (C D) is annotated with n, then (d, (p n)I) DI. So derivations are also represented in interpretations, and thus can be entailed. Each derivation is independent of the others. Regarding the semantics of negation, we point out that, at the level of an interpretation, the lack of provenance information is a support for the negation of a fact. This apparent counterintuitive behaviour does not hold in all interpretations, hence it does not manifest in the entailments. In fact, our focus in this paper is query entailment (deﬁned next), negations are only deﬁned to comply with the usual syntax and semantics of DL-Lite R. They do not affect query results, as in DL-Lite R.

2.2 Annotated Queries We extend the notion of conjunctive queries in DLs by allowing binary and ternary predicates, where the last term of a tuple may contain provenance information represented as a monomial (by deﬁnition of the semantics of annotated OBDA instances, the last element of a tuple can only contain monomials, not sums). More speciﬁcally, a Boolean conjunctive query (BCQ) q is a sentence x.ϕ( x, a, p), where ϕ is a conjunction of (non-repeating) atoms of the form A(t1, t), R(t1, t2, t), and ti is either an individual name from a, or a variable from x, and t (the last term of each tuple) is either an element of NM in the list p or a variable from x. We often write P( t, t) to refer to an atom which can be either A(t1, t) or R(t1, t2, t) and P( t, t) q if P( t, t) is an atom occurring in q. A match of the BCQ q = x.ϕ( x, a, p) in the annotated interpretation I is a function π : x a p I I m, such that π(b) = b I, for all b a p, and π( t, t) P I, for every P( t, t) q. I satisﬁes the BCQ q, written I |= q, if there is a match of q in I. A BCQ is entailed by an annotated OBDA instance if it is satisﬁed by every model of it. For a BCQ q and an interpretation I, νI(q) denotes the set of all matches of q in I. The provenance of q on I, denoted prov I(q), is the (potentially inﬁnite) expression: P π νI(q) Q P ( t,t) q π (t)

where π(t) is the last element of the tuple π( t, t) P I; and π (t) is any v NM s.t. v I = π(t). For p NP, we write

p prov I(q) if p is a sum of monomials and for each occurrence of a monomial in p we ﬁnd an occurrence of it in prov I(q). I satisﬁes q with provenance p NP, written I |= (q, p), if I |= q and p prov I(q). The annotated OBDA instance (P, D) entails q, (P, D) |= q, if for all annotated interpretations I, if I |= (P, D) then I |= q; and (P, D) |= (q, p), if (P, D) |= q and p prov I(q), for all I satisfying (P, D). In our syntax, the atoms of the queries contain an additional parameter which may either be a variable or a monomial. As a result, one can ﬁlter query results based on provenance information by specifying constraints in the last parameter of the atoms, which was not possible in the original approach by Green et al. [Green et al., 2007; Green and Tannen, 2017]. For example, xy.A(x, p) B(x, y) can be used to specify that we are only interested in matches of the query where the ﬁrst atom is associated with a particular provenance. Variables can also be repeated, e.g. xy.A(x, y) B(x, y). One can fall back to the original setting from databases, where no constraints are imposed, by simply associating the last term of each atom with a fresh variable (see standard queries in Section 4). The size |X| of an annotated OBDA instance, a polynomial or a BCQ X is the length of the string that represents X. We assume a binary encoding of elements of NC,NR,NI and NP occurring in X. We may omit annotated in front of terms such as OBDA, queries, inclusions, assertions, and others, whenever this is clear from the context.

2.3 Reasoning Problems Annotating OBDA instances with provenance information does not impact consistency checking. That is, an annotated OBDA instance is satisﬁable precisely when the OBDA instance that results from removing the annotations is satiaﬁable. We thus focus on the problem of query entailment w.r.t. a provenance polynomial: given an (annotated) OBDA instance (P, D), a query q and a polynomial p NP decide if (P, D) |= (q, p). Another important and related problem is to compute the provenance of a query: given an OBDA instance (P, D) and a query q, compute the set of all p NP such that (P, D) |= (q, p). In our formalism, the latter problem depends on whether there is a ﬁnite set of polynomials which we can compute. As shown next, in DL-Lite R the set of provenance polynomials may be inﬁnite. Example 3. Consider an OBDA instance (P, D) as in Ex. 1, but where now O of P contains also (Mayor head Gov, t). For all i N, (P, D) |= (Mayor(Renier), p n si+1 ti). Indeed, for any model I of (P, D), (Renier, (p n s)I) Mayor I implies (a, (p n s t)I) ( head Gov)I, which implies (Renier, (p n s2 t)I) Mayor I, and so on.

In Section 3 we consider the problem of query entailment w.r.t. a provenance polynomial. Note that in Example 3, if the semiring is multiplicatively idempotent (i.e., s s = s), the set of provenance polynomials is ﬁnite: the only polynomial is p n s t. This is not a coincidence; under multiplicativeidempotency, the set of provenance polynomials is always ﬁnite. The following proposition states that multiplicativeidempotency is indeed sufﬁcient to guarantee a ﬁnite set of polynomials.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

Proposition 1. Under multiplicative idempotency, for any satisﬁable OBDA instance (P, D) and BCQ q, the set of polynomials p NP such that (P, D) |= (q, p) is ﬁnite.

In Section 4 we study idempotent semirings and consider the problem of computing the provenance of a query.

3 Provenance Annotated Query Entailment

We establish complexity results for the problem of deciding whether an OBDA instance entails a (provenance annotated) query. For clarity of presentation, we split our proof in two parts. We ﬁrst show that for an OBDA instance (P, D) and a query (q, p), there is an OBDA instance (Pm, Dm) and a set Tr(qm, pm) of (non-annotated) queries such that (P, D) |= (q, p) iff (Pm, Dm) entails some q Tr(qm, pm). Moreover, the sizes of (Pm, Dm) and q are polynomial in the sizes of (P, D) and (q, p). Then, we adapt the query rewriting algorithm Perfect Ref [Calvanese et al., 2007] to decide whether (Pm, Dm) |= q .

3.1 Characterization Lemma 1 states that, given an OBDA instance (P, D) and a query (q, p), there is an OBDA instance (Pm, Dm) and a query (qm, pm) that can be used to decide (P, D) |= (q, p) and, moreover, all monomials in pm are mathematically distinct (modulo associativity, commutativity, and distributivity).

Lemma 1. Given a satisﬁable OBDA instance (P, D) and a query (q, p), there are (Pm, Dm) and (qm, pm) such that

any two monomials p1, p2 appearing in pm are mathematically distinct; (P, D) |= (q, p) iff (Pm, Dm) |= (qm, pm); and |(Pm, Dm)| + |(qm, pm)| is polynomially bounded by |(P, D)| + |(q, p)|.

We show that, given (Pm, Dm) and (qm, pm) as in Lemma 1, (qm, pm) can be translated into a set of queries such that (Pm, Dm) entails (qm, pm) iff it entails at least one of these queries. We ﬁrst deﬁne the translation of a BCQ where all terms are variables (no individual names and no polynomials), and then adapt the translation for the general case. Given the BCQ qm = x. ϕ( x) with k atoms and pm NP with n monomials, deﬁne Tr(qm, pm) as the set of all BCQs:

y. V 1 i n ϕi( xi), (1)

where y = x1, . . . , xn and each qi = xi. ϕi( xi) is a copy of q in which we replace each variable x x by a fresh variable xi xi. We check whether we can ﬁnd the monomials of the polynomial in these matches by replacing the last variable in each j-th atom of qi by a monomial pi,j NM built from symbols occurring in pm such that Q 1 j k pi,j = pi for some pi NP, with 1 i n; and P 1 i n pi = p. The translation of a BCQ with individual names is similar, except that we must add such individual names in each copy of the query; that is, we would replace the corresponding variable in the translation with the individual name occurring in the query. Theorem 1 formalises the correctness of our translation, where we write (P, D) |= Tr(q, p) to express that there is q Tr(q, p) such that (P, D) |= q .

Example 4. Consider the query

q = xyzw.(head Gov(x, y, z) City(y, w))

and the polynomial p = (s t) + (s r). Then,

x1y1x2y2.(head Gov(x1, y1, s) City(y1, t) head Gov(x2, y2, s) City(y2, r))

is in Tr(q, p).

Theorem 1. Let (P, D) be an OBDA instance, q a BCQ and p NP a polynomial formed of mathematically distinct monomials. (P, D) |= (q, p) iff (P, D) |= Tr(q, p).

Without assuming that p NP is formed of mathematically distinct monomials, we would need to add inequalities to the queries in Tr(q, p) (there is no way to distinguish Tr(q, p + p) from Tr(q, p)). By Lemma 1, given the OBDA instance (P, D) and query (q, p), there are (Pm, Dm) and (qm, pm), satisfying the assumption of Theorem 1, which we can use to decide whether (P, D) |= (q, p). This is crucial for query entailment since entailment of conjunctive queries with inequalities in DL-Lite R is undecidable [Guti errez-Basulto et al., 2015].

3.2 Query Rewriting

We adapt the classical query rewriting algorithm Perfect Ref [Calvanese et al., 2007] to decide whether (P, D) |= q , for q Tr(q, p), where (P, D) and (q, p) are as in Theorem 1. When possible, we use the deﬁnitions and terminology from [Calvanese et al., 2007, Sec. 5.1], adapting some of them to our setting if needed. For simplicity, for each role R occurring in an OBDA instance ((O, M, S), D), we add to O the annotated role inclusions (R R, p R) and (R R , p R), where R is a fresh role name and p R, p R are fresh variables of a provenance semiring. We assume w.l.o.g. that inverse roles only occur in such role inclusions by replacing other occurrences of R with R. The symbol denotes non-distinguished non-shared variables. A positive inclusion I is a provenance annotated role or concept inclusion without negations. I is applicable to A(x, p) if I is annotated with v occurring in p and it has A in its right-hand side. A positive inclusion I is applicable to R(x, y, p) if (i) x = , I is annotated with v occurring in p, and the right-hand side of I is R, or (ii) I is a role inclusion annotated with v occurring in p and its right-hand side is R or R . Given p NM and v NV occurring in p, we denote by p|v the result of removing one occurrence of v from p.

Deﬁnition 1. Let g be an atom and I a positive inclusion applicable to g. The atom obtained from g by applying I, denoted by gr(g, I), is deﬁned as follows:

gr(A(x, p), (A1 A, v)) = A1(x, p|v); gr(A(x, p), ( R A, v)) = R(x, , p|v); gr(R(x, , p), (A R, v)) = A(x, p|v); gr(R(x, , p), ( R1 R, v)) = R1(x, , p|v); gr(R(x, y, p), (R1 R, v)) = R1(x, y, p|v); gr(g, I) = R1(y, x, p|v), if g = R(x, y, p) and either I = (R1 R , v) or I = (R 1 R, v).

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

Algorithm 1 Perfect Ref Input: a BCQ q, a set of positive inclusions OT Output: a set of BCQs PR 1: PR := {q} 2: repeat 3: PR := PR 4: for all q PR , all g, g1, g2 q and all I OT do 5: if {q[g/gr(g, I)]} / PR and I OT is applicable to g q then 6: PR := PR {q[g/gr(g, I)]} 7: if there are g1, g2 q such that g1 and g2 unify then 8: PR := PR {τ(reduce(q, g1, g2))} 9: until PR = PR 10: return PR

We use Perfect Ref (Algorithm 1) originally presented in [Calvanese et al., 2007], except that the applicability of a positive inclusion I to an atom g is as previously described and gr(g, I) follows Deﬁnition 1. Let q[g/g ] denote the BCQ obtained from q by replacing the atom g with a new atom g ; let τ be a function that takes as input a BCQ q and returns a new BCQ obtained by replacing each occurrence of an unbound variable in q with the symbol ; and let reduce be a function that takes as input a BCQ q and two atoms g1, g2 and returns the result of applying to q the most general uniﬁer of g1 and g2 (unifying mathematically equal terms). Perfect Ref(q, OT ) is the output of the algorithm Perfect Ref over q (with a monomial in NM in the last parameter of each atom) and a set OT of positive inclusions of an OBDA instance ((O, M, S), D).

Example 5. Consider an OBDA instance ((O, M, S), D) as in Ex. 1. We call Algorithm 1 with OT and the query q = x.Mayor(x, p n s) as input. Since I is applicable to g = Mayor(x, p n s), in Line 6, Alg. 1 adds to PR the result of replacing g by gr(g, I) = head Gov(x, , p n) in q. Hence, q = x, y. head Gov(x, y, p n) Perfect Ref(q, OT ). Indeed q is a rewriting of q.

Our next theorem states the correctness of Algorithm 1.

Theorem 2. Let q be a BCQ and OT the set of positive inclusions of an OBDA speciﬁcation P = (O, M, S). Given q and OT as input, Algorithm 1 terminates and outputs a set of BCQs PR such that, for all data instances D where (P, D) is satisﬁable, (P, D) |= q iff there is q PR such that (( , M, S), D) |= q .

Termination of our modiﬁed version of Perfect Ref is analogous to [Calvanese et al., 2007, Lemma 34], except that now the number of terms is exponential in the size of monomials occurring in the query, and thus in the size of the query. This is due to Deﬁnition 1, where we break the monomial into a smaller one. Our modiﬁcation does not change the upper bounds obtained with the algorithm, since for data complexity the query is not part of the input and the upper bound for combined complexity, which we establish in Theorem 3, is obtained by a non-deterministic version of the algorithm.

Theorem 3. Answering provenance annotated queries w.r.t. OBDA instances is NP-complete (combined complexity).

4 Computing the Provenance of a Query We now consider the problem of computing the provenance of a query. To avoid the case of an inﬁnite provenance, we focus on the special case where the provenance semiring is fully idempotent, which is a sufﬁcient condition for ﬁnite provenance (Proposition 1). The semiring is fully idempotent if for every polynomial p NP, p p = p and p+p = p. This is the case, e.g., if the provenance refers to the name of the source of the knowledge; having several times the same name does not affect the result. Alternatively, one can model access rights and observe whether certain pieces of knowledge are needed for the entailment of a query w.r.t. an OBDA instance. For fully idempotent semirings, the task corresponds to computing relevant monomials. More precisely, in this special case we want to compute all monomials p such that (P, D) |= (q, p). The provenance of the query w.r.t. the OBDA instance is the addition of all these monomials. This deﬁnition is equivalent to the general one since the semiring is idempotent: repetitions of a monomial do not affect the result, and repetitions of a variable within a monomial can be removed. If the semiring is only multiplicatively idempotent, then computing monomials does not sufﬁce, as some of them may appear several times. However, the problem is still simpliﬁed to ﬁnd the (ﬁnite) number of repeated monomials to be observed. In general, the query polynomial may be composed of exponentially many monomials, even if the query is a simple one of the form x.A(a, x), with A NC. Proposition 2. There exists an OBDA instance (P, D) and a simple query q such that the provenance polynomial of q w.r.t. (P, D) is formed of exponentially many monomials.

For some queries, provenance cannot be expressed by a provenance polynomial of polynomial length in the size of the ontology, even if an expanded form is not required. This follows from known results in monotone complexity [Karchmer and Wigderson, 1990]: there is no monotone Boolean formula (i.e., propositional formula using only the connectives and ) of polynomial length expressing all the simple paths between two nodes in a graph. This holds already for complete graphs. Graphs can be described in DL-Lite R (and simpler logics) using basic inclusion axioms, and monotone Boolean formulas are provenance polynomials over an idempotent semiring, where the and serve as product and addition. Hence we have the following result [Pe naloza, 2009]. Proposition 3. There exist an OBDA instance (P, D) and a query q such that the provenance of q w.r.t. (P, D) cannot be represented in polynomial space. This holds even for idempotent semirings, and if every axiom has a unique label.

On the other hand, if every axiom is labeled with a unique variable, then the provenance polynomial for instance queries can be computed efﬁciently, whenever its length does not increase greatly; that is, it can be computed in polynomial time in the size of the input and the output. The proof of this claim follows the same ideas from [Pe naloza and Sertkaya, 2017], based on the fact that all the relevant monomials from the provenance are enumerable with polynomial delay. Lemma 2. The provenance p of an instance query w.r.t. an OBDA instance (P, D) can be computed in polynomial time

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

Algorithm 2 Compute Prov Input: a BCQ q0, an OBDA instance ((O, M, S), D) Output: the provenance p of q w.r.t. ((O, M, S), D) 1: PR := Perfect Ref (q 0, OT ), 2: for all q PR do 3: for all matches π of q y in IM(D) do 4: PR := PR {q y,π} 5: PR := PR \ {q} 6: return p := P q P R Q P ( t,t) q t

in the size of (P, D) and of the polynomial p.

We give an algorithm for computing the provenance of a BCQ w.r.t. an OBDA instance. We focus on BCQs that do not have monomials in the last term of the atom. A BCQ q = x.ϕ( x, a) is standard if, for all P( t, t) q, t is a fresh variable in x. Algorithm 2 computes the provenance of a standard BCQ w.r.t. an OBDA instance. We adopt the same notation used for describing Perfect Ref [Calvanese et al., 2007] (also used in Section 3). Perfect Ref is a variant of Perfect Ref (Algorithm 1), where the notions of applicability of an inclusion I w.r.t. an atom g and the deﬁnition of gr(g, I) are as follows. I is applicable to an atom A(x, p) if I has A in its right-hand side. A positive inclusion I is applicable to an atom R(x, y, p) if (i) x = , and the right-hand side of I is R, or (ii) the right-hand side of I is either R or R . Given p NM and v NV, we deﬁne pv as p v if v does not occur in p, and we deﬁne pv as p, otherwise. E.g., vwv = vw. Deﬁnition 2. Let g be an atom and I a positive inclusion applicable to g. The atom obtained from g by applying I, denoted by gr(g, I), is deﬁned as follows: gr(A(x, p), (A1 A, v)) = A1(x, pv); gr(A(x, p), ( R A, v)) = R(x, , pv); gr(R(x, , p), (A R, v)) = A(x, pv); gr(R(x, , p), ( R1 R, v)) = R1(x, , pv); gr(R(x, y, p), (R1 R, v)) = R1(x, y, pv); gr(g, I) = R1(y, x, pv), if g = R(x, y, p) and either I = (R1 R , v) or I = (R 1 R, v).

For standard BCQs, Algorithm 2 is sound and complete. Termination of Algorithm 2 is an easy consequence of termination of Perfect Ref. The main difference between Algorithm 2 and Algorithm 1 (Section 3) is that here we assume that a standard BCQ is given (without any provenance information) and we aim at computing its provenance. Instead of removing variables of the semiring while applying positive inclusions (Deﬁnition 1), we add the variables of the semiring whenever the associated positive inclusion is applied (Deﬁnition 2). In Line 1, we write q to denote the result of replacing each t in P( t, t) q by , where is a fresh symbol from NV. This transformation ensures that in Deﬁnition 2 the last term is always an element of NM. In Line 3, we denote by q y the result of replacing, for each P( t, t) q, the last term t by a fresh variable from y (i.e., q y is a standard BCQ). We perform another transformation in Line 4, denoted by q y,π, which is the result of replacing, for each P( t, t) q, the symbol in t by u NM such that u I = π(y) (if there are multiple

mathematically equal such u, we simply choose u arbitrarily), where y is the last term of the corresponding atom in q y (that is, P( t, y) q y). Observe that π is a match of q y in IM(D). Example 6. Assume Algorithm 2 receives as input the standard query q0 = xz.Mayor(x, z) and an OBDA instance ((O, M, S), D) with O = {( head Gov Mayor, s)} and

M(D) = {(head Gov(Renier, Venice), u), (head Gov(Brugnaro, Venice), v)}.

In Line 1, Algorithm 2 calls Perfect Ref , deﬁned as a variant of Perfect Ref (Algorithm 1), where the notions of applicability of an inclusion I w.r.t. an atom g and the deﬁnition of gr(g, I) are as in Section 4. The return of Perfect Ref is PR = { x.Mayor(x, ), xz.head Gov(x, z, s)}. Then, for all q PR and all matches π of q y in IM(D) (if they exist) the algorithm adds q y,π to PR. In this example, assume q = xz.head Gov(x, z, s). We have two matches of q y = xzy.head Gov(x, z, y) in M(D), one mapping y to u (call this match π) and the other mapping y to v (call it π ). So, q y,π = xz.head Gov(x, z, u s) and q y,π = xz.head Gov(x, z, v s). In Line 5, Algorithm 2 removes q 0 from PR. Finally, in Line 6, it returns the polynomial u s + v s.

Theorem 4. Let q be a standard BCQ and (P, D) an OBDA instance. Given q and (P, D) as input to Algorithm 2, it outputs the provenance of q w.r.t. (P, D).

The upper bounds from the previous section for the general case obviously apply in the restricted idempotent case as well.

5 Evaluation To evaluate the feasibility of our approach, we implemented a prototype system (Onto Prov) that extends the state-of-the-art OBDA system Ontop [Calvanese et al., 2017] with the support for provenance. Ontop supports SPARQL query answering over ontologies in OWL 2 QL, the W3C standard corresponding to DL-Lite R [Motik et al., 2012]. The algorithm of Ontop has two stages, an ofﬂine stage, which classiﬁes the ontology and saturates the input set of mappings, and an online stage, which rewrites the input queries according to the saturated set of mappings. Onto Prov enriches these steps by taking into account provenance information, and relies on Prov SQL [Senellart et al., 2018] to handle provenance from the database and queries in the mappings that go beyond the CQ fragment. We compare Ontop v3.0.0-beta-3 and Onto Prov over the BSBM [Bizer and Schultz, 2009] and the NPD [Lanti et al., 2015] benchmarks. Experiments were run on a server with 2 Intel Xeon X5690 Processors (24 logical cores at 3.47 GHz), 106 GB of RAM and ﬁve 1 TB 15K RPM HDs. As RDBMS we have used Postgre SQL 11.2.

5.1 Evaluation with the BSBM Benchmark The BSBM benchmark is designed to test the different features of SPARQL. It provides a baseline for our tests, since it comes with an empty ontology and therefore it does not require ontological reasoning. In this experiment we restrict to a set of parametric queries (called here query mix) in the benchmark

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

Dataset mix Time Ontop mix Time Onto Prov

bsbm10k 2.0s 3.2s bsbm1M 326s 364s

Table 1: BSBM Experiment

Ontop Onto Prov Q #unf t t10 #unf #uinst tinst tinst10 #prov #prov10

1 0 29.5 172.4 16 16 5.2 1.7 1524 49 2 0 .5 3.1 32 32 .3 .4 28 60 3 24 5.0 51.1 16 16 248.0 296.5 84 153 4 0 3.2 24.3 16 16 437.6 297.3 90 53 5 0 .1 .2 0 0 .1 1.5 1 1 6 13 107.5 804.3 369 369 1439.3 tout 426 tout 7 0 .4 .2 0 0 .1 .3 1 1 9 15 6.8 53.4 64 55 .3 .9 5 5 10 1 .4 25.4 4 4 .8 11.3 1 7 11 6 53.6 760.8 184 184 1342.6 tout 474 tout 12 8 69.1 1215.5 185 185 1248.1 tout 476 tout 31 21 60.3 633.0 248 239 1.9 5.3 120 60

Table 2: NPD Experiment (times in seconds)

(9 in total) that are supported by our theoretical framework. Table 1 compares the average time (over three test runs) to evaluate the query mix with both Ontop and Onto Prov, on two datasets containing 10k and 1M products (resp., bsbm10k and bsbm1M). Evaluation times for both systems are very close. Hence, without a complex ontology or complex mappings, the overhead for computing provenance is rather small.

5.2 Evaluation with the NPD Benchmark As opposed to BSBM, the NPD Benchmark is speciﬁcally tailored to OBDA systems; it comes with a complex ontology, complex mappings, and queries of various kinds. We restrict to 12 user queries that are supported by our framework. We use the dataset NPD, containing real-world data about the oil extraction domain, and the dataset NPD10, which is 10 times the size of NPD and is generated by a data scaler [Lanti et al., 2019]. Differently from the BSBM benchmark, in NPD we observed many timeouts (set to 40 minutes) when running the benchmark queries with Onto Prov. This is due to the fact that, in NPD, the optimizations performed by Ontop over the query unfoldings are crucial for getting reasonably compact SQL queries. Such optimizations, however, need to be disabled in Onto Prov to guarantee completeness. In fact, we are interested in all the possible ways to derive a result, and cannot identify and discard redundant derivations. For a broader discussion about these aspects, please refer to the additional material. We assume that a user of Onto Prov is more interested in understanding the reason for a speciﬁc answer tuple, rather than getting in bulk all possible explanations for all possible answer tuples. To simulate such user interaction, in our tests we have instantiated the NPD queries with answer tuples, and have run the obtained instantiated queries (which are, in fact, BCQs) over Onto Prov. Table 2 contains the aggregate results of our runs. For each of our tests, we performed 5 test runs. The columns #unf and #uinst denote the number of times a UNION operator appears in the unfolding of an NPD query and an instantiated query, respectively. This measure gives an

idea on the complexity of the unfolding, and we can observe that the unfoldings produced by Onto Prov are much more complex than those produced by Ontop. As argued above, this is because Onto Prov disallows some optimizations. Columns t and t10 denote the average execution times of the queries over the datasets NPD and NPD10, respectively, and for instantiated queries these values are respectively denoted by tinst and tinst10. The execution times for Onto Prov are generally much higher than for Ontop. We attribute this to the increased complexity of the unfoldings. Columns #prov and #prov10 denote the number of results for the instantiated queries, respectively over NPD and NPD10. These numbers can be interpreted as the number of possible ways an answer tuple can be derived, and give an indication on the complexity of the benchmark itself. For instance, for query 1 over the NPD dataset there are on average 1524 explanations for a single answer tuple. This test shows that the approach is feasible even with complex ontologies and mappings, but also that more work is needed in order to devise optimization techniques dedicated to a setting with provenance.

6 Conclusions and Discussion

We investigated the problem of dealing with provenance within OBDA, based on the provenance semiring approach introduced for databases. In our case, every element of an OBDA instance is annotated with provenance information. We showed that query rewriting techniques can be applied to deal with provenance as well. An evaluation based on a prototypical implementation shows that our methods are feasible in practice. A key difference between the problem of provenance computation (or its decision version) and that of axiom pinpointing [Schlobach and Cornet, 2003; Kalyanpur et al., 2007; Baader et al., 2007] and query explanation [Calvanese et al., 2013; Croce and Lenzerini, 2018; Bienvenu et al., 2019] is that axiom pinpointing and query explanation focus on tracing the minimal causes of a consequence (or the lack of it). In contrast, all possible derivations are relevant for provenance, independently of whether a cause is minimal or not. As future work, we plan to investigate provenance with the monus operator. We will also study the provenance of SPARQL query answering [Geerts et al., 2013] in OBDA. Our implementation computes the provenance of a query assuming that the semiring is multiplicatively idempotent. While this assumption is useful to identify which parts of the knowledge base contribute to the query result, it restricts the applicability of our approach to other settings, in particular, to the numerical ones. For capturing probabilities, it is important to distinguish repetitions, so (multiplicative) idempotency is not suitable. In our setting, dropping the idempotency condition leads to cases where the polynomial can be inﬁnite. It would be interesting to investigate whether the polynomial can be ﬁnitely represented, so that its computation could be applied in a numerical setting.

Acknowledgements

This research has been supported by the European Region Tyrol-South Tyrol-Trentino (Euregio IPN12 KAOS) and by the Free University of Bozen-Bolzano through the projects STy Lo La, OBATS, QUADRO, PROVDL, and FO2S.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

References [Baader et al., 2007] Franz Baader, Rafael Pe naloza, and Boontawee Suntisrivaraporn. Pinpointing in the description logic EL+. In KI, pages 52 67, 2007. [Bienvenu et al., 2019] Meghyn Bienvenu, Camille Bourgaux, and Franc ois Goasdou e. Computing and explaining query answers over inconsistent DL-Lite knowledge bases. Journal of Artiﬁcial Intelligence Research, 64:563 644, 2019. [Bizer and Schultz, 2009] Christian Bizer and Andreas Schultz. The Berlin SPARQL benchmark. International Journal on Semantic Web and Information Systems, 5(2):1 24, 2009. [Borgida et al., 2008] Alexander Borgida, Diego Calvanese, and Mariano Rodriguez-Muro. Explanation in the DL-Lite family of description logics. In ODBASE, volume 5332 of LNCS, pages 1440 1457. Springer, 2008. [Bourgaux and Ozaki, 2019] Camille Bourgaux and Ana Ozaki. Querying attributed DL-Lite ontologies using provenance semirings. In AAAI, 2019. [Buneman and Kostylev, 2010] Peter Buneman and Egor V. Kostylev. Annotation algebras for RDFS data. In Proc. of the 2nd Int. Workshop on the Role of Semantic Web in Provenance Management (SWPM@ISWC), 2010. [Calvanese et al., 2007] Diego Calvanese, Giuseppe De Giacomo, Domenico Lembo, Maurizio Lenzerini, and Riccardo Rosati. Tractable reasoning and efﬁcient query answering in description logics: The DL-Lite family. Journal of Automated Reasoning, 39(3):385 429, 2007. [Calvanese et al., 2013] Diego Calvanese, Magdalena Ortiz, Mantas Simkus, and Giorgio Stefanoni. Reasoning about explanations for negative query answers in DL-Lite. Journal of Artiﬁcial Intelligence Research, 48:635 669, 2013. [Calvanese et al., 2017] Diego Calvanese, Benjamin Cogrel, Sarah Komla-Ebri, Roman Kontchakov, Davide Lanti, Martin Rezk, Mariano Rodriguez-Muro, and Guohui Xiao. Ontop: Answering SPARQL queries over relational databases. Semantic Web Journal, 8(3):471 487, 2017. [Calvanese et al., 2019] Diego Calvanese, Davide Lanti, Ana Ozaki, Rafael Pe naloza, and Guohui Xiao. Enriching ontology-based data access with provenance (Extended version). Co RR Technical Report ar Xiv:1906.00179, ar Xiv.org, 2019. Available at http://arxiv.org/abs/1906. 00179. [Croce and Lenzerini, 2018] Federico Croce and Maurizio Lenzerini. A framework for explaining query answers in DL-Lite. In EKAW, pages 83 97, 2018. [Geerts et al., 2013] Floris Geerts, Grigoris Karvounarakis, Vassilis Christophides, and Irini Fundulaki. Algebraic structures for capturing the provenance of SPARQL queries. In ICDT, pages 153 164. ACM, 2013. [Green and Tannen, 2017] Todd J. Green and Val Tannen. The semiring framework for database provenance. In PODS, pages 93 99. ACM, 2017.

[Green et al., 2007] Todd J. Green, Gregory Karvounarakis, and Val Tannen. Provenance semirings. In PODS, pages 31 40, 2007. [Guti errez-Basulto et al., 2015] V ıctor Guti errez-Basulto, Yazm ın Ib a nez-Garc ıa, Roman Kontchakov, and Egor V. Kostylev. Queries with negation and inequalities over lightweight ontologies. Journal of Web Semantics, 35(P4):184 202, 2015. [Kalyanpur et al., 2007] Aditya Kalyanpur, Bijan Parsia, Matthew Horridge, and Evren Sirin. Finding all justiﬁcations of OWL DL entailments. In ISWC, volume 4825 of LNCS, pages 267 280. Springer, 2007. [Karchmer and Wigderson, 1990] M. Karchmer and A. Wigderson. Monotone circuits for connectivity require super-logarithmic depth. SIAM Journal on Discrete Mathematics, 3(2):255 265, 1990. [Lanti et al., 2015] Davide Lanti, Martin Rezk, Guohui Xiao, and Diego Calvanese. The NPD benchmark: Reality check for OBDA systems. In EDBT, pages 617 628. Open Proceedings.org, 2015. [Lanti et al., 2019] Davide Lanti, Guohui Xiao, and Diego Calvanese. VIG: Data scaling for OBDA benchmarks. Semantic Web Journal, 10(2):413 433, 2019. [Motik et al., 2012] Boris Motik, Bernardo Cuenca Grau, Ian Horrocks, Zhe Wu, Achille Fokoue, and Carsten Lutz. OWL 2 Web Ontology Language proﬁles. W3C Recommendation, World Wide Web Consortium, December 2012. Available at http://www.w3.org/TR/owl2-proﬁles/. [Pe naloza and Sertkaya, 2017] Rafael Pe naloza and Barıs Sertkaya. Understanding the complexity of axiom pinpointing in lightweight description logics. Artiﬁcial Intelligence, 250:80 104, 2017. [Pe naloza, 2009] Rafael Pe naloza. Axiom Pinpointing in Description Logics and Beyond. Ph D thesis, Dresden University of Technology, 2009. [Schlobach and Cornet, 2003] Stefan Schlobach and Ronald Cornet. Non-standard reasoning services for the debugging of description logic terminologies. In IJCAI, 2003. [Senellart et al., 2018] Pierre Senellart, Louis Jachiet, Silviu Maniu, and Yann Ramusat. Prov SQL: Provenance and probability management in Postgre SQL. Proceedings of the VLDB Endowment, 11(12):2034 2037, 2018. [Senellart, 2017] Pierre Senellart. Provenance and probabilities in relational databases. SIGMOD Record, 46(4):5 15, 2017. [Xiao et al., 2018] Guohui Xiao, Diego Calvanese, Roman Kontchakov, Domenico Lembo, Antonella Poggi, Riccardo Rosati, and Michael Zakharyaschev. Ontology-based data access: A survey. In IJCAI, pages 5511 5519, 2018. [Zimmermann et al., 2012] Antoine Zimmermann, Nuno Lopes, Axel Polleres, and Umberto Straccia. A general framework for representing, reasoning and querying with annotated Semantic Web data. Journal of Web Semantics, 11:72 95, 2012.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)