# the_bag_semantics_of_ontologybased_data_access__a872e723.pdf

The Bag Semantics of Ontology-Based Data Access

Charalampos Nikolaou, Egor V. Kostylev, George Konstantinidis, Mark Kaminski, Bernardo Cuenca Grau, and Ian Horrocks Department of Computer Science, University of Oxford, UK

Ontology-based data access (OBDA) is a popular approach for integrating and querying multiple data sources by means of a shared ontology. The ontology is linked to the sources using mappings, which assign views over the data to ontology predicates. Motivated by the need for OBDA systems supporting database-style aggregate queries, we propose a bag semantics for OBDA, where duplicate tuples in the views deﬁned by the mappings are retained, as is the case in standard databases. We show that bag semantics makes conjunctive query answering in OBDA CONP-hard in data complexity. To regain tractability, we consider a rather general class of queries and show its rewritability to a generalisation of the relational calculus to bags.

1 Introduction Ontology-based data access (OBDA) is an increasingly popular approach to enable uniform access to multiple data sources with diverging schemas [Poggi et al., 2008]. In OBDA, an ontology provides a unifying conceptual model for the data sources together with domain knowledge. The ontology is linked to each source by global-asview (GAV) mappings [Lenzerini, 2002], which assign views over the data to ontology predicates. Users access the data by means of queries formulated using the vocabulary of the ontology; query answering amounts to computing the certain answers to the query over the union of ontology and the materialisation of the views deﬁned by the mappings. The formalism of choice for representing ontologies in OBDA is the description logic DL-Lite R [Calvanese et al., 2007], which underpins OWL 2 QL [Motik et al., 2012]. DL-Lite R was designed to ensure that queries against the ontology are ﬁrstorder rewritable; that is, they can be reformulated as a set of relational queries over the sources [Calvanese et al., 2007]. Example 1. A company stores data about departments and their employees in several databases. The sales department uses the schema Sal Employee(id, name, salary, loc, mngr),

This work was supported by the Royal Society under a University Research Fellowship, the EPSRC projects ED3 and DBOnto, and the Research Council of Norway via the Sirius SFI.

where attributes id, name, salary, loc, and mngr stand for employee ID within the department, their name, salary, location, and name of their manager. In turn, the IT department stores data using the schema ITEmployee(id, surname, salary, city), where managers are not speciﬁed. To integrate employee data, the company relies on an ontology with TBox Tex, which deﬁnes unary predicates such as Sal Emp, ITEmp, and Mngr, and binary predicates such as has Mngr relating employees to their managers. The following mappings determine the extension of the predicates based on the data, where each atti represents the attributes occurring only in the source:

Sal Employee(name, att1) Sal Emp(name), Sal Employee(name, mngr, att2) has Mngr(name, mngr), Sal Employee(mngr, att3) Mngr(mngr), ITEmployee(surname, att4) ITEmp(surname).

TBox Tex speciﬁes the meaning of its vocabulary using inclusions (i) Sal Emp Emp and ITEmp Emp, which say that both sales and IT employees are company employees; (ii) has Mngr Mngr, specifying the range of the has Mngr relation, and (iii) Emp has Mngr, requiring that employees have a (maybe unspeciﬁed) manager. Such inclusions inﬂuence query answering: when asking for the names of all company employees, the system will retrieve all relevant sales and IT employees; this is achieved via query rewriting, where the query is reformulated as the union of queries over the sales and IT databases. OBDA has received a great deal of attention in recent years. Researchers have studied the limits of ﬁrst-order rewritability in ontology languages [Calvanese et al., 2007; Artale et al., 2009], established bounds on the size of rewritings [Gottlob et al., 2014; Kikot et al., 2014], developed optimisation techniques [Kontchakov et al., 2014], and implemented systems well-suited for real-world applications [Calvanese et al., 2017; Calvanese et al., 2011]. An important observation about the conventional semantics of OBDA is that it is set-based: the materialisation of the views deﬁned by the mappings is formalised as a virtual ABox consisting of a set of facts over the ontology predicates. This treatment is, however, in contrast with the semantics of database views, which is based on bags (multisets) and where duplicate tuples are retained by default. The distinction between set and bag semantics in databases is very signiﬁcant in practice; in particular, it inﬂuences the evaluation of aggre-

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

gate queries, which combine various aggregation functions such as Min, Max, Sum, Count or Avg with the grouping functionality provided in SQL by the Group By construct. Example 2. Consider the query asking for the number of employees named Lee. Assume there are two different employees named Lee, which are represented as different tuples in the sales database (e.g., tuples with the same employee name, but different ID). Under the conventional semantics of OBDA, the virtual ABox would contain a single fact Sal Emp(Lee); hence, the query would wrongly return one, even under the semantics for counting aggregate queries in [Calvanese et al., 2008; Kostylev and Reutter, 2015]. The correct count can be obtained by considering the extension of Sal Emp as a bag with multiple occurrences of Lee. The goal of this paper is to propose and study a bag semantics for OBDA which is compatible with the semantics of standard databases and can provide a suitable foundation for the future study of aggregate queries. We focus on conjunctive query (CQ) answering over DL-Lite R ontologies under bag semantics, and our main contributions are as follows. 1. We propose the ontology language DL-Litebag R and its restriction DL-Litebag core, where ABoxes consist of a bag of facts, thus providing a faithful representation of the views deﬁned by OBDA mappings. We deﬁne the semantics of query answering in this setting and show that it is compatible with the conventional set-based semantics. 2. We show that, in contrast to the set case, ontologies may not have a universal model (i.e., a single model over which all CQs can be correctly evaluated), and bag query answering becomes CONP-hard in data complexity even if we restrict ourselves to DL-Litebag core ontologies. 3. To regain tractability, we study the class of rooted CQs [Bienvenu et al., 2012], where each connected component of the query graph is required to contain an individual or an answer variable. This is a very general class, which arguably captures most practical OBDA queries. We show that rooted CQs over DL-Litebag core ontologies not only admit a universal model and enjoy favourable computational properties, but also allow for rewritings that can be directly evaluated over the bag ABox of the ontology. For the proofs of all results we refer to [Nikolaou et al., 2017].

2 Preliminaries Syntax of Ontologies We ﬁx a vocabulary consisting of countably inﬁnite and pairwise disjoint sets of individuals I (i.e., constants), variables X, atomic concepts C (unary predicates) and atomic roles R (binary predicates). A role is an atomic role P R or its inverse P . A concept is an atomic concept in C or an expression R, where R is a role. An inclusion is an expression of the form S1 S2 with S1 and S2 either both concepts or both roles. A disjointness axiom is an expression of the form Disj(S1, S2) with S1 and S2 either both concepts or both roles. A concept assertion is of the form A(a) with a I and A C. A role assertion is of the form P(a, b) with a, b I and P R. A DL-Lite R TBox is a ﬁnite set of inclusions and disjointness axioms. An ABox is a ﬁnite set of concept and role assertions. A DL-Lite R ontology is a pair T , A with T a DL-Lite R TBox and A an

ABox. The ontology language DL-Litecore restricts DL-Lite R by disallowing inclusions and disjointness axioms for roles. Semantics of Ontologies An interpretation I is a pair I, I , where the domain I is a non-empty set, and the interpretation function I maps each a I to a I I such that a I = b I for all a, b I,1 each A C to a subset AI of I and each P R to a subset P I of I I. The interpretation function extends to concepts and roles as follows: (R )I = {(u, v) | (v, u) RI} and ( R)I = {u I | (u, v) RI for some v I}. An interpretation I satisﬁes ABox A if a I AI for all A(a) A and (a I, b I) P I for all P(a, b) A; I satisﬁes TBox T if SI 1 SI 2 for all S1 S2 in T and SI 1 SI 2 = for all Disj(S1, S2) in T ; I is a model of ontology T , A if it satisﬁes T and A. An ontology is satisﬁable if it has a model. Queries A conjunctive query (CQ) q(x) with answer variables x is a formula y. φ(x, y), where x, y are (possibly empty) repetition-free tuples of variables and φ(x, y) is a conjunction of atoms of the form A(t), P(t1, t2) or z = t, where A C, P R, z x y, and t, t1, t2 x y I. If x is inessential, then we write q instead of q(x). If x is the empty tuple , then q is Boolean. A union of CQs (UCQ) is a disjunction of CQs with the same answer variables. The equality atoms in a CQ q(x) = y. φ(x, y) yield an equivalence relation on terms x y I, and we write t for the equivalence class of a term t. The Gaifman graph of q(x) has a node t for each t x y I in φ, and an edge { t1, t2} for each atom in φ over t1 and t2. We assume that all CQs are safe: for each z x y, the class z contains a term mentioned in an atom of φ(x, y) that is not an equality. The certain answers q K to a (U)CQ q(x) over a DL-Lite R ontology K are the set of all tuples a of individuals such that q(a) holds in every model of K. A class of queries Q1 is rewritable to a class Q2 for an ontology language O if for any q1 Q1 and TBox T in O, there is q2 Q2 such that, for any ABox A in O with T , A satisﬁable, q T ,A 1 equals the answers to q2 in (the least model of) A. Checking a q T ,A for a tuple a, (U)CQ q, and DL-Lite R ontology T , A is an NP-complete problem with AC0 data complexity (i.e., when T and q are ﬁxed) [Calvanese et al., 2007]. The latter follows from the rewritability of UCQs to themselves for DL-Lite R. Bags A bag over a set M is a function Ω: M N 0 , where N 0 is the set of nonnegative integers and inﬁnity. The value Ω(c) is the multiplicity of c in M. A bag Ωis ﬁnite if there are ﬁnitely many c M with Ω(c) > 0 and there is no c with Ω(c) = . The empty bag over M is the bag such that (c) = 0 for all c M. Given bags Ω1 and Ω2 over M, let Ω1 Ω2 if Ω1(c) Ω2(c) for each c M. The intersection , max union , arithmetic union , and difference are the binary operations deﬁned for bags Ω1 and Ω2 over the same set M as follows: for every c M, (Ω1 Ω2)(c) = min{Ω1(c), Ω2(c)}, (Ω1 Ω2)(c) = max{Ω1(c), Ω2(c)}, (Ω1 Ω2)(c) = Ω1(c) + Ω2(c), and (Ω1 Ω2)(c) = max{0, Ω1(c) Ω2(c)}; difference is welldeﬁned only when Ω2 is ﬁnite.

1We adopt the unique name assumption for convenience; dropping it does not affect results (modulo minor changes of deﬁnitions).

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

3 DL-Lite R with Bag Semantics In this section we present a bag semantics for DL-Lite R ontologies, deﬁne the associated query answering problem, and establish its intractability in data complexity. We formalise ABoxes as bags of facts (rather than sets) in order to faithfully represent the materialised views over source data deﬁned by OBDA mappings. Deﬁnition 3. A bag ABox is a ﬁnite bag over the set of concept and role assertions. A DL-Litebag R ontology is a pair T , A of a DL-Lite R TBox T and a bag ABox A; the ontology is DL-Litebag core if T is a DL-Litecore TBox.

The semantics of DL-Litebag R is based on bag interpretations I, with atomic concepts and roles mapped to bags of domain elements and pairs of elements, respectively, and where the interpretation function is extended to complex concepts and roles in the natural way; in particular, a concept P is interpreted as the bag projection of P I to the ﬁrst component, where each occurrence of a pair (u, v) in P I contributes to the multiplicity of domain element u in ( P)I. Deﬁnition 4. A bag interpretation I is a pair I, I deﬁned the same as in the set case with the exception that AI and P I are bags (not sets) over I and I I, respectively. The interpretation function extends to concepts and roles as follows: (P )I maps each (u, v) I I to P I(v, u), and ( R)I maps each u I to P v I RI(u, v). The deﬁnition of semantics of ontologies is as expected. Deﬁnition 5. A bag interpretation I = I, I satisﬁes a bag ABox A if A(A(a)) AI(a I) for each concept assertion A(a) in A and A(P(a, b)) P I(a I, b I) for each role assertion P(a, b). Satisfaction of T is deﬁned as in the set case, except that and are applied to bags instead of sets. Bag interpretation I is a bag model of the DL-Litebag R ontology T , A , written I |=b T , A , if it satisﬁes both T and A. The ontology is satisﬁable if it has a bag model.

Example 6. Let Kex = Tex, Aex be a DL-Litebag R ontology with Tex as in Example 1 and Aex has Sal Emp(Lee) with multiplicity 3, ITEmp(Lee) and has Mngr(Lee, Hill) both with multiplicity 2 (and all other assertions with multiplicity 0). Let Iex be the bag interpretation mapping individuals to themselves and with the following non-zero values:

Sal Emp Iex(Lee) = Emp Iex(Lee) = 3, ITEmp Iex(Lee) = 2, has Mngr Iex(Lee, Hill) = 2, has Mngr Iex(Lee, w) = 1, Mngr Iex(Hill) = 2, Mngr Iex(w) = 1,

where w is a fresh element. We can check that Iex |=b Kex. We now deﬁne the notion of query answering under bag semantics. We ﬁrst deﬁne the answers q I of a CQ q(x) over a bag interpretation I. Intuitively, q I is a bag of tuples of individuals such that each valid embedding λ of the body of q into I contributes separately to the multiplicity of the tuple λ(x) in q I; in turn, the contribution of each speciﬁc λ is the product of the multiplicities of the images of the query atoms under λ. The latter is in accordance with the interpretation of joins in the bag relational algebra and SQL, where the multiplicity of a tuple in a join is the product of the multiplicities of the joined tuples (e.g., see [Garc ıa-Molina et al., 2009]).

Deﬁnition 7. Let q(x) = y. φ(x, y) be a CQ. The bag answers q I to q over a bag interpretation I = I, I are deﬁned as the bag over tuples of individuals from I of the same size as x such that, for every such tuple a,

S(t) in φ(x,y) SI(λ(t)),

where Λ is the set of all valuations λ : x y I I such that λ(x) = a I, λ(a) = a I for each a I, and λ(z) = λ(t) for each z = t in φ(x, y). If q is Boolean then q I are deﬁned only for the empty tuple . Also, conjunction φ(x, y) may contain repeated atoms, and hence can be seen as a bag of atoms; while repeated atoms are redundant in the set case, they are essential in the bag setting [Chaudhuri and Vardi, 1993] and thus the deﬁnition of q I(a) treats each copy of a query atom S(t) separately. The following deﬁnition of certain answers, capturing open-world query answering, is a reformulation of the deﬁnition in [Kostylev and Reutter, 2015] for counting queries. It is a natural extension of the set notion to bags: a query answer is certain for a given multiplicity if it occurs with at least that multiplicity in every bag model of the ontology. Deﬁnition 8. The bag certain answers q K to a query q over a DL-Litebag R ontology K are the bag T I|=b K q I.

We study the problem BAGCERT[Q, O] of checking, given a query q from a class of CQs Q, ontology K = T , A from an ontology language O, tuple a over I, and number k N 0 , whether q K(a) k; data complexity of BAGCERT is studied under the assumption that T and q are ﬁxed. Following [Grumbach and Milo, 1996], we assume that the multiplicities of assertions in A and k (if not inﬁnity) are given in unary. Example 9. Let qex(x) = y. has Mngr(x, y) and Kex be as in Example 6. Then q Kex ex (Lee) = 3. Indeed, on the one hand, q Iex ex (Lee) = 3 for Iex in Example 6. On the other, for any bag model I of Kex, q I ex(Lee) = Σu Ihas Mngr I(Lee I, u) 3, because Aex(Sal Emp(Lee)) = 3 and Tex contains inclusions Sal Emp Emp and Emp has Mngr . The bag semantics can be seen as a generalisation of the set semantics of DL-Lite: ﬁrst, satisﬁability under bag semantics reduces to the set case; second, certain answers under bag and set semantics coincide if multiplicities are ignored. Proposition 10. Let T , A be a DL-Lite R ontology and T , A be a DL-Litebag R ontology with the same TBox such that {S(t) | A (S(t)) 1} = A. Then, the following holds: 1. T , A is satisﬁable if and only if T , A is satisﬁable; 2. for each CQ q and tuple a of individuals from I, a q T ,A if and only if q T ,A (a) 1. An important property of satisﬁable DL-Lite R ontologies K is the existence of so called universal models for CQs, that is, models I such that the certain answers to every CQ q over K can be obtained by evaluating q over I [Calvanese et al., 2007]. This notion extends naturally to bags.

Deﬁnition 11. A bag model I of a DL-Litebag R ontology K is universal for a class of queries Q if q K = q I for any q Q. Unfortunately, in contrast to the set case, even DL-Litebag core ontologies may not admit a universal bag model for all CQs.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

Proposition 12. There exists a satisﬁable DL-Litebag core ontology that has no universal bag model for the class of all CQs. The lack of a universal model suggests that CQ answering under bag semantics is harder than in the set case. Indeed, this problem is CONP-hard in data complexity, which is in stark contrast to the AC0 upper bound in the set case. Theorem 13. BAGCERT[CQs, DL-Litebag core] is CONP-hard in data complexity.

4 Universal Models for Rooted Queries Theorem 13 suggests that bag semantics is generally not wellsuited for OBDA. Our approach to overcome this negative result is to consider a restricted class of CQs, introduced in the context of query optimisation in DLs [Bienvenu et al., 2012], called rooted: in a rooted CQ, each existential variable is connected in the Gaifman graph to an individual or an answer variable. Rooted CQs capture most practical queries; for example, they include all connected non-Boolean CQs. Deﬁnition 14. A CQ q(x) is rooted if each connected component of its Gaifman graph has a node with a term in x I. In contrast to arbitrary CQs, any satisﬁable DL-Litebag core ontology admits a universal bag model for rooted CQs. Although we deﬁne such a model, called canonical, in a fully declarative way, it can be intuitively seen as the result of applying a variant of the restricted chase procedure [Cal ı et al., 2013] extended to bags. Starting from the ABox, the procedure successively repairs violations of T by extending the interpretation of concepts and roles in a minimal way. To formalise canonical models, we need two auxiliary notions. First, the concept closure ccl T [u, I] of an element u I in a bag interpretation I = I, I over a TBox T is the bag of concepts such that, for any concept C, ccl T [u, I](C) is the maximum value of CI 0 (u) amongst all concepts C0 satisfying T |= C0 C. Second, the union I J of bag interpretations I = I, I and J = J , J with a I = a J for all a I is the bag interpretation I J , I J with a I J = a I for a I and SI J = SI SJ for S C R. Deﬁnition 15. The canonical bag model C(K) of a DL-Litebag core ontology K = T , A is the bag interpretation S i 0 Ci(K) with the bag interpretations Ci(K) = Ci(K), Ci(K) deﬁned as follows: - C0(K) = I, a C0(K) = a for each a I, and SC0(K)(a) = A(S(a)) for each S C R and individuals a; - for each i > 0, Ci(K) is Ci 1(K) {w1 u,R, . . . , wδ u,R | u Ci 1(K), R a role, δ = ccl T [u, Ci 1(K)]( R) ( R)Ci 1(K)(u)},

where wj u,R are fresh domain elements, called anonymous, a Ci(K) = a for all a I, and, for all A C, P R, and elements u, v,

ACi(K)(u) = ccl T [u, Ci 1(K)](A), if u Ci 1(K), 0, otherwise,

P Ci(K)(u, v) =

P Ci 1(K)(u, v), if u, v Ci 1(K), 1, if v = wj u,P or u = wj v,P , 0, otherwise.

It is easily seen that C(K) satisﬁes K whenever K is satisﬁable. We next show that it is universal for rooted CQs. Theorem 16. The canonical bag model C(K) of a satisﬁable DL-Litebag core ontology K is universal for rooted CQs. Example 17. Consider an ontology Kr = Tr, Ar with

Tr = {Emp has Mngr, has Mngr Mngr}, Ar(Emp(Lee)) = Ar(Mngr(Hill)) = 1.

The canonical model C(Kr) interprets (all with multiplicity 1) Emp by Lee, Mngr by Hill and w1 Lee,has Mngr, and has Mngr by (Lee, w1 Lee,has Mngr). Note that C(Kr) is not universal for

all CQs: for instance, q C(Kr) nr ( ) = 2 for non-rooted qnr = y. Mngr(y), but q Inr nr ( ) = 1 for the model Inr interpreting Emp by Lee, has Mngr by (Lee, Hill), and Mngr by Hill. We conclude this section by showing an important property of rooted CQs, which justiﬁes their favourable computational properties. As in the set case for arbitrary CQs, given a satisﬁable DL-Litebag core ontology K and a rooted CQ q, q K can be computed over a small sub-interpretation of C(K).

Theorem 18. Let K be a satisﬁable DL-Litebag core ontology with C(K) = S i 0 Ci(K) and q be a rooted CQ having n atoms. Then, q C(K) = q Cn(K).

5 Rewritability of Rooted Queries Rewritability is key for OBDA, and we next establish to what extent rooted CQs over bag semantics are rewritable. The ﬁrst idea would be to use the analogy with the set case and rewrite to unions of CQs. There are two corresponding operations for bags: max union and arithmetic union . So we may consider max unions qmax = q1(x) qn(x) or arithmetic unions qar = q1(x) qn(x) of CQs qi(x), 1 i n, with the following semantics, for any interpretation I: q I max = q I 1 q I n and q I ar = q I 1 q I n, respectively. Our ﬁrst result is negative: rewriting to either of these classes is not possible even for DL-Litebag core. Proposition 19. The class of rooted CQs is rewritable neither to max nor to arithmetic unions of CQs for DL-Litebag core. Next we show that rooted queries are rewritable to BALG1 ε-queries: the class directly corresponding to the algebra BALG1 ε for bags [Grumbach et al., 1996; Grumbach and Milo, 1996; Libkin and Wong, 1997]. Since BALG1 ε LOGSPACE [Grumbach and Milo, 1996], where BALG1 ε is the complexity class for BALG1 ε algebra evaluation, rewritability to BALG1 ε-queries is highly desirable. Intuitively, in addition to projection , join , and unions and , BALG1 ε also allows for difference \. Domaindependent queries, inexpressible in algebraic query languages, are precluded by restrictions on the use of variables. Deﬁnition 20. A BALG1 ε-query q(x) with answer variables x is one of the following, where qi are BALG1 ε-queries: - S(t), for S C R, t tuple over x I mentioning all x; - q1(x1) q2(x2), for x = x1 x2; - q0(x0) (x = t), for x x0, t X I, x = x0 ({t}\I); - y. q0(x, y); q1(x) q2(x); q1(x) q2(x); q1(x) \ q2(x).

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

The semantics of BALG1 ε-queries is deﬁned as follows. Deﬁnition 21. The bag answers q I to a BALG1 ε-query q(x) over a bag interpretation I = I, I is the bag of tuples over I of the same size as x inductively deﬁned as follows, for each tuple a and the corresponding mapping λ such that λ(x) = a I and λ(a) = a I for all a I: - SI(λ(t)), if q(x) = S(t); - q I 1 (λ(x1)) q I 2 (λ(x2)), if q(x) = q1(x1) q2(x2); - q I 0 (λ(x0)), if q(x) = q0(x0) (x = t) and λ(x) = λ(t); - 0, if q(x) = q0(x0) (x = t) and λ(x) = λ(t); - P λ : y I q I 0 (a I, λ (y)), if q(x) = y. q0(x, y); - (q I 1 op q I 2 )(a I) if q(x) = q1(x) op q2(x), where op is , , or and op is , , or \, respectively. The data complexity of BALG1 ε-query evaluation is obtained by showing that BALG1 ε-queries can be be mapped to the BALG1 ε algebra of [Grumbach and Milo, 1996]. Proposition 22. Given a ﬁxed BALG1 ε-query q(x), the problem of checking whether q C( ,A )(a) k for a bag ABox A, tuple a, and k N 0 is AC0 reducible to BALG1 ε. Our rewriting algorithm is inspired by the algorithm in [Kikot et al., 2012] for the set case of DL-Lite R. Before going into details, we provide a high-level description. The key observation is that the set of valuations of a CQ q(x) = y. φ(x, y) over the bag canonical model C(K) can be partitioned into subsets, each of which is characterised by variables z y that are sent to anonymous elements of C(K). Hence, we can rewrite q(x) for each of these subsets separately and then take an arithmetic union of the resulting queries, provided these queries are guaranteed to give the same answers as the corresponding subsets of valuations. Our rewriting proceeds along the following steps. Step 1. First, each z is checked for realisability, that is, whether the subquery induced by z can indeed be folded into the anonymous forest-shaped part of C(K). This can be done without the ABox, looking only at the atoms of q that link z to other terms of q (these linking atoms exist because q is rooted). Non-realisable z can be disregarded. Step 2. For every realisable z, CQ q(x) is replaced (for this z in the arithmetic union) by a CQ qz(x) obtained from q by replacing each maximal connected component of the subquery induced by z by just one linking atom. This transformation is equivalence-preserving, because the anonymous part of C(K) does not involve multiplicities other than 0 and 1. Step 3. Finally, each resulting qz(x) is rewritten to a BALG1 εquery qz(x) by chasing back each unary atom and each binary atom mentioning a variable in z with the TBox; for the binary atoms it is also guaranteed, by means of difference, that the variable in z is indeed mapped to the anonymous part, thus avoiding double-counting in the arithmetic union. For the rest of this section, let us ﬁx a rooted CQ q(x) = y. φ(x, y) and a DL-Litebag core TBox T . We start by formalising Step 1. Deﬁnition 23. Given an ontology K with a TBox T and variables z y, let [q, z]C(K) be the bag of tuples over I such that, for each tuple a of individuals,

[q, z]C(K)(a) = X

S(t) in φ(x,y) SC(K)(λ(t)),

where Λz is the set of valuations λ : x y I C(K) such that λ(x) = a, λ(a) = a for each a I, λ(x) = λ(t) for each x = t in φ(x, y), λ(z) is an anonymous element for each z z, and λ(y) I for each y y \ z. Hence, the bag answers to q can be partitioned as follows:

z y[q, z]C(K). (1)

Variables z y are equality-consistent if φ(x, y) has no equality z = t with z z and t / z. If z is not equalityconsistent, then [q, z]C(K) = and these z can be disregarded in (1). Next, we show which other z can be ignored. Deﬁnition 24. Given equality-consistent z y, variables z z are maximally connected in the anonymous part (maconnected) if z z for the equivalence class z of any z z and the equivalence classes z are a maximal subset of z connected in the Gaifman graph of q via nodes in z. Next we introduce several notations for ma-connected z z with equality-consistent z y. First, let φz be the subconjunction of φ(x, y) that consists of all atoms mentioning at least one variable in z (these sub-conjunctions are disjoint for different z ). Second, since q is rooted, φz contains an atom αz of the form P(t, z) or P(z, t) with z z and t / z (note that this deﬁnition may be non-deterministic). Third, let qa z () = x . z . φz V t tz (t = a) V z z (z = a),

where tz are all such terms t, a is an individual in tz if it exists or a fresh individual otherwise, and x = tz X, (this deﬁnition may also be non-deterministic because of a). Notice that qa z is a Boolean CQ with possible equalities of individuals and inequalities, and we can deﬁne the bag answers of such a query q over a bag interpretation I in the same way as for usual CQs in Deﬁnition 7 with the extra requirement that each contributing valuation λ should satisfy λ(x) = λ(t) for each inequality x = t of q (and equalities of individuals are handled as usual equalities). Deﬁnition 25. Given equality-consistent variables z y, ma-connected z z are realisable by TBox T if

(qa z )C( T ,A )( ) 1, where, for a fresh individual b, A is the bag ABox having either only the assertion P(a, b) (with multiplicity 1), when αz = P(t, z), or only P(b, a), when αz = P(z, t). This deﬁnition does not depend on the choice of αz and a. Indeed, if there are two atoms P1(t1, z1) and P2(t2, z2) satisfying the deﬁnition of αz , then either P1 = P2 and both pairs (t1, z1) and (t2, z2) are mapped by a valuation of qa z to the same tuple, or z are not realisable regardless of the choice of αz . Similarly, if tz contains two individuals a, a , then qa z has the equality a = a , and hence z are not realisable regardless of this choice. Intuitively, z are realisable if their corresponding subquery qa z is satisﬁed by the tree-shaped model induced by the TBox from a connection αz of z and the rest of the query. This deﬁnition does not essentially involve multiplicities, because all tuples of anonymous elements in the canonical model have multiplicity at most 1, and, hence, if qa z matches a part of the canonical model, it does so in a unique way. Thus, checking realisability is decidable using standard set-based techniques.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

Deﬁnition 26. Variables z y are realisable by TBox T if they are equality-consistent and each non-empty maconnected subset of z is realisable by T . We proceed to Step 2. For realisable z y, let qz(x) be the CQ y . φz(x, y ) such that φz(x, y ) is obtained from φ(x, y) by replacing φz , for each ma-connected z z, with

αz V y tz X, t tz (y = t),

where tz is as in qa z , and y is the subset of y remaining in φz. In other words, qz contains, for each z , just one atom αz and equalities identifying tz instead of conjunction φz in q. The following lemma justiﬁes Steps 1 and 2. It says that in partitioning (1) we only need to iterate over tuples z that are realisable by T and can also replace q with qz for each z. Lemma 27. For any ontology K with TBox T and z y with qz(x) = y . φz(x, y ), 1. if z is realisable by T then [q, z]C(K) = [qz, z y ]C(K); 2. if z is not realisable by T then [q, z]C(K) = . For Step 3, it sufﬁces to rewrite each CQ qz(x) = y . φz(x, y ) to a BALG1 ε-query qz(x) = yz. ψz(x, yz), for yz = y \ z, which is guaranteed to give [qz, z y ]C(K) as the bag answers on the ABox in any ontology K with TBox T . To this end, we use the following notation: for t X I, let ζA(t) = A(t) for A C, while ζ P (t) = y. P(t, y) and ζ P (t) = y. P(y, t) for P R, where y is a variable different from t. Then, formula ψz(x, yz) is obtained from φz(x, y ) by replacing all atoms mentioning a term t I x yz or a variable z z as follows: - each A(t) with W T |=C A ζC(t); - each P(t, z) with W T |=C P ζC(t) \ ζ P (t); - each P(z, t) with W T |=C P ζC(t) \ ζ P (t). Note that φz(x, y ) does not contain any atoms of the form A(z) for z z, so ψz(x, yz) does not mention variables z. Also, atoms over roles without variables z stay intact, because T contains no role inclusions. Finally, the rewriting of q(x) over T is the BALG1 ε-query

q(x) = W z realisable by T qz(x).

Example 28. Consider TBox Tr from Example 17 and the rooted CQ qr(x) = y. has Mngr(x, y) Mngr(y). The query qr(x) = qr (x) qr y(x), where qr (x) and qr y(x) are

y. has Mngr(x, y) Mngr(y) z. has Mngr(z, y) and Emp(x) y. has Mngr(x, y) \ y. has Mngr(x, y),

is a rewriting of qr over Tr, since and y are realisable. The following theorem establishes the correctness of our approach and leads to the main rewritability result.

Theorem 29. For any rooted CQ q and DL-Litebag core ontology K = T , A we have that q C(K) = q C( ,A ).

Corollary 30. The class of rooted CQs is rewritable to BALG1 ε-queries for DL-Litebag core. We conclude this section by establishing the complexity of rooted query answering. The bounds follow as an easy consequence of Theorem 18, Proposition 22, and Corollary 30.

Theorem 31. BAGCERT[rooted CQs, DL-Litebag core] is NPcomplete and in LOGSPACE in data complexity. However, the next theorem implies that rooted queries are not BALG1 ε-rewritable for unrestricted DL-Litebag R TBoxes.

Theorem 32. BAGCERT[rooted CQs, DL-Litebag R ] is CONPhard in data complexity.

6 Related work Query answering under bag semantics has received signiﬁcant attention in the database literature [Libkin and Wong, 1994; Grumbach et al., 1996; Grumbach and Milo, 1996; Libkin and Wong, 1997]. These works study the relative expressive power of bag algebra primitives, the relationship with set-based algebras, and establish the data complexity of query answering. Such problems have also been recently studied in the setting of Semantic Web and SPARQL 1.1 in [Kaminski et al., 2016; Angles and Gutierrez, 2016]. Bag semantics in the context of Description Logics has been studied in [Jiang, 2010], where the author proposes a bag semantics for ALC and provides a tableaux algorithm. In contrast to our work, their results are restricted to ontology satisﬁability and do not encompass CQ answering. CQ answering under bag semantics is closely related to answering Count aggregate queries. The semantics of aggregate queries for database settings with incomplete information, such as inconsistent databases and data exchange, have been studied in [Arenas et al., 2003; Libkin, 2006; Afrati and Kolaitis, 2008]. As pointed out in [Kostylev and Reutter, 2015], these techniques are not directly applicable to ontologies. The practical solution in [Calvanese et al., 2008] is to give epistemic semantics to aggregate queries, where the query is evaluated over ABox facts entailed by the ontology; thus, the anonymous part of the ontology models is essentially ignored, and the semantics easily leads to counterintuitive answers. To remedy these issues, [Kostylev and Reutter, 2015] propose a certain answer semantics for Count aggregate queries over ontologies and prove tight complexity bounds for DL-Lite R and DL-Litecore. Similarly to our work, their semantics is open-world and considers all models of the ontology for query evaluation, which leads to more intuitive answers. The main difference resides in the deﬁnition of the ontology language, where they consider set ABoxes and adopt conventional set-based semantics for TBox axioms. Although DL-Litebag R is closely related to the logic in [Kostylev and Reutter, 2015], the two settings do not coincide even for set ABoxes. For example, if A comprises only assertions R(a, b) and R(a, c) and T comprises axiom R B, then the query over T , A that counts the number of individuals a in concept B returns 1 in the setting of [Kostylev and Reutter, 2015], while the corresponding DL-Litebag R query returns 2.

7 Conclusion and Future Work We have studied OBDA under bag semantics and identiﬁed a general class of rewritable queries over DL-Litebag core ontologies. As our framework covers already the class of Count aggregate queries, in future work we plan to extend it to capture further aggregate functions and more expressive ontologies.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

References [Afrati and Kolaitis, 2008] Foto N. Afrati and Phokion G. Kolaitis. Answering aggregate queries in data exchange. In PODS, 2008. [Angles and Gutierrez, 2016] Renzo Angles and Claudio Gutierrez. The multiset semantics of SPARQL patterns. In ISWC, 2016. [Arenas et al., 2003] Marcelo Arenas, Leopoldo E. Bertossi, Jan Chomicki, Xin He, Vijay Raghavan, and Jeremy P. Spinrad. Scalar aggregation in inconsistent databases. Theor. Comput. Sci., 296(3):405 434, 2003. [Artale et al., 2009] Alessandro Artale, Diego Calvanese, Roman Kontchakov, and Michael Zakharyaschev. The DL-Lite family and relations. J. Artif. Intell. Res. (JAIR), 36:1 69, 2009. [Bienvenu et al., 2012] Meghyn Bienvenu, Carsten Lutz, and Frank Wolter. Query containment in description logics reconsidered. In KR, 2012. [Cal ı et al., 2013] Andrea Cal ı, Georg Gottlob, and Michael Kifer. Taming the inﬁnite chase: Query answering under expressive relational constraints. J. Artif. Intell. Res. (JAIR), 48:115 174, 2013. [Calvanese et al., 2007] Diego Calvanese, Giuseppe De Giacomo, Domenico Lembo, Maurizio Lenzerini, and Riccardo Rosati. Tractable reasoning and efﬁcient query answering in description logics: The DL-Lite family. J. Autom. Reasoning, 39(3):385 429, 2007. [Calvanese et al., 2008] Diego Calvanese, Evgeny Kharlamov, Werner Nutt, and Camilo Thorne. Aggregate queries over ontologies. In Proceedings of the 2nd International Workshop on Ontologies and Information Systems for the Semantic Web, ONISW, 2008. [Calvanese et al., 2011] Diego Calvanese, Giuseppe De Giacomo, Domenico Lembo, Maurizio Lenzerini, Antonella Poggi, Mariano Rodriguez-Muro, Riccardo Rosati, Marco Ruzzi, and Domenico Fabio Savo. The MASTRO system for ontology-based data access. Semantic Web, 2(1):43 53, 2011. [Calvanese et al., 2017] Diego Calvanese, Benjamin Cogrel, Sarah Komla-Ebri, Roman Kontchakov, Davide Lanti, Martin Rezk, Mariano Rodriguez-Muro, and Guohui Xiao. Ontop: Answering SPARQL queries over relational databases. Semantic Web, 8(3):471 487, 2017. [Chaudhuri and Vardi, 1993] Surajit Chaudhuri and Moshe Y. Vardi. Optimization of Real conjunctive queries. In PODS, 1993. [Garc ıa-Molina et al., 2009] Hector Garc ıa-Molina, Jeffrey D. Ullman, and Jennifer Widom. Database Systems: The Complete Book. Pearson Education, 2nd edition, 2009. [Gottlob et al., 2014] Georg Gottlob, Stanislav Kikot, Roman Kontchakov, Vladimir V. Podolskii, Thomas Schwentick, and Michael Zakharyaschev. The price of query rewriting in ontology-based data access. Artif. Intell., 213:42 59, 2014.

[Grumbach and Milo, 1996] St ephane Grumbach and Tova Milo. Towards tractable algebras for bags. J. Comput. Syst. Sci., 52(3):570 588, 1996. [Grumbach et al., 1996] St ephane Grumbach, Leonid Libkin, Tova Milo, and Limsoon Wong. Query languages for bags: expressive power and complexity. SIGACT News, 27(2):30 44, 1996. [Jiang, 2010] Yuncheng Jiang. Description logics over multisets. In 6th International Workshop on Uncertainty Reasoning for the Semantic Web (URSW), 2010. [Kaminski et al., 2016] Mark Kaminski, Egor V. Kostylev, and Bernardo Cuenca Grau. Semantics and expressive power of subqueries and aggregates in SPARQL 1.1. In WWW, 2016. [Kikot et al., 2012] Stanislav Kikot, Roman Kontchakov, and Michael Zakharyaschev. Conjunctive query answering with OWL 2 QL. In Principles of Knowledge Representation and Reasoning: Proceedings of the Thirteenth International Conference, KR, 2012. [Kikot et al., 2014] Stanislav Kikot, Roman Kontchakov, Vladimir V. Podolskii, and Michael Zakharyaschev. On the succinctness of query rewriting over shallow ontologies. In LICS, 2014. [Kontchakov et al., 2014] Roman Kontchakov, Martin Rezk, Mariano Rodriguez-Muro, Guohui Xiao, and Michael Zakharyaschev. Answering SPARQL queries over databases under OWL 2 QL entailment regime. In ISWC, 2014. [Kostylev and Reutter, 2015] Egor V. Kostylev and Juan L. Reutter. Complexity of answering counting aggregate queries over DL-Lite. J. Web Sem., 33:94 111, 2015. [Lenzerini, 2002] Maurizio Lenzerini. Data integration: A theoretical perspective. In PODS, 2002. [Libkin and Wong, 1994] Leonid Libkin and Limsoon Wong. New techniques for studying set languages, bag languages and aggregate functions. In PODS, 1994. [Libkin and Wong, 1997] Leonid Libkin and Limsoon Wong. Query languages for bags and aggregate functions. JCSS, 55(2):241 272, 1997. [Libkin, 2006] Leonid Libkin. Data exchange and incomplete information. In PODS, 2006. [Motik et al., 2012] Boris Motik, Bernardo Cuenca Grau, Ian Horrocks, Zhe Wu, Achille Fokoue, and Carsten Lutz. OWL 2 Web Ontology Language Proﬁles (Second Edition). W3C Recommendation, 2012. [Nikolaou et al., 2017] Charalampos Nikolaou, Egor V. Kostylev, George Konstantinidis, Mark Kaminski, Bernardo Cuenca Grau, and Ian Horrocks. The Bag Semantics of Ontology-Based Data Access. Co RR, abs/1705.07105, 2017. [Poggi et al., 2008] Antonella Poggi, Domenico Lembo, Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, and Riccardo Rosati. Linking data to ontologies. J. Data Semantics, 10:133 173, 2008.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)