# answering_counting_queries_over_dllite_ontologies__ad52dfa1.pdf

Answering Counting Queries over DL-Lite Ontologies

Meghyn Bienvenu1 , Quentin Mani ere1 and Micha el Thomazo2

1University of Bordeaux, CNRS, Bordeaux INP, La BRI, Talence, France 2Inria, DI ENS, ENS, CNRS, University PSL, Paris, France {meghyn.bienvenu, quentin.maniere}@u-bordeaux.fr, michael.thomazo@inria.fr

Ontology-mediated query answering (OMQA) is a promising approach to data access and integration that has been actively studied in the knowledge representation and database communities for more than a decade. The vast majority of work on OMQA focuses on conjunctive queries, whereas more expressive queries that feature counting or other forms of aggregation remain largely unexplored. In this paper, we introduce a general form of counting query, relate it to previous proposals, and study the complexity of answering such queries in the presence of DL-Lite ontologies. As it follows from existing work that query answering is intractable and often of high complexity, we consider some practically relevant restrictions, for which we establish improved complexity bounds.

1 Introduction Ontology-mediated query answering (OMQA) utilizes ontologies to provide a convenient vocabulary for query formulation and to capture domain knowledge that is exploited during the querying process to obtain more complete sets of answers [Poggi et al., 2008; Bienvenu and Ortiz, 2015; Xiao et al., 2018]. Much of the work on OMQA considers ontologies formulated using description logics (DLs), a family of knowledge representation languages that provide the logical foundations of the OWL web ontology language. Particular attention has been to the DL-Lite family of DLs [Calvanese et al., 2007], which were developed with OMQA in mind and enjoy favorable computational properties. The vast majority of work on OMQA supposes that user queries are given as conjunctive queries (CQs). However, there are many other kinds of database queries, beyond plain CQs, that are relevant in practice. This motivates research into the feasibility of adopting other database query languages for OMQA. While enriching CQs with either negated atoms or inequalities has been shown to lead to undecidability even in very restricted settings [Guti errez-Basulto et al., 2015], the situation is more positive for navigational queries (like regular path queries), which can be adopted without losing decidability, sometimes even retaining tractable data complexity [Bienvenu et al., 2015b].

Aggregate queries, which use numeric operators (e.g. count, sum, max) to summarize selected parts of a dataset, constitute another prominent class of database queries. Although such queries are widely used for data analysis, they have been little explored in context of OMQA. This may be partly due to the fact that it is not at all obvious how to deﬁne the semantics of such queries in the OMQA setting. A ﬁrst exploration of aggregate queries in OMQA was conducted by Calvanese et al. (2008). They argued that the most straightforward adaptation of classical certain answer semantics to aggregate queries was unsatisfactory, as often values would differ from model to model, leading to no certain answers. For this reason, an epistemic semantics was proposed, in which variables involved in the aggregation are required to match to data constants. However, as discussed in [Kostylev and Reutter, 2015], this semantics can also give unintuitive results by ignoring ways of mapping aggregate variables to anonymous elements inferred due the ontology axioms. For instance, if no children of alex are listed in the data, then a query that asks to return the number of children will yield 0 under epistemic semantics, even if it can be inferred (e.g. due to a family tax beneﬁt) that there must be at least 3 children. This led Kostylev and Reutter to deﬁne an alternative semantics for two kinds of counting queries (inspired by the COUNT and COUNT DISTINCT in SQL) which adopts a form of certain answer semantics but considers lower and upper bounds on the count value across different models. For the two considered logics (DL-Litecore and DL-Lite R), only the lower bounds on the count value are non-trivial, and a complexity analysis shows that they are challenging to identify: co NP-data complexity for both logics, and Πp 2-hard (resp. co NEXP-hard) in combined complexity for DL-Litecore (resp. DL-Lite R). Several question were left unanswered by their work, including the exact combined complexity, the difﬁculty of recognizing the optimal lower bound, and the impact of allowing multiple aggregation variables. This paper returns to the issue of handling counting queries in OMQA and makes several important contributions:

1. We propose a new notion of counting CQ that generalizes the two forms of queries from [Kostylev and Reutter, 2015] and allows arbitrarily many counting variables. 2. We show that existing complexity results for DL-Litecore and DL-Lite R KBs continue to hold for our more general notion of counting CQ, and provide an improved co NEXP

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

upper bound for the relevant case of ﬁnite-depth TBoxes. 3. We consider the impact of restricting the query structure, focusing on the class of rooted queries, in which every query variable must be connected to an answer variable or individual in the query graph. A recent result, obtained as part of a study of bag semantics for OMQA, identiﬁed a case in which rootedness leads to tractable data complexity for counting queries [Nikolaou et al., 2019]. This motivates us to perform a more thorough investigation of rooted counting queries, which yields several improvements upon existing complexity bounds. 4. We prove that the problem of identifying the best certain interval is DP-complete in data complexity. Our results close some questions that were left open by the work of Kostylev and Reutter and pave the way for further study of counting and aggregate queries in the OMQA setting. An appendix with full proofs can be found in the long version of this paper, available on ar Xiv.

2 Preliminaries We recall the basics of description logics (DLs), focusing on DL-Lite, see [Baader et al., 2017] for more details. Syntax and Semantics. A description logic vocabulary consists of a set NC of atomic concepts (unary predicates), a set NR of atomic roles (binary predicates), and a set NI of individual names (constants). By role, we mean either an atomic role P NR or an inverse role P (where P NR). We let N R denote the set NR {P | P NR} of roles and use the notation R to mean P if R = P NR and P if R = P . A DL knowledge base (KB) is a pair K = (T , A), consisting of an ABox A that contains facts about particular individuals and a TBox T that expresses general knowledge about the domain. Formally, an ABox is a ﬁnite set of concept assertions A(b), with A NC and b NI, and role assertions P(a, b), with P NR and a, b NI. We use Ind(A) to denote the set of individuals in A. A TBox is a ﬁnite set of axioms, whose syntax depends on the particular DL. In DL-Litecore, axioms take the form of concept inclusions B1 ( )B2, where each Bi is either A (for A NC) or R (with R N R ). DL-Lite R TBoxes additionally allow role inclusions R1 ( )R2, where R1, R2 N R . Example 1. Our example KB talks about leading (Lead In) and supporting actors (Supp In) in movies:

Aact = {Acts In(doona, cloud), Supp In(berry, cloud), Supp In(hanks, cloud), Supp In(hanks, catch)} Tact = {Lead In Acts In, Supp In Acts In,

Supp In Lead In }

An interpretation takes the form I = ( I, I), where I is a non-empty set (the domain of I), and .I is a function that maps each A NC to a subset AI I, each P NR to a binary relation P I I I, and each a NI to an element a I of I. We make the unique names assumption (UNA) by requiring that a I = b I for every a, b NI with a = b. The function I naturally extends to complex concepts and roles: ( R)I = {d | d : (d, d ) RI}, (P )I =

{(d1, d2) | (d2, d1) P I}, ( B)I = I \ BI, ( R)I = I I \ RI. A (concept or role) inclusion F G is satisﬁed in I if F I GI; assertion A(b) is satisﬁed in I if b I AI; P(a, b) is satisﬁed in I if (a I, b I) P I. We call I a model of K, written I |= K, if it satisﬁes all inclusions and assertions in K. A KB is satisﬁable if has at least one model. Queries. We recall that a conjunctive query (CQ) takes the form y ψ(x, y), where x and y are tuples of variables drawn from an inﬁnite set of variables V, and ψ is a conjunction of atoms, which can be either concept atoms A(t1) or role atoms P(t1, t2), where A NC, P NR, and terms ti are drawn from NI x y. Consider an interpretation I and CQ q = y ψ(x, y) with |x| = n. A tuple α I n is an answer to q in I, written I |= q(α), if there is a homomorphism of q into I, i.e., a function σ that maps the terms of q to elements of I such that (i) σ(a) = a I for a NI, (ii) σ(t) AI for every atom A(t) of q, and (iii) (σ(t1), σ(t2)) P I for every atom P(t1, t2) of q. A tuple a Ind(A)n is a certain answer to q w.r.t. the KB K iff I |= q(a I) for every model I of K. Canonical Model. We recall the deﬁnition of the canonical model CK of a DL-Lite R KB K = (T , A). The domain of CK consists of Ind(A) and all words of the form a R1 . . . Rn, with a Ind(A), Ri N R , and n 1, such that: K |= R1(a) and there is no R1(a, b) A; for 1 i < n, T |= R i Ri+1 and R i = Ri+1.

We interpret individuals as themselves (a CK = a) and atomic concepts and roles as follows:

ACK = {a Ind(A) | K |= A(a)}

{a R1 . . . Rn CK \ Ind(A) | T |= R n A}

P CK = {(a, b) | P(a, b) A} {(e1, e2) | e2 = e1R and T |= R P}

{(e2, e1) | e2 = e1R and T |= R P }

The term canonical model is motivated by the following well-known property of CK (see e.g. [Calvanese et al., 2007]): Lemma 1. Let K be a satisﬁable DL-Lite R KB. Then CK |= K and if I |= K, there is a homomorphism of CK into I. A useful corollary is that the certain answers to a CQ q w.r.t. K are the tuples from Ind(A) that are answers to q in CK. Note that CK may be inﬁnite. The depth of a TBox T is deﬁned as the maximal length of any word that appears in the domain of CK for any KB K whose TBox is T . If this number is ﬁnite, we say that T is a ﬁnite-depth TBox; such TBoxes can be identiﬁed in polynomial time [Bienvenu et al., 2015a].

3 Counting Queries We now introduce our formalization of counting queries. In addition to the set V of (classical) variables, we assume a second inﬁnite set of counting variables Vc, disjoint from V. Deﬁnition 1. A counting conjunctive query (CCQ) q takes the form q(x) = y z ψ(x, y, z), where x y V, z Vc, and ψ is a conjunction of concept and role atoms whose terms are drawn from NI x y z. We call x (resp. y, resp. z) the answer (resp. existential, resp. counting) variables of q.

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

We ﬁrst deﬁne the semantics of counting queries on a single interpretation I, by considering those pairs (a, n) such that n is the number of possible ways to map z into I when x is mapped to a. Such pairs are called the answers to q in I. Deﬁnition 2. A match of a CCQ q(x) = y z ψ(x, y, z) in I is a homomorphism1 from q into I. If a match σ maps x to a, then the restriction of σ to z is called a counting match (cmatch) of q(a) in I. The set of answers to q in I, denoted q I, contains all pairs (a, q I a), where q I a is the number of distinct c-matches of q(a) in I. As has been previously noted (see e.g. [Kostylev and Reutter, 2015]), the exact count values of the answers in q I are usually too speciﬁc to hold across models. Considering bounds on the exact value provides more insight, while still allowing unnamed elements to be counted. This motivates the following notion of answer interval. Deﬁnition 3. The set [q]I of answer intervals for a CCQ q in I contains all pairs (a, [m, M]) with a Ind|x| and m, M integers such that m q I a M. The set [q]K of certain (counting) answers to q w.r.t. KB K is obtained by considering those answer intervals that hold in all models of K: [q]K = T I|=K [q]I.

Note that (a, [m, M]) [q]K does not imply that for any n [m, M] there exists a model I in which (a, n) q I. Deﬁnition 1 is a proper generalization of the two forms of counting query considered by Kostylev and Reutter. Reusing their notations, a Cntd()-query q(x, Cntd(z)) = y ψ(x, y, z) corresponds to the CCQ q(x) = ψ(x, y, z), while a Count()-query q(x, Count()) = y ψ(x, y) corresponds to the CCQ q(x) = ψ(x, , ˆy) (with ˆy a tuple of variables from Vc in bijection with y). We will use the term exhaustive to refer to the latter CCQs, i.e. those in which every non-answer variable is a counting variable. Example 2. Reconsider the KB Kact = (Tact, Aact). We can use CCQs to count the pairs of actors (leading role, supporting role) having acted together (q1), return movies together with a count of their supporting actors (q2), and count the number of actors having acted with Tom Hanks (q3):

q1 = y z1 z2 Lead In(z1, y) Supp In(z2, y) q2(x) = z Supp In(z, x) q3 = y z Acts In(hanks, y) Acts In(z, y)

According to our semantics, we have: ( , [2, + ]) [q1]Kact, since z2 can be mapped to either berry or hanks, and z1 mapped to the lead actor (which must exist due to Tact). As the lead actors of the two ﬁlms could be the same, ( , [3, + ]) [q1]Kact. (cloud, [2, + ]) [q2]Kact, mapping z to berry and hanks. ( , 5) q CKact 3 , since in CKact, we can map z to a named actor or the two elements standing in for the lead actors. ( , [5, + ]) / [q3]Kact, since the lead actors could possibly be the same or one of the named actors. The latter two points show that the canonical model does not yield the minimal number of matches.

1The notion of homomorphism of a CCQ is deﬁned in the same way as for CQs, simply treating variables from Vc like those in V.

Data Combined

DL-Litecore co NP-c Πp 2-h, PP-h & in co NEXP

DL-Lite R co NP-c co NEXP-h & in co N2EXP co NEXP-c (T of ﬁnite depth)

Table 1: Data and combined complexity of CCQ answering

4 General Counting CQs We shall consider the following CCQ answering decision problem: given a KB K = (T , A), CCQ q, and candidate answer (a, [m, M]), decide whether (a, [m, M]) [q]K. As ontology languages, we will consider DL-Lite R (which underlies OWL 2 QL) and its sublogic DL-Litecore. We know from [Kostylev and Reutter, 2015] that in these DLs, the least upper bound M can take one of three values (0, 1, or + ) and is easily computed. The argument2 transfers to our more general notion of CCQ. We can therefore restrict our attention to identifying certain answers of the form (a, [m, + ]). We will consider the two usual complexity measures: combined complexity which is in terms of the size of the whole input (T , A, q, a, m), and data complexity which is only in terms of the size of A and m (T and q are treated as ﬁxed). We will assume that m is given in binary.

4.1 General Case Table 1 displays complexity results for answering general CCQs over DL-Litecore and DL-Lite R TBoxes (we use -c and -h as abbreviations for -complete and -hard ). With the exception of the PP-hardness result (discussed in Section 6.1), the lower bounds are inherited from [Kostylev and Reutter, 2015]. We will thus concentrate on the upper bounds from Table 1, which are obtained by generalizing and clarifying the constructions of Kostylev and Reutter. We give an overview of the proof both to give the ﬂavor of the techniques involved and to enable us to discuss the necessary adaptations used to prove later results. The proof constructs a decision procedure for the complementary problem of deciding whether (a, [m, + ]) [q]K. The latter holds iff there exists a countermodel, i.e., a model of K with fewer than m c-matches of q(a). The main ingredient of the proof is the following theorem, which shows that it is sufﬁcient to consider countermodels of bounded size. Theorem 1. For every DL-Lite R (resp. DL-Litecore) KB K = (T , A) and CCQ q, if there is a model of K with fewer than m c-matches of q(a), then there exists one of size3

O(|A||T ||q|+1) (resp. O(|A||q|)). With Theorem 1 in hand, we can easily deﬁne nondeterministic procedures that witness the complexity upper bounds from Table 1: simply guess an interpretation of polynomial / exponential / double-exponential size (depending on the case) and verify whether it is a countermodel. The proof of Theorem 1 starts with an arbitrary countermodel I and modiﬁes it in order to make it smaller, being

2Brieﬂy, the upper bound is 0 if the tuple is not a certain answer; otherwise, it is either 1 if z = , else + . 3As usual, |T | (resp. |A|, |q|) denotes the size of T (resp. A, q).

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

careful not to introduce any new c-matches of q(a). We ﬁrst identify a relevant subset of the domain of I, consisting of the interpretations of all individual names from A and the images of all c-matches of q(a). We then deﬁne a new interpretation that intuitively preserves and replaces the rest of I with parts of the canonical model, to introduce a more regular structure. Formally, we ﬁx a homomorphism f of CK into I (see Lemma 1) and consider the following mapping f : CK CK from [Kostylev and Reutter, 2015]:

f (d) = f(d) if f(d) d otherwise

We deﬁne the interleaving4 I of I as the image of CK by f , i.e., with domain f ( CK) and interpretation function f CK. It is not difﬁcult to prove that the interleaving I is a model of K. Moreover, by exhibiting a homomorphism ρ from I to I, we can translate matches of I into matches in I. As the images of c-matches of q(a) are contained in , which is left unchanged in I , the homomorphism ρ is in fact a one-toone mapping of c-matches of q(a) in I to those in I. This shows that I is also a countermodel. The interleaving I may be arbitrarily large, even inﬁnite. To reduce its size, an equivalence relation is introduced, and elements from I \ that belong to the same equivalence class are merged (elements from are retained). In the case of DL-Lite R, there can be double-exponentially many equivalence classes, as elements are grouped based upon the properties of their |q|-neighborhoods, while for DL-Litecore, we can use a more reﬁned relation with only exponentially many classes. This means that the resulting models are either of singleor double-exponential size w.r.t. combined complexity, depending on the chosen DL. A crucial ﬁnal step is to show that the merging of elements does not introduce any new c-matches of q(a), so the resulting model is still a countermodel. This part of the argument, only sketched in [Kostylev and Reutter, 2015], requires a detailed and technical analysis of the construction to ensure that this property holds for our more general class of CCQs. We show that this is indeed the case, which answers a question left open by Kostylev and Reutter about counting CQs with both existential variables and multiple counting variables.

4.2 Case of Finite-Depth TBoxes We give an improved upper bound for ﬁnite-depth TBoxes (which arguably cover many practical ontologies [Grau et al., 2013]), pinpointing the exact combined complexity. Theorem 2. For ﬁnite-depth DL-Lite R TBoxes, CCQ answering is co NEXP-complete w.r.t. combined complexity.

Proof sketch. Fix a KB K = (T , A). If T has ﬁnite depth, then CK contains at most |Ind(A)| |T ||T | elements, which implies that, for every model I of K, the interleaving of I is ﬁnite and of single exponential size in |K|. Since the interleaving of a countermodel is itself a countermodel, this shows that the smallest countermodel is of single-exponential size, from which derives the improved co NEXP upper bound.

4We have slightly modiﬁed the deﬁnition of interleaving to correct a small bug in the deﬁnition from [Kostylev and Reutter, 2015].

We note that the proofs of the co NP and Πp 2 lower bounds listed in Table 1 already use ﬁnite-depth TBoxes.

5 Rooted Counting CQs

We next explore whether structural restrictions on CCQs allow us to obtain lower complexity. As the lower bounds from [Kostylev and Reutter, 2015] use disconnected counting variables, a natural idea is to consider the subclass of rooted queries that were introduced in [Bienvenu et al., 2012] and are believed to capture a large portion of real-world CQs. Rooted CCQs can be deﬁned analogously to rooted CQs. The deﬁnition utilizes the notion of a Gaifman graph of a CCQ, whose vertices are the query terms, and which has an undirected edge {t1, t2} iff t1, t2 co-occur in a role atom.

Deﬁnition 4. A CCQ q(x) := y zψ(x, y, z) is rooted if every connected component of the Gaifman graph of q contains at least one answer variable or individual name.

Example queries q2 and q3 are rooted, while q1 is not. Rootedness has been shown to lower the complexity of reasoning in several settings. Most relevant to us is a recent result by Nikolaou et al. (2019) that rooted CQ answering under bag semantics5 has tractable data complexity in DL-Litecore, and furthermore, the same holds for rooted versions of the Count()-queries of Kostylev and Reutter under suitable restrictions on the TBox. These techniques can be adapted to show tractability for arbitrary DL-Litecore TBoxes:

Theorem 3. (Implicit in [Nikolaou et al., 2019; Cima et al., 2019]) In DL-Litecore, exhaustive rooted CCQ answering is TC0-complete6 w.r.t. data complexity.

Proof sketch. Nikolaou et al. prove that answering rooted CQs under bag semantics can be done via a rewriting to BCALC, whose evaluation problem is known to be in TC0

due to [Libkin, 2001], see [Cima et al., 2019] for discussion. Moreover, they further show that for a syntactically restricted class of DL-Litecore TBoxes, it is possible to reduce exhaustive rooted CCQ answering to rooted CQ answering under bag semantics. To obtain TC0 membership for unrestricted TBoxes, the BCALC rewriting can be adapted to set-based rather than bag interpretations. In the long version, we provide an alternative self-contained proof which directly constructs a family of TC0 circuits. A matching lower bound has not been stated, but can be shown by a simple reduction (using an empty TBox) from the TC0-complete problem that asks, given a binary string s and number k, whether the number of 1-bits in s exceeds k [Aehlig et al., 2007].

The preceding result naturally leads us to ask whether rootedness also bring beneﬁts for general CCQs. Unfortunately, we show that restricting to rooted CCQs (without exhaustiveness) does not allow us to escape existing hardness results:

5Bag semantics, which underly practical database systems, interprets relations using multisets rather than sets [Albert, 1991]. 6We recall that TC0 is a circuit complexity class deﬁned similarly to AC0 but additionally allowing threshhold gates. It is known that AC0 TC0 NC1 Log Space PTime.

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

Theorem 4. In DL-Litecore, rooted CCQ answering is co NPcomplete w.r.t. data complexity.

Proof sketch. The proof borrows some ideas from the proofs of Lemmas 12 and 16 from [Kostylev and Reutter, 2015]. It proceeds by reduction from the well-known co NP-complete 3COL problem: given an undirected graph G = (V, E), return yes iff G has no 3-coloring, i.e., a mapping from V to {red, green, blue} such that adjacent vertices map to different colors (equivalently: there is no monochromatic edge). The reduction uses atomic roles Edge and Vertex to encode the graph and Has Col to assign colors. The TBox Tcol has a single axiom: Vertex Has Col. The ABox AG contains an individual v for each vertex v V and an assertion Edge(u, v) for each edge {u, v} E. All vertices are connected to a special root individual a: Vertex(a, u), for each u V. The three colors are represented by individuals r, g and b. To ensure that the query has matches in every model, we include a dummy vertex individual av and the following assertions: Vertex(a, av), Edge(av, av), Has Col(av, r), Has Col(av, g), and Has Col(av, b). The query q is the conjunction of the two subqueries:

qedge = yc z1 z2 Vertex(a, z1) Vertex(a, z2) Edge(z1, z2) Has Col(z1, yc) Has Col(z2, yc)

qcol = y z Vertex(a, y) Has Col(y, z)

serving respectively to detect monochromatic edges and to check whether any additional colors have been introduced. By construction, there are at least 3 c-matches for q( ) in any model of the KB Kcol = (Tcol, AG). Moreover, it can be veriﬁed that ( , [4, + ]) is a certain answer to q w.r.t. Kcol iff G is not 3-colorable.

Theorem 5. In DL-Lite R, rooted CCQ answering is co NEXP-hard w.r.t. combined complexity.

Proof sketch. The proof adapts a reduction from the exponential grid tiling problem (Lemma 18 from [Kostylev and Reutter, 2015]), the key difference being the use of existential query variables to access (and count) the colors and bits.

6 Exhaustive Rooted Counting CQs

We have seen in Section 5 that the rootedness restriction is not by itself sufﬁcient to lower the complexity of CCQ answering, whereas imposing both rootedness and exhaustiveness can sometimes yield better results. This motivates us to take a closer look at the case of exhaustive rooted CCQs. The emerging complexity landscape is summarized in Table 2. Note that exhaustive CCQs constitute a very natural form of counting query, which ask for the number of different query matches for a given answer tuple. The query q2 from Example 2 is an exhaustive rooted CCQ.

6.1 Exhaustive Rooted CCQs in DL-Litecore We ﬁrst consider DL-Litecore KBs and pinpoint the precise combined complexity, which had not yet been considered. An essential ingredient is the following result that shows that it is possible to focus on query matches in the canonical

Data Combined

DL-Litecore TC0-c PP-c DL-Lite R co NP-c Πp 2-h, PP-h & in co NEXP

Table 2: Complexity results for exhaustive rooted CCQs

model. It can be obtained by adapting a similar result about canonical bag interpretations [Nikolaou et al., 2019].

Theorem 6. For every DL-Litecore KB K and exhaustive rooted CCQ q, it holds that [q]K = [q]CK.

Proof sketch. Exploiting the structure of DL-Litecore canonical models, one can show that if σ1, σ2 are distinct matches of an exhaustive rooted CCQ q in CK, then there exists a variable v such that σ1(v) = σ2(v) and σ1(v), σ2(v) Ind(A). It follows that if we take an arbitrary model I of K, and let f be a homomorphism of CK into I, then f injectively maps query matches in CK to query matches in I.

We will also use the next lemma, implicit in [Bienvenu et al., 2013], constraining the possible images of matches in CK:

Lemma 2. For every DL-Litecore TBox T and CCQ q, we can construct in polynomial time a set of words Γq,T such that for every KB K = (T , A), match σ of q in CK, and variable v of q: σ(v) = aw for some a Ind(A) and w Γq,T .

We are now ready to show that the problem is PP-complete in combined complexity, and hence in PSpace.

Theorem 7. In DL-Litecore, exhaustive rooted CCQ answering is PP-complete w.r.t. combined complexity.

Proof sketch. The class PP contains all decision problems for which there exists a non-deterministic Turing machine (TM) such that, when the input is a yes instance, then at least half of the computation paths accept, while on no instances, less than half of the computation paths accept. The lower bound is obtained by a reduction from the following PP-complete problem [Bailey et al., 2007]: given a propositional formula ψ in CNF and number n, decide whether ψ has at least n satisfying assignments. We sketch the TM used to show PP membership, which takes as input a DL-Litecore KB K = (T , A), an exhaustive rooted CCQ q(x), and candidate answer (a, [m, + ]):

Phase 1. The TM constructs the set Γq,T from Lemma 2.

Phase 2. The TM guesses a mapping σ of the variables in q to elements from {aw | a Ind(A), w Γq,T }. It then compares m with the number C = |Γq,T ||q| of possible mappings and proceeds accordingly:

2 + 1, the TM guesses an integer i with 0 i 2m 3 and accepts iff σ is a c-match of q(a) and i < C; if m < C

2 + 1, the TM guesses an integer i with 0 i 2C 2m + 1 and accepts iff σ is c-match for q(a) or i < C 2m + 2.

The guessed integer and comparisons ensure a suitable number of accepting paths. It can be veriﬁed that at least half of the paths are accepting iff (a, [m, + ]) [q]CK.

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

6.2 Exhaustive Rooted CCQs in DL-Lite R We now turn to DL-Lite R KBs. Our ﬁrst result is negative: exhaustive rooted CCQs do not enjoy lower data complexity. This is shown by another reduction from 3COL which involves ideas from our proof of Theorem 4 and the proof of Lemma 16 from [Kostylev and Reutter, 2015]. Theorem 8. In DL-Lite R, exhaustive rooted CCQ answering is co NP-complete w.r.t. data complexity. More positively, we can show an improved co NEXP upper bound in combined complexity for exhaustive rooted CCQs. We brieﬂy sketch the proof, which involves highly non-trivial modiﬁcations to the argument used for general CCQs. We ﬁrst introduce a more reﬁned notion of interleaving, which replaces the mapping f by the following mapping f :

f (a) = f(a)

f (ωR) = f(ωR) if f (ω), f(ωR) f (ω)R otherwise

It is possible to prove that when q is an exhaustive rooted CCQ, this modiﬁed interleaving yields a countermodel. Moreover, it has a very particular structure, essentially corresponding to the canonical model of the restriction of f(CK) to (viewed as an ABox). Importantly, this means that instead of guessing a whole countermodel, it sufﬁces to guess an initial, exponential-size portion (the |q|-neighborhood of ), providing the basis for a co NEXP decision procedure. Theorem 9. In DL-Lite R, exhaustive rooted CCQ answering is in co NEXP w.r.t. combined complexity.

7 Best Certain Answers The deﬁnition of certain answers implies that if (a, [m, M]) [q]K, then we also have (a, [m , M ]) [q]K for every m m and M M. It is naturally of interest to focus on certain answers providing the best bounds, i.e., those of the form (a, [min I|=K q I a, max I|=K q I a]). In this section, we show that the problem of identifying the best lower bound (min I|=K q I a) is DP-complete in data complexity. It is easily seen that checking whether m is such an optimal bound can be done in DP, by making a call to a co NP oracle (is (a, [m, + ]) [q]K?) and an NP oracle (is (a, [m+1, + ]) / [q]K?). The DP-hardness of this problem was left as an open question by Kostylev and Reutter. Theorem 10. The following problem is DP-hard in data complexity: given a DL-Litecore KB K = (T , A), rooted CCQ q, tuple a, and number m, decide whether m = min I|=K q I a.

Proof sketch. We give a reduction from the following problem (DP-complete due to [Garey et al., 1976]): given planar graphs G1 and G2, decide if G1 3COL and G2 / 3COL. Let the TBox Tcol and ABoxes AG1, AG2 be deﬁned as in the proof of Theorem 4. Rename the individuals to ensure Ind(AG1) Ind(AG2) = , then set K = (Tcol, AG1 AG2). Let qcolor 1 and qedge 1 (resp. qcolor 2 and qedge 2 ) be deﬁned as before, but using disjoint variables and the root individual from the AG1 (resp. AG2). The challenge is to make sure that we can determine the 3-colorability status of the two graphs solely by looking at the number of c-matches of the query. To

be able to distinguish G1 from G2, we introduce an asymmetry by duplicating the color counter query for G1, i.e., create a copy qcolor 0 of qcolor 1 that uses fresh variables but the same root individual. We then take the query

q() := qcolor 0 qcolor 1 qedge 1 qcolor 2 qedge 2 .

We claim (a , [36, + ]) [q]K iff G1 3COL and G2 3COL. This is proven by a case analysis, summarized here:

G1 3COL G1 / 3COL

G2 3COL 27 (= 3 3 3) 48 (= 4 4 3) G2 / 3COL 36 (= 3 3 4) 64 (= 4 4 4)

Each of the four cells displays the least value of m such that (a , [m, + ]) [q]K, under different assumptions on the 3colorability of G1 and G2. To establish these values, one must ﬁrst prove that every model has at least this many c-matches, and then exhibit a model that realizes the exact number. For the latter, we utilize our assumption that the graphs are planar, hence 4-colorable [Gonthier, 2008], which we use to show that the minimal number of c-matches is realized in a model that encodes proper 3or 4-colorings of the graphs.

The preceding reduction can be adapted to show DPhardness also for the two kinds of CCQs from [Kostylev and Reutter, 2015], but without the rootedness restriction.

8 Conclusion & Future Work We have revisited the issue of counting queries in OMQA and advanced our understanding of the complexity landscape, both by extending existing results to a more general notion of counting CQ and by exploring when structural restrictions on the ontology and query can lead to improved complexity. There are several natural avenues for future study. A ﬁrst challenging problem is to provide a full classiﬁcation of the data complexity of ontology-mediated queries (i.e. queryontology pairs), in order to identify further tractable cases. It would also be relevant to extend the complexity study to DLs with functional roles or quantiﬁed number restrictions, which would allow for non-trivial upper bounds on the number of matches. Tackling general CCQs for such DLs will likely require wholly different techniques from the model manipulations used in Section 4. However, a recent result by Cima et al. (2019) shows that the canonical model property (Theorem 6) holds also for DL-Lite F (which extends DL-Litecore with functional roles), and hence both TC0 data complexity (Theorem 3) and our PP-completeness result (Theorem 7) for exhaustive rooted CCQs transfer to DL-Lite F. Much remains to be explored for queries involving other kinds of aggregate functions (min, max, sum, average), which manipulate data values. Recent studies of bag semantics for OMQA [Nikolaou et al., 2019; Cima et al., 2019] and databases with incomplete information [Hernich and Kolaitis, 2017; Console et al., 2017] provide important formal foundations for supporting such queries.

Acknowledgements This work was partially supported by ANR project CQFD (ANR-18-CE23-0003).

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

[Aehlig et al., 2007] Klaus Aehlig, Stephen Cook, and Phuong Nguyen. Relativizing Small Complexity Classes and Their Theories. Springer Berlin Heidelberg, Berlin, Heidelberg, 2007.

[Albert, 1991] Joseph Albert. Algebraic properties of bag data types. In Proceedings of the 17th International Conference on Very Large Data Bases (VLDB), pages 211 219, 1991.

[Baader et al., 2017] Franz Baader, Ian Horrocks, Carsten Lutz, and Ulrike Sattler. An Introduction to Description Logic. Cambridge University Press, 2017.

[Bailey et al., 2007] Delbert D. Bailey, V ıctor Dalmau, and Phokion G. Kolaitis. Phase transitions of PP-complete satisﬁability problems. Discrete Applied Mathematics, 155(12):1627 1639, 2007.

[Bienvenu and Ortiz, 2015] Meghyn Bienvenu and Magdalena Ortiz. Ontology-mediated query answering with data-tractable description logics. In Tutorial Lectures of the 11th Reasoning Web International Summer School, pages 218 307, 2015.

[Bienvenu et al., 2012] Meghyn Bienvenu, Carsten Lutz, and Frank Wolter. Query containment in description logics reconsidered. In Proceedings of the 13th International Conference on the Principles of Knowledge Representation and Reasoning (KR), 2012.

[Bienvenu et al., 2013] Meghyn Bienvenu, Magdalena Ortiz, Mantas Simkus, and Guohui Xiao. Tractable queries for lightweight description logics. In Proceedings of the 23rd International Joint Conference on Artiﬁcial Intelligence (IJCAI), pages 768 774, 2013.

[Bienvenu et al., 2015a] Meghyn Bienvenu, Stanislav Kikot, and Vladimir V. Podolskii. Tree-like queries in OWL 2 QL: Succinctness and complexity results. In Proceedings of the 30th Annual ACM/IEEE Symposium on Logic in Computer Science (LICS), pages 317 328, 2015.

[Bienvenu et al., 2015b] Meghyn Bienvenu, Magdalena Ortiz, and Mantas Simkus. Regular path queries in lightweight description logics: Complexity and algorithms. Journal of Artiﬁcial Intelligence Research (JAIR), 53:315 374, 2015.

[Calvanese et al., 2007] Diego Calvanese, Giuseppe De Giacomo, Domenico Lembo, Maurizio Lenzerini, and Riccardo Rosati. Tractable reasoning and efﬁcient query answering in description logics: The DL-Lite family. Journal of Automated Reasoning (JAR), 39(3):385 429, 2007.

[Calvanese et al., 2008] Diego Calvanese, Evgeny Kharlamov, Werner Nutt, and Camilo Thorne. Aggregate queries over ontologies. In Proceedings of the 2nd International Workshop on Ontologies and Information Systems for the Semantic Web (ONISW), pages 97 104, 2008.

[Cima et al., 2019] Gianluca Cima, Charalampos Nikolaou, Egor V. Kostylev, Mark Kaminski, Bernardo Cuenca Grau,

and Ian Horrocks. Bag semantics of dl-lite with functionality axioms. In Proceedings of the 18th International Semantic Web Conference (ISWC), pages 128 144, 2019. [Console et al., 2017] Marco Console, Paolo Guagliardo, and Leonid Libkin. On querying incomplete information in databases under bag semantics. In Carles Sierra, editor, Proceedings of the 26th International Joint Conference on Artiﬁcial Intelligence (IJCAI), pages 993 999, 2017. [Garey et al., 1976] M.R. Garey, D.S. Johnson, and L. Stockmeyer. Some simpliﬁed NP-complete graph problems. Theoretical Computer Science, 1(3):237 267, 1976. [Gonthier, 2008] Georges Gonthier. Formal proof The four-color theorem. Notices of the American Mathematical Society, 55(11):1382 1393, 2008. [Grau et al., 2013] Bernardo Cuenca Grau, Ian Horrocks, Markus Kr otzsch, Clemens Kupke, Despoina Magka, Boris Motik, and Zhe Wang. Acyclicity notions for existential rules and their application to query answering in ontologies. Journal of Artiﬁcial Intelligence Research (JAIR), 47:741 808, 2013. [Guti errez-Basulto et al., 2015] V ıctor Guti errez-Basulto, Yazmin Ang elica Ib a nez-Garc ıa, Roman Kontchakov, and Egor V. Kostylev. Queries with negation and inequalities over lightweight ontologies. Journal of Web Semantics (JWS), 35:184 202, 2015. [Hernich and Kolaitis, 2017] Andr e Hernich and Phokion G. Kolaitis. Foundations of information integration under bag semantics. In Proceedings of the 32nd Annual ACM/IEEE Symposium on Logic in Computer Science (LICS), pages 1 12, 2017. [Kostylev and Reutter, 2015] Egor V. Kostylev and Juan L. Reutter. Complexity of answering counting aggregate queries over DL-Lite. Journal of Web Semantics (JWS), 33(1):94 111, 2015. [Libkin, 2001] Leonid Libkin. Expressive power of SQL. In Proceedings of the 8th International Conference on Database Theory (ICDT), pages 1 21, 2001. [Nikolaou et al., 2019] Charalampos Nikolaou, Egor V. Kostylev, George Konstantinidis, Mark Kaminski, Bernardo Cuenca Grau, and Ian Horrocks. Foundations of ontology-based data access under bag semantics. Artiﬁcial Intelligence (AIJ), 274:91 132, 2019. [Poggi et al., 2008] Antonella Poggi, Domenico Lembo, Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, and Riccardo Rosati. Linking data to ontologies. Journal of Data Semantics, 10:133 173, 2008. [Xiao et al., 2018] Guohui Xiao, Diego Calvanese, Roman Kontchakov, Domenico Lembo, Antonella Poggi, Riccardo Rosati, and Michael Zakharyaschev. Ontologybased data access: A survey. In Proceedings of the 27th International Joint Conference on Artiﬁcial Intelligence (IJCAI), pages 5511 5519, 2018.

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)