# ontologymediated_queries_distributing_over_components__772f9115.pdf

Ontology-Mediated Queries Distributing Over Components

Gerald Berger, Andreas Pieris Institute of Information Systems, TU Wien, Austria

{gberger,pieris}@dbai.tuwien.ac.at

Ontology-based data access is concerned with the problem of querying incomplete data sources in the presence of an ontology. A key notion in this setting is that of ontology-mediated query, which is a database query coupled with an ontology. An interesting issue is whether the answer to an ontologymediated query can be computed by parallelizing it over the connected components of the database, i.e., whether the query distributes over components. This allows us to evaluate the query in a distributed and coordination-free manner. We investigate distribution over components for classes of ontologymediated queries where the database query is a conjunctive query and the ontology is formulated using existential rules. For each such class, we syntactically characterize its fragment that distributes over components, and we study the problem of deciding whether a query distributes over components.

1 Introduction

Ontology-based data access (OBDA) has recently emerged as a promising application of knowledge representation and reasoning technologies in information management systems. The aim of OBDA is to facilitate access to data that is significantly heterogeneous and incomplete. This is achieved via an ontology that provides a uniﬁed conceptual view of the data, and makes it accessible via database queries formulated solely in the vocabulary of the ontology. The actual database query and the ontology can be seen as two components of one composite query, called ontology-mediated query [Bienvenu et al., 2014]. OBDA can then be realized as the problem of answering ontology-mediated queries.

A challenging and important topic for declarative database query languages, and in particular (extensions of) Datalog, is coordination-free evaluation. This has its roots in declarative networking [Loo et al., 2009], an approach where distributed computations are programmed using Datalog-based formalisms. In this setting, programs (or queries) are speciﬁed over a global schema, and are executed by multiple computing nodes over which the database is distributed. These nodes can perform local computations, and also communicate

asynchronously with each other via messages. The model assumes that messages can never be lost but can be arbitrarily delayed. An intrinsic source of inefﬁciency in such systems are the global barriers raised by the need for synchronization in computing the result of queries. This has motivated a fruitful line of research for isolating classes of queries that can be evaluated in a coordination-free manner [Zinn et al., 2012; Ameloot et al., 2013; 2014; 2015; Alvaro et al., 2014].

It is quite natural to ask wether the results on coordinationfree evaluation for declarative database query languages can be transferred to ontology-mediated queries. Undoubtedly, a positive answer to this question, apart from possible applications to declarative networking, will signiﬁcantly contribute towards more efﬁcient procedures for OBDA, since it will enable distributed ontology-mediated query evaluation in a coordination-free way. The goal of the present work is to initiate the study of coordination-free evaluation for ontologymediated queries, and give strong indications that the answer to the above challenging question is afﬁrmative. Distribution Over Components. More concretely, we focus on the question whether the answer to an ontology-mediated query can be computed by parallelizing it over the (maximally connected) components of the database, i.e., whether the query distributes over components. In other words, given an ontology-mediated query Q, the question is whether for every database D the answer to Q over D, denoted Q(D), coincides with S

1 i n Q(Di), where D1, . . . , Dn are the components of D. In this case, Q(D) can be computed without any communication over a network using a distribution where every computing node is assigned some of the connected components of the database, and every component is assigned to at least one computing node.

The notion of distribution over components has been introduced in [Ameloot et al., 2014], and explicitly considered for Datalog queries in [Ameloot et al., 2015]. It has been shown that connected Datalog, that is, the fragment of Datalog where all rule-bodies are connected, provides an effective syntax for Datalog queries that distribute over components, while the problem of deciding whether a Datalog query distributes over components is undecidable. Aims and Objectives. As said, this work is concerned about ontology-mediated queries and their distribution over components. We focus on ontology-mediated queries where the database query is a conjunctive query and the ontology is for-

Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16)

mulated using existential rules (a.k.a. tuple-generating dependencies or Datalog rules). It is widely accepted that existential rules form a natural and convenient way for modeling ontologies. This has led to a ﬂurry of activity for designing ontology languages based on existential rules; see e.g., [Baget et al., 2011; Cal ı et al., 2012a; 2012b; Thomazo et al., 2012; Leone et al., 2012]. We consider several ontology-mediated query languages, where the ontology is modeled using either arbitrary existential rules (without syntactic restrictions), or one of the basic decidable classes of existential rules, namely guarded, linear, or sticky (more details are given is Section 2), and we perform similar analysis as the one in [Ameloot et al., 2015] for Datalog queries. In particular, we would like to understand whether connectedness is the right notion for characterizing the fragment of the query languages in question that distributes over components, and understand the complexity of checking distribution over components. Our Results. Our results can be summarized as follows:

I We show that for all the ontology-mediated query languages Q in question, the fragment of Q that distributes over components has the same expressive power as the fragment of Q where both the existential rules and the conjunctive query are connected (Theorem 2). For the language where the ontology is expressed using arbitrary existential rules, similar ideas as the ones in [Ameloot et al., 2015] for Datalog queries can be applied. However, the situation changes once we focus on the languages based on guarded, linear, or sticky existential rules, since these languages are not powerful enough to express transitivity axioms, which can be used to compute the connected components of the input database.

I The problem of deciding distribution over components is undecidable for all the ontology-mediated query languages that capture Datalog; implicit in [Ameloot et al., 2015]. Interestingly, we show that for the languages based on guarded, linear, or sticky existential rules, the problem is decidable (Theorems 15, 16 and 18).1 Furthermore, for the languages based on linear and sticky rules, we obtain precise complexity results that range from P

2 up to co NEXPTIME. I It turned out that our techniques and results have interesting consequences to ontology-mediated query languages based on Description Logics (Corollary 8, and Theorem 19), and to central database query languages, i.e., (unions of) conjunctive queries, and non-recursive Datalog queries (Corollary 11, and Theorems 20 and 21).

2 Preliminaries Instances and Queries. Let C, N, and V be pairwise disjoint countably inﬁnite sets of constants, (labeled) nulls and (regular) variables (used in queries and dependencies), respectively. A schema S is a ﬁnite set of relation symbols (or predicates) with associated arity. We write R/n to denote that R has arity n. A term is either a constant, null or variable. An atom over S is an expression R( t), where R is a relation symbol in S of arity n > 0 and t is an n-tuple of terms. A fact is an atom whose arguments consist only of constants. An

1For guarded-based queries, this holds under the assumption that the conjunctive query is answer-guarded, i.e., it has an atom that contains all the answer variables.

instance over S is a (possibly inﬁnite) set of atoms over S that contain constants and nulls, while a database over S is a ﬁnite set of facts over S. The active domain of an instance I, denoted adom(I), is the set of all terms occurring in I.

A query over S is a mapping q that maps every database D over S to a set of answers q(D) adom(D)n, where n 0 is the arity of q. If n = 0, then q is a Boolean query, and we write q(D) = 1 if () 2 q(D), and q(D) = 0 otherwise.

The usual way of specifying queries is by means of (fragments of) ﬁrst-order logic. A conjunctive query (CQ) q over S is a conjunction of atoms of the form 9 y φ( x, y), where x [ y are variables of V, that uses only predicates from S. The free variables of a CQ are called answer variables. The evaluation of CQs over instances is deﬁned in terms of homomorphisms; we assume the reader is familiar with the standard notion of homomorphism. Let I be an instance over S. The evaluation of q over I, denoted q(I), is the set of all tuples h( x) of constants such that h is a homomorphism from q to I. Each schema S and CQ q = 9 y φ(x1, . . . , xn, y) give rise to the nary query qφ,S deﬁned by setting, for every database D over S, qφ,S(D) = { c 2 adom(D)n | c 2 q(D)}. Let CQ be the class of all queries deﬁnable by some CQ. Analogously, we deﬁne UCQ as the class of all queries deﬁnable by some union of conjunctive queries (UCQ), that is, a disjunction of CQs with the same answer variables.

Two queries q1 and q2 over S are equivalent, written q1 q2, if, for every database D over S, q1(D) = q2(D). A query language Q2 is at least as expressive as query language Q1, written Q1 Q2, if, for every q1 2 Q1 over schema S, there is q2 2 Q2 over S such that q1 q2. Q1 and Q2 have the same expressive power, written Q1 = Q2, if Q1 Q2 Q1.

Tgds for Specifying Ontologies. An ontology language is a fragment of ﬁrst-order logic. Here, we focus on languages that are based on tuple-generating dependencies. A tuplegenerating dependency (tgd) is a ﬁrst-order sentence of the form 8 x8 y

φ( x, y) ! 9 z ( x, z)

, where both φ and are conjunctions of atoms without nulls and constants. For simplicity, we write this tgd as φ( x, y) ! 9 z ( x, z), and use comma instead of for conjoining atoms. We call φ and the body and head of the tgd, respectively, and write sch( ) for the set of predicates occurring in . An instance I satisﬁes the above tgd if the following holds: For every homomorphism h such that h(φ( x, y)) I, there is a homomorphism h0 that extends h such that h0( ( x, z)) I. I satisﬁes a set of tgds, denoted I |= , if I satisﬁes every tgd in . Let TGD be the class of all (ﬁnite) sets of tgds.

Ontology-Mediated Queries. An ontology-mediated query over S is a triple (S, , q), where 2 TGD, q 2 CQ, and q is over S [ sch( ). We call S the data schema. In general, ontology-mediated queries are deﬁned for arbitrary ontology languages (not only TGD), and arbitrary query languages (not only CQ). In this work, we focus on tgd-based ontology languages and queries deﬁnable via CQs. S is explicitly mentioned in the speciﬁcation of the ontology-mediated query to highlight that it is interpreted as a query over S. The semantics of an ontology-mediated query is given in terms of certain answers. Let (S, , q) be an ontology-mediated query, where n is the arity of q. The certain answers to q w.r.t. a

database D over S and is the set of tuples certq, (D) = { c 2 adom(D)n | c 2 q(I), 8I s.t. I D and I |= }. Recall that certq, (D) coincides with the evaluation of q over the canonical instance of D and that can be constructed by the chase; see e.g., [Cal ı et al., 2012b]. The chase adds new atoms to D as dictated by until the ﬁnal result, written chase(D, ), satisﬁes . When an existentially quantiﬁed variable must be satisﬁed the chase invents a new null. Ontology-Mediated Query Languages. Every query Q = (S, , q) can be semantically interpreted as a query q Q over S by setting q Q(D) = certq, (D), for every S-database D.2 Thus, we obtain a new query language, denoted (TGD, CQ), deﬁned as the class of queries q Q, where Q is an ontologymediated query. However, (TGD, CQ) is undecidable since, given a database D over S, 2 TGD, an n-ary query q 2 CQ over S[sch( ), and a tuple c 2 Cn, the problem of deciding if c 2 certq, (D) is undecidable [Beeri and Vardi, 1981]. This has led to a ﬂurry of activity for identifying decidable syntactic restrictions. Such a restriction deﬁnes a subclass C of tgds, i.e., C TGD, which in turn gives rise to the decidable query language (C, CQ). The two main paradigms for ontological reasoning are guardedness and stickiness: Guardedness: A tgd is guarded if its body contains an atom, called guard, that contains all the body-variables [Cal ı et al., 2013]. Let G be the class of sets of guarded tgds. A key subclass of guarded tgds is the class of linear tgds, that is, tgds whose body consists of a single atom [Cal ı et al., 2012a]. Let L be the class of sets of linear tgds. Stickiness: The goal of stickiness is to express joins that are not expressible via guarded tgds. The key property underlying this condition is as follows: During the chase, terms that are associated with variables that appear more than once in the body of a tgd (i.e., join variables) are always propagated (or stick ) to the inferred atoms. Due to lack of space we omit the deﬁnition that can be found in [Cal ı et al., 2012b]. Let S be the class of sticky sets of tgds. Weak Versions: Both G and S have a weak version: Weaklyguarded [Cal ı et al., 2013] and weakly-sticky [Cal ı et al., 2012b], respectively. These are highly expressive classes obtained by relaxing guardedness and stickiness so that only those positions that receive nulls during the chase are taken into account. We write WG and WS for the class of weaklyguarded and weakly-sticky sets of tgds, respectively.

We refer to the query languages deﬁned above as ontologymediated query languages.

3 Distribution of Queries and Connectedness Our goal is to syntactically characterize the expressive power of the fragment of the query languages in question that guarantees distribution over components. Let us recall the notion of distribution over components [Ameloot et al., 2014]. A set of atoms A is connected if for all a, b 2 adom(A), there exists a sequence 1, . . . , n of atoms in A such that a 2 adom( 1), b 2 adom( n), and adom( i) \ adom( i+1) 6= ?, for each i 2 [n 1].3 We call B A a component of A if

2For brevity, we use Q both for (S, , q) and the mapping q Q. 3Henceforth, we write [k] for {1, . . . , k}.

(i) B is connected, and (ii) for every 2 A \ B, B [ { } is not connected.4 Let co(A) be the set of components of A.

Deﬁnition 1 A query q over S distributes over components if q(D) = S

D02co(D) q(D0), for every database D over S.

Roughly, the centralized answer to q w.r.t. D is precisely obtained when we parallelize q over the components of D. Let DIST be the class of queries that distribute over components. We proceed to characterize the expressive power of the fragment of (C, CQ), where C is one of the classes of tgds introduced above. This is done via connectedness. A tgd is connected if the set of atoms occurring in its body is connected, while a set of tgds is called connected if every tgd in is connected. Given a class C of tgds, we write con C for the class of all (ﬁnite) sets of tgds that are connected and fall in C. Similarly, we write con CQ for the class of all queries deﬁnable by some CQ that is connected.

The main result of this section states that every query that is expressible by one of the ontology-mediated query languages Q in question and distributes over components is equivalent to a connected query that falls in Q, and vice versa. Formally:

Theorem 2 For C 2 {TGD, WG, G, L, WS, S}, (C, CQ) \ DIST = (con C, con CQ).

Notice that guarded and linear tgds are, by deﬁnition, connected. This implies that, for C 2 {G, L}, the above theorem can be equivalently stated as (C, CQ)\DIST = (C, con CQ). We present Theorem 2 in this way for the sake of uniformity. Let us now discuss the key ideas underlying Theorem 2.

For the ( ) direction we show, no matter which ontologymediated query language we consider, that connectedness ensures distribution over components. This is a consequence of the fact that, given a query Q = (S, , q) 2 (con C, con CQ), for every database D over S with co(D) = {D1, . . . , Dm}, chase(D, ) can be partitioned into {I1, . . . , Im} such that, for each i 2 [m], the atoms of Ii depend only on Di. Thus:

Proposition 3 For each C TGD, (con C, con CQ) (C, CQ) \ DIST.

We proceed with the ( ) direction. It turned out that there is no single reduction (at least no obvious one) from queries that distribute over components to connected queries that is generic enough to deal with all the query languages in question. Nevertheless, our languages can be classiﬁed into three groups according to certain properties that are useful for our purposes, and then treat each group separately. These groups and the underlying properties are as follows:

(A) (TGD, CQ), (WG, CQ) and (WS, CQ); these languages

are powerful enough to express (a limited form of) transitivity and cartesian products. (B) (G, CQ), (L, CQ); guarded and linear tgds are, by deﬁ-

nition, connected. (C) (S, CQ); UCQ is at least as expressive as (S, CQ), and

(con S, con CQ) is at least as expressive as con UCQ.5

4For technical clarity, the notion of component is deﬁned only for sets of atoms that do not contain nullary atoms.

5We write con UCQ for the class of all queries deﬁnable by a union of connected CQs.

We consider each one of the above groups, and show that distribution over components implies (semantic) connectedness.

Group A of Query Languages Our main technical tool is the so-called connecting operator, based on a construction given in [Ameloot et al., 2015] for showing that distribution over components implies connectedness when we focus on Datalog queries. By applying the connecting operator to a query Q = (S, , q) 2 (TGD, CQ), we obtain the query Qc = (S, c( ), c(q)) deﬁned as follows. The set c( ) consists of the tgds:

R(x1, . . . , xn) ! con(xi, xj),

for each R/n 2 S and i, j 2 [n], where con is a new binary predicate not in , and

con(x, y) ! con(y, x) con(x, y), con(y, z) ! con(x, z),

stating that con is symmetric and transitive; con(a, b) states that the constants a and b belong to the same component of the input database. In addition, we have the tgd

con(x1, v), R(x1, . . . , xn) ! R?(v, x1, . . . , xn),

where v 62 {x1, . . . , xm}, that annotates every atom of a certain component of the input database with all the constants in this component. Finally, for each σ 2 , we add to c( ) the connected version of σ obtained by replacing each atom R( x) in σ with R?(vσ, x), where vσ is a new variable. Analogously, c(q) is deﬁned as the connected version of q. It can be shown that Q 2 DIST implies Q Qc. Then:

Proposition 4 For each C TGD that is closed under connecting,6 (C, CQ) \ DIST (con C, con CQ).

Interestingly, TGD, WG, and WS are closed under connecting. Hence, Proposition 4 implies that distribution over components implies connectedness for the languages of Group A.

Group B of Query Languages It is clear that G and L are not closed under connecting, and therefore we need to follow a different approach. To this end, we exploit the fact that guarded and linear tgds are, by deﬁnition, connected. In particular, we establish a general result for subclasses of con TGD:

Proposition 5 For each C con TGD, (C, CQ) \ DIST (C, con CQ).

It is clear that the above result immediately shows that distribution over components implies connectedness for the languages of Group B. We proceed to explain the key ideas underlying Proposition 5. Fix an arbitrary class C con TGD, and a query Q = (S, , q) 2 (C, CQ) \ DIST, where q/n is deﬁned by the CQ 9 y φ( x, y). Observe that if Q is unsatisﬁable, i.e., there is no database D over S such that Q(D) 6= ?, then the claim holds trivially; simply choose an arbitrary unsatisﬁable query from (C, con TGD). The interesting case is when Q is satisﬁable. Let {φ1, . . . , φk} be the components of q. We assume that k 2 since for k = 1 the claim holds trivially. Our goal is to show that there exists i 2 [k] such

6This means that, for each set 2 C, c( ) 2 C.

that Qi = (S, , qi), where qi is the CQ obtained from q by keeping only the component φi, is equivalent to Q. Since Qi 2 (C, con CQ) the claim follows. We proceed by distinguishing the two cases where q is either non-Boolean (i.e., n > 0) or Boolean (i.e., n = 0).

Case 1 - Non-Boolean: Interestingly, we can show that all the answer variables of q occur in a single component. In fact, if an answer variable occurs in more than one component of q, then Q is unsatisﬁable, which contradicts the fact that Q is satisﬁable; this exploits the fact that is connected. Therefore, we can refer to the component q x of q that contains the answer variables of q. Notably, the query obtained from Q by replacing q with q x is equivalent to Q:

Lemma 6 Q Q x, where Q x = (S, , q x).

It is clear that Q x 2 (C, con CQ), and the claim when q is non-Boolean follows by Lemma 6.

Case 2 - Boolean: Let us now consider the Boolean case. Although we cannot refer to the component q x of q, we can show that there exists a component that gives rise to a query equivalent to Q. Given a query ˆQ = (ˆS, ˆ , ˆq) 2 (C, CQ) \ DIST, where ˆq is Boolean and 1, . . . , m are its components, let

ˆQi be the query obtained from ˆQ by keeping only the component i of ˆq, and ˆQ

i the query obtained from ˆQ by removing the component i. We can show that, for each i 2 [m],

i ˆQi.7 Thus, starting from Q, and repeatedly applying this result, we will ﬁnd an integer i 2 [k] such that Qi Q

i . It is not difﬁcult to show that Q Qi. Thus:

Lemma 7 There exists i 2 [k] such that Q Qi, where Qi = (S, , qi).

It is clear that the query Qi provided by Lemma 7 belongs to (C, con CQ), and the claim when q is Boolean follows.

Description Logics. Before we proceed to the third group of query languages, we would like to brieﬂy discuss an interesting consequence of Proposition 5 to ontology-mediated query languages based on Description Logics (DLs). As for classes of tgds, a DL L gives rise to the query language (L, CQ). One of the most prominent and well-studied DLs is EL [Baader, 2003]. It is known that an EL-TBox can be rewritten as a set of guarded tgds [Cal ı et al., 2012a]. This fact, together with Propositions 3 and 5, allow us to show that:

Corollary 8 (EL, CQ) \ DIST = (EL, con CQ).

Group C of Query Languages S is not closed under connecting, and S 6 con TGD. Thus, we need to apply different techniques for establishing the desired result. Our main technical tool is as follows:

Proposition 9 For each C TGD such that

1. (C, CQ) UCQ; 2. con UCQ (con C, con CQ),

it holds that (C, CQ) \ DIST (con C, con CQ).

7As usual, Q1 Q2 means that for every source database D, Q1(D) Q2(D).

To establish Proposition 9, it sufﬁces to show the following lemma, which can be done by exploiting similar ideas as the ones developed above for proving Proposition 5:

Lemma 10 UCQ \ DIST con UCQ.

Having the above proposition in place, to establish the desired result for (S, CQ) it remains to show that (S, CQ) UCQ and con UCQ (con S, con CQ). The former is implicit in [Gottlob et al., 2014], where it is shown that evaluation for (S, CQ) is UCQ-rewritable. The latter is shown by exploiting the fact that every CQ of the form 9 y φ( x, y) can be converted into the sticky tgd φ( x, y) ! ans( x, y); this tgd is sticky since all the body-variables are propagated to the head. Database Query Languages. We conclude this section by discussing some interesting consequences to central database query languages, in particular CQ, UCQ, and NRDat.8 First, notice that Proposition 3 can be established for CQ and UCQ. This fact, together with Lemma 10, which holds also for CQ, implies that CQ \ DIST (UCQ \ DIST) has the same expressive power as con CQ (con UCQ). Regarding NRDat, we can show that the class of tgds that corresponds to non-recursive Datalog programs enjoys the two properties stated in Proposition 9. Hence, Propositions 3 and 9 imply that:

Corollary 11 For each language Q 2 {CQ, UCQ, NRDat}, Q \ DIST = con Q.

4 Deciding Distribution Over Components The question that comes up is whether distribution over components for the query languages considered so far can be decided. This problem, dubbed Dist(Q) with Q being a query language, accepts as input a query Q 2 Q, and asks whether Q 2 DIST. Notice that Theorem 2 only says that Dist(Q) is equivalent to the problem of checking whether a query is semantically (not syntactically) connected, and thus it does not provide a decision procedure for our problem.

We start our analysis be recalling a recent negative result: Dist(Dat) is undecidable [Ameloot et al., 2015], where Dat is the class of all queries deﬁnable by some Datalog query. Thus, for every C TGD that captures Datalog programs, we immediately obtain that Dist(C, CQ) is undecidable.9 TGD, WG and WS are such classes, and hence:

Theorem 12 For Q 2 {(TGD, CQ), (WG, CQ), (WS, CQ)}, Dist(Q) is undecidable.

This negative result rules out weakly-guarded and weaklysticky sets of tgds. What about their non-weak versions? Do they ensure the decidability of Dist? This will be the subject of the remainder of this section.

Distribution and Guardedness We show that our problem is decidable if we focus on the answer-guarded fragment of (G, CQ). A CQ q is answerguarded if it includes an atom that contains all the answer variables [B ar any et al., 2013]. Let AGCQ be the class of all queries deﬁnable by some answer-guarded CQ. Notice that AGCQ is a broad class of queries, which includes all Boolean

8All queries deﬁnable by some non-recursive Datalog query. 9For brevity, we write Dist(C, CQ) instead of Dist((C, CQ)).

CQs. Our goal is to show that Dist(G, AGCQ) is decidable. Fix a query Q = (S, , q) 2 (G, AGCQ), where {q1, . . . , qk} are the components of q. Recall that Qi is the query obtained from Q by keeping only the component qi of q. By exploiting Proposition 3, and Lemmas 6 and 7, we can show that:

Lemma 13 Q 2 DIST iff one of the following holds:

1. Q is unsatisﬁable; 2. There exists i 2 [k] such that Qi 2 (G, AGCQ) and

Checking whether Q is unsatisﬁable can be reduced to the containment problem for (G, AGCQ). Thus, Lemma 13 provides an algorithm for Dist(G, AGCQ) under the assumption that Cont(G, AGCQ) is decidable.10 It remains to show that:

Proposition 14 Cont(G, AGCQ) is decidable.

The above result is shown by a reduction to Cont(GDat), where GDat is the class of queries deﬁnable by some guarded Datalog query [B ar any et al., 2012]. Guarded Datalog requires an atom with an extensional predicate to contain all the variables in the head. Cont(GDat) is in 2EXPTIME as it can be reduced to the satisﬁability problem for guarded negation ﬁxed point logic [B ar any et al., 2015]. Our reduction exploits a construction given in [B ar any et al., 2013] for rewriting a query of Q = (S, , q) 2 (G, AGCQ) into an equivalent Datalog query QDat = ( , Ans), where each rule of has a guard atom (possibly with an intensional predicate) that contains all the head-variables. This construction assumes that q is answer-guarded; this is the reason why we focus on answerguarded queries. From the above discussion, we get that:

Theorem 15 Dist(G, AGCQ) is decidable.

The decision procedure underlying Theorem 15 shows that Dist(G, AGCQ) is elementary. However, its exact complexity is unknown. Moreover, the decidability of Dist(G, CQ), i.e., when the CQ is not answer-guarded, is still open.

Distribution and Linearity It is easy to verify that the proof of Theorem 15 applies even if we focus on linear tgds; thus, Dist(L, AGCQ) is decidable. Interestingly, by exploiting the fact that evaluation for (L, CQ) is UCQ-rewritable [Gottlob et al., 2014], we can signiﬁcantly strengthen the above result. We can show that:

Theorem 16 Dist(L, CQ) is PSPACE-complete, and P

2 - complete if the arity of the schema is ﬁxed.

Fix a query Q = (S, , q) 2 (L, CQ). By applying query rewriting techniques, we can construct Qr 2 UCQ such that Q Qr [Gottlob et al., 2014]. Clearly, Q 2 DIST iff Qr 2 DIST. Hence, it would be quite beneﬁcial to ﬁrst understand the problem Dist(UCQ). One may be tempted to think that a UCQ distributes over components iff each of its disjuncts does. It is easy to show that this is not the case; consider, e.g., the UCQ 9x9y (R(x) S(y)) _ 9x R(x). The reason for this is the fact that some disjuncts that do not distribute over components maybe subsumed by other ones that distribute over components. To formalize this we need an

10For a query language Q, Cont(Q) accepts as input two queries Q1, Q2 2 Q of the same arity, and asks whether Q1 Q2.

auxiliary notation. For a CQ q( x) = 9 y (Vn

i=1 Ri( x, y)), let D[q] the database {Ri(c( x), c( y))}i2[n] obtained after replacing each variable v in q with a fresh constant c(v). Then:

Lemma 17 For Q( x) 2 UCQ, Q 2 DIST iff for each q 2 Q, there exists D0 2 co(D[q]) such that c( x) 2 Q(D0).

Let Q = (S, , q( x)) 2 (L, CQ), and let Qr( x) 2 UCQ be the rewriting obtained by applying the rewriting algorithm in [Gottlob et al., 2014]; recall that Q Qr. This algorithm constructs Qr starting from q and exhaustively applying two steps: rewriting and minimization. The rewriting step resolves an atom in the disjunct q0 under consideration using a linear tgd of , while the minimization step is responsible for unifying some atoms of q0. Lemma 17 implies that Q 62 DIST iff there exists q0 2 Qr such that, for every D0 2 co(D[q0]), c( x) 2 Q(D0). This suggests the following decision procedure for the complement of Dist(L, CQ):

1. Construct a CQ q0( x) 2 Qr by nondeterministically ap-

plying rewriting and minimization steps; 2. For each D0 2 co(D), if c( x) 2 Q(D0), then Reject; 3. Accept.

Each step of the above procedure uses polynomial space. This is shown by exploiting the fact that each q0 2 Qr cannot have more atoms than the original CQ q due to the linearity of tgds, and also the fact that evaluation for (L, CQ) is in PSPACE; the latter is implicit in [Johnson and Klug, 1984]. In case the arity of the schema is ﬁxed, q0 can be nondeterministically constructed in polynomial time, while evaluation for (L, CQ) is in NP; implicit in [Johnson and Klug, 1984]. Thus, the above procedure gives a P

2 upper bound. Hence, Dist(L, CQ) is in PSPACE, and in P

2 if the arity of the schema is ﬁxed. The lower bounds are established by a reduction from a restricted version of Cont(L, CQ). We deﬁne Rest Cont(C, CQ) to be the problem of checking Q1 Q2, where Q1 = (S [ {P}, , q1) and Q2 = (S [ {P}, , q2) with P 62 S being a unary predicate not in , q1, q2, given: (i) a set 2 C of connected tgds, (ii) a Boolean connected CQ q1 with at least one variable vq1 that occurs only in atoms with a predicate of S, and (iii) a Boolean CQ q2 that uses only predicates of S. It holds that Q1 Q2 iff (S[{P}, , q1 P(vq1) q2) 2 DIST, which in turn implies that we have a logspace reduction from Rest Cont(L, CQ) to Dist(L, CQ). Therefore, to obtain the desired lower bounds for Dist(L, CQ), it remains to show that Rest Cont(L, CQ) is PSPACE-hard, and P

2 -hard for ﬁxed arity. The former can be shown by a reduction from query answering under linear tgds, while the latter is implicit in [Bienvenu et al., 2012] (see the proof of Theorem 9). In fact, the P

2 -hardness holds even for tgds of the form P(x) ! R(x).

Distribution and Stickiness We proceed to show that Dist(S, CQ) is decidable, and pinpoint the complexity of the problem:

Theorem 18 Dist(S, CQ) is co NEXPTIME-complete,11 and P

2 -complete if the arity of the schema is ﬁxed.

Evaluation for (S, CQ) is UCQ-rewritable [Gottlob et al., 2014]. This allows us to exploit the algorithm devised above

11The lower bound assumes databases with at least two constants.

for solving the complement of Dist(L, CQ). However, since sticky tgds may have more than one atom in their body, the space we need during the execution of the algorithm is not polynomial anymore; actually, we need exponential space. Nevertheless, it can be shown that for (S, CQ) this algorithm terminates after exponentially many steps, which implies a NEXPTIME upper bound for the complement of our problem. This exploits the fact that query evaluation for (S, CQ) is in EXPTIME [Cal ı et al., 2012b]. In case the arity of the schema is ﬁxed, the same analysis as for (L, CQ) applies, and thus we obtain a P

2 upper bound for the complement of our problem. The fact that evaluation for (S, CQ) in case of ﬁxed arity is in NP has been shown in [Lukasiewicz et al., 2015].

To show co NEXPTIME-hardness for Dist(S, CQ), we ﬁrst establish the same for Cont(con S, con CQ) by giving a reduction from the so-called Exponential Tiling Problem [Johnson, 1990]; for technical reasons, we assume databases with at least two constants. The desired co NEXPTIME-hardness is obtained by reducing Cont(con S, con CQ) to Dist(S, con CQ). The P

2 -hardness in case the arity of the schema is ﬁxed is obtained in the same way as for (L, CQ).

Description Logics and Database Query Languages Interestingly, by exploiting the ideas developed above for understanding the complexity of Dist, we can show that:

Theorem 19 Dist(EL, CQ) is EXPTIME-complete.

For the upper bound, we relax Lemma 13 in order to deal with arbitrary (beyond answer-guarded) CQs, and then use the fact that Cont(EL, CQ) is in EXPTIME [Bienvenu et al., 2012]. The lower bound exploits the problem Rest Cont+ obtained from Rest Cont dy dropping the third condition in the deﬁnition of an instance of Rest Cont. Rest Cont+(EL, CQ) is known to be EXPTIME-hard [Bienvenu et al., 2012], which is not the case for Rest Cont(EL, CQ).

We conclude with the complexity of Dist when we focus on central database query languages. We show that:

Theorem 20 Dist(CQ) and Dist(UCQ) are NP-complete, even if the arity of the schema is ﬁxed.

By exploiting Lemma 17, we can design a guess-and-check algorithm for Dist(UCQ) that runs in polynomial time; the NP-hardness is shown by reduction from Cont(CQ). Finally, by giving a proof similar to the one for Theorem 18, we show:

Theorem 21 Dist(NRDat) is co NEXPTIME-complete, even if the arity of the schema is ﬁxed.11

5 Conclusions

Here are some interesting open problems that we are planning to tackle: (i) The complexity of deciding distribution over components for guarded-based queries is still open, (ii) distribution over components in the presence of equality and denial constraints is not well understood, and (iii) we do not know how distribution over components behaves in the case of non-monotonic queries.

Acknowledgements. Austrian Science Fund (FWF), projects P25207-N23, Y698 and W1255-N23, and Vienna Science and Technology Fund (WWTF), project ICT12-015.

[Alvaro et al., 2014] Peter Alvaro, Neil Conway, Joseph M.

Hellerstein, and David Maier. Blazes: Coordination analysis for distributed programs. In ICDE, pages 52 63, 2014.

[Ameloot et al., 2013] Tom J. Ameloot, Frank Neven, and

Jan Van den Bussche. Relational transducers for declarative networking. J. ACM, 60(2):15, 2013.

[Ameloot et al., 2014] Tom J. Ameloot, Bas Ketsman, Frank

Neven, and Daniel Zinn. Weaker forms of monotonicity for declarative networking: A more ﬁne-grained answer to the CALM-conjecture. In PODS, pages 64 75, 2014.

[Ameloot et al., 2015] Tom J. Ameloot, Bas Ketsman, Frank

Neven, and Daniel Zinn. Datalog queries distributing over components. In ICDT, pages 308 323, 2015.

[Baader, 2003] Franz Baader. Least common subsumers and

most speciﬁc concepts in a description logic with existential restrictions and terminological cycles. In IJCAI, pages 319 324, 2003.

[Baget et al., 2011] Jean-Franc ois Baget, Michel Lecl ere,

Marie-Laure Mugnier, and Eric Salvat. On rules with existential variables: Walking the decidability line. Artif. Intell., 175(9-10):1620 1654, 2011.

[B ar any et al., 2012] Vince B ar any, Balder ten Cate, and

Martin Otto. Queries with guarded negation. PVLDB, 5(11):1328 1339, 2012.

[B ar any et al., 2013] Vince B ar any, Michael Benedikt, and

Balder ten Cate. Rewriting guarded negation queries. In MFCS, pages 98 110, 2013.

[B ar any et al., 2015] Vince B ar any, Balder ten Cate, and Luc

Segouﬁn. Guarded negation. J. ACM, 62(3):22, 2015.

[Beeri and Vardi, 1981] Catriel Beeri and Moshe Y. Vardi.

The implication problem for data dependencies. In ICALP, pages 73 85, 1981.

[Bienvenu et al., 2012] Meghyn Bienvenu, Carsten Lutz, and Frank Wolter. Query containment in description logics reconsidered. In KR, 2012.

[Bienvenu et al., 2014] Meghyn Bienvenu, Balder ten Cate,

Carsten Lutz, and Frank Wolter. Ontology-based data access: A study through disjunctive datalog, csp, and MMSNP. ACM Trans. Database Syst., 39(4):33:1 33:44, 2014.

[Cal ı et al., 2012a] Andrea Cal ı, Georg Gottlob, and Thomas

Lukasiewicz. A general Datalog-based framework for tractable query answering over ontologies. J. Web Sem., 14:57 83, 2012.

[Cal ı et al., 2012b] Andrea Cal ı, Georg Gottlob, and An-

dreas Pieris. Towards more expressive ontology languages: The query answering problem. Artif. Intell., 193:87 128, 2012.

[Cal ı et al., 2013] Andrea Cal ı, Georg Gottlob, and Michael

Kifer. Taming the inﬁnite chase: Query answering under expressive relational constraints. J. Artif. Intell. Res., 48:115 174, 2013.

[Gottlob et al., 2014] Georg Gottlob, Giorgio Orsi, and An-

dreas Pieris. Query rewriting and optimization for ontological databases. ACM Trans. Database Syst., 39(3):25:1 25:46, 2014. [Johnson and Klug, 1984] David S. Johnson and Anthony C.

Klug. Testing containment of conjunctive queries under functional and inclusion dependencies. J. Comput. Syst. Sci., 28(1):167 189, 1984. [Johnson, 1990] David S. Johnson. A catalog of complexity

classes. In Handbook of Theoretical Computer Science, pages 67 161. 1990. [Leone et al., 2012] Nicola Leone, Marco Manna, Giorgio

Terracina, and Pierfrancesco Veltri. Efﬁciently computable Datalog9 programs. In KR, 2012. [Loo et al., 2009] Boon Thau Loo, Tyson Condie, Minos N.

Garofalakis, David E. Gay, Joseph M. Hellerstein, Petros Maniatis, Raghu Ramakrishnan, Timothy Roscoe, and Ion Stoica. Declarative networking. Commun. ACM, 52(11):87 95, 2009. [Lukasiewicz et al., 2015] Thomas Lukasiewicz, Maria Van-

ina Martinez, Andreas Pieris, and Gerardo I. Simari. From classical to consistent query answering under existential rules. In AAAI, pages 1546 1552, 2015. [Thomazo et al., 2012] Micha el Thomazo, Jean-Franc ois Baget, Marie-Laure Mugnier, and Sebastian Rudolph. A generic querying algorithm for greedy sets of existential rules. In KR, 2012. [Zinn et al., 2012] Daniel Zinn, Todd J. Green, and Bertram

Lud ascher. Win-move is coordination-free (sometimes). In ICDT, pages 99 113, 2012.