# explainable_certain_answers__0383adbf.pdf

Explainable Certain Answers

Giovanni Amendola1, Leonid Libkin2

1 Mathematics and Computer Science, University of Calabria 2 School of Informatics, University of Edinburgh amendola@mat.unical.it, libkin@inf.ed.ac.uk

When a dataset is not fully speciﬁed and can represent many possible worlds, one commonly answers queries by computing certain answers to them. A natural way of deﬁning certainty is to say that an answer is certain if it is consistent with query answers in all possible worlds, and is furthermore the most informative answer with this property. However, the existence and complexity of such answers is not yet well understood even for relational databases. Thus in applications one tends to use different notions, essentially the intersection of query answers in possible worlds. However, justiﬁcation of such notions has long been questioned. This leads to two problems: are certain answers based on informativeness feasible in applications? and can a clean justiﬁcation be provided for intersectionbased notions? Our goal is to answer both. For the former, we show that such answers may not exist, or be very large, even in simple cases of querying incomplete data. For the latter, we add the concept of explanations to the notion of informativeness: it shows not only that one object is more informative than the other, but also says why this is so. This leads to a modiﬁed notion of certainty: explainable certain answers. We present a general framework for reasoning about them, and show that for open and closed world relational databases, they are precisely the common intersection-based notions of certainty.

1 Introduction Problems where AI and data management techniques meet are characterized by the need to compute answers to queries over many possible worlds, such as instances of a global schema in data integration [Cal ı et al., 2003; Lenzerini, 2002] or solutions in data exchange [Arenas et al., 2014; Fagin et al., 2005] or results of chasing a database with ontology constraints [Calvanese et al., 2007; Cal ı et al., 2012]. This is due to the fact that datasets one uses for query answering typically contain null values (i.e., unknown individuals), and thus they represent potentially many possible complete worlds (i.e., datasets without null values). Query answering

that takes all of them into account is based on the notion of certain answers. Intuitively, certain answers are those that are true in all possible worlds. However, deﬁning this notion formally is not completely straightforward. There are two different approaches. One of them, applicable when databases are relations (sets or multisets of tuples), simply looks for tuples that are present in all answers [Imielinski and Lipski, 1984], i.e., the intersection of query answers in all possible worlds. We refer to this notion as intersection-based. In fact one often uses its slight generalization that also allows tuples with null values in the output [Lipski, 1984]. This is what one typically ﬁnds in the literature on querying incomplete information, or its applications such as data integration, data exchange, or ontology-based data access (OBDA). While this notion looks rather natural, it comes with little justiﬁcation, which led to an attempt [Libkin, 2016] to provide a systematic approach to deﬁning certain answers; it generalized earlier approaches that utilized techniques from programming semantics [Buneman et al., 1991; Rounds, 1991]. Suppose we have an incomplete object x, and let Jx K, the semantics of x, be the set of all of its possible worlds. To answer a query Q on x, we need to extract certain information from Qp Jx Kq t Qpyq | y P Jx Ku, where Qpyq is the answer to Q on y. The key element of this approach is the assumption that we have an ordering on query answers: a ď a1 means that a1 is at least as informative as a (i.e., a1 contains at least all information from a). This deﬁnes the notion of a candidate answer, which is an object a such that a ď Qpyq for all y P Jx K. In other words, a candidate answer is no more informative than any of query answers in possible worlds. A certain answer is the maximal candidate answer, i.e., the most informative answer consistent with answers in all possible worlds (formally, it is the greatest lower bound of the set Qp Jx Kq). We refer to this notion as informativeness-based. To compare this approach with the intersection-based approach for the relational data model, we need to explain what the incomplete objects are and how the informativeness ordering is interpreted. Most commonly, incompleteness is represented by marked nulls [Imielinski and Lipski, 1984] (this is in fact the model typically used in applications such as data integration and exchange and OBDA). In databases with marked nulls, some of the entries are nulls that represent currently unknown values. Possible worlds are obtained by as-

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

signing values to nulls; e.g., by means of a function v on nulls whose values are constants that can appear in complete databases. Queries over relational databases produce relations as well. Let R and R1 be two relations that can appear as a query answers, and assume that R1 has no nulls, i.e., it is a query answer on a complete database. Then R1 is at least as informative as R if vp Rq Ď R1 for some valuation v of nulls in R. That is, R1 provides additional information by instantiating nulls in R and adding some tuples. With these notions in place, we can now see how the two approaches to deﬁning certainty compare. Example 1. Consider an incomplete database D t Rp1, 1q, Rp2, 1q, Rp2, Kq, Sp K, 1qu with three tuples in relation R and one tuple in relation S, where K denotes a null value. Let Qpx, yq Rpx, yq Spx, yq be the query computing the difference of R and S. Possible worlds of D could be of three kinds: D1 t Rp1, 1q, Rp2, 1q, Sp1, 1qu, if K is mapped to 1; D2 t Rp1, 1q, Rp2, 1q, Rp2, 2q, Sp2, 1qu, if K is mapped to 2, and Dc t Rp1, 1q, Rp2, 1q, Rp2, cq, Spc, 1qu, if K is mapped to some other value c 1, 2. This gives us the following sets of answers Qp D1q tp2, 1qu, Qp D2q tp1, 1q, p2, 2qu, and Qp Dcq tp1, 1q, p2, 1q, p2, cqu, for c 1, 2. There are no tuples present in all these answers, and thus the standard intersection-based approach will return the empty set. Its generalization that permits nulls in answers will return the single tuple p2, Kq. Indeed, if K is mapped to 1, then p2, 1q P Qp D1q; if K is mapped to 2, then p2, 2q P Qp D2q; and if K is mapped to c 1, 2, then p2, cq P Qp Dcq. The informativeness-based notion of certain answer results in the set A tp K1, 1q, p2, K2qu. Indeed, one can see that for each c, there is v such that vp Aq Ď Qp Dcq, and it is not hard to show that A is the greatest lower bound of all Qp Dcqs.

Thus, the informativeness-based notion of certainty may be different from the intersection-based notion. We do not yet understand the informativeness-based notion well, however. For example, we do not know whether the certain answer deﬁned as the greatest lower bound always exists, nor do we know how large it can be when it does exist. The latter question was partially answered in [Arenas et al., 2017]: even for unions of conjunctive queries with inequalities, it can be of super-polynomial size under the open-world interpretation of incomplete databases. We complete the picture. First we show that for open-world, even for ﬁrst-order queries, such certain answers need not exist. For closed-world semantics, certain answers exist for queries expressible in any reasonable logical language, but could still be of super-polynomial size. So suddenly the informativeness-based approach to deﬁning certain answers faces a very serious obstacle of either non-existence or infeasible sizes of answers. The reason for this bad behavior lies in the excessive permissiveness of the deﬁnition of certain answers with respect

to information content. If y P Jx K, there must be a speciﬁc explanation as to why y is a possible world. However, we only require a ď Qpyq if a is candidate answer: that is, such an answer is less informative than Qpyq but for reason that may have nothing to do with the fact that y is a possible world for x. This may lead to bizarre candidate answers. Indeed, returning to Example 1, consider the tuple p1, K1q. It matches Qp D1q, obtained by interpreting K as 1, when K1 is mapped to 2, and it matches Qp D2q, obtained by interpreting K as 2, when K1 is mapped to 1. Thus, there is no connection between how a possible world is obtained, and why the tuple is in the answer to Q in that possible world. Instead of the very permissive deﬁnition of candidate and certain answers, it seems much more natural to require, for a candidate answer a, that a ď Qpyq for the same reason that y is in Jx K: that is, whatever explanation is given to y P Jx K, it should also explain why a ď Qpyq. This idea gives us the notion of explainable candidate answers, and maximal such answer is the explainable certain answer. Our main contribution is to extend the theory of informativeness-based notion of certainty to include explanations of why tuples appear in the answer, and to relate explainable certain answers to the classical intersection-based approach. As the ﬁrst step, we develop a general, datamodelindependent, framework to reason about explanations applied to incomplete objects, and to query answers. Essentially, explanations are viewed as operations that can be applied to objects to make them more informative; such explanations compose, and naturally lead to notions of explainable candidate and certain answers. We then see how these notions look in the familiar relational model, under both openand closed-world semantics of incompleteness. Our main result in this case is that for both semantics, explainable certain answers always exist, are of at most polynomial size in the size of the input, and are, essentially, the intersection-based answers (to be more precise, their slight extension that allows nulls in the output: the one that produced the answer p2, Kq in Example 1). Thus, the standard notion so common in data management and reasoning literature is well justiﬁed after all, by applying the concept of explanations to both database objects and query answers. Organization. Basic deﬁnitions are in Section 2. Existence and size of answers as greatest lower bounds is studied in Section 3. Section 4 compares the intersectionand informativeness-based notions of certainty. Section 5 introduces the framework of explanations, and Section 6 shows what explainable certain answers are for common semantics.

2 Preliminaries

Incomplete databases. An incomplete database D represents many possible worlds, or complete databases, known as its semantics JDK. As a concrete model of incompleteness we look at relational databases with marked nulls that dominate applications such as data integration, OBDA, and data exchange [Arenas et al., 2014; Lenzerini, 2002; Bienvenu and Ortiz, 2015]. Entries in databases come from the sets of constants and nulls, denoted by Const and Null. We use the symbol K for nulls. A database D is a set of relations over Const Y Null;

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

a k-ary relation is a ﬁnite subset of p Const Y Nullqk. Sets of constants and nulls that occur in D are denoted by Constp Dq and Nullp Dq. Its domain is domp Dq Constp Dq Y Nullp Dq. A complete database has no nulls. A homomorphism h : D Ñ D1 is a mapping from domp Dq to domp D1q such that hpcq c for c P Constp Dq, and for each tuple a in relation R of D, the tuple hp aq is in relation R of D1. The image of h is denoted by hp Dq. A homomorphism v is a valuation if its range contains only constants, i.e., vp Kq P Const for each null (we shall be using v for valuations, and h for arbitrary homomorphisms). Two most common semantics of incompleteness, under openand closed-world assumptions (OWA and CWA) are given as follows [Abiteboul et al., 1995; Imielinski and Lipski, 1984]: JDKOWA t D1 complete | D valuation v : D Ñ D1u JDKCWA t D1 complete | D valuation v : D1 vp Dqu That is, under CWA, one just replaces nulls with constants, and under OWA, one can add extra tuples, cf. [Reiter, 1977].

Query answering. A relational query Q of arity k takes a complete database D and returns a set of k-tuples over Constp Dq. If such a query Q is asked on an incomplete database D, to answer it under semantics J K, one has to analyze Qp JDKq t Qp D1q | D1 P JDKu. The deﬁnition one sees most commonly in the literature is simply Ş Qp JDKq Şt Qp D1q | D1 P JDKu. We shall consider a slightly more permissive deﬁnition: l OWAp Q, Dq t a | @v : D Ñ D1 vp aq P Qp D1qu l CWAp Q, Dq t a | @v : vp aq P Qpvp Dqqu where v ranges over valuations on D, and a is a tuple over constants and nulls. The latter comes from [Lipski, 1984], and has often been used as a substitute for intersectionbased deﬁnition since null-free tuples in l CWAp Q, Dq and Ş Qp JDKCWAq are exactly the same (and likewise for OWA). For Boolean (true/false) queries, where the standard representations of true and false are tpqu and H (i.e., the set containing the empty tuple pq, and the empty set), lp Q, Dq and Ş Qp JDKq are the same for both OWA and CWA semantics. Otherwise l OWAp Q, Dq and l CWAp Q, Dq may produce informative tuples with nulls. For instance, if Q returns a relation R in D, then l OWAp Q, Dq l CWAp Q, Dq R, while both Ş Qp JDKOWAq and Ş Qp JDKCWAq contain only null-free tuples in R.

Query languages. As our basic query language, we shall use ﬁrst-order predicate logic (FO) over the relational vocabulary. Its formulae are built up from relational atoms Rp xq and equality atoms x y, using Boolean connectives , _, , and quantiﬁers D, @. The D, -fragment of FO is known as conjunctive queries. Their unions are denoted by UCQ (unions of conjunctive queries); in terms of their expressiveness, they correspond to the D, , _-fragment of FO. If inequality atoms x y are allowed in them, we have the class of unions of conjunctive queries with inequalities, denoted by UCQ .

3 Query Answers as Greatest Lower Bounds We now outline a general approach to viewing certain query answers as greatest lower bounds in orderings induced by in-

formation content, and then see what it yields for answering relational queries, under both OWA and CWA semantics.

A General Framework We review the basics of the framework of [Libkin, 2016]. A database domain is a triple x D, C, J Ky where D is a set of database objects (e.g., all databases of a given schema), C Ď D is a set of complete objects, and J K is the semantic function that assigns complete objects (possible worlds) to an incomplete object, i.e., Jx K Ď C. With such a domain, we associate its information ordering given by x ď y iff Jy K Ď Jx K. The intuition is that the more possible worlds there are for x, the less information about x we have. Given two database domains x D1, C1, J K1y and x D, C, J Ky, a query is a mapping Q : C1 Ñ C, i.e., a mapping from complete objects (query inputs) to complete objects (query answers). This is consistent with the standard query languages. To extend Q to arbitrary elements of D in a correct way, we need to extract certain information from Qp Jx Kq. We say that a is a candidate answer to Q on x if a ď Qpcq for every c P Jx K, and a is a certain answer to Q on x if it is maximal among candidate answers. In other words, the certain answer is the greatest lower bound of Qp Jx Kq in ordering ď, denoted by glbďQp Jx Kq. Depending on a concrete ordering ď on D, certain answers may or may not exist. If certain answers do exist, then Jy K1 Ď Jx K1 implies glbďQp Jx Kq ď glbďQp Jy Kq. That is, more informative query inputs yield more informative query answers.

The Framework for Relational Databases When elements of database domains are relational databases, for query inputs we use either OWA or CWA semantics. For the semantics of answers, it is common to use OWA. Indeed, a partial query answer can be improved in two ways: either by ﬁnding more tuples, or by instantiating nulls with values. This is precisely what the OWA semantics does. The ordering it generates, D Ď D1 iff JD1KOWA Ď JDKOWA, has a nice description: D Ď D1 iff there is a homomorphism h : D Ñ D1, see [Gheerbrant et al., 2014]. This is a well-known and studied relation [Hell and Neˇsetˇril, 2004], and we now recall some basic notions related to it. It is a preorder, that is, reﬂexive and transitive. If both D Ď D1 and D1 Ď D hold, we say that D and D1 are homomorphically equivalent and write D D1; note that is an equivalence relation. A relational structure is a core if it has no proper substructure that is homomorphically equivalent to it. A substructure D1 of D is called a core of D if it is a core and D D1. It is known that up to isomorphism, there is only one core of D, and thus one speaks of the core of D, denoted by corep Dq. Any two homomorphically equivalent structures will have the same core (up to isomorphism). The certain answer to a relational query Q on an incomplete database D is glbĎQp JDKq, where J K is either J KOWA or J KCWA, and the greatest lower bound is taken with respect to Ď. Since Ď is a preorder, glbĎ is deﬁned up to homomorphic equivalence, and one typically takes the core of these homomorphically equivalent structures to represent glbĎ. For structures D1 and D2, their greatest lower bound is any structure homomorphically equivalent to D1 ˆD2 [Hell and Neˇsetˇril, 2004]. Thus, there is the smallest such structure,

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

namely corep D1 ˆ D2q. If homomorphisms are required to preserve constants, the deﬁnition of the product D1ˆD2 needs a small adjustment: if ai P domp Diq, for i 1, 2, then we replace the pair pa1, a2q in the domain of D1 ˆD2 by a1 if a1 a2 P Const and by a fresh null otherwise [ten Cate and Dalmau, 2015; Libkin, 2011]). While glbĎ exists for ﬁnite sets, our goal is to ﬁnd it for inﬁnite sets Qp JDKq. That is what we analyze now.

Certain Answers for UCQs. In the case of unions of conjunctive queries, there is an easy description of these greatest lower bounds. By Qp Dq we mean the na ıve evaluation of Q on D that treats nulls simply as new constant values. Then, if Q P UCQ,

glbĎQp JDKOWAq glbĎQp JDKCWAq Qp Dq ,

see [Imielinski and Lipski, 1984; Gheerbrant et al., 2014; Libkin, 2016]. Thus, it is of interest to us what happens for more complex queries, e.g., UCQ or FO queries.

Bounds under OWA. Now we have a relational query Q and we want to compute glbĎQp JDKOWAq for a database D. It was shown already in [Arenas et al., 2017] that its size could be exponential even for UCQ queries. That is, glbĎQp JDKOWAq always exists if Q P UCQ , and, moreover, there exists a sequence Dn, n P N, of databases of increasing size and a UCQ query Q such that the size of glbĎQp JDn KOWAq is 2Op}Dn}q, where the size }D} of a database is measured as the number of tuples in it. When we move to arbitrary FO queries, the situation is much worse.

Theorem 1. There is an FO query Q and a database D such that glbĎQp JDKOWAq does not exist.

Proof idea. Let D contain a unary relation U with a single element, and an empty binary relation R. Let q state that for each element of R, its indegree and outdegree are 1, and let Q return R if q is true, and tpx, xq | x P Uu otherwise. A database D1 in JDKOWA contains a graph R and a set U; then Q returns R if it is a disjoint union of directed cycles, and single loops on elements of U otherwise. Let Cn be a directed cycle of length n. From the above and the fact that Cn Ď Cn \ Cm (where \ denotes disjoint union) it is easy to derive that glbĎQp JDKOWAq glbĎt Cn | n ą 0u. Since Cmn Ď Cm, we then get that the latter equals glbĎt Cn! | n ą 0u. Now one can use techniques from [Hell and Neˇsetˇril, 2004] to show that such a greatest lower bound does not exist.

Bounds under CWA. For CWA, we can recover the existence of greatest lower bounds for a large class of queries, but we still have very high bounds on their sizes. Recall that a query Q is generic [Abiteboul et al., 1995] if Qpπp Dqq πp Qp Dqq for every permutation of the domain (i.e., for a bijection π : domp Dq Ñ domp Dq). All queries in FO and many other logics over the relational vocabulary are such.

Theorem 2. If Q is a generic query, then glbĎQp JDKCWAq always exists, and its size is bounded by 22Op}D}q.

Proof idea. Given a database D, a valuation v is admissible if its range is contained in Constp Dq together with | Nullp Dq| 1 new constants not present in D. Let Ap Q, Dq t Qpvp Dqq | v admissible valuationu. Note that Ap Q, Dq Ď Qp JDKCWAq, and Ap Q, Dq is a ﬁnite set. The key lemma is to show that glbĎp JDKCWAq glbĎAp Q, Dq. To construct glbĎAp Q, Dq, we have to ﬁnd the core of the product of the elements of Ap Q, Dq; since the number of admissible valuations is exponential, the product size is at most doubly exponential.

As for the lower bound, we can adapt the construction of [Arenas et al., 2017] to show the following: Proposition 1. There exists a family of incomplete databases Dn, for n P N, of increasing size, and a UCQ query Q such that the size of glbĎQp JDn KCWAq is 2Op}Dn}q. This still leaves a gap between the exponential lower bound and the double exponential upper bound. To give an idea why it is very hard to close the gap, note that lower bound proofs are based on constructing, for each n, a polynomialsize family of cores of size up to n that are closed under product; this results in an exponential size greatest lower bound. For a double-exponential bound, we would need to ﬁnd an exponential-size family of cores closed under product. However the existence of such a family is not known. A double exponential lower bound can be produced for monadic second-order logic, MSO (the fragment of secondorder where all second-order quantiﬁcation is over sets, and both ﬁrstand second-order free variables are allowed). Proposition 2. There is a family of incomplete databases Dn, for n P N, of increasing size, and an MSO query Q such that the size of glbĎQp JDn KCWAq is 22Ωp}Dn}q. Proof idea. We ﬁrst show how with an MSO query, we can generate, on JDn KCWA, outputs which are directed cycles of lengths up to 2n, where Dn has Opnq tuples. Thus, glbĎQp JDn KCWAq must contain corepŚ iď2n Ciq, which is known to be CN, where N is the least common multiplier of numbers up to 2n. Thus, N ě śtp ď 2n | p primeu, and ln N ě θp2nq, where θpxq is the sum of ln p over prime p ď x. It is known that θpxq ě x for sufﬁciently large x, see [Rosser and Schoenfeld, 1962], and thus N ě e2n, from which the bound is easily derived. l

4 Comparing Certainty Notions We have deﬁned certain answers as greatest lower bounds glbĎQp JDK q when is OWA or CWA. On the other hand, we have standard notions of certain answers in the literature, namely l CWAp Q, Dq and l OWAp Q, Dq. How do they compare? We know that l CWAp Q, Dq and l OWAp Q, Dq are candidate answers, under the appropriate semantics; that is, l p Q, Dq Ď Qp D1q for each D1 P JDK . However, they are not in general informativeness-based certain answers: we already saw, in Example 1, that glbĎQp JDKCWAq and l CWAp Q, Dq can be different. In that example, l CWAp Q, Dq tp2, Kqu, while glbĎQp JDKCWAq tp K1, 1q, p2, K2qu. A similar example can be constructed under OWA as well.

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

In fact Example 1 showed that the intersection-based notion can miss some of the answers justiﬁed under the informativeness-based notion. On the other hand, the informativeness-based notion of certainty does not behave well computationally: it may not exist, or may be prohibitively large. To reconcile good properties of the two notions, it would thus be nice to produce proper justiﬁcations for answers of the form l p Q, Dq. To do so, we observe the following: in Example 1, each of the possible worlds for D is explained by means of a valuation v. If we take, in that example, the tuple a p K1, 1q which belongs to glbĎQp JDKCWAq but not to l CWAp Q, Dq, we shall always have some valuation v1 so that v1p aq P Qpvp Dqq, but we cannot ensure that v1 always equals v. In other words, we cannot ensure that the same explanation accounts for a possible world for an incomplete database, and for the tuple being an answer in that possible world. This is an informal description of why the notions of l p Q, Dq are justiﬁed: they appear to be the most informative answers under the assumption that the same explanation accounts for a possible world and the query answer in that possible world. We next formalize the notion of explanations.

5 Explanations: The Framework

We now present an abstract framework to reason about explanations applied to incomplete database objects, while providing analogies with the common OWA and CWA relational semantics. This is done to keep the intuition clear; detailed treatment of these semantics will be presented in Section 6. Assume that we have a set D of database objects, as in Section 3. On this set we have an oblivious preorder ĺε; intuitively, x ĺε y means that we know that y is at least as informative as x without having to provide additional information. For example, under OWA, this will be the subset ordering D Ď D1 (tuples can be added freely), and under CWA, it is just the identity D D1. Next, we need actions; these are functions on objects that return objects. Some actions with special properties will be viewed as explanations of incomplete information. The set of actions Ωis a monoid: actions can be composed, and there is the identity (unit) action. This is captured by:

Deﬁnition 1 (Action Structure). An action structure A consists of:

a set D of database objects with an oblivious preorder ĺε; a monoid Ωwith an associative composition operation denoted by ωω1 and unit ε (i.e., εω ωε ω); and a function D ˆ ΩÑ D, whose result on x P D and ω P Ω is denoted by xω, such that xε x and xωω1 pxωqω1, and furthermore x ĺε y implies xω ĺε yω for all x, y P D and ω, ω1 P Ω.

The intuition, in the case of relational databases, is as follows. Actions are functions that can modify database entries, i.e., they are functions f : Const Y Null Ñ Const Y Null. They can provide us with additional knowledge (e.g., by replacing a null with a constant), or to the contrary decrease our knowledge (by doing the reverse). The unit of the monoid of

functions is the identity function. Then xω is the result of applying an action to an object. In case of relational databases, if we have a database D and a function f as above, the result of applying this action is simply fp Dq. The last condition says that if x is obliviously at most as informative as y, then changing information in both x and y in the same way does not change that. We extend ĺε to all elements of Ωby deﬁning x ĺω y iff xω ĺ y. It says that by applying the action ω, object x is at most as informative as y. Lemma 1. x ĺω y and y ĺω1 z imply x ĺωω1 z. Explanations are special kinds of actions. A relational intuition for actions is replacing some values in relational databases by others. When we replace nulls by other values, this explains why one object is more informative than another. In particular, we can replace all nulls by constants, resulting in a complete object, to which no further explanations apply. The following deﬁnition captures this distinction between arbitrary actions and explanations. Deﬁnition 2 (Explanation Structure). An explanation structure E is a restriction of an action structure A to a submonoid Σ of Ω, whose elements are called explanations, so that for each x P D, there is σ P Σ, called x-complete, such that pxσqσ1 xσ, for all σ1 P Σ. If ε is x-complete, then x is called a complete object, and the set of such objects is denoted by C. If σ is x-complete, then no other explanation changes xσ, i.e., σ already explained all the incompleteness of x. A complete object in C is one for which ε is x-complete, i.e., there is no incompleteness to explain, and xσ x for all σ P Σ. In the relational case, explanations are maps f : Const Y Null Ñ Const Y Null that are the identity on constants (i.e., homomorphisms). For a relational database D, an explanation is D-complete if it is a valuation on D, i.e., it maps D to a complete database. Complete databases, under this deﬁnition, will be those without nulls. Given an explanation structure E, we deﬁne the semantics of an object x P D with respect to an explanation σ P Σ as the set of all complete objects which are at least as informative as x with the help of explanation σ, i.e.,

Jx KE σ tc P C | x ĺσ cu.

The semantics of an object is the set of complete objects that can be explained to be at least as informative as x:

σPΣ Jx KE σ .

We also write x ĺ y if x ĺσ y for some σ P Σ. Below we summarize basic properties of these notions. Proposition 3. ĺ is a preorder.

If xσ is a complete object, then xσ P Jx KE σ.

If x ĺσ y, then Jy KE σ1 Ď Jx KE σσ1, for each σ1 P Σ.

Semantics and Ordering It is standard to deﬁne information ordering on objects by comparing their semantics, i.e., x ĺ y iff Jy K Ď Jx K. Here

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

we took a different path and deﬁned both the semantics and the ordering based on the actions of explanations on objects. Does the connection between them continue to hold? The answer is positive in the presence of canonical explanations. These are analogous, in the case of relational databases, to replacing null values with new distinct constants. These are important for deﬁning na ıve evaluation of queries, when nulls are treated as constants. Crucially such explanations are invertible: one can change these new constants back into nulls, and apply this inverse to other objects that use the same constants. The following deﬁnition captures this intuition. First, we say that ω P Ωis explainable on x if xω xσ for some σ P Σ. That is, the action of ω may not always be an explanation, but on x it is. Deﬁnition 3 (Canonical explanation). For an object y P D, a y-complete τ P Σ is a canonical explanation for y if two conditions hold. (1) There is τ 1 P Ωsuch that yττ 1 y. (2) If σ P Σ is a y-complete explanation, and x P D satisﬁes x ĺσ1 yσ for some σ1 P Σ, then σ1τ 1 is explainable on x. Theorem 3. Given an explanation structure E, let y P D be such that there is a canonical explanation for it. Then, for x P D, we have x ĺ y if and only if Jy KE Ď Jx KE.

Explainable Answers to Queries Let D and D1 be two sets of database objects, and let Σ be a monoid of explanations that gives rise to explanation structures E and E1 on these sets. We use notations with 1 for all the concepts in E1, i.e., C1 for complete objects, ĺ1 σ and ĺ1. Deﬁnition 4 (Explainable answers). A query Q is a map from C to C1. Given x P D and a P D1, we say that a is an explainable candidate answer to Q on x if for every c P Jx Kσ we have a ĺ1 σ Qpcq. A maximal, with respect to ĺ1, explainable candidate answer is called an explainable certain answer. That is, for explainable candidate answers, the same explanation has to account for c being a possible world of x, and for a being at most as informative as the answer in that world. Certain answers, as before, are maximal candidate answers. These behave in the expected way. Theorem 4. Let Q : C Ñ C1 be a query. Assume that x ĺσ y for x, y P D. If a is an explainable candidate answer to Q on x, then aσ is an explainable candidate answer to Q on y. If glbĺ1Qp Jx Kq exists, then it is an explainable certain answer to Q on x. If x ĺ y and a, b are explainable certain answers to Q on x and y respectively, then a ĺ1 b. These say that query answers are still more informative on more informative inputs, but extra information in query answers is not arbitrary: instead it is added according to the way in which information was added to the inputs.

6 Explainable Certain Answers in Open and Closed Worlds We now look at action and explanation structures for the common OWA and CWA semantics. These are easy to construct.

Their elements are relational databases, complete objects are databases without nulls, actions are functions on database entries, and explanations are those actions that preserve constants (i.e., they either assign a constant to a null, or equate two nulls). The only difference between them is the oblivious ordering, which is subset for OWA (to account for adding tuples), and identity for CWA. We then prove that in both of these explanation structures, l OWA and l CWA are exactly the explainable certain answers.

6.1 Explainable Answers under OWA Let ρ be a relational vocabulary. We deﬁne action and explanation structures AOWApρq and EOWApρq for relational databases over ρ. In AOWApρq: the domain Dpρq is the set of all databases of vocabulary ρ over Const Y Null; the oblivious preorder ĺε is the subset relation Ď; the monoid Ωcontains all functions f : Const Y Null Ñ Const Y Null, with the identity function as its unit element, and composition of functions as its binary operation. The relation D ĺf D1 is thus given by fp Dq Ď D1. For the explanation structure EOWApρq, the submonoid of Ω is H, the set of all homomorphisms, which are maps h P Ω such that hpcq c for every c P Const. Recall that a homomorphism h is called a valuation if hp Nullq Ď Const. Let J KEOWApρq be the semantics deﬁned by EOWApρq, and let ĺ be the associated preorder, as shown in the previous section. The theorem below summarizes properties of the structures, the preorder, and the semantics. Theorem 5. For each relational vocabulary ρ: AOWApρq is an action structure and EOWApρq is an explanation structure; every valuation v P H is a complete explanation; complete objects are exactly ρ-databases without nulls;

J KEOWApρq coincides with J KOWA; the preorder ĺ is Ď (the existence of a homomorphism); for D P Dpρq, canonical explanations are valuations that are bijections with the range disjoint from Constp Dq. The existence of canonical explanations implies that D ĺ D1 iff JD1KEOWApρq Ď JDKEOWApρq, which also follows from [Gheerbrant et al., 2014]. A relational k-ary query Q takes a complete database from Dpρq and produces a complete database in Dpρ1q, where ρ1 consists of a single k-ary relation A (for answer). To deﬁne the notion of explainable candidate and certain answers on arbitrary databases, we need explanation structures on Dpρq and Dpρ1q. As mentioned earlier, query answers are always interpreted in EOWApρ1q. When query inputs are interpreted in EOWApρq, we can view Q as a map from complete objects in EOWApρq to complete objects in EOWApρ1q. This, by Deﬁnition 4, gives us the notions of explainable candidate and certain answers under OWA. The result below provides their precise characterization. Theorem 6. If Q is a k-ary relational query deﬁned on complete databases, and D is an arbitrary database, then

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

the explainable certain answer under OWA is l OWAp Q, Dq; explainable candidate answers under OWA are exactly the subsets of l OWAp Q, Dq.

This justiﬁes the use of l OWA and its approximations for ﬁnding certain answers under the OWA semantics.

6.2 Explainable Answers under CWA We deﬁne action and explanation structures ACWApρq and ECWApρq just as we did for the OWA case, with a single change: now the oblivious preorder, which we denote by Ĳε to distinguish it from the OWA case, is the identity (i.e., only D Ĳε D is true). This reﬂects the fact that under CWA, no information is added to an incomplete database except by applying a map to its nulls. This extends to relations Ĳh for for h P H, where D Ĳh D1 iff D1 hp Dq. Let Ĳ be the union of all such relations, and let J KECWApρq be the semantics associated with ECWApρq. The following is an analog of Theorem 5 for CWA.

Theorem 7. For each relational vocabulary ρ:

ACWApρq is an action structure and ECWApρq is an explanation structure; every valuation h P H is a complete explanation; complete objects are exactly ρ-databases without nulls;

J KECWApρq coincides with J KCWA; D Ĳ D1 iff hp Dq D1 for some homomorphism h.

Let Q be a query from complete databases in Dpρq to complete databases in Dpρ1q, where again ρ1 consists of a single k-ary relation A. We consider inputs as elements of the explanation structure ECWApρq and outputs, as before, as elements of the explanation structure EOWApρ1q. Then again Deﬁnition 4 gives us notions of explainable candidate and certain answers, this time under CWA. They can be characterized as follows.

Theorem 8. If Q is a k-ary relational query deﬁned on complete databases, and D is an arbitrary database, then

the explainable certain answer under CWA is l CWAp Q, Dq; explainable candidate answers under CWA are exactly the subsets of l CWAp Q, Dq.

Thus, similarly to the OWA case, this result justiﬁes the use of l CWA for ﬁnding certain answers under the CWA semantics.

6.3 Justiﬁcation for Intersection-Based Certain Answers Theorems 6 and 8 allow us to provide justiﬁcations for the commonly used intersection-based certain answers, i.e., l X p Q, Dq Şt Qp D1q | D1 P JDK u, where is OWA or CWA. We say that a tuple a is an explainable answer to Q on D under semantics if it belongs to the explainable certain answer under . Then:

Corollary 1. Let Q be a k-ary relational query, D a database, and c a k-tuple over Constp Dq. Then c is an explainable answer to Q on D under iff c P l X p Q, Dq, where is OWA or CWA.

Thus, l X p Q, Dq is the maximal explainable candidate answer containing only constant tuples, under both OWA and

CWA. Also, Corollary 1 implies that for Boolean queries, l X p Q, Dq is precisely the explainable certain answer, under both OWA and CWA, thereby providing justiﬁcation for the notions of certainty most commonly found in query answering literature.

7 Conclusions

Our goal was to bridge the gap between two existing ways of deﬁning certainty of query answers over incomplete data. We have done it by introducing the notion of explanations in semantics of incompleteness, and deﬁning certainty by saying that it is the same explanation that has to account for a possible world and the query answer in it. This way, we justiﬁed the most commonly used (although hitherto without proper justiﬁcation) notion of certainty in the literature. As the outcome of this work, we believe that the right notions of certainty to use, in the case of relational data, are intersection-based; that is l CWA or l OWA, depending on the semantics of data. Now these notions come with a proper justiﬁcation. If one is only interested in tuples of constants, the classical deﬁnition of Şt Qp D1q | D1 P JDK u is the right notion. The informativeness-based deﬁnition is still more general and can capture answers that intersection-based notions would miss, but pragmatically one should use intersectionbased notions due to their computational properties. In the future, we plan to extend the ideas of this paper in three ways. One extension is to other semantics of incompleteness (e.g., weaker notions of CWA, cf. [Reiter, 1980; Minker, 1982]). Another direction is to look for notions of explanations as they directly arise in applications such as data integration, data exchange, OBDA, and others. Finally, we would like to consider other data models, in particular graph models and RDF, where the study of querying incomplete data (such as blank nodes) and certain answers has recently received attention [Ahmetaj et al., 2015; Gheerbrant and Fontaine, 2014; Hogan et al., 2014; Nikolaou and Koubarakis, 2016].

Acknowledgments

Part of this work was done while the ﬁrst author was visiting the University of Edinburgh. The ﬁrst author was supported by MISE under projects PIUCultura and S2BDW, and by the EU Horizon 2020 programme under project MIREL, grant agreement 690974. The second author is supported by EPSRC grants M025268 and N023056. We are grateful to anonymous referees for their comments.

[Abiteboul et al., 1995] Serge Abiteboul, Richard Hull, and Victor Vianu. Foundations of Databases. Addison-Wesley, 1995.

[Ahmetaj et al., 2015] Shqiponja Ahmetaj, Wolfgang Fischl, Reinhard Pichler, Mantas Simkus, and Sebastian Skritek. Towards reconciling SPARQL and certain answers. In WWW, pages 23 33, 2015.

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

[Arenas et al., 2014] Marcelo Arenas, Pablo Barcel o, Leonid Libkin, and Filip Murlak. Foundations of Data Exchange. Cambridge University Press, 2014.

[Arenas et al., 2017] Marcelo Arenas, Elena Botoeva, Egor V. Kostylev, and Vladislav Ryzhikov. A note on computing certain answers to queries over incomplete databases. In AMW, 2017.

[Bienvenu and Ortiz, 2015] Meghyn Bienvenu and Magdalena Ortiz. Ontology-mediated query answering with data-tractable description logics. In Reasoning Web, pages 218 307, 2015.

[Buneman et al., 1991] Peter Buneman, Achim Jung, and Atsushi Ohori. Using powerdomains to generalize relational databases. Theoretical Computer Science, 91(1):23 55, 1991.

[Cal ı et al., 2003] Andrea Cal ı, Domenico Lembo, and Riccardo Rosati. Query rewriting and answering under constraints in data integration systems. In IJCAI, pages 16 21, 2003.

[Cal ı et al., 2012] Andrea Cal ı, Georg Gottlob, and Thomas Lukasiewicz. A general datalog-based framework for tractable query answering over ontologies. J. Web Sem., 14:57 83, 2012.

[Calvanese et al., 2007] Diego Calvanese, Giuseppe De Giacomo, Domenico Lembo, Maurizio Lenzerini, and Riccardo Rosati. Tractable reasoning and efﬁcient query answering in description logics: The DL-Lite family. J. Autom. Reasoning, 39(3):385 429, 2007.

[Fagin et al., 2005] Ronald Fagin, Phokion G. Kolaitis, Renee J. Miller, and Lucian Popa. Data exchange: Semantics and query answering. Theoretical Computer Science, 336:89 124, 2005.

[Gheerbrant and Fontaine, 2014] Am elie Gheerbrant and Ga elle Fontaine. Querying incomplete graphs with data. In AMW, 2014.

[Gheerbrant et al., 2014] Am elie Gheerbrant, Leonid Libkin, and Cristina Sirangelo. Na ıve evaluation of queries over incomplete databases. ACM Trans. Database Syst., 39(4):31:1 31:42, 2014.

[Hell and Neˇsetˇril, 2004] Pavol Hell and Jaroslav Neˇsetˇril. Graphs and Homomorphisms. Oxford University Press, 2004.

[Hogan et al., 2014] Aidan Hogan, Marcelo Arenas, Alejandro Mallea, and Axel Polleres. Everything you always wanted to know about blank nodes. J. Web Sem., 27:42 69, 2014.

[Imielinski and Lipski, 1984] Tomasz Imielinski and Witold Lipski. Incomplete information in relational databases. Journal of the ACM, 31(4):761 791, 1984.

[Lenzerini, 2002] Maurizio Lenzerini. Data integration: a theoretical perspective. In ACM Symposium on Principles of Database Systems (PODS), pages 233 246, 2002.

[Libkin, 2011] Leonid Libkin. Incomplete information and certain answers in general data models. In ACM Symposium on Principles of Database Systems (PODS), pages 59 70, 2011. [Libkin, 2016] Leonid Libkin. Certain answers as objects and knowledge. Artiﬁcial Intelligence, 232:1 19, 2016. [Lipski, 1984] Witold Lipski. On relational algebra with marked nulls. In PODS, pages 201 203, 1984. [Minker, 1982] Jack Minker. On indeﬁnite databases and the closed world assumption. In CADE, pages 292 308, 1982. [Nikolaou and Koubarakis, 2016] Charalampos Nikolaou and Manolis Koubarakis. Querying incomplete information in RDF with SPARQL. Artiﬁcial Intelligence, 237:138 171, 2016. [Reiter, 1977] Raymond Reiter. On closed world data bases. In Logic and Data Bases, pages 55 76, 1977. [Reiter, 1980] Raymond Reiter. Equality and domain closure in ﬁrst-order databases. Journal of the ACM, 27(2):235 249, 1980. [Rosser and Schoenfeld, 1962] J. B. Rosser and L. Schoenfeld. Approximate formulas for some functions of prime numbers. Illinois J. Math., 6(2):64 94, 1962. [Rounds, 1991] Bill Rounds. Situation-theoretic aspects of databases. In Situation Theory and Applications, volume 26 of CSLI, pages 229 256. 1991. [ten Cate and Dalmau, 2015] Balder ten Cate and V ıctor Dalmau. The product homomorphism problem and applications. In ICDT, pages 161 176, 2015.

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)