# semantic_characterization_of_data_services_through_ontologies__1176287a.pdf

Semantic Characterization of Data Services through Ontologies

Gianluca Cima1 , Maurizio Lenzerini1 and Antonella Poggi1,2

1Dipartimento di Ingegneria Informatica, Automatica e Gestionale, Sapienza Universit a di Roma 2Dipartimento di Lettere e Culture Moderne, Sapienza Universit a di Roma {cima, lenzerini, poggi}@diag.uniroma1.it

We study the problem of associating formal semantic descriptions to data services. We base our proposal on the Ontology-based Data Access paradigm, where a domain ontology is used to provide a semantic layer mapped to the data sources of an organization. The basic idea is to explain the semantics of a data service in terms of a query over the ontology. We illustrate a formal framework for this problem, based on the notion of source-toontology (s-to-o) rewriting, which comes in three variants, called sound, complete and perfect, respectively. We present a thorough complexity analysis of two computational problems, namely veriﬁcation (checking whether a query is an s-to-o rewriting of a given data service), and computation (computing an s-to-o rewriting of a data service).

1 Introduction The architecture of many modern Information Systems is based on data services [Zheng et al., 2013], i.e., services deployed on top of data stores, other services, and/or applications to encapsulate a wide range of data-centric operations. Data services are also used to handle the programming logic for data virtualization in a cloud-hosted data storage infrastructure, so as to delegate most administrative tasks to the cloud infrastructure, and effectively realizing the idea of Data-As-A-Service. Furthermore, since big data, which is now imperative in many contexts, may be obtuse, disorganized, and may not make much sense to most potential users, in order to get value from them, it is reasonable to resort to data services built on top of massive amount of raw data. In order to realize the promises of data services, in particular to foster their reuse, it is of vital importance to well document and clearly specify their semantics. While most current techniques manually associate APIs (Application Programming Interface) to data services, and describe their intended meaning with ad-hoc methods, often using natural language or complex metadata [Carey et al., 2012], we propose a new approach, whose goal is to automatically associate formal semantic descriptions to data services. We base our proposal on the Ontology-Based Data Access (OBDA) paradigm [Poggi et al., 2008]. An OBDA speciﬁcation consists of an ontology

expressed in Description Logic (DL) [Baader et al., 2003], the schema of the data sources forming the information system, and a mapping between the source schema and the ontology. The ontology is a formal representation of the underlying domain, and the mapping speciﬁes the relationship between the data at the sources and the elements in the ontology. The semantics of data services can be thus expressed using the elements of the domain ontology, which is assumed to be familiar to the consumer of data services. But how can we automatically produce a semantic characterization of a data service, having an OBDA speciﬁcation available? The idea is to exploit a new reasoning task over the OBDA speciﬁcation, that works as follows: we express the data service in terms of a query over the sources, and we aim at automatically deriving the query over the ontology that best describes the data service, given the mapping. The following example illustrates this idea.

Example 1. Let Σ = O, S, M be as follows: O = { Erasmus Student Student, Math Student Student, Student Professor } S = { s1, s2, s3, s4, s5 }

{ {(x) | s1(x)} {(x) | Student(x)}, {(x) | s2(x)} {(x) | Student(x)}, {(x) | s3(x)} {(x) | Professor(x)}, {(x) | s1(x), s4(x, y)} {(x) | Erasmus Student(x)}, {(x) | s1(x), s5(x, y)} {(x) | Math Student(x)} } One can verify that the query over O that best describes the data service q S = {(x) | s1(x)} {(x) | s2(x)} in terms of O is q O = {(x) | Student(x)}.

Most of (if not all) the literature about managing data sources through an ontology [Lenzerini, 2018; Xiao et al., 2018; Ortiz, 2018; Bienvenu, 2016] deals with user queries expressed over the ontology, and studies the problem of ﬁnding an ontology-to-source rewriting, i.e., a query over the source schema that, once executed over the data, provides the answers to the original query. Here, the problem is reversed, because we start with a source query and we aim at deriving a corresponding query over the ontology, called a source-toontology rewriting (s-to-o rewriting for short). Thus, we deal with a sort of reverse engineering problem, which is novel in the investigation of both OBDA and data integration. The notions introduced in this paper are relevant in a plethora of scenarios. For the sake of brevity, we mention only three of them. Following the ideas in [Cima, 2017], it

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

can be shown that our notions of s-to-o rewriting can be used to provide the semantics of open datasets and open APIs published by organizations, which is a key aspect for unchaining all the potentials of open data. In [Lutz et al., 2018], the concept of realization of source queries, similar to one of the notions studied here, is used for checking whether the mapping provides the right coverage for expressing the relevant data services at the ontology level. Our notions are also useful for a semantic-based approach to source proﬁling [Abedjan et al., 2017], in particular for describing the structure and the content of a data source in terms of the business vocabulary. The contributions provided by this paper can be summarized as follows. We propose a formal framework for the problem of semantically characterizing a data service through an ontology (Section 3). We introduce the notions of perfect, sound, and complete s-to-o rewritings, and we deﬁne two basic reasoning tasks, namely veriﬁcation and computation. The former checks whether a given query is an s-to-o rewriting of a data service, whereas the latter computes one such rewriting. We show that, although the ideal notion is the one of perfect s-to-o rewriting, there are cases where, with the given mapping, no query over the ontology can precisely characterize the data service at hand. Thus, we introduce maximally sound and minimally complete s-to-o rewritings, which intuitively aim at approximating the perfect s-to-o rewriting of a data service at best, with the goal of either precision (sound rewriting), or recall (complete rewriting). We study both the veriﬁcation, and the computation problem for complete (Section 4), sound (Section 5) and perfect (Section 6) s-to-o rewritings in one of the most popular OBDA setting considered in the literature, namely where the ontology language is DL-Lite R, each mapping assertion maps a conjunctive query (CQ) over the source to a CQ over the ontology, and both the data service and the s-to-o rewriting are expressed as unions of CQs. For perfect and complete s-to-o rewritings we present algorithms for veriﬁcation and computation, and characterize the complexity of both tasks. For the case of sound s-to-o rewritings, we do the same for veriﬁcation, and then we precisely determine the cases where a maximally sound rewriting is not guaranteed to exist. We single out a restricted setting for OBDA speciﬁcations that is still meaningful from the point of view of expressive power, and guarantees the existence of maximally sound s-to-o rewritings (Section 7). For such restricted setting, we provide algorithms and complexity results for veriﬁcation and computation of maximally sound s-to-o rewritings. Preliminaries (Section 2), and conclusions (Section 8) complete the paper. To the best of our knowledge, the problem studied in this paper has been (partially) addressed only in [Cima, 2017; Lutz et al., 2018]. The former provides complexity upper bounds for complete s-to-o rewritings, and the latter focuses on both DL-Lite R and the EL family of ontology languages, and studies perfect s-to-o rewritings only, under a slightly different semantics with respect to the one proposed here.

2 Preliminaries

We assume basic knowledge about databases [Abiteboul et al., 1995] and DLs [Baader et al., 2003]. In what follows, we

use σ(x) to denotes the size of x. A database schema is a set of predicate symbols, each with a speciﬁc arity, and a set of integrity constraints. Given a schema S, an S-database D is a set of facts P( t) satisfying all integrity constraints in S, where P is a predicate in S of arity n, and t is an n-uple of constants, each taken from a denumerable inﬁnite set of symbols, where each such symbol is called an S-constant, or simply constant. In its general form, an L-query q over a schema S is a function in a certain class L that can be evaluated over an Sdatabase D to return a set of answers q D, each answer being a tuple of constants. A conjunctive query (CQ) q over a schema S is an expression of the form { t | φ( t, y)}, also denoted q( t), where t is a tuple of terms, each term being either a constant or a variable, y is a tuple of variables not appearing in t, called the existential variables of q, and φ( t, y) is either (in this case we also say that the whole q is ), or a ﬁnite conjunction of atoms of the form P(t1, ..., tn), where P is an n-ary predicate symbol of S and each tj is either a constant, or a variable in t y. We call φ and t the body and target list of q, respectively, and we sanction that every variable in t appears in φ. If t is empty, then the query is a boolean query. A union of CQs (UCQ) is a union of a ﬁnite set of conjunctive queries (called its disjuncts) with same arity. If not otherwise stated, we implicitly assume that all CQs of a UCQ have the same target list1. If q1, q2 are two queries with the same arity over S, q1 is contained in q2, denoted as q1 q2 if for every S-database D, q D 1 q D 2 . Containment of CQs and UCQs is characterized in terms of homomorphism [Chandra and Merlin, 1977; Sagiv and Yannakakis, 1980]. In what follows, we also consider CQs with no existential variables occurring more than once (CQJFE), and unions thereof (UCQJFE). We consider ontologies expressed in DL-Lite R, the member of the DL-Lite family [Calvanese et al., 2007] that underpins OWL 2 QL [Motik et al., 2012], i.e., the proﬁle of OWL 2 especially designed for the OBDA scenarios. In DL-Lite R axioms have the following forms: B1 B2 R1 R2 (concept/role inclusion) B1 B2 R1 R2 (concept/role disjointness) where B1, B2 are basic concepts, i.e., expressions of the form A, P, or P , with A and P atomic concept (atomic concepts include the universal concept ) and atomic role, respectively, and R1 and R2 basic roles, i.e., expressions of the form P, or P . We will also consider one sublanguage of DL-Lite R, namely DL-Lite RDFS, where both disjointness axioms, and concepts of the forms P or P in the right-hand side of the inclusion axioms, are ruled out. An OBDA speciﬁcation Σ is a triple O, S, M [Poggi et al., 2008], where O is a DL ontology, S is a database schema, called the source schema, and M is a set of mapping assertions (or simply mappings) relating S to O, i.e., assertions of the form q S q O, where q S and q O are CQs of the same arity over S and O, respectively. Mappings of the above form are called GLAV mappings. Special cases are GAV (Global-as-View) and LAV (Local-as-

1All the results of this paper can be generalized to the case of UCQs whose disjuncts may have different target lists.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

Views) mappings [Doan et al., 2012]: in a GAV (resp., LAV) mapping, q O (resp., q S) is simply an atom without existential variables. A GAV mapping is called pure if q O does not have constants or repeated variables, i.e., it is either of the form A(x), or R(x, y), with x, y different variables. An interpretation B for O is a model for Σ relative to an S-database D if (i) it is a model of O, and (ii) for every mapping q S q O in M, we have q D S q B O, where q B O denotes the answers of q O over the interpretation B (seen as a database). The set of models for Σ relative to D is denoted by Mod D(Σ), and D is said to be consistent with Σ if Mod D(Σ) = . If q S is a CQ, we denote by M(q S) the query obtained by applying the chase [Maier et al., 1979] w.r.t. M to the socalled freezing of q S, with the proviso that M(q S) is if q S is , and if its chase is empty. Note that the freezing of q S is the database obtained by seeing the atoms of q S as facts. Given an OBDA speciﬁcation Σ = O, S, M , a query q O over O, and an S-database D, the certain answers of q O w.r.t. Σ and D are the set of tuples t of S-constants such that t q B O, for every B Mod D(Σ). An O-to-S Σ-rewriting of a query q O over O is a query q S over S of the same arity as q O such that for every S-database D, q D S is a subset of the certain answers of q O w.r.t. Σ and D. A perfect O-to-S Σrewriting of q O, denoted by certq O,Σ, is a query over S of the same arity as q O such that for every S-database D, cert D q O,Σ coincides with the certain answers of q O w.r.t. Σ and D. We say that query q1 over O is equivalent w.r.t. Σ to query q2 over O if certq1,Σ certq2,Σ. It is easy to see that, with a slight modiﬁcation for taking care of the presence of and in queries, we can use the results in [Levy et al., 1995] to show that, if O has no axiom, and q O is a UCQ over O, one can compute a UCQ over S, denoted by REWM(q O), that is equivalent to certq O,Σ. We conclude the section with some observations about the case where the ontology O in Σ is a DL-Lite R ontology. In this case, it is well-known (see, e.g., [Calvanese et al., 2009]) that an S-database D is consistent with Σ if and only if cert D VO,Σp = , where Σp is obtained from Σ by eliminating the disjointness axioms from O, and VO is the Oviolation query, i.e., the boolean UCQ obtained by including a CQ qd for each disjointness axiom, where qd has the form {()|B1(x) B2(x)} (resp., {()|R1(x, y) R2(x, y)}) for the axiom B1 B2 (resp., R1 R2). Also, we denote by Vt1,...,tn O the UCQ over O with target list (t1, . . . , tn) obtained by adding (t1) . . . (tn) (written also as (t1, . . . , tn)) to each of the disjuncts of VO. If q O is a (U)CQ-query over O, we denote by Perf Refq O,O the UCQ computed by executing the algorithm Perfect Ref described in [Calvanese et al., 2007] on q O and O (again, slightly modiﬁed to take care of and ), and by Perf Refq O,Σ the UCQ REWM(Perf Refq O,O). Note that Perfect Ref ignores the disjointness axioms in O, and if D is consistent with Σ, then Perf Ref D q O,Σ computes exactly cert D q O,Σ. From these observations, and exploiting the results in [Calvanese et al., 2012; Levy et al., 1995], it is possible to prove that for an OBDA speciﬁcation Σ = O, S, M , if O is a DL-Lite R ontology, and q O( t) is a (U)CQ-query over O,

then certq O,Σ Perf Refq O,Σ Perf Ref V t O,Σ.

3 Framework We implicitly refer to an OBDA speciﬁcation Σ = O, S, M . Intuitively, given a data service expressed as a query q S over S, we aim at ﬁnding the query over O that precisely characterizes q S w.r.t. Σ. Since the evaluation of queries over O is based on certain answers, this means that we aim at ﬁnding a query over O whose certain answers w.r.t. Σ and D exactly capture the answers of q S w.r.t. D, for every S-database D. So, we are naturally led to the notion of perfect s-to-o rewriting. In what follows, q S refers to a query over S, and q O to a query over O of the same arity. Deﬁnition 2. q O is a perfect S-to-O Σ-rewriting of q S if for every S-database D, Mod D(Σ) = implies q D S = cert D q O,Σ. The above notion is similar, but not equivalent, to the notion of realization in [Lutz et al., 2018]. Indeed, while the latter sanctions that q D S = cert D q O,Σ for all S-databases, in our notion the condition is limited to the S-databases that are consistent with Σ. The difference between the two notions is highlighted by the following example. Example 3. Refer to Example 1, and consider again the query q S = {(x) | s1(x)} {(x) | s2(x)} over S, and the query q O = {(x) | Student(x)} over O. For the S-database D = {s1(a), s3(a), s4(a, b)}, we have that q D S = { a }, while, since D is inconsistent with Σ, cert D q O,Σ contains all S-constants of D (including, for example, the tuple b ). It follows that q O is not a realization of q S in Σ, whereas, since q D S = cert D q O,Σ for every S-database D consistent with Σ, q O is a perfect S-to-O Σ-rewriting of q S. As noted in [Cima, 2017; Lutz et al., 2018] and illustrated in the next example, perfect s-to-o rewritings may not exist. Example 4. Refer again to Example 1, and consider the data service expressed as the source query q S = {(x) | s1(x)}. By inspecting the mappings, one can see that, since the certain answers of Student include the values stored both in s1 and in s2, such concept is too general for exactly characterizing q S. On the other hand, both Erasmus Student and Math Student are too speciﬁc, and therefore we can conclude that no perfect S-to-O Σ-rewriting of q S exists. In order to cope with the situations illustrated in the example, we introduce the notions of sound and complete s-too rewritings, which, intuitively, provide sound and complete approximations of perfect rewritings, respectively. Deﬁnition 5. q O is a sound (respectively, complete) S-to-O Σ-rewriting of q S if for every S-database D, Mod D(Σ) = implies cert D q O,Σ q D S (resp., q D S cert D q O,Σ).

Example 6. We refer to Example 4, and observe that {(x) | Erasmus Student(x) Math Student(x)} is a sound S-to-O Σrewriting of q S = {(x) | s1(x)}, whereas {(x) | Student(x)} is a complete S-to-O Σ-rewriting of q S. Obviously, q O is a perfect S-to-O Σ-rewriting of q S if and only if q O is a sound and complete S-to-O Σ-rewriting of q S. There are also interesting relationships between the notions of S-to-O Σ-rewritings introduced here and the usual notions of rewritings studied in OBDA.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

Proposition 7. q O is a complete S-to-O Σ-rewriting of q S if and only if q S is an O-to-S Σ-rewriting of q O. If q S is a perfect O-to-S Σ-rewriting of q O, then q O is a perfect S-to O Σ-rewriting of q S.

It is easy to see that different sound or complete s-to-o rewritings of q S may exist, and therefore it is reasonable to look for the best approximations of q S, at least relative to a certain class of queries.

Deﬁnition 8. q O L is an L-maximally sound (respectively, L-minimally complete) S-to-O Σ-rewriting of q S if q O is a sound (respectively, complete) S-to-O Σ-rewriting of q S, and no q L exists such that (i) q is a sound (resp., complete) S-to-O Σ-rewriting of q S, (ii) certq O,Σ certq ,Σ (resp., certq ,Σ certq O,Σ), and (iii) there exists an S-database D s.t. cert D q O,Σ cert D q ,Σ (resp., cert D q ,Σ cert D q O,Σ).

Example 9. We refer again to Example 4, and observe that while {(x) | Student(x)} is a minimally complete S-to-O Σrewriting of q S = {(x) | s1(x)} in the class of UCQs, both {(x) | Erasmus Student(x)}, and {(x) | Math Student(x)} are maximally sound S-to-O Σ-rewritings of q S in the class of CQs, while q O = {(x) | Erasmus Student(x)} {(x) | Math Student(x)} is so in the class of UCQs.

Given the general framework presented so far, it is natural to consider the following two basic computational problems, for classes LS and LO of queries:

Veriﬁcation: given Σ = O, S, M , q S LS over S and q O LO over O of the same arity as q S, verify whether q O is a sound (resp., complete, perfect) S-to-O Σ-rewritings of q S.

Computation: given Σ = O, S, M , and q S LS over S compute any LO-maximally sound (resp., LOminimally complete, perfect) S-to-O Σ-rewriting of q S, if it exists.

In the rest of this paper, if not otherwise stated, we refer to the most common setting studied in OBDA, i.e., where (i) the ontology is expressed in DL-Lite R, (ii) S is a relational database schema without integrity constraints, and (iii) both LO and LS denote the class of UCQs. Interestingly, in this case, we have the following.

Proposition 10. If q1 and q2 are UCQ-minimally complete (resp., UCQ-maximally sound) S-to-O Σ-rewritings of q S, then they are equivalent w.r.t. Σ.

4 Complete Source-to-Ontology Rewritings

In this section, we study both the veriﬁcation and the computation problem for complete s-to-o rewritings. We ﬁrst address the veriﬁcation problem. Suppose we want to check whether q O is a complete S-to-O Σ-rewriting of q S. Obviously, if q S is contained in Perf Refq O,Σ, then for every S-database D consistent with Σ, we have that q D S cert D q O,Σ and therefore the answer is positive. However, if q S is not contained in Perf Refq O,Σ, it might be that q O is still a complete S-to-O Σ-rewriting of q S, in particular in the case where the non-emptiness of q S in D reveals the presence of inconsistencies. From this observation, we derive the following.

Algorithm 1 Compute complete s-to-o rewritings

Input: Σ = O, S, M , q S( t) = q1 S( t) . . . qn S( t) over S Output: q O( t) over O

1: return q O = Wn i=1{ t | M(qi S) ( t)}

Proposition 11. q O is a complete S-to-O Σ-rewriting of q S( t) if and only if q S Perf Refq O,Σ Perf Ref V t O,Σ.

The following theorem characterizes the complexity of veriﬁcation for complete s-to-o rewritings. Theorem 12. The veriﬁcation problem for complete s-to-o rewritings is NP-complete. Proof sketch. As for the upper bound, we show how to check the containment q S( t) Perf Refq O,Σ Perf Ref V t O,Σ in NP. For every disjunct q of q S, (i) we guess a query q over O with the same arity of q O and size at most the maximum between σ(q O) and σ(V t O), a sequence ρ of ontology axioms, a query q over S of size σ(M) σ(q ) and a function φ from the variables of q to the variables q, and (ii) we check in PTIME whether we can rewrite either q O or V t O into q through ρ, q is in REWM(q ), and φ is a homomorphism from q to q. As for the lower bound, the proof of NP-hardness is by a LOGSPACE reduction from the 3-COLOURABILITY problem, which is NPcomplete [Garey et al., 1976].

We point out that the result of NP-hardness holds even when O is empty, M is both a pure GAV mapping and a LAV mapping, and q S, q O are boolean CQs. We now deal with the computation problem. In particular, we propose Algorithm 1 for computing UCQ-minimally complete s-to-o rewritings. Intuitively, the algorithm computes the output query as union of CQs obtained by simply applying the mapping M to each CQ qi S in q S, using to bind the variables that are not involved in the application of M to qi S. Theorem 13. Algorithm 1 computes the UCQ-minimally complete S-to-O Σ-rewriting of q S. The algorithm shows that the UCQ-minimally complete Sto-O Σ-rewriting of q S always exists. Moreover, if q S is a CQ, then it is a CQ. Finally, we observe that the complexity of Algorithm 1 does not depend on O and is in PTIME in σ(q S). Moreover, it is in EXPTIME in σ(M), since it essentially applies the chase using the queries in the mapping. It can be shown that an algorithm for computing the UCQ-minimally complete s-to-o rewritings that is PTIME in the size of all inputs would imply a PTIME algorithm for CQ containment. So, assuming PTIME = NP, the computation problem cannot be solved in PTIME.

5 Sound Source-to-Ontology Rewritings We now turn to both the veriﬁcation and the computation problem for sound s-to-o rewritings. As for veriﬁcation, we remind the reader that, for an Sdatabase D consistent with Σ, Perf Ref D q O,Σ computes exactly

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

cert D q O,Σ. So, intuitively, checking whether q O is a sound Sto-O Σ-rewriting of q S means checking whether for all Sdatabases D, either Mod D(Σ) = or Perf Ref D q O,Σ q D S . This observation leads to the following characterization.

Proposition 14. q O( t) is a sound S-to-O Σ-rewriting of q S if and only if Perf Refq O,Σ q S Perf Ref V t O,Σ.

The following theorem characterizes the complexity of the veriﬁcation problem for sound s-to-o rewritings.

Theorem 15. The veriﬁcation problem for sound s-to-o rewritings is Πp 2-complete.

Proof sketch. As for the upper bound, we show that checking Perf Refq O( t),Σ q S Perf Ref V t O,Σ can be done in Σp 2: we guess a CQ q1 over S whose size is at most σ(M) σ(q O), we check in PTIME whether q1 is a disjunct of Perf Refq O,Σ, and then we use an NP oracle to check q1 q S Perf Ref V t O,Σ. As for the lower bound, the proof of Πp 2-hardness is by a LOGSPACE reduction from the -CNF problem.

We point out that the result of Πp 2-hardness holds even when O is empty, M is both a GAV mapping and a LAV mapping, and q S, q O are boolean CQs. We now tackle the computation problem. Our main result is that there are many cases where a UCQ-maximally sound s-to-o rewriting of a query is not guaranteed to exist. To illustrate the result, we introduce a speciﬁc setting for OBDA speciﬁcations, that we call restricted, obtained from the general one by: (i) limiting the ontology language to DL-Lite RDFS, (ii) limiting the mapping to pure GAV, and (iii) limiting q S to UCQJFEs. We now show that, surprisingly, as soon as we try to extend such setting, we lose the guarantee of the existence of s-to-o rewritings that are maximally sound in the class of UCQs.

Theorem 16. UCQ-maximally sound s-to-o rewritings of a query q S may not exist if we extend the restricted setting with each of the following features: 1. disjointness axioms in the ontology; 2. inclusion axioms with R as right-hand side in the ontology; 3. LAV mapping assertions, even without joins involving existential variables in the right-hand side; 4. non-pure GAV mapping assertions; 5. q S in a fragment of CQs going beyond CQJFEs.

Proof sketch. We present the proof for case 5. Consider the OBDA speciﬁcation Σ = O, S, M , where O has no axiom, and M consists of the following pure GAV mappings:

{(x, y) | s1(y, y) s3(x)} {(x, y) | R(x, y)} {(x, y) | s1(x, y)} {(x, y) | R(x, y)}

and let q S be the query {() | s1(x, y) s1(y, z)}. Observe that q O = {() | R(x, y) R(y, z)} is a complete S-to-O Σ-rewriting of q S, but is not sound, because the query q S = {() | s1(x, y) s1(z, z) s3(y)} is a disjunct of Perf Refq O,Σ such that q S q S. Conversely, one can verify that each of the following queries is a sound S-to-O Σ-rewriting of q S: q0 = {() | R(x, y) R(y, y)}, q1 = {() | R(x, y) R(y, z1) R(z1, y)},

q2 = {() | R(x, y) R(y, z1) R(z1, z2) R(z2, y)}, . . . More precisely, if we deﬁne qn to be {() | R(x, y) R(y, z1) R(z1, z2) . . . R(zn 1, zn) R(zn, y)}, for n 2,, then it can be shown that every qn is a sound S-to-O Σrewriting of q S, and for no pair (i, j), with i = j, i, j 0, certqi,Σ certqj,Σ. It follows that the inﬁnite union of q0, q1, and all qn s can be shown to be the maximally sound Sto-O Σ-rewriting of q S in the class of positive queries.

It remains to study sound s-to-o rewritings in the restricted setting. We do so in Section 7.

6 Perfect Source-to-Ontology Rewritings

Both the veriﬁcation and the computation problem for perfect s-to-o rewritings can be addressed by combining the techniques illustrated in the previous sections. As for veriﬁcation, we can check whether q O( t) is a perfect S-to-O Σ-rewriting of q S( t ) by checking both Perf Refq O,Σ q S Perf Ref V t O,Σ and q S Perf Refq O,Σ Perf Ref V t O ,Σ. As for computation, we can ﬁrst compute the query q O that is the UCQ-minimally complete S-to-O Σ-rewriting of q S, and then check whether q O is a sound S-to-O Σ-rewriting of q S. If the answer is positive, we return q O, otherwise we report that no perfect rewriting exists. From the above observation we derive the following: (i) all complexity results illustrated for the case of sound s-to-o rewritings hold for perfect rewritings as well, and (ii) if q S is a CQ, then either its perfect S-to-O Σ-rewriting does not exist, or it is a CQ as well. Finally, we brieﬂy discuss the case of perfect rewritings under the semantics used in [Lutz et al., 2018], that imposes the condition q D S = cert D q O,Σ for all S-databases. From the results presented in the previous sections, it follows that q O( t) is a perfect S-to-O Σ-rewriting of q S( t ) under such semantics if and only if q S Perf Refq O,Σ Perf Ref V t O,Σ. This allows us to easily derive algorithms and complexity bounds for both veriﬁcation and computation in this case, too.

7 Restricted Setting

We now deal with the restricted setting mentioned at the end of Section 5. Before delving into the technical part, we observe that, despite its limitations, the expressive power of this setting is sufﬁcient for several meaningful applications. Indeed, several popular ontologies are expressible in DL-Lite RDFS (e.g., Dublin Core [Weibel et al., 1998] and SKOS [Miles and Bechhofer, 2009]), and the form of pure GAV mapping is exactly the one originally deﬁned in the literature of data integration. Moreover, UCQJFEs captures data services expressible in the famous USPJ (Union, Select, Project, Join) fragment of Relational Algebra [Codd, 1970], with the only limitation that joining variables cannot be projected out. Note that such fragment is the one needed for all tasks related to source proﬁling [Abedjan et al., 2017].

Deﬁnition 17. Let q1( t) and q2( t) be two CQs, and let F1 = S(t1,1, . . . , t1,n) and F2 = S(t2,1, . . . , t2,n) be two atoms of q1 and q2, respectively. We say that F1 instantiates F2, if for

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

every i = 1, . . . , n, we have that if t2,i is a term of t or a constant, then t2,i = t1,i. Clearly, given atoms F1 in q1 and F2 in q2, checking whether F1 instantiates F2 can be done in PTIME. Based on this observation, the following lemma shows that checking whether a UCQ is contained in a UCQJFE is tractable. Lemma 18. Given a UCQ q1 and a UCQJFE q2 of the same arity, checking whether q1 q2 can be done in PTIME. We are now ready to characterize the complexity of the veriﬁcation problem in the restricted setting. Theorem 19. The veriﬁcation problem for sound s-to-o rewritings in the restricted setting is co NP-complete; it is in PTIME if q S is a CQJFE. Proof sketch. The co NP upper bound is obtained by noticing that we have to guess a disjunct q1 of Perf Refq O,Σ and then check whether q1 is not contained in q S, which, by virtue of Lemma 18, can be done in PTIME. The co NP lower bound is shown by a reduction from VALIDITY. To show that veriﬁcation is in PTIME when q S is a CQJFE, we notice that, for the characteristics on the OBDA setting, for every q( t) in q O, Perf Refq,Σ = V α q Perf Refqα,Σ, where qα is the query with body α and target list the tuple of variables that occur both in t and α. Hence, veriﬁcation can be solved by checking, for every atom F in q S and every query q in q O, whether there is an atom G in q such that all disjuncts of Perf Refq G,Σ contain at least one atom that instantiates F. Clearly, this can be done in PTIME w.r.t. σ(q S), σ(M), and σ(q O).

We observe that the co NP upper bound holds even when O is expressed in the fragment of DL-Lite R that does not admit disjointness axioms, and M is GLAV, while the co NPhardness holds even when O is empty, M is a set of both pure GAV and LAV mappings, q O is a CQ, and both q S and q O have no existential variables. We now address the computation problem by providing an algorithm to compute maximally sound s-to-o rewritings, thus proving that in the restricted setting, for each query q S, the UCQ-maximally sound s-to-o rewriting of q S always exists. Let γ(M) be the number of mapping assertions in M and η(q S) the number of distinct atoms appearing in q S. Moreover, let bound(q S) = 1+γ(M)+γ(M)2+. . .+γ(M)η(q S) if q S is a UCQJFE, and bound(q S) = η(q S) if q S is a CQJFE. The following lemma, showing that we can limit our attention to queries with at most bound(q S) atoms when we search for the maximally sound S-to-O Σ-rewriting of q S, allows us to immediately derive Algorithm 2.

Lemma 20. If a CQ q O( t) is a sound S-to-O Σ-rewriting of q S, then there exists a CQ q O( t) which is a sound S-to-O Σ-rewriting of q S whose body is the conjunction of m atoms appearing in q O, where m bound(q S). Note that the disjuncts of query q O computed by Algorithm 2 do not have necessarily the same target list. Theorem 21. Algorithm 2 computes the UCQ-maximally sound S-to-O Σ-rewriting of q S. Proof sketch. It is immediate to verify that the query returned by the algorithm is a sound S-to-O Σ-rewriting of q S.

Algorithm 2 Compute sound s-to-o rewritings Input: Σ = O, S, M , q S (U)CQJFE over S Output: q O over O 1: q O := . 2: for each q over O with at most bound(q S) atoms, involving only constants from q S and M do 3: if q is a sound S-to-O Σ rewriting of q S then 4: q O := q O q. 5: end if 6: end for 7: return q O

To show that it is the maximally sound S-to-O Σ-rewriting of q S, we proceed by contradiction, i.e., by assuming that there exists a CQ q such that Perf Refq ,Σ q S and Perf Refq ,Σ Perf Refq O,Σ, for every disjunct q O of q O, where q O is the query computed by Algorithm 2. Let q be obtained by substituting in q each constant neither in q S nor in M (if any) with a new fresh existential variable. It can be shown that in the restricted case, q would be such that Perf Refq ,Σ q S. Also, let m be the number of atoms of q . If m > bound(q S), by Lemma 20, there exists a query q that is a sound S-to-O rewriting of q S whose body is the conjunction of m atoms appearing in q , where m bound(q S). But then, since q possibly contains only constants in q S or M and since m bound(q S), by construction, q would be a disjunct of q O and we get a contradiction. Finally, if m bound(q S), we obtain a contradiction, by a similar argument.

It can be shown that Algorithm 2 (i) computes the unique (up to equivalence w.r.t. Σ) maximally sound S-to-O Σrewriting of q S in the class of monotone queries, (ii) is PTIME in σ(O) and σ(M), and EXPTIME in η(q S). Finally, we can show that (i) assuming PTIME = NP, the computation problem cannot be solved in PTIME, and (ii) there are cases where the number of atoms of the maximally sound Sto-O Σ-rewriting of q S is necessarily exponential w.r.t. η(q S).

8 Conclusion

We have presented a framework for semantically characterizing data services through ontologies, and carried out a comprehensive analysis for the most common OBDA setting, including a restricted setting, still useful in practice. We plan to continue this work along several directions. For example, in the unrestricted setting, we will study the problem of checking for the existence of a UCQ-maximally sound source-toontology rewriting of a query, and we aim at singling out the minimal class LO of queries that guarantees the existence of an LO-maximally sound source-to-ontology rewriting of a query q S. Also, we will extend our analysis to OBDA settings going beyond the one based on DL-Lite R, e.g., by considering DL-Lite A and the EL family as ontology languages.

Acknowlegments

Work supported by Sapienza under the project PRE-O-PRE and by MIUR, under the SIR project MODEUS - grant n. RBSI14TQHQ, and under the PRIN 2017 project HOPE .

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

References [Abedjan et al., 2017] Ziawasch Abedjan, Lukasz Golab, and Felix Naumann. Data proﬁling: A tutorial. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pages 1747 1751, 2017. [Abiteboul et al., 1995] Serge Abiteboul, Richard Hull, and Victor Vianu. Foundations of Databases. Addison Wesley Publ. Co., 1995. [Baader et al., 2003] Franz Baader, Diego Calvanese, Deborah Mc Guinness, Daniele Nardi, and Peter F. Patel Schneider, editors. The Description Logic Handbook: Theory, Implementation and Applications. Cambridge University Press, 2003. [Bienvenu, 2016] Meghyn Bienvenu. Ontology-mediated query answering: Harnessing knowledge to get more from data. In Proc. of the 25th Int. Joint Conf. on Artiﬁcial Intelligence (IJCAI), pages 4058 4061, 2016. [Calvanese et al., 2007] Diego Calvanese, Giuseppe De Giacomo, Domenico Lembo, Maurizio Lenzerini, and Riccardo Rosati. Tractable reasoning and efﬁcient query answering in description logics: The DL-Lite family. J. of Automated Reasoning, 39(3):385 429, 2007. [Calvanese et al., 2009] Diego Calvanese, Giuseppe De Giacomo, Domenico Lembo, Maurizio Lenzerini, Antonella Poggi, Mariano Rodr ıguez-Muro, and Riccardo Rosati. Ontologies and databases: The DL-Lite approach. In Sergio Tessaris and Enrico Franconi, editors, Reasoning Web. Semantic Technologies for Informations Systems 5th Int. Summer School Tutorial Lectures (RW), volume 5689 of Lecture Notes in Computer Science, pages 255 356. Springer, 2009. [Calvanese et al., 2012] Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, and Riccardo Rosati. Viewbased query answering in description logics: Semantics and complexity. J. of Computer and System Sciences, 78:26 46, 2012. [Carey et al., 2012] Michael J. Carey, Nicola Onose, and Michalis Petropoulos. Data services. Communications of the ACM, 55(6):86 97, 2012. [Chandra and Merlin, 1977] Ashok K. Chandra and Philip M. Merlin. Optimal implementation of conjunctive queries in relational data bases. In Proc. of the 9th ACM Symp. on Theory of Computing (STOC), pages 77 90, 1977. [Cima, 2017] Gianluca Cima. Preliminary results on ontology-based open data publishing. In Proc. of the 30th Int. Workshop on Description Logic (DL), volume 1879 of CEUR Electronic Workshop Proceedings, http: //ceur-ws.org/, 2017. [Codd, 1970] E. F. Codd. A relational model of data for large shared data banks. Communications of the ACM, 13(6):377 387, 1970. [Doan et al., 2012] An Hai Doan, Alon Y. Halevy, and Zachary G. Ives. Principles of Data Integration. Morgan Kaufmann, 2012.

[Garey et al., 1976] Michael R. Garey, David S. Johnson, and Larry J. Stockmeyer. Some simpliﬁed NP-complete graph problems. Theoretical Computer Science, 1(3):237 267, 1976. [Lenzerini, 2018] Maurizio Lenzerini. Managing data through the lens of an ontology. AI Magazine, 39(2):65 74, 2018. [Levy et al., 1995] Alon Y. Levy, Alberto O. Mendelzon, Yehoshua Sagiv, and Divesh Srivastava. Answering queries using views. In Proc. of the 14th ACM SIGACT SIGMOD SIGART Symp. on Principles of Database Systems (PODS), pages 95 104, 1995. [Lutz et al., 2018] Carsten Lutz, Johannes Marti, and Leif Sabellek. Query expressibility and veriﬁcation in ontology-based data access. In Proc. of the 16th Int. Conf. on the Principles of Knowledge Representation and Reasoning (KR), pages 389 398, 2018. [Maier et al., 1979] David Maier, Alberto O. Mendelzon, and Yehoshua Sagiv. Testing implications of data dependencies. ACM Trans. on Database Systems, 4(4):455 469, 1979. [Miles and Bechhofer, 2009] Alistair Miles and Sean Bechhofer. SKOS Simple Knowledge Organization System. W3C Recommendation, World Wide Web Consortium, 2009. Available at http://www.w3.org/TR/skos-reference. [Motik et al., 2012] Boris Motik, Bernardo Cuenca Grau, Ian Horrocks, Zhe Wu, Achille Fokoue, and Carsten Lutz. OWL 2 Web Ontology Language proﬁles (second edition). W3C Recommendation, World Wide Web Consortium, 2012. Available at http://www.w3.org/TR/owl2-proﬁles/. [Ortiz, 2018] Magdalena Ortiz. Improving data management using domain knowledge. In Proc. of the 27th Int. Joint Conf. on Artiﬁcial Intelligence (IJCAI), pages 5709 5713, 2018. [Poggi et al., 2008] Antonella Poggi, Domenico Lembo, Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, and Riccardo Rosati. Linking data to ontologies. J. on Data Semantics, X:133 173, 2008. [Sagiv and Yannakakis, 1980] Yehoshua Sagiv and Mihalis Yannakakis. Equivalences among relational expressions with the union and difference operators. J. of the ACM, 27(4):633 655, 1980. [Weibel et al., 1998] Stuart Weibel, John A. Kunze, Carl Lagoze, and Misha Wolf. Dublin core metadata for resource discovery. Request for Comments, 2413:1 8, 1998. [Xiao et al., 2018] Guohui Xiao, Diego Calvanese, Roman Kontchakov, Domenico Lembo, Antonella Poggi, Riccardo Rosati, and Michael Zakharyaschev. Ontologybased data access: A survey. In Proc. of the 27th Int. Joint Conf. on Artiﬁcial Intelligence (IJCAI), pages 5511 5519, 2018. [Zheng et al., 2013] Zibin Zheng, Jieming Zhu, and Michael R. Lyu. Service-generated big data and big data-as-a-service: An overview. In Proc. of the 2013 IEEE Int. Conf. on Big Data, pages 403 410, 2013.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)