# monotone_abstractions_in_ontologybased_data_management__a7c9d1b4.pdf

Monotone Abstractions in Ontology-Based Data Management

Gianluca Cima1, Marco Console2, Maurizio Lenzerini2, Antonella Poggi2

1CNRS & University of Bordeaux 2Sapienza University of Rome gianluca.cima@u-bordeaux.fr, {console, lenzerini, poggi}@diag.uniroma1.it

In Ontology-Based Data Management (OBDM), an abstraction of a source query q is a query over the ontology capturing the semantics of q in terms of the concepts and the relations available in the ontology. Since a perfect characterization of a source query may not exist, the notions of best sound and complete approximations of an abstraction have been introduced and studied in the typical OBDM context, i.e., in the case where the ontology is expressed in DL-Lite, and source queries are expressed as unions of conjunctive queries (UCQs). Interestingly, if we restrict our attention to abstractions expressed as UCQs, even best approximations of abstractions are not guaranteed to exist. Thus, a natural question to ask is whether such limitations affect even larger classes of queries. In this paper, we answer this fundamental question for an essential class of queries, namely the class of monotone queries. We deﬁne a monotone query language based on disjunctive Datalog enriched with an epistemic operator, and show that its expressive power sufﬁces for expressing the best approximations of monotone abstractions of UCQs.

1 Introduction

In Ontology-Based Data Management (OBDM) (Poggi et al. 2008), an ontology, i.e., a formal, logic-based representation of a domain of interest, is used to provide a high-level conceptual tool for accessing and managing the data sources of an information system. Suitable mappings declaratively specify the relationship between the data at the sources and the elements in the ontology, and this enables the user to carry out many relevant tasks on data through the lens of the ontology (Lenzerini 2018; De Giacomo et al. 2018). In the last decade, there have been extensive investigations on the problem of answering queries expressed over the ontology by means of algorithms that, taking into account both the ontology and the mapping, produce suitable queries that extract the correct data from the sources. The problem is complicated by the need of reasoning about all components of the system, so as not to miss any answer logically implied by the axioms expressing both the ontology and the mapping (see the surveys in (Bienvenu 2016; Xiao et al. 2018; Ortiz 2018)).

Copyright c 2022, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

Recent papers (Cima 2017; Lutz, Marti, and Sabellek 2018; Cima, Lenzerini, and Poggi 2019, 2020b; Cima et al. 2021) address a different issue in OBDM: starting from a query q S expressed over the sources, the goal is to ﬁnd a so-called abstraction of q S (Cima 2022), i.e., an ontologybased characterization of q S expressed in terms of the ontology elements, and whose answers coincide with the answer to the original query, modulo the ontology and the mapping. We encountered the need of abstraction during a joint project with a public statistical research institute. The institute s departments must publish subsets of the data they gather in the form of semantically described linked open data. To compute the content of the datasets the departments execute suitable queries over the data sources mapped to a shared ontology. Notably, when the dataset is published, it must be documented through a SPARQL query expressed in terms of the ontology. This task is currently done manually. The notion of abstraction perfectly captures this scenario and provides the formal tool for automating the process: given the query over the sources computing the content of the dataset, the abstraction of such query with respect to the mapping and the ontology is exactly the SPARQL query to be associated to the open dataset. Besides the above use case, abstraction can be the appropriate tool for checking the quality of mapping in capturing relevant source queries (Lutz, Marti, and Sabellek 2018), or deriving ontology-mediated queries from examples (Ortiz 2019; Cima, Croce, and Lenzerini 2021).

The ﬁrst investigations on abstraction appear in (Cima 2017; Lutz, Marti, and Sabellek 2018). Both papers point out that the perfect abstraction of a union of conjunctive queries (UCQ) expressed over the data source not always exists, and present algorithms for computing such abstraction in the case where it both exists, and can be expressed as a UCQ over the ontology. In (Cima 2017; Cima, Lenzerini, and Poggi 2019) the notion of (sound and complete) approximations of the perfect abstraction is introduced, exactly to cope with situations in which perfectness cannot be achieved. Moreover, both papers make it clear that, for a given class of queries C, one is probably interested in two speciﬁc forms of approximations, called C-minimally complete and C-maximally sound abstractions. Based on these notions, (Cima, Lenzerini, and Poggi 2019) presents a thorough analysis of the veriﬁcation problem (check whether a given query is a complete or sound abstraction) and the com-

The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22)

putation problem of UCQ-minimally complete and UCQmaximally sound abstractions of UCQ source queries in OBDM systems based on DL-Lite. In (Cima, Lenzerini, and Poggi 2020b) the computation problem is studied in the context of a speciﬁc class of non-monotone queries for expressing abstractions, and it is shown that this class can provide abstractions that are better than the one in the UCQ class. Thus, with the exception of (Cima, Lenzerini, and Poggi 2020b), all the results on abstractions have been obtained under the assumptions that abstractions are expressed as UCQs over the ontology, and many of them originate from the observations that best approximations in the UCQ class are not guaranteed to exist. Thus, a natural issue to investigate is whether such limitations affect even larger classes of queries for expressing abstractions. The main goal of this paper is to address the following question: do approximations of perfect abstraction that are best in a fundamental class of queries, namely the class of monotone queries, always exist? Obviously, a related goal is to derive algorithms for computing approximations of abstractions that are best in the class of monotone queries, if they exist. The class of monotone queries includes queries expressible in First Order Logic (FOL) and is therefore extremely important. In this paper we answer positively to the above-mentioned fundamental question. More speciﬁcally, we provide the following contributions. 1. We present a general framework for abstraction in OBDM, based on the deﬁnition of queries as functions from the logical models of OBDM systems to sets of tuples. This is important for the goal of making our investigation, and in particular the notion of monotone queries, to be independent from any particular syntactic form. The usual concept of queries whose semantics is based on certain answers is a special case of our general notion. 2. The framework includes a new monotone query language, called Datalog K, based on disjunctive Datalog enriched with inequalities and an epistemic operator. 3. We consider a scenario where the OBDM speciﬁcation J is based on DL-Lite RDFS (i.e., the fragment of RDFS expressible in Description Logic), and present algorithms that, given a source query q S expressed as UCQ, compute the best (sound or complete) approximations of the J-abstraction of q S in the class of monotone queries, expressed in Datalog K. 4. The above algorithms provide the proof that, in the considered scenario, for any UCQ q S, the best (sound or complete) approximations of the J-abstraction of q S in the class of monotone queries always exists. As a consequence, if the perfect abstraction exists and is in the class of monotone queries, then it can be expressed in Datalog K. The paper is organized as follows. In Section 2 we recall some preliminary notions. In Section 3 we describe the formal framework for abstraction, including Datalog K. Starting from Section 4, we focus on the above-mentioned scenario, and present the algorithms for computing the best complete (Section 4) and the best sound (Section 5) abstractions of source UCQs. In Section 6 we discuss perfect mono-

tone abstractions. Finally, Section 7 concludes the paper by mentioning future works in the context of abstraction.

2 Preliminaries A relational schema R is a ﬁnite set of predicate symbols (each one with a ﬁxed arity) and an R-database D is a ﬁnite set of facts using predicate symbols in R and constants from a countable inﬁnite set Const. A query q of arity n over a schema R is a function mapping each R-database to n-tuples of constants occurring in Const. We denote by q D the set of tuples obtained by applying q to the R-database D. For two queries q1 and q2 over a schema S, we write q1 R q2 if q D 1 q D 2 for each R-database D. Let Var be a countable inﬁnite set of symbols, called variables, disjoint from Const. A relational atom over R is an expression constituted by a predicate symbol in R applied to as many arguments (constants or variables) as its arity. We also consider inequality atoms, where the predicate symbol is inequality ( =). For a set of atoms C, we let Dom(C) = Var(C) Const(C), where Var(C) and Const(C) are the set of all variables and all constants, respectively, occurring in C. A conjunctive query (CQ) q over a schema R is a query expressible as { x | y.φ( x, y)}, where x is a tuple of variables, called the distinguished variables of q, y is a tuple of variables, called the existential variables of q, and φ( x, y) is the body of q, i.e., a ﬁnite conjunction of relational atoms over R containing all the distinguished and existential variables of q. The arity of q is the arity of x. A union of conjunctive queries (UCQ) is a ﬁnite union of CQs with same arity, called its disjuncts. A (U)CQ with inequality ((U)CQ =) is simply a (U)CQ allowing inequality atoms in its body. Let C1 and C2 be two sets of atoms. A homomorphism h from C1 to C2 (written h : C1 C2) is a mapping from Dom(C1) to Dom(C2) such that: (i) h(α) C2, for each relational atom α C1; (ii) h(c) = c, for each constant c Const(C1); and (iii) h(x) = h(y), for each inequality atom x = y occurring in C1. Given a signature of unary and binary predicates, a Description Logic (DL) ontology is simply a TBox (set of axioms) expressed in a speciﬁc DL (Baader et al. 2003). We assume a special unary predicate whose extension in every interpretation coincides with the entire domain. In the technical sections, we are interested in ontologies expressed in DL-Lite RDFS (Cuenca Grau 2004; Rosati 2007; Cima, Lenzerini, and Poggi 2020a). Such language corresponds to the fragment of RDFS (Brickley and Guha 2014) expressible in DL-Lite. Speciﬁcally, a DL-Lite RDFS ontology is a ﬁnite set of assertions of the form B A or R1 R2, where B denotes a basic concept, i.e., an expression of the form A, P, or P , with A and P denoting an atomic concept and an atomic role, respectively, and R1 and R2 denote basic roles, i.e., expressions of the form P, or P . An OBDM speciﬁcation J is a triple O, S, M , where O is a DL ontology, S is a relational schema, called the source schema of J, and M is a Global-and-Local-As-View (GLAV) mapping, i.e., a set of GLAV mapping assertions of the form:

x. y.φ( x, y) z.ψ( x, z),

where x are called the frontier variables of m, φ( x, y), called the body of m, is a conjunction of relational atoms over S, and ψ( x, z), called the head of m, is a conjunction of relational atoms over O. An OBDM system is a pair Σ = J, D , where J = O, S, M is an OBDM speciﬁcation, and D is an Sdatabase. For a mapping M relating a source schema S to an ontology O and an S-database D, we denote by M(D) the set of atoms over O obtained by chasing D with respect to M as described in (Fagin et al. 2005). The semantics of an OBDM system is given in terms of interpretations for O following the so-called standard names assumption (Calvanese et al. 2007a). This means that the domain of every interpretation is Const, with every constant in the signature interpreted as itself. Note that this also implies the unique name assumption. We say that an interpretation I for O is a model of an OBDM system Σ = O, S, M , D if (i) I |= O, and (ii) (D, I) |= M, where (D, I) is the interpretation obtained from I by adding the set of facts corresponding to D. We denote by mod( J, D ) the set of models of an OBDM system J, D , and when mod( J, D ) = we say that D is consistent with J. We next provide an example of OBDM speciﬁcation.

Example 1. Let J = O, S, M be an OBDM speciﬁcation such that O = , S = {s1/2, s2/1, s3/1} and M = {m1, m2, m3, m4}, where:

m1 : x1, x2.s1(x1, x2) E(x1, x2) m2 : x.s1(x, x) SN(x) m3 : x.s2(x) z.E(x, z) m4 : x.s3(x) E(x, x)

The mapping M of J establishes how predicates in the schema S relate to the ontology predicates E (which stands for edge) and SN (which stands for special node).

A query q of arity n over the signature of an ontology O is a function whose domain is the power set of the set of all the interpretations for the ontology, and whose range is the power set of the set of all n-tuples of constants in Const. A query q over J = O, S, M is a query over the signature of the ontology O that, for every S-database D, associates to mod( J, D ) a set of tuples of constants. The answers to q with respect to the OBDM system J, D , denoted as q J,D, is obtained by applying the function q to the set of models mod( J, D ). Although our framework is general enough to accept any kind of function as query, a very common method for deﬁning q J,D when q is expressed as a FOL query is by sanctioning that the function q applied to mod( J, D ) returns the set of certain answers. A certain answer is a tuple of constants in Dom(J, D, q) that, once assigned to the distinguished variables of q, satisﬁes of q in every model in mod( J, D ). Here, Dom(J, D, q) is the set of all constants occurring in J, D, and q. Given a set of constants Λ Const, we denote by q J,D |Λ the answers to q w.r.t. J, D restricted to the constants of Λ, i.e., q J,D |Λ = q J,D Λn. The class of monotone queries is particularly important in this paper. In the context of an OBDM speciﬁcation J = O, S, M , a query q over J is monotone if, for every

pairs of S-databases D, D , mod( J, D ) mod( J, D ) implies q J,D

Λ(D,D ) q J,D Λ(D,D ), where Λ(D, D ) = Dom(J, D, q) Dom(J, D , q). This captures the usual intuition of monotonicity of queries in logic: when the knowledge possessed by the formal system increases (i.e., the set of models decreases), the answers to a query increase. Note that the restriction on the constants occurring in both the active domains of D and D (other than in J and q) is imposed for ensuring that the property of monotonicity looks only for those constants occurring in both databases. We use MJ to denote the class of all monotone queries in the context of J, and when J is understood, we simply use M. Note that, for each OBDM speciﬁcation J, MJ includes the whole class of FOL queries evaluated under the certain answer semantics, that is arguably the most studied class in both the KR and the DB literature. For two queries q1 and q2 over J, we write q1 J q2 if q J,D 1 q J,D 2 for each S-database D, and write q1 J q2 if q1 J q2 and q J,D 1 q J,D 2 for at least one S-database D. We say that q1 and q2 are J-equivalent if both q1 J q2 and q2 J q1 hold.

3 Framework In what follows, we implicitly refer to an OBDM speciﬁcation J = O, S, M , and when we denote a query by q O (resp., q S) we mean that the query is a query for J (resp., a source query), i.e., is over the signature of the ontology O (resp., the schema S). We follow (Cima, Lenzerini, and Poggi 2019) for the basic deﬁnitions related to abstraction. We say that q O is a perfect J-abstraction of q S if q J,D O = q D S , for each S-database D consistent with J. As the condition for an ontology query to be a perfect abstraction of a source query is a strong one, it might be very well the case that a perfect abstraction of a source query does not exist. It is then reasonable to consider weaker notions, such as sound or complete approximations, of perfectness. We say that q O is a complete (resp. sound) J-abstraction of q S if q D S q J,D O (resp. q J,D O q D S ), for each S-database D consistent with J. Obviously, we might be interested in complete or sound abstractions that approximate q S at best, at least in the context of a speciﬁc class of queries. If LO is a class of queries, we say that a query q O LO is an LO-minimally complete (resp., LO-maximally sound) J-abstraction of q S if q O is a complete (resp., sound) J-abstraction of q S and there is no query q O LO such that q O is a complete (resp., sound) J-abstraction of q S and q O J q O (resp., q O J q O). We now introduce a new language, called Datalog K, for expressing queries over OBDM speciﬁcations. The language is based on disjunctive Datalog, and is used in this paper for expressing abstractions. The basic component of a Datalog K query is a rule. Assume two disjoint and countably inﬁnite sets of predicates E and P, called extensional and intensional, respectively. A Datalog K rule has one of the following forms: The typical form of disjunctive Datalog, i.e., γ( x) α1( x1) . . . αn( xn) (1)

where γ( x) is a conjunction of relational atoms on the predicates of P with x as variables, and for i = 1, . . . , n, αi( xi) is a single relational atom whose predicate is in P such that xi x, A new form speciﬁed as follows

K( z.φ( x, z) ξ( x)) ψ1( x1) . . . ψn( xn) (2)

where φ( x, z) is a conjunction of relational atoms on the predicates of E, ξ( x) is a conjunction of inequality atoms involving only variables from x and for j = 1, . . . , n, ψj( xj) = yj.γj( xj, yj), where γj( xj, yj) is a conjunction of relational atoms on P such that xj x.

An n-ary Datalog K query q O over an OBDM speciﬁcation J is a ﬁnite set of Datalog K rules whose extensional predicates coincide with the alphabet of O, and whose intensional predicates include a special n-ary predicate Ans. The semantics of q O is provided relative to an OBDM system. Given an OBDM system J, D , an interpretation for q O is a pair I = (mod( J, D ), f), where f is a ﬁrst-order interpretation (with domain Const) for the predicates in P. As usual, we may also see f as the set of facts {p( c) | c pf}. We now deﬁne when I satisﬁes a Datalog K rule. I satisﬁes a rule of the form (1) if the ﬁrst-order formula x.γ( x) α1( x1) . . . αn( xn) is true in f, I satisﬁes a rule of the form (2) if for all tuples c of constants, the fact that c is a certain answers to the FOL query z.φ( x, z) ξ( x) w.r.t. J, D implies that f satisﬁes the formula yj.γj( cj, yj), for some j = 1, . . . , n. An interpretation I for q O is called a model of q O if all the rules of q O are satisﬁed by I. Finally, we deﬁne the notion of answers to an n-ary Datalog K query q O w.r.t. an OBDM system J, D , denoted by q J,D O , as follows: { c Dom(J, D, q O)n | c Ansf for each model (mod( J, D ), f) of q O}. Example 2. Consider the OBDM speciﬁcation J illustrated in Example 1. The following Datalog K query q O over J returns all the pairs (v1, v2) of special nodes that are known to be distinct and such that there is a path from v1 to v2 passing only for nodes known to be special:

K(E(x1, x2) SN(x1) SN(x2)) T1(x1, x2) K(SN(x1) SN(x2) x1 = x2) T2(x1, x2) T1(x1, y) T1(y, x2) T1(x1, x2) T1(x1, x2) T2(x1, x2) Ans(x1, x2)

The following proposition shows that Datalog K is a monotone query language. Proposition 1. Every Datalog K query over J is in MJ, for every OBDM speciﬁcation J. The semantics should make it clear that K is the knowledge operator in the S5 epistemic logic: the formula KA should be read as A is known (i.e., logically implied) by the system (Levesque and Lakemeyer 2001). Therefore, when accessing the information modeled by J, D , a Datalog K query extracts what is known by the system, and this characteristic is crucial for not falling into undecidability resulting from using Datalog rules jointly with Description Logics

(see (Levy and Rousset 1998; Calvanese and Rosati 2003)), as stated in the following proposition.

Proposition 2. Let Σ be an OBDM system. Answering Datalog K queries w.r.t. Σ is decidable if and only if answering CQs w.r.t. Σ is decidable1.

Although our framework is general enough to consider any DL for expressing ontologies and any query language for expressing source queries, in the rest of this paper we will carry out our investigation in the following setting: (i) ontologies are expressed in DL-Lite RDFS, and (ii) source queries are expressed as UCQs. At this point, one may wonder whether Datalog K is the right language to express monotone abstractions in this setting. While a thorough analysis of the language is outside the scope of the present paper, the following proposition provides a positive answer to this question, at least from the computational point of view.

Proposition 3. In our setting, (i) answering Datalog K queries is in co NP in data complexity, and (ii) there exists an OBDM speciﬁcation J and a CQ q S such that, given an S-database D, answering the M-maximally sound Jabstraction of q S is co NP-hard in data complexity.

To ease the presentation, from now on we assume that mappings and source queries do not mention constants. However, all our results can be straightforwardly adapted to the case where constants are allowed.

4 Minimally Complete Abstractions We now prove that M-minimally complete J-abstractions of UCQs always exist and can be expressed in Datalog K. Actually, a fragment of Datalog K sufﬁces for this purpose, namely the one using only rules of form (2) without disjunction. We start with some useful subroutines. Given a CQ q S = { x | y.φ( x, y)}, the subroutine Saturate Q(q S) computes a UCQ = in the following way: for each possible uniﬁer µ on the variables in x y such that µ(x) x for each x x, Saturate Q(q S) contains a query obtained from µ(q S) by adding the inequality atom t1 = t2 for each pair of distinct variables t1, t2 occurring in µ(q S). For a UCQ q S, we denote by Saturate Q(q S) the UCQ = obtained by applying Saturate Q(q) to each disjunct q of q S. We write each CQ = q generated by Saturate Q(q S) as q = { x | y.φ( x, y) ξ( x, y)}, where φ( x, y) and ξ( x, y) are the conjunctions of the relational atoms over S and of the inequality atoms, respectively, occurring in the body of q. Moreover, for an OBDM speciﬁcation J = O, S, M and a CQ = q = { x | y.φ( x, y) ξ( x, y)} over S, we denote by rq the following Datalog K rule of form (2):

rq = K( z. ( x) M(q) ξ( x, Y)) Ans( x),

where (i) ( x) denotes (x1) . . . (xn) for x = (x1, . . . , xn), (ii) M(q) is computed by simply ignoring the inequality atoms and chasing the set of relational atoms occurring in the body of q; (iii) Y y is the subset of the

1With answering we implicitly refer to the associated recognition problem, i.e., check whether a tuple is in the answer to a query.

existential variables of q occurring in M(q); (iv) z are the fresh variables introduced when computing M(q); and (v) ξ( x, Y) is the conjunction of the inequality atoms obtained from ξ( x, y) by removing all those atoms of the form y = t and t = y in which y is an existential variable occurring in y but not in Y (i.e., not in M(q)) and t is any other possible variable. Observe that the epistemic operator is exploited to bind the existential variables coming from q. This is achieved by pushing the subset Y of the existential variables y of q occurring in M(q) inside the K operator. We are now ready to present the algorithm M-Min Complete for computing the M-minimally complete J-abstractions. Given an OBDM speciﬁcation J = O, S, M and a UCQ q S over S such that Saturate Q(q S) = q1 . . . qn, M-Min Complete(J, q S) outputs the Datalog K query q O = {rq1, . . . , rqn} over J. Example 3. Consider the OBDM speciﬁcation J illustrated in Example 1 and the CQ q S = {(x) | y.s1(x, y)} over S. One can verify that M-Min Complete(J, q S) returns the following Datalog K query q O over J asking for all those nodes v such that either v is connected to a node v known to be different from v or v is a special node with a self-loop:

K(E(x, y) x = y) Ans(x) K(E(x, x) SN(x)) Ans(x)

Note that q O is a better complete approximation than the query {(x) | y.E(x, y)}, which is the UCQ-minimally complete J-abstraction of q S (Cima, Lenzerini, and Poggi 2019). Theorem 1. M-Min Complete(J, q S) terminates and returns the unique (up to J-equivalence) M-minimally complete J-abstraction of q S. Furthermore, we observe that the result of M-Min Complete(J, q S) is independent from the assertions occurring in the ontology of the OBDM speciﬁcation J. Similar results can be obtained for OBDM speciﬁcations based on more expressive Horn DL ontologies. Before concluding this section, we point out that answering Datalog K queries returned by M-Min Complete(J, q S) w.r.t. OBDM systems of our setting is FOL-rewritable, as the following proposition shows. Proposition 4. Let q O be a Datalog K with only rules of the form K( z.φ( x, z) ξ( x)) Ans( xa) over an OBDM speciﬁcation J. It is possible to compute a UCQ = qr over S such that q J,D O = q D r , for each S-database D.

5 Maximally Sound Abstractions We present the algorithm M-Max Sound for the computation of M-maximally sound J-abstractions of UCQs. The output of M-Max Sound is a Datalog K query, thus proving that such abstractions always exist and can be expressed in Datalog K. To present our algorithm, we ﬁrst need to introduce the notion of inverse mappings, which extends the one introduced in (Duschka and Genesereth 1998) for the problem of rewriting queries using disjunctive views2.

2We refer to (Cima et al. 2021) for the relation between abstraction and the problem of rewriting queries using disjunctive views.

To ease the presentation of our techniques, we now introduce a technical tool to encode a DL-Lite RDFS ontology into a set of GLAV mapping assertions. Given a conjunction of atoms ψ over a DL-Lite RDFS ontology O, we denote by O(ψ) the conjunction of atoms over O obtained by chasing the atoms in ψ w.r.t. O as described in (Calvanese et al. 2007b). Note that, since O is a DL-Lite RDFS ontology, O(ψ) is always ﬁnite and do not introduce further existential variables. Starting from an OBDM speciﬁcation J = O, S, M , we can generate an equivalent OBDM speciﬁcation J = , S, M as follows. For each mapping assertion m M of the form x. y.φ( x, y) z.ψ( x, z), the mapping M contains the mapping assertion x. y.φ( x, y) z.O(ψ( x, z)). From now on, without loss of generality, we assume OBDM speciﬁcations with an empty set of ontological assertions. Furthermore, in each mapping M relating a source schema S to an ontology O, again without loss of generality we assume that M contains the assertion x1, . . . , xn.s(x1, x2, . . . , xn) (x1) (x2) . . . (xn) for each n-ary predicate s S.

5.1 Inverse Mapping Our algorithm for the computation of M-maximally sound abstractions is based on the notion of inverse mappings. Intuitively, given a GLAV mapping M, the inverse of m M is a Datalog K rule that characterizes those that may give rise to an homomorphic image of the head of m when chased with M. To compute inverse mappings, we will deﬁne the algorithm Inv Map. In order to deﬁne such algorithm, we need to introduce several additional notions. Splitting. Given a set of relational atoms A and a set of variables U occurring in A, we deﬁne the U-graph GU A of A as follows: (i) each atom α in A is a node of GU A, and (ii) there is an edge between nodes α1, α2 A if and only if there exists a variable z U such that z occurs in both α1 and α2. The splitting of A around U (written Split(A, U)) is the set of all those C A such that the atoms in C form a connected component of GU A, i.e., C is a maximal subset of A s.t. there exists a path from α to α in GU A, for each α, α C. Whenever necessary, we will assume that sets in Split(A, U) are presented as a tuple. Saturate. Given a GLAV mapping M, the algorithm Saturate M(M) produces an equivalent set of rules with explicit inequalities on the frontier variables. Given a GLAV mapping assertion m = x. y.φ( x, y) z.ψ( x, z), the algorithm Saturate M(m) computes a set of rules in the following way: for each possible uniﬁer µ between the variables in x, Saturate M(m) contains a rule obtained from µ(m) by adding the inequality atom x1 = x2 in the body of µ(m), for each pair of distinct variables x1, x2 µ( x). For a GLAV mapping M, we denote by Saturate M(M) the set of rules obtained by applying Saturate M(M) to each mapping assertion m of M. It is easy to see that (D, I) |= M if and only if (D, I) |= Saturate M(M). We call each assertion in Saturate M(M) a saturated GLAV mapping assertion and Saturate M(M) a saturated GLAV mapping. We write each rule m Saturate M(M) as φ( x, y) ξ( x) z.ψ( x, z), where φ( x, y) and ψ( x, z) are conjunctions of relational

atoms over S and over O, respectively, whereas ξ( x) is a conjunction of inequality atoms. We write r-body(m) for the set of relational atoms in φ( x, y), and, as in GLAV mappings, head(m) for the set of relational atoms in ψ( x, z). Generators and Supports. Assume a saturated GLAV mapping M from a source schema S to an ontology O. Without loss of generality, we assume that mapping assertions in M use distinct variables. Given a set of atoms A over O, a generator of A w.r.t. M is a pair g = h, m , where h is a homomorphism from A to the head of m and m M. A generator g is V -distinguished, for some V Var(A), if h(v) is a frontier variable of m, for each v V , and h(u) is not a frontier variable of m, for each u V . We write Gens(A, M, V ) for the set of V -distinguished generators of A w.r.t. M. Intuitively, a generator g = h, m of A w.r.t. M represents a possible way to obtain a homomorphic copy B M(D) of A by applying m to an S-database D. If such generator is V -distinguished, then the only constants occurring in B are images of the variables in V . Example 4. Let M = {m A, m B, m C, m D}, where:

m A : s1(x1, x2) x1 = x2 Z.P1(x1, Z) P2(Z, x2),

m B : s(x3, x4) x3 = x4 Z .P1(x3, Z ) P2(Z , x4), m C : s3(x5, y1) s4(y1, y2) P1(x5, x5), m D : s5(x6, x7) x6 = x7 P1(x6, x7) P2(x7, x6).

For the set of atoms A1 = {P1(x1, Z), P2(Z, x2)}, we have that Gens(A1, M, {x1, x2}) = {g1, g2} with g1 = h1, m A and g2 = h2, m B where h1 is the identity function, whereas h2(x1) = x3, h2(x2) = x4, and h2(Z) = Z . For the set of atoms A2 = {P1(x1, Z)}, we have that Gens(A2, M, {x1, Z}) = {g3, g4} with g3 = h3, m C and g4 = h4, m D where h3(x1) = x5 and h3(Z) = x5, whereas h4(x1) = x6 and h4(Z) = x7. For the set of atoms A3 = {P2(Z, x2)}, we have that Gens(A3, M, {Z, x2}) = {g5} with g5 = h5, m D where h5(Z) = x7 and h5(x2) = x6. We now extend the notion of generators to the case of multiple applications of mapping assertions. Given a tuple of sets of atoms A = A1, . . . , An and a set of variables V Var(A1) . . . Var(An), the V -distinguished generators of A w.r.t. M are all the tuples G = g1, . . . , gn such that gi Gens(Ai, M, V ), for each i = 1, . . . , n. With a slight abuse of notation, we use Gens(A, M, V ) to denote the set of all tuples of V -distinguished generators of elements of A w.r.t. M and we call generator of A every such tuple. Intuitively, each G Gens(A, M, V ) represents a possible way to obtain a homomorphic copy of A using a single assertion of M for each A A. Example 5. Refer to Example 4. For A = A1 and A = A2, A3 , we have Gens(A, M, {x1, x2}) = {G1, G2} and Gens(A , M, {x1, x2, Z}) = {G3, G4}, resp., where G1 = g1 , G2 = g2 , G3 = g3, g5 , and G4 = g4, g5 . We now characterize those databases D for which M(D) contains an homomorphic copy of A generated using the the mappings constituting some G Gens(A, M, V ). To this end, we deﬁne a formula σG that we call the support of G Gens(A, M, V ). Given G = g1, . . . , gn in

Gens(A, M, V ) with gi = hi, mi for each i = 1, . . . , n, let ρ1, . . . , ρn be a tuple of renaming for the variables of M whose images are pairwise disjoint. For i = 1, . . . , n, we deﬁne βi = ρi(r-body(mi)) and σG = V i ρi(r-body(mi)). Intuitively, each βi represents a set of relational source atoms that is needed to generate an homomorphic image of hi(Ai) using mi. Example 6. Refer to Example 5. We have σG1 = s1(x 1, x 2), σG2 = s(x 3, x 4), σG3 = s3(x 5, y 1) s4(y 1, y 2) s5(x 6, x 7), and σG4 = s5(x 6, x 7) s5(x 6, x 7). In order to connect together the different hi(Ai) to form an image of A, we need to impose further restrictions on the variables of σG. These restrictions will take the form of equalities. Let ηG denote binary relation over Var(σG) deﬁned as follows: for each i, j = 1, . . . , n, and v V , (ρi(hi(v)), ρj(hj(v))) ηG. Moreover, given a set x V , let η[ x] denote the binary relation over Var(σG) x deﬁned as follows: for each i = 1, . . . , n, and x x, (x, ρi(hi(x))) η[ x]. We use η[ x] G to denote the reﬂexive and transitive closure of ηG η[ x].

Example 7. Refer to Example 6. We have that η[{x1,x2}] G1 ,

η[{x1,x2}] G2 , η[{x1,x2}] G3 , and η[{x1,x2}] G4 are the reﬂexive and transitive closure of {(x1, x 1), (x2, x 2)}, {(x1, x 3), (x2, x 4)}, {(x 5, x 7), (x1, x 5), (x2, x 6)}, {(x 7, x 7), (x1, x 6), (x2, x 6)}, respectively.

As customary, for each equivalence class E of η[ x] G , we select an element s(E) E as representative; if E contains elements of x, we require s(E) x. Moreover, given v Var(σG), we use v G, x to denote the equivalence class of η[ x] G that contains v. Finally, we deﬁne the x-support σ x G of G as the formula w.η[ x] G (σG), where η[ x] G (σG) is obtained from σG by replacing each variable v Var(σG) with s(v G, x) and w are the variables of η[ x] G (σG) not occurring in x. Example 8. Refer to the previous example. We have that σ{x1,x2} G1 = s1(x1, x2), σ{x1,x2} G2 = s(x1, x2), σ{x1,x2} G3 =

y 1, y 2.s3(x1, y 1) s4(y 1, y 2) s5(x2, x1), and σ{x1,x2} G4 = x 7.s5(x1, x 7) s5(x2, x 7).

The Inv Map Algorithm. We are now ready to formally deﬁne the Inv Map algorithm. In the algorithm, we use Partitions( z), where z is a set of variables, to denote the set of all 2-tuples of pairwise disjoint sets U, V such that U V = z. Intuitively, given an OBDM speciﬁcation J = O, S, M , Inv Map(J) returns a set of Datalog K rules that represents a sound inverse of M w.r.t. O, i.e., a set of formulae that, given an S-database D, characterize those databases D such that mod(J, D) mod(J, D ). Equivalently, given an S-database D, Inv Map(J) characterizes all those databases D such that M(D ) is an homomorphic copy of M(D). In order to deﬁne a sound inverse of M w.r.t. O, Inv Map(J) relies on the notions of supports and generators. More speciﬁcally, for every saturated mapping assertion φ( x, y) ξ( x) z.ψ( x, z), Inv Map(J) contains

Algorithm 1: Inv Map

Input: OBDM speciﬁcation , S, M Output: Set R of Datalog K rules 1: M = Saturate M(M) 2: R = 3: for all m : φ( x, y) ξ( x) z.ψ( x, z) M do 4: Γ = 5: for all U, V Partitions( z) do 6: Let Sψ = Split(ψ, U) 7: for all G Gens(Sψ, M , V x) do

8: if (x, x ) η[ x] G for each x, x x s.t. x = x then 9: Let σ x G be the x-support of G 10: Γ = Γ {σ x G} 11: end if 12: end for 13: end for 14: inv(m) = K( z.ψ( x, z) ξ( x)) W γ Γ γ 15: R = R {inv(m)} 16: end for 17: return R

a Datalog K rule of the form K( z.ψ( x, z) ξ( x)) W γ Γ γ, where Γ is the set of all x-supports of the generators of all possible splittings of ψ. Intuitively, the latter guarantees that, if (mod( J, D ), D ) satisﬁes Inv Map(J), for every atom α M(D), then there exists an assertion m M such that h(body(m)) D and ℓ(α) h(head(m)), where h and ℓare homomorphisms.

Example 9. Consider the OBDM speciﬁcation J illustrated in Example 1. One can verify that Inv Map(J) returns the following set R of Datalog K rules:

K(E(x1, x2) x1 = x2) s1(x1, x2) K(E(x, x)) s3(x) s1(x, x) K( z.E(x, z)) s2(x) y.s1(x, y) s3(x) K(SN(x)) s1(x, x)

In the reminder of this section, we introduce two crucial properties of the Inv Map algorithm. Assume an OBDM speciﬁcation J and an S-database D. One can show that (mod( J, D ), D) satisﬁes Inv Map(J).

Lemma 1. For every S-database D, (mod( J, D ), D) |= Inv Map(J).

As it will turn out, Lemma 1 guarantees that our approximation is sound. Moreover, one can show that, whenever (mod( J, D ), D ) |= Inv Map(J), we have that M(D) can be homomorphically mapped into M(D ). Informally, this property guarantees that our approximation is maximal.

Lemma 2. For every pair of S-databases D, D such that (mod( J, D ), D ) |= Inv Map(J), there exists a homomorphism from M(D) to M(D ).

5.2 Maximally Sound Abstraction Algorithm With the Inv Map algorithm and its properties at hand, we are now ready to focus on the problem of computing the M-maximally sound J-abstractions. First, let us introduce

some additional notations. For an OBDM speciﬁcation J = O, S, M , the standard predicate renaming of J, denoted by ren J, is the renaming such that ren J(s) = s , for every s S, where s is an IDB predicate with ar(s) = ar(s ). Furthermore, with a slight abuse of notation, given a set of formulae Ψ over S O, we write ren J(Ψ) for the set of formulae obtained from Ψ by replacing each occurrence of s in Ψ with ren J(s), for each s S. We can now present the algorithm M-Max Sound for computing the M-maximally sound J-abstractions. Given an OBDM speciﬁcation J = O, S, M and a UCQ q S = q1 S . . . qn S over S such that qi S = { xi | yi.φi( xi, yi)} for each i = 1, . . . , n, M-Max Sound(J, q S) simply outputs the Datalog K query q O = ren J(Ψ), where Ψ = Inv Map(J) {φ1( x1, y1) Ans( x1)} . . . {φn( xn, yn) Ans( xn)}. We conclude this section by establishing termination and correctness of the M-Max Sound algorithm.

Theorem 2. M-Max Sound(J, q S) terminates and returns the unique (up to J-equivalence) M-maximally sound Jabstraction of q S.

6 Perfect Abstractions

It follows from Theorem 1 that either the perfect abstraction of a source UCQ can be expressed in Datalog K, or it cannot be expressed as a monotone query (if it exists at all). We now present an algorithm that, given an OBDM speciﬁcation J and a source UCQ q S, returns the perfect J-abstraction of q S, if and only if it exists and is in M. To this aim, we make use of Proposition 4, and refer to the qr deﬁned in that proposition as the rewriting of q O w.r.t. J. Our algorithm, that we call M-Perfect, goes as follows. Given an OBDM speciﬁcation J = O, S, M and a UCQ q S over S as input: if q O = M-Min Complete(J, q S) is such that qr S q S; then return q O; otherwise, report no perfect J-abstraction of q S is in M .

Example 10. In Example 3, M-Perfect(J, q S) returns q O, which is the perfect J-abstraction of q S.

We conclude this section by establishing termination and correctness of the M-Perfect algorithm.

Theorem 3. M-Perfect(J, q S) terminates and returns the unique (up to J-equivalence) perfect J-abstraction of q S if and only if such an abstraction can be expressed in M.

7 Conclusion

We presented a thorough study of monotone abstractions of UCQs in OBDM systems. We proved that best approximations of such abstractions always exist and introduced a query language, Datalog K, that captures them. Directions for future work are many. In the context of monotone abstractions, we would like to investigate the case of more expressive ontology languages, e.g., DL-Lite R, as well as more expressive source query languages, e.g., unions of conjunctive queries with inequalities and disjunctive Datalog. Finally, the problem of checking whether given best approximations are expressible in simpler and more user friendly languages remains open.

Acknowledgements

This work has been partially supported by the ANR AI Chair INTENDED (ANR-19-CHIA-0014), by MIUR under the PRIN 2017 project HOPE (prot. 2017MMJJRE), and by the EU under the H2020-EU.2.1.1 project TAILOR, grant id. 952215.

Baader, F.; Calvanese, D.; Mc Guinness, D.; Nardi, D.; and Patel-Schneider, P. F., eds. 2003. The Description Logic Handbook: Theory, Implementation and Applications. Cambridge University Press.

Bienvenu, M. 2016. Ontology-Mediated Query Answering: Harnessing Knowledge to Get More from Data. In Proceedings of the Twenty-Fifth International Joint Conference on Artiﬁcial Intelligence (IJCAI 2016), 4058 4061.

Brickley, D.; and Guha, R. V. 2014. RDF Schema 1.1. W3C Recommendation, World Wide Web Consortium. Available at https://www.w3.org/TR/2014/REC-rdf-schema20140225/.

Calvanese, D.; De Giacomo, G.; Lembo, D.; Lenzerini, M.; and Rosati, R. 2007a. EQL-Lite: Effective First-Order Query Processing in Description Logics. In Proceedings of the Twentieth International Joint Conference on Artiﬁcial Intelligence (IJCAI 2007), 274 279.

Calvanese, D.; De Giacomo, G.; Lembo, D.; Lenzerini, M.; and Rosati, R. 2007b. Tractable Reasoning and Efﬁcient Query Answering in Description Logics: The DL-Lite Family. Journal of Automated Reasoning, 39(3): 385 429.

Calvanese, D.; and Rosati, R. 2003. Answering recursive queries under keys and foreign keys is undecidable. In Proceedings of the Tenth International Workshop on Knowledge Representation meets Databases (KRDB 2003), volume 79 of CEUR Electronic Workshop Proceedings.

Cima, G. 2017. Preliminary Results on Ontology-based Open Data Publishing. In Proceedings of the Thirtieth International Workshop on Description Logics (DL 2017), volume 1879 of CEUR Electronic Workshop Proceedings.

Cima, G. 2022. Abstraction in Ontology-based Data Management, volume 348 of Frontiers in Artiﬁcial Intelligence and Applications. IOS Press.

Cima, G.; Console, M.; Lenzerini, M.; and Poggi, A. 2021. Abstraction in Data Integration. In Proceedings of the Thirty-Sixth Annual ACM/IEEE Symposium on Logic in Computer Science (LICS 2021), 1 11.

Cima, G.; Croce, F.; and Lenzerini, M. 2021. Query Deﬁnability and Its Approximations in Ontology-based Data Management. In Proceedings of the Thirtieth International Conference on Information and Knowledge Management (CIKM 2021), 271 280.

Cima, G.; Lenzerini, M.; and Poggi, A. 2019. Semantic Characterization of Data Services through Ontologies. In Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI 2019), 1647 1653.

Cima, G.; Lenzerini, M.; and Poggi, A. 2020a. Answering Conjunctive Queries with Inequalities in DL-Lite R. In Proceedings of the Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence (AAAI 2020), 2782 2789. Cima, G.; Lenzerini, M.; and Poggi, A. 2020b. Non Monotonic Ontology-based Abstractions of Data Services. In Proceedings of the Seventeenth International Conference on Principles of Knowledge Representation and Reasoning (KR 2020), 243 252. Cuenca Grau, B. 2004. A possible simpliﬁcation of the semantic web architecture. In Proceedings of the Thirteenth International World Wide Web Conference (WWW 2004), 704 713. De Giacomo, G.; Lembo, D.; Lenzerini, M.; Poggi, A.; and Rosati, R. 2018. Using Ontologies for Semantic Data Integration. In A Comprehensive Guide Through the Italian Database Research Over the Last 25 Years, 187 202. Duschka, O. M.; and Genesereth, M. R. 1998. Query Planning with Disjunctive Sources. In Proceedings of the AAAI98 Workshop on AI and Information Integration. Fagin, R.; Kolaitis, P. G.; Miller, R. J.; and Popa, L. 2005. Data Exchange: Semantics and Query Answering. Theoretical Computer Science, 336(1): 89 124. Lenzerini, M. 2018. Managing Data through the Lens of an Ontology. AI Magazine, 39(2): 65 74. Levesque, H. J.; and Lakemeyer, G. 2001. The Logic of Knowledge Bases. The MIT Press. Levy, A. Y.; and Rousset, M.-C. 1998. Combining Horn Rules and Description Logics in CARIN. Artiﬁcial Intelligence, 104(1 2): 165 209. Lutz, C.; Marti, J.; and Sabellek, L. 2018. Query Expressibility and Veriﬁcation in Ontology-based Data Access. In Proceedings of the Sixteenth International Conference on the Principles of Knowledge Representation and Reasoning (KR 2018), 389 398. Ortiz, M. 2018. Improving Data Management using Domain Knowledge. In Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI 2018), 5709 5713. Ortiz, M. 2019. Ontology-Mediated Queries from Examples: a Glimpse at the DL-Lite Case. In Proceedings of the Fifth Global Conference on Artiﬁcial Intelligence (GCAI 2019), volume 65 of EPi C Series in Computing, 1 14. Poggi, A.; Lembo, D.; Calvanese, D.; De Giacomo, G.; Lenzerini, M.; and Rosati, R. 2008. Linking Data to Ontologies. Journal on Data Semantics, X: 133 173. Rosati, R. 2007. The Limits of Querying Ontologies. In Proceedings of the Eleventh International Conference on Database Theory (ICDT 2007), volume 4353 of Lecture Notes in Computer Science, 164 178. Springer. Xiao, G.; Calvanese, D.; Kontchakov, R.; Lembo, D.; Poggi, A.; Rosati, R.; and Zakharyaschev, M. 2018. Ontology Based Data Access: A Survey. In Proceedings of the Twenty Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI 2018), 5511 5519.