# modular_materialisation_of_datalog_programs__7a8545ba.pdf

The Thirty-Third AAAI Conference on Artiﬁcial Intelligence (AAAI-19)

Modular Materialisation of Datalog Programs

Pan Hu, Boris Motik, Ian Horrocks Department of Computer Science, University of Oxford Oxford, United Kingdom ﬁrstname.lastname@cs.ox.ac.uk

The semina ıve algorithm can be used to materialise all consequences of a datalog program, and it also forms the basis for algorithms that incrementally update a materialisation as the input facts change. Certain (combinations of) rules, however, can be handled much more efﬁciently using custom algorithms. To integrate such algorithms into a general reasoning approach that can handle arbitrary rules, we propose a modular framework for computing and maintaining a materialisation. We split a datalog program into modules that can be handled using specialised algorithms, and we handle the remaining rules using the semina ıve algorithm. We also present two algorithms for computing the transitive and the symmetric transitive closure of a relation that can be used within our framework. Finally, we show empirically that our framework can handle arbitrary datalog programs while outperforming existing approaches, often by orders of magnitude.

1 Introduction Datalog (Abiteboul, Hull, and Vianu 1995) is a prominent rule language whose popularity is mainly due to its ability to express recursive deﬁnitions such as transitive closure. Datalog captures OWL 2 RL (Motik et al. 2009) ontologies with SWRL rules (Horrocks et al. 2004), so it supports query answering on the Semantic Web. It has been implemented in many systems, including but not limited to Web PIE (Urbani et al. 2012), VLog (Urbani, Jacobs, and Kr otzsch 2016), Oracle s RDF Store (Wu et al. 2008), OWLIM (Bishop et al. 2011a), and RDFox (Nenov et al. 2015). Datalog reasoning is often realised by precomputing and storing all consequences of a datalog program and a set of facts; this process and its output are both called materialisation. A materialisation must be updated when the input facts change, but doing so from scratch can be inefﬁcient if changes are small. To minimise the overall work, incremental maintenance algorithms have been developed. These include the well-known Delete/Rederive (DRed) (Gupta, Mumick, and Subrahmanian 1993; Staudt and Jarke 1996) and Counting (Gupta, Mumick, and Subrahmanian 1993) algorithms, and the more recent Backward/Forward (B/F) (Motik et al. 2015), DRedc, and B/Fc (Hu, Motik, and Horrocks 2018b) algorithms.

Copyright c 2019, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

Materialisation and all aforementioned incremental algorithms compute rule consequences using semina ıve evaluation (Abiteboul, Hull, and Vianu 1995). The main beneﬁt of this approach is that each applicable inference is performed exactly once. However, all consequences of certain rules or rule combinations can actually be computed without considering every applicable inference. For example, consider applying a program that axiomatises a relation R as symmetric and transitive to input facts that describe a connected graph consisting of n vertices. In Section 3 we show that computing all consequences using semina ıve evaluation involves O(n3) rule applications, whereas a custom algorithm can achieve the same goal using only O(n2) steps. Since incremental maintenance algorithms are based on the semina ıve algorithm, they can suffer from similar deﬁciencies. Approaches that can maintain the closure of speciﬁc datalog programs have already been considered in the literature. For example, maintaining transitive closure of a graph has been studied extensively (Ibaraki and Katoh 1983; La Poutre and van Leeuwen 1987; King 1999; Demetrescu and Italiano 2000). Subercaze et al. (2016) presented an algorithm for the materialisation of the transitive and symmetric properties in RDFS-Plus. Dong, Su, and Topor (1995) showed that insertions into a transitively closed relation can be maintained by evaluating four nonrecursive ﬁrst-order queries. However, these approaches can only handle datalog programs for which they have been speciﬁcally developed that is, the programs are not allowed to contain any additional rules. The presence of other rules introduces additional complexity since updates computed by specialised algorithms must be propagated to the remaining rules and vice versa. Moreover, many of these approaches cannot handle deletion of input facts, which is a key problem in incremental reasoning. Thus, it is currently not clear whether and how customised algorithms can be used in general-purpose datalog systems that must handle arbitrary datalog rules and support incremental additions and deletions. To address these issues, in this paper we present a modular framework for materialisation computation and incremental materialisation maintenance that can integrate specialised reasoning algorithms with the semina ıve evaluation. The framework partitions the rules of a datalog program into disjoint subsets called modules. For each module, four pluggable functions are used to compute certain consequences

of the module s rules; there are no restrictions on how these functions are implemented, as long as their outputs satisfy certain conditions. Moreover, if no specialised algorithm for a module is available, the four functions can be implemented using semina ıve evaluation. Thus, our framework can efﬁciently handle certain combinations of rules, but it can also handle arbitrary rules while avoiding repeated inferences. We then examine a module that axiomatises the transitive closure, and a module that axiomatises the symmetric transitive closure. These modules capture node reachability in directed and undirected graphs, respectively, both of which frequently occur in practice and are thus highly relevant. We present the functions necessary to integrate these modules into our framework and show that they satisfy the properties needed for correctness. We also discuss the kinds of input that are likely to beneﬁt from modular reasoning. We have implemented our algorithms and compared them on several real-life and synthetic datasets. Our experiments illustrate the potential beneﬁts of the proposed solution: our approach often outperforms state-of-the-art algorithms, sometimes by orders of magnitude. Our system and test data are available online.1 All proofs of our results are given in a technical report (Hu, Motik, and Horrocks 2018a).

2 Preliminaries We now introduce datalog with stratiﬁed negation. A term is a constant or a variable. An atom has the form P(t1, . . . , tk), where P is a k-ary predicate with k 0, and each ti, 1 i k, is a term. A fact is a variable-free atom, and a dataset is a ﬁnite set of facts. A rule r has the form

B1 Bm not Bm+1 not Bn H,

where 0 m n, and Bi and H are atoms. For r a rule, h(r) = H is the head, b+(r) = {B1, . . . , Bm} is the set of positive body atoms, and b (r) = {Bm+1, . . . , Bn} is the set of negative body atoms. Each rule r must be safe that is, each variable occurring in r must occur in at least one positive body atom. A program is a ﬁnite set of rules. A stratiﬁcation λ of a program Π maps each predicate occurring in Π to a positive integer such that, for each rule r Π with predicate P in its head, λ(P) λ(R) (resp. λ(P) > λ(R)) holds for each predicate R occurring in b+(r) (resp. b (r)). Such r is recursive w.r.t. λ if λ(P) = λ(R) holds for some predicate R occurring in b+(r); otherwise, r is nonrecursive w.r.t. λ. Program Π is stratiﬁable if a stratiﬁcation λ of Π exists. For s an integer, the stratum s of Π is the program Πs containing each rule r Π whose head predicate P satisﬁes λ(P) = s. Moreover, let Πs r and Πs nr be the recursive and the nonrecursive subsets, respectively, of Πs. Finally, let Os = {P(c1, . . . , cn) | λ(P) = s and ci are constants}. A substitution σ is a mapping of ﬁnitely many variables to constants. For α a term, an atom, a rule, or a set thereof, ασ is the result of replacing each occurrence of a variable x in α with σ(x), provided that the latter is deﬁned. If r is a rule and σ is a substitution mapping all variables of r to constants, then rule rσ is an instance of r. For I

1http://krr-nas.cs.ox.ac.uk/2018/modular/

Algorithm 1 MAT(Π, λ, E)

1: I = 2: for each stratum index s with 1 s S do 3: = (E Os) Πs nr I

4: while = do 5: I = I 6: = Πs r I \ I

a dataset, we deﬁne the set Π I of all facts obtained by applying a program Π to I as

r Π {h(rσ) | b+(rσ) I and b (rσ) I = }.

Let E be a dataset (called explicit facts) and let λ be a stratiﬁcation of Π with maximum stratum index S. Then, let I0 = E; for each s 1, let Is 0 = Is 1 , let

Is i = Is i 1 Πs Is i 1 for i > 0, and let Is = [

Set IS is the materialisation of Π w.r.t. E and λ. It is known that IS does not depend on λ, so we write it as mat(Π, E).

3 Motivation In this section we show how custom algorithms can handle certain rule combinations much more efﬁciently than semina ıve evaluation. We consider here only materialisation, but similar observations apply to incremental maintenance algorithms as most of them use variants of semina ıve evaluation.

3.1 Semina ıve Evaluation The semina ıve algorithm (Abiteboul, Hull, and Vianu 1995) takes as input a set of explicit facts E, a program Π, and a stratiﬁcation λ of Π, and it computes mat(Π, E). To apply each rule instance at most once, in each round of rule application it identiﬁes the newly applicable rule instances (i.e., instances that depend on a fact derived in the previous round) as shown in Algorithm 1. For each stratum, the algorithm initialises , the set of newly derived facts, by combining the explicit facts in the current stratum (E Os) with the facts derivable from previous strata via nonrecursive rules (Πs nr I ). Then, in lines 4 6 it iteratively computes all consequences of . To this end, in line 6 it uses operator Π I , which extends Π I to allow identifying newly applicable rule instances. Speciﬁcally, given datasets I and I, operator Π I returns a set containing h(rσ) for each rule r Π and substitution σ such that b+(rσ) I and b (rσ) I = hold (i.e., rule instance rσ is applicable to I), but also b+(rσ) = holds (i.e., a positive body atom of rσ occurs in the set of facts derived in the previous round of rule application). It is not hard to see that the algorithm computes I = mat(Π, E), and that it considers each rule instance rσ at most once.

3.2 Problems with the Semina ıve Evaluation Although semina ıve evaluation does not repeat derivations, it always considers each applicable rule instance. However,

facts are often derived via multiple, distinct rule instances; this is particularly common with recursive rules, but it can also occur with nonrecursive rules only. We are unaware of a general technique that can prevent such derivations. We next present two programs for which materialisation can be computed without considering all applicable rule instances, thus showing how semina ıve evaluation can be suboptimal. Example 1. Let Π be the program containing rule (1) and let E = {R(ci, ci+1) | 0 i n}. R(x, y) R(y, z) R(x, z) (1) Clearly, I = mat(Π, E) = {R(ci, cj) | 0 i < j n}, so each rule instance of the form R(ci, cj) R(cj, ck) R(ci, ck) (2) with 1 i < j < k n is applicable to I. Algorithm 1 considers all of these O(n3) rule instances. We next present an outline of an approach that is still cubic in general, but on this speciﬁc input runs in O(n2) time. The key is to distinguish the set X of external facts given to Π as input from the internal facts derived by Π. We can transitively close R by iteratively considering pairs of facts R(u, v) X and R(v, w). That is, we require the ﬁrst fact to be in X, but place no restriction on the second fact. (We could have equivalently required the second fact to be in X.) In our example, we have X = E, so the algorithm considers only rule instances of the form R(ci, ci+1) R(ci+1, ck) R(ci, ck) (3)

for 0 i < k n, of which there are O(n2) many. Intuitively, this is analogous to replacing the predicate R in all explicit facts with X, and using a linear rule X(x, y) R(y, z) R(x, z) (4) instead of rule (1). In our approach, however, other rules can derive R-facts so the set X is not ﬁxed; thus, rule (1) cannot be simply replaced with (4). Our approach simulates such linearisation, and it can be expected to perform well whenever the other rules derive fewer facts than rule (1). Example 2. Let Π consist of rules (1) and (5), and let E = {R(ci, ci+1) | 1 i < n} {R(cn, c1)}. R(x, y) R(y, x) (5) Now I = mat(Π, E) = {R(ci, cj) | 1 i, j n}, so each instance of the form (2) with 1 i, j, k n is applicable to I. Algorithm 1 considers all of these O(n3) rule instances. However, we can view any relation R as an undirected graph with n vertices. To compute the symmetric transitive closure of R, we ﬁrst compute the connected components of R, and, for each connected component C, we enumerate all u, v C and derive R(u, v). The ﬁrst step is linear in the size of R and the second step requires O(n2) time, so the algorithm runs in O(n2) time on any R.

4 Framework In this section we present a general framework for materialisation and incremental reasoning that can avoid the deﬁciencies outlined in Section 3 for certain rule combinations. Our framework focuses on recursive rules only: nonrecursive rules Πs nr are evaluated just once in each stratum,

which is usually efﬁcient. In contrast, the recursive part Πs r of each stratum Πs must be evaluated iteratively, which is a common source of inefﬁciency. Thus, our framework splits Πs r into n(s) mutually disjoint, nonempty programs Πs,i r , 1 i n(s), called modules. (We let n(s) = 0 if Πs r = .) Our notion of modules should not be confused with ontology modules: the latter are subsets of an ontology that are semantically independent from each other in a well-deﬁned way, whereas our modules are just arbitrary program subsets. Each module is handled using plugin functions that compute certain consequences of Πs,i r . These functions can be implemented as desired, as long as their results satisfy certain properties that guarantee correctness. We present our framework in two steps: in Section 4.1 we consider materialisation, and in Section 4.2 we focus on incremental reasoning. Then, in Sections 5 and 6 we discuss how to realise these plugin functions for certain common modules. Before proceeding, we generalise operator Π I as follows. Given datasets Ip, In, p, and n where p Ip and n In = , let

Π Ip, In p, n = S

r Π{h(rσ) |

b+(rσ) Ip and b (rσ) In = , and b+(rσ) p = or b (rσ) n = }.

When the condition in the last line is not required, we simply write Π Ip, In . Moreover, we omit In when Ip = In, and we omit n when n = . Intuitively, this operator computes the consequences of Π by evaluating the positive and the negative body atoms in Ip and In, respectively, while ensuring in each derivation that either a positive or a negative body atom is true in p or n, respectively. Our incremental algorithm uses this operator to identify the consequences of Π that are affected by the changes to the facts matching the positive and the negative body atoms of the rules in Π. For example, if the facts in p are added to (resp. removed from) the materialisation, then Π Ip p contains the consequences of the rule instances that start (resp. cease) to be applicable because a positive body atom matches to a fact in p. The set n is used to analogously capture the consequences of the negative body atoms of the rules in Π.

4.1 Computing the Materialisation Our modular materialisation algorithm uses a plugin function Adds,i for each module Πs,i r . The function takes as arguments datasets Ip, In, and such that Ip, and it closes Ip with all consequences of Πs,i r that depend on . Each invocation of these functions must satisfy the following properties in order to guarantee correctness of our algorithm.

Deﬁnition 3. Function Add captures a datalog program Π on datasets Ip, In, and with Ip if the result of Add[Ip, In, ] is the smallest dataset J that satisﬁes Π Ip J, In J Ip J.

For brevity, in the rest of the paper we often say just Add captures Π without specifying the datasets whenever the latter are clear from the context. In the absence of a customised algorithm, Add can always be realised using the semina ıve evaluation strategy as follows:

Algorithm 2 MAT-MOD(Π, λ, E)

7: I = 8: for each stratum index s with 1 s S do 9: 1 = = n(s) = 10: = (E Os) Πs nr I

11: while = do 12: I = I 13: for each i with 1 i n(s) do 14: i = Adds,i[I, I, \ i] 15: = 1 n(s)

let 0 = and J0 = , for i starting with 0 onwards, if i = , stop and return Ji; otherwise, let i+1 = Π Ip Ji, In i \ (Ip Ji) and Ji+1 = Ji i+1 and proceed to i + 1. However, a custom implementation of Add will typically not examine all rule instances from the above computation in order to optimise reasoning with certain modules. Algorithm 2 formalises our modular approach to datalog materialisation. It takes as input a program Π, a stratiﬁcation λ of Π, and a set of explicit facts E, and it computes mat(Π, E). The algorithm s structure is similar to Algorithm 1. For each stratum of Π, both algorithms ﬁrst apply the nonrecursive rules, and then they apply the recursive rules iteratively up to a ﬁxpoint. The main difference is that, given a set of facts derived from the previous iteration, Algorithm 2 computes the consequences of for each module independently using Adds,i (line 14); note that each i is closed under Πs,i r , which is key to the performance of our approach. The algorithm then combines the consequences of all modules (line 15) before proceeding to the next iteration. If each Adds,i function is implemented using semina ıve evaluation as described earlier, then the algorithm does not consider a rule instance more than once. This is achieved by passing \ i to Adds,i in line 14: only facts derived by other modules in the previous iteration are considered new for Adds,i, which is possible since the facts in i have been produced by the ith module in the previous iteration. Theorem 4 captures these properties formally. Theorem 4. Algorithm 2 computes I as mat(Π, E) if function Adds,i captures Πs,i r in each of its calls. Moreover, if all Adds,i use the semina ıve strategy, each applicable rule instance is considered at most once.

4.2 Incremental Updates Our modular incremental materialisation maintenance algorithm is based on the DRedc algorithm by Hu, Motik, and Horrocks (2018b), which is a variant of the well-known DRed algorithm (Gupta, Mumick, and Subrahmanian 1993). For each fact, DRedc maintains two counters that track the number of nonrecursive and recursive derivations of the fact. The algorithm proceeds in three steps. During the deletion phase, DRedc iteratively computes the consequences of the deleted facts, similar to DRed, while adjusting the counters accordingly. However, to optimise overdeletion, deletion propagation stops on facts with a nonzero nonrecursive

counter. In the one-step rederivation phase, DRedc identiﬁes the facts that were overdeleted but can be rederived from the remaining facts in one step by simply checking the recursive counters: if the counter is nonzero, then the corresponding fact is rederived. In the insertion phase, DRedc computes the consequences of the rederived and the inserted facts using semina ıve evaluation, which we have already discussed. Our modular incremental algorithm handles nonrecursive rules in the same way as DRedc. Thus, the nonrecursive counters, which record the number of nonrecursive derivations of each fact, can be maintained globally just as in DRedc. In contrast, as discussed in Section 3, custom algorithms for recursive modules will usually not consider all applicable rule instances, so counters of recursive derivations cannot be maintained globally. Nevertheless, certain modules can maintain recursive counters internally (e.g., the module based on the semina ıve evaluation can do so). In addition to function Adds,i from Section 4.1, our modular incremental reasoning algorithm uses three further functions: Diﬀs,i, Dels,i, and Reds,i. Deﬁnition 5 captures the requirements on Diﬀs,i. Intuitively, Diﬀs,i[Ip, p, n] identiﬁes the consequences of Πs,i r affected by the addition of the facts in p and removal of the the facts n, respectively, with both sets containing facts from earlier strata.

Deﬁnition 5. Function Diﬀcaptures a datalog program Π on datasets Ip, p, and n where p Ip, n Ip = , and both p and n do not contain predicates occurring in rule heads in Π if Diﬀ[Ip, p, n] = Π Ip p, n .

Function Dels,i captures overdeletion: if the facts in are deleted, then Dels,i[Ip, In, , Cnr] returns the consequences of Πs,i r that must be overdeleted as well. The function can use the nonrecursive counters Cnr in order to stop overdeletion as in DRedc. We do not specify exactly what the functions must return: as we discuss in Section 6, computing the smallest set that needs to be overdeleted might require considering all rule instances as in DRedc, which would miss the point of modular reasoning. Instead, we specify the required output in terms of a lower bound Jl and an upper bound Ju. Intuitively, Jl and Ju contain facts that would be overdeleted in DRedc and DRed, respectively.

Deﬁnition 6. Function Del captures a datalog program Π on datasets Ip, In, with Ip, and a mapping Cnr of facts to integers if Jl Del[Ip, In, , Cnr] Ju where

the lower bound Jl is the smallest dataset such that, for each F Π Ip, In Jl , either F Jl or Cnr(F) > 0 holds, and the upper bound Ju is the smallest dataset that satisﬁes Π Ip, In Ju Ju.

Finally, function Reds,i captures rederivation: if facts in are overdeleted, then Reds,i[Ip, In, ] returns all facts in that can be rederived from Ip \ and Πs,i r in one or more steps. This is different from DRed and DRedc, which both perform only one-step rederivation. This change is important in our framework because, as we shall see in Section 6, Reds,i provides the opportunity for a module to adjust its internal data structures after deletion.

Deﬁnition 7. Function Reds,i captures a datalog program Π on datasets Ip, In, with Ip if the result of Red[Ip, In, ] is the smallest dataset J that satisﬁes Π (Ip \ ) J, In J. Algorithm 3 formalises our modular approach to incremental maintenance. The algorithm takes as input a program Π, a stratiﬁcation λ of Π, a set of explicit facts E, the materialisation I = mat(Π, E), the sets of facts E and E+ to delete from and add to E, and a map Cnr that records the number of nonrecursive derivations of each fact. The algorithm updates I to mat(Π, (E \ E ) E+). We next describe the two main steps of the algorithm. In the overdeletion phase, the algorithm ﬁrst initialises the set of facts to delete as the union of the explicitly deleted facts (E Os) and the facts affected by changes in previous strata (lines 22 and 23). Then, in lines 26 30 the algorithm computes all consequences of . In each iteration, function Dels,i is called for each module to identify the consequences of Πs,i r that must be overdeleted due to the deletion of (line 28). As in Algorithm 2, the third argument of Dels,i

is \ i, which guarantees that the function will not be applied to its own consequences. In the second step, the algorithm ﬁrst identiﬁes the rederivable facts by calling Reds,i for each module (lines 33 35). Then, the consequences of the rederived facts, the explicitly added facts (E+ Os), and the facts added due to changes in previous strata are computed in the loop of lines 36 40 analogously to Algorithm 2. Although Reds,i

rederives facts in one or more steps as opposed to the onestep rederivation in DRed and DRedc, this extra effort is not repeated during insertion since Adds,i is not applied to the consequences of module i. Theorem 8 states that the algorithm is correct. Theorem 8. Algorithm 3 updates I from mat(Π, E) to mat(Π, (E \ E ) E+) if functions Adds,i, Dels,i, Diﬀs,i, and Reds,i capture Πs,i r in all of their calls.

5 Transitive Closure We now consider a module consisting of a single rule (1) axiomatising a relation R as transitive. Following the ideas from Example 1, we distinguish the internal facts produced by rule (1) from the external facts produced by other rules. We keep track of the latter in a global set XR that is initialised to the empty set. A key invariant of our approach is that each fact R(a0, an) is produced by a chain {R(a0, a1), . . . , R(an 1, an)} XR of external facts. Thus, we can transitively close R by considering pairs of R-facts where at least one of them is contained in XR, which can greatly reduce the number of inferences. A similar effect could be achieved by rewriting the input program: we introduce a fresh predicate XR, and we replace by XR each occurrence of R in the head of a rule, as well as one of the two occurrences of R in the body of rule (1). Such an approach, however, introduces the facts containing the auxiliary predicate XR into the materialisation and thus reveals implementation details to the users. Moreover, the rederivation step can be realised very efﬁciently in our approach.

Algorithm 3 DREDc-MOD(Π, λ, E, I, E , E+, Cnr)

16: D = A = , E = (E E) \ E+, E+ = E+ \ E 17: for each stratum index s with 1 s S do 18: OVERDELETE 19: REDERIVE-INSERT 20: E = (E \ E ) E+, I = (I \ D) A

21: procedure OVERDELETE 22: 1 = = n(s) = 23: = (E Os) Πs nr I D \A, A\D and update Cnr 24: for each i with 1 i n(s) do 25: = Diﬀs,i[I, D \ A, A \ D] 26: while = do 27: for each i with 1 i n(s) do 28: i = Dels,i[I \ (D \ A), I A, \ i, Cnr] 29: D = D 30: = 1 n(s)

31: procedure REDERIVE-INSERT 32: = (E+ Os) Πs nr (I \ D) A A \ D, D \ A

and update Cnr 33: for each i with 1 i n(s) do 34: i = Reds,i[I, (I \ D) A, D \ A] 35: = i Diﬀs,i[(I \ D) A, A \ D, D \ A] 36: while = do 37: A = A 38: for each i with 1 i n(s) do 39: i = Adds,i[(I \ D) A, (I \ D) A, \ i] 40: = 1 n(s)

Based on the above idea, function Addtc(R), shown in Algorithm 4, essentially implements semina ıve evaluation for rule (4): the loops in lines 42 43 and 44 47 handle the two delta rules derived from (4). For Diﬀtc(R), note that sets A \ D and D \ A in lines 25 and 35 of Algorithm 3 always contain facts with predicates that do not occur in Πs,i r in rule heads; thus, since R occurs in the head of rule (1), these sets contain facts whose predicate is different from R, so Diﬀtc(R) can simply return the empty set. Function Deltc(R), shown in Algorithm 5, implements semina ıve evaluation for rule (4) analogously to Addtc(R). The main difference is that only facts whose nonrecursive counter is zero are overdeleted, which mimics overdeletion in DRedc. As a result, not all facts processed in lines 50 and 53 are added to J so, to avoid repeatedly considering such facts, the algorithm maintains the set S of seen facts. Finally, function Redtc(R), shown in Algorithm 6, identiﬁes for each source vertex u all vertices reachable by the external facts in XR. Theorem 9. In each call in Algorithms 2 and 3, functions Addtc(R), Deltc(R), Diﬀtc(R), and Redtc(R) capture a datalog program that axiomatises relation R as transitive.

6 Symmetric Transitive Closure We now consider a module consisting of two rules, (1) and (5), axiomatising a relation R as transitive and symmetric. As in Example 2, we can view relation R as an undirected graph. To compute the materialisation, we extract the set CR

Algorithm 4 Addtc(R)[Ip, In, ]

41: J = , Q = , XR = XR 42: for each R(u, v) and each R(v, w) Ip \ do 43: add R(u, w) to Q and J 44: while Q = do 45: remove an arbitrarily chosen fact R(v, w) from Q 46: for each R(u, v) XR such that R(u, w) Ip J do 47: add R(u, w) to Q and J 48: return J

Algorithm 5 Deltc(R)[Ip, In, , Cnr]

49: J = , Q = S = , XR = XR \ 50: for each R(u, v) and each R(v, w) Ip \ S do 51: add R(u, w) to Q and S 52: if Cnr(R(u, w)) = 0 then add R(u, w) to J 53: while Q = do 54: remove an arbitrarily chosen fact R(v, w) from Q 55: for each R(u, v) XR such that R(u, w) Ip \ S do 56: add R(u, w) to Q and S 57: if Cnr(R(u, w)) = 0 then add R(u, w) to J 58: return J \

Algorithm 6 Redtc(R)[Ip, In, ]

59: J = 60: for each u such that there exist v with R(u, v) do 61: for each w reachable from u via R facts in XR do 62: add R(u, w) to J 63: return J

of connected components that is, each U CR is a set of mutually connected vertices in the symmetric transitive closure of R; ﬁnally, we derive R(u, v) for all u and v in each component U CR. Set CR is global and is initially empty. Based on this idea, function Addstc(R), shown in Algorithm 7, uses an auxiliary function CLOSEEDGES to incrementally update the set CR by processing each fact R(u, v) in lines 67 75: if either u or v does not occur in a component in CR, then the respective component is created in CR (lines 69 and 71); and if u and v belong to distinct components U and V , then U and V are merged into a single component and all R-facts connecting U and V are added (lines 72 75). For the same reasons as in Section 5, function Diﬀtc(R) can simply return the empty set. Function Delstc(R), shown in Algorithm 8, simply overdeletes all facts R(u , v ) whose nonrecursive counter is zero and where both u and v belong to a component U containing both vertices of a fact R(u, v) in . Those facts R(u , v ) for which the nonrecursive counter is nonzero will hold after overdeletion, so they are kept in an initially empty global set YR so that they can be used for rederivation later. Finally, function Redstc(R), shown in Algorithm 9, simply closes the set YR in the same way as during addition, and it empties the set YR. While this creates a dependency between Delstc(R) and Redstc(R), the order in which these functions are called in Algorithm 3 ensures that the set YR is maintained correctly.

Algorithm 7 Addstc(R)[Ip, In, ]

64: return CLOSEEDGES( ) \ Ip

65: function CLOSEEDGES( ) 66: J = 67: for each R(u, v) do 68: if no U CR exists such that u U then 69: add {u} to CR, and R(u, u) to J 70: if no V CR exists such that v V then 71: add {v} to CR, and R(v, v) to J 72: if u and v belong to distinct U, V CR, resp. then 73: remove U and V from CR, and add U V to CR 74: for each u U and each v V do 75: add R(u , v ) and R(v , u ) to J 76: return J

Algorithm 8 Delstc(R)[Ip, In, , Cnr]

77: J = 78: for each U CR where R(u, v) s.t. {u, v} U do 79: for each u U and each v U do 80: if Cnr(R(u , v )) = 0 then add R(u , v ) to J 81: else add R(u , v ) to YR 82: remove U from CR 83: return J \

Algorithm 9 Redstc(R)[Ip, In, ]

84: J = CLOSEEDGES(YR) 85: YR = 86: return J

Theorem 10. In each call in Algorithms 2 and 3, functions Addstc(R), Delstc(R), Diﬀstc(R), and Redstc(R) capture a datalog program that axiomatises R as symmetric transitive.

7 Evaluation We have implemented our modular materialisation and incremental maintenance algorithms, as well as the semina ıve materialisation and the DRedc algorithms, and we have compared their performance empirically.

Test Benchmarks We used the following real-world and synthetic benchmarks in our tests. LUBM (Guo, Pan, and Heﬂin 2005) is a well-known benchmark that models individuals and organisations in a university domain. Claros describes archeological artefacts. We used the LUBM and Claros datasets with the lower bound extended (-LE) programs by Motik et al. (2014); roughly speaking, we converted a subset of the accompanying OWL ontologies into datalog and manually extended them with several difﬁcult rules . DBpedia (Lehmann et al. 2015) contains structured information extracted from Wikipedia. DBpedia represents Wikipedia categories using the SKOS vocabulary (Miles and Bechhofer 2009), which deﬁnes several transitive properties. We used the datalog subset of the SKOS RDF schema. Moreover, the materialisation of DBpedia-SKOS is too large to ﬁt into the memory of our test server, so we used a random sample of the DBpedia dataset consisting of ﬁve mil-

Benchmark |E| |I| S |Πnr| |Πr| |TC| |STC| Mat-Mod Mat Claros-LE 18.8 M 533.3 M 11 1031 306 27 2 733.55 3593.32 LUBM-LE 133.6 M 332.6 M 5 85 22 1 2 291.90 1100.22 DBpedia-SKOS 5.0 M 97.0 M 5 26 15 2 1 103.23 3623.37 DAG-R 0.1 M 22.9 M 1 1 1 1 0 29.60 3238.86

Table 1: Running times for materialisation computation (seconds)

Benchmark Small Deletions Small Insertions Large Deletions DRedc-Mod DRedc DRedc-Mod DRedc DRedc-Mod DRedc

Claros-LE 0.93 1035.28 0.17 0.80 314.33 3616.93 LUBM-LE 0.32 3.87 0.01 0.01 182.93 1369.77 DBpedia-SKOS 21.77 691.32 0.20 2.78 111.28 3826.87 DAG-R 64.92 3005.11 14.56 116.78 62.48 4316.71

Table 2: Running times for incremental maintenance (seconds)

lion facts. Finally, DAG-R is a synthetic benchmark consisting of a randomly generated dataset containing a directed acyclic graph with 10k nodes and 100k edges, and a program that axiomatises the path relation as transitive. Table 1 shows the numbers of explicit facts (|E|), derived facts (|I|), strata (S), nonrecursive rules (|Πnr|), recursive rules (|Πr|), transitivity modules (|TC|), and symmetric transitive modules (|STC|) for each benchmark.

Test Setup and Results We conducted all experiments on a Dell Power Edge R730 server with 512 GB RAM and two Intel Xeon E5-2640 2.6 GHz processors running Fedora 27, kernel version 4.17.6. For each benchmark, we loaded the test data into our system and then compared the performance of our modular algorithms with the semina ıve and DRedc algorithms using the following methodology. We ﬁrst computed the materialisation and measured the wall-clock time. The results are shown in Table 1. We then conducted two groups of incremental reasoning tests. In the ﬁrst group, we tested the performance of our incremental algorithms on small changes. To this end, we used uniform sampling to select ten subsets Ei E, 1 i 10, each consisting of 1000 facts from the input dataset. We deleted and then reinserted Ei for each i while measuring the wall-clock times, and then we computed the average times for deletion and insertion over the ten samples. The results are shown in the Small Deletions and Small Insertions columns of Table 2, respectively. In the second group, we tested the performance of incremental algorithms on large deletions. To this end, we used uniform sampling to select a subset E E containing 25% of the explicit facts, and we measured the wall-clock time needed to delete E from the materialisation. The results are shown in the Large Deletions column of Table 2. We did not consider large insertions because our algorithms handle insertion in the same way as materialisation, so the relative performance of our algorithms should be similar to the performance of materialisation shown in Table 1.

Discussion Mat-Mod signiﬁcantly outperformed Mat on all test inputs. For example, Mat-Mod was several times faster than Mat on Claros-LE and LUBM-LE. The programs

of both benchmarks contain transitivity and symmetric transitivity modules, which are efﬁciently handled by our custom algorithm. The performance improvement is even more signiﬁcant for DBpedia-SKOS and DAG-R: Mat-Mod is more than 30 times faster than Mat on DBpedia-SKOS, and the difference reaches two orders of magnitude on DAGR. In fact, DBpedia contains long chains/cycles over the skos:broader relation (Bishop et al. 2011b), which is axiomatised as transitive in SKOS. Mat-Mod outperforms Mat in this case since our custom algorithm for transitivity skips a large number of rule instances. The same observation explains the superior performance of DAG-R. Similarly, DRedc-Mod considerably outperformed DRedc on small deletions: the performance speedup ranges from around ten times on LUBM-LE to three orders of magnitude on Claros-LE. The program of Claros-LE contains a symmetric transitive closure module for the predicate related Places, and the materialisation contains large cliques of constants connected to each other via this predicate. Thus, when a related Places(a,b) fact is deleted, DRedc can end up considering up to n3 rule instances where n is the number of constants in the clique containing a and b. In contrast, our custom algorithm for this module maintains a connected component for the clique and requires only up to n2 steps. It is worth noticing that, while DRedc-Mod signiﬁcantly outperforms DRedc on DAG-R, the incremental update times for small deletion were larger than both the update times for large deletions and even for the initial materialisation. This is because deleting one thousand edges from the graph ( Small Deletion ) caused a large part of the materialisation to be overdeleted and rederived again. In contrast, when 25% of the explicit facts are deleted ( Large Deletion ), a larger propertion of the materialisation is overdeleted, but only a few facts are rederived. For DRedc the situation is similar, but rederivation in DRedc beneﬁts from a global recursive counter (at the expense of considering each applicable rule instance), which makes small deletion still faster than large deletion and initial materialisation. Finally, as shown in Table 2, DRedc-Mod scaled well and maintained its advantage over DRedc on large deletions. Incremental insertions are in general easier to handle than

deletions since during insertion the algorithms can rely on the whole materialisation to prune the propagation of facts whereas during deletion the algorithms can only rely on the nonrecursive counters of facts to do the same. This is clearly reﬂected in Table 2. Nevertheless, in our tests for small insertions, DRedc-Mod was several times faster than DRedc in all cases but LUBM-LE, for which both algorithms updated the materialisation instantateously.

8 Conclusion

We have proposed a modular framework for the computation and maintenance of datalog materialisations. The framework allows integrating custom algorithms for speciﬁc types of rules with standard datalog reasoning methods. Moreover, we have presented such custom algorithms for programs axiomatising the transitive and the symmetric transitive closure of a relation. Finally, we have shown empirically that our algorithms typically signiﬁcantly outperform then existing ones, sometimes by orders of magnitude. In future, we plan to extend our framework also to the B/Fc algorithm, which eliminates overdeletion by eagerly checking alternative derivations. This could potentially be useful in cases such as DBpedia-SKOS and DAG-R, where overdeletion is a major source of inefﬁciency.

Acknowledgements

We thank David Tena Cucala for his help with obtaining the SKOS benchmark. This work was supported by the EPSRC projects Ana LOG, DBOnto, and ED3.

Abiteboul, S.; Hull, R.; and Vianu, V. 1995. Foundations of Databases. Addison-Wesley. Bishop, B.; Kiryakov, A.; Ognyanoff, D.; Peikov, I.; Tashev, Z.; and Velkov, R. 2011a. OWLIM: A family of scalable semantic repositories. SWJ 2(1):33 42. Bishop, B.; Kiryakov, A.; Ognyanov, D.; Peikov, I.; Tashev, Z.; and Velkov, R. 2011b. Factforge: A fast track to the web of data. Semantic Web 2(2):157 166. Demetrescu, C., and Italiano, G. F. 2000. Fully dynamic transitive closure: breaking through the o (n/sup 2/) barrier. In Foundations of Computer Science, 2000. Proceedings. 41st Annual Symposium on, 381 389. IEEE. Dong, G.; Su, J.; and Topor, R. W. 1995. Nonrecursive Incremental Evaluation of Datalog Queries. Annals of Mathematics and Artiﬁcial Intelligence 14(2 4):187 223. Guo, Y.; Pan, Z.; and Heﬂin, J. 2005. Lubm: A benchmark for owl knowledge base systems. Web Semantics: Science, Services and Agents on the World Wide Web 3(2-3):158 182. Gupta, A.; Mumick, I. S.; and Subrahmanian, V. S. 1993. Maintaining Views Incrementally. In SIGMOD. ACM. Horrocks, I.; Patel-Schneider, P. F.; Boley, H.; Tabet, S.; Grosof, B.; Dean, M.; et al. 2004. SWRL: A Semantic Web Rule Language Combining OWL and Rule ML. W3C Member Submission.

Hu, P.; Motik, B.; and Horrocks, I. 2018a. Modular Materialisation of Datalog Programs. Co RR abs/1811.02304. Hu, P.; Motik, B.; and Horrocks, I. 2018b. Optimised maintenance of datalog materialisations. In AAAI, 1871 1879. Ibaraki, T., and Katoh, N. 1983. On-line computation of transitive closures of graphs. Information Processing Letters 16(2):95 97. King, V. 1999. Fully dynamic algorithms for maintaining all-pairs shortest paths and transitive closure in digraphs. In Foundations of Computer Science, 1999. 40th Annual Symposium on, 81 89. IEEE. La Poutre, J. A., and van Leeuwen, J. 1987. Maintenance of transitive closures and transitive reductions of graphs. In International Workshop on Graph-Theoretic Concepts in Computer Science, 106 120. Springer. Lehmann, J.; Isele, R.; Jakob, M.; Jentzsch, A.; Kontokostas, D.; Mendes, P. N.; Hellmann, S.; Morsey, M.; Van Kleef, P.; Auer, S.; et al. 2015. Dbpedia a large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web 6(2):167 195. Miles, A., and Bechhofer, S. 2009. Skos simple knowledge organization system reference. W3C recommendation 18:W3C. Motik, B.; Patel-Schneider, P.; Parsia, B.; Bock, C.; Fokoue, A.; Haase, P.; Hoekstra, R.; Horrocks, I.; Ruttenberg, A.; Sattler, U.; et al. 2009. OWL 2 Web Ontology Language: Structural Speciﬁcation and Functional-Style Syntax. W3C. Motik, B.; Nenov, Y.; Piro, R.; Horrocks, I.; and Olteanu, D. 2014. Parallel materialisation of datalog programs in centralised, main-memory rdf systems. In AAAI, 129 137. Motik, B.; Nenov, Y.; Piro, R.; and Horrocks, I. 2015. Incremental Update of Datalog Materialisation: the Backward/Forward Algorithm. In AAAI, 1560 1568. Nenov, Y.; Piro, R.; Motik, B.; Horrocks, I.; Wu, Z.; and Banerjee, J. 2015. RDFox: A Highly-Scalable RDF Store. In ISWC, 3 20. Staudt, M., and Jarke, M. 1996. Incremental Maintenance of Externally Materialized Views. In VLDB, 75 86. Subercaze, J.; Gravier, C.; Chevalier, J.; and Laforest, F. 2016. Inferray: fast in-memory RDF inference. PVLDB 9(6):468 479. Urbani, J.; Kotoulas, S.; Maassen, J.; Van Harmelen, F.; and Bal, H. 2012. Web PIE: A Web-scale Parallel Inference Engine using Map Reduce. JWS 10:59 75. Urbani, J.; Jacobs, C. J.; and Kr otzsch, M. 2016. Column Oriented Datalog Materialization for Large Knowledge Graphs. In AAAI, 258 264. Wu, Z.; Eadon, G.; Das, S.; Chong, E. I.; Kolovski, V.; Annamalai, M.; and Srinivasan, J. 2008. Implementing an Inference Engine for RDFS/OWL Constructs and User Deﬁned Rules in Oracle. In ICDE, 1239 1248. IEEE.