# stream_reasoning_in_temporal_datalog__3fa3ec8b.pdf

Stream Reasoning in Temporal Datalog

Alessandro Ronca, Mark Kaminski, Bernardo Cuenca Grau, Boris Motik, Ian Horrocks Department of Computer Science, University of Oxford, UK {alessandro.ronca, mark.kaminski, bernardo.cuenca.grau, boris.motik, ian.horrocks}@cs.ox.ac.uk

In recent years, there has been an increasing interest in extending traditional stream processing engines with logical, rule-based, reasoning capabilities. This poses signiﬁcant theoretical and practical challenges since rules can derive new information and propagate it both towards past and future time points; as a result, streamed query answers can depend on data that has not yet been received, as well as on data that arrived far in the past. Stream reasoning algorithms, however, must be able to stream out query answers as soon as possible, and can only keep a limited number of previous input facts in memory. In this paper, we propose novel reasoning problems to deal with these challenges, and study their computational properties on Datalog extended with a temporal sort and the successor function a core rule-based language for stream reasoning applications.

1 Introduction Query processing over data streams is a key aspect of Big Data applications. For instance, algorithmic trading relies on real-time analysis of stock tickers and ﬁnancial news items (Nuti et al. 2011); oil and gas companies continuously monitor and analyse data coming from their wellsites in order to detect equipment malfunction and predict maintenance needs (Cosad et al. 2009); network providers perform realtime analysis of network ﬂow data to identify trafﬁc anomalies and Do S attacks (M unz and Carle 2007). In stream processing, an input data stream is seen as an unbounded, append-only, relation of timestamped tuples, where timestamps are either added by the external device that issued the tuple or by the stream management system receiving it (Babu and Widom 2001; Babcock et al. 2002). The analysis of the input stream is performed using a standing query, the answers to which are also issued as a stream. Most applications of stream processing require near real-time analysis using limited resources, which poses signiﬁcant challenges to stream management systems. On the one hand, systems must be able to compute query answers over the partial data received so far as if the entire (inﬁnite) stream had been available; furthermore, they must stream query answers out with the minimum possible delay. On the other hand, due to memory limitations, systems can

Copyright c 2018, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

only keep a limited history of previously received input facts in memory to perform computations. These challenges have been addressed by extending traditional database query languages with window constructs, which declaratively specify the ﬁnite part of the input stream relevant to the answers at the current time (Arasu, Babu, and Widom 2006). In recent years, there has been an increasing interest in extending traditional stream management systems with logical, rule-based, reasoning capabilities (Barbieri et al. 2010; Calbimonte, Corcho, and Gray 2010; Anicic et al. 2011; Le-Phuoc et al. 2011; Zaniolo 2012; Ozc ep, M oller, and Neuenstadt 2014; Beck et al. 2015; Dao-Tran, Beck, and Eiter 2015). Rules can be very useful in stream processing applications for capturing complex analysis tasks in a declarative way, as well as for representing background knowledge about the application domain.

Example 1. Consider a number of wind turbines scattered throughout the North Sea. Each turbine is equipped with a sensor, which continuously records temperature levels of key devices within the turbine and sends those readings to a data centre monitoring the functioning of the turbines. Temperature levels are streamed by sensors using a ternary predicate Temp, whose arguments identify the device, the temperature level, and the time of the reading. A monitoring task in the data centre is to track the activation of cooling measures in each turbine, record temperature-induced malfunctions and shutdowns, and identify parts at risk of future malfunction. This task is captured by the following set of rules:

Temp(x, high, t) Flag(x, t) (1) Flag(x, t) Flag(x, t + 1) Cool(x, t + 1) (2) Cool(x, t) Flag(x, t + 1) Shdn(x, t + 1) (3) Shdn(x, t) Malfunc(x, t 2) (4) Shdn(x, t) Near(x, y) At Risk(y, t) (5) At Risk(x, t) At Risk(x, t + 1) (6)

Rule (1) ﬂags a device whenever a high temperature reading is received. Rule (2) says that two consecutive ﬂags on a device trigger cooling measures. Rule (3) says that an additional consecutive ﬂag after activating cooling measures triggers a pre-emptive shutdown. By Rule (4), a shutdown is due to a malfunction that occurred when the ﬁrst ﬂag leading to shutdown was detected. Finally, Rules (5) and (6) identify

The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18)

devices located near a shutdown device as being at risk and propagate risk recursively into the future. The power and ﬂexibility provided by rules poses additional challenges. As seen in our example, rules can derive information and propagate it both towards past and future time points. As a result, query answers can depend on data that has not yet been received (thus preventing the system from streaming out answers as soon as new input arrives), as well as on data that arrived far in the past (thus forcing the system to keep in memory a potentially large input history). Towards developing a solid foundation for rule-based stream reasoning, we propose in Section 3 a suite of decision problems that can be exploited by a stream reasoning algorithm to deal with the aforementioned challenges.

The deﬁnitive time point (DTP) problem is to check whether query answers to be issued at a given time τout will remain unaffected by any future input data given the current history; if so, τout is deﬁnitive and answers at τout can be safely output by the algorithm. The forgetting problem is to determine whether facts received at a given previous time point and recorded in the current history can be forgotten , in that they cannot affect future query answers. Forgetting allows the algorithm to maintain as small a history as possible. The delay problem is to check, given a time gap d, whether time point τin d is deﬁnitive for each time point τin at which new input facts are received and each history up to τin. Delay can thus be seen as a data-independent variant of DTP: the delay d can be computed ofﬂine before receiving any data, and the algorithm can then safely output answers at τin d as data at τin is being received. The window size problem is a data-independent variant of forgetting. The task is to determine, given a window size s, whether all history facts at time points up to τin s can be forgotten for each time τin at which new input facts are received and each history up to τin. A stream reasoning algorithm can compute s in an ofﬂine phase and then, in the online phase, immediately delete all history facts older than s time points as new data arrives.

In Section 4, we proceed to the study of the computational properties of the aforementioned problems. For this, we consider as query language temporal Datalog negationfree Datalog with a special temporal sort to which the successor function (or, equivalently, addition by a constant) is applicable (Chomicki and Imieli nski 1988). This is a core temporal rule-based language, which captures other prominent temporal languages (Abadi and Manna 1989; Baudinet, Chomicki, and Wolper 1993) and forms the basis of more expressive formalisms for stream reasoning recently proposed in the literature (Zaniolo 2012; Beck et al. 2015). We show in Section 4.1 that DTP is PSPACE-complete in data complexity and becomes tractable for nonrecursive queries under very mild additional restrictions; thus, DTP is no harder than query evaluation (Chomicki and Imieli nski 1988). In Section 4.2, we show that forgetting is undecidable; however, quite surprisingly, the aforementioned restrictions to nonrecursive queries allows us to regain not

only decidability, but also tractability in data complexity. In Section 4.3, we turn our attention to data-independent problems. We show that both delay and window size are undecidable in general and become co-NEXP-complete for nonrecursive queries. Our results show that, although stream reasoning problems are either intractable in data complexity or undecidable in general, they become feasible in practice for nonrecursive queries under very mild additional restrictions. On the one hand, the data-dependent problems (DTP and forgetting) become tractable in data complexity (a very important requirement for achieving near real-time computation in practice); on the other hand, although the data-independent problems (delay and window size) remain intractable, these are onetime problems which only need to be solved once prior to receiving any input data. The proofs of all results are given in an extended version of this paper (Ronca et al. 2017).

2 Preliminaries Syntax A vocabulary consists of predicates, constants and variables, where constants are partitioned into objects and integer time points and variables are partitioned into object variables and time variables. An object term is an object or an object variable. A time term is either a time point, a time variable, or an expression of the form t + k where t is a time variable, k is an integer number, and + is the standard integer addition function. The offset Δ(s) of a time term s equals zero if s is a time variable or a time point and it equals k if s is of the form s = t + k. Predicates are partitioned into extensional (EDB) and intensional (IDB) and they come with a nonnegative integer arity n, where each position 1 i n is of either object or time sort. A predicate is rigid if all its positions are of object sort and it is temporal if the last position is of time sort and all other positions are of object sort. An atom is an expression P(t1, . . . , tn) where P is a predicate and each ti is a term of the required sort. A rigid atom (respectively, temporal, IDB, EDB) is an atom involving a rigid predicate (respectively, temporal, IDB, EDB). A rule r is of the form i αi α, where α and each αi are rigid or temporal atoms, and α is IDB whenever i αi is non-empty. Atom head(r) = α is the head of r, and body(r) = i αi is the body of r. Rules are assumed to be safe: each head variable must occur in the body. An instance r of r is obtained by applying a substitution to r. A program Π is a ﬁnite set of rules. Predicate P is Π-dependent on a predicate P if there is a rule of Π with P in the head and P in the body. The rank rank(P, Π) of P w.r.t. Π is 0 if P does not occur in head position in Π, and is the maximum of the values rank(P )+1 for P a predicate such that P is Π-dependent on P otherwise. We write rank(P) for rank(P, Π) if Π is clear from the context. The rank rank(Π) of Π is the maximum rank of a predicate in Π. A query is a pair Q = PQ, ΠQ where ΠQ is a program and PQ is an IDB predicate in ΠQ; query Q is temporal (rigid) if PQ is a temporal (rigid) predicate. A term, atom, rule, or program is ground if it contains no variables. A fact α is a ground, function-free rigid or temporal atom; every

fact α corresponds to a rule of the form α where denotes the empty conjunction, so we use α and α interchangeably. A dataset D is a program consisting of EDB facts. The τ-segment D[τ] of dataset D is the subset of D containing all rigid facts and all temporal facts with time argument τ > τ. A program Π (respectively, query Q) is: Datalog if no temporal predicate occurs in Π (in ΠQ); and nonrecursive if the directed graph induced by the Π-dependencies (ΠQdependencies) is acyclic. Semantics and standard reasoning Rules are interpreted in the standard way as universally quantiﬁed ﬁrst-order sentences. A Herbrand interpretation H is a (possibly inﬁnite) set of facts. Interpretation H satisﬁes a rigid atom α if α H, and it satisﬁes a temporal atom α if evaluating the addition function in α yields a fact in H. The notion of satisfaction is extended to conjunctions of ground atoms, rules and programs in the standard way. If H |= Π, then H is a model of Π. Program Π entails a fact α, written Π |= α, if H |= Π implies H |= α. The answers to a query Q over a dataset D, written Q(D), are the tuples a of constants such that ΠQ D |= PQ(a). If Q is a temporal query, we denote with Q(D, τ) the subset of answers in Q(D) referring to time point τ. Given an input query Q, dataset D and tuple a, the query evaluation problem is to check whether a is an answer to Q over D; the data complexity of query evaluation is the complexity when Q is considered ﬁxed. Finally, a query Q1 is contained in a query Q2, written Q1 Q2, if Q1(D) Q2(D) for every dataset D. Given input queries Q1 and Q2, the query containment problem is to check whether Q1 is contained in Q2. Complexity Query evaluation is PSPACE-complete in data complexity assuming that numbers are coded in unary (Chomicki and Imieli nski 1988). Data complexity drops to the circuit class AC0 for nonrecursive programs. By standard results in nontemporal Datalog, containment of temporal queries is undecidable (Shmueli 1993). Furthermore, it is co-NEXP-hard for nonrecursive queries (Benedikt and Gottlob 2010).

3 Stream Reasoning Problems A stream reasoning algorithm receives as input a query Q and a stream of temporal EDB facts, and produces as output a stream of answers to Q. Both input facts and query answers are processed by increasing value of their timestamps, where τin and τout represent the current times at which input facts are received and query answers are being streamed out, respectively. Answers at τout are only output when the algorithm can determine that they cannot be affected by future input facts. In turn, input facts received so far are kept in a history dataset D since future query answers can be inﬂuenced by facts received at an earlier time; practical systems, however, have limited memory and hence the algorithm must also forget facts in the history as soon as it can determine that they will not inﬂuence future query answers. Algorithms 1 and 2 provide two different realisations of such a stream reasoning algorithm, which we refer to as online and ofﬂine, respectively.

Algorithm 1: Online Stream Reasoning Algorithm

Parameters: Temporal query Q

1 D := , τin := 0, τout := 0, τmem := 0

3 receive facts U that hold at τin and set D := D U

4 while τout τin and τout is deﬁnitive for Q, D, τin do

5 stream output Q(D, τout)

6 τout := τout + 1

8 while τmem < τout and τmem is forgettable for Q, D, τin, τout do

9 forget all temporal facts in D holding at τmem 10 τmem := τmem + 1

12 τin := τin + 1

The online algorithm (see Algorithm 1) decides which answers to stream and which history facts to forget on the ﬂy as new input data arrives. The algorithm records the latest time point τout for which answers have not yet been streamed; as τin increases and new data arrives, the algorithm checks (lines 4-7) whether answers at τout can now be streamed and, if so, it continues incrementing τout until it ﬁnds a time point for which answers cannot be provided yet. This process relies on deciding whether the considered τout are deﬁnitive that is, the answers to Q at τout for the history D will remain stable even if D were extended with an unknown (and thus arbitrary) set U of future input facts. Deﬁnition 1. A τin-history D is a dataset consisting of rigid facts and temporal facts with time argument at most τin. A τin-update U is a dataset consisting of temporal facts with time argument strictly greater than τin. Deﬁnition 2. An instance I of the Deﬁnitive Time Point (DTP) problem is a tuple Q, D, τin, τout , with Q a temporal query, D a τin-history and τout τin. DTP holds for I iff Q(D, τout) = Q(D U, τout) for each τin-update U. Example 2. Consider Example 1, and suppose we are interested in determining the time points at which a turbine malfunctions. Thus, let the query Q have output predicate Malfunc and include rules (1) (4) together with rule

Temp(x, na, t) Malfunc(x, t)

which deﬁnes an invalid reading as a malfunction. For a history D consisting of the fact Temp(a, high, 0), we have that DTP(Q, D, 0, 0) is false, since Q(D, 0) is empty and Q(D U, 0) is not if the update U contains Temp(a, high, 1) and Temp(a, high, 2). For a history D consisting of the fact Temp(a, na, 0), we have that DTP(Q, D , 0, 0) is true, since Q(D , 0) already includes the only possible answer. Algorithm 1 also records the latest time point τmem for which history facts have not yet been forgotten. As τin increases, the algorithm checks in lines 8-11 whether all history facts at time τmem can now be forgotten and, if so, it continues incrementing τmem until it ﬁnds a point where this is no longer possible. For this, the algorithm decides whether

Algorithm 2: Ofﬂine Stream Reasoning Algorithm

Parameters: Temporal query Q

1 D := , τin := 0

2 compute minimal delay d and minimal window size s for Q

4 receive facts U that hold at τin and set D := D U

5 if τin d 0 then stream output Q(D, τin d)

6 forget all temporal facts in D holding at τin s

7 τin := τin + 1

the relevant τmem are forgettable, in the sense that no future answer to Q can be affected by the history facts at τmem. Deﬁnition 3. An instance I of FORGET is a tuple of the form Q, D, τin, τout, τmem , with Q a temporal query, D a τin-history, and τmem τout τin. FORGET holds for I iff Q(D U, τ) = Q(D[τmem] U, τ) for each τin-update U and each time point τ τout. Example 3. Consider the query Q with output predicate Shdn and rules (1) (3) from Example 1. For a history D consisting of facts Temp(a, high, 0) and Temp(a, low, 1), we have that FORGET(Q, D, 1, 1, 1) is true, since Q(D[1] U, 1) is empty for every 1-update U. For a history D containing facts Temp(a, high, 0) and Temp(a, high, 1), we have that FORGET(Q, D, 1, 1, 1) is false, since Q(D[1] U, 1) is empty but Q(D U, 1) is not for the 1-update containing Temp(a, high, 2). The ofﬂine algorithm (Algorithm 2) precomputes the minimum delay d and window size s for the standing query Q in a way that is independent from the input data stream. Intuitively, d represents the smallest time gap needed to ensure that, for any input stream and any time point τin, the time point τout = τin d is deﬁnitive; in other words, that it is always safe to stream answers with a delay d relative to the currently processed input facts. Deﬁnition 4. An instance I of DELAY is a pair Q, d , with Q a temporal query and d a nonnegative integer. DELAY holds for I iff Q(D, τin d) = Q(D U, τin d) for each time point τin, each τin-history D, and each τin-update U. Example 4. In Example 1, 0 is a valid delay for the At Risk and Shdn queries, and so is 2 for the Malfunc query. In turn, s represents the size of the smallest time interval for which the history needs to be kept; in other words, for any input stream and any time point τin, it is safe to forget all history facts with timestamp smaller than τin s. Deﬁnition 5. An instance I of WINDOW is a triple Q, d, s , with Q a temporal query and d and s nonnegative integers. WINDOW holds for I iff Q(D U, τout) = Q(D[τin s] U, τout) for all time points τin and τout with τout > τin d, each τin-history D, and each τin-update U. Example 5. Consider Example 1. Assuming that we want to evaluate queries with delay 0, a valid window size for the Shdn query is 2; whereas the At Risk query has no valid window size (or, equivalently, the query requires a window of inﬁnite size), since answers for that query can depend on facts arbitrarily far in the past.

Once the delay d and window size s have been determined, they remain ﬁxed during execution of the algorithm: indeed, as τin increases and new data arrives in each iteration of the main loop, Algorithm 2 simply streams query answers at τin d and forgets all history facts at τin s. This is in contrast to the online approach, where the algorithm had to decide in each iteration of the main loop which answers to stream and which facts to forget.

4 Complexity of Stream Reasoning

We now start our investigation of the computational properties of the stream reasoning problems introduced in Section 3. For all problems, we consider both the general case applicable to arbitrary inputs and the restricted setting where the input queries are nonrecursive. All our results assume that numbers are coded in unary.

4.1 Deﬁnitive Time Point

Let I = Q, D, τin, τout be a ﬁxed, but arbitrary, instance of DTP and denote with ΩI the set consisting of all objects in ΠQ D and a fresh object o I unique to I. As stated in Deﬁnition 2, DTP holds for I if and only if the query answers at time τout over the history D coincide with the answers at the same time point but over D extended with an arbitrary τin-update U. Note that, in addition to new facts over existing objects in D and Q, the update U may also include facts about new objects. The following proposition shows that, to decide DTP, it sufﬁces to consider updates involving only objects from ΩI. Intuitively, updates containing fresh objects can be homomorphically embedded into updates over ΩI by mapping all fresh objects to o I.

Proposition 1. DTP holds for I iff o Q(D U, τout) implies o Q(D, τout) for every tuple o over ΩI and every τin-update U involving only objects in ΩI.

The general case. We next show that DTP is decidable and provide tight complexity bounds. Our upper bounds are obtained by showing that, to decide DTP, it sufﬁces to consider a single critical update and a slight modiﬁcation of the query, which we refer to as the critical query. Intuitively, the critical update is a dataset that contains all possible facts at the next time point τin + 1 involving EDB predicates from Q and objects in ΩI. In turn, the critical query extends Q with rules that propagate all facts in the critical update recursively into the future. The intention is that the answers to the critical query over D extended with the critical update will capture the answers to Q over D extended with any arbitrary future update. In the following deﬁnition, we use ψ to denote the renaming mapping each temporal EDB predicate to a fresh temporal IDB predicate of the same arity.

Deﬁnition 6. Let A be a fresh unary temporal EDB predicate. The critical update ΥI for I is the τin-update containing the fact A(τin + 1), and all facts P(o, τin + 1) for each temporal EDB predicate P in ΠQ and each tuple o over ΩI. Let V be a fresh unary temporal IDB predicate. The critical query ΘI for I is the query where PΘI = PQ and ΠΘI is obtained from ψ(ΠQ) by adding rule A(t) V (t), rule

V (t) V (t+1), and the following rules for each temporal EDB predicate P occurring in ΠQ, where P = ψ(P):

P(x, t) P (x, t)

V (t + 1) P (x, t) P (x, t + 1)

The construction of the critical query and update ensures, on the one hand, that Q(D, τout) = ΘI(D, τout) and, on the other hand, that Q(D U, τout) ΘI(D ΥI, τout) for each τin-update U involving only objects in ΩI. We can exploit these properties, together with Proposition 1, to show that DTP can be decided by checking whether, at τout, the answers to the critical query over D remain the same if D is extended with the critical update. Lemma 1. DTP holds for I iff o ΘI(D ΥI, τout) implies o ΘI(D, τout) for every tuple o over ΩI. It follows from Lemma 1 that, to decide DTP, we need to perform two temporal query evaluation tests for each candidate tuple o. Since temporal query evaluation is feasible in PSPACE in data complexity, then so is DTP because the number of candidate tuples o is polynomial if Q is ﬁxed. Furthermore, query evaluation is reducible to DTP, and hence the aforementioned PSPACE upper bound in data complexity is tight. Theorem 1. DTP is PSPACE-complete in data complexity.

Nonrecursive queries We next show that DTP becomes tractable in data complexity for nonrecursive queries. In the remainder of this section, we ﬁx an arbitrary instance I = Q, D, τin, τout of DTP, where Q is nonrecursive. We assume w.l.o.g. that Q does not contain rigid atoms: each such atom P(c) can be replaced with a temporal atom of the form, e.g., P (c, 0). We make the additional technical assumption that each rule in Q is restricted as follows. Deﬁnition 7. A rule is connected if it contains at most one temporal variable, which occurs in the head whenever it occurs in the body. A query is connected if so are its rules. Restricting our arguments to connected queries allows us to considerably simplify deﬁnitions and proofs. We start with the observation that the critical query of I always includes recursive rules that propagate information arbitrarily far into the future (see Deﬁnition 6). As a result, our general algorithm for DTP does not immediately provide an improved upper bound in the nonrecursive case. The need for such recursive rules, however, is motivated by the fact that the answers to a recursive query Q at time τout may depend on facts at time points τ arbitrarily far from τout; in other words, there is no bound b for I such that |τ τout| b in every derivation of query answers at τout involving input facts at τ. If Q is nonrecursive, however, such a bound is guaranteed to exist and can be established based on the following notion of program radius. Intuitively, query answers at τout τin can only be inﬂuenced by future facts whose timestamp is located within the interval [τin, τout + rad(ΠQ)]. Deﬁnition 8. Let r be a connected rule mentioning a time variable. The radius rad(r) of r is the maximum of the values |Δ(s) Δ(s )| for s the time argument in the head of r and

s the time argument in a body atom of r. The radius rad(Π) of a connected program Π is given by the number of rules in Π multiplied by the maximum radius of a rule in Π.

Thus, to show tractability of DTP, we identify a polynomially bounded number of critical time points using the radius of ΠQ and argue that we can dispense with the aforementioned recursive rules by constructing a critical update for I that includes all facts over these time points. Since ΠQ may contain explicit time points in rules, these also need to be taken into account when deﬁning the relevant critical time points and the corresponding critical update.

Deﬁnition 9. Let τ0 be the maximum value between τout and the largest time point occurring in ΠQ. A time point τ is critical I if τin < τ τ0 + rad(ΠQ). The bounded critical update Υb I of I consists of each fact P(o, τ) with P a temporal EDB predicate in ΠQ, o a tuple over ΩI, and τ a critical time point.

The following lemma justiﬁes the key property of the critical update Υb I, namely that the answers to Q over D Υb I capture those over D extended with any future update.

Lemma 2. Let a be the maximum radius of a rule in ΠQ and let T consist of τout and the time points in ΠQ. If ΠQ D U |= P(o, τ) for a τin-update U involving only objects in ΩI and a predicate P in ΠQ, and |τ τ | a (rank(ΠQ) rank(P)) for some τ T, then ΠQ D Υb I |= P(o, τ).

We are now ready to establish the analogue to Lemma 1 in the nonrecursive case.

Lemma 3. DTP holds for I iff o Q(D Υb I, τout) implies o Q(D, τout) for every tuple o over ΩI.

Tractability of DTP then follows from Lemma 3 and the tractability of query evaluation for nonrecursive programs.

Theorem 2. DTP is in P in data complexity if restricted to nonrecursive connected queries.

4.2 Forgetting We now move on to the forgetting problem as given in Definition 3. Unfortunately, in contrast to DTP, forgetting is undecidable. This follows by a reduction from containment of nontemporal Datalog queries a well-known undecidable problem (Shmueli 1993).

Theorem 3. FORGET is undecidable.

In the remainder of this section we show that, by restricting ourselves to nonrecursive input queries, we can regain not only decidability of forgetting, but also tractability in data complexity. Let I = Q, D, τin, τout, τmem be an arbitrary instance of FORGET where Q is nonrecursive. We adopt the same technical assumptions as in Section 4.1 for DTP in the nonrecursive case. Additionally, we assume that Q does not contain explicit time points in rules note that the rules in our running example satisfy this restriction. We believe that dropping this assumption does not affect tractability, but we leave this question open for future work. By Deﬁnition 3, to decide FORGET for I, we must check whether the answers Q(D U, τ) are included in Q(D[τmem] U, τ) for every τin-update U and τ τout.

Similarly to the case of DTP, we identify two time intervals of polynomial size in data, and show that it sufﬁces to consider only updates U over the ﬁrst interval and only time points τ over the second interval. In contrast to DTP, however, we need to potentially consider all possible such updates and cannot restrict ourselves to a single critical one. Note, however, that checking the aforementioned inclusion of query answers for all relevant updates and time points would lead to an exponential blowup in data complexity. To overcome this, we deﬁne instead nonrecursive queries Q1 and Q2 such that the desired condition holds if and only if Q1 Q2, where Q1 and Q2 contain a (ﬁxed) rule set derived from Q and a portion of the history D. Then, we show that checking such containment where only the datadependent rules are considered part of the input is feasible in polynomial time. We start by identifying the set of relevant time points for query answers and updates.

Deﬁnition 10. A time point τ is output-relevant for I if τout τ τmem + rad(Q). In turn, it is update-relevant for I if it satisﬁes τin < τ τmem + 2 rad(Q).

Intuitively, answers at time points bigger than τmem + rad(Q) cannot be affected by history facts that hold before τmem; thus, we do not need to consider answers at time points that are not output-relevant. In turn, output-relevant time points cannot depend on facts in an update after τmem+ 2 rad(Q); thus, it sufﬁces to consider updates containing only facts that hold at update-relevant time points. We next construct the aforementioned queries Q1 and Q2 using the identiﬁed update-relevant and output-relevant time points.

Deﬁnition 11. Let B be a fresh temporal IDB unary predicate and let D0 consist of all facts B(τ) for each updaterelevant time point τ. For ψ as in Deﬁnition 6, let Π be the smallest program containing: (i) each rule in ψ(ΠQ) having a predicate different from PQ in the head; (ii) each rule obtained by grounding the time argument to an output-relevant time point in a rule in ψ(ΠQ) having PQ as head predicate; and (iii) the rule P(x, t) B(t) P (x, t) for each temporal EDB predicate P occurring in ΠQ, where P = ψ(P). We now let Q1 = PQ, Π1 and Q2 = PQ, Π2 , where Π1 = Π D0 ψ(D), and Π2 = Π D0 ψ(D[τmem]).

Intuitively, the facts about B are used to tag the updaterelevant time points; rule P(x, t) B(t) P (x, t), when applied to the history D and any update U, will project U to the relevant time points and ﬁlter out all facts in D. Finally, the rules in (i) and (ii) allow us to derive the same consequences (modulo predicate renaming) as ΠQ, but only over the output-relevant time points. We can now establish correctness of our approach.

Lemma 4. FORGET holds for I iff Q1 Q2, where Q1 and Q2 are as given in Deﬁnition 11.

Theorem 4. FORGET is in P in data complexity if restricted to nonrecursive connected queries whose rules contain no time points.

This concludes our discussion of the data-dependent problems motivated by our online stream reasoning algorithm

(recall Algorithm 1). In the following section, we turn our attention to the data-independent problems motivated by our ofﬂine approach (Algorithm 2).

4.3 Data-Independent Problems Query containment can be reduced to both DELAY and WINDOW using a variant of the reduction we used in Section 3 for the forgetting problem. As a result, we can show undecidability of both of our data-independent reasoning problems in the general case. Theorem 5. DELAY is undecidable.

Theorem 6. WINDOW is undecidable. Furthermore, our reductions from query containment preserve the shape of the queries and hence they also provide a co-NEXP lower bound for both problems in the nonrecursive case (Benedikt and Gottlob 2010). In the remainder of this section, we show that this bound is tight. The co-NEXP upper bounds are obtained via reductions from DELAY and WINDOW into query containment for temporal Datalog, which we detail in the remainder of this section. Similarly to previous sections, we assume that queries are connected and also that they do not contain explicit time points or objects; the latter restriction is consistent with the purity assumption in (Benedikt and Gottlob 2010). Note, however, that the upper bound in (Benedikt and Gottlob 2010) for query containment only holds for standard nonrecursive Datalog; therefore, we ﬁrst establish that this upper bound extends to the temporal case. Intuitively, temporal queries can be transformed into nontemporal ones by grounding them to a ﬁnite number of relevant time points based on their (ﬁnite) radius. Lemma 5. Query containment restricted to nonrecursive queries that are connected and constant-free is in co-NEXP. We now proceed to discussing our reductions from DELAY and WINDOW into temporal query containment. Consider a ﬁxed, but arbitrary, instance I = Q, d of DELAY. We construct queries Q1 and Q2 providing the basis for our reduction. Let A and B be fresh fresh unary temporal predicates, where A is EDB and B is IDB. Furthermore, let G be a fresh temporal IDB predicate of the same arity as PQ. Let Π1 extend ΠQ with the following rule:

PQ(x, t) A(t) G(x, t) (7)

and let Q1 = G, Π1 . Intuitively, Q1 restricts the answers to Q to time points where A holds. Let Q2 = G, Π2 , where Π2 is now the program obtained from ψ(ΠQ) by adding the previous rule (7) and the following rules for each k satisfying rad(Q) k d and each temporal EDB predicate P in ΠQ, where P = ψ(P):

A(t) B(t + k) (8)

P(x, t) B(t) P (x, t) (9)

Intuitively, Q2 further restricts the answers to Q1 at any time τout to those that can be derived using facts in the interval [τout rad(Q), τout + d].

It then follows that Q1 Q2 if and only if DELAY holds for I. Furthermore, the construction of Q1 and Q2 is feasible in LOGSPACE. Theorem 7. DELAY restricted to nonrecursive queries that are connected and constant-free is co-NEXP-complete. We conclude by providing the upper bound for WINDOW. For this, consider an arbitrary instance I = Q, d, s . Similarly to the case of DELAY, we construct queries Q1 and Q2 for which containment holds iff WINDOW holds for I. Let In, Out and B be fresh unary temporal predicates, where In is EDB and Out, B are IDB. Furthermore, as before, let G be a fresh temporal IDB predicate of the same arity as PQ. For j a nonnegative integer, let Πj be the program extending ψ(ΠQ) with the following rules for each k satisfying d < k s + rad(Q), each ℓsatisfying j < ℓ s+2 rad(Q), and each temporal EDB predicate P in ΠQ, where P = ψ(P):

In(t) Out(t + k) (10) PQ(x, t) Out(t) G(x, t) (11) In(t) B(t + ℓ) (12)

P(x, t) B(t) P (x, t) (13)

We deﬁne Q1 = G, Πd+rad(Q) and Q2 = G, Πs . Intuitively, given a dataset for which In holds for a set of time points T, query Q1 captures the answers to Q at time points τout within the the interval [τ d, τ s+rad(Q)] for some τ T; in turn, for each such interval [τ d, τ s+rad(Q)], query Q2 further restricts the answers to Q1 to those that depend on input facts holding after τ s. Since these queries can again be constructed in LOGSPACE, we obtain the desired upper bound. Theorem 8. WINDOW restricted to nonrecursive queries that are connected and constant-free is co-NEXP-complete.

5 Related Work The main challenges posed by stream processing and the basic architecture of a stream management system were ﬁrst discussed in (Babu and Widom 2001; Babcock et al. 2002). Arasu, Babu, and Widom (2006) proposed the CQL query language, which extends SQL with a notion of window a mechanism that allows one to reduce stream processing to traditional query evaluation. Since then, there have been numerous extensions and variants of CQL, which include a number of stream query languages for the Semantic Web (Barbieri et al. 2009; Le-Phuoc et al. 2011; 2013; Dell Aglio et al. 2015). In recent years, there have been several proposals for a general-purpose rule-based language in the context of stream reasoning. Streamlog (Zaniolo 2012) is a temporal Datalog language, which differs from the language considered in our paper in that it provides nonmonotonic negation and restricts the syntax so that only facts over time points explicitly present in the data can be derived. Furthermore, the focus in (Zaniolo 2012) is on dealing with so-called blocking queries , which are those whose answers may depend on input facts arbitrarily far in the future; for this, a syntactic fragment of the language is provided that precludes

blocking queries. LARS is a temporal rule-based language featuring window constructs and negation interpreted according to the stable model semantics (Beck et al. 2015; Beck, Dao-Tran, and Eiter 2015; Beck, Dao-Tran, and Eiter 2016). The semantics of LARS is rather different from that of temporal Datalog; in particular, the number of time points in a model is considered as part of the input to query evaluation, and hence is restricted to be ﬁnite; furthermore, the notion of window is built-in in LARS. Stream reasoning has been studied in the context of RDFSchema (Barbieri et al. 2010), and ontology-based data access (Calbimonte, Corcho, and Gray 2010; Ozc ep, M oller, and Neuenstadt 2014). In these works, the input data is assumed to arrive as a stream, but the ontology language is assumed to be nontemporal. Stream reasoning has also been considered in the unrelated context of complex event processing (Anicic et al. 2011; Dao-Tran and Le-Phuoc 2015). There have been a number of proposals for rule-based languages in the context of temporal reasoning; here, the focus is on query evaluation over static temporal data, rather than on reasoning problems that are speciﬁc to stream processing. Our temporal Datalog language is a notational variant of Datalog1S the core language for temporal deductive databases (Chomicki and Imieli nski 1988; Chomicki and Imieli nski 1989; Chomicki 1990). Templog is an extension of Datalog with modal temporal operators (Abadi and Manna 1989), which was shown to be captured by Datalog1S (Baudinet, Chomicki, and Wolper 1993). Datalog was extended with integer periodicity and gap-order constraints in (Toman and Chomicki 1998); such constraints allow for the representation of inﬁnite periodic phenomena. Finally, Datalog MTL is a recent Datalog extension based on metric temporal logic (Brandt et al. 2017). In the setting of database constraint checking, a problem related to our window problem was considered by Chomicki (1995), who obtained some positive results for queries formulated in temporal ﬁrst-order logic.

6 Conclusion and Future Work

In this paper, we have proposed novel decision problems relevant to the design of stream reasoning algorithms, and have studied their computational properties for temporal Datalog. These problems capture the key challenges behind rulebased stream reasoning, where rules can propagate information both to past and future time points. Our results suggest that rule-based stream reasoning is feasible in practice for nonrecursive temporal Datalog queries. Our problems are, however, either intractable in data complexity or undecidable in the general case. We have made several mild technical assumptions in our upper bounds for nonrecursive queries, which we plan to lift in future work. Furthermore, we have assumed throughout the paper that numbers in the input are encoded in unary; we are currently looking into the impact of binary encoding on the complexity of our problems. Finally, we are planning to study extensions of nonrecursive temporal Datalog for which decidability of all our problems can be ensured.

Acknowledgments This research was supported by the SIRIUS Centre for Scalable Data Access in the Oil and Gas Domain, the Royal Society, and the EPSRC projects DBOnto, Ma SI3, and ED3.

References Abadi, M., and Manna, Z. 1989. Temporal logic programming. J. Symb. Comput. 8(3):277 295. Anicic, D.; Fodor, P.; Rudolph, S.; and Stojanovic, N. 2011. EP-SPARQL: a uniﬁed language for event processing and stream reasoning. In WWW, 635 644. Arasu, A.; Babu, S.; and Widom, J. 2006. The CQL continuous query language: Semantic foundations and query execution. VLDB J. 15(2):121 142. Babcock, B.; Babu, S.; Datar, M.; Motwani, R.; and Widom, J. 2002. Models and issues in data stream systems. In PODS, 1 16. Babu, S., and Widom, J. 2001. Continuous queries over data streams. SIGMOD Rec. 30(3):109 120. Barbieri, D. F.; Braga, D.; Ceri, S.; Della Valle, E.; and Grossniklaus, M. 2009. C-SPARQL: SPARQL for continuous querying. In WWW, 1061 1062. Barbieri, D. F.; Braga, D.; Ceri, S.; Valle, E. D.; and Grossniklaus, M. 2010. Incremental reasoning on streams and rich background knowledge. In ESWC, 1 15. Baudinet, M.; Chomicki, J.; and Wolper, P. 1993. Temporal deductive databases. In Tansel, A. U.; Clifford, J.; Gadia, S.; Jajodia, S.; Segev, A.; and Snodgrass, R., eds., Temporal Databases. Benjamin Cummings. 294 320. Beck, H.; Dao-Tran, M.; Eiter, T.; and Fink, M. 2015. LARS: A logic-based framework for analyzing reasoning over streams. In AAAI, 1431 1438. Beck, H.; Dao-Tran, M.; and Eiter, T. 2015. Answer update for rule-based stream reasoning. In IJCAI, 2741 2747. Beck, H.; Dao-Tran, M.; and Eiter, T. 2016. Equivalent stream reasoning programs. In IJCAI, 929 935. Benedikt, M., and Gottlob, G. 2010. The impact of virtual views on containment. PVLDB 3(1-2):297 308. Brandt, S.; Kalayci, E. G.; Kontchakov, R.; Ryzhikov, V.; Xiao, G.; and Zakharyaschev, M. 2017. Ontology-based data access with a horn fragment of metric temporal logic. In AAAI, 1070 1076. Calbimonte, J.-P.; Corcho, O.; and Gray, A. J. 2010. Enabling ontology-based access to streaming data sources. In ISWC, 96 111. Chomicki, J., and Imieli nski, T. 1988. Temporal deductive databases and inﬁnite objects. In PODS, 61 73. Chomicki, J., and Imieli nski, T. 1989. Relational speciﬁcations of inﬁnite query answers. In SIGMOD, 174 183. Chomicki, J. 1990. Polynomial time query processing in temporal deductive databases. In PODS, 379 391. Chomicki, J. 1995. Efﬁcient checking of temporal integrity constraints using bounded history encoding. ACM Trans. Database Syst. 20(2):149 186.

Cosad, C.; Dufrene, K.; Heidenreich, K.; Mc Millon, M.; Jermieson, A.; O Keefe, M.; and Simpson, L. 2009. Wellsite support from afar. Oilﬁeld Review 21(2):48 58. Dao-Tran, M., and Le-Phuoc, D. 2015. Towards enriching CQELS with complex event processing and path navigation. In Hi De St@KI, 2 14. Dao-Tran, M.; Beck, H.; and Eiter, T. 2015. Towards comparing RDF stream processing semantics. In Hi De St@KI, 15 27. Dell Aglio, D.; Calbimonte, J.; Valle, E. D.; and Corcho, O. 2015. Towards a uniﬁed language for RDF stream query processing. In ESWC (Satellite Events), 353 363. Le-Phuoc, D.; Dao-Tran, M.; Parreira, J. X.; and Hauswirth, M. 2011. A native and adaptive approach for uniﬁed processing of linked streams and linked data. In ISWC, 370 388. Le-Phuoc, D.; Quoc, H. N. M.; Le Van, C.; and Hauswirth, M. 2013. Elastic and scalable processing of linked stream data in the cloud. In ISWC, 280 297. M unz, G., and Carle, G. 2007. Real-time analysis of ﬂow data for network attack detection. In IM, 100 108. Nuti, G.; Mirghaemi, M.; Treleaven, P.; and Yingsaeree, C. 2011. Algorithmic trading. IEEE Computer 44(11):61 69. Ozc ep, O. L.; M oller, R.; and Neuenstadt, C. 2014. A stream-temporal query language for ontology based data access. In KI, 183 194. Ronca, A.; Kaminski, M.; Cuenca Grau, B.; Motik, B.; and Horrocks, I. 2017. Stream reasoning in temporal Datalog. Co RR abs/1711.04013. Shmueli, O. 1993. Equivalence of datalog queries is undecidable. J. Log. Program. 15(3):231 241. Toman, D., and Chomicki, J. 1998. Datalog with integer periodicity constraints. J. Log. Program. 35(3):263 290. Zaniolo, C. 2012. Logical foundations of continuous query languages for data streams. In Datalog 2.0, 177 189.