# fullydynamic_decision_trees__2a2b4fb9.pdf

Fully-Dynamic Decision Trees

Marco Bressan1, Gabriel Damay2, Mauro Sozio2

1University of Milan, 2Institut Polytechnique de Paris, T el ecom Paris marco.bressan@unimi.it, gabriel.damay@telecom-paris.fr, sozio@telecom-paris.fr

We develop the ﬁrst fully dynamic algorithm that maintains a decision tree over an arbitrary sequence of insertions and deletions of labeled examples. Given ϵ > 0 our algorithm guarantees that, at every point in time, every node of the decision tree uses a split with Gini gain within an additive ϵ of the optimum. For real-valued features the algorithm has an amortized running time per insertion/deletion of O d log3 n

which improves to O d log2 n

ϵ for binary or categorical features, while it uses space O(nd), where n is the maximum number of examples at any point in time and d is the number of features. Our algorithm is nearly optimal, as we show that any algorithm with similar guarantees requires amortized running time Ω(d) and space eΩ(nd). We complement our theoretical results with an extensive experimental evaluation on real-world data, showing the effectiveness of our algorithm.

1 Introduction

Decision trees are a cornerstone of machine learning, and an essential tool in any machine learning library. Given a feature domain X and a label domain Y, a decision tree is a function f : X 7 Y that assigns to each x X a label y Y by traversing a tree T from its root node to a leaf. At each node of the tree, an example x is evaluated by some rule that determines which successor should receive x for instance, a common rule is a simple threshold on some feature. Every leaf is associated with a label, which is the result of the prediction when such a leaf is reached. The problem of constructing an optimal decision tree is NP-hard w.r.t. to several natural objective functions (Shalev-Shwartz and Ben-David 2014). This has led to the introduction of several heuristic approaches, such as ID3, C4.5, C5.0 and CART, which have proven very effective and are now considered state of the art. Typically, those approaches proceed in a greedy fashion by selecting for each node a feature and a splitting value (we shall call such a pair a split) that optimize some measure of improvement such as the Gini gain or the information gain. This is repeated until a stopping condition is met, such as the tree reaching a certain height or the number of examples at every leaf falling below some threshold.

Copyright c 2023, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

Recently, there have been signiﬁcant efforts to adapt machine learning algorithms to a fully dynamic setting, where the algorithm is asked to process an arbitrary list of insertions or deletions. Insertions are typically the result of new data being collected or revealed, while deletions can be the result of noise removal, removal of personal data for privacy concerns, data becoming obsolete, etc. It might not be desirable to make any assumption on such a list of update operations, which motivates the design of fully dynamic algorithms. Most works in fully-dynamic machine learning algorithms have focused on unsupervised tasks such as clustering (Cohen-Addad et al. 2019; Henzinger and Kale 2020; Bateni et al. 2021; Chan, Guerqin, and Sozio 2018) or graph mining (Sawlani and Wang 2020; Bhattacharya et al. 2015; Epasto, Lattanzi, and Sozio 2015; De Stefani et al. 2017). For decision trees, however, only incremental algorithms are known, which handle insertions but not deletions.1 The state of the art in this case is given by Hoeffding Trees (Domingos and Hulten 2000) and their evolutions such as EFDT or HAT (see (Manapragada et al. 2022) for a survey). Not only are these algorithms incapable of handling deletions, but no good bound on their amortized cost is known, while guarantees hold only if the insertions are i.i.d. from some distribution. Our work represents one of the ﬁrst studies on fully-dynamic supervised machine learning which has been mostly unexplored so far, to the best of our knowledge. Deﬁning what kind of decision tree a dynamic algorithm should maintain requires some care. The ﬁrst natural attempt is to maintain the very same decision tree that the greedy approaches above (ID3, C4.5, etc.) would produce from the current set of examples. Thus, at any time, every node should use a split with maximal gain with respect to the set of examples held by its subtree. The problem with this goal is that, at some point, the gains of the best and second-best splits may differ by just O(1/n) where n is the total number of examples. In that case, O(1) updates can turn the second-best split into the best one, possibly forcing a reconstruction of the whole tree. The same happens if one wants splits within multiplicative factors of the best one, as the latter may be in O(1/n). Hence, in those cases it is unclear whether there is an efﬁcient fully-dynamic algorithm. The next natural goal is maintaining a tree with ϵ-optimal splits, that is, within an

1These algorithms are also called online decision tree algorithms

The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23)

additive ϵ of the best ones. We shall call such a decision tree ϵ-feasible. Aiming at an ϵ-feasible tree is reasonable, since excessively small gains are statistically not signiﬁcant (a gain of, say, 10 4 is likely the result of noise) and thus approximating large gains is enough. Indeed, algorithms such as EFDT or HAT try to maintain precisely an ϵ-feasible tree. However, their approach is based on computing exactly the Gini gains. For real-valued features, doing that for every possible split would lead to an O(n) amortized cost. Hence, they resort to a heuristic which comes at the price of worse results. In this work we develop an efﬁcient fully-dynamic algorithm for maintaining an ϵ-feasible decision tree. Our ﬁrst observation is that, in order to change by an additive term ϵ the gain of a given split on a sequence of examples S, one must make Ω(ϵ|S|) insertions or deletions. Thus, one could rebuild a subtree after Θ(ϵ|S|) updates, without even tracking the gains, with the number of updates covering the rebuilding cost. Intuitively speaking, this is one of the arguments we use in our amortized cost analysis, which is pretty standard. However, this yields amortized time bounds that are quadratic in the height h of the tree, because a sequence of updates can force a cascade of rebuilds on Θ(h) subtrees each having height Θ(h). We show how to bypass this obstacle and save a factor of h in the amortized cost with a proactive strategy that rebuilds subtrees slightly larger than necessary. Through a careful amortized analysis based on a few charging arguments, this yields our fully-dynamic algorithm for real-valued features. For categorical features, our algorithm can be improved via a faster tree reconstruction subroutine. Moreover, our algorithm can satisfy constraints more general than just ϵ-feasibility, including pruning at a certain height or guaranteeing splits only if enough examples are available. Finally, we prove that our algorithms are nearly optimal: no algorithm can beat their time or space usage by more than poly log(nd) factors, even if one looks at algorithms attaining considerably weaker guarantees. Our contributions can be summarized as follows:

We present FUDYADT, a deterministic algorithm for maintaining an ϵ-feasible decision tree under an arbitrary sequence of insertions and deletions. It uses O(nd) space, while it has O d log3 n

ϵ2 amortized running time for real-

valued features and O d log2 n

ϵ for categorical ones.

We prove a lower bound of Ω(nd) on the space requirements and of Ω(d) on the amortized running time of any fully dynamic algorithm, even in easier settings. This makes FUDYADT optimal up to poly log(nd) factors. We conduct an extensive experimental evaluation on realworld data, evaluating FUDYADT s speed and accuracy against state-of-the-art tools such as EFDT and HAT.

Related Work. The works closest to ours are those in the incremental setting. Here, the algorithm receives a stream of examples from a distribution, and has to perform well when compared to the ofﬂine tree built on the entire sequence. In this setting, Hoeffding trees (Domingos and Hulten 2000) emerged as one of the most effective approaches, inspiring several variants, even ones capable of handling concept drifts (Hulten, Spencer, and Domingos 2001; Gama, Rocha,

and Medas 2003; Manapragada, Webb, and Salehi 2018; Das et al. 2019; Sun et al. 2020; Haug, Broelemann, and Kasneci 2022; Jin and Agrawal 2003; Rutkowski et al. 2013); see (Manapragada et al. 2022) for a survey. These algorithms crucially rely on the examples being i.i.d., which allows them to compute good splits with high probability via concentration bounds (whence the name). Moreover, those algorithms cannot handle efﬁciently real-valued features, since on those features they would update Θ(n) counters at each time, even when only insertions are allowed. Our algorithms instead efﬁciently handle arbitrary sequences of insertions and deletions of examples with real-valued features. We observe that there are general techniques to turn ofﬂine data structures into dynamic ones, see (Bentley and Saxe 1980). Those techniques, however, work only for problems that have a special decomposability property loosely speaking, the answer to a query (e.g., ﬁnd min(X) for some set X) must be quickly computable from the answers to subqueries (e.g., min(A B) = min(min(A), min(B))). In our case, a query corresponds to the label predicted by the tree for a given x. Unfortunately, our problem is far from decomposable and it does not seem solvable via such techniques.

2 Preliminaries All missing proofs can be found in the full version (Bressan, Damay, and Sozio 2022). We denote the feature and label domains respectively by X and Y; by default X = Rd and Y = {0, 1}. We denote by (x, y) X Y a labeled example, by xj the value of its j-th feature, and by S a multiset of labeled examples. We may treat S as a sequence; this will be clear from the context. We assume examples can be stored in O(d) bits, while the xj s can be accessed in time O(1). We let S[. . .] be the subset of S matching a condition; e.g. S[xj t] = {(x, y) S : xj t}. A split is a pair (j, t) [d] R. We use the bold font for vectors (e.g. x). We use Gini gain to measure split quality. Deﬁnition 2.1. The Gini index of S is g(S) = 2 p S(1 p S), where p S = 1 |S| P (x,y) S y. The Gini gain of (j, t) on S is:

G(S, j, t) = g(S) |S |

|S| g(S ) + |S+|

|S| g(S+) (1)

where S = S[xj t] and S+ = S[xj > t]. When |S| = 0 we deﬁne g(S) = 0 and G(S, j, t) = 0 for all (j, t) [d] R. For all j [d] let G(S, j) = maxt R G(S, j, t). Hence, arg maxj G(S, j) is a feature with maximum Gini gain over S. Finally, we let G(S) = maxj [d] G(S, j). We rely on the following smoothness properties of the Gini index and the Gini gain. Given two multisets/sequences S, S , their edit distance (S, S ) is the minimum number of insertions and deletions to obtain S from S, and their relative edit distance is (S, S ) = (S,S ) max(|S|,|S |).

Lemma 2.2. Let S, S be multisets of labeled examples. 1. |G(S, j, t) G(S , j, t)| 12 (S, S ), (j, t) [d] R 2. |g(S) g(S )| 2.5 (S, S ). Decision trees. A decision tree is a triple (T, Σ, L), where T = (V, A) is a directed binary tree rooted at r(T), and Σ

and L are functions that assign splits to internal nodes and labels to the leaves. More formally Σ = {σv : v V (T)} where σv = (jv, tv) [d] R for every internal node v of T, while L = {Lv : v V (T)} where Lv {0, 1} for every leaf v of T. For any x X and any internal vertex v of T let succ(v, x) be the left child of v if xj t and the right child of v otherwise, where σv = (j, t). For any v V (T) let Pv = (v0, . . . , vℓ) be the unique path from v0 = r(T) to vℓ= v. For a multiset S, denote by Sv the set of examples x S such that succ(vi, x) = vi+1 for all i = 0, . . . , ℓ 1; this is the subset of S associated to v. For every x X let v(x) be the leaf x is associated to. The labeling given by T is the function T : X Y such that T(x) = Lv(x) for every x X. We denote by Tv the subtree of T rooted at v and by (T, Σ, L)v the decision subtree rooted at v. Algorithms. A fully-dynamic decision tree algorithm A is deﬁned as follows. The input of A is an update sequence U of requests of three types: insertion, INS(x, y); deletion, DEL(x, y); labeling, LAB(x). Each such sequence U induces an active multiset of labeled examples S obtained by inserting/deleting the examples following the order of the sequence. Suppose A has processed an update sequence U. We say A is coherent with a decision tree T if, for every x X, any further request LAB(x) makes A output T(x). The query time of A is the worst-case time it takes to A to output T(x). Our goal is to construct a fully dynamic algorithm A that has low query time and, at every point in time, is coherent with a decision tree T that is ϵ-feasible with respect to the current active set S (see below).

3 A Fully Dynamic Decision Tree Algorithm This section presents FUDYADT (Fully Dynamic Amortized Decision Tree). As argued in Section 1, one of our goals is to ensure that every node of the tree uses a split whose gain is within an additive ϵ of the maximum. FUDYADT satisﬁes a stricter guarantee, called ϵ-feasibility, which allows to also prune the tree at some height or at leaves with few examples.

Deﬁnition 3.1. Let k, h N, and let ϵ = (α, β) where α, β (0, 1]. A decision tree (T, Σ, L) is ϵ-feasible, with pruning thresholds (k, h), w.r.t. a multiset S of labeled examples if for every v V (T):

1. if |Sv| k or g(Sv) = 0 or depth T (v) = h then v is a leaf, else if g(Sv) α then v is an internal node 2. if σv = (j, a) then G(Sv, j, a) G(Sv, j , a ) β for all (j , a ) [d] R 3. if v is a leaf then Lv is a majority label of Sv For any ﬁxed pruning thresholds k, h we say that a fully dynamic algorithm A is ϵ-feasible if, at any point in time, A is coherent with a decision tree (T, Σ, L) that is ϵ-feasible with respect to the current active set. When k = 1 and h = and α = β, ϵ-feasibility reduces to the following condition: if g(Sv) = 0 then v is a leaf, and if g(Sv) α then v is internal and use an α-optimal split. This is the ϵ-optimality condition of Section 1 used by incremental algorithms such as Hoeffding trees and EFDT. We prove:

Theorem 3.2. Let X = Rd, let k, h be positive integers, let α, β (0, 1], and let 0 < ϵ < min 1 k+1, α

5 , β 12.5 . There is

Algorithm 1 FUDYADT.UPDATE

1: procedure UPDATE((T, Σ, L), (x, y), o) 2: Pvκℓ vκ1, . . . , vκℓwith vκ1 = r(T), vκℓ= v(x)

3: update Dvκℓand DL vκℓaccording to (x, y), o

4: Lvκℓ any majority label in DL vκℓ 5: for i = 1, . . . , ℓdo 6: c(vκi) c(vκi) + 1 7: if c(vκi) > ϵ s(vκi) then

8: ˆs 2 log s(vκi)

9: j min{j {0, . . . , i} : s(vκj ) ˆs} 10: (T , Σ , L ) BUILD(Svκj , i)

11: (T, Σ, L)vκj (T , Σ , L ) 12: return

a deterministic (α, β)-feasible fully dynamic decision tree algorithm with pruning thresholds k, h that has query time O(h ), uses space O(nd), and has amortized running time per update O dh log2 n

ϵ = O d log3 n

ϵ2 , where h h and n are respectively the maximum height of the tree and the maximum size of the active set at any time. Theorem 3.2 can be improved for categorical features, that is, when X = Ad for some ﬁxed ﬁnite set A; this includes the case of binary features, A = {0, 1}. Theorem 3.3. If X = Ad for a ﬁnite set A then the amortized time bound of Theorem 3.2 can be improved to O d log2 n

ϵ . The rest of this section describes FUDYADT and proves Theorem 3.2, except for the ϵ-feasibility part, which is proven in the full version, as is Theorem 3.3. Before moving on, let us give some intuition on FUDYADT. The algorithm consists of the two routines UPDATE and BUILD below. Those routines maintain a decision tree T, and, for each leaf v of T, dictionaries Dv and DL v storing respectively the multiset Sv associated to v and the frequency histogram of the labels of Sv. At every insertion or deletion of an example (x, v), UPDATE computes the leaf v = v(x) where x ends up, and updates Dv and DL v consequently, see line 3. Then, UPDATE checks if any subtree should be rebuilt. To this end, for every vertex u V (T) it maintains two counters, s(u) and c(u), storing respectively the size of the multiset on which the subtree Tu was rebuilt the last time and the number of updates that reached u since that time. As soon as c(u) > ϵ s(u) for some u V (T), UPDATE invokes BUILD to rebuild an appropriately chosen supertree of Tu. Let us move to the bounds of Theorem 3.2. Proving those bounds requires some care in charging the cost of rebuilding the subtrees to the update requests. To this end, we need the following two simple results proven in the full version. From now on, by time t we mean the t-th invocation of UPDATE. Lemma 3.4. Let (T, Σ, L) be the result of t 0 invocations of UPDATE. Then (1 ϵ) st(v) |St v| (1 + ϵ) st(v) for every v V (T), where st(v) is the value of s(v) at time t. Lemma 3.5. Let (T, Σ, L) be a decision tree built on a sequence S. If every v V (T) uses a split with gain at least γ > 0 w.r.t. Sv, then T has height O (log |S| / γ).

Algorithm 2 FUDYADT.BUILD

1: procedure BUILD(S, η) 2: r new vertex, c(r) 0, s(r) |S| 3: if |S| k or g(S) α

2 or η = h then 4: store S in a dynamic dictionary Dr 5: and its labels in a dynamic dictionary DL r 6: (T, Σ, L) decision tree with T = ({r}, ) 7: Lr any majority label in DL r 8: else if g(S) > α

2 then 9: (j, a) arg max{G(S, ˆι, ˆa) : (ˆι, ˆa) [d] R} 10: T1 BUILD(S[xj a], η + 1) 11: T2 BUILD(S[xj > a], η + 1) 12: (T, Σ, L) decision tree with root r, T1, T2 as left, right subtrees, and split σr = (j, a) 13: return (T, Σ, L)

We can now prove:

Lemma 3.6. BUILD and UPDATE can be implemented so that any T invocations of UPDATE take time

O T d h (log n)2

= O T d (log n)3

Proof. First we describe the data structures and the time taken by the basic operations of BUILD and UPDATE. Using a self-balancing tree for Dv we ensure search, insert, update, and deletion in time O(d log N), and enumeration in time O(d N) recall that every element takes O(d) bits where N is the number of distinct entries in the data structure. The same for DL v , which has at most 2 distinct entries. Thus the block at line 3 of BUILD(S, i) runs in time O(d|S| log |S|). If instead the condition at line 3 fails, then BUILD must compute (j, a). To this end one proceeds as follows. First, for each j [d] one computes the projection S|j of S on the j-th feature (keeping the label as well). Then one sorts S|j according to the feature values in time O(|S| log |S|). Next, one scans S|j and ﬁnds the threshold t for which a split on j yields maximum gain in time O(|S|). To this end one just needs to keep label counts for the subsequence formed by the ﬁrst i examples in S|j, so that the gain a split at that point would yield can be computed in time O(1) from the counts of the ﬁrst (i 1) examples. Summarizing, one can compute the optimal split (j, a) in time O(d|S| log |S|). Since |S[xj a]| + |S[xj > a]| = |S|, it follows that BUILD(S, i) always runs in time O(d|S| log |S|(h + log |S|)), which is in O(d|S| log n(h + log n)) since |S| n by deﬁnition of n. For UPDATE, computing vκ1, . . . , vκℓtakes time O(h), while performing any INS or DEL operation on Svκℓtakes time O(d log |St|) = O(d log n). Finally, computing the input Svκj of BUILD takes time O(|Svκj |) by visiting Tvκj and listing the data structures at its leaves. Now, we bound the total time taken by T successive invocations of UPDATE. Let B = {t [T ] : BUILD is invoked at time t}. For every t B let b(t) be such that vb(t) is the vertex vκj on which BUILD is invoked.

The total running time cost(T ) of the T invocations satisﬁes:

t=1 O(h + d log n)

t B O (h + log n) log n d |St vb(t)| (3)

The ﬁrst term contributes O(T (h+d log n)). We now bound the second term. For every t B consider the t-th execution of UPDATE. Let st(v) be the value of s(v) right before BUILD is invoked, and vi(t) be the vertex that satisﬁes the condition at line 7 of UPDATE. Note that vi(t) is by construction a descendant of vb(t). Finally, for v {vb(t), vi(t)} let ct(v) and st(v) be the values of c(v) and s(v) right after line 6 is executed with vκi = v. Then: X

t B |St vb(t)| 2 X

t B st(vb(t)) (4)

t B st(vi(t)) (5)

t B ct(vi(t)) (6)

t B ct(vb(t)) (7)

where (4) follows from Lemma 3.4 noting that ϵ 1, (5) and (6) follow respectively from lines 8-9 and line 7 of UPDATE, and (7) follows from the fact that ct(v) ct(u) if v is a descendant of u. Now observe that at every time t at most h counters c(v) are increased by one unit; therefore, X

t B ct(vb(t)) |B| h T h (8)

We conclude that P t B |St vb(t)| T 4h

ϵ . Plugging this bound in (3) and noting that the second sum dominates, we obtain:

cost(T ) = O T (h + log n) log n d h

Next we prove a second bound on cost(T ); the ﬁnal bound comes from taking the minimum. Consider the t-th execution of UPDATE, let (x, y) the example that is inserted or removed, and recall that Pv(x) is the path from the root of the tree to the leaf v(x) determined by x. Let Ct = {vk1, . . . , vk M } be the set of all vertices of Pv(x) such that BUILD(Sti vki , i) is executed at some time ti t. If Ct = then we call Ct a charging set. We wish to bound the maximum size of Ct, which might be seen as the number of BUILD operations performed per update. We shall prove the following two properties:

P1: t [T ], |Ct| log(n) P2: t B, ct(vb(t)) Pt τ=1 1vb(t) Cτ

We start with P1. We argue that there cannot be distinct nodes vki and vkj in Ct such that log sτ(vki) = log sτ(vkj) for any τ [ti, tj]. Suppose τ = ti; vki cannot be an ancestor

of vkj, for otherwise vkj would not be connected to the root node at time tj and BUILD(Stj vkj , j) would not be executed. If vkj is an ancestor of vki, UPDATE would not have performed BUILD(Sti vki , i) at time ti, in that, there is at least one other node (vkj) which is closer to the root and that would have been selected instead (line 9 of UPDATE). The claim holds for every τ, as sτ(vkj) = sti(vkj), for every τ [ti, tj]. Therefore |Ct| log n . For P2 we proceed as follows. Let t0 < t be such that c(vb(t)) is set to 0 by BUILD (i.e., the point in time when vb(t) was created by BUILD, line 2), and let q = ct(vb(t)). By construction of UPDATE, there are q distinct times t0 + 1 τ1 < < τq t such that, for every i [q], we have cτi(vb(t)) = cτi 1(vb(t)) + 1 and vb(t) Cτi, proving P2. We obtain the following chain of inequalities:

t B ct(vb(t)) X

τ=1 1vb(t) Cτ X

t B |Cτ| (10)

where the inequalities follow respectively from P2 and from the fact that vb(t) = vb(t ) for t = t . Using P1 and B [T ], we conclude that P t B ct(vb(t)) = O(T log n). Plugging this bound into (3) yields:

cost(T ) = O T (h + log n) log n d log n

Taking the minimum of (9) and (11) yields that cost(T ) is in

O T (h + log n) log n d min(h, log n)

As (x+y) min(x, y) 2xy for x, y 0 we conclude that cost(T ) = O T dh(log n)2/ϵ . Noting that h O log n

by Lemma 3.5 concludes the proof.

4 Lower Bounds This section proves lower bounds on the space and amortized running time used by any fully dynamic algorithm for our problem. These bounds hold even under signiﬁcant relaxations of both the input access model and the constraints of Section 3. Notably, they hold for randomized algorithms that can fail with constant probability under inputs provided by oblivious adversaries. To state our bounds we need some more deﬁnitions. A label y is ϵ-feasible for x X w.r.t. S if there is a decision tree (T, Σ, L) that is ϵ-feasible w.r.t. S such that T(x) = y. Note that there might be multiple ϵ-feasible labels for x. A decision tree algorithm A is weakly (ϵ, δ)-feasible w.r.t. S if for every x X there is a decision tree (Tx, Σx, Lx) that is ϵ-feasible w.r.t. S and such that Pr(AS(x) = Tx(x)) δ. Note that (ϵ, δ)-feasibility is much weaker than ϵ-feasibility: not only it allows the algorithm to fail, but it does not even require it to be coherent with any given ϵ-feasible tree. Theorem 4.1. Let k , h 1, and let ϵ = (α, β) with 0 α 1 and 0 β < 1 24. Any weakly (ϵ, 3

4)-feasible fully dynamic algorithm with pruning thresholds k , h uses space Ω n d k log n , where d is the number of features and n is the maximum size of the active set at any point in time.

Proof. We reduce from the following classic two-party communication problem called INDEX. Alice is given a string x {0, 1}N and Bob is given an integer i [N]. Alice is allowed to send one message M {0, 1} to Bob, which, after receiving M, outputs a single bit. The goal of Bob is to output precisely xi. It is well known that for Bob to succeed with probability greater than 3

4 we must have |M| = Ω(N), see (Henzinger and Kale 2020). We reduce INDEX to the construction of an (ϵ, 3

4)-feasible fully dynamic algorithm. For some positive integers N, D, Alice is given an arbitrary string in {0, 1}ND representing a matrix A {0, 1}N D. Bob is given a pair (κ, ℓ) [N] [D] and must output Aκℓ. By the lower bound above, Alice must send to Bob Ω(ND) bits in order for Bob to succeed with probability greater than 3

4. The reduction is as follows. Let k = k . First, Alice computes the following sequence S of |S| = N D 2k examples. Let D := log(N + D) + 1 , and for all i [N + D] let bi {0, 1} D be the binary representation of i. For simplicity and w.l.g. we assume k to be an even integer. For every i [N + D] Alice constructs 2k examples (x1 i , y1 i ), . . . , (x2k i , y2k i ) with X = {0, 1}D+ D and Y = {0, 1}, as follows. For every i [N + D] and every h [2k], the last D bits of xh i correspond to the string bi (i.e., xh ij = bij for all j [D + 1, D + D]), and yh i = 1h>k. The remaining bits of xh i are deﬁned as follows. If i [N], then for all j [D]:

xh ij := 1 Aij, h [k]; Aij, h [k + 1, 2k];

while if i [N + 1, N + D], then for all j [D] and h [2k]:

xh ij := 1, j [D] \ {i N}, h mod 2 = 0; 0, otherwise;

Let A be any (ϵ, 3

4)-feasible fully dynamic algorithm with β < 1 24. Alice asks A to add every element of S, then she sends a snapshot of its memory to Bob, which resumes the execution of A. Next, Bob asks A to perform DEL(x, y) for every y {0, 1} and every x {0, 1}D+ D terminating with the D-bits binary string bi, for all i [N + D] \ {κ, N + ℓ}. Finally, Bob asks A to label 1, and outputs the answer. First, we claim that Bob outputs Aκℓ. To prove this, note that the active set received by A is: b S = (xh κ, yh κ) : h [k] (xh N+ℓ, yh N+ℓ) : h [k]

We prove that any decision tree that is ϵ-feasible with respect to b S labels the example 1 with Aκℓ. To this end we show that in any such tree, (i) the root splits on feature ℓ, (ii) the child v of the root corresponding to feature ℓequal to 1 is a leaf with label Aκℓ. For (i), we prove that b S does not meet any stopping condition, and that j is the only β-optimal feature. The claim on the stopping condition is immediate. For the optimality of ℓ, we claim that:

G(b S, j) =

1/8, j [D] \ {ℓ} 0, otherwise (13)

To begin, note that g(b S) = 1 2. Let b S1 and b S2 be the two subsequences obtained by splitting b S on j. When j ℓ, b S1 contains k examples with identical labels, so g(b S1) = 0, while b S2 contains 3k examples of which 2k have identical label, so g(b S2) = 4

9. Thus G(b S, ℓ) = 1

6. When j [D] \ {ℓ}, both b S1 and b S2 contain k examples of (xh κ, yh κ) : h [k] with identical label, as well as k examples of (xh N+ℓ, yh N+ℓ) : h [k] with k

2 labels to 0 and k 2 labels to 1. Thus, in both b S1 and b S2 one label occurs precisely on a fraction 3

4 of the examples. Hence g(b S1) = g(b S2) = 2 3

8, and G(b S, j) = 1

8. In every other case, either b S1 = b S, or b S1 = (xh κ, yh κ) : h [2k] and b S2 = (xh N+ℓ, yh N+ℓ) : h [2k] , which implies g(b S1) = g(b S2) = 1

2 and G(b S, j) = 0. Since β < 1 24 = 1

8, we conclude that ℓis the only β-optimal feature, as desired. For (ii), note that the subsequence of b S having the ℓ-th feature set to 1 has all labels equal to Aκℓ. This implies that the corresponding child v of the root is a leaf, since it meets at least one of the stopping conditions, and that it assigns label Aκℓ. This proves that Bob returns Aκℓ. To prove the space lower bound, note that S consists of n = 2k(N + D) examples, each of which can be encoded in d = D + D = O(D + log(D + N)) bits. For D = O(N), this yields n = O(k N) and d = O(D log N) and therefore ND = Ω nd k log n . Recalling that Alice must send Ω(ND) bits to Bob concludes the proof.

We conclude this section with a lower bound on the running time of any fully dynamic algorithm. Clearly, if the model requires the algorithm to read every labeled example (x, y) upon arrival, then a lower bound of Ω(nd) is trivial. However, we show that an Ω(nd) bounds holds even if we do not require the algorithm to read the examples; instead, at any point in time we allow the algorithm to access in time O(1) the j-th feature of any example in the current active set. We call this the matrix access model. Again, we prove the bound for weakly (ϵ, δ)-feasible algorithms.

Theorem 4.2. Let k, h 1 and α, β [0, 1

2). For arbitrarily large n and d there exist sequences of n INS and DEL operations over {0, 1}d {0, 1} such that, in the matrix access model, any weakly (ϵ, 2/3)-feasible fully dynamic algorithm has expected running time Ω(nd).

5 Experiments

We compare FUDYADT against two state-of-the-art algorithms for incremental decision tree learning, EFDT (Manapragada, Webb, and Salehi 2018) and HAT (Bifet and Gavald a 2009), using the MOA software (Bifet et al. 2010). Similarly to FUDYADT, EFDT and HAT aim at keeping ϵ-optimal splits, which they do with high probability when the examples are i.i.d. from a distribution. Due to space limitations, the results for HAT and other experiments can be found in the full version of the paper (Bressan, Damay, and Sozio 2022).

Settings. We implemented FUDYADT in C++ 2. We conducted all experiments on an Ubuntu 20.04.2 LTS server equipped with 144 Intel(R) Xeon(R) Gold 6154 @ 3.00GHz CPUs and 264 GB of RAM. We observe that the algorithms have not been implemented in the same programming language, which limits the relevance of the runtime comparison. Datasets. Our datasets are shown in Table 1. We have chosen them among standard datasets for classiﬁcation; some of them, such as INSECTS, feature the so-called concept drift. Not all datasets have binary labels. For the INSECTS datasets, we assigned label 1 to the union of male classes. For every other dataset, we assigned label 1 to the majority class. Input models. We consider three input models. Let (x1, y1), . . . , (x T , y T ) be the sequence of examples as given by the dataset at hand (typically in chronological order). The simplest model is when only insertions are allowed aka incremental model. Formally, at every t [T] the algorithm receives INS(xt, yt). This model is supported by all algorithms (FUDYADT, EDFT, HAT), hence we use it to compare them against each other. The next two models involve deletions and thus are supported only by FUDYADT. The ﬁrst one is the sliding window model (SW): given an integer W 1 for all t [T] the algorithm receives INS(xt, yt), preceded by DEL(xt W +1, yt W +1) if t W. The second one is the random update model (RU): for all t [T], with probability 1/2 the algorithm receives INS(xt, yt) and with probability 1/2 it receives DEL(x, y) where (x, y) is chosen uniformly at random from the active set St. Metrics. As is customary in the literature, we evaluate how well each algorithm predicts the label of the next example before seeing it. Formally, if (xt, yt) is the t-th example appearing in the input sequence, then we compute ˆyt = LAB(xt) before the algorithm sees (xt, yt). We then compute the F1-score of the label sequence ˆy = (ˆyt)t 1 against the ground-truth y = (yt)t 1,

F1(y, ˆy) = 2 P(y, ˆy) R(y, ˆy)

P(y, ˆy) + R(y, ˆy) (14)

where P(y, ˆy) and P(y, ˆy) are respectively the precision and recall of ˆy against y. The F1-score is in [0, 1] with higher values denoting better results. Parameters. For FUDYADT, we let α = 0, β = 0 k = 1, h {5, 10}, and we manually set ϵ [0, 2]. Note that it breaks the condition of theorem 3.2. It allows us to test the effect of ϵ without ﬁne-tuning of the other parameters. The parameters of EFDT and HAT are set to the original values speciﬁed by the authors; we only vary the so-called grace period in {100, 500, 1000} to ﬁnd the value yielding highest F1-score. For the SW model we use W {100, 1000}. In all our experiments, we ﬁrst build a decision tree for the ﬁrst W examples, then we apply the models above to the remaining sequence. Several parameter conﬁgurations show similar trends. We only report the most interesting results. FUDYADT versus EFDT. We compare the F1-scores of EFDT and FUDYADT when allowed the same amortized time. To this end we tuned FUDYADT s ϵ to make its running time very close to (and never exceeding) that of EFDT.

2https://github.com/GDamay/dynamic-tree

d # of examples 1-class

Electricity 8 45 311 UP Forest Covertype 54 581 011 2 INSECTS v1-v5 33 24 150 79 986 *-male KDDCUP99 41 494 021 smurf. NOAA Weather 8 18 159 1 Poker 10 829 201 0

Table 1: Datasets statistics.

EFDT FUDYADT RT F1 RT F1 ϵ

Electricity 1.65 83.64 1.53 90.33 0.15 Forest Covertype 42.47 83.64 42.37 90.33 0.29 INSECTS v1 4.85 88.96 4.51 92.17 1.00 INSECTS v2 3.13 87.40 3.09 92.53 0.92 INSECTS v3 7.54 92.51 7.43 94.76 1.00 INSECTS v4 6.30 91.15 6.02 91.91 0.95 INSECTS v5 6.84 89.85 6.77 93.34 1.00 KDDCUP99 17.72 97.98 17.43 99.91 0.17 NOAA Weather 0.73 80.78 0.73 81.43 0.36 Poker 16.26 79.69 16.14 86.07 1.03

Table 2: Running time in seconds (labeled RT) and F1-score (labeled F1) of EFDT and FUDYADT in the incremental model. The last column shows the value of ϵ in UPDATE.

The results are shown in Table 2; remarkably, FUDYADT outperforms consistently EFDT in terms of F1-score. One of the possible reasons is that FUDYADT can guarantee to be relatively close to the optimal Gini gain, even without computing it explicitly. In contrast, EFDT resorts to an approximation which might be relatively poor, given that maintaining an optimal Gini gain is expensive. FUDYADT on SW and RU. Next, we studied the performance of FUDYADT in the SW and RU models (recall that EFDT/HAT do not work here). For the SW model, we set h = 10, k = 1, α = 0, and W = 100 for Electricity and W = 1000 otherwise. Figure 1 shows the F1 score as a function of ϵ (subﬁgures a-c) and the average time per update in milliseconds in logarithmic scale as a function of ϵ (subﬁgures d-f). The smaller ϵ is, the more often subtrees are recomputed, yielding a higher F1 score and amortized running time. This behavior is clear in the Electricity and Poker datasets, where from ϵ = 0 to ϵ = 1 the F1 score decreases by roughly 0.1 and the running time increases by three orders of magnitude. A good tradeoff could be ϵ = 0.1, where the F1-score is close to that of ϵ = 0 but with an amortized running time per update smaller by orders of magnitude ( 0.5ms). For INSECTS the F1-score is much more stable. All other datasets and parameter settings yielded very similar qualitative behaviors. Figure 2 shows the average running time for the RU model, showing similar trends to Figure 1.

6 Conclusions and Future Work We developed the ﬁrst fully dynamic algorithm for maintaining ϵ-feasible decision trees, while we proved it to be nearly optimal in terms of space and amortized time. Our

Figure 1: Performance of FUDYADT in the SW model on the Electricity, INSECTS and Poker datasets (top to bottom), in terms of F1-score (left) and amortized milliseconds per update (right) as a function of ϵ.

Figure 2: Amortized running time per update (in milliseconds) of FUDYADT in the RU model on the Electricity (left) and INSECTS (right) datasets. The Poker dataset yields similar results.

work shows that many well-known decision tree algorithms, whether ofﬂine like CART or incremental like EDFT, can be made fully dynamic with a small loss in the quality of the decision tree and a small overhead in the amortized running time. Our work leaves open the natural question of whether these results can be strengthened from amortized to worst-case. We believe this is an exciting direction for future research in fully-dynamic supervised machine learning.

Acknowledgements

The work of Gabriel Damay and Mauro Sozio was partially supported by the French National Agency (ANR) under project APY (ANR-20-CE38-0011), while it has been carried out partially in the frame of a cooperation between Huawei Technologies France SASU and Telecom Paris (Grant no. YBN2018125164). Marco Bressan has been supported in part by the EU Horizon 2020 ICT-48 research and innovation action under grant agreement 951847, project ELISE (European Learning and Intelligent Systems Excellence), and by the FAIR (Future Artiﬁcial Intelligence Research) project, funded by the Next Generation EU program within the PNRRPE-AI scheme (M4C2, investment 1.3, line on Artiﬁcial Intelligence). We thank the anonymous reviewers for their careful reading of our manuscript and their many insightful comments and suggestions.

Bateni, M.; Esfandiari, H.; Jayaram, R.; and Mirrokni, V. S. 2021. Optimal Fully Dynamic k-Centers Clustering. Co RR, abs/2112.07050.

Bentley, J. L.; and Saxe, J. B. 1980. Decomposable searching problems I. Static-to-dynamic transformation. Journal of Algorithms, 1(4): 301 358. Bhattacharya, S.; Henzinger, M.; Nanongkai, D.; and Tsourakakis, C. E. 2015. Spaceand Time-Efﬁcient Algorithm for Maintaining Dense Subgraphs on One-Pass Dynamic Streams. In Proc. of ACM STOC. Bifet, A.; and Gavald a, R. 2009. Adaptive Learning from Evolving Data Streams. In Proc. of IDA, 249 260. Berlin, Heidelberg: Springer-Verlag. Bifet, A.; Holmes, G.; Kirkby, R.; and Pfahringer, B. 2010. MOA: Massive Online Analysis. J. Mach. Learn. Res., 11: 1601 1604.

Bressan, M.; Damay, G.; and Sozio, M. 2022. Fully-Dynamic Decision Trees. Co RR, abs/2212.00778.

Chan, T. H.; Guerqin, A.; and Sozio, M. 2018. Fully Dynamic k-Center Clustering. In Proc. of WWW.

Cohen-Addad, V.; Hjuler, N.; Parotsidis, N.; Saulpic, D.; and Schwiegelshohn, C. 2019. Fully Dynamic Consistent Facility Location. In Proc. of Neur IPS. Das, A.; Wang, J.; Gandhi, S. M.; Lee, J.; Wang, W.; and Zaniolo, C. 2019. Learn Smart with Less: Building Better Online Decision Trees with Fewer Training Examples. In Proc. of IJCAI, 2209 2215.

De Stefani, L.; Epasto, A.; Riondato, M.; and Upfal, E. 2017. TRI EST: Counting Local and Global Triangles in Fully Dynamic Streams with Fixed Memory Size. ACM Trans. Knowl. Discov. Data.

Domingos, P.; and Hulten, G. 2000. Mining High-Speed Data Streams. In Proc. of ACM KDD, 71 80.

Epasto, A.; Lattanzi, S.; and Sozio, M. 2015. Efﬁcient Densest Subgraph Computation in Evolving Graphs. In Proc. of WWW.

Gama, J. a.; Rocha, R.; and Medas, P. 2003. Accurate Decision Trees for Mining High-Speed Data Streams. In Proc. of ACM KDD, 523 528. Haug, J.; Broelemann, K.; and Kasneci, G. 2022. Dynamic Model Tree for Interpretable Data Stream Learning. In Proc. of IEEE ICDE, 2562 2574. Henzinger, M.; and Kale, S. 2020. Fully-Dynamic Coresets. In Proc. of ESA, volume 173 of LIPIcs, 57:1 57:21. Hulten, G.; Spencer, L.; and Domingos, P. 2001. Mining Time-Changing Data Streams. In Proc. of ACM KDD, 97 106. Jin, R.; and Agrawal, G. 2003. Efﬁcient Decision Tree Construction on Streaming Data. In Proc. of ACM KDD, 571 576. Manapragada, C.; Gomes, H. M.; Salehi, M.; Bifet, A.; and Webb, G. I. 2022. An eager splitting strategy for online decision trees in ensembles. Data Mining and Knowledge Discovery, 36(2): 566 619. Manapragada, C.; Webb, G. I.; and Salehi, M. 2018. Extremely Fast Decision Tree. In Proc. of ACM KDD, 1953 1962. Rutkowski, L.; Pietruczuk, L.; Duda, P.; and Jaworski, M. 2013. Decision Trees for Mining Data Streams Based on the Mc Diarmid s Bound. IEEE Transactions on Knowledge and Data Engineering, 25(6): 1272 1279. Sawlani, S.; and Wang, J. 2020. Near-optimal fully dynamic densest subgraph. In Proc. of ACM STOC, 181 193. Shalev-Shwartz, S.; and Ben-David, S. 2014. Understanding Machine Learning: From Theory to Algorithms. USA: Cambridge University Press. Sun, J.; Jia, H.; Hu, B.; Huang, X.; Zhang, H.; Wan, H.; and Zhao, X. 2020. Speeding up Very Fast Decision Tree with Low Computational Cost. In Proc. of IJCAI, 1272 1278.