# driftsurf_stablestate__reactivestate_learning_under_concept_drift__2b0cfecd.pdf Drift Surf: Stable-State / Reactive-State Learning under Concept Drift Ashraf Tahmasbi * 1 Ellango Jothimurugesan * 2 When learning from streaming data, a change in the data distribution, also known as concept drift, can render a previously-learned model inaccurate and require training a new model. We present an adaptive learning algorithm that extends previous drift-detection-based methods by incorporating drift detection into a broader stable-state/reactivestate process. The advantage of our approach is that we can use aggressive drift detection in the stable state to achieve a high detection rate, but mitigate the false positive rate of standalone drift detection via a reactive state that reacts quickly to true drifts while eliminating most false positives. The algorithm is generic in its base learner and can be applied across a variety of supervised learning problems. Our theoretical analysis shows that the risk of the algorithm is (i) statistically better than standalone drift detection and (ii) competitive to an algorithm with oracle knowledge of when (abrupt) drifts occur. Experiments on synthetic and real datasets with concept drifts confrm our theoretical analysis. 1. Introduction Learning from streaming data is an ongoing process in which a model is continuously updated as new training data arrive. We focus on the problem of concept drift, which refers to an unexpected change in the distribution of data over time. The objective is high prediction accuracy at each time step on test data from the current distribution. To achieve this goal, a learning algorithm should adapt quickly whenever drift occurs by focusing on the most recent data points that represent the new concept, while also, in the absence of drift, optimizing over all the past data points from *Equal contribution 1Department of Electrical and Computer Engineering, Iowa State University, Ames, Iowa, USA. 2Computer Science Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA. 3Apple Inc, Cupertino, California, USA. Correspondence to: Ashraf Tahmasbi , Ellango Jothimurugesan . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). Srikanta Tirthapura 1 3 Phillip B. Gibbons 2 the current distribution (for statistical accuracy). The latter has greater importance in the setting we consider where data points may be stored and revisited to achieve accuracy greater than what can be obtained in a single pass. Moreover, computational effciency of the learning algorithm is critical to keep pace with the continuous arrival of new data. In a survey from Gama et al. (Gama et al., 2014), concept drift between time steps t0 and t1 is defned as a change in the joint distribution of examples: pt0 (X, y) 6= pt1 (X, y). Gama et al. categorize drifts in several ways, distinguishing between real drift that is a change in p(y|X) and virtual drift (also known as covariate drift) that is a change only in p(X) but not p(y|X). Drift is also categorized as either abrupt when the change happens across one time step, or gradual if there is a transition period between the two concepts. A learning algorithm that reacts (well) to concept drift is referred to as an adaptive algorithm. In contrast, an oblivious algorithm, which optimizes the empirical risk over all data points observed so far under the assumption that the data are i.i.d., performs poorly in the presence of drift. One major class of adaptive algorithms is drift detection, which includes DDM (Gama et al., 2004), EDDM (Baena-García et al., 2006), ADWIN (Bifet & Gavaldà, 2007), PERM (Harel et al., 2014), FHDDM (Pesaranghader & Viktor, 2016), and MDDM (Pesaranghader et al., 2018). Drift detection tests commonly work by tracking the prediction accuracy of a model over time, and signal that a drift has occurred whenever the accuracy degrades by more than a signifcant threshold. After a drift is signaled, the previouslylearned model can be discarded and replaced with a model trained solely on the data going forward. There are several key challenges with using drift detection. Different tests are preferred depending on whether a drift is abrupt or gradual, and most drift detection tests have a user-defned parameter that governs a trade-off between the detection accuracy and speed (Gama et al., 2014); choosing the right test and the right parameters is hard when the types of drift that will occur are not known in advance. There is also a signifcant cost in prediction accuracy when a false positive results in the discarding of a long-trained model and data that are still relevant. Furthermore, even when drift is accurately detected, not all drifts require restarting with a new model. Drift detection can trigger following a virtual Drift Surf: Stable-State / Reactive-State Learning under Concept Drift drift when the model misclassifes data points drawn from a previously unobserved region of the feature space, but the older data still have valid labels and should be retained. We have also encountered real drifts in our experimental study where a model with high parameter dimension can adapt to simultaneously ft data from both the old and new concepts, and it is more effcient to continue updating the original model rather than starting from scratch. Our contribution is Drift Surf, an adaptive algorithm that helps overcome these drift detection challenges. Drift Surf works by incorporating drift detection into a broader twostate process. The algorithm starts with a single model beginning in the stable state and transitions to the reactive state based on a drift detection trigger, and then starts a second model. During the reactive state, the model used for prediction is greedily chosen as the best performer over data from the immediate previous time step (each time step corresponds to a batch of arriving data points). At the end of the reactive state, the algorithm transitions back to the stable state, keeping the model that was the best performer during the reactive state. Drift Surf s primary advantage over standalone drift detection is that most false positives will be caught by the reactive state and lead to continued use of the original long-trained model and all the relevant past data indeed, our theoretical analysis shows that Drift Surf is statistically better than standalone drift detection. Other advantages include (i) when restarting with a new model does not lead to better post-drift performance, the original model will continue to be used; and (ii) switching to the new model for predictions happens only when it begins outperforming the old model, accounting for potentially lower accuracy of the new model as it warms up. Meanwhile, the addition of this stable-state/reactive-state process does not unduly delay the time to recover from a drift, because the switch to a new model happens greedily within one time step of it outperforming the old model (as opposed to switching only at the end of the reactive state). We present a theoretical analysis of Drift Surf, showing that it is risk-competitive with Aware, an adaptive algorithm that has oracle access to when a drift occurs and at each time step maintains a model trained over the set of all data since the previous drift. We also provide experimental comparisons of Drift Surf to Aware and two adaptive learning algorithms: a state-of-the-art drift-detection-based method MDDM and a state-of-the-art ensemble method AUE (Brzezinski & Stefanowski, 2013). Our results on 10 synthetic and real-world datasets with concept drifts show that Drift Surf generally outperforms both MDDM and AUE. 2. Related Work Most adaptive learning algorithms can be classifed into three major categories: Window-based, drift detection, and ensembles. Window-based methods, which include the fam ily of FLORA algorithms (Widmer & Kubat, 1996) train models over a sliding window of the recent data in the stream. Alternatively, older data can be forgotten gradually by weighting the data points according to their age with ei ther linear (Koychev, 2000) or exponential (Hentschel et al., 2019; Klinkenberg, 2004) decay. Window-based methods are guaranteed to adapt to drifts, but at a cost in accuracy in the absence of drift. The aforementioned drift detection methods can be further classifed as either detecting degradation in prediction accu racy with respect to a given model, which include all of the tests mentioned in 1, or detecting change in the underlying data distribution which include tests given by (Kifer et al., 2004; Sebastião & Gama, 2007); the connection between the two approaches is made in (Hinder et al., 2020). In this paper, we focus on the subset of concept drifts that are performance-degrading, and that can be detected by the frst class of these drift detection methods. As observed in (Harel et al., 2014), under this narrower focus, the problem of drift detection has lower sample and computational complexity when the feature space is high-dimensional. Furthermore, this approach ignores drifts that do not require adaptation, such as changes only in features that are weakly correlated with the label. Tests for drift detection may also be com bined, known as hierarchical change detection (Alippi et al., 2016), in which a slow but accurate second test is used to validate change detected by the frst test. The two-state process of Drift Surf has a similar pattern, but differs in that Drift Surf s reactive state is based on the performance of a newly created model, which has the advantage of not pro longing the time to recover from a drift because the new model is available to use immediately. Finally, there are ensemble methods, such as DWM (Kolter & Maloof, 2007), Learn++.NSE (Elwell & Polikar, 2011), AUE (Brzezinski & Stefanowski, 2013), DWMIL (Lu et al., 2017), DTEL (Sun et al., 2018), Diversity Pool (Chiu & Minku, 2018), and Condor (Zhao et al., 2020). An ensem ble is a collection of individual models, often referred to as experts, that differ in the subset of the stream they are trained over. Ensembles adapt to drift by including both older experts that perform best in the absence of drift and newer experts that perform best after drifts. The predictions of each individual expert are typically combined using a weighted vote, where the weights depend on each expert s recent prediction accuracy. Strictly speaking, Drift Surf is an ensemble method, but differs from traditional ensembles by maintaining at most two models and where only one model is used to make a prediction at any time step. The advan tage of Drift Surf is its effciency, as the maintenance of each additional model in an ensemble comes at either a cost in additional training time, or at a cost in the accuracy of each individual model if the available training time is divided Drift Surf: Stable-State / Reactive-State Learning under Concept Drift among them. The ensemble algorithm most similar to ours is from (Bach & Maloof, 2008), which also maintains just two models: a long-lived model that is best-suited in the stationary case, and a newer model trained over a sliding window that is best-suited in the case of drift. Their algo rithm differs from Drift Surf in that instead of using a drift detection test to switch, they are essentially always in what we call the reactive state of our algorithm, where they choose to switch to a new model whenever its performance is better over a window of recent data points. Their algorithm has no theoretical guarantee, and without the stable-state/reactive state process of our algorithm, there is no control over false switching to the newer model in the stationary case. 3. Model and Preliminaries We consider a data stream setting in which the training data points arrive over time. For t = 1, 2, . . . , let Xt be the set of labeled data points arriving at time step t. We consider a constant arrival rate m = |Xt| for all t. (Our discussion and results can be readily extended to Poisson and other arrival t2 1 distributions.) Let St1,t2 = Xt be a segment of the t=t1 stream of points arriving in time steps t1 through t2 1. Let nt1,t2 = m(t2 t1) be the number of data points in St1,t2 . Each Xt consists of data points drawn from a distribution It not known to the learning algorithm. In the stationary case, It = It 1; otherwise, a concept drift has occurred at time t. We seek an adaptive learning algorithm A with high pre diction accuracy at each time step. At time t, A has access to all the data points so far, S1,t, and a constant number of processing steps (e.g., gradient computations) to output a model wt from a class of functions F that map an unlabeled data point to a predicted label. Note this setting differs from the traditional online learning setting, as we are not limited in memory and allow for the reuse of relevant older data points in the stationary case to achieve higher accuracy than what can be achieved in a single pass. To achieve high prediction accuracy at time t, we want to minimize the expected risk over the distribution It. The expected risk of function w over a distribution I is: RI (w) = Ex I [fx(w)], where fx(w) is the loss of func tion w on input x. Thus, the objective at each time t is: min Ex It [fx(wt)] wt F Given a stream segment St1,t2 of training data points, the best we can do when the data are all drawn from the same distribution is to minimize the empirical risk over St1,t2 . The empirical risk of function w over a sample S of n 1 P elements is: RS (w) = fx(w). The optimizer n x S of the empirical risk is denoted as w S , defned as w = S arg minw F RS (w). The optimal empirical risk is R = S RS (w ). S Table 1: Commonly used symbols Xt data points arriving at time step t m = |Xt|, number of points arriving at each time RS empirical risk over the set of points S H statistical error bound H(n) = hn α h constant factor in the statistical error bound α exponent in the statistical error bound W length of the windows W 1 and W 2 r length of the reactive state δ threshold in condition 2 to enter the reactive state δ0 threshold in condition 3 to switch the model Δ magnitude of a drift In order to quantify the error in the expected risk from empirical risk minimization, we use a uniform convergence bound (Boucheron et al., 2005; Bousquet & Bottou, 2007). We assume the expected risk over a distribution I and the empirical risk over a sample S of size n drawn from I are related through the following bound: E[ sup |RI (w) RS (w)|] H(n)/2 (1) w F where H(n) = hn α , for a constant h and 1/2 α 1. From this relation, H(n) is an upper bound on the statistical error (also known as the estimation error) over a sample of size n (Bousquet & Bottou, 2007). Let w be the solution learned by an algorithm A over stream segment S = St1 ,t2 . Following prior work (Bousquet & Bottou, 2007; Jothimurugesan et al., 2018), we defne the difference between A s empirical risk and the optimal empirical risk over this stream segment as its sub-optimality: SUBOPTS (A) := RS (w) RS (w ). Based on (Bousquet S & Bottou, 2007), in the stationary case, achieving a suboptimality on the order of H(nt1,t2 ) over stream segment St1,t2 asymptotically minimizes the total (statistical + optimization) error for F. However, suppose a concept drift occurs at time td such that t1 < td < t2. We could still defne empirical risk and sub-optimality of an algorithm A over stream segment St1,t2 . But, balancing sub-optimality with H(nt1,t2 ) does not necessarily minimize the total error. Algorithm A needs to frst recover from the drift such that the predictive model is trained only over data points drawn from the new distribution. We defne recovery time as follows: The recovery time of an algorithm A is the time it takes after a drift for A to provide a solution w that is maintained solely over data points drawn from the new distribution. Let td1 , td2 , . . . be the sequence of time steps at which a drift occurs, and defne td0 = 1. The goals for an adaptive learning algorithm A are (G1) to have a small recovery time ri at each tdi and (G2) to achieve sub-optimality on the order of H(ntdi ,t) over every stream segment Stdi ,t for Drift Surf: Stable-State / Reactive-State Learning under Concept Drift (i.e., during the stationary, recovered tdi + ri < t < tdi+1 periods between drifts). In 5, we formalize the latter as A being risk-competitive with an oracle algorithm Aware. It implies that A is asymptotically optimal in terms of its total error, despite concept drifts. Table 1 summarizes the symbols commonly used throughout the rest of the paper. 4. Drift Surf: Adaptive Learning over Streaming Data in Presence of Drift We present our algorithm Drift Surf for adaptively learning from streaming data that may experience drift. Incremen tal learning algorithms work by repeatedly sampling a data point from a training set S and using the corresponding gra dient to determine an update direction. This set S expands as new data points arrive. In the presence of a drift from distribution I1 to I2, without a strategy to remove from S data points from I1, the model trains over a mixture of data points from I1 and I2, often resulting in poor prediction accuracy on I2. One systematic approach to mitigating this problem would be to use a sliding window-based set S from which further sampling is conducted. Old data points are re moved when they fall out of the sliding window (regardless of whether they are from the current or an old distribution). However, the problem with this approach is that the suboptimality of the model trained over S suffers from the limited size of S. Using larger window sizes helps with achieving a better sub-optimality, but increases the recovery time. Smaller window sizes, on the other hand, provide better recovery time, but the sub-optimality of the algorithm over S increases. An ideal algorithm manages the set S such that it contains as many as possible data points from the current distribution and resets it whenever a (signifcant) drift happens, so that it contains only data points from the new distribution. As noted in 1, prior work (Baena-García et al., 2006; Bifet & Gavaldà, 2007; Gama et al., 2004; Harel et al., 2014; Pe saranghader & Viktor, 2016; Pesaranghader et al., 2018) has sought to achieve this ideal algorithm by developing better and better drift detection tests, but with limited success due to the challenges of balancing detection accuracy and speed, and the high cost of false positives. Instead, we couple aggressive drift detection with a stable-state/reactive-state process that mitigates the shortcomings of prior approaches. Unlike prior drift detection approaches, Drift Surf views per formance degrading as only a sign of a potential drift: the fnal decision about resetting S and the predictive model will not be made until the end of the reactive state, when more evidence has been gathered and a higher confdence decision can be made. Our algorithm, Drift Surf, is depicted in Algorithm 1, which is executed when Drift Surf is in the stable state, and Algo rithm 2, which is executed when Drift Surf is in the reactive Algorithm 1 Drift Surf-Stable-State: Processing a set of training points Xt arriving in time step t during a stable state 0 // wt 1(S), wt 1(S0) are respectively the parameters // (stream segments for training) of the predictive, and // reactive models. Every W time steps starting with // the creation of the current predictive model, we start // a new window of size W . // wb1, wb2 are the models with the best observed risk // Rb1, Rb2 in the two most-recent windows W 1, W 2. if condition 2 holds then {Enter reactive state} state reactive T {T is a segment arriving during the last r/2 time steps of reactive state} 0 wt 1 w0, S0 {initialize randomly a new reactive model} i 0 {time steps in the current reactive state} execute Algorithm 2 on Xt else wt Update(wt 1, S, Xt) {update w, S} end if Algorithm 2 Drift Surf-Reactive-State: Processing a set of training points Xt arriving in time step t during a reactive state 0 // wt 1, S, wt 1, S0 , wb1, wb2, Rb1, Rb2 are as defned // in Algorithm 1, except that W 1, W 2 are the two most- // recent windows started before the current reactive state. if condition 2 does NOT hold then {Early exit} state stable execute Algorithm 1 on Xt else i i + 1 wt Update(wt 1, S, Xt) {update w, S} 0 0 0 w Update(wt 1, S0 , Xt) {update w , S0} t r if i = 2 then 0 0 w w {take a snapshot of reactive model} f t 1 else if 2 r < i r then add Xt to T end if if i = r then {Exit reactive state} state stable if condition 3 holds then 0 wt wt, S S0{change the predictive model} end if 0 else if RXt (w ) < RXt (wt) then t 0 use w instead of wt for predictions at the next time t step {greedy policy} end if end if Drift Surf: Stable-State / Reactive-State Learning under Concept Drift state. The algorithm starts in the stable state, and the steps are shown for processing the batch of points arriving at time step t. When in the stable state, there is a single model, wt 1, called the predictive model. Our test for entering the reactive state is based on dividing the time steps since the creation of that model into windows of size W . Drift Surf enters the reactive state at the sign of a drift, given by the following condition: RXt (wb) > Rb + δ, where b = arg minb b1,b2Rb (2) and δ is a predetermined threshold that represents the tolerance in performance degradation (the selection of δ is discussed in 6), and wb1 (wb2) are the parameters of the predictive model that provided the best-observed risk Rb1 (Rb2) over the most-recent window W 1 (second most-recent window W 2). E.g., Rb1 = (wb1) = RXb1+1 minj W 1 RXj (wj 1). Although most drift detection tech niques rely on their predictive model to detect a drift, we keep a snapshot of the predictive model that provided the best-observed risk over two jumping windows of up to W time steps because: (i) having a frozen model that does not train over the most recent data increases the chance of de tecting slow, gradual drifts; (ii) each frozen model is at most 2W time steps old which makes it refective of the current predictive model; and (iii) the older of the models refects the best over W steps, while the younger of the models is guaranteed to have at least W steps that it can be used for drift detection tests, which are both key factors in obtaining our theoretical analysis. If condition 2 does not hold, Drift Surf assumes there was no drift in the underlying distribution and remains in the stable state. It calls Update, an update process that expands S to include the newly arrived set of data points Xt and then updates the (predictive) model parameters using S for incremental training (examples in Appendix A). Otherwise, 0 Drift Surf enters the reactive state, adds a new model wt 1, called the reactive model, with randomly initialized parame ters, and initializes its sample set S0 to be empty. To save space, the growing sample set S0 can be represented by pointers into S. If, at time step t, Drift Surf is in the reactive state (including the time step that it has just entered the reactive state) (Al gorithm 2), Drift Surf checks that condition 2 still holds (to handle a corner case discussed below), adds Xt to S and S0 , the sample sets of the predictive and reactive models, and 0 updates wt 1 and wt 1. During the reactive state, Drift Surf uses for prediction at t whichever model w or w0 performed the best in the previous time step t 1. This greedy heuris tic yields better performance during the reactive state by switching to the newly added model sooner in the presence of drift. Upon exiting the reactive state (when i=r), Drift Surf chooses the predictive model to use for the subsequent stable state. It switches to the reactive model w0 if condition 3 holds: 0 RT (wf ) < RT (wb) δ0 , where b = arg minb b1,b2Rb (3) 0 and w is the snapshot of reactive model (at i = r/2), wb f is snapshot of the predictive model with the best-observed performance over the last two windows and δ0 is set to be much smaller than δ (our experiments use δ0 = δ/2). This condition checks their performance over the test set of data points T that arrived during the last r/2 time steps of the 0 reactive state (note that neither wf nor wb have been trained over this test set). This provides an unbiased test to decide on switching the model. Otherwise, Drift Surf continues with the prior predictive model. Handling a corner case. Consider the case that a drift happens when Drift Surf is in the reactive state (due to an earlier false positive on entering the reactive state). In this case, no matter what predictive model Drift Surf chooses at the end of the reactive state, both the current predictive and reactive models are trained over a mixture of data points from both the old and new distributions. This will decrease the chance of recovering from the actual drift. To avoid this problem, Drift Surf keeps checking condition 2 and drops out of the reactive state if it fails to hold (because the failure indicates a false positive). Then the next time the condition holds, a fresh reactive state is started. This way the new reactive model will be trained solely on the new distribution. Algorithm 1 and 2 are generic in the individual base learner. For the experimental evaluation in 6, we focus on base learners where the update process is STRSAGA (Jothimu rugesan et al., 2018), a variance-reduced SGD for streaming data. Compared to SGD, STRSAGA has a faster conver gence rate and better performance under different arrival distributions. The time and space complexity of Drift Surf is within a constant factor of the individual base learner. 5. Analysis of Drift Surf In this section, we show that Drift Surf achieves goals G1 and G2 from 3. As in prior work (Bousquet & Bottou, 2007; Jothimurugesan et al., 2018), we assume that H(n) = hn α, for a constant h and 1 α 1, is an upper bound 2 on the statistical error over a set of n data points all drawn from the same distribution. Aware is an adaptive learning algorithm with oracle knowl edge of when drifts occur. At each drift, the algorithm restarts the predictive model to a random initial point and trains it over data points that arrive after the drift. The main obstacle for other adaptive learning algorithms to compete with Aware is that they are not told exactly when drifts occur. We assume that Aware and Drift Surf use base learners that Drift Surf: Stable-State / Reactive-State Learning under Concept Drift effciently learn to within statistical accuracy: Assumption 1. Let t0 be the time the base learner B was initialized. At each time step t, E[SUBOPTSt0,t (B)] H(nt0 ,t). As an example, a base learner that uses STRSAGA as the update process satisfes Assumption 1 by Lemma 3 in (Joth imurugesan et al., 2018). We use STRSAGA in the bulk of our experimental evaluation. As a means of achieving goal G2 (sub-optimality on the order of H(ntd ,t) after a drift at time td), we will show that the empirical risk of Drift Surf after a drift is close to the risk of Aware, where close is defned formally in terms of our notion of risk-competitiveness in Defnition 1. Defnition 1. For c 1, an adaptive learning algorithm A is said to be c-risk-competitive to Aware at time step t > td if E[SUBOPTStd,t (A)] c H(ntd,t), where td is the time step of the most recent drift and ntd,t = |Std ,t|. We will analyze the risk-competitiveness of Drift Surf in a stationary environment and after a drift. Additionally, we will provide high probability analysis of the recovery time after a drift (goal G1). Let td1 , td2 , . . . be the sequence of time steps at which a drift occurs. We assume that each drift at tdi is abrupt and that it satisfes the following assumption of sustained performance-degradation. Assumption 2. For the drift at time tdi , and for both frozen models wb {wb1, wb2} stored at tdi , we have (wt 1) > Rb for each time tdi < t < as RXt tdi+1 long as Drift Surf has not recovered. Furthermore, we denote Δ to be the magnitude of the drift where Δ = minwb (RJ (wb) RI (wb)) where I denotes the distribu tion at the time tdi 1 before the drift, and J denotes the distribution at tdi . Typically in drift detection, the magnitude of a drift is de fned as the difference in the expected risks over the old and new distributions with respect to the current predictive model. But that defnition results in a moving target after the drift but before replacement of the model, as the model gets updated with new data, and possibly slowly converges on the new distribution, making the drift harder to detect. Instead in our approach in Drift Surf, detection is done on frozen models snapshotted prior to the drift, and we accord ingly defne the drift magnitude with respect to the frozen models. The implication of Assumption 2 is that after a drift, the current predictive model being continually updated with the new data does not automatically adapt to the drift for at least W time steps and actually needs to be replaced. Finally, we assume that all loss functions fx are bounded [0, 1], that the optimal expected risk R = It infw F RIt (w) = 0 for each distribution It, that the batch size m > 16/δ0 , that each drift magnitude Δ > δ, that 2W is upper bounded by both exp( 1 mδ2) and 2 exp( 1 m(Δ δ)2) for each drift magnitude Δ, and that 2 for each frozen model wb that yielded a minimal observed risk Rb, that its expected risk is at least as good as its expec tation. 5.1. Stationary Environment We will show that Drift Surf is competitive to Aware in the sta tionary environment during the time 1 < t < td1 before any drift happens. By Assumption 1 the expected sub-optimality of Aware and Drift Surf are (respectively) bounded by H(n1,t) and H(nte,t), where te is the time that the current predic tive model of Drift Surf was initialized. To prove Drift Surf is risk-competitive to Aware, we need to show that nte,t, the size of the predictive model s sample set, is close to n1,t. To achieve this, we frst give a constant upper bound ps on the probability of entering the reactive state: Lemma 1. In the stationary environment for 1 < t < td1 , the probability of entering the reactive state is upper bounded by ps = 2 exp( 1 mδ2). 8 In the proof (Appendix B.1), we use sub-Gaussian concen tration in the empirical risk under a bounded loss function. Besides, if Drift Surf enters the reactive state in the station ary case, Lemma 2 asymptotically bounds the probability of switching to the reactive model by qs(β) to approach 0, where β is the age of the frozen model wb used in condi tion 3. Lemma 2. In the stationary environment for 1 < t < td1 , if Drift Surf enters the reactive state, the probability of switching to the reactive model at the end of the reactive state is bounded by qs = c1/β2 for β > c2, where β is the number of time steps between the initialization of the model wb and the time it was frozen, and the constants 1 c1 = (2h/mα)mrδ0/4 and c2 = (2h/δ0)1/α. m In the proof (Appendix B.1), we use the convergence of the base learner and Bennett s inequality. As the probability of falsely switching to the reactive model goes to 0, Drift Surf is increasingly likely to hold onto the predictive model. Using the above results, we bound the size of the predictive model s sample set to at least half of the size of Aware s sample set, with high probability. Corollary 1. With probability 1 ϵ, the size of the sample set S for the predictive model in the stable state is larger than 1 n1,t at any time step 2W + c4/(ϵ c3) t < td1 , 2 where n1,t is the total number of data points that arrived until time t, and constants c3 = c1((c2 + W ) 1/c2)ps 2 2 and c4 = (2c3 8)c1p + 6c1ps (where c1 and c2 are the s constants in Lemma 2). Drift Surf: Stable-State / Reactive-State Learning under Concept Drift Based on the result of Corollary 1, we show that the pre dictive model of Drift Surf in the stable state is 41 7 α -risk competitive with Aware with probability 1 ϵ, at any time step 2W + c4/(ϵ c3) t < td1 . This is a special case of the forthcoming Theorem 1 in 5.2. In addition, it follows from Lemma 1 and Corollary 1 that Drift Surf maintains an asymptotically larger expected num ber of samples compared to the standalone drift detection algorithm that resets the model whenever condition 2 holds (this algorithm is Drift Surf without the reactive state). Lemma 3. In the stationary environment for 1 < t < td1 , let β be the age of the predictive model in Drift Surf and let γ be the age of the model of standalone drift detection. For (2W + 2c4 ) < t < td1 , E[β] > t/4 (where c3 and c4 are 1 2c3 the constants in Corollary 1). Meanwhile, even as t (in the absence of drifts), E[γ] > 1/ps o(1). When each model is trained to statistical accuracy (Assump tion 1), the total (statistical+optimization) error bound is asymptotically limited by the statistical error for the number of samples maintained. Hence, Drift Surf is statistically better than standalone drift detection in a stationary environment. 5.2. In Presence of Abrupt Drifts Consider an abrupt drift that occurs at time tdi , and let Δ be its magnitude. Suppose the drift occurs while Drift Surf is in the stable state. The case of drift occurring when Drift Surf is in the reactive state is handled in Appendix B.2. We show that Drift Surf has a bounded recovery time (goal G1). In order to do so, we frst give a lower bound pd on the probability of entering the reactive state: Lemma 4. For tdi < t < W , the probability of entering the reactive state while Drift Surf has not yet recovered is lower bounded by pd = 1 2 exp( ( 1 m(Δ δ)2). 8 Next, we give a lower bound qd on the probability of switch ing to the reactive model at the end of the reactive state: Lemma 5. For tdi < t < W , the probability of switch ing to the reactive model at the end of the reactive state while Drift Surf has not yet recovered is lower bounded by qd = 1 2 exp( C2) where C = (Δ δ0) mr/2 2α+1h/(mr)α 1/2 subject to C > 0. The proofs of the preceding two lemmas are similar to their stationary counterparts due to the use of frozen models: for the W time steps after the drift, by Assumption 2, the previous frozen models will not be displaced by a newer model that has been partially trained over data after the drift. Following from Lemmas 4 and 5, the recovery time of Drift Surf is bounded by W with a probability 1 ϵr where ϵr is parameterized by pd, qd, which is shown in Lemma 11 in Appendix B.2. We next show the risk-competitiveness of Drift Surf after recovery (goal G2). The time period after recovery until the next drift is a stationary environment for Drift Surf, in which each model is trained solely over points drawn from a single distribution, allowing for an analysis similar to the stationary environment before any drifts occurred. Theorem 1. With probability 1 ϵ, the predictive model 7 of Drift Surf in the stable state is 41 α -risk-competitive with Aware at any time step tdi + 3W + c4/(ϵs c3) t < , where tdi is the time step of the most recent drift and tdi+1 ϵ = ϵs + ϵr (where c3, c4 are the constants in Corollary 1). At a high level, ϵr and ϵs, respectively, capture the error rates in false negatives in drift detection and false positives in the stationary period afterwards. The full proof is in Appendix B.2. 6. Experimental Results In this section, we present experimental results on datasets with drifts that (i) empirically confrm the advantage of Drift Surf s stable-state / reactive-state approach over Stan dard Drift Detection (Standard DD), (ii) empirically con frm the risk-competitiveness of Drift Surf with Aware, and (iii) show the effectiveness of Drift Surf via comparison to two state-of-the-art adaptive learning algorithms, the driftdetection-based method MDDM and the ensemble method AUE. Both Standard DD and MDDM are standalone drift detection algorithms, with the key difference being that Standard DD s drift detector matches the test used by Drift Surf to enter the reactive state, enabling us to quantify the gains of having a reactive state. More details on these algo rithms, and additional algorithm comparisons, are provided in Appendix C.1. We use fve synthetic, two semi-synthetic and three real datasets for binary classifcation, chosen to include all such datasets that the authors of MDDM and AUE use in their evaluations. These datasets include both abrupt and gradual drifts. Drifts in semi-synthetic datasets are generated by rotating data points or changing the labels of the real-world datasets that originally do not contain any drift. We divide each dataset into equally-sized batches that arrive over the course of the stream. More detail on the datasets is provided in Appendix C.2. In our experiments, a batch of data points arrives at each time step. We frst evaluate the performance of each al gorithm by measuring the misclassifcation rate over this batch, and then each algorithm gains access to the labeled data to update their model(s); i.e., test-then-train. The base learner in each algorithm is a logistic regression model with STRSAGA as the update process. More details on this base learner, hyperparameter settings, and additional base learn ers, are provided in Appendix C.3. All reported results of Drift Surf: Stable-State / Reactive-State Learning under Concept Drift 20 40 60 80 100 Time Misclassification rate Drift Surf MDDM AUE Standard DD 20 40 60 80 100 Time Misclassification rate Drift Surf MDDM AUE Standard DD 5 10 15 20 25 30 Time Misclassification rate Drift Surf MDDM AUE Standard DD (a) Cover Type (b) Power Supply (c) Electricity Figure 1: Misclassifcation rate over time for Cover Type, Power Supply, and Electricity 20 40 60 80 100 Time Misclassification rate Drift Surf MDDM AUE Standard DD Figure 2: Cover Type (update steps divided among each model) 0.00 0.05 0.10 0.15 0.20 0.25 δ Misclassification rates Standard DD Drift Surf Figure 3: All datasets, Drift Surf and Standard DD under varying threshold δ 20 40 60 80 100 Time Misclassification rate Drift Surf Drift Surf(no-greedy) Figure 4: RCV1, Drift Surf and Drift Surf (no-greedy) the misclassifcation rates represent the median over fve trials. We present the misclassifcation rates at each time step on the Cover Type, Power Supply, and Electricity datasets (see Appendix D.1 for other datasets) in Figure 1. A drift occurs at times 30 and 60 in Cover Type, at times 17, 47, and 76 in Power Supply, and at time 20 in Electricity. We observe Drift Surf outperforms MDDM because false positives in drift detection lead to unnecessary resetting of the predictive model in MDDM, while Drift Surf avoids the performance loss by catching most false positives via the reactive state and returning to the older model. Cover Type and Electricity were especially problematic for MDDM, which continually signaled a drift. We also observe Drift Surf adapts faster than AUE on Cover Type and Electricity. This is because after an abrupt drift, the predictions of Drift Surf are solely from the new model, while for AUE, the predictions are a weighted average of each expert in the ensemble. Immediately after a drift, the older, inaccurate experts of AUE have reduced, but non-zero weights that negatively impact the accuracy. In particular, on Cover Type, we observe the recovery time of Drift Surf is within one reactive state. Standard DD also suffers from false-positive drift detection, especially on Power Supply and Electricity. However, it out performs all the other algorithms on Cover Type. It detects the drifts at the right moment and resets its predictive model. Following the greedy approach during the reactive state al lows Drift Surf to converge to its newly created model with only a one time step lag. Table 2: Average of misclassifcation rate (equal number of update steps for each model) ALGORITHM AUE MDDM Stand Drift Surf Aware DATASET ard DD SEA0 0.093 0.086 0.097 0.086 0.137 SEA20 0.245 0.289 0.249 0.243 0.264 SEA-GRADUAL 0.162 0.165 0.160 0.159 0.177 HYPER-SLOW 0.112 0.116 0.116 0.118 0.110 HYPER-FAST 0.179 0.163 0.168 0.173 0.191 SINE1 0.212 0.176 0.184 0.187 0.171 MIXED 0.209 0.204 0.204 0.204 0.192 CIRCLES 0.379 0.372 0.377 0.371 0.368 RCV1 0.167 0.125 0.126 0.125 0.121 COVERTYPE 0.279 0.311 0.267 0.268 0.267 AIRLINE 0.333 0.345 0.338 0.334 0.338 ELECTRICITY 0.296 0.344 0.320 0.290 0.315 POWERSUPPLY 0.301 0.322 0.308 0.292 0.309 Table 2 summarizes the results for all the datasets in terms of the total average of the misclassifcation rate over time. In the frst two rows, we observe the stability of Drift Surf in the presence of 20% additive noise in the synthetic SEA dataset, again demonstrating the beneft of the reactive state while MDDM s performance suffers due to the increased false positives. We also observe that Drift Surf performs well on datasets with gradual drifts, such as SEA-gradual and Cir cles, where the stable-state / reactive-state approach is more Drift Surf: Stable-State / Reactive-State Learning under Concept Drift accurate at identifying when to switch the model, compared to MDDM and Standard DD, respectively. Overall, Drift Surf is the best performer on a majority of the datasets in Table 2. For some datasets (Airline, Hyper-Slow) AUE outperforms Drift Surf. A factor is the different computational power (e.g., number of gradient computations per time step) used by each algorithm. AUE maintains an ensemble of ten experts, while Drift Surf maintains just one (except during the reactive state when it maintains two), and so AUE uses at least fve (up to ten) times the computation of Drift Surf. To account for the varying computational effciency of each algorithm, we conducted another experiment where the available computa tional power for each algorithm is divided equally among all of its models. (A different variation on AUE that is instead limited by only maintaining two experts is also studied in Appendix D.2.) The misclassifcation rates for each dataset are presented in Table 3, where we observe Drift Surf dom inates AUE across all datasets. The Cover Type dataset is visualized in Figure 2 (compare to Figure 1a for equal com putational power given to each model), where we observe a signifcant penalty to the accuracy of AUE because of the constrained training time per model. Table 3: Average of misclassifcation rate (update steps divided among each model) ALGORITHM AUE MDDM Stand Drift Surf Aware DATASET ard DD SEA0 0.201 0.089 0.097 0.094 0.133 SEA20 0.291 0.283 0.253 0.249 0.266 SEA-GRADUAL 0.240 0.172 0.161 0.160 0.174 HYPER-SLOW 0.191 0.116 0.117 0.130 0.117 HYPER-FAST 0.278 0.164 0.168 0.188 0.191 SINE1 0.309 0.178 0.180 0.209 0.168 MIXED 0.259 0.204 0.204 0.204 0.191 CIRCLES 0.401 0.372 0.380 0.369 0.368 RCV1 0.403 0.131 0.128 0.143 0.120 COVERTYPE 0.317 0.313 0.267 0.271 0.267 AIRLINE 0.369 0.351 0.338 0.348 0.338 ELECTRICITY 0.364 0.339 0.319 0.308 0.311 POWERSUPPLY 0.313 0.309 0.311 0.307 0.311 Another advantage of the stable-state / reactive-state ap proach of Drift Surf is its robustness in the setting of the threshold δ. In general, drift detection tests have a threshold that poses a trade-off in false positive and false negative rates (for Standard DD, Lemmas 1 and 4 in 5), which can be diffcult to tune without knowing the frequency and mag nitude of drifts in advance. Across a range of δ, Figure 3 shows the misclassifcation rates for Drift Surf compared to Standard DD, averaged across the datasets in Table 2 (see Appendix D.3 for results per dataset). We observe that the performance of Drift Surf is resilient in the choice of δ. We also confrm that lower values of δ, corresponding to ag gressive drift detection in the stable state, allow Drift Surf to detect subtle drifts while not sacrifcing performance because the reactive state eliminates most false positives. We also study the impact of the design choice in Drift Surf of using greedy prediction during the reactive state. While in the reactive state, the predictive model used at one time step is the model that had the better performance in the previous time step, and then at the end of the reactive state, the decision is made whether or not to use the reactive model going forward. The natural alternative choice is that switching to the new reactive model can happen only at the end of the reactive state; we call this Drift Surf (no-greedy). The comparison of these two choices is visualized on the RCV1 dataset in Figure 4, where we observe the delayed switch of Drift Surf (no-greedy) to the new model following the drifts at times 30 and 60. The full results for each dataset are presented in Appendix D.4, where we observe that Drift Surf performs equal or better than Drift Surf (no greedy) on 11 of the 13 datasets in Table 2, and, averaging over all the datasets, has a misclassifcation rate of 0.221 compared to 0.229. Appendices D.5 D.8 contain additional experimental re sults. In Appendix D.5, we report the results for single-pass SGD and an oblivious algorithm (STRSAGA with no adapta tion to drift), which are generally worse across each dataset. Appendix D.6 includes results for each algorithm when SGD is used as the update process instead of STRSAGA. We observe that using SGD results in lower accuracy for each algorithm, and also that, relatively, AUE gains an edge because its ensemble of ten experts mitigates the higher vari ance updates of SGD. Appendix D.7 studies base learners beyond logistic regression, showing the advantage of Drift Surf s stable-state/reactive-state approach on both Hoeffding Trees and Naive Bayes classifers. Finally, Appendix D.8 reports additional numerical results on the recovery time of each algorithm. 7. Conclusion We presented Drift Surf, an adaptive algorithm for learning from streaming data that contains concept drifts. Our riskcompetitive theoretical analysis showed that Drift Surf has high accuracy competitive with Aware both in a stationary environment and in the presence of abrupt drifts. Further analysis showed that Drift Surf s reactive-state approach pro vides statistically better learning than standalone drift de tection. Our experimental results confrmed our theoretical analysis and also showed high accuracy in the presence of abrupt and gradual drifts, generally outperforming state-of the-art algorithms MDDM and AUE. Furthermore, Drift Surf maintains at most two models while achieving high accuracy, and therefore its computational effciency is signifcantly better than an ensemble method like AUE. Acknowledgments. This research was supported in part by NSF grants CCF-1725663 and CCF-1725702 and by the Parallel Data Lab (PDL) Consortium. Drift Surf: Stable-State / Reactive-State Learning under Concept Drift Alippi, C., Boracchi, G., and Roveri, M. Hierarchical change-detection tests. IEEE Trans. Neural Netw. Learn. Syst, 28(2):246 258, 2016. Bach, S. H. and Maloof, M. A. Paired learners for concept drift. In ICDM, pp. 23 32, 2008. Baena-García, M., del Campo-Ávila, J., Fidalgo, R., Bifet, A., Gavaldà, R., and Morales-Bueno, R. Early drift de tection method. In Stream KDD, pp. 77 86, 2006. Bifet, A. and Gavaldà, R. Learning from time-changing data with adaptive windowing. In ICDM, pp. 443 448, 2007. Bifet, A. and Gavaldà, R. Adaptive learning from evolving data streams. In International Symposium on Intelligent Data Analysis, pp. 249 260, 2009. Bifet, A., Holmes, G., Kirkby, R., and Pfahringer, B. MOA: Massive online analysis. JMLR, 11:1601 1604, 2010. Boucheron, S., Bousquet, O., and Lugosi, G. Theory of classifcation: A survey of some recent advances. ESAIM: probability and statistics, 9:323 375, 2005. Bousquet, O. and Bottou, L. The tradeoffs of large scale learning. In NIPS, pp. 161 168, 2007. Brzezinski, D. and Stefanowski, J. Reacting to different types of concept drift: The accuracy updated ensemble algorithm. IEEE Trans. Neural Netw. Learn. Syst, 25(1): 81 94, 2013. Chiu, C. W. and Minku, L. L. Diversity-based pool of models for dealing with recurring concepts. In IJCNN, pp. 1 8, 2018. Daneshmand, H., Lucchi, A., and Hofmann, T. Starting small-learning with adaptive sample sizes. In ICML, pp. 1463 1471, 2016. Dau, H. A., Bagnall, A., Kamgar, K., Yeh, C.-C. M., Zhu, Y., Gharghabi, S., Ratanamahatana, C. A., and Keogh, E. The UCR time series archive. IEEE/CAA Journal of Automatica Sinica, 6(6):1293 1305, 2019. Dua, D. and Graff, C. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml. Elwell, R. and Polikar, R. Incremental learning of concept drift in nonstationary environments. IEEE Trans. Neural Netw., 22(10):1517 1531, 2011. Gama, J., Medas, P., Castillo, G., and Rodrigues, P. Learning with drift detection. In Advances in Artifcial Intelligence SBIA, pp. 286 295, 2004. Gama, J., Žliobaite, I., Bifet, A., Pechenizkiy, M., and Bouchachia, A. A survey on concept drift adaptation. ACM Comput. Surv., 46(4):44, 2014. Harel, M., Crammer, K., El-Yaniv, R., and Mannor, S. Con cept drift detection through resampling. In ICML, pp. 1009 1017, 2014. Harries, M. Splice-2 comparative evaluation: Electricity pricing. Technical report, University of New South Wales, 1999. Hentschel, B., Haas, P. J., and Tian, Y. Online model man agement via temporally biased sampling. ACM SIGMOD Record, 48(1):69 76, 2019. Hinder, F., Artelt, A., and Hammer, B. Towards non parametric drift detection via dynamic adapting window independence drift detection (dawidd). In ICML, pp. 4249 4259, 2020. Ikonomovska, E. Airline dataset. URL http://kt. ijs.si/elena_ikonomovska/data.html. (Accessed on 02/06/2020). Janson, S. Tail bounds for sums of geometric and exponen tial variables. Statistics & Probability Letters, 135:1 6, 2018. Jothimurugesan, E., Tahmasbi, A., Gibbons, P. B., and Tirthapura, S. Variance-reduced stochastic gradient de scent on streaming data. In Neur IPS, pp. 9906 9915, 2018. Kifer, D., Ben-David, S., and Gehrke, J. Detecting change in data streams. In VLDB, pp. 180 191, 2004. Klinkenberg, R. Learning drifting concepts: Example selec tion vs. example weighting. IDA, 8(3):281 300, 2004. Kolter, J. Z. and Maloof, M. A. Dynamic weighted majority: An ensemble method for drifting concepts. JMLR, 8: 2755 2790, 2007. Koychev, I. Gradual forgetting for adaptation to concept drift. In ECAI Workshop on Current Issues in Spatio Temporal Reasoning, 2000. Lewis, D. D., Yang, Y., Rose, T. G., and Li, F. RCV1: A new benchmark collection for text categorization research. JMLR, 5:361 397, 2004. Lu, Y., Cheung, Y.-m., and Tang, Y. Y. Dynamic weighted majority for incremental learning of imbalanced data streams with concept drift. In IJCAI, pp. 2393 2399, 2017. Montiel, J., Read, J., Bifet, A., and Abdessalem, T. Scikit multifow: A multi-output streaming framework. JMLR, 19(72):1 5, 2018. Drift Surf: Stable-State / Reactive-State Learning under Concept Drift Pesaranghader, A. and Viktor, H. L. Fast hoeffding drift detection method for evolving data streams. In ECML PKDD, pp. 96 111, 2016. Pesaranghader, A., Viktor, H. L., and Paquet, E. A framework for classifcation in data streams using multistrategy learning. In ICDS, pp. 341 355, 2016. Pesaranghader, A., Viktor, H. L., and Paquet, E. Mc Diarmid drift detection methods for evolving data streams. In IJCNN, pp. 1 9, 2018. Sebastião, R. and Gama, J. Change detection in learning histograms from data streams. In PAI, pp. 112 123, 2007. Shaker, A. and Hüllermeier, E. Recovery analysis for adap tive learning from non-stationary data streams: Exper imental design and case study. Neurocomputing, 150: 250 264, 2015. Sun, Y., Tang, K., Zhu, Z., and Yao, X. Concept drift adap tation by exploiting historical knowledge. IEEE Trans. Neural Netw. Learn. Syst., 29(10):4822 4832, 2018. Widmer, G. and Kubat, M. Learning in the presence of concept drift and hidden contexts. Machine learning, 23 (1):69 101, 1996. Zhao, P., Cai, L.-W., and Zhou, Z.-H. Handling concept drift via model reuse. Machine Learning, 109:533 568, 2020.