# weighted_linear_bandits_for_nonstationary_environments__bfdff2a8.pdf Weighted Linear Bandits for Non-Stationary Environments Yoan Russac CNRS, Inria, ENS, Université PSL yoan.russac@ens.fr Claire Vernade Deepmind vernade@google.com Olivier Cappé CNRS, Inria, ENS, Université PSL olivier.cappe@cnrs.fr We consider a stochastic linear bandit model in which the available actions correspond to arbitrary context vectors whose associated rewards follow a non-stationary linear regression model. In this setting, the unknown regression parameter is allowed to vary in time. To address this problem, we propose D-Lin UCB, a novel optimistic algorithm based on discounted linear regression, where exponential weights are used to smoothly forget the past. This involves studying the deviations of the sequential weighted least-squares estimator under generic assumptions. As a by-product, we obtain novel deviation results that can be used beyond nonstationary environments. We provide theoretical guarantees on the behavior of D-Lin UCB in both slowly-varying and abruptly-changing environments. We obtain an upper bound on the dynamic regret that is of order d2/3B1/3 T T 2/3, where BT is a measure of non-stationarity (d and T being, respectively, dimension and horizon). This rate is known to be optimal. We also illustrate the empirical performance of D-Lin UCB and compare it with recently proposed alternatives in simulated environments. 1 Introduction Multi-armed bandits offer a class of models to address sequential learning tasks that involve exploration-exploitation trade-offs. In this work we are interested in structured bandit models, known as stochastic linear bandits, in which linear regression is used to predict rewards [1, 2, 22]. A typical application of bandit algorithms based on the linear model is online recommendation where actions are items to be, for instance, efficiently arranged on personalized web pages to maximize some conversion rate. However, it is unlikely that customers preferences remain stable and the collected data becomes progressively obsolete as the interest for the items evolve. Hence, it is essential to design adaptive bandit agents rather than restarting the learning from scratch on a regular basis. In this work, we consider the use of weighted least-squares as an efficient method to progressively forget past interactions. Thus, we address sequential learning problems in which the parameter of the linear bandit is evolving with time. Our first contribution consists in extending existing deviation inequalities to sequential weighted least-squares. Our result applies to a large variety of bandit problems and is of independent interest. In particular, it extends the recent analysis of heteroscedastic environments by [18]. It can also be useful to deal with class imbalance situations, or, as we focus on here, in non-stationary environments. 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. As a second major contribution, we apply our results to propose D-Lin UCB, an adaptive linear bandit algorithm based on carefully designed exponential weights. D-Lin UCB can be implemented fully recursively without requiring the storage of past actions with a numerical complexity that is comparable to that of Lin UCB. To characterize the performance of the algorithm, we provide a unified regret analysis for abruptly-changing or slowly-varying environments. The setting and notations are presented below and we state our main deviation result in Section 2. Section 3 is dedicated to non-stationary linear bandits: we describe our algorithms and provide regret upper bounds in abruptly-changing and slowly-varying environments. We complete this theoretical study with a set of experiments in Section 4. 1.1 Model and Notations The setting we consider in this paper is a non-stationary variant of the stochastic linear bandit problem considered in [1, 22], where, at each round t 1, the learner receives a finite set of feasible actions At Rd; chooses an action At At and receives a reward Xt such that Xt = At, θ t + ηt, (1) where θ t Rd is an unknown parameter and ηt is, conditionally on the past, a σ subgaussian random noise. The action set At may be arbitrary but its components are assumed to be bounded, in the sense that a 2 L, a At. The time-varying parameter is also assumed to be bounded: t, θ t 2 S. We further assume that | a, θ t | 1, t, a At, (obviously, this could be guaranteed by assuming that L = S = 1, but we indicate the dependence in L and S in order to facilitate the interpretation of some results). For a positive definite matrix M and a vector x, we denote by x M the norm The goal of the learner is to minimize the expected dynamic regret defined as t=1 max a At a, θ t Xt t=1 max a At a At, θ t . (2) Even in the stationary case i.e., when θ t = θ , there is, in general, no single fixed best action in this model. When making stronger structural assumption on At, one recovers specific instances that have also been studied in the literature. In particular, the canonical basis of Rd, At = {e1, . . . , ed}, yields the familiar non contextual multi-armed bandit model [20]. Another variant, studied by [15] and others, is obtained when At = {e1 at, . . . , ek at}, where denotes the Kronecker product and at is a time-varying context vector shared by the k actions. 1.2 Related Work There is an important literature on online learning in changing environments. For the sake of conciseness, we restrict the discussion to works that consider specifically the stochastic linear bandit model in (1), including its restriction to the simpler (non-stationnary) multi-armed bandit model. Note that there is also a rich line of works that consider possibly non-linear contextual models in the case where one can make probabilistic assumptions on the contexts [10, 23]. Controlling the regret with respect to the non-stationary optimal action defined in (2) depends on the assumptions that are made on the time-variations of θ t . A generic way of quantifying them is through a variation bound BT = PT 1 s=1 θ s θ s+1 2 [4, 6, 11], similar to the penalty used in the group fused Lasso [8]. The main advantage of using the variation budget is that is includes both slowly-varying and abruptly-changing environments. For the K armed bandits with known BT , [4 6] achieve the tight dynamic regret bound of O(K1/3B1/3 T T 2/3). For linear bandits, [11, 12] propose an algorithm based on the use of a sliding-window and provide a O(d2/3B1/3 T T 2/3) dynamic regret bound; since this contribution is close to ours, we discuss it further in Section 3.2. A more specific non-stationary setting arises when the number of changes in the parameter is bounded by ΓT , as in traditional change-point models. The problem is usually referred to as switching bandits or abruptly-changing environments. It is, for instance, the setting considered in the work by Garivier and Moulines [14], who analyzed the dynamic regret of UCB strategies based on either a slidingwindow or exponential discounting. For both policies, they prove upper bounds on the regret in O( ΓT T) when ΓT is known. They also provide a lower bound in a specific non-stationary setting, showing that R(T) = Ω( T). The algorithm ideas can be traced back to [19]. [28] shows that an horizon-independent version of the sliding window algorithm can also be analyzed in a slowly-varying setting. [17] analyze windowing and discounting approaches to address dynamic pricing guided by a (time-varying) linear regression model. Discount factors have also been used with Thomson sampling in dynamic environments as in [16, 26]. In abruptly-changing environments, the alternative approach relies on change-point detection [3, 7, 9, 29, 30]. A bound on the regret in O(( 1 ) log(T)) is proven by [30], where ϵ is the smallest gap that can be detected by the algorithm, which had to be given as prior knowledge. [9] proves a minimax bound in O( ΓT KT) if ΓT is known. [7] achieves a rate of O( ΓT KT) without any prior knowledge of the gaps or ΓT . In the contextual case, [29] builds on the same idea: they use a pool of Lin UCB learners called slave models as experts and they add a new model when no existing slave is able to give good prediction, that is, when a change is detected. A limitation however of such an approach is that it can not adapt to some slowly-varying environments, as will be illustrated in Section 4. From a practical viewpoint, the methods based either on sliding window or change-point detection require the storage of past actions whereas those based on discount factors can be implemented fully recursively. Finally, non-stationarity may also arise in more specific scenarios connected, for instance, to the decaying attention of the users, as investigated in [21, 24, 27]. In the following, we consider the general case where the parameters satisfy the variation bound, i.e., PT 1 t=1 θ t θ t+1 2 BT and we propose an algorithm based on discounted linear regression. 2 Confidence Bounds for Weighted Linear Bandits In this section, we consider the concentration of the weighted regularized least-squares estimator, when used with general weights and regularization parameters. To the best of our knowledge there is no such results in the literature for sequential learning i.e., when the current regressor may depend on the random outcomes observed in the past. The particular case considered in Lemma 5 of [18] (heteroscedastic noise with optimal weights) stays very close to the unweighted case and we show below how to extend this result. We believe that this new bound is of interest beyond the specific model considered in this paper. For the sake of clarity, we first focus on the case of regression models with fixed parameter, where θ t = θ , for all t. First consider a deterministic sequence of regularization parameters (λt)t 1. The reason why these should be non-constant for weighted least-squares will appear clearly in Section 3. Next, define by Ft = σ(X1, . . . , Xt) the filtration associated with the random observations. We assume that both the actions At and positive weights wt are predictable, that is, they are Ft 1 measurable. ˆθt = arg min θ Rd s=1 ws(Xs As, θ )2 + λt θ 2 2 the regularized weighted least-squares estimator of θ at time t, one has ˆθt = V 1 t s=1 ws As Xs where Vt = s=1 ws As A s + λt Id, (3) and Id denotes the d-dimensional identity matrix. We further consider an arbitrary sequence of positive parameters (µt)t 1 and define the matrix s=1 w2 s As A s + µt Id. (4) e V is strongly connected to the variance of the estimator ˆθt, which involves the squares of the weights (w2 s)s 1. For the time being, µt is arbitrary and will be set as a function of λt in order to optimize the deviation inequality. We then have the following maximal deviation inequality. Theorem 1. For any Ft-predictable sequences of actions (At)t 1 and positive weights (wt)t 1 and for all δ > 0, t, ˆθt θ Vt e V 1 t Vt λt µt S + σ v u u t2 log(1/δ) + d log 1 + L2 Pt s=1 w2s dµt The proof of this theorem is deferred to the appendix and combines an argument using the method of mixtures and the use of a proper stopping time. The standard result used for least-squares [20, Chapter 20] is recovered by taking µt = λt and wt = 1 (note that e Vt is then equal to Vt). When the weights are not equal to 1, the appearance of the matrix e Vt is a consequence of the fact that the variance terms are proportional to the squared weights w2 t , while the least-squares estimator itself is defined with the weights wt. In the weighted case, the matrix Vt e V 1 t Vt must be used to define the confidence ellipsoid. An important property of the least-squares estimator is to be scale-invariant, in the sense that multiplying all weights (ws)1 s t and the regularization parameter λt by a constant leaves the estimator ˆθt unchanged. In Theorem 1, the only choice of sequence (µt)t 1 that is compatible with this scale-invariance property is to take µt proportional to λ2 t: then the matrix Vt e V 1 t Vt becomes scale-invariant (i.e. unchanged by the transformation ws 7 αws) and so does the upper bound of ˆθt θ Vt e V 1 t Vt in Theorem 1. In the following, we will stick to this choice, while particularizing the choice of the weights wt to allow for non-stationary models. It is possible to extend this result to heteroscedastic noise, when ηt is σt sub-Gaussian and σt is Ft 1 measurable, by defining e Vt as Pt s=1 w2 sσ2 s As A s + µt Id. In the next section, we will also use an extension of Theorem 1 to the non-stationary model presented in (1) . In this case, Theorem 1 holds with θ replaced by V 1 t Pt s=1 ws As A s θ s + λtθ r , where r is an arbitrary time index (proposition 3 in Appendix). The fact that r can be chosen freely is a consequence of the assumption that the sequence of L2-norms of the parameters (θ t )t 1 is bounded by S. 3 Application to Non-stationary Linear Bandits In this section, we consider the non-stationary model defined in (1) and propose a bandit algorithm in Section 3.1, called Discounted Linear Upper Confidence Bound (D-Lin UCB), that relies on weighted least-squares to adapt to changes in the parameters θ t . Analyzing the performance of D-Lin UCB in Section 3.2, we show that it achieves reliable performance both for abruptly changing or slowly drifting parameters. 3.1 The D-Lin UCB Algorithm Being adaptive to parameter changes indeed implies to reduce the influence of observations that are far back in the past, which suggests using weights wt that increase with time. In doing so, there are two important caveats to consider. First, this can only be effective if the sequence of weights is growing sufficiently fast (see the analysis in the next section). We thus consider exponentially increasing weights of the form wt = γ t, where 0 < γ < 1 is the discount factor. Next, due to the absence of assumptions on the action sets At, the regularization is instrumental in obtaining guarantees of the form given in Theorem 1. In fact, if wt = γ t while λt does not increase sufficiently fast, then the term log 1 + (L2 Pt s=1 w2 s)/(dµt) will eventually dominate the radius of the confidence region since we choose µt proportional to λ2 t. This occurs because there is no guarantee that the algorithm will persistently select actions At that span the entire space. With this in mind, we consider an increasing regularization factor of the form λt = γ tλ, where λ > 0 is a hyperparameter. Note that due to the scale-invariance property of the weighted least-square estimator, we can equivalently consider that at time t, we are given time-dependent weights wt,s = γt s, for 1 s t and that ˆθt is defined as arg min θ Rd s=1 γt s(Xs As, θ )2 + λ θ 2 2 . For numerical stability reasons, this form is preferable and is used in the statement of Algorithm 1. In the analysis of Section 3.2 however we revert to the standard form of the weights, which is required to apply the concentration result of Section 1. We are now ready to describe D-Lin UCB in Algorithm 1. Algorithm 1: D-Lin UCB Input: Probability δ, subgaussianity constant σ, dimension d, regularization λ, upper bound for actions L, upper bound for parameters S, discount factor γ. Initialization: b = 0Rd, V = λId, e V = λId, ˆθ = 0Rd for t 1 do Receive At, compute βt 1 = δ + d log 1 + L2(1 γ2(t 1)) for a At do Compute UCB(a) = a ˆθ + βt 1 p a V 1 e V V 1a At = arg max a(UCB(a)) Play action At and receive reward Xt Updating phase: V = γV + At A t + (1 γ)λId, e V = γ2 e V + At A t + (1 γ2)λId b = γb + Xt At, ˆθ = V 1b 3.2 Analysis As discussed previously, we consider weights of the form wt = γ t (where 0 < γ < 1) in the D-Lin UCB algorithm. In accordance with the discussion at the end of Section 1, Algorithm 1 uses µt = γ 2tλ as the parameter to define the confidence ellipsoid around ˆθt 1. The confidence ellipsoid Ct is defined as θ : θ ˆθt 1 Vt 1 e V 1 t 1Vt 1 βt 1 where 2 log(1/δ) + d log 1 + L2(1 γ2t) Using standard algebraic calculations together with the remark above about scale-invariance it is easily checked that at time t Algorithm 1 selects the action At that maximizes a, θ for a At and θ Ct. The following theorem bounds the regret resulting from Algorithm 1. Theorem 2. Assuming that PT 1 s=1 θ s θ s+1 2 BT , the regret of the D-Lin UCB algorithm is bounded for all γ (0, 1) and integer D 1, with probability at least 1 δ, by RT 2LDBT + 4L3S T log(1/γ) + log 1 + L2 The first two terms of the r.h.s. of (6) are the result of the bias due to the non-stationary environment. The last term is the consequence of the high probability bound established in the previous section and an adaptation of the technique used in [1]. We give the complete proof of this result in appendix. The high-level idea of the proof is to isolate bias and variance terms. However, in contrast with the stationary case, the confidence ellipsoid Ct does not necessarily contain (with high probability) the actual parameter value θ t due to the (unknown) bias arising from the time variations of the parameter. We thus define θt = V 1 t 1 s=1 γ s As A s θ s + λγ (t 1)θ t which is an action-dependent analogue of the parameter value θ in the stationary setting (although this is a random value). As mentioned in section 2, θt does belong to Ct with probability at least 1 δ (see Proposition 3 in Appendix). The regret may then be split as t=1 θ t θt 2 + t=1 At, θt θt (with probability at least 1 δ), where (At, θt) = arg max(a At,θ Ct) a, θ . The rightmost term can be handled by proceeding as in the case of stationary linear bandits, thanks to the deviation inequality obtained in Section 2. The first term in the r.h.s. can be bounded deterministically, from the assumption made on PT 1 s=1 θ s θ s+1 2. In doing so, we introduce the analysis parameter D that, roughly speaking, corresponds to the window length equivalent to a particular choice of discount factor γ: the bias resulting from observations that are less than D time steps apart may be bounded in term of D while the remaining ones are bounded globally by the second term of the r.h.s. of (6). This sketch of proof is substantially different from the arguments used by [11] to analyze their sliding window algorithm (called SW-Lin UCB). We refer to the appendix for a more detailed analysis of these differences. Interestingly, the regret bound of Theorem 2 holds despite the fact that the true parameter θ t may not be contained in the confidence ellipsoid Ct 1, in contrast to the proof of [14]. It can be checked that, as T tends to infinity, the optimal choice of the analysis parameter D is to take D = log(T)/(1 γ). Further assuming that one may tune γ as a function of the horizon T and the variation upper bound BT yields the following result. Corollary 1. By choosing γ = 1 (BT /(d T))2/3, the regret of the D-Lin UCB algorithm is asymptotically upper bounded with high probability by a term O(d2/3B1/3 T T 2/3) when T . This result is favorable as it corresponds to the same order as the lower bound established by [4]. More precisely, the case investigated by [4] corresponds to a non-contextual model with a number of changes that grows with the horizon. On the other hand, the guarantee of Corollary 1 requires horizon-dependent tuning of the discount factor γ, which opens interesting research issues (see also [11]). 4 Experiments This section is devoted to the evaluation of the empirical performance of D-Lin UCB. We first consider two simulated low-dimensional environments that illustrate the behavior of the algorithms when confronted to either abrupt changes or slow variations of the parameters. The analysis of the previous section, suggests that D-Lin UCB should behave properly in both situations. We then consider a more realistic scenario in Section 4.2, where the contexts are high-dimensional and extracted from a data set of actual user interactions with a web service. For benchmarking purposes, we compare D-Lin UCB to the Dynamic Linear Upper Confidence Bound (d Lin UCB) algorithm proposed by [29] and with the Sliding Window Linear UCB (SW-Lin UCB) of [11]. The principle of the d Lin UCB algorithm is that a master bandit algorithm is in charge of choosing the best Lin UCB slave bandit for making the recommendation. Each slave model is built to run in each one of the different environments. The choice of the slave model is based on a lower confidence bound for the so-called badness of the different models. The badness is defined as the number of times the expected reward was found to be far enough from the actual observed reward on the last τ steps, where τ is a parameter of the algorithm. When a slave is chosen, the action proposed to a user is the result of the Lin UCB algorithm associated with this slave. When the action is made, all the slave models that were good enough are updated and the models whose badness were too high are deleted from the pool of slaves models. If none of the slaves were found to be sufficiently good, a new slave is added to the pool. The other algorithm that we use for comparison is SW-Lin UCB, as presented in [11]. Rather than using exponentially increasing weights, a hard threshold is adopted. Indeed, the actions and rewards included in the l-length sliding window are used to estimate the linear regression coefficients. We expect D-Lin UCB and SW-Lin UCB to behave similarly as they both may be shown to have the same sort of regret guarantees (see appendix). In the case of abrupt changes, we also compare these algorithms to the Oracle Restart Lin UCB (Lin UCB-OR) strategy that would know the change-points and simply restart, after each change, a new instance of the Lin UCB algorithm. The regret of this strategy may be seen as an empirical lower bound on the optimal behavior of an online learning algorithm in abruptly changing environments. In the following figures, the vertical red dashed lines correspond to the change-points (in abrupt changes scenarios). They are represented to ease the understanding but except for Lin UCB-OR, they are of course unknown to the learning algorithms. When applicable, the blue dashed lines correspond to the average detection time of the breakpoints with the d Lin UCB algorithm. For D-Lin UCB the discount parameter is chosen as γ = 1 ( BT d T )2/3. For SW-Lin UCB the window s length is set to l = ( d T BT )2/3, where d = 2 in the experiment. Those values are theoretically supposed to minimize the asymptotic regret. For the Dynamic Linear UCB algorithm, the badness is estimated from τ = 200 steps, as in the experimental section of [29]. 4.1 Synthetic data in abruptly-changing or slowly-varying scenarios Figure 1: Performances of the algorithms in the abruptly-changing environment (on the left), and, the slowly-varying environment (on the right). The upper plots correspond to the estimated parameter and the lower ones to the accumulated regret, both are averaged on N = 100 independent experiments In this first experiment, we observe the empirical performance of all algorithms in an abruptly changing environment of dimension 2 with 3 breakpoints. The number of rounds is set to T = 6000. The light blue triangles correspond to the different positions of the true unknown parameter θ t : before t = 1000, θ t = (1, 0); for t [[1000, 2000]], θ t = ( 1, 0); for t [[2000, 3000]], θ t = (0, 1); and, finally, for t > 3000, θ t = (0, 1). This corresponds to a hard problem as the sequence of parameters is widely spread in the unit ball. Indeed it forces the algorithm to adapt to big changes, which typically requires a longer adaptation phase. On the other hand, it makes the detection of changes easier, which is an advantage for d Lin UCB. In the second half of the experiment (when t 3000) there is no change, Lin UCB struggles to catch up and suffers linear regret for long periods after the last change-point. The results of our simulations are shown in the left column of Figure 1. On the top row we show a 2-dimensional scatter plot of the estimate of the unknown parameters ˆθt every 1000 steps averaged on 100 independent experiment. The bottom row corresponds to the regret averaged over 100 independent experiments with the upper and the lower 5% quantiles. In this environment, with 1-subgaussian random noise, d Lin UCB struggles to detect the change-points. Over the 100 experiments, the first change-point was detected in 95% of the runs, the second was never detected and the third only in 6% of the runs, thus limiting the effectiveness of the d Lin UCB approach. When decreasing the variance of the noise, the performance of d Lin UCB improves and gets closer to the performance of the oracle restart strategy Lin UCB-OR. It is worth noting that for both SW-Lin UCB and D-Lin UCB, the estimator ˆθt adapts itself to non-stationarity and is able to follow θ t (with some delay), as shown on the scatter plot. Predictably, Lin UCB-OR achieves the best performance by restarting exactly whenever a change-point happens. The second experiment corresponds to a slowly-changing environment. It is easier for Lin UCB to keep up with the adaptive policies in this scenario. Here, the parameter θ t starts at (1 and moves continuously counter-clockwise on the unit-circle up to the position [0, 1] in 3000 steps. We then have a steady period of 3000 steps. For this sequence of parameters, BT = PT 1 t=1 θ t θ t+1 2 = 1.57. The results are reported in the right column of Figure 1. Unsurprisingly, d Lin UCB does not detect any change and thus displays the same performance as Lin UCB. SW-Lin UCB and D-Lin UCB behaves similarly and are both robust to such an evolution in the regression parameters. The performance of Lin UCB-OR is not reported here, as restarting becomes ineffective when the changes are too frequent (here, during the first 3000 time steps, there is a change at every single step). The scatter plot also gives interesting information: ˆθt tracks θ t quite effectively for both SW-Lin UCB and D-Lin UCB but the two others algorithms lag behind. Lin UCB will eventually catch up if the length of the stationary period becomes larger. 4.2 Simulation based on a real dataset Figure 2: Behavior of the different algorithms on large-dimensional data D-Lin UCB also performs well in high-dimensional space (d = 50). For this experiment, a dataset providing a sample of 30 days of Criteo live traffic data [13] was used. It contains banners that were displayed to different users and contextual variables, including the information of whether the banner was clicked or not. We kept the categorical variables cat1 to cat9 , together with the variable campaign, which is a unique identifier of each campaign. Beforehand, these contexts have been onehot encoded and 50 of the resulting features have been selected using a Singular Value Decomposition. θ is obtained by linear regression. The rewards are then simulated using the regression model with an additional Gaussian noise of variance σ2 = 0.15. At each time step, the different algorithms have the choice between two 50-dimensional contexts drawn at random from two separate pools of 10000 contexts corresponding, respectively, to clicked or not clicked banners. The non-stationarity is created by switching 60% of θ coordinates to θ at time 4000, corresponding to a partial class inversion. The cumulative dynamic regret is then averaged over 100 independent replications. The results are shown on Figure 2. In the first stationary period, Lin UCB and d Lin UCB perform better than the adaptive policies by using all available data, whereas the adaptive policies only use the most recent events. After the breakpoint, Lin UCB suffers a large regret, as the algorithm fails to adapt to the new environment. In this experiment, d Lin UCB does not detect the change-point systematically and performs similarly as Lin UCB on average, it can still outperform adaptive policies from time to time when the breakpoint is detected as can be seen with the 5% quantile. D-Lin UCB and SW-Lin UCB adapt more quickly to the change-point and perform significantly better than the non-adaptive policies after the breakpoint. Of course, the oracle policy Lin UCB-OR is the best performing policy. The take-away message is that there is no free lunch: in a stationary period by using only the most recent events SW-Lin UCB and D-Lin UCB do not perform as good as a policy that uses all the available information. Nevertheless, after a breakpoint, the recovery is much faster with the adaptive policies. [1] Y. Abbasi-Yadkori, D. Pál, and C. Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312 2320, 2011. [2] P. Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397 422, 2002. [3] P. Auer, P. Gajane, and R. Ortner. Adaptively tracking the best arm with an unknown number of distribution changes. In European Workshop on Reinforcement Learning 14, 2018. [4] O. Besbes, Y. Gur, and A. Zeevi. Stochastic multi-armed-bandit problem with non-stationary rewards. In Advances in neural information processing systems, pages 199 207, 2014. [5] O. Besbes, Y. Gur, and A. Zeevi. Non-stationary stochastic optimization. Operations research, 63(5):1227 1244, 2015. [6] O. Besbes, Y. Gur, and A. Zeevi. Optimal exploration-exploitation in a multi-armed-bandit problem with non-stationary rewards. Available at SSRN 2436629, 2018. [7] L. Besson and E. Kaufmann. The generalized likelihood ratio test meets klucb: an improved algorithm for piece-wise non-stationary bandits. ar Xiv preprint ar Xiv:1902.01575, 2019. [8] K. Bleakley and J.-P. Vert. The group fused lasso for multiple change-point detection. ar Xiv preprint ar Xiv:1106.4199, 2011. [9] Y. Cao, W. Zheng, B. Kveton, and Y. Xie. Nearly optimal adaptive procedure for piecewisestationary bandit: a change-point detection approach. ar Xiv preprint ar Xiv:1802.03692, 2018. [10] Y. Chen, C.-W. Lee, H. Luo, and C.-Y. Wei. A new algorithm for non-stationary contextual bandits: Efficient, optimal, and parameter-free. ar Xiv preprint ar Xiv:1902.00980, 2019. [11] W. C. Cheung, D. Simchi-Levi, and R. Zhu. Learning to optimize under non-stationarity. ar Xiv preprint ar Xiv:1810.03024, 2018. [12] W. C. Cheung, D. Simchi-Levi, and R. Zhu. Hedging the drift: Learning to optimize under non-stationarity. ar Xiv preprint ar Xiv:1903.01461, 2019. [13] Diemert Eustache, Meynet Julien, P. Galland, and D. Lefortier. Attribution modeling increases efficiency of bidding in display advertising. In Proceedings of the Ad KDD and Target Ad Workshop, KDD, Halifax, NS, Canada, August, 14, 2017. ACM, 2017. [14] A. Garivier and E. Moulines. On upper-confidence bound policies for switching bandit problems. In International Conference on Algorithmic Learning Theory, pages 174 188. Springer, 2011. [15] A. Goldenshluger and A. Zeevi. A linear response bandit problem. Stoch. Syst., 3(1):230 261, 2013. [16] N. Gupta, O.-C. Granmo, and A. Agrawala. Thompson sampling for dynamic multi-armed bandits. In 2011 10th International Conference on Machine Learning and Applications and Workshops, volume 1. IEEE, 2011. [17] N. B. Keskin and A. Zeevi. Chasing demand: Learning and earning in a changing environment. Mathematics of Operations Research, 42(2):277 307, 2017. [18] J. Kirschner and A. Krause. Information directed sampling and bandits with heteroscedastic noise. ar Xiv preprint ar Xiv:1801.09667, 2018. [19] L. Kocsis and C. Szepesvári. Discounted ucb. In: 2nd Pascal Challenge Workshop, 2006. [20] T. Lattimore and C. Szepesvári. Bandit Algorithms. Cambridge University Press, 2019. [21] N. Levine, K. Crammer, and S. Mannor. Rotting bandits. In Advances in Neural Information Processing Systems, pages 3074 3083, 2017. [22] L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalized news article recommendation. In WWW, 2010. [23] H. Luo, C.-Y. Wei, A. Agarwal, and J. Langford. Efficient contextual bandits in non-stationary worlds. ar Xiv preprint ar Xiv:1708.01799, 2017. [24] Y. Mintz, A. Aswani, P. Kaminsky, E. Flowers, and Y. Fukuoka. Non-stationary bandits with habituation and recovery dynamics. ar Xiv preprint ar Xiv:1707.08423, 2017. [25] V. H. Peña, T. L. Lai, and Q.-M. Shao. Self-normalized processes: Limit theory and Statistical Applications. Springer Science & Business Media, 2008. [26] V. Raj and S. Kalyani. Taming non-stationary bandits: A bayesian approach. ar Xiv preprint ar Xiv:1707.09727, 2017. [27] J. Seznec, A. Locatelli, A. Carpentier, A. Lazaric, and M. Valko. Rotting bandits are no harder than stochastic ones. ar Xiv preprint ar Xiv:1811.11043, 2018. [28] L. Wei and V. Srivatsva. On abruptly-changing and slowly-varying multiarmed bandit problems. In 2018 Annual American Control Conference (ACC), pages 6291 6296. IEEE, 2018. [29] Q. Wu, N. Iyer, and H. Wang. Learning contextual bandits in a non-stationary environment. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR 18, pages 495 504, New York, NY, USA, 2018. ACM. [30] J. Y. Yu and S. Mannor. Piecewise-stationary bandit problems with side observations. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 1177 1184. ACM, 2009.