# slidingwindow_thompson_sampling_for_nonstationary_settings__20fd9c3e.pdf

Journal of Artiﬁcial Intelligence Research 68 (2020) 311-364 Submitted 03/2019; published 05/2020

Sliding-Window Thompson Sampling for Non-Stationary Settings

Francesco Trov o francesco1.trovo@polimi.it Stefano Paladino stefano.paladino@polimi.it Marcello Restelli marcello.restelli@polimi.it Nicola Gatti nicola.gatti@polimi.it Politecnico di Milano, Dipartimento di Elettronica, Informazione e Bioingegneria, Piazza Leonardo da Vinci 32, Milano, 20133, Italy

Multi-Armed Bandit (MAB) techniques have been successfully applied to many classes of sequential decision problems in the past decades. However, non-stationary settings very common in real-world applications received little attention so far, and theoretical guarantees on the regret are known only for some frequentist algorithms. In this paper, we propose an algorithm, namely Sliding-Window Thompson Sampling (SW-TS), for nonstationary stochastic MAB settings. Our algorithm is based on Thompson Sampling and exploits a sliding-window approach to tackle, in a uniﬁed fashion, two diﬀerent forms of non-stationarity studied separately so far: abruptly changing and smoothly changing. In the former, the reward distributions are constant during sequences of rounds, and their change may be arbitrary and happen at unknown rounds, while, in the latter, the reward distributions smoothly evolve over rounds according to unknown dynamics. Under mild assumptions, we provide regret upper bounds on the dynamic pseudo-regret of SW-TS for the abruptly changing environment, for the smoothly changing one, and for the setting in which both the non-stationarity forms are present. Furthermore, we empirically show that SW-TS dramatically outperforms state-of-the-art algorithms even when the forms of non-stationarity are taken separately, as previously studied in the literature.

1. Introduction

The Multi-Armed Bandit (MAB) setting, introduced by Auer, Cesa-Bianchi, and Fischer (2002), models the sequential decision-making problem, addressing the well-known exploration-exploitation trade-oﬀ. In this setting, at each round of a ﬁnite time horizon, the learner selects an action from a ﬁnite set of actions (also known as arms), and she only observes the reward of the chosen action. The goal of the learner is to play the optimal arm, maximizing the expected reward while minimizing the loss incurred during the learning process. This loss is usually addressed as regret, deﬁned as the diﬀerence between the expected reward collected by a clairvoyant algorithm, selecting the optimal arm through the whole-time horizon, and the expected reward achieved by the used MAB algorithm. We focus on the Non-Stationary stochastic MAB (NS-MAB) setting, where, diﬀerently from the classical stochastic MAB setting, the expected reward of each arm may change over time, thus potentially changing the optimal arm.

c 2020 AI Access Foundation. All rights reserved.

Trov o, Paladino, Restelli, & Gatti

Non-stationarity behaviours are common in real-world applications. We recall that the former motivation for MAB settings argued by Thompson (1933) was the study of clinical trials, where diﬀerent treatments are available, and a learner aims at selecting the treatment to use for the next patient. Although the clinical trial scenario was assumed stationary over time in its original formulation, it may not be in the real world. Indeed, in a scenario in which the trial takes place over periods, the disease to defeat may mutate. Thus, as showed by Gorre, Mohammed, Ellwood, Hsu, Paquette, Rao, and Sawyers (2001), a treatment that initially was optimal might subsequently slowly decrease its eﬀectiveness, and another treatment, which initially was ineﬀective, might become the best option. Similarly, non-stationarity plays a prominent role in Internet economics. For instance, in optimal pricing problems, a non-stationarity may be due to a new product invading the market. For instance, Eliashberg and Jeuland (1986) show that the price maximizing the expected proﬁt of a product already present in the market may change abruptly when a newer product enters. Furthermore, in untruthful auction mechanisms for search advertising where advertisers try to learn the best bid to obtain their ad displayed in some proﬁtable slot, non-stationarity may be due to the arrival and departure of advertisers, which change the proﬁtability of the slots, as studied by Kitts and Leblanc (2004).1 Finally, Lai, El Gamal, Jiang, and Poor (2011) study a cognitive medium radio access problem, in which a user aims to opportunistically exploit the availability of an empty channel in a multiple channel system. The reward expresses the binary/fractional availability of the channel, whose distribution is unknown to the users. In particular, the probability that a given channel is available to a user changes over time as the other users behavior changes. As stressed by Hartland, Gelly, Baskiotis, Teytaud, and Sebag (2006), general-purpose classical MAB algorithms are not suitable when tackling NS-MAB settings, their regret bounds not holding anymore. In non-stationary settings, two main approaches are studied: passive, e.g., the works by Combes and Proutiere (2014) and Garivier and Moulines (2008), and active, e.g., the works by Liu, Lee, and Shroﬀ(2018), Besson and Kaufmann (2019), Auer, Gajane, and Ortner (2019). The former ones can deal with non-stationarity without detecting explicitly that a change of the reward distributions occurred, while the latter ones exploit techniques coming from the change detection ﬁeld to deal with reward distributions varying over time. In our work, we focus on passive approaches, since they require fewer assumptions on the change characteristics than those required by active approaches. For instance, as argued by Liu et al. (2018), active approaches commonly require the knowledge of the minimum magnitude of the change, to avoid excessively long delays in its detection. However, in practice, this knowledge is rarely available to the learner, making active approaches less appealing for real-world applications. Among the techniques following the passive approach, some frequentist algorithms have been proposed showing order optimal theoretical guarantees. We mention the works by Besbes, Gur, and Zeevi (2014), Combes and Proutiere (2014), Garivier and Moulines (2008), Kocsis and Szepesv ari (2006), and Wei, Hong, and Lu (2016). To the best of our knowledge,

1. Generalized Second Price (GSP) is an untruthful auction mechanism used by Google and BING. Examples of problems on the publisher side are discussed by Farina and Gatti (2017) and Gatti, Lazaric, Rocco, and Trov o (2015), while examples of problems on the advertiser side are discussed by Nuara, Trov o, Gatti, and Restelli (2018), Gasparini, Nuara, Trov o, Gatti, and Restelli (2018), and Nuara, Sosio, Trov o, Zaccardi, Gatti, and Restelli (2019).

Sliding-Window Thompson Sampling for Non-Stationary Settings

all the known Bayesian methods are only based on heuristics, e.g., the works by Granmo and Berg (2010) and Mellor and Shapiro (2013), while we recall that, in stationary settings, some Bayesian algorithms with theoretical guarantees are known, e.g., Thompson Sampling (TS) by Thompson (1933), and these algorithms empirically outperform frequentist algorithms in most scenarios. Some examples are also discussed by Chapelle and Li (2011), Granmo (2010), Kaufmann, Korda, and Munos (2012b), May, Korda, Lee, and Leslie (2012), Paladino, Trov o, Restelli, and Gatti (2017). Notably, heuristic algorithms might outperform algorithms with theoretical guarantees in speciﬁc settings, but, on the other side, they may provide an arbitrarily large regret in others. In the present paper, we provide a Bayesian MAB algorithm for non-stationary settings with theoretical guarantees in terms of dynamic pseudo-regret. Remarkably, our algorithm tackles in a uniﬁed fashion two forms of non-stationarity the abruptly changing one and the smoothly changing one that have been studied separately so far in the literature. In the former, the reward distributions are constant during sequences of rounds, and their change may be arbitrary and happen at unknown rounds, while, in the latter, the reward distributions smoothly evolve over rounds according to unknown dynamics. More precisely, our original contributions are as follows:

we design a novel Bayesian MAB algorithm, named Sliding-Window Thompson Sampling (SW-TS), working when two diﬀerent forms of non-stationarity coexist;

we derive some an bound over the dynamic pseudo-regret for SW-TS of order O(N 1+α

2 ) when abruptly changing non-stationarity is present, and an upper bound of order O(Nβ) when smoothly changing non-stationarity is present.2 Parameters α and β provide a measure of the number of abrupt changes and the number of rounds the expected rewards of the arms are close enough, respectively, over a horizon of N rounds. Finally, we derive a bound on the dynamic pseudo-regret in the setting in which both the non-stationarity forms are present;

we empirically show the superior performance of SW-TS over state-of-the-art frequentist MAB passive algorithms even when the forms of non-stationarity are taken separately. Finally, we provide a sensitivity analysis of SW-TS for the parameters α and β.

2. Related Works

Non-stationary MAB settings have received attention in the scientiﬁc community only in the last few years. When rewards may change arbitrarily over time, the problem of NSMAB is intractable, i.e., one can only derive trivial bounds on the dynamic pseudo-regret. For this reason, the literature mainly focuses on non-stationary MAB settings with some speciﬁc structure in the attempt to design algorithms with better regret bounds. Garivier and Moulines (2008) study abruptly changing MAB settings and present the SW-UCB algorithm achieving an O(

N) bound on the dynamic pseudo-regret. The same setting is tackled by Allesiardo, F eraud, and Maillard (2017), who present the SER4 algorithm, which empirically outperforms the SW-UCB algorithm. Combes and Proutiere (2014) present SW-KL-UCB, which is a policy working in a smoothly changing MAB setting. In this

2. With the notation O( ) we disregard logarithmic terms in the computation of the order.

Trov o, Paladino, Restelli, & Gatti

non-stationary setting, the authors provide a bound of O(σ1/4N) on the dynamic pseudoregret, being σ the Lipschitz constant of the process. This result implies that the per-round pseudo-regret of SW-KL-UCB vanishes as the speed at which the expected rewards evolve decreases to 0. Besbes et al. (2014) study a non-stationary MAB setting under the assumption that the total variation of the expected rewards over the time horizon is bounded by a budget that is a priori ﬁxed. They provide a distribution-independent lower bound. Furthermore, they propose the REXP3 algorithm, a near-optimal frequentist algorithm with a dynamic pseudo-regret of order O(N2/3). Slivkins and Upfal (2008) focus on the dynamic bandit setting a special case of the restless bandits , in which the reward distribution of the arms changes at each round according to Brownian motion. The authors propose algorithms that minimize the per-round pseudo-regret over an inﬁnite time horizon. We also mention the work by Trov o, Paladino, Restelli, and Gatti (2018), who provide some bandit algorithms for dynamic pricing in non-stationary settings. Finally, the problem of non-stationarity with bounded per-round variation is tackled using contextual bandit techniques by Slivkins (2011), who designs the Contextual Zooming algorithm, and by Luo, Wei, Agarwal, and Langford (2018), for which they use a variant of the classic EXP4 algorithm. The MAB literature also provides some works that exploit MAB techniques as heuristics on application scenarios without providing theoretical guarantees. To cite a few, Granmo and Berg (2010) propose a Bayesian algorithm for the speciﬁc case of non-stationary bandit settings with normally distributed rewards. Mellor and Shapiro (2013) analyze an NS-MAB where the probabilities according to which the expected value of the arms change are a priori ﬁxed and propose the CTS algorithm that combines Thompson Sampling with a change point detection mechanism. St-Pierre and Jialin (2014) present an evolutionary algorithm to deal with generic non-stationary environments which empirically outperforms classical solutions. Other settings, closely related to the MAB one, are also studied in the presence of nonstationarity. For instance, Wei et al. (2016) present a study of the regret in the case of non-stationary stochastic experts, providing an upper bound of order O(N1/3) in the case we assume a constant number of switches and limited variance of the expected rewards over time.

3. Problem Formulation

We model our problem as a stochastic NS-MAB setting, in which, at each round t over a ﬁnite horizon N, the learner selects an arm ait among a ﬁnite set of K arms A := {a1, . . . , a K}. At each round t the learner observes a realization of the reward xit,t obtained from the chosen arm ait. The reward for each arm ai at round t is modeled by a sequence of independent random variables Xi,t from a distribution unknown to the learner. We denote by µi,t := E[Xi,t] the expected value of the reward of the arm ai at round t. As is customary in the MAB literature, here we consider Bernoulli distributed rewards, i.e., Xi,t Be(µi,t).3

A policy U is a function U(ht) = ait that chooses the arm ait to play at round t according to history ht, deﬁned as the sequence of past plays and obtained rewards.

3. The extension to other bounded distributions is straightforward. Bernoulli variables are considered here for the sake of simplicity.

Sliding-Window Thompson Sampling for Non-Stationary Settings

The goal of the learner is to design a policy U that minimizes the loss w.r.t. the optimal decision in terms of reward. This loss, usually addressed as cumulative dynamic pseudoregret, is deﬁned as:

µi t ,t µit,t #

where µi t ,t = maxi {1,...,K} µi,t is the expected reward of the optimal arm ai t at round t and E [ ] is the expectation w.r.t. the stochasticity of the policy. Diﬀerently from the classical (stationary) stochastic MAB setting, where an arm (unique unless degeneracy) is optimal for the whole-time horizon (ai t = ai , t), in the NS-MAB setting the arms that are optimal might change over time. We recall that when the optimal expected value of the arm can change without any restriction, the NS-MAB setting has only trivial bounds on the dynamic pseudo-regret RN(U). One of the focus of the MAB research is the design of algorithms that guarantee sublinear pseudo-regret, i.e., RN(U) = O(Nω) with 0 ω < 1. When this is not possible, an alternative performance metric is the average pseudo-regret, deﬁned as:

ARN(U) := lim sup N

In what follows, we will discuss two diﬀerent settings where the evolution over time of the reward distributions of the arms is constrained to change according to speciﬁc schemes.

3.1 Abruptly Changing Setting

The Abruptly Changing MAB (AC-MAB) setting is introduced, for the ﬁrst time, by Garivier and Moulines (2008). In this scenario, the reward distributions are constant during sequences of rounds, namely phases, and change at unknown rounds, namely breakpoints. Thus, the expected value µi,t of the reward of arm ai at round t only changes at the beginning of each phase and, therefore, the best arm ai t remains constant during the phase. The change of the expected rewards at the breakpoints may be arbitrary and is unknown. Let us deﬁne a breakpoint as a round b {1, . . . , N} s.t. i | µi,b 1 = µi,b, i.e., a round b in which the expected reward of at least one arm ai changes w.r.t. the one at round b 1. In an AC-MAB setting with horizon N, we have a set of breakpoints B := {b1, . . . , b BN } of cardinality BN (for sake of notation we deﬁne b0 = 1), which determines a set of phases {F1, . . . , FBN }, where each phase is a set of rounds between two consecutive breakpoints, namely, Fφ = {t {1, . . . , N} s.t. bφ 1 t < bφ}. In order to have sublinear dynamic pseudo-regret, we upper bound the number of breakpoints BN over the time horizon N. We do that by making the following assumption.

Assumption 1. There exists α [0, 1], independent of N, s.t. the number of breakpoints BN is of order O(Nα). That is, there exist α [0, 1) and B R+ such that: BN BNα.

During phase Fφ of an AC-MAB setting, with abuse of notation, we denote with µi,φ the expected value of the reward of arm ai, where ai φ is the optimal arm and µi ,φ is the corresponding expected reward. By deﬁning the length of a phase as Nφ := |Fφ|, a more

Trov o, Paladino, Restelli, & Gatti

compact formulation of the dynamic pseudo-regret of a generic policy U over an AC-MAB is available:

φ=1 i,φE[Ti(Fφ)],

where Ti(Fφ) = P t Fφ 1 {it = i} is the number of times arm ai has been pulled during phase Fφ, i,φ := µi ,φ µi,φ is the diﬀerence between the expected reward µi ,φ of the optimal arm ai φ of phase Fφ and the expected reward µi,φ of arm ai, and E[ ] is the expectation w.r.t. the stochasticity of the policy.4 This alternative formulation highlights that the dynamic pseudo-regret in this setting can be decomposed over the diﬀerent phases such that, in each phase, the dynamic pseudo-regret takes the form of the classic expected pseudo-regret.

3.2 Smoothly Changing Setting

The Smoothly Changing MAB (SC-MAB) setting we study is similar to that one studied by Combes and Proutiere (2014), where the expected value µi,t of each arm varies no more than σ at each round, and the evolution of the dynamics is unknown to the learner. More formally, we make the following Lipschitz assumption.

Assumption 2. There exists σ > 0, such that µi,t µi,t σ |t t | for all t, t {1, . . . , N} and all i {1, . . . , K}.

Furthermore, in such a setting, a suboptimal arm ai might be arbitrarily close to the optimal one ai t in terms of expected reward. Identifying the best arm among those with similar expected rewards is known to be hard, as showed by Lai and Robbins (1985). Indeed, it is known that a learner takes a time of the order of 1 (µi t ,t µi,t)2 . Thus, to prevent the

dynamic pseudo-regret from being linearly dependent on the horizon N, we assume also that the separation between the expected rewards of two arms is arbitrarily small only for a limited number of rounds. More formally, consider 0 < < 1, we deﬁne:

F ,N := {t {1, . . . , N} s.t. i = j, |µi,t µj,t| < }

and we assume the following.

Assumption 3. There exist β [0, 1], F R+, and 0 (0, 1), all independent of N, s.t. for all < 0 it holds:

|F ,N| F Nβ.

We remark that the assumption used by Combes and Proutiere (2014) is a particular case of the above assumption when β = 1.

4. From now on, we denote with | | the cardinality operator and with 1{ } the indicator function of a generic event.

Sliding-Window Thompson Sampling for Non-Stationary Settings

Table 1: Summary of the notation used in the paper. Parameter Description A = {a1, . . . a K} Set of the K available arms ai. Xi,t Random variable corresponding to the reward for arm ai at round t. µi,t Expected value of arm ai at round t. ai t Optimal arm at round t, i.e., the one providing the largest expected reward µi t ,t U(ht) Policy selecting an arm in A for round t, given history ht. B := {b1, . . . , b BN } Set of the BN breakpoints bφ. Fφ Set of rounds between two consecutive breakpoints bφ 1 and bφ Ti(Fφ) Number of times arm ai is pulled during phase Fφ. i,φ := µi ,φ µi,φ Diﬀerence between the expected reward µi ,φ of the optimal arm ai φ of phase Fφ and the expected reward µi,φ of arm ai. F ,N Set of rounds over a time horizon of N in which the optimal arm expected reward has a distance of at least from the second best arm.

3.3 Abruptly and Smoothly Changing Setting

Finally, in a quite straightforward way, it is possible to study also a scenario, from now on addressed as Abruptly and Smoothly Changing MAB (ASC-MAB) setting, in which the two forms of non-stationarity introduced above (abrupt changes and smooth ones) simultaneously occur over a ﬁnite time period. In this setting, in addition to Assumptions 1 and 3, we also require the following assumption.

Assumption 4. There exists σ > 0 and a set of phases {Fφ, . . . , FBN } that, for each Fφ with φ {1, . . . , BN}, it holds: µi,t µi,t σ t t ,

for all i {1, . . . , K} and for all t, t Fφ, i.e., the expected value of the reward function is Lipschitz continuous w.r.t. the rounds belonging to a single phase.

This newly deﬁned assumption is the natural extension of Assumption 2 to this new setting, in which the smoothness assumption might be violated if the process is at breakpoints.5

4. Sliding-Window Thompson Sampling Algorithm

We propose an algorithm that exploits a Sliding-Window (SW) approach to forget past information during the learning, which could provide a bias to the estimation process. More precisely, we use a sliding window of length τ N such that the algorithm, at every

5. A summary of the notation deﬁned in the previous section and used in the following sections is provided in Table 1.

Trov o, Paladino, Restelli, & Gatti

Algorithm 1 SW-TS

1: Input: {πi,0}i prior distributions, N time horizon, A arm set, τ sliding window size

2: for t {1, . . . , N} do

3: for i {1, . . . , K} do

4: Compute πi,t = Beta(Si,t,τ + 1, Ti,t,τ Si,t,τ + 1)

5: Sample ϑi,t from πi,t 6: Play arm ait s.t.: it = arg maxi {1,...,K} ϑi,t and observe xit,t+1

round t, takes into account only the rewards obtained in the last τ rounds. Based on these realizations, we apply a TS-based algorithm to decide which is the arm to pull in the next round. In particular, the expected value of each arm is coupled with a posterior distribution from which we draw samples, and the arm with the highest value is the next arm to play. At ﬁrst, we describe the algorithm and provide theoretical results about the ﬁnite-time analysis of its dynamic pseudo-regret for Bernoulli distributed rewards separately for the AC-MAB and SC-MAB. After that, we study the ASC-MAB setting in which both the non-stationary forms (abrupt and smoothly changing) are present at the same time.6

4.1 The Algorithm

The pseudocode of SW-TS for Bernoulli distributed rewards is presented in Algorithm 1. Assume to have, for each arm ai, a prior πi,0 on the reward expected value µi,t and let πi,t be the posterior distribution for the parameter µi,t after t rounds. In the case we do not have further information on the expected value of the arm, we use an uninformative prior, i.e., πi,0 := Beta(1, 1), where we denote with Beta(a, b) the Beta distribution with parameters a and b. The posterior of the expected reward of arm ai at round t is πi,t := Beta(Si,t,τ + 1, Ti,t,τ Si,t,τ +1), where Ti,t,τ := Pt s=max{t τ+1,1} 1{is = i} is the number of times arm ai has been selected in the last min{t, τ} rounds, and Si,t,τ := Pt s=max{t τ+1,1} xi,s1{is = i} is the cumulative reward collected by arm ai in the last min{t, τ} rounds.7 Once computed the distributions πi,t (Line 4), from each one of them, we draw a random sample ϑi,t, also known as Thompson sample (Line 5). Finally, we select arm ai with the highest sample ϑi,t for this round (Line 6). The extension of the SW-TS algorithm to the case where the rewards Xi,t comes from other bounded distributions is similar to what proposed for the classical TS algorithm, as showed by Chapelle and Li (2011) and Agrawal and Goyal (2012), given that conjugate prior/posterior distributions for the expected rewards of the arms are available.

4.2 Finite-Time Analysis in the Abruptly Changing Setting

We provide a ﬁnite-time analysis of the dynamic pseudo-regret achieved by the SW-TS algorithm, in the AC-MAB setting introduced in Section 3.1.

6. We report the proofs of our theoretical results in Appendix B, Appendix C, and Appendix D, respectively. 7. To avoid an excessively cumbersome notation, we omit the subscript τ in all the terms depending on the choice of τ, e.g., πi,t.

Sliding-Window Thompson Sampling for Non-Stationary Settings

Theorem 1. If the SW-TS policy is run over an AC-MAB setting with Xi,t Be(µi,t), for every τ N, the dynamic pseudo-regret after N rounds is at most:

2 i,φ + log τ + 5 + 19 log τ

where B and α are deﬁned in Assumption 1 and i,φ := µi ,φ µi,φ is the diﬀerence between the expected reward µi ,φ of the best arm ai φ and the expected reward µi,φ of arm ai during phase Fφ. By deﬁning: i := min φ {1,...,BN} i,φ1{i = i φ},

for all i {1, . . . , K}, i.e., the minimum over all the phases Fφ of the diﬀerence of the expected rewards i,φ, the dynamic pseudo-regret can be written as:

RN(U) τKBNα + N

i + log τ + 5 + 19 log τ

From this general result we derive two corollaries for the cases in which we have a number of breakpoints BN which is either sublinear or linear w.r.t. the time horizon N.

Corollary 1. If the SW-TS policy is run over an AC-MAB setting in which Assumption 1 holds with α [0, 1) and using a sliding window τ N 1 α

2 , the dynamic pseudo-regret is:

RN(U) = O(N 1+α

Notice that the size of the sliding window prescribed by Corollary 1 decreases as the parameter α increases, meaning that, in settings in which we have a large number of breakpoints, we should use a short sliding window. In particular, if Assumption 1 holds for α = 0, meaning that the number of breakpoints is constant w.r.t. the time horizon, and we use a sliding window τ

N, the order of the dynamic pseudo-regret is O(

N). Interestingly, even in the basic setting with a single breakpoint (α = 0 and BN = 1) and two arms, the sliding window approach outperforms classical MAB algorithms for stationary settings, e.g., UCB1. Indeed, MAB algorithms for stationary settings would suﬀer from Ω(

N) dynamic pseudo-regret in the second phase, in addition to the regret due to the ﬁrst phase.8

Conversely, if Assumption 1 holds with α = 1, and consequently with B < 1, the above bound would provide a linear upper bound on the dynamic pseudo-regret over the time horizon. In this case, the interest is in bounding the average pseudo-regret, as stated in the following corollary.

Corollary 2. If the SW-TS policy is run over an AC-MAB setting in which Assumption 1

holds with α = 1, 0 < B < 1, and using a sliding window τ q

B) B , the average pseudo-regret is:

ARN(U) = O( p

8. For instance, be given two arms a1, a2 and a breakpoint b1 = N/2 in which the expected values of the arms switch (e.g., a1 is better than a2 before N/2 and worse after). After N/2, O(

N) pulls of a1 are required before the upper conﬁdence bound of a2 is higher than that one of a1. It is easy to see that, in this situation, the algorithm suﬀers from a dynamic pseudo-regret of at least Ω(

Trov o, Paladino, Restelli, & Gatti

Finally, we remark that the bound provided in Theorem 1 has the order of O(log N) when we have a single phase and τ = T, i.e., in such a setting we have Nφ = T, BN = 0 and Assumption 1 holds when B = 0. Such a result is consistent with the upper bounds on the expected pseudo-regret of Thompson Sampling algorithm provided by Kaufmann et al. (2012b) and Agrawal and Goyal (2012).

4.3 Finite-Time Analysis in the Smoothly Changing Setting

We provide a ﬁnite-time analysis of the dynamic pseudo-regret achieved by the SW-TS algorithm, in the SC-MAB setting introduced in Section 3.2.

Theorem 2. If the SW-TS policy is run over a SC-MAB setting with Xi,t Be(µi,t), Lipschitz constant σ > 0 and there exists 0 (0, 1) as in Assumption 3, for any τ N s.t. 2στ < 0, the dynamic pseudo-regret after N rounds is at most:

RN(U) F Nβ + NK

52 log τ ( 2στ)2 + log τ + 5 + 19 log τ

+ K 52 log τ ( 2στ)2 + 3 + 19 log τ

The dependence of the dynamic pseudo-regret on the factor N

τ is similar to the result obtained by Combes and Proutiere (2014) and Garivier and Moulines (2008) for frequentist algorithms. In what follows, depending on the value of the parameter β in Assumption 3, we can show that either the dynamic pseudo-regret of the SW-TS algorithm is sublinear in the time horizon N or it is linear. Depending on the characteristic of the speciﬁc SC-MAB setting, we can provide diﬀerent results in terms of dynamic pseudo-regret once we ﬁx the sliding window size. More speciﬁcally, one of the parameter characterizing the dynamic pseudo-regret is P, i.e., the number of times that the expected rewards of a couple of arms switch over the time horizon, or formally:

P = |{t {1, . . . , N 1} s.t. i = j(µi,t µj,t)(µi,t+1 µj,t+1) < 0}|.

If we do not have any switch between the expected rewards of the arms, we can ensure the following.

Corollary 3. If the SW-TS policy is run over an SC-MAB setting with no switches between expected rewards of the arms (P = 0), in which Assumption 3 holds with β [1 log N

2σ , 1], and using a sliding window τ := N1 β, for each 0 the dynamic pseudo-regret is at most:

RN(U) = O(Nβ).

Notice that, as the parameter β increases, the sliding window size prescribed by Corollary 3 reduces. Intuitively, this is due to the fact that settings with a large value of β present a large number of rounds in which two arms are hard to be distinguished, i.e., the diﬀerence between their expected rewards is smaller than . This fact, in its turn, also shortens the phases in which the bandit algorithm is capable of properly operating and implies the use of a shorter sliding window. In particular, it is easy to prove that, if Assumption 3 holds

Sliding-Window Thompson Sampling for Non-Stationary Settings

with β = 0, meaning that we have σ 2N , we would have an SC-MAB setting in which the expected rewards do not switch over time. Therefore, using the prescribed time window of τ = N provides a logarithmic dynamic pseudo-regret. Conversely, if we have at least one switch between the expected reward of the arms, the result on the dynamic pseudo-regret upper bound requires a further condition on the values of β to hold. More formally, we have the following:

Corollary 4. If the SW-TS policy is run over an SC-MAB setting with P N switches between expected rewards of the arms, in which Assumption 3 holds with:

where max{a, b} denotes the maximum between a and b, and using a sliding window τ := N1 β, F is deﬁned in Assumption 3, for each 0 the dynamic pseudo-regret is at most:

RN(U) = O(Nβ).

If Assumption 1 holds with β = 1, the above corollaries provide an upper bound on the dynamic pseudo-regret, which is linear in the time horizon N, similarly to what provided by Combes and Proutiere (2014). Nonetheless, we can bound the average pseudo-regret as follows.

Corollary 5. If the SW-TS policy is run over an SC-MAB setting in which Assumption 3 holds with β = 1 and using a sliding window τ σ 3

4 , the average pseudo-regret is:

ARN(U) = O(σ 1 2 ).

Finally, if we have a stationary environment, i.e., σ = 0, we are able to ﬁnd s.t. |F ,N| = 0 and, therefore, we have that β = 0. Choosing a sliding window τ = N, Theorem 2 provides an upper bound over the expected pseudo-regret of order O(log N), as it happens in the classical MAB literature, as discussed by Auer et al. (2002) and Auer et al. (2012b).

4.4 Finite-Time Analysis in the Abruptly and Smoothly Changing Setting

In this section, we provide theoretical guarantees of SW-TS in both the AC-MAB and SC-MAB settings. The main result follows.

Theorem 3. If the SW-TS policy is run over an ASC-MAB setting with Xi,t Be(µi,t), Lipschitz constant σ > 0 as in Assumption 4 and there exists 0 (0, 1) as in Assumption 3, for any τ N s.t. 2στ < 0, the dynamic pseudo-regret after N rounds is at most:

RN(U) F Nβ + τBNα + NK

52 log τ ( 2στ)2 + log τ + 5 + 19 log τ

where B and α are deﬁned in Assumption 1 and F and β are deﬁned in Assumption 3.

Trov o, Paladino, Restelli, & Gatti

Similarly to what has been done in the other two scenarios, we provide the order of the derived upper bounds when the sliding window length τ has been set properly, depending on the values of the parameters α and β and the number of times the expected rewards switch P.

Corollary 6. If the SW-TS policy is run over an ASC-MAB setting with no switches between expected rewards of the arms (P = 0) and Assumption 1 and Assumption 3 hold with α (1 2 log N

2σ , 1) and β (0, 1), respectively, for each 0, using a sliding window of τ := N 1 α

2 , the dynamic pseudo-regret is at most:

2 O Nβ if β > 1+α

Conversely, in the case we have some switches between the expected rewards over time, we require further conditions on the parameters α and β.

Corollary 7. If the SW-TS policy is run over an ASC-MAB setting with P N switches between expected rewards of the arms, and Assumption 1 and Assumption 3 hold with α (1 2 log N

2σ , 1) and β (0, 1), respectively, for each 0, using a sliding window of τ := N 1 α

2 , if β + α

P holds, the dynamic pseudo-regret is at most:

2 O Nβ if β > 1+α

Intuitively, the results in Corollaries 6 and 7 state that, depending on which form of nonstationarity dominates, we have two diﬀerent orders of dynamic pseudo-regret dependent on either α or β. Similarly to the other two settings (AC-MAB and SC-MAB), if α = 1 and β = 1, Theorem 3 provides an upper bound over the dynamic pseudo-regret over the time horizon N. Conversely, it is possible to bound the average pseudo-regret as follows:

Corollary 8. If the SW-TS policy is run over an SC-MAB setting in which Assumption 1 holds with α = 1, Assumption 3 holds with β = 1, and using a sliding window τ B 1

4 , the average pseudo-regret is:

ARN(U) = O(B 1 2 σ 1 2 ).

The asymptotic order of SW-TS in the ASC-MAB setting upper bound reduces to the one of Theorem 2 in the case we have B = 0, i.e., we are in an SC-MAB setting. If we apply the bound in Theorem 3 for the AC-MAB setting, in which we have σ = 0 and by ﬁxing = mini i we have |F ,N| = 0, thus F = 0, we obtain a slightly less accurate bound in terms of than the one provided in Theorem 1. Nonetheless, the bound presents the same order in terms of N and τ. Finally, if we are in a stationary setting, i.e., F = B = 0, the bound over the expected pseudo-regret reduces to the order of the one provided by the Thompson Sampling algorithm, analyzed by Kaufmann et al. (2012b) and Agrawal and Goyal (2012).

Sliding-Window Thompson Sampling for Non-Stationary Settings

TS SER4 REXP3 SW-UCB SW-KL-UCB SW-TS

0 2 4 6 8 10

Figure 1: AC-MAB: average dynamic pseudo-regret Rt(U) and 95% conﬁdence intervals. Settings with K = 10 arms and time horizon N = 104 (a), and with N = 105 (b).

5. Experimental Evaluation

We experimentally evaluate our algorithm w.r.t. the state-of-the-art passive algorithms with theoretical guarantees in terms of pseudo-regret performance in the AC-MAB, SC-MAB, and ASC-MAB settings. In particular, we compare SW-TS with Thompson Sampling (TS) by Thompson (1933) to evaluate the improvement obtained thanks to the employment of a sliding window τ. Furthermore, we compare SW-TS with REXP3 by Besbes et al. (2014), SW-UCB by Garivier and Moulines (2008), SW-KL-UCB by Combes and Proutiere (2014) and SER4 by Allesiardo et al. (2017) to evaluate the improvement obtained thanks to the adoption of Bayesian methods vs. frequentist ones in non-stationary settings.9 The ﬁgures of merit we consider are the dynamic pseudo-regret RN(U), as deﬁned in Equation (1), and the corresponding 95% conﬁdence intervals (reported in the ﬁgures as semi-transparent areas) computed over 100 independent runs, if not speciﬁed otherwise. Finally, we perform a sensitivity analysis on the two parameters whose knowledge is required by the SW-TS algorithm, speciﬁcally α and β.

5.1 Abruptly Changing MAB Setting

Experimental Setting We use a time horizon N {104, 105, 106} and a number of arms K {5, 10, 20, 30}. We split the time horizon N into four phases of equal length. The expected value µi,φ is chosen randomly for every arm ai during each phase. In particular, after every breakpoint, the expected value µi,φ of each arm ai is drawn from a uniform probability distribution over [0, 1], thus assuring that there is never the same optimal arm in two diﬀerent phases, i.e., ai φ = ai φ , φ, φ with φ = φ . For the sake of comparison, we

choose a sliding window τ = 4 p

N log(N) as is discussed by Garivier and Moulines (2008). We generate 10 conﬁgurations for every combination of N and K as discussed above, and

9. If not speciﬁed otherwise, the parameters of REXP3 and SER4 are set as in Corollary 3 provided by Allesiardo et al. (2017) and Theorem 2 provided by Besbes et al. (2014), respectively, since these values allow the two algorithms to have sublinear regret.

Trov o, Paladino, Restelli, & Gatti

Table 2: AC-MAB: average dynamic pseudo-regret RN(U) and 95% conﬁdence intervals. Best results on average have been highlighted in boldface.

N 104 105 106

TS 1317 52.89 12857 425.68 114476 4836.98

SER4 2494 37.63 25601 513.99 238034 4323.34

REXP3 1451 13.70 8448 55.21 42561 212.75

SW-UCB 824 66.80 5687 814.94 32939 7587.28

SW-KL-UCB 344 7.57 1570 31.51 6248 145.51

SW-TS 437 13.37 1467 30.45 4904 39.00

TS 1251 26.90 10927 315.30 98312 4168.24

SER4 3151 34.63 31454 499.91 279232 6504.65

REXP3 1913 17.85 12170 108.38 61978 345.25

SW-UCB 1116 68.46 8143 872.37 49537 6191.14

SW-KL-UCB 469 7.98 2197 45.54 8601 162.32

SW-TS 470 8.82 1632 32.85 5493 92.67

TS 1130 30.91 8864 139.77 69919 2447.98

SER4 3684 26.76 33293 167.89 293844 3038.42

REXP3 2480 17.27 16134 93.65 83042 337.96

SW-UCB 1405 57.44 11789 503.34 68751 6651.74

SW-KL-UCB 652 6.70 3086 48.22 11921 315.74

SW-TS 536 10.26 1858 26.82 6156 149.64

TS 1016 35.55 7714 170.92 61979 2001.15

SER4 3922 19.23 33622 212.29 285382 1727.97

REXP3 2712 22.37 18432 100.09 96851 378.67

SW-UCB 1566 60.42 12271 804.93 82006 8424.70

SW-KL-UCB 770 19.79 3858 84.94 15287 233.75

SW-TS 575 12.20 2067 35.65 7123 96.46

we provide the results averaged over the conﬁgurations and over 100 independent trials for each conﬁguration.

Results The numerical results in terms of RN(U) are reported in Table 2. For every combination of N and K, we highlight in bold the minimum value of RN(U) achieved. SW-TS outperforms the other algorithms in all the conﬁgurations except for the setting with N = 104 and K = 5 where SW-KL-UCB outperforms SW-TS. In the setting with N = 104 and K = 10 there is no statistical evidence to determine which algorithm is the best between SW-TS and SW-KL-UCB since the 95% conﬁdence intervals overlap.

Sliding-Window Thompson Sampling for Non-Stationary Settings

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

a1 a2 a3 a4 a5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

a1 a2 a3 a4 a5 a6 a7 a8 a9 a10

Figure 2: SC-MAB: examples of evolution of the expected reward µi,t of the arms over time. Settings with N = 104, K = 5 (a), and N = 105, K = 10 (b).

In Figure 1, we report the results for the settings with K = 10 as t varies. It can be observed that, with N = 104 (Figure 1a), the performance of SW-TS and SW-KL-UCB are similar. However, the regret obtained by the algorithms is almost linear, suggesting that the algorithms are not able to learn since the problem is excessively hard. With a longer time horizon of N = 105 (Figure 1b), the sliding window τ becomes larger (we recall that we use a τ depending on N), as well as the length of the phases and, thus, SW-TS outperforms SW-KL-UCB. The SW-TS suﬀers from a larger regret when we enter a new phase, e.g., around t = 5 104, but once the sliding window discards the samples coming from the previous phase SW-TS can learn faster than other algorithms, which is exempliﬁed by the lower slope of the regret between t = 6 104 and t = 7 104.

5.2 Smoothly Changing MAB Setting

Experimental Setting We use a time horizon N {104, 105, 106} and a number of arms K {5, 10, 20, 30}. We replicate the experimental setting of Combes and Proutiere (2014), where the expected value µi,t of arm ai changes according to the following function:

w(t) = 1 + (K 1)(1 + sin(tσ))

Examples of the evolution over time of the expected reward of the arms are presented in Figure 2. We use two diﬀerent sliding window lengths. In the ﬁrst case, we use the values of the parameters τ and σ prescribed by our theoretical results. In particular, with the above experimental setting, Assumption 3 is satisﬁed for every value of N {104, 105, 106} when β = 1

2 and σ = 0.0001.10 Such a value of β leads to τ =

N. In the second case, we

10. Details on the conditions for which Assumption 3 is satisﬁed are provided in Appendix E.

Trov o, Paladino, Restelli, & Gatti

Table 3: SC-MAB with τ =

N: average dynamic pseudo-regret RN(U) and 95% conﬁdence intervals. Best results on average have been highlighted in boldface.

N 104 105 106

TS 218 41.94 11995 562.73 161933 3767.54

SER4 1787 6.61 22398 61.49 212095 309.93

REXP3 957 16.58 11141 58.95 111403 169.79

SW-UCB 1624 62.40 6560 49.31 34615 160.23

SW-KLUCB 987 13.79 7407 50.99 40190 165.42

SW-TS 608 15.43 3330 40.60 16403 117.46

TS 520 42.19 13253 579.05 169850 4434.77

SER4 2206 14.66 26094 145.80 242464 498.22

REXP3 1253 19.36 14391 63.09 144581 199.42

SW-UCB 3424 105.37 36622 314.36 80256 4375.85

SW-KLUCB 1289 12.56 11009 50.77 64518 179.97

SW-TS 922 16.18 5529 49.82 28258 149.82

TS 549 26.74 12843 390.73 173140 2772.27

SER4 2361 32.85 27286 259.41 258380 1039.26

REXP3 1470 16.91 17334 67.77 174065 219.54

SW-UCB 4466 202.74 45089 353.93 448649 155.59

SW-KLUCB 1330 12.17 13630 38.60 89442 157.88

SW-TS 1180 15.48 7971 49.70 44186 154.32

TS 581 26.29 12483 297.63 172305 2205.39

SER4 2480 23.17 27872 354.54 279190 1680.75

REXP3 1607 14.59 18854 59.28 189734 178.99

SW-UCB 4348 329.65 47586 911.96 462611 443.20

SW-KLUCB 1638 11.92 14603 37.31 102707 126.91

SW-TS 1339 11.49 9595 39.76 54298 124.20

use the value of the parameter τ used by Combes and Proutiere (2014) to provide a direct comparison between the SW-TS and SW-KLUCB algorithms. Thus, we set τ = σ 4

5 . Let us notice that such a choice is not optimal for SW-TS according to our theoretical results. Furthermore, we do not set σ equal to a ﬁxed value, but we evaluate the performance of the algorithms as σ varies. In both cases, we average the results over 100 independent trials for every combination of N, K and σ.

Results First, we analyze the results for the settings with τ =

N. The numerical results in terms of RN(U) are reported in Table 3. The SW-TS algorithm outperforms all the other

Sliding-Window Thompson Sampling for Non-Stationary Settings

TS SER4 REXP3 SW-UCB SW-KL-UCB SW-TS

0 2 4 6 8 10

Figure 3: SC-MAB: average dynamic pseudo-regret Rt(U) and 95% conﬁdence intervals. Settings with K = 5 arms and time horizon N = 104 (a), and N = 105 (b).

ones, except for the case with N = 104, in which SW-TS achieves the best performance w.r.t. the other algorithms using a sliding window, but it is not able to outperform TS. The reason behind this behaviour lies in the fact that, with such a small value of σ, the optimal arm remains the same until round t = 5 103 and it is not convenient to use a sliding window approach. Conversely, if we have longer time horizons, the optimal arm changes more often. In particular, with N = 105, we have 14 changes of the optimal arm, and the performance of TS becomes the worst. In Figure 3a, we report the dynamic pseudo-regret Rt(U) of the analysed algorithms as t varies in the case with N = 104. It can be observed that, when the optimal arm changes, there is a worsening in the dynamic pseudo-regret performance of TS. However, no sliding window algorithm can achieve its performance. Conversely, as it can be observed in Figure 3b, when N = 105, TS is outperformed by almost all the other algorithms, and this is because the optimal arm changes multiple times. Even if TS and SW-TS share similar behaviours in the very ﬁrst rounds, the use of a sliding window allows SW-TS to provide the best performance on a longer time horizon. Second, we analyze the results for the settings with τ = σ 4

5 . The results in terms of dynamic pseudo-regret RN(U) are reported in Table 4 for the experiments with σ = 10 3. We observe that SW-TS outperforms all the other algorithms, providing in every setting the minimum RN(U) (highlighted in bold). In Figure 4, we report the dynamic pseudo-regret Rt(U) as t varies in the setting with N = 104, K = 10 and σ = 10 3. It can be observed that, in the ﬁrst 4 103 rounds, TS outperforms SW-KL-UCB, but, subsequently, thanks to the use of a sliding window, SW-KL-UCB forgets the past and improves its performance. Instead, SW-TS outperforms all the other algorithms for the whole time horizon. Notably, REXP3 achieves performance similar to the one of TS, while TS outperforms SW-UCB even if this latter algorithm employs a sliding window approach. Finally, in Figure 5, we report the dynamic pseudo-regret RN(U) as σ varies in the setting with N = 104. We observe that SW-TS outperforms all the other algorithms for each value of σ. As the number of arms K increases from K = 5 to K = 30, the performance of REXP3 and SW-KL-UCB gets worse and, with σ = 0.01, TS (without sliding window) outperforms both REXP3 and SW-KL-UCB. The setting with the other values of N, K and σ, are not reported since they

Trov o, Paladino, Restelli, & Gatti

Table 4: SC-MAB with τ = σ 4

5 : average dynamic pseudo-regret RN(U) and 95% conﬁdence intervals. Setting with σ = 10 3. Best results on average have been highlighted in boldface.

N 104 105 106

TS 1374 41.16 22965 186.32 211996 209.56

REXP3 1094 6.49 11915 21.08 118999 96.21

SW-UCB 677 8.24 7533 127.33 92532 4929.63

SW-KL-UCB 752 7.18 8249 35.44 82469 317.70

SW-TS 423 8.14 4629 33.14 46291 269.55

TS 1419 36.45 25969 276.69 266539 210.95

REXP3 1426 7.70 15505 23.81 155050 92.10

SW-UCB 3247 27.03 41238 43.84 422468 38.42

SW-KL-UCB 1093 7.75 11973 31.21 119631 238.20

SW-TS 690 8.03 7664 30.12 76583 199.37

TS 1442 29.13 26623 226.21 292824 228.08

REXP3 1727 6.89 18863 22.90 188579 65.92

SW-UCB 3987 41.67 45247 38.81 458502 41.66

SW-KL-UCB 1321 7.02 14507 26.76 145215 134.47

SW-TS 944 6.95 10413 24.59 104074 88.18

TS 1401 26.42 26887 201.09 302460 219.72

REXP3 1887 6.92 20529 24.08 205269 93.17

SW-UCB 4144 71.81 46490 69.65 470566 70.75

SW-KL-UCB 1383 8.81 15121 25.55 151357 90.92

SW-TS 1104 7.30 12107 21.06 120868 71.62

provide results in line with what has been presented before, in which SW-TS outperforms state-of-the-art algorithms.

5.3 Abruptly and Smoothly Changing MAB Setting

Experimental Setting We use a time horizon N {104, 105, 106} and a number of arms K {5, 10, 20, 30}. For each setting, we generate the expected reward of the arms µi,t as follows. We split the time horizon N into four phases Fφ of equal length and the set the expected value µi,t of arm ai according to the following function:

Sliding-Window Thompson Sampling for Non-Stationary Settings

TS SER4 REXP3 SW-UCB SW-KL-UCB SW-TS

Figure 4: SC-MAB with τ = σ 4

5 : average dynamic pseudo-regret Rt(U) and 95% conﬁdence intervals. Setting with number of arms K = 10, time horizon N = 104, and σ = 10 3.

TS SER4 REXP3 SW-UCB SW-KL-UCB SW-TS

Figure 5: SC-MAB with τ = σ 4

5 : average dynamic pseudo-regret RN(U) and 95% conﬁdence as the value of σ varies. Setting with number of arms K = 5 (a) and K = 30 (b), time horizon N = 104.

w(t) = 1 + (K 1)(1 + sin(tσ + N(φ 1)

where φ {1, . . . , 4} represents the index of the phase. Note that the presence of the second term in the argument of the sine induces an abrupt change at the beginning of each phase by shifting the argument of the sine by an amount proportional to the time horizon N: after the ﬁrst breakpoint, we shift of an number of rounds of N

4 ; after the second one, N

2 of N; after the third one, 3N

4 . As in Section 5.2, we run a set of experiments using both a sliding window τ =

N and σ = 0.0001, and a sliding window τ = σ 4

5 and σ {0.001, 0.002, . . . , 0.01}. In both cases, we average the results over 100 independent trials for every combination of N, K and σ.

Trov o, Paladino, Restelli, & Gatti

Table 5: ASC-MAB with τ =

N: average dynamic pseudo-regret RN(U) and 95% conﬁdence intervals. Setting with σ = 10 4. Best results on average have been highlighted in boldface.

N 104 105 106

TS 416 42.09 11598 294.55 155795 3325.74

SER4 2136 10.24 22894 63.04 211904 386.48

REXP3 984 18.45 11428 55.26 111657 166.40

SW-UCB 1424 80.93 6565 58.97 34832 162.74

SW-KLUCB 991 16.66 7367 51.78 40395 171.51

SW-TS 587 15.28 3376 51.61 16546 143.17

TS 513 33.02 13322 399.70 166233 4166.45

SER4 2677 28.61 26590 163.43 243132 652.63

REXP3 1391 20.59 14652 65.30 144851 231.14

SW-UCB 3807 146.66 51669 13.16 82394 5545.59

SW-KLUCB 1434 15.94 10994 42.74 64783 166.04

SW-TS 967 15.41 5512 47.74 28539 143.63

TS 598 30.29 13099 243.57 172319 3124.95

SER4 2832 56.99 28280 242.18 258039 1055.98

REXP3 1664 18.84 17838 62.91 173867 198.21

SW-UCB 5085 229.91 56948 35.71 450598 200.35

SW-KLUCB 1544 12.00 13718 40.59 89463 139.87

SW-TS 1280 12.89 8017 52.56 44306 115.80

TS 589 21.37 12854 288.97 171534 2879.85

SER4 2929 48.64 29651 457.64 278093 1590.27

REXP3 1833 18.58 19585 53.51 189882 189.03

SW-UCB 4819 418.89 59078 38.20 464238 565.19

SW-KLUCB 1882 14.16 14718 33.40 102637 138.04

SW-TS 1471 13.06 9612 41.48 54363 134.34

Results The results for both settings are similar to the ones presented in Section 5.2, suggesting that the abrupt changes not aﬀect the dynamic pseudo-regret if the expected values of the arms are smoothly changing.

The numerical results in terms of dynamic pseudo-regret RN(U) with τ =

N are reported in Table 5 for the experiments with σ = 10 4. We observe that SW-TS outperforms all the other algorithms except for the case with N = 104, in which TS is the algorithm with the lowest dynamic pseudo-regret. The reason behind this behaviour lies in the fact that,

Sliding-Window Thompson Sampling for Non-Stationary Settings

Table 6: ASC-MAB with τ = σ 4

5 : average dynamic pseudo-regret RN(U) and 95% conﬁdence intervals. Setting with σ = 10 3. Best results on average have been highlighted in boldface.

N 104 105 106

TS 1252 39.02 23163 305.47 211871 380.91

SER4 2253 9.44 23133 55.87 226316 294.62

REXP3 1661 14.75 17517 42.07 175080 115.16

SW-UCB 684 15.74 7451 74.99 90968 7174.24

SW-KLUCB 754 9.62 8232 50.02 83436 359.15

SW-TS 470 11.90 4645 54.85 47197 298.63

TS 1370 39.51 26078 384.97 266289 283.49

SER4 2745 23.31 29277 99.78 299947 456.21

REXP3 2063 12.45 21719 49.39 217066 140.36

SW-UCB 4736 62.98 41371 76.94 423199 99.35

SW-KLUCB 1090 11.15 11958 48.21 120388 218.01

SW-TS 728 11.76 7630 47.36 77302 207.19

TS 1394 41.43 26573 341.85 292356 328.06

SER4 3107 21.22 34016 46.53 341214 164.83

REXP3 2389 14.57 25122 46.73 251113 143.90

SW-UCB 5309 4.98 45305 77.67 459211 75.73

SW-KLUCB 1345 9.73 14479 36.84 145437 133.78

SW-TS 980 9.92 10420 34.53 104146 107.82

TS 1401 27.51 26713 326.58 301882 317.67

SER4 3252 13.63 35227 22.46 352542 78.03

REXP3 2558 12.68 26911 40.75 268738 126.13

SW-UCB 5513 4.21 46429 132.47 471276 0.00

SW-KLUCB 1418 11.11 15161 37.74 151120 143.85

SW-TS 1136 10.84 12102 32.18 120625 115.00

with such a small value for σ, the optimal arm remains the same until round t = 5 103, and, therefore, it is not convenient to use a sliding window approach. Conversely, with longer time horizons, the optimal arm changes multiple times, and the performance of TS gets worse. The results in terms of RN(U) with τ = σ 4 5 are reported in Table 6 for the experiments with σ = 10 3, and conﬁrm the superior performance showed by the SW-TS algorithms in the SC-MAB setting.

Trov o, Paladino, Restelli, & Gatti

6. Sensitivity Analysis

In this section, we present the results of the sensitivity analysis for parameter α in the AC-MAB when the sliding window is set as prescribed by Corollary 1, and for parameter β in the SC-MAB when the sliding window is set as prescribed by Corollary 3.

6.1 Abruptly Changing MAB Setting

Experimental Setting We compare the performance of the SW-TS algorithm using diﬀerent sliding windows τ = N 1 α

2 with α A, where A = { 1, 0.95, . . . , 0.95, 1}. We use the same set of arms K {5, 10, 20, 30} previously used in Section 5.1, and a more ﬁne-grained set of time horizons N {104, 105, 2 105, 3 105, . . . , 9 105, 106}. The diﬀerent conﬁgurations of expected values µi,φ for each arm ai are generated as described in Section 5.1. The results are averaged over these conﬁgurations and over 100 independent trials for each conﬁguration. In addition to the sensitivity analysis w.r.t. α, we also show the change in terms of cumulative per-round pseudo-regret ˆRN := RN(U)/N as the value of α varies.

Results In Figure 6, we report the values of α (denoted with α and using the solid red line), minimizing the dynamic pseudo-regret RN(U), as the length of the time horizon N varies. We also report the value of α (denoted with α0 and using the solid blue line) prescribed by Corollary 1; in this case, α0 = 0. For a better comprehension of how the dynamic pseudo-regret increases as α gets far from the optimal α , we also plot the values of α corresponding to an increase of 50%, 100% and 200% of the dynamic pseudo-regret RN(U) w.r.t. the one achieved with α . These curves are denoted with α50%, α100% and α200%, respectively, and are reported using dashed lines (notice that we have two curves for every α%, one for α larger than α and another for α smaller than α ; the curves above α = 1 and below α = 1 are omitted). It can be observed that, when using α0, the increase in terms of dynamic pseudo-regret is always lower than 50% w.r.t the dynamic pseudoregret achieved with α , α0 being always between the two curves corresponding to α50%. Moreover, α always corresponds to a negative value, suggesting that, on average, a sliding window τ longer than the one obtained with α0 is preferable in these settings.

In Figure 7, for every number K of arms, we show the curves of the per-round pseudoregret ˆRN for every diﬀerent value of the time horizon N as the value of α varies. The minimum of each curve corresponds to α for the considered time horizon N. It can also be observed that the minimum per-round pseudo-regret is always achieved with almost the same value of α 0.4. Moreover, ˆRN grows faster moving from α to larger values of α rather than to smaller values of α, thus suggesting that underestimating the value of α to use in the algorithm, and therefore overestimating the length of the sliding window, is safer than overestimating it.

In Figure 8, for every value of the time horizon N, we show the curves of the perround pseudo-regret ˆRN for every diﬀerent number K of arms as the value of α varies. The minimum of each curve corresponds to α for the considered number of arms K. We observe that the value of α decreases as the number K of arms increases. Intuitively, this behavior is due to the fact that, when SW-TS has more arms to play, it also needs to collect

Sliding-Window Thompson Sampling for Non-Stationary Settings

0 2 105 4 105 6 105 8 105 1 106 1

α0 α α50% α100% α200%

0 2 105 4 105 6 105 8 105 1 106 1

α0 α α50% α100% α200%

0 2 105 4 105 6 105 8 105 1 106 1

α0 α α50% α100% α200%

0 2 105 4 105 6 105 8 105 1 106 1

α0 α α50% α100% α200%

Figure 6: AC-MAB: values of α providing the best dynamic pseudo-regret (α ), the best dynamic pseudo-regret increased by 50% (α50%), by 100% (α100%), and by 200% (α200%) as the time horizon N varies; α0 is the value prescribed by Corollary 1.

more samples for each arm in order to identify which is the optimal arm and, consequently, a larger sliding window τ is preferable.

Trov o, Paladino, Restelli, & Gatti

1 0.5 0 0.5 1 0

1 0.5 0 0.5 1 0

1 0.5 0 0.5 1 0

1 0.5 0 0.5 1 0

Figure 7: AC-MAB: per-round pseudo-regret ˆRN as the value of α A varies for every value of the time horizon N and every diﬀerent number of arms K.

Sliding-Window Thompson Sampling for Non-Stationary Settings

1 0.5 0 0.5 1 0

K = 5 K = 10 K = 20 K = 30

1 0.5 0 0.5 1 0

K = 5 K = 10 K = 20 K = 30

1 0.5 0 0.5 1 0

K = 5 K = 10 K = 20 K = 30

1 0.5 0 0.5 1 0

N = 1000000

K = 5 K = 10 K = 20 K = 30

Figure 8: AC-MAB: Per-round pseudo-regret ˆRN as the value of α A varies for every number of arms K and every diﬀerent length of the time horizon N.

Trov o, Paladino, Restelli, & Gatti

6.2 Smoothly Changing MAB Setting

Experimental Setting We compare the performance of the SW-TS algorithm using diﬀerent sliding windows τ = N 1 β

2 with β = B, where B = { 1, 0.95, . . . , 0.95, 1}. We use the same set of arms K {5, 10, 20, 30} previously used in Section 5.1, and a time horizon with a length in N {104, 105, 106}. The expected value of the rewards µi,t for arm ai changes as described in Section 5.2.

Results In Figure 9, we report the values of β (denoted with β and using the solid red line), minimizing the dynamic pseudo-regret RN(U), as the length of the time horizon N varies. We also report the value of β (denoted with β0 and using the solid blue line) prescribed by Corollary 3; in this case, β0 = 0. For a better comprehension of how the dynamic pseudo-regret increases as β gets far from the optimal β , we also plot the values of β corresponding to an increase of 50%, 100% and 200% of the dynamic pseudo-regret RN(U) w.r.t. the one achieved with β . These curves are denoted with β50%, β100% and β200% and are reported using dashed lines. It can be observed that, when using β0, the increase in terms of dynamic pseudo-regret is always lower than the 50% w.r.t. the dynamic pseudo-regret achieved with β , β0 being always between the two curves corresponding to β50%. As in the analysis for the parameter α, we notice that β always corresponds to a negative value, suggesting that a sliding window τ longer than the one obtained with β0 is preferable. In Figure 10, for every number K of arms, we show the curves of the per-round pseudoregret ˆRN for every diﬀerent value of the time horizon N as the value of β varies. The minimum of the curves corresponds to β for the considered time horizon N. It can also be observed that the minimum per-round pseudo-regret is always achieved with a value of β 0.2. Moreover, ˆRN grows faster moving from β toward larger values of β rather than towards smaller values of β. Such behavior suggests that, in practice, using a sliding window τ slightly longer than the one of the optimal β is preferable w.r.t. a slightly smaller one. In Figure 11, for every value of the time horizon N, we show the curves of the perround pseudo-regret ˆRN for every diﬀerent number K of arms as the value of β varies. The minimum of the curves corresponds to β for the considered number of arms K. We observe that the larger the number of arms K, the lower the value of β , which is in line with what has been observed for parameter α.

7. Conclusions and Future Work

In this paper, we study the non-stationary Multi-Armed Bandit problem, investigating settings in which two diﬀerent forms of non-stationarity are present: the abruptly changing one, namely AC-MAB, in which the expected reward of the arm remains the same for sequences of rounds, and it changes at unknown rounds, and the smoothly changing one, namely SC-MAB, in which the expected reward of every arm continuously changes over rounds in a smooth way. Besides, we also study the setting in which both the non-stationarity forms coexist, namely ASC-MAB. We propose an algorithm, namely Sliding-Window Thompson Sampling (SW-TS), which chooses the next arm to play on the basis of the information collected in the last τ rounds. We derive an upper bound on the dynamic pseudo-regret

Sliding-Window Thompson Sampling for Non-Stationary Settings

0 2 105 4 105 6 105 8 105 1 106 1

β50% β100% β200%

0 2 105 4 105 6 105 8 105 1 106 1

β50% β100% β200%

0 2 105 4 105 6 105 8 105 1 106 1

β50% β100% β200%

0 2 105 4 105 6 105 8 105 1 106 1

β50% β100% β200%

Figure 9: SC-MAB: values of β providing the best dynamic pseudo-regret (β ), the best dynamic pseudo-regret increased by 50% (β50%), by 100% (β100%), and by 200% (β200%) as the time horizon N varies; β0 is the value prescribed by Corollary 3.

for SW-TS when it is applied to one of the settings mentioned before. Finally, we present a thorough experimental evaluation of the performance of the SW-TS algorithm separately for the AC-MAB, SC-MAB, and ASC-MAB settings. We show that our algorithm signiﬁcantly outperforms state-of-the-art approaches for NS-MAB problems in terms of dynamic pseudo-regret, achieving the lowest pseudo-regret in almost all the conﬁgurations we tested. Future development of this work may consider an analysis of our algorithm in MAB problems with structure, e.g., the Unimodal MAB, in which the arm space presents a graph structure, the case of continuous decision space as studied by Nuara, Trov o, Crippa, Gatti, and Restelli (2020), and adversarial settings as studied by Marchesi, Trov o, and Gatti (2020) and Bisi, Nittis, Trov o, Restelli, and Gatti (2017). Moreover, another research line we intend to explore is the derivation of bounds depending on the characteristics of the abrupt changes, such as their magnitude and their distribution over the time horizon.

Trov o, Paladino, Restelli, & Gatti

1 0.5 0 0.5 1 0

1 0.5 0 0.5 1 0

1 0.5 0 0.5 1 0

1 0.5 0 0.5 1 0

Figure 10: SC-MAB: per-round pseudo-regret ˆRN as the value of β B varies for every value of the time horizon N and every diﬀerent number of arms K.

Sliding-Window Thompson Sampling for Non-Stationary Settings

1 0.5 0 0.5 1 0

K = 5 K = 10 K = 20 K = 30

1 0.5 0 0.5 1 0

K = 5 K = 10 K = 20 K = 30

1 0.5 0 0.5 1 0

K = 5 K = 10 K = 20 K = 30

1 0.5 0 0.5 1 0

N = 1000000

K = 5 K = 10 K = 20 K = 30

Figure 11: SC-MAB: per-round pseudo-regret ˆRN as the value of β B varies for every number of arms K and every diﬀerent length of the time horizons N.

Trov o, Paladino, Restelli, & Gatti

Agrawal, S., & Goyal, N. (2012). Analysis of Thompson Sampling for the multi-armed bandit problem. In Proceedings of the Conference on Learning Theory (COLT), pp. 39.1 39.26.

Allesiardo, R., F eraud, R., & Maillard, O.-A. (2017). The non-stationary stochastic multiarmed bandit problem. International Journal of Data Science and Analytics, 3, 1 17.

Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2), 235 256.

Auer, P., Gajane, P., & Ortner, R. (2019). Adaptively tracking the best bandit arm with an unknown number of distribution changes. In Proceedings of the Conference on Learning Theory (COLT), pp. 138 158.

Besbes, O., Gur, Y., & Zeevi, A. (2014). Stochastic multi-armed-bandit problem with non-stationary rewards. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS), pp. 199 207.

Besson, L., & Kaufmann, E. (2019). The generalized likelihood ratio test meets KLUCB: an improved algorithm for piece-wise non-stationary bandits. ar Xiv preprint ar Xiv:1902.01575.

Bisi, L., Nittis, G. D., Trov o, F., Restelli, M., & Gatti, N. (2017). Regret minimization algorithms for the followers behaviour identiﬁcation in leadership games. In Proceedings of the Conference on Uncertainty in Artiﬁcial Intelligence (UAI).

Chapelle, O., & Li, L. (2011). An empirical evaluation of Thompson Sampling. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS), pp. 2249 2257.

Combes, R., & Proutiere, A. (2014). Unimodal bandits: regret lower bounds and optimal algorithms. In Proceedings of the International Conference on Machine Learning (ICML), pp. 521 529.

Eliashberg, J., & Jeuland, A. P. (1986). The impact of competitive entry in a developing market upon dynamic pricing strategies. Marketing Science, 5(1), 20 36.

Farina, G., & Gatti, N. (2017). Adopting the cascade model in ad auctions: Eﬃciency bounds and truthful algorithmic mechanisms. Journal of Artiﬁcial Intelligence Research, 59, 265 310.

Garivier, A., & Moulines, E. (2008). On upper-conﬁdence bound policies for non-stationary bandit problems. ar Xiv preprint ar Xiv:0805.3415.

Garivier, A., & Moulines, E. (2011). On upper-conﬁdence bound policies for switching bandit problems. In Proceedings of the International Conference on Algorithmic Learning Theory (ALT), pp. 174 188.

Gasparini, M., Nuara, A., Trov o, F., Gatti, N., & Restelli, M. (2018). Targeting optimization for internet advertising by learning from logged bandit feedback. In International Joint Conference on Neural Networks (IJCNN), pp. 1 8.

Sliding-Window Thompson Sampling for Non-Stationary Settings

Gatti, N., Lazaric, A., Rocco, M., & Trov o, F. (2015). Truthful learning mechanisms for multi-slot sponsored search auctions with externalities. Artiﬁcial Intelligence, 227, 93 139.

Gorre, M. E., Mohammed, M., Ellwood, K., Hsu, N., Paquette, R., Rao, P. N., & Sawyers, C. L. (2001). Clinical resistance to STI-571 cancer therapy caused by BCR-ABL gene mutation or ampliﬁcation. Science, 293(5531), 876 880.

Granmo, O.-C. (2010). Solving two-armed Bernoulli bandit problems using a Bayesian learning automaton. International Journal of Intelligent Computing and Cybernetics, 3(2), 207 234.

Granmo, O.-C., & Berg, S. (2010). Solving non-stationary bandit problems by random sampling from sibling Kalman ﬁlters. In Proceedings of the International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems (IEA/AIE), pp. 199 208.

Hartland, C., Gelly, S., Baskiotis, N., Teytaud, O., & Sebag, M. (2006). Multi-armed bandit, dynamic environments and meta-bandits. Working paper.

Hoeﬀding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301), 13 30.

Kaufmann, E., Capp e, O., & Garivier, A. (2012a). On Bayesian upper conﬁdence bounds for bandit problems. In Proceedings of the International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), pp. 592 600.

Kaufmann, E., Korda, N., & Munos, R. (2012b). Thompson Sampling: an asymptotically optimal ﬁnite-time analysis. In Proceedings of the Algorithmic Learning Theory (ALT), pp. 199 213.

Kitts, B., & Leblanc, B. (2004). Optimal bidding on keyword auctions. Electronic Markets, 14(3), 186 201.

Kocsis, L., & Szepesv ari, C. (2006). Discounted UCB. In PASCAL Challenges Workshop, pp. 784 791.

Lai, L., El Gamal, H., Jiang, H., & Poor, H. V. (2011). Cognitive medium access: Exploration, exploitation, and competition. IEEE Transactions on Mobile Computing, 10(2), 239 253.

Lai, T. L., & Robbins, H. (1985). Asymptotically eﬃcient adaptive allocation rules. Advances in Applied Mathematics, 6(1), 4 22.

Liu, F., Lee, J., & Shroﬀ, N. (2018). A change-detection based framework for piecewisestationary multi-armed bandit problem. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence.

Luo, H., Wei, C.-Y., Agarwal, A., & Langford, J. (2018). Eﬃcient contextual bandits in non-stationary worlds. In Proceedings of the Conference On Learning Theory (COLT), pp. 1739 1776.

Marchesi, A., Trov o, F., & Gatti, N. (2020). Learning probably approximately correct maximin strategies in simulation-based games with inﬁnite strategy spaces. In Proceedings

Trov o, Paladino, Restelli, & Gatti

of the International Conference on Autonomous Agents and Multi Agent Systems (AAMAS).

May, B. C., Korda, N., Lee, A., & Leslie, D. S. (2012). Optimistic Bayesian sampling in contextual-bandit problems. Journal of Machine Learning Research, 13(Jun), 2069 2106.

Mellor, J., & Shapiro, J. (2013). Thompson Sampling in switching environments with Bayesian online change point detection. In Proceedings of the International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), pp. 442 450.

Nuara, A., Sosio, N., Trov o, F., Zaccardi, M. C., Gatti, N., & Restelli, M. (2019). Dealing with interdependencies and uncertainty in multi-channel advertising campaigns optimization. In Proceedings of the the World Wide Web Conference (WWW), pp. 1376 1386.

Nuara, A., Trov o, F., Crippa, D., Gatti, N., & Restelli, M. (2020). Driving exploration by maximum distribution in gaussian process bandits. In Proceedings of the International Conference on Autonomous Agents and Multi Agent Systems (AAMAS).

Nuara, A., Trov o, F., Gatti, N., & Restelli, M. (2018). A combinatorial-bandit algorithm for the online joint bid/budget optimization of pay-per-click advertising campaigns. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, pp. 2379 2386.

Paladino, S., Trov o, F., Restelli, M., & Gatti, N. (2017). Unimodal Thompson Sampling for graph-structured arms. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, pp. 2457 2463.

Slivkins, A. (2011). Contextual bandits with similarity information. In Proceedings of the Conference on Learning Theory (COLT), pp. 679 702.

Slivkins, A., & Upfal, E. (2008). Adapting to a changing environment: the Brownian restless bandits. In Proceedings of the Conference on Learning Theory (COLT), pp. 343 354.

St-Pierre, D. L., & Jialin, L. (2014). Diﬀerential evolution algorithm applied to nonstationary bandit problem. In Proceeding of the IEEE Congress on Evolutionary Computation (CEC), pp. 2397 2403.

Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3), 285 294.

Trov o, F., Paladino, S., Restelli, M., & Gatti, N. (2018). Improving multi-armed bandit algorithms in online pricing settings. International Journal of Approximate Reasoning, 98, 196 235.

Wei, C.-Y., Hong, Y.-T., & Lu, C.-J. (2016). Tracking the best expert in non-stationary stochastic environments. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS), pp. 3972 3980.

Sliding-Window Thompson Sampling for Non-Stationary Settings

Appendix A. Preliminaries

In this section, we introduce some lemmas that we use in the proofs of our main theorems. At ﬁrst, we recall the link shown by Agrawal and Goyal (2012), and cited by Kaufmann et al. (2012b), between Beta and Bernoulli distributions, usually addressed in the literature as the Beta-Binomial trick .

Lemma 1 (Agrawal & Goyal, 2012). Let us denote with F Beta a,b the Cumulative Distribution Function (CDF) of a Beta distribution Beta(a, b) with parameters a and b and with F B n,µ the CDF of a random variable with Binomial distribution Bi(n, µ) with parameters n and µ. It is true that: F Beta a,b (y) = 1 F B a+b 1,y(a 1).

We introduce the following lemma that we use below to bound the number of times a Thompson sample ϑi,t is drawn from a high quantile of the Beta distribution.

Lemma 2. Consider a random variable Beta with Beta distribution Beta(S +1, T S +1), where S := PT s=1 Xs is the sum of T N Bernoulli trials Xs Be(µ) with same parameter µ [0, 1]. Consider a ﬁnite integer τ N, τ > T, a parameter ε > 1

q T := Q 1 1

where Q(α) is the α-quantile of the random variable Beta. We have that q T u T .

Proof. Here, we use the inequalities provided by Kaufmann, Capp e, and Garivier (2012a) to bound the (1 1

τ )-quantile of a Beta distribution with the KL-divergence and elaborate over them. Consider the event that variable B Beta(S + 1, T S + 1) is greater than the UCB-like bound u T . We have:

P(Beta u T ) = 1 F Beta S+1,T S+1(u T ) (3)

= F B T+1,u T (S) (4)

= 1 F B T+1,1 u T (T S + 1) = P(Bi T+1,1 u T > T S + 1) (5)

exp T KL T S + 1

T + 1 , 1 u T

T + 1 1 + u T

T + 1 1 + S

exp 2T ε log τ

= 1 τ 2ε , (9)

where Bin,µ is a random variable with Binomial distribution Bi(n, µ) with parameters n and µ, KL( , ) is the Kullback-Leibler divergence, the equalities in Equation (3) follow

Trov o, Paladino, Restelli, & Gatti

from Lemma 1, Equation (5) follows from the properties of the Binomial distribution, Equation (6) follows from the Sanov inequality, Equation (7) follows from the Pinsker inequality. The quantile Q 1 1

τ satisﬁes, by deﬁnition, the property:

P (Beta q T ) = 1

Since we have 1

τ 1 τ 2ε for ε 1

2, it follows that q T u T .

Finally, we introduce the following lemma, whose proof is provided independently in the appendices of the works by Garivier and Moulines (2011) (Lemma 1) and Combes and Proutiere (2014) (Lemma 4.1).

Lemma 3 (Garivier & Moulines, 2011; Combes & Proutiere, 2014). Let A N and a(t) = Pt 1 t =t τ 1{t A}, then for every positive integer τ and every s N we have:

t=1 1{t A, a(t) s} s N

Appendix B. Abruptly Changing Setting: Proofs

Theorem 1. If the SW-TS policy is run over an AC-MAB setting with Xi,t Be(µi,t), for every τ N, the dynamic pseudo-regret after N rounds is at most:

2 i,φ + log τ + 5 + 19 log τ

where B and α are deﬁned in Assumption 1 and i,φ := µi ,φ µi,φ is the diﬀerence between the expected reward µi ,φ of the best arm ai φ and the expected reward µi,φ of arm ai during phase Fφ. By deﬁning: i := min φ {1,...,BN} i,φ1{i = i φ},

for all i {1, . . . , K}, i.e., the minimum over all the phases Fφ of the diﬀerence of the expected rewards i,φ, the dynamic pseudo-regret can be written as:

RN(U) τKBNα + N

i + log τ + 5 + 19 log τ

Proof. We adapt, to the AC-MAB setting, the proof provided by Kaufmann et al. (2012b) to bound the expected pseudo-regret of the classical Thompson Sampling algorithm. In particular, in the following, we remark the points where our proof distinguishes from the one by Kaufmann et al. (2012b). Let us deﬁne the eﬀective phase F φ := {t N s.t. bφ 1 + τ t < bφ} and denote with Ti(F φ) := P t F φ 1{it = i, i = i φ}, i.e., the number of times a suboptimal arm ai = ai φ is

played during the eﬀective phase F φ. During every generic eﬀective phase F φ, the MAB setting is stationary. Moreover, by the deﬁnition of eﬀective phase F φ, we have:

E [Ti(Fφ)] τ + E Ti(F φ) , (10)

Sliding-Window Thompson Sampling for Non-Stationary Settings

where we recall that Ti(Fφ) is the number of times arm ai is pulled during phase Fφ.

At ﬁrst, we bound the expected number of times the algorithm chooses a suboptimal arm in a generic eﬀective phase F φ. The rationale with which we bound E[Ti(F φ)] is to decompose E[Ti(F φ)] into two terms. The ﬁrst term is conditioned to the fact that the reward of the optimal arm ai φ is underestimated, while the second term is conditioned to the fact that the reward of the optimal arm ai φ is not underestimated, but the suboptimal arm ai is played. Hence, we have:

E Ti(F φ) = X

E [1{it = i}] (11)

ϑi φ,t µi φ,t

Ti φ,t,τ , it = i

ϑi φ,t > µi φ,t

Ti φ,t,τ , it = i

ϑi φ,t µi φ,t

ϑi,t > µi φ,t

Ti φ,t,τ , it = i

ϑi φ,t µi φ,t

ϑi,t > µi φ,t

Ti φ,t,τ , it = i, ϑi,t < q Ti,t,τ

P ϑi,t q Ti,t,τ (14)

ϑi φ,t µi φ,t

u Ti,t,τ > µi φ,t

Ti φ,t,τ , it = i

P ϑi,t q Ti,t,τ

where, in bounding Equation (12), we use the property that the Thompson sample ϑit,t = ϑi,t chosen at round t is larger than the one of the optimal arm ϑi φ,t (i.e., ϑi,t ϑi φ,t), q Ti,t,τ is the quantile of order 1 1

τ of the Beta distribution corresponding the expected value µi,t of arm ai, and we use Lemma 2, applied to the rewards of arm ai and with T = Ti,t,τ and ε = 2, to bound the second term in Equation (14).

Let us focus on RA. In the analysis by Kaufmann et al. (2012b), the probability that the optimal arm is pulled in the past less than tb times (by properly deﬁning the constant b (0, 1)) is bounded by a constant (from Proposition 1 by Kaufmann et al. (2012b)). In the setting we study, such a result does not hold as the number of samples used in the posterior distribution πi,t of the expected reward µi,t does not increase indeﬁnitely over time due to the sliding-window approach that limits the number of samples to at most τ. Thus, we bound the probability that an arm is pulled less than n A times by using Lemma 3

Trov o, Paladino, Restelli, & Gatti

with A = {t|it = i}, t F φ and, consequently a(t) = Ti,t,τ. We have:

E [1{it = i, Ti,t,τ n A}] n A

where |F φ| = Nφ τ Nφ, which holds for all i {1, . . . , K}. Thus, by choosing n A = l 19 log τ m , we have:

ϑi φ,t µi φ,t

ϑi φ,t µi φ,t

Ti φ,t,τ , Ti φ,t,τ > n A

P Ti φ,t,τ n A (19)

ϑi φ,t µi φ,t

Ti φ,t,τ , Ti φ,t,τ > n A

E h 1 n Ti φ,t,τ n A oi (20)

ϑi φ,t µi φ,t

Ti φ,t,τ , Ti φ,t,τ > n A

where we use Lemma 3 to bound Equation (21). Let us deﬁne:

{Ut}t F φ a set of i.i.d. uniform random variables over Ω= [0, 1];

Si,t,τ := Pt h=t τ+1 Xi,h1{ih = i} the sum of the rewards received by arm ai in the last τ rounds (with abuse of notation w.r.t. the main body of the paper);

Σi,t,τ,s := Pt τ+s s=t τ+1 Xi,h1{ih = i} the sum of the ﬁrst s rewards over the last τ rounds of arm ai.

Recalling that Ti,t,τ := Pt h=max{t τ+1,1} 1{ih = i}, we have:

ϑi φ,t µi φ,t

Ti φ,t,τ , Ti φ,t,τ > n A

Ut F Beta Si φ,t,τ+1,Ti φ,t,τ Si φ,t,τ+1

, Ti φ,t,τ > n A

Ut 1 F B Ti φ,t,τ+1,µi φ,t r 5 log τ

Ti φ,t,τ (Si φ,t,τ), Ti φ,t,τ > n A

F B Ti φ,t,τ+1,µi φ,t r 5 log τ

Ti φ,t,τ (Si φ,t,τ) Ut, Ti φ,t,τ > n A

Sliding-Window Thompson Sampling for Non-Stationary Settings

s { n A, . . . , τ}F B

s+1,µi φ,t q

s (Σi φ,t,τ,s) Ut

Σi φ,t,τ,s (F B) 1

s+1,µi φ,t q

where we use Lemma 1 to derive Equation (24), we use the fact that Ut 1 Ut to derive Equation (25), and we use the union bound to derive Equation (27). Note that:

s+1,µi φ,t q

s + 1, µi φ,t

We also remark that term (F B) 1

s+1,µi φ,t q

s (Ut) is independent of Σi φ,t,τ,s Bi(s, µi φ,t).

Similarly to what done by Kaufmann et al. (2012b), we deﬁne, for a chosen s, two i.i.d. sequences of Bernoulli random variables {X1,l}s l=1 and {X2,l}s l=1 of size s and s + 1, respectively:

X2,l Be µi φ,t , (30)

whose summations correspond to the r.h.s. and l.h.s. of the inequality present in the probability in Equation (27), respectively. Let {Zl}s l=1 be another i.i.d. sequence of random

variables, with Zl := X2,l X1,l and E[Zl] = q

s .11 We get:

Σi φ,t,τ,s (F B) 1

s+1,µi φ,t q

l=1 Zl X1,s+1

5s log τ 1 !

11. We here assume that µi φ,t q

s 0, i.e., that the sequence {X1,l}s l=1 is well deﬁned. In the case this condition does not hold, we have RA = 0 since the event that the Thompson sample ϑi t ,t < 0 has zero probability.

Trov o, Paladino, Restelli, & Gatti

where we use that s > n A 5s log τ 1 > 4s log τ to bound Equation (35). We apply the Hoeﬀding s inequality provided by Hoeﬀding (1963) to the bounded martingale diﬀerence sequence {Zl}s l=1 (having support of measure 2) and we get:

Σi φ,t,τ,s (F B) 1

s+1,µi φ,t q

s= n A exp 2s( 4s log τ)2

s= n A exp ( 4s log τ)2

s= n A e 2 log τ

Finally, we get:

ϑi φ,t µi φ,t

log τ + 1 + Nφ

τ log τ + 2Nφ

Let us focus on RB. Diﬀerently from what done by Kaufmann et al. (2012b), we use the Hoeﬀding s inequality to bound each probability term used in RB. Let us deﬁne ˆµi,t,τ := Pt s=t τ+1 Xi,s1{is=i}

Ti,t,τ , i.e., the estimator of the expected value µi,φ of the rewards of arm ai

computed over the last τ rounds and choose n B = 20 log τ

and n B = 32 log τ

we recall that i,φ := µi φ,t µi,t with t F φ. If the two properties Ti φ,t,τ > n B and Ti,t,τ > n B hold, we have the following:

u Ti,t,τ > µi φ,t

Ti φ,t,τ , it = i

Ti,t,τ > µi φ,t

Ti φ,t,τ , it = i

Ti,t,τ > µi φ,t

Ti φ,t,τ , Ti φ,t,τ > n B , Ti,t,τ > n B

P Ti φ,t,τ n B + X

P (Ti,t,τ n B) (43)

Ti,t,τ > µi φ,t

Ti φ,t,τ , Ti φ,t,τ > n B , Ti,t,τ > n B

Sliding-Window Thompson Sampling for Non-Stationary Settings

Ti,t,τ > µi,t + µi φ,t µi,t | {z } = i,φ

| {z } > i,φ

Ti,t,τ > µi,t

2 i,φ + 2Nφ

where Equation (42) is derived from the deﬁnition of u Ti,t,τ . From Corollary 21 by Garivier and Moulines (2008), we have for all η > 0:

Ti,t,τ > µi,t

log τ log(1 + η) exp 12 log τ 1 η2

and, since η = 4 q

12, the bound can be written as:

Ti,t,τ > µi,t

Hence, we can write:

2 i,φ + 2Nφ

τ + Nφ log τ

Let us focus on RC. The RC term is upper bounded by:

P ϑi,t q Ti,t,τ = X

using the deﬁnition of quantiles q Ti,t,τ . Pseudo-regret. Since PBN φ=1 Nφ = N and recalling that if t Fφ F φ we have µi,t = µi,φ, the dynamic pseudo-regret over all the phases can be written as:

t=1 (µi ,t µit,t)

φ=1 µi ,φNφ E

t=1 (µi ,t µit,t)

φ=1 µi ,φNφ E

Trov o, Paladino, Restelli, & Gatti

i=1 µi,φE[Ti(Fφ)]

i=1 (µi ,φ µi,φ)E[Ti(Fφ)]

φ=1 (µi ,φ µi,φ)E[Ti(Fφ)]

φ=1 i,φE[Ti(Fφ)]

φ=1 i,φ τ + E[Ti(F φ)]

φ=1 i,φ (RA + RB + RC)

19Nφ τ log τ + 2Nφ

2 i,φ + 2Nφ

τ + Nφ log τ

19Nφ τ log τ + Nφ

2 i,φ + Nφ log τ

2 i,φ + log τ + 5 + 19 log τ + 1

where BN is the number of breakpoints before N. By deﬁning: i := min φ {1,...,BN} i,φ1{i = i φ} i {1, . . . , K}, (62)

i.e., the minimum, over all the phases Fφ in which the arm ai is not optimal, of the diﬀerence between the expected reward µi φ,φ of the best arm ai φ and the expected reward µi,φ of arm ai, the dynamic pseudo-regret can be written as:

RN(U) τKBNα + N

i + log τ + 5 + 19 log τ

which concludes the proof.

Appendix C. Smoothly Changing Setting: Proofs

Theorem 2. If the SW-TS policy is run over a SC-MAB setting with Xi,t Be(µi,t), Lipschitz constant σ > 0 and there exists 0 (0, 1) as in Assumption 3, for any τ N s.t. 2στ < 0, the dynamic pseudo-regret after N rounds is at most:

RN(U) F Nβ + NK

52 log τ ( 2στ)2 + log τ + 5 + 19 log τ

+ K 52 log τ ( 2στ)2 + 3 + 19 log τ

Sliding-Window Thompson Sampling for Non-Stationary Settings

Proof. Let us deﬁne:

F ,N := {t {1, . . . , N} s.t. i = j, |µi,t µj,t| < }, i.e., the set of the rounds in which there are two arms whose expected values diﬀering less than ;

F C,N := {1, . . . , N}\F ,N, i.e., the set of the rounds in which the expected rewards of the arms are well separated (|µi,t µj,t| > , i = j});

Ti(F ,N) := P t F ,N 1{it = i, i = i t }, i.e., the number of rounds arm ai is played when it is not optimal during rounds t F ,N;

Ti(F C,N) := P t F C ,N 1{it = i, i = i t }, i.e., the number of rounds arm ai is played when it is not optimal during rounds t F C,N.

For every s.t. 2στ 0, we have:

µi t ,t µit,t #

t=1 E [1{it = i, i = i t }] (64)

t=1 E [1{it = i, i = i t }] (65)

i=1 E[Ti(F ,N)] +

i=1 E[Ti(F C,N)]. (66)

The ﬁrst term in Equation (66), i.e., E[Ti(F ,N)], can be directly bounded by using Assumption 3. Instead, bounding the second term, i.e., E[Ti(F C,N)], requires a more complex procedure. In our proof, we adapt the line provided by Kaufmann et al. (2012b) to our setting. Indeed, we remark the result by Kaufmann et al. (2012b) cannot be directly applied as the reward distributions vary at every round.12 We deﬁne two events: in the ﬁrst one the optimal arm ai t is underestimated; in the second one the optimal arm ai t is not underestimated, but the suboptimal arm ai is played. Hence, we have:

E Ti(F C,N) X

ϑi t ,t µi t ,t στ

ϑi,t > µi t ,t στ

Ti t ,t,τ , it = i

ϑi t ,t µi t ,t στ

ϑi,t > µi t ,t στ

Ti t ,t,τ , it = i, ϑi,t q Ti,t,τ

12. For the sake of concision, we will omit the derivations which are the same of those discussed in the proof of Theorem 1.

Trov o, Paladino, Restelli, & Gatti

t F C ,N P ϑi,t q Ti,t,τ (68)

ϑi t ,t µi t ,t στ

u Ti,t,τ > µi t ,t στ

t F C ,N P ϑi,t q Ti,t,τ

where we use Lemma 2 applied to the rewards of the arm ai with T = Ti,t,τ and ε = 2, to bound the expression in Equation (69).

Let us focus on RA. By applying Lemma 3 and deﬁning n A = l 19 log τ m , we have:

ϑi t ,t µi t ,t στ

ϑi t ,t µi t ,t στ

Ti t ,t,τ , Ti t ,t,τ > n A

t F C ,N P Ti t ,t,τ n A (71)

ϑi t ,t µi t ,t στ

Ti t ,t,τ , Ti t ,t,τ > n A

ϑi t ,t µi t ,t στ

Ti t ,t,τ , Ti t ,t,τ > n A

τ + 1 . (73)

While in the proof of Theorem 1, as well as in the proof by Kaufmann et al. (2012b), the expected values of the rewards are constant over the last τ rounds, in the case we study here such property does not hold. In what follows, we deﬁne a set of auxiliary random variables Si,t,τ whose expected values are constant over the last τ rounds and smaller than the one of the optimal arm. Then, using the random variables Si,t,τ, we can apply Lemma 1 to transform the Beta distribution estimating the expected reward of an arm into a Binomial random variable, which, ﬁnally, allows us to bound the quantity RA. Let us deﬁne:

{Ut}t F C ,N as a sequence of i.i.d. uniform random variables over Ω= [0, 1];

Sliding-Window Thompson Sampling for Non-Stationary Settings

Si,t,τ := Pt s=t τ+1 1{is = i}Xi,s, i.e., the number of successes of arm ai at round t in the previous τ rounds (with abuse of notation w.r.t. the main body of the paper);

Xi,s := Xi,s +µi,t µi,s στ, s {t τ +1, t}, i.e., a set of auxiliary variables having Xi,s Xi,s (since |µi,t µi,s| στ) and µi,t := E[ Xi,s] = µi,t στ;

Si,t,τ := Pt s=t τ+1 1{is = i} Xi,s, the number of successes of an arm ai having rewards Xi,s at round s in the rounds {t τ + 1, . . . , t};

Σi,t,τ,s := Pt τ+s h=t τ+1 1{ih = i} Xi,h, i.e., the sum of the random variables Xi,t τ+1, . . . , Xi,t τ+s.

Note that the arm ai t , whose expected reward is µi t ,t, is optimal since we are focusing on rounds t F C,N. Hence, we have:

ϑi t ,t µi t ,t στ

Ti t ,t,τ , Ti t ,t,τ > n A

Ut F Beta Si t ,t,τ+1,Ti t ,t,τ Si t ,t,τ+1

, Ti t ,t,τ > n A

Ut 1 F B Ti t ,t,τ+1,µi t ,t στ r

5 log τ Ti t ,t,τ (Si t ,t,τ), Ti t ,t,τ > n A

F B Ti t ,t,τ+1,µi t ,t στ r

5 log τ Ti t ,t,τ (Si t ,t,τ) Ut, Ti t ,t,τ > n A

F B Ti t ,t,τ+1,µi t ,t r

5 log τ Ti t ,t,τ (Si t ,t,τ) Ut, Ti t ,t,τ > n A

s { n A, . . . , τ} s.t. F B

s+1,µi t ,t q

s (Σi t ,t,τ,s) Ut

Σi t ,t,τ,s (F B) 1

s+1,µi t ,t q

where we use Lemma 1 to derive Equation (76), Equation (77) follows from Ut 1 Ut, and we bound Equation (78) exploiting that Si,t,τ Si,t,τ, i, which follows from the deﬁnition of Si,t,τ. Note that:

s+1,µi t ,t q

s + 1, µi t ,t

and (F B) 1

s+1,µi t ,t q

s (Ut) is independent of Σi t ,t,τ,s Bi(s, µi t ,t). Consider, for a chosen

s, two i.i.d. sequences of random variables {X1,l}s l=1 and {X2,l}s l=1 of size s and s + 1,

Trov o, Paladino, Restelli, & Gatti

respectively:

X2,l D µi t ,t

whose summations correspond to the r.h.s. and l.h.s. of the inequality that is the argument of the probability operator in Equation (80), respectively. In Equation (82), we denote with Be(µ) a Bernoulli distribution with mean µ, and, in Equation (83), we denote with D a discrete distribution deﬁned over Ω= {1 + µi t ,t µi t ,s στ, µi t ,t µi t ,s στ} and expected value equal to µi t ,t. Let {Zl}s l=1 be another i.i.d. sequence of random variables,

with Zl := X2,l X1,l, having support of measure 2 and E[Zl] = q

s .13 We get:

Σi t ,t,τ,s (F B) 1

s+1,µi t ,t q

l=1 Zl X1,s+1

5s log τ 1 !

where we use the property s > n A 5s log τ 1 > 4s log τ. We apply the Hoeﬀding s inequality to the bounded martingale diﬀerence sequence {Zl}s l=1 and we get:

Σi t ,t,τ,s (F B) 1

s+1,µi t ,t r

5 log τ Ti t ,t,τ (Ut)

s= n A exp 2( 4s log τ)2

s= n A e 2 log τ

Finally, we get:

ϑi t ,t µi t ,t στ

13. Similarly to what we do in the proof of Theorem 1, here, we focus only on the case in which the sequence {X1,l}s l=1 is well deﬁned.

Sliding-Window Thompson Sampling for Non-Stationary Settings

τ log τ + 2N

τ + 19 log τ + 1. (92)

Let us focus on RB. Let us deﬁne ˆµi,t,τ :=

Pt s=t τ+1 Xi,s1{is=i}

Ti,t,τ , i.e., the estimator of the expected value of the rewards of arm ai computed over the last τ rounds and its expected

value µi,t,τ :=

Pt s=t τ+1 µi,s1{is=i}

Ti,t,τ . Note that µi,t,τ µi,t στ due to Assumption 2.

We can rewrite term RB and apply Lemma 3 with n B = l 20 log τ ( 2στ)2 m and n B = l 32 log τ ( 2στ)2 m as follows:

u Ti,t,τ > µi t ,t στ

Ti φ,t,τ , it = i

Ti,t,τ > µi t ,t στ

Ti t ,t,τ , it = i

Ti,t,τ > µi t ,t στ

Ti φ,t,τ , Ti t ,t,τ > n B , Ti,t,τ > n B

t F C ,N P Ti t ,t,τ n B + X

t F C ,N P (Ti,t,τ n B) (95)

Ti,t,τ > µi φ,t στ

Ti t ,t,τ , Ti t ,t,τ > n B , Ti,t,τ > n B

Ti,t,τ > µi,t,τ + µi t ,t µi,t,τ στ 2

τ + 1 52 log τ ( 2στ)2 + 2 (97)

Ti,t,τ > µi,t,τ + µi t ,t µi,t στ στ 2

τ 52 log τ ( 2στ)2 + 2N

τ + 52 log τ ( 2στ)2 + 2 (98)

Ti,t,τ > µi,t,τ + i,t 2στ

| {z } ( 2στ)

τ 52 log τ ( 2στ)2 + 2N

τ + 52 log τ ( 2στ)2 + 2 (99)

Trov o, Paladino, Restelli, & Gatti

Ti,t,τ > µi,t,τ

τ 52 log τ ( 2στ)2 + 2N

τ + 52 log τ ( 2στ)2 + 2, (100)

where we use the property i,t > i, t F C,N to bound Equation (99). By using Corollary 21 by Garivier and Moulines (2008), we have the following for every η > 0:

Ti,t,τ > µi,t,τ

log τ log(1 + η) exp 12 log τ 1 η2

thus, by using η = 4 q

12, we have:

Ti,t,τ > µi,t,τ

Hence, we can write:

τ 52 log τ ( 2στ)2 + 2N

τ + N log τ

τ + 52 log τ ( 2στ)2 + 2. (103)

Let us focus on RC. The RC term is upper bounded by:

t F C ,N P ϑi,t q Ti,t,τ = X

Pseudo-regret. Summing all the derived bounds, the dynamic pseudo-regret becomes:

µi t ,t µit,t #

E[Ti(F ,N)] + E[Ti(F C,N)] (105)

|F ,N| + K (RA + RB + RC) (106)

= F Nβ + K 19N

τ log τ + 2N

τ + 19 log τ + 1 (107)

τ 52 log τ ( 2στ)2 + 2N

τ + N log τ

τ + 52 log τ ( 2στ)2 + 2 + N

52 log τ ( 2στ)2 + log τ + 5 + 19 log τ

+ K 52 log τ ( 2στ)2 + 3 + 19 log τ

Sliding-Window Thompson Sampling for Non-Stationary Settings

where we use the property K P

i=1 E[Ti(F ,N)] |F ,N| F Nβ that holds as Assumption 3

is satisﬁed.

Corollary 3. If the SW-TS policy is run over an SC-MAB setting with no switches between expected rewards of the arms (P = 0), in which Assumption 3 holds with β [1 log N

2σ , 1], and using a sliding window τ := N1 β, for each 0 the dynamic pseudo-regret is at most:

RN(U) = O(Nβ).

Proof. It is easy to derive that, if we choose a sliding window τ := N1 β, we minimize the asymptotic dynamic pseudo-regret w.r.t. N. By substituting τ in the expression of the dynamic pseudo-regret bound used in Theorem 2, we obtain the result stated in Corollary 3. However, the result stated in Theorem 2 holds only if the condition 2στ is satisﬁed. Since we set τ = N1 β, we have that:

which concludes the proof.

Corollary 4. If the SW-TS policy is run over an SC-MAB setting with P N switches between expected rewards of the arms, in which Assumption 3 holds with:

where max{a, b} denotes the maximum between a and b, and using a sliding window τ := N1 β, F is deﬁned in Assumption 3, for each 0 the dynamic pseudo-regret is at most:

RN(U) = O(Nβ).

Proof. If two arms switch their average rewards over time, i.e., it exists t s.t. (µi,t µj,t)(µi,t+1 µj,t+1) 0, then the set F ,N is nonempty. In particular, for every switch, we have at least 2

σ rounds during which the diﬀerence of the average rewards is smaller than . This is because a variation of the expected reward by occurs in at least

σ rounds before the switch and

σ rounds after it.14 Speciﬁcally, given and having P switches over the time horizon N, we have that the number of rounds belonging to F ,N is at least:

2σ |F ,N| F Nβ, (111)

14. Assume, for sake of simplicity, that is a multiple of σ.

Trov o, Paladino, Restelli, & Gatti

where the second inequality comes from Assumption 3, thus leading to:

σ P 2FNβ . (112)

From the assumption that 2στ and given that the length of the sliding window is τ = N1 β, we have:

σ 2N1 β . (113)

To make both Equations (112) and (113) hold at the same time, we need to have an SC-MAB problem s.t.:

2N1 β P 2FNβ (114)

F Nβ PN1 β (115)

N2β 1 P F (116)

Given that Inequality (118) holds and condition β [1 log N

2σ , 1], derived in Corollary 3, is satisﬁed, we can provide following range for β:

Corollary 5. If the SW-TS policy is run over an SC-MAB setting in which Assumption 3 holds with β = 1 and using a sliding window τ σ 3

4 , the average pseudo-regret is:

ARN(U) = O(σ 1 2 ).

Proof. Deﬁning ε := 2στ (ε (0, 1]), independent of σ and τ, the dynamic pseudo-regret of the SW-TS algorithm is bounded by:

52 log τ ( 2στ)2 + log τ + 5 + 19 log τ

52 log τ ( 2στ)2 + 3 + 19 log τ

= F( 2στ) + 2Fστ + K

ε2 + log τ + 5 + 19 log τ

N o(log τ) (121)

ε2 + log τ + 5 + 19 log τ

N o(log τ) (122)

Sliding-Window Thompson Sampling for Non-Stationary Settings

By substituting τ = Cσ 1

2 , we have:

N 2FCσ 1 2 + 58Kσ 1 2 log Cσ 1

Cε2 + o(σ 1 2 ) + K

N o(σ 1 2 ), (123)

which, as N + , provides the inequality given in the corollary statement.

Appendix D. Abrupt and Smoothly Changing Setting: Proofs

Theorem 3. If the SW-TS policy is run over an ASC-MAB setting with Xi,t Be(µi,t), Lipschitz constant σ > 0 as in Assumption 4 and there exists 0 (0, 1) as in Assumption 3, for any τ N s.t. 2στ < 0, the dynamic pseudo-regret after N rounds is at most:

RN(U) F Nβ + τBNα + NK

52 log τ ( 2στ)2 + log τ + 5 + 19 log τ

where B and α are deﬁned in Assumption 1 and F and β are deﬁned in Assumption 3.

Proof. Let us deﬁne:

F ,N := {t {1, . . . , N} s.t. i = j, |µi,t µj,t| < }, i.e., the set of the rounds in which there are two arms whose expected values diﬀer by less than ;

F C,N := {τ, . . . , N} \ F ,N, i.e., the set of the rounds t τ, in which the expected rewards of the arms are well separated (|µi,t µj,t| > , i = j});

F C,φ := {bφ 1, . . . , bφ} \ F ,N, i.e., the set of the rounds of phase Fφ, in which the expected rewards of the arms are well separated;

F C,φ := {bφ 1 + τ, . . . , bφ} \ F ,N, i.e., the set of the rounds of phase Fφ except the ﬁrst τ rounds, in which the expected rewards of the arms are well separated;

Ti(F) := P t F 1{it = i, i = i t }, i.e., the number of rounds the arm ai is played when it is not optimal during rounds t F.

For every s.t. 2στ , we have that:

µi t ,t µit,t #

t=1 E [1{it = i, i = i t }] =

t=1 E [1{it = i, i = i t }]

i=1 E[Ti(F ,N)] +

i=1 E[Ti(F C,N)] =

i=1 E[Ti(F ,N)] +

i=1 E[Ti(F C,φ)]

i=1 E[Ti(F ,N)] +

i=1 E[Ti(F C,φ)]

Trov o, Paladino, Restelli, & Gatti

i=1 E[Ti(F ,N)] + τBN +

i=1 E[Ti(F C,φ)]. (127)

The ﬁrst term in Equation (127), i.e., E[Ti(F ,N)], can be directly bounded by using Assumption 3. Each element E[Ti(F C,φ)] of the summation in the third term of Equation (127) can be bounded as we do for the second term in Equation (66) in the proof of Theorem 2 when using an opportune time horizon of length Nφ τ for phase Fφ. This follows since, when we focus on rounds belonging to a single phase, the setting is an SC-MAB. Formally, we have:

E[Ti(F C,φ)] Nφ τ

52 log τ ( 2στ)2 + log τ + 5 + 19 log τ

52 log τ ( 2στ)2 + log τ + 5 + 19 log τ

Here, we omit the details about the derivation of the last equation. They are the same of the proof of Theorem 2, except for the fact that the time horizon we use here is Nφ τ

instead of N and that the quantities l Nφ τ

τ m can be upper bounded with Nφ

τ . This leads to Equation (129), which, diﬀerently from Equation (110), lacks of some terms that are not depending on N. Finally, we have:

i=1 E[Ti(F ,N)] + τBN +

i=1 E[Ti(F C,φ)] (130)

F Nβ + τBNα +

52 log τ ( 2στ)2 + log τ + 5 + 19 log τ

F Nβ + τBNα + NK

52 log τ ( 2στ)2 + log τ + 5 + 19 log τ

where we use Assumption 1 and Assumption 3 in Equation (131), and we use the property BN P

φ=1 Nφ = N in Equation (132). This concludes the proof.

Corollary 6. If the SW-TS policy is run over an ASC-MAB setting with no switches between expected rewards of the arms (P = 0) and Assumption 1 and Assumption 3 hold with α (1 2 log N

2σ , 1) and β (0, 1), respectively, for each 0, using a sliding window of τ := N 1 α

2 , the dynamic pseudo-regret is at most:

2 O Nβ if β > 1+α

Proof. The problem of choosing the proper order of the sliding window reduces here to minimizing the order in N of each of the ﬁrst three terms of the dynamic pseudo-regret bound in Theorem 3. More speciﬁcally, the dynamic pseudo-regret can be bounded by:

RN(U) C1Nβ + C2τNα + C3 N

Sliding-Window Thompson Sampling for Non-Stationary Settings

where C1, C2 and C3 are proper constants. Using a sliding window τ := Nγ, with γ (0, 1), the minimization of the dynamic pseudo-regret order can be obtained by choosing the value of γ that minimizes max{β, α + γ, 1 γ}. If β 1+α

2 , the minimum is obtained with an order γ = 1 α

2 and, thus, the overall dynamic pseudo-regret is of order O(N 1+α

2 ). If β > 1+α

2 , the minimization problem does not admit a single solution. However, a solution is γ = 1 α

2 , thus providing an overall dynamic pseudo-regret of order O(Nβ). In this case, the condition on the sliding window 2στ is:

α 1 2 log N

which concludes the proof.

Corollary 7. If the SW-TS policy is run over an ASC-MAB setting with P N switches between expected rewards of the arms, and Assumption 1 and Assumption 3 hold with α (1 2 log N

2σ , 1) and β (0, 1), respectively, for each 0, using a sliding window of τ := N 1 α

2 , if β + α

P holds, the dynamic pseudo-regret is at most:

2 O Nβ if β > 1+α

Proof. The analysis is similar to the one provided for Corollary 6. In addition to the properties we require in Corollary 6, we need to enforce another condition on the parameters α and β due to the number of switches between the expected rewards of the arms. More speciﬁcally, as shown in Equation (111) we have:

and, since we use a sliding window of τ = N 1 α

2 , we require that:

Using the previous inequality together, we have:

2 P 2FNβ (135)

F Nβ PN 1 α

2 P F (137)

Trov o, Paladino, Restelli, & Gatti

which concludes the proof.

Corollary 8. If the SW-TS policy is run over an SC-MAB setting in which Assumption 1 holds with α = 1, Assumption 3 holds with β = 1, and using a sliding window τ B 1

4 , the average pseudo-regret is:

ARN(U) = O(B 1 2 σ 1 2 ).

Proof. As in Corollary 5, let us deﬁne ε := 2στ and, using the dynamic pseudo-regret bound in Theorem 3, we have:

N F( 2στ) + 2Fστ + τB + K

52 log τ ( 2στ)2 + log τ + 5 + 19 log τ

Using a sliding window τ = CB 1

N Fε + 2FB 1 2 σ 1 2 + 52KB 1 2 σ 1 2 log CB 1

ε2 + o(B 1 2 σ 1 2 ), (141)

which concludes the proof.

Appendix E. Smoothly Changing Setting: Satisfaction of Assumption 3 in the Experimental Setting of Section 5.2

We want to show that the number of the rounds, in which at least one pair i, j {1, . . . , K} such that i = j the inequality |µi,t µj,t| < holds, is upper bounded by F , where F is deﬁned in Assumption 3. Let us set 0 = 1

3, which leads to 1

3. The evolution of the expected values of the arms over time in the SC-MAB we evaluate in Section 5 is the following:

2(K 1)(1 + sin(tσ)) i

If we are in phase F ,N, there exists a couple of index i and j, i = j such that:

|µi,t µj,t|

2(K 1)(1 + sin(tσ)) i

2(K 1)(1 + sin(tσ)) j

2(K 1)(1 + sin(tσ)) i

2(K 1)(1 + sin(tσ)) j

2(K 1)(1 + sin(tσ)) j 1 + 1

2(K 1)(1 + sin(tσ)) i ,

Sliding-Window Thompson Sampling for Non-Stationary Settings

thus, we have:

2(K 1)(1 + sin(tσ)) j 1 + 1

2(K 1)(1 + sin(tσ)) i < K .

In the following, we distinguish the analysis in two cases: the one in which the arguments of the absolute values have the same sign, formally t is such that (1 + 1

2(K 1)(1 + sin(tσ)) j)(1 + 1

2(K 1)(1 + sin(tσ)) i) > 0, from the one in which they have opposite signs, formally t is such that (1 + 1

2(K 1)(1 + sin(tσ)) j)(1 + 1

2(K 1)(1 + sin(tσ)) i) < 0.

Case 1. Let us consider the case in which both 1 + 1

2(K 1)(1 + sin(tσ) > 0 and 1 + 1 2(K 1)(1 + sin(tσ)) i > 0. The same holds in the case both terms are negative and by inverting the roles of i and j. In the former case, the inequality becomes:

2(K 1)(1 + sin(tσ)) j 1 1

2(K 1)(1 + sin(tσ)) + i < K

K < i j < K .

If i > j, the inequality K < i j is always satisﬁed, while we need to examine whether i j < K or not. The worst case is when the two arms are i = K and j = 1:

which is always false if K > 2 since 0 = 1

3. Thus, the set of the rounds in this case is empty. If i < j, the inequality i j < K is always satisﬁed, while we need to verify K < i j. The worst case is when i = 1 and j = K, thus:

thus, the same reasoning made for the previous case holds and we have an empty set of rounds.

Case 2. Even in this case we analyse only the case in which 1+ 1

2(K 1)(1+sin(tσ)) j > 0 and 1 + 1

2(K 1)(1 + sin(tσ)) i < 0, the opposite one being analogous. We have that the initial condition becomes:

2(K 1)(1 + sin(tσ)) j + 1 + 1

2(K 1)(1 + sin(tσ)) i < K

K < 2 + (K 1)(1 + sin(tσ)) j i < K

K 2 + j + i < (K 1)(1 + sin(tσ)) < K 2 + j + i K 2 + j + i

K 1 1 < sin(tσ) < K 2 + j + i

1 σ arcsin K 2 + j + i

K 1 1 < t < 1

σ arcsin K 2 + j + i

Trov o, Paladino, Restelli, & Gatti

We are interested in the number of rounds for which the inequalities hold, i.e.,:

σ arcsin K 2 + j + i

K 1 1 < t < 1

σ arcsin K 2 + j + i

σ arcsin K 2 + j + i

σ arcsin K 2 + j + i

where | | is the cardinality operator. By relying on the following inequalities:

arcsin(x) 2x x 0,

arcsin(x) 2x x 0,

we have that if a 0 and b 0, we can write:

K 2 + j + i

K 2 + j + i

K 1 1 = 4K σ(K 1),

thus Assumption 3 is satisﬁed with F := 4K σ(K 1). Finally, we have to show that a 0 and b 0. Let us start with a 0. The value minimizing a for the indexes are i = 1 and j = 2, consequently, we have:

K 2 + j + i

K 1 1 = K 2 + 2 + 1 K + 1

K 1 = K K + 2

K 1 = ( 1)K + 2

which is satisﬁed since 0 1

3 for K 3. Let us consider b 0. Even in this case the choice of i = 1 and j = 2 is the one providing the most restrictive conditions. We have:

K 2 + j + i

K 1 1 = K 2 + 2 + 1 K + 1

K 1 = K K + 2

K 1 = ( + 1)K + 2

which is the same condition as in the a 0 derivations.