# smooth_nonstationary_bandits__5d497876.pdf

Smooth Non-stationary Bandits

Su Jia 1 Qian Xie 1 Nathan Kallus 1 Peter I. Frazier 1

In many applications of online decision making, the environment is non-stationary and it is therefore crucial to use bandit algorithms that handle changes. Most existing approaches are designed to protect against non-smooth changes, constrained only by total variation or Lipschitzness over time, where they guarantee Θ(T 2/3) regret. However, in practice environments are often changing smoothly, so such algorithms may incur higher-than-necessary regret in these settings and do not leverage information on the rate of change. We study a non-stationary two-armed bandits problem where we assume that an arm s mean reward is a β-H older function over (normalized) time, meaning it is (β 1)-times Lipschitzcontinuously differentiable. We show the first separation between the smooth and non-smooth regimes by presenting a policy with O(T 3/5) regret for β = 2. We complement this result by an Ω(T (β+1)/(2β+1)) lower bound for any integer β 1, which matches our upper bound for β = 2.

1. Introduction

As a fundamental variant of the MAB problem, nonstationary bandits provide a middleground between the stochastic bandits (Lai et al., 1985) and adversarial bandits (Auer et al., 2002). In the standard non-stationary model (Besbes et al., 2014), the mean reward function is adversarially chosen in advance, and rewards are realized stochastically.

The adversary is confined by the total variation budget V : the mean reward function ra(t) of every arm a is required to be Lipschitz and have total variation PT t=1 |ra(t) ra(t +

1Cornell University, Ithaca, New York, USA. Correspondence to: Su Jia <sj693@cornell.edu>, Qian Xie <qx66@cornell.edu>, Nathan Kallus <kallus@cornell.edu>, Peter I. Frazier <pf98@cornell.edu>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

1)| V . The problem under this framework is well understood. In particular, an optimal regret bound O V 1/3T 2/3

is known (Besbes et al., 2014), even when V is unknown (Cheung et al., 2019).

This model may nonetheless be overly pessimistic for some applications as it allows the adversary to instantaneously shock the reward function s slope. However, in many applications, the underlying environment changes in a smooth manner, e.g., temperature, demand for a seasonal products, economic factors, just to name a few.

This motivates us to consider adversaries constrained to choose reward functions that are smooth in time. We model the level of smoothness by borrowing a standard concept from non-parametric statistics, called the H older class. As formally defined in Section 2, a function is β-H older if the first (β 1) derivatives exist and are Lipschitz. In particular, for β = 1, our model recovers the model of Besbes et al. (2014) with V = O(1), which has an optimal Θ(T 2/3) regret. By setting β > 1, we constrain the adversary more than in past literature. This motivates the following question: Can we break the O(T 2/3) regret bound (i.e., optimal bound for β = 1) under smooth non-stationarity (i.e., β 2)?

In this work, we provide an affirmative answer by showing an O(T 3/5) upper bound for β = 2. A natural idea would be to predict the derivative of the reward function and then make decisions based on the predicted trend. Surprisingly, our algorithm, which achieves the optimal regret, does not use any derivative information. Moreover, we show that this bound is in fact nearly optimal:1 for any integer β 1, we show every policy has worst-case regret Ω(T (β+1)/(2β+1)).

1.1. Related Work

Past multi-armed bandit literature has recognized the importance of non-stationarity by considering several aspects: contextual information (Luo et al., 2018; Russac et al., 2019); uncertainty in the number of changes (Auer et al., 2019; Chen et al., 2019); and Bayesian prior information (Trovo et al., 2020). However, these papers make no smoothness assumptions. (The smoothly changing setting in Trovo et al. (2020) assumes Lipschitz reward functions that may

1Unless stated otherwise, nearly optimal means optimal up to logarithmic factors.

Smooth Non-Stationary Bandits

be non-differentiable, and is hence different from our work.)

There is another line of related work that considers reward function generated by a stochastic process. For example, when the reward function is drawn from a known reflected Brownian motion with variance σ2, the optimal regret is known to be O kσ2 . Other evolution models include other Gaussian processes (Gr unew alder et al., 2010) and discrete Markov chains (Zhou et al., 2021).

Beyond bandits, model smoothness and the H older class are often studied in non-parametric statistics (Gy orfi et al., 2002; Tsybakov, 2004). In the MAB literature, H older smoothness has been considered in contextual bandits; see e.g., (Hu et al., 2020; Gur et al., 2022).

1.2. Our Contributions

As our first and most important contribution, we present the first separation for non-stationary bandits between the smooth (β 2) and non-smooth (i.e., classical, or β = 1) regimes: we develop a policy that we show achieves O(T 3/5) regret when β = 2. This policy and regret bound also apply to instances with β > 2, since a β-H older function with β > 2 is also 2-H older. This is asymptotically lower than the optimal Θ(T 2/3) regret bound for the classical β = 1 setting (Besbes et al., 2014), and shows that smoothness can be exploited to reduce regret. On the technical level, our analysis relies on an amortization which, as a byproduct, can also be applied to give an alternate proof for the O(T 2/3) upper bound when β = 1.

As our second contribution, we provide an Ω(T (β+1)/(2β+1)) lower bound on regret that is valid for for every integer β 1. This shows that our upper bound for β = 2 is tight up to logarithmic factors and that our proposed policy is also nearly optimal. If our lower bound is tight for general β, then as β , it would suggest that it is possible to achieve T 1/2+O(1/β) regret, which would almost match that of the classical stationary bandits. If true, this would show that smoothness can be an effective replacement for stationarity as a way to achieve low regret.

Our third contribution is introducing the notion of smoothness in non-stationary online models, by borrowing a standard concept, H older smoothness, from non-parametric statistics. Our model fills the gap between the adversarial and model-based non-stationary models, and may open up a new direction in online decision making.

2. Formulation

Consider the following non-stationary bandit model. For each arm a {0, 1} and time t = 1, . . . , T up to horizon T, let Zt a be an independent random variable taking values on [ 1, 1]. An instance is given by the distribution of these

variables. The reward function is ra(t) := E[Zt a]. Its dependence on t makes this bandit model non-stationary. We consider two cases: in the one-armed case r0(t) = 0 for all t, while in the two-armed case we make no such restriction. We discuss generalizing to more arms in Section 7.

In each round t [T],2 the learner selects an arm At whereupon they observe and receive a reward Yt := Zt At. Based on the observed past rewards, they then select the next arm At+1, with the goal of maximizing the cumulative sum of rewards. This decision-making process is called a policy.

The problem would be trivial if ra( ) s were known: In each round t the optimal policy chooses any a t arg maxa {0,1} ra(t) and collects reward r (t) = maxa {0,1}{ra(t)} in expectation.

When ra( ) s are unknown and the policy needs to learn the reward functions on the fly. A standard metric for assessing a policy A is called regret, defined as the difference between the expected total rewards of A and the optimum given knowledge of the ra(t) s.

Definition 2.1 (Regret). The regret3 of a policy A under instance r = {ra(t)} is defined as

Reg(A, r) = E h PT t=1 r (t) Zt At i .

For a family F of instances, the worst-case regret of A is maxr F Reg(A, r). The minimax regret is the minimum achievable worst-case regret among all policies.

To define smoothness for a bandit instance, which is a collection of functions on a discrete domain [T], we first define smoothness for functions on a continuous domain.

Definition 2.2 (H older Class). For integers β 1 and L > 0, we say a function f : [0, 1] R is β-H older and write f Σ(β, L) if (i) f is (β 1)-order differentiable, and (ii) f (β 1) and f are both L-Lipschitz.

For example, consider the values β = 1 and β = 2, which are the most important for our analysis. One can easily verify that f Σ(1, L) if and only if f is L-Lipschitz, and that f Σ(2, L) if and only if f is differentiable and f , f are L-Lipschitz.

A bandit instance is β-H older if its reward function can be embedded into a β-H older function in the following sense.

Definition 2.3 (Smooth Non-stationary Bandit Instance). A non-stationary bandit instance r = {ra(t)} is called βH older, if for each arm a, there exists a function µa Σ(β, L) with domain [0, 1] such that ra(t) = µa (t/T) for

2For any integer T, we write [T] = {1, 2, . . . , T}. 3Our notion of regret is sometimes called the dynamic regret, since the arm in the benchmark may change over time. Distinct from this, there is a substantial literature where the regret is against the best fixed arm, e.g., in adversarial bandits.

Smooth Non-Stationary Bandits

Figure 1. Illustration of the family Fβ of reward functions. Here we show the snapshots of the curves on the two epochs [xj, xj+1] and [xj+1, xj+2]. As can be seen from the figure, for any combination of red or blue curves, the change at any endpoint is smooth - both red and blue have 0 derivative at any xj.

any t [T]. We denote by Σk(β, L) the family of all β-H older instances with k arms.

For example, in the one-armed case, consider r1(t) = f(t/T) for f(x) = |x 1

2|. This instance is 1-H older but not 2-H older. In fact, for any differentiable function g with r1(t) = g(t/T), the derivative g must change from 1 to +1 inside an interval of length O(1/T). This means |g | is not bounded by any absolute constant.

We emphasize that in Definition 2.3, µa is defined on the normalized time scale [0, 1] while ra is defined on the {1, . . . , T}. To avoid confusion, we will use x for the [0, 1] scale and t for the {1, . . . , T} scale.

Finally, we note that (Manegueu et al., 2021) also considered H older-continuous rewards satisfying |µa(t) µa(t )| |t t |β for any t, t [0, 1] (their α is our β ); see their page 11. This is only meaningful for β 1, since otherwise the function is necessarily constant. To generalize to β > 1, our H older class requires differentiability and H older-continuity of the highest derivative.

In this sense, (Manegueu et al., 2021) considered β < 1, (Besbes et al., 2014) considered β = 1, and we are the first to consider β > 1, which is our main contribution. The leap from β = 1 to β > 1 is more challenging than from β = 1 to β < 1: the former must leverage derivatives, whereas the latter can still work with the functions themselves while better tracking general β < 1 exponents in arguments similar to the case of β = 1.

3. Lower Bounds

We show an Ω(T (β+1)/(2β+1)) lower bound on regret for the β-H older family of bandit instances. To show this bound, we construct a family Fβ of β-H older instances and prove that every policy suffers the claimed regret on it.

We describe this family at a high level using Figure 1. Fix some δ > 0 and partition [0, 1] into epochs [xj, xj+1] with xj = j 6δ for j = 0, 1, . . . , m 1 where m = 1 6δ. For

each reward function in this family, its restriction on each epoch is either constant, on the order of δβ (colored red), or a bowl-shaped curve (colored blue). The family Fβ consists of all such 2m choices. A bowl curve b on [xj, xj+1] is (β 1)-times differentiable, with vanishing derivatives at xj and xj+1, which enables a smooth concatenation with a constant curve or another bowl curve.

As shown in Figure 1, over the two epochs [xj, xj+1] and [xj+1, xj+2], an instance can correspond to each of the following 22 = 4 combinations of constant and bowl-shaped curves: a constant curve (all red); the curve constituted by two bowl-shaped curves, one on each epoch (all blue); and the two combinations of constant and bowl-shaped curves (one red then blue; the other blue then red). We formally define bowl-shaped curves and the family of instances Fβ in 3.1, then use it to show our lower bound in 3.2.

3.1. Constructing the Family Fβ

Before constructing our family of instances Fβ for the lower bound, the first question is: do such bowl-shaped curves even exist? More precisely, is there a function which (i) has vanishing derivatives up to order (β 1) at the endpoints of an interval of length δ, and (ii) has maximum height Ω(δβ)? We answer this affirmatively via an explicit construction.

Proposition 3.1. Fix an integer β 1. There exists a family {gε} of (β 1)-times continuously differentiable functions, with gε defined on [0, ε], satisfying all of the following: (i) (vanishing derivatives at the boundary) g(j) ε (0) = g(j) ε (ε) = 0 for any j = 1, . . . , β 1, (ii) (monotonicity) g ε 0, (iii) (polynomial growth) gε(ε) = Θ(εβ) as ε 0+. Equivalently, gε(ε) = Cβ(ε) εβ where Cβ(ε) = Θ(1), (iv) (Lipschitz derivatives) g(β 1) ε is 1-Lipschitz.

The proof of the above is rather technical and we defer the details to Appendix A.

As Figure 1 illustrates, a bowl-curve is obtained by connecting two rotated copies of gδ with a constant function.

Definition 3.2 (Bowl-Shaped Curves). Fix any integer β

1 and let δ = δ(β, T) = 22(β+1)C2 βT 1/(2β+1) . For

j = 0, . . . , m 1 and x [0, 1], define

µj+1,r(x) = Cβ(2δ) (2δ)β 1[xj,xj+1)(x),

and µj+1,b(x) =

g2δ(x xj) + 1

2Cβ(2δ)β, if x [xj, xj + 2δ), 1

2Cβ(2δ)β, if x [xj + 2δ, xj + 4δ), g2δ(x xj 2δ) 1

2Cβ(2δ)β, if x [xj + 4δ, xj+1), 0, else.

Smooth Non-Stationary Bandits

Each reward function in the family is specified by a binary vector that encodes its color in each epoch. The colors will be chosen adversarily by our lower bound in response to an algorithm s choices. Definition 3.3 (The Family Fβ). For any v = (v1, . . . , vm) = {r, b}m, let µv(x) = Pm j=1 µj,vj(x). We define Fβ = {µv : v {r, b}m}.

One can easily verify that for every v { 1}m the function µv is β-H older. In particular, by property (i) in Proposition 3.1, if x is a multiple of δ, then for any 1 i β 1, the left and right order-i derivatives at x both become 0, ensuring a smooth transition.

The construction of bowl-shaped curves illustrates the role of β: for a fixed δ, the total variation of the bowl-shaped curve is only O(δβ). In other words, the more smoothness we ask for, the less drastically the bowl curve can vary, reducing regret. This suggests that the lower bound and the optimal regret should decrease in β.

3.2. The Lower Bound

We are now ready to state the lower bound. Note our lower bound actually applies even if we restrict Zt a { 1}. Theorem 3.4 (Lower bound for Integer β). For any integer β 1 and policy A, it holds that Reg(A, v) >

1 24 2 β (Cβ) 2β 2β+1 T

As the key step, observe that for any t [xj + 2δ, xj + 4δ], arm 0 is optimal under the blue curve and arm 1 is optimal under the red curve. We show that the probability of choosing the wrong arm in those rounds is at least 1

2, under an adversarially chosen instance. This is true regardless of the algorithm and the shape of the reward function in the previous epochs.

Formally, consider two instances that are identical up to the (j 1)st epoch. For any prefix u {r, b}j 1 and color χ {r, b}, we will consider the instances µu r and µu b where denotes vector concatenation. Write tj = xj T. We show the following in Appendix A.3. Lemma 3.5 (Likely to Select a Wrong Arm). For any t in the j-th epoch [tj 1, tj], any prefix u {r, b}j 1 and any policy A, we have Pu b[At = 1] + Pu r[At = 0] 1

Observe that if the policy chooses a wrong arm at time t [tj + 2δT, tj + 4δT], then an Ω(δβ) regret is incurred, leading to a high regret in this epoch.

We show that for every policy, there is an instance where Ω(δβ) regret is incurred in every epoch. Indeed, in Fβ, the shape of the reward function in previous epochs imposes no constraints on its shape in future epochs. Thus, given a policy, the adversary can choose the reward curve s shape in the next epoch (encoded red or bowl), whichever leads

to higher regret. In other words, we have effectively m separate instances, each with time horizon 6δT. We complete the proof by lower bounding the regret in each epoch separately using Lemma 3.5.

Proof of Theorem 3.4. Consider the regret on an epoch j under instance v,

Regj(A, v) := E h Ptj+1 t=tj r v(t) Zt At i ,

where r v(t) = max 0, µv t

T . Note that Regj(A, v) depends solely on the first j epochs, so we only need to specify the first j entries of v. Under this notation, by Lemma 3.5, for any prefix u {r, b}j 1,

Regj(A, u r) + Regj(A, u b)

t=tj 1+2δT (Pu b[At = 1] + Pu r[At = 0]) δβ

2 δβ = δβ+1T.

Thus for some color vj {r, b}, we have Regj(A, u vj) 1 2δβ+1T. Note that this inequality holds for any epoch j and prefix u {r, b}j 1, so we can inductively construct a sequence v {r, b}m with Regj(A, v[j]) 1 2δβ+1T for each j [m] where v[j] = (v1, . . . , vj). Summing over j [m], we conclude that

Reg(A, v) =

j=1 Regj(A, v[j]) m 1

Substituting δ = 22(β+1)C2 βT 1 2β+1 , we obtain

Reg(A, v) > 1

24 2 β (Cβ) 2β 2β+1 T

4. Algorithm and Analysis in One-Armed Case

We begin with the one-armed setting, i.e., r0 0, and consider a single arm with unknown reward function r1(t) = µ1(t/T) where µ1 Σ(β, L). We can then suppress the subscripts and write µ(t) = µ1(t) and r(t) = r1(t). The generalization to the two-armed version is straightforward; see Section 5.

4.1. The Budgeted Exploration Algorithm

Consider the following Budgeted Exploration (BE) policy, formally defined in Algorithm 1. The policy is specified by two parameters, the exploration budget B 1 and epoch size (0, 1). The policy partitions the normalized timescale [0, 1] into epochs [xi, xi+1] where xi = i for each i = 0, . . . , 1 1,4 or equivalently, partitions

4For simplicity we assume 1 is an integer. Apparently, this assumption is not essential.

Smooth Non-Stationary Bandits

{1, . . . , T} into epochs [ti, ti+1] where ti = xi T. In each epoch, the algorithm pulls the changing arm (i.e., arm 1) from the start of the epoch until either (i) the epoch ends, or (ii) the budget B is run out, whereupon the algorithm selects the static arm (i.e., arm 0) in all remaining rounds in the epoch.

Algorithm 1 Budgeted Exploration Policy BE(B, )

1: for i = 1, . . . , 1 do 2: Select arm 1 from round ti until round ti + Si with Si = min{ Si, T 1} where

t=ti Zt 1 B

3: Then select arm 0 from round ti + Si + 1 till ti+1. 4: end for

We show that in the one-armed case, for suitable B and , the BE algorithm achieves optimal regret bounds for β = 1, 2, both matching the lower bounds in Section 3.

Theorem 4.1 (Optimal Regret Bound, β = 1). For some B = O(T 1/3) and = O(T 1/3), we have

Reg (BE(B, ), Σ2 (1, L)) = O L1/3T 2/3 log1/3 T .

Remarkably, the following upper bound provides the first separation between smooth (β 2) and non-smooth (β = 1) bandits (Besbes et al., 2015).

Theorem 4.2 (Optimal Regret Bound, β = 2). For some B = O(T 2/5) and = O(T 1/5), we have

Reg (BE(B, ), Σk (2, L)) = O L1/5T 3/5 log2/5 T .

Although these bounds are achieved by the same algorithm, their analyses involve different rationales. Specifically, for β = 2 the analysis must utilize smoothness to reduce the regret over what is achievable when β = 1. We will therefore prove them separately in Section 4.3 and 4.4.

4.2. Preliminaries

Before presenting the proofs of Theorems 4.1 and 4.2, we first state and prove tools used in both proofs. We will focus on bounding the regret conditional on the following clean event, which will be shown to occur with high probability. Loosely, this is the event that the rewards in all sufficiently large time intervals obey Hoeffding s inequality.

Definition 4.3 (Clean Event). For any arm a and rounds t, t [T], consider the event

s=t (Za,s µa(s)) p

6 log T (t t)

We define the clean event as C = T

a,t,t Ct,t a where the intersection is over all arms a and all t, t with t t 2 log T.

We next show C occurs with high probability.

Lemma 4.4 (Clean Event Occurs w.h.p.). For any T, it holds that P[C] T 1.

Proof. By Hoeffding s inequality (Vershynin, 2018, Theorem 2.2.6), for any 1 t t T with t t 2 log T, taking δ = p

6 log T (t t + 1), we have

Ct,t a exp 1 2(t t + 1) 6 log T (t t + 1)

There are at most T 2 combinations of t, t , so by the union bound, we have

Ct,t a T 1.

4.3. Proof of Theorem 4.1

Before delving deep into our focus, the β = 2 case, we analyze the β = 1 case as a warm-up. We will show a stronger statement than Theorem 4.1.

Proposition 4.5. Suppose 6 T log T B2. Then

Reg (BE( , B), Σ1(1, L)) 1 (1 + L 2T + B).

We then obtain Theorem 4.1 by selecting

= L 2/3T 1/3 log1/3 T and B = L 1/3T 1/3 log2/3 T.

At a high level, on each epoch the function µ either (i) is always positive, (ii) is always negative or (iii) has a unique crossing, i.e., µ(x) = 0 for some x. We will bound the regret for these three types of epochs, referred to as positive, negative and crossing epochs, in Lemma 4.7, Lemma 4.8 and Lemma 4.9 respectively, and then combine them to complete the proof.

We first consider a positive epoch i. In this case, the optimal arm is arm 1, which coincides with the choice of the BE policy before the epoch s stopping rule is triggered. Thus, there is no regret in this epoch before time ti + Si. This can be formally shown by rephrasing Wald s classical identity as follows. Recall that (Zt 1) are the rewards of arm 1.

Lemma 4.6 (Wald s Identity, Rephrased). For any epoch i, we have E h Pti+Si t=ti Zt 1 i = E h Pti+Si t=ti r(t) i .

Smooth Non-Stationary Bandits

As a result, we only need to bound the regret incurred after ti + Si. We will do this by bounding the probability that the budget is ever run out, i.e., Si < T, as detailed in the next lemma. Recall that Rt = max{0, r(t)} Zt At is the regret in round t.

Lemma 4.7 (Regret on Positive Epochs). Suppose µ(x) > 0 for x [xi, xi+1]. Then, whenever B2 6 T log T, the

regret of BE(B, ) on epoch i satisfies E h Pti+1 t=ti Rt i 1.

The above essentially follows from the definition of the clean event. When C occurs, the cumulative reward up to the first s rounds in this epoch lies within an interval of width w(s) s of the mean, which is positive.5 Further, by the assumption that B

T we have w(s) B whenever s T. We defer the proof to Appendix B.2

Now we turn to the negative epochs.

Lemma 4.8 (Regret on Negative Epochs). If µ(t) < 0 for x [xi, xi+1], then the regret on epoch i satisfies

E h Pti+1 t=ti Rt i B + 1.

We defer the details to Appendix B.3. The above follows from the definition of the stopping time Si. In fact, if the process never stops until the end of epoch, then the cumulative reward is above B. If it does stop at some s < T, then the cumulative reward is just below B, and hence above (B + 1) since |Zt 1| 1.

Finally we consider crossing epochs. The following essentially follows from the Lipschitzness of µ.

Lemma 4.9 (Regret on Crossing Epochs). Let j be a crossing epoch, i.e., µ( x) = 0 for some x [xj, xj+1]. Then, the

regret in this epoch satisfies E h Pt=tj+1 t=tj Rt i 2L 2T.

Proof. By Lipschitzness of µ, we have |µ(x)| = |µ(x) µ( x)| L whenever xj x xj+1. In the original time scale, this means |r(t)| L whenever tj t tj+1, and hence

t=tj |r(t)| T L = L 2T.

To connect the above with the regret, observe that

t=tj (max{0, r(t)} Yt)

5When illustrating high level ideas, we use A B to denote A = O(B).

Moreover, for each t we have |E[Zt At]| |r(t)|, so

r(t) 2L 2T.

We are now ready to show the O(T 2/3) upper bound. To suppress notations, we will subsequently write R[i] =

E h Pti+1 t=ti Rt i as the regret on epoch i.

Proof of Proposition 4.5. Let J+, J , Jx {1, . . . , 1} be the subsets of positive, negative and crossing epochs. Note that J+ J Jx = {1, . . . , 1}, so the total regret can be decomposed as

i=1 R[i] = X

i J R[i] + X

i J+ R[i] + X

i Jx R[i]. (1)

By Lemma 4.7, Lemma 4.8 and Lemma 4.9, whenever 6 T log T B2, we have

(1) < |J+| 1 + |J | (B + 1) + |Jx| 2L 2T

1 (1 + B + L 2T).

4.4. Proof of Theorem 4.2

We next show the O(T 3/5) regret for β = 2. We will show the following bound for generic policy parameters, which implies Theorem 4.2 by choosing

= L 2/5T 1/5 log1/5 T and B = L 1/5T 2/5 log3/5 T.

Proposition 4.10. Suppose 6 T log T B2, then

Reg (BE( , B), Σ1(2, L)) 2 1 (L 3T + B).

The proof strategy is similar to the β = 1 case. Specifically, note that Lemma 4.7 and Lemma 4.8 do not rely on the smoothness of µ, so they also hold for β = 2. However, as the key difference, we will derive a more fine-grained regret bound for the crossing epochs. To state this result, we classify the epochs based on whether µ ever vanishes on them.

Definition 4.11 (Stationary Points and Stationary Epochs). A point s [0, 1] is said to be stationary if µ (s) = 0. An epoch is said to be stationary (or non-stationary otherwise) if it contains a stationary point.

As the key step, we need the following regret bound on the crossing epochs, which resembles Lemma 4.9 but is more refined as it crucially relies on the fact that β = 2.

Smooth Non-Stationary Bandits

Lemma 4.12 (Key Lemma: Regret on Crossing Epochs). Let j be a crossing epoch and j + ℓbe a stationary epoch, with ℓbeing possibly negative or 0. Moreover, suppose every epoch between them is non-stationary, i.e., epoch i is nonstationary whenever (j+ℓ i) (j i) < 0. Then, the regret

in epoch j satisfies E h Pt=tj+1 t=tj Rt i 2L (|ℓ| + 1) 3T.

Proof. We assume ℓ 0 w.l.o.g. As the key observation, we first claim that |µ | = O ((ℓ+ 1) ) on epoch j. In fact, since epoch (j ℓ) is stationary, there exists an s [xj ℓ, xj ℓ+1] [0, 1] with µ (s) = 0. Moreover, since µ

is Lipschitz, for any x [xj, xj+1] we have

|µ (x)| = |µ (x) µ (s)| L |x s| L (ℓ+ 1) .

We next claim that |r(t)| = O((ℓ+ 1) 2) on epoch j. In fact, let x [xj, xj+1] be any crossing, i.e., µ( x) = 0. Then for any x [xj, xj+1], by the mean value theorem, for some ζ between x and x, i.e., (ζ x) (ζ x) 0, it holds that

|µ(x)| = |µ(x) µ( x)| = |µ (ζ) ( x x)|.

By the previous claim, i.e., |µ | L (ℓ+ 1) on epoch j, the above implies that

|µ(x)| L (ℓ+ 1) | x x| L (ℓ+ 1) 2.

Translating this to the original time scale, we have |r(t)| L (ℓ+1) 2 for any t [tj, tj+1], and thus the claim holds.

Finally, summing over t s, we obtain

t=tj |r(t)| L(ℓ+ 1) 2 T. (2)

We use this to bound the regret. Note that

t=tj (max{0, µ(t)} Yt)

Observe that for each t we have |EYt| |r(t)|, so the above is bounded by 2 Ptj+1 t=tj r(t) . Combining this with (2), we conclude that (3) 2L (ℓ+ 1) 3T.

The above lemma suggests that for the adversary to generate a high regret on a crossing epoch, the nearest stationary point must be proportionally far away. This motivates us to consider an amortization scheme: bucket the epochs 1, . . . , 1 into contiguous blocks (called cycles) separated at stationary epochs, and then show that in each cycle, the regret on each epoch is low on average.

This proof strategy, however, assumes there is at least one stationary point. Thus, as the final building block, we need to handle the corner case where µ has no stationary point. We prove the following in Appendix B.

Lemma 4.13 (Corner Case). Suppose µ (x) = 0 for all x [0, 1]. Then, Reg (BE(B, ), µ) L 2T + (B + 1) 1.

Now we are ready to prove the main upper bound.

Proof of Proposition 4.10. If there is no stationary point, i.e., µ (t) = 0 for all t [0, 1], then the desired bound follows immediately from Lemma 4.13.

Otherwise, index the stationary epochs as s1 < < sn. Crucially, observe that for any j, there is at most one crossing epoch between sj and sj+1, say ix, if it does exist. Then by Lemma 4.12,

R[ix] 2L (|ix sj|+1) 3T 2L (|sj+1 sj|) 3T.

Combining the above with Lemma 4.7 and Lemma 4.8, the total expected regret on epochs sj through sj+1 satisfies X

sj i<sj+1 R[i] (sj+1 si) (2L 3T + B + 1).

The above clearly also holds when there is no crossing between sj and sj+1. In fact, in this case, the 2L 3T term disappears. Summing over j = 1, . . . , n, we have P1/ i=1 R[i] Pn j=1 P sj i<sj+1 R[i]

Pn i=1(sj+1 si) (2L 3T + B + 1)

2n (L 3T + B),

and the desired bound follows by noticing that n 1

5. Two-Armed Setting

Algorithm 2 adapts the BE policy to the two-armed case. In each epoch it alternates between the two arms until the difference in the cumulative rewards exceeds B. It then chooses the empirically better arm until the epoch ends.

Its analysis is similar to the one-armed case by considering the gap function G(x) = µ0(x) µ1(x) in place of µ1(x). The only difference is that positive and negative epochs are replaced by non-crossing epochs via the following lemma.

Lemma 5.1 (Non-crossing Epoch). Suppose G(x) = 0 for any x [xj, xj+1]. Then the regret in this epoch satisfies

E h Pti+1 t=ti Rt i B + O(1).

This is essentially obtained by merging the two lemmas for positive and negative epochs in the previous section. W.l.o.g. assume > 0. First we show that the regret by time 2Si is no greater than B + 1, using the definition of Si and the boundedness of the rewards, as in the proof of Lemma 4.8.

Smooth Non-Stationary Bandits

Algorithm 2 BE(B, ) Policy, Two-Armed Case

1: for i = 1, . . . , 1 do 2: Select arm 0 in rounds ti + 2k and arm 1 in rounds ti + 2k + 1 for k = 0, 1, . . . , Si where Si = min n Si, 1

2( T 1) o with

Zti+2k 0 Zti+2k+1 1 > B

3: Let ˆA be the arm with higher cumulative rewards from rounds ti to ti + 2Si + 1. Select ˆA from round ti + 2(Si + 1) till ti+1. 4: end for

Figure 2. Log-log plot of the averaged regrets incurred by policy BE(B = T 1/3, = T 1/3) and BE(B = T 2/5, = T 1/5) as a function of the time horizon length T, with the linear relationship estimates. Here, the averaged regrets are calculated across randomly generated instances with sinusoidal mean rewards.

Then, we show that the expected regret after 2Si is O(1), as in the proof of Lemma 4.7. This can be done by considering the event that the better arm, i.e., arm 0, ever has cumulative rewards lower than arm 1 by B. Concentration bounds can show that this event occurs with low probability. Moreover, due to the boundedness of G, when this event occurs, the regret is also bounded, leading to O(1) regret.

6. Experiments

We implemented our algorithm with simulations on synthetic data in the one-armed setting. We consider our BE policy where the parameters are chosen to be optimal for the non-smooth and smooth environments respectively. Formally, we consider the policy BE(B, ) where the tuple (B, ) is chosen to be (T 1/3, T 1/3) for non-smooth and (T 2/5, T 1/5) for smooth non-stationary environments.

We consider random trigonometric reward functions whose amplitudes, frequencies and phase shifts are randomly drawn. Specifically, in each instance, we have r0(t) = A and r1(t) = A sin(2πνt/T +ϕ)+A, where ν U[2.5,5], A N(0.25ν 2, 0.001) and ϕ U[0,2π].

An astute reader may have noticed that A depends on ν when we generate the instances. This choice is actually quite natural. In fact, consider µ(x) = A sin(2πνx + ϕ) + A. Note that µ has a ν2A term, so by choosing A to scale as ν 2, |µ | becomes bounded by an absolute constant and hence µ Σ(2, L) for some L = O(1).

We visualize the regret of the two policies via a log-log plot with time horizon T = 2j where j = 20, 21, . . . , 26; see Figure 2. Theoretically, the slope of a log-log curve should equal the exponent of the cumulative regret. In fact, if the cumulative regret is c T d, then the log-regret is log c + d log T. Our simulation shows that under smooth non-stationarity, the T 3/5-regret policy outperforms the T 2/3-regret policy. Moreover, the log-log curves have slope 0.70 and 0.62 respectively, which are close to their theoretical values.

7. Conclusions and Future Directions

In this paper, we presented smoothly-varying non-stationary bandits and demonstrated the first separation between the smooth and non-smooth case by showing we can break the O(T 2/3) regret lower bound for Lipschitz variation if we further assume Lipschitz-continuous differentiablity, attaining O(T 3/5) regret. We showed this upper bound is tight by establishing a lower bound of Ω(T (β+1)/(2β+1)) for any β-H older smooth variation.

One important direction is more than two arms. Algorithm 2 can be straightforwardly adapted to more arms: perform successive elimination in each epoch, until either the exploration budget is used up or the epoch ends. But it is not clear whether the regret has sublinear dependence on k.

We conjecture that the lower bounds can be matched for every integer β 3 but this remains open. If this is true, it means that as smoothness increases we can obtain regret that approaches the optimal O(

T) regret of stationary bandits, since O(T (β+1)/(2β+1)) = T 1/2+O(1/β).

Acknowledgements This material is based upon work supported by the National Science Foundation under Grant No. 1846210. Peter Frazier was supported by AFOSR FA9550-19-1-0283 and FA9550-20-1-0351.

Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48 77, 2002.

Smooth Non-Stationary Bandits

Auer, P., Gajane, P., and Ortner, R. Adaptively tracking the best bandit arm with an unknown number of distribution changes. In Conference on Learning Theory, pp. 138 158. PMLR, 2019.

Besbes, O., Gur, Y., and Zeevi, A. Stochastic multi-armedbandit problem with non-stationary rewards. Advances in neural information processing systems, 27, 2014.

Besbes, O., Gur, Y., and Zeevi, A. Non-stationary stochastic optimization. Operations research, 63(5):1227 1244, 2015.

Chen, Y., Lee, C.-W., Luo, H., and Wei, C.-Y. A new algorithm for non-stationary contextual bandits: Efficient, optimal and parameter-free. In Conference on Learning Theory, pp. 696 726. PMLR, 2019.

Cheung, W. C., Simchi-Levi, D., and Zhu, R. Learning to optimize under non-stationarity. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1079 1087. PMLR, 2019.

Gr unew alder, S., Audibert, J.-Y., Opper, M., and Shawe Taylor, J. Regret bounds for gaussian process bandit problems. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 273 280. JMLR Workshop and Conference Proceedings, 2010.

Gur, Y., Momeni, A., and Wager, S. Smoothness-adaptive contextual bandits. Operations Research, 70(6):3198 3216, 2022.

Gy orfi, L., Kohler, M., Krzyzak, A., Walk, H., et al. A distribution-free theory of nonparametric regression, volume 1. Springer, 2002.

Hu, Y., Kallus, N., and Mao, X. Smooth contextual bandits: Bridging the parametric and non-differentiable regret regimes. In Conference on Learning Theory, pp. 2007 2010. PMLR, 2020.

Lai, T. L., Robbins, H., et al. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6 (1):4 22, 1985.

Luo, H., Wei, C.-Y., Agarwal, A., and Langford, J. Efficient contextual bandits in non-stationary worlds. In Conference On Learning Theory, pp. 1739 1776. PMLR, 2018.

Manegueu, A. G., Carpentier, A., and Yu, Y. Generalized non-stationary bandits. ar Xiv preprint ar Xiv:2102.00725, 2021.

Russac, Y., Vernade, C., and Capp e, O. Weighted linear bandits for non-stationary environments. Advances in Neural Information Processing Systems, 32, 2019.

Slivkins, A. et al. Introduction to multi-armed bandits. Foundations and Trends in Machine Learning, 12(12):1 286, 2019.

Trovo, F., Paladino, S., Restelli, M., and Gatti, N. Slidingwindow thompson sampling for non-stationary settings. Journal of Artificial Intelligence Research, 68:311 364, 2020.

Tsybakov, A. B. Introduction to nonparametric estimation, 2009. URL https://doi. org/10.1007/b13794. Revised and extended from the, 9(10), 2004.

Vershynin, R. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.

Zhou, X., Xiong, Y., Chen, N., and Gao, X. Regime switching bandits. Advances in Neural Information Processing Systems, 34:4542 4554, 2021.

Smooth Non-Stationary Bandits

A. Omitted Proofs in Section 3

In this section we provide details for constructing the family in the lower bound proof.

A.1. Preliminaries: the Flock Transformation and the Pyramid

We need the notion of flocks to construct bowl-shaped curves. Pictorially, the flock transformation of a given function h (called the base function) is given by a sequence of copies of h side by side, each weighted by a constant. For example, the topmost subfigure in Figure 3 is a flock transformation of the pyramid-shaped base function.

Definition A.1 (Flock Transformation). For any base function h(x) : [0, w] R and weight vector v Rℓ, the v-flock is a function Fv[h] : [0, ℓw] R given by

i=1 vi h ((i 1)w + x) .

We now specify the family of functions gε for constructing a bowl-shaped curve. We will set the highest derivative, i.e., g(β 1) ε , to be a flock transformation of the pyramid function whose weight vector is chosen from among the following neutralizing vectors.

Definition A.2 (Neutralizing Vectors). Let ν0 = 1 and be vector concatenation. For each integer k 1, recursively define the k-th neutralizing vector as νk = νk 1 ( νk 1) { 1}2k. A flock corresponding to a neutralizing vector is called a neutral flock.

As the name suggests, these vectors have the property that the sum of their entries on any dyadic interval is 0. Formally, P(j+1)2i

j2i+1 νk i = 0 for any ν = νk and integers i 1, j 0. In our construction we set the highest derivative of gε to be a neutral flock of pyramid functions defined as follows.

Definition A.3 (Pyramid Function). For any w > 0, define the w-pyramid function as

w(x) = x 1[0, w

A.2. Proof of Proposition 3.1

We first describe the high level idea. We will prove Proposition 3.1 constructively by considering g = gε obtained by iteratively integrating the pyramid function for (β 1) times. Figure 1 illustrates this idea for β = 4 using the pyramid function, with ε = 4w. Specifically, we start by setting the highest derivative, i.e., g(3), to be a neutral flock of pyramids; see the topmost subfigure. Then, as shown in the other two subfigures, the lower-order derivatives all vanish at the boundary points, i.e., at 0 and 4w. Moreover, since g(3) is bounded by the function f(x) = x, we can verify that g(i)(ε) = O(ε4 i) for i = 0, . . . , 3, in particular, we have g(ε) = O(ε4) as desired.

To formally define the family {gε}, we need the following.

Definition A.4 (Anti-derivative). Consider any Lebesgue integrable function f : R 0 R. We define Φ0[f] = f and for any integer ℓ 1, the level-ℓanti-derivative Φℓ[f] : R R is given by

Φℓ[f](x) = Z x

0 f(x1) dx1 . . . dxℓ.

Now consider gε = Φβ 1 [Fνβ 2 [ w]] with w = w(ε) = 2 (β 1)ε. Note that the (β 1)-st derivative is Fνβ 1[ w], which is 1-Lipschitz, so property (iv) holds trivially. Moreover, observe that w2β 1 = ε, and hence gε is supported on [0, ε] as desired.

We next formally verify that {gε} has the other properties claimed in Proposition 3.1. The next result, which connects the notions of anti-derivatives, flock transformation and neutralizing vectors, says that the anti-derivative of a neutral flock is still a neutral flock.

Smooth Non-Stationary Bandits

Figure 3. Illustration of gε for β = 4: g(3) is a pyramid flock. The function g(2)(x) is defined as the integration of g(3) from 0 to x. Similarly, g(1)(x) is the integration of g from 0 to x.

Proposition A.5 (Anti-derivative Preserves Neutrality). Let h0 : [0, w] R be any base function and let j 0 be any integer. Then, for any 0 ℓ k, there exists some base function hℓ: [0, 2ℓw] R such that

Φℓ[Fνℓ+j [h0]] = Fνj[hℓ],

more precisely, hℓ= Φℓ[Fνℓ[h0]].

To see the intuition, consider again β = 4 (readers can again refer to Figure 3) and consider pyramid base function h0 = w. In this case, the highest derivative is given by a ν2-flock of pyramids, i.e.,

g(3) = Fν2[h0]. (4)

Now consider g(2). On one hand, as illustrated in the middle subfigure in Figure 3, g(2) is a ν1-flock under the base function h1 = Φ1[Fν1[ w]], i.e.,

g(2) = Fν1[h1]. (5)

On the other hand, by (4) we also have g(2) = Φ1[g(3)] = Φ1[Fν2[h0]].

Combining with (5), we have Φ1[Fν2[h0]] = Fν1[h1],

as claimed for ℓ= j = 1.

Before formally showing this, we first state some basic properties of the flock transform. We use to denote the composition of mappings.

Lemma A.6 (Algebra of the Flock Transform). For any integers i, j, k 0, it holds that (i) (Distributive Law) (Fνi Fνj) Fνk = Fνi (Fνj Fνk), (ii) (Additive Law) Fνi Fνj = Fνi+j = Fνj Fνi.

Smooth Non-Stationary Bandits

The proof of the above is rather straightforward and we leave it to the readers. We are now ready to prove Proposition A.5.

Proof of Proposition A.5. Consider induction on j. Wlog assume w = 1. The base case, j = 0, is trivially true, since Fν0 is the identity mapping. Now consider j 1. As the induction hypothesis, suppose the claim holds, i.e.,

Φℓ Fνℓ+i = Fνi Φℓ Fνℓ,

for i = 0, . . . , j 1. Denote by IH(i) the above identity. Then, for any base function h, it holds that

Φℓ Fνℓ+j[h] = Φℓ Fνℓ+j 1[Fν1[h]]

= Fνj 1 ΦℓFνℓ[Fν1[h]] by IH(j 1)

= Fνj 1 ΦℓFνℓ+1[h] by Lemma A.6

= Fνj 1 Fν1 Φℓ Fνℓ[h] by IH(1)

= Fνj Φℓ Fνℓ[h], by Lemma A.6

and the induction completes.

We use Proposition A.5 to verify properties (i) and (ii) in Proposition 3.1. To verify that the derivatives do vanish at the endpoints, we need the following nice property of the neutralizing vectors.

Lemma A.7 (Symmetric Area Property). For any base function h : [0, w] R and integer k 1, the νk-flock satisfies R 2kw 0 Fνk[h](x) dx = 0.

Proof. Induction on k. For k = 1 this is obviously true. As the induction hypothesis, suppose this is true for some k 2. Recall that by definition it holds that νk = νk 1 ( νk 1), so

0 fw,νk+1(x) dx

0 fw,νk(x) dx + Z 2k+1w

2kw fw, νk(2kw + x) dx

0 fw,νk(x) dx Z 2kw

0 fw,νk(x) dx

Proposition A.8 (Vanishing Derivatives at the Endpoints). Let h : [0, w] R be any base function and H = Fνℓ[h]. Let g = Φℓ(H). Then, the function g : [0, 2ℓw] R is ℓ-times continuously differentiable with g(j)(0) = g(j)(2ℓw) = 0 for any j = 1, . . . , ℓ.

Proof. By Proposition A.5, for any j, there exists base function hj : [0, 2ℓw] R such that g(ℓ j) = Φj[H] = Fνℓ j[hj]. By Lemma A.7, it follows that

g(ℓ j 1)(2ℓw) = g(ℓ j 1)(2ℓw) g(ℓ j 1)(0) = Z 2ℓw

0 g(ℓ j) = 0.

Proposition 3.1 then follows immediately since we have verified properties (i)-(iv).

A.3. Proof of Lemma 3.5

We first introduce some standard concepts and tools. For simplicity, for any instance µ : [0, 1] R, let (Zt µ)t [T ] be the reward vector under this instance.

Smooth Non-Stationary Bandits

Definition A.9 (KL-Divergence). Let X, Y { 1}n be random vectors, specified by probability mass functions f X, f Y : { 1}n [0, 1]. The Kullback-Leibler divergence (or KL-divergence) is defined as

KL(X, Y ) = X

v { 1}n f X(v) log f X(v)

We show that at any time, the KL-divergence of the two random variables is on the order of the squared difference of their means.

Lemma A.10 (Bounding the KL-Divergence). For i = 1, 2, suppose random variable Zi takes value on { 1} and has mean ri. Then, when |r2| 1

2, we have KL(Z1, Z2) 4

3 (r1 r2)2.

Proof. By definition, we can write Zi = 2Xi 1 where Zi Ber ri+1

2 for i = 1, 2. Then, we have

KL(Z1, Z2) = KL(X1, X2)

2 ln r1 + 1

r2 + 1 + 1 r1

r2 + 1 + 1 r1

The following says that two instances with small KL-divergence are hard to distinguish between.

Theorem A.11 (Pinsker s Inequality). Let X, Y { 1}n be two random vectors. For any event6 E, we have 2(P[Y E] P[X E])2 KL(X, Y ).

The chain rule characterizes the KL-divergence for random vectors, on which we will later apply Pinsker s inequality. The following can be found as Theorem 2.4 (b) in (Slivkins et al., 2019).

Theorem A.12 (Chain Rule for Product Distributions). Suppose X1, . . . , Xn, Y1, . . . , Yn { 1} are independent. Consider X = (Xi) and Y = (Yi). Then, KL(X, Y ) = Pn t=1 KL(Xt, Yt).

Proof of Lemma 3.5. For any color χ {r, b}, denote by Zχ = (Zs χ)t 1 s=1 the reward vector under instance u χ. Consider the event Et := {At = 0}. By Pinsker s inequality (Theorem A.11) and the chain rule (Theorem A.12),

|P[Zr Et] P[Zb Et]|2 1

2KL(Zr, Zb) = 1

s=1 KL(Zs r , Zs b) = 0 + 1

s=tj 1 KL(Zs r , Zs b). (6)

By the construction of Fβ and Lemma A.10, for tj 1 < s t we have KL(Zs r , Zs b) 4

2Cβ(2δ)β 2, and thus

2Cβ(2δ)β 2 (t tj 1) 22βC2 β 3 δ2β 6δT, (7)

where the last inequality follows since by definition, t tj 1 tj tj 1 = 6δT. Finally, recall that δ = 22(β+1)C2 βT 1 2β+1 , so (7) gives P[Zr Et] P[Zb Et] 2 1

4. Therefore,

P[Zr Et] + P[Zb Et] = P[Zr Et] + 1 P[Zb Et] 1 P[Zr Et] P[Zb Et] 1

i.e., Pu b[At = 1] + Pu r[At = 0] 1

6In this work, by event we mean a Borel set.

Smooth Non-Stationary Bandits

B. Omitted Proofs in Section 4

B.1. Proof of Lemma 4.6

For simplicity fix i and write S = Si. Observe that

Zt 1 r(t) #

Note that Zt 1 r(t) s are independent, mean 0, and S is a stopping time, so the partial sum Ms := Ps t=1(Zt 1 r(t)) is a martingale (w.r.t. the filtration induced by {Zs 1}). By Wald s stopping time theorem, E[MS] = 0. Therefore, (8) becomes

0 + E h Pti+S t=ti r(t) i .

B.2. Proof of Lemma 4.7

Write S = Si. Since µ > 0, the optimal policy always chooses arm 1 in this epoch. Recall that at time ti + S the BE algorithm switches to arm 0, the sub-optimal arm. We can thus simplify the regret as

t=ti (r(t) Zt At)

t=ti (r(t) Zt 1) +

t=ti+S+1 (r(t) Zt 0)

r(t) Zt 0 #

t=ti+S+1 r(t)

where the last identity follows since Zt 0 s are mean 0 and independent of S. Further, since ti+1 ti = T and |r| 1, we have

(9) P[S = T 1] 0 + P[S < T 1] T = P[S < T 1] T.

We conclude the proof by bounding P[S < T 1]. Consider the event {S = s} where s < T 1. We claim that the event {S = s} would not occur conditional on the clean event C. In fact, if {S = s} occurs, we have Ps t=1 Zt 1 < B. However, conditional on C, we have

t=ti (Zt 1 r(t)) p

6s log T < p

and more explicitly,

t=ti Zt 1 >

t=ti r(t) p

6 T log T 0 B,

where the last inequality follows since 6 T log T B2. It follows that P[{S = s} C] = 0 for any s < T 1, and hence

P[S < T] = P {S < T} C P

Therefore, (9) P[S < T] T T 1 T 1.

Smooth Non-Stationary Bandits

B.3. Proof of Lemma 4.8

Write S = Si. In this case the optimal arm is arm 0, so we can simplify the regret as

t=ti ( Zt At)

t=ti+S+1 Zt 0

Note that Zt 0 is independent of S, so the second expectation is 0.

To analyze the first term, for any s 0 define Xs := Pti+s t=ti Zt 1. Then by definition of S, on the event {S = s} we have Xs 1 < B. Since we assumed each the reward distribution to be { 1}-valued, this implies that Xs < B + 1. Therefore,

s=1 E [Xs 1(X = s)] <

s=1 P[S = s] (B + 1) = (B + 1),

where the first identity follows from Lebesgue s Dominated Convergence Theorem and the boundedness of Xs for any fixed s. The claimed bound immediately follows by combining the above with (10).

B.4. Proof of Lemma 4.13

Note that in this case µ has at most one crossing on its domain [0, 1]. As the trivial case, suppose there is no crossing, then the upper bound follows immediately by applying Lemma 4.7 or Lemma 4.8 on each epoch.

Now suppose there is exactly one crossing, say x [xi0, xi0+1] for some integer i0. Then, for any x [xi0, xi0+1], by Lipschitzness, we have |µ(x)| = |µ(x) µ( x)| = |µ (ζ) (x x)| L .

Translating this to the original time scale, we have |r(t)| L whenever ti0 t ti0+1. Therefore,

t=ti0 |r(t)| L (ti0+1 ti0) = L T. (11)

Meanwhile, since any epoch i = i0 is either negative or positive, by Lemma 4.7 and Lemma 4.8 we have R[i] B + 1. Combining this with (11), the total regret is then bounded as

i=1 R[i] = R[i0] + X

i =i0 R[i] < L 2T + 1 (B + 1).