# tracking_most_significant_shifts_in_infinitearmed_bandits__3570d0b6.pdf

Tracking Most Significant Shifts in Infinite-Armed Bandits

Joe Suk 1 Jung-hun Kim 2

We study an infinite-armed bandit problem where actions mean rewards are initially sampled from a reservoir distribution. Most prior works in this setting focused on stationary rewards (Berry et al., 1997; Wang et al., 2008; Bonald and Proutiere, 2013; Carpentier and Valko, 2015) with the more challenging adversarial/non-stationary variant only recently studied in the context of rotting/decreasing rewards (Kim et al., 2022; 2024). Furthermore, optimal regret upper bounds were only achieved using parameter knowledge of nonstationarity and only known for certain regimes of regularity of the reservoir. This work shows the first parameter-free optimal regret bounds while also relaxing these distributional assumptions. We also study a natural notion of significant shift for this problem inspired by recent developments in finite-armed MAB (Suk and Kpotufe, 2022). We show that tighter regret bounds in terms of significant shifts can be adaptively attained. Our enhanced rates only depend on the rotting non-stationarity and thus exhibit an interesting phenomenon for this problem where rising non-stationarity does not factor into the difficulty of non-stationarity.

1. Introduction

We study the multi-armed bandit (MAB) problem, where an agent sequentially plays arms from a set A, based on partial and random feedback for previously played arms called rewards. The agent s goal is to maximize earned rewards.

Much of the classical literature (see Bubeck and Cesa Bianchi, 2012; Slivkins, 2019; Lattimore and Szepesvári, 2020, for surveys) focuses on finite armed bandits where A = [K] for some fixed K N. The theory here then typically assumes a large time horizon of play T relative to

1Columbia University 2CREST, ENSAE Paris. Correspondence to: Joe Suk <joe.suk@columbia.edu>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

K. However, in practice, the number of arms can be prohibitively large as is the case in recommendation engines or adaptive drug design motivating the so-called many-armed, or infinite-armed, model.

At the same time, another practical reality is that of changing reward distributions or non-stationarity. While there has been a surge of works here (Kocsis and Szepesvári, 2006; Yu and Mannor, 2009; Garivier and Moulines, 2011; Mellor and Shapiro, 2013; Liu et al., 2018; Auer et al., 2019; Chen et al., 2019; Cao et al., 2019; Manegueu et al., 2021; Wei and Luo, 2021; Suk and Kpotufe, 2022; Jia et al., 2023; Abbasi-Yadkori et al., 2023; Suk, 2024), most works here again focus on the finite-armed problem.

This work studies infinite-armed bandits where mean rewards of actions are initially drawn from a reservoir distribution and evolve over time under rested non-stationarity. While much of the existing literature on this topic focuses on stationary rewards (Berry et al., 1997; Wang et al., 2008; Bonald and Proutiere, 2013; Carpentier and Valko, 2015), the more challenging adversarial or non-stationary scenario has only recently been explored in the context of rotting (i.e., decreasing) rewards (Kim et al., 2022; 2024). Kim et al. (2024) s state-of-the-art algorithm for rotting bandits relies on prior knowledge of non-stationarity parameters and further regularity assumptions on the reservoir distribution to attain optimal regret bounds. However, these assumptions can be impractical in real-world applications.

This work studies a broader non-stationary model where rewards are decided by an adaptive adversary and aims to derive regret bounds without requiring algorithmic knowledge of non-stationarity. Additionally, we go beyond the task of attaining optimal regret bounds as posed by Kim et al. (2024), and show enhanced regret bounds which can be tighter (i.e., possibly near-stationary rates) despite large non-stationarity, thus more properly capturing the theoretical limits of learnability in changing environments. Such insights are inspired by similar developments in the finitearmed analogue (Suk and Kpotufe, 2022), where substantial changes in best arm, called significant shifts, can be detected leading to more conservative procedures which don t overestimate the severity of non-stationarity.

Tracking Significant Shifts in Infinite Armed Bandits

1.1. More on Related Works

The most relevant works are Kim et al. (2022; 2024). The first of these works studies a simpler model where the reservoir distribution is uniform on [0, 1] and there s a fixed upper bound ρ on the magnitude of round-to-round nonstationarity. Kim et al. (2022) show the minimax regret rate is ρ1/3 T +

T and derive a matching upper bound, up to log terms, but using algorithmic knowledge of ρ as well as a suboptimal regret upper bound without knowledge of ρ.

Kim et al. (2024) study a general setting where the reservoir distribution for initial mean reward µ0(a) of arm a satisfies P(µ0(a) > 1 x) = Θ(xβ) for all x (0, 1). They study a more general setting of 1-sub-Gaussian random rewards vs. our setting of [0, 1]-bounded rewards. They show a regret lower bound of order min{V 1 β+2 T

β+1 β+2 , L 1 β+1 T

β β+1 } in terms of number of changes L in rewards or total variation V (quantifying magnitude of changes over time).

With knowledge of L and V , they show a matching regret upper bound for β 1 and a worse min{V 1/3T 2/3,

LT} bound for β < 1. We suspect this latter bound for β < 1 is in fact tight due to earlier works (Carpentier and Valko, 2015, Theorem 1) showing T 1/2 lower bounds on the finalround regret. Interestingly, in our [0, 1]-bounded rewards setting, this phenomenon does not occur and the minimax regret behaves the same for β < 1 or β 1. Finally, Kim et al. (2024) also have suboptimal regret upper bounds without knowledge of L or V .

We also note that non-stationarity is well-studied in more structured many or infinite-armed bandit settings such as (generalized) linear (Cheung et al., 2019; Russac et al., 2020; Zhao et al., 2020; Kim and Tewari, 2020; Wei and Luo, 2021), kernel (Hong et al., 2023; Iwazaki and Takeno, 2024), or convex bandits (Wang, 2022). To contrast, our infinitearmed setting does not assume any metric structure on the arm space and so these works are not easily to comparable to this paper. We also note there are no known results on more nuanced measures of non-stationarity, like the significant shifts, even for such structured settings.

1.2. Contributions

We show the first optimal and adaptive (a.k.a. parameter-free) regret upper bounds for non-stationary infinite-armed bandits (Theorem 5). In fact, our bounds are expressed in terms of tighter and more optimistic measures of non-stationarity (Subsection 2.2 and Theorem 3) new to this work. This resolves open questions of Kim et al. (2022; 2024).

We note our procedures are substantially different from Kim et al. (2022; 2024) who rely on explore-thenexploit strategies whereas we revisit the subsampling approach of Bayati et al. (2020) which partially reduces

the problem to analyzing finite-armed bandits.

Along the way, we develop the first high-probability regret bounds for the infinite-armed setting. Our regret upper bound in Theorem 5 also relaxes distributional assumptions on the reservoir distribution, not requiring an upper bound on the masses of randomly sampled rewards. Both such generalizations were unknown in prior works even in the stationary setting.

To our knowledge, our work is the first to develop adaptive dynamic regret bounds of the style V 1/3T 2/3

LT with bandit feedback and adaptively adversarial changes. Notably, such results are yet unknown in the finite-armed setting. Note our setting is not directly comparable to finite-armed bandits as we assume a known optimal reward value.

We validate our findings via experiments on synthetic data, showing our procedures perform better than the previous art for rotting infinite-armed bandits.

2.1. Non-Stationary Infinite-Armed Bandits

We consider a multi-armed bandit with infinite armset A. At each round t, the agent plays an arm at, choosing either to newly sample at from A or to play an already sampled arm among the previously chosen arms {a1, . . . , at 1}.

When the agent samples arm at A at round t, it observes a random reward Yt(at) [0, 1] with mean µt(at) [0, 1] whose value is randomly drawn from a mean reservoir distribution. The reward of this chosen arm in the subsequent round is then decided according to an adaptive adversary with access to prior decisions {as}s t and observations {Yt(as)}s t. As in prior works (Berry et al., 1997; Wang et al., 2008; Carpentier and Valko, 2015; Bayati et al., 2020; Kim et al., 2024), we assume the reservoir distribution is β-regular, parametrized by a shape parameter β > 0.

Assumption 1 (β-Regular Reservoir Distribution). We assume a β-regular reservoir distribution for some β > 0: there exists constants κ1, κ2 > 0 such that for all x [0, 1]:

κ1 xβ P(µ0(a) > 1 x) κ2 xβ.

Remark 1. Prior works on non-stationary bandits (e.g. Besbes et al., 2019) typically allow for unplayed arms rewards to change each round, which is equivalent to our setting with a different accounting of changes, as changes in yet unplayed arms do not affect performance.

Let δt(a) := 1 µt(a) be the gap of arm a at round t. Then, the cumulative regret is RT := PT t=1 δt(at). Let δt(a, a ) := µt(a) µt(a ) be the relative regret of arm a

to a at round t.

Tracking Significant Shifts in Infinite Armed Bandits

2.2. Non-Stationarity Measures

Let V := PT t=2 |µt 1(at 1) µt(at 1)| denote the realized total variation which measures the total variation of changes in mean rewards through the sequence of played arms. In Section 5, we also consider the total realized rotting variation VR := PT t=2(µt 1(at 1) µt(at 1))+ which sums the magnitudes of rotting in rewards over time.

Let L := PT t=2 1{µt(at 1) = µt 1(at 1)} denote the realized count of changes, and LR := PT t=2 1{µt(at 1) < µt 1(at 1)} the realized number of rotting changes

To contrast, the prior works Kim et al. (2022; 2024) consider a priori upper bounds on V, L (i.e., the adversary is constrained to incur non-stationarity at most size V or count L), and so only show expected regret bounds in terms of such bounds. Our work establishes stronger high-probability regret bounds in terms of our tighter realized values of V, L. Remark 2. We note that our measures of non-stationarity V, VR, L, LR depend on the agent s decisions and so may not be directly comparable for different algorithms and/or adversaries. For an oblivious adversary, these quantities can be upper bounded by worst-case analogues which do not depend on the agent s decisions. Even in this more limited setting, optimal and adaptive regret upper bounds were previously unknown.

3. Regret Lower Bounds

Kim et al. (2024, Theorem 4.1 and 4.2) show regret lower

bounds of order (LR+1) 1 β+1 T

β β+1 ) for the rotting infinite-armed bandit problem, which is a subcase of our non-stationary setup. Our regret upper bound in Theorem 5 matches this lower bound up to log factors, without algorithmic knowledge of LR, VR.

4. A Blackbox for Optimally Tracking Unknown Non-Stationarity

4.1. Intuition for Subsampling

A key idea, used for the stationary problem in Wang et al. (2008); Bayati et al. (2020), is that of subsampling a fixed set of arms from the reservoir. The main algorithmic design principle is to run a finite-armed MAB algorithm over this subsample. The choice of subsample size is key here and exhibits its own exploration-exploitation tradeoff, appearing through a natural regret decomposition with respect to the subsample, which we ll denote by A0 A:

t=1 min a A0 δt(a)

| {z } Regret of best subsampled arm

t=1 max a A0 δt(a, at)

| {z } Regret to best subsampled arm

Suppose there are K := |A0| subsampled arms. One can show (e.g., Theorem 11) a size K subsample of a β-regular reservoir contains, with high probability, an arm with gap O(K 1/β). Thus, the first sum in (1) is at most T K 1/β.

Then, plugging in the classical gap-dependent regret bounds for finite-armed MAB, the second sum in (1) is O PK i=2 1 (i) where (i) is the (random) i-th smallest gap to the best subsampled arm. Then, it is further shown (Bayati et al., 2020, Section A.2) that this gap-dependent quantity scales like O(K) in expectation, by carefully integrating 1 (i) over the randomness of the reservoir.

Then, choosing K to balance the bounds T K 1/β and K on (1) yields an optimal choice of K T

β β+1 giving a regret bound of T

β β+1 which is in fact minimax.

Key Challenges: Extending this strategy to the nonstationary problem, it s natural to ask if we can follow an analogous strategy by reducing to K-armed non-stationary bandits. However, this poses fundamental difficulties:

(a) As our goal in the non-stationary problem is to achieve adaptive regret bounds, without parameter knowledge, a naive approach is to reduce to adaptive K-armed non-stationary MAB guarantees (Auer et al., 2019; Wei and Luo, 2021; Suk and Kpotufe, 2022; Abbasi Yadkori et al., 2023). However, these guarantees only hold for an oblivious adversary, and so are inapplicable to our problem. Furthermore, these algorithms only give worst-case rates of the form

LKT in terms of L changes. In fact, it s known in this literature that no algorithm can adaptively secure tighter gapdependent rates over unknown changepoints (Garivier and Moulines, 2011, Theorem 13). As the gapdependent regret bound is crucial to achieving optimally balancing exploration and exploitation in our subsampling strategy, we see this approach can only hope to achieve suboptimal rates.

(b) Secondly, we observe that upon experiencing changes, one may have to re-sample arms from the reservoir distribution as the regret of the best subsampled arm mina A0 δt(a) can itself become large over time. Thus, we require a more refined subsampling strategy which works in tandem with non-stationarity detection.

4.2. Our New Approach: Regret Tracking

We handle both of the above issues with the new idea of tracking the empirical regret ˆδt(at) := 1 Yt(at) as a proxy for tracking non-stationarity. The key observation is that the empirical cumulative regret of played actions up to round t, Pt s=1 ˆδs(as), concentrates around Pt s=1 δs(as) at fast logarithmic rates by Freedman s inequality and a self-bounding argument for [0, 1]-valued random variables.

Tracking Significant Shifts in Infinite Armed Bandits

This means, so long as Pt s=1 1 Ys(as) t

β β+1 , our regret will be safe up to the minimax stationary regret rate. On the other hand, if Pt s=1 1 Ys(as) t

β β+1 , then the agent must be experiencing large regret which means some non-stationarity has occurred if the agent otherwise plays optimally for stationary environments.

Thus, at a high level, our procedure (Algorithm 1) restarts the subsampling strategy outlined in Subsection 4.1 upon detecting large empirical regret.

Setting up relevant terminology, an episode is the set of rounds between consecutive restarts and, within each episode, we further employ doubling epochs, termed blocks, to account for unknown changepoints and durations of play.

Within each block, we run the subsampling strategy for a fixed time horizon as a blackbox. The blackbox takes as input a finite-armed MAB base algorithm, parametrized by Base-Alg(t, A0) for inputs horizon t and subsampled set of arms A0 A.

Our only requirement of the base algorithm is that it attains a gap-dependent regret bound in so-called mildly corrupt environments, defined below. It s straightforward to show this is satisfied by classical stochastic MAB algorithms such as UCB (Lai and Robbins, 1985) (proof in Section C).

Definition 1. We say a finite-armed non-stationary bandit environment {µt(a)}t [T ],a A0 over horizon T with armset A0 is α-mildly corrupt for α > 0 if there exists a reference reward profile {µ(a)}a A0 such that

t [T], a A0 : |µt(a) µ(a)| α.

Next, in stating the requirement of our base algorithm, we use δt(a, a ) := µt(a) µt(a ) to denote the gap of arm a to a in the context of a finite-armed bandit {µt(a)}t [T ],a A0.

Assumption 2. Let {µt(a)}t [T ],a A0 be an α-mildly corrupt T-round finite-armed bandit with reference reward profile {µ(a)}a A0. Let (2) (3) (|A0|) be the ordered gaps induced by the reference reward profile. Then, Base-Alg(T, A0) attains, with probability at least 1 1/T, for all t [T], a t-round static regret bound of (for C0 free of t, T, A0):

s=1 δs(a, as) 6tα + C0

(i) 1{ (i) 4α},

Remark 3. Our Assumption 2 is at firt glance similar to Assumption 1 of Wei and Luo (2021), but it in fact stronger in requiring gap-dependent bounds in environments with small variation as opposed to their requirement of O(

T) regret in such environments (Wang, 2022, Lemma 3). Remark 4. Bandit algorithms attaining state-of-the-art regret bounds in stochastic regimes with adversarial corruption (Lykouris et al., 2018; Gupta et al., 2019; Zimmert

and Seldin, 2019; Ito, 2021; Ito and Takemura, 2023; Dann et al., 2023) satisfy Assumption 2.

Algorithm 1: Blackbox Non-Stationary Algorithm

1 Input: Finite-armed MAB algorithm Base-Alg satisfying Assumption 1. Subsampling rate Sm.

2 Initialize: Episode count ℓ 1, Starting time t1 1 1.

3 for m = 1, 2, . . . , log(T) do

4 Subsample Sm 2m arms Am A.

5 Initiate a new instance of Base-Alg(2m, Am).

6 for t = tm ℓ, . . . , (tm ℓ+ 2m 1) T do

7 Play arm at (receiving reward Yt(at)) as chosen by Base-Alg(2m, Am).

8 Changepoint Test: if t X

ˆδs(as) C1 (|Am| 2m/2) log3(T)

9 Restart: t1 ℓ+1 t + 1, ℓ ℓ+ 1.

10 Return to Algorithm 1 (Restart from m = 1).

11 else if t = tm ℓ+ 2m 1 then

12 tm+1 ℓ t + 1 (Start of the (m + 1)-th block in the ℓ-th episode).

4.3. Blackbox Regret Upper Bound

The main result of this section is that Algorithm 1 attains the optimal regret in terms of number of changes L and totalvariation V when β 1 and matches the state-of-art regret bounds with known L, V for β < 1 (Kim et al., 2024).

Theorem 2. Under Assumption 1 with β 1, Algorithm 1 with Sm := l 2m β β+1 m satisfies, w.p. 1 O(1/T):

RT O (L + 1) 1 β+1 T

β β+1 (V 1 β+2 T

β+1 β+2 + T

If β < 1, Algorithm 1 with Sm := 2m β/2 satisfies w.p. 1 O(1/T):

(L + 1) T (V 1/3T 2/3 +

Proof. (Outline) We given an outline of the proof with full details deferred to Section A. We also focus on the setting of β 1 with (minor) modifications of the argument for β < 1 discussed in Subsection A.6. Let tℓ:= t1 ℓbe the start of the ℓ-th episode [tℓ, tℓ+1). Let ˆL be the (random) number of restarts triggered over T rounds. Let mℓbe the index of the last block in ℓ-th episode.

Tracking Significant Shifts in Infinite Armed Bandits

Converting Empirical Regret Bound to Per-Episode Regret Bound. Following the discussion of Subsection 4.2, we first use concentration to upper bound the per-block

regret Ptm+1 ℓ 1 s=tm ℓ δs(as) on each block and to also lower bound it on blocks concluding with a restart. We first have by Freedman s inequality (Theorem 7) and AM-GM, with high probability, for all subintervals [s1, s2] of rounds:

s=s1 δs(as) ˆδs(as)

v u u tlog(T)

s=s1 δs(as) + log(T)

s=s1 δs(as) + log(T) (2)

This allows us to upper bound the regret on each block [tm ℓ, tm+1 ℓ ) by O(Sm) using the bound on empirical regret Ptm+1 ℓ 1 s=tm ℓ ˆδs(as) from Algorithm 1 of Algorithm 1.

Then, summing Sm 2m β β+1 over blocks and episodes yields a total regret bound of O PˆL ℓ=1(tℓ+1 tℓ)

Bounding the Variation Over Each Episode. It now remains to relate PˆL ℓ=1(tℓ+1 tℓ)

β β+1 to the total count L of changes and total-variation V . To this end, we show there is a minimal amount of variation in each episode [tℓ, tℓ+1) which will allow us to conclude the total regret bound using arguments similar to prior works (Suk and Kpotufe, 2022, Corollary 2) (Chen et al., 2019, Lemma 5).

We first introduce a regret decomposition, alluded to earlier in (1), based on the best initially subsampled arm , baℓ,mℓ:= arg maxa Amℓµ0(a), where we use µ0(a) to denote the initial reward of arm a when it s first sampled. The regret in the last block [tmℓ ℓ, tℓ+1,) of the ℓ-th episode is:

δt(ˆaℓ,mℓ, at)

Then from the above and concentration, one of two possible cases must hold if a restart occurs on round tℓ+1: either (A) or (B) is Ω(Smℓ). In either case, we claim large variation occurs over the episode:

t=tℓ |µt(at 1) µt 1(at 1)| (tℓ+1 tℓ) 1 β+1 . (3)

Regret of Best Subsampled Arm is Large. In the case where (A) is Ω(Smℓ), we know due to our subsampling rate Smℓthat ˆaℓ,mℓwill w.h.p. have an initial gap of O(2 mℓ 1 β+1 ) (Theorem 11). On the other hand, (A) being

large means there s a round t [tm ℓ, tℓ+1) such that

δt (ˆaℓ,mℓ) Smℓ/(tℓ+1 tmℓ ℓ) 2 mℓ 1 β+1 .

Thus, from 2 mℓ 1 β+1 (tℓ+1 tℓ) 1 β+1 , (3) holds.

Regret of Base is Large. Now, if (B) is Ω(Smℓ) but (A) is o(Smℓ), suppose for contradiction that (3) is reversed. Then, this means the finite MAB environment experienced by the base algorithm is (tℓ+1 tℓ) 1 β+1 -mildly corrupt (Theorem 1). Thus, Assumption 2 bounds the regret of the base, which in turn bounds the per-block regret above:

δt(ˆaℓ,mℓ, at) t 1 β+1

4 t β β+1 . (4)

where { (i)} Smℓ i=1 are the ordered initial gaps to ˆaℓ,mℓof the arms in Amℓ.

We then bound the RHS of (4) by O(Smℓ) to contradict our premise that (B) is Ω(Smℓ). In Bayati et al. (2020, Lemma D.4), bounding (4) by O(Smℓ) was done in expectation by carefully integrating the densities of each random variable 1 (i) . But, this requires additional regularity conditions on an assumed density for the reservoir distribution when β = 1 (op. cit. Section 6.1). To contrast, we bound (4) in high probability using a novel binning argument on the values of the (i) s and then using Freedman-type concentration on the number of subsampled arms within each bin. Details of this can be found in Subsection A.4.

4.4. Comparison with MASTER of Wei and Luo (2021)

While our blackbox (Algorithm 1) bears similarities to the MASTER algorithm of Wei and Luo (2021), we highlight that Algorithm 1 s design is much simpler in not requiring randomized multi-scale scheduling of re-exploration for the sake of changepoint detection.

The method of detecting changes is also different. MASTER detects changes by comparing the UCB indices of base algorithms scheduled at different times. To contrast, we only track the estimated cumulative regret in each block to track changes. This results in a simpler regret analysis where we don t need to bound the regret incurred while waiting for a re-exploration phase to be triggered.

5. Tracking Significant Shifts

While Algorithm 1 is a flexible blackbox which can make use of any reasonable finite-armed MAB algorithm, the

Tracking Significant Shifts in Infinite Armed Bandits

regret bound of Theorem 2 is suboptimal for β < 1. Futhermore, we aim to show regret upper bounds in terms of the tighter rotting measures of non-stationarity LR and VR (cf. Subsection 2.2).

In fact, we go beyond this aim and define a new measure of non-stationarity which precisely tracks the rotting changes which are most severe to performance.

5.1. Defining a Significant Shift

Recent works on non-stationary finite-armed MAB achieve regret bounds in terms of more nuanced non-stationarity measures, which only track the most significant switches in best arm, or those truly necessitating re-exploration (Suk and Kpotufe, 2022; Buening and Saha, 2023; Suk and Agarwal, 2023; Suk and Kpotufe, 2023; Suk, 2024).

Here, we develop and study an analogous notion for the infinite-armed bandit. First, we note in the infinite-armed bandit problem, there s no single best arm , as the armspace is infinite and, almost surely, no sampled arm will have the optimal reward value of 1. Yet, the core inution behind the notion of significant shift for finite MAB (Suk and Kpotufe, 2022, Definition 1) remains by definining a significant shift in terms of the regret experienced by each arm. First, we say an arm a is safe on an interval [s1, s2] of rounds if:

s=s1 1 µs(a) κ 1 1 (s2 s1 + 1)

β β+1 . (5)

Then, a significant shift is roughly defined as occurring when every arm in some set of arms ˆ A is unsafe and violates (5). If we let ˆ A = A be the full reservoir set of arms, then this proposed definition may not by meaningful as there could always exist an unsampled arm which is safe in terms of regret, but unknown to the agent.

For the subsampling strategy discussed in Subsection 4.1, we argue it is sensible to let ˆ A = A0, the set of sub-sampled arms. Based on the discussion of Section 4, we see that a subsample of size Ω(t

β β+1 ) contains w.h.p. a safe arm with gap O(t 1 β+1 ). This motivates the following definition of significant shift, which is defined for any agent (regardless of whether a subsampling strategy is used).

Definition 3 (Significant Shift). We say a bandit environment over t rounds is safe if there exists an arm a among the first t

β β+1 arms sampled by the agent such that (5) holds for all intervals of rounds [s1, s2] [t]. Let τ0 := 1. Define the (i + 1)-th significant shift τi+1 given τi as the first round t > τi such that the bandit environment over rounds [τi, t] is not safe. Let L be the largest significant shift over T rounds and by convention let τ L+1 := T + 1.

5.2. Comparing Significant Shifts with LR, VR

We first note that the significant shift, like LR and VR, is an agent-based measure of non-stationarity and so depends on the agent s past decisions.

Next, we caution that a significant shift does not always measure non-stationarity and can in fact be triggered in stationary environments for a bad algorithm. For example, consider an agent sampling exactly one arm from the reservoir and committing to it. Then, due to the large variance of a single sample, such an algorithm incurs constant regret with constant probability. In this case, a significant shift is triggered even in a stationary environment as none of the subsample is safe. Thus, in this example, the significant shift tracks the agent s suboptimality.

In spite of this example, we argue that Theorem 3 is still a meaningful measure of non-stationarity for all algorithms of theoretical interest, and indeed learning significant shifts via Algorithm 2 will allow us to attain optimal rates w.r.t. LR, VR (Theorem 5). Note that any algorithm attaining the optimal high-probability regret of O(T

β β+1 ) in stationary environments must sample at least T

β β+1 arms. Intuitively, this is because securing an arm with gap at most δ := T 1 β+1 (which occurs for a given arm with probability P(µ0(a) > 1 δ) = Θ(δβ)) requires Ω(δ β) trials. Indeed, this claim is also seen in the proofs of the stationary regret lower bounds (e.g., Wang et al., 2008, Theorem 3).

At the same time, any sample of Ω(T

β β+1 ) arms from the reservoir will contain an arm with gap O(δ) w.h.p.. (Theorem 11), which will maintain (5) over all subintervals. Then, a significant shift will not be triggered unless this best subsampled arm gap increases. Thus, any algorithm attaining optimal regret in stationary environments will satisfy w.h.p. L L.

Additionally, the only way an arm with initial gap O(δ) becomes unsafe and violates (5) is if there s large rotting total variation (i.e., VR is bounded below). This means L LR and we ll see in the next subsection that learning significant shifts in fact allows us to recover the optimal rate

β+1 β+2 in terms of VR (Theorem 5).

5.3. Regret Upper Bounds

Here, we derive a tight and adaptive regret bound in terms of the significant shifts. In particular, we show a regret bound

of ( L + 1) 1 β+1 T

β+1 β+2 + T

β β+1 ) in terms of L significant shifts and VR total rotting variation.

A preliminary task here is as follows:

Goal 1. Show a regret bound of O(t

β β+1 ) in t-round safe environments. Given a base procedure achieving said goal, we can then use

Tracking Significant Shifts in Infinite Armed Bandits

a restart strategy similar to Algorithm 1 where we restart the base upon detecting a significant shift.

For the finite K-armed setting, the analogous preliminary claim (Suk, 2024, Theorem 11) is to show a regret bound of O(

KT) in safe environments. This is achieved using a randomized variant of successive elimination (Even-Dar et al., 2006).

However, once again the infinite-armed setting is more nuanced and so this claim cannot directly be applied. Indeed, setting K := T

β β+1 results in a suboptimal rate of

β+1 . The fundamental issue here is that the

KT rate captures a worst-case variance of estimating bounded rewards. To get around this, we do a more refined variance-aware regret analysis relying on self-bounding techniques similar to those used to show Theorem 2 in Section 4.

To further emphasize the more challenging nature of showing Goal 1, we notice that the regret analysis of our blackbox procedure Algorithm 1 in Subsection 4.3 crucially relies on bounding the logarithmic gap-dependent regret rate of Assumption 2 over periods of small total-variation (used to show (3)). However, such a gap-dependent regret rate is ill-defined when the gaps i can be changing substantially over time (as to violate Assumption 2) which can happen in safe environments while we expect stationary regret rates.

Nevertheless, we show that a different per-arm regret analysis gets around this difficulty. Our procedure (Algorithm 2) is a restarting randomized elimination using the same doubling block scheme of Algorithm 1.

Going into more detail, we note, by uniformly exploring a candidate armset Gt at round t, we can maintain importanceweighted estimates of the gaps of each arm a Gt:

ˆδIW t (a) := (1 Yt(at)) 1{at = a}

P(at = a | Ht 1) ,

where Ht 1 is the σ-algebra generated by decisions and observations up to round t 1.

Next, we note, by Freedman s inequality that the estimation error of the cumulative estimate P

t I ˆδIW t (a) over an interval I of rounds scales like p P

t I δt(a) |Gt|. The inclusion of the δt(a) term inside of the square root is crucial here and, using self-bounding arguments, yield tighter concentration bounds of order O(maxt I |Gt|), which we can use as a threshold for fast variance-based elimination.

Theorem 4. Let ˆL be the number of episodes [tℓ, tℓ+1) elapsed in Algorithm 2 over T rounds. Algorithm 2 satisfies, w. p. at least 1 1/T:

ℓ=1 (tℓ+1 tℓ)

Algorithm 2: Restarting Subsampling Elimination

1 Initialize: Episode count ℓ 1, start t1 1 1.

2 for m = 1, 2, . . . , log(T) do

3 Subsample l 2(m+1) β β+1 log(T) m 2m arms Am and let Gtm ℓ Am.

4 for t = tm ℓ, . . . , (tm ℓ+ 2m 1) T do

5 Play arm at as Unif{Gt} and observe reward Yt(at).

6 Eliminate arms: Gt+1

ˆδIW s (a) C2 |Am| log(T)

7 Restart Test: if Gt+1 = then

8 Restart: t1 ℓ+1 t + 1, ℓ ℓ+ 1.

9 Return to Algorithm 2 (Restart from m = 1).

10 else if t = tm ℓ+ 2m 1 then

11 tm+1 ℓ t + 1 (Start of the m + 1-th block in the ℓ-th episode).

Proof. (Outline) We give a proof outline here and full details are found in Section B. It suffices to bound the regret on block [tm ℓ, tm+1 ℓ ) by O(Sm). To show this, we do a varianceaware version of the per-arm regret analysis of Section B.1 in Suk and Kpotufe (2022).

We first transform the regret according to its conditional expectation. Note that, from the uniform sampling strategy,

E[δt(at) | Ht 1] = X

Next, note that

Var[δt(at) | Ht 1] E[δ2 t (at) | Ht 1] E[δt(at) | Ht 1].

Then, using Freedman s inequality (Theorem 7) with the above, we have for all subintervals I [T], with probability at least 1 1/T:

t I δt(at) X

t I E[δt(at) | Ht 1]

|Gt| + log(T),

where the second inequality is from AM-GM. In light of the above, it remains to bound PK a=1 Pta

t=tm ℓ δt(a)

|Gt| where ta is

the last round in block [tm ℓ, tm+1 ℓ ) that a is retained.

We next again use Freedman s inequality (Theorem 7) and self-bounding to relate P

t I δt(a) to P

t I ˆδIW t (a). We

Tracking Significant Shifts in Infinite Armed Bandits

have w.p. at least 1 1/T:

t=tm ℓ δt(a) ˆδIW t (a)

max t [tm ℓ,ta] |Gt| log(T) +

v u u tlog(T)

t=tm ℓ δt(a) |Gt|

t=tm ℓ δt(a) + C max t [tm ℓ,ta] |Gt| log(T), (6)

where again we use AM-GM in the second inequality. Next,

|Gt| max s [tm ℓ,ta] 1 |Gs|

t=tm ℓ δt(a).

Moving the 1

t I δt(a) to the other side in (6), we get

t=tm ℓ δt(a)

ˆδIW t (a) + log(T) max t [tm ℓ,ta] |Gt|.

Combining the above two displays with Algorithm 2 of Algorithm 2, we have:

|Gt| max s [tm ℓ,ta 1] Sm |Gs| log(T),

Now, summing maxs [tm ℓ,ta 1] |Gs| 1 over arms a Am yields another log(Sm) factor, while summing over blocks m [mℓ] and episodes ℓ [ˆL] finishes the proof.

We next show the regret bound of Theorem 4 in fact recovers the minimax regret rates in terms of L and VT .

Corollary 5. Algorithm 2 satisfies, w.p. at least 1 1/T:

RT O ( L + 1) 1 β+1 T

β+1 β+2 + T

Proof. (Sketch) Using similar arguments to the proof of Theorem 2, within each block the best initial arm ˆaℓ,m has gap O(2 m 1 β+1 ). On the other hand, within a block [tmℓ ℓ, tℓ+1) ending in a restart, this best initial arm is eliminated meaning there exists a round t [tmℓ ℓ, tℓ+1) such that δt(ˆaℓ,mℓ) (tℓ+1 tmℓ ℓ) 1 β+1 . Thus, the gap of ˆaℓ,mℓmust have increased by at least this amount which gives a lower bound on the per-episode total variation of Ω((tℓ+1 tℓ) 1 β+1 ). Then, by similar arguments to the proof of Theorem 2, we deduce the total regret bound.

6. Comparing Blackbox vs. Elimination

Theorem 5 and the nearly matching lower bounds of Section 3 show that only rotting non-stationarity ( L LR and VR) factor into the difficulty of non-stationarity. In other words, rising non-stationarity is benign for non-stationary infinite-armed bandits. Intuitively, this is because our problem assumes knowledge of both the top reward value and upper bound on rewards, which coincide and are equal to 1. Hence, arms with rising rewards require less exploration.

Interestingly, unlike the case with Algorithm 1, the elimination algorithm s bound in Theorem 5 does not require an upper bound on masses of the reservoir (Assumption 1), or there s no dependence on κ2 in the regret upper bounds of Theorem 5. This is new to this work and was not known even in the previous stationary regret bounds (Wang et al., 2008; Bayati et al., 2020; Kim et al., 2024). This suggests our regret analysis is simpler and, indeed, we only require that the initial best arm in the subsample has small gap (which only requires lower bounded masses of the reservoir as we see in Theorem 11). To contrast, the regret analysis of Theorem 2 (more similar to those of the aforementioned works) uses the upper bound on tail probabilities scaling with κ2 to bound the regret of the finite-armed MAB base algorithm. Such a step is avoided in the regret analysis of Theorem 4 by estimating the regret of each arm separately.

On the other hand, the blackbox can be seen as more extensible as it allows for a wide range of finite-armed MAB algorithms to be used as a base. We finally note Algorithm 2 can essentially be reformulated as an instantiation of the blackbox with a successive elimination base algorithm.

7. Experiments

Here, we demonstrate the performance of Algorithms 1 and 2 on synthetic datasets. For the Base-Alg in Algorithm 1 we use UCB (Auer et al., 2002). As benchmarks, we implement SSUCB (Bayati et al., 2020) and a variant of it using a sliding window with size

T (SSUCB-SW), and AUCBT-ASW (Kim et al., 2024), which achieves a suboptimal regret bound of O(min{V 1 β+2 T

β+1 β+2 , (L + 1) 1 β+1 T

β β+1 } + min{T

2β+1 2β+2 , T 3 4 }) in rested rotting setups.

Comparing Total Variation-based Regret Bounds. To ensure a fair comparison between algorithms, we consider a rotting scenario where the mean reward of each selected arm decreases at a rate of O(1/t) at time t. In this environment, for all our algorithms, it can be shown that L = Ω(T), V = O(1). For the case of β = 1 such that initial mean rewards follow a uniform distribution on [0, 1] (Figure 1(b)), both our algorithms outperform the benchmarks, with the elimination one achieving the best performance. These results validate the insights of Section 6. Specifically, our

Tracking Significant Shifts in Infinite Armed Bandits

0 20000 40000 60000 80000100000

SSUCB-SW SSUCB AUCBT-AW Blackbox (Algorithm 1) Elimination (Algorithm 2)

(a) β = 0.6

0 20000 40000 60000 80000100000

SSUCB-SW SSUCB AUCBT-AW Blackbox (Algorithm 1) Elimination (Algorithm 2)

(b) β = 0.8

0 20000 40000 60000 80000100000

SSUCB-SW SSUCB AUCBT-AW Blackbox (Algorithm 1) Elimination (Algorithm 2)

Figure 1: Experimental results showing regret of algorithms

algorithms have a regret bound of O(T 2/3) (Theorem 2 and Theorem 5) vs. AUCBT-ASW s bound of O(T 3/4) (Kim et al., 2024).

In Figure 1, we observe that our algorithms outperform the benchmarks across various β values, aligning with theoretical results. Furthermore, the performance gap between the elimination and blackbox algorithms increases as β decreases, which is also consistent with our theoretical results.

0 20000 40000 60000 80000100000

SSUCB-SW SSUCB AUCBT-AW Blackbox (Algorithm 1) Elimination (Algorithm 2)

Figure 2: Experimental results showing regret of algorithms

Comparing Regret Bounds based on Number of Changes. Figure 2 covers the piecewise stationary setting where we use a rotting rate of ρt = 1 at L =

T different rounds with β = 1. The prior art AUCBT-ASW (Kim et al., 2024) has a regret bound of

LT = T 3/4, yet we see from the plot that our procedures have empirically better regret owing to the small number of significant shifts. Surprisingly, we see the blackbox algorithm has comparable performance to our elimination procedure, suggesting that

the blackbox algorithm may also be capable of adapting to significant shifts.

Remark 5. Our implementation of Algorithm 2 does not include the log(T) factor in the subsampling rate of Algorithm 2, as this lead to more stable experimental results. One can show this only changes the bound of Theorem 5 up to a log2/β(T) factor, which is not large for β {0.6, 0.8, 1}.

8. Conclusion and Future Directions

We ve shown the first optimal and adaptive dynamic regret bounds for infinite-armed non-stationary bandit with arms drawn from a reservoir distribution. For future work, it would be interesting to see if our techniques can be extended to non-stationary linear bandits, and to design more practical/computationally efficient procedures for detecting changes.

Impact Statement

As our contribution here is primarily theoretical, advancing the state-of-art in infinite-armed non-stationary bandits, we do not foresee any major societal impact concerns.

Yasin Abbasi-Yadkori, András György, and Nevena Lazi c. A new look at dynamic regret for non-stationary stochastic bandits. Journal of Machine Learning Research, 24(288): 1 37, 2023.

Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finitetime analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235 256, 2002.

Peter Auer, Pratik Gajane, and Ronald Ortner. Adaptively tracking the best bandit arm with an unknown number of distribution changes. Conference on Learning Theory, pages 138 158, 2019.

Mohsen Bayati, Nima Hamidi, Ramesh Johari, and Khashayar Khosravi. Unreasonable effectiveness of greedy algorithms in multi-armed bandit with many arms. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1713 1723. Curran Associates, Inc., 2020.

Donald A. Berry, Robert W. Chen, Alan Zame, David C. Heath, and Larry A. Shepp. Bandit problems with infinitely many arms. The Annals of Statistics, 25(5):2103 2116, 1997. doi: 10.1214/aos/1069362389.

Omar Besbes, Yonatan Gur, and Assaf Zeevi. Optimal exploration-exploitation in a multi-armed-bandit problem

Tracking Significant Shifts in Infinite Armed Bandits

with non-stationary rewards. Stochastic Systems, 9(4): 319 337, 2019.

Thomas Bonald and Alexandre Proutiere. Two-target algorithms for infinite-armed bandits with bernoulli rewards. In C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013.

Sébastien Bubeck and Nicoló Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1), 2012.

Thomas Kleine Buening and Aadirupa Saha. Anaconda: An improved dynamic regret algorithm for adaptive nonstationary dueling bandits. In Francisco Ruiz, Jennifer Dy, and Jan-Willem van de Meent, editors, Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine Learning Research, pages 3854 3878. PMLR, 25 27 Apr 2023.

Yang Cao, Zheng Wen, Branislav Kveton, and Yao Xie.

Nearly optimal adaptive procedure with change detection for piecewise-stationary bandit. Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS), 2019.

Alexandra Carpentier and Michal Valko. Simple regret for infinitely many armed bandits. ICML, 2015.

Yifang Chen, Chung-Wei Lee, Haipeng Luo, and Chen Yu Wei. A new algorithm for non-stationary contextual bandits: efficient, optimal, and parameter-free. In 32nd Annual Conference on Learning Theory, 2019.

Wang Chi Cheung, David Simchi-Levi, and Ruihao Zhu.

Learning to optimize under non-stationarity. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1079 1087. PMLR, 2019.

Chris Dann, Chen-Yu Wei, and Julian Zimmert. A blackbox approach to best of both worlds in bandits and beyond. In Gergely Neu and Lorenzo Rosasco, editors, Proceedings of Thirty Sixth Conference on Learning Theory, volume 195 of Proceedings of Machine Learning Research, pages 5503 5570. PMLR, 12 15 Jul 2023.

Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of Machine Learning Research, 2006.

Aurélien Garivier and Eric Moulines. On upper-confidence bound policies for switching bandit problems. In Proceedings of the 22nd International Conference on Algorithmic

Learning Theory, pages 174 188. ALT 2011, Springer, 2011.

Anupam Gupta, Tomer Koren, and Kunal Talwar. Better algorithms for stochastic bandits with adversarial corruptions. Proceedings of the International Conference on Computational Learning Theory (COLT), 2019.

Kihyuk Hong, Yuhang Li, and Ambuj Tewari. An optimization-based algorithm for non-stationary kernel bandits without prior knowledge. In Francisco Ruiz, Jennifer Dy, and Jan-Willem van de Meent, editors, Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine Learning Research, pages 3048 3085. PMLR, 25 27 Apr 2023.

Shinji Ito. Parameter-free multi-armed bandit algorithms with hybrid data-dependent regret bounds. In Mikhail Belkin and Samory Kpotufe, editors, Proceedings of Thirty Fourth Conference on Learning Theory, volume 134 of Proceedings of Machine Learning Research, pages 2552 2583. PMLR, 15 19 Aug 2021.

Shinji Ito and Kei Takemura. Best-of-three-worlds linear bandit algorithm with variance-adaptive regret bounds. In Gergely Neu and Lorenzo Rosasco, editors, Proceedings of Thirty Sixth Conference on Learning Theory, volume 195 of Proceedings of Machine Learning Research, pages 2653 2677. PMLR, 12 15 Jul 2023.

Shogo Iwazaki and Shion Takeno. Near-optimal algorithm for non-stationary kernelized bandits. ar Xiv preprint ar Xiv:2410.16052, 2024.

S. Jia, Qian Xie, Nathan Kallus, and P. Frazier. Smooth non-stationary bandits. In International Conference on Machine Learning, 2023.

Baekjin Kim and Ambuj Tewari. Randomized exploration for non-stationary stochastic linear bandit. Uncertainty in Artificial Intelligence, 2020.

Jung-Hun Kim, Milan Vojnovic, and Se-Young Yun. Rotting infinitely many-armed bandits. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 11229 11254. PMLR, 17 23 Jul 2022.

Jung-hun Kim, Milan Vojnovic, and Se-Young Yun. An adaptive approach for infinitely many-armed bandits under generalized rotting constraints. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024.

Tracking Significant Shifts in Infinite Armed Bandits

Levente Kocsis and Csaba Szepesvári. Discounted ucb. 2nd PASCAL Challenges Workshop, 2006.

T.L. Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Adv. in Appl. Math, 1985.

Tor Lattimore and Csaba Szepesvári. Bandit algoritms. Cambridge University Press, 2020.

Fang Liu, Joohyun Lee, and Ness Shroff. A changedetection based framework for piecewise-stationary multiarmed bandit problem. Proceedings of the AAAI Conference on Artificial Intelligence, 2018.

Thodoris Lykouris, Vahab Mirrokni, and Renato Paes Leme.

Stochastic bandits robust to adversarial corruptions. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, page 114 122, New York, NY, USA, 2018. Association for Computing Machinery. ISBN 9781450355599. doi: 10.1145/3188745.3188918.

Anne Gael Manegueu, Alexandra Carpentier, and Yi Yu.

Generalized non-stationary bandits. ar Xiv preprint: ar Xiv:2102.00725, 2021.

Joseph Mellor and Jonathan Shapiro. Thompson sampling in switching environments with bayesian online change detection. In Carlos M. Carvalho and Pradeep Ravikumar, editors, Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics, volume 31 of Proceedings of Machine Learning Research, pages 442 450, Scottsdale, Arizona, USA, 29 Apr 01 May 2013. PMLR.

Yoan Russac, Olivier Cappé, and Aurélien Garivier. Algorithms for non-stationary generalized linear bandits. ar Xiv preprint ar Xiv:2003.10113, 2020.

Aleksandrs Slivkins. Introduction to multi-armed bandits. Foundations and Trends in Machine Learning, 12(1-2):1 286, 2019. ISSN 1935-8237. doi: 10.1561/ 2200000068.

Joe Suk. Adaptive smooth non-stationary bandits. ar Xiv preprint ar Xiv:2407.08654, 2024.

Joe Suk and Arpit Agarwal. When can we track significant preference shifts in dueling bandits? Advances in Neural Information Processing Systems (Neur IPS), 2023.

Joe Suk and Samory Kpotufe. Tracking most significant arm switches in bandits. Conference on Learning Theory (COLT), 2022.

Joe Suk and Samory Kpotufe. Tracking most significant shifts in nonparametric contextual bandits. Advances in Neural Information Processing Systems (Neur IPS), 2023.

Yining Wang. Technical note on adaptivity in nonstationary stochastic optimization with bandit feedback. Operations Research, 2022.

Yizao Wang, Jean-yves Audibert, and Rémi Munos. Algorithms for infinitely many-armed bandits. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems, volume 21. Curran Associates, Inc., 2008.

Chen-Yu Wei and Haipeng Luo. Non-stationary reinforcement learning without prior knowledge: An optimal black-box approach. Proceedings of the 32nd International Conference on Learning Theory, 2021.

Jia Yuan Yu and Shie Mannor. Piecewise-stationary bandit problems with side observations. Proceedings of the 26th Annual international Conference on Machine Learning, pages 1177 1184, 2009.

Peng Zhao, Lijun Zhang, Yuan Jiang, and Zhi-Hua Zhou. A simple approach for non-stationary linear bandits. International Conference on Artificial Intelligence and Statistics (AISTATS), 2020.

Julian Zimmert and Tor Lattimore. Return of the bias: Almost minimax optimal high probability bounds for adversarial linear bandits. In Po-Ling Loh and Maxim Raginsky, editors, Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of Machine Learning Research, pages 3285 3312. PMLR, 02 05 Jul 2022.

Julian Zimmert and Yevgeny Seldin. An optimal algorithm for stochastic and adversarial bandits. In Kamalika Chaudhuri and Masashi Sugiyama, editors, Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, volume 89 of Proceedings of Machine Learning Research, pages 467 475. PMLR, 16 18 Apr 2019.

Tracking Significant Shifts in Infinite Armed Bandits

A. Blackbox Algorithm Regret Analysis (Details for the Proof of Theorem 2)

Here, we present the details of the proof of Theorem 2 following the outline of Section 4. Again, we focus on the case of β 1 and discuss modifications of the argument required for β < 1 in Subsection A.6.

A.1. Preliminaries

Let c0, c1, c2, . . . denote constants not depending on T, κ1, or κ2 (Assumption 1). In what follows, all logarithms will

be base 2. We ll also assume WLOG that log(T) κ 1

β 1 κ2, as otherwise we can bound the regret by a constant only depending on κ1 and κ2.

Next, we establish a basic fact about the block structure of Algorithm 1.

Fact 6. Recall from Section 4 that mℓis the index of the last block in the ℓ-th episode. Then, for any episode [tℓ, tℓ+1) terminating in a restart (via Algorithm 1 of Algorithm 1), we have

Smℓ= l 2mℓ β β+1 m 2mℓ, (7)

(tℓ+1 tmℓ ℓ) 1 β+1 C1 2 1 β+1 log3(T) (8)

Proof. By Algorithm 1 of Algorithm 1, we must have

2mℓ tℓ+1 tmℓ ℓ + 1

1 Ys(as) C1 |Amℓ| log3(T).

Now, recall the definition of Sm so that |Amℓ| := l 2m β β+1 m 2m. We first note that we cannot have |Amℓ| = 2mℓsince

the above display would then become 1 C1 log3(T). Thus

tℓ+1 tmℓ ℓ + 1 C1 2mℓ β β+1 log3(T) C1 (tℓ+1 tmℓ ℓ + 1)

β β+1 log3(T).

Rearranging this becomes (tℓ+1 tmℓ ℓ +1) 1 β+1 C1 log3(T). Further, bounding tℓ+1 tmℓ ℓ +1 2 (tℓ+1 tmℓ ℓ) finishes the proof.

A.2. Using Concentration to Bound Per-Block Regret (Proof of Theorem 8)

We first present Freedman s inequality which is used in Section 4 to bound the regret using the empirical regret bound of Algorithm 1 of Algorithm 1.

Lemma 7 (Strengthened Freedman s Inequality, Theorem 9 (Zimmert and Lattimore, 2022)). Let X1, X2, . . . , XT be a martingale difference sequence with respect to a filtration F1 F2 FT such that E[Xt | Ft] = 0 and assume E[|Xt| | Ft] < a.s.. Then, with probability at least 1 δ,

Vt log 2 max{Ut, VT }

+ 2UT log 2 max{UT , VT }

where VT = PT t=1 E[X2 t | Ft], and UT = max{1, maxs [T ] Xs}.

We next use this to relate the per-episode regret to the empirical bounds of Algorithm 1.

Lemma 8. Let E1 be the event that (a) for all blocks [tm ℓ, tm+1 ℓ ),

s=tm ℓ δs(as) < 3C1 |Am| log3(T). (9)

Tracking Significant Shifts in Infinite Armed Bandits

and (b) for the last block [tmℓ ℓ, tℓ+1) of episodes [tℓ, tℓ+1) concluding in a restart, we have:

2 |Amℓ| log3(T). (10)

Then, E1 occurs with probability at least 1 1/T.

Proof. First, fix an interval of rounds [s1, s2]. Then, we note that

s=s1 1 Ys(as) E[1 Ys(as) | Hs 1],

is a martingale difference sequence with respect to the natural filtration {Ht}t of σ-algebras Ht, generated by the observations and decisions (of both decision-maker and adversary) up to round t 1. Now, since the choice of arm at at round t is fixed conditional on Ht 1, we have E[Yt(at) | Ht 1] = µt(at). We also have since Yt(at) [0, 1]:

Var(1 Yt(at) | Ht 1) E[(1 Yt(at))2 | Ht 1] E[1 Yt(at) | Ht 1] = δt(at).

Then, by Freedman s inequality (Theorem 7) we have with probability at least 1 1/T 3: for all choices of intervals [s1, s2] [T]:

s=s1 (1 Ys(as)) (1 µs(as))

v u u tlog (2T 3)

s=s1 δs(as) + 2 log 2T 3

s=s1 δs(as) + 13

2 log(2T 3), (11)

where the second inequality is by AM-GM. Going forward, suppose the above concentration holds for all subintervals of rounds [s1, s2] [T].

At the same time, on each m-th block, Algorithm 1 of Algorithm 1 gives us an empirical regret upper bound since the changepoint test is not triggered until possibly the last round of the block. Specifically, the inequality in Algorithm 1 of Algorithm 1 must be reversed for the second-to-last round of the block or

s=tm ℓ 1 Ys(as) < C1 (|Am| 2m/2) log3(T),

where by convention we let tmℓ+1 ℓ := tℓ+1. Thus,

s=tm ℓ 1 Ys(as) < C1 (|Am| 2m/2) log3(T) + 1. (12)

Combining the above with (11), we conclude

s=tm ℓ δs(as) < 2C1 (|Am| 2m/2) log3(T) + 2 + 13 log(2T 3) 3C1 (|Am| 2m/2) log3(T).

where the last inequality holds for C1 large enough.

Tracking Significant Shifts in Infinite Armed Bandits

At the same time, for constant C1 in Algorithm 1 of Algorithm 1 chosen large enough, we have that, for the last block [tmℓ ℓ, tℓ+1) of an episode concluding in a restart, we must have

1 Ys(as) C1 (|Am| 2m/2) log3(T) =

3 (|Amℓ| 2m/2) log3(T) 13

3 log(2T 3) C1

2 (|Amℓ| 2m/2) log3(T),

where the first inequality in the second line above comes from combining the first line with (11) and the last inequality holds for C1 large enough.

Finally, we note that for β 1, we have |Am| = Sm 2m = l 2m β β+1 m 2m 2m/2. Thus, |Am| 2m/2 = |Am| in all our above inequalities.

A.3. Summing Regret Over Blocks

Next, we sum the per-block regret bound of Theorem 8 over blocks m and episodes ℓto obtain a total regret bound.

Lemma 9. Under event E1, we have

t=1 δt(at) c0 log3(T) X

ℓ [ˆL] (tℓ+1 tℓ)

We first decompose the regret along episodes and blocks contained therein:

t=1 1 µt(at) =

t=tm ℓ 1 µt(at).

Then, on event E1, we have summing (9):

t=1 1 µt(at) X

m [mℓ] 3C1 |Am| log3(T)

m [mℓ] 2m β β+1 log3(T)

ℓ [ˆL] (tℓ+1 tℓ)

β β+1 log3(T).

where in the last line we sum the geometric series over m and the fact that 2mℓ β β+1 (tℓ+1 tℓ)

A.4. Showing there is Large Variation in Each Episode

Next, following the proof outline of Section 4, our goal is to show there is a minimal amount of variation in each episode.

Lemma 10. ℓ [ˆL 1], w.p. 1 4/T:

t=tℓ |µt(at 1) µt 1(at 1)| log3(T)

(tℓ+1 tℓ) 1 β+1 .

Proof. As outlined in the proof outline of Section 4, in light of the regret lower bound (10) of Theorem 8, we consider two

Tracking Significant Shifts in Infinite Armed Bandits

different cases which we recall below:

1 µt(ˆaℓ,mℓ) C1

4 |Amℓ| log3(T) (A)

1 µt(ˆaℓ,mℓ) C1

4 |Amℓ| log3(T) and

µt(ˆaℓ,mℓ) µt(at) C1

4 |Amℓ| log3(T) (B)

Now, on event E1, due to (10), one of (A) or (B) must hold. Our goal is to show that in either case, large variation must have elapsed over the episode.

Best Initial Arm has Large Dynamic Regret. We first consider the former case (A). First, we establish a lemma asserting the best initially subsampled arm has small initial gap with high probability.

Lemma 11. (Proof in Subsection A.7) Recall that µ0(a) is the initial mean reward of arm a. Let ˆaℓ,mℓdenote the best initial arm among the arms sampled in the last block of episode [tℓ, tℓ+1) or ˆaℓ,mℓ:= arg maxa Amℓµ0(a). Let E3 be the event that

ℓ [ˆL] : 1 µ0(ˆaℓ,mℓ) log3(T)

2mℓ 1 β+1 .

Then, E3 occurs with probability at least 1 1/T.

Now, from (A) (with large enough C1 > 16), we also know that there exists a round t [tmℓ ℓ, tℓ+1) such that

1 µt (ˆaℓ,mℓ) > C1 |Amℓ| log3(T)

4(tℓ+1 tmℓ ℓ) 16 2mℓ β β+1 log3(T) 4 2 2mℓ 2 log3(T)

2mℓ 1 β+1 .

This means the mean reward of arm ˆaℓ,mℓmust have moved by amount at least 2 mℓ 1 β+1 log3(T) over the course of the block, implying

µ0(ˆaℓ,mℓ) µt (ˆaℓ,mℓ) log3(T)

2mℓ 1 β+1 log3(T)

(tℓ+1 tℓ) 1 β+1 .

Since the adversary can only modify the rewards of an arm at after the round t it is played, the movement must have occurred on rounds where ˆaℓ,mℓwas chosen by the agent. Thus,

t=tℓ |µt(at 1) µt 1(at 1)|

t=tℓ |µt(ˆaℓ,mℓ) µt 1(ˆaℓ,mℓ)| µ0(ˆaℓ,mℓ) µt (ˆaℓ,mℓ) log3(T)

(tℓ+1 tℓ) 1 β+1 .

Regret of Base Algorithm is Large. Now, we consider the other case (B), where the regret of the subsampled bandit environment over the rounds [tmℓ ℓ, tℓ+1) is large, while the dynamic regret of the best initial arm is small. In this case, we want to show a contradiction: that if the total variation over the episode is small, then the finite MAB environment experienced by the base algorithm must be mildly corrupt (Theorem 1) which means Assumption 2 can be used to bound the regret of the last block. This will yield a contradiction since by virtue of Algorithm 1 being triggered, we know the regret of the last block be large.

Suppose for contradiction that

t=tℓ |µt(at 1) µt 1(at 1)| < log3(T)

(tℓ+1 tℓ) 1 β+1 . (13)

Then, since our adaptive adversary only changes the rewards of arms on the rounds they are played, we have the finite-armed MAB environment experienced by the base is α-mildly corrupt (Theorem 1) with respect to reference reward profile {µ(a)}a Amℓand α := log3(T )

(tℓ+1 t mℓ ℓ ) 1 β+1 .

Tracking Significant Shifts in Infinite Armed Bandits

This means we can employ Assumption 2 to bound the regret. In particular, our end goal here is to show that

t=tm ℓ µt(ˆaℓ,mℓ) µt(at) < C1

4 |Amℓ| log3(T), (14)

which will contradict (B) and imply (13) is true. First, by Assumption 2 with α = log3(T )

(tℓ+1 t mℓ ℓ ) 1 β+1 , we have

µt(ˆaℓ,mℓ) µt(at) C0

(tℓ+1 tmℓ ℓ) 1 β+1

+ (tℓ+1 tmℓ ℓ)

β β+1 log3(T)

To ease notation, we now parametrize the arms in Amℓ\{ˆaℓ,mℓ} via {2, . . . , |Amℓ|} and let i be the initial gap of arm i to ˆaℓ,mℓ: i := µ0(ˆaℓ,mℓ) µ0(i),

Now, we will partition the values of i based on a dyadic grid. Let

i=2 1{ i [2 (j+1) δ0(ˆaℓ,mℓ), 2 j δ0(ˆaℓ,mℓ))},

where j ranges from 0 to

(tℓ+1 tmℓ ℓ)

42β log3β(T)

Next, the following lemma bounds Nj,ℓin high probability using Freedman s inequality (Theorem 7).

Lemma 12. (Proof in Subsection A.8) Let E4 be the event that the following hold:

j {0, . . . , J}, ℓ {1, . . . , ˆL 1} : Nj,ℓ 3κ2

2 |Amℓ| 2 jβ + 13

2 log(2T 3) (15)

4 log(T) tℓ+1 tmℓ ℓ

1 β+1 i < 2 (J+1) )

2 |Amℓ| 2 (J+1)β + 13

2 log(2T 3) (16)

Then, E4 occurs with probability at least 1 1/T.

The next lemma relates the cutoff i 4α to the quantity 2 (J+2) showing that every gap i such that i 4α lies in one of the bins of the dyadic grid.

Lemma 13. (Proof in Subsection A.9) We have

2 (J+2) δ0(ˆaℓ,mℓ) > 4α.

Now, Theorem 13 implies 2 (J+1) δ0(ˆaℓ,mℓ) > 4α. Thus, every gap i such that i/4 log3(T )

(tℓ+1 t mℓ ℓ ) 1 β+1 lies in some

interval [2 (j+1) δ0(ˆaℓ,mℓ), 2 j δ0(ˆaℓ,mℓ)) for some j {0, . . . , J} or lies in the interval [4α, 2 (J+1) δ0(ˆaℓ,mℓ)). Thus, we bound

j=0 Nj,ℓ (2 (j+1) δ0(ˆaℓ,mℓ)) 1 log(T)

+ log(T) α 1 |Amℓ| X

i=2 1{α i < 2 (J+1) δ0(ˆaℓ,mℓ)} (17)

Tracking Significant Shifts in Infinite Armed Bandits

Now, by Theorem 13, we have for all j {0, . . . , J}:

2 (j+1) δ0(ˆaℓ,mℓ) > 2 (j+2).

Combining the above with Theorem 12, we have (17) is order

log(T) α 1 κ2|Amℓ| 2 (J+1)β + log(T) + log(T)

j=0 κ2 |Amℓ| 2j(1 β)+2 + 2(j+1) log(T).

2 (J+1)β 2 β 1 2β

42β log3β(T)

(tℓ+1 tmℓ ℓ)

42β log3β(T)

2β(tℓ+1 tmℓ ℓ)

κ2 log(T) α 1 2 (J+1)β c3κ2 (tℓ+1 tmℓ ℓ)

1 β β+1 log3β 3(T)

c4κ2 log3(1 β)(T) log3β 3(T)

= c5 log(T),

where the second inequality follows from (8) of Theorem 6 and β 1, and the third inequality follows from log(T) κ2 by choosing T sufficiently large. Thus,

log(T) α 1 κ2|Amℓ| 2 (J+1)β + log(T) c6|Amℓ| log(T).

j=0 2j(1 β)+2 4(J + 1) 2J(1 β) log2(T)

21 β (tℓ+1 tmℓ ℓ)

421 β log2(1 β)(T)

(tℓ+1 tmℓ ℓ) 1 β+1

42β log3(T)

8 log2(T) (1 + log(T))

c7 log3(T),

where the first inequality follows from κ2 log(T), and the third inequality follows from β 1 and (8) of Theorem 6.

Thus, (17) is at most order |Amℓ| log3(T). Thus, choosing C1 large enough in Algorithm 1 of Algorithm 1, we have that (14) holds. Following earlier discussion, this means the subsampled bandit environment over the last block [tm ℓ, tℓ+1) is not mildly corrupt (Theorem 1). Then, for any µ( ) : Amℓ [0, 1], there always exists t [tm ℓ, tℓ+1) and a Amℓsuch that

|µt (a ) µ(a )| > log3(T)

(tℓ+1 tmℓ ℓ) 1 β+1 .

This means that since the adversary can only modify the rewards of arms at after the round t, we have

t=tℓ |µt(at 1) µt 1(at 1)|

|µt(at 1) µt 1(at 1)|

|µt(a ) µt 1(a )|

|µt (a ) µ0(a )|

(tℓ+1 tmℓ ℓ) 1 β+1

(tℓ+1 tℓ) 1 β+1 .

Tracking Significant Shifts in Infinite Armed Bandits

A.5. Relating Episodes to Non-Stationarity Measures

Now, we show how to derive

ℓ=1 (tℓ+1 tℓ)

β β+1 log3(T) (L + 1) 1 β+1 T

β β+1 (V 1 β+2 T

β+1 β+2 + T

β β+1 ) log3(T),

from the fact that, as we have shown by Theorem 10, for all episodes ℓ [ˆL 1]

t=tℓ |µt(at 1) µt 1(at 1)| log3(T)

(tℓ+1 tℓ) 1 β+1 .

We first show the bound in terms of total variation V . Let V[s1,s2) := Ps2 1 t=s1 |µt(at 1) µt 1(at 1)|. Then, by using Hölder s inequality:

ℓ=1 (tℓ+1 tℓ)

β β+1 log3(T) T

β β+1 log3(T) +

(tℓ+1 tℓ) 1 β+1

ℓ=1 (tℓ+1 tℓ) log3 3 β+2 (T)

β β+1 log3(T) +

ℓ=1 V[tℓ,tℓ+1)

β+1 β+2 log

β β+1 log3(T) + V 1 β+2 T

β+1 β+2 log

To show the bound in terms of L, we use Jensen s inequality on the function x 7 x

β β+1 combined with the fact that the number of episodes ˆL L + 1 by virtue of Theorem 10.

A.6. A Sketch of Modifications Required for Proving Theorem 2 for β < 1

Next, we describe how to show a suboptimal regret bound of order PˆL ℓ=1 tℓ+1 tℓin the setting of β < 1. From here, using the steps of Subsection A.5 with β = 1, it is straightforward to show a regret bound of p

(L + 1)T (V 1/3T 2/3+

T). For the sake of redundancy, we only give here a sketch of the modifications to the argument required.

We ll first describe at a high level the difficulty of showing a PˆL ℓ=1(tℓ+1 tℓ)

β β+1 regret bound using the same arguments of the previous sections for β < 1. In fact, the only place where β 1 was used is in the final step where we bound (17). In particular, when bounding the sum PJ j=0 2j(1 β)+2 4(J + 1) 2J(1 β), for β 1, 2J(1 β) is a constant. However,

for β < 1, we may incur an extra (tℓ+1 tmℓ ℓ)

1 β β+1 term in the final regret bound due to this term. Thus, this argument would only yield a suboptimal regret bound. Interestingly, Bayati et al. (2020, cf. E.1.2) and Kim et al. (2024) also face this difficulty in bounding similar terms which leads to a suboptimal rate.

However, we can still attain a tℓ+1 tℓper-episode regret bound using an altered subsampling rate Sm := 2m β/2 2m

which effectively targets a tℓ+1 tℓregret rate.

Going into more detail, a first key fact, as observed in Bayati et al. (2020), is that subsampling Sm arms ensures an arm with gap O((tℓ+1 tℓ) 1/2). This can be seen analogously to the proof of Theorem 11 in Subsection A.7 where letting β = 1 and |Amℓ| 2mℓ β/2 in Subsection A.7 establishes that the best initial arm has gap at most log3(T) 2 mℓ β/2.

Then, the key fact to show will be an analogue of Theorem 10: with probability at least 1 4/T, for all ℓ [ˆL 1]:

t=tℓ |µt(at 1) µt 1(at 1)| log3(T) tℓ+1 tℓ . (18)

Tracking Significant Shifts in Infinite Armed Bandits

To show this, we ll essentially repeat the arguments of Subsection A.4 except specializing β = 1 to target a bound scaling like tℓ+1 tℓ. Note that using the scaling |Am| 2m/2 in the regret threshold for the changepoint detection test (Algorithm 1 of Algorithm 1) is crucial here as |Am| 2m β/2 2m/2 in the case of β < 1.

First, if the dynamic regret of the best initial arm ˆaℓ,mℓis larger than C1

4 2mℓ/2 log3(T), then using the previously established fact that δ0(ˆaℓ,mℓ) log3(T) 2 mℓ β/2, we have that (18) holds.

Next, following the argument structure of Subsection A.4, suppose Ptℓ+1 1 t=t mℓ ℓ δt(ˆaℓ,mℓ) C1

4 2mℓ/2 log3(T) but Ptℓ+1 1 t=t mℓ ℓ µt(ˆaℓ,mℓ) µt(at) C1

4 2mℓ/2 log3(T). Then, we invoke Assumption 2 with α = log3(T) (tℓ+1 tmℓ ℓ) 1/2

and use the same dyadic gridding argument with

J := 1 log (tℓ+1 tmℓ ℓ)1/2

Then, observe the bounds

α 1 2 (J+1)β (tℓ+1 tmℓ ℓ)1/2

log3(T) log3(T) (tℓ+1 tmℓ ℓ)1/2 O(1)

J 2J(1 β) |Amℓ| 2mℓ β/2 (tℓ+1 tmℓ ℓ)(1 β)/2 2mℓ/2.

Thus, we can show Ptℓ+1 t=t mℓ ℓ µt(ˆaℓ,mℓ) µt(at) < C1

4 2mℓ/2 log3(T) if (18) does not hold, which contradicts our earlier supposition. This means (18) holds.

A.7. Proof of Theorem 11

Let µ0(a) denote the initial mean reward of arm a. By Assumption 1, we have

P µ0(ˆaℓ,mℓ) 1 log3(T)

= P max a Amℓ µ0(a) 1 log3(T)

1 P µ0(a) > 1 log3(T)

1 κ1 log3β(T)

exp |Amℓ| log(T)

where the fourth line follows from assuming WLOG that log(T) κ 1

2β 1 and β 1, and the last inequality from |Amℓ| 21+mℓ β β+1 .

A.8. Proof of Theorem 12

By Freedman s inequality (Theorem 7), we have with probability at least 1/T 3:

|Nj,ℓ E[Nj,ℓ]| 3 q

E[Nj,ℓ] log(2T 3) + 2 log(2T 3)

Using AM-GM, this yields,

Nj,ℓ 3E[Nj,ℓ]

2 log(2T 3).

P(2 (j+1) δ0(ˆaℓ,mℓ) i < 2 j δ0(ˆaℓ,mℓ)) = P(2 (j+1) δ0(i) < 2 j) P(1 2 j < µ0(i)) κ2 2 jβ.

Tracking Significant Shifts in Infinite Armed Bandits

Thus, E[Nj,ℓ] (|Amℓ| 1) κ2 2 jβ.

This shows (15). (16) is showed in a nearly identical manner.

A.9. Proof of Theorem 13

Recall α := log3(T )

(tℓ+1 t mℓ ℓ ) 1 β+1 and let δ0 := δ0(ˆaℓ,mℓ) to ease notation. First, suppose J = 1. Then, by Theorem 11

2 3 δ0 > 4α = 1

8 > log3(T)

2mℓ 1 β+1 + 4 log3(T)

(tℓ+1 tmℓ ℓ) 1 β+1 .

Now, since β 1 and 2mℓ tℓ+1 tmℓ ℓ, it suffices to show

1 8 > 5 log3(T)

(tℓ+1 tmℓ ℓ) 1 β+1 (tℓ+1 tmℓ ℓ) 1 β+1 > 40 log3(T).

However, this last inequality is true from (8) of Theorem 6 for sufficiently large C1.

Now, suppose J > 1. Rearranging the desired inequality we have

(δ0 + 4α) 1 > 2J+2 = 1 4(δ0 + 4α) > 4 2 (tℓ+1 tmℓ ℓ) 1 β+1

From (8) of Theorem 6, we have for sufficiently large C1:

2 (tℓ+1 tmℓ ℓ) 1 β+1

42 log3(T) > 4.

Thus, in light of Theorem 11 and the definition of α we want to show

(tℓ+1 tmℓ ℓ) 1 β+1 > 4 log3(T)

(tℓ+1 tmℓ ℓ) 1 β+1 + 16 log3(T)

(tℓ+1 tmℓ ℓ) 1 β+1 ,

which is always true.

B. Elimination Algorithm Regret Analysis (Proofs of Theorem 4 and Theorem 5)

Following the notation of Section C, we let c0, c1, c2, . . . denote constants not depending on T , κ1, or κ2 (Assumption 1).

B.1. Details for the Proof of Theorem 4

As in the proof of Theorem 2 in Section A, we assume WLOG that log(T) κ (1 2β) 1 or else we can bound the regret by a constant only depending on κ1.

Following the proof outline of Subsection 5.3, we have by Freedman s inequality (Theorem 7) with probability at least 1 1/T:

t=tm ℓ δt(at)

v u u u tlog(T)

t=tm ℓ E[δt(at) | Ht 1]

|Gt| + c9 log(T),

Tracking Significant Shifts in Infinite Armed Bandits

where the second inequality uses AM-GM. Now, recalling that ta is the last round in block [tm ℓ, tm+1 ℓ ) that arm a [K] is retained, we have:

|Gt| max s [tm ℓ,ta] 1 |Gs|

t=tm ℓ δt(a). (19)

We next again use Freedman s inequality (Theorem 7) to relate P

t I δt(a) to P

t I ˆδIW t (a). For the variance bound, we have

E[(δt(a) ˆδIW t (a))2|Ht 1] E[(ˆδIW t )2(a)|Ht 1]

= E[|Gt|2 (1 Yt(at))2 1{at = a} | Ht 1]

|Gt|2 E[(1 Yt(a)) | Ht 1] E[1{at = a} | Ht 1]

= |Gt|2 δt(a) 1 |Gt| = δt(a) |Gt|.

We also have maxs I |δs(a) ˆδIW s (a)| maxs I |Gs| and E[ˆδIW t (a) | Ht 1] = δt(a).

From the above and Freedman s inequality, we can show that w.p. at least 1 1/T, for any interval I [T] on which arm a is retained:

t=tm ℓ δt(a) ˆδIW t (a)

max t [tm ℓ,ta] |Gt| log(T) +

v u u tlog(T)

t=tm ℓ δt(a) |Gt|

t=tm ℓ δt(a) + c11 max t [tm ℓ,ta] |Gt| log(T), (20)

where again we use AM-GM in the second inequality. Moving the 1

t δt(a) to the other side, we get

t=tm ℓ δt(a) 2

ˆδIW t (a) + c12 log(T) max t [tm ℓ,ta] |Gt|.

Plugging the above into (19), we have

|Gt| max s [tm ℓ,ta] c13 |Gs|

ˆδIW t (a) + log(T) max t [tm ℓ,ta] |Gt|

c14 max s [tm ℓ,ta] |Am|

|Gs| log(T),

where the second inequality is from the elimination guarantee (Algorithm 2 of Algorithm 2) and maxt [tm ℓ,ta] |Gt| = |Am|.

We next have X

a Am max s [tm ℓ,ta 1] 1 |Gs| = X

1 |Gta 1| 1{ta < tm+1 ℓ 1} + 1 |Gtm+1 ℓ 1| 1{ta = tm+1 ℓ 1}

i + |Gtm+1 ℓ 1|

|Gtm+1 ℓ 1|

1 + log(|Am|),

Now, combining the above with our previous bound and summing over arms a, we have:

|Gt| 1{a Gt} c15|Am| log2(T).

Tracking Significant Shifts in Infinite Armed Bandits

We know that for any m = mℓ, we have |Am| = l 2(m+1) β β+1 log(T) m . We also have

|Amℓ| 2|Amℓ 1| = 2 l (2(tmℓ+1 ℓ tmℓ ℓ))

β β+1 log(T) m .

Then, summing |Am| log2(T) over m [mℓ] and ℓ [ˆL] gives us the regret bound of order PˆL ℓ=1(tℓ+1 tℓ)

β β+1 log3(T).

B.2. Details for the Proof of Theorem 5

We define event E5 such that, for any I [T],

t I δt(a) ˆδIW t (a)

t I δt(a) + c16log(T) max t I |Gt|, (21)

which holds with probability at least 1 1/T, by the same reasoning as (20) We consider episode ℓ [ˆL] that terminates in a restart and let mℓbe the index of the last block in ℓ-th episode and tℓbe the start time of the ℓ-th episode.

Bounding the Number of Episodes. We claim the number of episodes is at most the number of significant shifts or ˆL L + 1. In particular, we prove that a significant shift must have occurred over some block in each episode concluding with a restart. Let ta be the time when arm a is eliminated. For the mℓ-th block [tmℓ ℓ, tℓ+1), on E5, for each arm a Amℓ we have by (21) and Algorithm 2 of Algorithm 2 with C2 > 0 large enough::

ˆδIW(a) c18 log(T) max t [t mℓ ℓ ,ta] |Gt|

c17|Amℓ| log(T).

Then, for all a Amℓwhere |Amℓ| = l 2(m+1) β β+1 log(T) m (tℓ+1 tmℓ ℓ)

β β+1 log(T) (ta tmℓ ℓ)

β β+1 log(T), we have

δt(a) > 3(tℓ+1 tmℓ ℓ)

β β+1 log3(T) 3(ta tmℓ ℓ)

β β+1 log3(T), (22)

which implies that (5) is violated for log(T) κ 1 1 , meaning arm a is unsafe. By Theorem 3, a significant shift must have occurred within the block [tmℓ ℓ, tℓ+1).

Now, by considering that the ˆL-th episode may end by reaching the horizon T rather than restarting, which does not ensure a significant shift, we conclude that ˆL L + 1. Therefore, by using Jensen s inequality, we have w.p. at least 1 T 1,

ℓ=1 (tℓ+1 tℓ)

β β+1 ˆL 1 β+1 T

β β+1 ( L + 1) 1 β+1 T

β β+1 . (23)

Bounding the Per-Episode Variation. Next, we show that, on event E5 where (21) holds, the total rotting variation over episode [tℓ, tℓ+1) is at least

t=tℓ (µt 1(at 1) µt(at 1))+ log3(T)

(tℓ+1 tℓ) 1 β+1 .

From (22), we have for all a Amℓ, there exists t [tmℓ ℓ, ta] such that

δt (a) 3 log3(T)

(ta tmℓ ℓ) 1 β+1 3 log3(T)

(tℓ+1 tmℓ ℓ) 1 β+1 . (24)

Tracking Significant Shifts in Infinite Armed Bandits

Recall µ0(a) is the initial mean reward of arm a. Let ˆaℓ,m be the best arm in terms of initial reward value µ0( ) among the arms Am subsampled for block [tm ℓ, tm+1 ℓ ). We have

P µ0(ˆaℓ,mℓ) < 1 2 log3(T)

= P max a Amℓ µ0(a) < 1 2 log3(T)

1 κ1 2 log3β(T)

|Amℓ| 2κ1 log3β(T)

where the last inequality follows from log(T) κ 1

2β 1 and |Amℓ| 2mℓ β β+1 log(T). Then, we define E6 to be the corresponding high-probability event

{ ℓ [ˆL] : 1 µ0(ˆaℓ,mℓ) 2 log3(T) 2 mℓ 1 β+1 },

which holds with at least probability of 1 1/T. On E6, we have

µ0(ˆaℓ,mℓ) 1 2 log3(T)

2mℓ 1 β+1 1 2 log3(T)

(tℓ+1 tmℓ ℓ) 1 β+1 . (25)

From (24) and (25), we can observe that the mean reward of arm ˆaℓ,mℓmust have moved by amount at least

(tℓ+1 tmℓ ℓ) 1 β+1 log3(T)

(tℓ+1 tℓ) 1 β+1 ,

over the course of the block, implying

µ0(ˆaℓ,mℓ) µt (ˆaℓ,mℓ) log3(T)

(tℓ+1 tmℓ ℓ) 1 β+1 log3(T)

(tℓ+1 tℓ) 1 β+1 .

Since the adversary can only modify the rewards of an arm at after the round t it is played, the movement must have occurred on rounds where ˆaℓ,mℓwas chosen by the agent. Thus,

t=tℓ (µt 1(at 1) µt(at 1))+

t=tℓ (µt 1(ˆaℓ,mℓ) µt(ˆaℓ,mℓ))+ µ0(ˆaℓ,mℓ) µt (ˆaℓ,mℓ) log3(T)

(tℓ+1 tℓ) 1 β+1 .

Next, this lower bound on the per-episode variation gives us, in an identical manner Subsection A.5, a cumulative regret bound: ˆL X

ℓ=1 (tℓ+1 tℓ)

β β+1 log3(T) log3(T) V

β+1 β+2 + T

C. Verifying Assumption 2 for UCB

Here, we show the UCB algorithm satisfies Assumption 2. The proof will mostly follow the standard proof for showing the classsical logarithmic regret bound (e.g. Lattimore and Szepesvári, 2020, Section 7.1), with some small modifications to account for the mild corruption (Theorem 1).

We first present a variant of the UCB1 Algorithm of Auer et al. (2002).

Theorem 14. Algorithm 3 satisfies Assumption 2.

Tracking Significant Shifts in Infinite Armed Bandits

Algorithm 3: Variant of UCB1 (Algoritm 3 of Lattimore and Szepesvári (2020))

1 Input: number of arms K, horizon T.

2 for t = 1, . . . , T do

3 Let Nt 1(a) := Pt 1 s=1 1{as = a}.

4 Let UCBa(t 1) :=

( Nt 1(a) = 0

1 Nt 1(a) Pt 1 s=1 1{as = a} Ys(a) + q

2 log(T ) Nt 1(a) otherwise .

5 Play arm at = arg maxa [K] UCBa(t 1).

Proof. Our goal is to establish a high-probability regret bound over t T rounds. First, using Theorem 1, we may write the regret as

s=1 µs(a) µs(as) max a A0

s=1 µ(a) µ(as) + 2t α.

Recall Nt(a) := Pt s=1 1{as = a}. Let a denote the gap of arm a A0. Then, we can decompose the regret based on whether the gap of each arm is large or small and write:

s=1 µ(a) µ(as) X

a A0 a Nt(a) 1 { a > 4α} + 4t α. (26)

It then remains to bound Nt(a) w.h.p for each a A0 such that a > α. WLOG, suppose arm a = 1 is the best arm among the arms in A0, which we ll index by the set [|A0|].

We claim with probability at least 1 1/T, for t such that Nt(a) > l 32 log(T )

UCBa(t) < UCB1(t). (27)

This will allow us to conclude by bounding Nt(a) l 32 log(T )

Let ˆµt(a) := Pt s=1 Ys(a) 1{as = a} and let µt(a) := Pt s=1 µs(a) 1{as = a}. Then, by Azuma-Hoeffding inequality and a union bound we have

s [t] : |ˆµs(a) µs(a)|

Ns(a) log(2T 2)

Going forward, suppose the above concentration bound holds.

Now, to show the claim, we first note for t such that Nt(a) > l 32 log(T )

m and a = 1 such that a > 4α:

UCBa(t) = ˆµt(a) +

= µ(a) + 3 a

Tracking Significant Shifts in Infinite Armed Bandits

where the third inequality is by Theorem 1. Meanwhile,

UCB1(t) = ˆµt(1) +

Thus, claim (27) is shown.

Remark 6. Although not shown here for the sake of redundancy, a very similar regret analysis as Theorem 14 shows the Successive Elimination algorithm (Even-Dar et al., 2006) also satisfies Assumption 2.