# adversarial_bandits_against_arbitrary_strategies__b2526369.pdf Published in Transactions on Machine Learning Research (10/2025) Adversarial Bandits Against Arbitrary Strategies Jung-hun Kim junghun.kim@ensae.fr CREST, ENSAE, IP Paris Fair Play joint team, France Se-Young Yun yunseyoung@kaist.ac.kr KAIST AI, South Korea Reviewed on Open Review: https: // openreview. net/ forum? id= x4Qr Oh8u Cs We study the adversarial bandit problem against arbitrary strategies, where the difficulty is captured by an unknown parameter S, which is the number of switches in the best arm in hindsight. To handle this problem, we adopt the master-base framework using the online mirror descent method (OMD). We first provide a master-base algorithm with simple OMD, achieving O(S1/2K1/3T 2/3), in which T 2/3 comes from the variance of loss estimators. To mitigate the impact of the variance, we propose using adaptive learning rates for OMD and achieve O(min{ SKTρ, S KT}), where ρ is a variance term for loss estimators. 1 Introduction The bandit problem is a fundamental framework in sequential decision-making that addresses the explorationexploitation trade-off. At each time step, an agent selects an action (referred to as an arm) and observes a corresponding reward or loss. In applications such as recommendation systems, arms may represent items presented to users, and user preferences can evolve over time. This dynamic nature can be modeled by allowing the identity of the best arm to change, leading to the notion of switching best arms. To capture such evolving environments, it is natural to consider performance against a changing sequence of optimal arms rather than a single fixed one. The problem of competing against switching arms has been extensively studied. In the expert setting with full-information feedback (Cesa-Bianchi et al., 1997), several algorithms (Daniely et al., 2015; Jun et al., 2017) have been developed that achieve a near-optimal regret bound of O( ST) for S-switch regret (formally defined later), without requiring prior knowledge of the number of switches S. In contrast, the bandit setting presents a greater challenge, as the agent only observes the feedback for the selected arm rather than the full loss vector, making the problem significantly harder than in the full-information case. In the stochastic bandit setting where each arm s reward distribution may change over time, commonly referred to as the non-stationary bandit problem, several studies have addressed the challenge of switching environments (Garivier & Moulines, 2008; Auer et al., 2019; Russac et al., 2019; Suk & Kpotufe, 2022). Notably, Auer et al. (2019) and Suk & Kpotufe (2022) achieved near-optimal regret bounds of O( SKT) without requiring prior knowledge of the number of switches S. However, these methods rely on stochastic assumptions and are not applicable in the adversarial setting, where losses can be chosen arbitrarily. For the adversarial bandit setting, EXP3.S algorithm (Auer et al., 2002) achieves a regret bound of O( SKT), but assumes that S is known in advance. When S is unknown, the Bandit-over-Bandit (BOB) approach has been proposed, achieving a regret bound of O( SKT + T 3/4) (Cheung et al., 2019; Foster et al., 2020). More recently, Luo et al. (2022) studied the problem of switching adversarial linear bandits and achieved a regret of O( d ST) under the assumption that S is known. In this paper, we study adversarial bandit problems against arbitrarily switching arms. Crucially, we allow the number of switches S to be unknown to the agent, thereby targeting arbitrary strategies without prior Published in Transactions on Machine Learning Research (10/2025) knowledge of their complexity. To address this setting, we adopt the master-base framework combined with the online mirror descent (OMD) method, inspired by Agarwal et al. (2017); Pacchiano et al. (2020); Luo et al. (2022). We begin by analyzing a master-base algorithm that employs OMD with a negative entropy regularizer, and show that it achieves a regret bound of O(S1/2K1/3T 2/3). However, this approach relies on a fixed learning rate, which limits its ability to adapt to the variance of the loss estimators, leading to a suboptimal regret term proportional to T 2/3. Building on this analysis, we propose using adaptive learning rates within the OMD framework to better control the variance of loss estimators. This refinement leads to an improved regret bound of O(min{ SKTρ, S KT}) with respect to T, where ρ captures the variance associated with a comparator strategy. Crucially, rather than employing the standard negative entropy regularizer, we adopt a log-barrier regularizer, which enables tighter control over worst-case scenarios in terms of the variance term ρ. 2 Problem Statement We now formalize the problem setting. Let A = [K] denote the set of K arms, and let lt [0, 1]K be the loss vector at time t, where lt(a) denotes the loss incurred by arm a [K] at time t. The environment is adversarial, generating an arbitrary sequence of loss vectors l1, l2, . . . , l T [0, 1]K over a time horizon of T rounds. At each round t [T], the agent selects an arm at [K] and observes only the loss lt(at) of the chosen arm. Our goal is to minimize the S-switch regret, which measures the performance gap between the agent and the best sequence of actions that switches arms at most S times. Formally, let κ = {κ1, κ2, . . . , κT } [K]T denote a sequence of comparator actions. For a positive integer S < T, define the set of sequences with at most S switches as t=1 1{κt = κt+1} S Then, the S-switch regret is defined by RS(T) = max κ BS t=1 E[lt(at)] lt(κt). We consider the setting where the number of switches S is unknown to the agent (i.e., not provided in advance). Our goal is to design algorithms that perform well against any sequence of actions, without relying on prior knowledge of S. This requires the development of universal algorithms that achieve tight regret bounds uniformly over all values of S [T 1], where S characterizes the hardness of the problem (Auer et al., 2002). Notably, this setting generalizes non-stationary stochastic bandit problems in which the switching parameter is unknown (Auer et al., 2019; Chen et al., 2019). 3 Algorithms and Regret Analysis To address the problem of adversarial bandits with an unknown switching budget S, we propose algorithms based on the master-base framework combined with the online mirror descent (OMD) method. 3.1 Master-Base Framework In the master-base framework, a master algorithm selects one among several base algorithms at each round, and the selected base then chooses an arm to play. Since the true switch parameter S is not known in advance, we instantiate each base algorithm with a different candidate value of S from a predefined set. Let H denote the set of candidate values for S, defined as: H = {T 0, T 1 log T , T 2 log T , . . . , T}. Published in Transactions on Machine Learning Research (10/2025) Each base is associated with a candidate h H and tunes its learning rate accordingly. Let H = |H| = O(log T) be the number of base algorithms. When there is no ambiguity, we refer to a base instantiated with parameter h H simply as base h. Define h H as the largest candidate not exceeding the true (unknown) value of S: h = max{h H : h S}. By construction, this ensures which guarantees that h provides a near-optimal approximation of S. 3.2 Online Mirror Descent (OMD) We now present the Online Mirror Descent (OMD) method (Lattimore & Szepesvári, 2020), which serves as the fundamental update rule for the master and each base algorithm in our framework. OMD generalizes classic online gradient descent by incorporating a flexible geometry through a regularizer. Let F : Rd R be a convex and differentiable regularizer function. This regularizer defines the Bregman divergence between two points p, q Rd: DF (p, q) = F(p) F(q) F(q), p q . In our context, the decision variable is a probability distribution pt over d arms, i.e., pt Pd where Pd denotes the d-dimensional probability simplex. At each round t, given a loss vector l Rd, OMD computes the next distribution by solving the following optimization problem: pt+1 = arg min p Pd n p, l + DF (p, pt) o . (1) This formulation reflects the trade-off between exploiting current loss information (via the linear term) and staying close to the previous distribution (via the Bregman divergence). In practice, the update of equation 1 is often implemented in two steps for computational convenience: pt+1 = arg min p Rd n p, l + DF (p, pt) o , pt+1 = arg min p Pd DF (p, pt+1). Here, the first step performs an unconstrained update in the dual space defined by F, while the second step projects the intermediate point back onto the simplex, ensuring that the updated distribution is a valid probability vector. The choice of regularizer F determines the specific algorithm instance; it typically includes a learning rate parameter that controls the step size and will be specified in subsequent sections. Finally, in the bandit setting, since the learner does not observe the full loss vector lt but only the loss of the selected arm, the true loss must be replaced with an appropriate unbiased estimator. This ensures that OMD remains applicable even with partial feedback. 3.3 Master-Base OMD We employ an OMD framework with a hierarchical structure consisting of a master and multiple bases. We first present a simple Master-Base OMD algorithm (Algorithm 1) that employs the negative entropy regularizer: Fη(p) = (1/η) i=1 (p(i) log p(i) p(i)), where p Rd, p(i) denotes the i-index entry for p, and η is a learning rate. This regularizer is commonly used in adversarial bandit algorithms, including the well-known EXP3 algorithm (Auer et al., 2002). In Algorithm 1, at each round t, the master algorithm selects a base ht H according to a probability distribution pt. The selected base ht then chooses an arm at [K] using its internal distribution pt,ht Published in Transactions on Machine Learning Research (10/2025) and observes the incurred loss lt(at). From this partial feedback, we construct unbiased estimators: l t(h) estimates the loss for each base h H, and l t,h(a) estimates the loss for each arm a [K] under base h. Using these estimators, the algorithm updates both the master distribution pt+1 and each base s distribution pt+1,h via OMD. To control variance in the estimator l t(h) = lt(at)1(h = ht)/pt(h), we define the master s update domain as a clipped probability simplex Pα H = PH [α, 1]H for some α > 0. This clipping ensures that each base is selected with non-negligible probability, preventing high variance in the importance-weighted loss estimates. Within this domain, the update of pt+1 is then computed using the negative entropy regularizer with learning rate η. Each base h H maintains its own arm selection distribution pt,h. The update for this distribution is also based on the negative entropy regularizer but with a learning rate that adapts to the candidate switch parameter h: ξ(h) = h1/2 K1/3T 2/3 . This choice of ξ(h) helps base h adapt to environments with up to h switches in the best arm sequence. The domain for updating pt,h is Pβ K = PK [β, 1]K for β > 0. Unlike α, which controls variance in master-level estimation, β acts as a regularization mechanism to stabilize learning under the switching best arms in hindsight. Algorithm 1 Master-base OMD Input: T, K, H. Initialization: α = K1/3/(T 1/3H1/2), β = 1/(KT), η = 1/ TH, ξ(h) = h1/2/(K1/3T 2/3), p1(h) = 1/H, p1,h(a) = 1/K for h H and a [K]. for t = 1, . . . , T do Select a base and an arm: Draw ht probabilities {pt(h)}h H. Draw at,ht probabilities {pt,ht(a)}a [K]. Pull at = at,ht and Receive lt(at,ht) [0, 1]. Obtain loss estimators: l t(ht) = lt(at,ht) pt(ht) and l t(h) = 0 for h H/{ht}. l t,ht(at,ht) = l t(ht) pt,ht(at,ht) and l t,h(a) = 0 for h H/{ht}, a [K]/{at,ht}. Update distributions: pt+1 = arg minp Pα H p, l t + DFη(p, pt) pt+1,h = arg minp Pβ K p, l t,h + DFξ(h)(p, pt,h) for all h H end for Now we provide a regret bound for the algorithm in the following theorem. Theorem 3.1. For any switch number S [T 1], Algorithm 1 achieves a regret bound of RS(T) = O(S1/2K1/3T 2/3) Proof Sketch. Here, we provide a proof sketch, and the full version is provided in Appendix A.1. In our proof, we decompose the regret into two parts: one is the regret from the master selecting a base at each time, and the other is the regret from the base selecting an arm at each time. Let ts be the time when the s-th switch of the best arm happens and t S+1 1 = T, t0 = 1. Also let ts+1 ts = Ts. For any ts for Published in Transactions on Machine Learning Research (10/2025) all s [0, S], the S-switch regret can be expressed as t=1 E [lt(at)] s=0 min ks [K] t=ts lt(ks) t=1 E [lt(at,ht)] t=1 E lt(at,h ) + t=1 E lt(at,h ) s=0 min ks [K] t=ts lt(ks), (2) in which the first two terms are closely related with the regret from the master algorithm against the near optimal base h , and the remaining terms are related with the regret from h base algorithm against the best arms in hindsight. We note that the algorithm does not need to know h in prior and h is brought here only for regret analysis. Regret from the near-optimal base. First we provide a bound for the following regret from base h : t=1 E lt(at,h ) s=0 min ks [K] t=ts lt(ks), where the first term is the loss from the base and the second one is the loss from the optimal arm in hindsight. Let k s = arg mink [K] Pts+1 1 t=ts lt(k) and ej,K denote the unit vector with 1 at j-index and 0 at the rest of K 1 indices. Then, we have t=ts E lt(at,h ) lt(k s) βTs(K 1) + max p Pβ K E t=ts pt,h p, l t,h where the first term in the last inequality is obtained from the clipped domain Pβ K and the second term is obtained from the unbiased estimator l t,h such that E[l t,h |Ft 1] = E[lt|Ft 1] where Ft 1 denotes the natural filtration generated by the history up to round t 1. We can observe that the clipped domain controls the distance between the initial distribution at ts and the best arm unit vector for the time steps over [ts, ts+1 1]. Let pt+1,h = arg min p RK p, l t,h + DFξ(h )(p, pt,h ). Then, by solving the optimization problem, for all k [K] we can obtain pt+1,h (k) = pt,h (k) exp( ξ(h )l t,h (k)). For the second term of the last inequality in equation 3, we have for any p Pβ K, t=ts pt,h p, l t,h DFξ(h )(p, pts,h ) + t=ts DFξ(h )(pt,h , pt+1,h ). (4) The first term is for the initial point diameter at time ts and the second term is for the divergence of the updated policy. Using the definition of the Bregman divergence and the fact that pts,h (k) β, the initial point diameter term can be shown to be bounded as follows: DFξ(h )(p, pts,h ) log(1/β) ξ(h ) . (5) Next, for the updated policy divergence term, using pt+1,h (k) = pt,h (k) exp( ξ(h )l t,h (k)) for all k [K], we have t=ts E h DFξ(h )(pt,h , pt+1,h ) i ξ(h )KTs Published in Transactions on Machine Learning Research (10/2025) Then from equation 3, equation 4, equation 5, and equation 6, by summing up over s [S], we have t=1 E lt(at,h ) s=0 min ks [K] t=ts lt(ks) βT(K 1) + S log(1/β) ξ(h ) + ξ(h )KT Next, we provide a bound for the following regret from the master: t=1 E [lt(at,ht)] t=1 E lt(at,h ) . Let pt+1 = arg minp RH p, l t + DFη(p, pt) and eh,H denote the unit vector with 1 at base h-index and 0 at the rest of H 1 indices. For ease of presentation, we define lt(h) = lt(at,h). Then, we have t=1 E lt(at,ht) lt(at,h ) αT(H 1) + max p Pα H E t=1 pt p, lt Regret from the master. For bounding the second term in equation 8, which arises from the master, we use the following: for any p Pα H t=1 pt p, lt Fη(p) Fη(p1) + t=1 DFη(pt, pt+1). (9) From equation 8 and equation 9, we have t=1 E [lt(at,ht)] t=1 E lt(at,h ) αT(H 1) + max p Pα H E Fη(p) Fη(p1) + t=1 DFη(pt, pt+1) αT(H 1) + log(H) Overall Regret. Therefore, putting equation 2, equation 7, and equation 10 altogether, we have t=1 E [lt(at)] s=0 min 1 ks K t=Ts lt(ks) αTH + log(H) 2 + βT(K 1) + S log(1/β) ξ(h ) + ξ(h )KT = O(S1/2T 2/3K1/3), where α = K1/3/(T 1/3H1/2), β = 1/(KT), η = 1/ TH, ξ(h ) = (h ) 1/2/(K1/3T 2/3), h = Θ(S), and H = log(T). This completes the proof. Compared to the prior parameter-free algorithm based on the Bandit-over-Bandit (BOB) approach (Cheung et al., 2019), which incurs a suboptimal dependence on the time horizon T, specifically, a regret term of order T 3/4, our algorithm achieves a tighter regret bound in terms of T. In particular, when T = ω(S6K4), Algorithm 1 achieves a tighter regret bound than that of BOB. However, the regret bound achieved by Algorithm 1 remains of order O(T 2/3), rather than the optimal O( T), due to the high variance in the loss estimators. This variance arises from the double sampling process first selecting a base, then an arm at each round. To address this issue, we next propose an improved algorithm that leverages adaptive learning rates to better control the variance of the estimators. Published in Transactions on Machine Learning Research (10/2025) 3.4 Master-Base OMD with Adaptive Learning Rates We now propose Algorithm 2, which incorporates adaptive learning rates to better control the variance of the loss estimators. We begin by describing the base algorithm. Each base employs the negative entropy regularizer, but with a time-varying adaptive learning rate ξt(h), defined as: Fξt(h)(p) = 1 ξt(h) i=1 (p(i) log p(i) p(i)), where p Rd is a probability distribution over arms. The learning rate ξt(h) is dynamically adjusted at each round t based on the variance of the loss estimators, and is given by: h/(KTρt(h)), where ρt(h) is a variance threshold term that will be defined later. This formulation ensures that when the variance of the estimators is small, a larger learning rate is used allowing for more aggressive updates while high variance naturally leads to more conservative updates. Notably, this adaptive base algorithm is effectively combined with the master employing log-barrier regularization to control the regret due to variance, resulting in equation 15, which introduces a novel integration of adaptive learning and log-barrier regularization for variance control. For the master algorithm, inspired by the approach in Agarwal et al. (2017), we employ a log-barrier regularizer with increasing learning rates. This design introduces a negative bias term that effectively offsets the variance arising from the base algorithms, particularly by addressing the worst-case scenario in terms of the variance threshold ρt(h ). The log-barrier regularizer is defined as: where p Pd is the master distribution, and ηt = (ηt(1), . . . , ηt(d)) denotes the vector of learning rates at time t. We describe the learning rate update procedures for both the master and the base algorithms in Algorithm 2; all other components remain identical to those in Algorithm 1. The variance of the loss estimator l t+1(h) = lt(at+1,h)1(ht+1 = h)/pt+1(h) for base h is given by 1/pt+1(h). If this variance exceeds the threshold ρt(h), i.e., 1/pt+1(h) > ρt(h), the master increases the learning rate as: ηt+1(h) = γηt(h) for some fixed γ > 1. Simultaneously, the variance threshold is updated to: ρt+1(h) = 2/pt+1(h) which is also used to adaptively tune the base learning rate ξt(h) as described earlier. If the variance does not exceed the threshold, both ηt(h) and ρt(h) remain unchanged from the previous time step. In the following theorem, we provide a regret bound of Algorithm 2. Theorem 3.2. For any switch number S [T 1] and any ρ E[ρT (h )], Algorithm 2 achieves a regret bound of RS(T) = O min np Proof Sketch. Here, we provide a proof sketch, and the full version is provided in Appendix A.2. As in the Theorem 3.1, we decompose the regret into two parts: one is the regret from the master selecting a base at each time, and the other is the regret from the base selecting an arm at each time. Let ts be the time when the s-th switch of the best arm happens and t S+1 1 = T, t0 = 1. Also let ts+1 ts = Ts. For any ts for all s [0, S], the S-switch regret can be expressed as t=1 E [lt(at)] s=0 min ks [K] t=ts lt(ks) t=1 E [lt(at,ht)] t=1 E lt(at,h ) + t=1 E lt(at,h ) s=0 min ks [K] t=ts lt(ks). (11) Published in Transactions on Machine Learning Research (10/2025) Algorithm 2 Master-base OMD with adaptive learning rates Input: T, K, H Initialization: α = 1/(TH), β = 1/(TK), γ = e 1 log T , η = p H/T, ρ1(h) = 2H, η1(h) = η, p1(h) = 1/H, p1,h(a) = 1/K for h H and a [K]. for t = 1, . . . , T do Select a base and an arm: Draw ht probabilities {pt(h)}h H. Draw at,ht probabilities {pt,ht(a)}a [K]. Pull at = at,ht and Receive lt(at,ht) [0, 1]. Update loss estimators: l t(ht) = lt(at,ht) pt(ht) and l t(h) = 0 for h H/{ht}. l t,ht(at,ht) = l t(ht) pt,ht(at,ht) and l t,h(a) = 0 for h H/{ht}, a [K]/{at,ht}. Update distributions: pt+1 = arg minp Pα H p, l t + DGηt(p, pt) pt+1,h = arg minp Pβ K p, l t,h + DFξt(h)(p, pt,h) for h H Update learning rates: For h H If 1 pt+1(h) > ρt(h), then ρt+1(h) = 2 pt+1(h), ηt+1(h) = γηt(h). Else, ρt+1(h) = ρt(h), ηt+1(h) = ηt(h). end for Regret from the near-optimal base. First we provide a bound for the following regret from base h . From equation 3, we can obtain t=ts E lt(at,h ) t=ts lt(ks) βTs K + E t=ts pt,h p, l t,h Then for the second term of the last inequality in equation 12, we provide the following lemma. Lemma 3.3. For any p Pβ K we can show that t=ts E h pt,h p, l t,h i E Then from equation 8 and Lemma 3.3, we have t=1 E lt(at,h ) s=0 min 1 ks K t=Ts lt(ks) βT(K 1) + E 2S log(1/β) Regret from the master. Next, we provide a bound for the regret from the master in the following: t=1 E [lt(at,ht)] t=1 E lt(at,h ) O H log(T) η + Tη E ρT (h ) + αT(H 1). (14) The negative bias term in equation 14 is derived from the log-barrier regularizer and increasing learning rates ηt(h). This term is critical to bound the worst case regret which will be shown soon. Also, H log(T)/η is obtained from H log(1/(Hα))/η considering the clipped domain. Published in Transactions on Machine Learning Research (10/2025) Overall Regret. Then, putting equation 11, equation 13 and equation 14 altogether, we have t=1 E [lt(at)] s=0 min 1 ks K t=Ts lt(ks) = O E q SKTρT (h ) E Then, we can obtain RS(T) = O min np KT) is obtained from the worst case of ρT (h ). The worst case can be found by considering a maximum value of the concave bound of the last equality in equation 15 with variable ρT (h ) > 0 such that ρT (h ) = Θ(S). This concludes the proof. We now provide a comparison of regret bounds with an existing approach. our regret bound depends on ρ, which captures the variance of the loss estimators l t(h ) over the time horizon t [T]. While the bound is variance-dependent, it is noteworthy that in the worst case, it is always upper bounded by O(S KT). Importantly, Algorithm 2 achieves a tight dependence of O( T) in the regret bound, in contrast to Algorithm 1 and BOB (Cheung et al., 2019; Foster et al., 2020), which incur suboptimal T 2/3 and T 3/4 term, respectively. Therefore, when T is sufficiently large, Algorithm 2 yields a strictly better regret guarantee than Algorithm 1 and BOB. We also note that the variance term ρ is commonly observed in Luo et al. (2022), but we control the worst case without information of S using an adaptive learning rate. Remark 3.4. Since we incorporate ρt(h) into the adaptive learning rate for each base h as ξt(h) = p h/KTρt(h), we can optimize the regret bound to depend on p ρT (h ), as demonstrated in Lemma 3.3. Notably, this adaptive base algorithm is effectively combined with the master employing log-barrier regularization Agarwal et al. (2017) to control the regret due to variance, resulting p SρT (h ) ρT (h ). This integration is the main reason why Algorithm 2 can achieve a order of T, even in the worst case, without the knowledge of S. Remark 3.5. Our algorithms are designed to perform well across different regimes, particularly with respect to T and S. To recall the regret bounds, Algorithm 1 achieves a regret of O(S1/2K1/3T 2/3), while Algorithm 2 achieves O(min{ SKTρ, S KT}). This indicates that Algorithm 1 is advantageous when S is large, whereas Algorithm 2 is preferable for larger T due to its use of an adaptive learning rate that accounts for the variance of the loss estimator. Remark 3.6 (Implementation). For implementation of our algorithms, we briefly describe how to update the policy pt using OMD. Let bls(a) denote a loss estimator for ls(a) for action a [d] and s [T]. For the negative-entropy regularizer, by solving the optimization in equation 1 (see Lattimore & Szepesvári (2020)), we obtain pt(a) = exp η Pt 1 s=1 bls(a) b [d] exp η Pt 1 s=1 bls(b) . For the log-barrier regularizer, the solution is pt(a) = (η Pt 1 s=1 bls(a) + Z) 1, where Z is the normalization constant ensuring P a pt(a) = 1 (Luo et al., 2022). When the feasible set is the clipped simplex Pϵ/d d = {p d : p(a) ϵ/d} for 0 < ϵ < 1, a computationally simple way to enforce feasibility is to mix with the uniform distribution: pt(a) (1 ϵ) pt(a) + ϵ/d for all a [d]. This uniform-mixing trick, following Auer et al. (2002), guarantees pt Pϵ/d d to ensure a bounded variance of the loss estimator. We emphasize that this is an implementation convenience rather than the exact Bregman projection onto Pϵ/d d ; a rigorous regret analysis under this specific implementation is left for future work. Remark 3.7. The regret bounds in Theorems 3.1 and 3.2 also extend to non-stationary stochastic bandit problems with unknown switching parameters, where the reward distributions may change over time. This generalization is possible because the adversarial bandit setting encompasses the stochastic setting as a special case. Published in Transactions on Machine Learning Research (10/2025) 4 Conclusion In this paper, we studied the adversarial bandit problem with S-switch regret, where the agent competes against any sequence of arms that switches at most S times, without prior knowledge of S. To address this challenge, we proposed two algorithms based on the master-base framework integrated with the Online Mirror Descent (OMD) method. First, we introduced Algorithm 1, which employs a simple OMD update with a fixed learning rate and achieves a regret bound of O(S1/2K1/3T 2/3). To further improve performance with respect to T, we proposed Algorithm 2, which incorporates adaptive learning rates to control the variance of the loss estimators. This leads to an improved regret bound of O(min{ SKTρ, S KT}) where ρ captures the variance of the estimators associated with the near-optimal base. Acknowledgements We thank the Action Editor and the reviewers for their constructive and insightful feedback. J. Kim acknowledges the support of ANR through the PEPR IA FOUNDRY project (ANR-23-PEIA-0003) and the Doom project (ANR-23-CE23-0002), as well as the ERC through the Ocean project (ERC-2022-SYG-OCEAN101071601). Alekh Agarwal, Haipeng Luo, Behnam Neyshabur, and Robert E Schapire. Corralling a band of bandit algorithms. In Conference on Learning Theory, pp. 12 38. PMLR, 2017. Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48 77, 2002. Peter Auer, Pratik Gajane, and Ronald Ortner. Adaptively tracking the best bandit arm with an unknown number of distribution changes. In Conference on Learning Theory, pp. 138 158, 2019. Nicolo Cesa-Bianchi, Yoav Freund, David Haussler, David P Helmbold, Robert E Schapire, and Manfred K Warmuth. How to use expert advice. Journal of the ACM (JACM), 44(3):427 485, 1997. Yifang Chen, Chung-Wei Lee, Haipeng Luo, and Chen-Yu Wei. A new algorithm for non-stationary contextual bandits: Efficient, optimal and parameter-free. In Conference on Learning Theory, pp. 696 726. PMLR, 2019. Wang Chi Cheung, David Simchi-Levi, and Ruihao Zhu. Learning to optimize under non-stationarity. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1079 1087, 2019. Amit Daniely, Alon Gonen, and Shai Shalev-Shwartz. Strongly adaptive online learning. In International Conference on Machine Learning, pp. 1405 1411, 2015. Dylan J Foster, Akshay Krishnamurthy, and Haipeng Luo. Open problem: Model selection for contextual bandits. In Conference on Learning Theory, pp. 3842 3846. PMLR, 2020. Aurélien Garivier and Eric Moulines. On upper-confidence bound policies for non-stationary bandit problems, 2008. Kwang-Sung Jun, Francesco Orabona, Stephen Wright, and Rebecca Willett. Improved strongly adaptive online learning using coin betting. In Artificial Intelligence and Statistics, pp. 943 951. PMLR, 2017. Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020. Haipeng Luo, Mengxiao Zhang, Peng Zhao, and Zhi-Hua Zhou. Corralling a larger band of bandits: A case study on switching regret for linear bandits. ar Xiv preprint ar Xiv:2202.06151, 2022. Published in Transactions on Machine Learning Research (10/2025) Aldo Pacchiano, My Phan, Yasin Abbasi-Yadkori, Anup Rao, Julian Zimmert, Tor Lattimore, and Csaba Szepesvari. Model selection in contextual stochastic bandit problems. ar Xiv preprint ar Xiv:2003.01704, 2020. Yoan Russac, Claire Vernade, and Olivier Cappé. Weighted linear bandits for non-stationary environments. In Advances in Neural Information Processing Systems, pp. 12017 12026, 2019. Joe Suk and Samory Kpotufe. Tracking most significant arm switches in bandits. In Conference on Learning Theory, pp. 2160 2182. PMLR, 2022. A.1 Proof of Theorem 3.1 Let ts be the time when the s-th switch of the best arm happens and t S+1 1 = T, t0 = 1. Also let ts+1 ts = Ts. For any ts for all s [0, S], the S-switch regret can be expressed as t=1 E [lt(at)] s=0 min ks [K] t=ts lt(ks) t=1 E [lt(at,ht)] t=1 E lt(at,h ) + t=1 E lt(at,h ) s=0 min ks [K] t=ts lt(ks), (16) in which the first two terms are closely related with the regret from the master algorithm against the near optimal base h , and the remaining terms are related with the regret from h base algorithm against the best arms in hindsight. We note that the algorithm does not need to know h in prior and h is brought here only for regret analysis. First we provide a bound for the following regret from base h : t=1 E lt(at,h ) s=0 min ks [K] t=ts lt(ks). Let k s = arg mink [K] Pts+1 1 t=ts lt(k) and ej,K denote the unit vector with 1 at j-index and 0 at the rest of K 1 indices. Then, we have t=ts E lt(at,h ) lt(k s) t=ts E pt,h ek s,K, lt max p Pβ K E t=ts p ek s,K, lt + t=ts pt,h p, lt βTs(K 1) + max p Pβ K E t=ts pt,h p, l t,h where the first term in the last inequality is obtained from the clipped domain Pβ K and the second term is obtained from the unbiased estimator l t,h such that E[l t,h |Ft 1] = E[lt|Ft 1] where Ft 1 denotes the natural filtration generated by the history up to round t 1. We can observe that the clipped domain Published in Transactions on Machine Learning Research (10/2025) controls the distance between the initial distribution at ts and the best arm unit vector for the time steps over [ts, ts+1 1]. Let pt+1,h = arg min p RK p, l t,h + DFξ(h )(p, pt,h ). Then, by solving the optimization problem, for all k [K] we can obtain pt+1,h (k) = pt,h (k) exp( ξ(h )l t,h (k)). For the second term of the last inequality in equation 17, we provide a lemma in the following. Lemma A.1 (Theorem 28.4 and Eq. 28.11 in Lattimore & Szepesvári (2020)). For any p Pβ K we have t=ts pt,h p, l t,h DFξ(h )(p, pts,h ) + t=ts DFξ(h )(pt,h , pt+1,h ). Proof. For completeness, we provide a proof for this lemma. Fix any p Pβ K. By the first-order optimality condition of the unconstrained mirror-descent step, l t,h + Fξ(h )(pt+1,h ) Fξ(h )(pt,h ), p pt+1,h 0. This gives pt+1,h p, l t,h Fξ(h )(pt+1,h ) Fξ(h )(pt,h ), p pt+1,h . Using the definition of Bregman divergence, Fξ(h )(pt+1,h ) Fξ(h )(pt,h ), p pt+1,h = DFξ(h )(p, pt,h ) DFξ(h )(p, pt+1,h ) DFξ(h )(pt+1,h , pt,h ), pt+1,h p, l t,h DFξ(h )(p, pt,h ) DFξ(h )(p, pt+1,h ) DFξ(h )(pt+1,h , pt,h ). (18) We now decompose pt,h p, l t,h = pt,h pt+1,h , l t,h + pt+1,h p, l t,h . Combining with equation 18 yields pt,h p, l t,h pt,h pt+1,h , l t,h DFξ(h )(pt+1,h , pt,h ) + DFξ(h )(p, pt,h ) DFξ(h )(p, pt+1,h ). (19) Recall the unconstrained mirror step pt+1,h = arg min u n l t,h , u + DFξ(h )(u, pt,h ) o . By the first-order optimality condition, l t,h + Fξ(h )( pt+1,h ) Fξ(h )(pt,h ) = 0. (20) Taking the inner product of equation 20 with pt,h pt+1,h yields pt,h pt+1,h , l t,h = pt,h pt+1,h , Fξ(h )(pt,h ) Fξ(h )( pt+1,h ) . (21) From the above, by using the definition of Bregman divergences, we have pt,h pt+1,h , l t,h = pt,h pt+1,h , Fξ(h )(pt,h ) Fξ(h )( pt+1,h ) = DFξ(h )(pt+1,h , pt,h ) + DFξ(h )(pt,h , pt+1,h ) DFξ(h )(pt+1,h , pt+1,h ) DFξ(h )(pt+1,h , pt,h ) + DFξ(h )(pt,h , pt+1,h ). (22) Published in Transactions on Machine Learning Research (10/2025) Then from equation 19 and equation 22, we have pt,h p, l t,h DFξ(h )(p, pt,h ) DFξ(h )(p, pt+1,h ) + DFξ(h )(pt,h , pt+1,h ). Summing over t = ts, . . . , ts+1 1 gives a telescoping series in the middle terms: t=ts pt,h p, l t,h DFξ(h )(p, pts,h ) DFξ(h )(p, pts+1,h ) + t=ts DFξ(h )(pt,h , pt+1,h ). Since DFξ(h )(p, pts+1,h ) 0 by the nonnegativity of Bregman divergences, the term can be safely dropped. Therefore, ts+1 1 X t=ts pt,h p, l t,h DFξ(h )(p, pts,h ) + t=ts DFξ(h )(pt,h , pt+1,h ), which concludes the proof. In Lemma A.1, the first term is for the initial point diameter at time ts and the second term is for the divergence of the updated policy. Using the definition of the Bregman divergence and the fact that pts,h (k) β, the initial point diameter term can be shown to be bounded as follows: DFξ(h )(p, pts,h ) 1 ξ(h ) k [K] p(k) log(1/pts,h (k)) ξ(h ) . (23) Next, for the updated policy divergence term, using pt+1,h (k) = pt,h (k) exp( ξ(h )l t,h (k)) for all k [K], we have t=ts E h DFξ(h )(pt,h , pt+1,h ) i k=1 E 1 ξ(h )pt,h (k) exp( ξ(h )l t,h (k)) 1 + ξ(h )l t,h (k) k=1 E ξ(h ) 2 pt,h (k)l t,h (k)2 k=1 E ξ(h ) where the first inequality comes from exp( x) 1 x + x2/2 for all x 0, the second inequality comes from E[l t,h (k)2 | pt,h (k), pt(h )] 1/(pt(h )pt,h (k)), and the last inequality is obtained from pt(h ) α from the clipped domain. We can observe that the clipped domain controls the variance of estimators. Then from equation 17, Lemma A.1, equation 23, and equation 24, by summing up over s [S], we have t=1 E lt(at,h ) s=0 min ks [K] t=ts lt(ks) βT(K 1) + S log(1/β) ξ(h ) + ξ(h )KT Next, we provide a bound for the following regret from the master: t=1 E [lt(at,ht)] t=1 E lt(at,h ) . Published in Transactions on Machine Learning Research (10/2025) Let pt+1 = arg minp RH p, l t + DFη(p, pt) and eh,H denote the unit vector with 1 at base h-index and 0 at the rest of H 1 indices. For ease of presentation, we define lt(h) = lt(at,h) for h [H]. Then, we have t=1 E lt(at,ht) lt(at,h ) = t=1 E pt eh ,H, lt max p Pα H E t=1 p eh ,H, lt + t=1 pt p, lt αT(H 1) + max p Pα H E t=1 pt p, lt For bounding the second term in equation 26, we use the following lemma. Lemma A.2 (Theorem 28.4 and Eq. 28.11 in Lattimore & Szepesvári (2020)). For any p Pα H we have t=1 pt p, lt DFη(p, p1) + t=1 DFη(pt, pt+1) Proof. For completeness, we provide a proof for this lemma. Fix any p Pα H. By the first-order optimality condition of the unconstrained mirror-descent step, l t + Fη(pt+1) Fη(pt), p pt+1 0. This gives pt+1 p, l t Fη(pt+1) Fη(pt), p pt+1 . Using the three-point identity for Bregman divergences, Fη(pt+1) Fη(pt), p pt+1 = DFη(p, pt) DFη(p, pt+1) DFη(pt+1, pt), we obtain pt+1 p, l t DFη(p, pt) DFη(p, pt+1) DFη(pt+1, pt). (27) We now decompose pt p, l t = pt pt+1, l t + pt+1 p, l t . Combining with equation 27 yields pt p, l t pt pt+1, l t DFη(pt+1, pt) + DFη(p, pt) DFη(p, pt+1). (28) Recall the unconstrained mirror step pt+1 = arg min u n l t, u + DFη(u, pt) o . By the first-order optimality condition, l t + Fη( pt+1) Fη(pt) = 0. (29) Taking the inner product of equation 29 with pt pt+1 yields pt pt+1, l t = pt pt+1, Fη(pt) Fη( pt+1) . (30) From the above, by using the definition of Bregman divergences, we have pt pt+1, l t = pt pt+1, Fη(pt) Fη( pt+1) = DFη(pt+1, pt) + DFη(pt, pt+1) DFη(pt+1, pt+1) DFη(pt+1, pt) + DFη(pt, pt+1). (31) Published in Transactions on Machine Learning Research (10/2025) Then from equation 28 and equation 31, we have pt p, l t DFη(p, pt) DFη(p, pt+1) + DFη(pt, pt+1). Summing over t = 1, . . . , T gives a telescoping series in the middle terms: t=1 pt p, l t DFη(p, p1) DFη(p, p T +1) + t=1 DFη(pt, pt+1). Since DFη(p, p T +1) 0 by the nonnegativity of Bregman divergences, the term can be safely dropped. Therefore, T X t=1 pt p, l t DFη(p, p1) + t=1 DFη(pt, pt+1), which concludes the proof with E[PT t=1 pt p, l t ] = E[PT t=1 pt p, lt ]. From equation 26 and Lemma A.2, we have t=1 E [lt(at,ht)] t=1 E lt(at,h ) αT(H 1) + max p Pα H E DFη(p, p1) + t=1 DFη(pt, pt+1) αT(H 1) + log(H) where the last inequality is obtained from the fact that DFη(p, p1) = 1 h [H] p(h) log( p(h) p1(h)) log(H) t=1 DFη(pt, pt+1) h [H] pt(h)(exp( ηl t(h)) 1 + ηl t(h)) h [H] pt(h)l t(h)2 Therefore, putting equation 16, equation 25, and equation 32 altogether, we have t=1 E [lt(at)] s=0 min 1 ks K t=Ts lt(ks) αTH + log(H) 2 + βT(K 1) + S log(1/β) ξ(h ) + ξ(h )KT = O(S1/2T 2/3K1/3), where α = K1/3/(T 1/3H1/2), β = 1/(KT), η = 1/ TH, ξ(h ) = (h ) 1/2/(K1/3T 2/3), h = Θ(S), and H = log(T). This concludes the proof. Published in Transactions on Machine Learning Research (10/2025) A.2 Proof of Theorem 3.2 Let ts be the time when the s-th switch of the best arm happens and t S+1 1 = T, t0 = 1. Also let ts+1 ts = Ts. For any ts for all s [0, S], the S-switch regret can be expressed as t=1 E [lt(at)] s=0 min ks [K] t=ts lt(ks) t=1 E [lt(at,ht)] t=1 E lt(at,h ) + t=1 E lt(at,h ) s=0 min ks [K] t=ts lt(ks), (33) in which the first two terms are closely related with the regret from the master algorithm against the near optimal base h , and the remaining terms are related with the regret from h base algorithm against the best arms in hindsight. First we provide a bound for the following regret from base h . From equation 17, we can obtain t=ts E lt(at,h ) min ks [K] t=ts lt(ks) βTs K + max p Pβ K E t=ts pt,h p, l t,h Then for the second term of the last inequality in equation 34, we provide a following lemma. Lemma A.3 (Restatement of Lemma 3.3). For any p Pβ K we can show that t=ts E h pt,h p, l t,h i E Proof. For ease of presentation, we define the negative entropy regularizer without a learning rate as i=1 (p(i) log p(i) p(i)) and define learning rate ξ0(h ) = . From the first-order optimality condition for pt+1,h and using the definition of the Bregman divergence, pt+1,h p, l t,h 1 ξt(h ) p pt+1,h , F(pt+1,h ) F(pt,h ) = 1 ξt(h ) DF (p, pt,h ) DF (p, pt+1,h ) DF (pt+1,h , pt,h ) . (35) Also, we have pt,h pt+1,h , l t,h = 1 ξt(h ) pt,h pt+1,h , F(pt,h ) F( pt+1,h ) = 1 ξt(h )(DF (pt+1,h , pt,h ) + DF (pt,h , pt+1,h ) DF (pt+1,h , pt+1,h )) 1 ξt(h )(DF (pt+1,h , pt,h ) + DF (pt,h , pt+1,h )). (36) Published in Transactions on Machine Learning Research (10/2025) Then, we can obtain t=ts pt,h p, l t,h t=ts pt,h pt+1,h , l t,h 1 ξt(h ) DF (p, pt,h ) DF (p, pt+1,h ) DF (pt+1,h , pt,h ) t=ts pt,h pt+1,h , l t,h + t=ts+1 DF (p, pt,h ) 1 ξt(h ) 1 ξt 1(h ) + 1 ξts(h )DF (p, pts,h ) 1 ξts+1 1(h)DF (p, pts+1,h ) 1 ξt(h )DF (pt+1,h , pt,h ) t=ts pt,h pt+1,h , l t,h + log(1/β) 1 ξt(h ) 1 ξt 1(h ) + 1 ξts(h )DF (p, pts,h ) 1 ξts+1 1(h)DF (p, pts+1,h ) 1 ξt(h )DF (pt+1,h , pt,h ) DF (pt,h , pt+1,h ) = 2 log(1/β) DF (pt,h , pt+1,h ) ξt(h ) , (37) where the first inequality is obtained from equation 35, the second last inequality is obtained from DF (p, pt,h ) log(1/β) and 1/ξt(h ) 1/ξt 1(h ) from non-decreasing ρt(h ), and the last inequality is obtained from equation 36, DF (p, pts,h ) log(1/β), and ξs(h ) ξT (h ) for s [T] from non-decreasing ρs(h ). For the second term in the last inequality in equation 37, using pt+1,h (k) = pt,h (k) exp( ξ(h )l t,h (k)) for all k [K], we have t=ts E DF (pt,h , pt+1,h ) k=1 E 1 ξt(h )pt,h (k) exp( ξt(h )l t,h (k)) 1 + ξt(h )l t,h (k) i k=1 E ξt(h ) 2 pt,h (k)l t,h (k)2 k=1 E ξt(h ) k=1 E ξt(h )ρt(h ) T E ρT (h )1/2 Published in Transactions on Machine Learning Research (10/2025) where the first inequality comes from exp( x) 1 x+x2/2 for all x 0, the second inequality comes from E[l t,h (k)2 | pt,h (k), pt(h )] 1/(pt(h )pt,h (k)), and the third inequality is obtained from 1/pt(h ) ρt(h ). Then from equation 34 and Lemma A.3, we have t=1 E lt(at,h ) s=0 min 1 ks K t=Ts lt(ks) βT(K 1) + E 2S log(1/β) Next, we provide a bound for the regret from the master in the following lemma. Lemma A.4 (Lemma 13 in Agarwal et al. (2017)). t=1 E [lt(at,ht)] t=1 E lt(at,h ) O H log(T) η + Tη E ρT (h ) Proof. For ease of presentation, define lt(h) = lt(at,h) for h [H]. Then t=1 E lt(at,ht) lt(at,h ) = t=1 E pt eh ,H, lt max p Pα H E t=1 p eh ,H, lt + t=1 pt p, lt αT(H 1) + max p Pα H E t=1 pt p, lt = αT(H 1) + max p Pα H E t=1 pt p, l t Bounding the mirror-descent term. We next bound E P t pt p, l t using the OMD analysis of Agarwal et al. (2017, Lemma 13). The master update is pt+1 = arg min p Pα H p, l t + DGηt(p, pt) , where Gηt(p) = PH h=1 1 ηt(h) p(h) log p(h) and DGηt is the corresponding Bregman divergence. Applying the standard mirror-descent inequality, for any p Pα H, pt p, lt DGηt(p, pt) DGηt(p, pt+1) + h=1 ηt(h) pt(h)2 l t(h)2. (42) Summing over t = 1, . . . , T gives t=1 pt p, lt DGηt(p, pt) DGηt(p, pt+1) + h=1 ηt(h) pt(h)2 l t(h)2. (43) Bounding the potential differences. We first control the telescoping term in equation 43. Since Published in Transactions on Machine Learning Research (10/2025) DGηt( , ) 0 and the learning rates ηt(h) are nondecreasing in t for each h, we have DGηt(p, pt) DGηt(p, pt+1) DGη1(p, p1) + 1 ηt+1(h) 1 ηt(h) h p(h) pt+1(h) , (44) where h(y) = y 1 log y 0 is the log-barrier Bregman core. For the initial term, using that Gη1 is (scaled) negative entropy on the clipped simplex Pα H with α = 1/(TH), we obtain the standard bound max p Pα H DGη1(p, p1) = O H log(1/α) = O H log T Adaptive-rate gain (negative correction). By the adaptive schedule in Algorithm 2, if 1 pt+1(h) > ρt(h), then ρt+1(h) = 2 pt+1(h) and ηt+1(h) = γ ηt(h) with γ = e1/ log T , while otherwise ηt+1(h) = ηt(h). As in (Agarwal et al., 2017, Lemma 13), this implies that whenever the coordinate h is assigned too little probability, the factor 1 ηt+1(h ) 1 ηt(h ) is negative of order 1/(η log T), and it multiplies the nonnegative barrier increment h p(h ) pt+1(h ) where h(y) = y 1 log(y). Aggregating these events over t = 1, . . . , T 1 yields 1 ηt+1(h) 1 ηt(h) h p(h) pt+1(h) 1 ηt+1(h ) 1 ηt(h ) h p(h ) pt+1(h ) ρT (h ) 40 η log T , (46) where ρT (h ) is the final density parameter maintained by the schedule. Combining equation 44, equation 45, and equation 46, and for p Pα H, we obtain DGηt(p, pt) DGηt(p, pt+1) O H log T ρT (h ) 40 η log T . (47) Bounding the variance term. It remains to bound PT t=1 PH h=1 ηt(h) pt(h)2 l t(h)2. Recall that l t(h) [0, 1/pt(h)] and only the sampled coordinate can be nonzero. Since each increase at least doubles the density ρt(h) and ρt(h) 2TH from pt(h) α = 1/TH, the number of entire updates for each h is at most C1 log(HT) for a constant C1 > 0. This implies that ηt(h) ηT (h) ηγC1 log(2HT ) ηe C2 for a constant C2 > 0. Therefore h=1 ηt(h) pt(h)2 l t(h)2 = t=1 ηt(ht) pt(ht)2 l t(ht)2 TηT (h) = O(Tη). (48) Putting the pieces together. Apply equation 47 and equation 48 to equation 43, then maximize over p Pα H and take expectations. Combining with equation 41 yields t=1 E[lt(at,ht)] t=1 E lt(at,h ) O H log T + O(Tη) E ρT (h ) which is the desired bound. The negative bias term in Lemma A.4 is derived from the log-barrier regularizer and increasing learning rates ηt(h). This term is critical to bound the worst case regret which will be shown soon. Also, H log(T)/η is obtained from H log(1/(Hα))/η considering the clipped domain. Then, putting equation 33 and Lemmas A.3 Published in Transactions on Machine Learning Research (10/2025) and A.4 altogether, we have t=1 E [lt(at)] s=0 min 1 ks K t=Ts lt(ks) η + Tη E ρT (h ) + αT(H 1) + βT(K 1) 2S log(1/β) SKTρT (h ) E where α = 1/(TH), β = 1/(TK), η = p H/T, H = log(T), and h = Θ(S). Then we can obtain RS(T) = O min np KT) is obtained from the worst case of ρT (h ). The worst case can be found by considering a maximum value of the concave bound of the last equality in equation 49 with variable ρT (h ) > 0 such that ρT (h ) = Θ(S). This concludes the proof.