# best_of_both_worlds_model_selection__f41697f9.pdf Best of Both Worlds Model Selection Aldo Pacchiano Microsoft Research, NYC apacchiano@microsoft.com Christoph Dann Google, NYC cdann@cdann.net Claudio Gentile Google, NYC cgentile@google.com We study the problem of model selection in bandit scenarios in the presence of nested policy classes, with the goal of obtaining simultaneous adversarial and stochastic ( best of both worlds") high-probability regret guarantees. Our approach requires that each base learner comes with a candidate regret bound that may or may not hold, while our meta algorithm plays each base learner according to a schedule that keeps the base learner s candidate regret bounds balanced until they are detected to violate their guarantees. We develop careful mis-specification tests specifically designed to blend the above model selection criterion with the ability to leverage the (potentially benign) nature of the environment. We recover the model selection guarantees of the CORRAL [3] algorithm for adversarial environments, but with the additional benefit of achieving high probability regret bounds. More importantly, our model selection results also hold simultaneously in stochastic environments under gap assumptions. These are the first theoretical results that achieve best of both world (stochastic and adversarial) guarantees while performing model selection in contextual bandit scenarios. 1 Introduction A fundamental challenge in sequential decision-making is the ability of the learning agent to adapt to the unknown properties of the environment they interact with. While adversarial environments may require some caution, we would also like to leverage situations where more benign scenarios may be disclosed as a result of this interaction. In the literature on multi-armed bandits, this adaptation capabilities has often taken two forms: 1. Best-of-both worlds guarantees, which were pioneered by [10] and subsequently studied by a number of authors [e.g., 29, 5, 28, 30, 33, 21]. Here, the goal is to design algorithms achieving both stochastic and adversarial environment guarantees simultaneously, without knowing the type of environment in advance. Similar in spirit is the stream of literature on stochastic rewards with adversarial corruptions [23, 17, 34, 21, 31], where an adversary is assumed to corrupt the stochastic rewards observed by the algorithm, and the regret guarantees are expected to degrade gracefully with the total amount of corruption, without knowing this amount in advance. 2. Model selection guarantees, which were initiated by [3], and subject since then to intense investigations [e.g., 15, 1, 27, 4, 16, 12, 8, 14, 22, 26, 13]. Here, we assume we have access to a pool of M base bandit algorithms each operating, say, within a different class of models or under different assumptions on the environment, and the aim is to design a bandit meta-algorithm that learns to simulate the best base algorithm in hindsight, without knowing in advance which one will be best for the environment at hand. This approach has often been used in the literature to capture bandit model selection problems. In fact, a natural instantiation of this framework is when we have a sequence of nested policy classes, and the goal is to single out the best policy within this nested family, by paying as price only the complexity of the policy class the optimal policy falls into. In some sense, Item 2 is more general than Item 1, since one may attempt to achieve a best-of-both-world performance by pooling a stochastic bandit algorithm with an adversarial bandit one, and expect the meta-algorithm on top of them to eventually learn to follow one of the two. Similarly, in the setting of 36th Conference on Neural Information Processing Systems (Neur IPS 2022). stochastic rewards with adversarial corruptions, one can pool together stochastic algorithms operating with increasing guessed levels of corruptions, and let the meta-algorithm learn to single out the one corresponding to the true corruption level.1 In this paper, we combine the two items above into a bandit algorithm that exhibits both model selection and best-of-both worlds regret guarantees simultaneously. Our framework encompasses in particular a well-known linear bandit model selection scenario, where the action set A is a (finite but large) subset of Rd M , for some maximal dimension d M > 0, and we deal with a hierarchy of possible dimensions d1 < . . . < d M. At time t, the learner plays an action at A and receives as reward either rt = a t ωt, where ωt Rd M is an adversarially-generated reward vector (adversarial setting), or rt = a t ω + white noise, where ω Rd M is a fixed but unknown reward vector (stochastic setting). Associated with dimensions d1 < . . . < d M are a nested family of policy classes Π1 . . . ΠM, and M base algorithms A1, . . . , AM, where the i-th algorithm Ai is a linear bandit algorithm that works under the assumption that only the first di components of ωt (or ω) are nonzero. Hence Ai operates with dimension di in that all policies π Πi are probability distributions which are projections over action set Ai of the set of probability distributions in ΠM. Here, Ai is the projection of the full-dimensional action set A onto its first di dimensions. Notice that this implies that the policy classes are nested: Π1 Π2 . . . ΠM. If only the first di dimensions (i being unknown to the learner) of each ωt (adversarial setting) or ω (stochastic setting) are nonzero, we design an algorithm whose regret bound scales as poly(di ) T in the adversarial case and, simultaneously, as poly(d M) log T if the environment happens to be stochastic. Specific consideration is given to the dependence on di . We show through a lower bounding argument that in the stochastic case a guarantee of the form d M log T , that is, where d M replaces the more desirable factor di , is inevitable, if we want to insist on obtaining a log T -like result. Thus our bounds reflect the best achievable regret rates depending on d M in the stochastic case and on poly(di ) in the adversarial case. In order to achieve these best-of-both-world results, one cannot easily rely on general corralling techniques, like the one contained in [3], since the granularity offered by adversarial aggregation algorithms is no better than T, which is not adequate for stochastic settings; hence our choice towards the model selection technique known as regret balancing. Yet, even in this case, the literature does not provide off-the-shelf solutions: We first need to extend regret balancing from stochastic rewards [13, 26, 1] and corrupted stochastic rewards [31] to adversarial rewards. This involves adding extra actions to the action space at a given level of the hierarchy so as to enable the base learner operating at that level to compete with higher levels. This seems to be needed because in the adversarial case, base algorithms may even incur a negative regret, thus making it hard to compare the relative performance of algorithms operating at different levels during the regret balancing operations. Specific technical hurdles arise in the linear bandit case, where mis-specification may cause low-level base learners to behave in a maliciously erratic way. In this case, mis-specification tests have to be designed with care so as to ensure that base learners that are mis-specified but not yet eliminated incur manageable regret. At the same time, these tests should also detect as soon as possible if the environment is stochastic through ad hoc regret balancing gap estimation procedures. For our results to hold, some technical conditions are required on the base algorithms, like a notion of (high probability) stability and a notion of action space extendability, formally defined in Section 2. We show that these conditions are fulfilled by known algorithms, like an anytime variant of the Geometric Hedge.P algorithm in [7], and a high-probability variant of the EXP4 algorithm in [6]. Related work. Among the references we mentioned above, those which are most relevant to our work, as directly related to model selection and best-of-both-worlds guarantees are perhaps [4, 13, 21, 31]. In both [4] and [13], model selection regret guarantees for stochastic contextual bandits are given which take the form poly(di ) T. Yet, no combination of best-of-both worlds and model selection results are contained in those papers. Closer in spirit are [21] and [31]. In [21], the authors consider model selection problems on top of adversarially corrupted stochastic linear bandit problems, where the total level of corruption C is unknown in advance. It is worth stressing that this is a model selection problem which is substantially different (and actually easier, see Section 4) than ours, as the selection applies to C instead of the complexity term R(Πi). In fact, the model selection procedure in [21] looks very different from ours, and can be roughly seen as 1Recent relevant papers, working with model mis-specification instead of reward corruption, include [27, 14]. a robust version of a doubling trick applied to C. Our algorithm is more general than theirs, as it applies to settings beyond linear bandits, and the same is true of our best of both worlds procedures. In [31], the authors also consider RL settings, and more general forms of corruptions than [21]. Like ours, their model selection algorithm follows the idea of regret balancing and elimination of [1, 13]. Yet, importantly our work applies to the fully adversarial setting with T regret while the corruption-robust approach by [31] suffers a linear in T regret if the world is fully adversarial (C = Ω(T)). To enable this sublinear regret, several important technical challenges needed to solved that are not present in the corruption setting. Most importantly, the corruption setting can, for the most part, be handled similar to the stochastic setting. For instance, since [31] measures regret always w.r.t. to the uncorrupted environment, by definition, no learner can achieve negative regret, which entails that setting does not require linking base learners as we do here see Section 2. On the lower bound side, relevant papers include [32, 24], where the authors investigate the Pareto frontier of model selection for (contextual) stochastic bandits. It is shown in particular that a regret upper bound of the form p di T cannot be achieved. However, these papers do not explicitly cover gap-dependent regret guarantees. A more thorough discussion of our contributions is postponed to Section 3, after introducing our main notation and assumptions. 2 Problem Setting and Assumptions We start off by defining the adversarial scenario. An adversarial contextual bandit problem is a repeated game between a learner A and an environment B. We consider the general decision-making scenario where the learner A has at its disposal a class of policies ΠA made up of functions π of the form π : X A, where A denotes the set of probability distributions over A. At each round t, the interaction between A and B is as follows: 1. Learner A selects a policy πt ΠA 2. Simultaneously, the environment B selects context xt X and reward function rt : A X [0, 1], and reveals xt to A 3. Learner A takes an action at πt( | xt) and observes reward ot = rt(at, xt). The regret Reg A(T , Π ) of algorithm A in rounds T N against a policy class Π is the difference between the learner s accumulated reward in T and the performance of the best fixed policy from Π : Reg A(T , Π ) = max π Π X ℓ T [Ea π [rℓ(a, xℓ)] rℓ(aℓ, xℓ)] , where Ea π[rℓ(a, xℓ)] = P a A π(a|xℓ)rℓ(a, xℓ) denotes the expectation over the action drawn from the given policy π which may itself be a random quantity. For ease of notation, we will omit the comparator policy class when Π = ΠA is the policy class of A and replace T by t when T = [t] := {1, . . . , t}. Hence, Reg A(t) is the regret of A against its own policy class up to round t. The stochastic scenario we consider is similar to the above, except in the way contexts xt and reward values rt(at, xt) are generated. Specifically, B generates contexts xt in an i.i.d. fashion according to a fixed (but arbitrary and unknown) distribution D over X, while the reward rt(at, xt) is such that for all fixed (a, x) A X, we have E[rt(a, x)] = r(a, x), for some fixed function r( , ) in some known class of reward functions. In this case, we measure performance through pseudo-regret Pseudo Reg A(T , Π ) = max π Π X ℓ T [Ex D,a π [r(a, x)] Ex D,a πℓ[r(a, x)]] , where Ex D,a π is the expectation over contexts x and the action a π( |x) drawn from a policy π that may itself be a random quantity. We say an environment B is stochastic with gap > 0 if there is a policy π Π such that Ex D,a π [r(a, x)] max π Π\{π } Ex D,a π [r(a, x)] + . A notable example of the above is the following linear bandit scenario. In the adversarial case ( adversarial linear bandits ) the action space A is a subset of Rd, for some dimension d, the context xt is irrelevant and, upon playing action at A, the environment generates a reward which is a linear function of the actions, rt(at, xt) = rt(at) = a t ωt, where ωt is chosen adversarially at every round within some known class of d-dimensional vectors. In the stochastic case ( stochastic linear bandits ), the only difference is that we simply have rt(at, xt) = rt(at) = a t ω + ϵt, where ω is a fixed and unknown vector in the class of d-dimensional vectors, and ϵt is a sub-Gaussian noise. Moreover, we have gap if there is π Π such that Ea π a ω maxπ Π\{π } Ea π a ω + . Many algorithms A in the literature exist that enjoy a regret bound against their policy class ΠA, holding in all environments B satisfying certain conditions. If this regret bound holds, we say that A is adapted to this environment. Here, we allow algorithms to also compete against policy classes Π other than ΠA, and focus on algorithms that come with regret bounds of the following form. Definition 1 (Adapted). We call A adapted to environment B and policy class Π if, with probability at least 1 δ, the following bound holds simultaneously for all t N:2 Reg A(t, Π ) = O The term R(ΠA) is a (known) measure of complexity of the policy class ΠA used by A. Examples of algorithms satisfying Definition 1 include a version of the Geometric Hedge algorithm in linear bandits [7] and the EXP4 algorithm [6] in contextual bandits with finite action sets that operates over policies mapping contexts to probability distributions over actions. We will discuss them in detail below. Before introducing the model selection questions addressed in this work, we present two stronger versions of the above definition, which we call high-probability stability and extendability. These will be useful for model selection among multiple learners. As we will show with examples later, these stricter conditions can be established for several common settings under the same conditions for which adaptivity from Definition 1 is guaranteed. Additional conditions. The first condition, high-probability stability (or h-stability) generalizes adaptivity to the case where A only observes a certain noisy version of the reward. To state the condition formally, consider a more general interaction protocol between A and the environment B, where Step 3 above is replaced by 3a. Let bt Bernoulli(ρ) for a fixed and known ρ (0, 1]. Learner A takes an action at πt( | xt) and observes an importance-sampled version of the reward ot = bt rt(at,xt) This encompasses the original protocol with ρ = 1 as a special case. An algorithm is h-stable if it maintains its regret guarantee up to a 1/ ρ penalty in this more general interaction protocol: Definition 2 (h-stability). An algorithm A is high probability stable (h-stable) in an environment B against policy class Π if it satisfies for any constant ρ (0, 1] a regret bound Reg A(t, Π ) = O R(ΠA) r t that holds with probability at least 1 δ simultaneously for all t N. Note that A is adapted to B and Π if it is h-stable for ρ = 1. Compared to the notion of stability proposed in [3], the one in Definition 2 is more demanding, in that it requires the importanceweighted regret bound to hold with high probability, rather than in expectation. Since we aim for high-probability regret bounds in adversarial and stochastic settings, this stronger notion is natural. The second condition, extendability, generalizes h-stability to environments with additional actions, as specified next. Let B be the original environment with action set A and let Ak = A {a 1, a 2, . . . , a k} be the action set extended by k extra special actions a 1, a 2, . . . , a k. Further, let Π = {π : X Ak} be an extended version of the original policy set Π = {π : X A} that contains all policies of the original policy set and the single-action policies 1 {a i} that always choose a certain special action a i, i.e., Π Π {1 {a 1} , 1 {a 2} , . . . , 1 {a k}}.3 We further allow the environment on the extended action space Ak to choose any values for rewards rt(a i, xt), possibly depending on the entire history and all rewards rt(a, xt) assigned to other actions a = a i in that round. We denote the set of all such extended environments by Bk(B, Ak). An algorithm A is now extendable if we can run it with the extended policy set Πk and it competes well against Π and policies 1 {a i} in all such extended environments. 2Here and throughout, the O-notation only hides absolute constants. 3Policies π Π are naturally extended from A to Ak by assigning probability 0 to all special actions a i. Definition 3 (Extendability). Consider an algorithm A with policy set ΠA X A in environment B. We call A k-extendable in B against Π if there is an extended policy set Πk ΠA {1 {a 1} , . . . 1 {a k}} such that A equipped with the extended policy set Πk is h-stable against Π {1 {a 1} , . . . 1 {a k}} in all environments B B(B, Ak) that extend B from action space A to action space Ak = A {a 1, . . . , a k}. One relevant example of h-stable and extendable algorithm working in the adversarial linear bandit scenario is an anytime variant of the Geometric Hedge algorithm from [7] with exploration ruled by John s ellipsoid (e.g., [9]) see Appendix B. Another example is a high-probability variant of the EXP4 algorithm from [6], which operates with finite sets of policies. 2.1 Model selection and best-of-both-worlds regret Our model selection for best-of-both worlds regret guarantees can be described as follows. We have a nested family of policy classes Π1 . . . ΠM, with (known) complexities R(Π1) . . . R(ΠM). A meta-learning algorithm M has access to M base algorithms A1, . . . , AM, algorithm Ai operating with policy class Πi. These algorithms we sometimes refer to as base learners. Let i be the smallest index of the base learner that competes against the largest policy class ΠM in the following sense: i = min {i [M]: Ai is (M i)-extendable and h-stable in B against ΠM} . The goal of model selection is to devise a meta-algorithm M that has access to the base learners A1, . . . , AM, and which achieves with probability 1 δ a regret bound of the form4 Reg M(t, ΠM) = O poly M, R(Πi ), ln R(ΠM) r holding for all t, whenever AM is h-stable and the environment B is adversarial. Simultaneously, if B is stochastic with gap , then we must have Pseudo Reg M(t, ΠM) = O poly(M, R(ΠM)) Notice that the above requirement on the pseudo-regret in stochastic environments only requires a dependence on R(ΠM), instead of R(Πi ). This is motivated by the fact that in stochastic environments with gap it is generally impossible to obtain model selection guarantees of the form R(Πi ) log t see Appendix A for a proof of this claim. 3 Summary of Our Contributions and Discussion Our contributions can be summarized as follows. (i) We introduce an algorithm, called Arbe (Algorithm 1), for high probability model selection. This is the first high-probability model selection result for adversarial contextual bandit algorithms. Arbe satisfies a high probability guarantee as long as each of the base algorithms also satisfies one. Our algorithm takes inspiration from the balancing and elimination techniques in [13], which have been designed for stochastic contextual bandit (and RL) scenarios. Yet, as mentioned above, several technical hurdles had to be overcome in the algorithm s design to make it usable with adversarial base learners. We believe the model selection rates that our algorithm achieves are minimax optimal in their dependency on R(Πi ). Recall that, as shown by the lower bounds of [24], it is not possible to achieve a model selection rate scaling linearly with R(Πi ) the best one can hope for is a quadratic dependence on R(Πi ). In the setting of adversarial linear bandits this turns into a rate with a multiplier of the form di log(A) instead of p di log(A). In particular, when |A| = Ω(2di ) or5 |A| = , this multiplier becomes d2 i . In the simpler case of EXP4 base learners, this multiplier takes the form log(Πi ) instead of p (ii) We introduce the first algorithm for best-of-both-worlds model selection. By leveraging the highprobability guarantees of Arbe, we develop the first best-of-both-worlds model selection algorithm 4Here, poly(a, b, c) is a polynomial function of the three arguments separately. 5Despite we do not explicitly work out the details here, applying our results to the infinite arm case for adversarial linear bandits can be accomplished by a standard covering argument at each dimension di. This turns factor log(A) into one of the form di log T. Algorithm 1: Arbe(δ, s = 1, t0 = 0) Adversarial Regret Balancing and Elimination 1 Input: initial time t0, index s of smallest active base learner, failure probability δ 2 Initialize base learners As, As+1, . . . , AM with extended policy classes Πs, Πs+1, . . . , ΠM 3 Set sampling probabilities ρi = R( Πi) 2 PM j=s R( Πj) 2 for all i {s, . . . , M}, and ρi = 0 for i < s 4 for round t = t0 + 1, t0 + 2, . . . do 5 Sample base learner index bt Categorical(ρ1, , ρM) 6 Get context xt and compute ai t πi t( |xt), the action each base learner Ai proposes for xt 7 Play action at = abt t (resolve linked actions if necessary) and receive reward rt(at, xt) 8 Update all base learners Ai with reward 1{bt=i}rt(at,xt) CRewi(t0, t) = 1 {bℓ= i} rℓ(aℓ, xℓ) ρi , Di(t0, t) = O (1) Test for all i, j {s, . . . , M} with i < j: CRewj(t0, t) > CRewi(t0, t) + Di(t0, t) + Dj(t0, t) + R( Πi) rt t0 10 if test triggers for Ai then restart algorithm by running Arbe(δ, i + 1, t) (Arbe-Gap + Arbe-Gap Exploit, Algorithm 2 + Algorithm 4) that can retain model selection rates when the environment is adversarial, and obtain logarithmic rates when the environment is stochastic with a gap. The logarithmic gap-dependent rate of this algorithm exhibits an optimal quadratic dependence w.r.t. the largest policy class R(ΠM) (see also Item (iii) below). Our algorithm is quite complex and requires a couple of main innovations: First, a careful design of a gap identification subroutine aimed at identifying a candidate optimal policy and gap estimator (Arbe-Gap) and, second, a very precise schedule of play for exploiting this knowledge and test its truthfulness (Arbe-Gap Exploit). (iii) As already mentioned, we show via a lower bound for stochastic environments that, in the presence of a gap, perfect model selection between multiple logarithmic rate learners is impossible. This can be found in Appendix A. A dependence on the complexity of the largest class R(ΠM) is inevitable, and this dependence must be quadratic. Our algorithms (Arbe-Gap + Arbe-Gap Exploit) achieve exactly this dependence when the environment is stochastic and has a gap (Theorem 5). 4 Adversarial Model Selection Using Regret Balancing We now introduce our algorithm for model selection in adversarial bandit problems with highprobability regret guarantees. The algorithm is shown in Algorithm 1, and follows the regret balancing principle. This principle has been applied successfully to model selection in bandit and RL problems with stochastic rewards [13, 26, 1] and with corrupted stochastic rewards [31]. Our work is the first to extend this approach to adversarial rewards. Regret Balancing. In each round, we choose the index of a base learner bt by sampling from a categorical distribution with probabilities ρs, ρs+1, . . . , ρM that remain fixed throughout the epoch. The policy of learner Abt is then used to sample the action at that is passed to the environment. After receiving the reward rt(at, xt), Algorithm 1 updates each base learner Ai with rt(at, xt) importanceweighted by the probability that the learner was selected, i.e., 1{bt=i}rt(at,xt) ρi . Thus, we update all base learners. This is closer to the Corral algorithm [3] than to the regret balancing approaches for the stochastic setting, which only update the selected learner Abt. If the probabilities are set to ρi 1 R(Πi)2 then after t rounds, each learner Ai is followed roughly ρit times. To see why the regrets of all learners are balanced if they are h-stable against ΠM in this case, consider the following. Denote by regt(a) = [Ea π [rt(a, xt)] rt(a, xt)] the regret in round t of action a where π argmaxπ ΠM PT t=1 Ea π [rt(a, xt)] is the best policy after all rounds T. The regret in rounds where Ai was selected can be bounded using standard concentration arguments as t=1 1 {bt = i} regt(at) = ρi ρi regt(ai t) ρi t=1 regt(ai t) R( Πi) O p where the final inequality holds because Ai is h-stable. From the definition of ρi we have R( Πi) ρi = PM j=s R( Πj) 2 1/2 R( Πs), which shows that the total regret in those rounds is bounded by O(R( Πs) T). Thus, if base learners are all h-stable, then the regret incurred by each of them is comparable to the regret incurred by As, the learner with the smallest complexity R(Πs). Eliminating Base Learners. If a learner Ai is not h-stable and may have linear regret, then the probabilistic schedule above which plays this learner roughly ρit times yields linear regret. We therefore monitor the performance of each learner and terminate the epoch whenever a learner performs significantly worse than expected, and thus cannot be h-stable in the environment. To identify such cases, we compare estimates of the rewards of all pairs of base learners as follows. For each learner Ai, CRewi(t0, t) in Equation 1 is an unbiased estimate of the learners reward sequence CRewi(t0, t) = Pt ℓ=t0+1 rℓ(ai ℓ, xℓ) (see the appendix for details about how these estimates are computed). Further, using confidence bounds Di(t t0) from Equation 1, we can constructs a confidence interval for CRewi(t0, t) as h CRewi(t0, t) Di(t t0), CRewi(t0, t) + Di(t t0) i . If the confidence intervals for two learners with indices i < j are more than the h-stable regret bound of i apart, see Equation 2, then deem all learners with index up to i not h-stable and restart the algorithm with a reduced set of base learners. This kind of elimination condition has already been used in stochastic environments [27, 13, 31], but it requires substantially more care in an adversarial setting. In settings with stochastic rewards (even with corruption, e.g., [31]), no learner can achieve rewards that are significantly higher than the optimal policy. In contrast, in the adversarial setting, it is possible to have negative regret against any fixed policy. Thus, if we were to use the elimination condition in Equation 2 with our base learners as is, then we may eliminate an h-stable base learner i when another base learner j has negative regret against the best fixed policy π t . This could in turn lead to undesirable regret in subsequent epochs when only learners with R(Πi) R(Πi ) are left. Linking Base Learner Performances. To address this issue, we link the performance of base learners. Instead of instantiating each learner Ai with its original policy set Πi, we apply it to an extended problem with M i additional actions ai+1, ai+2, . . . , a M, and an extended policy set Πi that includes all original policies, along with policies that only choose one of the additional actions ai, that is Πi Πi {1 { ai+1} , . . . 1 { a M}}. Whenever a base learner Ai chooses one of the additional actions aj, then the action proposed by Aj is followed. Essentially, running each base learner with such an extended action set allows it to choose to follow the actions proposed by any learner with higher index in the hierarchy. The benefit of linking base learners this way is that each base learner now not only competes against the best fixed policy in their set, but also against all learners above them in the hierarchy. Thus, if Ai is h-stable and extendable (see Definition 3), then it satisfies a regret bound of the form6 h Ea πj ℓ[rℓ(a, xℓ)] Ea πi ℓ[rℓ(a, xℓ)] i = O R( Πi) rt t0 against any learner Aj with j > i. As a result, since the LHS of Equation 3 is approximately CRewj(t0, t) CRewi(t0, t) Di(t0, t) Dj(t0, t), the test in Equation 2 cannot trigger for such Ai, and only base learners that are not h-stable or extendable can be eliminated. In Appendix C, we show that our regret balancing algorithm achieves the following regret bound: Theorem 4. Consider a run of Algorithm 1 with Arbe(δ, 1, 0) and M base algorithms with 1 R(eΠ1) R(eΠM) where Πi is the extended version of policy class Πi with (M i) additional actions. Then with probability at least 1 poly(M)δ the regret Reg(t, ΠM) for all rounds t i is bounded by 6We here naturally extend the domain of rℓto linked actions as rℓ( ai, xℓ) = rℓ(ai ℓ, xℓ) for all i [M]. where i is the smallest index of the base algorithm that is h-stable. In most settings, the complexity R(eΠi ) M and the first term dominates. This regret recovers the expected regret guarantees of Corral [2] when used with learning rates that do not require knowledge of i but as stronger high-probability bounds. To the best of our knowledge, Theorem 4 is the first high-probability regret bound for adversarial model selection. To illustrate our result, consider to common problem of model selection with nested model classes of dimensions di = 2i for i [M], discussed in the introduction. We can use Geometric Hedge.P as base learners which are h-stable and extendable if adapted (see Appendix B) and the complexity of extended policy classes R(eΠi) = R(Πi) + M i di + M is not much larger than those of the original policy classes. Arbe with such base learners achieves a regret bound of order O(d2 i t) up to factors of M and log-factors which are typically small. 5 Adversarial Model Selection with Best of Both Worlds Guarantees In this section, we present our algorithm for adversarial model selection that preserves a logarithmic regret guarantee in case it interacts with a stochastic environment with gap > 0. In order to achieve such a best-of-both-worlds guarantee, we will combine our adversarial regret balancing technique with the algorithmic strategy of Bubeck and Slivkins [11]. The main result for our algorithm is: Theorem 5. Consider a run of Algorithm 2 with inputs t0 = 0, arbitrary policy policy bπ ΠM and M base learners A1, . . . , AM. Then with probability at least 1 poly(M)δ, the following two conditions hold for all t M 2 simultaneously. In any adversarial or stochastic environment B, the regret is bounded as Reg(t, ΠM) = O M + ln(t) + R(eΠi ) t(ln(t) + i ) ln t If B is stochastic and there is a unique policy with gap > 0, then the pseudo-regret is bounded as, Pseudo Reg M(t, ΠM) = O δ + R(eΠi )2R(eΠM)2 R(eΠ1)2 M 2i ln2 MR(eΠM) The algorithm has the same regret bound as Arbe up to a ln t factor. However, in addition, it also maintains poly-logarithmic pseudo-regret if the environment is stochastic and the best policy exhibits a positive gap. Our pseudo-regret bound depends polynomially on R(ΠM) in contrast to the R(Πi ) dependency for the adversarial regret rate. This may seem undesirable but, as mentioned in Section 2.1, model selection in stochastic environments with poly(R(Πi ), ln(R(ΠM)) ln(t) regret is impossible (see Appendix A), let alone while maintaining a best-of-both-worlds guarantee with T regret in adversarial environments. Hence, our algorithm achieves the best kind of guarantee we can hope for, up to improvements in the order of polynomial dependencies. Our algorithm proceeds in two distinct phases. The first, Arbe-Gap shown in Algorithm 2 is designed to identify a suitable candidate for the optimal policy and to estimate its gap. If such a policy that performs significantly better than any other policy emerges, the algorithm enters the second phase, Arbe-Gap Exploit which hones in on this policy by playing it most of the time while monitoring its regret in case the environment turns out to be adversarial after all (in which case we simply run Arbe). We now describe both phases. 5.1 First Phase: Candidate Policy Identification and Gap Estimation This goal of this phase is to always maintain the desired adversarial model selection regret guarantee and simultaneously identify the gap between the best policy and the rest if one exists. To achieve the first goal, we employ the regret balancing and elimination technique of Arbe, see Lines 3 11 of Algorithm 2 which are virtually identical to Algorithm 1. Algorithm 2: Arbe-Gap(δ, t, bπ, n, (Ai)M i=s) 1 Input: failure probability δ, timestep t0, focus policy bπ, number of (re)starts n, learners (Ai)M i=s 2 Set ΠM+1 = ΠM\{bπ} and AM+1 as a copy of AM with policy class ΠM+1 3 Initialize base learners As, As+1, . . . , AM+1 with extended policy classes Πs, Πs+1, . . . , ΠM+1 4 Set sampling probabilities ρi = R( Πi) 2 PM+1 j=s R( Πj) 2 for all i {s, . . . , M + 1}, and ρi = 0 for i < s 5 for round t = t0 + 1, t0 + 2, . . . do 6 Sample base learner index bt Categorical(ρ1, , ρM) 7 Get context xt and compute ai t πi t( |xt), the action each base learner Ai proposes for xt 8 Play action at = abt t (resolve linked actions if necessary) and receive reward rt(at, xt) 9 Update all base learners Ai with reward 1{bt=i}rt(at,xt) ρi 10 if Equation 2 holds between Ai, Aj for s i < j M + 1 then 11 restart algorithm by running Arbe-Gap(δ, t, bπ, n + 1, (Ai+1, , AM)) // Gap Test: AM better than AM+1? 12 Set W(t0, t) = Θ q R(eΠM)2 ρM(t t0) ln n(t t0) δ + ln n ln(t t0) 13 Set b t = CRew M(t0,t) CRew M+1(t0,t) t t0 W(t0, t) 14 if 2W(t0, t) b t R(eΠM)2 then 15 Run Arbe-Gap Exploit (Appendix D) with inputs δ, t0 = t, AM, bπ and b = b t 16 Run Arbe with inputs t0, s, δ // New Candidate Policy Test 17 if a policy π ΠM \ {bπ} has been selected in more than 3t 4 of all t 9 rounds then 18 restart algorithm by running Arbe-Gap(δ, t, π, n + 1, (Ai)M i=s) For the second goal, determining the gap, Arbe-Gap maintains a candidate bπ ΠM for the optimal policy and estimates its gap as follows. The learner hierarchy A1, . . . , AM is augmented at the top with an additional learner AM+1 operating on the policy class ΠM+1 = ΠM \ {bπ}. Since AM+1 is identical to AM except that it does not have access to bπ, we can obtain a gap estimate for bπ by monitoring the difference in reward estimates CRew M+1 and CRew M of AM+1 and AM. In fact, b t in Line 13 is a lower confidence bound on the difference between the best policies in bΠM and bΠM \ {bπ} and thus, the gap of bπ. We test at every round whether b t exceeds its confidence width 2W(t0, t). If this test triggers then bπ must have a positive gap of order W(t0, t) and further, b t must be a multiplicative estimate of , that is, b t 2b t. The latter holds, because b t + 2W(t0, t) is an upper-confidence bound on and the test condition implies 2b t b t + 2W(t0, t) . Since in this case, we have determined that bπ is optimal with a gap of order b t, we can move on to the second phase Arbe-Gap Exploit discussed later. Assume the candidate policy bπ is indeed optimal and exhibits a positive gap . Since b t concentrates around at a rate of W(t0, t) poly(RM(eΠM)) t t0 , the condition of the gap test in Line 14 must trigger after at most t t0 poly(RM(eΠM)) 2 rounds. Finally, since Arbe-Gap always maintains the poly(R(eΠi ) t t0 adversarial regret rate, the total pseudo-regret incurred until the test triggers is of order poly(RM(eΠM)) , leading to the second term in the bound in Equation 5. With the techniques above, we can reliably detect a positive gap if the candidate policy bπ exhibits one. It remains to identify a suitable candidate bπ of the optimal policy π in stochastic environments. To do so, we use the following two observations: Arbe-Gap maintains a poly(R(eΠi ) t adversarial regret rate overall and each policy but π will incur on average at least a regret of per round. Hence, in order to maintain that regret, Arbe-Gap must select π in the majority of all rounds when t = ω( poly(R(eΠi )) 2 ). Otherwise the regret grows as Ω(t ) = ω(poly(R(eΠi )) t) violating the adversarial rate. To leverage this observation and identify a suitable candidate policy bπ, Line 17 of Algorithm 2 always checks whether there is a policy other than the current candidate policy that has been selected in at least 3/4 of all rounds.7 If so, the algorithm is restarted with this policy as candidate bπ. One can show that there are at most O(ln t) restarts due to candidate policy switches, only increasing the adversarial regret rate of Arbe-Gap by a factor of O( ln t) compared to Arbe s. Our candidate policy selection approach is similar to that by Wei et al. [31] for the corrupted reward setting but they require each adapted base learner to achieve a logarithmic regret rate in the first place. Instead, our approach only requires an adversarial regret rate from the base learner. 5.2 Second Phase: Exploitation Since each base learner only needs to satisfy a T regret rate even in stochastic environments, we generally cannot hope to recover logarithmic pseudo-regret by only selecting among them. For logarithmic pseudo-regret, we need to ensure that the given policy bπ with gap estimate b is played sufficiently often. However, we also need to monitor its regret against all other policies in case the environments turns out to be adversarial. If at any point, bπ fails to maintain a gap of order b , we can conclude that the environment is adversarial. Then Arbe-Gap Exploit returns and we simply play an instance of Arbe (Line 16 of Algorithm 2). We will present a brief summary of the main intuition behind our exploitation phase approach here and defer a longer discussion and the detailed pseudo-code to Appendix D. In each round, we play policy bπ with probability approximately 1 poly(R(ΠM)) b 2t , and with the remaining probability poly(R(ΠM)) b 2t a version of base learner AM with policy class ΠM \ {bπ}. If the environment is indeed stochastic, then bπ = π does not incur any pseudo-regret and the total regret in other t t poly(R(ΠM)) b 2t poly(R(ΠM)) b 2 rounds can be bounded as t +Reg AM (t , ΠM \ {bπ}) poly(R(ΠM)) ln(t). The first term is the regret of the best policy in ΠM \ {bπ} and the second term is the regret of AM against that policy. Since AM is h-stable on Π \ {bπ}, its regret against that policy is at most poly(R(ΠM)) t and we get the desired pseudo-regret. To ensure good regret in the adversarial case, we need to detect quickly enough when bπ does not exhibit a performance gap anymore and fall back to a fully adversarial algorithm. Similar to b t in the first phase, we use a lower confidence bound on the average performance gap and continuously test whether it falls below b 2 . This simple approach would give t adversarial regret but may exhibit a poly(R(ΠM)) dependency. Fortunately, we can avoid this dependency and retain the desired model selection regret rates by extending the intuition above to also test an upper bound on the gap. For details, see Appendix D. 6 Conclusions We have described and analyzed a novel model selection scheme for bandit algorithms that benefits from best-of-both-worlds high probability regret guarantees. Though not restricted to linear bandits, our machinery can be specifically applied to adversarial/stochastic linear bandit tasks, where model selection is performed on the unknown dimensionality of the linear reward function. This has required extending the regret balancing technique of model selection from stochastic to adversarial rewards and a very careful handling of the associated mis-specification tests. The base learners aggregated by our meta-algorithm have to satisfy an anytime high probability regret guarantee in the adversarial case, along with regret stability and action space extendability properties which we have shown are satisfied by (variants of) known algorithm in the bandits literature. Our best-of-both world model selection regret guarantees cannot in general be improved, specifically in stochastic environments with gaps, where it is generally impossible to obtain log t-like model selection bounds that only depend on the complexity of Πi . On the other hand, it would be nice to see in Theorem 5 a better polynomial dependence on M and R(ΠM). Also, it might be possible to adapt Arbe-Gap to work with adversarially corrupted stochastic rewards while doing model selection on the complexity of the models. We leave this as a future research direction. 7Due to linking learners, there is slight ambiguity in defining the selected policy per round. We here determine the selected policy as the policy chosen by the learner that eventually picked at after resolving linked actions. [1] Y. Abbasi-Yadkori, A. Pacchiano, and M. Phan. Regret balancing for bandit and rl model selection. ar Xiv preprint ar Xiv:2006.05491, 2020. [2] A. Agarwal, H. Luo, B. Neyshabur, and R. E. Schapire. Corralling a band of bandit algorithms. ar Xiv preprint ar Xiv:1612.06246, 2016. [3] A. Agarwal, H. Luo, B. Neyshabur, and R. E. Schapire. Corralling a band of bandit algorithms. In Conference on Learning Theory, pages 12 38. PMLR, 2017. [4] R. Arora, T. V. Marinov, and M. Mohri. Corralling stochastic bandit algorithms. ar Xiv preprint ar Xiv:2006.09255, 2020. [5] P. Auer and C.-K. Chiang. An algorithm with nearly optimal pseudo-regret for both stochastic and adversarial bandits. In Conference on Learning Theory, 2016. [6] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1), 2003. [7] P. Bartlett, V. Dani, T. Hayes, S. Kakade, A. Rakhlin, and A. Tewari. High-probability regret bounds for bandit online linear optimization. In Proceedings of the 21st Annual Conference on Learning Theory-COLT 2008, pages 335 342. Omnipress, 2008. [8] A. F. Bibaut, A. Chambaz, and M. J. van der Laan. Rate-adaptive model selection over a collection of black-box contextual bandit algorithms. ar Xiv preprint ar Xiv:2006.03632, 2020. [9] S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. ar Xiv preprint ar Xiv:1204.5721, 2012. [10] S. Bubeck and A. Slivkins. The best of both worlds: Stochastic and adversarial bandits. In Conference on Learning Theory, 2012. [11] S. Bubeck and A. Slivkins. The best of both worlds: Stochastic and adversarial bandits. In Conference on Learning Theory, pages 42 1. JMLR Workshop and Conference Proceedings, 2012. [12] N. Chatterji, V. Muthukumar, and P. Bartlett. Osom: A simultaneously optimal algorithm for multi-armed and linear contextual bandits. In International Conference on Artificial Intelligence and Statistics, pages 1844 1854, 2020. [13] A. Cutkosky, C. Dann, A. Das, C. Gentile, A. Pacchiano, and M. Purohit. Dynamic balancing for model selection in bandits and rl. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 2276 2285. PMLR, 2021. [14] D. Foster, C. Gentile, M. Mohri, and J. Zimmert. Adapting to misspecification in contextual bandits. In Advances in Neural Information Processing Systems, 2020. [15] D. J. Foster, A. Krishnamurthy, and H. Luo. Model selection for contextual bandits. In Advances in Neural Information Processing Systems, pages 14741 14752, 2019. [16] A. Ghosh, A. Sankararaman, and K. Ramchandran. Problem-complexity adaptive model selection for stochastic linear bandits. ar Xiv preprint ar Xiv:2006.02612, 2020. [17] A. Gupta, T. Koren, and K. Talwar. Better algorithms for stochastic bandits with adversarial corruptions. In Conference on Learning Theory, 2019. [18] S. R. Howard, A. Ramdas, J. Mc Auliffe, and J. Sekhon. Uniform, nonparametric, nonasymptotic confidence sequences. ar Xiv preprint ar Xiv:1810.08240, 2018. [19] S. R. Howard, A. Ramdas, J. Mc Auliffe, and J. Sekhon. Time-uniform chernoff bounds via nonnegative supermartingales. Probability Surveys, 17:257 317, 2020. [20] T. Lattimore and C. Szepesvári. Bandit algorithms. preprint, page 28, 2018. [21] C. Lee, H. Luo, W. Chen-Yu, M. Zhang, and X. Zhang. Achieving near instance-optimality and minimax-optimality in stochastic and adversarial linear bandits simultaneously. In Proceedings of the 38th International Conference on Machine Learning, ICML, volume 139 of Proceedings of Machine Learning Research, pages 6142 6151. PMLR, 2021. [22] J. Lee, A. Pacchiano, V. Muthukumar, W. Kong, and E. Brunskill. Online model selection for reinforcement learning with function approximation. In International Conference on Artificial Intelligence and Statistics, pages 3340 3348. PMLR, 2021. [23] T. Lykouris, V. Mirrokni, and R. Paes Leme. Stochastic bandits robust to adversarial corruptions. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, 2018. [24] T. Marinov and J. Zimmert. The pareto frontier of model selection for general contextual bandits. In Neurips, 2021. [25] G. Neu. Explore no more: Improved high-probability regret bounds for non-stochastic bandits. ar Xiv preprint ar Xiv:1506.03271, 2015. [26] A. Pacchiano, C. Dann, C. Gentile, and P. Bartlett. Regret bound balancing and elimination for model selection in bandits and rl. ar Xiv preprint ar Xiv:2012.13045, 2020. [27] A. Pacchiano, M. Phan, Y. Abbasi-Yadkori, A. Rao, J. Zimmert, T. Lattimore, and C. Szepesvari. Model selection in contextual stochastic bandit problems. ar Xiv preprint ar Xiv:2003.01704, 2020. [28] Y. Seldin and G. Lugosi. An improved parametrization and analysis of the exp3++ algorithm for stochastic and adversarial bandits. In Conference on Learning Theory, 2017. [29] Y. Seldin and A. Slivkins. One practical algorithm for both stochastic and adversarial bandits. In International Conference on Machine Learning, 2014. [30] C.-Y. Wei and H. Luo. More adaptive algorithms for adversarial bandits. In Conference On Learning Theory, 2018. [31] C.-Y. Wei, C. Dann, and J. Zimmert. A model selection approach for corruption robust reinforcement learning. ar Xiv preprint ar Xiv:2110.03580, 2021. [32] Y. Zhu and R. Nowak. Pareto optimal model selection in linear bandits, 2021. [33] J. Zimmert and Y. Seldin. An optimal algorithm for stochastic and adversarial bandits. In 22nd International Conference on Artificial Intelligence and Statistics, 2019. [34] J. Zimmert and Y. Seldin. Tsallis-inf: An optimal algorithm for stochastic and adversarial bandits. J. Mach. Learn. Res, pages 22:28 1, 2021. 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] (c) Did you discuss any potential negative societal impacts of your work? [No] Our work is of purely theoretical nature, and we are not aware of any potential negative societal impact. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [Yes] See in particular the definitions and desiderata in Section 2. (b) Did you include complete proofs of all theoretical results? [Yes] All proofs are provided in the appendix. 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [N/A] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [N/A] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [N/A] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [N/A] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A] (c) Did you include any new assets either in the supplemental material or as a URL? [No] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]