# metalearning_adversarial_bandit_algorithms__ab76af50.pdf

Meta-Learning Adversarial Bandit Algorithms

Mikhail Khodak Carnegie Mellon University khodak@cmu.edu

Ilya Osadchiy Technion - Israel Institute of Technology osadchiy.ilya@gmail.com

Keegan Harris CMU

Maria-Florina Balcan CMU Kfir Y. Levy Technion Ron Meir Technion Zhiwei Steven Wu CMU

We study online meta-learning with bandit feedback, with the goal of improving performance across multiple tasks if they are similar according to some natural similarity measure. As the first to target the adversarial online-within-online partial-information setting, we design meta-algorithms that combine outer learners to simultaneously tune the initialization and other hyperparameters of an inner learner for two important cases: multi-armed bandits (MAB) and bandit linear optimization (BLO). For MAB, the meta-learners initialize and set hyperparameters of the Tsallis-entropy generalization of Exp3, with the task-averaged regret improving if the entropy of the optima-in-hindsight is small. For BLO, we learn to initialize and tune online mirror descent (OMD) with self-concordant barrier regularizers, showing that task-averaged regret varies directly with an action spacedependent measure they induce. Our guarantees rely on proving that unregularized follow-the-leader combined with two levels of low-dimensional hyperparameter tuning is enough to learn a sequence of affine functions of non-Lipschitz and sometimes non-convex Bregman divergences bounding the regret of OMD.

1 Introduction

Learning-to-learn [51] is an important area of research that studies how to improve the performance of a learning algorithm by meta-learning its parameters e.g. initializations, step-sizes, and/or representations across many similar tasks. The goal is to encode information from previous tasks in order to achieve better performance on future ones. Meta-learning has seen a great deal of experimental work [24, 49], practical impact [21, 29], and theoretical effort [11, 18, 22, 45, 20]. One important setting is online-within-online meta-learning [19, 31], where the learner performs a sequence of tasks, each of which has a sequence of rounds. Past work has studied the full-information setting, where the loss for every arm is revealed after each round. This assumption is not realistic in many applications, e.g. recommender systems and experimental design, where often partial or bandit feedback only the loss of the action taken is revealed. Such feedback can be stochastic, e.g. the losses are i.i.d. from some distribution, or adversarial, i.e. chosen by an adversary. We establish the first formal guarantees for online-within-online meta-learning with adversarial bandit feedback.

As with past full-information meta-learning results, our goal when faced with a sequence of bandit tasks will be to achieve low regret on average across them. Specifically, our task-averaged regret should (a) be no worse than that of algorithms for the single-task setting, e.g. if the tasks are not very similar, and should (b) be much better on tasks that are closely related, e.g. if the same small set of arms do well on all of them. We show that a natural way to achieve both is to initialize and tune online mirror descent (OMD), an algorithm associted with a strictly convex regularizer whose hyperparam-

Denotes equal contribution.

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

eters have a significant impact on performance. Our approach works because it can learn the best hyperparameters in hindsight across tasks, which will recover OMD s worst-case optimal performance if the tasks are dissimilar but will take advantage of more optimistic settings if they are related. As generalized distances, the regularizers also induce interpretable measures of similarity between tasks.

1.1 Main contributions

We design a meta-algorithm (Algorithm 1) for learning variants of OMD specifically those with entropic or self-concordant regularizers that are used for adversarial bandits. This meta-algorithm combines three full-information algorithms follow-the-leader (FTL), exponentially weighted online optimization (EWOO), and multiplicative weights (MW) to set the initialization, step-size, and regularizer-specific parameters, respectively. It works by optimizing a sequence of functions that each upper-bound the regret of OMD on a single task (Theorem 2.1), resulting in (a) interesting notions of task-similarity because these functions depend on generalized notions of distances (Bregman divergences) and (b) adaptivity, i.e not needing to know how similar the tasks are beforehand.

Our first application is to OMD with the Tsallis regularizer [3], a relative of Exp3 [6] that is optimal for adversarial MAB. We bound the task-averaged regret by the Tsallis entropy of the estimated optimain-hindsight (Corollary 3.1), which we further extend to that of the true optima by assuming a gap between the best and second-best arms (Corollary 3.2). Both results are the first known consequences of the online learnability of Bregman divergences that are non-convex in their second arguments [31], while the latter is obtained by showing that the loss estimators of a modified algorithm identify the optimal arm w.h.p. As an example, our average m-round regret across T tasks under the gap assumption is

o T (poly(m)) + 2 min β (0,1]

Hβdβm/β + o( m) (1)

where d is the number of actions and Hβ is the Tsallis entropy [52, 3] of the distribution of the optimal actions (β = 1 recovers the Shannon entropy).1 This entropy is low if all tasks are usually solved by the same few arms, making it a natural task-similarity notion. For example, if only s d of the arms are ever optimal then Hβ = O(s), so using β = 1/ log d in (1) yields an asymptotic task-averaged regret of O( sm log d), dropping fast terms. For s = Od(1) this beats the minimax optimal rate of Θ(

dm) [5]. On the other hand, since H1/2 = O(

d), the same bound recovers this rate in the worst-case of dissimilar tasks.

Lastly, we adapt our meta-algorithm to the adversarial BLO problem by setting the regularizer to be a self-concordant barrier function, as in Abernethy et al. [2]. Our bounds yield notions of task-similarity that depend on the constraints of the action space, e.g. over the sphere the measure is the closeness of the average of the estimated optima to the sphere s surface (Corollary 4.1). We also instantiate BLO on the bandit shortest-path problem (Corollary 4.2) [50, 30].

1.2 Related work

While we are the first to consider meta-learning under adversarial bandit feedback, many have studied meta-learning in various stochastic bandit settings [9, 34, 47, 48, 35, 13, 15, 41, 10]. The latter three study stochastic bandits under various task-generation assumptions, e.g. Azizi et al. [10] is in a batchwithin-online setting where the optimal arms are adversarial. In contrast, we make no distributional assumptions either within or without. Apart from this difference, the results of Azizi et al. [10] are the ones our MAB results are most easily compared to, which we do in detail in Section 3. Notably, they assume that only s d of the d arms are ever optimal across T tasks and show (roughly speaking) O( sm) asymptotic regret; we instead focus on an entropic notion of task-similarity that achieves the same asymptotic regret when specialized to their s d. However, avoiding their explicit assumption has certain advantages, e.g. robustness in the presence of o(T) outlier tasks (c.f. Section 3.3).

A setting that bears some similarity to online-within-online bandits is that of switching bandits [6], and more generally online learning with dynamic comparators [4, 27, 38, 7, 55]. In such problems, instead of using a static best arm as the comparator we use a piecewise constant sequence of arms, with a limited number of arm switches. The key difference between such work and ours is our assumption that task-boundaries are known; this makes the other setting more general. However, while e.g. Exp3.S [6] can indeed be applied to online meta-learning, guarantees derived

1We use On( ) (and on( )) to denote terms with constant (and sub-constant) dependence on n.

from switching costs cannot improve upon just running Tsallis-INF on each task [39, Table 1]. Furthermore, these approaches usually quantify difficulty by the number of switches, whereas we focus on task-similarity. While there exists stochastic-setting work that measures difficulty using a notion of average change in distribution across rounds [53], it does not lead to improved performance if this average change is Ω(T), as is the case in e.g. the s-sparse setting discussed above.

There has been a variety of work on full-information online-within-online meta-learning [32, 12], including tuning OMD [31, 19]. Doing so for bandit algorithms has many additional challenges, including (1) their inherent and high-variance stochasticity, (2) the use of non-Lipschitz and even unbounded regularizers, and (3) the lack of access to task-optima in order to adapt to deterministic, algorithm-independent task-similarity measures. Theoretically our analysis draws on the average regret-upper-bound analysis (ARUBA) framework [31], which observes that OMD can be tuned by targeting its upper bounds, which are affine functions of Bregman divergences, and provide online learning tools for doing so. Our core structural result shows that the distance generating functions ψθ of these Bregman divergences can be tuned without interfering with meta-learning the initialization and step-size; tuning θ is critical for adapting to settings such as that of a small set of optimal arms in MAB. Doing so depends on several refinements of the original approach, including bounding the task-averaged-regret via the spectral norm of 2ψθ and expressing the loss of the meta-comparator using only ψθ, rather than via its Bregman divergence as in prior work. Finally, applying our structural result requires setting-specific analysis, e.g. to show regularity w.r.t. θ or to obtain MAB guarantees in terms of the entropy of the true optimal arms. The latter is especially difficult, as Khodak et al. [31] define task-similarity via full information upper bounds, and involves applying tools from the bestarm-identification literature [1] to show that a constrained variant of Exp3 finds the optimal arm w.h.p.

2 Learning the regularizers of bandit algorithms

We consider the problem of meta-learning over bandit tasks t = 1, . . . , T over some fixed set K Rd, a (possibly improper) subset of which is the action space A. On each round i = 1, . . . , m of task t we play action xt,i A and receive feedback ℓt,i(xt,i) for some function ℓt,i : A 7 [ 1, 1]. Note that all functions we consider will be linear and so we will also write ℓt,i(x) = ℓt,i, x . Additionally, we assume the adversary is oblivious within-task, i.e. it chooses losses ℓt,1, . . . , ℓt,m at time t. We will also denote x(a) to be the a-th element of the vector x Rd, K to be the interior of K, K its boundary, and to be the simplex on d elements. Finally, note that all proofs can be found in the Appendix.

In online learning, the goal on a single task t is to play actions xt,1, . . . xt,m that minimize the regret Pm i=1 ℓt,i(xt,i) ℓt,i( xt), where xt arg minx K Pm i=1 ℓt,i(x). Lifting this to the meta-learning setting, our goal as in past work [31, 19] will be to minimize the task-averaged regret: 1

T PT t=1 Pm i=1 ℓt,i(xi,t) ℓt,i( xt). In particular, we want to use multi-task data to improve average performance as the number of tasks T . For example, we wish to attain a task-averaged regret bound of the form o T (poly(m)) + O(V m) + o( m), where V R 0 is a measure of task-similarity that is small if the tasks are similar but still yields the worst-case single-task performance O(

dm) for MAB and O(d m) for BLO if they are not.

2.1 Online mirror descent as a base-learner

In meta-learning we are commonly interested in learning a within-task algorithm or base-learner, a parameterized method that we run on each task t. A popular approach is to learn the initialization and other parameters of a gradient-based method such as gradient descent [24, 44, 36]. If the task optima are close, the best initialization should perform well after only a few steps on a new task. We take a similar approach applied to online mirror descent, a generalization of gradient descent to non-Euclidean geometries [14]. Given a strictly convex regularizer ψ : K 7 R, step-size η > 0, and initialization xt,1 K , OMD has the iteration

xt,i+1 = arg min x K B(x||xt,1) + η X

j i ℓt,j(xt,j), x (2)

where B(x||y) = ψ(x) ψ(y) ψ(y), x y is the Bregman divergence of ψ. OMD recovers online gradient descent when ψ(x) = 1 2 x 2 2, in which case B(x||y) = 1 2 x y 2 2; another example is exponentiated gradient, for which ψ(p) = p, log p is the negative Shannon entropy on probability vectors p and B is the KL-divergence [46]. An important property of B is that the sum over functions B(xt|| ) is minimized at the mean x of the points x1, . . . , x T .

Algorithm 1: Tunes OMDη,θ with regularizer ψθ : K 7 R and step-size η > 0, which when run over loss estimators ˆℓt,1, . . . , ˆℓt,m, yielding task-optima ˆxt = arg minx K Pm i=1 ˆℓt,i, x .

Input: compact set K Rd, initialization x1 K, ordered subset Θk R also used to index interval bounds η, η Rk 0 and hyperparameters α Rk 0, scalar hyperparameters ρ > 0 and λ 0, learners OMDη,θ : K 7 Rd, projections cθ : K 7 Kθ for θ Θk do

w1(θ) 1 and η1(θ) η(θ)+η(θ)

2 // initialize MW and EWOO

for task t = 1, . . . , T do

sample θt from Θk w.p. exp(wt) // sample from MW distribution ˆxt OMDηt(θt),θt(cθt(xt)) // run bandit OMD within-task xt+1 1

t Pt s=1 ˆxs // FTL update of initialization for θ Θk do

R η(θ) η(θ) v exp( α(θ) Pt s=1 U (ρ) s (xs,v,θ))dv R η(θ) η(θ) exp α(θ) Pt s=1 U (ρ) s (xs,v,θ) dv // EWOO step-size update

wt+1(θ) wt(θ) λUt(xt, ηt(θ), θ) // MW update of tuning parameter

OMD on loss estimators ˆℓt,i constructed via partial feedback forms an important class of bandit methods [6, 2, 3]. Their regularizers ψ are often non-Lipschitz, e.g. the negative entropy, or even unbounded, e.g. the log-barrier. Thus full-information results for tuning OMD, e.g. by Khodak et al. [31] and Denevi et al. [19], do not suffice. We do adapt the former s approach of online learning a sequence Ut(x, η, θ) of affine functions of Bregman divergences from initializations x to known points in K. We are interested in them because the regret of OMD w.r.t. a comparator y is bounded by B(y||x)/η + O(ηm) [46, 25]. In our case the comparator is based on the estimated optimum ˆxt arg minx K ˆℓt, x , where ˆℓt = Pm i=1 ˆℓt,i, resulting from running OMD on task t using initialization x K and hyperparameters η and θ, which we denote OMDη,θ(x). Unlike full-information meta-learning, we use a parameter ε > 0 to constrain this optimum to lie in a subset Kε K . Formally, we fix a point x1 K to be the center e.g. x1 = 1d/d when K is the d-simplex and define the projection cε(x) = x1 + x x1

1+ε mapping from K to Kε. For example, c ε 1 ε (x) = (1 ε)x + ε1d/d on the simplex. This projection allows us to handle regularizers ψ that diverge near the boundary, but also introduces ε-dependent error terms. In the BLO case it also forces us to tune ε itself, as initializing too close to the boundary leads to unbounded regret while initializing too far away does not take advantage of task-similarity. Thus the general upper bounds of interest are the following functions of the initialization x, the step-size η > 0, and a third parameter θ that is either β or ε, depending on the setting (MAB or BLO):

Ut(x, η, θ) = Bθ(cθ(ˆxt)||x)

η + ηg(θ)m + f(θ)m (3)

Here Bθ is the Bregman divergence of ψθ while g(θ) 1 and f(θ) 0 are tunable constants. We overload θ to be either β or ε for notational simplicity, as we will not tune them simultaneously; if θ = β (for MAB) then cθ(x) = x1 + x x1

1+ε for fixed ε, while if θ = ε (for BLO) then Bθ is the Bregman divergence of a fixed ψ. The reason to optimize this sequence of upper bounds Ut is because they directly bound the task-averaged regret while being no worse than the worst-case single-task regret. Furthermore, an average over Bregman divergences is minimized at the average ˆ x = 1

T PT t=1 ˆxt, where it attains the value ˆV 2 θ = 1

T PT t=1 ψθ(cθ(ˆxt)) ψθ(cθ(ˆ x)) (c.f. Claim A.1). We will show that this quantity leads to intuitive and interpretable notions of task-similarity in all the applications we study.

2.2 A meta-algorithm for tuning bandit algorithms

To learn these functions Ut(x, η, θ) and thus to meta-learn OMDη,θ(x) our meta-algorithm sets x to be the projection cθ of the mean of the estimated optima i.e. follow-the-leader (FTL) over the Bregman divergences in (3) while simultaneously setting η via EWOO and θ via discrete multiplicative weights (MW). We choose FTL, EWOO, and MW because each is well-suited to the way Ut depends on x, η, and θ, respectively. First, the only effect of x on Ut is via the Bregman

divergence Bθ(cθ(ˆxt)||x), over which FTL attains logarithmic regret [31]. For η, Ut is exp-concave on η > 0 so long as the first term is nonzero, but it is also non-Lipschitz; the EWOO algorithm is one of the few methods with logarithmic regret on exp-concave losses without a dependence on the Lipschitz constant [26], and we ensure the first term is nonzero by regularizing the upper bounds as follows for some ρ > 0 and D2 θ = maxx,y Kθ Bθ(x||y):

U (ρ) t (x, η, θ) = Bθ(cθ(ˆxt)||x) + ρ2D2 θ η + ηg(θ)m + f(θ)m (4)

Note that this function is fully defined after obtaining ˆxt by running OMD on task t, which allows us to use full-information MW to tune θ across the grid Θk. Showing low regret w.r.t. any θ Θ Θk then just requires sufficiently large k and Lipschitzness of Ut w.r.t. θ. Combining all three algorithms together thus yields the guarantee in Theorem 2.1, which is our main structural result. It implies a generic approach for obtaining meta-learning algorithms by (1) bounding the task-averaged regret by an average of functions of the form Ut, (2) applying the theorem to obtain a new bound o T (1) + minθ,η ˆV 2 θ η + ηg(θ)m + f(θ)m, and (3) bounding the estimated task-similarity ˆV 2 θ by an interpretable quantity. Crucially, since we can choose any η > 0, the asymptotic regret is always as good as the worst-case guarantee for running the base-learner separately on each task. Theorem 2.1 (c.f. Thm. A.1). Suppose x1 = arg minx K ψθ(x) θ and let D, M, F, and S be maxima over θ of Dθ, Dθ p

g(θ)m, f(θ), and 2ψθ 2, respectively. For each ρ (0, 1) we can set η, η, α, and λ s.t. the expected average of the losses Ut(cθt(xt), ηt(θt), θt) of Algorithm 1 is at most

min θ Θ,η>0 E ˆV 2 θ η +ηg(θ)m+f(θ)m+ O

ρ2T +min ρ2D2

Here ˆV 2 θ = 1 T PT t=1 ψθ(cθ(ˆxt)) ψθ(cθ(ˆ x)) and Lη bounds the Lipschitz constant w.r.t. θ at ˆV 2 θ /η + ηg(θ)m + f(θ)m. The same bound plus (M/ρ + Fm) q

δ holds w.p. 1 δ.

We keep details of the dependence on S and other constants as they are important in applying this result, but in most cases setting ρ = 1 4

T yields O(T 3 4 ) regret. While a slow rate, the losses Ut are non-Lipschitz and non-convex in-general, and learning them allows us to tune θ over user-specified intervals and η over all positive numbers, which will be crucial later. At the same time, this tuning is what leads to the slow rate, as without tuning (k = 1, Lη = 0) the same ρ yields O(

T) regret. Lastly, while we focus on learning guarantees, we note that Algorithm 1 is reasonably efficient, requiring a 2k single-dimensional integrals per task; this is discussed in more detail in Section A.3.

3 Multi-armed bandits

We now turn to our first application: the multi-armed bandit problem, where at each round i of task t we take action at,i [d] and observe loss ℓt,i(at,i) [0, 1]. As we are sampling actions from distributions x K = on the k-simplex, the inner product ℓt,i, xt,i is the expected loss and the optimal arm at on task t can be encoded as a vector xt s.t. xt(a) = 1a= at.

We use as a base-learner a generalization of Exp3 that uses the negative Tsallis entropy

ψβ(p) = 1 Pd a=1 pβ(a) 1 β for some β (0, 1] as the regularizer; this improves regret from Exp3 s

O( dm log d) to the optimal O(

dm) [3]. Note that ψβ is the Shannon entropy in the limit β 1 and its Bregman divergence Bβ(x|| ) is non-convex in the second argument. As the Tsallis entropy is non-Lipschitz at the simplex boundary, which is where the estimated and true optima ˆxt and xt lie, we will project them using c ε 1 ε (x) = (1 ε)x + ε1d/d to the set K ε 1 ε = {x : mina x(a) ε/d}. We denote the resulting vectors using the superscript (ε),

e.g. ˆx(ε) t = c ε 1 ε (ˆxt), and also use (ε) = K ε 1 ε to denote the constrained simplex. For MAB we also study two base-learners: (1) implicit exploration and (2) guaranteed exploration. The former uses low-variance loss under-estimators ˆℓt,i(a) = ℓt,i(a)1at,i=a

xt,i(a)+γ for γ > 0, where xt,i(a) is the probability of sampling a on task t round i, to enable high probability bounds [43]. On the other hand, guaranteed exploration uses unbiased loss estimators (i.e. γ = 0) but constrains the action space to (ε), which we will use to adapt to a task-similarity determined by the true optima-in-hindsight.

3.1 Adapting to low estimated entropy with high probability using implicit exploration

In our first setting, the base-learner runs OMDηt,βt(xt,1) on γ-regularized estimators with Tsallis regularizer ψβt, step-size ηt, and initialization xt,1 (ε). Standard OMD analysis combined with implicit exploration analysis [43] shows (44) that the task-averaged regret is bounded w.h.p. by

(ε + γd)m + O

Bβt(ˆx(ε) t ||xt,1)

ηt + ηtdβtm

The summands have the desired form of Ut(xt,1, ηt, βt), so by Theorem 2.1 we can bound their average by

min β [β,β],η>0

ˆV 2 β η + ηdβm

ηT + ρ + 1 ρ

where ˆV 2 β = 1 T PT t=1 ψβ(ˆx(ε) t ) ψβ(ˆ x (ε)) is the average difference in Tsallis entropies between the (ε-constrained) estimated optima ˆxt and their empirical distribution ˆ x = 1

T PT t=1 ˆxt, while Lη

is the Lipschitz constant of ˆV 2 β η + ηdβm

β w.r.t. β [β, β]. The specific instantiation of Algorithm 1 that (7) holds for is to do the following at each time t:

1. sample βt via the MW distribution exp(wt) over the discretization Θk of [β, β] [0, 1]

2. run OMDηt,βt using the initialization xt,1 = 1 t 1 X

s<t ˆx(ε) t = ε

s<t ˆxt (FTL)

3. update EWOO at each β Θk with loss Bβ(ˆx(ε) t ||xt,1)+ρ2D2 β η + ηdβm

β , where D2 β = d1 β 1

4. update pt+1 using multiplicative weights with expert losses Bβ(ˆx(ε) t ||xt,1) η + ηdβm

The final guarantee for this procedure, given in full in Theorem B.1, follows by two properties of the Tsallis entropy ψβ: (1) its Lipschitzness w.r.t. β [0, 1] (c.f. Lem B.1) and (2) the fact that ˆV 2 β is bounded by the entropy ˆHβ = ψβ(ˆ x) of the empirical distribution of estimated optima (c.f. Lem B.2), which yields our first notion of task-similarity: multi-armed bandit tasks are similar if the empirical distribution of their (estimated) optimal arms has low entropy.

We exemplify the implications of Theorem B.1 in Corollary 3.1, where we consider three regimes of the lower bound β on the entropy parameter: β = 1, i.e. always using Exp3; β = 1/2, which corresponds to the optimal worst-case setting [3]; and β = 1/ log d, below which the OMD regret-upper-bound always worsens (and so it does not make sense to try β < 1/ log d).

Corollary 3.1 (c.f. Cors. B.1, B.2, and B.3). Suppose β = 1 and we set the initialization, step-size, and entropy parameter of Tsallis OMD with implicit exploration via Algorithm 1 as in Theorem B.1.

1. If β = 1 and T d2

m we can ensure 1

i=1 ℓt,i(xt,i) ℓt,i( xt) 2

p ˆH1dm+ O d 2 3 m 2 3 3

2. If β = 1

2 and T d5/2

m we can set k = 4

T and ensure w.h.p. that task-averaged regret is

ˆHβdβm/β + O d5/7m5/7

T 2/7 + d m

3. If β = 1 log d and T d3

m we can set k = 4

T and ensure w.h.p. that task-averaged regret is

min β (0,1] 2 q

ˆHβdβm/β + O d3/4m3/4 + d m

In all three settings, as T the regret scales directly with the entropy of the estimated optima-in-hindsight, which is small if most tasks are estimated to be solved by one of a few arms and large if all arms are used roughly equally. Corollary 3.1 demonstrates the importance of tuning β: even if tasks are dissimilar, we asymptotically recover the worst-case optimal guarantee O(

dm) in cases two and three because the entropy is at most d1 β

1 β . On the other hand, if a constant s d actions are always minimizers, i.e. the empirical distribution ˆ x is s-sparse, then the last bound (10)

implies that Algorithm 1 can achieve task-averaged regret o T (md) + O( sm log d). At the same time, this tuning is costly, with the last two results having an extra O d m

term because of it.

Furthermore, the bound of β = 1

2 has a slightly better dependence on d, m, and T compared to that

of β = 1 log d due to the d

ε 2 β term in the bound (7) returned for MAB by our structural result.

We can compare the s-sparse result to Azizi et al. [10], who achieve task-averaged regret O(m/ 3

T + sm log T) for stochastic MAB. Despite our adversarial setting and no stipulations on how tasks are related, our bounds are asymptotically comparable if the estimated and true optima are roughly equivalent (ignoring their O( log T)-factor), as we also have O( sm) average regret as T . Their rate in the number of tasks is better, but at a cost of runtime exponential in s. Apart from generality, we believe a great strength of our results is their adaptiveness; unlike Azizi et al. [10], we do not need to know how many optimal arms there are to adapt to there being few of them.

3.2 Adapting to the entropy of the true optima-in-hindsight using guaranteed exploration

While the entropy of estimated optima-in-hindsight may be useful in some cases where we wish to actually compute the task-similarity, it is otherwise generally more desirable to adapt to an intrinsic and algorithm-independent measure, e.g. the entropy of the true optima-in-hindsight. However, doing so is difficult without further assumptions, as the optima are both hard to identify and the measure itself may not be fully defined in case of ties. Thus in this section we focus on the setting where we have a nonzero performance gap > 0 between the best and second-best arms: Assumption 3.1. For some > 0 and all tasks t [T], 1

m Pm i=1 ℓt,i(a) ℓt,i( at) a = at.

This assumption is common in the best-arm identification literature [28, 1], which we adapt to show that the estimated optimal arms match the true optima, and thus so do their entropies. To do so, we switch to unbiased loss estimators, i.e. γ = 0, and control their variance by lower-bounding the probability of selecting an arm to be at least ε

d; this can alternatively be expressed as running OMD using the regularizer ψβ + I (ε), where for any C Rd the function IC(x) = 0 if x C and otherwise. Guaranteed exploration allows us extend the analysis of Abbasi-Yadkori et al. [1] to show that the estimated arm is optimal w.h.p.: Lemma 3.1 (c.f. Lem C.1). Suppose for ε > 0 and any β (0, 1] we run OMD on task t [T] with regularizer ψβ + I (ε). If m = Ω( d ε 2 ) then ˆxt = xt w.p. 1 d exp( Ω(ε 2m/d)).

However, the constraint that the probabilities are at least ε

d does lead to εm additional error on each task, with the upper bound on the task-averaged expected regret becoming

i=1 ℓt,i(at,i) ℓt,i( at) εm + 1

EBβt(ˆx(ε) t ||xt,1) ηt + ηtdβtm

Moreover, we will no longer set ε = o T (1), as this would require m to be increasing in T for the bestarm identification result of Lemma C.1 to hold. Thus, unlike in the previous section, our results will contain fast terms terms in the task-averaged regret that are o( m) but not decreasing in T nor affected by the task-similarity. They will still improve upon the Ω(

dm) MAB lower bound if tasks are similar, but the task-averaged regret will not converge to zero as T if the tasks are identical.

Nevertheless, the tuning-dependent component of the upper bounds in (11) has the appropriate form for our structural result in fact we can use the same meta-algorithm (8) as for implicit exploration and so we can again apply Theorem 2.1 to get a bound on the task-averaged regret in terms of the average difference ˆV 2 β = 1 T PT t=1 ψβ(ˆx(ε) t ) ψβ(ˆ x (ε)) of the entropies of the

ε-constrained estimated task-optima ˆx(ε) t and their mean ˆ x (ε). The easiest way to apply Lemma C.1 to bound ˆV 2 β in terms of Hβ = 1

T PT t=1 ψβ( xt) ψβ( x) is via union bound on all T tasks to show that ˆxt = xt t w.p. 1 d T exp( Ω(ε 2m/d)); however, setting a constant failure probability leads to m growing, albeit only logarithmically, in T. Instead, by analyzing the worst-case best-arm identification probabilities, we show in Lemma C.2 that the expectation of ˆV 2 β is bounded by

Hβ + 3β (d/ε)1 β 1

1 β exp 3ε 2m

28d without resorting to m = ωT (1). Assuming m 75d

ε 2 log d ε 2 is enough (69) to bound the second term by 56 dm. Then the final result (c.f. Thm. C.1) bounds the expected task-averaged regret as follows (ignoring terms that become o T (1) after setting ρ and k):

εm + min β [β,β],η>0 hβ( )

β for hβ( ) =

md if m 75d

ε 2 log d ε 2 d1 β 1

1 β otherwise (12)

If the gap is known and sufficiently large, then we can set ε = Θ( d 2m) to obtain an asymptotic task-averaged regret that scales only with the entropy Hβ and a fast term that is logarithmic in m:

Corollary 3.2 (c.f. Cor. C.3). Suppose we set the initialization, step-size, and entropy parameter of Tsallis OMD with guaranteed exploration via Algorithm 1 as specified in Theorem C.1. If [β, β] = [ 1 log d, 1] and m 75d

2 log d 2 , then setting ε = Θ d 2m , ρ = 1 3

m T , and k = 3

d2m T ensures that the expected task-averaged regret is at most

min β (0,1] 2 q

Hβdβm/β + O

d 2 + d 4 3 m 2 3

T + d 5 3 m 5 6

T 2 3 + d 4m3

Knowing the gap is a strong assumption, as ideally we could set ε without it. Note that if ε = Ω( 1

mp ) for some p (0, 1) then the condition m 75d ε 2 log d ε 2 only fails if m poly( 1

), i.e. for gap decreasing in m. We can use this together with the fact that minimizing over η and β in our bound allows us to replace them with any value, even a gap-dependent one, to derive a gap-independent setting of ε that ensures a task-similarity-adaptive bound when is not too small and falls back to the worst-case optimal guarantee otherwise. Specifically, for indicator ι = 1m 75d

ε 2 log d ε 2 , setting

hβ( ) dβm/β in (12) and using β = 1

2 if the condition ι fails yields asymptotic regret at most

εm + min β (0,1] O

min β (0,1]

(14) Thus setting ε = Θ(

d/m 2 3 ) yields the desired dependence on the entropy Hβ and a fast term in m:

Corollary 3.3 (c.f. Cor. C.4). In the setting of Corollary 3.2 but with m = Ω(d 3 4 ) and unknown , using ε = Θ(

d/m 2 3 ) ensures expected task-averaged regret at most

min β (0,1] 2 q

Hβdβm/β + O

d 4 3 m 2 3

T + d 5 3 m 5 6

T 2 3 + d2m 7 3 T

While not logarithmic, the gap-dependent term is still o( m), and moreover the asymptotic regret is no worse than the worst-case optimal O(

dm). Note that the latter is only needed if = o(1/ 6 m).

The main improvement in this section is in using the entropy of the true optima, which can be much smaller than that of the estimated optima if there are a few good arms but large noise. Our use of the gap assumption for this seems difficult to avoid for this notion of task-similarity. We can also compare to Corollary 3.1 (10), which did not require > 0 and had no fast terms but had a worse rate in T; in contrast, the O( 1 3

T ) rates above match that of the closest stochastic bandit result [10]. As before, for s d good arms we obtain O( sm log d) asymptotic regret, assuming the gap is not too small. Finally, we can also compare to the classic shifting regret bound for Exp3.S [6], which translated to task-averaged regret is O( p

dm log(dm T)). This is worse than even running OMD separately on each task, albeit under weaker assumptions (not knowing task boundaries). It also cannot take advantage of repeated optimal arms, e.g. the case of s d good arms.

3.3 Adapting to entropic task similarity implies robustness to outliers

While we considered mainly the s-sparse setting as a way of exemplifying our results and comparing to other work such as Azizi et al. [10], the fact that our approach can adapt to the Tsallis entropy minβ Hβ of the optimal arms implies meaningful guarantees for any low-entropy distribution over the optimal arms, not just sparsely-supported ones. One way to illustrate the importance of this is through an analysis of robustness to outlier tasks. Specifically, suppose that the s-sparsity assumption that optima at lie in a subset of [T] of size s d only holds for all but O(T p) of the tasks t [T], where p [0, 1). Then the best we can do using an asymptotic bound of O( sm) e.g. that of Azizi et al.

[10] in the stochastic case or from naively applying minβ (0,1] Hβdβm/β esm log d to any of our previous results is to substitute s + T p instead of s, which will only improve over the single-task bound if d = ω(T p), i.e. in the regime where the number of arms increases with the number of tasks.

However, our notion of task-similarity allows us to do much better, as we can show (c.f. Prop. D.1) that in the same setting Hβ = O(s + d1 β

T β(1 p) ) for any β [ 1 log d, 1

2]. Substituting this result into e.g. Corollary 3.3 yields the same asymptotic result of O( sm log d), although the rate in T is a very slow O(

1 p 2 log d ). This demonstrates how our entropic notion of task-similarity simultaneously yields strong results in the s-sparse setting and is meaningful in more general settings.

4 Bandit linear optimization

Our last application is bandit linear optimization, in which at task t round i we play xt,i K in some convex K Rd and observe loss ℓt,i, xt,i [ 1, 1]. We will again use a variant of mirror descent, using a self-concordant barrier for ψ and the specialized loss estimators of Abernethy et al. [2, Alg. 1]. More information on such regularizers can be found in the literature on interior point methods [42]. We pick this class of algorithms because of their optimal dependence on the number of rounds and their applicability to any convex domain K via specific barriers ψ, which will yield interesting notions of task-similarity. Our ability to handle non-smooth regularizers via the structural result (Thm. 2.1) is even more important here, as barriers are infinite at the boundaries. Indeed, we will not learn a β parameterizing the regularizer and instead focus on tuning a boundary offset ε > 0. Here we make use of notation from Section 2, where cε maps points in K to a subset Kε defined by the Minkowski function (c.f. Def. E.1) centered at x1 = arg minx K ψ(x).

From Abernethy et al. [2] we have an upper bound on the expected task-averaged regret of their algorithm run from initializations xt,1 K with step-sizes ηt > 0 and offsets εt > 0:

i=1 ℓt,i, xt,i xt 1

EB(cεt(ˆxt)||xt,1)

ηt + (32ηtd2 + εt)m (16)

We can show (88) that D2 ε = maxx,y Kε B(x||y) 9ν 3 2 K S1

ε , where ν is the self-concordance constant of ψ and S1 = 2ψ(x1) 2 is the spectral norm of its Hessian at the center x1 of K. Restricting to tuning ε [ 1

m, 1] which is enough to obtain constant task-averaged regret above if the estimated optima ˆxt are identical we can now apply Algorithm 1 via the following instantiation:

1. sample εt via the MW distribution exp(wt) over the discretization Θk of [ 1

2. run OMDηt,εt using the initialization xt,1 = 1 t 1 X

s<t cεt(ˆxt) = x1 +

s<t ˆxt x1 (1+εt)(t 1) (FTL)

3. update EWOO at each ε Θk with loss B(cε(ˆxt)||xt,1)+ρ2D2 ε η + 32ηd2 for D2 ε = 9ν 3 2 K S1

4. update pt+1 using multiplicative weights with expert losses B(cε(ˆxt)||xt,1)

η + εm (17)

Note the similarity to the MAB case (8), with the difference being the upper bound passed to EWOO and MW. Our structural result bounds the expected task-averaged regret as follows (c.f. Thm. E.1):

E min ε [ 1

ˆV 2 ε η + (32ηd2 + ε)m + O

k +m min ρ2

(18) For ρ = o T (1) and k = ωT (1) this becomes o T (poly(m)) + E minε [ 1

m ,1],η>0 ˆV 2 ε η + 32ηd2m + εm,

where ˆV 2 ε = 1 T PT t=1 ψ(cε(ˆxt) ψ(cε(ˆ xt). Then by tuning η we get an asymptotic (T ) regret of 4d ˆVε

2m + εm for any ε [ 1

m, 1]. Our analysis removes the explicit dependence on ν that appears in the single-task regret [2]; as an example, ν equals the number of inequalities defining a polytope K, as in the bandit shortest-path application below.

The remaining challenge is to interpret ˆV 2 ε , which as we did for MAB we do via specific examples, in this case concrete action domains K. Our first example is for BLO over the unit sphere K = {x Rd : x 2 1} using the appropriate log-barrier regularizer ψ(x) = log(1 x 2 2):

Corollary 4.1 (c.f. Cor. E.1). For BLO on the sphere, Algorithm 1 has expected task-averaged regret

+ min ε [ 1

2m log 1 + 1 E ˆ x 2 2 2ε + ε2

The bound above is decreasing in E ˆ x 2 2, the expected squared norm of the average of the estimated optima ˆxt. We thus say that bandit linear optimization tasks over the sphere are similar if the norm of the empirical mean of their (estimated) optima is large. This makes intuitive sense: if the tasks optima are uniformly distributed, we should expect E ˆ x 2 2 to be small, even decreasing in d. On the other hand, in the degenerate case where the estimated optima ˆxt are the same across all tasks t [T], we have E ˆ x 2 2 = 1, so the asymptotic task-averaged regret is 1 because we can use ε = 1

m. Perhaps slightly more realistically, if it is 1 mp -away from 1 for some power p 1

2 then setting ε = 1 m can remove the logarithmic dependence on m. These two regimes illustrate the importance of tuning ε.

As a last application, we apply our meta-BLO result to the shortest-path problem in online optimization [50, 30]. In its bandit variant [8, 17], at each step i = 1, . . . , m the player must choose a path pi from a fixed source u V to a fixed sink v V in a directed graph G(V, E). At the same time the adversary chooses edge-weights ℓi R|E| and the player suffers the sum P

e pt ℓi(e) of the weights in their chosen path pt. This can be relaxed as BLO over vectors x in a set K [0, 1]|E| defined by a set C of O(|E|) linear constraints (a, b) a, x b enforcing flows from u to v; u to v paths can be sampled from any x K in an unbiased manner [2, Proposition 1]. On a single-instance, applying the BLO method of Abernethy et al. [2] ensures O(|E| 3 2 m) regret on this problem.

In the multi-instance setting, comprising a sequence t = 1, . . . , T of shortest path instances with m adversarial edge-weight vectors ℓt,i each, we can attempt to achieve better performance by tuning the same method across instances. Notably, we can view this as the problem of learning predictions [33] in the algorithms with predictions paradigm from beyond-worst-case analysis [40], with the OMD initialization on each instance being effectively a prediction of its optimal path. Our meta-learner then has the following average performance across bandit shortest-path instances: Corollary 4.2 (c.f. Cor. E.2). For multi-task bandit online shortest path, Algorithm 1 with regularizer ψ(x) = P

a,b C log(b a, x ) attains the following expected average regret across instances

T 3 4 + |E| 5 2 m 5 6

+ min ε [ 1

m ,1] 4|E|E

vu u u t2m X

1 T PT t=1 b a, cε(ˆxt)

Tq QT t=1 b a, cε(ˆxt)

Here the asymptotic regret scales with the sum across all constraints a, b C of the log of the ratio between the arithmetic and geometric means across tasks of the distances b a, cε(ˆxt) from the estimated optimum flow cε(ˆxt) to the constraint boundary. As it is difficult to separate the effect of the offset ε, we do not state an explicit task-similarity measure like in our previous settings. Nevertheless, since the arithmetic and geometric means are equal exactly when all entries are equal and otherwise the former is larger the bound does show that regret is small when the estimated optimal flows ˆxt for each task are at similar distances from the constraints, i.e. the boundaries of the polytope. Indeed, just as on the sphere, if the estimated optima are all the same then setting ε = 1

m again yields constant averaged regret.

5 Conclusion and limitations

We develop and apply a meta-algorithm for learning to initialize and tune bandit algorithms, obtaining task-averaged regret guarantees for both multi-armed and linear bandits that depend on natural, setting-specific notions of task similarity. For MAB, we meta-learn the initialization, step-size, and entropy parameter of Tsallis-entropic OMD and show good performance if the entropy of the optimal arms is small. For BLO, we use OMD with self-concordant regularizers and meta-learn the initialization, step-size, and boundary-offset, yielding interesting domain-specific task-similarity measures. Some natural directions for future work involve overcoming some limitations of our results: can we adapt to a notion of task-similarity that depends on the true optima without assuming a gap for MAB, or at all for BLO? Alternatively, can we design meta-learning algorithms that adapt to both stochastic and adversarial bandits, i.e. a best-of-both-worlds guarantee? Beyond this, one could explore other partial information settings, such as contextual bandits or bandit convex optimization.

Acknowledgments

We thank our anonymous reviewers for helpful suggestions, especially concerning the analysis of robustness to outliers. This material is based on work supported in part by the National Science Foundation under grants CCF-1910321, FAI-1939606, IIS-1901403, SCC-1952085, and SES-1919453; the Defense Advanced Research Projects Agency under cooperative agreement HR00112020003; a Simons Investigator Award; an AWS Machine Learning Research Award; an Amazon Research Award; a Bloomberg Research Grant; a Microsoft Research Faculty Fellowship; a Google Faculty Research Award; a J.P. Morgan Faculty Award; a Facebook Research Award; a Mozilla Research Grant; a Facebook Ph D Fellowship; and an NDSEG Fellowship. KYL is supported by the Israel Science Foundation (grant No. 447/20) and the Technion Center for Machine Learning and Intelligent Systems (MLIS). The work of RM was partially supported by the Israel Science Foundation grant number 1693/22.

[1] Yasin Abbasi-Yadkori, Peter Bartlett, Victor Gabillon, Alan Malek, and Michal Valko. Best of both worlds: Stochastic & adversarial best-arm identification. In Proceedings of the 31st Conference On Learning Theory, 2018.

[2] Jacob Abernethy, Elad Hazan, and Alexander Rakhlin. Competing in the dark: An efficient algorithm for bandit linear optimization. In Proceedings of the International Conference on Computational Learning Theory, 2008.

[3] Jacob Abernethy, Chansoo Lee, and Ambuj Tewari. Fighting bandits with a new kind of smoothness. In Advances in Neural Information Processing Systems, 2015.

[4] Oren Anava and Zohar S. Karnin. Multi-armed bandits: Competing with optimal sequences. In Advances in Neural Information Processing Systems, 2016.

[5] Jean-Yves Audibert, Sébastien Bubeck, and Gábor Lugosi. Minimax policies for combinatorial prediction games. In Proceedings of the International Conference on Computational Learning Theory, 2011.

[6] Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal of Computing, 32:48 77, 2002.

[7] Peter Auer, Pratik Gajane, and Ronald Ortner. Adaptively tracking the best bandit arm with an unknown number of distribution changes. In Proceedings of the 32nd Annual Conference on Learning Theory, 2019.

[8] Baruch Awerbuch and Robert D. Kleinberg. Adaptive routing with end-to-end feedback: Distributed learning and geometric approaches. In Proceedings of the Thirty-Sixth Annual ACM Symposium on Theory of Computing, 2004.

[9] Mohammad Gheshlaghi Azar, Alessandro Lazaric, Emma Brunskill, et al. Sequential transfer in multi-armed bandit with finite set of models. In Advances in Neural Information Processing Systems, 2013.

[10] Mohammad Javad Azizi, Thang Duong, Yasin Abbasi-Yadkori, András György, Claire Vernade, and Mohammad Ghavamzadeh. Non-stationary bandits and meta-learning with a small set of optimal arms. ar Xiv, 2022.

[11] Maria-Florina Balcan, Avrim Blum, and Santosh Vempala. Efficient representations for lifelong learning and autoencoding. In Proceedings of the 28th Annual Conference on Learning Theory, 2015.

[12] Maria-Florina Balcan, Mikhail Khodak, Dravyansh Sharma, and Ameet Talwalkar. Learning-tolearn non-convex piecewise-Lipschitz functions. In Advances in Neural Information Processing Systems, 2021.

[13] Soumya Basu, Branislav Kveton, Manzil Zaheer, and Csaba Szepesvári. No regrets for learning the prior in bandits. In Advances in Neural Information Processing Systems, 2021.

[14] Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31:167 175, 2003.

[15] Leonardo Cella, Alessandro Lazaric, and Massimiliano Pontil. Meta-learning with stochastic linear bandits. In Proceedings of the 37th International Conference on Machine Learning, 2020.

[16] Nicolò Cesa-Bianchi and Gábor Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006.

[17] Varsha Dani, Thomas Hayes, and Sham Kakade. The price of bandit information for online optimization. In Advances in Neural Information Processing Systems, 2008.

[18] Giulia Denevi, Carlo Ciliberto, Dimitris Stamos, and Massimiliano Pontil. Learning to learning around a common mean. In Advances in Neural Information Processing Systems, 2018.

[19] Giulia Denevi, Carlo Ciliberto, Riccardo Grazzi, and Massimiliano Pontil. Online-within-online meta-learning. In Advances in Neural Information Processing Systems, 2019.

[20] Simon S. Du, Wei Hu, Sham M. Kakade, Jason D. Lee, and Qi Lei. Few-shot learning via learning the representation, provably. In Proceedings of the 9th International Conference on Learning Representations, 2021.

[21] Yan Duan, Marcin Andrychowicz, Bradly Stadie, Jonathan Ho, Jonas Schneider, Ilya Sutskever, Pieter Abbeel, and Wojciech Zaremba. One-shot imitation learning. In Advances in Neural Information Processing Systems, 2017.

[22] Alireza Fallah, Aryan Mokhtari, and Asuman Ozdaglar. On the convergence theory of gradientbased model-agnostic meta-learning algorithms. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics, 2020.

[23] Xiequan Fan, Ion Grama, and Quansheng Liu. Hoeffding s inequality for supermartingales. Stochastic Processes and their Applications, 122(10):3545 3559, 2012.

[24] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning, 2017.

[25] Elad Hazan. Introduction to online convex optimization. In Foundations and Trends in Optimization, volume 2, pages 157 325. now Publishers Inc., 2015.

[26] Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69:169 192, 2007.

[27] Ali Jadbabaie, Alexander Rakhlin, and Shahin Shahrampour. Online optimization: Competing with dynamic comparators. In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics, 2015.

[28] Kevin Jamieson and Ameet Talwalkar. Non-stochastic best arm identification and hyperparameter optimization. In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics, 2015.

[29] Yihan Jiang, Jakub Koneˇcný, Keith Rush, and Sreeram Kannan. Improving federated learning personalization via model agnostic meta learning. ar Xiv, 2019.

[30] Adam Kalai and Santosh Vempala. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71:291 307, 2005.

[31] Mikhail Khodak, Maria-Florina Balcan, and Ameet Talwalkar. Adaptive gradient-based metalearning methods. In Advances in Neural Information Processing Systems, 2019.

[32] Mikhail Khodak, Renbo Tu, Tian Li, Liam Li, Maria-Florina Balcan, Virginia Smith, and Ameet Talwalkar. Federated hyperparameter tuning: Challenges, baselines, and connections to weight-sharing. In Advances in Neural Information Processing Systems, 2021.

[33] Mikhail Khodak, Maria-Florina Balcan, Ameet Talwalkar, and Sergei Vassilvitskii. Learning predictions for algorithms with predictions. In Advances in Neural Information Processing Systems, 2022.

[34] Branislav Kveton, Martin Mladenov, Chih-Wei Hsu, Manzil Zaheer, Csaba Szepesvári, and Craig Boutilier. Meta-learning bandit policies by gradient ascent. ar Xiv, 2020.

[35] Branislav Kveton, Mikhail Konobeev, Manzil Zaheer, Chih-Wei Hsu, Martin Mladenov, Craig Boutilier, and Csaba Szepesvári. Meta-Thompson sampling. In Proceedings of the 38th International Conference on Machine Learning, 2021.

[36] Zhenguo Li, Fengwei Zhou, Fei Chen, and Hang Li. Meta-SGD: Learning to learning quickly for few-shot learning. ar Xiv, 2017.

[37] Haipeng Luo. CSCI 699 Lecture 13. 2017. URL https://haipeng-luo.net/ courses/CSCI699/lecture13.pdf.

[38] Haipeng Luo, Chen-Yu Wei, Alekh Agarwal, and John Langford. Efficient contextual bandits in non-stationary worlds. In Proceedings of the 31st Annual Conference on Learning Theory, 2018.

[39] Teodor Marinov and Julian Zimmert. The Pareto frontier of model selection for general contextual bandits. In Advances in Neural Information Processing Systems, 2021.

[40] Michael Mitzenmacher and Sergei Vassilvitskii. Algorithms with predictions. In Tim Roughgarden, editor, Beyond the Worst-Case Analysis of Algorithms. Cambridge University Press, Cambridge, UK, 2021.

[41] Ahmadreza Moradipari, Mohammad Ghavamzadeh, Taha Rajabzadeh, Christos Thrampoulidis, and Mahnoosh Alizadeh. Multi-environment meta-learning in stochastic linear bandits. In Proceedings of the 2022 IEEE International Symposium on Information Theory, 2022.

[42] Yurii Nesterov and Arkadii Nemirovskii. Interior-Point Polynomial Algorithms in Convex Programming. SIAM Studies in Applied and Numerical Mathematics, 1994.

[43] Gergely Neu. Explore no more: Improved high-probability regret bounds for non-stochastic bandits. In Advances in Neural Information Processing Systems, 2015.

[44] Alex Nichol, Joshua Achiam, and John Schulman. On first-order meta-learning algorithms. ar Xiv, 2018.

[45] Nikunj Saunshi, Yi Zhang, Mikhail Khodak, and Sanjeev Arora. A sample complexity separation between non-convex and convex meta-learning. In Proceedings of the 37th International Conference on Machine Learning, 2020.

[46] Shai Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107 194, 2011.

[47] Amr Sharaf and Hal Daumé III. Meta-learning effective exploration strategies for contextual bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, 2021.

[48] Max Simchowitz, Christopher Tosh, Akshay Krishnamurthy, Daniel J. Hsu, Thodoris Lykouris, Miro Dudik, and Robert E. Schapire. Bayesian decision-making under misspecified priors with applications to meta-learning. In Advances in Neural Information Processing Systems, 2021.

[49] Jake Snell, Kevin Swersky, and Richard S. Zemel. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, 2017.

[50] Eiji Takimoto and Manfred K. Warmuth. Path kernels and multiplicative updates. Journal of Machine Learning Research, 4:773 818, 2003.

[51] Sebastian Thrun and Lorien Pratt. Learning to Learn. Springer Science & Business Media, 1998.

[52] Constantino Tsallis. Possible generalization of Boltzmann-Gibbs statistics. Journal of Statistical Physics, 52:479 487, 1988.

[53] Chen-Yu Wei and Haipeng Luo. Non-stationary reinforcement learning without prior knowledge: An optimal black-box approach. In Proceedings of the 34th Annual Conference on Learning Theory, 2021.

[54] Takuya Yamano. Some properties of q-logarithm and q-exponential functions in Tsallis statistics. Physica A: Statistical Mechanics and its Applications, 305:486 496, 2002.

[55] Peng Zhao, Guanghui Wang, Lijun Zhang, and Zhi-Hua Zhou. Bandit convex optimization in non-stationary environments. Journal of Machine Learning Research, 22(125):1 45, 2021.

A Structural results

A.1 Properties of the Bregman divergence

Lemma A.1. Let ψ : C 7 R be a strictly convex function with maxx C 2ψ(x) 2 S over a convex set C Rd over size maxx C x 2 K, and let B( || ) be the Bregman divergence generated by ψ. Then for any points x1, . . . , x T C the actions y1 = arg minx C ψ(x) and yt = 1 t 1 P

s<t xs have regret

t=1 B(xt||yt) B(xt||y T +1)

2t 1 8SK2(1 + log T) (21)

Proof. Note that

y B(x||y) = ψ(y) y ψ(y), x + y ψ(y), y = diag( 2ψ(y))(y x) (22)

so B(xt||y) is 2SK-Lipschitz w.r.t. the Euclidean norm. Applying Khodak et al. [31, Prop. B.1] yields the result (note that its assumption of strong convexity of the regularizer can be replaced with strict convexity without changing the proof or result).

Claim A.1. Let ψ : K 7 R be a strictly-convex function with Bregman divergence B( || ) over a convex set K Rd containing points x1, . . . , x T . Then their mean x = 1

T PT t=1 xt satisfies

t=1 B(xt|| x) =

t=1 ψ(xt) ψ( x) (23)

t=1 B(xt|| x) =

t=1 ψ(xt) ψ( x) ψ( x), xt x

t=1 ψ(xt) ψ( x) ψ( x),

t=1 ψ(xt) ψ( x)

A.2 Tuning the step-size

Lemma A.2. Let ℓ1, . . . , ℓT : R>0 7 R>0 be a sequence of functions of form ℓt(x) = B2 t x + G2x for adversarially chosen Bt [0, D] and some G > 0. Then for any ρ 0, the actions of EWOO [26, Fig. 4] with parameter 2ρ2

DG run on the modified losses B2 t +ρ2D2

x + G2x over the domain h ρD

1 + ρ2 i achieves regret w.r.t. any x > 0 of

t=1 ℓt(x) ℓt(x) min ρ2D2

x , ρDG T + DG(1 + log(T + 1))

Proof. By Khodak et al. [31, Prop. C.1] the modified functions are 2ρ2

DG-exp-concave. Then Khodak et al. [31, Cor. C.2] with Bt set to Bt

G, αt = G2, and ε = ρD

G yields the result.

Lemma A.3. For ˆx1, . . . , ˆx T K consider a sequence of functions of form

Ut(x, η) = B(cε(ˆxt)||x)

η + ηG2m (26)

where B is the Bregman divergence of a strictly convex d.g.f. ψ : K 7 R and where x1 = arg minx K ψ(x) defines the projection cε(x) = x1 + x x1

1+ε for some ε > 0 . Suppose we play

xt+1 cε 1 t Pt s=1 ˆxs and set ηt using the actions of EWOO [26, Fig. 4] with parameter 2ρ2

some ρ, Dε > 0 s.t. B(cε(ˆxt)||x) D2 ε x Kε on the functions B(cε(ˆxt)||xt)+ρ2D2 ε η +ηG2m over

the domain ρDε G m, Dε

, with η1 being at the midpoint of the domain. Then Ut(xt, ηt)

DεG m 1 ρ + p

1 + ρ2 t [T] and

t=1 Ut(xt, ηt) min η>0,x K

B(cε(ˆxt)||x)

+ min ρ2D2 ε η , ρDεG T + DεG(1 + log(T + 1))

2ρ2 + 8SεK2(1 + log T)

for K = maxx K x 2 and Sε = maxx Kε 2ψ(x) 2.

Proof. The first claim follows by directly substituting the worst-case values of η into Ut(x, η). For the second, apply Lemma A.2 followed by Lemma A.1:

t=1 Ut(xt, ηt)

B(cε(ˆxt)||xt)

ηt + ηt G2m

min η>0 min ρ2D2 ε η , ρDεG T + DεG(1 + log(T + 1))

B(cε(ˆxt)||x)

min η>0 min ρ2D2 ε η , ρDεG T + DεG(1 + log(T + 1))

2ρ2 + 8SεK2(1 + log T)

B(cε(ˆxt)||x)

Conclude by noting that the sum of Bregman divergence to cε(ˆxt) is minimized on their convex hull, a subset of Kε.

A.3 Computational and space complexity

Algorithm 1 implicitly maintains a separate copy of FTL for each hyperparameter in the continuous space of EWOO and the grid Θk over the domain of θ, but explicitly just needs to average the estimated task-optima ˆxt; this is due to the mean-as-minimizer property of Bregman divergences and the linearity of cε. Thus the memory it uses is O(d+k), where k is size of the discretization of Θ and should be viewed as sublinear in T, e.g. for MAB with implicit exploration and BLO k = O( 4

T). Computationally, at each timestep t and for each grid point we must compute two single-dimensional integrals; the integrands are sums of upper bounds that just need to be incremented once per round, leading to a total per-iteration complexity of O(k) (ignoring the running of OMD). Although outside the scope of this work, it may be possible to avoid integration by tuning η with MW as well, rather than EWOO, but likely at the cost of worse regret because it would not take advantage of the exp-concavity of U (ρ) t .

A.4 Main structural result

Theorem A.1. Consider a family of strictly convex functions ψθ : K 7 R parameterized by θ lying in an interval Θ R of radius RΘ that are all minimized at the same x1 K , and for ˆx1, . . . , ˆx T K consider a sequence of functions of form Ut(x, η, θ) (3), as well as the associated regularized upper bounds U (ρ) t (4). Define the maximum divergence D = maxθ Θ Dθ, radius K = maxx K x 2,

and Lη the Lipschitz constant w.r.t. θ Θ of ˆV 2 θ η +ηg(θ)m+f(θ)m. Then Algorithm 1 with Θk Θ the uniform discretization of Θ s.t. maxθ Θ minθ Θk |θ θ | RΘ

k , ρ (0, 1), η(θ) = ρDθ

η(θ) = Dθ q

1+ρ2 g(θ)m, α(θ) = 2ρ2

g(θ)m, and λ = M 1 ρ + p

1 + ρ2 + Fm 1 q

2T leads to

a sequence (xt, ηt(θt), θt) s.t. E PT t=1 Ut(xt, ηt(θt), θt) is bounded by

E min θ Θ,η>0 8SK2(1 + log T)

ˆV 2 θ η + ηg(θ)m + f(θ)m + LηRΘ

k + min ρ2D2

T log k + M(1 + log(T + 1))

and PT t=1 Ut(xt, ηt(θt), θt) is bounded w.p. 1 δ1k>1 by

min θ Θ,η>0 8SK2(1 + log T)

ˆV 2 θ η + ηg(θ)m + f(θ)m + LηRΘ

k + min ρ2D2

T log k + 1k>1

+ M(1 + log(T + 1))

Proof. In the following proof, we first consider online learning Ut( , , θ) for fixed θ Θk. To tune η, we online learn the one-dimensional losses Bθ(cθ(ˆxt)||cθ(xt))/η + ηg(θ), where cθ(ˆxt) is the (ηt(θ)-independent) action of FTL at time t. As discussed, the corresponding regularized losses U (ρ) t are exp-concave, and so running EWOO yields O M/ρ2 + min ρ2D2/η, ρM T regret w.r.t. the original sequence [31, Cor. C.2]. At the same time, we show that FTL has logarithmic regret on the sequence Bθ(cθ(ˆxt)|| ) that scales with the spectral norm S of 2ψθ (c.f. Lem. A.1), and that the average loss of the optimal comparator is ˆV 2 θ (c.f. Claim A.1). Thus, since we only care about a fixed comparator η, dividing by ηT yields the first and last terms (5). We run a copy of these algorithms for each θ Θk; since their losses are bounded by O(M/ρ + Fm), textbook results for MW yield O( T log k) regret w.r.t. θ Θk, which we then extend to Θ Θk using Lη-Lipschitzness.

Formally, we have that

t=1 Ut(xt, ηt(θt), θt)

Bθt(cθt(ˆxt)||xt)

ηt(θt) + ηt(θt)g(θ)m + f(θ)m

2T log k + E min θ Θk

Bθ(cθ(ˆxt)||xt)

ηt(θ) + ηt(θ)g(θ)m + f(θ)m

T log k + E min θ Θk,η>0,x K

Bθ(cθ(ˆxt)||x)

η + ηg(θ)m + f(θ)m

+ min ρ2D2 θ η , ρDθ p

g(θ)m T + Dθ p

g(θ)m(1 + log(T + 1))

2ρ2 + 8SK2(1 + log T)

where the first inequality is the regret of multiplicative weights with step-size λ [46, Cor. 2.14] and the second is by applying Lemma A.3 for each θ. We then simplify and apply the definition of ˆV 2 θ via Claim A.1 and conclude by applying Lipschitzness w.r.t. θ:

t=1 Ut(xt, ηt(θt), θt)

T log k + E min θ Θk,η>0

η + ηg(θ)m T + f(θ)m T

η , ρM T + M(1 + log(T + 1))

2ρ2 + 8SK2(1 + log T)

E min θ Θ,η>0 8SK2(1 + log T)

ˆV 2 θ η + ηg(θ)m + f(θ)m + LηRΘ

k + min ρ2D2

T log k + M(1 + log(T + 1))

The w.h.p. guarantee follows by Cesa-Bianchi and Lugosi [16, Lem. 4.1].

B Implicit exploration

B.1 Properties of the Tsallis entropy

Lemma B.1. For any ε (0, 1] and x s.t. x(a) ε d a [d] the β-Tsallis entropy

Hβ(x) = 1 Pd a=1 xβ(a) 1 β is d log d

ε-Lipschitz w.r.t. β [0, 1].

Proof. Let logβ x = x1 β 1

1 β be the β-logarithm function and note that by Yamano [54, Equation 6] we have logβ x log x = (1 β)( b logβ x + logβ x log x) 0 β [0, 1]. Then we have for β [0, 1) that

| βHβ(x)| =

Hβ(x) Pd a=1 xβ(a) log x(a) 1 β

a=1 xβ(a)(logβ x(a) log x(a))

a=1 xβ(a)(logβ x(a) log x(a))

a=1 (logβ x(a) log x(a)) 1 1 β

a=1 logβ x(a) log x(a) d 1 β (logβ d ε log d

where the fourth inequality follows by Hölder s inequality, the fifth by subadditivity of xa for a (0, 1], the sixth by the fact that x(logβ x log x) = x β 1/x 0 β, x [0, 1), and

the last line by substituting β = 0 since β logβ x log x

1 β = 2(x xβ) (1 β)(xβ+x) log x

xβ(1 β)3 0 β

[0, 1), x (0, 1/d]. For β = 1, applying L Hôpital s rule yields

lim β 1 βHβ(x) = 1

a=1 xβ(a) log2 x(a)(1 (1 β) log x(a)) = 1

a=1 x(a) log2 x(a)

(34) which is bounded on [ 2d/e2, 0].

Lemma B.2. Consider x1, . . . , x T s.t. xt(at) = 1 for some at [d], and let x = 1

T PT t=1 xt be their average. For any ε (0, 1] and β (0, 1] we have that for every t [T]

Hβ( x(ε)) Hβ(x(ε) t ) Hβ( x) (35)

where recall that x(ε) = c ε 1 ε (x) = 1d/d + (1 ε)(x 1d/d) = (1 ε)x + ε

Proof. Assume w.l.o.g. that x(1) x(2) . . . x(d) and at = 1, so that x(ε) t = e(ε) 1 . We take the derivative

εHβ (1 ε) x + ε

d1d εHβ e(ε) 1

1 ((1 ε) x(a) + ε/d)1 β 1 (ε/d)1 β

1 ((1 ε) + ε/d)1 β 1 ((1 ε) x(d) + ε/d)1 β

a=1 x(a) 1 ((1 ε) x(d) + ε/d)1 β 1 ((1 ε) x(a) + ε/d)1 β

By the assumption that x(a) is non-decreasing in a, each of the summands above become non-positive. So for ε (0, 1] the derivative is non-positive, and for ε 0+ it goes to . Thus the l.h.s. of the bound is monotonically non-increasing in ε for all ε [0, 1]. The result then follows from the fact

that for ε = 0 we have Hβ (1 ε) x + ε

d1d Hβ e(ε) 1 = Hβ( x).

B.2 Implicit exploration bounds

Lemma B.3. Suppose we play OMDβ,η with regularizer ψβ the negative Tsallis entropy and initialization x1 on the sequence of linear loss functions ℓ1, . . . , ℓT [0, 1]d. Then for any x we have T X

t=1 ℓt, xt x Bβ(x||x1)

a=1 x2 β t (a)ℓ2 t(a) (37)

Proof. Note that the following proof follows parts of the course notes by Luo [37], which we reproduce for completeness. The OMD update at each step t involves the following two steps: set yt+1 s.t. ψβ(yt+1) = ψβ(xt) ηℓt and then set xt+1 = arg minx Bβ(x, yt+1) [25, Algorithm 14]. Note that by Hazan [25, Equation 5.3] and nonnegativity of the Bregman divergence we have T X

t=1 ℓt, xt x Bβ(x||x1)

t=1 Bβ(xt||yt+1) (38)

To bound the second term, note that when ψβ is the negative Tsallis entropy we have

Bβ(xt||yt+1)

yβ t+1(a) xβ t (a) + β

y1 β t+1 (a) (xt(a) yt+1(a)

(1 β)yβ t+1(a) xβ t (a) + β

x1 β t (a) + 1 β

yβ t+1(a) xβ t (a) + ηxt(a)ℓt(a)

Plugging the following result, which follows from (1 + x)α 1 + αx + α(α 1)x2 x 0, α < 0, into the above yields the desired bound.

yβ t+1(a) = xβ t (a)

yβ 1 t+1 (a)

! β β 1 = xβ t (a) 1 + 1 β

β ηx1 β t (a)ℓt(a) β β 1

xβ t (a) 1 ηx1 β t (a)ℓt(a) + η2

β x2 2β t (a)ℓt(a)2

= xβ t (a) ηxt(a)ℓt(a) + η2

β x2 β t (a)ℓt(a)2

Theorem B.1. In Algorithm 1, let OMDη,β be online mirror descent with the Tsallis entropy regularizer ψβ over γ-offset loss estimators, Θk is a subset of [β, β] [ 1 log d, 1], and

Ut(x, η, β) = Bβ(ˆx(ε) t ||x) η + ηdβm

where ˆx(ε) t = (1 ε)ˆxt + ε1d/d. Note that U (ρ) t (x, η, β) = Ut(x, η, β) + ρ2(d1 β 1)

η(1 β) . Then there exists settings of η, η, α, λ s.t. for all ε, ρ, γ (0, 1) we have w.p. 1 δ that

i=1 ℓt,i(at,i) ℓt,i( at)

(ε + γd)m T + 2 + q

δ + 1 + log(T + 1)

+ min β [β,β],η>0

ε 2 β (1 + log T)

β + Lη(β β)

2k + d min ρ2

for Lη = log d

ε η + ηm log2 d d.

Proof. In this setting we have g(β) = dβ/β, f(β) = 0, D2 β = d1 β 1

d/2, M = d m, F = 0, S = (d/ε)2 β, and K = 1. We have that

i=1 ℓt,i(at,i) ℓt,i( at)

i=1 ˆℓt,i, xt,i ℓt,i( at) + γ

a=1 ˆℓt,i(a)

Bβt(ˆx(ε) t ||xt,1)

i=1 ˆℓt,i, ˆx(ε) t ℓt,i( at) + ηt

a=1 x2 βt t,i (a)ˆℓ2 t,i(a) + γ

a=1 ˆℓt,i(a)

Bβt(ˆx(ε) t ||xt,1)

i=1 ˆℓt,i, ˆx(ε) t ℓt,i, x(ε) t

a=1 x1 βt t,i (a)ˆℓt,i(a) + γ

a=1 ˆℓt,i(a)

where the equality follows similarly to Luo [37] since ˆℓt,i, xt,i = ℓt,i(at,i) γ Pd a=1 ˆℓt,i(a), the first inequality follows by Lemma B.3 and the second by Hölder s inequality and the definitions of

ˆℓt,i and ˆx(ε) t,i . We next apply the optimality of ˆat for Pm i=1 ˆℓt,i to get

i=1 ℓt,i(at,i) ℓt,i( at)

Bβt(ˆx(ε) t ||xt,1)

i=1 ˆℓt,i( at) ℓt,i( at) + ε

a=1 ˆℓt,i(a) ℓt,i(a)

a=1 x1 βt t,i (a)ˆℓt,i(a) + γ

a=1 ˆℓt,i(a)

εm T + 1 + ε

Bβt(ˆx(ε) t ||xt,1)

a=1 x1 βt t,i (a)ℓt,i(a) + γ

a=1 ℓt,i(a)

εm T + 2 + q

δ + γdm T +

Bβt(ˆx(ε) t ||xt,1)

ηt + ηtdβtm

βt (44) where the the second inequality follows by Neu [43, Lemma 1] applied to each of the last four terms and the fifth by the definition of ℓt,i and using maxβ [ 1 log d ,1] η(β) q

d em log d. Substituting into

Theorem A.1 and simplifying yields the result except with ˆV 2 β = 1

T PT t=1 ψβ(ˆx(ε) t ) ψβ(ˆ x (ε)) in place of ˆHβ, but the former is bounded by the latter by Lemma B.2.

Corollary B.1. Let β = β = 1. Then w.h.p. we can ensure task-averaged regret at most

d m + d 2 3 m 2 3

so long as m T d2 or alternatively ensure

d 3 4 m 3 4 + d m

dm log d + O

so long as m T d.

Proof. Applying Theorem B.1, simplifying, and dividing by T yields task-averaged regret at most

(ε + γd)m + 2 + q

em γT log 5

δ + 1 + log(T + 1)

2ρ2T + min ρ2

+ min η>0 8d(1 + log T)

dm T . Then set ε = 3q

d2 m T and ρ = 1 3

T , and use η = q

ˆ H1 dm + 1 3

dm T to get the first result.

Otherwise, set ε = q

d m T and ρ = 1 4

T , and use the better of η = q

ˆ H1 dm + 1 4

dm T and η = q

dm to get the second.

Corollary B.2. Let β = 1 2 and β = 1 and assume m T d 5 2 . Then w.h.p. we can ensure task-averaged regret at most

ˆHβdβm/β + O

d 5 7 m 5 7

T 2 7 + d m

using k = l 4

Proof. Applying Theorem B.1, simplifying, and dividing by T yields task-averaged regret at most

(ε + γd)m + 2 + q

em γT log 5

δ T + 1 + log(T + 1)

+ min β [β,β],η>0 8d 3 2 (1 + log T)

ε η + ηm log2 d

dm T , ε = d 5 7

(m T ) 2 7 , ρ = 1 4

T , and use η = q

β ˆ Hβ mdβ + 1

(dm T ) 2 7 to get the result.

Corollary B.3. Let β = 1 log d and β = 1 and assume m T d3. Then w.h.p. we can ensure task-averaged regret at most

min β (0,1] 2 q

ˆHβdβm/β + O

d 3 4 m 3 4 + d m

using k = l 4

Proof. Applying Theorem B.1, dividing by T, and simplifying yields

(ε + γd)m + 2 + q

em γT log 5

δ T + 1 + log(T + 1)

+ min β [β,β],η>0 8d2(1 + log T)

ε η + η log2 d

Note that ˆHβ and dβ

β are both decreasing on β < 1 log d, so β in the chosen interval is optimal over all

β (0, 1]. Set γ = 1

dm T , ε = d 3 4 4

m T , ρ = 1 4

T , and use η = q

β ˆ Hβ mdβ + 1 4

dm T to get the result.

C Guaranteed exploration

C.1 Best-arm identification

Lemma C.1. Suppose for ε > 0 we run OMD on task t [T] with initialization xt,1 (ε), regularizer ψβt +I (ε) for some βt (0, 1], and unbiased loss estimators (γ = 0). If Assumption 3.1

holds and m > 28d log d

3ε 2 then ˆxt = xt w.p. 1 dκ, where κ = exp 3ε 2m

Proof. We extend the proof by Abbasi-Yadkori et al. [1, Appendices B and F] to arbitrary lower bounds ε/d on the probability. First, since 0 ˆℓt,i(a) d

εℓt,i(a) we have that

ε 1 ℓt,i(a) ˆℓt,i(a) ℓt,i(a) d

ε 1 ℓt,i(a) d

and so |ˆℓt,i(a) ℓt,i(a)| d ε. Therefore since the variance of the estimated losses is a scaled Bernoulli we have that

Var(ˆℓt,i(a) ℓt,i(a)) = Var(ˆℓt,i(a)) = xt,i(a)(1 xt,i(a)) ℓt,i(a)

2 ℓ2 t,i(a) xt,i(a) d

We can thus apply a martingale concentration inequality of Fan et al. [23, Corollary 2.1] to the martingale difference sequence (MDS) ε

d(ˆℓt,i(a) ℓt,i(a)) [ ε

d, 1] to obtain

i=1 ˆℓt,i(a) ℓt,i(a) m a

i=1 ˆℓt,i(a) ℓt,i(a) εm a

max j [m] ε d

i=j ˆℓt,i(a) ℓt,i(a) εm a

min m(1 + ε/d)2, 4(εm/d + εm a

4(εm/d + εm a

= exp 3εm 2 a 4d(6 + a)

exp 3εm 2 a 28d

where a = 1 m |Pm i=1 ℓt,i(a) mina =a Pm i=1 ℓt,i(a )| is the per-arm loss gap in the last step we apply a 1. For the symmetric MDS ε

d ℓt,i(a) ˆℓt,i(a) 1 we have

i=1 ˆℓt,i(a) ℓt,i(a) m a

i=1 ℓt,i(a) ˆℓt,i(a) m a

exp 3εm 2 a/d 4(6 + ε a/d)

exp 3εm 2 a 28d

We can then conclude that Pr (ˆxt = xt)

i=1 ˆℓt,i(a)

i=1 ℓt,i( at)

i=1 ˆℓt,i( at)

i=1 ℓt,i( at) + m at

i=1 ˆℓt,i(a)

i=1 ℓt,i(a) m a

i=1 ˆℓt,i( at)

i=1 ℓt,i( at) + m at

i=1 ˆℓt,i(a)

i=1 ℓt,i(a) m a

3εm 2 at 28d

a = at exp 3εm 2 a 28d

d exp 3εm 2

(56) where the second-to-last line follows by substituting the bounds (54) and (55) into the left and right terms, respectively.

Lemma C.2. Suppose on each task t [T] we run OMD as in Lemma C.1. Then for any β (0, 1]

T E PT t=1 ψβ(ˆx(ε) t ) ψβ(ˆ x (ε)) ψβ( x) + 3dκβ

Proof. We consider the expected divergence of the best initialization under the worst-case distribution of best arm estimation, which satisfies Lemma C.1 and (56). We have by Claim A.1 and the mean-as-minimizer property of Bregman divergences that

t=1 ψβ(ˆx(ε) t ) ψβ(ˆ x (ε)) = E min x (ε) 1 T

t=1 Bβ ˆx(ε) t ||x

min x (ε) E 1

t=1 Bβ ˆx(ε) t ||x

= min x (ε) 1 T

a=1 P(a = ˆat)Bβ e(ε) a ||x

max pt , t [T ] pt(a) 2κ, t [T ],a = at 1 dκ pt(a), t [T ],a= at

min x (ε) 1 T

a=1 pt(a)Bβ e(ε) a ||x

(57) To simplify the last expression, we define p = 1

T PT t=1 pt and again apply the (weighted) mean-asminimizer property, followed by Claim A.1:

min x (ε) 1 T

a=1 pt(a)Bβ e(ε) a ||x = min x (ε)

a=1 p(a)Bβ e(ε) a ||x =

a=1 Bβ e(ε) a || p(ε)

= ψβ(e(ε) 1 ) ψβ( p(ε)) (58) By substituting into the previous inequality, we can bound the expected divergence for the worst-case pt as follows:

t=1 ψβ(ˆx(ε) t ) ψβ(ˆ x (ε)) ψβ e(ε) 1 + max pt , t [T ] pt(a) 2κ, t [T ],a = at 1 dκ pt(a), t [T ],a= at

ψβ e(ε) 1 + max PT t=1 Pd a=1 pt(a)=T PT t=1 pt(a) (1 dκ) x(a)T, a PT t=1 pt(a) (2κ(1 x(a))T + x(a)T ), a

= ψβ e(ε) 1 min p p(a) (1 dκ) x(a), a p(a) 2κ+(1 2κ) x(a), a

We use the shorthand h(x) = ψβ (1 ε)x + ε

d1d . We have

x(a) (ψβ(x)) = x(a)

b=1 x(b)β 1

b=1 x(b)β + βd1 β(1

b=1 x(b)) 1

= β 1 β x(a)β 1 d1 β

and therefore

h(x) = max a=1,...,d

x(a)ψβ (1 ε)x + ε

β 1 β max a=1,...,d

((1 ε)x(a) + ε/d)β 1 d1 β

Finally, by convexity of h we have

min p p(a) (1 dκ) x(a), a p(a) 2κ+(1 2κ) x(a), a

h( p) h( x) h( x) max p p(a) (1 dκ) x(a), a p(a) 2κ+(1 2κ) x(a), a

h( x) 3dκ h( x)

h( x) 3dκβ logβ

so we can substitute into (59) to get

t=1 ψβ(ˆx(ε) t ) ψβ(ˆ x (ε)) ψβ( x (ε)) + 3dκβ

Applying Lemma B.2 completes the proof.

C.2 Guaranteed exploration bounds

Lemma C.3. Suppose we play OMDβ,η with initialization x1 (ε), regularizer ψβ +I (ε) for some β (0, 1], and unbiased loss estimators (γ = 0) on the sequence of loss functions ℓ1, . . . , ℓT [0, 1]d. Then for any a [d] we have expected regret

t=1 ℓt(at) ℓt( a) EBβ(ˆx(ε)||x1)

β + εm (64)

for ˆx the estimated optimum of the loss estimators ˆℓ1, . . . , ˆℓT .

t=1 ℓt(at) ℓt( a) = E

t=1 ℓt(at) ℓt, x

t=1 ℓt(at) ℓt, x(ε) + εm

t=1 ˆℓt(at) ˆℓt, x(ε) + εm

t=1 ˆℓt(at) ˆℓt, ˆx(ε) + εm

Bβ(ˆx(ε)||x1)

a=1 ˆℓ2 t(a)x2 β t (a)

EBβ(ˆx(ε)||x1)

where the second inequality follows by optimality of ˆx for the estimated losses ˆℓt, the third by Lemma B.3 constrained to (ε), and the fourth similarly to Theorem B.1 (note both are also effectively shown in Luo [37]).

Theorem C.1. In Algorithm 1, let OMDη,β be online mirror descent with the regularizer ψβ + I (ε) over unbiased (γ = 0) loss estimators, Θk is a subset of [β, β] [ 1 log d, 1], and

Ut(x, η, β) = Bβ(ˆx(ε) t ||x) η + ηdβm

where ˆx(ε) t = (1 ε)ˆxt + ε1d/d. Note that U (ρ) t (x, η, β) = Ut(x, η, β) + ρ2(d1 β 1)

η(1 β) . Then under Assumption 3.1 there exists settings of η, η, α, λ s.t. for all ε, ρ (0, 1) we have that

i=1 ℓt,i(at,i) ℓt,i( at)

T + 1 + log(T + 1)

+ min β [β,β],η>0

ε 2 β (1 + log T)

β + Lη(β β)

2k + d min ρ2

for Lη = log d

ε η + ηm log2 d d and hβ( ) = (Hβ + 56 dm)ι + d1 β 1

1 β (1 ι ) for ι = 1m 75d

ε 2 log d ε 2 .

Proof. By Lemma C.3 we have

i=1 ℓt,i(at,i) ℓt,i( at) εm T + E

Bβt(ˆx(ε) t ||xt,1)

ηt + ηtdβtm

Since we have the same environment-dependent quantities as in Theorem B.1, we can substitute the above bound into Theorem A.1 and then apply the Lemma C.2 bound

E ˆV 2 β Hβ + 3dκβ

ε exp 3ε 2m

= Hβ + 3ε 2

d2 exp 4 log d ε 2 3ε 2m

Hβ + 3ε 2/d2

28d 4 log d ε 2

where the last line follows by assuming m 75d

ε 2 log d ε 2 . If this condition does not hold, then we apply the default bound of E ˆV 2 β = 1

T PT t=1 ψβ(ˆxt) ψβ(ˆ x) d1 β 1

Corollary C.1. Let β = β = 1. Then for known and assuming m 75d

2 log d 2 we can ensure expected task-averaged regret at most

H1dm + 56 + 75d

d 3 2 m 3 4

where W is the Lambert W-function, while for unknown we can ensure expected task-averaged regret at most

H1dm + 56 + 3

50dm log d log d2m2

150 6 log d + O

d 3 2 m 3 4

T + d 4 3 m 5 3 T

so long as m2 150d log d.

Proof. Applying Theorem C.1 and simplifying yields

εm + 8d m(1 + log(T + 1))

16ρ2T + min η>0 8d(1 + log T)

εηT + h1( )

η + ηdm + dρ2

Then substitute η = q

dm and set ρ = 4q

1 d T m and ε = 75d 2m W( m

75) (for known ) or

m2 (otherwise).

Corollary C.2. Let β = 1

2 and β = 1. Then for known and assuming m 75d

2 log d 2 we can ensure task-averaged regret at most

(Hβm + 56/d)dβ/β + 75d

d 4 3 m 2 3

T + d 5 3 m 5 6

T 2 3 + d 3m 5 2 T

using k = 3

d2m T , while for unknown we can ensure expected task-averaged regret at most

(Hβm + 56/d)dβ/β + 3

50d2m log dm2

d 4 3 m 2 3

T + d 5 3 m 5 6

T 2 3 + d 3 2 m2

(74) so long as m 5d

Proof. Applying Theorem C.1 and simplifying yields

T + 1 + log(T + 1)

+ min β [β,β],η>0 8d 3 2 (1 + log T)

ε 3 2 ηT + hβ( )

ε η + ηm log2 d

Then substitute η = q

hβ( ) dβm/β and set ρ = 3q

m T and ε = 75d 2m W( m

75) (for known ) or

m2 (otherwise).

Corollary C.3. Let β = 1 log d and β = 1. Then for known and assuming m 75d

2 log d 2 we can ensure task-averaged regret at most

min β (0,1] 2 q

(Hβm + 56/d)dβ/β + 75d

d 4 3 m 2 3

T + d 5 3 m 5 6

T 2 3 + d 4m3

using k = 3

d2m T , while for unknown we can ensure expected task-averaged regret at most

min β (0,1] 2 q

(Hβm + 56/d)dβ/β + 3

50d2m log dm2

d 4 3 m 2 3

T + d 5 3 m 5 6

T 2 3 + d 5 3 m 7 3 T

(77) so long as m 5d

Proof. Applying Theorem C.1 and simplifying yields

T + 1 + log(T + 1)

+ min β [β,β],η>0 8d2(1 + log T)

ε2ηT + hβ( )

ε η + ηm log2 d

Then substitute η = q

hβ( ) dβm/β and set ρ = 3q

m T and ε = 75d 2m W( m

75) (for known ) or

m2 (otherwise).

Corollary C.4. Let β = 1 log d and β = 1. Then for unknown and assuming m max{d 3 4 , 56} we can ensure task-averaged regret at most

min β (0,1] min

β + 21d 3 4 3 m

d 4 3 m 2 3

T + d 5 3 m 5 6

T 2 3 + d2m 7 3 T

(79) using k = 3

Proof. Applying Theorem C.1 and simplifying yields

T + 1 + log(T + 1)

+ min β [β,β],η>0 8d2(1 + log T)

ε2ηT + hβ( )

ε η + ηm log2 d

Then substitute η = q

hβ( ) dβm/β and set ρ = 3q

m T and ε =

D Robustness to outliers

Proposition D.1. Suppose there exists a constant p [0, 1] and a subset S [T] of size s such that at S for all but O(T p) MAB tasks t [T]. Then if β [ 1 log d, 1

2] we have Hβ = O(s + d1 β

T β(1 p) ).

Proof. Define the vector e S [0, 1]d s.t. e S[a] = 1a S. Then by Claim A.1 and the mean-asminimizer property of Bregman divergences we have

Hβ = ψβ( x)

t=1 ψβ( xt) ψβ( x)

t=1 Bβ( xt|| x)

= min x d 1 T

t=1 Bβ( xt|| x)

min δ (0,1) 1 T

= min δ (0,1) 1 T

β xβ t[a] + β( xt[a] 1 δ

= min δ (0,1) 1 T

β xβ t[a] 1 β + β xβ t[a] (1 β)( 1 δ

min δ (0,1) s1 β + δβd1 β + β (1 β)T

1a= at (1 β)( 1 δ

min δ (0,1) s1 β

1 β + δβd1 β + O

= O s + d1 β

where the last line follows by considering δ = 1/T 1 p.

E Online learning with self-concordant barrier regularizers

E.1 General results

Lemma E.1. Let K Rd be a convex set and ψ : K 7 Rd be a self-concordant barrier. Suppose ℓ1, . . . , ℓT are a sequence of loss functions satisfying | ℓt, x | 1 x K. Then if we run OMD with step-size η > 0 as in Abernethy et al. [2, Alg. 1] on the sequence of estimators ˆℓt our estimated regret w.r.t. any x Kε for ε > 0 will satisfy

t=1 ˆℓt, xt x B(x||x1)

η + 32d2ηT (82)

Proof. The result follows from Abernethy et al. [2] by stopping the derivation on the second inequality below Equation 10.

Definition E.1. For any convex set K and any point y K, πy(x) = inf t 0,y+ x y

t K t is the

Minkowski function with pole y.

Lemma E.2. For any x K Rd and ψ : K 7 R a ν-self-concordant regularizer with minimum x1 K , the quantity ψ(cε(x)) is ν

2-Lipschitz w.r.t. ε [0, 1].

Proof. Consider any ε, ε [0, 1] s.t. ε ε (0, 1

2] Note that for t = ε ε

1+ε we have

cε (x) + cε (x) cε(x)

t = x1 + x x1

1 + ε + x1 + x x1

1+ε x1 x x1

1+ε t = x K (83)

so πcε (x)(cε(x)) ε ε

1+ε ε ε. Therefore by Nesterov and Nemirovskii [42, Prop. 2.3.2] we have

ψ(cε(x)) ψ(cε (x)) ν log 1 1 πcε (x)(cε(x))

ν log 1 1 + ε ε

where for the last inequality we used log(1 x) x

2 for x [0, 1

2]. The case of ε ε (0, 1] follows by considering ε = ε +ε

2 and applying the above twice.

Theorem E.1. In Algorithm 1, let OMDη,ε be online mirror descent over loss estimators specified in Abernethy et al. [2] with a ν-self-concordant barrier regularizer ψ : K 7 R that satisfies ν 1 and 2ψ(x1) 2 = S1 1. Let Θk be a subset of [ 1

Ut(x, η, ε) = B(cε(ˆx)||x)

η + 32ηd2 + εm (85)

Note that U (ρ) t (x, η, ε) = Ut(x, η, ε) + 9ν 3 2 ρ2Km S1

η . Then there exists settings of η, η, α, λ s.t. for all ε, ρ (0, 1) we have expected task averaged regret at most

E min ε [ 1

m ,1],η>0 512ν2K2S1m2(1 + log T)

ˆV 2 ε η + 32ηd2m + εm + ν

+ 3ν 3 4 m min

( 3ρ2ν 3 4 K S1

T log k + 1 + log(T + 1)

Proof. Let ε = 1 m. For any ε [ε, 1] and x K we have πx1(cε(x)) 1 1+ε, so by Nesterov and Nemirovskii [42, Prop. 2.3.2] we have

2ψ(cε(x)) 2 1 + 3ν 1 πx1(cε(x))

2 2ψ(x1) 2 64ν2S1

Thus S = maxx,y K,ε [ε,1] 2ψ(cε(x)) 2 = 64ν2S1

ε2 and also

D2 ε = max x,y K B(cε(x)||cε(y))

= max x,y K ψ(cε(x)) ψ(cε(y)) ψ(cε(y)), x y

max x,y K ν log 1 1 πx1(cε(x))

ν 2ψ(cε(y)) 2 x y 2

ε + 8ν 3 2 K S1

9ν 3 2 K S1

where the first inequality follows by Nesterov and Nemirovskii [42, Prop. 2.3.2] and the definition of a self-concordant barrier [2, Def. 5]. In addition, we have g(ε) = 32d2, f(ε) = ε, M = 12d p

4 ν3S1, and F = 1. We have

i=1 ℓt,i, xt,i xt E

i=1 ℓt,i, xt,i cεt( xt)

i=1 ˆℓt,i, xt,i cεt( xt)

i=1 ˆℓt,i, xt,i cεt(ˆxt)

EB(cεt(ˆxt||xt,1)

ηt + (32ηtd2 + εt)m

where the first inequality follows by Abernethy et al. [2, Lem. 8], the second by Abernethy et al. [2, Lem. 3], the third by optimality of ˆxt, and the fourth by Lemma E.1. Substituting into Theorem A.1 and simplifying yields the result.

E.2 Specialization to the unit sphere

Corollary E.1. Let K be the unit sphere with the self-concordant barrier ψ(x) = log(1 x 2 2). Then Algorithm 1 attains expected task-averaged regret bounded by

+ min ε [ 1

2m log 1 + 1 E ˆ x 2 2 2ε + ε2

using k = l

Proof. Using the fact the ν = 1 and K = S1 = 2, we apply Theorem E.1 and simplify to obtain

E min ε [ 1

ˆV 2 ε η + 32ηd2m + εm + O m2

k + m min ρ2

η , dρ + dm

(91) Then substitute η = ˆVε 4

2dm + m d 4

T , set ρ = 1 4

T , and note that

v u u u tlog

1 cε(ˆ x) 2 2

Tq QT t=1 1 cε(ˆxt) 2 2

log 1 (1 + ε) 2 ˆ x 2 2 1 (1 + ε) 2

log 1 + 1 E ˆ x 2 2 2ε + ε2

where we use the fact that ˆxt 2 = 1 and the inequality is Jensen s.

E.3 Specialization to polytopes, specifically the bandit online shortest-path problem

Corollary E.2. Let K = {x [0, 1]|E| : a, x b (a, b) C} be the set of flows from u to v on a graph G(V, E), where C R|E| R is a set of O(|E|) linear constraints. Suppose we see T instances of the bandit online shortest path problem with m timesteps each. Then sampling from probability distributions over paths from u to v returned by running Algorithm 1 with regularizer ψ(x) = P a,b C log(b a, x ) attains the following expected average regret across instances

T 3 4 + |E| 5 2 m 5 6

+ min ε [ 1

m ,1] 4|E|E

v u u u t2m X

1 T PT t=1 b a, cε(ˆxt)

Tq QT t=1 b a, cε(ˆxt)

(93) using k = l

Proof. Using the fact that d = |E|, ν = O(|E|), K = p

|E|, and S1 P

a,b C aa T 2 ( a,1|E|/|E| b)2 =

O(|E|3), we apply Theorem E.1 and simplify to obtain

E min ε [ 1

ˆV 2 ε η + 32η|E|2m + εm

( ρ2|E| 7 2 η , ρ|E| 11

Then substitute η = ˆVε 4

2dm + |E|2 m

T , set ρ = 4q

T 6 m, and note that

b a, cε(ˆ x)

Tq QT t=1 b a, cε(ˆxt)

1 T PT t=1 b a, cε(ˆxt)

Tq QT t=1 b a, cε(ˆxt)