# metalearning_for_simple_regret_minimization__c4f2bf3e.pdf

Meta-Learning for Simple Regret Minimization

Javad Azizi1, Branislav Kveton2, Mohammad Ghavamzadeh3, Sumeet Katariya2

1University of Southern California 2Amazon 3Google Research azizim@usc.edu, bkveton@amazon.com, ghavamza@google.com, katsumee@amazon.com

We develop a meta-learning framework for simple regret minimization in bandits. In this framework, a learning agent interacts with a sequence of bandit tasks, which are sampled i.i.d. from an unknown prior distribution, and learns its metaparameters to perform better on future tasks. We propose the ﬁrst Bayesian and frequentist meta-learning algorithms for this setting. The Bayesian algorithm has access to a prior distribution over the meta-parameters and its meta simple regret over m bandit tasks with horizon n is mere O(m/ n). On the other hand, the meta simple regret of the frequentist algorithm is O( mn + m/ n). While its regret is worse, the frequentist algorithm is more general because it does not need a prior distribution over the meta-parameters. It can also be analyzed in more settings. We instantiate our algorithms for several classes of bandit problems. Our algorithms are general and we complement our theory by evaluating them empirically in several environments.

1 Introduction We study the problem of simple regret minimization (SRM) in a ﬁxed-horizon (budget) setting (Audibert and Bubeck 2010; Kaufmann, Capp e, and Garivier 2016). The learning agent interacts sequentially with m such tasks, where each task has a horizon of n rounds. The tasks are sampled i.i.d. from a prior distribution P , which makes them similar. We study a meta-learning (Thrun 1996, 1998; Baxter 1998, 2000) variant of the problem, where the prior distribution P is unknown, and the learning agent aims to learn it to reduce its regret on future tasks. This problem is motivated by practical applications, such as online advertising, recommender systems, hyperparameter tuning, and drug repurposing (Hoffman, Shahriari, and Freitas 2014; Mason et al. 2020; R eda, Kaufmann, and Delahaye-Duriez 2021; Alieva, Cutkosky, and Das 2021), where bandit models are popular due to their simplicity and efﬁcient algorithms. These applications include a test phase separated from the commercialization phase, and one aims at minimizing the regret of the commercialized product (simple regret) rather than the cumulative regret in the test phase (Audibert and Bubeck 2010). In all of these, the exploration phase is limited by a ﬁxed horizon: the budget for estimating

Copyright 2023, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

click rates on ads is limited, or a hyper-parameter tuning task has only a limited amount of resources (Alieva, Cutkosky, and Das 2021). Meta-learning can result in more efﬁcient exploration when the learning agent solves similar tasks over time. To understand the beneﬁts of meta-learning, consider the following example. Repeated A/B tests are conducted on a website to improve customer engagement. Suppose that the designers always propose a variety of website designs to test. However, dark designs tend to perform better than light ones, and thus a lot of customer trafﬁc is repeatedly wasted to discover the same pattern. One solution to reducing waste is that the designers to stop proposing light designs. However, these designs are sometimes better. A more principled solution is to automatically adapt the prior P in A/B tests to promote dark designs unless proved otherwise by evidence. This is the key idea in the proposed solution in this work. We make the following contributions. First, we propose a general meta-learning framework for ﬁxed-horizon SRM in Section 2. While several recent papers studied this problem in the cumulative regret setting (Bastani, Simchi-Levi, and Zhu 2019; Cella, Lazaric, and Pontil 2020; Kveton et al. 2021; Basu et al. 2021; Simchowitz et al. 2021), this work is the ﬁrst application of meta-learning to SRM. We develop general Bayesian and frequentist algorithms for this problem in Sections 3 and 4. Second, we show that our Bayesian algorithm, which has access to a prior over the meta-parameters of P , has meta simple regret O(m/ n) over m bandit tasks with horizon n. Our frequentist algorithm is more general because it does not need a prior distribution over the metaparameters. However, we show that its meta simple regret is O( mn+m/ n), and thus, worse than that of the Bayesian algorithm. In Section 4.2, we present a lower bound showing that this is unimprovable in general. Third, we instantiate both algorithms in multi-armed and linear bandits in Section 5. These instances highlight the trade-offs of the Bayesian and frequentist approaches, a provably lower regret versus more generality. Finally, we complement our theory with experiments (Section 7), which show the beneﬁts of meta-learning and conﬁrm that the Bayesian approaches are superior whenever implementable. Some of our contributions are of independent interest. For instance, our analysis of the meta SRM algorithms is based on a general reduction from cumulative regret minimization in

The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23)

Section 3.1, which yields novel and easily implementable algorithms for Bayesian and frequentist SRM, based on Thompson sampling (TS) and upper conﬁdence bounds (UCBs) (Lu and Van Roy 2019). To the best of our knowledge, only Komiyama et al. (2021) studied Bayesian SRM before (Section 6). In Section 5.2, we also extend the analysis of frequentist meta-learning in Simchowitz et al. (2021) to structured bandit problems.

2 Problem Setup In meta SRM, we consider m bandit problems with arm set A that appear sequentially and each is played for n rounds. At the beginning of each task (bandit problem) s [m], the mean rewards of its arms µs RA are sampled i.i.d. from a prior distribution P . We deﬁne [m] = {1, 2, , m} for any integer m. We apply a base SRM algorithm, alg, to task s and denote this instance by algs. The algorithm interacts with task s for n rounds. In round t [n] of task s, algs pulls an arm As,t A and observes its reward Ys,t(As,t), where E[Ys,t(a)] = µs(a). We assume that Ys,t(a) ν(a; µs) where ν( ; µs) is the reward distribution of all arms with parameter (mean) µs. After the n rounds the algorithm returns arm ˆAalgs or simply ˆAs as the best arm. Let A s = arg maxa A µs(a) be the best arm in task s. We deﬁne the per-task simple regret for task s as

SRs(n, P ) = Eµs P Eµs[ s], (1)

where s = µs(A s) µs( ˆAs). The outer expectation is w.r.t. the randomness of the task instance, and the inner one is w.r.t. the randomness of rewards and algorithm. This is the common frequentist simple regret averaged over instances drawn from P . In the frequentist setting, we assume that P is unknown but ﬁxed, and deﬁne the frequentist meta simple regret as

SR(m, n, P ) =

s=1 SRs(n, P ) . (2)

In the Bayesian setting, we still assume that P is unknown. However, we know that it is sampled from a known meta prior Q. We deﬁne Bayesian meta simple regret as

BSR(m, n) = EP Q[SR(m, n, P )] . (3)

3 Bayesian Meta-SRM In this section, we present our Bayesian meta SRM algorithm (B-meta SRM), whose pseudo-code is in Algorithm 1. The key idea is to deploy alg for each task with an adaptively reﬁned prior learned from the past interactions, which we call an uncertainty-adjusted prior, Ps(µ). This is an approximation to P and it is the posterior density of µs given the history up to task s. At the beginning of task s, B-meta SRM instantiates alg with Ps, denoted as algs = alg(Ps), and uses it to solve task s. The base algorithm alg is Thompson Sampling (TS) or Bayesian UCB (Bayes UCB) (Lu and Van Roy 2019). During its execution, algs keeps updating its posterior over µs as Ps,t(µs) Ls,t(µs)Ps(µs), where Ls,t(µs) =

Qt ℓ=1 P(Ys,ℓ|As,ℓ, µs) is the likelihood of observations in task s up to round t under task parameter µs. TS pulls the arms proportionally to being the best w.r.t. the posterior. More precisely, it samples µs,t Ps,t and then pulls arm As,t arg maxa A µs,t(a). Bayes UCB is the same but it pulls the arm with largest Bayesian upper conﬁdence bound (see Appendix C and Eq. (12) for details). The critical step is how Ps is updated. Let θ be the parameter of P . At task s, B-meta SRM maintains a posterior density over the parameter θ , called meta-posterior Qs(θ), and uses it to compute Ps(µ). We use the following recursive rule from Proposition 1 of Basu et al. (2021) to update Qs and Ps. Proposition 1. Let Ls 1( ) = Ls 1,n( ) be the likelihood of observations right before the start of task s. We let Pθ be the prior distribution parameterized by θ. Then alg computes Qs and Ps as

µ Ls 1(µ)Pθ(µ)dκ2(µ)Qs 1(θ), θ (4)

θ Pθ(µ)Qs(θ)dκ1(θ), µ (5)

where κ1 and κ2 are the probability measures of θ and µ. We initialize Eq. (4) with L0 = 1 and Q0 = Q, where Q is the meta prior. Note that this update rule is computationally efﬁcient for Gaussian prior with Gaussian meta-prior, but not many other distributions. This computational issue can limit the applicability of our Bayesian algorithm. When task s ends, algs returns the best arm ˆAalgs by sampling from the distribution

ˆAalgs ρs, ρs(a) := Na,s

where Na,s := |{t [n] : As,t = a}| is the number of rounds where arm a is pulled. That is, the algorithm chooses the arms proportionally to their number of pulls. This decision rule facilitates the analysis of our algorithms based on a reduction from cumulative to simple regret. We develop this reduction in Section 3.1 and show that per-task simple regret is essentially the cumulative regret divided by n. This yields novel algorithms for Bayesian and frequentist SRM with guarantees.

3.1 Cumulative to Simple Regret Reduction Fix task s and consider an algorithm that pulls a sequence of arms (As,t)t [n]. Let its per-task cumulative regret with prior P be Rs(n, P) := Eµs P Eµs [nµs(A s) Pn t=1 µs(As,t)], where the inner expectation is taken over the randomness in the rewards and algorithm. Now suppose that at the end of the task, we choose arm a with probability ρs(a) and declare it to be the best arm ˆAs. Then the per-task simple regret of this procedure is bounded as follows. Proposition 2 (Cumulative to Simple Regret). For task s with n rounds, if we return an arm with probability proportional to its number of pulls as the best arm, the per-task simple regret with prior P is SRs(n, P) = Rs(n, P)/n.

Algorithm 1: Bayesian Meta-SRM (B-meta SRM)

Input: Meta prior Q, base algorithm alg Initialize: Meta posterior Q0 Q for s = 1, . . . , m do Receive the current task s, µs P Compute meta posterior Qs using Eq. (4) Compute uncertainty-adjusted prior Ps using Eq. (5) Instantiate alg for task s, algs alg(Ps) Run algs for n rounds Return the best arm ˆAalgs ρs using Eq. (6) end for

We prove this proposition in Appendix B using the linearity of expectation and properties of ρs. Note that Proposition 2 applies to both frequentist and Bayesian meta simple regret. This is because the former is a summation of SRs over tasks, and the latter is achieved by taking an expectation of the former over P .

3.2 Bayesian Regret Analysis Our analysis of B-meta SRM is based on results in Basu et al. (2021) and Lu and Van Roy (2019), combined with Section 3.1. Speciﬁcally, let Γs,t be an information-theoretic constant independent of m and n that bounds the instant regret of the algorithm at round t of task s. We defer its precise deﬁnition to Appendix C as it is only used in the proofs. The following generic bound for the Bayesian meta simple regret of B-meta SRM holds. Theorem 3 (Information Theoretic Bayesian Bound). Let {Γs}s [m] and Γ be non-negative constants, such that Γs,t Γs Γ holds for all s [m] and t [n] almost surely. Then, the Bayesian meta simple regret (Eq. 3) of B-meta SRM satisﬁes

BSR(m, n) Γ rm

n I(θ ; τ1:m) (7)

I(µs; τs|θ , τ1:s 1)

where τ1:s = s ℓ=1(Aℓ,1, Yℓ,1, , Aℓ,n, Yℓ,n) is the trajectory up to task s, τs is similarly deﬁned for the history only in task s, and I( ; ) and I( ; | ) are mutual information and conditional mutual information, respectively. The proof is in Appendix C. It builds on the analysis in Basu et al. (2021) and uses our reduction in Section 3.1. Our reduction readily applies to Bayesian meta simple regret by linearity of expectation. The ﬁrst term in Eq. (7) is the price for learning the prior parameter θ and the second one is the price for learning the mean rewards of tasks (µs)s [m] given known θ . It has been shown in many settings that the mutual information terms grow slowly with m and n (Lu and Van Roy 2019; Basu et al. 2021), and thus the ﬁrst term is O( p

m/n) and negligible. The second term is O(m/ n), since we solve m independent problems, each with O(1/ n) simple regret. In Section 5.2, we discuss a bandit environment where Γs,t and βs,t are such that the last term of the bound is comparable to

Algorithm 2: Frequentist Meta-SRM (f-meta SRM)

Input: Exploration strategy explore, base algorithm alg Initialize: τ1 for s = 1, . . . , m do Receive the current task s, µs P Explore the arms using explore Append explored arms and their observations to τs Compute ˆθs using τs as an estimate of θ Instantiate alg for task s, algs alg(ˆθs) Run algs for the rest of the n rounds Return the best arm ˆAalgs ρs using Eq. (6) τs+1 τs end for

the rest. This holds in several other environments discussed in Lu and Van Roy (2019); Basu et al. (2021), and Liu et al. (2022).

4 Frequentist Meta-SRM

In this section, we present our frequentist meta SRM algorithm (f-meta SRM), whose pseudo-code is in Algorithm 2. Similarly to B-meta SRM, f-meta SRM uses TS or UCB as its base algorithm alg. However, it directly estimates its prior parameter, instead of maintaining a meta-posterior. At the beginning of task s [m], f-meta SRM explores the arms for a number of rounds using an exploration strategy denoted as explore. This strategy depends on the problem class and we specify it for two classes in Section 5. f-meta SRM uses samples collected in the exploration phase of all the tasks up to task s, τs, to update its estimate of the prior parameter ˆθs. Then, it instantiates the base algorithm with this estimate, denoted as algs =alg (ˆθs), and uses algs for the rest of the rounds of task s. Here alg(θ) := alg(Pθ) is the base algorithm alg instantiated with prior parameter θ (Note that we used a slightly different parameterization of alg compared to Section 3). When task s ends, algs returns the best arm ˆAalgs by sampling from the probability distribution ρs deﬁned in Eq. (6). While B-meta SRM uses a Bayesian posterior to maintain its estimate of θ , f-meta SRM relies on a frequentist approach. Therefore, it applies to settings where computing the posterior is not computationally feasible. Moreover, we can analyze f-meta SRM for general settings beyond Gaussian bandits.

4.1 Frequentist Regret Analysis

In this section, we prove an upper bound for the frequentist meta simple regret (Eq. 2) of f-meta SRM with TS alg. To start, we bound the per-task simple regret of alg relative to oracle that knows θ . To be more precise, this is the difference between the means of arms returned by alg instantiated with some prior parameter θ and the true prior parameter θ . The total variation (TV) distance for two distributions P and P over the same probability space (Ω, F)1 is deﬁned as

1Ωis the sample space and F is the sigma-algebra.

TV(P || P ) := sup E F |P(E) P (E)|. We use TV to measure the distance between the estimated and true priors. We ﬁx task s and drop subindexing by s. In the following, we bound the per-task simple regret of alg(θ) relative to oracle alg(θ ). Theorem 4. Suppose Pθ is the true prior of the tasks and satisﬁes Pθ (diam(µ) B) = 1, where diam(µ) := supa A µ(a) infa A µ(a). Let θ be a prior parameter, such that TV(Pθ || Pθ) = ϵ. Also, let ˆAalg(θ ) and ˆAalg(θ) be the arms returned by alg(θ ) and alg(θ), respectively. Then we have

Eµ Pθ E µ( ˆAalg(θ )) µ( ˆAalg(θ)) 2nϵB. (8)

Moreover, if the prior is coordinate-wise σ2 0-sub-Gaussian (Deﬁnition 14 in Appendix E), then we may write the RHS of

Eq. (8) as 2nϵ diam Eθ [µ] +σ0 8+5 q

log |A| min(1,2nϵ) , where Eθ [µ] is the expectation of the mean reward of the arms, µ, given the true prior θ . The proof in Appendix E uses the fact that TS is a 1-Monte Carlo algorithm, as deﬁned by Simchowitz et al. (2021). It builds on Simchowitz et al. (2021) analysis of the cumulative regret, and extends it to simple regret. We again use our reduction in Section 3.1, which shows how it can be applied to a frequentist setting. Theorem 4 shows that an ϵ prior misspeciﬁcation leads to O(nϵ) simple regret cost in f-meta SRM. The constant terms in the bounds depend on the prior distribution. In particular, for a bounded prior, they reﬂect the variability (diameter) of the expected mean reward of the arms. Moreover, under a sub-Gaussian prior, the bound depends logarithmically on the number of arms |A| and sub-linearly on the prior variance proxy σ2 0. Next, we bound the frequentist meta simple regret (Eq. 2) of f-meta SRM. Corollary 4.1 (Meta Simple Regret of f-meta SRM). Let the explore strategy in Algorithm 2 be such that ϵs = TV(Pθ || Pˆθs) = O(1/ s) for each task s [m]. Then the frequentist meta simple regret of f-meta SRM is bounded as

SR(m, n, Pθ ) = O 2 mn B + m p

|A|/n . (9)

The proof is in Appendix E and decomposes the frequentist meta simple regret into two terms: (i) the per-task simple regret of alg(ˆθs) relative to oracle alg(θ ) in task s, which we bound in Theorem 4, and (ii) the meta simple regret of the oracle alg(θ ), which we bound using our cumulative regret to simple regret reduction (Section 3.1). The O( mn) term is the price of estimating the prior parameter, because it is the per-task simple regret relative to the oracle. The O(m p

|A|/n) term is the meta simple regret of the oracle over m tasks. Comparing to our bound in Theorem 3, B-meta SRM has a lower regret of O( p

m/n + m/ n) = O(m/ n). More precisely, only the price for learning the prior is different as both bounds have O(m/ n) terms. Note that despite its smaller regret bound, B-meta SRM may not be computationally feasible for arbitrary distributions and priors, while

f-meta SRM is since it directly estimates the prior parameter using frequentist techniques.

4.2 Lower Bound In this section, we prove a lower bound on the relative pertask simple regret of a γ-shot TS algorithm, i.e., a TS algorithm that takes γ N samples (instead of 1) from the posterior in each round. This lower bound compliments our upper bound in Theorem 4 and shows that Eq. (8) is nearoptimal. The proof of our lower bound builds on a cumulative regret lower bound in Theorem 3.3 of Simchowitz et al. (2021) and extends it to simple regret. We present the proof in Appendix E.2.

Theorem 5 (Lower Bound). Let TSγ(θ) be a γ-shot TS algorithm instantiated with the prior parameter θ. Also let Pθ and Pθ be two task priors. Let µ [0, 1]A and ﬁx a tolerance η (0, 1

4). Then there exists a universal constant c0 such that for any horizon n c0

η , number of arms |A| = n c0

η , and error ϵ η c0γn, we have TV(Pθ || Pθ ) = ϵ and the difference of per-task simple regret of TSγ(θ) and TSγ(θ ) satisﬁes E[µ( ˆATSγ(θ))] E[µ( ˆATSγ(θ ))] ( 1

This lower bound holds for any setting with large enough n and |A| = O(n2), and a small prior misspeciﬁcation error ϵ = O(1/n2). This makes it relatively general.

5 Meta-Learning Examples In this section, we apply our algorithms to speciﬁc priors and reward distributions. The main two are the Bernoulli and linear (contextual) Gaussian bandits. We analyze f-meta SRM in an explore-then-commit fashion, where f-meta SRM estimates the prior using explore in the ﬁrst m0 tasks and then commits to it. This is without loss of generality and only for simplicity.

5.1 Bernoulli Bandits We start with a Bernoulli multi-armed bandit (MAB) problem, as TS was ﬁrst analyzed in this setting (Agrawal and Goyal 2012). Consider Bernoulli rewards with beta priors for A = [K] arms. In particular, assume that the prior is P = N a A Beta(α a, β a). Therefore, α a and β a are the prior parameters of arm a and the arm mean µs(a) is the probability of getting reward 1 for arm a when it is pulled. Beta(α, β) is the beta distribution with a support on (0, 1) with parameters α > 0 and β > 0. B-meta SRM in this setting does not have a computationally tractable meta-prior (Basu et al. 2021). We can address this in practice by discretization and using TS as described in Section 3.4 of Basu et al. (2021). However, the theoretical analysis for this case does not exist. This is because a computationally tractable prior for a product of beta distributions does not exist. It is challenging to generalize our Bayesian approach to this class of distributions as we require more than the standard notion of conjugacy. In the contrary, f-meta SRM directly estimates the beta prior parameters, (α a)a A and (β a)a A based on the observed Bernoulli rewards as follows. The algorithm explores only in m0 m tasks. explore samples arm 1 in the

ﬁrst t0 rounds of ﬁrst m0/K tasks, and arm 2 in the next m0/K tasks similarly, and so on for arm 3 to K. In other words, explore samples arm a [K] in the ﬁrst t0 rounds of a th batch of size m0/K tasks. Let Xs denote the cumulative reward collected in the ﬁrst t0 rounds of task s. Then, the random variables X1, . . . , Xm0/K are i.i.d. draws from a Beta-Binomial distribution (BBD) with parameters (α 1, β 1, t0), where t0 denotes the number of trials of the binomial component. Similarly, X(m0/K)+1, , X2m0/K are i.i.d. draws from a BBD with parameters (α 2, β 2, t0). In general, X(a 1)(m0/K)+1, , Xam0/K are i.i.d. draws from a BBD with parameters (α a, β a, t0). Knowing this, it is easy to calculate the prior parameters for each arm using the method of moments (Tripathi, Gupta, and Gurland 1994). The detailed calculations are in Appendix D. We prove the following result in Appendix E.3. Corollary 5.1 (Frequentist Meta Simple Regret, Bernoulli). Let alg be a TS algorithm that uses the method of moments described and detailed in Appendix D, to estimate the prior parameters with m0 C|A|2 log(|A|/δ)

ϵ2 exploration tasks (explore-then-commit). Then the frequentist meta simple regret of f-meta SRM satisﬁes SR(m, n, Pθ ) =

O 2mnϵ + m q

n + m0 , for m m0 with probability at least 1 δ.

With small enough ϵ, the bound shows O(m/ p

|A|/n) scaling which we conjecture is the best an oracle that knows the correct prior of each task could do in expectation. The bound seems to be only sublinear in n if ϵ = O(1/n3/2). However, since ϵ m 1/2 0 and we know Pm z=1 z 1/2 = m1/2, if the exploration continues in all tasks, the regret

bound above simpliﬁes to O mn + m q

5.2 Linear Gaussian Bandits In this section, we consider linear contextual bandits. Suppose that each arm a A is a vector in Rd and |A| = K. Also, assume νs(a; µs) = N(a µs, σ2), i.e., with a little abuse of notation µs(a) = a µs, where µs is the parameter of our linear model. A conjugate prior for this problem class is P = N(θ , Σ0), where Σ0 Rd d is known and we learn θ Rd. In the Bayesian setting, we assume that the meta-prior is Q = N(ψq, Σq), where ψq Rd and Σq Rd d are both known. In this case, the meta-posterior is Qs = N(ˆθs, ˆΣs), where ˆθs Rd and ˆΣs Rd d are calculated as

ˆθs = ˆΣs Σ 1 q ψq +

ˆΣ 1 s = Σ 1 q +

where Vℓ= Pn t=1 Aℓ,t A ℓ,t is the outer product of the feature vectors of the pulled arms in task ℓand Bℓ= Pn t=1 Aℓ,t Yℓ,t(Aℓ,t) is their sum weighted by their rewards (see Lemma 7 of Kveton et al. (2021) for more details). By

Proposition 1, we can calculate the task prior for task s as Ps = N(ˆθs, ˆΣs + Σ0). When K = d and A is the standard Euclidean basis of Rd, the linear bandit reduces to a K-armed bandit. Assuming that maxa A a 1 by a scaling argument, the following result holds by an application of our reduction in Section 3.1, and we prove it in Appendix C.1. For a matrix A Rd d, let λ1(A) denote its largest eigenvalue. Corollary 5.2 (Bayesian Meta Simple Regret, Linear Bandits). For any δ (0, 1], the Bayesian meta simple regret of B-meta SRM in the setting of Section 5.2 with TS alg is bounded as BSR(m, n) c1 p

dm/n + (m + c2)SRδ(n) + c3dm/n, where c1 = O( p

log(K/δ) log m), c2 = O(log m), and c3 is a constant in m and n. Also SRδ(n) is the per-task simple regret bounded as SRδ(n)

2δλ1(Σ0), where c4 = O q

δ ) log n .

The ﬁrst term in the regret is O( p

dm/n) and represents the price of learning θ . The second term is the simple regret of m tasks when θ is known and is O(m p

d/n). The last term is the price of the forced exploration and is negligible, O(m/n). Comparing to the analysis in Basu et al. (2021), we prove a similar bound for B-meta SRM with Bayes UCB base algorithm in Appendix C.3. In the frequentist setting, we simplify the setting to P = N(θ , σ2 0Id). The case of general covariance matrix for the MAB Gaussian is dealt with in Simchowitz et al. (2021). We extend the results of Simchowitz et al. (2021) for metalearning to linear bandits. Our estimator of θ , namely ˆθs, is such that TV Pθs || Pˆθ

is bounded based on all the

observations up to task s. We show that for any ϵ, δ (0, 1), with probability at least 1 δ over the realizations of the tasks and internal randomization of the meta-learner, ˆθ is close to θ in TV distance. The key idea of the analysis is bounding the regret relative to an oracle. We use Theorem 4 to bound the regret of f-meta SRM relative to an oracle alg(θ ) which knows the correct prior. Our analysis and estimator also apply to sub Gaussian distributions, but we stick to linear Gaussian bandits for readability. Without loss of generality, let a1, . . . , ad be a basis for A such that Span({a1, . . . , ad}) = Rd. Resembling Section 5.1, we only need to explore the basis. The exploration strategy, explore in Algorithm 2, samples the basis a1, . . . , ad in the ﬁrst m0 m tasks. Then the least-squares estimate of θ is

ˆθ := V 1 m0

i=1 aiys,i , (10)

where Vm0 := m0 Pd i=1 aia i is the outer product of the basis. This gives an unbiased estimate of θ . Then we can guarantee the performance of explore as follows. Theorem 6 (Linear Bandits Frequentist Estimator). In the setting of Section 5.2, for any ϵ and δ (2e d, 1),

if n d and m0 d log(2/δ) Pd i=1 σ2 i 2σ0λ4 d(Pd i=1 aia i )ϵ4 1/3 , then

TV(Pθ || Pˆθ ) ϵ with probability at least 1 δ.

We prove this by bounding the TV distance of the estimate and correct prior using the Pinsker s inequality. Then the KLdivergence of the correct prior and the prior with parameter ˆθ boils down to θ ˆθ 2, which is bounded by the Bernstein s inequality (see Appendix E.4 for the proof). Now it is easy to bound the frequentist meta simple regret of f-meta SRM using the sub-Gaussian version of Corollary 4.1 in Appendix E. We prove the following result in Appendix E.4 by decomposing the simple regret into the relative regret of the base algorithm w.r.t. the oracle.

Corollary 5.3 (Frequentist Meta Simple Regret, Linear Bandits). In Algorithm 2, let alg be a TS algorithm and use Eq. (10) for estimating the prior parameters with m03 d log(2/

δ) Pd i=1 σ2 i 2σ0λ4 d(Pd i=1 aia i )ϵ4 . Then the frequentist meta simple regret of Algorithm 2 is O 2m1/4n diam(Eθ [µ]) + m d3/2 log K n with probability at least 1 δ.

This bound is O(m1/4n θ + md3/2/ n), where is the inﬁnity norm. The ﬁrst term is the price of estimating the prior and the second one is the standard frequentist regret of linear TS for m tasks divided by n, O(md3/2/ n). Compared to Corollary 5.2, the above regret bound is looser.

6 Related Work

To the best of our knowledge, there is no prior work on metalearning for SRM. We build on several recent works on metalearning for cumulative regret minimization (Bastani, Simchi Levi, and Zhu 2019; Cella, Lazaric, and Pontil 2020; Kveton et al. 2021; Basu et al. 2021; Simchowitz et al. 2021). Broadly speaking, these works either study a Bayesian setting (Kveton et al. 2021; Basu et al. 2021; Hong et al. 2022), where the learning agent has access to a prior distribution over the meta-parameters of the unknown prior P ; or a frequentist setting (Bastani, Simchi-Levi, and Zhu 2019; Cella, Lazaric, and Pontil 2020; Simchowitz et al. 2021), where the metaparameters of P are estimated using frequentist estimators. We study both the Bayesian and frequentist settings. Our ﬁndings are similar to prior works, that the Bayesian methods have provably lower regret but are also less general when insisting on the exact implementation. Meta-learning is an established ﬁeld of machine learning (Thrun 1996, 1998; Baxter 1998, 2000; Finn, Xu, and Levine 2018), and also has a long history in multi-armed bandits (Azar, Lazaric, and Brunskill 2013; Gentile, Li, and Zappella 2014; Deshmukh, Dogan, and Scott 2017). Tuning of bandit algorithms is known to reduce regret (Vermorel and Mohri 2005; Maes, Wehenkel, and Ernst 2012; Kuleshov and Precup 2014; Hsu et al. 2019) and can be viewed as meta-learning. However, it lacks theory. Several papers tried to learn a bandit algorithm using policy gradients (Duan et al. 2016; Boutilier et al. 2020; Kveton et al. 2020; Yang and Toni 2020; Min, Moallemi, and Russo 2020). These works focus on ofﬂine optimization against a known prior P and are in the cumulative regret setting. Our SRM setting is also related to ﬁxed-budget best-arm identiﬁcation (BAI) (Gabillon, Ghavamzadeh, and Lazaric

2012; Alieva, Cutkosky, and Das 2021; Azizi, Kveton, and Ghavamzadeh 2022). In BAI, the goal is to control the probability of choosing a suboptimal arm. The two objectives are related because the simple regret can be bounded by the probability of choosing a suboptimal arm multiplied by the maximum gap. While SRM has a long history (Audibert and Bubeck 2010; Kaufmann, Capp e, and Garivier 2016), prior works on Bayesian SRM are limited. Russo (2020) proposed a TS algorithm for BAI. However, its analysis and regret bound are frequentist. The ﬁrst work on Bayesian SRM is Komiyama et al. (2021). Beyond establishing a lower bound, they proposed a Bayesian algorithm that minimizes the (Bayesian) per-task simple regret in Eq. (1). This algorithm does not use the prior P and is conservative. As a side contribution of our work, we establish Bayesian per-task simple regret bounds for posterior-based algorithms in this setting.

7 Experiments

In this section, we empirically compare our algorithms by their average meta simple regret over 100 simulation runs. In each run, the prior is sampled i.i.d. from a ﬁxed meta-prior. Then the algorithms run on tasks sampled i.i.d. from the prior. Therefore, the average simple regret is a ﬁnite-sample approximation of the Bayesian meta simple regret. Alternatively, we evaluate the algorithms based on their frequentist regret in Appendix F. We also experiment with a real-world dataset in Appendix F.1. We evaluate three variants of our algorithms with TS as alg; (1) f-meta SRM (Algorithm 2) as a frequentist Meta TS. We tune m0 and report the point-wise best performance for each task. (2) B-meta SRM (Algorithm 1) as a Bayesian Meta-learning algorithm. (3) Mis B-meta SRM which is the same as B-meta SRM except that the meta-prior mean is perturbed by uniform noise from [ 50, 50]. This is to show how a major meta-prior misspeciﬁcation affects our Bayesian algorithm. The actual meta-prior is N(0, Σq). We do experiments with Gaussian rewards, and thus the following are our baseline for both MAB and linear bandit experiments. The ﬁrst baseline is Oracle TS, which is TS with the correct prior N(θ , Σ0). Because of that, it performs the best in hindsight. The second baseline is agnostic TS, which ignores the structure of the problem. We implement it with a prior N(0K, Σq + Σ0), since µs can be viewed as a sample from this prior when the task structure is ignored. Note that Σq is the meta-prior covariance in Section 5.2. The next set of baselines are state-of-the-art BAI algorithms. As mentioned in Section 6, the goal of BAI is not SRM but it is closely related. A BAI algorithm is expected to have small simple regret for a single task. Therefore, if our algorithms outperform them, the gain must be due to meta-learning. We include sequential halving (SH) and its linear variant (Lin-SH), which are special cases of GSE (Azizi, Kveton, and Ghavamzadeh 2022), as the state-of-theart ﬁxed-budget BAI algorithms. We also include Lin Gap E (Xu, Honda, and Sugiyama 2018) as it shows superior SRM performance compared to Lin-SH. All experiments have m = 200 tasks with n = 100 rounds in each. Appendix F

q Gaussian (K = 4, n = 100)

0 50 100 150 200 Num of Tasks

SH Oracle TS TS B-meta SRM Mis B-meta SRM f-meta SRM

Simple Regret x1e3

q Gaussian (K = 6, n = 100) 30

Num of Tasks 0 50 100 150 200

15 20 25 30

Gaussian (K = 8, n = 100)

0 50 100 150 200 Num of Tasks

Figure 1: Learning curves for Gaussian MAB experiments. The error bars are standard deviations from 100 runs.

Linear (K = 20, d = 4, n = 100)

0 50 100 150 200

Oracle TS TS

f-meta SRM Lin-SH Lin Gap E

Num of Tasks

B-meta SRM Mis B-meta SRM

Simple Regret x1e3

Linear (K = 40, d = 8, n = 100) 40 35 30 25 20 15 10 5 0

Num of Tasks 0 50 100 150 200

Linear (K = 80, d = 16, n = 100) 50

0 0 50 100 150 200 Num of Tasks

Figure 2: Learning curves for linear Gaussian bandit experiments. The error bars are standard deviations from 100 runs.

describes the experimental setup in more detail and also includes additional results.

7.1 Gaussian MAB We start our experiments with a Gaussian bandit. Speciﬁcally, we assume that A = [K] are K arms with a Gaussian reward distribution νs(a; µs) = N(µs(a), 102), so σ = 10. The mean reward is sampled as µs Pθ = N(θ , 0.12IK), so Σ0 = 0.12IK. The prior parameter is sampled from metaprior as θ Q = N(0K, IK), i.e., Σq = IK. Fig. 1 shows the results for various values of K. We clearly observe that the meta-learning algorithms adapt to the task prior and outperform TS. Both f-meta SRM and B-meta SRM perform similarly close to Oracle TS, which conﬁrms the negligible cost of learning the prior as expected in our bounds. We also note that f-meta SRM outperforms Mis B-meta SRM, which highlights the reliance of the Bayesian algorithm on a good meta-prior. SH matches the performance of the meta-learning algorithms when K = 4. However, as the task becomes harder (K > 4), it underperforms our algorithms signiﬁcantly. For smaller K, the tasks share less information and thus meta-learning does not improve the learning as much.

7.2 Linear Gaussian Bandits Now take a linear bandit (Section 5.2) in d dimensions with K = 5d arms, the arms are sampled from a unit sphere uniformly. The reward of arm a is distributed as N(a µs, 102), so σ = 10, and µs is sampled from P = N(θ , 0.12Id), so Σ0 = 0.12Id. The prior parameter, θ , is sampled from meta-prior Q = N(0d, Id), so Σq = Id. Fig. 2 shows experiments for various values of d. As expected, larger d increase the regret of all the algorithms. Compared to Section 7.1, the problem of learning the prior is more

difﬁcult, and the gap of B-meta SRM and Oracle TS increases. f-meta SRM also outperforms TS, but it has a much higher regret than B-meta SRM. While Mis B-meta SRM under-performs f-meta SRM in the MAB tasks, it performs closer to B-meta SRM in this experiment. The BAI algorithms, Lin-SH and Lin Gap E, under-perform our metalearning algorithms and are closer to TS than in Fig. 1. The value of knowledge transfer in the linear setting is higher since the linear model parameter is shared by many arms. Our linear bandit experiment conﬁrms the applicability of our algorithms to structured problems, which shows potential for solving real-world problems. Speciﬁcally, the success of Mis B-meta SRM conﬁrms the robustness of B-meta SRM to misspeciﬁcation.

8 Conclusions and Future Work

We develop a meta-learning framework for SRM, where the agent improves by interacting repeatedly with similar tasks. We propose two algorithms: a Bayesian algorithm that maintains a distribution over task parameters and the frequentist one that estimates the task parameters using frequentist methods. The Bayesian algorithm has superior regret guarantees while the frequentist one can be applied to a larger family of problems. This work lays foundations for Bayesian SRM and readily extends to reinforcement learning (RL). For instance, we can extend our framework to task structures, such as parallel or arbitrarily ordered (Wan, Ge, and Song 2021; Hong et al. 2022). Our Bayesian algorithm easily extends to tabular and factored MDPs RL (Lu and Van Roy 2019). Also, the frequentist algorithm applies to POMDPs (Simchowitz et al. 2021).

References Abeille, M.; and Lazaric, A. 2017. Linear Thompson Sampling Revisited. In Proceedings of the 20th International Conference on Artiﬁcial Intelligence and Statistics. Agrawal, S.; and Goyal, N. 2012. Analysis of thompson sampling for the multi-armed bandit problem. In Conference on learning theory, 39 1. JMLR Workshop and Conference Proceedings. Agrawal, S.; and Goyal, N. 2013. Further optimal regret bounds for thompson sampling. In Artiﬁcial intelligence and statistics, 99 107. PMLR. Alieva, A.; Cutkosky, A.; and Das, A. 2021. Robust Pure Exploration in Linear Bandits with Limited Budget. In Meila, M.; and Zhang, T., eds., Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, 187 195. PMLR. Audibert, J.-Y.; and Bubeck, S. 2010. Best Arm Identiﬁcation in Multi-Armed Bandits. In COLT - 23th Conference on Learning Theory - 2010, 13 p. Haifa, Israel. Azar, M. G.; Lazaric, A.; and Brunskill, E. 2013. Sequential Transfer in Multi-Armed Bandit with Finite Set of Models. In Advances in Neural Information Processing Systems 26, 2220 2228. Azizi, M.; Kveton, B.; and Ghavamzadeh, M. 2022. Fixedbudget best-arm identiﬁcation in structured bandits. In Proceedings of the Thirty-First International Joint Conference on Artiﬁcial Intelligence, IJCAI-22, 2798 2804. Bastani, H.; Simchi-Levi, D.; and Zhu, R. 2019. Meta Dynamic Pricing: Transfer Learning Across Experiments. Management Science. Basu, S.; Kveton, B.; Zaheer, M.; and Szepesv ari, C. 2021. No Regrets for Learning the Prior in Bandits. Advances in Neural Information Processing Systems, 34. Baxter, J. 1998. Theoretical Models of Learning to Learn. In Learning to Learn, 71 94. Springer. Baxter, J. 2000. A Model of Inductive Bias Learning. Journal of Artiﬁcial Intelligence Research, 12: 149 198. Boutilier, C.; Hsu, C.-W.; Kveton, B.; Mladenov, M.; Szepesvari, C.; and Zaheer, M. 2020. Differentiable Meta-Learning of Bandit Policies. In Advances in Neural Information Processing Systems 33. Cella, L.; Lazaric, A.; and Pontil, M. 2020. Meta-Learning with Stochastic Linear Bandits. In Proceedings of the 37th International Conference on Machine Learning. Deshmukh, A. A.; Dogan, U.; and Scott, C. 2017. Multi-Task Learning for Contextual Bandits. In Advances in Neural Information Processing Systems 30, 4848 4856. Duan, Y.; Schulman, J.; Chen, X.; Bartlett, P. L.; Sutskever, I.; and Abbeel, P. 2016. RL2: Fast reinforcement learning via slow reinforcement learning. ar Xiv preprint ar Xiv:1611.02779. Finn, C.; Xu, K.; and Levine, S. 2018. Probabilistic Model Agnostic Meta-Learning. In Advances in Neural Information Processing Systems 31, 9537 9548.

Gabillon, V.; Ghavamzadeh, M.; and Lazaric, A. 2012. Best Arm Identiﬁcation: A Uniﬁed Approach to Fixed Budget and Fixed Conﬁdence. In Pereira, F.; Burges, C. J. C.; Bottou, L.; and Weinberger, K. Q., eds., Advances in Neural Information Processing Systems, volume 25, 3212 3220. Curran Associates, Inc. Gentile, C.; Li, S.; and Zappella, G. 2014. Online Clustering of Bandits. In Proceedings of the 31st International Conference on Machine Learning, 757 765. Hoffman, M.; Shahriari, B.; and Freitas, N. 2014. On correlation and budget constraints in model-based bandit optimization with application to automatic machine learning. In Artiﬁcial Intelligence and Statistics, 365 374. PMLR. Hong, J.; Kveton, B.; Zaheer, M.; and Ghavamzadeh, M. 2022. Hierarchical Bayesian bandits. In International Conference on Artiﬁcial Intelligence and Statistics, 7724 7741. PMLR. Hsu, C.-W.; Kveton, B.; Meshi, O.; Mladenov, M.; and Szepesvari, C. 2019. Empirical Bayes regret minimization. ar Xiv preprint ar Xiv:1904.02664. Kaufmann, E.; Capp e, O.; and Garivier, A. 2016. On the Complexity of Best-Arm Identiﬁcation in Multi-Armed Bandit Models. JMLR, 17: 1:1 1:42. Komiyama, J.; Ariu, K.; Kato, M.; and Qin, C. 2021. Optimal Simple Regret in Bayesian Best Arm Identiﬁcation. ar Xiv preprint ar Xiv:2111.09885. Kuleshov, V.; and Precup, D. 2014. Algorithms for Multi Armed Bandit Problems. Co RR, abs/1402.6028. Kveton, B.; Mladenov, M.; Hsu, C.-W.; Zaheer, M.; Szepesvari, C.; and Boutilier, C. 2020. Differentiable meta-learning in contextual bandits. ar Xiv e-prints, ar Xiv 2006. Kveton, B.; wei Hsu, C.; Boutilier, C.; Szepesvari, C.; Zaheer, M.; Mladenov, M.; and Konobeev, M. 2021. Meta-Thompson Sampling. In Proceedings of the 38th International Conference on Machine Learning (ICML 2021), 5884 5893. Lattimore, T.; and Szepesv ari, C. 2020. Bandit algorithms. Cambridge University Press. Liu, Y.; Devraj, A. M.; Van Roy, B.; and Xu, K. 2022. Gaussian Imagination in Bandit Learning. ar Xiv preprint ar Xiv:2201.01902. Lu, X.; and Van Roy, B. 2019. Information-Theoretic Conﬁdence Bounds for Reinforcement Learning. In Advances in Neural Information Processing Systems, volume 32. Maes, F.; Wehenkel, L.; and Ernst, D. 2012. Meta-Learning of Exploration/Exploitation Strategies: The Multi-Armed Bandit Case. In Proceedings of the 4th International Conference on Agents and Artiﬁcial Intelligence, 100 115. Mason, B.; Jain, L.; Tripathy, A.; and Nowak, R. 2020. Finding All \epsilon-Good Arms in Stochastic Bandits. In Larochelle, H.; Ranzato, M.; Hadsell, R.; Balcan, M. F.; and Lin, H., eds., Advances in Neural Information Processing Systems, volume 33, 20707 20718. Curran Associates, Inc. Min, S.; Moallemi, C. C.; and Russo, D. J. 2020. Policy gradient optimization of Thompson sampling policies. ar Xiv preprint ar Xiv:2006.16507.

R eda, C.; Kaufmann, E.; and Delahaye-Duriez, A. 2021. Topm identiﬁcation for linear bandits. In International Conference on Artiﬁcial Intelligence and Statistics, 1108 1116. PMLR. Russo, D. 2020. Simple Bayesian Algorithms for Best-Arm Identiﬁcation. Operations Research, 68(6): 1625 1647. Simchowitz, M.; Tosh, C.; Krishnamurthy, A.; Hsu, D. J.; Lykouris, T.; Dudik, M.; and Schapire, R. E. 2021. Bayesian decision-making under misspeciﬁed priors with applications to meta-learning. Advances in Neural Information Processing Systems, 34. Thrun, S. 1996. Explanation-Based Neural Network Learning - A Lifelong Learning Approach. Ph.D. thesis, University of Bonn. Thrun, S. 1998. Lifelong Learning algorithms. In Learning to Learn, 181 209. Springer. Tripathi, R. C.; Gupta, R. C.; and Gurland, J. 1994. Estimation of parameters in the beta binomial model. Annals of the Institute of Statistical Mathematics, 46(2): 317 331. Vermorel, J.; and Mohri, M. 2005. Multi-Armed Bandit Algorithms and Empirical Evaluation. In Proceedings of the 16th European Conference on Machine Learning, 437 448. Vershynin, R. 2018. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press. Wan, R.; Ge, L.; and Song, R. 2021. Metadata-based Multi Task Bandits with Bayesian Hierarchical Models. In Beygelzimer, A.; Dauphin, Y.; Liang, P.; and Vaughan, J. W., eds., Advances in Neural Information Processing Systems. Xu, L.; Honda, J.; and Sugiyama, M. 2018. Fully adaptive algorithm for pure exploration in linear bandits. ar Xiv:1710.05552. Yang, K.; and Toni, L. 2020. Differentiable linear bandit algorithm. ar Xiv preprint ar Xiv:2006.03000.