# metalearning_with_stochastic_linear_bandits__35003c7e.pdf

Meta-learning with Stochastic Linear Bandits

Leonardo Cella 1 2 Alessandro Lazaric 3 Massimiliano Pontil 2

We investigate meta-learning procedures in the setting of stochastic linear bandits tasks. The goal is to select a learning algorithm which works well on average over a class of bandits tasks, that are sampled from a task-distribution. Inspired by recent work on learning-to-learn linear regression, we consider a class of bandit algorithms that implement a regularized version of the wellknown OFUL algorithm, where the regularization is a square euclidean distance to a bias vector. We ﬁrst study the beneﬁt of the biased OFUL algorithm in terms of regret minimization. We then propose two strategies to estimate the bias within the learning-to-learn setting. We show both theoretically and experimentally, that when the number of tasks grows and the variance of the task-distribution is small, our strategies have a signiﬁcant advantage over learning the tasks in isolation.

1. Introdution

The multi-armed bandit MAB (Lattimore & Szepesv ari, 2020; Auer et al., 2002; Siegmund, 2003; Robbins, 1952; Cesa-Bianchi, 2016; Bubeck et al., 2012) is a simple framework formalizing the online learning problem constrained to partial feedback. In the last decades it has receiving increasing attention due to its wide practical importance and the theoretical challenges in designing principled and efﬁcient learning algorithms. In particular, applications range from recommender systems (Li et al., 2010; Cella & Cesa Bianchi, 2019; Bogers, 2010), to clinical trials (Villar et al., 2015), and to adaptive routing (Awerbuch & Kleinberg, 2008), among others.

In this paper, we are concerned with linear bandits (Abbasi Yadkori et al., 2011; Chu et al., 2011; Auer, 2003), a con-

1University of Milan 2Istituto Italiano di Tecnologia 3Facebook AI Research. Correspondence to: Leonardo Cella <leonardocella@gmail.com>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

solidated MAB setting in which each arm is associated with a vector of features and the arm payoff function is modeled by a (unknown) linear regression of the arm feature vector. Our study builds upon the OFUL algorithm introduced in (Abbasi-Yadkori et al., 2011), which in turned improved the theoretical analysis initially investigated in (Chu et al., 2011; Auer, 2003). Nonetheless, it may still require a long exploration in order to estimate well the unknown linear regression vector. An appealing approach to solve this bottleneck is to leverage already completed tasks by transferring the previously collected experience to speedup the learning process. This framework ﬁnds its most common application in the recommendation system domain, where we wish to recommend contents to a new user by matching his preference. Our objective is to rely on past interactions corresponding to navigation of different users to speedup the learning process.

Previous Work During the past decade, there have been numerous theoretical investigation of transfer learning, with a particular attention to the problems of multi-task (MTL) (Ando & Zhang, 2005; Maurer & Pontil, 2013; Maurer et al., 2013; 2016; Cavallanti et al., 2010) and learning-to-learn (LTL) or meta-learning (Baxter, 2000; Alquier et al., 2017; Denevi et al., 2018a;b; 2019; Pentina & Urner, 2016). The main difference between these two settings is that MTL aims to solve the problem of learning well on a prescribed set of tasks (the learned model is tested on the same tasks used during training), whereas LTL studies the problem of selecting a learning algorithm that works well on tasks from a common environment (i.e. sampled from a prescribe distribution), relying on already completed tasks from the same environment (Pentina & Urner, 2016; Balcan et al., 2019; Denevi et al., 2018a; 2019). In either case the base tasks considered have always been supervised learning ones. Recently, the MTL setting has been extended to a class of bandit tasks, with encouraging results empirically and theoretically (Azar et al., 2013; Calandriello et al., 2014; Zhang & Bareinboim, 2017; Deshmukh et al., 2017; Liu et al., 2018), as well as the case where tasks belong to a (social) graph, a setting that is usually referred to as collaborative linear bandit (Cesa-Bianchi et al., 2013; Soare et al., 2014; Gentile et al., 2014; 2017). Differently from these works, the principal goal of this paper is to investigate the adoption of the meta-learning framework, which has been

Meta-learning with Stochastic Linear Bandits

successfully considered within the supervised setting, to the setting of linear stochastic bandits.

Contributions Our contribution is threefold. First, we introduce in Section 3 a variant of the OFUL algorithm in which the regularization term is modiﬁed by introducing a bias vector, analyzing the impact of the bias in terms of regret minimization. Second, and more importantly, in Sections 4 and 5 we propose two alternative approaches to estimate the bias, within the meta-learning setting. We establish theoretical results on the regret of these methods, highlighting that, when the task-distribution has a small variance and the number of tasks grows, adopting the proposed meta-learning methods lead a substantial beneﬁt in comparison to using the standard OFUL algorithm. Finally, in Section 6 we compare experimentally the proposed methods with respect to the standard OFUL algorithm on both synthetic and real data.

2. Learning Foundations

In this section we start by brieﬂy recalling the standard stochastic linear bandit framework and we then present the considered LTL setting.

2.1. Linear Stochastic Bandits

Let T be a positive integers and let [T] = {1, . . . , T}. A Linear Stochastic MAB is deﬁned by a sequence of T interactions between the agent and the environment. At each round t [T], the learner is given a decision set Dt Rd

from which it has to pick an arm xt Dt. Subsequently, it observes the corresponding reward yt = x t w + ηt which is deﬁned by a linear relation with respect to an unknown parameter w Rd combined with a sub-gaussian random noise term ηt. Thanks to the knowledge of the true parameter w , at each round t the optimal policy picks the arm x t = arg maxx Dt x w , maximizing the instantaneous reward. The learning objective is to maximize the cumulative reward, or equivalently, to minimize the pseudo-regret

t=1 (x t xt) w .

As learning algorithm we consider OFUL (Abbasi-Yadkori et al., 2011). At each round t [T], it estimates w by ridge-regression over the observed arm reward pairs, that is,

bwλ t+1 = arg min w Rd Xtw yt 2 2 + λ w 2 2 (1)

where Xt is the matrix whose rows are x 1 , . . . , x t , I is the d d identity matrix and yt = (y1, . . . , yt) . A key insight behind OFUL is to update online a conﬁdence interval Ct containing the true parameter w with high probability and centered in bwλ t . According to Theorem 2 of (Abbasi Yadkori et al., 2011), assuming w 2 S and x 2 L

x t s=1Ds, then for any δ > 0, w.p. 1 δ, t 0, w lies in

Ct(δ) = w Rd : bwλ t w Vλ t

d log 1 + t L2/λ

δ + λ 1 2 S =: βλ t (δ) (2)

where Vλ t = λI+X t Xt. According to the optimism in the face of uncertainty principle, at each round t OFUL picks the arm xt by solving the following optimization problem:

xt = arg max x Dt max ewλ t Ct x ewt. (3)

As was proved in Lemma 5 of (Kuzborskij et al., 2019), this corresponds to pick:

xt arg max x Dt

n x bwλ t 1 + βt 1(δ) x (Vλ t 1) 1 o . (4)

Finally, with probability at least 1 δ, OFUL satisﬁes (see Theorem 3 of (Abbasi-Yadkori et al., 2011)):

Td log 1 + TL

2 log(1/δ) + d log(1 + TL/(λd)) .

We can now formally introduce the considered LTL learning framework for the family of tasks we analyze in this work: biased regularized linear stochastic bandits.

2.2. LTL with Linear Stochastic Bandits.

We assume that each learning task w Rd representing a linear bandit, is sampled from a task-distribution ρ of bounded support in Rd. The objective is to design a metalearning algorithm which is well suited to the environment. Speciﬁcally, we assume to receive a sequence of tasks w1, . . . , w N, . . . which are independently sampled from the task-distribution (environment) ρ. Due to the interactive nature of the bandit setting, we do not have any prior information related to a new task; we collect information about it along the interaction with the environment. After completing the j-th task, we store the whole interaction in a dataset Zj which is formed by T entries (xj,t, yj,t)T t=1. Clearly, the dataset entries are not i.i.d sampled from a given distribution, but each dataset Zj corresponds to the recording of the learning policy in terms of the arm xj,t picked from the decision set Dj t and its corresponding reward yj,t while facing the task speciﬁed by the unknown vector wj. Starting from these datasets, we wish to design an algorithm A which suffers a low regret on a new task w N+1 ρ. This can be stated into requiring that A trained over N datasets has small transfer-regret:

R(T, ρ) = Ew ρ h E R(T, w) i

Meta-learning with Stochastic Linear Bandits

where the inner expectation is with respect to rewards realizations due to their noisy components.

3. Biased Regularized OFUL

We now introduce BIAS-OFUL, a biased version of OFUL, which is instrumental for our meta-learning setting. Although not feasible, the proposed algorithm it serves as a basis to study the theoretical properties of meta-learning with stochastic linear bandit tasks. In Section 6 we will present a more practical version of it.

Regularized Conﬁdence Sets The idea of following a bias in a speciﬁc family of learning algorithms is not new in the LTL literature (Denevi et al., 2018a; 2019; 2018b). Inspired by (Denevi et al., 2019) we modify the regularization in the computation of the conﬁdence set centroid bwλ t , where the regularization is now deﬁned as a square euclidean distance to the bias parameter h Rd. Given a ﬁxed vector h, at each round t [T] BIAS-OFUL estimates the regularized centroid of the conﬁdence ellipsoid as

bwh t = arg min w X t w Yt 2

2 + λ w h 2 2

whose solution is given by

bwh t = Vλ t 1 X t (Yt Xth) + h. (5)

This result follows directly from the standard ridgeregression by using the substitution v = w h. As we have mentioned in the previous section, at each round t OFUL keeps also updated a conﬁdence interval Ct (see Equation 2) centered in bwλ t which contains w with high probability. We now derive a conﬁdence set for the biased regularized estimate bwh t , assuming that we have access to an oracle to compute the distance h w 2. This seems quite restrictive, however later in the paper we will show how levering similar related tasks we can exploit this bound to take advantage of the bias version of OFUL, without having to know the above distance a-priori.

Theorem 1. Assuming h 2 S, w 2 S and x 2 L x t s=1Ds, then for any δ > 0, with probability at least 1 δ, t 0, w lies in the set

w Rd : bwh t w Vλ t λ 1 2 h w 2 +

v u u t2 log

det Vλ t 1/2

det (λI)1/2 δ

The proof can be found in the appendix material. We will now study the impact of the bias h in terms of regret.

3.1. Regret Analysis with Fixed Bias

Given the conﬁdence set deﬁned in Theorem 1 and the optimism principle translated into selecting the next arm according to Equation 4, we can analyze the expected pseudoregret depending on the value of h.

Lemma 1. (BIAS-OFUL Expected Regret) Under the same assumptions of Theorem 1, if in addition, for all t and all x Dt, x w [ 1, 1], and considering λ 1, we have

R(T, w ) = E [R(T, w )]

Td log 1 + TL

λ 1 2 w h 2 +

d log(T + T 2L/(λd))

where the expectation is respect to the reward generation and C > 0 is a constant factor.

We now analyze the regret for two different values of h. In particular we wish to highlight how setting a good bias can speedup the process of learning with respect to using the standard OFUL approach (Abbasi-Yadkori et al., 2011).

Corollary 1. Under the conditions of Lemma 1, the following bounds on the expected regret of BIAS-OFUL holds:

(i) Independent Task Learning (ITL), given by setting h = 0 satisﬁes the following expected regret bound

Td log 1 + TL

d log(T + T 2L/(λd))

which is of order O(d

T) for any λ 1.

(ii) The Oracle, given by setting h = w satisﬁes

Td log 1 + TL

d log(T + T 2L/(λd))

which is 0 as λ .

The proofs can be found in the supplementary material. The main intuition is that, as long as we can set h = w , the bigger the the regularization parameter λ is, the more the Oracle policy tends to select the arm only based on w , thereby becoming equivalent to the optimal policy.

Meta-learning with Stochastic Linear Bandits

3.2. Transfer Regret Analysis with Fixed Bias

Following the above analysis for the single task case, we now study the impact of the bias in the transfer regret bound. To this end, we introduce the variance and the mean absolute distance of a bias vector h relative to the environment of task,

Varh = Ew ρ w h 2 2 , Marh = Ew ρ w h 2

and we observe that w = Ew ρw = arg minh Rd Varh and m = arg minh Rd Marh. With this in hand, we can now analyze how the transfer regret can be upper bounded as a function of the introduced terms. Lemma 2. (Transfer Regret Bound) Under the same conditions in Theorem 1 and Lemma 1, the expected transfer regret of BIAS-OFUL can be upper bounded as:

Tdλ log 1 + TL

T log T + T 2L

Tdλ log 1 + TL

T log T + T 2L

Proof. The ﬁrst statement is the expectation with respect to the task-distribution ρ applied to Lemma 1, while the second follows by applying Jensen s inequality.

We can now replicate what we have done in Corollary 1 and consider the transfer regret bound for two different values of the hyper-parameter h. The main difference is that here, there is not an a-priori correct value for h as it depends on the task-distribution ρ. Corollary 2. Under the same assumptions in Theorem 1 and Lemma 1, and setting λ = 1 T Varh , the following bounds on the transfer regret hold

(i) Independent Task Learning (ITL), given by setting the bias hypeparameter h equal to 0, satisﬁes

Td log T + T 3LVar0

d log 1 + T 2LVar0

(ii) The Oracle, given by setting the bias hyperparameter h equal to the mean task w, satisﬁes

Td log T + T 3LVarρ

Algorithm 1 Within Task Algorithm: BIAS-OFUL

Require: λ > 0, bh0 Rd

1: bwh 0 = bh0, V 1 0 = 1

λI. 2: for t = 1 to T do 3: GET decision set Dt 4: SELECT xt Dt with bias h = bhλ j,t 5: OBSERVE reward yt 6: UPDATE Vt = Vt 1 + xtx t 7: UPDATE bht according to the meta-algorithm 8: UPDATE bwh t using Equation 5 9: end for

Algorithm 2 Meta-Algorithm: Estimating bhλ

1: for j = 1 to N do 2: SAMPLE new task wj ρ

3: SET bhλ j,0 4: RUN Algorithm 1 with parameter bhλ j,0 5: end for

d log 1 + T 2LVarρ

where Varρ = Varw.

Proof. These results directly follow from Lemma 2. We have picked λ = 1 T Varh in order to highlight the multiplicative term log(1 + Varh) which tends to zero as the the variance Varh goes to zero.

Therefore, running BIAS-OFUL with bias h equal to w brings a substantial beneﬁt with respect to the unbiased case when the second moment of the task-distribution ρ is much bigger than its variance. Speciﬁcally, we introduce the following assumption.

Assumption 1. (Low Biased Variance)

Varρ = Ew ρ w w 2 2 Ew ρ w 2 2 = Var0. (7)

Notice also that the choice λ = 1/(TVarh), implies that, as Varw tends to 0, the regret upper bound of the oracle case tends to zero too reﬂecting the result of Corollary 1. More in general, we can state that when the environment (i.e. the task-distribution ρ) satisﬁes Assumption 1, leveraging on tasks similarity would gives a substantial beneﬁt compared to learning each task separately. Since in practice the mean task parameter w is unknown, in the following sections we propose two alternative approaches to estimate w.

4. A High Variance Solution

In this section, we present our ﬁrst meta-learning method. We begin by introducing some additional notation. We

Meta-learning with Stochastic Linear Bandits

let xh j,t be the arm pulled by the BIAS-OFUL algorithm (Algorithm 1) at round t-th of the j-th task. We denote by Vj,T = PT s=1 xh j,sxh j,s the design matrix computed with the T arms picked during the j-th task. For each terminated task j [N] we also deﬁne bj,T = X j,T Yj,T . Finally, we introduce the mean estimation error

ϵN,t(ρ) = w bhλ N,t 2

which is the error of our estimate bhλ N,t with respect to the true mean task w, at round t of the N + 1-th task.

4.1. Averaging the Estimated Task Parameters

An intuitive solution to bound the estimation error ϵN,t is to simply average of the estimated task parameters bwλ j computed according to Equation 1 on the dataset Zj without considering any bias.

bhλ N,t+1 = 1 NT + t

j=1 T bwλ j,T + tbwλ N+1,t

By adopting this approach, we have the following bound on the transfer regret.

Theorem 2. (Transfer Regret Bound). Let the assumptions of Lemma 2 hold and let bhλ N,t be deﬁned as in Equation (8). Then, it hold that

R(T, ρ) d C

v u u u u u t T log

1 + T 2L Varw + ϵN,T (ρ)

where the mean estimation error can be bound as

ϵN,T (ρ) Hρ(N + 1, w) + max j=1,...,N βλ j 1/T

λ1/2 min(Vλ j,T ) .

Here, βλ j 1

T refers to the conﬁdence interval computed with OFUL (see Equation 2) and Hρ(N +1, w) = w h N,t 2 with h N,t+1 = 1 NT +t PN j=1 Twj + tw N+1 .

Proof. We follow the reasoning in Corollary 2, this time setting h = bhλ N,T , and then observe that

ϵN,T (ρ) = w bhλ N,T 2

w h N,T 2 + h N,T bhλ N,T 2

= Hρ(N + 1, w) + h N,T bhλ N,T 2 Hρ(N + 1, w) + max 1 j N+1 wj bwλ j,T 2

Hρ(N + 1, w) + max 1 j N+1

wj bwλ j,T Vλ j,T λ1/2 min(Vλ j,T )

Hρ(N + 1, w) + max 1 j N+1 βλ j 1/T

λ1/2 min(Vλ j,T ) .

The term Hρ(N + 1, w) denotes the estimation error of the empirical mean computed from the N + 1 tasks vectors (wj)N+1 j=1 , relative to the true mean w. Since the wj are independent random d-dimensional vectors drawn from ρ we can apply the following vectorial version of the Bennett s inequality (see, e.g., Smale & Zhou, 2007, Lemma 2).

Lemma 3. Let w1, . . . , w N be N independent random vectors with values in Rd sampled from the task-distribution ρ. Assuming that j [N] : wj S, then for any 0 < δ < 1, it holds, with probability at least 1 δ

H(N, w) 2 log(2/δ) S

2 log(2/δ) Varρ

The above lemma says that the error Hρ(N, w) goes to zero as N grows to inﬁnity. Therefore the estimation error ϵN,t(ρ) is dominated by the variance term max1 j N βλ j 1/T λ 1/2 min (Vλ j,T ), associated with the worst past task. By relying on linear regression results (Lai & Wei, 1982) we have that λmin(Vj) log T. Moreover, as λmin(Vλ j ) λ + λmin(Vj), we observe an increasing sensitivity of the incurred variance to the λ parameter for small value of T. Finally, according to our choice of λ = 1/TVarbhλ, the suffered variance increases with the variance of our estimator. The latter in turns increases with the variance of the distribution ρ, which corresponds to the case in which Assumption 1 tends to be violated.

5. A High Bias Solution

In this section we will present an alternative estimator of the true mean w, which is inspired by the existing multi-task bandit literature (Gentile et al., 2014; 2017; Soare et al., 2014). This estimator exploits together all the samples associated to the past tasks Z1, . . . , ZN, with the aim of reducing the variance. This is unlike the previous estimator which separately considers the ridge-regression estimates bwλ 1, . . . , bwλ N in Equation 8. As we will see, this approach will reduce the variance but it will introduce an extra-bias. Before presenting this second approach we require some more notation. We let e VN,t = PN j=1 VN,T + VN+1,t the global design matrix containing the design matrices associated to past tasks V1,T , . . . , VN,T and the current design matrix VN+1,t. Analogously eb N,t = PN j=1 bj,T + b N+1,t refers to global counterpart of bj,t. We denote

Meta-learning with Stochastic Linear Bandits

with |A| = sup{ Ax : x Rd, x = 1} the norm of matrix A induced by the norm , which if no speciﬁed is the Euclidean norm. Finally, we denote with σmax(A) the biggest singular value associated with matrix A.

5.1. Global Ridge Regression

In order to reduce the variance, our second approach estimates, at each round t of the new sampled task N + 1, the mean task w as a global ridge regression computed over all the available samples as

bhλ N,t = e Vλ N,t 1 1 eb N,t 1. (9)

Our next result provides a bound on the transfer regret of this proposed strategy.

Theorem 3. (Transfer Regret Bound). Let the assumptions of Lemma 2 hold and let bhλ N,t be deﬁned as in Equation (9). Then, the following upper bound holds

R(T, ρ) d C

v u u u u u t T log

1 + T 2L Varw + ϵN,t(ρ)

where the mean estimation error can be bound as q

ϵN,T (ρ) S λ+νmin + 2(N+1) max 1 j N+1 e H(N+1, wj)

2 λ+νmin log T 1 + NTL2

+ Hρ(N+1, w)

and deﬁned νmin = λmin( e VN,T ) and we introduced

e H(N, wj) = Hρ(j, wj)σmax Vj,T e V 1 N,T

which is a weighted form of the estimation error Hρ(j, wj) towards the current task vector wj, where the weights are deﬁned in terms of tasks misalignment σmax Vj,T e V 1 N,T .

The proof is presented in Section D of the appendix.

The previous variance term βλ j (1/T ) λmin(Vλ j,T ) has been now re-

placed by βλ(1/NT )

λ+νmin . It should be easy to observe that νmin N

d λmin(Vj) j [N] which leads a reduction of factor d/N to the variance, which goes to zero as N goes to inﬁnity. This gain does not come for free, in fact this approach introduces a potentially high bias: 2(N + 1) maxj=1,...,N+1 e H(N + 1, wj) which increases with the tasks misalignment σmax Vj,T e V 1 N,T .

5.2. Tasks Misalignment

We now analyze the tasks misalignment factors appearing in Theorem 3, namely, the quanitities σmax Vj,t e V 1 N,t and

e H(N, wj). For this purpose, we consider two opposite environments of tasks.

In the ﬁrst case we assume that all the tasks parameters are equal to each other and far from the zero d-dimensional vector. This scenario, which corresponds to put all the mass of the task-distribution ρ on a single task parameter w, is clearly in agreement with Assumption 1. We expect this to be the most favorable scenario, since after completing a task, we face exactly the same task again and again. In this case, independently on the covariance matrices, whose construction also depends on the decision sets available in the different tasks, it is simple to observe that we are not suffering any bias, that is, e H(N, wj) = 0 for every j [N] as all the task parameters are equal to each other.

The second environment is characterized by a task distribution ρ that is unform on ﬁnitely many orthogonal tasks. For instance, this is the scenario when ρ is uniform distributed over the standard basis vectors {(S, 0, . . . , 0), . . . , (0, . . . , 0, S)} Rd. Differently from the previous scenario, here after completing a task we will probably face an orthogonal task. It should be quite natural to see that this is the most unfavorable case and to expect to not have transfer learning between tasks. This is conﬁrmed by the regret bound due to the misalignment expressed by the covariance matrices σmax Vj,t e V 1 N,t . Indeed, since we can have at most d misaligned arms, we have the following upper bound d N to the term σmax Vj,t e V 1 N,t . Based on these observations we can conclude that the bigger the cardinality of the set of basis induced by the distribution ρ, the larger the number of completed tasks required to have a proper transfer. We will now focus on an intermediate case satisfying Assumption 1. In order to control the term σmax Vj,t e V 1 N,t and to give the possibility to generate aligned matrices when dealing with similar tasks, we introduce an additional mild assumption:

Assumption 2. (Shared Induced Basis) The decision sets are shared among all the tasks and tasks sampled according to Assumption 1 induces that the covariance matrices generated by running the BIAS-OFUL algorithm (Algorithm 1) share the same basis:

Vi = PΣi P , i [N]. (10)

This assumption is quite mild as it just states that similar tasks share the same pulled arms with no restrictions on the pulling frequency. This is the case when the decision set is ﬁxed among different rounds and tasks, that is, Dj,t = D j [N] and t [T], and consists of d orthogonal arms. If Assumption 2 is satisﬁed, then we can obtain the following bound: σmax Vj,t e V 1 N,t 1. Furthermore, if we denote by M the number of tasks necessary to achieve a stationary behavior of the BIAS-OFUL policy in terms of covariance matrices, then σmax Vj,t e V 1 N,t 1/(N M).

Meta-learning with Stochastic Linear Bandits

5.3. Smallest Global Eigenvalue νmin

It only remains to analyze the term νmin. We observe that it satisﬁes the lower bound

νmin = λmin

j=1 λmin(Vj,T )

(N + 1) log T

where in the last step we have relied on linear regression result from (Lai & Wei, 1982) which shows that the condition O(λmin) = log(λmax) is required to guarantee asymptotic consistency, necessary to have sublinear anytime regret. Since minj [N] λmax(Vj) = O(T), this condition implies that minj [N] λmin(Vj) log T.

6. Experiments

In this section we test the real effectiveness of the proposed approaches. The theoretical results stated that the method presented in Section 4 does not introduce any bias but it may incur an additional variance according to the variance of the task-distribution Varρ. On the contrary, the solution proposed in Section 5 which massively uses all the observed samples together, reduces the variance (at least) by a factor d/N, at the price of an extra bias term.

As it was mentioned in Section 3, the parameter w associated to each single task is unknown, therefore we cannot compute the gap bhλ w 2 deﬁning the term βh t (1/T). The main issue is that according to Equation 4, in order to pick the next arm, it seems that the algorithm needs to compute its exact value. However, we can simply split the norm and rely on the assumption that w S, so to remove the dependency on w . Indeed, it is important to emphasize that the real knowledge transfer happens in terms of wh, see Equation 5. This can be noticed by observing that the gap bhλ w equally affects all the available arms.

6.1. Experimental Results

In all the presented experiments the policy OPT knows the parameter wj associated to task j and picks the next arm as xj,t = arg maxx Dj,t x wj. The policies AVG-OFUL and RR-OFUL implement Algorithms 1 and 2 and estimate bh as per Equations 8 and Equation 9, respectively. The Oracle policy knows the mean task parameter w and uses it as the bias h in BIAS-OFUL (Corollary 2 (ii)). Analogously, the ITL policy consists of BIAS-OFUL with bias set equal to 0, see Corollary 2 (i). The regularization hyper-parameter λ was selected over a logarithmic scale. We will start by considering a pair of synthetic experiments in which we show how the hyper-parameter λ affects the performance. We then present experiments on two real datasets. We will denote with K the size of the decision set D.

Figure 1. Cumulative reward measured after N = 10 tasks and averaged over 10 independent test tasks, with λ = 1.

Figure 2. Cumulative reward measured after N = 10 tasks and averaged over 10 independent test tasks, with λ = 100.

Synthetic Data Similarly to what was done in (Denevi et al., 2019), we ﬁrst generated an environment of tasks in which running the Oracle policy is expected to outperform the ITL approach. In agreement with Assumption 1, we sample the task vectors from a distribution characterized by a much smaller variance than its second moment. That is, each task parameter wj is sampled from a Gaussian distribution with mean w given by the vector in Rd with all components equal to 1 and Varρ = 1. As far as the decision set concerns, we ﬁrst generate a random square matrix P with size d and then compute its qr factorization P = QR, where Q is a matrix with orthonormal columns and R is an upper-triangular matrix. We then associate to each base arm the direction associated to a column of the matrix Q. This will guarantee having arms that are almost orthogonal each other. Finally, at each round t [T] the decision set Dt is initialized as a set of K random vector that are ﬁrst shifted towards the respective arm base direction and then normalized. Notice that by following this generation mechanism we avoid any inductive bias between the task vectors and the arms ones, as they are actually

Meta-learning with Stochastic Linear Bandits

independent. Each task consists of T = 50 rounds, in which we have K = 5 arms of size d = 20. In order to generate the rewards, we ﬁrst compute the inner product between the user (task) vector and the arm (input) vector, we shift the resulting output interval [0, 1] and then add to a Gaussian noise N 0.5, 1 , to compute the rewards. Finally, we assigned reward 1 to the arm having the maximum ﬁnal reward, 0 to the others. In Figures 1 and 2, we report the results generated with λ = 1 and λ = 100, respectively. It is easy to observe that the stronger the regularization, the more the AVG-OFUL tends to the Oracle. Conversely, RR-OFUL get penalized with the increasing of λ, due to its bias.

Last FM Data The ﬁrst dataset we considered is extracted from the music streaming service Last.fm (Cantador et al., 2011). It contains 1892 possible users and 17632 artists. This dataset contains information about the artists listened by a given user, and we used this information to deﬁne the payoff function. We ﬁrst removed from the set of items those with less than 30 ratings and then we repeat the same procedure for the users. This operation yields an user rating matrix of size 741 x 538. Starting from this reduced matrix we derived the arms and the users vectors by computing an SVD decomposition where we kept only the ﬁrst d = 10 features associated to the users and to the items. In order to consider tasks satisfying Assumption 1, we randomly pick an user and compute the set of its N = 20 most similar users according to the l2-distance between their vectors. Each task lasts T = 5 rounds and consists of K = 5 arms. At each round t, the decision set consists of one arm whose rating was at least equal to 4 and K 1 arms whose ratings were at most equal to 3. The rewards were then generated analogously to the synthetic case. The Oracle policy knows w which is computed as the average between the N = 20 considered user vectors. In Figure 3 (and Figure 4) we displayed the cumulative regret suffered with respect to the optimal policy, which during each task j [N] knows the true user parameter wj. The vertical yellow lines indicate the end of each task. From the presented results we can observe that both the proposed policies AVG-OFUL and RR-OFUL outperform the ITL approach, while the Oracle policy is consistent with Corollary 2 and Assumption 1.

Movielens Here we consider the Movielens data (Harper & Konstan, 2015). It contains 1M anonymous ratings of approximately 3900 movies made by 6040 users. As before we ﬁrst removed from the set of movies those with less than 500 ratings, and from the set of users those with less than 200 rated movies. This preprocessing procedure yields an user rating matrix of size 847 x 618. Unlike the Last.fm case, here adopting SVD to generate the arm/user vectors seems not appropriate. Indeed, by exploring the retrieved singular values, we could not ﬁnd a subspace which provides a good approximation of the real ratings unless we keep all the latent features. Therefore, in order to ﬁnd a set of similar

Figure 3. Empirical Transfer regret associated with Lastfm.

Figure 4. Empirical Transfer regret associated with Movielens.

users we observe better results by using the KMeans clustering algorithm over the user vectors. The results displayed in Figure 4 were generated by running KMeans with C = 20 clusters with user vectors of size d = 10. We then picked all the resulting clusters by ﬁltering out the clusterings with a silhouette value lower than 0.15 and for each cluster of the clustering we have discarded those with less than 20 users. Furthermore, in order to let the tasks be simpler, we reduced the variance of the noisy components affecting rewards to 0.1. The difﬁculty in ﬁnding a valid set of similar tasks yields a high task misalignment, which is conﬁrmed by the fact that the best performance occur for small value of λ. Indeed, Figure 4 considers λ = 1. Here the AVG-OFUL policy behaves almost equally to the ITL approach, conversely, the task misalignment caused bad performances to the RR-OFUL policy, conﬁrming its higher sensitivity to task dissimilarity (see Theorem 3).

7. Conclusions and Future Work

In this work we studied a meta-learning framework with stochastic linear bandit tasks. We have ﬁrst introduced a novel regularized version of OFUL, where the regularization depends on the Euclidean distance to a bias vector. We showed that setting appropriately the bias leads a substantial improvement compared to learning each task in isolation. This observation motivated two alternative approaches to estimate this bias: while the ﬁrst one may suffer a potentially

Meta-learning with Stochastic Linear Bandits

high variance, the second might incur a strong bias.

In the future, it would be valuable to investigate the existence of unbiased estimators which do not suffer any variance. Furthermore, while in our analysis we set λ = 1/TVarh, in the future it would be also interesting to learn its value as part of the learning problem. Experimentally, we observed that when Assumption 1 is satisﬁed, adopting the unbiased estimator yields better results than the second one, which is biased. One more direction of future research would be to extend other meta-learning approaches, such as those based on feature sharing, to the banding setting. Finally, a problem which remains to be studied is the combination of meta-learning with non-stochastic bandits.

Abbasi-Yadkori, Y., P al, D., and Szepesv ari, C. Improved algorithms for linear stochastic bandits. In Proceedings of the 24th International Conference on Neural Information Processing Systems, NIPS 11, pp. 2312 2320, USA, 2011. Curran Associates Inc. ISBN 978-1-61839-5993. URL http://dl.acm.org/citation.cfm? id=2986459.2986717.

Alquier, P., Mai, T. T., and Pontil, M. Regret Bounds for Lifelong Learning. In Singh, A. and Zhu, J. (eds.), Proceedings of the 20th International Conference on rtiﬁcial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pp. 261 269, Fort Lauderdale, FL, USA, 20 22 Apr 2017. PMLR. URL http://proceedings.mlr. press/v54/alquier17a.html.

Ando, R. K. and Zhang, T. A framework for learning predictive structures from multiple tasks and unlabeled data. J. Mach. Learn. Res., 6:1817 1853, December 2005. ISSN 1532-4435. URL http://dl.acm.org/ citation.cfm?id=1046920.1194905.

Auer, P. Using conﬁdence bounds for exploitationexploration trade-offs. J. Mach. Learn. Res., 3:397 422, March 2003. ISSN 1532-4435. URL http://dl.acm. org/citation.cfm?id=944919.944941.

Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Mach. Learn., 47(2-3):235 256, May 2002. ISSN 0885-6125. doi: 10. 1023/A:1013689704352. URL https://doi.org/ 10.1023/A:1013689704352.

Awerbuch, B. and Kleinberg, R. Online linear optimization and adaptive routing. Journal of Computer and System Sciences, 74(1):97 114, 2008.

Azar, M. G., Lazaric, A., and Brunskill, E. Sequential transfer in multi-armed bandit with ﬁnite set of mod-

els. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS 13, pp. 2220 2228, USA, 2013. Curran Associates Inc. URL http://dl.acm.org/citation.cfm? id=2999792.2999860.

Balcan, M.-F., Khodak, M., and Talwalkar, A. Provable guarantees for gradient-based meta-learning. In International Conference on Machine Learning, pp. 424 433, 2019.

Baxter, J. A model of inductive bias learning. J. Artif. Int. Res., 12(1):149 198, March 2000. ISSN 10769757. URL http://dl.acm.org/citation. cfm?id=1622248.1622254.

Bogers, T. Movie recommendation using random walks over the contextual graph. In Proc. of the 2nd Intl. Workshop on Context-Aware Recommender Systems, 2010.

Bubeck, S., Cesa-Bianchi, N., et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends R in Machine Learning, 5 (1):1 122, 2012.

Calandriello, D., Lazaric, A., and Restelli, M. Sparse Multi-task Reinforcement Learning. In NIPS - Advances in Neural Information Processing Systems 26, Montreal, Canada, December 2014. URL https://hal. inria.fr/hal-01073513.

Cantador, I., Brusilovsky, P., and Kuﬂik, T. 2nd workshop on information heterogeneity and fusion in recommender systems (hetrec 2011). In Proceedings of the 5th ACM conference on Recommender systems, Rec Sys 2011, New York, NY, USA, 2011. ACM.

Cavallanti, G., Cesa-Bianchi, N., and Gentile, C. Linear algorithms for online multitask classiﬁcation. J. Mach. Learn. Res., 11:2901 2934, December 2010. ISSN 15324435.

Cella, L. and Cesa-Bianchi, N. Stochastic bandits with delay-dependent payoffs. ar Xiv preprint ar Xiv:1910.02757, 2019.

Cesa-Bianchi, N. Multi-armed Bandit Problem, pp. 1356 1359. Springer New York, New York, NY, 2016. ISBN 978-1-4939-2864-4. doi: 10.1007/ 978-1-4939-2864-4 768. URL https://doi.org/ 10.1007/978-1-4939-2864-4_768.

Cesa-Bianchi, N., Gentile, C., and Zappella, G. A gang of bandits. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 1, NIPS 13, pp. 737 745, USA, 2013. Curran Associates Inc. URL http://dl.acm.org/citation.cfm? id=2999611.2999694.

Meta-learning with Stochastic Linear Bandits

Chu, W., Li, L., Reyzin, L., and Schapire, R. Contextual bandits with linear payoff functions. In Gordon, G., Dunson, D., and Dud ık, M. (eds.), Proceedings of the Fourteenth International Conference on Artiﬁcial Intelligence and Statistics, volume 15 of Proceedings of Machine Learning Research, pp. 208 214, Fort Lauderdale, FL, USA, 11 13 Apr 2011. PMLR. URL http://proceedings. mlr.press/v15/chu11a.html.

Denevi, G., Ciliberto, C., Stamos, D., and Pontil, M. Learning to learn around a common mean. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 10169 10179. Curran Associates, Inc., 2018a.

Denevi, G., Ciliberto, C., Stamos, D., and Pontil, M. Incremental learning-to-learn with statistical guarantees. ar Xiv preprint ar Xiv:1803.08089, 2018b.

Denevi, G., Ciliberto, C., Grazzi, R., and Pontil, M. Learning-to-learn stochastic gradient descent with biased regularization. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 1566 1575, Long Beach, California, USA, 09 15 Jun 2019. PMLR. URL http://proceedings.mlr. press/v97/denevi19a.html.

Deshmukh, A. A., Dogan, U., and Scott, C. Multi-task learning for contextual bandits. In Advances in Neural Information Processing Systems, pp. 4848 4856, 2017.

Gentile, C., Li, S., and Zappella, G. Online clustering of bandits. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, ICML 14, pp. II 757 II 765. JMLR.org, 2014. URL http://dl.acm.org/ citation.cfm?id=3044805.3044977.

Gentile, C., Li, S., Kar, P., Karatzoglou, A., Zappella, G., and Etrue, E. On context-dependent clustering of bandits. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 1253 1262, International Convention Centre, Sydney, Australia, 06 11 Aug 2017. PMLR. URL http://proceedings.mlr. press/v70/gentile17a.html.

Harper, F. M. and Konstan, J. A. The movielens datasets: History and context. ACM Trans. Interact. Intell. Syst., 5(4), December 2015. ISSN 2160-6455. doi: 10.1145/ 2827872.

Kuzborskij, I., Cella, L., and Cesa-Bianchi, N. Efﬁcient linear bandits through matrix sketching. In Chaudhuri, K. and Sugiyama, M. (eds.), Proceedings of Machine Learning Research, volume 89 of Proceedings of Machine Learning Research, pp. 177 185. PMLR, 16 18 Apr 2019. URL http://proceedings.mlr.press/ v89/kuzborskij19a.html.

Lai, T. L. and Wei, C. Z. Least squares estimates in stochastic regression models with applications to identiﬁcation and control of dynamic systems. The Annals of Statistics, 10:154 166, 1982.

Lattimore, T. and Szepesv ari, C. Bandit algorithms. Cambridge University Press, 2020.

Li, L., Chu, W., Langford, J., and Schapire, R. E. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pp. 661 670, 2010.

Liu, B., Wei, Y., Y., Z., Yan, Z., and Yang, Q. Transferable contextual bandit for cross-domain recommendation. In In Thirty-Second AAAI Conference on Artiﬁcial Intelligence., 2018.

Maurer, A. and Pontil, M. Excess risk bounds for multitask learning with trace norm regularization. In Conference on Learning Theory, pp. 55 76, 2013.

Maurer, A., Pontil, M., and Romera-Paredes, B. Sparse coding for multitask and transfer learning. In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, ICML 13, pp. II 343 II 351. JMLR.org, 2013. URL http://dl.acm.org/citation. cfm?id=3042817.3042932.

Maurer, A., Pontil, M., and Romera-Paredes, B. The beneﬁt of multitask representation learning. J. Mach. Learn. Res., 17(1):2853 2884, January 2016. ISSN 15324435. URL http://dl.acm.org/citation. cfm?id=2946645.3007034.

Pentina, A. and Urner, R. Lifelong learning with weighted majority votes. In Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 29, pp. 3612 3620. Curran Associates, Inc., 2016.

Robbins, H. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5):527 535, 1952.

Siegmund, D. Herbert robbins and sequential analysis. Annals of statistics, pp. 349 365, 2003.

Meta-learning with Stochastic Linear Bandits

Smale, S. and Zhou, D.-X. Learning theory estimates via integral operators and their approximations. Constructive approximation, 26(2):153 172, 2007.

Soare, M., Alsharif, O., Lazaric, A., and Pineau, J. Multitask linear bandits. In NIPS 14 Workshop on Transfer and Multi-task Learning, 2014.

Villar, S. S., Bowden, J., and Wason, J. Multi-armed bandit models for the optimal design of clinical trials: beneﬁts and challenges. Statistical science: a review journal of the Institute of Mathematical Statistics, 30(2):199, 2015.

Zhang, J. and Bareinboim, E. Transfer learning in multiarmed bandits: A causal approach. In Proceedings of the 26th International Joint Conference on Artiﬁcial Intelligence, IJCAI 17, pp. 1340 1346. AAAI Press, 2017. ISBN 978-0-9992411-0-3.