# learningtolearn_stochastic_gradient_descent_with_biased_regularization__f4213aee.pdf

Learning-to-Learn Stochastic Gradient Descent with Biased Regularization

Giulia Denevi 1 2 Carlo Ciliberto 3 4 Riccardo Grazzi 1 4 Massimiliano Pontil 1 4

We study the problem of learning-to-learn: inferring a learning algorithm that works well on a family of tasks sampled from an unknown distribution. As class of algorithms we consider Stochastic Gradient Descent (SGD) on the true risk regularized by the square euclidean distance from a bias vector. We present an average excess risk bound for such a learning algorithm that quantiﬁes the potential beneﬁt of using a bias vector with respect to the unbiased case. We then propose a novel meta-algorithm to estimate the bias term online from a sequence of observed tasks. The small memory footprint and low time complexity of our approach makes it appealing in practice while our theoretical analysis provides guarantees on the generalization properties of the meta-algorithm on new tasks. A key feature of our results is that, when the number of tasks grows and their variance is relatively small, our learning-to-learn approach has a signiﬁcant advantage over learning each task in isolation by standard SGD without a bias term. Numerical experiments demonstrate the effectiveness of our approach in practice.

1. Introduction

The problem of learning-to-learn (LTL) (Baxter, 2000; Thrun & Pratt, 1998) is receiving increasing attention in recent years, due to its practical importance (Finn et al., 2017; Franceschi et al., 2018; Ravi & Larochelle, 2017) and the theoretical challenge of statistically principled and efﬁcient solutions (Alquier et al., 2017; Balcan et al., 2015; Maurer et al., 2016; Pentina & Lampert, 2014; Denevi et al., 2018a;b; Gupta & Roughgarden, 2017). The principal aim of LTL is to design a meta-learning algorithm to select a supervised learning algorithm that is well suited to learn

1Istituto Italiano di Tecnologia, Genoa, Italy 2University of Genoa, Genoa, Italy 3Imperial College of London, London, United Kingdom 4University College London, London, United Kingdom. Correspondence to: Giulia Denevi <giulia.denevi@iit.it>.

Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s).

tasks from a prescribed family. To highlight the difference between the meta-learning algorithm and the learning algorithm, throughout the paper we will refer to the latter as the inner or within-task algorithm.

The meta-algorithm is trained from a sequence of datasets, associated with different learning tasks sampled from a metadistribution (also called environment in the literature). The performance of the selected inner algorithm is measured by the transfer risk (Baxter, 2000; Maurer, 2005), that is, the average risk of the algorithm, trained on a random dataset from the same environment. A key insight is that, when the learning tasks share speciﬁc similarities, the LTL framework provides a means to leverage such similarities and select an inner algorithm of low transfer risk.

In this work, we consider environments of linear regression or binary classiﬁcation tasks and we assume that the associated weight vectors are all close to a common vector. Because of the increasing interest in low computational complexity procedures, we focus on the family of withintask algorithms given by Stochastic Gradient Descent (SGD) working on the regularized true risk. Speciﬁcally, motivated by the above assumption on the environment, we consider as regularizer the square distance of the weight vector to a bias vector, playing the role of a common mean among the tasks. Knowledge of this common mean can substantially facilitate the inner algorithm and the main goal of this paper is to design a meta-algorithm to learn a good bias that is supported by both computational and statistical guarantees.

Contributions. The ﬁrst contribution of this work is to show that, when the variance of the weight tasks vectors sampled from the environment is small, SGD regularized with the right bias yields a model with smaller error than its unbiased counterpart. The latter approach does not exploit the relatedness among the tasks, and it corresponds to learning the tasks in isolation also known as independent task learning (ITL). The second and principal contribution of this work is to propose a meta-algorithm that estimates the bias term, so that the transfer risk of the corresponding SGD algorithm is as small as possible. We consider the setting in which we receive in input a sequence of datasets and we propose an online meta-algorithm which efﬁciently updates the bias term used by the inner SGD algorithm. Our meta-algorithm consists in applying a (meta) SGD algo-

Learning-to-Learn Stochastic Gradient Descent with Biased Regularization

rithm to a proxy of the transfer risk, given by the expected minimum regularized empirical risk of a task. We provide a bound on the statistical performance of the biased inner SGD algorithm found by our procedure. It establishes that, when the number of observed tasks grows and the variance of the tasks weight vectors is signiﬁcantly smaller than their second moment, then, running the inner SGD algorithm with the estimated bias brings an improvement in comparison to learning the tasks in isolation with no bias. The bound is coherent with the state-of-the-art LTL analysis for other families of algorithms, but it applies for the ﬁrst time to a fully online meta-algorithm. Our results holds for Lipschitz loss functions both in the regression and binary classiﬁcation setting.

Our proof techniques combines ideas from online learning, stochastic and convex optimization, with tools from LTL. A key insight in our approach is to exploit the inner SGD algorithm to compute an approximate subgradient of the surrogate objective, in a such way that the degree of approximation can be controlled, without affecting the overall performance or the computational cost of the meta-algorithm.

Paper Organization. We start by recalling in Sec. 2 the basic concepts of LTL. In Sec. 3 we cast the problem of choosing a right bias term in SGD on the regularized objective in the LTL framework. Thanks to this formulation, in Sec. 4 we characterize the situations in which SGD with the right bias term is beneﬁcial in comparison to SGD with no bias. In Sec. 5 we propose an online meta-algorithm to estimate the bias vector from a sequence of datasets and we analyze its statistical properties. In Sec. 6 we report on the empirical performance of the proposed approach while in Sec. 7 we discuss future research directions.

Previous Work. Online LTL (Alquier et al., 2017; Denevi et al., 2018a;b; Pentina & Urner, 2016) has received limited attention and is less developed than standard LTL approaches, in which the data are processed in one batch as opposed to incrementally, see for instance (Baxter, 2000; Maurer, 2009; Maurer et al., 2013; 2016; Pentina & Lampert, 2014). The idea of introducing a bias in the learning algorithm is not new, see e.g. (Denevi et al., 2018b; Kuzborskij & Orabona, 2017; Pentina & Lampert, 2014) and Sec. 3. In this work, we consider the family of inner SGD algorithms with biased regularization and we develop a theoretically grounded meta-learning algorithm to ﬁnd the bias. Differently from others online methods (Alquier et al., 2017; Denevi et al., 2018a), our approach does not need to keep previous training points in memory and it runs online both across and within the tasks. As a result, both the low space and time complexity are the strengths of our method. We ﬁnally point out the recent related work by Khodak et al. (2019) and Finn et al. (2019), that was brought to our attention after completion of the present work.

2. Preliminaries

In this section, we recall the standard supervised (i.e. singletask) learning setting and the learning-to-learn setting.

We ﬁrst introduce some notation used throughout this work. Let Z = X Y be the data space, where X Rd and Y R (regression) or Y = { 1, +1} (binary classiﬁcation). We consider linear supervised learning tasks µ, namely distributions over Z, parametrized by a weight vector w 2 Rd. We measure the performance by a loss function : Y Y ! R+ such that, for any y 2 Y, ( , y) is convex and closed. Finally, for any positive k 2 N, we let [k] = {1, . . . , k} and, we denote by h , i and k k the standard inner product and euclidean norm. In the rest of this work, when speciﬁed, we make the following assumptions.

Assumption 1 (Bounded Inputs). Let X B(0, R), where B(0, R) =

x 2 Rd : kxk R

, for some radius R 0.

Assumption 2 (Lipschitz Loss). Let ( , y) be L-Lipschitz for any y 2 Y.

For example, for any y, ˆy 2 Y, the absolute loss (ˆy, y) = ##ˆy y

## and the hinge loss (ˆy, y) = max

are both 1-Lipschitz. We now brieﬂy recall the main notion of single-task learning.

2.1. Single-Task Learning

In standard linear supervised learning, the goal is to learn a linear functional relation fw : X ! Y, fw( ) = h , wi between the input space X and the output space Y. This target can be reformulated as that of ﬁnding a weight vector wµ minimizing the expected risk (or true risk)

Rµ(w) = E(x,y) µ

over the entire space Rd. The expected risk measures the prediction error that the weight vector w incurs on average with respect to points sampled from the distribution µ. In practice, the task µ is unknown and only partially observed by a corresponding dataset of n i.i.d. points Zn = (zi)n

i=1 µn, where, for every i 2 [n], zi = (xi, yi) 2 Z. In the sequel, we often use the more compact notation Zn = (Xn, yn), where Xn 2 Rn d is the matrix containing the input vectors xi as rows and yn 2 Rn is the vector with entries given by the labels yi. A learning algorithm is a function A : [n2NZn ! Rd that, given such a training dataset Zn 2 Zn, returns a good estimator, that is, in our case, a weight vector A(Zn) 2 Rd, whose expected risk is small and tends to the minimum of Eq. (1) as n increases.

2.2. Learning-to-Learn (LTL)

In the LTL framework, we assume that each learning task µ we observe is sampled from an environment , that is a (meta-)distribution on the set of probability distributions on

Learning-to-Learn Stochastic Gradient Descent with Biased Regularization

Z. The goal is to select a learning algorithm (hence the name learning-to-learn) that is well suited to the environment.

Speciﬁcally, we consider the following setting. We receive a stream of tasks µ1, . . . , µT , which are independently sampled from the and only partially observed by corresponding i.i.d. datasets Z(1)

n , . . . , Z(T )

n , . . . each formed by n datapoints. Starting from these datasets, we wish to learn an algorithm A, such that, when we apply it on a new dataset (composed by n points) sampled from a new task µ , the corresponding true risk is low. We reformulate this target into requiring that algorithm A trained with n points1 over the environment , has small transfer risk

En(A) = Eµ EZn µn Rµ(A(Zn)). (2)

The transfer risk measures the expected true risk that the inner algorithm A, trained on the dataset Zn, incurs on average with respect to the distribution of tasks µ sampled from . Therefore, the process of learning a learning algorithm is a meta-learning one, in that the inner learning algorithm is applied to tasks from the environment and then chosen from a sequence of training tasks (datasets) in attempt to minimize the transfer risk.

3. SGD on the Biased Regularized Risk

In this section, we introduce the LTL framework for the family of within-task algorithms we analyze in this work.

We consider a family of learning algorithms Ah parametrized by a bias vector h 2 Rd. The idea of introducing a bias in a speciﬁc family of learning algorithms is not new in the LTL literature, see e.g. (Denevi et al., 2018b; Kuzborskij & Orabona, 2017; Pentina & Lampert, 2014) and references therein. A natural choice is given by regularized empirical risk minimization, in which we introduce a bias vector in the square norm regularizer which we simply refer to as ERM throughout namely

h (Zn) wh(Zn) = argmin

w2Rd RZn,h(w), (3)

where, for any w, h 2 Rd, λ > 0, we have deﬁned the empirical error and its biased regularized version as

RZn,h(w) = RZn(w) + λ

Intuitively, if the weight vectors wµ of the tasks sampled from are close to each other, then running ERM with

1In order to simplify the presentation, we assume that all datasets are composed by the same number of points n. The general setting can be addressed by introducing the slightly different notion of transfer risk E(A) = E(n,µ) EZn µn Rµ(A(Zn)).

Algorithm 1 Within-Task Algorithm: SGD on the Biased Regularized True Risk

Input λ > 0 regularization parameter, h bias, µ task

Initialization wh(1) = h

For k = 1 to n Receive (xk, yk) µ

Build k,h( ) = k(hxk, i) + λ

Deﬁne γk = 1/(kλ)

k 2 @ k(hxk, wh(k)i)

Deﬁne sk = xku0

k+λ(wh(k) h) 2 @ k,h(wh(k))

Return (wh(k))n+1

k=1, wh = 1

h = m Eµ wµ should have a smaller transfer risk than running ERM with, for instance, h = 0. We make this statement precise in Sec. 4. Recently, a number of papers have considered how to learn a good bias h in a LTL setting, see e.g. (Pentina & Lampert, 2014; Denevi et al., 2018b). However, one drawback of these works is that they assume the ERM solution to be known exactly, without leveraging the interplay between the optimization and the generalization error. Furthermore, in LTL settings, data naturally arise in an online manner, both between and within tasks. Hence, an ideal LTL approach should focus on inner algorithms processing one single data point at the time.

Motivated by the above reasoning, in this work, we propose to analyze an online learning algorithm that is computationally and memory efﬁcient while retaining (on average with respect to the sampling of the data) the same statistical guarantees of the more expensive ERM estimator. Speciﬁcally, for a training dataset Zn µn, a regularization parameter λ > 0 and a bias vector h 2 Rd, we consider the learning algorithm deﬁned as

h (Zn) wh(Zn), (5)

where, wh(Zn) is the average of the ﬁrst n iterations of Alg. 1, in which, for any k 2 [n], we have introduced the notation k( ) = ( , yk).

Alg. 1 coincides with online subgradient algorithm applied to the strongly convex function RZn,h. Moreover, thanks to the assumption that Zn µn, Alg. 1 is equivalent to SGD applied to the regularized true risk

Rµ,h(w) = Rµ(w) + λ

2 kw hk2. (6)

Learning-to-Learn Stochastic Gradient Descent with Biased Regularization

Relying on a standard online-to-batch analysis, see e.g. (Cesa-Bianchi et al., 2004; Hazan, 2016) and references therein, it is easy to link the true error of such an algorithm with the minimum of the regularized empirical risk, that is, RZn,h(wh(Zn)). This fact is reported in the proposition below and it will be often used in our subsequent statistical analysis. We give a proof in App. F for completeness.

Proposition 1. Let Asm. 1 and Asm. 2 hold and let wh be the output of Alg. 1. Then, we have that

RZn,h(wh(Zn))

cn,λ = 2R2L2$

We remark that at this level of the analysis, one may avoid the logarithmic factor in the above bound, see e.g. (Shamir & Zhang, 2013; Rakhlin et al., 2012; Lacoste-Julien et al., 2012). However, in order to not complicate our presentation and proofs, we avoid this reﬁnement of the analysis.

In the next section we study the impact on the bias vector on the statistical performance of the inner algorithm. Specifically, we investigate circumstances under which there is an advantage in perturbing the regularization in the objective used by the algorithm with an appropriate ideal bias term h, as opposed to ﬁx h = 0. Throughout the paper, we refer to the choice h = 0 as independent task learning (ITL), although strictly speaking, when h is ﬁxed in advanced, then, SGD is applied on each task independently regardless of the value of h. Then, in Sec. 5 we address the question of estimating this appropriate bias from the data.

4. The Advantage of the Right Bias Term

In this section, we study the statistical performance of the model wh returned by Alg. 1, on average with respect to the tasks sampled from the environment , for different choices of the bias vector h. To present our observations, we require, for any µ , that the corresponding true risk admits minimizers and we denote by wµ the minimum norm minimizer2. With these ingredients, we introduce the oracle

E = Eµ Rµ(wµ),

representing the averaged minimum error over the environment of tasks, and, for a candidate bias h, we give a bound on the quantity E( wh) E . This gap coincides with the averaged excess risk of algorithm Alg. 1 with bias h over the environment of tasks, that is

En( wh) E = Eµ EZn µn

2This choice is made in order to simplify our presentation. However, our analysis holds for different choices of a minimizer wµ, which may potentially improve our bounds.

Hence, this quantity is an indicator of the performance of the bias h with respect to our environment. In the rest of this section, we study the above gap for a bias h which is ﬁxed and does not depend on the data. For this purpose, we introduce the notation

and we observe that

m Eµ wµ = argmin

Theorem 2 (Excess Transfer Risk Bound for a Fixed Bias h). Let Asm. 1 and Asm. 2 hold and let wh be the output of Alg. 1 with regularization parameter

Then, the following bound holds

En( wh) E Varh 2RL

Proof. For µ , consider the following decomposition

Rµ( wh(Zn)) Rµ(wµ)

A + B, (12)

where A and B are respectively deﬁned by

Rµ( wh(Zn)) RZn,h(wh(Zn))

RZn,h(wh(Zn)) Rµ(wµ)

In order to bound the term A, we use Prop. 1. Regarding the term B, we exploit the deﬁnition of the ERM algorithm and the fact that, since wµ does not depend on Zn, then Rµ,h(wµ) = EZn µn RZn,h(wµ). Consequently, we can upper bound the term B as

RZn,h(wh(Zn)) Rµ,h(wµ)

RZn,h(wh(Zn)) RZn,h(wµ)

The desired statement follows by combining the above bounds on the two terms, taking the average with respect to µ and optimizing over λ.

Thm. 2 shows that the strength of the regularization that one should use in the within-task algorithm Alg. 1 is inversely

Learning-to-Learn Stochastic Gradient Descent with Biased Regularization

proportional to both the variance of the bias h and the number of points in the datasets. This is exactly in line with the LTL aim: when solving each task is difﬁcult, knowing a priori a good bias can bring a substantial beneﬁt over learning with no bias. To further investigate this point, in the following corollary, we specialize Thm. 2 to two particular choices of the bias h. The ﬁrst choice we make is h = 0, which coincides, as remarked earlier, with learning each task independently, while the second choice considers an ideal bias, namely, assuming that the transfer risk admits minimizer, we set h = hn 2 argminh2Rd En( wh).

Corollary 3 (Excess Transfer Risk Bound for ITL and the Oracle). Let Asm. 1 and Asm. 2 hold.

1. Independent Task Learning. Let w0 be the output of

Alg. 1 with bias h = 0 and regularization parameter

as in Eq. (10) with h = 0. Then,

En( w0) E Var0 2RL

2. The Oracle. Let whn be the output of Alg. 1 with bias

h = hn and regularization parameter as in Eq. (10) with h = m. Then,

En( whn) E Varm 2RL

Proof. The proof of the ﬁrst statement follows directly from the application of Thm. 2 with h = 0. The second statement is a direct consequence of the deﬁnition of hn implying En( whn) E En( wm) E and the application of Thm. 2 with h = m on the second term.

From the previous bounds we can observe that, using the bias h = hn in the regularizer brings a substantial beneﬁt with respect to the unbiased case when the number of points n in each dataset in not very large (hence learning each task is quite difﬁcult) and the variance of the weight tasks vectors sampled from the environment is much smaller than their second moment, i.e. when

2 Eµ kwµ mk2 1

2 Eµ kwµk2 = Var2

Driven by this observation, when the environment of tasks satisﬁes the above characteristics, we would like to take advantage of this tasks similarity. But, since in practice we are not able to explicitly compute hn, in the following we propose an efﬁcient online LTL approach to estimate the bias directly from the observed sequence of datasets.

5. Estimating the Bias

In this section, we study the problem of designing an estimator for the bias vector that is computed incrementally from a set of observed T tasks.

5.1. The Meta-Objective

Since direct optimization of the transfer risk is not feasible, a standard strategy used in LTL consists in introducing a proxy objective that is easier to handle, see e.g. (Maurer, 2005; 2009; Maurer et al., 2013; 2016; Denevi et al., 2018a;b). In this paper, motivated by Prop. 1, according to which

RZn,h(wh(Zn))

we substitute in the deﬁnition of the transfer risk the true risk of the algorithm Rµ

with the minimum of the regularized empirical risk

LZn(h) = min

w2Rd RZn,h(w) = RZn,h(wh(Zn)). (15)

This leads us to the following proxy for the transfer risk

ˆEn(h) = Eµ EZn µn LZn(h). (16)

Some remarks about this choice are in order. First, convexity is usually a rare property in LTL. In our case, as described in the following proposition, the deﬁnition of the function LZn as the partial minimum of a jointly convex function, ensures convexity and other nice properties, such as differentiability and a closed expression of its gradient.

Proposition 4 (Properties of LZn). The function LZn in Eq. (15) is convex and λ-smooth over Rd. Moreover, for any h 2 Rd, its gradient is given by the formula

r LZn(h) = λ

where wh(Zn) is the ERM algorithm in Eq. (3). Finally, when Asm. 1 and Asm. 2 hold, LZn is LR-Lipschitz.

The above statement is a known result in the optimization community, see e.g. (Bauschke & Combettes, 2011, Prop. 12.29) and App. C for more details. In order to minimize the proxy objective in Eq. (16), one standard choice done in stochastic optimization, and also adopted in this work, is to use ﬁrst-order methods, requiring the computation of an unbiased estimate of the gradient of the stochastic objective. In our case, according to the above proposition, this step would require computing the minimizer of the regularized empirical problem in Eq. (15) exactly. A key observation of our work is to show below that we can easily design a satisfactory approximation (see the last paragraph

Learning-to-Learn Stochastic Gradient Descent with Biased Regularization

in Sec. 5) of its gradient, just substituting the minimizer wh(Zn) in the expression of the gradient in Eq. (17) with the last iterate wh(n+1)(Zn) of Alg. 1. An important aspect to stress here is the fact that this strategy does not require any additional computational effort. Formally, this reasoning is linked to the concept of -subgradient of a function. We recall that, for a given convex, proper and closed function f and for a given point ˆh 2 Dom(f) in its domain, u is an -subgradient of f at ˆh, if, for any h, f(h) f(ˆh) + hu, h ˆhi .

Proposition 5 (An -Subgradient for LZn). Let wh(n+1)(Zn) be the last iterate of Alg. 1. Then, under Asm. 1 and Asm. 2, the vector

ˆr LZn(h) = λ

(n+1)(Zn) h

is an -subgradient of LZn at point h, where is such that

Moreover, introducing Zn(h) = r LZn(h) ˆr LZn(h),

The above result is a key tool in our analysis. The proof requires some preliminaries on the -subdifferential of a function (see App. A) and introducing the dual formulation of both the within-task learning problem and Alg. 1 (see App. B and App. E, respectively). With these two ingredients, the proof of the statement is deduced in App. E.3 by the application of a more general result reported in App. D, describing how an -minimizer of the dual of the withintask learning problem can be exploited in order to build an -subgradient of the meta-objective function LZn. We stress that this result could be applied to more general class of algorithms, going beyond Alg. 1 considered here.

5.2. The Meta-Algorithm to Estimate the Bias h

In order to estimate the bias h from the data, we apply SGD to the stochastic function ˆEn introduced in Eq. (16). More precisely, in our setting, the sampling of a meta-point corresponds to the incremental sampling of a dataset from the environment3. We refer to Alg. 2 for more details. In particular, we propose to take the estimator h T obtained by averaging the iterations returned by Alg. 2. An important feature to stress here is the fact that the meta-algorithm uses -subgradients of the function LZn which are computed as described above. Speciﬁcally, for any t 2 [T], we deﬁne

n (h(t)) = λ

3More precisely we ﬁrst sample a distribution µ from and then a dataset Zn µn.

Algorithm 2 Meta-Algorithm, SGD on ˆE with - Subgradients

Input γ > 0 step size, λ > 0 inner regularization parameter, meta-distribution

Initialization h(1) = 0 2 Rd

For t = 1 to T

Receive µt , Z(t)

Run the inner algorithm Alg. 1 and approximate the gradient ˆr(t) r(t) by Eq. (21)

Update h(t+1) = h(t) γ ˆr(t)

Return (h(t))T +1

t=1 and h T = 1

where w(n+1)

h(t) is the last iterate of Alg. 1 applied with the

current bias h(t) and the dataset Z(t)

n . To simplify the presentation, throughout this work, we use the short-hand notation

Lt( ) = LZ(t)

n ( ), r(t) = r Lt(h(t)), ˆr(t) = ˆr Lt(h(t)).

Some technical observations follows. First, we stress that Alg. 2 processes one single instance at the time, without the need to store previously encountered data points, neither across the tasks nor within them. Second, the implementation of Alg. 2 does not require computing the meta-objective LZn, which would increase the computational effort of the entire scheme. The rest of this section is devoted to the statistical analysis of Alg. 2.

5.3. Statistical Analysis of the Meta-Algorithm

In the following theorem we study the statistical performance of the bias h T returned by Alg. 2. More precisely we bound the excess transfer risk of the inner SGD algorithm run with this biased term learned by the meta-algorithm.

Theorem 6 (Excess Transfer Risk Bound for the Bias h T Estimated by Alg. 2). Let Asm. 1 and Asm. 2 hold and let h T be the output of Alg. 2 with step size

Let w h T be the output of Alg. 1 with bias h = h T and regularization parameter

Learning-to-Learn Stochastic Gradient Descent with Biased Regularization

Then, the following bound holds

En( w h T )

where the expectation above is with respect to the sampling of the datasets Z(1)

n , . . . , Z(T )

n from the environment .

Proof. We consider the following decomposition

En( w h T )

E A + B + C, (24)

A = En( w h T ) ˆEn( h T )

B = E ˆEn( h T ) ˆEn(m)

C = ˆEn(m) E .

Now, in order to bound the term A, noting that

A = Eµ EZn µn

RZn, h T (w h T (Zn))

we use Prop. 1 with h = h T and, then, we take the average on µ . As regards the term C, we apply the inequality given in Eq. (14) with h = m and we again average with respect to µ . Finally, the term B is the convergence rate of Alg. 2 and its study requires analyzing the error that we introduce in the meta-gradients by Prop. 5. The bound we use for this term described in Prop. 22 (see App. G) with ˆh = m. The result now follows by combining the bounds on the three terms and optimizing over λ.

The bound stated in Thm. 6 with respect to the mean m holds also for a generic bias vector h 2 Rd. In particular, the choice of h = 0 and the corresponding step-size γ describe the setting in which the meta-algorithm returns the ITL estimator h = 0. In such a case, we recover the rate in Cor. 3 for ITL (up to a contant 2).

In addition, the above bound is coherent with the state-ofthe-art LTL bounds given in other papers studying other variants of Ivanov or Tikhonov regularized empirical risk minimization algorithms, see e.g. (Maurer, 2005; 2009; Maurer et al., 2013; 2016). Speciﬁcally, in our case, the bound has the form

where Varm reﬂects the advantage in exploiting the relatedness among the tasks sampled from the environment . More precisely, in Sec. 4 we noticed that, if the variance

of the weight vectors of the tasks sampled from our environment is signiﬁcantly smaller than their second moment, running Alg. 1 with the ideal bias h = hn on a future task brings a signiﬁcant improvement in comparison to the unbiased case. One natural question arising at this point of the presentation is whether, under the same conditions on the environment, the same improvement is obtained by running Alg. 1 with the bias vector h = h T returned by our online meta-algorithm in Alg. 2. Looking at the bound in Thm. 6, we can say that, when the number of training tasks T used to estimate the bias h T is sufﬁciently large, the above question has a positive answer and our LTL approach is effective.

In order to have also a more precise benchmark for the biased setting considered in this work, in App. H we have repeated the statistical study described in the paper also for the more expensive ERM algorithm described in Eq. (3). In this case, we assume to have an oracle providing us with this exact estimator, ignoring any computational costs. As before, we have performed the analysis both for a ﬁxed bias and the one estimated from the data via Alg. 2 (in this case, Alg. 2 is assumed to run with exact meta-gradients). Looking at the results reported in App. H, we immediately see that, up to constants and logarithmic factors, the LTL bounds we have stated in the paper for the low-complexity SGD family are equivalent to those in App. H for the more expensive ERM family.

All the above facts justify the informal statement given before Prop. 5 according to which the trick used to compute the approximation of the meta-gradient by using the last iterate of the inner algorithm, not only, does not require additional effort, but it is also accurate enough from the statistical view point, matching a state-of-the-art bound for more expensive within-task algorithms based on ERM.

6. Experiments

In this section, we test the effectiveness of the LTL approach proposed in this paper on synthetic and real data4. In all experiments, the regularization parameter λ and the stepsize γ were tuned by validation, see App. I for more details.

Synthetic Data. We considered two different settings, regression with the absolute loss and binary classiﬁcation with the hinge loss. In both cases, we generated an environment of tasks in which SGD with the right bias is expected to bring a substantial beneﬁt in comparison to the unbiased case. Motivated by our observations in Sec. 4, we generated linear tasks with weight vectors characterized by a variance which is signiﬁcantly smaller than their second moment. Speciﬁcally, for each task µ, we created a weight vector wµ from a Gaussian distribution with mean m given by the

4Code available at https://github.com/prolearner/online LTL

Learning-to-Learn Stochastic Gradient Descent with Biased Regularization

Figure 1. Synthetic Data. Test performance of different bias with respect to an increasing number of tasks. (Top) Regression with absolute loss. (Bottom) Classiﬁcation with hinge loss. The results are averaged over 10 independent runs (datasets generations).

vector in Rd with all components equal to 4 and standard deviation Varm = 1. Each task corresponds to a dataset (xi, yi)n

i=1, xi 2 Rd with n = 10 and d = 30. In the regression case, the inputs were uniformly sampled on the unit sphere and the labels were generated as y = hx, wµi + , with sampled from a zero-mean Gaussian distribution, with standard deviation chosen to have signal-to-noise ratio equal to 10 for each task. In the classiﬁcation case, the inputs were uniformly sampled on the unit sphere, excluding those points with margin |hx, wµi| smaller than 0.5 and the binary labels were generated as a logistic model, P(y = 1) =

1 + 10 exp( hx, wµi)

% 1. In Fig. 1 we report the performance of Alg. 1 with different choices of the bias: h = h T (our LTL estimator resulting from Alg. 2), h = 0 (ITL) and h = m, a reasonable approximation of the oracle minimizing the transfer risk. The plots conﬁrm our theoretical ﬁndings: estimating the bias with our LTL approach leads to a substantial beneﬁts with respect to the unbiased case, as the number of the observed training tasks increases.

Real Data. We run experiments on the computer survey data from (Lenk et al., 1996), in which 180 people (tasks) rated the likelihood of purchasing one of 20 different personal computers (n = 8). The input represents 13 different computer characteristics (price, CPU, RAM, etc.) while the output is an integer rating from 0 to 10. Similarly to the synthetic data experiments, we consider a regression setting with the absolute loss and a classiﬁcation setting. In the latter case each task is to predict whether the rating is above 5. We compare the LTL bias with ITL. The results are reported in Fig. 2. The ﬁgures above are in line with the results obtained on synthetic experiments, indicating that the bias

Figure 2. Real Data. Test performance of different bias with respect to an increasing number of tasks. (Top) Lenk Dataset Regression. (Bottom) Lenk Dataset Classiﬁcation. The results are averaged over 30 independent runs (datasets generations).

LTL framework proposed in this work is effective for this dataset. Moreover, the results for regression are also in line with what observed in the multitask setting with variance regularization (Mc Donald et al., 2016). The classiﬁcation setting has not been used before and has been created ad-hoc for our purpose. In this case we have an increased variance probably due to the datasets being highly unbalanced. In order to investigate the impact of passing through the data only once in the different steps in our method, we conducted additional experiments. The results, presented in App. J, indicate that the single pass strategy is competitive with respect to the more expensive ERM.

7. Conclusion and Future Work

We have studied the performance of Stochastic Gradient Descent on the true risk regularized by the square euclidean distance to a bias vector, over a class of tasks. Drawing upon a learning-to-learn framework, we have shown that, when the variance of the tasks is relatively small, the introduction of an appropriate bias vector may bring a substantial beneﬁt in comparison to the standard unbiased version, corresponding to learning the tasks independently. Then, we have proposed an efﬁcient online meta-learning algorithm to estimate this bias and we have theoretically shown that the bias returned by our method can bring a comparable beneﬁt. In the future, it would be interesting to investigate other kinds of relatedness among the tasks and to extend our analysis to other classes of loss functions, as well as to a Hilbert space setting. Finally, another valuable research direction is to derive fully dependent bounds, in which the hyperparameters are self-tuned during the learning process, see e.g. (Zhuang et al., 2019).

Learning-to-Learn Stochastic Gradient Descent with Biased Regularization

Alquier, P., Mai, T. T., and Pontil, M. Regret bounds for life-

long learning. In Proceedings of the 20th International Conference on Artiﬁcial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pp. 261 269, 2017.

Balcan, M.-F., Blum, A., and Vempala, S. Efﬁcient rep-

resentations for lifelong learning and autoencoding. In Conference on Learning Theory, pp. 191 210, 2015.

Bauschke, H. H. and Combettes, P. L. Convex Analysis and

Monotone Operator theory in Hilbert Spaces, volume 408. Springer, 2011.

Baxter, J. A model of inductive bias learning. J. Artif. Intell.

Res., 12(149 198):3, 2000.

Beck, A. and Teboulle, M. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183 202, 2009.

Borwein, J. and Zhu, Q. Techniques of variational analysis,

Bousquet, O. and Elisseeff, A. Stability and generalization.

Journal of machine learning research, 2(Mar):499 526, 2002.

Cesa-Bianchi, N., Conconi, A., and Gentile, C. On the gen-

eralization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50(9):2050 2057,

Denevi, G., Ciliberto, C., Stamos, D., and Pontil, M. Incre-

mental learning-to-learn with statistical guarantees. In Proc. 34th Conference on Uncertainty in Artiﬁcial Intelligence (UAI), 2018a.

Denevi, G., Ciliberto, C., Stamos, D., and Pontil, M. Learn-

ing to learn around a common mean. In Advances in Neural Information Processing Systems, pp. 10190 10200, 2018b.

Finn, C., Abbeel, P., and Levine, S. Model-agnostic meta-

learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 1126 1135. PMLR, 2017.

Finn, C., Rajeswaran, A., Kakade, S., and Levine, S. Online

meta-learning. ar Xiv preprint ar Xiv:1902.08438, 2019.

Franceschi, L., Frasconi, P., Salzo, S., Grazzi, R., and Pontil,

M. Bilevel programming for hyperparameter optimization and meta-learning. In International Conference on Machine Learning, PMLR 80, pp. 568 1577, 2018.

Gupta, R. and Roughgarden, T. A pac approach to application-speciﬁc algorithm selection. SIAM Journal on Computing, 46(3):992 1017, 2017.

Hazan, E. Introduction to online convex optimization. Foun-

dations and Trends in Optimization, 2016.

Jean-Baptiste, H.-U. Convex analysis and minimization algorithms: advanced theory and bundle methods. SPRINGER, 2010.

Khodak, M., Balcan, M.-F., and Talwalkar, A. Provable guar-

antees for gradient-based meta-learning. ar Xiv preprint ar Xiv:1902.10644, 2019.

Kuzborskij, I. and Orabona, F. Fast rates by transferring

from auxiliary hypotheses. Machine Learning, 106(2): 171 195, 2017.

Lacoste-Julien, S., Schmidt, M., and Bach, F. A simpler

approach to obtaining an o (1/t) convergence rate for the projected stochastic subgradient method. ar Xiv preprint ar Xiv:1212.2002, 2012.

Lenk, P. J., De Sarbo, W. S., Green, P. E., and Young, M. R.

Hierarchical bayes conjoint analysis: Recovery of partworth heterogeneity from reduced experimental designs. Marketing Science, 15(2):173 191, 1996.

Maurer, A. Algorithmic stability and meta-learning. Journal

of Machine Learning Research, 6:967 994, 2005.

Maurer, A. Transfer bounds for linear feature learning.

Machine Learning, 75(3):327 350, 2009.

Maurer, A., Pontil, M., and Romera-Paredes, B. Sparse cod-

ing for multitask and transfer learning. In International Conference on Machine Learning, 2013.

Maurer, A., Pontil, M., and Romera-Paredes, B. The ben-

eﬁt of multitask representation learning. The Journal of Machine Learning Research, 17(1):2853 2884, 2016.

Mc Donald, A. M., Pontil, M., and Stamos, D. New perspec-

tives on k-support and cluster norms. Journal of Machine Learning Research, 17(155):1 38, 2016.

Pentina, A. and Lampert, C. A PAC-Bayesian bound for life-

long learning. In International Conference on Machine Learning, pp. 991 999, 2014.

Pentina, A. and Urner, R. Lifelong learning with weighted

majority votes. In Advances in Neural Information Processing Systems, pp. 3612 3620, 2016.

Rakhlin, A., Shamir, O., Sridharan, K., et al. Making gradi-

ent descent optimal for strongly convex stochastic optimization. In ICML, volume 12, pp. 1571 1578. Citeseer, 2012.

Learning-to-Learn Stochastic Gradient Descent with Biased Regularization

Ravi, S. and Larochelle, H. Optimization as a model for

few-shot learning. In I5th International Conference on Learning Representations, 2017.

Shalev-Shwartz, S. and Ben-David, S. Understanding Ma-

chine Learning: From Theory to Algorithms. Cambridge University Press, 2014.

Shalev-Shwartz, S. and Kakade, S. M. Mind the duality gap:

Logarithmic regret algorithms for online optimization. In Advances in Neural Information Processing Systems, pp. 1457 1464, 2009.

Shamir, O. and Zhang, T. Stochastic gradient descent for

non-smooth optimization: Convergence results and optimal averaging schemes. In International Conference on Machine Learning, pp. 71 79, 2013.

Thrun, S. and Pratt, L. Learning to Learn. Springer, 1998.

Zhuang, Z., Cutkosky, A., and Orabona, F. Surrogate losses

for online learning of stepsizes in stochastic non-convex optimization. ar Xiv preprint ar Xiv:1901.09068, 2019.