# onlinewithinonline_metalearning__0b8b5c8f.pdf

Online-Within-Online Meta-Learning

Giulia Denevi1,2, Dimitris Stamos3, Carlo Ciliberto3,4 and Massimiliano Pontil1,3

1Istituto Italiano di Tecnologia (Italy), 2University of Genoa (Italy), 3University College of London (UK),4Imperial College of London (UK), giulia.denevi@iit.it, c.ciliberto@imperial.ac.uk, {d.stamos.12,m.pontil}@ucl.ac.uk

We study the problem of learning a series of tasks in a fully online Meta-Learning setting. The goal is to exploit similarities among the tasks to incrementally adapt an inner online algorithm in order to incur a low averaged cumulative error over the tasks. We focus on a family of inner algorithms based on a parametrized variant of online Mirror Descent. The inner algorithm is incrementally adapted by an online Mirror Descent meta-algorithm using the corresponding within-task minimum regularized empirical risk as the meta-loss. In order to keep the process fully online, we approximate the meta-subgradients by the online inner algorithm. An upper bound on the approximation error allows us to derive a cumulative error bound for the proposed method. Our analysis can also be converted to the statistical setting by online-to-batch arguments. We instantiate two examples of the framework in which the meta-parameter is either a common bias vector or feature map. Finally, preliminary numerical experiments conﬁrm our theoretical ﬁndings.

1 Introduction

Humans can quickly adapt knowledge gained when learning past tasks, in order to solve new tasks from just a handful of examples. In contrast, learning systems are still rather limited when it comes to transfer knowledge over a sequence of learning problems. Overcoming this limitation can have a broad impact in artiﬁcial intelligence, as it can save the expensive preparation of large training samples, often humanly annotated, needed by current machine learning methods. As a result, Meta-Learning is receiving increasing attention, both from applied [15, 32] and theoretical perspective [5, 40, 17].

Until very recently, Meta-Learning was mainly studied in the batch statistical setting, where data are assumed to be independently sampled from some distribution and they are processed in one batch, see [6, 23, 24, 25, 26, 29]. Only recently, a lot of interest raised in investigating more efﬁcient methods, combining ideas from Online Learning and Meta-Learning, see [1, 12, 13, 30, 3, 21, 16, 8, 11, 30]. In this setting, which is sometimes referred to as Lifelong Learning, the tasks are observed sequentially via corresponding sets of training examples and the broad goal is to exploit similarities across the

tasks to incrementally adapt an inner (within-task) algorithm to such a sequence. There are different ways to deal with Meta-Learning in an online framework: the so-called Online-Within-Batch (OWB) framework, where the tasks are processed online but the data within each task are processed in one batch, see [1, 12, 13, 16, 8, 3, 21], or the so-called Online-Within-Online (OWO) framework, where data are processed sequentially both within and across the tasks, see [1, 3, 21, 16, 11]. Previous work mainly analyzed speciﬁc settings, see the technical discussion in App. A. The main goal of this work is to propose an OWO Meta-Learning approach that can be adapted to a broad family of algorithms.

We consider a general class of inner algorithms based on primal-dual Online Learning [37, 33, 38, 36, 35]. In particular, we discuss in detail the case of online Mirror Descent on a regularized variant of the empirical risk. The regularizer belongs to a general family of strongly convex functions parametrized by a meta-parameter. The inner algorithm is adapted by a meta-algorithm, which also consists in applying online Mirror Descent on a meta-objective given by the within-task minimum regularized

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

empirical risk. The interplay between the meta-algorithm and the inner algorithm plays a key role in our analysis. The latter is used to compute a good approximation of the meta-subgradient which is supplied to the former. A key novelty of our analysis is to show that, exploiting a closed form expression of the error on the meta-subgradients, we can automatically derive a cumulative error bound for the entire procedure. Our analysis holds also for more aggressive primal-dual online updates and it can be adapted to the statistical setting by online-to-batch arguments.

Contributions. Our contribution is threefold. First, we derive an efﬁcient and theoretically grounded OWO Meta-Learning framework which is inspired by Multi-Task Learning (MTL). Our framework applies to a wide class of within-task algorithms and tasks relationships. Second, we establish how our analysis can be converted to the statistical setting. Finally, we show how our general analysis can be directly applied to two important families of inner algorithms in which the meta-parameter is either a bias vector or a feature map shared across the tasks.

Paper organization. We start by introducing in Sec. 2 our OWO Meta-Learning setting. In Sec. 3 we recall some background material from primal-dual Online Learning. In Sec. 4 we outline the proposed method and we give a cumulative error bound for it. In Sec. 5 we show how the above analysis can be used to derive guarantees for our method in the statistical setting. In Sec. 6 we specify our framework to two important examples in which the tasks share a common bias vector or feature map. Finally, in Sec. 7 we report preliminary experiments with our method and in Sec. 8 we draw conclusions. Technical proofs are postponed to the appendix.

In this section we introduce the OWO Meta-Learning problem. We consider that the learner is facing a sequence of online tasks. Corresponding to each task, there is an input space X, an output space Y and a dataset Z = (zi)n

i=1 = (xi, yi)n

i=1 2 (X Y)n, which is observed sequentially. Online Learning aims to design an algorithm that makes predictions through time from past information. More precisely, at each step i 2 {1, . . . , n}: (a) a datapoint zi = (xi, yi) is observed, (b) the algorithm outputs a label ˆyi, (c) the learner incurs the error i(ˆyi), where i( ) = ( , yi) for a loss function . To simplify our presentation, throughout we let X Rd, Y R and we consider algorithms that perform linear predictions of the form ˆyi = hxi, wii, where (wi)n

i=1 is a sequence of weight vectors updated by the algorithm and h , i denotes the standard inner product in Rd. The goal is to bound the cumulative error of the algorithm, i.e. Einner(Z) = Pn

i=1 i(hxi, wii), with respect to (w.r.t.) the same quantity incurred by a vector ˆw 2 Rd ﬁxed in hindsight, i.e. Pn

i=1 i(hxi, ˆwi).

In the OWO Meta-Learning setting, we have a family of inner online algorithms identiﬁed by a meta-parameter belonging to a prescribed set and the goal is to adapt to a sequence of learning tasks, in online fashion. Throughout this work, will be a closed, convex, non-empty subset of an Euclidean space M. The broad goal is to transfer information gained when learning previous tasks, in order to help learning future tasks. For this purpose, we propose a Meta-Learning procedure, acting across the tasks, which modiﬁes the inner algorithm one task after another. More precisely, we let T be the number of tasks and, for each task t 2 {1, . . . , T} we let Zt = (xt,i, yt,i)n

1 be the corresponding data sequence. At each time t: (a) the meta-learner incrementally receives a task dataset Zt, (b) it runs the inner online algorithm with meta-parameter t on Zt, returning the predictor vectors (w t,i)n

i=1, (c) it incrementally incurs the errors t,i(hxt,i, w t,ii), where t,i( ) = ( , yt,i), (d) the meta-parameter (and consequently, the inner algorithm) is updated in t+1. Denoting by Einner(Zt, t) the cumulative error of the inner algorithm with meta-parameter t on the dataset Zt, the goal is to bound the error accumulated across the tasks, i.e.

Einner(Zt, t) =

t,i(hxt,i, w t,ii), (1)

w.r.t. the same quantity incurred by a sequence of tasks vectors ( ˆwt)T

t=1 ﬁxed in hindsight, i.e. PT

i=1 t,i(hxt,i, ˆwti).

The setting we consider in the paper is inspired by previous work on Multi-Task Learning, such as [2, 10, 18]. To describe it, we use extended real-valued functions and, for any data sequence Z and

1Throughout the paper we use the double subscript notation t,i , to denote the {outer, inner} task index.

meta-parameter 2 , we deﬁne the within-task minimum regularized empirical risk

LZ( ) = min

w2Rd RZ(w) + λf(w, ) RZ(w) = 1

i(hxi, wi), (2)

where λ > 0 is a regularization parameter and f is an appropriate complexity term ensuring the existence and the uniqueness of the above minimizer ˆw . Assuming the entire sequence (Zt)T

t=1 available in hindsight, introducing the notation Lt = LZt, many MTL methods read as follows

Lt( ) + F( ), (3)

where > 0 is a meta-regularization parameter and F is an appropriate meta-regularizer ensuring that the above minimum is attained. We stress that in our OWO Meta-Learning setting, the data are received sequentially, both within and across the tasks. The above formulation inspires us to take a within-task online algorithm that mimics well the (batch) objective in Eq. (2) and to deﬁne as meta-objectives for the online meta-algorithm the functions (Lt)T

t=1. Obviously, in this setting, the meta-objectives (and consequently their subgradients used by the meta-algorithm) are computed only up to an approximation error, depending on the speciﬁc properties of the inner algorithm we are using. We will show how to control and exploit this approximation error in the analysis.

In the sequel, for an Euclidean space V, we let Γ0(V) to be the set of proper, closed and convex functions over V and, for any f 2 Γ0(V), we denote by Domf its domain (we refer to App. B and [31] for notions on convex analysis). In this work, we make the following standard assumptions in which we introduce two norms k k and ||| ||| that will be speciﬁed in two applications below. Assumption 1 (Loss and regularizer). Let ( , y) be a convex and closed real-valued function for any y 2 Y and let f 2 Γ0(Rd M) be such that, for any 2 , f( , ) is 1-strongly convex w.r.t. a norm k k over Rd, infw2Rd f(w, ) = 0 and, for any /2 , Domf( , ) = ;.

Assumption 2 (Meta-regularizer). Let F be a closed and 1-strongly convex function w.r.t. a norm ||| ||| over M such that inf 2M F( ) = 0 and Dom F = .

Notice that the norm w.r.t. which the function f( , ) is assumed to be strongly convex may vary with . Moreover, under Asm. 1, Dom LZ = and, since LZ is deﬁned as the partial minimum of a function in Γ0(Rd M), LZ 2 Γ0(M). This property supports the choice of this function as the meta-objective for our meta-algorithm. Finally, by Lemma 29 in App. B, Asm. 1 and Asm. 2 ensure the existence and the uniqueness of the minimizers in Eq. (2) and Eq. (3).

We conclude this section by giving two examples included in the framework above. The ﬁrst one is inspired by the MTL variance regularizer in [14], while the second example, which can be easily extended to more general MTL regularizers such as in [2, 10, 27, 28], relates to the MTL trace norm regularizer. As we will see in the following, in the ﬁrst example the tasks predictors are encouraged to stay close to a common bias vector, in the second example they are encouraged to lie in the range of a low-rank feature map. In order to describe these examples we require some additional notation. We let k k2, k k F , k k Tr, k k1, be the Euclidean, Frobenius, trace, and operator norm, respectively. We also let be the pseudo-inverse, Tr( ) be the trace, Ran( ) be the range and Sd (resp. Sd

+) be the set of symmetric (resp. positive semi-deﬁnite) matrices in Rd d. Finally, S denotes the indicator function of the set S, taking value 0 when the argument belongs to S and +1 otherwise. Example 1 (Bias). We choose M = = Rd, F( ) = 1

2, satisfying Asm. 2 with ||| ||| = k k2, and f( , ) = 1

2, satisfying Asm. 1 with k k = k k2 for every 2 Rd.

Example 2 (Feature Map). We choose M = Sd and = S, where S = { 2 Sd

+ : Tr( ) 1}. For a ﬁxed 0 2 S, we set F( ) = 1

F + S( ), satisfying Asm. 2 with ||| ||| = k k F , and f( , ) = 1

2h , i + Ran( )( ) + S( ), satisfying Asm. 1 with k k =

h , i for any 2 S.

We will return to these examples in Sec. 6, specializing our method and our analysis to these settings.

3 Preliminaries: primal-dual Online Learning

Our OWO Meta-Learning method consists in the application of two nested primal-dual online algorithms, one operating within the tasks and another across the tasks. In particular, even though our

Algorithm 1 Primal-dual online algorithm online Mirror Descent

Input (gm)M

m=1, r as described in the text Initialization 1 = (), v1 = rr (0) 2 Dom r For m = 1 to M

Receive gm, Am, cm+1, m Suffer gm(Amvm) and compute 0

m 2 @ mgm(Amvm) Update m+1 = ( m, 0

m) Deﬁne vm+1 = rr !

Return ( m)M+1

m=1 , (vm)M+1

analysis holds also for more aggressive schemes, in this work, we consider online Mirror Descent algorithm. In this section we brieﬂy recall some material from the primal-dual interpretation of this algorithm that will be used in our subsequent analysis. The material of this section is an adaptation from [37, 33, 38, 36, 35]; we refer to App. C for a more detailed presentation.

Online Mirror Descent algorithm on a (primal) problem can be derived from the following primaldual framework in which we introduce an appropriate dual algorithm. Speciﬁcally, at each iteration m 2 {1, . . . , M}, we consider the following instantaneous primal optimization problem

ˆPm+1 = inf

v2V Pm+1(v) Pm+1(v) =

gj(Ajv) + cmr(v) (4)

where V is an Euclidean space, cm > 0, r 2 Γ0(V) is a 1-strongly convex function w.r.t. a norm k k over V (with dual norm k k ) such that infv2V r(v) = 0, for any j 2 {1, . . . , M}, letting Vj an Euclidean space, gj 2 Γ0(Vj) and Aj : V ! Vj is a linear operator with adjoint A

j. As explained in App. C, the corresponding dual problem is given by

ˆDm+1 = inf 2V1 Vm Dm+1( ) Dm+1( ) =

j ( j) + cmr

j and r are respectively the conjugate functions of gj and r. After this, we deﬁne the dual scheme in which the dual variable m+1 is updated by a greedy coordinate descent approach on the dual, setting m+1 = ( m, 0

m), where 0

m 2 @ mgm(Amvm) is an m-subgradient of gm at Amvm and vm is the current primal iteration. The primal variable is then updated from the dual one by a variant of the Karush Kuhn Tucker (KKT) conditions, providing its belonging to Dom r, see Alg. 1. In this paper, following [36], we refer to such a scheme as lazy online Mirror Descent. However, the term linearized Follow-The-Regularized-Leader is historically more accurate. We recall also that such a scheme includes many well-known algorithms, when one properly speciﬁes the complexity term r. The behavior of Alg. 1 is analyzed in the next result which will be a key tool for our analysis. Theorem 1 (Dual optimality gap for Alg. 1). Let (vm)M

m=1 be the primal iterates returned by Alg. 1 when applied to the generic problem in Eq. (4) and let Dual = DM+1( M+1) ˆDM+1 be the corresponding (non-negative) dual optimality gap at the last dual iterate M+1 of the algorithm.

1. If, for any m 2 {1, . . . , M}, cm+1 cm, then,

gm(Amvm) + ˆPM+1 + 1

2. If, for any m 2 {1, . . . , M}, cm = Pm

j=1 λj for some λj > 0, then,

gm(Amvm) + λmr(vm)

+ ˆPM+1 + 1

The ﬁrst (resp. second) inequality in Thm. 1 links the dual optimality gap of the last dual iterate generated by Alg. 1, with the (resp. regularized) cumulative error of the corresponding primal iterates. Note that this result can be readily used to bound the cumulative error (resp. its regularized version) of Alg. 1 by the batch regularized comparative ˆPM+1 and additional terms. In the following section, we will make use of the above theorem in order to analyze our OWO Meta-Learning method.

Algorithm 2 Within-task algorithm

Input λ > 0, 2 , Z = (zi)n

i=1 Initialization s ,1 = (), w ,1 = rf( , ) (0) For i = 1 to n

Receive the datapoint zi = (xi, yi) Compute s0

,i 2 @ i(hxi, w ,ii) R

Deﬁne (s ,i+1)i = s0

,i, γi = λ(i + 1)

Update w ,i+1=rf( , ) !

Return (w ,i)n+1

i=1 , w = 1

w ,i, s ,n+1

Algorithm 3 Meta-algorithm

Input > 0, (Zt)T

t=1 Initialization 1 = r F (0) For t = 1 to T

Receive incrementally the dataset Zt Run Alg. 2 with t over Zt Compute s t,n+1 Compute r0

t as in Prop. 3 using s t,n+1

Update t+1 = r F !

Return ( t)T +1

In this section we present the proposed OWO Meta-Learning method and we establish a (regularized) cumulative error bound for it. As anticipated in Sec. 2, the method consists in the application of Alg. 1 both to the (non-normalized) within-task problem in Eq. (2) and to the across-tasks problem in Eq. (3), corresponding, as we will show in the following, to Alg. 2 and Alg. 3, respectively. In order to analyze our method, we start from studying the behavior of the inner Alg. 2. Proposition 2 (Dual optimality gap for the inner Alg. 2). Let Asm. 1 hold. Then, Alg. 2 coincides with Alg. 1 applied to the non-normalized within-task problem in Eq. (2). As a consequence, introducing

the regularized cumulative error of the iterates generated by Alg. 2,

inner(Z, ) =

i(hxi, w ,ii) + λf(w ,i, )

where w ,i 2 Domf( , ) for any i 2 {1, . . . , n}, the following upper bound for the associated dual optimality gap Dual introduced in Thm. 1 holds

inner(Z, ) n LZ( )

Proof. The inner Alg. 2 coincides with Alg. 1 applied to the non-normalized within-task problem in Eq. (2), once one makes the identiﬁcations 0

,i for the (exact) subgradients and realizes that the non-normalized within-task problem in Eq. (2) is of the form in Eq. (4) with

m M, j i, M n, v w, V Rd, gj i, Aj x

> i , cm nλ, r( ) f( , ). Now, the bound in the statement directly derives from the second point of Thm. 1.

Since Dual 0, by moving the terms and normalizing by the number of points n, the above result tells us that, when the terms kxis0

, are bounded, for an appropriate choice of λ, the inner algorithm attempts to mimic the function LZ in Eq. (2), as the number of points n increases. The method we propose in this work relies on the application of Alg. 1 also to the meta-problem in Eq. (3) as the tasks are sequentially observed, using the functions (Lt)T

t=1 as meta-objectives. A key difﬁculty here is that the meta-objective is deﬁned via the inner batch problem in Eq. (2), hence it is not available exactly but it is only approximately approached by the within-task online algorithm. From a practical point of view, this means that in this case, differently from the inner algorithm, the resulting meta-algorithm has to deal with an error on the meta-subgradients at each iteration. Our next result describes how, leveraging on the dual optimality gap for the inner Alg. 2, we can compute an -subgradient of the meta-objective, where is (up to normalization) the value stated in Prop. 2. This will allow us to develop an efﬁcient method which is computationally appealing and fully online. Proposition 3 (Computation of an -subgradient of LZ). Let Asm. 1 hold and let s ,n+1 be the output of Alg. 2 with 2 over the dataset Z. Let r 2 @{ Dn+1(s ,n+1, )}( ), where

Dn+1(s, ) =

i (si) + λnf ( , )

is the dual of the non-normalized Eq. (2). Then, r0

= r /n 2 @ /n LZ( ), with as in Prop. 2.

The proof of the above statement is reported in App. D. It is based on rewriting the meta-objective as LZ( ) = 1/n maxs2Rn{ Dn+1(s, )} (by strong duality, see Lemma 34 in App. D) and it essentially exploits Prop. 2, according to which, the last dual iteration s ,n+1 returned by Alg. 2 is an -maximizer of the dual objective Dn+1( , ). We remark that the procedure described above to compute an -subgradient has been already used in our work [11] for the statistical setting in Ex. 1. Here, with a different proof technique, we show that it can be extended also to more general inner regularizers. Leveraging on the form of the error on the meta-subgradients in Prop. 3, we now show how we can automatically deduce a (regularized) cumulative error bound for the entire procedure. Theorem 4 (Cumulative error bound). Let Asm. 1 and Asm. 2 hold. Then, Alg. 3 coincides with Alg. 1 applied to the outer-tasks problem in Eq. (3). As a consequence, introducing the regularized

cumulative error for the iterates generated by the combination of Alg. 2 and Alg. 3,

inner(Zt, t) =

t,i(hxt,i, w t,ii) + λf(w t,i, t)

where t 2 for any t 2 {1, . . . , T}, for any sequence of vectors ( ˆwt)T

t=1 in Rd and any 2 such that f( ˆwt, ) < +1 for any t 2 {1, . . . , T}, the following upper bound holds

RZt( ˆwt) + λ

f( ˆwt, ) + 1 2λn T

Proof. The meta-algorithm in Alg. 3 coincides with Alg. 1 applied to the outer-tasks problem in Eq. (3), once one makes the identiﬁcations 0

t for the (approximated) subgradients and realizes that the outer-tasks problem in Eq. (3) is of the form in Eq. (4) with

m M, j t, M T, v , V , gj Lt, Aj I, cm , r F.

As a consequence, denoting by Dual the associated dual optimality gap introduced in Thm. 1, specializing the ﬁrst point of Thm. 1 to this setting and exploiting the fact Dual 0, we get

Lt( t) + min

Lt( ) + F( )

Substituting the closed form of t in Prop. 2 (applied to the task t) into Eq. (10), one immediately observes that the term PT

t=1 Lt( t) erases. The desired statement then directly follows by rearranging the remaining terms, using the deﬁnition of (Lt)T

t=1 and multiplying by the number of points n.

When the inputs are bounded and both the inner loss and meta-objective are Lipschitz w.r.t. the associated norms (as we will see for Ex. 1), the terms

t, can be upper bounded by a constant. In this case, for an appropriate choice of λ and , we recover a reasonable rate O(1/pn) + O(1/

T). However, when the bounds on

hide a dependency w.r.t. λ or n (as we will see for Ex. 2), the bound must be accordingly analyzed.

5 Adaptation to the statistical setting

In this section we present guarantees for our method in the statistical setting. Following the framework outlined in [6, 23, 26] we assume that, for any t 2 {1, . . . , T}, the within-task dataset Zt is an independently identically distributed (i.i.d.) sample from a distribution (task) µt, and in turn the tasks (µt)T

t=1 are an i.i.d. sample from a meta-distribution . The estimator we consider here is w = 1

i=1 w ,i, the average of the iterates resulting from applying Alg. 2 to a test dataset Z with meta-parameter = 1 T

t=1 t, the average of the meta-parameters returned by our online meta-algorithm in Alg. 3 applied to the training datasets (Zt)T

t=1. We wish to study the performance of such an estimator in expectation w.r.t. the tasks sampled from the environment .

Formally, for any µ , we require that the corresponding true risk Rµ(w) = E(x,y) µ (hx, wi, y) admits minimizers over the entire space Rd and we denote by wµ the minimum norm one. With these ingredients, we introduce the oracle E = Eµ Rµ(wµ), representing the expected minimum error over the environment of tasks, and, introducing the transfer risk of the estimator w :

Estat( w ) = Eµ EZ µn Rµ( w (Z)), (11) we give a bound on it w.r.t. the oracle E . This is described in the following theorem. Theorem 5 (Transfer risk bound). Let the same assumptions in Thm. 4 hold in the i.i.d. statistical setting. Then, introducing the regularized transfer risk of the average w of the iterates resulting from the combination of Alg. 2 and Alg. 3,

stat( w ) = Eµ EZ µn

Rµ( w (Z)) + λf( w (Z), )

for any 2 such that Eµ f(wµ, ) < +1, the following upper bound holds in expectation w.r.t. the sampling of the datasets (Zt)T

stat( w ) E + λ Eµ f(wµ, ) + 1 2λn T E

T + 1 2 T E

+ E Eµ EZ µn 1 2λn

The proof of the statement above is reported in App. E. It exploits the regularized cumulative error bound given in Thm. 4 for our Meta-Learning procedure and two nested online-to-batch conversion steps [9, 22], one within-task and one across-tasks. The bound above is composed by the expectation of the terms comparing in Thm. 4 plus an additional term. Such a term comes out from the onlineto-batch conversion and, as we will see in the sequel, it does not affect the general behavior of the bound. Finally, we observe that, differently from [1, Thm. 6.1] and [3, Thm. 3.3], the theorem above holds for the average of the meta-parameters ( t)T

t=1 returned by our meta-algorithm (not for a meta-parameter randomly sampled from the pool) and, consequently, it does not require their memorization or the introduction of additional randomization to the process. In the following section we will show that specializing Thm. 4 and Thm. 5 to Ex. 1 and Ex. 2, we will get meaningful bounds.

In this section we specify our framework to Ex. 1 and Ex. 2 outlined at the end of Sec. 2. In order to do this, we require the following assumption, which is for instance satisﬁed by the absolute loss (ˆy, y) =

,, and the hinge loss (ˆy, y) = max

, where y, ˆy 2 Y. Assumption 3 (Lipschitz Loss). Let ( , y) be L-Lipschitz for any y 2 Y.

Below, for any task t 2 {1, . . . , T}, we let the input covariance matrices Ct = 1

t,i, ˆCt = Pn

t,i, Ctot = 1

t=1 Ct and ˆCtot = 1

t=1 ˆCt. We also use the notation k Ctotk1,a =

1 with a = 1, 2 and, in the statistical setting, we let C = Eµ E(x,y) µxx>.

Bias. In App. G we report the adaptation of our method in Alg. 2 and Alg. 3 (cf. Alg. 5 and Alg. 6) and we specify Thm. 4 and Thm. 5 (cf. Cor. 40 and Cor. 42) to Ex. 1. In such a case, the resulting inner algorithm coincides with online Subgradient Descent on the regularized empirical risk and, similarly, the resulting meta-algorithm coincides with online Subgradient Descent (with approximated subgradients) on the meta-objectives (Lt)T

t=1. We thus recover the method in [11] with a slightly different choice of the inner algorithm step size. Our results (see App. G.4.2) are in line with [11], where we present the same bound in Cor. 42 with slightly worse constants.

Feature map. In App. H.1 we report the adaptation of our method in Alg. 2 and Alg. 3 (cf. Alg. 7 and Alg. 8) to Ex. 2. In this case, the resulting inner algorithm coincides with a pre-conditioned variant of online Subgradient Descent on the regularized empirical risk and the resulting meta-algorithm coincides with a lazy variant of online Subgradient Descent (with approximate subgradients) on the meta-objectives (Lt)T

t=1, projected on the set S. The meta-algorithm we retrieve is a slightly different version of the algorithm we propose in [12] for an OWB statistical framework.

Our next result speciﬁes the cumulative error bound in Thm. 4 to Ex. 2. The proof is in App. H.2.

Corollary 6 (Cumulative error bound, feature map). Let Asm. 3 hold, consider the setting in Thm. 4 applied to Ex. 2 and let ˆCtot

t=1 t ˆCt. Then, for any sequence of vectors ( ˆwt)T

t=1 in Rd, introducing ˆB = 1

t=1 ˆwt ˆw>

t , for any 2 S such that Ran( ˆB) Ran( ), the following bound holds for our method with an appropriate choice of hyper-parameters

RZt( ˆwt) + L

v u u t Tr( ˆB)

1:T ) n + k 0k F

The next result speciﬁes the transfer risk bound in Thm. 5 to Ex. 2. The proof is in App. H.3. Corollary 7 (Transfer risk bound, feature map). Let Asm. 3 hold and consider the setting in Thm. 5 applied to Ex. 2. Then, in expectation w.r.t. the sampling of the datasets (Zt)T

t=1, introducing B = Eµ wµw>

µ, for any 2 S such that Ran(B ) Ran( ), the following bound holds for our method with an appropriate choice of hyper-parameters

stat( w ) E + L

v u u t Tr( B )

Tr(E C ) n + k 0k F

E k Ctotk1,2

We now analyze the statistical setting. Following [12, 26, 25] we study whether, as the number of tasks grows, our method mimics the performance of the inner algorithm with the best feature map in hindsight (oracle, see App. H.4.1) for any task. We note that, once ﬁxed in an appropriate way the meta-parameter in the statement (hence, the hyper-parameters), the above bound in Cor. 7 becomes comparable to the bound for the best feature map in hindsight, see the discussion in App. H.4.2. Hence, we recover the same conclusion: there is an advantage in using the feature map found by our Meta-Learning method w.r.t. solving each task independently when k C k1 is small (the inputs are high-dimensional, for instance) and B is low-rank (the tasks share a low dimensional representation). In addition, note that the bound in Cor. 7 converges, as the number of tasks grow, to the oracle at a rate of O(T 1/4), whereas the corresponding bounds for the bias example (cf. Cor. 40 and Cor. 42 in App. G) yield the faster O(T 1/2) rate, suggesting that feature learning is a more difﬁcult problem than bias learning. Regarding the non-statistical setting, the bound in Cor. 6 is less clear to interpret because of the presence of the modiﬁed version of the inputs covariance matrix ˆCtot

1:T . Future work may be devoted to investigate this point, which could be either an artifact of our analysis or due to some intrinsic characteristics of the problem we are considering.

7 Experiments

We present preliminary experiments with our OWO Meta-Learning method (ONL-ONL)2 in the statistical setting of Ex. 2. In all experiments, the hyper-parameters λ and were chosen by a meta-validation procedure (see App. I for more details) and we ﬁxed 0 = I/d for the meta-algorithm in Alg. 8. We compared ONL-ONL to the modiﬁed batch-online (BAT-ONL) variant, where the meta-subgradients in the meta-training phase are computed with higher accuracy by a convex solver (such as CVX), to Independent-Task Learning (ITL), i.e. running the inner Alg. 7 with the feature map = I/d for each task, and, in the synthetic data experiment, to the Oracle, i.e. running the inner Alg. 7 with the best feature map in hindsight for each task, see App. H.4.1.

Synthetic data. We considered the regression setting with the absolute loss function. We generated Ttot = 3600 tasks. For each task, the corresponding dataset (xi, yi)ntot

i=1 of ntot = 80 points was generated according to the linear equation y = hx, wµi + , with x sampled uniformly on the unit sphere in Rd with d = 20 and sampled from a Gaussian distribution, G(0, 0.2). The tasks predictors wµ were generated as wµ = P wµ with the components of wµ 2 Rd/5 sampled from G(0, 1) and then wµ normalized to have unit norm, with P 2 Rd d/5 a matrix with orthonormal columns. In this setting, the operator norm of the inputs covariance matrix C is small (equal to 1/d) and the weight vectors covariance matrix B is low-rank, a favorable setting for our method, according to Cor. 7. Looking at the results in Fig. 1 (Left), we can state that, in this setting, our method outperforms ITL and it tends to the Oracle as the number of training tasks increases. Moreover, the

2The code is available at https://github.com/dstamos/Adversarial-LTL

Figure 1: Synthetic data (Left) and Movielens-100k dataset (Right). Performance of different methods as the number of training tasks increases. The results are averaged over 10 runs/splits of the data.

Figure 2: Mini-Wiki dataset (Left) and Jester-1 dataset (Right). Performance of different methods as the number of training tasks increases. The results are averaged over 10 splits of the data.

performance of ONL-ONL and BAT-ONL are comparable, suggesting that our approximation of the meta-subgradients is an effective way to keep the entire process fully online.

Real data. We further validated the proposed method on three real datasets: 1) the Movielens-100k dataset3, containing the ratings of different users to different movies 2) the Mini-Wiki dataset from [3], containing sentences from Wikipedia pages and 3) the Jester-1 dataset4, containing the ratings of different users to different jokes. For the Movielens-100k and the Jester-1 datasets we considered each user as a task and each movie/joke as a point. Speciﬁcally, we casted each task as a regression problem where the labels are the ratings of the users and the raw features are simply the index of the movie/joke (i.e. a matrix completion setting where the input dimension d coincides with the number of points). For the Mini-Wiki dataset we casted each task as a multi-class classiﬁcation problem where the labels are the Wikipedia pages and the features are vectors with dimension d = 50. After processing the data, we ended with a total number of Ttot = 939, 813, 5700 tasks and ntot = 939, 128, 100 points per task for the Movielens-100k, the Mini-Wiki and the Jester-1 datasets, respectively. In the above formulation of the problem for the Movielens-100k and the Jester-1 datasets, it is possible to show that, the ITL algorithm is not able to predict any rate for the ﬁlms/jokes without observed rates. For this reason, in order to evaluate the performance of the Meta-Learning methods ONL-ONL and BAT-ONL, we decided to introduce a more challenging method for this particular formulation of the problem in which, for the ﬁlms/jokes without any observed rate, we predicted the rate coinciding with the average of the rates of all the observed users, at the end of the entire sequence of tasks. We denoted this method as BAT. In Fig. 1 (Right) and Fig. 2 we report the performance of the methods by using the absolute loss for the Movielens-100k and the Jester-1 datastes and the multi-class hinge loss for the Mini-Wiki dataset. The results we got are consistent with the synthetic experiments above, showing the effectiveness of our method also in real-life scenarios. We note also that the online Meta-Learning methods outperform the BAT method when the number of training tasks increases.

8 Conclusion

We presented a fully online Meta-Learning method stemming from primal-dual Online Learning. Our method can be adapted to a wide class of learning algorithms and it covers various types of tasks relatedness. By means of a new analysis technique we derived a cumulative error bound for our method based on which it is also possible to obtain guarantees in the statistical setting. We illustrated our framework with two important examples, the bias and the feature learning, improving upon state-of-the-art results. To conclude, we believe that the generality of our framework and our method of proof could be a valuable starting point for future theoretical investigations of Meta-Learning.

3https://grouplens.org/datasets/movielens/ 4http://goldberg.berkeley.edu/jester-data/

Acknowledgments

This work was supported in part by EPSRC Grant N. EP/P009069/1.

[1] P. Alquier, T. T. Mai, and M. Pontil. Regret bounds for lifelong learning. In Proceedings

of the 20th International Conference on Artiﬁcial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pages 261 269, 2017.

[2] A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learning,

73(3):243 272, 2008.

[3] M.-F. Balcan, M. Khodak, and A. Talwalkar. Provable guarantees for gradient-based meta-

learning. In International Conference on Machine Learning, pages 424 433, 2019.

[4] H. H. Bauschke and P. L. Combettes. Convex Analysis and Monotone Operator theory in Hilbert

Spaces, volume 408. Springer, 2011.

[5] J. Baxter. Theoretical models of learning to learn. In Learning to Learn, pages 71 94. 1998.

[6] J. Baxter. A model of inductive bias learning. J. Artif. Intell. Res., 12(149 198):3, 2000.

[7] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.

[8] B. Bullins, E. Hazan, A. Kalai, and R. Livni. Generalize across tasks: Efﬁcient algorithms for

linear representation learning. In Algorithmic Learning Theory, pages 235 246, 2019.

[9] N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning

algorithms. IEEE Transactions on Information Theory, 50(9):2050 2057, 2004.

[10] C. Ciliberto, Y. Mroueh, T. Poggio, and L. Rosasco. Convex learning of multiple tasks and their

structure. In International Conference on Machine Learning, pages 1548 1557, 2015.

[11] G. Denevi, C. Ciliberto, R. Grazzi, and M. Pontil. Learning-to-learn stochastic gradient descent

with biased regularization. In International Conference on Machine Learning, pages 1566 1575, 2019.

[12] G. Denevi, C. Ciliberto, D. Stamos, and M. Pontil. Incremental learning-to-learn with statistical

guarantees. In Proc. 34th Conference on Uncertainty in Artiﬁcial Intelligence (UAI), 2018.

[13] G. Denevi, C. Ciliberto, D. Stamos, and M. Pontil. Learning to learn around a common mean.

In Advances in Neural Information Processing Systems, pages 10190 10200, 2018.

[14] T. Evgeniou, C. A. Micchelli, and M. Pontil. Learning multiple tasks with kernel methods.

Journal of Machine Learning Research, 6:615 637, 2005.

[15] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep

networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1126 1135. PMLR, 2017.

[16] C. Finn, A. Rajeswaran, S. Kakade, and S. Levine. Online meta-learning. In International

Conference on Machine Learning, pages 1920 1930, 2019.

[17] R. Gupta and T. Roughgarden. A pac approach to application-speciﬁc algorithm selection.

SIAM Journal on Computing, 46(3):992 1017, 2017.

[18] L. Jacob, J.-p. Vert, and F. R. Bach. Clustered multi-task learning: A convex formulation. In

Advances in neural information processing systems, pages 745 752, 2009.

[19] H.-U. Jean-Baptiste. Convex analysis and minimization algorithms: advanced theory and

bundle methods. SPRINGER, 2010.

[20] S. Kakade, S. Shalev-Shwartz, and A. Tewari. On the duality of strong convexity and strong smoothness: Learning applications and matrix regularization. Unpublished Manuscript, http://ttic. uchicago. edu/shai/papers/Kakade Shalev Tewari09. pdf, 2(1), 2009.

[21] M. Khodak, M. Florina-Balcan, and A. Talwalkar. Adaptive gradient-based meta-learning

methods. ar Xiv preprint ar Xiv:1906.02717, 2019.

[22] N. Littlestone. From on-line to batch learning. In Proceedings of the second annual workshop

on Computational learning theory, pages 269 284, 1989.

[23] A. Maurer. Algorithmic stability and meta-learning. Journal of Machine Learning Research,

6:967 994, 2005.

[24] A. Maurer. Transfer bounds for linear feature learning. Machine Learning, 75(3):327 350,

[25] A. Maurer, M. Pontil, and B. Romera-Paredes. Sparse coding for multitask and transfer learning.

In International Conference on Machine Learning, 2013.

[26] A. Maurer, M. Pontil, and B. Romera-Paredes. The beneﬁt of multitask representation learning.

The Journal of Machine Learning Research, 17(1):2853 2884, 2016.

[27] A. M. Mc Donald, M. Pontil, and D. Stamos. New perspectives on k-support and cluster norms.

Journal of Machine Learning Research, 17(155):1 38, 2016.

[28] C. A. Micchelli, J. M. Morales, and M. Pontil. Regularizers for structured sparsity. Advances in

Computational Mathematics, 38(3):455 489, 2013.

[29] A. Pentina and C. Lampert. A PAC-Bayesian bound for lifelong learning. In International

Conference on Machine Learning, pages 991 999, 2014.

[30] A. Pentina and R. Urner. Lifelong learning with weighted majority votes. In Advances in Neural

Information Processing Systems, pages 3612 3620, 2016.

[31] J. Peypouquet. Convex optimization in normed spaces: theory, methods and examples. Springer,

[32] S. Ravi and H. Larochelle. Optimization as a model for few-shot learning. In I5th International

Conference on Learning Representations, 2017.

[33] S. Shalev-Shwartz. Online learning: theory, algorithms, and applications [ph. d. thesis]. Hebrew

Univ., Jerusalem, 2007.

[34] S. Shalev-Shwartz and S. Ben-David. Understanding Machine Learning: From Theory to

Algorithms. Cambridge University Press, 2014.

[35] S. Shalev-Shwartz et al. Online learning and online convex optimization. Foundations and

Trends R in Machine Learning, 4(2):107 194, 2012.

[36] S. Shalev-Shwartz and S. M. Kakade. Mind the duality gap: Logarithmic regret algorithms for

online optimization. In Advances in Neural Information Processing Systems, pages 1457 1464, 2009.

[37] S. Shalev-Shwartz and Y. Singer. Convex repeated games and fenchel duality. In Advances in

neural information processing systems, pages 1265 1272, 2007.

[38] S. Shalev-Shwartz and Y. Singer. Logarithmic regret algorithms for strongly convex repeated

games. The Hebrew University, 2007.

[39] P. Stange. On the efﬁcient update of the singular value decomposition. In PAMM: Proceedings

in Applied Mathematics and Mechanics, volume 8, pages 10827 10828. Wiley Online Library, 2008.

[40] S. Thrun and L. Pratt. Learning to Learn. Springer, 1998.

[41] E. Triantaﬁllou, T. Zhu, V. Dumoulin, P. Lamblin, K. Xu, R. Goroshin, C. Gelada, K. Swersky,

P.-A. Manzagol, and H. Larochelle. Meta-dataset: A dataset of datasets for learning to learn from few examples. ar Xiv preprint ar Xiv:1903.03096, 2019.