# conditional_metalearning_of_linear_representations__af352646.pdf

Conditional Meta-Learning of Linear Representations

Giulia Denevi Leonardo Labs (Italy) giulia.denevi.ext@leonardo.com

Massimiliano Pontil Istituto Italiano di Tecnologia (Italy) & University College of London (UK)

massimiliano.pontil@iit.it

Carlo Ciliberto University College of London (UK) & Istituto Italiano di Tecnologia (Italy)

c.ciliberto@ucl.ac.uk

Standard meta-learning for representation learning aims to ﬁnd a common representation to be shared across multiple tasks. The effectiveness of these methods is often limited when the nuances of the tasks distribution cannot be captured by a single representation. In this work we overcome this issue by inferring a conditioning function, mapping the tasks side information (such as the tasks training dataset itself) into a representation tailored to the task at hand. We study environments in which our conditional strategy outperforms standard meta-learning, such as those in which tasks can be organized in separate clusters according to the representation they share. We then propose a meta-algorithm capable of leveraging this advantage in practice. In the unconditional setting, our method yields a new estimator enjoying faster learning rates and requiring less hyper-parameters to tune than current state-of-the-art methods. Our results are supported by preliminary experiments.

1 Introduction

Learning a shared representation among a class of machine learning problems is a well-established approach used both in multi-task learning [3, 20, 11] and meta-learning [18, 15, 5, 17, 35, 24, 30, 9, 7]. The idea behind this methodology is to consider two nested problem: at the within-task level an empirical risk minimization is performed on each task, using inputs transformed by the current representation, on the outer-task (meta-) level, such a representation is updated taking into account the errors of the within-task algorithm on previous tasks.

Such a technique was shown to be advantageous in contrast to solving each task independently when the tasks share a low dimensional representation, see e.g. [27, 25, 15, 24, 35, 5, 22, 9]. However, in real world applications we often deal with heterogeneous classes of learning tasks, which may overall be only loosely related. Consequently, the tasks commonalities may not be captured well by a single representation shared among all the tasks. This is for instance the case in which the tasks can be organized in different groups (clusters), where only tasks belonging to the same cluster share the same low-dimensional representation.

In order to overcome this issue, previous authors developed non-convex methods (or convex relaxations) attempting at clustering the tasks, see e.g. [4, 26, 2, 20, 28, 40, 38, 31]. In this work, we follow

Work done while the ﬁrst author was with Istituto Italiano di Tecnologia (Italy).

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

the recent literature on heterogeneous meta-learning [37, 36, 32, 21, 10, 39, 14, 7] and propose a so-called conditional meta-learning approach for meta-learning a representation. Our algorithm learns a conditioning function mapping available tasks side information into a linear representation that is tuned to that task at hand. Our approach borrows from [14], where the authors proposed a conditional meta-learning approach for ﬁne tuning and biased regularization. In those cases however, the tasks target vectors are assumed to be all close to a common bias vector rather than sharing the same low-dimensional linear representation, as instead explored in this work. As we explain in the following, working with the representation setting requires a signiﬁcant contribution with respect to the bias one in order to give a new formulation of the problem, to consider a different meta-objective and a different interpretation of the results. In addition, the representation setting is known to be a more relevant and effective framework in many scenarios in comparison to the bias one (see e.g. [28]).

In this work, we propose for the ﬁrst time an online conditional method for linear representation learning with strong theoretical guarantees. In particular, we show that the method is advantageous over standard (unconditional) representation learning methods used in meta-learning when the environment of observed tasks is heterogeneous.

Contributions and organization. The contributions of this work are the following. First, in Sec. 2, we design a conditional meta-learning approach to infer a linear representation that is tuned to the task at hand. Second, in Sec. 3, we formally characterize circumstances under which our conditional framework brings advantage with respect to the standard unconditional approach. In particular, we argue that this is the case when the tasks are organized in different clusters according to the support pattern or linear representation their target vectors share. Third, in Sec. 4, we design a convex meta-algorithm providing a comparable gain as the number of the tasks it observes increases. In the unconditional setting, the proposed method is able to recover faster rates and it requires to tune one less hyper-parameter with respect to the state-of-the-art unconditional methods. Finally, in Sec. 5, we present numerical experiments supporting our theoretical claims. We conclude our work in Sec. 6 and we postpone the missing proofs to the supplementary material.

2 Conditional representation learning

In this section we introduce our conditional meta-learning setting for representation learning. Then, we proceed to identify the differences with respect to (with respect to) the standard unconditional counterpart. We begin our overview by ﬁrst introducing the class of inner learning algorithms we use in this work.

Within-task algorithms. We consider the standard linear supervised learning setting over Z = X Y with X Rd and Y R input and output spaces, respectively. We denote by P(Z) the set of probability distributions (tasks) over Z. For any task µ 2 P(Z) and a given loss function : R R ! R, we aim at ﬁnding a weight vector wµ 2 Rd minimizing the expected risk

min w2Rd Rµ(w) Rµ(w) = E(x,y) µ

where, h , i represents the Euclidean product in Rd. In practice, µ is only partially observed through a dataset Z = (xi, yi)n

i=1 µn, namely, a collection of n identically independently distributed (i.i.d.) points sampled from µ. Thus, the goal becomes to use a learning algorithm in order to estimate a candidate weight vector with a small expected risk converging to the ideal Rµ(wµ) as the sample size n grows. Speciﬁcally, in this work we will consider as candidate estimators, the family of regularized empirical risk minimizers for linear feature learning [3]. Formally, denoting by D = S

n2N Zn the space of all datasets on Z, for a given 2 in = Sd

+ the set of positive deﬁnite d d matrices, we will consider the following learning algorithms A( , ) : D ! Rd:

A( , Z) = argmin w2Ran( ) Rd RZ, (w), RZ, (w) = 1

(hxi, wi, yi) + 1

where Ran( ) denotes the range of . Here denotes the pseudoinverse of . Throughout this work we will denote by RZ( ) = 1/n Pn

i=1 (hxi, i, yi) the empirical risk associated to Z. Here, plays the role of a linear feature representation that is learned during the meta-learning process (see [3]).

Remark 1 (Within-task regularization parameter). Differently to previous work, see e.g. [15], we do not impose any constraints on the trace of (e.g. Tr( ) 1). This allows us to absorb the regularization parameter λ typically used to control λ

in . As we will discuss later, this choice reduces the number of hyper-parameter to tune and it allows to enjoy faster learning rates. Remark 2 (Online variant of Eq. (2)). Paying additional negligible logarithmic factors, our analysis and results extend also to the setting in which the minimizer in Eq. (2) is replaced by a pre-conditioned variant of online gradient descent on RZ, with starting point w0 = 0 and appropriate step size:

A( , Z) = 1

wi, wi+1 = wi pi

i , pi = sixi + wi, si 2 @ ( , yi)(hxi, wii). (3)

Unconditional Meta-Learning. The standard unconditional meta-learning setting assumes there exist a meta-distribution 2 P(M) also called environment in [6] over a family M P(Z) of distributions (tasks) µ and it aims at selecting an inner algorithm in the family above that is well suited to solve tasks µ sampled from . This target can be reformulated as ﬁnding a linear representation 2 such that the corresponding algorithm A( , ) minimizes the transfer risk

2 E ( ), E ( ) = Eµ EZ µn Rµ

In practice, this stochastic problem is usually tackled by iteratively sampling a task µ and a corresponding dataset Z µn, and, then, performing a step of stochastic gradient descent on an empirical approximation of Eq. (4) computed from Z. This approach has proven effective for instance when the tasks of the environment share a simple common linear representation, see e.g. [18, 5, 22, 15, 16, 12, 17, 9]. However, when a single linear representation is not sufﬁcient for the entire environment of tasks (e.g. multi-clusters), this homogeneous approach is expected to fail. In order to overcome this limitation, some recent works have adopted the following conditional approach to the problem, see e.g. [37, 36, 32, 21, 10, 39, 14].

Conditional Meta-learning. Analogously to [14], we assume that any task µ is provided of additional side information s 2 S. In such a case, we consider the environment as a distribution 2 P(M, S) over the set M of tasks and the set S of possible side information. Moreover, as usual, we assume to decompose in ( |s) S( ) and ( |µ) M( ) the conditional and marginal distributions with respect to S and M. For instance, we observe that the side information s could contain descriptive features of the associated task, for example attributes in collaborative ﬁltering [1], or additional information about the users in recommendation systems [19]). Moreover s could be formed by a portion of the dataset sampled from µ (see [37, 14]). Conditional meta-learning leverages this additional side information in order to adapt (or condition) the linear representation 2 on the associated task at hand, by learning a linear-representation-valued function solving the problem

min 2T E ( ), E ( )=E(µ,s) EZ µn Rµ(A( (s),Z)) (5)

over the space T of measurable functions : S ! . Notice that we retrieve the unconditional meta-learning problem in Eq. (4) if we restrict Eq. (5) to the set of functions T const = { | ( ) , 2 }, mapping all the side information into the same constant linear representation.

In the next section, we will investigate the theoretical advantages of adopting such a conditional perspective and, then, we will introduce a convex meta-algorithm to tackle Eq. (5).

3 The advantage of conditional representation learning

In order to characterize the behavior of the optimal solution of Eq. (5) and to investigate the potential advantage of conditional meta-learning, we analyze the generalization properties of a given conditioning function . Formally, we compare the error E ( ) with respect to the optimal minimum risk

= Eµ Rµ(wµ) wµ = argmin

w2Rd Rµ(w). (6)

In order to do this, we ﬁrst need to introduce the following standard assumptions used also in previous literature. Throughout this work we will denote by > the standard transposition operation. Assumption 1. Let be a convex and L-Lipschitz loss function in the ﬁrst argument. Additionally, there exist R > 0 such that kxk R for any x 2 X.

Theorem 1 (Excess risk with generic conditioning function ). Let Asm. 1 hold. For any s S, introduce the conditional covariance matrices W(s) = Eµ ( |s)wµw>

µ and C(s) = Eµ ( |s)Ex µxx>, where, µ denotes the inputs marginal distribution of the task µ. Let 2 T such that Ran(W(s)) Ran( (s)) for any s S and let A( (s), ) be the associated inner algorithm from Eq. (2). Then,

2 + 2L2Es STr

Proof. For any (µ, s) , consider the decomposition E ( ) E

Bµ,s + Cµ,s

Bµ,s = EZ µn

Rµ(A( (s), Z)) RZ(A( (s), Z))

Cµ,s = EZ µn

RZ(A( (s), Z)) Rµ(wµ)

Bµ,s is the generalization error of the inner algorithm A( (s), ) on the task µ. Hence, applying stability arguments (see Prop. 6 in App. A), we can write Bµ,s 2L2Tr

(s)Ex µxx>"

n 1. Regarding the term Cµ,s, for any conditioning function such that wµ 2 Ran( (s)), we can write

RZ, (s)(wµ) Rµ(wµ)

, where, the inequality exploits the deﬁnition of the algorithm in Eq. (2) as minimum of the regularized empirical risk. The desired statement follows by combining the two bounds above and rewriting E(µ,s) = Es SEµ ( |s).

Thm. 1 suggests that the conditioning function minimizing the right hand side of Eq. (7) is a good candidate to solve the meta-learning problem. The following result explores this question by showing that such a minimizer admits a closed form solution. The proof is reported in App. B. In the following, we will denote by k k F and k k the Frobenius and trace norm of a matrix, respectively. Proposition 2 (Best conditioning function in hindsight). The conditioning function minimizer and the minimum of the bound presented in Thm. 1 over the set { 2 T | Ran(W(s)) Ran( (s)), S-almost surely}, are respectively

(s) = (2L) 1 n1/2 C(s) /2(C(s)1/2W(s)C(s)1/2)1/2C(s) /2

,,W(s)1/2C(s)1/2,,

We observe that, in comparison to [14], the numerator term in the bound above describes a different kind of tasks similarity assumption: the conditional variance term E(µ,s) kwµ Eµ ( |s)wµk2

present in [14] is now substituted by the trace norm term Es S

,,W(s)1/2C(s)1/2,,

above. Additionally, the bound above allows us to quantify the beneﬁts of adopting the conditional feature learning strategy.

Conditional vs. unconditional Meta-Learning. Applying Prop. 2 to T const, we obtain the optimal (constant) meta-parameter and the corresponding excess risk bound for unconditional meta-learning

= (2L) 1 n1/2 C /2

with unconditional covariance matrices W = Eµ wµw>

µ and C = Eµ Ex µxx>. We observe that in the previous literature [13, 15] the authors restricted the unconditional problem over the smaller class of linear representation ˆ = { 2 Sd

+ : Ran(W ) Ran( ), Tr( ) 1} and they considered as the best unconditional representation, the matrix minimizing only a part of the previous bound, namely,

On the other hand, the unconditional oracle we introduce above in Eq. (9) allows us to recover a tighter bound which is able to recover the best performance between independent task learning (ITL) and the oracle considered in previous literature [15]. Indeed, by exploiting the duality between the trace norm k k and the operator norm k k1 of a matrix, we can upper bound the right-side-term in Eq. (9) by the quantity

namely, the minimum between the bound for independent task learning and the bound for unconditional oracle obtained by previous authors. Notice that the unconditional quantity in Eq. (9) is always bigger than the conditional quantity in Eq. (8), since Eq. (9) coincides with the minimum over a smaller class of function. In order to quantify the gap between these two quantities namely, the advantage in using the conditional approach with respect to the unconditional one we have to compare the term

with the term Es S

,,C(s)1/2W(s)1/2,,

We report below a setting that can be considered illustrative for many real-world scenarios in which such a gap in performance is signiﬁcant. Example 1 (Clusters). Let S = Rq be the side information space, for some integer q > 0. Let be such that the side information marginal distribution S is given by a uniform mixture of m uniform distributions. More precisely, let S = 1

S , with (i)

the uniform distribution on the ball of radius 1/2 centered at ai 2 S, characterizing the cluster i. For a given side information s, a task µ ( |s) is sampled such that: 1) its inputs marginal µ is a distribution with constant covariance matrix C(s) = Eµ ( |s)Ex µxx> = C, for some C 2 Sd

+, 2) wµ is sampled from a distribution with conditional covariance matrix W(s) = Eµ ( |s)wµw>

µ, with W(s) such that (C1/2W(s)C1/2)(C1/2W(p)C1/2) = 0 if s 6= p. Then, Es S

,,C(s)1/2W(s)1/2,,

The inequality above tells us that, in the setting of Ex. 1, the conditional approach gains a pm factor in comparison to the unconditional approach. Therefore, the larger the number of clusters is, the more pronounced the advantage of conditional approach with respect to the unconditional one will be. The pm gain factor follows from the fact that the weight vectors wµ sampled from the different clusters share disjoint supports (they share orthogonal representations). This allows us to rewrite the overall clusters weight vectors covariance as the average of the intra clusters weight vectors covariances. The pm term comes from this rewriting and the quadratic behavior of the covariance matrix. We refer to App. C for more details and the deduction. We also observe that a particular case of the setting above could be that one in which q = 1 and the side information are noisy observations of the index of the cluster the tasks belong to. In our experiments, in Sec. 5, we consider a more interesting and realistic variant of the setting above, in which we will use as task s side information a training dataset sampled from that task. In the next section, we introduce a convex meta-algorithm mimicking this advantage also in practice.

4 Conditional representation Meta-Learning algorithm

To tackle conditional meta-learning in practice we consider a parametrization where the conditioning functions that are modeled with respect to a given feature map Φ : S ! Rk (with k 2 N) on the side information space. In other words, we consider : S ! Sd

">MΦ( ) + C, (11)

for some tensor M 2 Rp d k (p 2 N) and matrix C 2 Sd

+. By construction, the above parametrization guarantees us to learn functions taking values in the set of positive semi-deﬁnite matrices. However, directly addressing the meta-learning problem poses two issues: ﬁrst, dealing with tensorial structures might become computationally challenging in practice and second, such parametrization is quadratic in M and would lead to a non-convex optimization functional in practice. To tackle this issue, the following results shows that we can rewrite the conditioning function in the form of Eq. (11) by using a matrix in Sdk

+ . This will allows us to implement our method working with matrices in Sdk

+ , instead of tensors in Rp d k. Throughout this work, we will denote by the Kronecker product. Proposition 3 (Matricial re-formulation of M(s)). Let be as in Eq. (11). Then,

where Id is the identity in Rd d and HM is the matrix in Rdk dk deﬁned by the entries !

(i 1)k+h,(j 1)k+z =

M(:, i, h), M(:, j, z)

, i, j = 1, . . . , d, h, z = 1, . . . , k.

The arguments above motivate us to consider the following set of conditioning functions:

/// such that H 2 Sdk

To highlight the dependency of a function 2 TΦ with respect to its parameter H and C, we will denote = H,C. Evidently, TΦ contains the space of all unconditional estimators T const. We consider TΦ equipped with the canonical norm k H,Ck2 = k(H, C)k2

F , where, recall, k k F denotes the Frobenius norm. The following two standard assumptions will allow us to design and analyse our method. Assumption 2. The optimal function belongs to TΦ, namely there exist H 2 Sdk

+ and C 2 Sd

+, such that ( ) = H ,C ( ) =

Assumption 3. There exists K > 0 such that kΦ(s)k K for any s 2 S.

Asm. 2, known as well-speciﬁed setting assumption (see e.g. [33]), is a standard assumption in learning theory and it allows us to restrict the conditional meta-learning problem in Eq. (5) to TΦ, rather than to the entire space T of measurable functions. Asm. 3 ensures that the meta-objective is Lipschitz (see below).

The convex surrogate problem. We start from observing that, exploiting the generalization properties of the within-task algorithm (see Prop. 6 in App. A), we can write the following

E ( ) E(µ,s) EZ µn FZ( (s)), FZ( ) = RZ, (A( , Z)) + 2L2

where X 2 Rn d is the matrix with the inputs vectors (xi)n

i=1 as rows. The inequality above suggests us to introduce the surrogate problem

ˆE ( ), ˆE ( ) = E(µ,s) EZ µn FZ( (s)). (14)

We stress that the surrogate problem we take here is different from the one considered in previous work [12, 15, 14, 9], where the authors considered as meta-objective only a part of the function above, namely, E(µ,s) EZ µn

RZ, (s)(A( (s), Z))

. As we will see in the following, such a choice is more appropriate for the problem at hand, since, differently from the meta-objective used in previous literature, it will allow us to develop a conditional meta-learning method that is theoretically grounded also for linear representation learning.

Exploiting Asm. 2, the surrogate problem in Eq. (14) can be restricted to the class of linear functions TΦ in Eq. (13) and it can be rewritten more explicitly as

min H2Sdk,C2Sd

E(µ,s) EZ µn L

In the following proposition we outline some useful properties of the meta-loss L

introduced above (such as convexity) supporting its choice as surrogate meta-loss. Proposition 4 (Properties of the surrogate meta-loss L). For any Z 2 D and s 2 S, the function L

is convex and one of its subgradients is given, for any H 2 Sdk

+ and C 2 Sd

(C) = ˆr, r L

2 H,C(s) w H,C(s)w

> H,C(s) H,C(s) + 2L2X >X

Moreover, by Asm. 1 and Asm. 3, ,,r L

F (1 + K2)(LR)2!

2 1 + 2n 1"

The proof of Prop. 4 is reported in App. D.2. It follows from combining results from [15] with the composition of the linear parametrization of the functions H,C 2 TΦ.

The conditional Meta-Learning estimator. The meta-learning strategy we propose consists in applying Stochastic Gradient Descent (SGD) on the surrogate problem in Eq. (15). Such a metaalgorithm is implemented in Alg. 1: we assume to observe a sequence of i.i.d. pairs (Zt, st)T

t=1 of training datasets and side information, and at each iteration we update the conditional parameters (Ht, Ct) by performing a step of constant size γ > 0 in the direction of r L( , , st, Zt)(Ht, Ct)

and a projection step on Sdk

+. Finally, we output the conditioning function H,C parametrized by (H, C), the average across all the iterates (Ht, Ct)T

t=1. The theorem below analyzes the generalization properties of such a conditioning function.

Algorithm 1 Meta-algorithm, SGD on Eq. (15)

Input γ > 0 meta-step size, H0 2 Sdk

+ , C0 2 Sd

+ Initialization H1 = H0 2 Sdk

+ , C = C0 2 Sd

+ For t = 1 to T

Receive (µt, st) and Zt µn

+ Ct and compute w t = A( t, Zt) by Eq. (2) Compute r L( , , st, Zt)(Ht, Ct) as in Prop. 4 with w t Update (Ht+1, Ct+1) = proj

(Ht, Ct) γr L( , , st, Zt)(Ht, Ct)

Return H = 1

Theorem 5 (Excess risk bound for the conditioning function returned by Alg. 1). Let Asm. 1 and Asm. 3 hold. For any s S, recall the conditional covariance matrices W(s) and C(s) introduced

in Thm. 1. Let H,C be a ﬁxed function in TΦ such that Ran(W(s)) Ran( H,C(s)) for any s S. Let H and C be the outputs of Alg. 1 applied to a sequence (Zt, st)T

t=1 of i.i.d. pairs sampled from with an appropriate meta-step size γ. Then, in expectation with respect to the sampling of (Zt, st)T

E E ( H,C) E

H,C(s) W(s)

2 + 2L2Es STr

(1 + K2)(LR)2 k(H H0, C C0)k F p

Proof (Sketch). The detailed proof is reported in App. D.4. Exploiting the fact that, for any 2 T , E ( ) ˆE ( ) and adding ˆE ( H,C), we can write the following

EZ E ( H,C) E

A( H,C) + B( H,C)

A( H,C) = EZ ˆE ( H,C) ˆE ( H,C) B( H,C) = ˆE ( H,C) E

The term A( H,C) can be controlled according to the convergence properties of the meta-algorithm in Alg. 1 as described in Prop. 12. Regarding the term B( H,C), exploiting the deﬁnition of the withintask algorithm in Eq. (2) as minimum, for any 2 T such that Ran(Eµ ( |s)wµw>

µ) Ran( (s)) for any s S, we can rewrite

B( ) E(µ,s) Tr

2 + 2L2E(µ,s) Tr

(s)Ex µxx>"

The desired statement then derives from combining the two parts above and optimizing with respect to γ.

Remark 3 (Online variant of Eq. (2)). Similarly to the bias regularization framework in [14], for the online inner family in Rem. 2, we approximate the meta-subgradient in Prop. 4 by replacing the batch minimizer A( H,C(s), Z) in Eq. (2) with the last iterate of the online algorithm in Eq. (3).

Proposed vs. optimal conditioning function. Specializing the bound in Thm. 5 to the best conditioning function in Prop. 2, thanks to Asm. 2, we get, up to contants, the following bound for our estimator,

,,W(s)1/2C(s)1/2,,

n 1/2 + k(H H0, C C0)k F T 1/2.

From such a bound, we can state that our proposed meta-algorithm achieves comparable performance to the best conditioning function in hindsight, when the number of observed tasks is sufﬁciently large. Moreover, recalling the unconditional oracle ˆ in Eq. (10) used in previous literature, regarding the second term vanishing with T, we observe that our conditional meta-learning approach incurs a cost of k(H H0, C C0)k F T 1/2 as opposed to the cost of kˆ 0k T 1/4 associated to state-of-the-art unconditional meta-learning approaches (see [15, 5, 22, 9]). Thus, our conditional approach presents a faster convergence rate with respect to T than such unconditional methods, but a complexity term that is expected to be larger due to the larger complexity of the class of functions we

Figure 1: Mean test error (5 generations) on synthetic data. 2 (Left) and 6 (Right) clusters.

are working with. Such a faster rate with respect to T is essentially due to our formulation of the problem on the entire set of positive-semideﬁnite matrices (with no trace constraints). This in fact allows us to incorporate the within-task regularization parameter λ directly in the linear representation and to gain a

T order that was lost in previous literature when tuning with respect to the parameter λ. At the same time, this allows us to develop also a method requiring to tune just one hyper-parameter, while previous unconditional approaches requires to tune two hyper-parameters.

Comparison to unconditional Meta-Learning. Specializing Thm. 5 to the best unconditional estimator H,C we introduced in Eq. (9), the bound for our estimator becomes, up to constants,

n 1/2 + k C0k T 1/2.

From the bound above, we can conclude that the conditional approach provides, at least, the same guarantees as its unconditional counterpart. Moreover, we stress again that the bound above presents a faster rate with respect to T in comparison to the state-of-the-art unconditional methods.

5 Experiments

We now present preliminary experiments in which we compare the proposed conditional metalearning approach in Alg. 1 (cond.) with the unconditional counterpart (uncond.) and solving the tasks independently (ITL, namely, running the inner algorithm separately across the tasks with the constant linear representation = Id 2 Sd

+). We considered regression problems and we evaluated the errors by the absolute loss. We implemented the online variant of the within-task algorithm introduced in Eq. (3). The hyper-parameter γ was chosen by (meta-)cross validation on separate Ttr, Tva and Tte respectively meta-train, -validation and -test sets. Each task is provided with a training dataset Ztr of ntr points and a test dataset Zte of nte points used to evaluate the performance of the within-tasks algorithm. In App. E we report the details of this process in our experiments.

Synthetic clusters. We considered two variants of the setting described in Ex. 1 with side information corresponding to the training datasets Ztr associated to each task. In both settings, we sampled Ttot = 900 tasks from a uniform mixture of m clusters. For each task µ, we generated the target vector wµ 2 Rd with d = 20 as wµ = P(jµ) wµ, where, jµ 2 {1, . . . , m} denotes the cluster from which the task µ was sampled and with the components of wµ 2 Rd/(10) sampled from the Gaussian distribution G(0, 1) and then wµ normalized to have unit norm, with P(jµ) 2 Rd d/(10) a matrix with orthonormal columns. We then generated the corresponding dataset (xi, yi)ntot

i=1 with ntot = 80 according to the linear equation y = hx, wµi + , with x sampled uniformly on the unit sphere in Rd and sampled from a Gaussian distribution, G(0, 0.1). In this setting, the operator norm of the inputs covariance matrix is small (equal to 1/d) and the weight vectors covariance matrix of each single cluster is low-rank (its rank is d/(10) = 2). We implemented our conditional method using the feature map Φ : D ! R2d deﬁned by Φ(Z) = 1 ntr

i=1 φ(zi), with φ(zi) = vec

xi(yi, 1)>"

, where, for any matrix A = [a1, a2] 2 Rd 2 with columns a1, a2 2 Rd, vec(A) = (a1, a2)> 2 R2d. In Fig. 1, we report the results we got on an environment of tasks generated as above with m = 2 (Left) and m = 6 (Right) clusters, respectively. As we can see, when the clusters are two, the unconditional approach outperforms ITL (as predicted from previous literature), but the unconditional method is in turn outperformed by our conditional counterpart. When the number of clusters raises to six, the performance of unconditional meta-learning degrades to the same performance of ITL,

Figure 2: Mean test error (5 splits) on Lenk (Top-Left), Movielens-100k (Top-Right), Jester-1 (Bottom) dataset.

while conditional meta-learning outperforms both methods. Summarizing, the more the heterogeneity of the environment (number of clusters) is signiﬁcant, the more the conditional approach brings advantage with respect to the unconditional one. This is in line with Ex. 1.

Real datasets. We tested the performance of the methods also on the regression problem on the computer survey data from [23] (see also [28]). Ttot = 180 people (tasks) rated the likelihood of purchasing one of ntot = 20 computers. The input represents d = 13 computers characteristics and the label is a rate in {0, . . . , 10}. In this case, we used as side information the training datapoints Z = (zi)ntr

i=1 and the feature map Φ : D ! Rd+1 deﬁned by Φ(Z) = w Z, with w Z the solution of Tikhonov regularization with the squared loss, namely, the vector satisfying ( ˆX > ˆX + Id+1)w Z =

ˆX >y, where, ˆX 2 R(d+1) n is the matrix obtained by adding to the matrix X 2 Rn d one column of ones at the end. Fig. 2 (Top-Left) shows that also in this case, the unconditional approach outperforms ITL, but the performance of its conditional counterpart is much better.

We also tested the performance of the methods on the Movielens-100k and Jester-1 real-world datasets, containing ratings of users (tasks) to movies and jokes (points), respectively. Recommendation system settings with d items can be interpreted within the meta-learning setting by considering each data point (x, y) to have input x 2 Rd to be the one-hot encoding of the current item to be rated (e.g. a movie or a joke) and y 2 R the corresponding score, see e.g. [12] for more details. We restricted the original dataset to the ntot = 20 most voted movies/jokes (as a consequence, by formulation, d = 20). We guaranteed each user voted at least 5 movies/jokes, which led to a total of Ttot = 400/450 tasks (i.e. users). In both cases, we used as side information the training datapoints Z = (zi)ntr

i=1. For the Movielens-100k dataset we used the same feature map described for the synthetic clusters experiments in Fig. 1. For the Jester-1 dataset, let M and m denote the maximum and minimum rating value that can be assigned to a joke. We adopted the feature map Φ : D ! R2d+1 such that, for any dataset Z = (xi, yi)n

i=1, we have Φ(Z) =

vec( Φ(Z)); 1

, where vec denotes the vectorization operator (i.e. mapping a matrix in the vector concatenating all its columns) and eΦ : Z ! Rd 2 is such that

cos( (Z)), sin( (Z))

i=1 xi) with (Z) = Pn

and denoting the Hadamard (entry-wise) product broad-casted across both columns. The rationale behind this feature map is to represent as similar vectors those users with similar scores for the same movies. In particular, each item-score pair observed in training is represented as a unitary vector in R2

++, with the angle depending on the score attributed to that item (the vector corresponds to zero if that movie was not observed at the training time). As it can be noticed in Fig. 2 (Top Right and Bottom), the

Figure 3: Mean test error (5 generations) of our method and [14] on 2 (Left) and 6 (Right) clusters.

proposed approach performs signiﬁcantly better than ITL and its unconditional counterpart also on these two benchmarks. This suggests that groups of users might rely each on similar features (but different from those of other groups) to rate an item in the dataset (respectively a movie or a joke).

Comparison with [14]. We conclude the experimental section by comparing the performance of our method with the conditional meta-learning approach for biased regularization proposed in [14]. In that case, the tasks target vectors are assumed to be all close to a common bias vector rather than sharing the same low-dimensional linear representation, as instead in our method. The representation setting is known to be a more relevant and effective framework in many scenarios in comparison to the bias one (see e.g. [28]). This is conﬁrmed in Fig. 3 where our conditional representation learning method ( cond. ) signiﬁcantly outperforms the one in [14] ( cond. B ) in the synthetic settings used in Fig. 1, since the tasks similarity leveraged by [14] is not appropriate in these cases.

6 Conclusion

We proposed a conditional meta-learning approach aiming at learning a function mapping task s side information into a linear representation that is well suited for the task at hand. We theoretically and experimentally showed that the proposed conditional approach is advantageous with respect to the standard unconditional counterpart when the observed tasks share heterogeneous linear representations. As a consequence of our analysis we also developed a new unconditional meta-learning variant requiring tuning less hyper-parameters and relying on faster rates with respect to state-of-the-art unconditional approaches. We identify two future directions addressing the limitations of our method. A ﬁrst question left opened is how to design a suitable feature map Φ when the tasks training data is used as side information. Following [32, 37], we adopted a mean embedding representation. However, given the importance played by such feature map in Thm. 5, it will be worth investigating better alternatives. Secondly, it will be valuable to investigate how to predict non-linear conditioning functions (similarly to e.g. [7, 16, 32]) and use less expensive algorithms to update the positive matrices, such as the Frank-Wolfe algorithm used in [9] for unconditional settings. In this last case, according to our analysis, applying standard convergence rates for Frank-Wolfe algorithm, we expect the meta-learning algorithm based on Frank-Wolfe iteration to incur a slower rate of order T 1/4

(instead of T 1/2 for the SGD method proposed here) in Thm. 5, paying the computational beneﬁts in terms of statistical performance.

Acknowledgments and Disclosure of Funding

This work was supported in part by SAP SE and EPSRC Grant N. EP/P009069/1. C.C. acknowledges the support of the Royal Society (grant SPREM RGS\R1\201149) and Amazon.com Inc. (Amazon Research Award ARA). G.D. acknowledges Leonardo Sp A for funding her participation to the conference.

[1] Jacob Abernethy, Francis Bach, Theodoros Evgeniou, and Jean-Philippe Vert. A new approach

to collaborative ﬁltering: Operator estimation with spectral regularization. Journal of Machine Learning Research, 10(Mar):803 826, 2009.

[2] Andreas Argyriou, Stéphan Clémençon, and Ruocong Zhang. Learning the graph of relations

among multiple tasks. 2013.

[3] Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Convex multi-task feature

learning. Machine Learning, 73(3):243 272, 2008.

[4] Andreas Argyriou, Andreas Maurer, and Massimiliano Pontil. An algorithm for transfer learning

in a heterogeneous environment. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 71 85. Springer, 2008.

[5] Maria-Florina Balcan, Mikhail Khodak, and Ameet Talwalkar. Provable guarantees for gradient-

based meta-learning. In International Conference on Machine Learning, pages 424 433, 2019.

[6] Jonathan Baxter. A model of inductive bias learning. J. Artif. Intell. Res., 12(149 198):3, 2000.

[7] Luca Bertinetto, Joao F Henriques, Philip HS Torr, and Andrea Vedaldi. Meta-learning with

differentiable closed-form solvers. ar Xiv preprint ar Xiv:1805.08136, 2018.

[8] Olivier Bousquet and André Elisseeff. Stability and generalization. Journal of machine learning

research, 2(Mar):499 526, 2002.

[9] Brian Bullins, Elad Hazan, Adam Kalai, and Roi Livni. Generalize across tasks: Efﬁcient

algorithms for linear representation learning. In Algorithmic Learning Theory, pages 235 246, 2019.

[10] T Tony Cai, Tengyuan Liang, and Alexander Rakhlin. Weighted message passing and minimum

energy ﬂow for heterogeneous stochastic block models with side information. Journal of Machine Learning Research, 21(11):1 34, 2020.

[11] Rich Caruana. Multitask learning. Machine learning, 28(1):41 75, 1997.

[12] Giulia Denevi, Carlo Ciliberto, Riccardo Grazzi, and Massimiliano Pontil. Learning-to-learn

stochastic gradient descent with biased regularization. In International Conference on Machine Learning, pages 1566 1575, 2019.

[13] Giulia Denevi, Carlo Ciliberto, Dimitris Stamos, and Massimiliano Pontil. Incremental learning-

to-learn with statistical guarantees. In Proc. 34th Conference on Uncertainty in Artiﬁcial Intelligence (UAI), 2018.

[14] Giulia Denevi, Massimiliano Pontil, and Carlo Ciliberto. The advantage of conditional meta-

learning for biased regularization and ﬁne tuning. Advances in Neural Information Processing Systems, 33, 2020.

[15] Giulia Denevi, Dimitris Stamos, Carlo Ciliberto, and Massimiliano Pontil. Online-within-online

meta-learning. In Advances in Neural Information Processing Systems, pages 13089 13099, 2019.

[16] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adap-

tation of deep networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1126 1135. PMLR, 2017.

[17] Chelsea Finn and Sergey Levine. Meta-learning and universality: Deep representations and

gradient descent can approximate any learning algorithm. In International Conference on Learning Representations, 2018.

[18] Chelsea Finn, Aravind Rajeswaran, Sham Kakade, and Sergey Levine. Online meta-learning.

In International Conference on Machine Learning, pages 1920 1930, 2019.

[19] F Maxwell Harper and Joseph A Konstan. The movielens datasets: History and context. Acm

transactions on interactive intelligent systems (tiis), 5(4):1 19, 2015.

[20] Laurent Jacob, Jean-philippe Vert, and Francis R Bach. Clustered multi-task learning: A convex

formulation. In Advances in neural information processing systems, pages 745 752, 2009.

[21] Ghassen Jerfel, Erin Grant, Tom Grifﬁths, and Katherine A Heller. Reconciling meta-learning

and continual learning with online mixtures of tasks. In Advances in Neural Information Processing Systems, pages 9119 9130, 2019.

[22] Mikhail Khodak, Maria-Florina F Balcan, and Ameet S Talwalkar. Adaptive gradient-based

meta-learning methods. In Advances in Neural Information Processing Systems, pages 5915 5926, 2019.

[23] Peter J Lenk, Wayne S De Sarbo, Paul E Green, and Martin R Young. Hierarchical bayes

conjoint analysis: Recovery of partworth heterogeneity from reduced experimental designs. Marketing Science, 15(2):173 191, 1996.

[24] Andreas Maurer. Transfer bounds for linear feature learning. Machine Learning, 75(3):327 350,

[25] Andreas Maurer, Massi Pontil, and Bernardino Romera-Paredes. Sparse coding for multitask

and transfer learning. In International Conference on Machine Learning, 2013.

[26] Andreas Maurer and Massimiliano Pontil. Transfer learning in a heterogeneous environment.

In 2012 3rd International Workshop on Cognitive Information Processing (CIP), pages 1 6. IEEE, 2012.

[27] Andreas Maurer, Massimiliano Pontil, and Bernardino Romera-Paredes. The beneﬁt of multitask

representation learning. The Journal of Machine Learning Research, 17(1):2853 2884, 2016.

[28] Andrew M Mc Donald, Massimiliano Pontil, and Dimitris Stamos. New perspectives on k-

support and cluster norms. Journal of Machine Learning Research, 17(155):1 38, 2016.

[29] Charles A Micchelli, Jean M Morales, and Massimiliano Pontil. Regularizers for structured

sparsity. Advances in Computational Mathematics, 38(3):455 489, 2013.

[30] Anastasia Pentina and Christoph Lampert. A PAC-Bayesian bound for lifelong learning. In

International Conference on Machine Learning, pages 991 999, 2014.

[31] Bernardino Romera-Paredes, Hane Aung, Nadia Bianchi-Berthouze, and Massimiliano Pontil.

Multilinear multitask learning. In International Conference on Machine Learning, pages 1444 1452. PMLR, 2013.

[32] Andrei A Rusu, Dushyant Rao, Jakub Sygnowski, Oriol Vinyals, Razvan Pascanu, Simon

Osindero, and Raia Hadsell. Meta-learning with latent embedding optimization. In International Conference on Learning Representations, 2018.

[33] Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to

Algorithms. Cambridge University Press, 2014.

[34] Ingo Steinwart and Andreas Christmann. Support vector machines. Springer Science & Business

Media, 2008.

[35] Nilesh Tripuraneni, Chi Jin, and Michael I Jordan. Provable meta-learning of linear representa-

tions. ar Xiv preprint ar Xiv:2002.11684, 2020.

[36] Risto Vuorio, Shao-Hua Sun, Hexiang Hu, and Joseph J Lim. Multimodal model-agnostic meta-

learning via task-aware modulation. In Advances in Neural Information Processing Systems, pages 1 12, 2019.

[37] Ruohan Wang, Yiannis Demiris, and Carlo Ciliberto. A structured prediction approach for

conditional meta-learning. Advances in Neural Information Processing Systems, 2020.

[38] Kishan Wimalawarne, Masashi Sugiyama, and Ryota Tomioka. Multitask learning meets tensor

factorization: task imputation via convex optimization. In NIPS, pages 2825 2833. Citeseer, 2014.

[39] Huaxiu Yao, Ying Wei, Junzhou Huang, and Zhenhui Li. Hierarchically structured meta-learning.

ar Xiv preprint ar Xiv:1905.05301, 2019.

[40] Jiayu Zhou, Jianhui Chen, and Jieping Ye. Clustered multi-task learning via alternating structure

optimization. Advances in neural information processing systems, 2011:702, 2011.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s

contributions and scope? [Yes] See the abstract and Lines 44 54. (b) Did you describe the limitations of your work? [Yes] See Sec. 6.

(c) Did you discuss any potential negative societal impacts of your work? [N/A] Our work

does not imply any negative societal impact. (d) Have you read the ethics review guidelines and ensured that your paper conforms to

them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] See Asm. 1,

Asm. 2, Asm. 3. (b) Did you include complete proofs of all theoretical results? [Yes] See Sec. 3, Sec. 4 and

Supplementary material. 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main exper-

imental results (either in the supplemental material or as a URL)? [Yes] See Sec. 5, App. E and Supplementary material. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they

were chosen)? [Yes] See Sec. 5, App. E and Supplementary material. (c) Did you report error bars (e.g., with respect to the random seed after running experi-

ments multiple times)? [Yes] See Sec. 5, App. E and Supplementary material. (d) Did you include the total amount of compute and the type of resources used (e.g., type

of GPUs, internal cluster, or cloud provider)? [Yes] See App. E and Supplementary material. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] In Fig. 3 we compared

with the method in [14] and we cited that work. (b) Did you mention the license of the assets? [Yes] See Fig. 3.

(c) Did you include any new assets either in the supplemental material or as a URL? [Yes]

We attached our code in the Supplementary material. (d) Did you discuss whether and how consent was obtained from people whose data you re

using/curating? [N/A] We used benchmark data and we cited the original source where one can ﬁnd all the details. (e) Did you discuss whether the data you are using/curating contains personally identiﬁable

information or offensive content? [N/A] The data we are using does not contain personally identiﬁable information or offensive content. 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if

applicable? [N/A] We did not use crowdsourcing or conducted research with human subjects. (b) Did you describe any potential participant risks, with links to Institutional Review Board

(IRB) approvals, if applicable? [N/A] We did not use crowdsourcing or conducted research with human subjects.

(c) Did you include the estimated hourly wage paid to participants and the total amount

spent on participant compensation? [N/A] We did not use crowdsourcing or conducted research with human subjects.