# a_distributiondependent_analysis_of_meta_learning__fe059c2a.pdf

A Distribution-Dependent Analysis of Meta-Learning

Mikhail Konobeev 1 Ilja Kuzborskij 2 Csaba Szepesv ari 1 2

Abstract A key problem in the theory of meta-learning is to understand how the task distributions inﬂuence transfer risk, the expected error of a metalearner on a new task drawn from the unknown task distribution. In this paper, focusing on ﬁxed design linear regression with Gaussian noise and a Gaussian task (or parameter) distribution, we give distribution-dependent lower bounds on the transfer risk of any algorithm, while we also show that a novel, weighted version of the so-called biased regularized regression method is able to match these lower bounds up to a ﬁxed constant factor. Notably, the weighting is derived from the covariance of the Gaussian task distribution. Altogether, our results provide a precise characterization of the difﬁculty of meta-learning in this Gaussian setting. While this problem setting may appear simple, we show that it is rich enough to unify the parameter sharing and representation learning streams of meta-learning; in particular, representation learning is obtained as the special case when the covariance matrix of the task distribution is unknown. For this case we propose to adopt the EM method, which is shown to enjoy efﬁcient updates in our case. The paper is completed by an empirical study of EM. In particular, our experimental results show that the EM algorithm can attain the lower bound as the number of tasks grows, while the algorithm is also successful in competing with its alternatives when used in a representation learning context.

1. Introduction

In meta-learning, a learner uses data from past tasks in an attempt to speed up learning on future tasks. Whether a speedup is possible depends on whether the new task is

1Computing Science Department, University of Alberta, Edmonton, Alberta, Canada 2Deep Mind, London, United Kingdom. Correspondence to: Mikhail Konobeev <konobeev.michael@gmail.com>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

similar to the previous ones. In the formal framework of statistical meta-learning of Baxter (2000), the learner is given a sequence of training sets . The data in each set is independently sampled from an unknown distribution speciﬁc to the set, or task, while each such task distribution is independently sampled from an unknown meta-distribution, which we shall just call the environment. The learner s transfer risk then is its expected prediction loss on a target task freshly sampled from the environment. Can a learner achieve smaller transfer risk by using data from the possibly unrelated tasks? What are the limits of reducing transfer risk?

As an instructive example, consider a popular approach where each of the n tasks is associated with ground truth parameters θi Rd, each of which is assumed to lie close to an unknown vector α that characterizes the environment. To estimate the unknown parameter vector of the last task, one possibility is to employ biased regularization (Yang et al., 2007; Kuzborskij & Orabona, 2013; Pentina & Lampert, 2014), that is, solve the optimization problem

where ˆLn( ) is the empirical loss on the nth task, λ > 0 is a regularization parameter that governs the strength of the regularization term that biases the solution towards ˆα, an estimate of α. Here, ˆα could be obtained, for example, by averaging parameters estimated on previous tasks (Denevi et al., 2018). This procedure implements the maxim learn on a new task, but stay close to what is already learned , which is the basis of many successful meta-learning algorithms, including the above, and MAML (Finn et al., 2017).

Early theoretical work in the area focused on studying the generalization gap, which is the difference between the transfer risk and its empirical counterpart. Maurer (2005) gives an upper bound on this gap for a concrete algorithm which is similar to the biased regularization approach discussed above. While these bounds are reassuring, they need further work to fully quantify the beneﬁt of metalearning, i.e., the gap between the risk of a standard (nonmeta) learner and the transfer risk of a meta-learner. Numerous other works have shown bounds on the generalization gap when using biased regularization, in one-shot learning (Kuzborskij & Orabona, 2016), meta-learning (Pentina

A Distribution-Dependent Analysis of Meta Learning

& Lampert, 2014), and sequential learning of tasks (Denevi et al., 2018; 2019; Khodak et al., 2019a;b; Finn et al., 2019). While some of these works introduced a dependence on the environment distribution, or on the regularity of the sequence of tasks as appropriate, they still leave open the question whether the shown dependence is best possible.

In summary, the main weakness of the cited literature is the lack of (problem dependent) lower bounds: To be able to separate good meta-learning methods from poor ones, one needs to know the best achievable performance in a given problem setting. In learning theory, the most often used lower bounds are distribution-free or problem independent. In the context of meta learning, the distribution refers to the distribution over the tasks, or the environment. The major limitation of a distribution-free approach is that if the class of environments is sufﬁciently rich, all that the bound will tell us is that the best standard learner will have similar performance to that of the best meta-learner since the worstcase environment will be one where the tasks are completely unrelated. As an example, for a linear regression setting with d-dimensional parameter vectors, Lucas et al. (2021) gives the worst-case lower bound Ω(d/((2r) d M + m)) for parameter identiﬁcation where the error is measured in the squared Euclidean distance. Here, M is the total number of data points in the identically-sized training sets, m is the number of data points in the training set of the target task, and r 1 is the radius of the ball that contains the parameter vectors.1 It follows that as r , the lower bound reduces to that of linear regression and we see that any method that ignores the tasks is competitive with the best meta-learning method. The pioneering works of Maurer (2009); Maurer et al. (2016) avoid this pathology by introducing empirical quantities that capture task relatedness in the context of linear regression with a common low-dimensional representation.

The bounds can be reﬁned and the pathological limit can be avoided by restricting the set of environments. This approach is taken by Du et al. (2020) and Tripuraneni et al. (2020) who also consider linear regression where the tasks share a common low-dimensional representation. Their main results show that natural algorithms can take advantage of this extra structure. In addition, Tripuraneni et al. (2020) also shows a lower bound on the transfer risk which is matched by their method up to logarithmic factors and problem dependent conditioning constants.

1This result is stated in Theorem 5 in their paper and the setting is meta linear regression. For readability, we dropped some constants, such as label noise variance and slightly generalized the cited result by introducing r, which is taken to be r = 1 in their paper. Indeed, the analysis in the paper is not hard to modify to get the dependence shown on r.

Our contributions. In the present paper we revisit the framework underlying biased regularized regression. In particular, we propose to study the case when the unknown parameter vectors for the tasks are generated from a normal distribution with some mean and covariance matrix. First, we consider the case when the mean is unknown while the covariance matrix is known. For this case, in the context of ﬁxed-design linear regression, we prove distributiondependent lower and upper bounds, which essentially match each other. The lower bound is a direct lower limit on the transfer risk of any meta-learning method. The upper bound is proven for a version of a weighted biased regularized leastsquares regression. Here, the parameters are biased towards the maximum likelihood estimate of the unknown common mean of the task parameter vectors, and the weighting is done with respect to the inverse covariance matrix of the distribution over the task parameter vectors. We show that the maximum likelihood estimator can be efﬁciently computed, which implies that the entire procedure is efﬁcient. As opposed to the work of Tripuraneni et al. (2020), the gap between the lower and upper bounds is a universal constant, regardless of the other parameters of the meta-learning task. The matching lower and upper bounds together provide a precise and ﬁne-grained characterization of the beneﬁts of meta-learning. Our algorithm shows how one should combine datasets of different cardinalities and suggest speciﬁc ways of tuning biased regularized regression based on the noise characteristics of the data and the task structure. Our lower bounds are based on a rigorously proven novel observation, which may be of interest on its own. According to this observation, any predictor can be treated as a plug-in method that ﬁrst estimates the unknown task distribution parameters. Hence, to prove a lower bound for the transfer risk, it sufﬁces to do so for plug-in estimators.

In the last part of the paper we consider the case when the covariance matrix of the task parameter vector distribution is unknown. Importantly, this case can be seen as a way of unifying the representation learning stream of meta-learning with the parameter sharing stream. In particular, if the covariance matrix is such that d s of its eigenvalues tend to zero, while the other eigenvalues s are allowed to take on arbitrarily large values, the problem becomes essentially the same as the representation learning problem of Du et al. (2020); Tripuraneni et al. (2020).

While we provide no theoretical analysis for this case, we give a detailed description of how the Expectation Maximization (EM) algorithm can be used to tackle this problem. In particular, we show that in this special case the EM algorithm enjoys an efﬁcient implementation: we show how to implement the iterative steps in the loop of the EM algorithm in an efﬁcient way. The steps of this algorithm are given as closed-form expressions, which are both intuitive and straightforward to implement. We demonstrate

A Distribution-Dependent Analysis of Meta Learning

Figure 1. Examples of predictions on the synthetic, Fourier meta-learning problem. Training data is shown in bold, small dots show test data. We also show the predictions for two learners (at every input) and the target function. The column correspond to outputs obtained training on n {10, 50, 100} tasks. Our new algorithm, EM learner, performs quite well.

the effectiveness of the resulting procedure on a number of synthetic and real benchmarks; Fig. 1 shows an example on a synthetic benchmark problem, comparing our EM algorithm with the earlier cited (unweighted) biased regression procedure. As can be seen from the ﬁgure, the EM based learner is signiﬁcantly more effective. Further experiments suggest that the EM learner is almost as effective as the optimal biased weighted regularized regression procedure that is given the unknown parameters. We found that the EM learner is also competitive as a representation learning algorithm by comparing it to the algorithm of Tripuraneni et al. (2020) that is based on the method-of-moments technique.

In the statistical approach to meta-learning (Baxter, 1998; 2000) the learner observes a sequence of training tuples D = (Di)n i=1, distributed according to a random sequence of task distributions (Pi)n i=1, i.e. Di Pi, and furthermore task distributions are sampled independently from each other from a ﬁxed and unknown environment distribution P. The focus of this paper is linear regression with a ﬁxed design and therefore each training tuple Di = (xi,1, Yi,1), . . . , (xi,mi, Yi,mi) consists of mi ﬁxed training inputs from Rd and corresponding random, real-valued targets satisfying

Yi,j = θ i xi,j + εi,j, (1)

where εi,j iid N(0, σ2), θi iid N(α, Σ) ,

while (εi,j)i,j and (θi)i are also independent from each other.2 A meta-learning environment in this setting is thus given by α and the noise parameters (σ2, Σ). Initially, we will assume that (σ2, Σ) is known, while α (just like (θi)i) is unknown. The learner observes D and needs to produce a prediction of the value

Y = θ n x + ε

2Technically, Pi consists of Dirac deltas on inputs.

where ε N(0, σ2) and where x Rd is a ﬁxed (nonrandom) point. Our theoretical results will trivially extend to the case when the learner needs to produce predictions for a sequence of input points or a ﬁxed distribution over these, as often considered in meta-learning literature (Denevi et al., 2018; Du et al., 2020). The (random) transfer risk of the learner is deﬁned as

L(x) = E h (Y ˆY )2 D i .

The setting described above coincides with the standard ﬁxed-design linear regression setup for n = 1 and Σ 0, for which the behavior of risk is well understood. In contrast, the question that meta-learning poses is whether having n > 1, one can design a predictor that achieves lower risk compared to the approach that only uses the target data. Naturally, this is of a particular interest in the small sample regime when for all tasks, mi n, that is when facing scarcity of the training data but having many tasks. Broadly speaking, this reduces to understanding the behavior of the risk in terms of the interaction between the number of tasks n, their sample sizes (m1, . . . , mn), and the task structure given by the noise parametrization (σ2, Σ).

3. Sufﬁciency of Meta-mean Prediction

In this section we show that there is no loss of generality in considering plug-in predictors that predict ﬁrst the unknown meta-mean α. We also show that biased regularized least-squares estimator belongs to this family. We start with some general remarks and notation.

Throughout the rest of the paper, for real symmetric matrices A and B, we use A B to indicate that the matrix A B is Positive Semi-Deﬁnite (PSD). For x Rd and PSD matrix A, we let x A =

x Ax. We use x to denote the 2-norm of x. In the following we will use matrix notation aggregating inputs, targets, and parameters over multiple tasks. In particular, let the cumulative sample size of all tasks be M = m1+ +mn and introduce aggregates

A Distribution-Dependent Analysis of Meta Learning

for inputs and targets as follows:

x i,1 ... x i,mi

| {z } mi d

Yi,1 ... Yi,mi

| {z } mi 1

X1 . . . 0 ... ... ... 0 . . . Xn

| {z } M nd

| {z } nd 1

The matrix representation allows us to compactly state the regression model simultaneously over all tasks. In particular, for the M-dimensional noise vector ε N(0, σ2I):

Y = XΘ + ε Y N(Ψα, K) (2)

where α is a meta-mean of model (1) and K is the marginal covariance matrix deﬁned as K = X(I Σ)X + σ2I where stands for the Kronecker product. Note that the above equivalence comes from a straightforward observation that a linear map Xi applied to the Gaussian r.v. θi is itself Gaussian with mean E[Yi] = Xi E[θi] = Xiα and covariance XiΣXT i +σ2I which follows from the property that for any random vector ξ with covariance matrix C, and matrix A of appropriate dimensions, covariance matrix of Aξ is ACA , ultimately giving Eq. (2).

3.1. Plug-In Predictors and their Sufﬁciency

Both our lower and upper bounds will be derived from analyzing a family of plug-in predictors that aim to estimate θn through estimating α. As we shall see, weighted biased regularization is also member of this family.

The said family is motivated by applying the well-known bias-variance decomposition to the risk of an arbitrary predictor A : supp(P1) supp(Pn) Rd R. Namely,

L(x) = E (Y A(D, x))2 D

= E h (E[Y | D] A(D, x))2 + V[Y | D] D i

where we used the law of total expectation and the fact that for any r.v. ξ, E[ξ2] = E[ξ]2 +V[ξ]. Since the variance term does not depend on A, it follows that the prediction problem reduces to predicting the posterior mean E[Y | D], which, in our setting, can be given in closed form:

Proposition 3.1. Let Y = θ n x + ε for ε N(0, σ2) and some x Rd. Then,

E[Y | D] = x T Σ 1α + 1

where T = Σ 1 + 1

Proof. See Appendix A.

Since the only unknown parameter here is the meta-mean α, we expect that good predictors will just estimate the metamean and use the above formula. That is, these predictors take the form (D, x) 7 x ˆθn(α(D, x)), where

ˆθn(a) = T Σ 1a + 1

giving our family of plug-in predictors. In fact, there is no loss in generality by considering only predictors of the above form. Indeed, given some predictor A and x = 0, we can solve A(D, x) = x ˆθn(α) for α. One solution is given by α(D, x) = ΣT 1xc where c = 1 x 2 A(D, x) σ 2x T X n Yn . Hence, to prove a lower bound for any regressor A, it will be enough to prove it for algorithms that estimate α.

One special estimator of α is the Maximum Likelihood Estimator (MLE) estimator, and, thanks to (2), can be obtained via ˆαMLE = arg maxa Rd ln p G(Y ; Ψa, K), where

p G(x; µ, Σ) e 1

2 x µ 2 Σ 1 is a Guassian PDF. Some standard calculations give us

ˆαMLE = (Ψ K 1Ψ) 1Ψ K 1Y . (4)

3.2. Weighted Biased Regularization

Biased regularization is a popular transfer learning technique which commonly appears in the regularized formulations of the empirical risk minimization problems, where one aims at minimizing the empirical risk (such as the mean squared error) while forcing the solution to stay close to some bias variable b. Here we consider the Weighted Biased Regularized Least Squares (WBRLS) formulation deﬁned w.r.t. bias b and some PSD matrix Γ:

ˆθWBRLS n = arg min θ Rd

where ˆLn(θ) =

Yn,j θ xn,j 2 .

Remarkably, an estimate ˆθWBRLS n produced by WBRLS is equivalent to estimator ˆθn( ˆα) of Eq. (3) for the choice of b = ˆα, Γ = Σ 1, and λ = σ2. Thus, WBRLS is a special member of the family chosen in the previous section.

To see the equivalence, owing to the convenient leastsquares formulation, we observe that

ˆθWBRLS n = X n Xn + λΓ 1 (X n Yn + λΓb)

A Distribution-Dependent Analysis of Meta Learning

and from here the equivalence follows by substitution. A natural question commonly arising in such formulations is how to set the bias term b. One choice can be b = ˆαMLE

and in the following we will see that it is an optimal one.

4. Problem-Dependent Bounds

We now present our main results, which are essentially matching lower and upper bounds. The upper bounds concern the parameter estimator that uses the MLE estimate of α, while the lower bounds apply to any method. We also present a more precise lower bound that applies to estimators that are built on unbiased meta-mean estimators ˆα. As we shall see that plug-in predictors based on MLE will exactly match this lower bound. The general lower bounds are also quite precise: They differ from this lower bound only by a (relatively small) universal constant. We also give a high-probability variant of the same general lower bound. Theorem 4.1. Let x Rd and consider the linear regression model (1). Let ˆα be any unbiased estimator of α based on D. Then the transfer risk L(x) of the predictor that predicts ˆY = x ˆθn( ˆα) satisﬁes

E[L(x)] x Mx + x T x + σ2 (5)

where M = T Σ 1(Ψ K 1Ψ) 1Σ 1T .

Moreover, for all predictors we have

E[L(x)] x Mx

16 e + x T x + σ2. (6)

Finally, for any δ (0, 1), with probability at least 1 δ for all predictors we have

2 log 1 4(1 δ)

x Mx + x T x + σ2.

Proof. The main ideas of the proof are in Section 4.2, while the complete proof is given in Appendix B.

Note that the presented bounds are problem-dependent since they depend on a concrete task structure of the environment characterized by (Σ, σ2). While the strength of the above bound is its generality, this generality makes the interpretation of the result challenging. We return to the interpretation of this result momentarily, after presenting results for the transfer risk for the plug-in method that uses the (unbiased) MLE meta-mean estimator ˆαMLE deﬁned in Eq. (4). Theorem 4.2. For the estimator ˆθn( ˆαMLE) and for any x Rd we have

E[L(x)] = x Mx + x T x + σ2 . (7)

Moreover for the same estimator, with probability at least 1 δ, δ (0, 1) we have

L(x) 2 log 2

x Mx + x T x + σ2.

Proof. See Appendix C.

Note that Eq. (7) is an equality for the transfer risk and it matches the lower bound available for unbiased estimators. This result, together with our lower bound shows that (i) the predictors based on ˆαMLE is optimal, with matching constant within the set of predictors that is based on unbiased estimators of α. It also follows that (ii) apart from a constant factor of 16 e of the transfer risk, this predictor is also optimal among all predictors.

4.1. Interpretation of the Results

The following two corollaries specialize the lower bound of Theorem 4.1 in a way that will make the results more transparent. Note that while we give these simpliﬁed expressions for the lower bound (speciﬁcally, Eq. (6)), these expressions also remain essentially true for the upper bound for the MLE estimator, since these differ only in minor details. The proofs of both corollaries are given in Appendix F.

Both specializations are concerned with the case when the inputs are isotropic, meaning that the input covariance matrix of task i is mi

In the ﬁrst result, in addition, we assume a spherical task structure: Σ = τ 2I. Thus, the coordinates of the parameter vectors θi are uncorrelated and share the same variance τ 2. Corollary 4.3. Assume the same as in case of Eq. (6). In addition, let Σ = τ 2I, suppose that X i Xi = mi

d I, and let x = 1. Then,

n (τ 2mn + dσ2)2 + dτ 2

τ 2mn + dσ2 ,

where Hz is a harmonic mean of a sequence (z + dσ2

In the above bound the ﬁrst term vanishes as more tasks are added (n growing). On the other hand, to decrease the second term, mn needs to increase. In particular, as n , we get

τ 2 can be interpreted as an effective sample size. Thus, while having inﬁnitely many previous tasks have the potential to reduce the loss, the size of this effect is ﬁxed and is related to the noise variance ratios. If τ 2 0, having inﬁnitely many tasks will allow perfect prediction, but for any τ 2 > 0, there is a limit on how much the data of previous tasks can help. Finally, for the case n = 1 and τ 2 = 0 we recover the standard lower bound for linear setting E[L(x)] σ2 = Ω(dσ2/m1).

Our next result is concerned with representation learning , which corresponds to the case when Σ is a low rank PSD matrix.

A Distribution-Dependent Analysis of Meta Learning

Corollary 4.4. Let the inputs be isotropic as before and x = 1. Moreover, let Σ be a PSD matrix of rank s d with eigenvalues λ1 . . . λs > 0,3 and suppose that x 2 P s Ps = s/d where Ps = [u1, . . . , us] and (uj)s j=1 are unit length eigenvectors of Σ. Then,

n (λ1mn + dσ2)2 + sλs λsmn + dσ2 .

Note that the ﬁrst term on the right-hand side of the last display scales with sd/n, where sd is the number of parameter in a matrix that would give the low-dimensional representation and the second term scales with s/mn for mn dσ2/λs. Somewhat surprisingly (given that here Σ is known), these essentially match the upper bounds due to Du et al. (2020); Tripuraneni et al. (2020), implying that their results are unimprovable.

4.2. Proof Sketches

Our lower and upper bounds on the risk are based on an identity that holds for the transfer risk of plug-in methods. The identity is essentially a bias-variance decomposition.

Lemma 4.5. For ˆθn( ˆα) deﬁned in Eq. (3), any task mean estimator ˆα, and any x Rd we have E[L(x)] =

E h x T Σ 1(α ˆα) 2i + x T x + σ2.

For the proof of this lemma we need the following proposition whose proof is given in Appendix A:

Proposition 4.6. Let Y = θ n x + ε for ε N(0, σ2) and some x Rd. Then, E[Y | D] = x T Σ 1α + 1 σ2 X n Yn and V[Y | D] = x T x + σ2.

Proof of Lemma 4.5. Using the law of total expectation and that for a r.v. ξ we have E[ξ2] = E[ξ]2 + V[ξ],

L(x) = E h (Y ˆθn( ˆα) x)2 | D i

= E E[Y | D] ˆθn( ˆα) x 2 + V[Y | D] | D

= E h x T Σ 1 (α ˆα) 2 | D i + x T x + σ2 ,

where identities for E[Y | D] and V[Y | D] come from Proposition 4.6 and identity for ˆθn( ˆα) is due to (3).

Thus, to establish universal lower bounds we need to lower bound E h x T Σ 1(α ˆα) 2i for any choice of estimator ˆα, which in combination with Lemma 4.5 will prove Theorem 4.1. Here, relying on the Cram er-Rao inequality (Theorem B.1), we only prove a lower bound for unbiased estimators, while the general case, whose proof uses Le Cam s method, is left to Appendix B.2.

3When s < d, we replace Σ 1 with its pseudo-inverse Σ .

Lemma 4.7. For any unbiased estimator ˆα of α in Eq. (2) we have E x T Σ 1(α ˆα)2 x Mx.

Proof. Recall that according to the equivalence (2), Y N(Ψα, K) and the unknown parameter is α. To compute the Fisher information matrix we ﬁrst observe that α ln p G (Y ; Ψα, K) = Ψ K 1(Y Ψα) and

F = E h α ln p G (Y ; Ψα, K) α ln p G (Y ; Ψα, K) i

= Ψ K 1E (Y Ψα)(Y Ψα) K 1Ψ

Thus, by the Cram er-Rao inequality we have E (α ˆα)(α ˆα) (Ψ K 1Ψ) 1 . Finally, left-multiplying by x T Σ 1 and right-multiplying the above by Σ 1T x gives us the statement.

5. Learning with Unknown Task Structure

So far we have assumed that parameters (σ2, Σ) characterizing the structure of environment are known, which limits the applicability of the predictor (though does not limit the lower bound). Staying within our framework, a natural idea is to estimate all the environment parameters E = (α, σ2, Σ) by maximizing the data marginal log-likelihood

J(D, E ) = ln Z

Rnd p(D | ϑ) dp(ϑ | E )

over E , where p(D, Θ, E) stands for the joint distribution in the model (1). The above problem is non-convex. As such, we propose to use EM procedure (Dempster et al., 1977), which is known to be a reasonable algorithm for similar settings.4 EM can be derived as a procedure that maximizes a lower bound on J(D, E ): Jensen s inequality gives us that for any probability measure q on Rnd, J(D, E ) R ln p(ϑ,D | E )

q(ϑ) dq(ϑ). This is then maximized in E and

q in an alternating fashion: Letting ˆEt to be a parameter estimate at step t, we maximize the lower bound in q for a ﬁxed E = ˆEt, and then obtain ˆEt+1 by maximizing the lower bound in E for a ﬁxed previously obtained solution in q. Maximization in q gives us q(ϑ) = p(ϑ | D, ˆEt), while maximization in E yields

ˆEt+1 arg max E

Z ln (p(ϑ, D | E )) dp(ϑ | D, ˆEt) . (9)

After some calculations (cf. Appendix D), this gives Algorithm 1. During the E-step (lines 4-5), the algorithm computes the parameters of the posterior distribution N(θi | ˆµt,i, ˆT t,i) relying on ˆEt, and during the M-step

4While the marginal distribution is available in analytic form by Eq. (2), we focus on EM because, in preliminary experiments, direct optimization proved to be numerically unstable.

A Distribution-Dependent Analysis of Meta Learning

(lines 7 9) it estimates ˆEt+1 based on (ˆµt,i, ˆT t,i). We propose to detect convergence (not shown) by checking the relative difference between successive parameter values.

Algorithm 1 EM procedure to estimate (α, σ2, Σ)

Input: Initial parameter estimates ˆE1 = ( ˆα1, ˆσ2 1, ˆΣ1) Output: Final parameter estimates ˆEt = ( ˆαt, ˆσ2 t , ˆΣt) 1: ˆT 1,i 0, ˆµ1,i 0 i [n] 2: repeat 3: for i = 1, . . . , n do E-step

4: ˆT t,i ˆΣ 1 t + ˆσ 2 t X i Xi 1

5: ˆµt,i ˆT t,i ˆΣ 1 t ˆαt + ˆσ 2 t X i Yi

6: end for 7: ˆαt 1

n Pn i=1 ˆµt,i M-step

n Pn i=1 ˆT t,i + (ˆµt,i ˆαt)(ˆµt,i ˆαt)

n Pn i=1 1 mi

ˆLi(ˆµt,i) + tr Xi ˆT t,i X i

10: t t + 1 11: until Convergence (see discussion)

6. Experiments

In this section we present experiments5 designed to verify three hypotheses: (i) Under ideal circumstances, the predictor x ˆθn( ˆαMLE) is superior to its alternatives, including biased, but unweighted regression; (ii) The EM-algorithm reliably recovers unknown parameters of the environment and is also suitable for representation learning; (iii) our distribution-dependent lower bound Eq. (5) is numerically sharp. In addition, we brieﬂy report on experiments with a real-world dataset.

Baselines. We consider two non-meta-learning baselines, that is Linear Regression (All) Ordinary Least Squares (OLS) ﬁtted on D\n = (Di)n 1 i=1 , which excludes the newly observed task, and Linear Regression (Task) OLS ﬁtted on a newly encountered task Dn. Next, we consider meta-learning algorithms. We report performance of the unweighted Biased Regression procedure with bias set to the least squares solution (P

i =n X i Xi) 1 P

i =n X i Yi and λ found by cross-validation (cf. Appendix E). Note that the bias and the regularization coefﬁcient are found on D\n, while Dn is used for the ﬁnal ﬁtting. EM Learner is estimator (3) with all environment parameters found by Algorithm 1 on D\n. The convergence threshold was set to 10 6 while the maximum number of iterations was set to 103. Finally, we report numerical values of Eq. (5) as Known Covariance Lower Bound. For all of the experiments we show aver-

5Implementation of all of our experiments and the required dataset are provided in the supplementary material.

ages and standard deviations of the mean test errors computed over 30 independent runs of that experiment.

Synthetic Experiments. We conduct synthetic experiments on datasets with Fourier generated features and features sampled from a d-dimensional unit sphere. In all of the synthetic experiments we have α = 0, σ2 = 1 and Σ generated by computing Σ = LL + ηI where Lij = I {i j} Zij with Zij and η sampled from the standard normal distribution. Test error is computed on 100 test tasks using 10 examples for training and 100 examples for testing. For the Fourier-features, we sample a value u U( 5, 5) and compute features by evaluating d = 11 Fourier basis functions at u: xj = I{1 j 5} sin( j

5πu) + I{6 j 10} cos( j 5

5 πu) + I{j = 11}, where I{E} = 1 if E is true and I{E} = 0 otherwise. Examples of these tasks and results of meta-learning on some of these were shown in Fig. 1. In Fig. 2 we show the test errors for various meta-learners while varying the number of tasks n and task sizes m. For the spherical data, the same is shown in Fig. 3. Here, we generate x from a d = 42 dimensional unit sphere. In both experiments for sufﬁciently large number of training tasks the EM-based learner approaches the optimal estimator even when the number of examples per task is less than the dimensionality of that task. In the context of the Fourier dataset, we also experimented with generating low-rank Σ, corresponding to the challenge of learning a low-dimensional representation, shared across the tasks. We found that the EM-based meta learner stays competitive in this setting. To save space, the results are presented in Appendix G.

Real Dataset Experiment. We also conducted experiments on a real world dataset containing information about students in 139 schools in years 1985-1987 (Dua & Graff, 2017, School Dataset). We adapt the dataset to a metalearning problem with the goal to predict the exam score of students based on the student-speciﬁc and school-speciﬁc features. After one-hot encoding of the categorical values there are d = 27 features for each student. We randomly split schools into two subsets: The ﬁrst, consisting of 100 schools, forms D\n (used for training the bias, λ selection, and EM). The second subset consists of 39 schools, where each school is further split into 80%/20% for the ﬁnal training and testing of the meta-learners. Results are given in Fig. 4. We can see that while both Biased Regression and EM Learner outperform regression, their performance is very similar. This could be attributed to the fact that the features mostly contain weakly relevant information, which is conﬁrmed by inspecting the coefﬁcient vector.

Representation learning experiments I. Our next ﬁgure (Fig. 5) shows the outcomes of experiments for the Fourier task but when Σ is low-rank. As can be seen, the EM based

A Distribution-Dependent Analysis of Meta Learning

Figure 2. Test errors on Fourier synthetic experiment with changing number of tasks n and number of samples per task m. When one of the parameters changes, the other one is set to 10.

Figure 3. Test error on spherical synthetic experiment with changing number of tasks n and number of samples per task m. When one of the parameters changes, the other one is set to 40.

Figure 4. Test error on the School Dataset. Up to 100 schools are used for ﬁtting environment-related parameters (see text for details) and the remaining 39 are used as the target task.

learner excels in exploiting the low-rank structure. For this experiment we have the same setup as for the Fourier experiment, but the covariance matrix Σ is generated by computing Σ = LL where L is a d r matrix with r = d/2 = 5 and elements Li,j N(0, 1). Note that in this case we can write θi = Bwi for some matrix B of size d r and vector wi of size r sampled from multivariate normal

distribution. Thus, if the matrix B is known or estimated during training, one can project the features xi,j onto a lower-dimensional space by computing B xi,j to speed up the adaptation to new tasks by running least-squares regression to estimate wi instead of θi. In addition to the baselines described in the main text, we compared EM Learner with two additional baselines: one is based on the Method of Moments (Mo M) estimator from Tripuraneni et al. (2020) (not shown on the ﬁgure), and another which we refer to as Oracle Representation. We omit displaying the error of the method of moments estimator since for the features generated as in this experiment it is not able to perform estimation of the subspace and leads to test errors with values around 60. At the same time, as shown on Fig. 5 (left) we observe that EM Learner can outperform Oracle Representation which assumes the knowledge of the covariance matrix Σ from which it computes the subspace matrix B and uses it to obtain lower-dimensional representation of the features when adapting to a new task via least-squares, as described above. This is possible because the coefﬁcients estimated by EM are biased toward α which does not happen with least squares regression in the lower

A Distribution-Dependent Analysis of Meta Learning

Figure 5. Test error when the task covariance matrix is low-rank. As usual, on the left the number of tasks is changed, on the right, the number of training datapoints (per task). When one parameter is varied, the other is set to the value of 10.

Figure 6. Max-correlation dmax( ˆ B, B) between the estimated matrix ˆ B (by the respective algorithm) and the ground truth matrix B while increasing number of tasks n.

dimensional subspace and this is beneﬁcial, especially when the number of test-task training examples is small.

Representation learning experiments II. To validate our implementation of the Mo M estimator of Tripuraneni et al. (2020) and to investigate more whether EM is preferable to the Mo M estimator beyond the setting that is ideal for the EM method we considered the experimental setup of Tripuraneni et al. (2020). To explain the setup, we recall that the Mo M estimator computes an estimate ˆ B of the ground truth matrix B. Tripuraneni et al. (2020) proves results for the max-correlation between ˆ B and B, and also reports experimentally measured max-correlation values between the ground truth and the Mo M computed matrix. The max-correlation between matrices A and A is based on the deﬁnition of principal angles and is equal to dmax(A, A ) = p

1 cos2(A, A ) where cos(A, A ) = maxu span(A): u =1 maxv span(A ): v =1 u v. Intuitively, max-correlation captures how well the subspaces spanned by matrices A and A are aligned.

To compare our EM estimator to Mo M we run the EM estimator as described in Algorithm 1, and once the ﬁnal estimate ˆΣ is obtained, we reduce its rank by clipping eigenvalues λs+1 . . . λd to 0. We follow the experimental setup of Tripuraneni et al. (2020), that is, inputs are generated as xi N(0, Id), while the regression model is given by Eq. (1) with (σ2, Σ) = (1, 1

s BB ). Here, columns of B Rd s are sampled from a uniform distribution on a unit d-sphere. Finally, the number of examples per previously observed task is set as m1 = ... = mn 1 = 5, the representation rank is s = 5, the input dimension is d = 100, and the experiment is repeated 30 times. Since we only estimate the subspace matrix we do not use the data from the test task (Xn, yn). We report our results in Fig. 6, plotting the max-correlation between ˆ B found by the respective algorithm and B, while increasing the number of tasks. We see that EM learner considerably outperforms Mo M Representation in terms of the subspace estimation to the degree captured by max-correlation. While we suspect that the improvement is due to the joint optimization over the covariance of environment and the mean of the environment (the bias in biased regularization), the detailed understanding of this effect is left for the future work.

7. Conclusions

While ours is the ﬁrst work to derive matching, distributiondependent lower and upper bounds, much works remains to be done: our approach to derive meta-learning algorithms based on a probabilistic model should be applicable more broadly and could lead to further interesting developments in meta-learning. The most interesting narrower question is to theoretically analyze the EM algorithm. Doing this in the low-rank setting looks particularly interesting. We hope that our paper will inspire other researchers to do further work in this area.

A Distribution-Dependent Analysis of Meta Learning

References Baxter, J. Theoretical models of learning to learn. In S. Thrun, L. P. (ed.), Learning to learn, pp. 71 94. Springer, 1998.

Baxter, J. A model of inductive bias learning. Journal of Artiﬁcial Intelligence Research, 12:149 198, 2000.

Bretagnolle, J. and Huber, C. Estimation des densit es: risque minimax. Zeitschrift f ur Wahrscheinlichkeitstheorie und verwandte Gebiete, 47(2):119 137, 1979.

Dempster, A. P., Laird, N. M., and Rubin, D. B. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1 22, 1977.

Denevi, G., Ciliberto, C., Stamos, D., and Pontil, M. Learning to learn around a common mean. In Conference on Neural Information Processing Systems (Neur IPS), pp. 10169 10179, 2018.

Denevi, G., Ciliberto, C., Grazzi, R., and Pontil, M. Learning-to-learn stochastic gradient descent with biased regularization. In International Conference on Machine Learing (ICML), 2019.

Du, S. S., Hu, W., Kakade, S. M., Lee, J. D., and Lei, Q. Few-shot learning via learning the representation, provably. ar Xiv:2002.09434, 2020.

Dua, D. and Graff, C. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.

Finn, C., Abbeel, P., and Levine, S. Model-agnostic metalearning for fast adaptation of deep networks. In International Conference on Machine Learing (ICML), 2017.

Finn, C., Rajeswaran, A., Kakade, S., and Levine, S. Online meta-learning. International Conference on Machine Learing (ICML), 2019.

Khodak, M., Balcan, M.-F., and Talwalkar, A. Provable guarantees for gradient-based meta-learning. In International Conference on Machine Learing (ICML), pp. 424 433, 2019a.

Khodak, M., Balcan, M.-F. F., and Talwalkar, A. S. Adaptive gradient-based meta-learning methods. Advances in Neural Information Processing Systems, 32:5917 5928, 2019b.

Kuzborskij, I. and Orabona, F. Stability and Hypothesis Transfer Learning. In International Conference on Machine Learing (ICML), pp. 942 950, 2013.

Kuzborskij, I. and Orabona, F. Fast Rates by Transferring from Auxiliary Hypotheses. Machine Learning, pp. 1 25, 2016. ISSN 1573-0565. doi: 10.1007/ s10994-016-5594-4.

Lattimore, T. and Szepesv ari, C. Bandit algorithms. Cambridge University Press, 2018.

Lucas, J., Ren, M., Kameni, I., Pitassi, T., and Zemel, R. Theoretical bounds on estimation error for meta-learning. In International Conference on Learning Representations, 2021.

Maurer, A. Algorithmic stability and meta-learning. Journal of Machine Learning Research, 6(Jun):967 994, 2005.

Maurer, A. Transfer bounds for linear feature learning. Machine Learning, 75(3):327 350, 2009.

Maurer, A., Pontil, M., and Romera-Paredes, B. The beneﬁt of multitask representation learning. Journal of Machine Learning Research, 2016.

Pentina, A. and Lampert, C. A pac-bayesian bound for lifelong learning. In International Conference on Machine Learing (ICML), pp. 991 999, 2014.

Tripuraneni, N., Jin, C., and Jordan, M. I. Provable metalearning of linear representations. ar Xiv:2002.11684, 2020.

Yang, J., Yan, R., and Hauptmann, A. G. Cross-domain video concept detection using adaptive svms. In Proceedings of the 15th ACM international conference on Multimedia, pp. 188 197, 2007.