# metalearning_for_mixed_linear_regression__68b95ebd.pdf

Meta-learning for Mixed Linear Regression

Weihao Kong 1 Raghav Somani 1 Zhao Song 2 Sham Kakade 1 Sewoong Oh 1

In modern supervised learning, there are a large number of tasks, but many of them are associated with only a small amount of labelled data. These include data from medical image processing and robotic interaction. Even though each individual task cannot be meaningfully trained in isolation, one seeks to meta-learn across the tasks from past experiences by exploiting some similarities. We study a fundamental question of interest: When can abundant tasks with small data compensate for lack of tasks with big data? We focus on a canonical scenario where each task is drawn from a mixture of k linear regressions, and identify sufﬁcient conditions for such a graceful exchange to hold; there is little loss in sample complexity even when we only have access to small data tasks. To this end, we introduce a novel spectral approach and show that we can efﬁciently utilize small data tasks with the help of eΩ(k3/2) medium data tasks each with eΩ(k1/2) examples.

1. Introduction

Recent advances in machine learning highlight successes on a small set of tasks where a large number of labeled examples have been collected and exploited. These include image classiﬁcation with 1.2 million labeled examples (Deng et al., 2009) and French-English machine translation with 40 million paired sentences (Bojar et al., 2014). For common tasks, however, collecting clean labels is costly, as they require human expertise (as in medical imaging) or physical interactions (as in robotics), for example. Thus collected realworld datasets follow a long-tailed distribution, in which a dominant set of tasks only have a small number of training

1University of Washington, Seattle, Washington, USA 2Princeton University/Institute for Advanced Study. Correspondence to: Weihao Kong <kweihao@gmail.com>, Raghav Somani <raghavs@cs.washington.edu>, Zhao Song <zhaos@ias.edu>, Sham Kakade <sham@cs.washington.edu>, Sewoong Oh <sewoong@cs.washington.edu>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

examples (Wang et al., 2017).

Inspired by human ingenuity in quickly solving novel problems by leveraging prior experience, meta-learning approaches aim to jointly learn from past experience to quickly adapt to new tasks with little available data (Schmidhuber, 1987; Thrun & Pratt, 2012). This has had a signiﬁcant impact in few-shot supervised learning, where each task is associated with only a few training examples. By leveraging structural similarities among those tasks, one can achieve accuracy far greater than what can be achieved for each task in isolation (Finn et al., 2017; Ravi & Larochelle, 2017; Koch et al., 2015; Oreshkin et al., 2018; Triantaﬁllou et al., 2019; Rusu et al., 2018). The success of such approaches hinges on the following fundamental question: When can we jointly train small data tasks to achieve the accuracy of large data tasks?

We investigate this trade-off under a canonical scenario where the tasks are linear regressions in d-dimensions and the regression parameters are drawn i.i.d. from a discrete set of a support size k. Although widely studied, existing literature addresses the scenario where all tasks have the same ﬁxed number of examples. We defer formal comparisons to Section 6.

On one extreme, when large training data of sample size Ω(d) is available, each task can easily be learned in isolation; here, Ω(k log k) such tasks are sufﬁcient to learn all k regression parameters. This is illustrated by a solid circle in Figure 1. On the other extreme, when each task has only one example, existing approaches require exponentially many tasks (see Table 1). This is illustrated by a solid square.

Several aspects of few-shot supervised learning makes training linear models challenging. The number of training examples varies signiﬁcantly across tasks, all of which are signiﬁcantly smaller than the dimension of the data d. The number of tasks are also limited, which restricts any algorithm with exponential sample complexity. An example distribution of such heterogeneous tasks is illustrated in Figure 1 with a bar graph in blue, where both the solid circle and square are far outside of the regime covered by the typical distribution of tasks.

In this data scarce regime, we show that we can still efﬁciently achieve any desired accuracy in estimating the

Meta-learning for Mixed Linear Regression

t (1, 1) log k

Figure 1. Realistic pool of meta-learning tasks do not include large data tasks (circle) or extremely large number of small data tasks (square), where existing approaches achieve high accuracy. The horizontal axis denotes the number of examples t per task, and the vertical axis denotes the number of tasks in the pool that have at least t examples. The proposed approach succeeds whenever any point in the light (green) region, and any point in the heavy (yellow) region are both covered by the blue bar graph, as is in this example. The blue graph summarizes the pool of tasks in hand, illustrating the cumulative count of tasks with more than t examples. We ignore constants and poly log factors.

meta-parameters deﬁning the meta-learning problem. This is shown in the informal version of our main result in Corollary 1.1. As long as we have enough number of light tasks each with t L = eΩ(1) examples, we can achieve any accuracy with the help of a small number of heavy tasks each with t H = eΩ(

k) examples. We only require the total number of examples that we have jointly across all light tasks to be of order t Ln L = eΩ(dk2); the number of light tasks n L and the number of examples per task t L trade off gracefully. This is illustrated by the green region in Figure 1. Further, we only need a small number of heavy tasks with t Hn H = eΩ(k3/2), shown in the yellow region. As long as the cumulative count of tasks in blue graph intersects with the light (green) and heavy (yellow) regions, we can recover the meta-parameters accurately.

Corollary 1.1 (Special case of Theorem 1, informal). Given two batches of samples, the ﬁrst batch with

t L = eΩ(1) , t Ln L = eΩ dk2 ,

and the second batch with

k , t Hn H = eΩ k2 ,

Algorithm 1 estimates the meta-parameters up to any desired accuracy of O (1) with a high probability, under a certain assumptions on the meta-parameters.

We design a novel spectral approach inspired by (Vempala & Wang, 2004) that ﬁrst learns a subspace using the light tasks, and then clusters the heavy tasks in the projected space. To get the desired tight bound on the sample complexity, we

improve upon a perturbation bound from (Li & Liang, 2018), and borrow techniques from recent advances in property testing in (Kong et al., 2019).

2. Problem formulation and notations

There are two perspectives on approaching meta-learning: optimization based (Li et al., 2017; Bertinetto et al., 2019; Zhou et al., 2018; Zintgraf et al., 2019; Rajeswaran et al., 2019), and probabilistic (Grant et al., 2018; Finn et al., 2018; Kim et al., 2018; Harrison et al., 2018). Our approach is motivated by the probabilistic view and we present a brief preliminary in Section 2.1. In Section 2.2, we present a simple but canonical scenario where the tasks are linear regressions, which is the focus of this paper.

2.1. Review of probabilistic view on meta-learning

A standard meta-training for few-shot supervised learning assumes that we are given a collection of n metatraining tasks {Ti}n i=1 drawn from some distribution P (T ). Each task is associated with a dataset of size ti, collectively denoted as a meta-training dataset Dmeta-train = {(xi,j, yi,j) Rd R}j [ti]

i [n]. Exploiting some structural similarities in P(T ), the goal is to train a model for a new task T new, coming from P (T ), from a small amount of training dataset D = (xnew j , ynew j )

Each task Ti is associated with a model parameter φi, where the meta-training data is independently drawn from: (xi,j, yi,j) Pφi(y|x)P(x) for all j [ti]. The prior distribution of the tasks, and hence the model parameters, is fully characterized by a meta-parameter θ such that φi Pθ(φ).

Following the deﬁnition from (Grant et al., 2018), the metalearning problem is deﬁned as estimating the most likely meta-parameter given meta-training data by solving

θ arg max θ log P(θ | Dmeta-train) , (1)

which is a special case of empirical Bayes methods for learning the prior distribution from data (Carlin & Louis, 2010). Once meta-learning is done, the model parameter of a newly arriving task can be estimated by a Maximum a Posteriori (MAP) estimator: bφ arg max φ log P(φ | D, θ ) , (2)

or a Bayes optimal estimator:

bφ arg min φ Eφ P(φ | D,θ )[ ℓ(φ, φ ) ] , (3)

for a choice of a loss function ℓ. This estimated parameter is then used for predicting the label of a new data point x in task T new as

by arg max y Pbφ(y|x) . (4)

Meta-learning for Mixed Linear Regression

General notations. We deﬁne [n] := {1, . . . , n} n N; x p := P

x x |x|p 1/p as the standard ℓp-norm;

and Bp,k(µ, r) := n x Rk | x µ p = r o . N (µ, Σ) denotes the multivariate normal distribution with mean µ Rd and covariance Σ Rd d, and 1 {E} denotes the indicator of an event E.

2.2. Linear regression with a discrete prior

In general, the meta-learning problem of (1) is computationally intractable and no statistical guarantees are known. To investigate the trade-offs involved, we assume a simple but canonical scenario where the tasks are linear regressions:

xi,j Px , yi,j = β i xi,j + ϵi,j , (5)

for the i-th task and j-th example. Each task is associated with a model parameter φi = βi Rd, σi R+ . The noise ϵi,j is i.i.d. as ϵi,j Pϵi, and Pϵi is a centered sub-Gaussian distribution with parameter σ2 i . Without loss of generality, we assume that Px is an isotropic (i.e. E xi,jx i,j = Id) centered sub-Gaussian distribution. If Px is not isotropic, we assume there are large number of xi,j s for whitening such that Px is sufﬁciently close to isotropic.

We do not make any assumption on the prior of φi s other than that they come from a discrete distribution of a support size k. Concretely, the meta-parameter θ = W Rd k, s Rk +, p Rk + B1,k(0, 1) deﬁnes a discrete prior (which is also known as mixture of linear experts (Chaganty & Liang, 2013)) on φi s, where W = [w1, . . . , wk] are the k candidate model parameters, and s = [s1, . . . , sk] are the k candidate noise parameters. The i-th task is randomly chosen from one of the k components from distribution p, denoted by zi multinomial(p). The training data is independently drawn from (5) for each j [ti] with βi = wzi and σi = szi.

We want to characterize the sample complexity of this metalearning. This depends on how complex the ground truths prior θ is. This can be measured by the number of components k, the separation between the parameters W, the minimum mixing probability pmin, and the minimum positive eigen-value λmin of the matrix Pk j=1 pjwjw j .

Notations. We deﬁne ρi := q

s2zi + wzi 2 2 as the sub-Gaussian norm of a label yi,j in the i-th task, and ρ2 := maxi ρ2 i . Without loss of generality, we assume ρ = 1, which can be always achieved by scaling the meta-parameters appropriately. We also deﬁne pmin := minj [k] pj, and := mini,j [k],i =j wi wj 2 and assume pmin, > 0. ω R+ is such that two n n matrices can be multiplied in O (nω) time.

3. Algorithm

We propose a novel spectral approach (Algorithm 1) to solve the meta-learning linear regression, consisting of three subalgorithms: subspace estimation, clustering, and classiﬁcation. These sub-algorithms require different types of tasks, depending on how many labelled examples are available.

Clustering requires heay tasks, where each task is associated with many labelled examples, but we need a smaller number of such tasks. On the other hand, for subspace estimation and classiﬁcation, light tasks are sufﬁcient, where each task is associated with a few labelled examples. However, we need a large number of such tasks. In this section, we present the intuition behind our algorithm design, and the types of tasks required. Precisely analyzing these requirements is the main contribution of this paper, to be presented in Section 4.

3.1. Intuitions behind the algorithm design

We give a sketch of the algorithm below. Each step of meta-learning is spelled out in full detail in Section 5. This provides an estimated meta-parameter bθ = c W, bs, bp . When a new task arrives, this can be readily applied to solve for prediction, as deﬁned in Deﬁnition 4.5.

Algorithm 1

Meta-learning

1. Subspace estimation. Compute subspace U which approximates span {w1, . . . , wk}, with singular value decomposition.

2. Clustering. Project the heavy tasks onto the subspace of U, perform distance-based k clustering, and estimate ewi for each cluster.

3. Classiﬁcation. Perform likelihood-based classiﬁcation of the light tasks using ewi estimated from the Clustering step, and compute the more reﬁned estimates (bwi, bsi, bpi) of (wi, si, pi) for i [k].

4. Prediction. Perform MAP or Bayes optimal prediction using the estimated meta-parameter as a prior.

Subspace estimation. The subspace spanned by the regression vectors, span{w1, . . . , wk}, can be easily estimated using data from the (possibly) light tasks with only ti 2. Using any two independent examples from the same task (xi,1, yi,1), (xi,2, yi,2), it holds that E yi,1yi,2xi,1x i,2 = Pk j=1 pjwjw j . With a total of Ω(d log d) such exam-

ples, the matrix Pk j=1 pjwjw j can be accurately estimated under spectral norm, and so is the column space

Meta-learning for Mixed Linear Regression

span{w1, . . . , wk}. We call this step subspace estimation.

Clustering. Given an accurate estimation of the subspace span{w1, . . . , wk}, we can reduce the problem from a ddimensional to a k-dimensional regression problem by projecting x onto the subspace of U. Tasks with ti = Ω(k) examples can be individually trained as the unknown parameter is now in Rk. The fundamental question we address is: What can we do when ti = o(k)? We propose clustering such light tasks based on their estimates of the regression vector βi s, and jointly solve a single regression problem for each cluster.

To this end, we borrow techniques from recent advances in property estimation for linear regression. Recently, in the contextual bandit setting, Kong et al. (2019) proposed an estimator for the correlation between the linear regressors between a pair of datasets. Concretely, given two datasets {x1,j, y1,j}j [t] and {x2,j, y2,j}j [t] whose true (unknown) regression vectors are β1 and β2, one can estimate β1 2 2, β2 2 2 and β 1 β2 accurately with t = O

d . We use this technique to estimate βi2 βi2 2 2, whose value can be used to check if the two tasks are in the same clusters. We cluster the tasks with ti = Ω

k into k disjoint clusters. We call this step clustering.

After clustering, resulting estimated ewi s have two sources of error: the error in the subspace estimation, and the error in the parameter estimation for each cluster. If we cluster more heavy tasks, we can reduce the second error but not the ﬁrst. We could increase the samples used in subspace estimation, but there is a more sample efﬁcient way: classiﬁcation.

Classiﬁcation. We start the classiﬁcation step, once each cluster has enough (i.e. Ω(k)) datapoints to obtain a rough estimation of their corresponding regression vector. In this regime, we have O (1) error in the estimated ewi s. This is sufﬁcient for us to add more datapoints to grow each of the clusters. When enough data points are accumulated (i.e. eΩ(d) for each cluster), then we can achieve any desired accuracy with this larger set of accurately classiﬁed tasks. This separation of the roles of the three sub-algorithms is critical in achieving the tightest sample complexity.

In contrast to the necessary condition of ti = Ω

k for the clustering step, we show that one can accurately determine which cluster a new task belongs to with only ti = Ω(log k) examples once we have a rough initial estimation f W of the parameter W. We grow the clusters by adding tasks with a logarithmic number of examples until we have enough data points per cluster to achieve the desired accuracy. We call this step classiﬁcation. This concludes our algorithm for the parameter estimation (i.e. meta-learning) phase.

4. Main results

Suppose we have n H heavy tasks each with at least t H training examples, and n L light tasks each with at least t L training examples. If heavy tasks are data rich (t H d), we can learn W straightforwardly from a relatively small number, i.e. n H = Ω(k log k). If the light tasks are data rich (t L k), they can be straightforwardly clustered on the projected k-dimensional subspace. We therefore focus on the following challenging regime of data scarcity.

Assumption 1. The heavy dataset DH consists of n H heavy tasks, each with at least t H samples. The ﬁrst light dataset DL1 consists of n L1 light tasks, each with at least t L1 samples. The second light dataset DL2 consists of n L2 tasks, each with at least t L2 samples. We assume t L1, t L2 < k, and t H < d.

To give more ﬁne grained analyses on the sufﬁcient conditions, we assume two types of light tasks are available with potentially differing sizes (Remark 4.3). In meta-learning step in Algorithm 1, subspace estimation uses DL1, clustering uses DH, and classiﬁcation uses DL2. We provide proofs of the main results in Appendices A, B, and C.

4.1. Meta-learning

We characterize a sufﬁcient condition to achieve a target accuracy ϵ in estimating the meta-parameters θ = (W, s, p).

Theorem 1 (Meta-learning). For any failure probability δ (0, 1), and accuracy ϵ (0, 1), given three batches of samples under Assumption 1, meta-learning step of Algorithm 1 estimates the meta-parameters with accuracy

bwi wi 2 ϵsi , bs2 i s2 i ϵ

d s2 i , and

with probability at least 1 δ, if the following holds. The numbers of tasks satisfy

d log3 d pmin δ

t L1 min 6p 2 min, 2λ 2 min

n H = Ω log(k/δ)

t H pmin 2 k + 2 ,

n L2 = Ω d log2(k/δ)

and the numbers of samples per task satisfy t L1 2, t L2 = Ω log (kd/(pminδϵ)) / 4 , and t H =

k log (k/(pmin δ)) , where λmin is the small-

est non-zero eigen value of M := Pk j=1 pjwjw j Rd d.

Meta-learning for Mixed Linear Regression

In the following remarks, we explain each of the conditions. Remark 4.1 (Dependency in DL1). The total number of samples used in subspace estimation is n L1t L1. The sufﬁcient condition scales linearly in d which matches the information theoretically necessary condition up to logarithmic factors. If the matrix M is well conditioned, for example when wi s are all orthogonal to each other, subspace estimation is easy, and n L1t L1 scales as 2λ 2 min. Otherwise, the problem gets harder, and we need 6p 2 min samples. Note that in this regime, tensor decomposition approaches often fails to provide any meaningful guarantee (see Table 1). In proving this result, we improve upon a matrix perturbation bound in (Li & Liang, 2018) to shave off a k6 factor on n L1 (see Lemma A.12). Remark 4.2 (Dependency in DH). The clustering step requires t H = eΩ(

k), which is necessary for distance-based clustering approaches such as single-linkage clustering. From (Kong & Valiant, 2018; Kong et al., 2019) we know that it is necessary (and sufﬁcient) to have t = Θ(

k), even for a simpler testing problem between β1 = β2 or β1 β2 2 2 0, from two labelled datasets with two linear models β1 and β2.

Our clustering step is inspired by (Vempala & Wang, 2004) on clustering under Gaussian mixture models, where the algorithm succeeds if t H = eΩ( 2

k). Although a straightforward adaptation fails, we match the sufﬁcient condition.

We only require the number of heavy samples n Ht H to be eΩ(k/pmin) up to logarithmic factors, which is information theoretically necessary. Remark 4.3 (Gain of using two types of light tasks). To get the tightest guarantee, it is necessary to use a different set of light tasks to perform the ﬁnal estimation step. First notice that the ﬁrst light dataset DL1 does not cover the second light dataset since we need t L2 Ω(log(kd)) which does not need to hold for the ﬁrst dataset DL1. On the other hand, the second light dataset does not cover the ﬁrst light dataset in the setting where or pmin is very small. Remark 4.4 (Dependency in DL2). Classiﬁcation and prediction use the same routine to classify the given task. Hence, the log k requirement in t L2 is tight, as it matches our lower bound in Proposition 4.6. The extra terms in the log factor come from the union bound over all n L2 tasks to make sure all the tasks are correctly classiﬁed. It is possible to replace it by log(1/ϵ) by showing that ϵ fraction of incorrectly classiﬁed tasks does not change the estimation by more than ϵ. We only require n L2t L2 = Ω(d/pmin) up to logarithmic factors, which is information theoretically necessary.

4.2. Prediction

Given an estimated meta-parameter bθ = (c W,bs, bp), and a new dataset D = {(xnew j , ynew j )}j [τ], we make predictions on the new task with unknown parameters using two

estimators: MAP estimator and Bayes optimal estimator.

Deﬁnition 4.5. Deﬁne the maximum a posterior (MAP) estimator as

bβMAP(D) := bwbi , where bi := arg max i [k] log b Li , and

b Li := exp

j=1 (ynew j bw i xnew j ) 2

2bs2 i τ log bsi + log bpi

Deﬁne the posterior mean estimator as

bβBayes(D) := Pk i=1 b Li bwi Pk i=1 b Li .

If the true prior, {(wi, si, pi)}i [k], is known. The posterior mean estimator achieves the smallest expected squared ℓ2

error, ED,βnew bβ(D) βnew 2

. Hence, we refer to it as

Bayes optimal estimator. The MAP estimator maximizes the probability of exact recovery.

Theorem 2 (Prediction). Under the hypotheses of Theorem 1 with ϵ min n /10, 2

d/50 o , the expected prediction errors of both the MAP and Bayes optimal estimators bβ(D) are bound as

E x bβ(D) y 2 δ + 1 + ϵ2 k X

i=1 pis2 i , (6)

if τ Θ log(k/δ)/ 4 , where the true meta-parameter is θ = {(wi, si, pi)}k i=1, the expectation is over the new task with model parameter φnew = (βnew, σnew) Pθ, training dataset D Pφnew, and test data (x, y) Pφnew.

Note that the Pk i=1 pis2 i term in (6) is due to the noise in y, and can not be avoided by any estimator. With an accurate meta-learning, we can achieve a prediction error arbitrarily close to this statistical limit, with τ = O (log k). Although both predictors achieve the same guarantee, Bayes optimal estimator achieves smaller training and test errors in Figure 2, especially in challenging regimes with small data.

We show that τ = Ω(log k) training samples are necessary (even if the ground truths meta-parameter θ is known) to achieve error approaching this statistical limit. Let Θk, ,σ denote the set of all meta-parameters with k components, satisfying wi wj 2 for i = j [k] and si σ for all i [k]. The following minimax lower bound shows that there exists a threshold scaling as O (log k) below which no algorithm can achieve the fundamental limit of σ2, which is Pk i=1 pis2 i in this minimax setting.

Meta-learning for Mixed Linear Regression

(a) Training error

(b) Prediction error

Figure 2. Bayes optimal estimator achieves smaller errors for an example. Here, k = 32, d = 256, W W = Ik, s = 1k, p = 1k/k, and Px and Pϵ are standard Gaussian distributions. The parameters were learnt using the Meta-learning part of Algorithm 1 as a continuation of simulations discussed in Appendix E, where we provide extensive experiments conﬁrming our analyses.

Remark 4.6 (Lower bound for prediction). For any σ, > 0, if τ = (1 + 2)/σ2 1 log(k 1), then

inf by sup θ Θk, ,σ E h (by(D, θ) y)2i = σ2 + Ω 2 , (7)

where the minimization is over all measurable functions of the meta-parameter θ and the training data D of size τ.

5. Details of the algorithm and the analyses

We explain and analyze each step in Algorithm 1. These analyses imply our main result in meta-learning, which is explicitly written in Appendix A.

5.1. Subspace estimation

In the following, we use k SVD( , k) routine that outputs the top k-singular vectors. As E[c M] = M := Pk j=1 pjwjw j , this outputs an estimate of the subspace spanned by the true parameters. We show that as long as t L1 2, the accuracy only depends on the total number of examples, and it is sufﬁcient to have n L1t L1 = eΩ(d).

Algorithm 2 Subspace estimation

Input: data DL1 = {(xi,j, yi,j)}i [n L1],j [t L1], k N compute for all i [n L1]

bβ(1) i 2 t L1

j=1 yi,jxi,j, bβ(2) i 2 t L1

j=t L1/2+1 yi,jxi,j

c M (2n L1) 1 Pn L1 i=1 bβ(1) i bβ(2) i + bβ(2) i bβ(1) i

U k SVD c M, k

The dependency on the accuracy ϵ changes based on the ground truths meta-parameters. In an ideal case when W is an orthonormal matrix (with condition number one), the

sample complexity is e O d/(p2 minϵ2) . For the worst case W, it is e O d/ p2 minϵ6 .

Lemma 5.1 (Learning the subspace). Suppose Assumption 1 holds, and let U Rd k be the matrix with top k eigen vectors of matrix c M Rd d. For any failure probability δ (0, 1) and accuracy ϵ (0, 1), if the sample size is large enough such that

n L1 = Ω dt 1 L1 min ϵ 6p 2 min, ϵ 2λ 2 min log3(n L1d/δ) ,

and 2 t L1 < d, we have (UU I)wi 2 ϵ , (8)

for all i [k] with probability at least 1 δ, where λmin is the smallest non-zero eigen value of M := Pk i=1 piwiw i .

Time complexity: O nω 1 L1 + n L1t L1 d for computing c M, and O kd2 for k SVD (Allen-Zhu & Li, 2016).

5.2. Clustering

Once we have the subspace, we can efﬁciently cluster any task associated with t H = eΩ(

k) samples. In the following, the matrix H Rn H n H estimates the distance between the parameters in the projected k-dimensional space. If there is no error in U, then E[Hi,j] Ω 2 if i and j are from different components, and zero otherwise. Any clustering algorithm can be applied treating H as a distance matrix.

Algorithm 3 Clustering and estimation

Input: data DH = {(xi,j, yi,j)}i [n H],j [t H], 2L t H, k N, L N, U Rd k

compute for all ℓ [L] and i [n H]

β(ℓ) i (2L/t H) Pℓ (t H/2L) j=(ℓ 1) (t H/2L)+1 yi,jxi,j

β(ℓ+L) i (2L/t H) P2ℓ (t H/2L) j=ℓ (t H/2L)+1 yi,jxi,j compute for all ℓ [L] and (i, j) [n H] [n H]

H(ℓ) i,j bβ(ℓ) i bβ(ℓ) j UU bβ(ℓ+L) i bβ(ℓ+L) j

compute for all (i, j) [n H] [n H]

Hi,j median {H(ℓ) i,j }ℓ [L]

Cluster DH using H and return its partition {Cℓ}ℓ [k] compute for all ℓ [k]

ewℓ (t H |Cℓ|) 1 P

i Cℓ,j [t H] yi,j UU xi,j er2 ℓ (t H |Cℓ|) 1 P

i Cℓ,j [t H] yi,j x i,j ewℓ 2

epℓ |Cℓ| /n H output

Cℓ, ewℓ, er2 ℓ, epℓ k ℓ=1

This is inspired by (Vempala & Wang, 2004), where clustering mixture of Gaussians is studied. One might wonder if it is possible to apply their clustering approach to bβi s directly. This approach fails as it crucially relies on the

Meta-learning for Mixed Linear Regression

fact that x µ 2 =

k e O(1) with high probability for x N(0, Ik). Under our linear regression setting, yx β 2 does not concentrate. We instead propose median of estimates, to get the desired t H = eΩ(

k) sufﬁcient condition.

Lemma 5.2 (Clustering and initial parameter estimation). Under Assumption 1, and given an orthonormal matrix U Rd k satisfying (8) with any ϵ (0, /4), Algorithm 3 correctly clusters all tasks with t H = Ω( 2

k log(n H/δ)) with probability at least 1 δ, δ (0, 1). Further, if

n H = Ω k log(k/δ)

t H eϵ2 pmin

for any eϵ > 0, with probability at least 1 δ,

U (ewi wi) 2

2 eϵ (10a) er2 i r2 i eϵ

k r2 i , (10b)

where r2 i := (s2 i + ewi wi 2 2) for all i [k].

Time complexity: It takes O (n Hdt H + n Hdk) time to compute {U bβ(l) i }i [n H],l [L]. Then by using matrix multiplication, it takes O n2 Hkω 2 time to compute the matrix H, and the single linkage clustering algorithm takes O n2 H

time (Sibson, 1973).

Experiments. We set d = 8k, p = 1k/k, s = 1k, and Px and Pϵ are standard Gaussian distributions. Given an estimated subspace with estimation error 0.1, the clustering step is performed with n H = max k3/2, 256 tasks for various t H. The minimum t H such that the clustering accuracy is above 99% for at-least 1 δ fraction of 10 random trials is denoted by tmin(1 δ). Figure 3 and Table 3 illustrate the dependence of k on tmin(0.5), and tmin(0.9). More experimental results are provided in Appendix E.

Figure 3. The trends of tmin(0.9) and tmin(0.5) for various k show the t H k1/2 dependence as predicted by Lemma 5.2 when n H = k3/2.

5.3. Classiﬁcation

Once we have {ewℓ}k ℓ=1 from the clustering step, we can efﬁciently classify any task with t L2 = eΩ(log k) samples, and an extra log n L2 samples are necessary to apply the union bound. This allows us to use the light samples, in order to reﬁne the clusters estimated with heavy samples. This separation allows us to achieve the desired sample complexity on light tasks (t L2 = Ω( 4 log d), n L2t L2pmin = eΩ(ϵ 2d)), and heavy tasks (t H = eΩ( 2

k), n Ht Hpmin = eΩ( 2k)).

In the following, we use Least Squares( ) routine that outputs the least-squares estimate of all the examples in each cluster. Once each cluster has O (d) samples, we can accurately estimate the meta-parameters.

Algorithm 4 Classiﬁcation and estimation

Input: data DL2 = {(xi,j, yi,j)}i [n L2],j [t L2], Cℓ, ewℓ, er2 ℓ

ℓ [k] compute for all i [n L2]

hi arg min ℓ [k]

yi,j x i,j ewℓ 2 + t L2 log erℓ

Chi Chi {(xi,j, yi,j)}t L2 j=1 compute for all ℓ [k],

bwℓ Least Squares(Cℓ) bs2 ℓ (t L2 |Cℓ| d) 1 P

i Cℓ,j [t L2] yi,j x i,j bwℓ 2

bpℓ |Cℓ| /n L2 output

Cℓ, bwℓ, bs2 ℓ, bpℓ k ℓ=1

Lemma 5.3 (Reﬁned parameter estimation via classiﬁcation). Under Assumption 1 and given estimated parameters ewi, eri satisfying ewi wi 2 /10, 1 2/50 er2 i s2 i + ewi wi 2 2 1 + 2/50 er2 i for all i [k] and n L2 task with t L2 = Ω log(kn L2/δ)/ 4 examples per task, with probability 1 δ, Algorithm 4 correctly classiﬁes all the n L2 tasks. Further, for any 0 < ϵ 1 if

n L2 = Ω d log2(k/δ)

the following holds for all i [k],

bwi wi 2 ϵsi , (12a) bs2 i s2 i ϵ

d s2 i , and (12b)

|bpi pi| ϵ p

t L2/d pi. (12c)

Time complexity: Computing {hi}i [n L2] takes O (n L2t L2dk) time, and least square estimation takes O n L2t L2dω 1 time.

Meta-learning for Mixed Linear Regression

Table 1. Sample complexity for previous work in MLR to achieve small constant error on parameters recovery of the mixed linear regression problem. We ignore the constants and poly log factors. Let n, d, and k denote the number of samples, the dimension of the data points, and the number of clusters, respectively. (Yi et al., 2016) and (Chaganty & Liang, 2013) requires σk, the k-th singular value of some moment matrix. (Sedghi et al., 2016) requires smin, the k-th singular value of the matrix of the regression vectors. Note that 1/smin and 1/σk can be inﬁnite even when > 0. (Zhong et al., 2016) algorithm requires max/ min = O (1) and some spectral properties.

References Noise # Samples n

(CHAGANTY & LIANG, 2013) YES d6 poly(k, 1/σk) (YI ET AL., 2016) NO d poly(k, 1/ , 1/σk) (ZHONG ET AL., 2016) NO d exp(k log(k log d)) (SEDGHI ET AL., 2016) YES d3 poly(k, 1/smin) (LI & LIANG, 2018) NO d poly(k/ ) + exp(k2 log(k/ )) (CHEN ET AL., 2020) NO d exp(

k) poly(1/ )

6. Related Work

Meta-learning linear models have been studied in two contexts: mixed linear regression and multi-task learning.

Mixed Linear Regression (MLR). When each task has only one sample, (i.e. ti = 1), the problem has been widely studied. Prior work in MLR are summarized in Table 1. We emphasize that the sample and time complexity of all the previous work either has a super polynomial dependency on k (speciﬁcally at least exp(

k)) as in (Zhong et al., 2016; Li & Liang, 2018; Chen et al., 2020)), or depends on the inverse of the k-th singular value of some moment matrix as in (Chaganty & Liang, 2013; Yi et al., 2016; Sedghi et al., 2016), which can be inﬁnite. (Chen et al., 2020) cannot achieve vanishing error when there is noise.

Multi-task learning. (Baxter, 2000; Ando & Zhang, 2005; Rish et al., 2008; Orlitsky, 2005) address a similar problem of ﬁnding an unknown k-dimensional subspace, where all tasks can be accurately solved. The main difference is that all tasks have the same number of examples, and the performance is evaluated on the observed tasks used in training. Typical approaches use trace-norm to encourage low-rank solutions of the matrix h bβi, . . . , bβn i Rd n. This is posed as a convex program (Argyriou et al., 2008; Harchaoui et al., 2012; Amit et al., 2007; Pontil & Maurer, 2013).

Closer to our work is the streaming setting where n tasks are arriving in an online fashion and one can choose how many examples to collect for each. (Balcan et al., 2015) provides an online algorithm using a memory of size only O (kn + kd), but requires some tasks to have ti = Ω dk/ϵ2 examples. In comparison, we only need t H = eΩ(

k) but use O d2 + kn memory. (Bullins et al., 2019) also use only small memory, but requires eΩ d2 total samples to perform the subspace estimation under the setting studied in this paper.

Empirical Bayes/Population of parameters. A simple

canonical setting of probabilistic meta-learning is when Pφi is a univariate distribution (e.g. Gaussian, Bernoulli) and φi is the parameter of the distribution (e.g. Gaussian mean, success probability). Several related questions have been studied. In some cases, one might be interested in just learning the prior distribution Pθ(φ) or the set of φi s. For example, if we assume each student s score of one particular exam xi is a binomial random variable with mean φi (true score), given the scores of the students in a class, an ETS statistician (Lord, 1969) might want to learn the distribution of their true score φi s. Surprisingly, the minimax rate on estimating the prior distribution Pθ(φ) was not known until very recently (Tian et al., 2017; Vinayak et al., 2019) even in the most basic setting where Pφi(x) is Binomial.

In some cases, similar to the goal of meta-learning, one might want to accurately estimate the parameter of the new task φnew given the new data xnew, perhaps by leveraging an estimation of the prior Pθ(φ). This has been studied for decades under the empirical bayes framework in statistics (see, e.g. the book by Efron (Efron, 2012) for an introduction of the ﬁeld).

7. Discussion

We investigate how we can meta-learn when we have multiple tasks but each with a small number of labelled examples. This is also known as a few-shot supervised learning setting. When each task is a linear regression, we propose a novel spectral approach and show that we can leverage past experience on small data tasks to accurately learn the meta-parameters and predict new tasks.

When each task is a logistic regression coming from a mixture model, then our algorithm can be applied seamlessly. However, the notion of separation = mini =j wi wj 2 does not capture the dependence on the statistical complexity. Identifying the appropriate notion of complexity on the groundtruths meta-parameters is an interesting research question.

Meta-learning for Mixed Linear Regression

The subspace estimation algorithm requires a total number of eΩ(dk2) examples. It is worth understanding whether this is also necessary.

Compared to online algorithms for multi-task learning, whose focus is in using small memory, our approach achieves better sample complexity at the expense of requiring a larger memory scaling as O(d2 + kn); this is larger than the O(kd + kn) memory required by (Balcan et al., 2015). An interesting research direction is to achieve the sample complexity of our appooach with only O(kd + kn) memory.

Handling the setting where Px has different covariances in different tasks is a challenging problem. There does not seem to exist an unbiased estimator for W. Nevertheless, (Li & Liang, 2018) study the t = 1 case in this setting and come up with an exponential time algorithm. Studying this general setting and coming up with a polynomial time algorithm for meta-learning in a data constrained setting is an interesting direction.

Our clustering algorithm requires the existence of medium data tasks with t H = Ω(

k) examples per task. It is worth investigating whether there exists a polynomial time and sample complexity algorithms that learns with t H = o(

k). We conjecture that with the techniques developed in the robust clustering literature (Diakonikolas et al., 2018; Hopkins & Li, 2018; Kothari et al., 2018), it is possible to learn with t H = o(

k) in the expense of larger n H, and higher computation complexity. For a lower bound perspective, it is worth understanding the information theoretic trade-off between t H and n H when t H = o(

8. Acknowledgement

Sham Kakade acknowledges funding from the Washington Research Foundation for Innovation in Data-intensive Discovery, and the NSF Awards CCF-1637360, CCF-1703574, and CCF-1740551. Sewoong Oh acknowledges funding from the Google Faculty Research Award, Intel Future Wireless Systems Research Award, and NSF awards CCF1927712, CCF-1705007, IIS-1929955, and CNS-2002664.

Allen-Zhu, Z. and Li, Y. Lazysvd: even faster svd decomposition yet without agonizing pain. In NIPS. ar Xiv:1607.03463, 2016.

Amit, Y., Fink, M., Srebro, N., and Ullman, S. Uncovering shared structures in multiclass classiﬁcation. In Proceedings of the 24th international conference on Machine learning, pp. 17 24, 2007.

Ando, R. K. and Zhang, T. A framework for learning pre-

dictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6(Nov):1817 1853, 2005.

Argyriou, A., Evgeniou, T., and Pontil, M. Convex multitask feature learning. Machine learning, 73(3):243 272, 2008.

Balcan, M.-F., Blum, A., and Vempala, S. Efﬁcient representations for lifelong learning and autoencoding. In Conference on Learning Theory, pp. 191 210, 2015.

Baxter, J. A model of inductive bias learning. Journal of artiﬁcial intelligence research, 12:149 198, 2000.

Bertinetto, L., Henriques, J. F., Torr, P. H., and Vedaldi, A. Meta-learning with differentiable closed-form solvers. In ICLR. ar Xiv preprint ar Xiv:1805.08136, 2019.

Bojar, O., Buck, C., Federmann, C., Haddow, B., Koehn, P., Leveling, J., Monz, C., Pecina, P., Post, M., Saint Amand, H., et al. Findings of the 2014 workshop on statistical machine translation. In Proceedings of the ninth workshop on statistical machine translation, pp. 12 58, 2014.

Bullins, B., Hazan, E., Kalai, A., and Livni, R. Generalize across tasks: Efﬁcient algorithms for linear representation learning. In Algorithmic Learning Theory, pp. 235 246, 2019.

Carlin, B. P. and Louis, T. A. Bayes and empirical Bayes methods for data analysis. Chapman and Hall/CRC, 2010.

Chaganty, A. T. and Liang, P. Spectral experts for estimating mixtures of linear regressions. In International Conference on Machine Learning (ICML), pp. 1040 1048, 2013.

Chen, S., Li, J., and Song, Z. Learning mixtures of linear regressions in subexponential time via Fourier moments. In STOC. https://arxiv.org/pdf/1912. 07629.pdf, 2020.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248 255. Ieee, 2009.

Diakonikolas, I., Kane, D. M., and Stewart, A. Listdecodable robust mean estimation and learning mixtures of spherical gaussians. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1047 1060, 2018.

Efron, B. Large-scale inference: empirical Bayes methods for estimation, testing, and prediction, volume 1. Cambridge University Press, 2012.

Meta-learning for Mixed Linear Regression

Finn, C., Abbeel, P., and Levine, S. Model-agnostic metalearning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning (ICML), pp. 1126 1135, 2017.

Finn, C., Xu, K., and Levine, S. Probabilistic modelagnostic meta-learning. In Advances in Neural Information Processing Systems (Neur IPS), pp. 9516 9527, 2018.

Grant, E., Finn, C., Levine, S., Darrell, T., and Grifﬁths, T. Recasting gradient-based meta-learning as hierarchical bayes. ar Xiv preprint ar Xiv:1801.08930, 2018.

Harchaoui, Z., Douze, M., Paulin, M., Dudik, M., and Malick, J. Large-scale image classiﬁcation with trace-norm regularization. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3386 3393. IEEE, 2012.

Harrison, J., Sharma, A., and Pavone, M. Meta-learning priors for efﬁcient online bayesian regression. ar Xiv preprint ar Xiv:1807.08912, 2018.

Hoeffding, W. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13 30, 1963.

Hopkins, S. B. and Li, J. Mixture models, robustness, and sum of squares proofs. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1021 1034, 2018.

Hsu, D., Kakade, S. M., and Zhang, T. Random design analysis of ridge regression. In Conference on learning theory, pp. 9 1, 2012.

Kim, T., Yoon, J., Dia, O., Kim, S., Bengio, Y., and Ahn, S. Bayesian model-agnostic meta-learning. In Neur IPS. ar Xiv preprint ar Xiv:1806.03836, 2018.

Koch, G., Zemel, R., and Salakhutdinov, R. Siamese neural networks for one-shot image recognition. In ICML deep learning workshop, volume 2, 2015.

Kong, W. and Valiant, G. Estimating learnability in the sublinear data regime. In Advances in Neural Information Processing Systems, pp. 5455 5464, 2018.

Kong, W., Valiant, G., and Brunskill, E. Sublinear optimal policy value estimation in contextual bandits. ar Xiv preprint ar Xiv:1912.06111, 2019.

Kothari, P. K., Steinhardt, J., and Steurer, D. Robust moment estimation and improved clustering via sum of squares. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1035 1046, 2018.

Li, Y. and Liang, Y. Learning mixtures of linear regressions with nearly optimal complexity. In COLT. ar Xiv preprint ar Xiv:1802.07895, 2018.

Li, Z., Zhou, F., Chen, F., and Li, H. Meta-sgd: Learning to learn quickly for few-shot learning. ar Xiv preprint ar Xiv:1707.09835, 2017.

Lord, F. M. Estimating true-score distributions in psychological testing (an empirical bayes estimation problem). Psychometrika, 34(3):259 299, 1969.

Oreshkin, B., L opez, P. R., and Lacoste, A. Tadam: Task dependent adaptive metric for improved few-shot learning. In Advances in Neural Information Processing Systems, pp. 721 731, 2018.

Orlitsky, A. Supervised dimensionality reduction using mixture models. In Proceedings of the 22nd international conference on Machine learning, pp. 768 775, 2005.

Pontil, M. and Maurer, A. Excess risk bounds for multitask learning with trace norm regularization. In Conference on Learning Theory, pp. 55 76, 2013.

Rajeswaran, A., Finn, C., Kakade, S. M., and Levine, S. Meta-learning with implicit gradients. In Advances in Neural Information Processing Systems (Neur IPS), pp. 113 124, 2019.

Ravi, S. and Larochelle, H. Optimization as a model for few-shot learning. In International Conference on Representation Learning, 2017.

Rish, I., Grabarnik, G., Cecchi, G., Pereira, F., and Gordon, G. J. Closed-form supervised dimensionality reduction with generalized linear models. In Proceedings of the 25th international conference on Machine learning, pp. 832 839, 2008.

Rusu, A. A., Rao, D., Sygnowski, J., Vinyals, O., Pascanu, R., Osindero, S., and Hadsell, R. Meta-learning with latent embedding optimization. ar Xiv preprint ar Xiv:1807.05960, 2018.

Schmidhuber, J. Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta- ... hook. Ph D thesis, Technische Universit at M unchen, 1987.

Sedghi, H., Janzamin, M., and Anandkumar, A. Provable tensor methods for learning mixtures of generalized linear models. In Artiﬁcial Intelligence and Statistics (AISTATS), pp. 1223 1231, 2016.

Sibson, R. Slink: an optimally efﬁcient algorithm for the single-link cluster method. The computer journal, 16(1): 30 34, 1973.

Meta-learning for Mixed Linear Regression

Thrun, S. and Pratt, L. Learning to learn. Springer Science & Business Media, 2012.

Tian, K., Kong, W., and Valiant, G. Learning populations of parameters. In Advances in Neural Information Processing Systems, pp. 5778 5787, 2017.

Triantaﬁllou, E., Zhu, T., Dumoulin, V., Lamblin, P., Xu, K., Goroshin, R., Gelada, C., Swersky, K., Manzagol, P.-A., and Larochelle, H. Meta-dataset: A dataset of datasets for learning to learn from few examples. ar Xiv preprint ar Xiv:1903.03096, 2019.

Tropp, J. A. et al. An introduction to matrix concentration inequalities. Foundations and Trends in Machine Learning, 8(1-2):1 230, 2015.

Vempala, S. and Wang, G. A spectral algorithm for learning mixture models. Journal of Computer and System Sciences, 68(4):841 860, 2004.

Vershynin, R. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge University Press, 2018.

Vinayak, R. K., Kong, W., Valiant, G., and Kakade, S. M. Maximum likelihood estimation for learning populations of parameters. ar Xiv preprint ar Xiv:1902.04553, 2019.

Wang, Y.-X., Ramanan, D., and Hebert, M. Learning to model the tail. In Advances in Neural Information Processing Systems, pp. 7029 7039, 2017.

Yi, X., Caramanis, C., and Sanghavi, S. Solving a mixture of many random linear equations by tensor decomposition and alternating minimization. ar Xiv preprint ar Xiv:1608.05749, 2016.

Zhong, K., Jain, P., and Dhillon, I. S. Mixed linear regression with multiple components. In Advances in neural information processing systems (NIPS), pp. 2190 2198, 2016.

Zhou, F., Wu, B., and Li, Z. Deep meta-learning: Learning to learn in the concept space. ar Xiv preprint ar Xiv:1802.03596, 2018.

Zintgraf, L., Shiarli, K., Kurin, V., Hofmann, K., and Whiteson, S. Fast context adaptation via meta-learning. In International Conference on Machine Learning (ICML), pp. 7693 7702, 2019.