# subspace_learning_for_effective_metalearning__c3e72564.pdf

Subspace Learning for Effective Meta-Learning

Weisen Jiang 1 2 James T. Kwok 2 Yu Zhang 1 3

Meta-learning aims to extract meta-knowledge from historical tasks to accelerate learning on new tasks. Typical meta-learning algorithms like MAML learn a globally-shared meta-model for all tasks. However, when the task environments are complex, task model parameters are diverse and a common meta-model is insufﬁcient to capture all the meta-knowledge. To address this challenge, in this paper, task model parameters are structured into multiple subspaces, and each subspace represents one type of meta-knowledge. We propose an algorithm to learn the meta-parameters (i.e., subspace bases). We theoretically study the generalization properties of the learned subspaces. Experiments on regression and classiﬁcation metalearning datasets verify the effectiveness of the proposed algorithm.

1. Introduction

Humans are capable of learning new tasks from few trials by taking advantage of prior experiences. However, the state-of-the-art performance of deep networks heavily relies on the availability of large amounts of labeled samples. To improve data efﬁciency, meta-learning (or learning-tolearn) (Bengio et al., 1991; Thrun & Pratt, 1998) seeks to design algorithms that extract meta-knowledge from historical tasks to accelerate learning on unseen tasks. Metalearning has been widely used for few-shot learning (Finn et al., 2017; Wang et al., 2020b), neural architecture search (Zoph & Le, 2017; Liu et al., 2018), hyperparameter optimization (Maclaurin et al., 2015; Franceschi et al., 2018), reinforcement learning (Nagabandi et al., 2018; Rakelly et al., 2019), and natural language processing (Gu et al.,

1Guangdong Provincial Key Laboratory of Brain-inspired Intelligent Computation, Department of Computer Science and Engineering, Southern University of Science and Technology 2Department of Computer Science and Engineering, Hong Kong University of Science and Technology 3Peng Cheng Laboratory. Correspondence to: Yu Zhang <yu.zhang.ust@gmail.com>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

2018; Obamuyide & Vlachos, 2019).

Typical meta-learning algorithms (Finn et al., 2017; Denevi et al., 2019; Rajeswaran et al., 2019; Zhou et al., 2019) learn a globally-shared meta-model for all tasks. For example, the Model-Agnostic Meta-Learning (MAML) algorithm (Finn et al., 2017) learns a meta-initialization such that a good model for an unseen task can be ﬁne-tuned from limited samples by a few gradient updates. However, when the task environments are heterogeneous, task model parameters are diverse and a single meta-model may not be sufﬁcient to capture all the meta-knowledge.

To tackle this issue, a variety of methods have been proposed to learn structured meta-knowledge by exploring the task structure (Jerfel et al., 2019; Yao et al., 2019; 2020; Wang et al., 2020a; Zhou et al., 2021a; Kong et al., 2020; Tripuraneni et al., 2021). For example, Jerfel et al. (2019) formulate the task distribution as a mixture of hierarchical Bayesian models, and update the components (i.e., initializations) using an Expectation Maximization procedure. TSA-MAML (Zhou et al., 2021a) ﬁrst trains task models using vanilla MAML. Tasks are grouped into clusters by k-means clustering, and cluster centroids form group-speciﬁc initializations.

Alternatively, task model parameters can be formulated into a subspace. In the linear regression setting where task model vectors are sampled from a low-dimensional subspace, recent attempts (Kong et al., 2020; Tripuraneni et al., 2021) use a moment-based estimator to recover the subspace based on the property that the column space of the sample covariance matrix recovers the underlying subspace. However, for nonlinear models such as deep networks, this nice property no longer holds and the moment-based methods cannot be generalized.

In this paper, we propose a model-agnostic algorithm called MUSML (MUltiple Subspaces for Meta-Learning). Each subspace represents one type of meta-knowledge, and subspace bases are treated as meta-parameters. For each task, the base learner builds a task model from each subspace. The meta-learner then updates the subspace bases by minimizing a weighted validation loss of the task models. We theoretically establish upper bounds on the population risk, empirical risk and generalization gap. All these bounds depend on the complexity of the subspace mixture (number of component subspaces and subspace dimensionality). Ex-

Subspace Learning for Effective Meta-Learning

periments on various datasets verify the effectiveness of the proposed MUSML.

Our major contributions are four-fold: (i) We formulate task model parameters into a subspace mixture and propose a novel algorithm to learn the subspace bases. (ii) The proposed MUSML is model-agnostic, and can be used on linear and nonlinear models. (iii) We theoretically study the generalization properties of the learned subspaces. (iv) We perform extensive experiments on synthetic and real-world datasets. Results on the synthetic dataset conﬁrm that MUSML is able to discover the underlying subspaces of task model parameters. Results on the real world datasets demonstrate superiority of MUSML over the state-of-the-arts.

Notations. Vectors (e.g., x) and matrices (e.g., X) are denoted by lowercase and uppercase boldface letters, respectively. For a vector x, its ℓ2-norm is x . For a matrix X, its spectral norm is X and its Frobenius norm is X F. Subspaces (e.g., S) are denoted by blackboard boldface letters. U(a, b) is the uniform distribution over the interval [a, b]. N(µ; σ2) is the univariate normal distribution with mean µ and variance σ2, while N(µ; Σ) is a multivariate normal distribution with mean vector µ and covariance matrix Σ. σmin(X) is the smallest singular value of matrix X.

2. Related Work

Meta-Learning. Popular meta-learning algorithms can be roughly divided into three categories: metric-based (Vinyals et al., 2016; Snell et al., 2017; Bertinetto et al., 2018; Oreshkin et al., 2018; Lee et al., 2019), memory-based (Santoro et al., 2016; Munkhdalai & Yu, 2017), and optimizationbased. In this paper, we focus on the last category. Representative algorithms like MAML (Finn et al., 2017) and its variants (such as REPTILE (Nichol et al., 2018), ANIL (Raghu et al., 2020), Proto-MAML (Triantaﬁllou et al., 2020), and BMG (Flennerhag et al., 2022)) ﬁrst learn a meta-initialization from historical tasks. For a new task with limited examples, a good model is then obtained by a few gradient updates. Alternatively, other optimization-based algorithms such as i MAML (Rajeswaran et al., 2019), (Denevi et al., 2018; 2019), Meta-Minibatch Prox (Zhou et al., 2019), and Meta Prox (Jiang et al., 2021)) learn a meta-regularizer to bias risk minimization in the base learners.

Optimization-based meta-learning algorithms are effective when the task models are close together. However, realworld environments are complex, and the task model parameters are usually diverse. A globally-shared meta-model may not be sufﬁcient for all tasks to achieve fast adaption.

To tackle this challenge, Vuorio et al. (2019) and Yao et al. (2019; 2020) build task-speciﬁc initializations by incorporating task representation. Denevi et al. (2020) propose to

learn a meta-regularization conditioned on the task s side information. Since discriminative task representations and additional side information may not be easy to obtain, Jerfel et al. (2019) and Zhou et al. (2021a) cluster tasks into multiple groups and learn group-speciﬁc initializations. Recently, Kong et al. (2020); Saunshi et al. (2020); Tripuraneni et al. (2021) structure task model parameters to a subspace. However, this is limited to the linear regression setting and cannot be extended to nonlinear models such as deep networks.

Learning neural network subspaces. Recent studies (Li et al., 2018; Izmailov et al., 2020; Gressmann et al., 2020; Wortsman et al., 2021) show that deep network can be optimized in a low-dimensional parameter subspace. Wortsman et al. (2021) empirically demonstrates that a learned subspace contains diverse solutions that can be ensembled to boost accuracy. In meta-learning, learning of subspaces is more challenging since the optimization problem is bilevel and the learned subspaces need to generalize well on unseen tasks with limited samples.

3. Methodology

Let p(τ) be a task distribution. Each task τ p(τ) corresponds to a data distribution over (x, y), with input x and label y. In practice, this data distribution is only accessible via a training set Dtr τ = {(xi, yi) : i = 1, . . . , Ntr} and a validation set Dvl τ = {(xi, yi) : i = 1, . . . , Nvl} of i.i.d. samples. Let f( ; w) be a model parameterized by w Rd, and L(D; w) = 1 |D| P

(x,y) D ℓ(f(x; w), y) be its loss on D, where ℓ( , ) is a loss function (e.g., cross entropy or mean squared error). In meta-learning, a collection of tasks sampled from p(τ) are used to learn the meta-parameter. The base learner takes a task τ and meta-parameter to construct the model parameter wτ. The meta-learner then minimizes the validation loss Eτ p(τ)L(Dvl τ ; wτ) w.r.t. the meta-parameter. In this paper, the task model parameters wτ s are assumed to form a subspace mixture, and the component subspace bases are treated as meta-parameters.

3.1. Linear Regression Tasks

We ﬁrst focus on the linear setting, where the model is linear and all task parameters lie in one single (linear) subspace. The following Proposition shows that the underlying subspace can be recovered using a moment-based estimator.

Proposition 3.1. (Kong et al., 2020; Tripuraneni et al., 2021). Assume that p(τ) is a distribution of linear regression tasks. Each task τ is associated with a w τ Rd, and its samples are generated as y = x w τ + ξ, where x N(0, I) and ξ N(0, σ2 ξ) is the noise. Then, Eτ p(τ)E(x,y) τ,(x ,y ) τyy xx = Eτ p(τ)w τw τ .

Proposition 3.1 shows that ˆS 1 |B| P

τ B yτy τxτx τ is

Subspace Learning for Effective Meta-Learning

an unbiased estimator of Eτ p(τ)w τw τ , where (xτ, yτ) and (x τ, y τ) are two samples drawn from τ, and B is a collection of tasks. Hence, the column space of ˆS recovers the column space of Eτ p(τ)w τw τ (i.e., the underlying subspace) when the number of tasks is sufﬁcient.

3.2. Proposed Method

While Proposition 3.1 can be used to recover the column space in a linear meta-learning setting, extension to the nonlinear setting (such as deep networks) is difﬁcult. To address this problem, we propose a model-agnostic algorithm called MUSML (MUltiple Subspaces for Meta-Learning). We assume that the model parameters wτ s lie in K subspaces {S1, . . . , SK}, which can be seen as an approximation to a nonlinear manifold. For simplicity, we assume that all K subspaces have the same dimensionality m (this can be easily extended to the case where the subspaces have different dimensionalities). Let Sk Rd m be a basis of Sk. {S1, . . . , SK} are then the meta-parameters to be learned.

The proposed procedure is shown in Algorithm 1. Given a task τ, the base learner searches for the model parameter wτ over all subspaces with ﬁxed Sk (steps 4-11). In each subspace Sk, we search for the best linear combination v τ,k of the subspace s basis to form wτ as

v τ,k = arg min vτ Rm L(Dtr τ ; Skvτ). (1)

Skv τ,k is then the task model parameter corresponding to the kth subspace. When ℓ(f(x; w), y) is convex in w, it is easy to verify that L(Dtr τ ; Skvτ) is also convex in vτ. Hence, problem (1) can be solved as a convex program (Boyd & Vandenberghe, 2004). However, for nonlinear models such as deep networks, the loss function in (1) is nonconvex, and thus ﬁnding v τ,k is computationally intractable. Instead, we seek an approximate minimizer vτ,k by performing Tin gradient descent steps from an initialization v(0) τ,k, i.e., v(t +1) τ,k = v(t ) τ,k α v(t ) τ,k L(Dtr τ ; Skv(t ) τ,k )

(where α > 0 is the step size), and vτ,k v(Tin) τ,k .

At meta-training, one can assign τ to the subspace with the best training set performance. However, this is inefﬁcient for learning meta-parameters since only one subspace is updated at each step. Similar to DARTS (Liu et al., 2018), we relax the categorical choice to a softmax selection over all candidate subspaces. The relaxed operation is differentiable and all subspace bases can then be updated simultaneously, which accelerates learning. Let oτ,k = L(Dtr τ ; Skvτ,k) be the training loss for task τ when the kth subspace (where k = 1, . . . , K) is used to construct its task model. The metalearner updates {S1, . . . , SK} by performing one gradient

Algorithm 1 MUltiple Subspaces for Meta-Learning (MUSML). Require: stepsize α, {ηt}; number of inner gradient steps Tin, number of subspaces K, subspace dimension m, temperature {γt}; initialization v(0); 1: for t = 0, 1, . . . , T 1 do 2: sample a task τ with Dtr τ and Dvl τ ; 3: base learner: 4: for k = 1, . . . , K do 5: initialize v(0) τ,k = v(0); 6: for t = 0, 1, . . . , Tin 1 do

7: v(t +1) τ,k = v(t ) τ,k α v(t ) τ,k L(Dtr τ ; Sk,tv(t ) τ,k );

8: end for 9: vτ,k v(Tin) τ,k ; 10: oτ,k = L(Dtr τ ; Sk,tvτ,k); 11: end for 12: meta-learner: 13: Lvl = PK k=1 exp( oτ,k/γt) PK k =1 exp( oτ,k /γt)L(Dvl τ ; Sk,tvτ,k);

14: {S1,t+1, . . . , Sk,t+1} = {S1,t, . . . , Sk,t} ηt {S1,t,...,Sk,t}Lvl; 15: end for 16: Return S1,T , . . . , SK,T .

update on the weighted validation loss (steps 13-14):

exp( oτ,k/γ) PK k =1 exp( oτ,k /γ) L(Dvl τ ; Skvτ,k), (2)

where γ > 0 is the temperature. When γ is close to 0, the softmax selection becomes one-hot; whereas when γ increases to , the selection becomes uniform. In practice, we start at a high temperature and anneal to a small but nonzero temperature as in (Jang et al., 2016; Chen et al., 2020; Zhou et al., 2021b). Note that {oτ,k : k = 1, . . . , K} depend on the bases and {S1,...,Sk}oτ,k can be computed by auto-differentiation.

At meta-testing, for each testing task τ , we assign τ to the subspace with the lowest training loss, i.e., wτ = Skτ vτ ,kτ , where kτ arg min1 k K L(Dtr τ ; Skvτ ,k) is the chosen subspace index.

3.3. Analysis

In this section, we study the generalization performance of the learned subspace bases S {S1, . . . , SK} at metatesting. The following assumptions on smoothness and compactness are standard in meta-learning (Bao et al., 2021; Grazzi et al., 2020; Fallah et al., 2020) and bilevel optimization (Franceschi et al., 2018; Bao et al., 2021). The boundedness assumption on the loss function is widely used in analyzing meta-learning algorithms (Maurer & Jaakkola,

Subspace Learning for Effective Meta-Learning

2005; Pentina & Lampert, 2014; Amit & Meir, 2018) and traditional machine learning algorithms (Bousquet & Elisseeff, 2002).

Assumption 3.2. (i) ℓ(f(x; w), y) and wℓ(f(x; w), y) are ϱ-Lipschitz and β-Lipschitz in w, respectively;1 (ii) {vτ,k : τ p(τ), k = 1, . . . , K} and column vectors of Sk (k = 1, . . . , K) are in a compact set, and their ℓ2-norms are upper bounded by a constant ρ > 0. (iii) ℓ( , ) is upper bounded by a constant ν > 0.

Let R(S) Eτ EDtr τ E(x,y) τ ℓ(f(x; Skτ vτ ,kτ ), y)

be the expected population risk, and ˆR(S) Eτ EDtr τ L(Dtr τ ; Skτ vτ ,kτ ) be the expected empirical risk averaged over all tasks. The following Theorem characterizes the generalization gap, i.e., the gap between the population and empirical risks. All proofs are in Appendix B.

Theorem 3.3. With Assumption 3.2, we have

R(S) ˆR(S)+K

ν2 + 12ϱν(1 + mαδ)Tin

where δ = βρ2 > 0.

The proof is based on the connection between generalization and stability (Bousquet & Elisseeff, 2002). The dependence on Ntr in (3) agrees with their Theorem 11, as the stability constant in Algorithm 1 is of the order O(1/Ntr) (Lemma B.2 in Appendix B). From (3), one can observe that increasing the subspace complexity (i.e., m or K) increases the upper bound of R(S) ˆR(S).

Next, we study the expected empirical risk ˆR(S), which requires the following assumption.

Assumption 3.4. (i) w L(Dtr τ ; w) 2 λ(L(Dtr τ ; w) L ) 0 for some λ > 0 and L 0. (ii) min1 k K σmin(Sk) c1(1 c2 p

m/d), where c1, c2 > 0.

The ﬁrst assumption is called the Polyak-Łojasiewicz condition (Polyak, 1963; Lojasiewicz, 1963), and has been used in the non-convex optimization literature (Karimi et al., 2016; Liu et al., 2022). The second assumption on the smallest singular value is a property of random matrices with high probability (Rudelson & Vershynin, 2009).

Theorem 3.5. With Assumptions 3.2 and 3.4, if α < min( 1 mβρ2 , 1

λ, 2 λc2 1 ), we have

4L2 +2ν2 ( m κ1)( m+κ2)φ 2Tin

+ ν K + ν , (4)

1In other words, ℓ(f(x; w), y) ℓ(f(x; w ), y) ϱ w w , and wℓ(f(x; w), y) wℓ(f(x; w ), y) β w w .

where κ1 = (

2/(c2 1αλ) + 1), κ2 = (

2/(c2 1αλ) 1), and φ = c1c2 p

αλ/(2d) are all positive.

The above Theorem shows that increasing K reduces the upper bound of ˆR(S). This is intuitive as we choose the subspace with the lowest training loss during meta-testing.

Combining Theorems 3.3 and 3.5, we obtain the following Corollary. It indicates that the (upper bound on) population risk R(S) may not always decrease as K increases, due to overﬁtting. Corollary 3.6. With Assumptions 3.2 and 3.4, if α < min( 1 mβρ2 , 1

λ, 2 λc2 1 ), we have

ν2 + 12ϱν(1 + mαδ)Tin

2Ntr + ν K + ν

4L2 +2ν2 ( m κ1)( m+κ2)φ 2Tin. (5)

Let w τ arg minwτ E(x,y) τ ℓ(f(x; wτ ), y) be the optimal task model for task τ , and R Eτ E(x,y) τ ℓ(f(x; w τ ), y) be the minimum expected loss averaged over all tasks. The following Theorem provides an upper bound on the expected excess risk (Zhou et al., 2021a) R(S) R , which compares the performance of the learned task model with that of the optimal model. Theorem 3.7. With Assumption 3.2, we have

R(S) R ρ m Eτ EDtr τ vτ ,kτ v τ ,kτ

+ϱEτ EDtr τ dist(w τ , Skτ )+K

ν2+12ϱν(1+mαδ)Tin

where dist(w τ , Skτ ) minw Skτ w w τ is the distance between w τ and Skτ .

From Theorem 3.7, R(S) R is upper-bounded by three terms: (i) The ﬁrst term measures the distance between the approximate minimizer vτ ,kτ and exact minimizer v τ ,kτ ; (ii) The second term arises from the approximation error of w τ using the learned subspaces; (iii) The third term depends on the complexity of subspaces (i.e., m and K). For the centroid-based clustering method in (Zhou et al., 2021a), the upper bound of its expected excess risk contains a term Eτ EDtr τ ωk τ w τ 2, where ωk τ is the centroid of the cluster that τ is assigned to. The distance ωk τ w τ 2 plays the same role as the term dist(w τ , Skτ ) in Theorem 3.7, which measures how far the optimal model w τ is away from the subspaces or clusters.

4. Experiments

In this section, we perform extensive regression and classiﬁcation experiments to demonstrate effectiveness of the

Subspace Learning for Effective Meta-Learning

proposed method.

4.1. Few-shot Regression on Synthetic Data In this experiment, we use a synthetic 1-dimensional data set to examine whether MUSML can discover subspaces that the task model parameters lie in. We use a 5-shot regression setting, with 14, 000 meta-training, 2, 000 metavalidation, and 6, 000 meta-testing tasks. The model for task τ is f(x; wτ) = exp(0.1wτ,1x) + wτ,2| sin(x)|, in which wτ = [wτ,1; wτ,2] is randomly sampled from one of the two subspaces: (i) Line-A: wτ = S1aτ + 0.1ξτ, where S1 = [1; 1], aτ U(1, 5), and ξτ N(0; I); and (ii) Line B: wτ = S2aτ + 0.1ξτ, where S2 = [ 1; 1], aτ U(0, 2), and ξτ N(0; I). The samples of task τ are generated as y = f(x; wτ) + 0.05ξ, where x U( 5, 5) and ξ N(0, 1). The experiment is repeated 10 times with different seeds. Implementation details are in Appendix A.1.1.

The proposed MUSML (with K = 2, m = 1) is compared with the following meta-learning baselines: (i) MAML (Finn et al., 2017), (ii) BMG (Flennerhag et al., 2022), which uses target bootstrap, and structured meta-learning algorithms including (iii) Dirichlet process mixture model (DPMM) (Jerfel et al., 2019), (iv) hierarchically structured meta-learning (HSML) (Yao et al., 2019), (v) automated relational meta-learning (ARML) (Yao et al., 2020) using a graph structure, and (vi) task similarity aware meta-learning (TSA-MAML) (Zhou et al., 2021a) with different numbers of clusters. We use these baselines ofﬁcial implementations (except for DPMM and BMG whose implementations are not publicly available). For performance evaluation, the mean squared error (MSE) on the meta-testing set is used.

Results. Table 1 shows the meta-testing MSE. As can be seen, structured meta-learning methods (DPMM, HSML, ARML, TSA-MAML, and MUSML) are signiﬁcantly better than methods with a globally-shared meta-model (MAML and BMG). In particular, MUSML performs the best. Furthermore, simply increasing the number of clusters in TSAMAML fails to beat MUSML.

Table 1. Meta-testing MSE (with standard deviation) of 5-shot regression on synthetic data. For TSA-MAML, the number in brackets is the number of clusters used.

MAML (Finn et al., 2017) 0.74 0.03 BMG (Flennerhag et al., 2022) 0.67 0.03

DPMM (Jerfel et al., 2019) 0.56 0.09 HSML (Yao et al., 2019) 0.49 0.10 ARML (Yao et al., 2020) 0.60 0.07 TSA-MAML(2) (Zhou et al., 2021a) 0.58 0.10 TSA-MAML(10) (Zhou et al., 2021a) 0.24 0.09 TSA-MAML(20) (Zhou et al., 2021a) 0.12 0.10 TSA-MAML(40) (Zhou et al., 2021a) 0.14 0.09 TSA-MAML(80) (Zhou et al., 2021a) 0.13 0.08 MUSML 0.07 0.01

(a) Tasks from Line-A.

(b) Tasks from Line-B. Figure 1. Visualization of task model parameters.

Figure 1 visualizes the task model parameters obtained by TSA-MAML(2) and MUSML on 100 randomly sampled meta-testing tasks (50 per subspace). As can be seen, MUSML successfully discovers the underlying subspaces, while the centroid-based clustering method TSA-MAML does not. Table 11 of Appendix A.1.2 also shows that MUSML is more accurate in estimating the task model parameters.

4.2. Few-shot Regression on Pose Data

While the synthetic data used in the previous experiment is tailored for the proposed subspace model, in this section, we perform experiments on a real-world pose prediction dataset from (Yin et al., 2020). This is created based on the Pascal 3D data (Xiang et al., 2014). Each object contains 100 samples, where input x is a 128 128 grey-scale image and output y is its orientation relative to a ﬁxed canonical pose. Following (Yin et al., 2020), we adopt a 15-shot regression setting and randomly select 50 objects for metatraining, 15 for meta-validation, and 15 for meta-testing. The experiment is repeated 15 times with different seeds.

The MR-MAML regularization (Yin et al., 2020) is used on all the methods except the vanilla MAML. MUSML uses the same encoder-decoder network in (Yin et al., 2020) as the model f(x; w). Hyperparameters K and m, as well as the number of clusters in TSA-MAML, are chosen from 1 to 5 using the meta-validation set. For performance evaluation, the MSE on the meta-testing set is used.

Table 2 shows the meta-testing MSE. As can be seen, MUSML is again better than the other baselines, conﬁrming the effectiveness of the learned subspaces.

4.3. Few-shot Classiﬁcation

Setup. In this experiment, we use three meta-datasets: (i) Meta-Dataset-BTAF, proposed in (Yao et al., 2019), which consists of four image classiﬁcation datasets: (a) Bird; (b) Texture; (c) Aircraft; and (d) Fungi. Sample images are shown in Figure 2. (ii) Meta-Dataset-ABF, proposed in (Zhou et al., 2021a), which consists of Aircraft, Bird,

Subspace Learning for Effective Meta-Learning

Table 2. Meta-testing MSE (with standard deviation) of 15-shot regression on Pose. Results on MAML and MR-MAML are from (Yin et al., 2020).

MAML (Finn et al., 2017) 5.39 1.31 MR-MAML (Yin et al., 2020) 2.26 0.09 BMG (Flennerhag et al., 2022) 2.16 0.15

DPMM (Jerfel et al., 2019) 1.99 0.08 HSML (Yao et al., 2019) 2.04 0.13 ARML (Yao et al., 2020) 2.21 0.15 TSA-MAML (Zhou et al., 2021a) 1.96 0.07 MUSML 1.83 0.05

Figure 2. Some random images from the meta-testing set of Meta Dataset-BTAF (Top to bottom: Bird, Texture, Aircraft, and Fungi).

and Fungi. (iii) Meta-Dataset-CIO, which consists of three widely-used few-shot datasets: CIFAR-FS (Bertinetto et al., 2018), mini-Image Net (Vinyals et al., 2016), and Omniglot (Lake et al., 2015). We use the meta-training/metavalidation/meta-testing splits in (Yao et al., 2020; Zhou et al., 2021a; Lake et al., 2015). A summary of the datasets is in Table 3.

As for the network architecture, we use the standard Conv4 backbone (Finn et al., 2017; Yao et al., 2020; Zhou et al., 2021a), and a simple prototype classiﬁer with cosine similarity on top (Snell et al., 2017; Gidaris & Komodakis, 2018) as f(x; w). Hyperparameters K and m are chosen from 1 to 5 on the meta-validation set. Implementation details are in Appendix A.2.

MUSML is compared with the following state-of-the-arts in the 5-way 5-shot and 5-way 1-shot settings: (i) meta-learning algorithms with a globally-shared meta-model including MAML, Proto Net, ANIL (Raghu et al., 2020), and BMG; (ii) structured meta-learning algorithms including DPMM, HSML, ARML, TSA-MAML and its variant using Proto Net as the base learner (denoted TSA-Proto Net). The number of clusters in TSA-MAML and TSA-Proto Net are tuned from 1 to 5 on the meta-validation set. For performance evaluation, the classiﬁcation accuracy on the meta-testing set is used. The experiment is repeated 5 times with different seeds.

Table 3. Statistics of the datasets.

#classes (meta-training/meta-validation/meta-testing)

Bird 64/16/20 Texture 30/7/10 Aircraft 64/16/20 Fungi 64/16/20

CIFAR-FS 64/16/20 mini-Image Net 64/16/20 Omniglot 71/15/16

4.3.1. Meta-Dataset-BTAF

Table 4 shows the 5-shot results. As can be seen, MUSML is more accurate than both structured and unstructured metalearning methods, demonstrating the beneﬁt of structuring task model parameters into subspaces. Figure 3 shows the assignment of tasks to the learned subspaces in MUSML. As can be seen, meta-training tasks from the same dataset are always assigned to the same subspace, demonstrating that MUSML can discover the task structure from meta-training tasks. Though the meta-validation and meta-testing classes are not seen during meta-training, most of the corresponding tasks are still assigned to the correct subspaces. The assignment for Texture is slightly worse, as the Texture and Fungi images are more similar to each other (Figure 2).

Table 5 shows the 1-shot results. MUSML, while still the best overall, has a smaller improvement than in the 5-shot setting. This suggests that having more training samples is beneﬁcial for the base learner to choose a proper subspace. The assignment of tasks to the learned subspaces is shown in Figure 9 of Appendix A.3.

1 2 3 4 Subspace index

100% 0% 0% 0%

0% 100% 0% 0%

0% 0% 100% 0%

0% 0% 0% 100%

(a) Meta-training.

1 2 3 4 Subspace index

100% 0% 0% 0%

13% 49% 13% 26%

0% 0% 100% 0%

6% 3% 3% 89%

(b) Meta-validation.

1 2 3 4 Subspace index

96% 1% 3% 0%

7% 64% 4% 24%

0% 0% 100% 0%

10% 3% 4% 83%

(c) Meta-testing. Figure 3. Task assignment to the learned subspaces in 5-way 5shot setting on Meta-Dataset-BTAF (the number of subspaces K selected by the meta-validation set is 4). Darker color indicates higher percentage.

4.3.2. Meta-Dataset-ABF AND Meta-Dataset-CIO

Tables 6 and 7 show the results on Meta-Dataset-ABF and Meta-Dataset-CIO, respectively. Here, we only consider the 5-shot setting, which is more useful for subspace learning. As can be seen, MUSML consistently outperforms centroid-based clustering methods (DPMM, TSA-

Subspace Learning for Effective Meta-Learning

Table 4. 5-way 5-shot accuracy (with 95% conﬁdence interval) on Meta-Dataset-BTAF. Results marked with are from (Yao et al., 2020).

Bird Texture Aircraft Fungi average

MAML (Finn et al., 2017) 68.52 0.73 44.56 0.68 66.18 0.71 51.85 0.85 57.78 Proto Net (Snell et al., 2017) 71.48 0.72 50.36 0.67 71.67 0.69 55.68 0.82 62.29 ANIL (Raghu et al., 2020) 70.67 0.72 44.67 0.95 66.05 1.07 52.89 0.30 58.57 BMG (Flennerhag et al., 2022) 71.56 0.76 49.44 0.73 66.83 0.79 52.56 0.89 60.10

DPMM (Jerfel et al., 2019) 72.22 0.70 49.32 0.68 73.55 0.69 56.82 0.81 63.00 TSA-MAML (Zhou et al., 2021a) 72.31 0.71 49.50 0.68 74.01 0.70 56.95 0.80 63.20 HSML (Yao et al., 2019) 71.68 0.73 48.08 0.69 73.49 0.68 56.32 0.80 62.39 ARML (Yao et al., 2020) 73.68 0.70 49.67 0.67 74.88 0.64 57.55 0.82 63.95 TSA-Proto Net (Zhou et al., 2021a) 73.70 0.73 50.91 0.74 73.55 0.78 56.11 0.82 63.57 MUSML 76.79 0.72 52.41 0.75 77.76 0.82 57.74 0.81 66.18

Table 5. 5-way 1-shot accuracy (with 95% conﬁdence interval) on Meta-Dataset-BTAF. Results marked with are from (Yao et al., 2020).

Bird Texture Aircraft Fungi average

MAML (Finn et al., 2017) 53.94 1.45 31.66 1.31 51.37 1.38 42.12 1.36 44.77 Proto Net (Snell et al., 2017) 60.37 1.31 40.57 0.78 52.83 0.93 44.10 1.36 49.50 ANIL (Raghu et al., 2020) 53.36 1.42 31.91 1.25 52.87 1.34 42.30 1.28 45.11 BMG (Flennerhag et al., 2022) 54.12 1.46 32.19 1.21 52.09 1.35 43.00 1.37 45.35

DPMM (Jerfel et al., 2019) 61.30 1.47 35.21 1.35 57.88 1.37 43.81 1.45 49.55 TSA-MAML (Zhou et al., 2021a) 61.37 1.42 35.41 1.39 58.78 1.37 44.17 1.25 49.93 HSML (Yao et al., 2019) 60.98 1.50 35.01 1.36 57.38 1.40 44.02 1.39 49.35 ARML (Yao et al., 2020) 62.33 1.47 35.65 1.40 58.56 1.41 44.82 1.38 50.34 TSA-Proto Net (Zhou et al., 2021a) 60.41 1.02 40.98 1.20 53.29 0.89 43.91 1.31 49.64 MUSML 60.52 0.33 41.33 1.30 54.69 0.69 45.60 0.43 50.53

1 2 3 Subspace index

mini-Image Net

(a) Meta-training.

1 2 3 Subspace index

mini-Image Net

10% 76% 14%

(b) Meta-validation.

1 2 3 Subspace index

mini-Image Net

15% 71% 14%

(c) Meta-testing. Figure 4. Task assignment to the learned subspaces in 5-way 5-shot setting on Meta-Dataset-CIO (K selected by the meta-validation set is 3). Darker color indicates higher percentage.

MAML, TSA-Proto Net) and structured meta-learning methods (HSML, ARML). MUSML again outperforms methods with a globally-shared meta-model (MAML, Proto Net, ANIL, BMG), conﬁrming the effectiveness of using a subspace mixture. The performance of MUSML on Omniglot is slightly worse in Table 7. This may be due to that Omniglot is a simple dataset and a single meta-model is good enough. As shown in Figure 4, its meta-validation and meta-testing tasks are often assigned to the same subspace.

4.3.3. EFFECTS OF K AND m

In this experiment, we study the effects of K and m on the 5-shot performance of MUSML on Meta-Dataset-BTAF.

1 2 3 4 5 K

Meta-training accuracy (%)

(a) Meta-training.

1 2 3 4 5 K

Meta-validation accuracy (%)

(b) Meta-validation.

1 2 3 4 5 K

Meta-testing accuracy (%)

(c) Meta-testing. Figure 5. 5-way 5-shot classiﬁcation accuracy on Meta-Dataset BTAF with varying K (m is ﬁxed at 2).

Figure 5(a) shows that the meta-training accuracy increases with K. However, a large K = 5 is not advantageous at meta-validation (Figure 5(b)) and meta-testing (Figure 5(c)).

Figures 6(b) and 6(c) show that the meta-validation and meta-testing accuracies of MUSML increase when m increases from 1 to 2, but larger m s (m = 3, 4, 5) lead to worse performance. This is because the obtained task model parameters (W) lie close to the union of 2-dimensional subspaces2, and so a larger m does not improve performance.

2For example, for the W solution obtained with m = 5 on Meta-Dataset-BTAF (under 5-way 5-shot setting), approximation by a rank-2 matrix ˆ W leads to a relative error ( W ˆ W F/ W F) of only 4.1%.

Subspace Learning for Effective Meta-Learning

Table 6. Accuracy (with 95% conﬁdence interval) of 5-way 5-shot classiﬁcation on Meta-Dataset-ABF. Results marked with are from (Zhou et al., 2021a).

Aircraft Bird Fungi average

MAML (Finn et al., 2017) 67.82 0.65 70.55 0.77 53.20 0.82 63.86 Proto Net(Snell et al., 2017) 69.74 0.64 71.46 0.69 55.66 0.68 65.62 ANIL (Raghu et al., 2020) 69.24 0.87 70.34 1.20 53.71 0.67 64.43 BMG (Flennerhag et al., 2022) 69.75 0.72 73.04 0.77 54.61 0.84 65.80

DPMM (Jerfel et al., 2019) 70.22 0.69 73.28 1.33 54.28 1.01 66.26 TSA-MAML (Zhou et al., 2021a) 72.84 0.63 74.80 0.76 56.86 0.67 68.17 HSML (Yao et al., 2019) 69.89 0.90 68.99 1.01 53.63 1.03 64.17 ARML (Yao et al., 2020) 70.20 0.91 69.12 1.01 54.23 1.07 64.52 TSA-Proto Net (Zhou et al., 2021a) 74.42 0.62 75.11 0.72 56.77 0.69 68.77 MUSML 79.88 0.61 75.63 0.73 57.80 0.80 71.10

Table 7. Accuracy (with 95% conﬁdence interval) of 5-way 5-shot classiﬁcation on Meta-Dataset-CIO.

CIFAR-FS mini-Image Net Omniglot average

MAML (Finn et al., 2017) 66.28 1.61 60.20 1.20 96.91 0.39 74.46 Proto Net (Snell et al., 2017) 71.32 1.54 62.90 1.07 95.32 0.25 76.51 ANIL (Raghu et al., 2020) 66.08 0.90 60.62 0.94 97.13 0.13 74.61 BMG (Flennerhag et al., 2022) 70.49 1.22 63.97 1.19 97.92 0.42 77.46

DPMM (Jerfel et al., 2019) 69.84 1.42 62.92 1.28 97.14 0.28 76.63 TSA-MAML (Zhou et al., 2021a) 71.11 1.55 62.57 1.31 96.99 0.31 76.89 HSML (Zhou et al., 2019) 69.24 1.57 62.28 1.23 95.10 0.32 75.54 ARML (Yao et al., 2020) 68.88 1.91 63.26 1.33 96.23 0.31 76.12 TSA-Proto Net (Zhou et al., 2021a) 72.37 1.46 63.23 1.52 96.21 0.33 77.27 MUSML 73.25 1.42 65.12 1.48 95.13 0.28 77.83

1 2 3 4 5 m

Meta-training accuracy (%)

(a) Meta-training.

1 2 3 4 5 m

Meta-validation accuracy (%)

(b) Meta-validation.

1 2 3 4 5 m

Meta-testing accuracy (%)

(c) Meta-testing.

Figure 6. 5-way 5-shot classiﬁcation accuracy on Meta-Dataset BTAF with varying m (K is ﬁxed at 4).

Figure 7 also shows that for the 4 subspaces, the ﬁrst 2 singular values of W are dominant.

To demonstrate the theoretical results in Section 3.3, we further study the effects of K and m on the meta-testing loss. The average training (resp. testing) loss of meta-testing tasks is an empirical estimate of ˆR(S) (resp. R(S)), while their gap measures the generalization performance.

Figure 8(a) shows that, for m 2, increasing K leads to a reduction in the training loss, which is suggested by Theorem 3.5. However, the testing loss does not always decrease when K increases (Figure 8(b)), which also agrees

1 2 3 4 5 singular value index

singular value

subspace 1 subspace 2 subspace 3 subspace 4

Figure 7. Singular values of model parameters of meta-testing tasks under the 5-way 5-shot setting on Meta-Dataset-BTAF (K = 4 and m = 5).

with Corollary 3.6. Figure 8(c) shows that a large K or m may enlarge the generalization gap, which justiﬁes Theorem 3.3. As shown in Figure 8(c), the generalization gap is approximately linear with K, which agrees with the relationship between the upper bound of R(S) ˆR(S) and K in Theorem 3.3.

4.4. Cross-Domain Few-Shot Classiﬁcation

We examine the effectiveness of MUSML on cross-domain few-shot classiﬁcation, which is more challenging as the

Subspace Learning for Effective Meta-Learning

1 2 3 4 5 K

training loss

(a) Training loss.

1 2 3 4 5 K

testing loss

(b) Testing loss.

1 2 3 4 5 K

testing loss training loss

m = 1 m = 2 m = 3 m = 4 m = 5

(c) Generalization gap. Figure 8. Effects of K and m on the training loss, testing loss, and generalization gap (with 95% conﬁdence interval) of meta-testing tasks under the 5-way 5-shot setting on Meta-Dataset-BTAF.

Table 8. Accuracy of cross-domain 5-way 5-shot classiﬁcation (Meta-Dataset-BTAF Meta-Dataset-CIO).

MAML Proto Net ANIL BMG DPMM TSA-MAML HSML ARML TSA-Proto Net MUSML

64.25 66.13% 65.19% 66.98% 66.73% 66.85% 65.18% 65.37% 66.92% 67.41%

Table 9. Accuracy of 5-way 5-shot classiﬁcation on Meta-Dataset-BTAF.

γ 0.0001 0.001 0.01 0.1 1.0 2.0 MUSML

accuracy 51.22% 60.12% 61.15% 63.16% 62.11% 62.02% 66.18%

Table 10. Accuracy of 5-way 5-shot classiﬁcation on meta-datasets.

Meta-Dataset-BTAF Meta-Dataset-ABF Meta-Dataset-CIO

Meta-Curvature (Park & Oliva, 2019) 60.02% 64.51% 76.13% MUSML-Curvature 66.10% 69.23% 77.96%

Meta-SGD (Li et al., 2017) 58.93% 64.19% 75.95% MUSML-SGD 65.72% 69.15% 77.48%

testing domain is unseen at meta-training. We perform 5way 5-shot classiﬁcation, where Meta-Dataset-BTAF is used for meta-training, and Meta-Dataset-CIO for meta-testing. Table 8 shows the meta-testing accuracy. As can be seen, MUSML is also effective on unseen domains.

4.5. Effects of Temperature Scaling Schedule

The temperature schedule used is linear annealing as in Dynamic Convolution (Chen et al., 2020) and Prob Mask (Zhou et al., 2021b). We conduct a 5-way 5-shot experiment on Meta-Dataset-BTAF to evaluate MUSML with a constant temperature. Table 9 reports the meta-testing accuracy. We can see that using a constant γ is inferior.

4.6. Improving Existing Meta-Learning Approaches

As the proposed MUSML is general, a subspace mixture is also useful for other meta-learning approaches. In this experiment, we combine MUSML with Meta-Curvature (Park & Oliva, 2019) and Meta-SGD (Li et al., 2017). Table 10 reports 5-way 5-shot accuracies on meta-datasets. As can

be seen, MUSML is beneﬁcial for both Meta-Curvature and Meta-SGD.

5. Conclusion In this paper, we formulate task model parameters into a subspace mixture and propose a model-agnostic meta-learning algorithm with subspace learning called MUSML. For each task, the base learner builds a task model from each subspace, while the meta-learner updates the meta-parameters by minimizing a weighted validation loss. The generalization performance is theoretically studied. Experimental results on benchmark datasets for classiﬁcation and regression validate the effectiveness of the proposed MUSML.

Acknowledgements This work was supported by NSFC key grant 62136005, NSFC general grant 62076118, and Shenzhen fundamental research program JCYJ20210324105000003. This research was supported in part by the Research Grants Council of the Hong Kong Special Administrative Region (Grant 16200021).

Subspace Learning for Effective Meta-Learning

Amit, R. and Meir, R. Meta-learning by adjusting priors based on extended PAC-Bayes theory. In International Conference on Machine Learning, pp. 205 214, 2018.

Bao, F., Wu, G., Li, C., Zhu, J., and Zhang, B. Stability and generalization of bilevel programming in hyperparameter optimization. In Neural Information Processing Systems, 2021.

Bengio, Y., Bengio, S., and Cloutier, J. Learning a synaptic learning rule. In International Joint Conference on Neural Networks, pp. 969 975, 1991.

Bertinetto, L., Henriques, J. F., Torr, P., and Vedaldi, A. Meta-learning with differentiable closed-form solvers. In International Conference on Learning Representations, 2018.

Bousquet, O. and Elisseeff, A. Stability and generalization. Journal of Machine Learning Research, 2:499 526, 2002.

Boyd, S. and Vandenberghe, L. Convex optimization. Cambridge University Press, 2004.

Chen, Y., Dai, X., Liu, M., Chen, D., Yuan, L., and Liu, Z. Dynamic convolution: Attention over convolution kernels. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 11030 11039, 2020.

Denevi, G., Ciliberto, C., Stamos, D., and Pontil, M. Learning to learn around a common mean. In Neural Information Processing Systems, pp. 10169 10179, 2018.

Denevi, G., Ciliberto, C., Grazzi, R., and Pontil, M. Learning-to-learn stochastic gradient descent with biased regularization. In International Conference on Machine Learning, pp. 1566 1575, 2019.

Denevi, G., Pontil, M., and Ciliberto, C. The advantage of conditional meta-learning for biased regularization and ﬁne tuning. In Neural Information Processing Systems, 2020.

Fallah, A., Mokhtari, A., and Ozdaglar, A. On the convergence theory of gradient-based model-agnostic metalearning algorithms. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 1082 1092, 2020.

Finn, C., Abbeel, P., and Levine, S. Model-agnostic metalearning for fast adaptation of deep networks. In International Conference on Machine Learning, pp. 1126 1135, 2017.

Flennerhag, S., Schroecker, Y., Zahavy, T., van Hasselt, H., Silver, D., and Singh, S. Bootstrapped meta-learning. In International Conference on Learning Representations, 2022.

Franceschi, L., Frasconi, P., Salzo, S., Grazzi, R., and Pontil, M. Bilevel programming for hyperparameter optimization and meta-learning. In International Conference on Machine Learning, pp. 1568 1577, 2018.

Gidaris, S. and Komodakis, N. Dynamic few-shot visual learning without forgetting. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 4367 4375, 2018.

Grazzi, R., Franceschi, L., Pontil, M., and Salzo, S. On the iteration complexity of hypergradient computation. In International Conference on Machine Learning, pp. 3748 3758, 2020.

Gressmann, F., Eaton-Rosen, Z., and Luschi, C. Improving neural network training in low dimensional random bases. Neural Information Processing Systems, 2020.

Gu, J., Wang, Y., Chen, Y., Li, V. O., and Cho, K. Metalearning for low-resource neural machine translation. In Empirical Methods in Natural Language Processing, pp. 3622 3631, 2018.

Izmailov, P., Maddox, W. J., Kirichenko, P., Garipov, T., Vetrov, D., and Wilson, A. G. Subspace inference for Bayesian deep learning. In Uncertainty in Artiﬁcial Intelligence, pp. 1169 1179, 2020.

Jang, E., Gu, S., and Poole, B. Categorical reparameterization with gumbel-softmax. In International Conference on Learning Representations, 2016.

Jerfel, G., Grant, E., Grifﬁths, T., and Heller, K. A. Reconciling meta-learning and continual learning with online mixtures of tasks. In Neural Information Processing Systems, pp. 9122 9133, 2019.

Jiang, W., Kwok, J., and Zhang, Y. Effective metaregularization by kernelized proximal regularization. Neural Information Processing Systems, 2021.

Karimi, H., Nutini, J., and Schmidt, M. Linear convergence of gradient and proximal-gradient methods under the Polyak-Łojasiewicz condition. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 795 811, 2016.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.

Kong, W., Somani, R., Song, Z., Kakade, S., and Oh, S. Meta-learning for mixed linear regression. In International Conference on Machine Learning, pp. 5394 5404, 2020.

Subspace Learning for Effective Meta-Learning

Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332 1338, 2015.

Lee, K., Maji, S., Ravichandran, A., and Soatto, S. Metalearning with differentiable convex optimization. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 10657 10665, 2019.

Li, C., Farkhoor, H., Liu, R., and Yosinski, J. Measuring the intrinsic dimension of objective landscapes. In International Conference on Learning Representations, 2018.

Li, Z., Zhou, F., Chen, F., and Li, H. Meta-SGD: Learning to learn quickly for few-shot learning. Preprint ar Xiv:1707.09835, 2017.

Liu, C., Zhu, L., and Belkin, M. Loss landscapes and optimization in over-parameterized non-linear systems and neural networks. Applied and Computational Harmonic Analysis, 2022.

Liu, H., Simonyan, K., and Yang, Y. DARTS: Differentiable architecture search. In International Conference on Learning Representations, 2018.

Lojasiewicz, S. A topological property of real analytic subsets. Coll. du CNRS, Les equations aux d eriv ees partielles, 117(87-89):2, 1963.

Maclaurin, D., Duvenaud, D., and Adams, R. Gradientbased hyperparameter optimization through reversible learning. In International Conference on Machine Learning, pp. 2113 2122, 2015.

Maurer, A. and Jaakkola, T. Algorithmic stability and metalearning. Journal of Machine Learning Research, 6(6), 2005.

Munkhdalai, T. and Yu, H. Meta networks. In International Conference on Machine Learning, pp. 2554 2563, 2017.

Nagabandi, A., Clavera, I., Liu, S., Fearing, R. S., Abbeel, P., Levine, S., and Finn, C. Learning to adapt in dynamic, real-world environments through meta-reinforcement learning. In International Conference on Learning Representations, 2018.

Nichol, A., Achiam, J., and Schulman, J. On ﬁrst-order meta-learning algorithms. Preprint ar Xiv:1803.02999, 2018.

Obamuyide, A. and Vlachos, A. Model-agnostic metalearning for relation classiﬁcation with limited supervision. In Annual Meeting of the Association for Computational Linguistics, pp. 5873 5879, 2019.

Oreshkin, B., L opez, P. R., and Lacoste, A. TADAM: Task dependent adaptive metric for improved few-shot learning. In Neural Information Processing Systems, pp. 721 731, 2018.

Park, E. and Oliva, J. B. Meta-Curvature. Neural Information Processing Systems, 32, 2019.

Pentina, A. and Lampert, C. A PAC-Bayesian bound for lifelong learning. In International Conference on Machine Learning, pp. 991 999, 2014.

Polyak, B. Gradient methods for the minimisation of functionals. Ussr Computational Mathematics and Mathematical Physics, 3:864 878, 1963.

Raghu, A., Raghu, M., Bengio, S., and Vinyals, O. Rapid learning or feature reuse? Towards understanding the effectiveness of MAML. In International Conference on Learning Representations, 2020.

Rajeswaran, A., Finn, C., Kakade, S. M., and Levine, S. Meta-learning with implicit gradients. In Neural Information Processing Systems, pp. 113 124, 2019.

Rakelly, K., Zhou, A., Finn, C., Levine, S., and Quillen, D. Efﬁcient off-policy meta-reinforcement learning via probabilistic context variables. In International Conference on Machine Learning, pp. 5331 5340, 2019.

Rudelson, M. and Vershynin, R. Smallest singular value of a random rectangular matrix. Communications on Pure and Applied Mathematics, 62(12):1707 1739, 2009.

Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., and Lillicrap, T. Meta-learning with memory-augmented neural networks. In International Conference on Machine Learning, pp. 1842 1850, 2016.

Saunshi, N., Zhang, Y., Khodak, M., and Arora, S. A sample complexity separation between non-convex and convex meta-learning. In International Conference on Machine Learning, pp. 8512 8521, 2020.

Snell, J., Swersky, K., and Zemel, R. Prototypical networks for few-shot learning. In Neural Information Processing Systems, pp. 4077 4087, 2017.

Thrun, S. and Pratt, L. Learning to learn: Introduction and overview. In Learning to learn, pp. 3 17. Springer, 1998.

Triantaﬁllou, E., Zhu, T., Dumoulin, V., Lamblin, P., Evci, U., Xu, K., Goroshin, R., Gelada, C., Swersky, K., Manzagol, P.-A., and Larochelle, H. Meta-Dataset: A dataset of datasets for learning to learn from few examples. In International Conference on Learning Representations, 2020.

Subspace Learning for Effective Meta-Learning

Tripuraneni, N., Jin, C., and Jordan, M. Provable metalearning of linear representations. In International Conference on Machine Learning, pp. 10434 10443, 2021.

Vinyals, O., Blundell, C., Lillicrap, T., and Wierstra, D. Matching networks for one shot learning. In Neural Information Processing Systems, pp. 3630 3638, 2016.

Vuorio, R., Sun, S.-H., Hu, H., and Lim, J. J. Multimodal model-agnostic meta-learning via task-aware modulation. In Neural Information Processing Systems, pp. 1 12, 2019.

Wang, R., Demiris, Y., and Ciliberto, C. Structured prediction for conditional meta-learning. Neural Information Processing Systems, 33:2587 2598, 2020a.

Wang, Y., Yao, Q., Kwok, J. T., and Ni, L. M. Generalizing from a few examples: A survey on few-shot learning. ACM Computing Surveys, 53(3):1 34, 2020b.

Wortsman, M., Horton, M., Guestrin, C., Farhadi, A., and Rastegari, M. Learning neural network subspaces. In International Conference on Machine Learning, 2021.

Xiang, Y., Mottaghi, R., and Savarese, S. Beyond pascal: A benchmark for 3D object detection in the wild. In IEEE Winter Conference on Applications of Computer Vision, pp. 75 82, 2014.

Yao, H., Wei, Y., Huang, J., and Li, Z. Hierarchically structured meta-learning. In International Conference on Machine Learning, pp. 7045 7054, 2019.

Yao, H., Wu, X., Tao, Z., Li, Y., Ding, B., Li, R., and Li, Z. Automated relational meta-learning. In International Conference on Learning Representations, 2020.

Yin, M., Tucker, G., Zhou, M., Levine, S., and Finn, C. Meta-learning without memorization. In International Conference on Learning Representations, 2020.

Zhou, P., Yuan, X., Xu, H., Yan, S., and Feng, J. Efﬁcient meta learning via minibatch proximal update. In Neural Information Processing Systems, pp. 1534 1544, 2019.

Zhou, P., Zou, Y., Yuan, X., Feng, J., Xiong, C., and Hoi, S. Task similarity aware meta learning: Theory-inspired improvement on MAML. In Conference on Uncertainty in Artiﬁcial Intelligence, 2021a.

Zhou, X., Zhang, W., Xu, H., and Zhang, T. Effective sparsiﬁcation of neural networks with global sparsity constraint. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3599 3608, 2021b.

Zoph, B. and Le, Q. V. Neural architecture search with reinforcement learning. In International Conference on Learning Representations, 2017.

Subspace Learning for Effective Meta-Learning

A. Experiment Details and Additional Results

A.1. Few-shot Regression on Synthetic Data

A.1.1. IMPLEMENTATION DETAILS

The subspace bases are trained for T = 30, 000 iterations using the Adam optimizer (Kingma & Ba, 2015). For the meta-learner, the initial learning rate is 0.001, which is then reduced by half every 5, 000 iterations. The base learner uses a learning rate of α = 0.05, v(0) = 1

m1, and Tin is 5 (resp. 20) at meta-training (resp. meta-testing). The temperature is γt = max(10 5, 0.5 t/T), a linear annealing schedule as in (Chen et al., 2020; Zhou et al., 2021b). To prevent overﬁtting, we evaluate the meta-validation performance every 2, 000 iterations, and stop training when the meta-validation accuracy has no signiﬁcant improvement for 5 consecutive evaluations.

A.1.2. ADDITIONAL RESULTS

Table 11 shows the average Euclidean distance between the estimated task model parameters and the ground truth. As can be seen, MUSML is more accurate in estimating the task models, conﬁrming the effectiveness of the learned subspaces.

Table 11. Average Euclidean distance (with standard deviation) between the estimated task model parameters and ground-truth in 5-shot setting on synthetic data. For TSA-MAML, the number in brackets is the number of clusters used.

MAML (Finn et al., 2017) 1.69 0.02 BMG (Flennerhag et al., 2022) 1.55 0.03

DPMM (Jerfel et al., 2019) 0.85 0.10 HSML (Yao et al., 2019) 0.80 0.09 ARML (Yao et al., 2020) 0.91 0.11 TSA-MAML(2) (Zhou et al., 2021a) 0.88 0.12 TSA-MAML(10) (Zhou et al., 2021a) 0.47 0.19 TSA-MAML(20) (Zhou et al., 2021a) 0.33 0.18 TSA-MAML(40) (Zhou et al., 2021a) 0.36 0.19 TSA-MAML(80) (Zhou et al., 2021a) 0.36 0.18 MUSML 0.17 0.01

A.2. Few-shot Classiﬁcation

We use the cross-entropy loss for ℓ( , ). The number of parameters in Conv4 is 113, 088. For the base learner, α = 0.01, v(0) = 1 m1, and Tin is set to 5 (resp. 15) at meta-training (resp. meta-validation or meta-testing). We train the subspace bases for T = 100, 000 iterations using the Adam optimizer (Kingma & Ba, 2015) with an initial learning rate of 0.001, which is then reduced by half every 5, 000 iterations. The temperature is set to γt = max(10 5, 0.8 t/T), which is again a linear annealing schedule (Chen et al., 2020; Zhou et al., 2021b). To prevent overﬁtting, we evaluate the meta-validation performance every 2, 000 iterations and stop training when the meta-validation accuracy has no signiﬁcant improvement for 5 consecutive evaluations. Hyperparameters K and m are chosen from 1 to 5 on the meta-validation set. In practice, as shown in Section 4.3.3, m can simply be ﬁxed at 2, and K can be chosen from 3 to 4. As the search space of K is small, the additional cost of tuning K and m is small.

A.3. Results on Meta-Dataset-BTAF

Figure 9 shows the assignment of tasks to the learned subspaces in the 5-way 1-shot setting on Meta-Dataset-BTAF.

For notation simplicity, throughout this section, we omit the superscript τ . Let z (x, y) be the samples, ℓ(z; w) ℓ(f(x; w), y), and wℓ(f(z; Sv)) wℓ(f(z; w)) |w=Sv.

We ﬁrst show the stability constant (in Lemma 9 of (Bousquet & Elisseeff, 2002)) in Algorithm 1 is of the order O(1/Ntr).

Let {z i : i = 1, . . . , Ntr} be another Ntr samples from τ. Let Dtr(i) τ be another training set which differs from Dtr τ only in the ith sample (i.e., Dtr(i) τ (Dtr τ {zi}) {z i}). We let vτ,k,i (resp. vτ,k ) be the task model obtained from the base

Subspace Learning for Effective Meta-Learning

1 2 Subspace index

(a) Meta-training.

1 2 Subspace index

(b) Meta-validation.

1 2 Subspace index

(c) Meta-testing.

Figure 9. Task assignment to the learned subspaces in 5-way 1-shot on Meta-Dataset-BTAF (K selected by meta-validation set is 2).

learner when using training set Dtr(i) τ (resp. Dtr τ ).

Lemma B.1 (Lemma 9 of (Bousquet & Elisseeff, 2002)). Let D = {z1, . . . , zm} be a dataset containing m samples. For any learning algorithm A (receives a training set and outputs a learned model) and loss function ℓsuch that 0 ℓ( ) M, we have

ED [Ezℓ(z; A(D)) R(D; A(D))]2 M 2

2m + 3MED,z i|ℓ(zi; A(D)) ℓ(zi; A(D(i)))| (6)

for any i {1, . . . , m}, where D(i) is a dataset obtained by replacing zi with z i, and R(D; A(D)) is the empirical risk.

Lemma B.2. For the base learner in Algorithm 1, we have EDtr τ Ez i τ |ℓ(f(xi; Skvτ,k), yi) ℓ(f(xi; Skvτ,k,i), yi)|

2ϱ(1+αβρ2m)Tin

Proof. Claim 1: For k {1, . . . , K} and i {1, . . . , Ntr}, it holds that vτ,k vτ,k,i 2ϱ(1+αϱρ2m)Tin

By the update rule in the base learner, we have v(t +1) τ,k v(t +1) τ,k,i = v(t ) τ,k α v(t ) τ,k L(Dtr τ ; Skv(t ) τ,k ) v(t ) τ,k,i +

α v(t ) τ,k,i L(Dtr(i) τ ; Skv(t ) τ,k,i) v(t ) τ,k v(t ) τ,k,i +α v(t ) τ,k L(Dtr τ ; Skv(t ) τ,k ) v(t ) τ,k,i L(Dtr(i) τ ; Skv(t ) τ,k,i) . For the second

term, by the chain rule, it follows that v(t ) τ,k L(Dtr τ ; Skv(t ) τ,k ) v(t ) τ,k,i L(Dtr(i) τ ; Skv(t ) τ,k,i)

= S k w L(Dtr τ ; Skv(t ) τ,k ) w L(Dtr(i) τ ; Skv(t ) τ,k,i)

wℓ(f(zj; Skv(t ) τ,k )) wℓ(f(zj; Skv(t ) τ,k,i))

wℓ(f(zi; Skv(t ) τ,k )) wℓ(f(z i; Skv(t ) τ,k,i)) (7)

wℓ(f(zj; Skv(t ) τ,k )) wℓ(f(zj; Skv(t ) τ,k,i))

wℓ(f(zj; Skv(t ) τ,k )) + 1 Ntr

wℓ(f(z i; Skv(t ) τ,k,i))

j =i β Skv(t ) τ,k Skv(t ) τ,k,i + 2ϱ

Ntr β Sk v(t ) τ,k v(t ) τ,k,i + 2ϱ

mρ2β v(t ) τ,k v(t ) τ,k,i + 2ϱρ m

where Eq.(7) uses the norm inequality Ax A x and Sk Sk F, Eq.(8) uses the compactness assumption (thus Sk F ρ m) and the triangle inequality, Eq.(9) uses the Lipschitzness of wℓ(f(x; w), y), Eq.(10) uses the

Subspace Learning for Effective Meta-Learning

Lipschitzness of ℓ(f(x; w), y), and Eq.(11) uses the boundedness of Sk again. Hence, we obtain a recursive inequality

v(t +1) τ,k v(t +1) τ,k,i (1 + αmρ2β) v(t ) τ,k v(t ) τ,k,i + 2αϱρ m

By induction, we obtain a bound for v(Tin) τ,k v(Tin) τ,k,i :

v(Tin) τ,k v(Tin) τ,k,i (1 + αmρ2β) v(0) τ,k v(0) τ,k,i + 2αϱρ m

t =0 (1 + αβρ2m)t 2ϱ(1 + αϱρ2m)Tin

ρβ m Ntr , (13)

where we have used the fact v(0) τ,k = v(0) τ,k,i.

Claim 2: The stability constant of the base learner is 2ϱ(1+αβρ2m)Tin

Next, we analyze the stability constant of the base learner:

EDtr τ Ez i τ |ℓ(f(xi; Skvτ,k), yi) ℓ(f(xi; Skvτ,k,i), yi)|

βEDtr τ Ez i τ Skvτ,k,i Skvτ,k (14)

βEDtr τ Ez i τ Sk F vτ,k,i vτ,k (15)

βρ m EDtr τ Ez i τ vτ,k,i vτ,k (16)

βρ m 2ϱ(1 + αβρ2m)Tin

ρβ m Ntr (17)

=2ϱ(1 + αβρ2m)Tin

where Eq.(14) uses the Lipschitz property of ℓ, Eq.(15) uses the norm inequality, Eq.(16) uses the boundedness of Sk F, Eq.(17) uses the inequality (13). The above equality reveals that the stability constant in Theorem 11 of (Bousquet & Elisseeff, 2002) (β2 there) is 2ϱ(1+αβρ2m)Tin

B.1. Proof of Theorem 3.3

Proof. The proof is based on the connection between generalization and stability (Bousquet & Elisseeff, 2002).

We adopt the notations used in the proof of Lemma B.2. We apply Lemma B.1 to our algorithm and obtain

EDtr τ Ez τℓ(f(x; Skvτ,k), y) L(Dtr τ ; Skvτ,k) 2

2Ntr + 3νEDtr τ Ez i τ |ℓ(f(xi; Skvτ,k), yi) ℓ(f(xi; Skvτ,k,i), yi)|

2Ntr + 6νϱ(1 + αβρ2m)Tin

where (19) uses the equality (18) in Lemma B.2. By the Cauchy-Schwarz inequality, we have

EDtr τ Ez τℓ(f(x; Skvτ,k), y) L(Dtr τ ; Skvτ,k)

2Ntr + 6νϱ(1 + αβρ2m)Tin

Subspace Learning for Effective Meta-Learning

To provide an upper bound of R(S) ˆR(S), we need to address the randomness in kτ:

R(S) ˆR(S) =EτEDtr τ Ez τℓ(f(x; Skτ vτ,kτ ), y) L(Dtr τ ; Skτ vτ,kτ )

k=1 I[kτ =k] Ez τℓ(f(x; Skvτ,k), y) L(Dtr τ ; Skvτ,k)

k=1 EDtr τ I[kτ =k] Ez τℓ(f(x; Skvτ,k), y) L(Dtr τ ; Skvτ,k)

k=1 EDtr τ Ez τℓ(f(x; Skvτ,k), y) L(Dtr τ ; Skvτ,k)

2Ntr + 6νϱ(1 + αβρ2m)Tin

where the ﬁrst inequality is because the empirical loss can be smaller than the population loss, and the last inequality follows from the Eq.(20).

B.2. Proof of Theorem 3.5

Proof of Theorem 3.5. We ﬁrst show that L(D; Skv) satisﬁes the PL inequality in v. By the chain rule, it follows that

v L(D; Skv) 2 = S w L(D; Skv) 2 = w L(D; Skv) SS w L(D; Skv)

λmin SS w L(D; Skv) 2

2 w L(D; Skv) 2 (21)

2 (L(D; Skv) L ), (22)

where Eq.(21) uses Assumption 3.4(ii), and Eq.(22) uses Assumption 3.4(i).

Then, we show that v L(Dtr τ ; Skv) is (βρ2m)-Lipschitz in v. Using the chain rule, we have

v L(Dtr τ ; Skv) v L(Dtr τ ; Skv ) = S k w L(Dtr τ ; Skv)| S k w L(Dtr τ ; Skv )

Sk w L(Dtr τ ; Skv)| w L(Dtr τ ; Skv ) (23)

β Sk Skv Skv (24)

β Sk 2 v v (25)

βρ2m v v , (26)

where Eqs.(23) and (24) use the norm inequality Ax A x , Eq.(24) uses the Assumption 3.2(i) on smoothness, Eq.(24) uses Sk 2 Sk 2 F ρ2m by Assumption 3.2(ii) on compactness.

Using the Taylor expansion, we obtain

L(Dtr τ ; Skv(t +1) τ,k ) = L(Dtr τ ; Skv(t ) τ,k ) + (v(t +1) τ,k v(t ) τ,k ) v(t ) τ,k L(Dtr τ ; Skv(t ) τ,k )

2(v(t +1) τ,k v(t ) τ,k ) 2 v(t ) τ,k L(Dtr τ ; Skξ(t ))(v(t +1) τ,k v(t ) τ,k )

L(Dtr τ ; Skv(t ) τ,k ) α v(t ) τ,k L(Dtr τ ; Skv(t ) τ,k )

2α2βρ2m v(t ) τ,k L(Dtr τ ; Skv(t ) τ,k )

L(Dtr τ ; Skv(t ) τ,k ) α

v(t ) τ,k L(Dtr τ ; Skv(t ) τ,k )

Subspace Learning for Effective Meta-Learning

where ξt [v(t ) τ,k , v(t +1) τ,k ], Eq.(27) uses 2 v L(Dtr τ ; Skv) 2 βρ2m follows from Claim 1, Eq.(28) uses α < 1 βρ2m by assumption. By Assumption 3.4, it follows that

L(Dtr τ ; Skv(t +1) τ,k ) L L(Dtr τ ; Skv(t ) τ,k ) L α

2 (L(D; Skv) L ) (29)

(1 ψ) L(Dtr τ ; Skv(t ) τ,k ) L , (30)

where ψ = αλc2 1 2 1 c2 p m

The above recursive inequality implies L(Dtr τ ; Skvτ,k) L = L(Dtr τ ; Skv(Tin) τ,k ) L

(1 ψ)Tin L(Dtr τ ; Skv(0)) L . As ℓ( , ) is bounded by ν, L(Dtr τ ; Skv(0)) = 1 |Dtr τ | P

(x,y) Dtr τ ℓ(f(x; Skv(0)), y) is

also bounded by ν, then we have L(Dtr τ ; Skvτ,k) L +L(Dtr τ ; Skv(Tin) τ,k ) L +(1 ψ)Tin L(Dtr τ ; Skv(0)) L

L + (1 ψ)Tin (ν L ) L + (1 ψ)Tin ν. Let Lk L(Dtr τ ; Skvτ,k). Next, we consider two events E {min1 k K Lk L + ν K+ν } and E {min1 k K Lk < L + ν K+ν }. We have

EτEDtr τ min1 k K Lk = EτEE min1 k K Lk + EτE E min1 k K Lk Eτ L + (1 ψ)Tin ν P (E) +

Eτ(L + ν K+ν )P E 2L + (1 ψ)Tin ν + ν K+ν q

4L2 + 2ν2 (1 ψ)2Tin + ν K+ν , where we

have used the property P( E) 1. As ψ = αλc2 1 2 1 c2 p m

d 2, it follows that (1 ψ)2Tin = c1c2 p

αλ/(2d) m (

2/(c2 1αλ) + 1) m + (

2/(c2 1αλ) 1) 2Tin = (( m

κ1)( m + κ2)φ)2Tin, where κ1 = (

2/(c2 1αλ) + 1), κ2 = (

2/(c2 1αλ) 1), and φ = c1c2 p

αλ/(2d). Note that κ1, κ2, and φ are all positive. Hence, we obtain EτEDtr τ min1 k K Lk q

4L2 + 2ν2 (( m κ1)( m + κ2)φ)2Tin + ν K+ν and ﬁnish the proof.

B.3. Proof of Theorem 3.7

Proof. By the deﬁnition of excess risk, we have

= Eτ[EDtr τ Ez τℓ(f(x; Skτ vτ,kτ ), y) EDtr τ L(Dtr τ ; Skτ vτ,kτ ) + EDtr τ L(Dtr τ ; Skτ vτ,kτ ) EDtr τ L(Dtr τ ; Skτ v τ,kτ )

+ EDtr τ L(Dtr τ ; Skτ v τ,kτ ) Ez τℓ(z; w τ)] (31)

= EτEDtr τ Ez τℓ(f(x; Skτ vτ,kτ ), y) L(Dtr τ ; Skτ vτ,kτ ) + EτEDtr τ L(Dtr τ ; Skτ vτ,kτ ) L(Dtr τ ; Skτ v τ,kτ )

z Dtr τ ℓ(f(x; Skτ v τ,kτ ), y) Ez τℓ(f(x; w τ), y)

2Ntr + 6νϱ(1 + αβρ2m)Tin

Ntr + EτEDtr τ L(Dtr τ ; Skτ vτ,kτ ) L(Dtr τ ; Skτ v τ,kτ )

+ EτEDtr τ h L(Dtr τ ; Skτ v τ,kτ ) L(Dtr τ ; w τ, Skτ ) i + EτEDtr τ w L(Dtr τ ; ξτ) w τ,S kτ (32)

2Ntr + 6νϱ(1 + αβρ2m)Tin

Ntr + EτEDtr τ L(Dtr τ ; Skτ vτ,kτ ) L(Dtr τ ; Skτ v τ,kτ ) + ϱEτEDtr τ dist(w τ, Skτ ),

where (31) follows by introducing two additional terms (EτEDtr τ L(Dtr τ ; Skτ vτ,kτ ) and EτEDtr τ L(Dtr τ ; Skτ v τ,kτ )), Eq.(32) uses the bound in Theorem 3.3 and the mean value theorem (we decompose w τ = w τ,Skτ +w τ,S kτ and ξτ [w τ,Skτ , w τ]),

and Eq.(33) follows from the Lipschitzness assumption and EDtr τ h L(Dtr τ ; Skτ v τ,kτ ) L(Dtr τ ; w τ, Skτ ) i 0 as v τ,kτ is

Subspace Learning for Effective Meta-Learning

an exact solution of the problem minvτ L(Dtr τ ; Skτ vτ). We conclude that

2Ntr + 6νϱ(1 + αβρ2m)Tin

Ntr + EτEDtr τ L(Dtr τ ; Skτ vτ,kτ ) L(Dtr τ ; Skτ v τ,kτ ) + ϱEτEDtr τ dist(w τ, Skτ )

2Ntr + 6νϱ(1 + αβρ2m)Tin

Ntr + ρ m EτEDtr τ vτ,kτ v τ,kτ + ϱEτEDtr τ dist(w τ, Skτ ),

where the last inequality follows from the Lipschitzness of ℓand Skτ Skτ F ρ m.

B.4. Proof of Proposition 3.1

Proof. This proposition is a property of linear regression tasks and has been mentioned in (Kong et al., 2020; Tripuraneni et al., 2021). We include the proof here for completeness.

By the deﬁnition y = x w + ξ, we have

Eτ p(τ)E(x,y) τ,(x ,y ) τyy xx

=Eτ p(τ)Ex N(0,I),x N(0,I),ξ N(0,σ2 ξ),ξ N(0,σ2 ξ)(x w τ + ξ)(x w τ + ξ )xx

=Eτ p(τ)Ex N(0,I),x N(0,I),ξ N(0,σ2 ξ),ξ N(0,σ2 ξ)(x w τx w τxx + x w τξ xx + ξx w τxx + ξξ xx )

=Eτ p(τ)Ex N(0,I),x N(0,I)x w τx w τxx ,

where the last equality follows from the independence of ξ, ξ , x, and x . Using the independence of x, and x , we obtain Eτ p(τ)Ex N(0,I),x N(0,I)x w τx w τxx = Eτ p(τ)(Ex N(0,I)xx )w τw τ (Ex N(0,I)x x ) = Eτ p(τ)w τw τ .