# hierarchically_structured_metalearning__0acf7c55.pdf

Hierarchically Structured Meta-learning

Huaxiu Yao 1 Ying Wei 2 Junzhou Huang 2 Zhenhui Li 1

In order to learn quickly with few samples, metalearning utilizes prior knowledge learned from previous tasks. However, a critical challenge in meta-learning is task uncertainty and heterogeneity, which can not be handled via globally sharing knowledge among tasks. In this paper, based on gradient-based meta-learning, we propose a hierarchically structured meta-learning (HSML) algorithm that explicitly tailors the transferable knowledge to different clusters of tasks. Inspired by the way human beings organize knowledge, we resort to a hierarchical task clustering structure to cluster tasks. As a result, the proposed approach not only addresses the challenge via the knowledge customization to different clusters of tasks, but also preserves knowledge generalization among a cluster of similar tasks. To tackle the changing of task relationship, in addition, we extend the hierarchical structure to a continual learning environment. The experimental results show that our approach can achieve state-of-the-art performance in both toy-regression and few-shot image classiﬁcation problems.

1. Introduction

Learning quickly with a few samples is one of the key characteristics of human intelligence, while it remains a daunting challenge for artiﬁcial intelligence. Learning to learn (a.k.a., meta-learning) (Braun et al., 2010), as a common practice to address this challenge, leverages the transferable knowledge learned from previous tasks to improve the learning effectiveness in a new task. There have been several lines of meta-learning algorithms, including recurrent network based methods (Ravi & Larochelle, 2016), optimizer based

Part of the work was done when the author interned in Tencent AI Lab. 1College of Information Science and Technology, Pennsylvania State University, PA, USA 2Tencent AI Lab, Shenzhen, China. Correspondence to: Ying Wei <judyweiying@gmail.com>.

Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s).

methods (Andrychowicz et al., 2016), nearest neighbours based methods (Snell et al., 2017; Vinyals et al., 2016) and gradient descent based methods (Finn et al., 2017), which instantiate the transferable knowledge as latent representations, an optimizer, a metric space, and parameter initialization, respectively.

Despite their early success in few-shot image classiﬁcation (Ravi & Larochelle, 2016) and machine translation (Gu et al., 2018), most of the existing meta-learning algorithms assume the transferable knowledge to be globally shared across all tasks. As a consequence, they suffer from handling a sequence of tasks originated from different distributions. At the other end of the spectrum, recently, a few research works (Finn et al., 2018; Yoon et al., 2018a; Lee & Choi, 2018) try to ﬁx the problem by tailoring the transferable knowledge to each task. Yet the downside of such methods lies in the impaired knowledge generalization among closely correlated tasks (e.g., the tasks sampled from the same distribution).

Hence we are motivated to pursue a meta-learning framework to effectively balance generalization and customization. The inspiration comes from a hypothesis which has been formulated and tested by psychological researchers (Gershman et al., 2010; 2014). The hypothesis suggests that the key to human beings capability of solving a task with little training data is the way how human beings organize the learned knowledge from tasks. As bits of tasks impinge on us, we human beings cluster the tasks into several states based on task similarity, so that the learning occurs within each cluster instead of across cluster boundaries. Thus, when a new task arrives, it can either quickly take advantage of the knowledge learned within the cluster it belongs to or initiate a new cluster if it is wildly different from any existing clusters.

Inspired by this, we propose a novel meta-learning framework called Hierarchically Structured Meta-Learning (HSML). The key idea of the HSML is to enhance the meta-learning effectiveness by promoting knowledge customization to different clusters of tasks but simultaneously preserving knowledge generalization among a cluster of closely related tasks. In this paper, without loss of generality, we ground HSML on a gradient based meta learning algorithm (Finn et al., 2017) with the transferable knowl-

Hierarchically Structured Meta-learning

task 1 task 2 task 3 task 4

(a) : globally shared

task 1 task 2 task 3 task 4

(b) : task speciﬁc

task 4 task 1 task 3 task 2

Figure 1. Pictorial illustration of the difference between the proposed HSML and the other two representative lines of gradient based meta-learning algorithms.

edge instantiated as parameter initializations. Speciﬁcally, ﬁrst, the HSML resorts to a hierarchical clustering structure to perform soft clustering on tasks. The representation of each task is learned from either of the two proposed candidate aggregators, i.e., pooling autoencoder aggregator and recurrent autoencoder aggregator, and is passed to the hierarchical clustering structure to obtain the clustering result of this task. The sequentially incoming tasks, in turn, update the clustering structure. Especially, if the existing structure does not ﬁt the task, we dynamically expand the structure. Secondly, a globally shared parameter initialization is tailored to each cluster via a parameter gate, to serve as the initializations for all tasks belonging to the cluster.

Again we would highlight the contribution of the proposed HSML: 1) it achieves a better balance between generalization and customization of the transferable knowledge, so that it empirically outperforms state-of-the-art meta-learning algorithms in both toy regression and few-shot image classiﬁcation problems; 2) it is interpretable in terms of the task relationship; 3) it has been theoretically proved to be superior than existing gradient-based meta-learning algorithms.

2. Related Work

Meta-learning, allowing machines to learn new skills or adapt to new environments rapidly with a few training examples, has demonstrated success in both supervised learning such as few-shot image classiﬁcation and reinforcement learning settings. There are four common approaches: 1) use a recurrent neural network equipped with either external or internal memory storing and querying meta-knowledge (Munkhdalai & Yu, 2017; Santoro et al., 2016; Munkhdalai et al., 2018; Mishra et al., 2018); 2) learn a meta-optimizer which can quickly optimize the model parameters (Ravi & Larochelle, 2016; Andrychowicz et al., 2016; Li & Malik, 2016); 3) learn an effective distance metric between examples (Snell et al., 2017; Vinyals et al., 2016; Yang et al., 2018); 4) learn an appropriate initialization from which the model parameters can be updated within a few gradient steps (Finn et al., 2017; 2018; Lee & Choi, 2018).

HSML falls into the fourth aforementioned category named as gradient-based meta-learning. Most of the gradient-based meta-learning algorithms (Finn et al., 2017; Li et al., 2017;

Flennerhag et al., 2018) assume a globally shared initialization across all tasks, as shown in Figure 1a. To accommodate dynamically changing tasks, as illustrated in Figure 1b, recent studies tailor the global shared initialization to each task by taking advantage of probabilistic models (Finn et al., 2018; Grant et al., 2018; Yoon et al., 2018a) and incorporating task-speciﬁc information (Lee & Choi, 2018; Vuorio et al., 2018). However, our proposed HSML outlined in Figure 1c customizes the global shared initialization to each cluster using a hierarchical clustering structure, which enjoys not only knowledge customization but also generalization (e.g., between task 1 and 3). Better yet, HSML right ﬁt to a continual learning scenario with evolving clustering structures.

3. Preliminaries

The Meta-Learning Problem Suppose that a sequence of tasks {T1, ..., TNt} are sampled from an environment which is a probability distribution E on tasks (Baxter, 1998). In each task Ti E, we have a few examples {xi,j, yi,j}ntr j=1 to constitute the training set Dtr Ti and the rest as the test set Dte Ti. Given a base learner f with θ as parameters, the optimal parameters θTi are learned to make accurate predictions, i.e., fθTi (xi,j) yi,j. The effectiveness of such a base learner on Dtr Ti is evaluated by the loss function L(fθTi , Dtr Ti), which equals the mean square error P

(xi,j,yi,j) Dtr Ti fθTi (xi,j)

yi,j 2 2 for regression problems or the cross entropy loss P

(xi,j,yi,j) Dtr Ti log p(yi,j|xi,j, fθTi ) for classiﬁcation problems.

The goal of meta-learning is to learn from previous tasks a well-generalized meta-learner M( ) which can facilitate the training of the base learner in a future task with a few examples. In fulﬁllment of this, meta-learning involves two stages, i.e., meta-training and meta-testing. During metatraining, the parameters of the base learner for all tasks, i.e., {θTi}Nt i=1, and the meta-learner M( ) are optimized alternatingly. In virtue of M, the parameters {θTi}Nt i=1 are learned to minimize the expected empirical loss over training sets of all Nt historical tasks, i.e., min{θTi }Nt i=1 PNt i=1 L(M(fθTi ), Dtr Ti). In turn, a well-generalized M can be obtained by minimizing the expected empirical loss over test sets, i.e., min M PNt i=1 L(M(fθTi ), Dte Ti). When it comes to the metatesting phase, provided with a future task Tt, the learning effectiveness and efﬁciency are improved by applying the meta-learner M and solving minθTt L(M(fθTt ), Dtr Tt). Gradient-based Meta-Learning Here we give an overview of the representative algorithm, model-agnostic metalearning (MAML) (Finn et al., 2017). MAML instantiates the meta-learner M as a well-generalized initialization for the parameters of a base learner from which a few gradient descent steps can be performed to reach the optimal θTi for the task Ti, which means M(fθTi ) = fθ0 α θL(fθ,Dtr Ti ).

Hierarchically Structured Meta-learning

As a result, the optimization of M during meta-training is formulated as (one gradient step as exemplary):

i=1 L(fθ0 α θL(fθ,Dtr Ti ), Dte Ti). (1)

4. Methodology

In this section, we detail the proposed HSML algorithm whose framework is presented in Figure 2. The HSML aims to adapt the transferable knowledge learned from previous tasks, namely the initialization for parameters of the base learner in gradient based meta-learning (θ0 here), to the task Ti in a cluster-speciﬁc manner, so that the optimal parameters θTi can be achieved in as few gradient descent steps as possible. As shown in the part (c) of Figure 2, the possibilities to adaptation are completely dictated by the hierarchical task clustering structure in part (b), and the eventual path for adaptation follows the clustering result on the task Ti (i.e., θB1). By this means, the HSML balances between customization and generalization: the transferable knowledge is adapted to different clusters of tasks, while it is still shared among closely related tasks pertaining to the same cluster. To perform hierarchical clustering on tasks, we learn the representation of a task using the proposed task embedding network, i.e., the part (a). Next we will introduce the three stages, i.e., task representation learning, hierarchical task clustering, and knowledge adaptation, respectively.

4.1. Task Representation Learning

Learning the representation of a task Ti with the whole training set Dtr Ti as input is much more challenging than the common representation learning over examples, which bears a striking similarity to the connection between sentence embeddings and word embeddings in natural language processing. Inspired by common practices in learning sentence embeddings (Conneau et al., 2017), we tackle the challenge by aggregating representations of all examples {xi,j, yi,j}ntr j=1 Dtr Ti. The desiderata of an ideal aggregator include 1) high representational capacity, and 2) permutational invariance to its inputs. In light of these, we propose two candidate aggregators, i.e., pooling autoencoder aggregator (PAA) and recurrent autoencoder aggregator (RAA). Pooling Autoencoder Aggregator To meet the ﬁrst desideratum, foremost, we resort to an autoencoder that learns highly effective representation for each example. The recontruction loss for training the autoencoder is as follows,

Lr(Dtr Ti) =

j=1 FCdec(gi,j) F(xtr i,j, ytr i,j) 2 2, (2)

where gi,j =FCenc(F(xtr i,j, ytr i,j)) is the representation for the j-th example. In order to characterize the joint distribution instead of the marginal distribution only, we use F( , ) to preliminarily embed both features and predictions of an

example. The deﬁnition of F varies from dataset to dataset, which we will detail in supplementary material C. FCenc and FCdec stand for the encoder composed of a stack of fully connected layers and the decoder consisting of two fully connected layers with Re LU activation, respectively. Consequently, the aggregation satisfying the permutational invariance follows,

gi = Poolntr j=1(gi,j), (3) where gi Rd is the desired representation of task Ti. Pool denotes a max or mean pooling operator over examples. Recurrent Autoencoder Aggregator Motivated by recent success of the recurrent embedding aggregation in orderinvariant problems such as graph embedding (Hamilton et al., 2017), we also consider a recurrent autoencoder aggregator which demonstrates more remarkable expressivity especially for a task with few examples. Different from the pooling autoencoder, examples are sequentially fed into the recurrent autoencoder, i.e., F(xtr i,1, ytr i,1) gi,1 gi,ntr di,ntr di,1, (4) where j, gi,j = RNNenc(F(xtr i,j, ytr i,j), gi,j 1) and di,j = RNNdec(di,j+1) represent the learned representation and the reconstruction of the j-th example, respectively. Here RNNenc and RNNdec stand for a recurrent encoder (LSTM or GRU) and a recurrent decoder, respectively. The reconstruction loss is similar to Eqn. (2), except that FCdec(gi,j) is replaced with di,j. Thereupon, the task representation is aggregated over representations of all examples, i.e.,

j (gi,j). (5)

Regrettably, the sequential feeding of examples makes the ﬁnal task representation to be permutation sensitive, which violates the second prerequisite of an ideal aggregator. We address the problem by applying the recurrent aggregator to random permutations of examples (Hamilton et al., 2017).

4.2. Hierarchical Task Clustering

Given the representation of a task, we propose a hierarchical task clustering structure to locate the cluster the task belongs to. Before proceeding to detail the structure, we ﬁrst explicate why the hierarchical clustering is preferred over ﬂat clustering: a single level of task groups is likely insufﬁcient to model complex task relationship in real-world applications; for example, to identify the cross-talks between gene expressions of multiple species, the study (Kim & Xing, 2010) suggests multi-level clustering of such gene interaction.

The hierarchical clustering, following the tradition of clustering, proceeds by alternating between two steps, i.e., assignment step and update step, in a layer-wise manner. Assignment step: Each task receives a cluster assignment score on each hierarchical level, and the assignment that it

Hierarchically Structured Meta-learning

base learner

(a) Task Representation

(b) Hierarchical Task

Clustering (c) Knowledge Adaptation

Figure 2. The framework of the proposed HSML involving three essential stages. (a) Task representation learning: we learn the representation for the task Ti using an autoencoder aggregator (e.g., pooling aggregator, recurrent aggregator). (b) Hierarchical task clustering: provided with the task representation, we learn the soft clustering assignment with this differentiable hierarchical clustering structure. Darker nodes signify more likely assigned clusters (e.g., the cluster 1 in the ﬁrst level and the cluster B in the second level). (c) Knowledge adaptation: we next use a parameter gate to adapt the transferable knowledge (θ0) to a cluster-speciﬁc initialization (θB1) from which only a few gradient descent steps are required to achieve the optimal parameters θTi.

receives in a particular level is a function of its representation in the previous level. Thus, we assign a task represented in the kl-th cluster of the l-th level, i.e., hkl i Rd , to the kl+1-th cluster in the (l+1)-th level. Note that we conduct soft assignment for the following two reasons: (1) task groups have been demonstrated to overlap, since there is always a continuum in the sharing between tasks (Kumar & Daum e III, 2012); (2) the soft instead of hard assignment guarantees the differentiability, so that the full HSML framework can still be trained in an end-to-end fashion. In particular, for each task Ti, the soft-assignment pkl kl+1 i is computed by applying softmax over Euclidean distances between hkl i and the learnable cluster centers {ckl+1}Kl+1

kl+1=1, i.e.,

pkl kl+1 i = exp ( (hkl i ckl+1)/σl 2 2/2) PKl+1 kl+1=1 exp ( (hkl i ckl+1)/σl 2 2/2) , (6)

where σl is a scaling factor in the l-th level and Kl+1 denotes the number of clusters in the (l+1)-th level. Update step: As a result of assignment, the representation of a task in the kl+1-th cluster of the (l+1)-th level, i.e., hkl+1 i , can be updated with the following weighted average,

kl=1 pkl kl+1 i tanh (Wkl+1hkl i + bkl+1), (7)

where Wkl+1 Rd dand bkl+1 Rd are learned to transform from representations of the l-th to those of the (l+1)-th level.

The full pipeline of clustering starts from l = 0, where the initialization for hk0 i equals the task representation gi and K0 = 1, and ends at KL = 1. We would especially discuss the cluster centers. The meta-learning scenario where training tasks come sequentially poses a unique challenge which requires the hierarchical clustering structure to be accordingly online. Therefore, the cluster centers are parameterized and learned as the learning proceeds. Each center is randomly initialized.

4.3. Knowledge Adaptation

The ﬁnal representation h L i , which encrypts the hierarchical clustering result, is believed to be cluster speciﬁc. Previous works (Xu et al., 2015; Lee & Choi, 2018) suggest that similar tasks activate similar meta-parameters (e.g., initialization) while different tasks trigger disparate ones. Inspired by this ﬁnding, we design a cluster-speciﬁc parameter gate,

oi = FCσ Wg(gi h L i ), (8) where the fully connected layer FCσ Wg is parameterized by Wg and activated by a sigmoid function σ. It is worth mentioning here that concatenating the task representation gi together with h L i not only preserves but also reinforces the cluster-speciﬁc property of the parameter gate. Most importantly, the globally transferable knowledge, i.e., the initial parameters θ0, is adapted to the cluster-speciﬁc initial parameters θ0i via the parameter gate, i.e., θ0i =θ0 oi.

Recalling the objectives for a meta-learning algorithm in Section 3, we reach the optimization problem for HSML:

i=1 L(fθ0i α θL(θ,Dtr Ti ), Dte Ti) + ξLr(Dtr Ti), (9)

where L deﬁned in Section 3 measures the empirical risk over Dte Ti and Lr measures the reconstruction error as deﬁned in Eqn. (2). ξ is used to balance the importance of these two items. Θ represents all learnable parameters including the global transferable initialization θ0, the parameters for clustering, and those for knowledge adaptation (i.e., Wg). Continual Adaptation We especially pay attention to the case where a new task does not ﬁt any of the learned task clusters, which implies that additional clusters should be introduced to the hierarchical clustering structure. Incrementally adding model capacity (Yoon et al., 2018b; Daniely et al., 2015), has been the common practice to handle distribution drift without initially introducing excessive parameters. The key lies in the criterion when to expand the clustering structure. Since the loss values of L(fθTi , Dte Ti) ﬂuctuate

Hierarchically Structured Meta-learning

Algorithm 1 Meta-training of HSML

Require: E: distribution over tasks; {K1, , KL}: # of clusters in each layer; α, β: stepsizes; µ: threshold 1: Randomly initialize Θ 2: while not done do 3: if Lnew > µ Lold then 4: Increase the number of clusters 5: end if 6: Sample a batch of tasks Ti E 7: for all Ti do 8: Sample Dtr Ti, Dte Ti from Ti 9: Compute gi in Eqn. (3) or Eqn. (5), h L i in Eqn. (7), and reconstruction error Lr(Dtr Ti) 10: Compute oi in Eqn. (8) and evaluate θL(θ, Dtr Ti) 11: Update parameters with gradient descent (taking one step as an example): θTi =θ0i α θL(θ, Dtr Ti) 12: end for 13: Update Θ Θ β Θ PNt i=1 L(fθTi , Dte Ti)+ξLr(Dtr Ti) 14: Compute Lnew and save Lold for every Q rounds 15: end while

across different tasks during the online meta-training process, setting the loss value as threshold would obviously be futile. Instead, for every Q training tasks, we compute the average loss value L. If the new average value Lnew is more than µ times the previous value Lold (i.e., Lnew > µ Lold), the number of clusters will be increased, and the parameters for new clusters are randomly initialized. The whole algorithm of our proposed model is detailed in Alg. 1.

5. Analysis

The core of HSML is to adapt a globally shared initialization of stochastic gradient descent (SGD) to be cluster speciﬁc via the proposed hierarchical clustering structure. Hence, in this section, we theoretically analyze the advantage of such adaptation in terms of the generalization bound.

For a task Ti E, we assume both training and testing examples are i.i.d. drawn from a distribution Si , i.e., Dtr Ti Si and Dte Ti Si. According to Theorem 2 in (Kuzborskij & Lampert, 2017), a base learner fθTi is ϵ(Si, θ0)-on-average stable if its generalization is bounded by ϵ(Si, θ0), i.e., ESi EfθTi [R(fθTi (Dtr Ti)) ˆRDtr Ti (fθTi (Dtr Ti))] ϵ(Si, θ0). θ0 is

the initialization of SGD to reach θTi, and R( ) and ˆRDtr Ti ( ) denote the expected and empirical risk on Dtr Ti , respectively.

Transferring the globally shared initialization θ0 (i.e., MAML) to the target task Tt is equivalent to transferring a hypothesis fθ0 learned from meta-training tasks like (Kuzborskij & Orabona, 2017). For HSML, the initialization can be represented as θ0t = PK k=1 ˆBkθ0, which we demonstrate in the supplementary material A. In the following two theorems, provided with an initialization θ0t,

we derive according to (Kuzborskij & Lampert, 2017) the generalization bounds of the base learner fθTt when the loss L is convex and non-convex, respectively.

Theorem 1 Assume that L is convex and fθTt optimized using SGD is ϵ St, θ0t -on-average stable. Then ϵ St, θ0t

is bounded by,

ˆRDtr Tt (θ0t) +

Theorem 2 Assume that L [0, 1] is η-smooth and has a ρLipschitz Hessian. The step size at the u-step αu = c/u satisfying c min{ 1

η , 1 4(2η ln U)2 } with total steps U = ntr and

ˆγ = 1 ntr Pntr

j=1 2L(θ0t, (xt,j, yt,j)) 2 + q ˆRDtr Tt (θ0t)

ntr and then ϵ(St, θ0t) is bounded by,

O 1 + 1 cˆγ ˆRDtr Tt (θ0t)

Though some standard base learners (e.g., 4 convolutional layers in few-shot image classiﬁcation (Finn et al., 2017)) with Re LU do not meet the property of Lipschitz Hessian, following (Nguyen & Hein, 2018), a softplus function f(x) = 1

κ log(1 + exp(κx)) can arbitrarily well approximate Re LU by adjusting κ and thus Theorem 2 holds. In both cases, MAML can be regarded as the special case of HSML, i.e., k, ˆBk = I, where I is an identity matrix. Remarkably, by proving { ˆBk}K k=1, s.t., ˆRDtr Tt (θ0t) ˆRDtr Tt (θ0), we conclude that HSML achieves a tighter generalization bound than MAML and thereby is much more favored. Consider the optimization process starting from θ0, through the negative gradient direction, ˆθ0 = (I α L(θ0)(θ0I) 1)θ0 and ˆRDtr Tt (ˆθ0) ˆRDtr Tt (θ0). Thus, we can ﬁnd a PK k=1 ˆBk =

I α L(θ0)(θ0I) 1. We provide more details about analysis in supplementary material A.

6. Experiments

In this section, we evaluate the effectiveness of HSML. The goal of our experimental evaluation is to answer the following questions: (1) Can our approach outperform other meta-learning algorithms in toy regression and few-shot image classiﬁcation tasks? (2) Can our approach discover reasonable task clusters? (3) Can our approach update the clustering structure in the continual learning manner and achieve better performance?

We study these questions on toy regression and few-shot image classiﬁcation problems. For gradient-based metalearning algorithms, we select the following as baselines: (1) globally shared models including MAML (Finn et al., 2017) and Meta-SGD (Li et al., 2017); (2) task speciﬁc models including MT-Net (Lee & Choi, 2018), BMAML (Yoon et al., 2018a) and MUMOMAML (Vuorio et al., 2018).

Hierarchically Structured Meta-learning

The empirical results indicate that recurrent autoencoder aggregator (RAA) is on average better than PAA for task representation, so that RAA is used as the default aggregator. We also provide a comparison of RAA and PAA on fewshot classiﬁcation problem in supplementary material G. All the baselines use the same neural network structure (base learner). For hierarchical task clustering, like (Ying et al., 2018), the number of clusters in a high layer is half of that in its consecutive lower layer. We specify the hyperparameters for meta-training in supplementary material C.

6.1. Toy Regression

Dataset and Experimental Settings In the toy regression problem, different tasks are sampled from different family of functions. In this paper, the underlying family functions are (1) Sinusoids: y(x) = Asin(wx) + b, A U[0.1, 5.0], w U[0.8, 1.2] and b U[0, 2π]; (2) Line: y(x)=Alx + bl, Al U[ 3.0, 3.0] and bl U[ 3.0, 3.0]; (3) Cubic: y(x)=Acx3+bcx2+ccx+dc, Ac U[ 0.1, 0.1], bc U[ 0.2, 0.2], cc U[ 2.0, 2.0] and dc U[ 3.0, 3.0]; (4) Quadratic: y(x)=Aqx2+bqx+cq, Aq U[ 0.2, 0.2], bq U[ 2.0, 2.0] and cq U[ 3.0, 3.0]. U[ , ] represents a uniform distribution. Each individual is randomly sampled from one of the four underlying functions. The input x U[ 5.0, 5.0] for both training and testing tasks. We train all models for 5-shot and 10-shot regression. Mean square error (MSE) is used as evaluation metric. In hierarchical clustering, we set the number of layers to be three with 4, 2, 1 clusters in each layer, respectively. Results of Regression Performance The results of 5-shot and 10-shot regression are shown in Table 1. HSML improves the performance of global models (e.g., MAML) and task speciﬁc models (e.g., MUMOMAML), indicating the effectiveness of task clustering.

Table 1. Performance of MSE 95% conﬁdence intervals on toy regression tasks, averaged over 4,000 tasks. Both 5-shot and 10shot results are reported.

Model 5-shot 10-shot MAML 2.205 0.121 0.761 0.068 Meta-SGD 2.053 0.117 0.836 0.065 MT-Net 2.435 0.130 0.967 0.056 BMAML 2.016 0.109 0.698 0.054 MUMOMAML 1.096 0.085 0.256 0.028

HSML (ours) 0.856 0.073 0.161 0.021

Task Clustering Analysis in Toy Regression In order to show the power of HSML for detecting task clusters, we randomly select six tasks (more results are shown in supplementary material I) of 5-shot regression scenario and show soft-assignment values in Figure 3(a), i.e., the value of {pk0 k1 i | k1} in Eqn. (6). Darker color stands for higher probability. The qualitative results of each task are shown in Figure 3(b). The ground truth underlying functions and the data samples Dtr Ti are shown as red lines and green stars,

Ground Truth MAML Selected Point MUMOMAML HSML

Figure 3. (a) The visualization of soft-assignment in Eqn. (6) of six selected tasks. Darker color represents higher probability. (b) The corresponding ﬁtting curves. The ground truth underlying a function is shown in red line with data samples marked as green stars. C1-4 mean cluster 1-4, respectively.

respectively. Qualitative results of MAML, MUMOMAML (best baseline), HSML are shown in different colors.

As shown in the heatmap, sinusoids and linear with positive slope activate cluster 1 and 3, respectively. Both quadratic 1 and 2 activate cluster 2, while quadratic 1 also activates cluster 1 and quadratic 2 also activates cluster 3. From the qualitative results, we can see the shape of quadratic 2 is similar to that of linear with positive slope, while quadratic 1 has more apparent curvature. Similar ﬁndings also verify in cubic cases. The shape of cubic 2 is very similar to sinusoids, thus cluster 1 is activated. Different from cubic 2, cubic 1 mainly activates cluster 4, whose shape is similar to linear with negative slope. The results indicate that the main cluster criteria of HSML is the shapes of tasks despite the underlying family functions. Furthermore, according to the qualitative results, HSML ﬁts better than other baselines. Results of Continual Adaptation To demonstrate the effectiveness of HSML under the continual learning scenario (HSML-D), we add more underlying functions during metatraining. First, we generate tasks from sinusoids and linear, and quadratic and cubic functions are added after 15,000 and 30,000 training rounds, respectively. For comparison, one baseline is HSML with 2 ﬁxed clusters (HSML-S(2C)), and the other is HSML with 10 ﬁxed clusters with much more representational capability (HSML-S(10C)). The metatraining loss curve and the meta-testing performance (MSE) are shown in Figure 4. We can see that HSML-D outperforms as expected. Especially, HSML-D performs better than HSML-S(10C) which are prone to overﬁt and stuck at local optima at early stages.

6.2. Few-shot Classiﬁcation

Dataset and Experimental Settings In the few-shot classiﬁcation problem, we construct a new benchmark which currently consists of four image classiﬁcation

Hierarchically Structured Meta-learning

Quadratic is added

Cubic is added

Model HSML-S (2C) HSML-S (10C) HSML-D MSE 95% CI 0.933 0.074 0.889 0.071 0.869 0.072

Figure 4. The performance comparison for the 5-shot toy regression problem in the continual adaptation scenario. The curve of MSE in meta-training process is shown in the top ﬁgure and the performance of meta-testing is reported in the bottom table.

datasets: Caltech-UCSD Birds-200-2011 (Bird) (Wah et al., 2011), Describable Textures Dataset (Texture) (Cimpoi et al., 2014), Fine-Grained Visual Classiﬁcation of Aircraft (Aircraft) (Maji et al., 2013), and FGVCx-Fungi (Fungi) (Fun, 2018) (See supplementary material B for detailed descriptions of this benchmark). Similar to the preprocessing of Mini Imagenet (Vinyals et al., 2016), we divide each dataset to meta-training, meta-validation and metatesting classes. Following the protocol in (Finn et al., 2017; Ravi & Larochelle, 2016; Vinyals et al., 2016), we adopt N-way classiﬁcation with K-shot samples. Each task samples classes from one of the four datasets. Compared with previous benchmarks (e.g., Mini Imagenet) that the tasks are constructed within a single dataset, the new benchmark is more heterogeneous and closer to the real-world image classiﬁcation. Like (Finn et al., 2017), the base learner is a standard four-block convolutional architecture. The number of layers in hierarchical clustering structure is set as 3 with 4, 2, 1 clusters in each layer. Note that, in this section, for the tables without conﬁdence interval, we provide the full results in supplementary material F. In addition, we provide the comparison to Mini Imagenet benchmark in supplementary material D. Note that, the sampled tasks from Mini Imagenet do not have obvious heterogeneity and uncertainty. Our approach achieves comparable results among gradient-based meta-learning methods. Results of Classiﬁcation Performance For each dataset, we report the averaged accuracy over 1000 tasks of 5-way 1-shot/5-shot classiﬁcation in Table 2. HSML consistently outperforms the other baselines on each dataset, which demonstrates the power of modeling hierarchical clustering structure. To verify the effectiveness of our proposed three components (i.e., task representation, hierarchical task clustering, knowledge adaptation), we also propose some variants of HSML. The detailed description of these variants and their corresponding results can be found in the supplementary material H, which further enhance the contribution of each component. In addition, we design another challeng-

ing leave-one-out experiment in this benchmark. We use one dataset for meta-testing and the rest three for meta-training. The results are reported in the supplementary material E and the HSML still achieves the best performance. Task Clustering Analysis in Few-shot Classiﬁcation Like the analysis of toy regression, we select four tasks in 5-way 1-shot classiﬁcation and show their soft-assignment in Figure 5 (more results are shown in the supplementary material J). Darker color means higher probability. Furthermore, in Figure 5, we show the learned hierarchical clustering of each task. In each layer, the top activated clusters are shown in darker color and then the activation paths are generated.

C1 C2 C3 C4 Bird

C1 C2 C3 C4 Texture

C1 C2 C3 C4 Aircraft

C1 C2 C3 C4 Fungi

Figure 5. (a) The visualization of soft-assignment in Eqn. (6) of four selected tasks. (b) Learned hierarchical structure of each task. In each layer, top activated cluster is shown in dark color.

From Figure 5, we can see different datasets mainly activate different clusters: bird cluster 2, texture cluster 4, aircraft cluster 1, fungi cluster 3. It is also interesting to ﬁnd the clustering across different tasks via the second largest activated cluster which further promote knowledge transfer between tasks. The correlation may represent the similarity of shape (bird and aircraft), environment (fungi and bird), surface texture (texture and fungi). Note that, aircraft is correlated to texture because the classiﬁcation of aircraft variants is mainly based on their shape and texture. The clustering can be further veriﬁed in the learned activated path. In the second layer, the left node, which may represent the environment, is activated by cluster 2 (activated by bird) and 3 (activated by fungi). The right node that reﬂects surface texture is activated by cluster 1 (activated by aircraft) and 4 (activated by texture). In Figure 6, in addition, we randomly select 1000 tasks from each dataset, and show the t-SNE (Maaten & Hinton, 2008) visualization of the gated weight, i.e., θ0i, in Eqn. (9). Compared with MUMOMAML, the results indicate that our clustering structure are able to identify the tasks in different clusters. Results of Continual Adaptation In few-shot classiﬁcation task, we conduct the experiments for continual adaptation in the 5-way 1-shot scenario. Initially, the tasks are generated from bird and texture datasets. Then, aircraft and fungi datasets are added after approximately meta-training round 15000 and 25000, respectively. We show the average meta-training accuracy curve and meta-testing accuracy in Figure 7, where MUMOMAML, HSML-S(2C) and HSML-

Hierarchically Structured Meta-learning

Table 2. Comparison between HSML and other gradient-based meta-learning methods on the 5-way, 1-shot/5-shot image classiﬁcation problem, averaged over 1000 tasks for each dataset. Accuracy 95% conﬁdence intervals are reported.

Model Bird Texture Aircraft Fungi Average

5-way 1-shot

MAML 53.94 1.45% 31.66 1.31% 51.37 1.38% 42.12 1.36% 44.77% Meta-SGD 55.58 1.43% 32.38 1.32% 52.99 1.36% 41.74 1.34% 45.67% MT-Net 58.72 1.43% 32.80 1.35% 47.72 1.46% 43.11 1.42% 45.59% BMAML 54.89 1.48% 32.53 1.33% 53.63 1.37% 42.50 1.33% 45.89% MUMOMAML 56.82 1.49% 33.81 1.36% 53.14 1.39% 42.22 1.40% 46.50%

HSML (ours) 60.98 1.50% 35.01 1.36% 57.38 1.40% 44.02 1.39% 49.35%

5-way 5-shot

MAML 68.52 0.79% 44.56 0.68% 66.18 0.71% 51.85 0.85% 57.78% Meta-SGD 67.87 0.74% 45.49 0.68% 66.84 0.70% 52.51 0.81% 58.18% MT-Net 69.22 0.75% 46.57 0.70% 63.03 0.69% 53.49 0.83% 58.08% BMAML 69.01 0.74% 46.06 0.69% 65.74 0.67% 52.43 0.84% 58.31% MUMOMAML 70.49 0.76% 45.89 0.69% 67.31 0.68% 53.96 0.82% 59.41%

HSML (ours) 71.68 0.73% 48.08 0.69% 73.49 0.68% 56.32 0.80% 62.39%

Bird Texture

(a) : MUMOMAML

Aircraft Fungi

(b) : HSML (Ours)

Figure 6. t-SNE visualization of gated weight, i.e., θ0i, in Eqn. (9)

S(10C) are used as baselines. As shown in Figure 7, HSMLD consistently achieves better performance.

Fungi is added Aircraft is added

Model Bird Texture Aircraft Fungi MUMOMAML 56.66% 33.68% 45.73% 40.38% HSML-S (2C) 60.77% 33.41% 51.28% 40.78% HSML-S (10C) 59.16% 34.48% 52.30% 40.56% HSML-D 61.16% 34.53% 54.50% 41.66%

Figure 7. The performance comparison for the 5-way 1-shot fewshot classiﬁcation problem in the continual adaptation scenario. The top ﬁgure and bottom table show the meta-training accuracy curves and the meta-testing accuracy, respectively.

Effect of Cluster Numbers We further analyze the effect of cluster numbers. The results are shown in Table 3. The cluster numbers from bottom layer to top layer are saved in a tuple. We can see that too few clusters may not enough to learn the task clustering characteristic (e.g., case (2,2,1)). In this dataset, increasing layers (e.g., case (8,4,4,1)) achieves

similar performance compared with case (4,2,1). However, the former introduces more parameters.

Table 3. Comparison of different cluster numbers. The numbers in ﬁrst column represents the number of clusters from bottom layer to top layer. Accuracy for 5-way 1-shot classiﬁcation are reported.

Num. of Clu. Bird Texture Aircraft Fungi (2, 2, 1) 58.37% 33.18% 56.15% 42.90% (4, 2, 1) 60.98% 35.01% 57.38% 44.02% (6, 3, 1) 60.55% 34.02% 55.79% 43.43% (8, 4, 2, 1) 59.55% 34.74% 57.84% 44.18%

7. Conclusion and Discussion

In this paper, we introduce HSML to improve the metalearning effectiveness, which simultaneously customizing task knowledge and preserving knowledge generalization via hierarchical clustering structure. Compared with several baselines, experiments demonstrated the effectiveness and interpretability of our algorithm in both toy regression and few-shot classiﬁcation problems.

Although our method is widely applicable, there are some limitations and interesting future directions. (1) In this paper, we provide a simple version for continual learning, where tasks from new underlying groups are added continually. However, to construct a more reliable lifelong learning system, it is will be necessary to consider more complex evolution relations between tasks (e.g., relationship forgetting); (2) Another interesting direction is to combining active learning with task relation learning for automatically exploring evolutionary task relations.

Acknowledgements

The work was supported in part by NSF awards #1652525, #1618448, and #1639150. Zhenhui Li would also like to ac-

Hierarchically Structured Meta-learning

knowledge the support from the Haile Family Early Career endowment. The views and conclusions contained in this paper are those of the authors and should not be interpreted as representing any funding agencies.

2018 fgcvx fungi classiﬁcation challenge, 2018. URL https://www.kaggle.com/c/ fungi-challenge-fgvc-2018.

Andrychowicz, M., Denil, M., Gomez, S., Hoffman, M. W., Pfau, D., Schaul, T., Shillingford, B., and De Freitas, N. Learning to learn by gradient descent by gradient descent. In NIPS, pp. 3981 3989, 2016.

Baxter, J. Theoretical models of learning to learn. In Learning to learn, pp. 71 94. Springer, 1998.

Braun, D. A., Mehring, C., and Wolpert, D. M. Structure learning in action. Behavioural brain research, 206(2): 157 165, 2010.

Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., , and Vedaldi, A. Describing textures in the wild. In CVPR, 2014.

Conneau, A., Kiela, D., Schwenk, H., Barrault, L., and Bordes, A. Supervised learning of universal sentence representations from natural language inference data. In EMNLP, pp. 670 680, 2017.

Daniely, A., Gonen, A., and Shalev-Shwartz, S. Strongly adaptive online learning. In ICML, pp. 1405 1411, 2015.

Finn, C., Abbeel, P., and Levine, S. Model-agnostic metalearning for fast adaptation of deep networks. In ICML, pp. 1126 1135, 2017.

Finn, C., Xu, K., and Levine, S. Probabilistic modelagnostic meta-learning. ar Xiv preprint ar Xiv:1806.02817, 2018.

Flennerhag, S., Moreno, P. G., Lawrence, N. D., and Damianou, A. Transferring knowledge across learning processes. ar Xiv preprint ar Xiv:1812.01054, 2018.

Gershman, S. J., Blei, D. M., and Niv, Y. Context, learning, and extinction. Psychological review, 117(1):197, 2010.

Gershman, S. J., Radulescu, A., Norman, K. A., and Niv, Y. Statistical computations underlying the dynamics of memory updating. PLo S computational biology, 10(11): e1003939, 2014.

Grant, E., Finn, C., Levine, S., Darrell, T., and Grifﬁths, T. Recasting gradient-based meta-learning as hierarchical bayes. ar Xiv preprint ar Xiv:1801.08930, 2018.

Gu, J., Wang, Y., Chen, Y., Cho, K., and Li, V. O. Metalearning for low-resource neural machine translation. ar Xiv preprint ar Xiv:1808.08437, 2018.

Hamilton, W., Ying, Z., and Leskovec, J. Inductive representation learning on large graphs. In NIPS, pp. 1024 1034, 2017.

Kim, S. and Xing, E. P. Tree-guided group lasso for multitask regression with structured sparsity. In ICML, pp. 543 550, 2010.

Kumar, A. and Daum e III, H. Learning task grouping and overlap in multi-task learning. In ICML, pp. 1723 1730, 2012.

Kuzborskij, I. and Lampert, C. H. Data-dependent stability of stochastic gradient descent. ar Xiv preprint ar Xiv:1703.01678, 2017.

Kuzborskij, I. and Orabona, F. Fast rates by transferring from auxiliary hypotheses. Machine Learning, 106(2): 171 195, 2017.

Lee, Y. and Choi, S. Gradient-based meta-learning with learned layerwise metric and subspace. In ICML, pp. 2933 2942, 2018.

Li, K. and Malik, J. Learning to optimize. ar Xiv preprint ar Xiv:1606.01885, 2016.

Li, Z., Zhou, F., Chen, F., and Li, H. Meta-sgd: Learning to learn quickly for few shot learning. ar Xiv preprint ar Xiv:1707.09835, 2017.

Maaten, L. v. d. and Hinton, G. Visualizing data using t-sne. JMLR, 9(Nov):2579 2605, 2008.

Maji, S., Kannala, J., Rahtu, E., Blaschko, M., and Vedaldi, A. Fine-grained visual classiﬁcation of aircraft. Technical report, 2013.

Mishra, N., Rohaninejad, M., Chen, X., and Abbeel, P. A simple neural attentive meta-learner. ICLR, 2018.

Munkhdalai, T. and Yu, H. Meta networks. In ICML, pp. 2554 2563, 2017.

Munkhdalai, T., Yuan, X., Mehri, S., and Trischler, A. Rapid adaptation with conditionally shifted neurons. In ICML, pp. 3661 3670, 2018.

Nguyen, Q. and Hein, M. Optimization landscape and expressivity of deep cnns. In ICML, 2018.

Ravi, S. and Larochelle, H. Optimization as a model for few-shot learning. ICLR, 2016.

Hierarchically Structured Meta-learning

Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., and Lillicrap, T. Meta-learning with memory-augmented neural networks. In ICML, pp. 1842 1850, 2016.

Snell, J., Swersky, K., and Zemel, R. Prototypical networks for few-shot learning. In NIPS, pp. 4077 4087, 2017.

Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al. Matching networks for one shot learning. In NIPS, pp. 3630 3638, 2016.

Vuorio, R., Sun, S.-H., Hu, H., and Lim, J. J. Toward multimodal model-agnostic meta-learning. ar Xiv preprint ar Xiv:1812.07172, 2018.

Wah, C., Branson, S., Welinder, P., Perona, P., and Belongie, S. The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011.

Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., and Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In ICML, pp. 2048 2057, 2015.

Yang, F. S. Y., Zhang, L., Xiang, T., Torr, P. H., and Hospedales, T. M. Learning to compare: Relation network for few-shot learning. In CVPR, 2018.

Ying, Z., You, J., Morris, C., Ren, X., Hamilton, W., and Leskovec, J. Hierarchical graph representation learning with differentiable pooling. In NIPS, pp. 4805 4815, 2018.

Yoon, J., Kim, T., Dia, O., Kim, S., Bengio, Y., and Ahn, S. Bayesian model-agnostic meta-learning. In NIPS, pp. 7343 7353, 2018a.

Yoon, J., Yang, E., Lee, J., and Hwang, S. J. Lifelong learning with dynamically expandable networks. In ICLR, 2018b.