# probabilistic_modelagnostic_metalearning__d9d12dea.pdf Probabilistic Model-Agnostic Meta-Learning Chelsea Finn , Kelvin Xu , Sergey Levine UC Berkeley {cbfinn,kelvinxu,svlevine}@eecs.berkeley.edu Meta-learning for few-shot learning entails acquiring a prior over previous tasks and experiences, such that new tasks be learned from small amounts of data. However, a critical challenge in few-shot learning is task ambiguity: even when a powerful prior can be meta-learned from a large number of prior tasks, a small dataset for a new task can simply be too ambiguous to acquire a single model (e.g., a classifier) for that task that is accurate. In this paper, we propose a probabilistic meta-learning algorithm that can sample models for a new task from a model distribution. Our approach extends model-agnostic meta-learning, which adapts to new tasks via gradient descent, to incorporate a parameter distribution that is trained via a variational lower bound. At meta-test time, our algorithm adapts via a simple procedure that injects noise into gradient descent, and at meta-training time, the model is trained such that this stochastic adaptation procedure produces samples from the approximate model posterior. Our experimental results show that our method can sample plausible classifiers and regressors in ambiguous few-shot learning problems. We also show how reasoning about ambiguity can also be used for downstream active learning problems. 1 Introduction Learning from a few examples is a key aspect of human intelligence. One way to make it possible to acquire solutions to complex tasks from only a few examples is to leverage past experience to learn a prior over tasks. The process of learning this prior entails discovering the shared structure across different tasks from the same family, such as commonly occurring visual features or semantic cues. Structure is useful insofar as it yields efficient learning of new tasks a mechanism known as learning-to-learn, or meta-learning [3]. However, when the end goal of few-shot meta-learning is to learn solutions to new tasks from small amounts of data, a critical issue that must be dealt with is task ambiguity: even with the best possible prior, there might simply not be enough information in the examples for a new task to resolve that task with high certainty. It is therefore quite desireable to develop few-shot meta-learning methods that can propose multiple potential solutions to an ambiguous few-shot learning problem. Such a method could be used to evaluate uncertainty (by measuring agreement between the samples), perform active learning, or elicit direct human supervision about which sample is preferable. For example, in safety-critical applications, such as few-shot medical image classification, uncertainty is crucial for determining if the learned classifier should be trusted. When learning from such small amounts of data, uncertainty estimation can also help predict if additional data would be beneficial for learning and improving the estimate of the rewards. Finally, while we do not experiment with this in this paper, we expect that modeling this ambiguity will be helpful for reinforcement learning problems, where it can be used to aid in exploration. While recognizing and accounting for ambiguity is an important aspect of the few-shot learning problem, it is challenging to model when scaling to high-dimensional data, large function approximators, and multimodal task structure. Representing distributions over functions is relatively straightforward First two authors contributed equally. 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada. when using simple function approximators, such as linear functions, and has been done extensively in early few-shot learning approaches using Bayesian models [39, 7]. But this problem becomes substantially more challenging when reasoning over high-dimensional function approximators such as deep neural networks, since explicitly representing expressive distributions over thousands or millions of parameters if often intractable. As a result, recent more scalable approaches to few-shot learning have focused on acquiring deterministic learning algorithms that disregard ambiguity over the underlying function. Can we develop an approach that has the benefits of both classes of few-shot learning methods scalability and uncertainty awareness? To do so, we build upon tools in amortized variational inference for developing a probabilistic meta-learning approach. In particular, our method builds on model-agnostic meta-learning (MAML) [9], a few shot metalearning algorithm that uses gradient descent to adapt the model at meta-test time to a new few-shot task, and trains the model parameters at meta-training time to enable rapid adaptation, essentially optimizing for a neural network initialization that is well-suited for few shot learning. MAML can be shown to retain the generality of black-box meta-learners such as RNNs [8], while being applicable to standard neural network architectures. Our approach extends MAML to model a distribution over prior model parameters, which leads to an appealing simple stochastic adaptation procedure that simply injects noise into gradient descent at meta-test time. The meta-training procedure then optimizes for this simple inference process to produce samples from an approximate model posterior. The primary contribution of this paper is a reframing of MAML as a graphical model inference problem, where variational inference can provide us with a principled and natural mechanism for modeling uncertainty. Our approach enables sampling multiple potential solutions to a few-shot learning problem at meta-test time, and our experiments show that this ability can be used to sample multiple possible regressors for an ambiguous regression problem, as well as multiple possible classifiers for ambiguous few-shot attribute classification tasks. We further show how this capability to represent uncertainty can be used to inform data acquisition in a few-shot active learning problem. 2 Related Work Hierarchical Bayesian models are a long-standing approach for few-shot learning that naturally allow for the ability to reason about uncertainty over functions [39, 7, 25, 43, 12, 4, 41]. While these approaches have been demonstrated on simple few-shot image classification datasets [24], they have yet to scale to the more complex problems, such as the experiments in this paper. A number of works have approached the problem of few-shot learning from a meta-learning perspective [35, 19], including black-box [33, 5, 42] and optimization-based approaches [31, 9]. While these approaches scale to large-scale image datasets [40] and visual reinforcement learning problems [28], they typically lack the ability to reason about uncertainty. Our work is most related to methods that combine deep networks and probabilistic methods for few-shot learning [6, 15, 23]. One approach that considers hierarchical Bayesian models for few-shot learning is the neural statistician [6], which uses an explicit task variable to model task distributions. Our method is fully model agnostic, and directly samples model weights for each task for any network architecture. Our experiments show that our approach improves on MAML [9], which outperforms the model by Edwards and Storkey [6]. Other work that considers model uncertainty in the few-shot learning setting is the LLAMA method [15], which also builds on the MAML algorithm. LLAMA makes use of a local Laplace approximation for modeling the task parameters (post-update parameters), which introduces the need to approximate a high dimensional covariance matrix. We instead propose a method that approximately infers the pre-update parameters, which we make tractable through a choice of approximate posterior parameterized by gradient operations. Bayesian neural networks [27, 18, 29, 1] have been studied extensively as a way to incorporate uncertainty into deep networks. Although exact inference in Bayesian neural networks is impractical, approximations based on backpropagation and sampling [16, 32, 20, 2] have been effective in incorporating uncertainty into the weights of generic networks. Our approach differs from these methods in that we explicitly train a hierarchical Bayesian model over weights, where a posterior task-specific parameter distribution is inferred at meta-test time conditioned on a learned weight prior and a (few-shot) training set, while conventional Bayesian neural networks directly learn only the posterior weight distribution for a single task. Our method draws on amortized variational inference methods [22, 21, 36] to make this possible, but the key modification is that the model and inference networks share the same parameters. The resulting method corresponds structurally to a Bayesian version of model-agnostic meta-learning [9]. Figure 1: Graphical models corresponding to our approach. The original graphical model (left) is transformed into the center model after performing inference over φi. We find it beneficial to introduce additional dependencies of the prior on the training data to compensate for using the MAP estimate to approximate p(φi), as shown on the right. 3 Preliminaries In the meta-learning problem setting that we consider, the goal is to learn models that can learn new tasks from small amounts of data. To do so, meta-learning algorithms require a set of meta-training and meta-testing tasks drawn from some distribution p(T ). The key assumption of learning-to-learn is that the tasks in this distribution share common structure that can be exploited for faster learning of new tasks. Thus, the goal of the meta-learning process is to discover that structure. In this section, we will introduce notation and overview the model-agnostic meta-learning (MAML) algorithm [9]. Meta-learning algorithms proceed by sampling data from a given task, and splitting the sampled data into a set of a few datapoints, Dtr used for training the model and a set of datapoints for measuring whether or not training was effective, Dtest. This second dataset is used to measure few-shot generalization drive meta-training of the learning procedure. The MAML algorithm trains for few-shot generalization by optimizing for a set of initial parameters θ such that one or a few steps of gradient descent on Dtr achieves good performance on Dtest. Specifically, MAML performs the following optimization: Ti p(T ) L(θ α θL(θ, Dtr Ti), Dtest Ti ) = min θ Ti p(T ) L(φi, Dtest Ti ) where φi is used to denote the parameters updated by gradient descent and where the loss corresponds to negative log likelihood of the data. In particular, in the case of supervised classification with inputs {xj}, their corresponding labels {yj}, and a classifier fθ, we will denote the negative log likelihood of the data under the classifier as L(θ, D) = P (xj,yj) D log p(yj|xj, θ). This corresponds to the cross entropy loss function. Our goal is to build a meta-learning method that can handle the uncertainty and ambiguity that occurs when learning from small amounts of data, while scaling to highly-expressive function approximators such as neural networks. To do so, we set up a graphical model for the few-shot learning problem. In particular, we want a hierarchical Bayesian model that includes random variables for the prior distribution over function parameters, θ, the distribution over parameters for a particular task, φi, and the task training and test datapoints. This graphical model is illustrated in Figure 1 (left), where tasks are indexed over i and datapoints are indexed over j. We will use the shorthand xtr i , ytr i , xtest i , ytest i to denote the sets of datapoints {xtr i,j| j}, {ytr i,j| j}, {xtest i,j | j}, {ytest i,j | j} and Dtr i , Dtest i to denote {xtr i , ytr i } and {xtest i , ytest i }. 4.1 Gradient-Based Meta-Learning with Variational Inference In the graphical model in Figure 1, the predictions for each task are determined by the task-specific model parameters φi. At meta-test time, these parameters are influenced by the prior p(φi|θ), as well as by the observed training data xtr, ytr. The test inputs xtest are also observed, but the test outputs ytest, which need to be predicted, are not observed. Note that φi is thus independent of xtest, but not of xtr, ytr. Therefore, posterior inference over φi must take into account both the evidence (training set) and the prior imposed by p(θ) and p(φi|θ). Conventional MAML can be interpreted as approximating maximum a posteriori inference under a simplified model where p(θ) is a delta function, and inference is performed by running gradient descent on log p(ytr|xtr, φi) for a fixed number of iterations starting from φ0 i = E[θ] [15]. The corresponding distribution p(φi|θ) is approximately Gaussian, with a mean that depends on the step size and number of gradient steps. When p(θ) is not deterministic, we must make a further approximation to account for the random variable θ. One way we can do this is by using structured variational inference. In structured variational inference, we approximate the distribution over the hidden variables θ and φi for each task with some approximate distribution qi(θ, φi). There are two reasonable choices we can make for qi(θ, φi). First, we can approximate it as a product of independent marginals, according to qi(θ, φi) = qi(θ)qi(φi). However, this approximation does not permit uncertainty to propagate effectively from θ to φi. A more expressive approximation is the structured variational approximation qi(θ, φi) = qi(θ)qi(φi|θ). We can further avoid storing a separate variational distribution qi(φi|θ) and qi(θ) for each task Ti by employing an amortized variational inference technique [22, 21, 36], where we instead set qi(φi|θ) = qψ(φi|θ, xtr i , ytr i , xtest i , ytest i ), where qψ is defined by some function approximator with parameters ψ that takes xtr i , ytr i as input, and the same qψ is used for all tasks. Similarly, we can define qi(θ) as qψ(θ|xtr i , ytr i , xtest i , ytest i ). We can now write down the variational lower bound on the log-likelihood as log p(ytest i |xtest i , xtr i , ytr i ) E θ,φi qψ log p(ytr i |xtr i , φi)+log p(ytest i |xtest i , φi)+log p(φi|θ)+log p(θ) + H(qψ(φi|θ, xtr i , ytr i , xtest i , ytest i )) + H(qψ(θ|xtr i , ytr i , xtest i , ytest i )). The likelihood terms on the first line can be evaluated efficiently: given a sample θ, φi q(θ, φi|xtr i , ytr i , xtest i , ytest i ), the training and test likelihoods simply correspond to the loss of the network with parameters φi. The prior p(θ) can be chosen to be Gaussian, with a learned mean and (diagonal) covariance to provide for flexibility to choose the prior parameters. This corresponds to a Bayesian version of the MAML algorithm. We will define these parameters as µθ and σ2 θ. Lastly, p(φi|θ) must be chosen. This choice is more delicate. One way to ensure a tractable likelihood is to use a Gaussian with mean θ. This choice is reasonable, because it encourages φi to stay close to the prior parameters φi, but we will see in the next section how a more expressive implicit conditional can be obtained using gradient descent, resulting in a procedure that more closely resembles the original MAML algorithm while still modeling the uncertainty. Lastly, we must choose a form for the inference networks qψ(φi|θ, xtr i , ytr i , xtest i , ytest i ) and qψ(θ|xtr i , ytr i , xtest i , ytest i ). They must be chosen so that their entropies on the second line of the above equation are tractable. Furthermore, note that both of these distributions model very high-dimensional random variables: a deep neural network can have hundreds of thousands or millions of parameters. So while we can use an arbitrary function approximator, we would like to find a scalable solution. One convenient solution is to allow qψ to reuse the learned mean of the prior µθ. We observe that adapting the parameters with gradient descent is a good way to update them to a given training set xtr i , ytr i and test set xtest i , ytest i , a design decision similar to one made by Fortunato et al. [11]. We propose an inference network of the form qψ(θ|xtr i , ytr i , xtest i , ytest i ) = N(µθ + γq µθ log p(ytr i |xtr i , µθ) + γq µθ log p(ytest i |xtest i , µθ); vq), where vq is a learned (diagonal) covariance, and the mean has an additional parameter beyond µθ, which is a learning rate vector γq that is pointwise multiplied with the gradient. While this choice may at first seem arbitrary, there is a simple intuition: the inference network should produce a sample of θ that is close to the posterior p(θ|xtr i , ytr i , xtest i , ytest i ). A reasonable way to arrive at a value of θ close to this posterior is to adapt it to both the training set and test set.2 Note that this is only done during meta-training. It remains to choose qψ(φi|θ, xtr i , ytr i , xtest i , ytest i ), which can also be formulated as a conditional Gaussian with mean given by applying gradient descent. Although this variational distribution is substantially more compact in terms of parameters than a separate neural network, it only provides estimates of the posterior during meta-training. At meta-test time, we must obtain the posterior p(φi|xtr i , ytr i , xtest i ), without access to ytest i . We can train a separate set of inference networks to perform this operation, potentially also using gradient descent within the inference network. However, these networks do not receive any gradient information during 2In practice, we can use multiple gradient steps for the mean, but we omit this for notational simplicity. Algorithm 1 Meta-training, differences from MAML in red Require: p(T ): distribution over tasks 1: initialize Θ := {µθ, σ2 θ, vq, γp, γq} 2: while not done do 3: Sample batch of tasks Ti p(T ) 4: for all Ti do 5: Dtr, Dtest = Ti 6: Evaluate µθL(µθ, Dtest) 7: Sample θ q = N(µθ γq µθL(µθ, Dtest), vq) 8: Evaluate θL(θ, Dtr) 9: Compute adapted parameters with gradient descent: φi = θ α θL(θ, Dtr) 10: Let p(θ|Dtr) = N(µθ γp µθL(µθ, Dtr), σ2 θ)) 11: Compute Θ P Ti L(φi, Dtest) +DKL(q(θ|Dtest) || p(θ|Dtr)) 12: Update Θ using Adam Algorithm 2 Meta-testing Require: training data Dtr T for new task T Require: learned Θ 1: Sample θ from the prior p(θ|Dtr) 2: Evaluate θL(θ, Dtr) 3: Compute adapted parameters with gradient descent: φi = θ α θL(θ, Dtr) meta-training, and may not work well in practice. In the next section we propose an even simpler and more practical approach that uses only a single inference network during meta-training, and none during meta-testing. 4.2 Probabilistic Model-Agnostic Meta-Learning Approach with Hybrid Inference To formulate a simpler variational meta-learning procedure, we recall the probabilistic interpretation of MAML: as discussed by Grant et al. [15], MAML can be interpreted as approximate inference for the posterior p(ytest i |xtr i , ytr i , xtest i ) according to p(ytest i |xtr i , ytr i , xtest i ) = Z p(ytest i |xtest i , φi)p(φi|xtr i , ytr i , θ)dφi p(ytest i |xtest i , φ i ), (1) where we use the maximum a posteriori (MAP) value φ i . It can be shown that, for likelihoods that are Gaussian in φi, gradient descent for a fixed number of iterations using xtr i , ytr i corresponds exactly to maximum a posteriori inference under a Gaussian prior p(φi|θ) [34]. In the case of non-Gaussian likelihoods, the equivalence is only locally approximate, and the exact form of the prior p(φi|θ) is intractable. However, in practice this implicit prior can actually be preferable to an explicit (and simple) Gaussian prior, since it incorporates the rich nonlinear structure of the neural network parameter manifold, and produces good performance in practice [9, 15]. We can interpret this MAP approximation as inferring an approximate posterior on φi of the form p(φi|xtr i , ytr i , θ) δ(φi = φ i ), where φ i is obtained via gradient descent on the training set xtr i , ytr i starting from θ. Incorporating this approximate inference procedure transforms the graphical model in Figure 1 (a) into the one in Figure 1 (b), where there is now a factor over p(φi|xtr i , ytr i , θ). While this is a crude approximation to the likelihood, it provides us with an empirically effective and simple tool that greatly simplifies the variational inference procedure described in the previous section, in the case where we aim to model a distribution over the global parameters p(θ). After using gradient descent to estimate p(φi | xtr i , ytr i , θ), the graphical model is transformed into the model shown in the center of Figure 1. Note that, in this new graphical model, the global parameters θ are independent of xtr and ytr and are independent of xtest when ytest is not observed. Thus, we can now write down a variational lower bound for the logarithm of the approximate likelihood, which is given by log p(ytest i |xtest i , xtr i , ytr i ) Eθ qψ log p(ytest i |xtest i , φ i ) + log p(θ) + H(qψ(θ|xtest i , ytest i )). In this bound, we essentially perform approximate inference via MAP on φi to obtain p(φi|xtr i , ytr i , θ), and use the variational distribution for θ only. Note that qψ(θ|xtest i , ytest i ) is not conditioned on the training set xtr i , ytr i since θ is independent of it in the transformed graphical model. Analogously to the previous section, the inference network is given by qψ(θ|xtest i , ytest i ) = N(µθ + γq log p(ytest i |xtest i , µθ); vq). To evaluate the variational lower bound during training, we can use the following procedure: first, we evaluate the mean by starting from µθ and taking one (or more) gradient steps on log p(ytest i |xtest i , θcurrent), where θcurrent starts at µθ. We then add noise with variance vq, which is made differentiable via the reparameterization trick [22]. We then take additional gradient steps on the training likelihood log p(ytr i |xtr i , θcurrent). This accounts for the MAP inference procedure on φi. Training of µθ, σ2 θ, and vq is performed by backpropagating gradients through this entire procedure with respect to the variational lower bound, which includes a term for the likelihood log p(ytest i |xtest i , xtr, ytr, φ i ) and the KL-divergence between the sample θ qψ and the prior p(θ). This meta-training procedure is detailed in Algorithm 1. At meta-test time, the inference procedure is much simpler. The test labels are not available, so we simply sample θ p(θ) and perform MAP inference on φi using the training set, which corresponds to gradient steps on log p(ytr i |xtr i , θcurrent), where θcurrent starts at the sampled θ. This meta-testing procedure is detailed in Algorithm 2. 4.3 Adding Additional Dependencies In the transformed graphical model, the training data xtr i , ytr i and the prior θ are conditionally independent. However, since we have only a crude approximation to p(φi | xtr i , ytr i , θ), this independence often doesn t actually hold. We can allow the model to compensate for this approximation by additionally conditioning the learned prior p(θ) on the training data. In this case, the learned prior has the form p(θi|xtr i , ytr i ), where θi is now task-specific, but with global parameters µθ and σ2 θ. We thus obtain the modified graphical model in Figure 1 (c). Similarly to the inference network qψ, we parameterize the learned prior as follows: p(θi|xtr i , ytr i ) = N(µθ + γp log p(ytr i |xtr i , µθ); σ2 θ). With this new form for distribution over θ, the variational training objective uses the likelihood term log p(θi|xtr i , ytr i ) in place of log p(θ), but otherwise is left unchanged. At test time, we sample from θ p(θ|xtr i , ytr i ) by first taking gradient steps on log p(ytr i |xtr i , θcurrent), where θcurrent is initialized at µθ, and then adding noise with variance σ2 θ. Then, we proceed as before, performing MAP inference on φi by taking additional gradient steps on log p(ytr i |xtr i , θcurrent) initialized at the sample θ. In our experiments, we find that this more expressive distribution often leads to better performance. 5 Experiments The goal of our experimental evaluation is to answer the following questions: (1) can our approach enable sampling from the distribution over potential functions underlying the training data?, (2) does our approach improve upon the MAML algorithm when there is ambiguity over the class of functions?, and (3) can our approach scale to deep convolutional networks? We study two illustrative toy examples and a realistic ambiguous few-shot image classification problem. For the both experimental domains, we compare MAML to our probabilistic approach. We will refer to our version of MAML as a PLATIPUS (Probabilistic LATent model for Incorporating Priors and Uncertainty in few-Shot learning), due to its unusual combination of two approximate inference methods: amortized inference and MAP. Both PLATIPUS and MAML use the same neural network architecture and the same number of inner gradient steps. We additionally provide a comparison on the Mini Imagenet benchmark and specify the hyperparameters in the supplementary appendix. Illustrative 5-shot regression. In this 1D regression problem, different tasks correspond to different underlying functions. Half of the functions are sinusoids, and half are lines, such that the task distribution is clearly multimodal. The sinusoids have amplitude and phase uniformly sampled from the range [0.1, 5] and [0, π], and the lines have the slope and intercept sampled in the range [ 3, 3]. The input domain is uniform on [ 5, 5], and Gaussian noise with a standard deviation of 0.3 is added to the labels. We trained both MAML and PLATIPUS for 5-shot regression. In Figure 2, we show the qualitative performance of both methods, where the ground truth underlying function is shown in gray and the datapoints in Dtr are shown as purple triangles. We show the function fφi learned by MAML in black. For PLATIPUS, we sample 10 sets of parameters from p(φi|θ) and plot the resulting functions in different colors. In the top row, we can see that PLATIPUS allows the model to effectively reason over the set of functions underlying the provided datapoints, with increased variance in parts of the function where there is more uncertainty. Further, we see that PLATIPUS is able to capture the multimodal structure, as the curves are all linear or sinusoidal. A particularly useful application of uncertainty estimates in few-shot learning is estimating when more data would be helpful. In particular, seeing a large variance in a particular part of the input space suggests that more data would be helpful for learning the function in that part of the input space. On the bottom of Figure 2, we show the results for a single task at meta-test time with increasing numbers of training datapoints. Even though the model was only trained on training set sizes of 5 Figure 2: Samples from PLATIPUS trained for 5-shot regression, shown as colored dotted lines. The tasks consist of regressing to sinusoid and linear functions, shown in gray. MAML, shown in black, is a deterministic procedure and hence learns a single function, rather than reasoning about the distribution over potential functions. As seen on the bottom row, even though PLATIPUS is trained for 5-shot regression, it can effectively reason over its uncertainty when provided variable numbers of datapoints at test time (left vs. right). Figure 3: Qualitative examples from active learning experiment where the 5 provided datapoints are from a small region of the input space (shown as purple triangles), and the model actively asks for labels for new datapoints (shown as blue circles) by choosing datapoints with the largest variance across samples. The model is able to effectively choose points that leads to accurate predictions with only a few extra datapoints. datapoints, we observe that PLATIPUS is able to effectively reduce its uncertainty as more and more datapoints are available. This suggests that the uncertainty provided by PLATIPUS can be used for approximately gauging when more data would be helpful for learning a new task. Figure 4: Active learning performance on regression after up to 5 selected datapoints. PLATIPUS can use it s uncertainty estimation to quickly decrease the error, while selecting datapoints randomly and using MAML leads to slower learning. Active learning with regression. To further evaluate the benefit of modeling ambiguity, we now consider an active learning experiment. In particular, the model can choose the datapoints that it wants labels for, with the goal of reaching good performance with a minimal number of additional datapoints. We performed this evaluation in the simple regression setting described previously. Models were given five initial datapoints within a constrained region of the input space. Then, each model selects up to 5 additional datapoints to be labeled. PLATIPUS chose each datapoint sequentially, choosing the point with maximal variance across the sampled regressors; MAML selected datapoints randomly, as it has no mechanism to model ambiguity. As seen in Figure 4, PLATIPUS is able to reduce its regression error to a much greater extent when given one to three additional queries, compared to MAML. We show qualitative results in Figure 3. Figure 5: Samples from PLATIPUS for 1-shot classification, shown as colored dotted lines. The 2D classification tasks all involve circular decision boundaries of varying size and center, shown in gray. MAML, shown in black, is a deterministic procedure and hence learns a single function, rather than reasoning about the distribution over potential functions. Illustrative 1-Shot 2D classification. Next, we study a simple binary classification task, where there is a particularly large amount of ambiguity surrounding the underlying function: learning to learn from a single positive example. Here, the tasks consist of classifying datapoints in 2D within the range [0, 5] with a circular decision boundary, where points inside the decision boundary are positive and points outside are negative. Different tasks correspond to different locations and radii of the decision boundary, sampled at uniformly at random from the ranges [1.0, 4.0] and [0.1, 2.0] respectively. Following Grant et al. [14], we train both MAML and PLATIPUS with Dtr consisting of a single positive example and Dtest consisting of both positive and negative examples. We plot the results using the same scheme as before, except that we plot the decision boundary (rather than the regression function) and visualize the single positive datapoint with a green plus. As seen in Figure 5, we see that PLATIPUS captures a broad distribution over possible decision boundaries, all of which are roughly circular. MAML provides a single decision boundary of average size. Ambiguous image classification. The ambiguity illustrated in the previous settings is common in real world tasks where images can share multiple attributes. We study an ambiguous extension to the celeb A attribute classification task. Our meta-training dataset is formed by sampling two attributes at random to form a positive class and taking the same number of random examples without either attribute to from the negative classes. To evaluate the ability to capture multiple decision boundaries while simultaneously obtaining good performance, we evaluate our method as follows: We sample from a test set of three attributes and a corresponding set of images with those attributes. Since the tasks involve classifying images that have two attributes, this task is ambiguous, and there are three possible combinations of two attributes that explain the training set. We sample models from our prior as described in Section 4 and assign each of the sampled models to one of the three possible tasks based on its log-likelihood. If each of the three possible tasks is assigned a nonzero number of samples, this means that the model effectively covers all three possible modes that explain the ambiguous training set. We can measure coverage and accuracy from this protocol. The coverage score indicates the average number of tasks (between 1 and 3) that receive at least one sample for each ambiguous training set, and the accuracy score is the average number of correct classifications on these tasks (according to the sampled models assigned to them). A highly random method will achieve good coverage but poor accuracy, while a deterministic method will have a coverage of 1. We additionally compute the log-likelihood across the ambiguous tasks which compares each method s ability to model all of the modes . As is standard in amortized variational inference (e.g., with VAEs), we put a multiplier β in front of the KL-divergence against the prior [17] in Algorithm 1. We find that larger values result in more diverse samples, at a modest cost in performance, and therefore report two different values of β to illustrate this tradeoff. Our results are summarized in Table 5 and Fig. 6. Our method attains better log-likelihood, and a comparable accuracy compared to standard MAML. More importantly, deterministic MAML only ever captures one mode for each ambiguous task, where the maximum is three. Our method on average captures closer to two modes on average. The qualitative analysis in Figure 6 illustrates3 an example ambiguous training set, example images for the three possible two-attribute pairs that can correspond to this training set, and the classifications made by different sampled classifiers trained on the ambiguous training set. Note that the different samples each pay attention to different attributes, indicating that PLATIPUS is effective at capturing the different modes of the task. 6 Discussion and Future Work We introduced an algorithm for few-shot meta-learning that enables simple and effective sampling of models for new tasks at meta-test time. Our algorithm, PLATIPUS, adapts to new tasks by running 3Additional qualitative results and code can be found at https://sites.google.com/view/probabilistic-maml/ example example +'ve -'ve example example +'ve -'ve example example +'ve -'ve example example +'ve -'ve (a) (b) Mouth Open Young Wearing Hat Young Wearing Hat Mouth Open Young Wearing Hat Young Wearing Hat Figure 6: Sampled classifiers for an ambiguous meta-test task. In the meta-test training set (a), PLATIPUS observes five positives that share three attributes, and five negatives. A classifier that uses any two attributes can correctly classify the training set. On the right (b), we show the three possible two-attribute tasks that the training set can correspond to, and illustrate the labels (positive indicated by purple border) predicted by the best sampled classifier for that task. We see that different samples can effectively capture the three possible explanations, with some samples paying attention to hats (2nd and 3rd column) and others not (1st column). Ambiguous celeb A (5-shot) Accuracy Coverage (max=3) Average NLL MAML 89.00 1.78% 1.00 0.0 0.73 0.06 MAML + noise 84.3 1.60 % 1.89 0.04 0.68 0.05 PLATIPUS (ours) (KL weight = 0.05) 88.34 1.06 % 1.59 0.03 0.67 0.05 PLATIPUS (ours) (KL weight = 0.15) 87.8 1.03 % 1.94 0.04 0.56 0.04 Table 1: Our method covers almost twice as many tasks compared to MAML, with comparable accuracy. MAML + noise is a method that adds noise to the gradient, but does not perform variational inference. This improves coverage, but results in lower accuracy average log likelihood. We bold results above the highest confidence interval lowerbound. gradient descent with injected noise. During meta-training, the model parameters are optimized with respect to a variational lower bound on the likelihood for the meta-training tasks, so as to enable this simple adaptation procedure to produce approximate samples from the model posterior when conditioned on a few-shot training set. This approach has a number of benefits. The adaptation procedure is exceedingly simple, and the method can be applied to any standard model architecture. The algorithm introduces a modest number of additional parameters: besides the initial model weights, we must learn a variance on each parameter for the inference network and prior, and the number of parameters scales only linearly with the number of model weights. Our experimental results show that our method can be used to effectively sample diverse solutions to both regression and classification tasks at meta-test time, including with task families that have multi-modal task distributions. We additionally showed how our approach can be applied in settings where uncertainty can directly guide data acquisition, leading to better few-shot active learning. Although our approach is simple and broadly applicable, it has potential limitations that could be addressed in future work. First, the current form of the method provides a relatively impoverished estimator of posterior variance, which might be less effective at gauging uncertainty in settings where different tasks have different degrees of ambiguity. In such settings, making the variance estimator dependent on the few-shot training set might produce better results, and investigating how to do this in a parameter efficient manner would be an interesting direction for future work. Another exciting direction for future research would be to study how our approach could be applied in RL settings for acquiring structured, uncertainty-guided exploration strategies in meta-RL problems. Acknowledgments We thank Marvin Zhang and Dibya Ghosh for feedback on an early draft of this paper. This research was supported by an NSF Graduate Research Fellowship, NSF IIS-1651843, the Office of Naval Research, and NVIDIA. [1] D. Barber and C. M. Bishop. Ensemble learning for multi-layer networks. In neural information processing systems (NIPS), 1998. [2] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight uncertainty in neural networks. ar Xiv preprint ar Xiv:1505.05424, 2015. [3] D. A. Braun, C. Mehring, and D. M. Wolpert. Structure learning in action. Behavioural brain research, 2010. [4] H. Daumé III. Bayesian multitask learning with latent hierarchies. In Conference on Uncertainty in Artificial Intelligence (UAI), 2009. [5] Y. Duan, J. Schulman, X. Chen, P. L. Bartlett, I. Sutskever, and P. Abbeel. RlΘ2: Fast reinforcement learning via slow reinforcement learning. ar Xiv preprint ar Xiv:1611.02779, 2016. [6] H. Edwards and A. Storkey. Towards a neural statistician. In International Conference on Learning Representations (ICLR), 2017. [7] L. Fei-Fei et al. A Bayesian approach to unsupervised one-shot learning of object categories. In Conference on Computer Vision and Pattern Recognition (CVPR), 2003. [8] C. Finn and S. Levine. Meta-learning and universality: Deep representations and gradient descent can approximate any learning algorithm. ar Xiv preprint ar Xiv:1710.11622, 2017. [9] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning (ICML), 2017. [10] C. Finn, T. Yu, T. Zhang, P. Abbeel, and S. Levine. One-shot visual imitation learning via meta-learning. ar Xiv preprint ar Xiv:1709.04905, 2017. [11] M. Fortunato, C. Blundell, and O. Vinyals. Bayesian recurrent neural networks. ar Xiv preprint ar Xiv:1704.02798, 2017. [12] J. Gao, W. Fan, J. Jiang, and J. Han. Knowledge transfer via multiple model local structure mapping. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD). ACM, 2008. [13] V. Garcia and J. Bruna. Few-shot learning with graph neural networks. ar Xiv preprint ar Xiv:1711.04043, 2017. [14] E. Grant, C. Finn, J. Peterson, J. Abbott, S. Levine, T. Darrell, and T. Griffiths. Concept acquisition through meta-learning. In NIPS Workshop on Cognitively Informed Artificial Intelligence, 2017. [15] E. Grant, C. Finn, S. Levine, T. Darrell, and T. Griffiths. Recasting gradient-based meta-learning as hierarchical bayes. In International Conference on Learning Representations (ICLR), 2018. [16] A. Graves. Practical variational inference for neural networks. In Neural Information Processing Systems (NIPS), 2011. [17] I. Higgins, L. Matthey, X. Glorot, A. Pal, B. Uria, C. Blundell, S. Mohamed, and A. Lerchner. Early visual concept learning with unsupervised deep learning. International Conference on Learning Representations (ICLR), 2017. [18] G. E. Hinton and D. Van Camp. Keeping the neural networks simple by minimizing the description length of the weights. In Conference on Computational learning theory, 1993. [19] S. Hochreiter, A. Younger, and P. Conwell. Learning to learn using gradient descent. International Conference on Artificial Neural Networks (ICANN), 2001. [20] M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley. Stochastic variational inference. The Journal of Machine Learning Research, 2013. [21] M. Johnson, D. K. Duvenaud, A. Wiltschko, R. P. Adams, and S. R. Datta. Composing graphical models with neural networks for structured representations and fast inference. In Neural Information Processing Systems (NIPS), 2016. [22] D. P. Kingma and M. Welling. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013. [23] A. Lacoste, T. Boquet, N. Rostamzadeh, B. Oreshki, W. Chung, and D. Krueger. Deep prior. ar Xiv preprint ar Xiv:1712.05016, 2017. [24] B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 2015. [25] N. D. Lawrence and J. C. Platt. Learning to learn with the informative vector machine. In International Conference on Machine Learning (ICML), page 65, 2004. [26] Z. Li, F. Zhou, F. Chen, and H. Li. Meta-sgd: Learning to learn quickly for few shot learning. ar Xiv preprint ar Xiv:1707.09835, 2017. [27] D. J. Mac Kay. A practical Bayesian framework for backpropagation networks. Neural computation, 1992. [28] N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel. A simple neural attentive meta-learner. International Conference on Learning Representations, 2018. [29] R. M. Neal. Bayesian learning for neural networks. Ph D thesis, University of Toronto, 1995. [30] A. Nichol and J. Schulman. Reptile: a scalable metalearning algorithm. ar Xiv preprint ar Xiv:1803.02999, 2018. [31] S. Ravi and H. Larochelle. Optimization as a model for few-shot learning. In International Conference on Learning Representations (ICLR), 2017. [32] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. ar Xiv preprint ar Xiv:1401.4082, 2014. [33] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap. Meta-learning with memory-augmented neural networks. In International Conference on Machine Learning (ICML), 2016. [34] R. J. Santos. Equivalence of regularization and truncated iteration for general ill-posed problems. Linear Algebra and its Applications, 1996. [35] J. Schmidhuber. Evolutionary principles in self-referential learning. Ph D thesis, Institut für Informatik, Technische Universität München, 1987. [36] R. Shu, H. H. Bui, S. Zhao, M. J. Kochenderfer, and S. Ermon. Amortized inference regularization. ar Xiv preprint ar Xiv:1805.08913, 2018. [37] J. Snell, K. Swersky, and R. S. Zemel. Prototypical networks for few-shot learning. In Neural Information Processing Systems (NIPS), 2017. [38] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. S. Torr, and T. M. Hospedales. Learning to compare: Relation network for few-shot learning. Co RR, abs/1711.06025, 2017. URL http: //arxiv.org/abs/1711.06025. [39] J. B. Tenenbaum. A Bayesian framework for concept learning. Ph D thesis, Massachusetts Institute of Technology, 1999. [40] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. Matching networks for one shot learning. In Neural Information Processing Systems (NIPS), 2016. [41] J. Wan, Z. Zhang, J. Yan, T. Li, B. D. Rao, S. Fang, S. Kim, S. L. Risacher, A. J. Saykin, and L. Shen. Sparse Bayesian multi-task learning for predicting cognitive outcomes from neuroimaging measures in Alzheimer s disease. In Conference on Computer Vision and Pattern Recognition (CVPR), 2012. [42] Y.-X. Wang and M. Hebert. Learning to learn: Model regression networks for easy small sample learning. In European Conference on Computer Vision (ECCV), 2016. [43] K. Yu, V. Tresp, and A. Schwaighofer. Learning Gaussian processes from multiple tasks. In International Conference on Machine Learning (ICML), 2005.