# function_contrastive_learning_of_transferable_metarepresentations__5b18f924.pdf

Function Contrastive Learning of Transferable Meta-Representations

Muhammad Waleed Gondal 1 Shruti Joshi 1 Nasim Rahaman 1 2 Stefan Bauer 1 3 Manuel W uthrich 1

Bernhard Sch olkopf 1

Abstract Meta-learning algorithms adapt quickly to new tasks that are drawn from the same task distribution as the training tasks. The mechanism leading to fast adaptation is the conditioning of a downstream predictive model on the inferred representation of the task s underlying data generative process, or function. This meta-representation, which is computed from a few observed examples of the underlying function, is learned jointly with the predictive model. In this work, we study the implications of this joint training on the transferability of the meta-representations. Our goal is to learn meta-representations that are robust to noise in the data and facilitate solving a wide range of downstream tasks that share the same underlying functions. To this end, we propose a decoupled encoder-decoder approach to supervised meta-learning, where the encoder is trained with a contrastive objective to ﬁnd a good representation of the underlying function. In particular, our training scheme is driven by the self-supervision signal indicating whether two sets of examples stem from the same function. Our experiments on a number of synthetic and real-world datasets show that the representations we obtain outperform strong baselines in terms of downstream performance and noise robustness, even when these baselines are trained in an end-to-end manner.

1. Introduction

Many supervised learning problems are concerned with approximating a data-generating function f : X Y given a ﬁnite set of N samples, {xi, yi = f(xi)}N i=1. Expressive models, such as deep neural networks, are known to excel at this function approximation task, but they often heavily

1Max Planck Institute for Intelligent Systems, T ubingen, Germany 2Mila, University of Montreal, Montreal, Canada 3CIFAR Azrieli Global Scholar. Correspondence to: Muhammad Waleed Gondal <waleed.gondal@tue.mpg.de>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

rely on the number of samples N being large. This poses further challenges: in many domains of interest, sourcing enough data is a challenging endeavour; further, the process of training such models can be prohibitively slow for many applications. This is exacerbated by the fact that in the typical setting, each new data-generating function encountered requires that the model be retrained. In other words, the model is not shared between data-generating functions, even when training a model to approximate one function can potentially be beneﬁcial for approximating another function.

To overcome these challenges, a variety of meta-learning methods have been proposed (Vinyals et al., 2016; Snell et al., 2017; Garnelo et al., 2018a; Ravi & Larochelle, 2016; Finn et al., 2017). In the present work, we are interested in a class of models that use encoder-decoder architectures such as Conditional Neural Processes (CNPs) (Garnelo et al., 2018a) and Generative Query Networks (GQNs) (Eslami et al., 2018). In the ﬁrst stage, an encoder is used to infer a ﬁxed-dimensional representation of a given function f from just a few input-output examples Of = {(xi, yi)}i, the context dataset. We call it the meta-representation of the function r = hφ(Of), where h is an encoder parameterized by φ. In the second stage, the meta-representation is then used to condition a predictive model in order to solve a downstream prediction task related to that function. For instance, the task may consist of predicting the function value y at unseen locations x or classifying images after observing only a few pixels (in that case, x is the pixel location and y is the pixel value). This two-stage process has multiple beneﬁts. First, the extraction of prior knowledge about f directly from the training data, in the form of metarepresentation, reduces the need for specifying inductive biases (model architectures, training details, etc.) particular to f. Thus, it allows learning to be shared between functions such that a single model can be trained on a distribution over functions. Second, the computation of meta-representations is efﬁcient and can be done online. Third, the computation of meta-representations provides ﬂexibility to solve a variety of downstream tasks concerning a speciﬁc function.

However, CNPs optimize encoder jointly with the decoder on the downstream prediction task, i.e., prediction of function values y at unseen locations x, as illustrated in Figure 1(a). This ties the meta-representation s quality to the

Function Contrastive Learning of Transferable Meta-Representations

(b) FCRL (encoder training)

(c) FCRL (transfer)

Figure 1. The difference in the training of CNP (Garnelo et al., 2018a) and FCRL for learning meta-representations r of functions. (shown left) CNP learns the aggregated representation r of the context set by maximizing the conditional likelihood of the target data. (shown center) Training of FCRL encoder hφ by contrasting context sets of different functions. Note that the target inputs x N+1 etc., are not used at this stage. (shown right) Using the pretrained FCRL encoder hφ, we train separate decoders p ψ for each downstream task, shown in grey boxes. The dotted arrows indicate the transfer of inferred meta-representations to the tasks.

combined encoder and decoder performances on this particular task and thereby makes it susceptible to supervision collapse, i.e. the representations lose any information which is irrelevant for solving the training task, but may be necessary for the transfer to new tasks (Doersch et al., 2020). Moreover, many real-world tasks are noisy, and the prediction task might entail reconstructing high dimensional data, such as images in GQNs (Eslami et al., 2018). The corresponding objective function can cause the model to waste its capacity on reconstructing unimportant features such as static backgrounds and noise, while ignoring visually small but important details in its learned representation (Anand et al., 2019; Kipf et al., 2019). This is crucial for many realworld applications; for instance, in order to manipulate a small object in a complex scene, the model s ability to infer the object s shape carries more importance than inferring its color or reconstructing the static background.

In this work, we study the generalization of a function s meta-representations in terms of their transferability to downstream tasks and their robustness to noise. We empirically show that the joint optimization of metarepresentations and a prediction task is detrimental to the transferability of meta-representations and makes them vulnerable to noise. To address this issue, we propose a decoupled encoder-decoder training scheme, wherein the encoder is exclusively trained by a novel contrastive learning framework which we call FCRL (Function Contrastive Representation Learning). Instead of relying on reconstructions, it learns by contrasting sets of input-output pairs sampled from different functions. The key idea is that two sets of samples from the same function should have similar latent representations, while representations of sets of samples from different functions should remain easily distinguishable. FCRL retains the useful properties of meta-representations such as shared learning and sample efﬁciency while improving its transferability to downstream tasks and robustness to noise. Unlike contemporary meta-learning algorithms, meta-representations in FCRL are explicitly optimized over a distribution of functions rather than tasks.

To evaluate the effectiveness of the proposed method, we conduct comprehensive experiments on diverse downstream problems, including classiﬁcation, regression, parameter identiﬁcation, scene understanding, scene reconstruction and reinforcement learning. We consider different datasets, ranging from simple 1D and 2D regression to challenging simulated and real-world scenes. In particular, we ﬁnd that a downstream predictor trained with our (pre-trained) encoder compares favorably to related methods on these tasks, including ones where the predictor is trained jointly with the encoder.

2. Preliminaries

2.1. Problem Setting

Consider a distribution over data-generating functions p(f). Let f be a sample from this distribution f p(f), where f : X Y with X = Rd and Y Rd :

y = f(x, ξ); ξ Z (1)

where ξ is sampled from some noise distribution Z. Let Of = {(xi, yi)}N i=1 be a set of few observed examples of a function f, referred to as the context set, and consider a set of downstream tasks T . Here, each task T T can be deﬁned as a mapping deﬁned over f. In the case of few shot regression (see Section 4.1), T maps from f to a predictive model pψ(y|x). In the case of parameter identiﬁcation, T maps from f to some scalar or vector valued parameter of f. Our goal is therefore to learn an encoder which maps a context set Of to a representation of f that can interchangeably be used for multiple downstream tasks T deﬁned on the same function (without requiring retraining).

2.2. Background

In this section, we brieﬂy discuss a class of meta-learning methods that are particularly relevant to our encoderdecoder setting, namely conditional neural processes (CNPs) and generative query networks (GQNs) (Garnelo et al., 2018a;b; Eslami et al., 2018).

Function Contrastive Learning of Transferable Meta-Representations

Conditional Neural Processes (CNPs). The key proposal in CNPs (applied to few-shot learning) is to express a distribution over predictor functions given a context set. They learn the meta-representations r by jointly training the encoder and decoder, as illustrated in Figure 1(a). To this end, they ﬁrst encode the context Of into individual representations ri = hΦ(xi, yi) i [N], where hΦ is a neural network. The representations are then aggregated via a mean-pooling operation into a ﬁxed size vector r = 1/N(r1 + r2 + ... + r N). The idea is that r captures all the relevant information about f from the context set Of; accordingly, the predictive distribution is approximated by maximizing the conditional likelihood of the target distribution p(y|x, Of), where y = f(x).

Generative Query Networks (GQN). GQN (Eslami et al., 2018) can be seen as an extension of NPs (Garnelo et al., 2018b) for learning 3D scenes representations. The context dataset Of in GQN consists of tuples of camera viewpoints in 3D space (X) and the images taken from those viewpoints (Y). Like NPs, GQNs learn to infer the latent representation of the scene (a function) by conditioning on the aggregated context and maximizing the likelihood of generating the correct image corresponding to a queried viewpoint.

3. Function-Contrastive Representation Learning (FCRL)

We take the perspective here that the sets of context points Of provide a partial observation of an underlying function f. Our goal is to ﬁnd an encoder g(φ,Φ) which maps such partial observations to low-dimensional representations of the underlying function. The key idea is that a good encoder g(φ,Φ) should map different context sets (i.e. partial observations) of the same function to be close in the latent space, such that they can easily be identiﬁed among context sets of different functions. This motivates the contrastive-learning objective which we will detail in the following.

Encoder Structure. Since the inputs to the encoder g(φ,Φ) are sets, it needs to be permutation invariant with respect to input order and able to process inputs of varying sizes. We enforce this permutation invariance in g(φ,Φ) via sumdecomposition, proposed by (Zaheer et al., 2017). We ﬁrst average-pool the point-wise transformations of Of to get the encoded representations

rf = 1 |Of|

(x,y) Of hΦ(x, y) (2)

where hΦ( ) is the encoder network. We then obtain a nonlinear projection of this encoded representation g(φ,Φ)(Of) = ρφ(rf). Note that the function ρφ can be any nonlinear function. We use an MLP with one hidden layer which also acts as the projection head for learning the

Figure 2. Inner-workings of the FCRL objective function. We split the context set of each function into J disjoint views, and align the aggregated representations of those views. Shown here is the example of two functions, with two views each i.e., J = 2.

representation. Similar to (Chen et al., 2020), we found that it is beneﬁcial to deﬁne the contrastive objective on these projected representations ρφ(rf), rather than directly on the encoded representations rf. More details can be found in our ablation study in Appendix A.1.

Encoder Training. At training time, we are provided with partial observations O1:K from K functions. Each observation is a set of N i.i.d. (independent and identically distributed) samples Ok = {(xk i , yk i )}N i=1. To encourage that different observations of the same functions are mapped to similar representations, we will now formulate a contrastive learning objective, as illustrated in Figure 2. To apply contrastive learning, we create different views of the same function k by splitting each sample set Ok into J subsets of size N/J. Deﬁning tj := {(j 1)N/J +1, ..., j N/J}, we obtain a split of Ok into J disjoint subsets of equal size:

Ok = J j=1Ok tj with Ok ti Ok tj = if i = j (3)

where each subset Ok tj is a partial view of the underlying function k. Here, J is a hyper-parameter and can vary in the range of [2, N] and must divide N, i.e. N mod J = 0. Its value is empirically chosen based on the data domain: for 1D and 2D regression problems (Sections 4.1), the number of examples per view (N/J) is relatively large as a single context point does not provide much information about the underlying function; whereas in scenes (Section 4.3), a few (or even one) images per view (partial observation) may provide enough information. We expand on the appropriate choice of J in our experiments and the respective ablations. For brevity of notation, we deﬁne vk j := g(φ,Φ)(Ok tj). We now formulate the contrastive learning objective as follows:

log exp sim vk j , vk i /τ

PK m=1 exp sim vk j , vm i /τ

where sim(a, b) := a b a b is the cosine similarity measure.

Function Contrastive Learning of Transferable Meta-Representations

Intuitively, the objective function in Equation (4) encourages that the similarity measure sim(vp ( ), vq ( )) acts as a discriminatory function, yielding a large value if vp ( ) and vq ( ) are representations of sets of samples drawn from the same function, i.e. if p = q (positives), and a small value otherwise, i.e. if p = q (negatives). The second summation in Equation (4) over 1 i j J is over available pairs of positives, and τ is a temperature parameter which scales the scores returned by the similarity measure. Similar to Sim CLR (Chen et al., 2020), we ﬁnd that temperature adjustment is important for learning good representations and treat it as a hyperparameter (ablation study in Appendix A.1).

We note that the learning objective effectively balances two goals. The ﬁrst is that of avoiding overﬁtting (i.e., regularization). It encourages that any two independent samples from the same distribution get mapped to similar points. This is akin to the method of symmetrization by a ghost sample which is a standard trick in proving learning theory bounds (Vapnik, 1995). Essentially, if two means on different samples are close, then they will also be close to their expectation, i.e., they will not overﬁt to the data.1 In spirit, this is close also to the idea of regularization by enforcing stability (i.e., weak dependence on sampling points) (Bousquet & Elisseeff, 2002). The second goal is to preserve contrastive information, ensuring that samples from different distributions get mapped to different points. Both goals are intricately linked in our setting, where the aggregation function is being learnt, since the second component is necessary to prevent the system from trivially meeting the ﬁrst goal by, say, mapping everything to 0.

3.1. Application to Downstream Tasks

Once representation learning using FCRL is concluded, hΦ is ﬁxed and can now be used for few-shot downstream prediction tasks T deﬁned on the underlying data-generating functions. To solve a particular downstream task, one may optimize a parametric decoder pψ( |r) conditioned on the learned representation r. Speciﬁcally, the decoder maps the representations learned in the previous step to the variable of interest in the given task. Depending on the nature of the downstream task, the conditional distributions and the associated objectives can be deﬁned in different ways.

1This is an example of the more general phenomenon of concentration of measure, applicable not just to means but also to other functions that aggregate samples. For a simple argument, let EOi denote the expectation w.r.t. drawing the sample Oi, and g be the mapping function applied to two independent samples O1, O2 from the same distribution. Then we have |g(O1) EO1[g(O1)]| = |g(O1) EO2[g(O2)]| = |EO2[g(O1) g(O2)]| EO2[|g(O1) g(O2)|]. The second equality uses independence of the samples O1 and O2, and the last step uses Jensen s inequality. This shows that if in expectation the embeddings of two samples are close (r.h.s.), then each embedding is close to its expectation.

4. Experiments

To illustrate the beneﬁts of learning function representations without an explicit decoder, we consider four different experimental settings. In all experiments that follow, we ﬁrst learn the encoder, and then keep it ﬁxed. Subsequently, we optimize decoders for the speciﬁc downstream problems at hand, while keeping the meta-representations from the encoder detached from the computational graph.

Baselines. We compare the downstream predictive performance of FCRL based representations with the representations learned by the closest task-oriented, meta-learning baselines. For a fair comparison, all the baselines and FCRL have the same encoding architecture. For instance, for 1D and 2D regression functions, we consider CNPs and NPs as baselines. We share with these methods an identical way of mapping the context set to its representation, but unlike us, they optimize directly for the performance of the decoder p(y|x) jointly with the said representation. For scene datasets, we use GQN (a variant of NPs) as the baseline, one that explicitly learns to reconstruct scenes using a limited number of context samples (pairs of camera viewpoints and the corresponding images).

4.1. 1D and 2D Functions

In the ﬁrst set of experiments, we consider two different distributions of functions i.e., a distribution over 1D sinusoidal waves, proposed by (Finn et al., 2017), and a relatively harder distribution where images are modelled as 2D functions (Garnelo et al., 2018a;b; Gordon et al., 2020; Kim et al., 2019). The representation learning stage for both datasets is similar, however the downstream tasks differ.

1D Sinusoidal Functions: We consider a dataset of 20, 000 training, 1000 validation and 1000 test sinusoidal functions. The amplitude and the phase of the functions are sampled uniformly from [0.1, 0.5] and [0, π] respectively. For each function f, the x-coordinates are uniformly sampled from [ 5.0, 5.0] and then f is applied to obtain the y-coordinates.

Modeling Images as 2D Functions: In this setting, each image is regarded as a function mapping from 2D pixel coordinates (comprising function input xi) to the pixel intensities at the corresponding pixel coordinate (comprising function output yi). We consider images of MNIST digits (Le Cun et al., 1998), where xi [0, 1]2 are the normalized pixel coordinates and yi [0, 1] is a grayscale pixel intensity. The training and validation datasets consists of 60, 000 MNIST training and 10, 000 test samples, respectively.

4.1.1. REPRESENTATION LEARNING STAGE

We ﬁrst describe the representation learning stage for both datasets, and then provide results on their respective downstream tasks. For training the encoder g(φ,Φ), we have a dataset O = {Ok}K k=1 at our disposition, where each k

Function Contrastive Learning of Transferable Meta-Representations

Figure 3. Qualitative comparison of different methods on 5-shot regression task on three different sinusoids. The decoder trained with FCRL representation predicts the correct form of the sinusoids.

corresponds to a function fk which has been sampled as described above. Each individual sample Ok = {(xk i , yk i )}N i=1 from the dataset is itself a set, comprising N input-output pairs from that particular function fk, i.e. yk i = fk(xk i ). For sinusoidal functions, we ﬁx the maximum number of context points to 20 and the number of examples N is chosen randomly in [2, 20] for each k. For MNIST digits as 2D functions, we allow a maximum of 200 samples per context set, and N is sampled randomly from [2, 200] for each k. The encoder g(φ,Φ) is then trained by splitting each context set Ok into J disjoint views. We set J = 2 for the sinusoidal functions and J = 10 for the 2D functions. An ablation study for the choice of J is presented in Appendix A.1.

4.1.2. DOWNSTREAM TASKS ON 1D SINUSOIDS

After training the encoder g(φ,Φ), we discard the projection head ρφ and use the trained encoder hΦ to extract the representations. For 1D sinusoids, we deﬁne two downstream tasks on the learned representation: T1D = {Tfsr, Tfspi}, where Tfsr and Tfspi are few-shot regression and few-shot parameter identiﬁcation tasks, respectively. The decoders for the downstream tasks are trained as follows:

Few-shot Regression (FSR). FSR for 1D functions is a well-studied problem in meta-learning (Garnelo et al., 2018a; Finn et al., 2017; Kim et al., 2019; Xu et al., 2019). For each sampled function fk, we are given a context set Ok = {(xk i , yk i = fk(xk i ))}N i=1 of size N, which can be utilized to infer the meta-representation rk of fk via the pre-trained encoder hΦ. We are then provided with M additional samples from fk (not seen by the encoder hΦ). The goal for a downstream decoder is to predict yk i , given xk i and the meta-representation rk. In other words, the downstream decoder with parameters ψ models the distribution pψ(yk i |xk i , rk). Where Dk = {(xk i , yk i )}N+M i=1 , the decoder is therefore trained to solve the following problem:

max ψ Efk p(f)

E (xk i ,yk i ) Dk[log pψ(yk i |xk i , rk)]

Here, the value of M is sampled randomly from the interval [0, 20 N]. The decoder pψ is an MLP with two hidden

Few-shot Regression (FSR)

Models 5-shot 20-shot

NP 0.310 0.05 0.218 0.02 CNP 0.265 0.03 0.149 0.02 FCRL 0.172 0.04 0.100 0.02

Few-shot Parameter Identiﬁcation

Models 5-shot 20-shot

NP 0.0087 0.0007 0.0037 0.0005 CNP 0.0096 0.0007 0.0049 0.0011 FCRL 0.0078 0.0004 0.0032 0.0002

Table 1. Mean squared error (MSE) for all the target points in 5 and 20 shot regression and parameter identiﬁcation tasks on test sinusoid functions. The reported values are the mean and standard deviation of three independent runs. FCRL performs slightly better than CNP and NP on both tasks.

layers and it is trained with the same training functions as the encoder hΦ. In addition to the Gaussian mean of yk i , it also outputs the variance in order to quantify the uncertainty in the point estimates. The qualitative results on test functions as shown in Figure 3 demonstrate that our model is able to quickly adapt with as few as 5 context points. In Table 1, we compare our method with CNP and NP quantitatively, and show that the predictions of our method are closer to the groundtruth, even though the encoder and decoder in both CNP and NP are explicitly trained to directly maximize the log likelihood to ﬁt the sinusoid.

Few-shot Parameter Identiﬁcation (FSPI). The goal here is to predict the amplitude (yk amp) and phase (yk phase) of the sampled sine wave fk, given a context set Ok = {(xk i , yk i = fk(xk i ))}N i=1 of N samples. Having encoded the context set Ok to meta-representation rk via the pre-trained encoder hΦ (following Equation (2)), we train a linear decoder pψ on top of the said representation by maximizing the likelihood of the sine wave parameters. The predictive distribution is pψ(yk amp, yk phase|rk) and the objective is:

max ψ Efk p(f)[log pψ(yk amp, yk phase|rk)] (6)

Similar to FSR, we use the same training functions to train pψ as we did to train the encoder hΦ. In Table 1, we report the mean squared error for three independent runs, averaged across all the test tasks for 5-shots and 20-shots FSR and FSPI. In both prediction tasks, the decoders trained on FCRL representations outperform CNP and NP. More details on the experiment are given in Appendix F.

4.1.3. DOWNSTREAM TASKS ON 2D FUNCTIONS

Similar to the tasks above, after training the model g(φ,Φ), we discard the projection head ρφ and use the trained en-

Function Contrastive Learning of Transferable Meta-Representations

100 200 300 400 500 600 700 Number of Context Points

Train Dist. NP CNP FCRL

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 Noise Std.

MNIST Probe Validation Acc.

NP CNP FCRL

Figure 4. (left) Quantitative evaluation of the models in terms of digit classiﬁcation from the ﬁxed number of context points (varying along the x-axis). The error bands show the standard deviation over three runs. FCRL achieves substantially higher accuracy than both baselines for all evaluated numbers of context points.(right) Quantitative comparison for robustness to noise on MNIST content classiﬁcation downstream task. The representations learned with FCRL are much more robust to noise than with CNPs and NPs.

coder hΦ to extract the representations. For MNIST digits as functions, we formulate two downstream prediction tasks on the learned representations: T2D = {Tfsic, Tfscc}, where Tfsci corresponds to few-shot image completion and Tfscc corresponds to few-shot content classiﬁcation task. The decoders for the downstream tasks are trained as following.

Few-shot Content Classiﬁcation (FSCC). To evaluate how much semantic information is captured by the metarepresentations, we propose the task of few-shot content classiﬁcation (FSCC). The goal here is to predict the class of each MNIST image given a context set Ok = {(xk i , yk i )}N i=1 comprising a few randomly sampled pixel coordinates xk i and the corresponding grayscale intensities yk i . The lack of explicit spatial structure in the context points makes it a challenging problem. We use the pre-trained encoder hΦ to encode Ok to its representation rk, and train a linear decoder on top to classify the class label yk one hot corresponding to the MNIST image from which Ok is sampled. The decoder pψ therefore solves the following classiﬁcation problem:

max ψ Efk p(f)[log pψ(yk one hot|rk)] (7)

We train the decoder with the same functions (images) that were used for training the encoder hΦ, and subsequently evaluate them on unseen functions from the validation set. Figure 6 shows the performance of decoders applied to representations obtained from FCRL, CNP and NP for varying size of the context set Ok. We ﬁnd that FCRL signiﬁcantly outperforms the baselines at any given number of context points, suggesting that the encoder hΦ is able to efﬁciently extract semantic information in an unsupervised manner. We also observe that it is able to generalize to larger number of context points than encountered during training (200).

Few-shot Image Completion (FSIC). This setting is identical to that of Few-Shot Regression (FSR) described in Section 4.1.2, except M is sampled randomly from [0, 200 N] and the decoder is a two-layer MLP with two input units (to account for the fact that the input xk i is now two dimensional). Qualitative results of FSIC on test images are shown

Few-shot Image Completion (FSIC)

Models 50-shot 200-shot

NP 0.0531 0.0002 0.0424 0.0002 CNP 0.0477 0.0006 0.0347 0.0011 FCRL 0.0481 0.0001 0.0355 0.0001

Table 2. MSE for all the target points in 50 and 200 shot image completion task on MNIST as functions. The reported values are the mean and standard deviation of three independent runs.

in Figure 6. It can be seen that the decoder trained on FCRL representations is able to predict the pixel intensities reasonably well, even when the number of context points is as low as 50, or approximately 6% of the image. We compare its performance against CNP, which uses the same parameterization of both the encoder and the decoder. We however note a crucial distinction: in FCRL, the meta-representation (resulting from the encoder) is not optimized for the image completion task. In particular, no gradient ﬂows from the decoder to the encoder, and the former is trained independently of the latter. On the contrary, CNP jointly optimizes both encoder and decoder parameters to solve the image completion task (i.e. to predict the pixel values). Despite the fact, it appears that the quality of reconstructions from the FCRL decoder matches that from the CNP decoder.

We note that the gap between CNP and FCRL is reduced if the training scheme aligns with the downstream task. In FSIC, the downstream task is to obtain a generative model which is exactly what CNPs are trained for, therefore CNPs tend to perform on par with FCRL as shown in Table 2.

4.2. Robustness to Noise Corruptions

In our experiments so far, we have considered the functions to be deterministic. However in real-world settings, data-generating functions are corrupted with noise. In this section, we assume that they take the form:

y = f(x, ξ) = f(x) + ξ; where ξ N(0, σ) (8)

where N(0, σ) is the standard Gaussian distribution with standard deviation σ. We now investigate the robustness of FCRL and the baselines as σ is varied. To this end, we train all the models on the noisy data and evaluate the quality of the learned representation on the Few-Shot Content Classiﬁcation downstream task, as deﬁned above. We ﬁnd that the representations learned by FCRL to be signiﬁcantly more robust to increasing noise strength (σ) than the baselines, as illustrated in Figure 4. One possible explanation for the susceptibility of CNP and NP to noise is the fact that the representations are learned by reconstructing the outputs, where signal to noise ratio is low. On the other hand, FCRL learns by contrasting the set of examples, extracting only

Function Contrastive Learning of Transferable Meta-Representations

Figure 5. t SNE projections of the meta-representations learned by FCRL on MPI3D dataset (treating scene as functions). (top row) Each plot shows the latent structure corresponding to the factors mentioned. (bottom row) Shows the latent structure corresponding to the factor of robot arm rotation along 1st DOF. Each embedding corresponds to the aggregated representation of three views of the same scene, denoted as f angle . It can be seen that by learning to distinguish between functions, FCRL captures the semantic underlying structure of the functions distribution. Note: Each plot is generated by varying the factor of interest and keeping the rest of the factors ﬁxed, except the factors of the ﬁrst degree of freedom and the second degree of freedom.

the invariant features and discarding non-correlated noise in the input. Similar results on scene understanding datasets are presented in Appendix B.

Figure 6. Qualitative comparison of CNP and FCRL based fewshot image completion. Here, each digit corresponds to one function. The context is shown in the second row where the target pixels are colored blue. FCRL is comparable with CNPs at predicting the correct form of a digit.

4.3. Representing Scenes as Functions

Like Eslami et al. (2018), we represent scenes as deterministic functions which map camera viewpoints to images. Precisely, each such scene is represented by a function fk, and we consider context sets Ok = {(xk i , yk i )}N i=1, where xi is the 3D camera viewpoint and yi is the corresponding image taken from that viewpoint. Given this set of viewpoint-image pairs, we apply the proposed method on Ok to obtain a representation of the scene, rk. The use-

fulness of this representation is then evaluated for three downstream tasks: scene understanding, scene reconstruction and reinforcement learning (RL).

For the ﬁrst task, our goal is to determine whether the representation rk contains enough information to infer the underlying factors of variation (Bengio et al., 2013) of a given scene. To this end, we use MPI3D (Gondal et al., 2019), a real-world robotics dataset comprising pairs of images from three camera viewpoints and the corresponding factors of variation (including the position, orientation, size and color of an object in the scene). For the second task, we analyze whether the learned representation rk can be used to reconstruct the scene from an unseen viewpoint. This objective is similar to what GQNs (Eslami et al., 2018) are originally trained for, serving as an ideal baseline for this task. For the last task, our objective is to determine whether rk contains enough useful information to guide a RL agent towards maximizing its reward. To this end, we create RLScenes, a multi-view robotics dataset based on an open-source physics simulation engine (details in Appendix D). Having trained the encoder on RLScenes, we feed the representation rk of the scene as input to a control policy rewarded for solving the considered RL task.

Scenes Representation Learning Stage. We use the same setting for learning the representations on both datasets. We ﬁx the maximum number of context sets (J) to 3 in MPI3D dataset and 20 in RLScenes. The number of tuples drawn, N, is then chosen randomly in [2, 3] and [2, 20] respectively.

Function Contrastive Learning of Transferable Meta-Representations

Background Color

Probe Validation Acc.

Object Color

Object Shape

Object Size

Figure 7. Quantitative Comparison of FCRL and GQN on MPI3D downstream classiﬁcation tasks. The classiﬁers trained with FCRL s representation outperform classiﬁers based on GQN s representations on all the tasks.

4.3.1. DOWNSTREAM TASKS ON SCENES

After training the encoder g(φ,Φ), we discard the projection head ρφ and use the trained encoder hΦ to extract the representations rk and train decoders for downstream problems.

Scene Understanding on MPI3D Dataset. In MPI3D, each scene is identiﬁed by 6 factors of variations. This allows us to deﬁne a set of 6 tasks Tmpi3D = {T k v }6 v=1, where the task T k v maps the scene to a discretized factor of variation yk v. For each task, we train a linear decoder using the objective in Equation (7), using a single image to infer the representation rk. Figure 7 shows the linear probes validation performance for six independently trained models on both GQN learned representations and FCRL learned representations and we see that the representations learned by FCRL consistently outperform GQN for identifying all the factors of variations in scenes.

Scene Reconstruction on MPI3D Dataset. Similar to the FSIC task in Section 4.1.3, we train a separate decoder to reconstruct the scenes corresponding to an unseen (query) viewpoint xk q. Conditioning on the inferred representation rk and the query viewpoint xk q, the decoder reconstructs the corresponding view of the scene yk q . The qualitative comparison in Figure 8 shows that the decoder trained with FCRL representation is capable of preserving the information required to reconstruct the subtle details in the scene.

Reinforcement Learning on RLScenes Dataset. In RLScenes, the goal for the agent (a robotic ﬁnger) is to locate the object in the arena, reach it, and stay close to it for the remainder of the episode. We use the Soft-Actor Critic (SAC) algorithm (Haarnoja et al., 2018) to learn a MLP policy for all the joints of the robot, where the policy takes as input the representation rk (inferred from a single image) and outputs an action. As the baseline, we use a RL policy trained with GQN representation as input. Figure 9 shows the mean rewards and standard deviations over ﬁve runs achieved by both FCRL and GQN-based policies. We ﬁnd that the FCRL agent clearly outperforms the baseline

Ground Truth

0 10 20 30 40 50 60

0 10 20 30 40 50 60

Figure 8. Qualitative Comparison of FCRL and GQN on scene reconstruction task. The decoder trained with FCRL s representation performs on par with GQN in terms of reconstructing a scene from unseen viewpoints.

GQN-agent, both in terms of the ﬁnal performance and sample-efﬁciency. In particular, the FCRL agent obtains convergence level control performance with approximately 2 times fewer interactions with the environment.

5. Related Work

Meta-Learning. Supervised meta-learning can be broadly classiﬁed into two main categories. The ﬁrst category considers the learning algorithm to be an optimizer and metalearning is about optimizing that optimizer, for e.g., gradientbased methods (Ravi & Larochelle, 2016; Finn et al., 2017; Li et al., 2017; Lee et al., 2019) and metric-learning based methods (Vinyals et al., 2016; Snell et al., 2017; Sung et al., 2018; Allen et al., 2019; Qiao et al., 2019). The second

Function Contrastive Learning of Transferable Meta-Representations

Figure 9. Comparison between GQN and FCRL on learning a dataefﬁcient control policy for an object reaching downstream task. FCRL based policy clearly outperforms GQN based policy.

category is the family of Neural Processes (NP) (Garnelo et al., 2018a;b; Kim et al., 2019; Eslami et al., 2018) which draw inspirations from Gaussian Processes (GPs). These methods use data-speciﬁc priors in order to adapt to a new task at test time while using only a simple encoder-decoder architecture. However, they approximate the distribution over tasks in terms of their predictive distributions which does not incentivize NP to fully utilize the information in the data-speciﬁc priors. Our method draws inspiration from this simple, yet elegant framework. However, our proposed method extracts the maximum information from the context which is shown to be useful for solving not just one task, but multiple downstream tasks.

Self-Supervised Learning. Self-supervised learning methods aim to learn the meaningful representations of the data by performing some pretext learning tasks (Zhang et al., 2017; Doersch et al., 2015). These methods have recently received huge attention (Oord et al., 2018; Tian et al., 2019; Hjelm et al., 2018; Bachman et al., 2019; Chen et al., 2020; He et al., 2019) mainly owing their success to noise contrastive learning objectives (Gutmann & Hyv arinen, 2010). At the same time, different explanations have recently come out to explain the success of such methods for e.g. from both empirical perspective (Tian et al., 2020; Tschannen et al., 2019) and theoretical perspective (Wang & Isola, 2020; Arora et al., 2019). The goal of these methods has mostly been to extract useful, low-dimensional representation of the data while using downstream performance as a proxy to evaluate the quality of the representation. For example, CPC (Oord et al., 2018) proposes an auto-regressive model to obtain a representation of a sample at time t that is then matched with that at time t + k, making it specialized for sequence-valued inputs. On the other hand, (Tian et al., 2019; Chen et al., 2020; He et al., 2019) match the representation of a sample with the representation of its randomly augmented view. In this work, we take inspiration

from these methods and propose a self-supervised learning method which meta-learn the representation of the functions. However, instead of requiring randomly augmented views or sequential ordered data points, our self-supervised loss uses partially observed, unordered views, sampled from the underlying functions.

In a similar spirit of enriching NPs (Garnelo et al., 2018b) with better approximation capability, (Ton et al., 2019) proposed to replace the conditional expectations E(y|x) in NPs with more expressive conditional densities p(y|x) estimated via NCE (Gutmann & Hyv arinen, 2010). In contrast to FCRL, it directly estimates the conditional distribution p(y|x) and uses a binary classiﬁer for NCE. However, this estimation is done in a small data regime where the standard conditional density estimation does not work well. Therefore, their method is practically limited to low dimensional problems. Recently, (Zhang et al., 2020; Srinivas et al., 2020) has shown the beneﬁts of using self-supervised representations, learned without reconstruction, for reinforcement learning tasks. In this work, we explore the utility of such representations for the reinforcement learning tasks deﬁned on scenes (functions).

6. Conclusion

In this work, we proposed a novel self-supervised representations learning algorithm for few-shot learning problems. We deviate from the commonly-used, task-speciﬁc training routines in meta-learning frameworks and propose to learn the representations of the relevant functions independently of the prediction task. Experiments on various datasets and the related set of downstream few-shot prediction tasks show the effectiveness of our method. The ﬂexibility to reuse the same representation for different task distributions deﬁned over functions brings us one step closer towards learning a generic meta-learning framework. Using a shared generic representation of the data-generating process, we plan to adapt the proposed framework in order to tackle multiple challenging few-shot problems such as object detection, segmentation, visual question answering.

ACKNOWLEDGMENTS

The authors would like to thank Krikamol Muandet, Luigi Gresele, Ilya Tolstikhin and Simon Buchholz for the helpful discussions and feedback. We thank CIFAR for their support. This work was supported by the German Federal Ministry of Education and Research (BMBF): T ubingen AI Center, FKZ: 01IS18039B, and by the Machine Learning Cluster of Excellence, EXC number 2064/1 Project number 390727645.

Function Contrastive Learning of Transferable Meta-Representations

Allen, K. R., Shelhamer, E., Shin, H., and Tenenbaum, J. B. Inﬁnite mixture prototypes for few-shot learning. ar Xiv preprint ar Xiv:1902.04552, 2019.

Anand, A., Racah, E., Ozair, S., Bengio, Y., Cˆot e, M.- A., and Hjelm, R. D. Unsupervised state representation learning in atari. In Advances in Neural Information Processing Systems, pp. 8766 8779, 2019.

Arora, S., Khandeparkar, H., Khodak, M., Plevrakis, O., and Saunshi, N. A theoretical analysis of contrastive unsupervised representation learning. ar Xiv preprint ar Xiv:1902.09229, 2019.

Bachman, P., Hjelm, R. D., and Buchwalter, W. Learning representations by maximizing mutual information across views. In Advances in Neural Information Processing Systems, pp. 15509 15519, 2019.

Bengio, Y., Courville, A., and Vincent, P. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8): 1798 1828, 2013.

Bousquet, O. and Elisseeff, A. Stability and generalization. JMLR, 2:499 526, 2002.

Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. A simple framework for contrastive learning of visual representations. ar Xiv preprint ar Xiv:2002.05709, 2020.

Doersch, C., Gupta, A., and Efros, A. A. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE international conference on computer vision, pp. 1422 1430, 2015.

Doersch, C., Gupta, A., and Zisserman, A. Crosstransformers: spatially-aware few-shot transfer. ar Xiv preprint ar Xiv:2007.11498, 2020.

Eslami, S. A., Rezende, D. J., Besse, F., Viola, F., Morcos, A. S., Garnelo, M., Ruderman, A., Rusu, A. A., Danihelka, I., Gregor, K., et al. Neural scene representation and rendering. Science, 360(6394):1204 1210, 2018.

Finn, C., Abbeel, P., and Levine, S. Model-agnostic metalearning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126 1135. JMLR. org, 2017.

Garnelo, M., Rosenbaum, D., Maddison, C. J., Ramalho, T., Saxton, D., Shanahan, M., Teh, Y. W., Rezende, D. J., and Eslami, S. Conditional neural processes. ar Xiv preprint ar Xiv:1807.01613, 2018a.

Garnelo, M., Schwarz, J., Rosenbaum, D., Viola, F., Rezende, D. J., Eslami, S., and Teh, Y. W. Neural processes. ar Xiv preprint ar Xiv:1807.01622, 2018b.

Gondal, M. W., Wuthrich, M., Miladinovic, D., Locatello, F., Breidt, M., Volchkov, V., Akpo, J., Bachem, O., Sch olkopf, B., and Bauer, S. On the transfer of inductive bias from simulation to the real world: a new disentanglement dataset. In Advances in Neural Information Processing Systems, pp. 15740 15751, 2019.

Gordon, J., Bruinsma, W., Foong, A. Y., Requeima, J., Dubois, Y., and Turner, R. E. Convolutional conditional neural processes. 2020.

Gutmann, M. and Hyv arinen, A. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artiﬁcial Intelligence and Statistics, pp. 297 304, 2010.

Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. ar Xiv preprint ar Xiv:1801.01290, 2018.

He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. Momentum contrast for unsupervised visual representation learning. ar Xiv preprint ar Xiv:1911.05722, 2019.

Hjelm, R. D., Fedorov, A., Lavoie-Marchildon, S., Grewal, K., Bachman, P., Trischler, A., and Bengio, Y. Learning deep representations by mutual information estimation and maximization. ar Xiv preprint ar Xiv:1808.06670, 2018.

Joshi, S., Widmaier, F., Agrawal, V., and W uthrich, M. https://github.com/ open-dynamic-robot-initiative/ trifinger_simulation, 2020.

Kim, H., Mnih, A., Schwarz, J., Garnelo, M., Eslami, A., Rosenbaum, D., Vinyals, O., and Teh, Y. W. Attentive neural processes. ar Xiv preprint ar Xiv:1901.05761, 2019.

Kipf, T., van der Pol, E., and Welling, M. Contrastive learning of structured world models. ar Xiv preprint ar Xiv:1911.12247, 2019.

Le Cun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

Lee, K., Maji, S., Ravichandran, A., and Soatto, S. Metalearning with differentiable convex optimization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10657 10665, 2019.

Li, Z., Zhou, F., Chen, F., and Li, H. Meta-sgd: Learning to learn quickly for few-shot learning. ar Xiv preprint ar Xiv:1707.09835, 2017.

Function Contrastive Learning of Transferable Meta-Representations

Oord, A. v. d., Li, Y., and Vinyals, O. Representation learning with contrastive predictive coding. ar Xiv preprint ar Xiv:1807.03748, 2018.

Qiao, L., Shi, Y., Li, J., Wang, Y., Huang, T., and Tian, Y. Transductive episodic-wise adaptive metric for fewshot learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3603 3612, 2019.

Ravi, S. and Larochelle, H. Optimization as a model for few-shot learning. 2016.

Snell, J., Swersky, K., and Zemel, R. Prototypical networks for few-shot learning. In Advances in neural information processing systems, pp. 4077 4087, 2017.

Srinivas, A., Laskin, M., and Abbeel, P. Curl: Contrastive unsupervised representations for reinforcement learning. ar Xiv preprint ar Xiv:2004.04136, 2020.

Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P. H., and Hospedales, T. M. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199 1208, 2018.

Tian, Y., Krishnan, D., and Isola, P. Contrastive multiview coding. ar Xiv preprint ar Xiv:1906.05849, 2019.

Tian, Y., Sun, C., Poole, B., Krishnan, D., Schmid, C., and Isola, P. What makes for good views for contrastive learning. ar Xiv preprint ar Xiv:2005.10243, 2020.

Ton, J.-F., Chan, L., Teh, Y. W., and Sejdinovic, D. Noise contrastive meta-learning for conditional density estimation using kernel mean embeddings. ar Xiv preprint ar Xiv:1906.02236, 2019.

Tschannen, M., Djolonga, J., Rubenstein, P. K., Gelly, S., and Lucic, M. On mutual information maximization for representation learning. ar Xiv preprint ar Xiv:1907.13625, 2019.

Vapnik, V. The Nature of Statistical Learning Theory. Springer, NY, 1995.

Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al. Matching networks for one shot learning. In Advances in neural information processing systems, pp. 3630 3638, 2016.

Wang, T. and Isola, P. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. ar Xiv preprint ar Xiv:2005.10242, 2020.

Xu, J., Ton, J.-F., Kim, H., Kosiorek, A. R., and Teh, Y. W. Metafun: Meta-learning with iterative functional updates. ar Xiv preprint ar Xiv:1912.02738, 2019.

Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R. R., and Smola, A. J. Deep sets. In Advances in neural information processing systems, pp. 3391 3401, 2017.

Zhang, A., Mc Allister, R., Calandra, R., Gal, Y., and Levine, S. Learning invariant representations for reinforcement learning without reconstruction. ar Xiv preprint ar Xiv:2006.10742, 2020.

Zhang, R., Isola, P., and Efros, A. A. Split-brain autoencoders: Unsupervised learning by cross-channel prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1058 1067, 2017.