# group_equivariant_conditional_neural_processes__1077b63a.pdf

Published as a conference paper at ICLR 2021

GROUP EQUIVARIANT CONDITIONAL NEURAL PROCESSES

Makoto Kawano The University of Tokyo Tokyo, Japan kawano@weblab.t.u-tokyo.ac.jp

Wataru Kumagai The University of Tokyo, RIKEN AIP Tokyo, Japan kumagai@weblab.t.u-tokyo.ac.jp

Akiyoshi Sannai RIKEN AIP Tokyo, Japan akiyoshi.sannai@riken.jp

Yusuke Iwasawa & Yutaka Matsuo The University of Tokyo Tokyo, Japan {iwasawa, matsuo}@weblab.t.u-tokyo.ac.jp

We present the group equivariant conditional neural process (Equiv CNP), a metalearning method with permutation invariance in a data set as in conventional conditional neural processes (CNPs), and it also has transformation equivariance in data space. Incorporating group equivariance, such as rotation and scaling equivariance, provides a way to consider the symmetry of real-world data. We give a decomposition theorem for permutation-invariant and group-equivariant maps, which leads us to construct Equiv CNPs with an inﬁnite-dimensional latent space to handle group symmetries. In this paper, we build architecture using Lie group convolutional layers for practical implementation. We show that Equiv CNP with translation equivariance achieves comparable performance to conventional CNPs in a 1D regression task. Moreover, we demonstrate that incorporating an appropriate Lie group equivariance, Equiv CNP is capable of zero-shot generalization for an image-completion task by selecting an appropriate Lie group equivariance.

1 INTRODUCTION

Data symmetry has played a signiﬁcant role in the deep neural networks. In particular, a convolutional neural network, which play an important part in the recent achievements of deep neural networks, has translation equivariance that preserves the symmetry of the translation group. From the same point of view, many studies have aimed to incorporate various group symmetries into neural networks, especially convolutional operation (Cohen et al., 2019; Defferrard et al., 2019; Finzi et al., 2020). As example applications, to solve the dynamics modeling problems, some works have introduced Hamiltonian dynamics (Greydanus et al., 2019; Toth et al., 2019; Zhong et al., 2019). Similarly, Quessard et al. (2020) estimated the action of the group by assuming the symmetry in the latent space inferred by the neural network. Incorporating the data structure (symmetries) into the models as inductive bias, can reduce the model complexity and improve model generalization.

In terms of inductive bias, meta-learning, or learning to learn, provides a way to select an inductive bias from data. Meta-learning use past experiences to adapt quickly to a new task T p(T ) sampled from some task distribution p(T ). Especially in supervised meta-learning, a task is described as predicting a set of unlabeled data (target points) given a set of labeled data (context points). Various works have proposed the use of supervised meta-learning from different perspectives (Andrychowicz et al., 2016; Ravi & Larochelle, 2016; Finn et al., 2017; Snell et al., 2017; Santoro et al., 2016; Rusu et al., 2018). In this study, we are interested in neural processes (NPs) (Garnelo et al., 2018a;b), which are meta-learning models that have encoder-decoder architecture (Xu et al., 2019). The encoder is a permutation-invariant function on the context points that maps the contexts into a latent representation. The decoder is a function that produces the conditional predictive distribution of targets given the latent representation. The objective of NPs is to learn the encoder and the decoder, so that the predictive model generalizes well to new tasks by observing some points of the tasks. To achieve the

Published as a conference paper at ICLR 2021

objective, an NP is required to learn the shared information between the training tasks T , T p(T ): the data knowledge Lemke et al. (2015). Each task T is represented by one dataset, and multiple datasets are provided for training NPs to tackle a meta-task. For example, we consider a meta-task that completing the pixels that are missing in a given image. Often, images are taken by the same condition in each dataset, respectively. While the datasets contain identical subjects of images (e.g., cars or apples), the size and angle of the subjects in the image may be different; the datasets have group symmetry, such as scaling and rotation. Therefore, it is expected that pre-constraining NPs to have group equivariance improves the performance of the NPs at those datasets.

In this paper, we investigate the group equivalence of NPs. Speciﬁcally, we try to answer the following two questions, (1) can NPs represent equivariant functions? (2) can we explicitly induce the group equivariance into NPs? In order to answer the questions, we introduce a new family of NPs, Equiv CNP, and show that Equiv CNP is a permutation-invariant and group-equivariant function theoretically and empirically. Most relevant to Equiv CNP, Conv CNP (Gordon et al., 2019) shows that using general convolution operation leads to the translation equivariance theoretically and experimentally; however it does not consider incorporation of other groups. First, we introduce the decomposition theorem for permutation-invariant and group-equivariant maps. The theorem suggests that the encoder maps the context points into a latent variable, which is a functional representation, in order to preserve the data symmetry. Thereafter, we construct Equiv CNP by following the theorem. In this study, we adopt Lie Conv (Finzi et al., 2020) to construct Equiv CNP for practical implementation. We tackle a 1D synthetic regression task (Garnelo et al., 2018a;b; Kim et al., 2019; Gordon et al., 2019) to show that Equiv CNP with translation equivariance is comparable to conventional NPs. Furthermore, we design a 2D image completion task to investigate the potential of Equiv CNP with several group equivariances. As a result, we demonstrate that Equiv CNP enables zero-shot generalization by incorporating not translation, but scaling equivariance.

2 RELATED WORK

2.1 NEURAL NETWORKS WITH GROUP EQUIVARIANCE

Our works build upon the recent advances in group equivariant convolutional operation incorporated into deep neural networks. The ﬁrst approach is group convolution introduced in (Cohen & Welling, 2016), where standard convolutional kernels are used and their transformation or the output transformation is performed with respect to the group. This group convolution induces exact equivariance, but only to the action of discrete groups. In contrast, for exact equivariance to continuous groups, some works employ harmonic analysis so as to ﬁnd the basis of equivariant functions, and then parameterize convolutional kernels in the basis (Weiler & Cesa, 2019). Although this approach can be applied to any type of general data (Anderson et al., 2019; Weiler & Cesa, 2019), it is limited to local application to compact, unimodular groups. To address these issues, Lie Conv (Finzi et al., 2020) and other works (Huang et al., 2017; Bekkers, 2019) use Lie groups. Our Equiv CNP chooses Lie Conv to manage group equivariance for simplicity of the implementation.

There are several works that study deep neural networks using data symmetry. In some works, in order to solve machine learning problems such as sequence prediction or reinforcement learning, neural networks attempt to learn a data symmetry of physical systems from noisy observations directly (Greydanus et al., 2019; Toth et al., 2019; Zhong et al., 2019; Sanchez-Gonzalez et al., 2019). While both these studies and Equiv CNP can handle data symmetries, Equiv CNP is not limited to speciﬁc domains such as physics.

Furthermore, Quessard et al. (2020) let the latent space into which neural networks map data, have group equivariance, and estimated the parameters of data symmetries. In terms of using group equivariance in the latent space, Equiv CNP is similar to this study but differs from being able to use various group equivariance.

2.2 FAMILY OF NEURAL PROCESSES

NPs (Garnelo et al., 2018a;b) are deep generative models for regression functions that map an input xi Rdx into an output yi Rdy. In particular, given an arbitrary number of observed data points (x C, y C) := {(xi, yi)}C i=1, NPs model the conditional distribution of the target value y T at some new,

Published as a conference paper at ICLR 2021

unobserved target data point x T , where (x T , y T ) := {(xj, yj)}T j=1. Fundamentally, there are two NP variants: deterministic and probabilistic. Deterministic NPs (Garnelo et al., 2018a), known as conditional NPs (CNPs), model the conditional distribution as:

p(y T |x T , x C, y C) := p(y T |x T , r C), where r represents a function that maps data sets (x C, y C) into a ﬁnite-dimensional vector space in a permutation-invariant way and r C := r(x C, y C) Rd is the feature vector. The function r can be implemented by Deep Sets (Zaheer et al., 2017). The likelihood p(y T |x T , r C) is modeled by Gaussian distribution factorized across the targets (xj, yj) with mean and variance of prediction {(xj, yj)}T j=1 by passing inputs r C and xj through the MLP. The CNP is trained by maximizing the likelihood.

Probabilistic NPs include a latent variable z. The NP infers q(z|r C) given an input r C using the reparametrization trick (Kingma & Welling, 2013) and models such a conditional distribution as:

p(y T |x T , x C, y C) := Z p(y T |x T , r C, z)q(z|r C)dz

and it is trained by maximizing an ELBO: L(φ, θ) = Ez qφ(z|x T ,y T )[log pθ(y T |x T )] KL[qφ(z|x T , y T ) pθ(z|x C, y C)].

NPs have various useful properties: i) Scalability: the computational cost of NPs scales as O(n + m) with respect to n contexts and m targets of data, ii) Flexibility: NPs can deﬁne a conditional distribution of an arbitrary number of target points, conditioning an arbitrary number of observations, iii) Permutation invariance: the encoder of NPs uses Deepsets (Zaheer et al., 2017) to make the target prediction permutation invariant. Thanks to these properties, Galashov et al. (2019) replace Gaussian processes in Bayesian optimization, contextual multi-armed bandit, and Sim2Real tasks.

While there are many NP variants (Kim et al., 2019; Louizos et al., 2019; Xu et al., 2019) to improve the performance of NPs, those do not take group equivariance into account yet. The most similar to Equiv CNP, Conv CNP (Gordon et al., 2019) incorporated only translation equivariance. In contrast, Equiv CNP can incorporate not only translation but also other groups such as rotation and scaling.

3 DECOMPOSITION THEOREM

In this section, we consider group convolution. We ﬁrst prepare some deﬁnition and teminology. Let X and Y R be the input space and output space, respectively. We deﬁne ZM = (X Y)M as a collection of M input-output pairs, Z M = SM n=1 Zn as the collection of at most M pairs, and Z = S m=1 Zm as the collection of ﬁnitely many pairs. Let [n] = {1, . . . , n} for n N, and let Sn be the permutation group on [n]. The action of Sn on Zn is deﬁned as

πZn := ((xπ 1(1), yπ 1(1)), . . . , (xπ 1(n), yπ 1(n))),

where π Sn and Zn Zn. We deﬁne the multiplicity of Zn = ((x1, y1), . . . , (xn, yn)) Zn by

mult(Zn) := sup {|{i [n] : xi = ˆx}| : ˆx = x1, . . . , xn}

and the multiplicity of Z Z by mult(Z ) := sup Zn Z mult(Zn). Then, a collection Z Z is said to have multiplicity K if mult(Z ) = K.

Mathematically, symmetry is described in terms of group action. The following group equivariant maps represent to preserve the symmetry in data. Deﬁnition 1 (Group Equivariance and Invariance). Suppose that a group G acts on sets S and S . Then, a map Φ : S S is called G-equivariant when Φ(g s) = g Φ(s) holds for arbitrary g G and s S. In particular, when G acts on S trivially (i.e., g s = s for g G and s S ), the G-equivariant map is said to be G-invariant: Φ(g s) = Φ(s).

Then, we can derive the following theorem, which decompose a permutation-invariant and group equivariant function into two tractable functions. Note that this theorem has been proved by Gordon et al. (2019) when G is a translation group. Theorem 2 (Decomposition Theorem). Let G be a group. Let Z M (X Y) M be topologically closed, permutation-invariant and G-invariant with multiplicity K. For a function Φ : Z M Cb(X, Y), the following conditions are equivalent:

Published as a conference paper at ICLR 2021

Representation Lifting Discretization Projection Encoding

Figure 1: Overview of Equiv CNP.

Algorithm 1 Prediction of Group Equivariant Conditional Neural Process Input: ρ =Lie Conv, RBF kernel ψ, context {xi, yi}N i=1, target {x j}M j=1 lower, upper range((xi)N i=1 (xj)M j=1) (tk)T k=1 uniform_grid(lower, upper; γ) // Encoding the context information into representation h(i.e. Encoder) h PN i=1 φK+1(yi)ψ([x j, tk], xi) h. (µj, Σj) = Lie Conv Net(h)(x j) // Decoder Output: {(µj, Σj)}M j=1

(I) Φ is continuous, permutation-invariant and G-equivariant.

(II) There exist a function space H and a continuous G-equivariant function ρ : H Cb(X, Y) and a continuous G-invariant interpolating kernel ψ : X 2 R such that

i=1 φK+1 (yi) ψxi

where φK+1 : Y RK+1 is deﬁned by φK+1(y) := [1, y, y2, . . . , y K] .

Thanks to the Theorem 2, we can construct the permutation-invariant and group-equivariant NPs whose form of encoder and decoder is determined. In this paper, we call Φ as Equiv Deep Set.

4 GROUP EQUIVARIANT CONDITIONAL NEURAL PROCESSES

In this section, we represent Equiv CNP that is a permutation-invariant and group-equivariant map. Equiv CNP models the same conditional distribution as well as CNPs:

p(YT |XT , DC) =

n=1 p (yn|Φθ(DC) (xn))

n=1 N (yn; µn, Σn) with (µn, Σn) = Φθ(DC) (xn)

where N denotes the density function of a normal distribution, DC = (XC, YC) = {(xc, yc)}C i=1 is the observed context data and φ is a Equiv Deep Set. The important components of Equiv CNP to be determined are ρ, φ, and ψ. The algorithm is represented in Algorithm 1.

To describe in more detail, ﬁrst, Section 4.1 introduce the deﬁnition of group convolution, and then Section 4.2 explains Lie Conv (Finzi et al., 2020) used for Equiv CNP to implement group convolution. Finally, we describe the architecture of proposed Equiv CNP in Section 4.3.

4.1 GROUP CONVOLUTION

When X is a homogenous space of a group G, the lift of x X is the element of group G that transfers a ﬁxed origin o to x Lift(x) = {u G: uo = x}. That is, each pair of coordinates and

Published as a conference paper at ICLR 2021

features is lifted into K elements1 {(xi, fi)}N i=1 {(uik, fi)}N,K i=1,k=1. When the group action is transitive, the space on which it acts on is a homogenous space. More generally, however, the action is not transitive, and the total space contains an inﬁnite number of orbits. Consider a quotient space Q = X/G, which consists of orbits of G in X. Then each element q Q is a homogenous space of G. Because many equivariant maps use this information, the total space should be G X/G, not G. Hence, x X is lifted to the pair (u, q), where u G and q Q.

Group convolution is a generalization of convolution by translation, which is used in images, etc., to other groups. Deﬁnition 3 (Group Convolution (Kondor & Trivedi, 2018; Cohen et al., 2019)). Let g, f : G Q R be functions, and let µ( ) be a Haar measure on G For any u G, the convolution of f by g is deﬁned as

h(u, q) = Z

G Q g(v 1u, q, q )f(v, q )dµ(v)dq .

By the deﬁnition, we can verify that the group convolution is G-equivariant. Moreover, Cohen et al. (2019) recently showed that a G-equivariant linear map is represented by group convolution when the action of a group is transitive.

4.2 LOCAL GROUP CONVOLUTION

In this study, we used Lie Conv as a group convolution (Finzi et al., 2020) Lie Conv is a convolution that can handle Lie groups in group convolutions. Lie Conv acts on a pair (xi, fi)N i=1 of coordinates xi X and values fi V in vector space V . First, input data xi is transformed (lifted) into group elements ui and orbits qi. Next, we deﬁne the convolution range based on the invariant (pseudo) distance in the group, and convolve it using a kernel parameterized by a neural network.

What is important for inductive bias and computational efﬁciency in convolution is that the range of convolutions is local; that is, if the distance between ui and uj is larger than r, gθ(ui, uj) = 0. First, we deﬁne distance in the Lie group to deal with locality in the matrix group2: d(u, v) := log(u 1v) F , where log denotes the matrix logarithm, and F denotes the Frobenius norm. Because d(wu, wv) = log(u 1w 1wv) F = d(u, v) holds, this function is left-invariant and is a pseudo-distance.3

To further account for orbit q, we extend the distance to d((ui, qi), (vj, qj))2 = d(ui, vj)2 + αd O(qi, qj)2, where d O(qi, qj) := infxi qi,xj qj d X (xi, xj) and d X is the distance on X. It is not necessarily invariant to the transformation in q.

Based on this distance, the neighborhood is nbhd(u) = {v, q|d((ui, qi), (vi, qj)) < r}. The radius r should be adjusted appropriately from the ratio of the range of convolutions to the total input, because the appropriate value is difﬁcult to determine depending on the group treated. Therefore, the Lie group convolution is

h(u, q) = Z

v,q nbhd(u) gθ(v 1u, q, q )f(v, q )dµ(v)dq .

Radius r of the neighborhood corresponds to the inverse of the density channel h(0) in Gordon et al. (2019).

Discrete Approximation. Given a lifted input data point {(vj, qj)N j=1} and a function value fj = f(vj, qj) at each point, we need to select a target {(ui, qi)N i=1} to convolve so that we can approximate the integral of the equation. Because the convolutional range is limited by nbhd(u), Lie Conv can approximate the integrals by the Monte Carlo method

h(u, q) = (gˆ f)(u, q) = 1

vj,q j nbhd(u,q) g(v 1 j u, q, q j)f(vj, q j)

1K is a hyperparameter and we randomly pick K elements {uik}K k=1 in the orbit corresponding to xi. 2We assume that we have a ﬁnite-dimensional representation. 3This is because the triangle inequality is not satisﬁed.

Published as a conference paper at ICLR 2021

The classical convolutional ﬁlter kernel g( ) is only valid for discrete values and is not available for continuous group elements. Therefore, pointconv/Lieconv uses a multilayered neural network gθ as a convolutional kernel. However, because neural networks are good at computation in Euclidean space, and input G is not a vector space, we let gθ be a map in the Lie algebra g. Therefore, we use Lie groups and logarithmic maps exist in each element of the group. That is, let gθ(u) = (g exp)θ(log u), and parameterize gθ = (g exp)θ by MLP. We use gθ : g Rcout cin. Therefore, the convolution of the equation is

j nbhd (i) gθ log v 1 j ui , qi, qj fj.

Here, the input to the MLP is aij = Concat [log(v 1 j ui), qi, qj)] .

4.3 IMPLEMENTATION

First, we explain the form of φ. Because most real-world data have a single output per one input location, we treat the multiplicity of DC as one, K = 1, and deﬁne φ(y) = [1 y] based on (Zaheer et al., 2017). The ﬁrst dimension of output φi indicates whether the data located at xi is observed, so that the model can distinguish between the observed data, and the unobserved data whose value is zero (yi = 0).

Then, we describe the form of ψ. Following our Theorem 2, ψ is required to be stationary, nonnegative, and a positive deﬁnite kernel. For Equiv CNP, we change ψ depending on whether the input data is continuous or discrete. With continuous input data (e.g. 1D regression), we use RBF kernels for ψ. An RBF kernel has a learnable bandwidth parameter and scale parameter and is optimized with Equiv CNP. A functional representation E(Z) is made up by multiplying the kernel ψ with φ. On the other hand, when the inputs are discrete (e.g. images), we use not an RBF kernel but Lie Conv.

Finally, we explain the form of ρ. With our Theorem2, because ρ needs to be a continuous group equivariant map between function spaces, we use Lie Conv for ρ. In this study, under the hypothesis of separability (Kaiser et al., 2017), we implemented separable Lie Conv in the spatial and channel directions, to improve the efﬁciency of computational processing. The details are given in the Appendix B. Equiv CNP requires to compute the convolution of E(Z). However, since E(Z) itself is a functional representation, it cannot be computed in computers as it is. To address this issue, we discretize E(Z) over the range of context and target points. We space the lattice points (ti)n i=1 X on a uniform grid over a hypercube covering both the context and target points. Because the conventional convolution that is used in Conv CNP requires discrete lattice input space to operate on and produces discrete outputs, we need to back the outputs to continuous functions X Y. While Conv CNP regards the outputs as weights for evenly-spaced basis functions (i.e., RBF kernel), Lie Conv does not require the input location to be lattice and can produce continuous functions output directly. Note that the algorithm of Equiv CNP can be the same as Conv CNP; it can also use evenly-spaced basis functions. The obtained functions are used to output the Gaussian predictive mean and the variance at the given target points. We can evaluate Equiv CNP by log-likelihood using the mean and variance.

5 EXPERIMENT

To investigate the potential of Equiv CNP, we constructed three questions: 1) Is Equiv CNP comparable to conventional NPs such as Conv CNP? and 2) Can Equiv CNP have group equivariance in addition to translation equivariance and 3) does it preserve the symmetries? To compare fairly with Conv CNP, the architecture of Equiv CNP follows that of Conv CNP; details are given in the Appendix C.

5.1 1D SYNTHETIC REGRESSION TASK

To answer the ﬁrst question, we tackle the 1D synthetic regression task as has been done in other papers (Garnelo et al., 2018a;b; Kim et al., 2019). At each iteration, a function f is sampled from a given function distribution, then, some of the context DC and target DT points are sampled from function f. In this experiment, we selected the Gaussian process with RBF kernel, Matern 5

2 and periodic kernel for the function distribution. We chose translation equivariance T(1) to incorporate

Published as a conference paper at ICLR 2021

Table 1: Log-likelihood of synthetic 1-dimensional regression Model RBF Matern Periodic

Oracle GP 3.9335 0.5512 3.7676 0.3542 1.2194 5.6685 CNP (Garnelo et al., 2018a) 1.7468 1.5415 1.7808 1.3124 1.0034 0.5174 Conv CNP (Gordon et al., 2019) 1.3271 1.0324 0.8189 0.9366 0.4787 0.5448 Equiv CNP (ours) 1.2930 1.0113 0.6616 0.6728 0.4037 0.4968

4 2 0 2 4 3

RBF x Conv CNP

Oracle GP Model

Observation (in-bound) Observation (out-bound)

4 2 0 2 4 3

RBF x Equiv CNP

4 2 0 2 4 3

4 2 0 2 4 3

3 4 2 0 2 4 3

Matern x Conv CNP

4 2 0 2 4 3

Matern x Equiv CNP

4 2 0 2 4 3

4 2 0 2 4 3

Figure 2: Predictive mean and variance of Conv CNP and Equiv CNP. The ﬁrst two columns show the prediction of the models trained on the RBF kernel and the last two columns show the prediction of the model trained on the Matern 5

2 kernel. The target function and sampled data points are the same between the top row and bottom row except for the context. At the top row, the context is within the vertical dash line that is sampled from the same range during the training (black circle). In the bottom row, the new context located out of the training range (white circle) is appended.

into Equiv CNP. We compared Equiv CNP with GP (as an oracle), with CNP (Garnelo et al., 2018a) as a baseline, and with Conv CNP.

Table 1 shows the log likelihood means and standard deviations of 1000 tasks. In this task, both contexts and targets are sampled from the range [ 2, 2]. From Table 1, we can see that Equiv CNP with translation equivariance is comparable to Conv CNP throughout all GP curve datasets. That is, Equiv CNP has the model capacity to learn the functions as well as Conv CNP.

We also conducted the extrapolation regression proposed in (Gordon et al., 2019) as shown in Figure 2. The ﬁrst two columns show the models trained on an RBF kernel and the last two columns on a Matern 5

2 kernel. The top row shows the predictive distribution when the observation is given within the same training region; the bottom row for the observation is not only the training region but also the extrapolation region: [ 4, 4]. As a result, Equive CNP can generalize to the observed data whose range is not included during training. This result was expected because Gordon et al. (2019) has mentioned that translation equivariance enables the models to adapt to this setting.

5.2 2D IMAGE-COMPLETION TASK

An image-completion task aims to investigate that Equiv CNP can complete the images when it is given an appropriate group equivariance. The image-completion task can be regarded as a regression task that predicts the value of y i at the 2D image coordinates x i , given the observed pixels DC = {(xn, yn)}N n=1 ( R3 for the colored image input, and R for the grayscale image input). The framework of the image completion can apply not only to the images but also to other real-world applications, such as predicting spatial data (Takeuchi et al., 2018).

To evaluate the effect of Equiv CNP with a speciﬁc group equivariance, we introduce a new dataset digital clock digits as shown in Figure 3. Since previous works use the MNIST dataset for image completion, we also conduct the image completion task with rotated-MNIST. However, we cannot ﬁnd a signiﬁcant difference between the group equivariance models (the result of rotated-MNIST is depicted in Appendix E). We think that this happens because (1) original MNIST contains various

Published as a conference paper at ICLR 2021

Table 2: Log-likelihood of 2D image-completion task

Group Log likelihood

T(2) 1.0998 0.4115 SO(2) 2.4275 6.8856 R>0 SO(2) 1.8398 0.5368 SE(2) 1.1655 0.5420

Figure 3: The example of training data (top) and test data (bottom).

1.0 0.75 0.5 25% 25% 25% 75% 75%

T(2) (Conv CNP)

Figure 4: Image-completion task results. The top row shows the given observation and the other rows show the mean of the conditional distribution predicted by Equiv CNP with the speciﬁc group equivariance: T(2), SO(2), R>0 SO(2), and SE(2). Two of each column shows the same image, and the difference between two columns is the percentage of context random sampling: 25% and 75%. When the size of digits is the same as that of the training set (i.e. not scaling but rotation equals SO(2) symmetry), T(2) and SE(2) have a good quality, but when the size of digits is smaller than that of training set, R>0 SO(2) has a good performance.

data symmeries including translation, scaling, and rotation, and (2) we cannot specify them precisely. Thus, we provide digital clock digits dataset anew.

In this experiment, we used four kinds of group equivariance; translation group T(2), the 2D rotation group SO(2), the translation and rotation group SE(2), and the rotation-scale group R>0 SO(2). The size of the images is 64 64 pixels, and the numbers are in the center with the same vertical length. For the test data, we transform the images by scaling within [0.15, 0.5] and rotating within [ 90 , +90 ]. Image completion with our digits data becomes an extrapolation task in that the test data is never seen during training, though the number shapes are the same in both sets.

The log likelihood of image completion by Equiv CNP with the group equivariance is reported in Table 2. The mean and standard deviation of the log likelihood is calculated over 1000 tasks (i.e. evaluating the digit transformed in 100 times respectively). As a result, Equiv CNP with R>0 SO(2) performed better than other group equivarinace. On the other hand, the model with SO(2) had the worst performance. This might happen because the SO(2) is not able to generalize Equiv CNP to scaling. In fact, the log likelihood of SE(2), which is the group equivariance combining translation T(2) and rotation SO(2), is not improved than that of T(2).

Figure 4 shows the qualitative result of image completion by Equiv CNP with each group equivariance. We demonstrate that Equiv CNP was able to predict digits smaller than the training digits4. While T(2) completes the images most clearly when the sizes of digits and the number of observations

4When the scaling is 1.0, it equals to SO(2) symmetry.

Published as a conference paper at ICLR 2021

are large, other groups also complete the images. The smaller the size of digits is compared to the training digits, the worse the quality of T(2) completion becomes, and R>0 SO(2) completes the digits more clearly. This is because the convolution region of T(2) is invariant to the location, while that of R>0 SO(2) is adaptive to the location. As a result, for the images transformed by scaling, we can see that Equiv CNP with R>0 SO(2) preserved scaling group equivariance.

6 DISCUSSION

We presented a new neural process, Equiv CNP, that uses the group equivariant adopted from Lie Conv. Given a speciﬁc group equivariance, such as translation and rotation as inductive bias, Equiv CNP has a good performance at regression tasks. This is because the kernel size changes depending on the speciﬁc equivariance. Real world applications, such as robot learning tasks (e.g. using hand-eye camera) will be left as future work. We also hope Equiv CNPs will help in learning group equivariance (Quessard et al., 2020) by data driven approaches for future research.

Brandon Anderson, Truong Son Hy, and Risi Kondor. Cormorant: Covariant molecular neural networks. In Advances in Neural Information Processing Systems, pp. 14510 14519, 2019.

Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient descent. In Advances in neural information processing systems, pp. 3981 3989, 2016.

Erik J Bekkers. B-spline cnns on lie groups. ar Xiv preprint ar Xiv:1909.12057, 2019.

Francois Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1251 1258, 2017.

Taco Cohen and Max Welling. Group equivariant convolutional networks. In International conference on machine learning, pp. 2990 2999, 2016.

Taco S Cohen, Mario Geiger, and Maurice Weiler. A general theory of equivariant cnns on homogeneous spaces. In Advances in Neural Information Processing Systems, pp. 9142 9153, 2019.

Michaël Defferrard, Nathanaël Perraudin, Tomasz Kacprzak, and Raphael Sgier. Deepsphere: towards an equivariant graph-based spherical cnn. In ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019. URL https://arxiv.org/abs/1904.05146.

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126 1135. JMLR. org, 2017.

Marc Finzi, Samuel Stanton, Pavel Izmailov, and Andrew Gordon Wilson. Generalizing convolutional neural networks for equivariance to lie groups on arbitrary continuous data. ar Xiv preprint ar Xiv:2002.12880, 2020.

Alexandre Galashov, Jonathan Schwarz, Hyunjik Kim, Marta Garnelo, David Saxton, Pushmeet Kohli, SM Eslami, and Yee Whye Teh. Meta-learning surrogate models for sequential decision making. ar Xiv preprint ar Xiv:1903.11907, 2019.

Marta Garnelo, Dan Rosenbaum, Chris J Maddison, Tiago Ramalho, David Saxton, Murray Shanahan, Yee Whye Teh, Danilo J Rezende, and SM Eslami. Conditional neural processes. ar Xiv preprint ar Xiv:1807.01613, 2018a.

Marta Garnelo, Jonathan Schwarz, Dan Rosenbaum, Fabio Viola, Danilo J Rezende, SM Eslami, and Yee Whye Teh. Neural processes. ar Xiv preprint ar Xiv:1807.01622, 2018b.

Jonathan Gordon, Wessel P Bruinsma, Andrew YK Foong, James Requeima, Yann Dubois, and Richard E Turner. Convolutional conditional neural processes. ar Xiv preprint ar Xiv:1910.13556, 2019.

Published as a conference paper at ICLR 2021

Samuel Greydanus, Misko Dzamba, and Jason Yosinski. Hamiltonian neural networks. In Advances in Neural Information Processing Systems, pp. 15353 15363, 2019.

Zhiwu Huang, Chengde Wan, Thomas Probst, and Luc Van Gool. Deep learning on lie groups for skeleton-based action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6099 6108, 2017.

Lukasz Kaiser, Aidan N Gomez, and Francois Chollet. Depthwise separable convolutions for neural machine translation. ar Xiv preprint ar Xiv:1706.03059, 2017.

Hyunjik Kim, Andriy Mnih, Jonathan Schwarz, Marta Garnelo, Ali Eslami, Dan Rosenbaum, Oriol Vinyals, and Yee Whye Teh. Attentive neural processes. ar Xiv preprint ar Xiv:1901.05761, 2019.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Diederik P Kingma and Max Welling. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013.

Risi Kondor and Shubhendu Trivedi. On the generalization of equivariance and convolution in neural networks to the action of compact groups. ar Xiv preprint ar Xiv:1802.03690, 2018.

Christiane Lemke, Marcin Budka, and Bogdan Gabrys. Metalearning: a survey of trends and technologies. Artiﬁcial intelligence review, 44(1):117 130, 2015.

Christos Louizos, Xiahan Shi, Klamer Schutte, and Max Welling. The functional neural process. In Advances in Neural Information Processing Systems, pp. 8743 8754, 2019.

Robin Quessard, Thomas D Barrett, and William R Clements. Learning group structure and disentangled representations of dynamical environments. ar Xiv preprint ar Xiv:2002.06991, 2020.

Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In Proceedings of the 5th International Conference on Learning Representation, 2016.

Andrei A Rusu, Dushyant Rao, Jakub Sygnowski, Oriol Vinyals, Razvan Pascanu, Simon Osindero, and Raia Hadsell. Meta-learning with latent embedding optimization. ar Xiv preprint ar Xiv:1807.05960, 2018.

Alvaro Sanchez-Gonzalez, Victor Bapst, Kyle Cranmer, and Peter Battaglia. Hamiltonian graph networks with ode integrators. ar Xiv preprint ar Xiv:1909.12790, 2019.

Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. Metalearning with memory-augmented neural networks. In International conference on machine learning, pp. 1842 1850, 2016.

Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In Advances in neural information processing systems, pp. 4077 4087, 2017.

Koh Takeuchi, Hisashi Kashima, and Naonori Ueda. Angle-based convolution networks for extracting local spatial features. In Neurips Workshop on Modeling and decision-making in the spatiotemporal domain, 2018.

Peter Toth, Danilo Jimenez Rezende, Andrew Jaegle, Sébastien Racanière, Aleksandar Botev, and Irina Higgins. Hamiltonian generative networks. ar Xiv preprint ar Xiv:1909.13789, 2019.

Maurice Weiler and Gabriele Cesa. General e (2)-equivariant steerable cnns. In Advances in Neural Information Processing Systems, pp. 14334 14345, 2019.

Wenxuan Wu, Zhongang Qi, and Li Fuxin. Pointconv: Deep convolutional networks on 3d point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9621 9630, 2019.

Jin Xu, Jean-Francois Ton, Hyunjik Kim, Adam R Kosiorek, and Yee Whye Teh. Metafun: Metalearning with iterative functional updates. ar Xiv preprint ar Xiv:1912.02738, 2019.

Published as a conference paper at ICLR 2021

Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhutdinov, and Alexander J Smola. Deep sets. In Advances in neural information processing systems, pp. 3391 3401, 2017.

Yaofeng Desmond Zhong, Biswadip Dey, and Amit Chakraborty. Symplectic ode-net: Learning hamiltonian dynamics with control. ar Xiv preprint ar Xiv:1909.12077, 2019.

Published as a conference paper at ICLR 2021

SUPPLEMENTARY MATERIAL

A. PROOF OF THEOREM 2

First, we prove that (II) implies (I). We deﬁne the action of G on the set of univariate maps f : X R by

(g f)(x) := f(g 1 x).

and deﬁne the action of G on the set of bivariate maps ψ : X 2 R by

(g ψ)(x, x ) := ψ(g 1 x, g 1 x ).

Lemma 4. For a map ψ : X 2 Rd and a sample x X, a sample dependent function ψx : X Rd is deﬁned by

ψx(x ) := ψ(x , x)

Here, ψ is G-invariant if and only if the map X x 7 ψx Map(X, Rd) is G-equivariant.

The above lemma is derived as follows:

ψg x (x) = ψ(x, g x ) = ψ(g 1 x, x ) = ψx (g 1 x) = (g ψx )(x).

The left hand side represents the action on the sample space and the right hand side does the action on the function space.

When we denote the set of all G-equivariant maps from S to S by Equiv(S, S ), the above lemma is represented as

Inv(X 2, Rd) = Equiv(X, Map(X, Rd)).

Thus, since ψ : X 2 R is invariant from (II), x 7 ψx is equivariant. Then, for Z Z M, the correspondence Z 7 Pm i=1 φK+1(yi)ψxi is also equivariant. Since ρ is equivariant from (II), we obtain (I) because the composition ofequivariant maps is equivariant.

Next, we prove that (I) implies (II). We prepare some notations and lemmas in the following. Let ψ be an interpolating continuous kernel that satisﬁes ψ (x, x ) 0. Then, for m N and Z m (X Y)m, deﬁne

i=1 φK+1 (yi) ψ ( , xi) : (xi, yi)m i=1 Z m

where HK+1 = H H is the (K + 1)-dimensional-vector-valued-function Hilbert space constructed from the RKHS H for which ψ is a reproducing kernel and endowed with the inner product f, g HK+1 = PK+1 i=1 fi, gi H, where , H is the inner product of the RKHS H. When the permutation group Sm acts on a set (X Y)m, the set of equivalence classes of this action is denoted by (X Y)m/Sm. Then, for an element Z (X Y)m, the equivalent class of the action is denoted by [Z]. Similarly, for a subset Z m (X Y)m, the set of equivalent classes is denoted by [Z m] := {[Z]|Z Z m}. Furthermore, we denote as

m=1 [Z m] and H M :=

m=1 Hm(Z m).

Lemma 1 and Lemma 3 in Gordon et al. (2019) provides the following lemma. Lemma 5. For m N, let Z m (X Y)m be a set with multiplicity K and ψ be an interpolating continuous kernel. Then, (Hm(Z m))M m=1 are pairwise disjoint and the embedding E is injective and continuous:

E : Z M H M(Z m), E([Z]) := Em([Z]) if [Z] [Z m] ,

Em : [Z m] Hm(Z m), Em ([(x1, y1) , . . . , (xm, ym)]) :=

i=1 φK+1 (yi) ψ ( , xi)

Published as a conference paper at ICLR 2021

Similarly, Lemma 2 and Lemma 4 in Gordon et al. (2019) provides the following lemma.

Lemma 6. Suppose that Z M is a topologically closed set in (X Y)M and permutation-invariant, and that ψ satisﬁes (i) ψ 0, (ii) ψ(x, x) = σ2 > 0 for any x, and (iii) ψ(x, x ) 0 as x . Let Φ : Z <M Cb(X, Y) be a map such that every restriction Φ|[z m] is continuous. Then, Φ E 1 : H<M Cb(X, Y) is continuous.

When a G-equivariant function f is injective, f 1|Imf is also G-equivariant on the image of f. Denoting Φ E 1 by ρ, we can rewrite as Φ = ρ E.

B. SEPARABLE LIECONV

In this section, we introduce the separable Lie Conv that we design and implement for Equiv CNP. Lie Conv (Finzi et al., 2020) is based on Point Conv (Wu et al., 2019), which is proposed for point cloud convolution. That is, the lifted inputs are convolved by Point Conv. Therefore, we can adopt techniques that are used for general convolution. One of such techniques is separable convolution(Chollet, 2017). Separable convolution consists of depthwise convolution and pointwise convolution (as known as 1 x 1 convolution). The mathematical formulation of normal convolution, the pointwise convolution, and the depthwise convolution is as follow:

Conv(W, y)(i, j) =

k,l,m W(k,l,m) y(i+k,j+l,m)

Pointwise Conv(W, y)(i, j) =

m Wm y(i,j,m)

Depthwise Conv(W, y)(i, j) =

k=1 W(k,l) y(i+k,j+l)

Sep Conv (Wp, Wd, y) (i, j) = Pointwise Conv(i, j) (Wp, Depthwise Conv (i, j) (Wd, y))

Thanks to the assumption that convolution operation is separable to the spatial direction and the channel direction, the separable convolution provides the way to operate convolution more efﬁciently than general convolution. Note that the difference of the efﬁciency between Lie Conv and separable Lie Conv is slight; the difference between the matrix production and element-wise product. Following the equation above, we design and implemented separable Lie Conv. Figure 5 illustrates the processing of (a) normal Lie Conv and (b) separable Lie Conv. The memory consumption is also different that the output shape of after the convolutional weights (kernel) is calculated in normal Lie Conv is B NMC Cmid while that of separable Lie Conv is B NMC Cin.

C. EQUIVCNP ARCHITECTURE

The architecture of Equiv CNP is following that of Conv CNP (Gordon et al., 2019), so that we can fairly compare them. It is difﬁcult to determine radius r of Lie Conv because the radius is varied substantially between the different groups due to the different distance functions. Instead, we parametrized the radius by specifying the average fraction of the total number of convolved elements that would fall into this radius. Therefore, we describe the value of the average fraction instead of kernel size that is described in other papers as usual. Simultaneously, while the conventional convolutional layer has a parameter called stride that determines the target elements (pixels) to be convolved, Lie Conv has a parameter sampling fraction instead of stride to subsample the group elements; sampling fraction is 1.0.

C.1 1D SYNTHETIC REGRESSION TASK

For 1D regression tasks, we use 4-layer Lie Conv architecture with Re LU activations. The average fraction of those Lie Conv is 5 32 and the number of MC sampling is 25. The channels of Lie Conv are [16, 32, 16, 8]. Functional representation E(Z) is concatenated with target point x T , followed by

Published as a conference paper at ICLR 2021

Lifted Coordinates

NMC (Du + Dq)

1 Cout f1 f2

Coord1 Coord2

1 (Cin Cmid)

(a) Lie Conv

Lifted Coordinates

NMC (Du + Dq)

1 Cout f1 f2

Coord1 Coord2

1 (Cin Cmid)

(b) Separable Lie Conv

Figure 5: Separable Lie Conv. Difference between (a) normal Lie Conv and (b) separable Lie Conv is the matrix product and elemente-wise product .

lifting. After operating convolution to the lifted inputs, we use a softplus activation following the last fully-connected layer (FC) as a standard deviation. Note that the output of Equiv CNP, mean and standard deviation, is sliced to get those of y T . The architecture of Equiv CNP for a 1D regression task is illustrated in Figure 6.

B T (K+1) B (T+|x T|) 8

B (T+|x T|) (16 => 32 => 16 => 8)

(Lie Conv + Re LU) x 4

B (T+|x T|) 1

Figure 6: The architecture of Equiv CNP for a 1D regression task. represents dot product and represents concatenation. ψ is a RBF kernel and φ = [y0, y1, . . . , y K].

C.2 2D IMAGE-COMPLETION TASK

For the 2D image-completion task, we use Lie Conv Convθ instead of RBF kernels as ψ. The channels of this Lie Conv is 128, the average fraction is 1 10, and the number of MC sampling is 121. After the Lie Conv of ψ, we use four residual blocks. Each block is composed by two separable Lie Conv layers

Published as a conference paper at ICLR 2021

Separable Lie Conv

Separable Lie Conv

Figure 7: Residual Block

and residual connections as shown in Figure 7. The channel of each residual block is 128, the average fraction is 1 15, and the number of MC sampling is 81.

We employ the same procedure of Conv CNP (Gordon et al., 2019) for image-completion as follows:

1. Given an input image I RC H W , where C is color channel, H and W represents height and width respectively, sample context points features := I Mc from bernoulli distribution. Mc means the density as same as we deﬁne φ during 1D regression task. 2. After lifting the inputs, apply a Lie Conv to both I Mc and Mc to get functional representation: E(Z) = Convθ([Mc, I Mc]) R(128+128) H W . 3. Then, functional representation E(Z) is passed through one FC followed by four residual blocks: h = Res Blocks(FC(E(Z))) R128 H W . 4. Finally, we use one FC to get mean and standard deviation channels and split the output R2C H W into those statistics.

D. EXPERIMENT DETAILS

In this section, we describe the experiments in more detail. Code and dataset are available on https://github.com/makora9143/Equiv CNP.

D.1 1D SYNTHETIC REGRESSION TASK

The kernels used in Section 5.1 for generating the data via Gaussian Processes are deﬁned as follows:

k(x1, x2) = exp (x1 x2)2

k(x1, x2) = 1 +

with d = x1 x2 2

Periodic k(x1, x2) = exp ( 2 sin(π x1 x2 2))

To train all NPs, the GPs generate the context and target points; the number of context points and target points is random-sampled uniformly from [3, 50] respectively. All NPs were trained for 200 epochs by 256 batches per epoch and the size of each batch is 16, We used Adam optimizer (Kingma & Ba, 2014) with learning rate 10 3. An architecture of CNP was based on the original code5. We visualize the result of periodic kernel regression at Figure 8.

We also demonstrate Equiv CNP with the algorithm following that of Conv CNP (Gordon et al., 2019); regarding the output of Equiv CNP as weights for evenly-spaced basis functions (i.e. RBF kernel) in Figure 9. The result of predictive distribution is much smoother than the result of our Algorithm 1 though using RBF kernel is redundant.

5https://github.com/deepmind/neural-processes

Published as a conference paper at ICLR 2021

4 2 0 2 4 3

Conv CNP (No Outlier)

4 2 0 2 4 3

Equiv CNP (No Outlier)

4 2 0 2 4 3

Conv CNP (Outlier)

4 2 0 2 4 3

Equiv CNP (Outlier)

Figure 8: Predictive mean and variance of Conv CNP and Equiv CNP at periodic kernels. First two columns show the result without outlier observation and last two columns show the result with outlier observation.

Figure 9: Predictive mean and variance of Equiv CNP that using algorithm proposed in (Gordon et al., 2019). Blue line and region represents Equiv CNP and green line and region represents Gaussian Process. Each plot shows diffent sampled data. Although the algorithm is redundant compared with our proposed Algorithm 1 due to using RBF kernel to map the output of Lie Conv back to a continuous function, the result is much smoother than Figure 2 and 8.

D.2 2D IMAGE-COMPLETION TASK

The original image of the digital clock number is shown in Figure 10. We ﬁrst inverted in colors of black and white of the image. Then, we cropped the image so that each cropped image contains one digit and resize them to 64 64. Note that the vertical size of each number is set up to 56, while the horizontal size is not ﬁxed. The values of all pixels are devided by 255 to rescale them to the [0, 1] range.

As we mentioned in Section C.2, the context points are sampled from bernoulli distribution. The parameter of bernoulli distribution, probability p that the value is 1, is determined at a rate of the number uniformly from U( ntotal

100 , ntotal

2 ) per ntotal. The batch size is 4, epoch is 100, and the optimizer is Adam (Kingma & Ba, 2014) whose learning rate is 5 10 4.

E. ADDITIONAL COMPLETION TASK: MNIST

We also conduct the image completion task using rotated MNIST. It is thought that (1) original MNIST contains various data symmetries including translation, scaling, and rotation, and (2) we cannot specify them precisely. Figure 11 shows the actual images from the original MNIST datasets. We can conﬁrm that yet we did not conduct any transformation, the images have been already rotated.

Published as a conference paper at ICLR 2021

Figure 10: The original data that is used for 2D image-completion task.

0 50 100 150 200 250 300

Figure 11: Actual images from original MNIST.

Moreover, factors other than symmetry such as personal habit exist. Indeed, the original MNIST is not good for verify the effectiveness of Equiv CNP.

The result is depicted in Figure 12. During this experiment, the batch size is 16, epoch is 30, and the optimizer is Adam whose leraning rate is 5 10 4. As a result, the model misses the completion when the number of context points is quite a few. On the other hand, when the number of context points is sufﬁcient, the completion results seem well except the SO(2)-equivariant model.

Published as a conference paper at ICLR 2021

(c) R>0 SO(2)

Figure 12: Image-completion task results using rotated-MNIST. In each image, the 1st and 4th columns show context pixels, the 2nd and 5th columns show ground truth images, and the 3rd and 6th columns show completion results. As a result, the model misses the completion when the number of context points is quite a few. On the other hand, when the number of context points is sufﬁcient, the completion results seem well except the SO(2)-equivariant model.