# general_e2equivariant_steerable_cnns__44596ee1.pdf

General E(2) - Equivariant Steerable CNNs

Maurice Weiler University of Amsterdam, QUVA Lab m.weiler@uva.nl

Gabriele Cesa University of Amsterdam cesa.gabriele@gmail.com

The big empirical success of group equivariant networks has led in recent years to the sprouting of a great variety of equivariant network architectures. A particular focus has thereby been on rotation and reﬂection equivariant CNNs for planar images. Here we give a general description of E(2)-equivariant convolutions in the framework of Steerable CNNs. The theory of Steerable CNNs thereby yields constraints on the convolution kernels which depend on group representations describing the transformation laws of feature spaces. We show that these constraints for arbitrary group representations can be reduced to constraints under irreducible representations. A general solution of the kernel space constraint is given for arbitrary representations of the Euclidean group E(2) and its subgroups. We implement a wide range of previously proposed and entirely new equivariant network architectures and extensively compare their performances. E(2)-steerable convolutions are further shown to yield remarkable gains on CIFAR-10, CIFAR-100 and STL-10 when used as drop in replacement for non-equivariant convolutions.

1 Introduction

The equivariance of neural networks under symmetry group actions has in the recent years proven to be a fruitful prior in network design. By guaranteeing a desired transformation behavior of convolutional features under transformations of the network input, equivariant networks achieve improved generalization capabilities and sample complexities compared to their non-equivariant counterparts. Due to their great practical relevance, a big pool of rotationand reﬂectionequivariant models for planar images has been proposed by now. Unfortunately, an empirical survey, reproducing and comparing all these different approaches, is still missing.

An important step in this direction is given by the theory of Steerable CNNs [1, 2, 3, 4, 5] which deﬁnes a very general notion of equivariant convolutions on homogeneous spaces. In particular, steerable CNNs describe E(2)-equivariant (i.e. rotationand reﬂection-equivariant) convolutions on the image plane R2. The feature spaces of steerable CNNs are thereby deﬁned as spaces of feature ﬁelds, characterized by a group representation which determines their transformation behavior under transformations of the input. In order to preserve the speciﬁed transformation law of feature spaces, the convolutional kernels are subject to a linear constraint, depending on the corresponding group representations. While this constraint has been solved for speciﬁc groups and representations [1, 2], no general solution strategy has been proposed so far. In this work we give a general strategy which reduces the solution of the kernel space constraint under arbitrary representations to much simpler constraints under single, irreducible representations.

Speciﬁcally for the Euclidean group E(2) and its subgroups, we give a general solution of this kernel space constraint. As a result, we are able to implement a wide range of equivariant models, covering regular GCNNs [6, 7, 8, 9, 10, 11], classical Steerable CNNs [1], Harmonic Networks [12], gated Harmonic Networks [2], Vector Field Networks [13], Scattering Transforms [14, 15, 16, 17, 18] and entirely new architectures, in one uniﬁed framework. In addition, we are able to build hybrid models, mixing different ﬁeld types (representations) of these networks both over layers and within layers.

* Equal contribution, author ordering determined by random number generator. This research has been conducted during an internship at QUVA lab, University of Amsterdam.

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

We further propose a group restriction operation, allowing for network architectures which are decreasingly equivariant with depth. This is useful e.g. for natural images which show low level features like edges in arbitrary orientations but carry a sense of preferred orientation globally. An adaptive level of equivariance accounts for the resulting loss of symmetry in the hierarchy of features.

Since the theory of steerable CNNs does not give a preference for any choice of group representation or equivariant nonlinearity, we run an extensive benchmark study, comparing different equivariance groups, representations and nonlinearities. We do so on MNIST 12k, rotated MNIST SO(2) and reﬂected and rotated MNIST O(2) to investigate the inﬂuence of the presence or absence of certain symmetries in the dataset. A drop in replacement of our equivariant convolutional layers is shown to yield signiﬁcant gains over non-equivariant baselines on CIFAR10, CIFAR100 and STL-10.

Beyond the applications presented in this paper, our contributions are of relevance for general steerable CNNs on homogeneous spaces [3, 4] and gauge equivariant CNNs on manifolds [5] since these models obey the same kind of kernel constraints. More speciﬁcally, 2-dimensional manifolds, endowed with an orthogonal structure group O(2) (or subgroups thereof), necessitate exactly the kernel constraints solved in this paper. Our results can therefore readily be transferred to e.g. spherical CNNs [19, 5, 20, 21, 22, 23] or more general models of geometric deep learning [24, 25, 26, 27].

2 General E(2) - Equivariant Steerable CNNs

Convolutional neural networks process images by extracting a hierarchy of feature maps from a given input signal. The convolutional weight sharing ensures the inference to be translation-equivariant which means that a translated input signal results in a corresponding translation of the feature maps. However, vanilla CNNs leave the transformation behavior of feature maps under more general transformations, e.g. rotations and reﬂections, undeﬁned. In this work we devise a general framework for convolutional networks which are equivariant under the Euclidean group E(2), that is, under isometries of the plane R2. We work in the framework of steerable CNNs [1, 2, 3, 4, 5] which provides a quite general theory for equivariant CNNs on homogeneous spaces, including Euclidean spaces Rd

as a speciﬁc instance. Sections 2.2 and 2.3 brieﬂy review the theory of Euclidean steerable CNNs as described in [2]. The following subsections explain our main contributions: a decomposition of the kernel space constraint into irreducible subspaces (2.4), their solution for E(2) and subgroups (2.5), an overview on the group representations used to steer features, their admissible nonlinearities and their use in related work (2.6), the group restriction operation (2.7) and implementation details (2.8).

2.1 Isometries of the Euclidean plane R2

The Euclidean group E(2) is the group of isometries of the plane R2, consisting of translations, rotations and reﬂections. Characteristic patterns in images often occur at arbitrary positions and in arbitrary orientations. The Euclidean group therefore models an important factor of variation of image features. This is especially true for images without a preferred global orientation like satellite imagery or biomedical images but often also applies to low level features of globally oriented images.

One can view the Euclidean group as being constructed from the translation group (R2, +) and the orthogonal group O(2) = {O R2 2 | OT O = id2 2} via the semidirect product operation as E(2) = (R2, +) O(2). The orthogonal group thereby contains all operations leaving the origin invariant, i.e. continuous rotations and reﬂections. In order to allow for different levels of equivariance and to cover a wide spectrum of related work we consider subgroups of the Euclidean group of the form (R2, +) G, deﬁned by subgroups G O(2). Speciﬁcally, G could be either the special orthogonal group SO(2), the group ({ 1}, ) of the reﬂections along a given axis, the cyclic groups CN, the dihedral groups DN or the orthogonal group O(2) itself. While SO(2) describes continuous rotations (without reﬂections), CN and DN contain N discrete rotations by angles multiple of 2π

N and, in the case of DN, reﬂections. CN and DN are therefore discrete subgroups of order N and 2N, respectively. For an overview over the groups and their interrelations see Table 6 in the Appendix.

Since the groups (R2, +) G are semidirect products, one can uniquely decompose any of their elements into a product tg where t (R2, +) and g G [3] which we will do in the rest of the paper.

2.2 E(2) - steerable feature ﬁelds

Steerable CNNs deﬁne feature spaces as spaces of steerable feature ﬁelds f : R2 Rc which associate a c-dimensional feature vector f(x) Rc to each point x of a base space, in our case the

plane R2. In contrast to vanilla CNNs, the feature ﬁelds of steerable CNNs are associated with a transformation law which speciﬁes their transformation under actions of E(2) (or subgroups) and therefore endows features with a notion of orientation. Formally, a feature vector f(x) encodes the coefﬁcients of a coordinate independent geometric feature relative to a choice of reference frame or, equivalently, image orientation (see Appendix A).

scalar ﬁeld ρ(g) = 1

vector ﬁeld ρ(g) = g

Figure 1: Transformation behavior of ρ-ﬁelds.

An important example are scalar feature ﬁelds s : R2 R, describing for instance gray-scale images or temperature ﬁelds. The Euclidean group acts on scalar ﬁelds by moving each pixel to a new position, that is, s(x) 7 s (tg) 1x = s g 1(x t)

for some tg (R2, +) G; see Figure 1, left. Vector ﬁelds v : R2 R2, like optical ﬂow or gradient images, on the other hand transform as v(x) 7 g v g 1(x t) . In contrast to the case of scalar ﬁelds, each vector is therefore not only moved to a new position but additionally changes its orientation via the action of g G; see Figure 1, right.

The transformation law of a general feature ﬁeld f : R2 Rc is fully characterized by its type ρ. Here ρ : G 7 GL(Rc) is a group representation, specifying how the c channels of each feature vector f(x) mix under transformations. A representation satisﬁes ρ(g g) = ρ(g)ρ( g) and therefore models the group multiplication g g as multiplication of c c matrices ρ(g) and ρ( g). More speciﬁcally, a

ρ-ﬁeld transforms under the induced representation12 h Ind(R2,+) G G ρ i of (R2, +) G as

f(x) 7 h Ind(R2,+) G G ρ i (tg) f (x) := ρ(g) f g 1(x t) . (1)

As in the examples above, it transforms feature ﬁelds by moving the feature vectors from g 1(x t) to a new position x and acting on them via ρ(g). We thus ﬁnd scalar ﬁelds to correspond to the trivial representation ρ(g) = 1 g G which reﬂects that the scalar values do not change when being moved. Similarly, a vector ﬁeld corresponds to the standard representation ρ(g) = g of G.

In analogy to the feature spaces of vanilla CNNs comprising multiple channels, the feature spaces of steerable CNNs consist of multiple feature ﬁelds fi : R2 Rci, each of which is associated with its own type ρi : G GL(Rci). A stack f = L

i fi of feature ﬁelds is then deﬁned to be concatenated from the individual feature ﬁelds and transforms under the direct sum ρ = L

i ρi of the individual representations. A common example for a stack of feature ﬁelds are RGB images f: R2 R3. Since the color channels transform independently under rotations we identify them as three independent scalar ﬁelds. The stacked ﬁeld representation is thus given by the direct sum L3 i=1 1 = id3 3 of three trivial representations. While the input and output types of steerable CNNs are given by the learning task, the user needs to specify the types ρi of intermediate feature ﬁelds as hyperparameters, similar to the choice of channels for vanilla CNNs. We discuss different choices of representations in Section 2.6 and investigate them empirically in Section 3.1.

2.3 E(2) - steerable convolutions

In order to preserve the transformation law of steerable feature spaces, each network layer is required to be equivariant under the group actions. As proven for Euclidean groups in [2], the most general equivariant linear map between steerable feature spaces, transforming under ρin and ρout, is given by convolutions with G-steerable kernels3 k : R2 Rcout cin, satisfying a kernel constraint

k(gx) = ρout(g)k(x)ρin(g 1) g G, x R2 . (2)

Intuitively, this constraint determines the form of the kernel in transformed coordinates gx in terms of the kernel in non-transformed coordinates x and thus its response to transformed input ﬁelds. It ensures that the output feature ﬁelds transform as speciﬁed by Ind ρout when the input ﬁelds are being transformed by Ind ρin; see Appendix G.1 for a proof.

1 Induced representations are the most general transformation laws compatible with convolutions [3, 4]. 2 Note that this simple form of the induced representation is a special case for semidirect product groups. 3 As k : R2 Rcout cin returns a matrix of shape (cout, cin) for each position x R2, its discretized version can be represented by a tensor of shape (cout, cin, X, Y ) as usually done in deep learning frameworks.

Since the kernel constraint is linear, its solutions form a linear subspace of the vector space of unconstrained kernels considered in conventional CNNs. It is thus sufﬁcient to solve for a basis of the G-steerable kernel space in terms of which the equivariant convolutions can be parameterized. The lower dimensionality of the restricted kernel space enhances the parameter efﬁciency of steerable CNNs over conventional CNNs similarly to the increased parameter efﬁciency of CNNs over MLPs.

2.4 Irrep decomposition of the kernel constraint

The kernel constraint (2) in principle needs to be solved individually for each pair of input and output types ρin and ρout to be used in the network. Here we show how the solution of the kernel constraint for arbitrary representations can be reduced to much simpler constraints under irreducible representations (irreps). Our approach relies on the fact that any representation of a ﬁnite or compact group decomposes under a change of basis into a direct sum of irreps, each corresponding to an invariant subspace of the representation space Rc on which ρ acts. Denoting the change of basis by Q, this means that one can always write ρ = Q 1 L

i I ψi Q where ψi are the irreducible representations of G and the index set I encodes the types and multiplicities of irreps present in ρ. A decomposition can be found by exploiting basic results of character theory and linear algebra [28].

The decomposition of ρin and ρout in the kernel constraint (2) leads to

k(gx) = Q 1 out h M

i Iout ψi(g) i Qout k(x) Q 1 in h M

j Iin ψ 1 j (g) i Qin g G, x R2,

which, deﬁning a kernel relative to the irrep bases as κ := Qoutk Q 1 in , implies

κ(gx) = h M

i Iout ψi(g) i κ(x) h M

j Iin ψ 1 j (g) i g G, x R2.

The left and right multiplication with a direct sum of irreps reveals that the constraint decomposes into independent constraints

κij(gx) = ψi(g) κij(x) ψ 1 j (g) g G, x R2 where i Iout, j Iin (3)

on blocks κij in κ corresponding to invariant subspaces of the full space of equivariant kernels; see Appendix H for a visualization. In order to solve for a basis of equivariant kernels satisfying the original constraint (2), it is therefore sufﬁcient to solve the irrep constraints (3) to obtain bases for each block, revert the change of basis and take the union over different blocks. Speciﬁcally, given dij-dimensional bases κij 1 , , κij dij for the blocks κij of κ, we get a d=P

ijdij-dimensional basis

k1, , kd := [

n Q 1 out κij 1 Qin, , Q 1 out κij dij Qin o (4)

of solutions of (2). Here κij denotes a block κij being ﬁlled at the corresponding location of a matrix of the shape of κ with all other blocks being set to zero; see Appendix H. The completeness of the basis found this way is guaranteed by construction if the bases for each block ij are complete. Note that while this approach shares some basic ideas with the solution strategy proposed in [2], it is computationally more efﬁcient for large representations; see Appendix J. We want to emphasize that this strategy for reducing the kernel constraint to irreducible representations is not restricted to subgroups of O(2) but applies to steerable CNNs in general.

2.5 General solution of the kernel constraint for O(2) and subgroups

In order to build isometry-equivariant CNNs on R2 we need to solve the irrep constraints (3) for the speciﬁc case of G being O(2) or one of its subgroups. For this purpose note that the action of G on R2 is norm-preserving, that is, |g.x| = |x| g G, x R2. The constraints (2) and (3) therefore only restrict the angular parts of the kernels but leave their radial parts free. Since furthermore all irreps of G correspond to one unique angular frequency (see Appendix I.2), it is convenient to expand the kernel w.l.o.g. in terms of an (angular) Fourier series

κij αβ x(r, φ) = Aαβ,0(r) + X

h Aαβ,µ(r) cos(µφ) + Bαβ,µ(r) sin(µφ) i (5)

with real-valued, radially dependent coefﬁcients Aαβ,µ : R+ R and Bαβ,µ : R+ R for each matrix entry κij αβ of block κij. By inserting this expansion into the irrep constraints (3) and projecting on individual harmonics we obtain constraints on the Fourier coefﬁcients, forcing most of them to be

zero. The vector spaces of G-steerable kernel blocks κij satisfying the irrep constraints (3) are then parameterized in terms of the remaining Fourier coefﬁcients. The completeness of this basis follows immediately from the completeness of the Fourier basis. Similar approaches have been followed in simpler settings for the cases of CN in [7], SO(2) in [12] and SO(3) in [2].

The resulting bases for the angular parts of kernels for each pair of irreducible representations of O(2) are shown in Table 1. It turns out that each basis element is harmonic and associated to one unique angular frequency. Appendix I gives an explicit derivation and the resulting bases for all possible pairs of irreps for all groups G O(2) following the strategy presented in this section. The analytical solutions for SO(2), ({ 1}, ), CN and DN are found in Tables 8, 10, 11 and 12. Since these groups are subgroups of O(2), they enforce a weaker kernel constraint as compared to O(2). As a result, the bases for G < O(2) are higher dimensional, i.e. they allow for a wider range of kernels. A higher level of equivariance therefore leads simultaneously to a guaranteed behavior of the inference process under transformations and on the other hand to an improved parameter efﬁciency.

ψi ψj trivial sign-ﬂip frequency n N+

sin(nφ), 9 cos(nφ)

1 cos(nφ), sin(nφ)

frequency m N+

" sin(mφ) 9cos(mφ)

# " cos(mφ) sin(mφ)

# " cos (m9n)φ 9sin (m9n)φ

sin (m9n)φ cos (m9n)φ

" cos (m+n)φ sin (m+n)φ

sin (m+n)φ 9cos (m+n)φ

Table 1: Bases for the angular parts of O(2)-steerable kernels satisfying the irrep constraint (3) for different pairs of input ﬁeld irreps ψj and output ﬁeld irreps ψi.The different types of irreps are explained in Appendix I.2.

2.6 Group representations and nonlinearities

A question which so far has been left open is which ﬁeld types, i.e. which representations ρ of G, should be used in practice. Considering only the convolution operation with G-steerable kernels for the moment, it turns out that any change of basis P to an equivalent representation eρ := P 1ρP is irrelevant. To see this, consider the irrep decomposition ρ = Q 1 L

i I ψi Q used in the solution of the kernel constraint to obtain a basis {ki}d i=1 of G-steerable kernels as deﬁned by Eq. (4). Any equivalent representation will decompose into eρ = e Q 1 L

i I ψi e Q with e Q = QP for some P and therefore result in a kernel basis {P 1 out ki Pin}d i=1 which entirely negates changes of bases between equivalent representations. It would therefore w.l.o.g. sufﬁce to consider direct sums of irreps ρ = L

i I ψi as representations only, reducing the question on which representations to choose to the question on which types and multiplicities of irreps to use.

In practice, however, convolution layers are interleaved with other operations which are sensitive to speciﬁc choices of representations. In particular, nonlinearity layers are required to be equivariant under the action of speciﬁc representations. The choice of group representations in steerable CNNs therefore restricts the range of admissible nonlinearities, or, conversely, a choice of nonlinearity allows only for certain representations. In the following we review prominent choices of representations found in the literature in conjunction with their compatible nonlinearities.

All equivariant nonlinearities considered here act spatially localized, that is, on each feature vector f(x) Rcin for all x R2 individually. They might produce different types of output ﬁelds ρout : G GL(Rcout), that is, σ : Rcin Rcout, f(x) 7 σ(f(x)). As proven in Appendix G.2, it is sufﬁcient to require the equivariance of σ under the actions of ρin and ρout, i.e. σ ρin(g) = ρout(g) σ g G, for the nonlinearities to be equivariant under the action of induced representations when being applied to a whole feature ﬁeld as σ(f)(x) := σ(f(x)).

A general class of representations are unitary representations which preserve the norm of their representation space, that is, they satisfy |ρunitary(g)f(x)| = f(x) g G. As proven in Appendix G.2.2, nonlinearities which solely act on the norm of feature vectors but preserve their orientation are equivariant w.r.t. unitary representations. They can in general be decomposed in σnorm : Rc Rc, f(x) 7 η |f(x)| f(x)

|f(x)| for some nonlinear function η : R 0 R 0 acting on the norm of feature vectors. Norm-Re LUs, deﬁned by η(|f(x)|) = Re LU(|f(x)| b) where b R+ is a learned bias, were used in [12, 2]. In [29], the authors consider squashing nonlinearities η(|f(x)|) = |f(x)|2

|f(x)|2+1. Gated nonlinearities were proposed in [2] as conditional version of norm

nonlinearities. They act by scaling the norm of a feature ﬁeld by learned sigmoid gates 1 1+e s(x) , parameterized by a scalar feature ﬁeld s. All representations considered in this paper are unitary such that their ﬁelds can be acted on by norm-nonlinearities. This applies speciﬁcally also to all irreducible representations ψi of G O(2) which are discussed in detail in Section I.2.

A common choice of representations of ﬁnite groups like CN and DN are regular representations. Their representation space R|G| has dimensionality equal to the order of the group, e.g. RN for CN and R2N for DN. The action of the regular representation is deﬁned by assigning each axis eg of R|G|

to a group element g G and permuting the axes according to ρG reg( g)eg := e gg. Since this action is just permuting channels of ρG reg-ﬁelds, it commutes with pointwise nonlinearities like Re LU; a proof is given in Appendix G.2.3. While regular steerable CNNs were empirically found to perform very well, they lead to high dimensional feature spaces with each individual ﬁeld consuming |G| channels. Regular steerable CNNs were investigated for planar images in [6, 7, 8, 9, 10, 17, 18, 30], for spherical CNNs in [19, 5] and for volumetric convolutions in [31, 32]. Further, the translation of feature maps of conventional CNNs can be viewed as action of the regular representation of the translation group.

Closely related to regular representations are quotient representations. Instead of permuting |G| channels indexed by G, they permute |G|/|H| channels indexed by cosets g H in the quotient space G/H of a subgroup H G. Speciﬁcally, they act on axes eg H of R|G|/|H| as deﬁned by ρG/H quot ( g)eg H := e gg H. As permutation representations, quotient representations allow for pointwise nonlinearities; see Appendix G.2.3. Quotient representations were considered in [1, 11].

Regular and quotient ﬁelds can furthermore be acted on by nonlinear pooling operators. Via a group pooling or projection operation max : Rc R, f(x) max(f(x)) the works [6, 7, 9, 32, 31] extract the maximum value of a regular or quotient ﬁeld. The invariance of the maximum operation implies that the resulting features form scalar ﬁelds. Since group pooling operations discard information on the feature orientations entirely, vector ﬁeld nonlinearities σvect : RN R2

for regular representations of CN were proposed in [13]. Vector ﬁeld nonlinearities do not only keep the maximum response max(f(x)) but also its index arg max(f(x)). This index corresponds to a rotation angle θ = 2π

N arg max(f(x)) which is used to deﬁne a vector ﬁeld with elements v(x) = max(f(x))(cos(θ), sin(θ))T . The equivariance of this operation is proven in G.2.4.

2.7 Group restrictions and inductions

The key idea of equivariant networks is to exploit symmetries in the distribution of characteristic patterns in signals. The level of symmetry present in data might thereby vary over length scales. For instance, natural images typically show small features like edges in arbitrary orientations. On a larger length scale, however, the rotational symmetry is broken as manifested in visual patterns exclusively appearing upright but still in different reﬂections. Each individual layer of a convolutional network should therefore be adapted to the symmetries present in the length scale of its ﬁelds of view.

A loss of symmetry can be implemented by restricting the equivariance at a certain depth to a subgroup (R2, +) H (R2, +) G, e.g. from rotations and reﬂections G = O(2) to mere reﬂections H = ({ 1}, ) in the example above. This requires the feature ﬁelds produced by a layer with a higher level of equivariance to be reinterpreted in the following layer as ﬁelds transforming under a subgroup. Speciﬁcally, a ρ-ﬁeld, transforming according to ρ : G GL(Rc), needs to be reinterpreted as a ρ-ﬁeld, where ρ : H GL(Rc) is a representation of the subgroup H G. This is naturally achieved by using the restricted representation ρ := Res G H(ρ) : H GL(Rc), h 7 ρ(h) , deﬁned by restricting the domain of ρ to H. Since a subsequent H-steerable convolution layers can map ﬁelds of arbitrary representations we can readily process the resulting Res G H(ρ)-ﬁeld further.

2.8 Implementation details

E(2)-steerable CNNs rely on convolutions with O(2)-steerable kernels. Our implementation therefore requires the precomputation of steerable kernel bases according to the analytical solutions in Eq. (4) with arbitrary radial parts. Since the kernel basis is sampled on a discrete pixel grid, care has to be taken that no aliasing artifacts occur. During runtime, the sampled basis is expanded using learned weights. The resulting G-steerable kernel is then being used in a standard convolution routine. For more details we refer to Appendix C. Our implementation is provided as a Py Torch extension which is available at https://github.com/QUVA-Lab/e2cnn.

4 8 12 16 20 N

test error (%)

4 8 12 16 20 N

CN DN DN|5CN CNN

4 8 12 16 20 N

CN DN CN|5{e}

Figure 2: Test errors of CN and DN regular steerable CNNs for different orders N for all three MNIST variants. Left: All equivariant models improve upon the non-equivariant baseline on MNIST O(2). The error decreases before saturating at around 8 orientations. Since the dataset contains reﬂected digits, the DN-equivariant models perform better than their CN counterparts. Middle: Since the intraclass variability of MNIST rot is reduced, the performances of the CN model and the baseline improve. In contrast, the DN models are invariant to reﬂections such that they can t distinguish between MNIST O(2) and MNIST rot. For N = 1 this leads to a worse performance than that of the baseline. Restricted dihedral models, denoted by DN|5CN, make use of the local reﬂectional symmetries but are not globally invariant. This makes them perform better than the CN models. Right: On MNIST 12k the globally invariant models CN and DN don t yield better results than the baseline, however, the restricted (i.e. non-invariant) models CN|5{e} and DN|5{e} do. For more details see Appendix D.1.

3 Experiments

Since the framework of general E(2)-equivariant steerable CNNs supports many choices of groups, representations and nonlinearities, we ﬁrst run an extensive benchmark study over the space of supported models in Section 3.1. The insights from these benchmark experiments are then applied to classify CIFAR and STL-10 images in Sections 3.2 and 3.3. All of our experiments are found in a dedicated repository at https://github.com/gabri95/e2cnn_experiments.

3.1 Model benchmarking on transformed MNIST datasets

We ﬁrst perform a comprehensive benchmarking to compare the impact of the different design choices covered in this work. All benchmarked models are evaluated on three different versions of the MNIST dataset, each containing 12000 training and 50000 test images. The digits in the three variations MNIST 12k, MNIST rot and MNIST O(2) are left untransformed, are rotated and are rotated and reﬂected, respectively. These datasets allow us to study the beneﬁt from different levels of G-steerability in the presence or absence of certain symmetries. In order to not disadvantage models with lower levels of equivariance, we train all models using data augmentation by the transformations present in the corresponding dataset.

Representation and nonlinearity benchmarking: Table 7 in the Appendix shows the test errors of 57 different models on the three MNIST variants. The ﬁrst four columns state the equivariance groups, representations, nonlinearities and invariant maps which distinguish the models. The invariant maps of each model are applied after the last convolution layer to produce G-invariant features. Appendix D.1 compares and analyzes all results in detail. In particular, it discusses regular and quotient models, group pooling and vector ﬁeld networks, as well as SO(2) and O(2)-equivariant irrep models. The latter employ new kinds of gated-nonlinearities and norm-nonlinearities and, in the case of O(2), introduce induced representations as new feature types. The results of all models whose feature ﬁelds transform according to regular representations, are summarized in Figure 2.

Group restriction: All transformed MNIST datasets show local rotational and reﬂectional symmetries but differ in the level of symmetry present at the global scale. While DN and O(2)-equivariant

restriction depth MNIST rot MNIST 12k

group test error (%) group test error (%) group test error (%)

(0) C16 0.82 0.02 {e} 0.82 0.01 {e} 0.82 0.01 1

0.80 0.03 2 0.82 0.03 0.74 0.03 0.77 0.03 3 0.77 0.03 0.73 0.03 0.76 0.03 4 0.79 0.03 0.72 0.02 0.77 0.03 5 0.78 0.04 0.68 0.04 0.75 0.02 no restriction D16 1.65 0.02 D16 1.68 0.04 C16 0.95 0.04

Table 2: Effect of the group restriction operation at different depths of the network on MNIST rot and MNIST 12k. All restricted models perform better than non-restricted, and hence globally invariant, models.

model group representation test error (%)

[6] C4 regular/scalar 3.21 0.0012 [6] C4 regular 2.28 0.0004 [12] SO(2) irreducible 1.69 [33] - - 1.2 [13] C17 regular/vector 1.09 Ours C16 regular 0.716 0.028 [7] C16 regular 0.714 0.022 Ours C16 quotient 0.705 0.025 Ours D16|5C16 regular 0.682 0.022

Table 3: Final runs on MNIST rot

model CIFAR-10 CIFAR-100

wrn28/10 [34] 3.87 18.80 wrn28/10 D1 D1 D1 3.36 0.08 17.97 0.11 wrn28/10* D8 D4 D1 3.28 0.10 17.42 0.33 wrn28/10 C8 C4 C1 3.20 0.04 16.47 0.22 wrn28/10 D8 D4 D1 3.13 0.17 16.76 0.40 wrn28/10 D8 D4 D4 2.91 0.13 16.22 0.31

wrn28/10 [35] AA 2.6 0.1 17.1 0.3 wrn28/10* D8 D4 D1 AA 2.39 0.11 15.55 0.13 wrn28/10 D8 D4 D1 AA 2.05 0.03 14.30 0.09

Table 4: Test errors on CIFAR (AA=autoaugment)

models exploit these local symmetries, their global invariance leads to a considerable loss of information. On the other hand, models which are equivariant to the symmetries present at the global scale of the dataset only are not able to generalize over all local symmetries. The proposed group restriction operation allows for models which are locally equivariant but are globally invariant only to the level of symmetry present in the data. Table 2 reports the results of models which are restricted at different depths. The overall trend is that a restriction at later stages of the model improves the performance. All restricted models perform signiﬁcantly better than the invariant models. Figure 2 shows that this behavior is consistent for different orders N.

Convergence rate: In our experiments we ﬁnd that steerable CNNs converge signiﬁcantly faster than non-equivariant CNNs. Figure 4 in the Appendix shows this behavior for regular CN-steerable CNNs in comparison to a vanilla CNN. The rate of convergence thereby increases with the order N and, as already observed in Figure 2, saturates at approximately N = 8. All models share about the same number of parameters. The faster convergence of equivariant networks is explained by the fact that they generalize over G-transformed images by design which reduces the amount of intra-class variability which they have to learn. Conversely, a conventional CNN has to learn to classify all transformed versions of an image explicitly which requires an increased batch size or more training iterations. The enhanced data efﬁciency of E(2)-steerable CNNs thus leads to a reduced training time.

Competitive runs: As a ﬁnal experiment on MNIST rot we are replicating the regular C16 model from [7]. It is mostly similar to the models evaluated before but is wider and adds additional fully connected layers; see Table 14 in the Appendix. As reported in Table 3, our reimplementation matches the accuracy of the original model. Replacing the regular feature ﬁelds with the quotient representations used in the benchmarking leads to slightly better results. We refer to Appendix F for more insights on the improved performance of quotient model. A further signiﬁcant improvement and a new state of the art is being achieved by a D16 model, which is restricted to C16 in the ﬁnal layer.

3.2 CIFAR experiments

The statistics of natural images are typically invariant under global translations and reﬂections but are not under global rotations. Here we investigate the beneﬁt of G-steerable convolutions for such images by classifying CIFAR-10 and CIFAR-100. For this purpose we implement several DN and CN-equivariant versions of Wide Res Net [34]. Different levels of equivariance, stated in the model speciﬁcations in Table 4, are thereby used in the three main blocks of the network. Regular representations are used throughout the whole model. For a fair comparison we scale the width of all layers such that the number of parameters of the original wrn28/10 model is preserved. We further add a small model, marked by an additional *, which has about the same number of channels

as the non-equivariant wrn28/10. All runs use the same training procedure as reported in [34] and Appendix K.3. We want to emphasize that we perform no further hyperparameter tuning.

The results of the D1 D1 D1 model conﬁrm that incorporating the global symmetries of the data yields a signiﬁcant boost in accuracy. Interestingly, the C8 C4 C1 model, which is rotation but not reﬂection-equivariant, achieves better results, which shows that it is worthwhile to leverage local rotational symmetries. Both symmetries are respected simultaneously by the wrn28/10 D8 D4 D1 model. While this model performs better than the two previous ones on CIFAR-10, it surprisingly yields slightly worse result on CIFAR-100. The best results are obtained by the D8 D4 D4 model which suggests that rotational symmetries are useful even on a larger scale. The small wrn28/10* D8 D4 D1 model shows a remarkable gain compared to the non-equivariant wrn28/10 baseline despite not being computationally more expensive. To investigate whether equivariance is useful even when a powerful data augmentation policy is available, we further rerun both D8 D4 D1 models with Auto Augment (AA) [35]. As without AA, both equivariant models outperform the baseline by a large margin.

3.3 STL-10 experiments

model group #params test error (%)

wrn16/8 [36] - 11M 12.74 0.23 wrn16/8* D1 D1 D1 5M 11.05 0.45 wrn16/8 D1 D1 D1 10M 11.17 0.60 wrn16/8* D8 D4 D1 4.2M 10.57 0.70 wrn16/8 D8 D4 D1 12M 9.80 0.40

Table 5: Test errors of different equivariant models on the STL-10 dataset. Models with * preserve the number of channels of the baseline.

In order to test whether the previous results generalize to natural images of higher resolution we run experiments on STL-10 [37]. We adapt the experiments in [36] by replacing the non-equivariant convolutions of their wrn16/8 model with regular DN-steerable convolutions. As in the CIFAR experiments, we adopt the training settings and hyperparameters of [36] without changes. Our four adapted models, reported in Table 5, are equivariant under either the action of D1 in all blocks or the actions of D8, D4 and D1. For both choices we build a large model, preserving the number of parameters of the baseline, and a small model, which preserves its number of channels and thus computational requirements. All models improve signiﬁcantly over the baseline. Due to their extended equivariance, the small D8 D4 D1 model performs better than the large D1 D1 D1 model. In comparison to the CIFAR experiments, rotational equivariance gives a larger boost in accuracy since the higher resolution of 96px of STL-10 allows for more detailed local patterns which occur in arbitrary orientations. Appendix D.3 reports the results of a data ablation study. The results validate that the gains from incorporating equivariance are consistent over all training set sizes. More information on the training procedures is given in Appendix K.4.

4 Conclusions

In this work we presented a general theory of E(2)-equivariant steerable CNNs. By analytically solving the kernel constraint for any representation of O(2) or its subgroups we were able to reproduce and compare many different models from previous work. We further proposed a group restriction operation which allows us to adapt the level of equivariance to the symmetries present on the corresponding length scale. When using G-steerable convolutions as drop in replacement for conventional convolution layers we obtained signiﬁcant improvements on CIFAR and STL-10 without additional hyperparameter tuning. While the kernel expansion leads to a small overhead during train time, the ﬁnal kernels can be stored such that during test time steerable CNNs are computationally not more expensive than conventional CNNs of the same width. Due to the enhanced parameter efﬁciency of equivariant models it is a common practice to adapt the model width to match the parameter cost of conventional CNNs. Our results show that even non-scaled models outperform conventional CNNs in accuracy.

We believe that equivariant CNNs will in the long term become the default choice for tasks like biomedical imaging, where symmetries are present on a global scale. The impressive results on natural images demonstrate the great potential of applying E(2)-steerable CNNs to more general vision tasks which involve only local symmetries. Future research still needs to investigate the wide range of design choices of steerable CNNs in more depth and collect evidence on whether our ﬁndings generalize to different settings. We hope that our library will help equivariant CNNs to be adopted by the community and facilitate further research.

Acknowledgments

We would like to thank Taco Cohen for fruitful discussions on an efﬁcient implementation and helpful feedback on the paper and Daniel Worrall for elaborating on the implementation of Harmonic Networks.

[1] Taco S. Cohen and Max Welling. Steerable CNNs. In International Conference on Learning Representations (ICLR), 2017.

[2] Maurice Weiler, Mario Geiger, Max Welling, Wouter Boomsma, and Taco S. Cohen. 3D steerable CNNs: Learning rotationally equivariant features in volumetric data. In Conference on Neural Information Processing Systems (Neur IPS), 2018.

[3] Taco S. Cohen, Mario Geiger, and Maurice Weiler. Intertwiners between induced representations (with applications to the theory of equivariant neural networks). ar Xiv preprint ar Xiv:1803.10743, 2018.

[4] Taco S. Cohen, Mario Geiger, and Maurice Weiler. A general theory of equivariant CNNs on homogeneous spaces. ar Xiv preprint ar Xiv:1811.02017, 2018.

[5] Taco S. Cohen, Maurice Weiler, Berkay Kicanaoglu, and Max Welling. Gauge equivariant convolutional networks and the icosahedral CNN. In International Conference on Machine Learning (ICML), 2019.

[6] Taco S. Cohen and Max Welling. Group equivariant convolutional networks. In International Conference on Machine Learning (ICML), 2016.

[7] Maurice Weiler, Fred A. Hamprecht, and Martin Storath. Learning steerable ﬁlters for rotation equivariant CNNs. In Conference on Computer Vision and Pattern Recognition (CVPR), 2018.

[8] Emiel Hoogeboom, Jorn W. T. Peters, Taco S. Cohen, and Max Welling. Hexa Conv. In International Conference on Learning Representations (ICLR), 2018.

[9] Erik J. Bekkers, Maxime W Lafarge, Mitko Veta, Koen A.J. Eppenhof, Josien P.W. Pluim, and Remco Duits. Roto-translation covariant convolutional networks for medical image analysis. In International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2018.

[10] Sander Dieleman, Jeffrey De Fauw, and Koray Kavukcuoglu. Exploiting cyclic symmetry in convolutional neural networks. In International Conference on Machine Learning (ICML), 2016.

[11] Risi Kondor and Shubhendu Trivedi. On the generalization of equivariance and convolution in neural networks to the action of compact groups. In International Conference on Machine Learning (ICML), 2018.

[12] Daniel E. Worrall, Stephan J. Garbin, Daniyar Turmukhambetov, and Gabriel J. Brostow. Harmonic networks: Deep translation and rotation equivariance. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

[13] Diego Marcos, Michele Volpi, Nikos Komodakis, and Devis Tuia. Rotation equivariant vector ﬁeld networks. In International Conference on Computer Vision (ICCV), 2017.

[14] Laurent Sifre and Stéphane Mallat. Combined scattering for rotation invariant texture analysis. In European Symposium on Artiﬁcial Neural Networks, Computational Intelligence and Machine Learning (ESANN), volume 44, pages 68 81, 2012.

[15] Laurent Sifre and Stéphane Mallat. Rotation, scaling and deformation invariant scattering for texture discrimination. Conference on Computer Vision and Pattern Recognition (CVPR), 2013.

[16] Joan Bruna and Stéphane Mallat. Invariant scattering convolution networks. IEEE transactions on pattern analysis and machine intelligence, 35(8):1872 1886, 2013.

[17] Laurent Sifre and Stéphane Mallat. Rigid-motion scattering for texture classiﬁcation. ar Xiv preprint ar Xiv:1403.1687, 2014.

[18] Edouard Oyallon and Stéphane Mallat. Deep roto-translation scattering for object classiﬁcation. In Conference on Computer Vision and Pattern Recognition (CVPR), 2015.

[19] Taco S. Cohen, Mario Geiger, Jonas Köhler, and Max Welling. Spherical CNNs. In International Conference on Learning Representations (ICLR), 2018.

[20] Risi Kondor, Zhen Lin, and Shubhendu Trivedi. Clebsch Gordan Nets: A Fully Fourier Space Spherical Convolutional Neural Network. In Conference on Neural Information Processing Systems (Neur IPS), 2018.

[21] Carlos Esteves, Christine Allen-Blanchette, Ameesh Makadia, and Kostas Daniilidis. Learning SO(3) equivariant representations with spherical CNNs. In European Conference on Computer Vision (ECCV), 2018.

[22] Nathanaël Perraudin, Michaël Defferrard, Tomasz Kacprzak, and Raphael Sgier. Deep Sphere: Efﬁcient spherical Convolutional Neural Network with HEALPix sampling for cosmological applications. ar Xiv:1810.12186 [astro-ph], 2018.

[23] Chiyu Jiang, Jingwei Huang, Karthik Kashinath, Prabhat, Philip Marcus, and Matthias Niessner. Spherical CNNs on unstructured grids. In International Conference on Learning Representations (ICLR), 2019.

[24] Adrien Poulenard and Maks Ovsjanikov. Multi-directional geodesic neural networks via equivariant convolution. ACM Transactions on Graphics, 2018.

[25] Jonathan Masci, Davide Boscaini, Michael M. Bronstein, and Pierre Vandergheynst. Geodesic convolutional neural networks on Riemannian manifolds. In International Conference on Computer Vision Workshop (ICCVW), 2015.

[26] J. Bruna, W. Zaremba, A. Szlam, and Y. Le Cun. Spectral Networks and Deep Locally Connected Networks on Graphs. In International Conference on Learning Representations (ICLR), 2014.

[27] Davide Boscaini, Jonathan Masci, Simone Melzi, Michael M. Bronstein, Umberto Castellani, and Pierre Vandergheynst. Learning class-speciﬁc descriptors for deformable shapes using localized spectral convolutional networks. Computer Graphics Forum, 2015.

[28] Jean-Pierre Serre. Linear representations of ﬁnite groups. 1977.

[29] Sara Sabour, Nicholas Frosst, and Geoffrey E. Hinton. Dynamic routing between capsules. In Conference on Neural Information Processing Systems (NIPS), 2017.

[30] Nichita Diaconu and Daniel Worrall. Learning to convolve: A generalized weight-tying approach. In International Conference on Machine Learning (ICML), 2019.

[31] Marysia Winkels and Taco S. Cohen. 3D G-CNNs for pulmonary nodule detection. In Conference on Medical Imaging with Deep Learning (MIDL), 2018.

[32] Daniel E. Worrall and Gabriel J. Brostow. Cubenet: Equivariance to 3D rotation and translation. In European Conference on Computer Vision (ECCV), pages 585 602, 2018.

[33] Dmitry Laptev, Nikolay Savinov, Joachim M. Buhmann, and Marc Pollefeys. Ti-pooling: Transformation-invariant pooling for feature learning in convolutional neural networks. In Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.

[34] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In British Machine Vision Conference (BMVC), 2016.

[35] Ekin D. Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V. Le. Autoaugment: Learning augmentation strategies from data. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019.

[36] Terrance De Vries and Graham W. Taylor. Improved regularization of convolutional neural networks with cutout. ar Xiv preprint ar Xiv:1708.04552, 2017.

[37] Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 2011.

[38] Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (ELUs). In International Conference on Learning Representations (ICLR), 2016.

[39] Nathaniel Thomas, Tess Smidt, Steven M. Kearnes, Lusann Yang, Li Li, Kai Kohlhoff, and Patrick Riley. Tensor ﬁeld networks: Rotationand translation-equivariant neural networks for 3d point clouds. ar Xiv preprint ar Xiv:1802.08219, 2018.

[40] Risi Kondor. N-body networks: a covariant hierarchical neural network architecture for learning atomic potentials. ar Xiv preprint ar Xiv:1803.01588, 2018.

[41] Brandon Anderson, Truong-Son Hy, and Risi Kondor. Cormorant: Covariant molecular neural networks. ar Xiv preprint ar Xiv:1906.04015, 2019.

[42] Diego Marcos, Michele Volpi, and Devis Tuia. Learning rotation invariant convolutional ﬁlters for texture classiﬁcation. In International Conference on Pattern Recognition (ICPR), 2016.

[43] Diego Marcos, Michele Volpi, Benjamin Kellenberger, and Devis Tuia. Land cover mapping at very high resolution with rotation equivariant CNNs: Towards small yet accurate models. ISPRS Journal of Photogrammetry and Remote Sensing, 145:96 107, 2018.

[44] Bastiaan S. Veeling, Jasper Linmans, Jim Winkens, Taco S. Cohen, and Max Welling. Rotation equivariant CNNs for digital pathology. In International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2018.

[45] Geoffrey Hinton, Nicholas Frosst, and Sabour Sara. Matrix capsules with EM routing. In International Conference on Learning Representations (ICLR), 2018.