# efficient_equivariant_network__8d18398b.pdf

Efﬁcient Equivariant Network

Lingshen He1, Yuxuan Chen1, Zhengyang Shen3, Yiming Dong1, Yisen Wang1,2, Zhouchen Lin1,2,4

1Key Laboratory of Machine Perception (MOE), School of Artiﬁcial Intelligence, Peking University 2Institute for Artiﬁcial Intelligence, Peking University 3School of Mathematical Sciences and LMAM, Peking University 4Pazhou Lab, Guangzhou 510330, China lingshenhe@pku.edu.cn, yuxuan chen1997@outlook.com, shenzhy@pku.edu.cn, yimingdong ml@outlook.com, yisen.wang@pku.edu.cn, zlin@pku.edu.cn

Convolutional neural networks (CNNs) have dominated the ﬁeld of Computer Vision and achieved great success due to their built-in translation equivariance. Group equivariant CNNs (G-CNNs) that incorporate more equivariance can signiﬁcantly improve the performance of conventional CNNs. However, G-CNNs are faced with two major challenges: spatial-agnostic problem and expensive computational cost. In this work, we propose a general framework of previous equivariant models, which includes G-CNNs and equivariant self-attention layers as special cases. Under this framework, we explicitly decompose the feature aggregation operation into a kernel generator and an encoder, and decouple the spatial and extra geometric dimensions in the computation. Therefore, our ﬁlters are essentially dynamic rather than being spatial-agnostic. We further show that our Equivariant model is parameter Efﬁcient and computational Efﬁcient by complexity analysis, and also data Efﬁcient by experiments, so we call our model E4-Net. Extensive experiments verify that our model can signiﬁcantly improve previous works with smaller model size. Especially, under the setting of training on 1/5 data of CIFAR10, our model improves G-CNNs by 5%+ accuracy, while using only 56% parameters and 68% FLOPs.

1 Introduction

In the past few years, convolutional neural networks (CNNs) have been widely used and achieved superior results on multiple vision tasks, such as image classiﬁcation [31, 55, 51, 22], semantic segmentation [3], and object detection [44]. A compelling explanation of the good performance of CNNs is that their built-in parameter sharing scheme brings in translation equivariance: shifting an image and then feeding it through a CNN layer is the same as feeding the original image and then shifting the resulted feature maps. In other words, the translation symmetry is preserved by each layer. Motivated by this, Cohen and Welling [9] proposed Group Equivariant CNNs (G-CNNs), showing how convolutional networks can be generalized to exploit larger groups of symmetries. Following G-CNNs, researchers have designed new neural networks that are equivariant to other transformations like rotations [9, 61, 24, 49] and scales [65, 53]. However, G-CNNs still have two main drawbacks: 1) In the implementation, G-CNNs would introduce extra dimensions to encode new transformations, such as rotations and scales, thus have a very high computational cost. 2) Although G-CNNs achieve group equivariance by sharing kernels, like vanilla CNNs, they lack the ability to adapt kernels to diverse feature patterns with respect to different spatial positions, namely, the spatial-agnostic problem [68, 39, 70, 71, 54, 67, 36].

Corresponding author.

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

Some previous works focus on solving these two problems. Cheng et al. [4] proposed to decompose the convolutional ﬁlters over joint steerable bases to reduce model size. However, it is essentially G-CNNs which still have the inherent spatial-agnostic problem. To incorporate dynamic ﬁlters, one solution is introducing attention mechanism into each convolution layer in G-CNNs without disturbing inherent equivariance [48, 45]. The cost is that they introduce extra parameters and increase the complexity of space and time. Another solution is to replace group convolution layers with standalone self-attention layers by designing a speciﬁc position embedding to ensure equivariance [47, 26]. However, the self-attention mechanism suffers from quadratic memory and time complexity, because it has to compute the attention score at each pair of inputs.

Actually, Cohen et al. [7], Kondor et al. [29] and Bekkers [1] revealed that an equivariant linear layer is essentially a convolution-like operation. Inspired by this, we further discover that a general feature-extraction layer, either linear or non-linear, being equivariant is equivalent to that the feature aggregation mechanism between each pair of inputs only depends on the relative positions of these two inputs. Based on this observation, we propose a generalized framework of previous equivariant models, which includes G-CNNs and equivariant attention networks as special cases. Under this generalized framework, we design a new equivariant layer to conquer the aforementioned difﬁculties. Firstly, to avoid quadratic computational complexity, the feature aggregation operator is explicitly decomposed into a kernel generator and an encoder which takes one single feature as the input. Since our kernels are calculated based on input features, they are essentially dynamic rather than being spatial-agnostic. In addition, we decouple the feature aggregation mechanism across spatial and extra geometric dimensions to reduce the inter-channel redundancy in convolution ﬁlters [4] and further accelerate computation. Extensive experiments show that our method can process data very efﬁciently and perform signiﬁcantly better than previous works using lower computational cost. As our method is parameter Efﬁcient, computational Efﬁcient, data Efﬁcient and Equivariant, we name our new layer as E4-layer.

We summarize our main contributions as follows:

We propose a generalized framework of previous equivariant models, which includes GCNNs and attention-based equivariant models as special cases.

Under the generalized framework, we explicitly decompose the feature aggregation operator into a kernel generator and an encoder, and further decouple the spatial and extra geometric dimensions to reduce computation.

Extensive experiments verify that our method is also data efﬁcient and performs competitively with lower computational cost.

2 Related Work

Vanilla CNNs [34] are naturally translation equivariant. More symmetries are considered to be exploited into the network for different tasks, such as rotations over plane [9, 66, 35, 12, 61, 59, 49, 37, 52, 4, 41, 2, 10, 58, 21, 24], rotations over 3D space [62, 16, 57, 64, 14, 60, 15, 50, 28, 6], scaling [65, 40, 53, 46], symmetries on manifold [8, 11], and other general symmetry groups [17, 56, 18]. These works accomplish equivariance by constraining the linear mappings in layers, followed by pointwise non-linearities to enhance their expressive power. In general, researchers [29, 7, 1] pointed out that an equivariant linear mapping can always be written as a convolution-like integral, i.e., G-CNNs in practice. However, their theory is still limited to linear cases.

As works [68, 39, 70, 71, 54, 67, 36] point out the spatial-agnostic problem of CNNs and attention mechanisms [25, 63, 43, 13, 20] achieve impressive results on various vision tasks, researchers start to consider non-linear equivariant mapping. Romero et al. [48, 45] directly reweighted the convolution kernels with attention weights generated by features and obtained non-linear equivariant models. However, compared with G-CNNs, these methods introduce extra parameters and operations, resulting in an even heavier computational burden. Also, some works [47, 26, 23] proposed group equivariant self-attention [43, 13]. Fuchs et al. [19] incorporated self-attention into 3D equivariant networks and proposed SE(3)-Transformers. However, since their ﬁlters are essentially calculated based on a pair of inputs, the computational complexity is quadratic.

In this work, we further extend the linear equivariant theory to a more general situation, including non-linear cases. Under the framework, we design a new equivariant layer to solve both the spatial-

agnostic problem in convolution-based equivariant models and heavy computation cost problem in most equivariant models.

3 A Uniﬁed Framework of Previous Group Equivariant Models

In this section, we ﬁrst brieﬂy review two representative group equivariant models: the linear model G-CNNs [9], and the non-linear model equivariant self-attention [47, 26]. Then, we propose a general framework of previous equivariant models based on the inner relationship among these speciﬁc models.

3.1 Equivariance

Equivariance indicates that the outputs of a mapping transform in a predictable way with the transformation of the inputs. Formally, a group equivariant map Ψ satisﬁes that

u G, Ψ [Tu[f]] = T u[Ψ[f]], (1) where G is a transformation group, f is an input feature map, and Tu and T u are group actions, indicating how the transformation u acts on the input and output features, respectively. Besides, since we hope that two transformations u, v G acting on the feature maps successively is equivalent to the composition of transformations uv G acting on the feature maps directly, we require that Tu Tv = Tuv, where uv is the group product of u and v. The same is the case with T u.

Now we examine the speciﬁc form of the transformation group G. In this work, we focus on the analysis of 2D images deﬁned on R2. Consequently, we are most interested in the groups of the form G = R2 A, resulting from the semi-product ( ) between the translation group R2 and a group A acts on R2, e.g., rotations, scalings and mirrorings. This family of groups is referred to as afﬁne groups and their group product rule is:

uv = (xu, au)(xv, av) = (xu + auxv, auav), (2) where u = (xu, au) and v = (xv, av), in which xu, xv R2 and au, av A. For ease of implementation, following [9], we take A as the cyclic group C4 or the dihedral group D4, then G becomes p4 or p4m. As for the group action, we employ the most common regular group action in this work, i.e., Tu[f](v) = f(u 1v). (3) Here, we only care about the group action over the feature maps deﬁned on G, because we always use a lifting operation to lift the input images deﬁned on R2 to the feature maps on G, where the equivariance can be preserved properly, as will be shown in Section 3.2.

Let f (l) : X RCl and W : G RCl+1 Cl be the input feature and the convolutional ﬁlter in the l-th layer, respectively, where Cl denotes the channel number of the l-th layer. X is taken as R2 for the ﬁrst layer, and taken as G for the following layers. Then for any g G, the group convolution [29, 7, 1] of f (l) and W on G at g is given by

f (l+1)(g) = Ψ[f (l)](g) = Z

X W(g 1eg)f (l)(eg)dµ(eg), (4)

where µ( ) is the Haar measure. When X is discrete, Eqn. (4) can be rewritten as

f (l+1)(g) = X

eg X W(g 1eg)f (l)(eg). (5)

G-CNNs essentially generalize the translation equivariance of conventional convolution to a more general group G.

In fact, the ﬁrst layer maps the 2D images to a function deﬁned on G, while the following layers map one feature map on G to another. As a result, the computational complexity of the ﬁrst layer and the following layers are of the order O(k2|A|) and O(k2|A|2), respectively, where k is the kernel size in the spatial space. As a result, G-CNNs have a much larger computational cost when A is large, especially for the intermediate layers. In this work, we employ the ﬁrst layer of G-CNNs as a lifting operation, and focus on reducing the computation of the latter layers.

3.3 Equivariant Attention Networks

Group Equivariant Self-Attention (G-SA) [47, 26] is a representative method of equivariant attention networks, whose form can be simpliﬁed as follows:

f (l+1)(g) = X

eg G Softmaxeg[h T Q(f (l)(g))(h K(f (l)(eg)) + Pg 1eg)]h V (f (l)(eg)), (6)

where h V : RCl RCl+1, and h Q, h K : RCl Rd are the embedding functions of values, querys and keys, respectively, which are neural networks in the most general case. d is the dimension of the low dimensional embeddings, and Pg 1eg Rd encodes the relative positions of the query f (l)(g) and the key f (l)(eg).

3.4 Generalized Equivariant Framework

As more and more group equivariant structures emerge, researchers start to deduce the most general equivariant structures. To this end, Cohen et al. [7], Kondor et al. [29] and Bekkers [1] proposed a general theory of linear group equivariant structures, which indicates that G-CNNs are the most general equivariant linear layers. Besides, a lot of non-linear equivariant structures appear recently, such as equivariant self-attention layers [47, 26]. This motivates us to investigate a more general framework.

In all, with only slight modiﬁcation, most of layers in a neural network can be viewed as a kind of aggregation of pair-wise feature interaction as follows:

f (l+1)(g) = X

eg G Hg,eg(f (l)(g), f (l)(eg)), (7)

where the feature aggregation operator Hg,eg( , ) : RCl RCl RCl+1 is a mapping indexed by a pair of location g and eg, which describes how to aggregate the input feature pair f(g) and f(eg). In general, the above layer is not equivariant. However, we can ﬁnd a general constraint for Hg,eg(f (l)(g), f (l)(eg)) to make this layer equivariant over G.

Theorem 1 The layer formulated as Eqn.(7) is group equivariant if and only if there is a mapping Hˆg : RCl RCl RCl+1 which is indexed by a single group element ˆg, such that, f (l) and g, g G, the layer satisﬁes: X

g Hg, g(f (l)(g), f (l)( g)) = X

g Hg 1 g(f (l)(g), f (l)( g)) (8)

Proof Firstly, u, g and eg G,

Tuf (l+1)(g) = f (l+1)(u 1g) = X

eg G Hu 1g,eg(f (l)(u 1g), f (l)(eg)).

On the other hand, X

eg G Hg,eg(Tuf (l)(g), Tuf (l)(eg)) = X

eg G Hg,eg(f (l)(u 1g), f (l)(u 1eg)) = X

eg G Hg,ueg(f (l)(u 1g), f (l)(eg)).

As Tuf (l+1)(g) = P

eg G Hg,eg(Tuf (l)(g), Tuf (l)(eg)),

f (l), g, u, X

eg G Hg,ueg(f (l)(u 1g), f (l)(eg)) = X

eg G Hu 1g,eg(f (l)(u 1g), f (l)(eg)).

Let g ug, we get:

f (l), g, u, X

eg G Hug,ueg(f (l)(g), f (l)(eg)) = X

eg G Hg,eg(f (l)(g), f (l)(eg)).

then, we let u to be g 1,

f (l), g, X

eg G He,g 1eg(f (l)(g), f (l)(eg)) = X

eg G Hg,eg(f (l)(g), f (l)(eg)).

We denote Hg 1eg( , ) as He,g 1eg( , ), we can get exactly the Eqn.(8)

This is obvious. Q.E.D

From the theorem, we can get a group equivariant layer:

f (l+1)(g) = X

eg G e Hg 1eg(f (l)(g), f (l)(eg)), (9)

which is also the only equivariant form of Eqn. (7). Actually, the above theorem also reveals the essence of equivariance in previous works, i.e., if the relative positions of (g1, eg1) and (g2, eg2) are the same, i.e., g1 1 eg1 = g 1 2 eg2 = ˆg, the feature pairs located at the two tuples should be processed equally. In other words, we should employ the same function e Hˆg to act on these two input feature pairs.

From this perspective, we can readily see that both the kernel sharing used in G-CNNs, Eqn. (4), and the relative position encoding adopted in the G-SA, Eqn. (6), utilizes the above rule. According to Theorem 1, designing a group equivariant layer becomes much more easily and ﬂexibly than ever, as we only need to design a new function e Hˆg. In addition, the new formulation provides a more general perspective on the group equivariant layer, i.e., sharing the parameters of function e Hˆg, which generalizes the kernel sharing schemes in G-CNNs. Based on the above understanding, we can see that if we replace the feature vector in the right hand side of Eqn. (9) with the local patch at group element g and eg, respectively, it is still equivariant. Proposition 1 The following layer is equivariant,

f (l+1)(g) = X

eg G e Hg 1eg(FN1(g), FN2(eg)) (10)

where for i = 1, 2, the FNi(g) denote the local patches of g, in which Ni(g) represent g s neighborhood {gg |g Ni(e)} and Ni(e) is the predeﬁned neighborhood of the identity element e G.

One remarkable advantages of introducing local patch is that it contains more semantic information than feature vector. Notice, we acquire the local patches by concatenating features in the neighborhoods of g and eg in a predeﬁned order on N1(e) and N2(e) respectively, i.e., f(g ) is concatenated at the same place in FN1(g) as f(g 1g ) in FN1(e). We denote the concatenation operator as S, and will discuss the above in detail in Section 4.1, which shows that concatenating features can not only make our framework more ﬂexible, but also help to reduce the computational burden of our newly proposed equivariant layer.

4 Efﬁcient Equivariant Layer

A straight-forward and easy case of Eqn. (10) is to adopt e Hˆg, ˆg G, as a multi-layer perceptron (MLP), where the subscript ˆg is used to identify different MLPs. However, in Eqn. (10), we have to compute a mapping from two high dimensional vectors to another high dimensional one for each input pair of g and eg, which is very expensive. A similar issue exists in computing the attention score in self-attention. To deal with this problem, we decompose e H into the following form to reduce the computation, i.e.,

ˆg G, e Hˆg(x, y) = Kˆg(x) V (y) (11)

where means element-wise product, and Kˆg : RCl|N1(e)| RCl+1 is a kernel generator and V : RCl|N2(e)| RCl+1 is an encoder. We use | | to denote the numbers of elements in a set. Hence, we can compute Kˆg(x) and V (y) separately. In addition, to further save computation, we split the kernel into several slices along the channels, such that Kˆg is shared across these slices,

Linear Transform

:Element-wise product :Spatial-wise aggregation

:Rotation-wise concat

Linear Transform

Linear Transform

Linear Transform

Figure 1: An example of our E4-layer on p4 group. We ﬁrstly concatenate features along rotation dimension in a predeﬁned order on the C4 and then pass them through the MLP and the linear layer to generate kernel and encode features, respectively. After this, the element-wise product is carried out to compute e Hg 1eg, and ﬁnally spatial-wise aggregation is performed to acquire the output. Note that when computing output features at different rotation dimensions, the generated kernel should be rotated by a speciﬁc degree to keep the correct relative position.

i.e., 1 i, j Cl+1, Ki ˆg = Kj ˆg if i j (mod s), where s is the number of slices, and i and j are channel indexes. The Kˆg is essentially a dynamic ﬁlters which is adaptive to features around g, avoiding the spatial-agnostic problem in G-CNNs. Unlike conventional dynamic ﬁlters, which are matrices, the output of Kˆg is a vector, which can be viewed as a depth wise kernel [5]. This can decouple channel dimension with spatial dimension during feature aggregation to reduce the computational cost. Position information is implicitly encoded in the organized output form of our kernel generator, rather than using explicit positional embedding in the group self-attention layer [26, 47].

In practice, we can view the whole kernel family {Kˆg}ˆg G as the output of a single mapping, i.e., e K: RCl|N1(e)| R|G|Cl+1. Then, we resize the output of e K to be a |G| Cl+1 matrix, with different rows represent different Kˆg. Namely, if we adopt e K as an MLP, the computations and parameters used for hidden layer are shared across Kˆg for different ˆg, which is another merit of the Eqn. (11). However, there is still a large search space for e Hˆg, as Eqn (11) is only a special structue of e Hˆg, we leave a more complete study of e Hˆg in the future work.

4.1 Implementation on Afﬁne Group

In this section, we design a very efﬁcient equivariant layer based on Eqn. (11) for afﬁne group R2 A. The computation of the operator is:

f (l+1)(g) = X

eg N(g) Kg 1eg

g N1(g) f (l)(g )

eg N2(eg) f (l)(eg )

Following the standard practice in computer vision, aggregation is done only on the local neighborhood of g, N(g). To save computation, we choose N(g) to be only spatial-wise neighborhood, i.e.,

N(g) = {g(v, e A) | v Ω}, where Ω R2 and e A is the identity element of group A. However, aggregating information along spatial neighborhood only discards the information interaction along A, which could lead to a drop in performance [32]. We alleviate the issue by concatenating the feature map along A, i.e., we choose N1(g) and N2(g) to be {g(0, a)|a A}. The order of concatenation is predeﬁned on A. As will be shown in the later experiments, this concatenation does not introduce much computation but can signiﬁcantly improve performance. Compared to group convolution, such a design enables us to decouple the feature aggregation across the spatial dimension and the A dimension to further reduce computational cost. In practice, we adopt the e K as a two layer MLP: e K(x) = W2Relu(W1x), where W1 RCl/r Cl|A|, W2 R|Ω|s Cl/r, and r is the reduction ratio which saves both parameters and computation, s is the number of slices deﬁned before. For 2D images, Ωis usually adopted as a k k square mesh grids and |Ω| = k2, where k is the kernel size. We simply adopt the encoder V as a linear transform: V (y) = W3y, where W3 RCl+1 Cl|A|. For better illustration, we visualize a concrete layer of Eqn. (12) by choosing G as p4 in Figure 1.

4.2 Computational Complexity Analysis

In practice, the feature map is deﬁned on discrete mesh grids. We use h and w to denote the height and the width of mesh grids. As the numbers of the input and output channels are usually the same, we assume Cl = Cl+1 = c.

Parameter Analysis The number of learnable parameters of E4-layer (12) is c2|A|(1 + 1/r) + csk2/r. As s c, parameter counts are dominated by the ﬁrst term when k is not too large, and increasing kernel size will not signiﬁcantly increase parameter counts, which is shown in later experiments. The parameters count of group convolution layer is c2k2|A|. Notice that (1+1/r) k2 and s/r c|A|, parameters count of our E4-layer is signiﬁcantly less than that of group convolution layer.

Time Complexity Analysis The FLOPs of E4-layer and group convolution layer are (1 + 1/r)c2|A|2hw + (1 + s/r)k2c|A|hw and k2c2|A|2hw, respectively. Similarly, as (1 + 1/r) k2 and (1 + s/r) c|A|, the FLOPs of E4-layer is signiﬁcantly lower than that of group convolutional layer.

It can be observed that both the parameter count and FLOPs of our E4-layer are composed of two terms, one depending on k2 and the other not relying on k, which is a result of disentangling across spatial dimension with both channels and A during feature aggregation.

5 Experiments

In this section, we conduct extensive experiments to study and demonstrate the performance of our model. The experimental results show that our model has a greater capacity than the groupconvolution-based one in terms of parameter efﬁciency, computational efﬁciency, data efﬁciency and accuracy. On the MNIST-rot dataset, we detailedly study the effect of hyperparameters on the number of parameters, computation FLOPs and performance of our model. All the experiments are done on the Ge Force RTX 3090 GPU.

5.1 Rotated MNIST

Table 1: Test error on rot-MNIST(with standard deviation under 5 random seed variations)

Model Test error (%) Params FLOPs

p4_SA [47] 2.54 0.052 44.67K 400M p4_CNN [9] 1.79 0.043 77.54K 46.2M α_p4_CNN[45] 1.69 0.021 73.13K 27.0M E4-Net (Ours) 1.29 0.023 18.8K 17M E4-Net(Large)(Ours) 1.17 0.019 41.1K 36.9M

The MNIST-rot dataset [33] is the most widely used benchmark to test the equivariant models. It contains 62k 28 28 randomly rotated gray-scale handwritten digits. Images in the dataset are split into 10k for training, 2k for validation and 50k for testing. Random rotation of digits and only 20 percent of training data of the standard MNIST dataset increases the difﬁculty of classiﬁcation.

For a fair comparison, we keep both training settings and architectures of our model as close as possible to previous works [9, 47]. In addition, we adopt the p4 group to construct all our models in this section. In our ﬁrst experiment, we adopt our E4-Net given in the supplementary material to make a comparison to previous works. This is a very lightweight model which contains only 18.8K learnable parameters. It is composed of one group convolutional layer which lifts the image to the p4 group, six E4-layers and one fully connected layer. Two 2 2 max-pooling layers are inserted after the ﬁrst and the third E4-layer to downsample feature maps. The last E4-layer is followed by a global max group pooling layer [9], which takes the maximum response over the entire group, to ensure the predictions invariant to rotations.

Our model is trained using the Adam optimizer [27] for 200 epochs with a batch size of 128. The learning rate is initialized as 0.02 and is reduced by 10 at the 60th, 120th and 160th epochs. The weight decay is set as 0.0001 and no data augmentation is used during training. The results are listed in Table 1. Our models signiﬁcantly outperform G-CNNs [9] using only about 25% parameters and 40% FLOPs. For G-SA [47], which is a group equivariant stand-alone self-attention model, even performs inferiorly to G-CNNs with much more computational cost. The α-p4-CNN model [45] further introduces the attention mechanism to group convolution along both spatial and channel dimensions to enhance the expressiveness of G-CNNs, while our E4-Net still signiﬁcantly outperforms it with less computational cost. We also experiment with a larger model to further demonstrate the capacity of our model, which is listed in the last line of Table 1.

Table 2: The effect of concatenation Concate Test error (%) Params FLOPs

None 4.10 0.085 9.9K 8.9M only K 1.96 0.045 14.4K 13M only V 1.52 0.036 14.4K 13M K&V 1.29 0.023 18.8K 17M

Ablation Study of Concatenation: In the E4layer (12), we introduce the concatenation operation to enable the disentanglement across the rotation and the spatial information interaction. To study the importance of concatenation, we carry out experiments on the case that neither Kˆg nor V in Eqn. (12) use concatenation, i.e., N1(g) = g, N2(eg) = eg. As shown in the ﬁrst line of Table 2, this leads to a signiﬁcant drop in performance. This is because if aggregation in Eqn.(12) is done merely in the spatial neighborhoods without concatenation, there is no information interaction along the rotation dimensions. We also experiment the cases using concatenation only in Kˆg or V , and the performance of both is better than the case without concatenation but is still inferior to the case with concatenation in both Kˆg and V . This further illustrates the importance of concatenation along A.

Table 3: Hyperparameters Analysis Hyperparam Test error (%) Params FLOPs

s=1 1.45 0.022 16.3K 14.9M s=2 1.29 0.023 18.8K 17M s=4 1.24 0.026 23.9K 21.2M

r=1 1.29 0.023 18.8K 17M r=2 1.33 0.026 13.0K 12M r=4 1.37 0.025 10.1K 9.5M

k=3 1.46 0.031 15.6K 14.3M k=5 1.29 0.023 18.8K 17M k=7 1.27 0.021 23.8K 21.1M

Hyperparameters Analysis: We investigate the effect of various hyperparameters used in the E4-layer. The reduction ratio r and the slice number s in the Kˆg and kernel size k control the computations and parameters of the layer. Based on the baseline model, we vary the three hyperparameters respectively. As shown in the Table 3, improvement is observed when decreasing the reduction ratio and increasing the slice number, with the cost of computational burden increasing. Especially, the improvement of s = 2 over s = 4 and r = 1 over r = 2 is marginal, which is attributes to redundancy in the kernel [4]. In conclusion, appropriately increasing the reduction ratio r and decreasing the slice number s can help to reduce computational cost while preserving performance. Keeping other hyperparameters ﬁxed, we study the effect of kernel size on our model. In Table 3, the performance peaks when kernel size equals 7. In general, a larger kernel size leads to improved performance due to a larger receptive ﬁeld. In addition, as explained in Section 4.2, increasing kernel size does not dramatically increase parameters and FLOPs as standard convolution.

5.2 Natural Image Classiﬁcation

In this section, we evaluate the performance of our model on the two common natural image datasets, CIFAR10 and CIFAR100 [30]. The CIFAR-10 and the CIFAR100 datasets consist of 32 32 images

Table 4: Test error on CIFAR10 and CIFAR100. (with standard deviation under 5 random seed variations)

Model CIFAR10 (%) CIFAR100 (%) Params FLOPs

R18 9.7 0.43 34 0.76 11M 0.56G p4-R18 7.53 0.21 27.96 0.56 11M 2.99G p4-E4R18(Ours) 6.72 0.14 26.59 0.36 5.8M 1.85G p4m-R18 5.83 0.17 24.95 0.42 10.8M 5.63G p4m-E4R18(Ours) 4.96 0.16 22.18 0.46 6.0M 3.87G

belonging to 10 and 100 classes, respectively. Both of the datasets contain 50k training data and 10k testing data. Before training, images are normalized according to the channel means and standard deviations.

In this experiment, we adopt Res Net-18 [22] as the baseline model(short as R18), which is composed of an initial convolution layer, followed by 4 stage Res-Blocks and one ﬁnal classiﬁcation layer. Following the standard practice in [9], we replace all the conventional layers with p4 (p4m) convolutions in R18 and increase the width of each layer by

8) to keep the learnable parameters approximately the same. We denote the resulting models as p4-R18 (p4m-R18). We replace the second group convolution layer in each Res-Block of p4-R18 (p4m-R18) with our E4-layer, resulting in the p4-E4R18 (p4m-E4R18). For a fair comparison, all the above models are trained under the same training settings. We use the stochastic gradient descent with an initial learning rate of 0.1, a Nesterov momentum of 0.9 and a weight decay of 0.0005. The learning rate is reduced by 5 at 60th, 120th, and 160th epochs. Models are trained for 200 epochs using 128 batch size. No data augmentation is used during training to illustrate data efﬁciency of our model.

The classiﬁcation accuracy, parameters count and FLOPs of all models on CIFAR10 and CIFAR100 are reported in Table 4. We can see that models incorporating more symmetry achieve better performance, i.e., R18 p4-R18 p4m-R18. Our p4 and p4m models signiﬁcantly outperform their counterparts on both CIFAR10 and CIFAR100. Furthermore, our model decreases the parameter count and FLOPs by 45% and 32%, respectively. Notice that the model size reduction is purely caused by the introduction of our E4-layers, as topological connections and width of each layer of E4 model and its counterparts are the same.

Figure 2: Trend of test error(%) on various training data sizes.

Data Efﬁciency: To further study the performance of our model, we train all the models listed in Table 4 on CIFAR10 with different sizes of training data. To be speciﬁc, we consider 5 settings, where 1k, 2k, 3k, 4k and 5k training data of each class are randomly sampled from the CIFAR10 training set. Testing is still performed on the original test set of CIFAR10. Other training settings are identical to the above. We visualize the results in Figure 2.

It is observed that the performance gap between p4, p4m and R2 models tend to increase as we reduce the training data. This is mainly because that the prior that the label is invariant to rotations is more important when training data are fewer. The trend is also observed in the gap between our models and their counterparts. For instance, the gap between p4m-E4R18 and p4m-R18 is 0.87% when training data of each class is 5k, while it is enlarged to 5.22% when training data of each class is reduced to 1k. Especially, we observe the line of p4-E4R18 intersects with the one of p4m-R18, which further indicates that our model is much more data efﬁcient than G-CNNs. As indicated above, symmetry prior is more important when training data are fewer, and the data efﬁciency of our model implies that p4-E4R18 and p4m-E4R18 can better exploit the symmetry of data.

6 Limitation and Future Work

From the theory perspective, although we extend the general equivaraint framework from linear cases to common non-linear cases, there s two limitations on the generalization: 1) we only focus on layers with such pair-wise interactions proposed in Eqn.(7), and higher-order interactions cases are not included. 2) We only consider regular group action in this framework, which is a special case of general group actions. We leave extending this equivariant framework to these cases as future work.

From the practice perspective, we only give a special implementation of Eqn.(10) in an intuitive insight, and further exploration in the space of equivariant map is in demand. An alternative is to exploit searching algorithms from neural architecture search [42, 38, 69] to ﬁnd a more powerful and efﬁcient model. Besides this, our E4-layer is slower than G-CNN despite less FLOPs due to convolutions are optimized by many speedup libraries. Our layer is implemented only in a naive way, that is, using the unfold operation followed by a summation operation for the aggregation step. In the future, we will try to implement a customized CUDA kernel for GPU acceleration to reduce training and inference time of our model.

7 Conclusions

In this work, we propose a general framework of group equivariant models which delivers a uniﬁed understanding on the previous group equivariant models. Based on the new understanding, we propose a novel efﬁcient and powerful group equivariant layer which can serve as a drop-in replacement for convolutional layers. Extensive experiments demonstrate the E4-layer is more powerful, parameter efﬁcient and computational efﬁcient than group convolution layers and their variants. Through a side by side comparison with G-CNNs, we demonstrate our E4-layer can signiﬁcantly improve data efﬁciency of equivariant models, which show great potential for reducing the cost of collecting data.

Acknowledgment

Zhouchen Lin was supported by the NSF China (No.s 61625301 and 61731018), NSFC Tianyuan Fund for Mathematics (No. 12026606) and Project 2020BD006 supported by PKU-Baidu Fund. Yisen Wang is partially supported by the National Natural Science Foundation of China under Grant 62006153, and Project 2020BD006 supported by PKU-Baidu Fund.

[1] Erik J Bekkers. B-spline cnns on lie groups. In ICLR, 2019.

[2] Erik J Bekkers, Maxime W Lafarge, Mitko Veta, Koen AJ Eppenhof, Josien PW Pluim, and Remco Duits. Roto-translation covariant convolutional networks for medical image analysis. In MICCAI, 2018.

[3] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deep Lab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 834 848, 2017.

[4] Xiuyuan Cheng, Qiang Qiu, Robert Calderbank, and Guillermo Sapiro. Rot DCF: Decomposition of convolutional ﬁlters for rotation-equivariant deep networks. In ICLR, 2018.

[5] François Chollet. Xception: Deep learning with depthwise separable convolutions. In CVPR, 2017.

[6] Taco S Cohen, Mario Geiger, Jonas Köhler, and Max Welling. Spherical CNNs. In ICLR, 2018.

[7] Taco S Cohen, Mario Geiger, and Maurice Weiler. A general theory of equivariant cnns on homogeneous spaces. In Neur IPS, 2019.

[8] Taco S Cohen, Maurice Weiler, Berkay Kicanaoglu, and Max Welling. Gauge equivariant convolutional networks and the icosahedral CNN. In ICML, 2019.

[9] Taco S Cohen and Max Welling. Group equivariant convolutional networks. In ICML, 2016.

[10] Taco S Cohen and Max Welling. Steerable CNNs. In ICLR, 2017.

[11] Pim De Haan, Maurice Weiler, Taco S Cohen, and Max Welling. Gauge equivariant mesh cnns: Anisotropic convolutions on geometric graphs. In ICLR, 2021.

[12] Sander Dieleman, Jeffrey De Fauw, and Koray Kavukcuoglu. Exploiting cyclic symmetry in convolutional neural networks. In ICML, 2016.

[13] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.

[14] Carlos Esteves, Christine Allen-Blanchette, Ameesh Makadia, and Kostas Daniilidis. Learning SO(3) equivariant representations with spherical CNNs. In ECCV, 2018.

[15] Carlos Esteves, Avneesh Sud, Zhengyi Luo, Kostas Daniilidis, and Ameesh Makadia. Crossdomain 3d equivariant image embeddings. In ICML, 2019.

[16] Carlos Esteves, Yinshuang Xu, Christine Allen-Blanchette, and Kostas Daniilidis. Equivariant multi-view networks. In ICCV, 2019.

[17] M Finzi, S Stanton, P Izmailov, and AG Wilson. Generalizing convolutional networks for equivariance to Lie groups on arbitrary continuous data. In ICML, 2020.

[18] Marc Finzi, Max Welling, and Andrew Gordon Wilson. A practical method for constructing equivariant multilayer perceptrons for arbitrary matrix groups. In ICML, 2021.

[19] Fabian Fuchs, Daniel Worrall, Volker Fischer, and Max Welling. SE(3)-Transformers: 3D roto-translation equivariant attention networks. In Neur IPS, 2020.

[20] Zhengyang Geng, Meng-Hao Guo, Hongxu Chen, Xia Li, Ke Wei, and Zhouchen Lin. Is attention better than matrix decomposition? In ICLR, 2021.

[21] Simon Graham, David Epstein, and Nasir Rajpoot. Dense Steerable Filter Cnns for Exploiting Rotational Symmetry in Histology Images. IEEE Transactions on Medical Imaging, 2020.

[22] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.

[23] Lingshen He, Yiming Dong, Yisen Wang, Dacheng Tao, and Zhouchen Lin. Gauge equivariant transformer. In Neur IPS, 2021.

[24] Emiel Hoogeboom, Jorn WT Peters, Taco S Cohen, and Max Welling. Hexa Conv. In ICLR, 2018.

[25] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In CVPR, 2018.

[26] Michael Hutchinson, Charline Le Lan, Sheheryar Zaidi, Emilien Dupont, Yee Whye Teh, and Hyunjik Kim. Lie Transformer: Equivariant self-attention for lie groups. ICML, 2021.

[27] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.

[28] Risi Kondor, Zhen Lin, and Shubhendu Trivedi. Clebsch Gordan Nets: a fully fourier space spherical convolutional neural network. In Neur IPS, 2018.

[29] Risi Kondor and Shubhendu Trivedi. On the generalization of equivariance and convolution in neural networks to the action of compact groups. In ICML, 2018.

[30] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. Citeseer, 2009.

[31] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classiﬁcation with deep convolutional neural networks. In Neur IPS, 2012.

[32] Dmitry Laptev, Nikolay Savinov, Joachim M Buhmann, and Marc Pollefeys. TI-POOLING: transformation-invariant pooling for feature learning in convolutional neural networks. In CVPR, 2016.

[33] Hugo Larochelle, Dumitru Erhan, Aaron Courville, James Bergstra, and Yoshua Bengio. An empirical evaluation of deep architectures on problems with many factors of variation. In ICML, 2007.

[34] Yann Le Cun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998.

[35] Jan Eric Lenssen, Matthias Fey, and Pascal Libuschewski. Group equivariant capsule networks. In Neur IPS, 2018.

[36] Duo Li, Jie Hu, Changhu Wang, Xiangtai Li, Qi She, Lei Zhu, Tong Zhang, and Qifeng Chen. Involution: Inverting the inherence of convolution for visual recognition. In CVPR, 2021.

[37] Junying Li, Zichen Yang, Haifeng Liu, and Deng Cai. Deep rotation equivariant network. Neurocomputing, 2018.

[38] Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. In ECCV, 2018.

[39] Ningning Ma, Xiangyu Zhang, Jiawei Huang, and Jian Sun. Weightnet: Revisiting the design space of weight networks. In ECCV, 2020.

[40] Diego Marcos, Benjamin Kellenberger, Sylvain Lobry, and Devis Tuia. Scale equivariance in cnns with vector ﬁelds. ar Xiv preprint ar Xiv:1807.11783, 2018.

[41] Diego Marcos, Michele Volpi, Nikos Komodakis, and Devis Tuia. Rotation equivariant vector ﬁeld networks. In ICCV, 2017.

[42] Hieu Pham, Melody Guan, Barret Zoph, Quoc Le, and Jeff Dean. Efﬁcient neural architecture search via parameters sharing. In ICML, 2018.

[43] Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Levskaya, and Jonathon Shlens. Stand-alone self-attention in vision models. In Neur IPS, 2019.

[44] Shaoqing Ren, Kaiming He, Ross B Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Neur IPS, 2015.

[45] David Romero, Erik Bekkers, Jakub Tomczak, and Mark Hoogendoorn. Attentive group equivariant convolutional networks. In ICML, 2020.

[46] David W Romero, Erik J Bekkers, Jakub M Tomczak, and Mark Hoogendoorn. Wavelet networks: Scale equivariant learning from raw waveforms. ar Xiv preprint ar Xiv:2006.05259, 2020.

[47] David W Romero and Jean-Baptiste Cordonnier. Group equivariant stand-alone self-attention for vision. In ICLR, 2021.

[48] David W Romero and Mark Hoogendoorn. Co-attentive equivariant neural networks: Focusing equivariance on transformations co-occurring in data. In ICLR, 2020.

[49] Zhengyang Shen, Lingshen He, Zhouchen Lin, and Jinwen Ma. PDO-e Convs: Partial differential operator based equivariant convolutions. In ICML, 2020.

[50] Zhengyang Shen, Tiancheng Shen, Zhouchen Lin, and Jinwen Ma. PDO-e S2CNNs: Partial differential operator based equivariant spherical CNNs. In AAAI, 2021.

[51] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.

[52] Bart Smets, Jim Portegies, Erik Bekkers, and Remco Duits. Pde-based group equivariant convolutional neural networks. ar Xiv preprint ar Xiv:2001.09046, 2020.

[53] Ivan Sosnovik, Michał Szmaja, and Arnold Smeulders. Scale-equivariant steerable networks. In ICLR, 2019.

[54] Hang Su, Varun Jampani, Deqing Sun, Orazio Gallo, Erik Learned-Miller, and Jan Kautz. Pixel-adaptive convolutional neural networks. In CVPR, 2019.

[55] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, 2015.

[56] Kai Sheng Tai, Peter Bailis, and Gregory Valiant. Equivariant transformer networks. In ICML, 2019.

[57] Nathaniel Thomas, Tess Smidt, Steven Kearnes, Lusann Yang, Li Li, Kai Kohlhoff, and Patrick Riley. Tensor ﬁeld networks: Rotation-and translation-equivariant neural networks for 3d point clouds. In Neur IPS, 2018.

[58] Bastiaan S Veeling, Jasper Linmans, Jim Winkens, Taco S Cohen, and Max Welling. Rotation equivariant cnns for digital pathology. In MICCAI, 2018.

[59] Maurice Weiler and Gabriele Cesa. General E(2)-equivariant steerable CNNs. In Neur IPS, 2019.

[60] Maurice Weiler, Mario Geiger, Max Welling, Wouter Boomsma, and Taco S Cohen. 3D steerable CNNs: Learning rotationally equivariant features in volumetric data. In Neur IPS, 2018.

[61] Maurice Weiler, Fred A Hamprecht, and Martin Storath. Learning steerable ﬁlters for rotation equivariant CNNs. In CVPR, 2018.

[62] Marysia Winkels and Taco S Cohen. Pulmonary nodule detection in ct scans with equivariant cnns. Medical Image Analysis, 2019.

[63] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. Cbam: Convolutional block attention module. In ECCV, 2018.

[64] Daniel Worrall and Gabriel Brostow. Cube Net: Equivariance to 3D rotation and translation. In ECCV, 2018.

[65] Daniel Worrall and Max Welling. Deep scale-spaces: Equivariance over scale. In Neur IPS, 2019.

[66] Daniel E Worrall, Stephan J Garbin, Daniyar Turmukhambetov, and Gabriel J Brostow. Harmonic networks: Deep translation and rotation equivariance. In CVPR, 2017.

[67] Jialin Wu, Dai Li, Yu Yang, Chandrajit Bajaj, and Xiangyang Ji. Dynamic ﬁltering with large sampling ﬁeld for convnets. In ECCV, 2018.

[68] Brandon Yang, Gabriel Bender, Quoc V Le, and Jiquan Ngiam. Condconv: Conditionally parameterized convolutions for efﬁcient inference. In Neur IPS, 2019.

[69] Yibo Yang, Hongyang Li, Shan You, Fei Wang, Chen Qian, and Zhouchen Lin. Ista-nas: Efﬁcient and consistent neural architecture search by sparse coding. In Neur IPS, 2020.

[70] Yikang Zhang, Jian Zhang, Qiang Wang, and Zhao Zhong. Dynet: Dynamic convolution for accelerating convolutional neural networks. ar Xiv preprint ar Xiv:2004.10694, 2020.

[71] Jingkai Zhou, Varun Jampani, Zhixiong Pi, Qiong Liu, and Ming-Hsuan Yang. Decoupled dynamic ﬁlter networks. In CVPR, 2021.