# attentive_group_equivariant_convolutional_networks__f91bc43d.pdf Attentive Group Equivariant Convolutional Networks David W. Romero 1 Erik J. Bekkers 2 Jakub M. Tomczak 1 Mark Hoogendoorn 1 Although group convolutional networks are able to learn powerful representations based on symmetry patterns, they lack explicit means to learn meaningful relationships among them (e.g., relative positions and poses). In this paper, we present attentive group equivariant convolutions, a generalization of the group convolution, in which attention is applied during the course of convolution to accentuate meaningful symmetry combinations and suppress non-plausible, misleading ones. We indicate that prior work on visual attention can be described as special cases of our proposed framework and show empirically that our attentive group equivariant convolutional networks consistently outperform conventional group convolutional networks on benchmark image datasets. Simultaneously, we provide interpretability to the learned concepts through the visualization of equivariant attention maps. 1. Introduction Convolutional Neural Networks (CNNs) (Le Cun et al., 1989) have shown impressive performance in a wide variety of domains. The developments of CNNs as well as of many other machine learning approaches have been fueled by intuitions and insights into the composition and modus operandi of multiple biological systems (Wertheimer, 1938; Biederman, 1987; Delahunt & Kutz, 2019; Blake & Lee, 2005; Zhaoping, 2014; Delahunt & Kutz, 2019). Though CNNs have achieved remarkable performance increases on several benchmark problems, their training efficiency as well as generalization capabilities are still open for improvement. One concept being exploited for this purpose is that of equivariance, again drawing inspiration from human beings. Humans are able to identify familiar objects despite modifi- 1Vrije Universiteit Amsterdam, 2University of Amsterdam, The Netherlands. Correspondence to: David W. Romero . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). Figure 1. Meaningful relationships among object symmetries. Though every figure is composed by the same elements, only the outermost examples resemble faces. The relative positions, orientations and scales of elements in the innermost examples do not match any meaningful face composition and hence, should not be labelled as such. Built upon Fig. 1 from Schwarzer (2000). cations in location, size, viewpoint, lighting conditions and background (Bruce & Humphreys, 1994). In addition, we do not just recognize them but are able to describe in detail the type and amount of modification applied to them as well (von Helmholtz, 1868; Cassirer, 1944; Schmidt et al., 2016). Equivariance is strongly related to the idea of symmetricity. As these modifications do not modify the essence of the underlying object, they should be treated (and learned) as a single concept. Recently, several approaches have embraced these ideas to preserve symmetries including translations (Le Cun et al., 1989), planar rotations (Dieleman et al., 2016; Marcos et al., 2017; Worrall et al., 2017; Weiler et al., 2018b; Li et al., 2018; Cheng et al., 2018; Hoogeboom et al., 2018; Bekkers et al., 2018; Veeling et al., 2018; Lenssen et al., 2018; Smets et al., 2020), spherical rotations (Cohen et al., 2018; Worrall & Brostow, 2018; Weiler et al., 2018a; Thomas et al., 2018; Cohen et al., 2019b), scaling (Marcos et al., 2018; Worrall & Welling, 2019; Sosnovik et al., 2020) and general symmetry groups (Cohen & Welling, 2016a; Kondor & Trivedi, 2018; Weiler & Cesa, 2019; Cohen et al., 2019a; Bekkers, 2020; Romero & Hoogendoorn, 2020; Venkataraman et al., 2020). While group convolutional networks are able to learn powerful representations based on symmetry patterns, they lack any explicit means to learn meaningful relationships among them, e.g., relative positions, orientations and scales (Fig. 1). In this paper, we draw inspiration from another promising development in the machine learning domain driven by neuroscience and psychology (e.g., Pashler (2016)), attention, to learn such relationships. The notion of attention is related to the idea that not all components of an input signal are per se equally relevant for a particular task. As a consequence, Attentive Group Equivariant Convolutional Networks given a task and a particular input signal, task-relevant components of the input should be focused during its analysis while irrelevant, possibly misleading ones should be suppressed. Attention has been broadly applied to fields ranging from natural language processing (Bahdanau et al., 2014; Cheng et al., 2016; Vaswani et al., 2017) to visual understanding (Xu et al., 2015; Ilse et al., 2018; Park et al., 2018; Woo et al., 2018; Ramachandran et al., 2019; Diaconu & Worrall, 2019; Romero & Hoogendoorn, 2020) and graph analysis (Veliˇckovi c et al., 2017; Zhang et al., 2020). Specifically, we present attentive group convolutions, a generalization of the group convolution, in which attention is applied during convolution to accentuate meaningful symmetry combinations and suppress non-plausible, possibly misleading ones. We indicate that prior work on visual attention can be described as special cases of our proposed framework and show empirically that our attentive group equivariant group convolutional networks consistently outperform conventional group equivariant ones on rot-MNIST and CIFAR-10 for the SE(2) and E(2) groups. In addition, we provide means to interpret the learned concepts trough the visualization of the predicted equivariant attention maps. Contributions: We propose a general group theoretical framework for equivariant visual attention, the attentive group convolution, and show that prior works on visual attention are special cases of our framework. We introduce a specific type of network referred to as attentive group convolutional networks as an instance of this theoretical framework. We show that our attentive group convolutional networks consistently outperform plain group equivariant ones. We provide means to interpret the learned concepts via visualization of the predicted equivariant attention maps. 2. Preliminaries Before describing our approach, we first define crucial prior concepts: (group) convolutions and attention mechanisms. 2.1. Spatial Convolution and Translation Equivariance Let f, ψ : Rd RN c be a vector valued signal and filter on Rd, such that f = {f c}N c c=1 and ψ = {ψ c}N c c=1. The spatial convolution ( Rd) is defined as: [f Rd ψ](y) = Rd f c(x)ψ c(x y) dx (1) Intuitively, Eq. 1 resembles a collection of Rd inner products between the input signal f and y-translated versions of ψ. Since the continuous integration in Eq. 1 is usually performed on signals and filters captured in a discrete grid Zd, the integral on Rd is reduced to a sum on Zd. In our derivations, however, we stick to the continuous case as to guarantee the validity of our theory for techniques defined on continuous spaces, e.g., steerable and Lie group convolutions (Cohen & Welling, 2016b; Worrall et al., 2017; Bekkers et al., 2018; Weiler et al., 2018b;a; Thomas et al., 2018; Weiler & Cesa, 2019; Bekkers, 2020; Sosnovik et al., 2020). To study (and generalize) the properties of the convolution, we rewrite Eq. 1 using the translation operator Ly: [f Rd ψ](y) = Rd f c(x)Ly[ψ c](x) dx (2) where Ly[ψ c](x) = ψ c(x y). Note that the translation operator Ly is indexed by an amount of translation y. Resultantly, we actually consider a set of operators {Ly}y Rd that indexes the set of all possible translations y Rd. A fundamental property of the convolution is that it commutes with translations: Ly[f Rd ψ](x) = Ly[f] Rd ψ (x), x, y Rd. (3) In other words, convolving a y-translated signal Ly[f] with a filter is equivalent to first convolving the original signal f with the filter ψ, and y-translating the obtained response next. This property is referred to as translation equivariance and, in fact, convolution (and reparametrizations thereof) is the only linear translation equivariant mapping (Kondor & Trivedi, 2018; Cohen et al., 2019a; Bekkers, 2020). 2.2. Group Convolution and Group Equivariance The convolution operation can be extended to general transformations by utilizing a larger set of transformations {Lg}g G, s.t. {Ly}y Rd {Lg}g G. However, in order to preserve equivariance, we must restrict the class of transformations allowed in {Lg}g G. To formalize this intuition, we first present some important concepts from group theory. 2.2.1. PRELIMINARIES FROM GROUP THEORY Groups. A group is a tuple (G, ) consisting of a set G, g G, and a binary operation : G G G, referred to as the group product, that satisfies the following axioms: Closure: For all h, g G, h g G. Identity: There exists an e G, such that e g = g e = g. Inverse: For all g G, there exists an element g 1 G, such that g g 1 = g 1 g = e. Associativity: For all g, h, k G, (g h) k = g (h k). Group actions. Let G and X be a group and a set, respectively. The (left) group action of G on X is a function : G X X that satisfies the following axioms: Identity: If e is the identity of G, then, for any x X, e x = x. Attentive Group Equivariant Convolutional Networks Compatibility: For all g, h G, x X, g (h x) = (g h) x. In other words, the action of G on X describes how the elements x X are transformed by g G. For brevity, we omit the operations and and refer to the set G as a group, to elements g h as gh and to actions (g x) as gx. Semi-direct product and affine groups. In practice, one is mainly interested in the analysis of data (and hence convolutions) defined on Rd. Consequently, groups of the form G = Rd H, resulting from the semi-direct product ( ) between the translation group Rd and an arbitrary (Lie) group H that acts on Rd (e.g., rotation, scaling, mirroring), are of main interest. This family of groups is referred to as affine groups and their group product is defined as: g1g2 = (x1, h1)(x2, h2) = (x1 + h1x2, h1h2) (4) where g1 = (x1, h1), g2 = (x2, h2) G, x1, x2 Rd and h1, h2 H. Some important affine groups are the rototranslation (SE(d) = Rd SO(d)), the scale-translation (Rd R+) and the euclidean (E(d) = Rd O(d)) groups. Group representations. Let G be a group and L2(X) be a space of functions defined on some vector space X. The (left) regular group representation of G on functions f L2(X) is a transformation L : G L2(X) L2(X), (g, f) 7 Lg[f], such that it shares the group structure via: Lg Lh[f](x) = Lgh[f](x) (5) Lg[f](x) := f(g 1x) (6) for any g, h G, f L2(X), x X. That is, concatenating two such transformations, parametrized by g and h, is equivalent to one transformation parametrized by gh G. Intuitively, the representation of G on a function f L2(X) describes how the function as a whole, i.e., f(x), x X, is transformed by the effect of group elements g G. If the group G is affine, i.e., G = Rd H, the (left) group representation Lg can be split as: Lg[f](x) = Ly Lh[f](x) (7) with g = (y, h) G, y Rd and h H. This property is key for the efficient implementation of functions on groups. 2.2.2. THE GROUP CONVOLUTION Let f, ψ : G RN c be a vector valued signal and kernel on G. The group convolution ( G) is defined as: [f G ψ](g) = G f c( g)ψ c(g 1 g) d g (8) G f c( g)Lg[ψ c]( g) d g (9) Figure 2. Group convolution on the roto-translation group SE(2) for discrete rotations by 90 degrees (also called the p4 group). The p4 group is defined as H = {e, h, h2, h3}, with h depicting a 90 rotation. The group convolution corresponds to |H| = 4 convolutions between the input f and h-transformations of the filter ψ, Lh[ψ], h H. Each of these convolutions is equal to the sum over group elements h H and channels c [N c] of the spatial channel-wise convolutions f c R2 Lh[ψ c] among f and Lh[ψ]. Differently to Eq. 2, the domain of the signal f, the filter ψ and the group convolution itself [f G ψ] are now defined on the group G.1 Intuitively, the group convolution resembles a collection of inner products between the input signal f and g-transformed versions of ψ. A key property of the group convolution is that it generalizes equivariance (Eq. 3) to arbitrary groups, i.e., it commutes with g-transformations: L g[f G ψ](g) = L g[f] G ψ (g), g, g G. (10) In other words, group convolving a g-transformed signal L g[f] with a filter ψ is equivalent to first convolving the original signal f with the filter ψ, and g-transforming the obtained response next. This property is referred to as group equivariance and, just as for spatial convolutions, the group convolution (or reparametrizations thereof) is the only linear G-equivariant map (Kondor & Trivedi, 2018; Cohen et al., 2019a; Bekkers, 2020). 1Note that Eq. 2 matches Eq. 9 with the substitution G = Rd. It follows that Lg[f](x) = f(g 1x) = f(x y), where g 1 = y is the inverse of g in the translation group (Rd, +) for g = y. Attentive Group Equivariant Convolutional Networks Group convolution on affine groups. For affine groups, the group convolution (Eq. 9) can be decomposed, without modifying its properties, by taking advantage of the group structure and the representation decomposition (Eq. 7) as: [f G ψ](g) = R2 f c( x, h)Lg[ψ c]( x, h) d x d h (11) R2 f c( x, h)Lx Lh[ψ c]( x, h) d x d h (12) where g = (x, h), g = ( x, h) G, x, x Rd and h, h H. By doing so, the group convolution can be separated into |H| spatial convolutions of the input signal f for each h-transformed filter Lh[ψ] (Fig. 2): [f G ψ](x, h) = f c R2 Lh[ψ c] (x, h) d h (13) Resultantly, the computational cost of a group convolution is roughly equivalent to that of a spatial convolution with a filter bank of size N c |H| (Cohen & Welling, 2016a; Worrall & Welling, 2019; Cohen et al., 2019b). 2.3. Attention, Self-Attention and Visual Attention Attention mechanisms find their roots in recurrent neural network (RNN) based machine translation. Let ϕ( ) be an arbitrary non-linear mapping (e.g., a neural network), y = {yj}m j=1 be a sequence of target vectors yi, and x = {xi}n i=1 be a source sequence, whose elements influence the prediction of each value yj y. In early models (e.g., Kalchbrenner & Blunsom (2013); Cho et al. (2014)), features in the input sequence are aggregated into a context vector c = P i ϕ(xi) which is used to augment the hidden state in RNN layers. These models assume that source elements xi contribute equally to every target element yj and hence, that the same context vector c can be utilized for all target positions yj, which does not generally hold (Fig. 3). Bahdanau et al. (2014) proposed the inclusion of attention coefficients αi = {αi,j}, [n] = {1, ..., n}, i [n], j [m], P i αi,j = 1, to modulate the contributions of the source elements xi as a function of the current target element yj by means of an adaptive context vector cj = P i αi,jϕ(xi). Thereby, they obtained large improvements both in performance and interpretability. Recently, attention has been extended to several other machine learning tasks (e.g., Vaswani et al. (2017); Veliˇckovi c et al. (2017); Park et al. (2018)). The main development behind these extensions was selfattention (Cheng et al., 2016), where, in contrast to conventional attention, the target and source sequences are equal, i.e., x = y. Consequently, the attention coefficients αi,j encode correlations among input element pairs (xi, xj). For vision tasks, self-attention has been proposed to encode visual co-occurrences in data (Hu et al., 2018; Wang et al., Figure 3. English to French translation. Brighter depicts stronger influence. Note how relevant parts of the input sentence are highlighted as a function of the current output word during translation. Taken from Bahdanau et al. (2014). 2018; Park et al., 2018; Woo et al., 2018; Cao et al., 2019; Bello et al., 2019; Ramachandran et al., 2019; Romero & Hoogendoorn, 2020). Unfortunately, its application on visual and, in general, on high-dimensional data is non-trivial. 2.3.1. VISUAL ATTENTION In the context of visual attention, consider a feature map f : X RNc to be the source sequence 2. Self-attention then imposes the learning of a total n2 = |X|2 attention vectors αi,j RN c, which rapidly becomes unfeasible with increasing feature map size. Interestingly, Cao et al. (2019) and Zhu et al. (2019) empirically demonstrated that, for visual data, the attention coefficients {αi,j} are approximately invariant to changes in the target position xj. Consequently, they proposed to approximate the attention coefficients {αi,j} R|X|2 N c by a single vector {αi} R|X| N c which is independent of target position xj. Despite this significant reduction in complexity, the dimensionality of {αi} is still very large and further simplifications are mandatory. To this end, existing works (Hu et al., 2018; Woo et al., 2018) replace the input f with a much smaller vector of input statistics s that summarizes relevant information from f. For instance, the SE-Net (Hu et al., 2018) utilizes global average pooling to produce a vector of channel statistics of f, s C RN c, s C = 1 |Rd| R Rd f c(x) dx, which is subsequently passed to a small fully-connected network ϕC( ) to compute channel attention coefficients αC = {αC c }N c c=1 = ϕC(s C). These attention coefficients are then utilized to modulate the corresponding input channels f c. Complementary to channel attention akin to that of the SENet, Park et al. (2018) utilize a similar strategy for spatial attention. Specifically, they utilize channel average pooling to generate a vector of spatial statistics of f, s X Rd, s X = 1 N c PN c c=1 f c(x), which is subsequently passed to a small convolutional network ϕX ( ) to compute spatial attention coefficients αX = {αX (x)}x R2 = ϕX (s X ). These attention coefficients are then utilized to modulate the corresponding spatial input positions f(x). Recent works include extra statistical information, e.g., max responses (Woo et al., 2018), or replace pooling by convolutions (Cao et al., 2019). 2In the machine translation context we can think of f as a sequence x = {f(xi)}n i=1, with n = |X| number of elements. Attentive Group Equivariant Convolutional Networks Figure 4. Same colors depict equal weights. The first column of AC corresponds to ψ and the following ones to Lh[ψ], obtained via cyclic permutations. See how {Lh[ψ]}h H resembles a circulant matrix. Taken from Romero & Hoogendoorn (2020). 3. Attentive Group Equivariant Convolution In this section, we propose our generalization of visual selfattention, discuss its properties and relations to prior work. Let f, ψ : G RN c be a vector valued signal and kernel on G, and let α : G G [0, 1]N c be an attention map that takes target and source elements g, g G, respectively, as input. We define the attentive group convolution ( α G) as: [f α G ψ](g) = G α c(g, g)f c( g)Lg[ψ c]( g) d g (14) with α = A[f] computed by some attention operator A. As such, the attentive group convolution modulates the contributions of group elements g G at different channels c [N c] during pooling.3 The properties and conditions on A are summarized in Thm. 1. An extensive motivation as well as its proof are provided in the supplementary material. Theorem 1. The attentive group convolution is an equivariant operator if and only if the attention operator A satisfies: g,g, g G : A[Lgf](g, g) = A[f](g 1g, g 1 g) (15) If, moreover, the maps generated by A are invariant to one of its arguments, and, hence, exclusively attend to either the input or the output domain (Sec. 3.4), then A satisfies Eq. 15 iff it is equivariant and thus, based on group convolutions. 3.1. Tying Together Equivariance and Visual Attention Interestingly, and, perhaps in some cases unaware of it, all of the visual attention approaches outlined in Section 2.3.1, as well as all of those we are aware of (Xu et al., 2015; Hu et al., 2018; Park et al., 2018; Woo et al., 2018; Wang et al., 2018; Ilse et al., 2018; Hu et al., 2019; Ramachandran et al., 2019; Cao et al., 2019; Chen et al., 2019; Bello et al., 2019; Lin et al., 2019; Diaconu & Worrall, 2019; Romero & Hoogendoorn, 2020) exclusively utilize translation (or group) equivariance preserving maps for the generation of the attention coefficients and, hence, constitute altogether group equivariant networks by which they satisfy Thm. 1. As will be explained in the following sections, all these works resemble special cases of Eq. 14 by substituting G with the corresponding group and modifying the specifications about how α is calculated (Sec. 3.2 - 3.4). 3Note that Eq. 14 is equal to Eq. 9 up to a multiplicative factor α c(g, g) 1, if α c(g, g) is constant for every g, g G, c [N c]. 3.1.1. TRANSLATION EQUIVARIANT VISUAL ATENTION Since convolutions as well as popular pooling operations are translation equivariant, the visual attention approaches outlined in Sec. 2.3.1 are translation equivariant as well.4 One particular case worth emphasising is that of SE-Nets. Here, a fully-connected network ϕC, a non-translation equivariant map, is used to generate the channel attention coefficients αC. However, ϕC is indeed translation equivariant. Recall that ϕC receives s C as input, a signal obtained via global average pooling (a convolution-like operation). Resultantly, s C can be interpreted as a RN c 1 1 tensor and hence, applying a fully connected layer to s C equals a pointwise convolution between s C and a filter ψfully RNo N c 1 1 with No output channels.5 3.1.2. GROUP EQUIVARIANT VISUAL ATTENTION To the best of our knowledge, the only work that provides a group theoretical approach towards visual attention is that of Romero & Hoogendoorn (2020). Here, the authors consider affine groups G with elements g = (x, h), x Rd, h H and cyclic permutation groups H. Consequently, they utilize a cyclic permutation equivariant map, ϕH( ), to generate attention coefficients αH(h), h H, with which the corresponding elements h are modulated. As a result, their proposed attention strategy is H-equivariant. To preserve translation equivariance, and hence, G-equivariance, ϕH is re-utilized at every spatial position x Rd. This is equivalent to combining ϕH with a pointwise filter on Rd. Romero & Hoogendoorn (2020) found that equivariance to cyclic groups H, can only be achieved by constraining ϕH to have a circulant structure. This is equivalent to a convolution with a filter ψ, whose group representations Lh induce cyclical permutations of itself (Fig. 4) and hence, resembles a group convolution, by which Thm. 1 is satisfied. The work of Romero & Hoogendoorn (2020) exclusively performs attention on the h component of the group elements g = (x, h) G and is only defined for (block) cyclic groups. Consequently, it does not consider spatial relationships during attention (Fig. 1) and is not applicable to general groups. Conversely, our proposed framework allows for simultaneous attention on both components of the group elements g = (x, h) in a G equivariance preserving manner. 3.2. Efficient Group Equivariant Attention Maps Attentive group convolutions impose the generation of an additional attention map α : G G [0, 1]N c, which is computationally demanding. To reduce this computational 4In fact, conventional pooling operations (e.g., max, average) can be written as combinations of convolutions and pointwise non-linearities, which are translation equivariant, as well. 5This resembles a depth-wise separable convolution (Chollet, 2017) with the first convolution given by global average pooling. Attentive Group Equivariant Convolutional Networks burden, we exploit the fact that visual data is defined on Rd and, hence, relevant groups are affine, to provide an efficient factorization of the attention map α. In Sec. 2.3.1 we indicated that attention coefficients α can be equivariantly factorized into spatial and channel components. We build upon this idea and factorize attention via: α c(g, g) := αX ((x, h), ( x, h))αC c (h, h) where αX attends for spatial relations without considering channel characteristics and αC attends for patterns in the channeland H-axis, but ignores spatial patterns. We thus factorize α into a spatial attention map αX : G G [0, 1] and a channel attention map αC : H H [0, 1]N c. Findings in literature have shown that, for visual data, attention maps are almost equivalent for different query positions and thus, only query-independent dependencies are learnt (Cao et al., 2019; Zhu et al., 2019). Based on this observation, we further simplify αX to be invariant over spatial positions either at the input or output space. Since separate convolutional filters ψ could possibly benefit from different attention maps, we omit spatial positions in the input space (see Sec. 3.2.1 for details). In other words, we replace αX (g, g) with αX (g, h), an spatial position invariant attention map over the input space: αX : G H [0, 1]. Conveniently, attention coefficients of type α : Rd H [0, 1]N c can be interpreted as functions on Rd with pointwise visualizations x 7 α( x, h) for each x Rd. Resultantly, we are able to aid the interpretability of the learned concepts and of the attended symmetries (e.g., Figs. 7, 8, 11). 3.2.1. THE ATTENTION OPERATOR A Recall that the attention map α is computed via an attention operator A. In the most general case, α and, hence A, is a function of both the input signal f and the filter ψ. In order to define A as such, we generalize the approach of Woo et al. (2018) such that: (1) equivariance to general symmetry groups is preserved and (2) the attention maps depend on the filter ψ as well. Let φC : f 7 s C = {s C avg, s C max}, s C i : H H RN c and φX : f 7 s X = {s X avg, s X max}, s X i : G G R be functions that generate channel (s C) and spatial statistics (s X ), respectively, from an intermediary vector valued signal f : G G RN c containing information both from the input and output spaces. Analogously to Woo et al. (2018), we compute spatial and channel statistics to reduce the dimensionality of the input. However, in contrast to them, we compute these statistics from intermediary convolutional maps f rather than from the input signal f directly.6 As a result, 6This is why the statistics s C i , s X i receive tuples (h, h), (g, g), respectively, as input, as opposed to single argument inputs which often emerge in several prior works on visual attention. Figure 5. Attentive group convolution on the roto-translation group SE(2). In contrast to group convolutions (Fig. 2, Eq. 13), attentive group convolutions utilize channel αC and spatial αX attention to modulate the intermediary convolutional responses [f R2 Lh[ψ]] before pooling over the c and h axes. we take the influence of the filter ψ into account during the computation of the attention maps. Following the simplifications proposed in Sec. 3.2 for αX , we can further reduce s X i and f to functions of the form s X i : G H R and f : G H RN c, respectively. Consequently, we define: f = { f c}N c c=1, f c(x, h, h) := f c Rd Lh[ψ c] (x, h), (16) which is the intermediary result of the convolution between the input f and the h-transformation of the filter ψ, Lh[ψ] before pooling over c and h (Fig. 5, Eq. 13). Channel Attention. Let ϕC : s C 7 αC be a function that generates a channel attention map αC : H H [0, 1]N c from a vector of channel statistics s C : H H RN c of the intermediate representation f. Our channel attention computation is analogous to that of Woo et al. (2018) based on two fully connected layers. However, in our case, each linear layer is parametrized by a matrixvalued kernel Wi : H RNout Nin, which we shift via left-regular representations Lh [Wi] ( h) = Wi(h 1 h) in order to guarantee equivariance (Thm. 1): αC(h, h) = ϕC s C (h, h) (17) = σ W2(h 1 h) [W1(h 1 h) s C avg(h, h)]+ + W2(h 1 h) [W1(h 1 h) s C max(h, h)]+ Attentive Group Equivariant Convolutional Networks with [ ]+ the Re LU function, σ the sigmoid function, r a reduction ratio and W1 : H R r N c, W2 : H RN c N c r filters defined on H. Spatial Attention. Let ϕX : s X 7 αX be a function that generates a spatial attention map αX : G H [0, 1] from channel statistics s X : G H R2, in which per input h H and output g G, the mean and max value is taken over the channel axis. Similarly to Woo et al. (2018), spatial attention αX is then defined as: αX (x, h, h) = ϕX (s X )(x, h, h) = σ s X Rd Lh[ψX ] (x, h) (18) with ψX : G R2 a group convolutional filter. Full Attention. Woo et al. (2018) carried out extensive experiments to find the best performing configuration to combine channel and spatial attention maps for the Rd case, e.g., in parallel, serially starting with channel attention, serially starting with spatial attention. Based on their results we adopt their best performing configuration, i.e., serially starting with channel attention, for the G case (Fig. 6). Recall that f is the intermediary result from the convolution between the input f and the h-transformation of the filter ψ before pooling over c and h. We perform attention on top of f (Fig. 6), where αC and αX are computed by Eqs. 17, 18, respectively. Resultantly, the attentive group convolution is computed as: [f α G ψ](x, h) = H αX (x, h, h) αC c (h, h) f(x, h, h) d h (19) 3.3. The Residual Attention Branch Based on the findings of He et al. (2016), several visual attention approaches propose to utilize residual blocks with direct connections during the course of attention to facilitate gradient flow (Hu et al., 2018; Park et al., 2018; Woo et al., 2018; Wang et al., 2018; Cao et al., 2019). However, these approaches calculate the final attention map α+ as the sum of the direct connection 1 and the attention map obtained from the attention branch α, i.e., α+ = 1+α. Consequently, the obtained attention map α+ : R2 [1, 2]Nc is restricted to the interval [1, 2] and the network loses its ability to suppress input components. Inspired by the aforementioned works, we propose to calculate attention in what we call a residual attention branch (Fig. 6). Specifically, we utilize the attention branch to calculate a residual attention map defined as α = (1 α+); α : G G [0, 1]. Next, we subtract the residual attention map α from the direct connection 1 to obtain the resultant attention map α+, i.e., α+ = 1 α . As a result, we are able to produce attention maps α+ that span the [0, 1] interval while preserving the benefits of the direct connections of He et al. (2016). Figure 6. Sequential channel and spatial attention performed on a residual attention branch (Sec. 3.3). 3.4. The Attentive Group Convolution as a Sequence of Group Convolutions and Pointwise Non-linearities CNNs are usually organized in layers and hence, the input f is usually convolved in parallel with a set of No filters {ψo}No o=1. As outlined in the previous section, this implies that the attention maps can change as a function of the current filter ψo. One assumption broadly utilized in visual attention is that these maps do not depend on the filters {ψo}No o=1, and, hence, that α is a sole function of the input signal f (Hu et al., 2018; Park et al., 2018; Woo et al., 2018; Diaconu & Worrall, 2019; Romero & Hoogendoorn, 2020). Consequently, the attention coefficients α are reduced from a function α : G G [0, 1]N c (c.f., Eq. 14) to a function α : G [0, 1]N c. In other words, attention becomes only dependent on g (see Eqs. 17-19) and thus, the generation of the attention maps αC, αX can be shifted to the input feature map f. Resultantly, the attentive group convolution is reduced to a sequence of conventional group convolutions and point-wise non-linearities (Thm. 1), which further reduces the computational cost of attention: [f α G ψ] = [f α G ψ] = [(αX αCf) G ψ] (20) 4. Experiments We validate our approach by exploring the effects of using attentive group convolutions in contrast to conventional ones. We compare the conventional group equivariant networks p4and p4m-CNNs of Cohen & Welling (2016a) on the rotated MNIST and CIFAR-10 datasets with their corresponding attentive counterparts: α-p4-CNNs and α-p4m-CNNs, respectively; and the p4and p4m-Dense Nets of Veeling et al. (2018) on the PCam dataset with their corresponding attentive counterparts: α-p4-Dense Net and α-p4m CNNs and Dense Nets, respectively. Additionally, we explore the effects of only applying channel attention (e.g., αCH-p4CNNs), spatial attention (e.g., αSP-p4-CNNs) and applying attention directly on the input (e.g., αF -p4-CNNs).7 We notice that the network architectures in Cohen & Welling (2016a) and Romero & Hoogendoorn (2020) used for the CIFAR-10 experiments are equivariant only approximately. This results from using odd-sized convolutional kernels with stride 1 on even-sized feature maps (see Appx. C for a 7Our code is publicly available at: https://github.com/dwromero/att_gconvs Attentive Group Equivariant Convolutional Networks complete discussion). Since this effect distorts the equivariance property of our equivariant attention maps, i.e., they also become equivariant only approximately (Figs. 10, 11), this issue must be fixed. We achieve this by replacing strided convolutions in such regimes by conventional convolutions followed by a max-pooling layer. For all our experiments we replicate as close as possible the training and evaluation strategies of the corresponding baselines, replace approximately equivariant networks by exact equivariant ones, and initialize any additional parameter in the same way as the corresponding baseline. Extended implementation details are provided in Appx. B. 4.1. rot-MNIST The rotated MNIST dataset (Larochelle et al., 2007) contains 62k gray-scale 28x28 handwritten digits uniformly rotated for [0, 2π). The dataset is split into training, validation and test sets of 10k, 2k and 50k images respectively. We compare p4-CNNs with all the corresponding attention variants previously mentioned. For our attention models, we utilize a filter size of 7 and a reduction ratio r of 2 on the attention branch. Since attentive group convolutions impose the learning of additional parameters, we also instantiate bigger p4-CNNs by increasing the number of channels uniformly at every layer to roughly match the number of parameters of the attentive versions. Furthermore, we compare our results with comparative attentive versions as defined in Romero & Hoogendoorn (2020) (αRH), which perform attention exclusively over the axis of rotations. Our results show that (1) attentive versions consistently outperform non-attentive ones, and that (2) performing attention over the entire group is beneficial in terms of classification accuracy (Tab. 1). 4.2. CIFAR-10 The CIFAR-10 dataset (Krizhevsky et al., 2009) consists of 60k real-world 32x32 RGB images uniformly drawn from 10 classes. The dataset is split into training, validation and test sets of 40k, 10k and 10k images, respectively. We compare the p4 and p4m versions of the All-CNN (Springenberg et al., 2014) and the Resnet44 (He et al., 2016) in Cohen & Welling (2016a) with attentive variations. For all our attention models, we utilize a filter size of 7 and a reduction ratio r of 16 on the attention branch. Unfortunately, attentive group convolutions impose an unfeasible increment on the memory requirements for this dataset.8 Resultantly, we are only able to compare the αF variations of the corresponding networks. Our results show that attentive αF networks consistently outperform non-attentive ones (Tab. 2). Moreover, 8the α-p4 All-CNN requires approx. 72GB of CUDA memory, as opposed to 5GBs for the p4-All-CNN. This is due to the storage of the intermediary convolution responses required for the calculation of the attention weights (Eqs. 1719) Figure 7. Equivariant attention maps on the roto-translation group SE(2). The predicted attention maps behave equivariantly for group symmetries. The arrows depict the strength of the filter responses at the corresponding orientations throughout the network. Table 1. Test error rates on rot-MNIST (with standard deviation under 5 random seed variations). NETWORK TEST ERROR (%) PARAM. p4-CNN 2.048 0.045 24.61K αRH-p4-CNN 1.980 0.032 24.85K BIG19-p4-CNN 1.796 0.035 77.54K α-p4-CNN 1.696 0.021 73.13K BIG15-p4-CNN 1.848 0.019 50.42K αCH-p4-CNN 1.825 0.048 48.63K αSP-p4-CNN 1.761 0.027 49.11K BIG11-p4-CNN 1.996 0.083 29.05K αF-p4-CNN 1.795 0.028 29.46K Table 2. Test error rates on CIFAR10 and augmented CIFAR10+. NETWORK TYPE CIFAR10 CIFAR10+ PARAM. p4 9.32 8.91 1.37M αF-p4 8.8 7.05 1.40M p4m 7.61 7.48 1.22M αF-p4m 6.93 6.53 1.25M RESNET44 p4m 15.72 15.4 2.62M αF-p4m 10.82 10.12 2.70M we demonstrate that our proposed networks focus on relevant parts of the input and that the predicted attention maps behave equivariantly for group symmetries (Figs. 7, 11). The Patch Camelyon dataset (Veeling et al., 2018) consists of 327k 96x96 RGB image patches of tumorous/non-tumorous breast tissues extracted from the Camelyon16 dataset (Bejnordi et al., 2017), where each patch was labelled as tumorous if the central region (32x32) contained at least one tumour pixel as given by the original annotation in Bejnordi et al. (2017). We compare the p4 and p4m versions of the Dense Net (Huang et al., 2017) in Veeling et al. (2018) with attentive variants. For all our attention models, we utilize a filter size of 7 and a reduction ratio r of 16 on the attention branch. Similarly to the CIFAR-10 case, we restrict our Attentive Group Equivariant Convolutional Networks Figure 8. Equivariant attention maps on the PCam dataset. The predicted attention maps behave equivariantly for group symmetries. Additionally, the network seems to learn to focus on the nuclei of the cells and remove background elements during training. Table 3. Test error rates on PCam. NETWORK TYPE TEST ERROR (%) PARAM. Z2 15.93 130.60K p4 12.45 129.65K αF-p4 11.34 140.45K p4m 11.64 124.21K αF-p4m 10.88 141.22K experiments to αF attentive networks due to computational constraints. Our results show that attentive αF consistently outperform non-attentive ones (Tab. 3). Interestingly, the αF-p4-Dense Net is already able to outperform the p4m Dense Net without attention. Surprisingly, our equivariant attention maps reveal that the network learns to focus on the nuclei of the cells and to removes background elements during inference, all of this in a group equivariant way (Fig. 8). 5. Discussion and Future Work Our results show that attentive group convolutions can be utilized as a drop-in replacement for standard and group equivariant convolutions that simultaneously facilitates the interpretability of the network decisions. Similarly to convolutional and group convolutional networks, attentive group convolutional networks also benefit of data augmentation. Interestingly, however, we also see that including additional symmetries reduces the effect of augmentations given by group elements. This finding supports the intuition that symmetry variants of the same concept are learned independently for non-equivariant networks (see Fig. 2 in (Krizhevsky et al., 2012)). The main shortcoming of our approach is its computational burden. As a result, the application of α-networks is computationally unfeasible for networks with several layers or channels. We believe, however, by extrapolation of our results on rot-MNIST, that further performance improvements are to be expected for α variations, should hardware requirements suffice. Group convolutional networks have recently been proven very successful in medical imaging applications (Bekkers et al., 2018; Winkels & Cohen, 2018; Lafarge et al., 2020). Since explainability plays a crucial role here, we believe that our attentive maps could be of high relevance to aid the explainability of the network decisions. Moreover, since our attention maps are guaranteed to be equivariant to transformations in the considered group, it is ensured that the predicted attention maps will be consistent across group symmetries. We believe this to be of crucial importance for rotation invariant tasks. Illustratively, in contrast to vanilla attentive CNNs, a malignant tissue will be ensured to generate consistent attention maps regardless of the orientation at which it has been provided to the network. In future work, we want to explore ways to reduce the computational cost of full attention networks. If successful, we consider feasible to obtain a direct performance boost over our CIFAR-10 and PCam experimental results, without extensive additional memory requirements. Furthermore, we want to extend our work to symmetry groups defined on 3D. By doing so, we expect the range of possible applications of our work to reach several other important applications such as 3D medical imaging applications like CT-scans and other voxel-based representations. 6. Conclusion We introduced attentive group convolutions, a generalization of the group convolution in which attention is utilized to explicitly highlight meaningful relationships among symmetries. We provided a general mathematical framework for group equivariant visual attention and indicated that prior work on visual attention can be perfectly described as special cases of the attentive group convolution. Our experimental results indicate that attentive group convolutional networks consistently outperform conventional group convolutional ones and additionally provide equivariant attention maps that behave predictively for symmetries of the group, with which learned concepts can be visualized. Acknowledgements We gratefully acknowledge our anonymous reviewers for their helpful and valuable commentaries, and Hyunjik Kim for valuable remarks to improve the readability of our paper. This work is part of the Efficient Deep Learning (EDL) programme (grant number P16-25), partly funded by the Dutch Research Council (NWO) and Semiotic Labs, and the research programme VENI (grant number 17290), financed by the Dutch Research Council (NWO). This work was carried out on the Dutch national e-infrastructure with the support of SURF Cooperative. Attentive Group Equivariant Convolutional Networks Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. ar Xiv preprint ar Xiv:1409.0473, 2014. Bejnordi, B. E., Veta, M., Van Diest, P. J., Van Ginneken, B., Karssemeijer, N., Litjens, G., Van Der Laak, J. A., Hermsen, M., Manson, Q. F., Balkenhol, M., et al. Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. Jama, 318(22):2199 2210, 2017. Bekkers, E. J. B-spline {cnn}s on lie groups. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum? id=H1g Bhk BFDH. Bekkers, E. J., Lafarge, M. W., Veta, M., Eppenhof, K. A., Pluim, J. P., and Duits, R. Roto-translation covariant convolutional networks for medical image analysis. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 440 448. Springer, 2018. Bello, I., Zoph, B., Vaswani, A., Shlens, J., and Le, Q. V. Attention augmented convolutional networks. ar Xiv preprint ar Xiv:1904.09925, 2019. Biederman, I. Recognition-by-components: a theory of human image understanding. Psychological review, 94 (2):115, 1987. Blake, R. and Lee, S.-H. The role of temporal structure in human vision. Behavioral and cognitive neuroscience reviews, 4(1):21 42, 2005. Bruce, V. and Humphreys, G. W. Recognizing objects and faces. Visual cognition, 1(2-3):141 180, 1994. Cao, Y., Xu, J., Lin, S., Wei, F., and Hu, H. Gcnet: Non-local networks meet squeeze-excitation networks and beyond. ar Xiv preprint ar Xiv:1904.11492, 2019. Cassirer, E. The concept of group and the theory of perception. Philosophy and phenomenological research, 5(1): 1 36, 1944. Chen, Y., Rohrbach, M., Yan, Z., Shuicheng, Y., Feng, J., and Kalantidis, Y. Graph-based global reasoning networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 433 442, 2019. Cheng, J., Dong, L., and Lapata, M. Long short-term memory-networks for machine reading. ar Xiv preprint ar Xiv:1601.06733, 2016. Cheng, X., Qiu, Q., Calderbank, R., and Sapiro, G. Rotdcf: Decomposition of convolutional filters for rotation-equivariant deep networks. ar Xiv preprint ar Xiv:1805.06846, 2018. Cho, K., Van Merri enboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. Learning phrase representations using rnn encoder-decoder for statistical machine translation. ar Xiv preprint ar Xiv:1406.1078, 2014. Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1251 1258, 2017. Cohen, T. and Welling, M. Group equivariant convolutional networks. In International conference on machine learning, pp. 2990 2999, 2016a. Cohen, T. S. and Welling, M. Steerable cnns. ar Xiv preprint ar Xiv:1612.08498, 2016b. Cohen, T. S., Geiger, M., K ohler, J., and Welling, M. Spherical cnns. Co RR, abs/1801.10130, 2018. URL http://arxiv.org/abs/1801.10130. Cohen, T. S., Geiger, M., and Weiler, M. A general theory of equivariant cnns on homogeneous spaces. In Advances in Neural Information Processing Systems, pp. 9142 9153, 2019a. Cohen, T. S., Weiler, M., Kicanaoglu, B., and Welling, M. Gauge equivariant convolutional networks and the icosahedral cnn. ar Xiv preprint ar Xiv:1902.04615, 2019b. Delahunt, C. B. and Kutz, J. N. Insect cyborgs: Bio-mimetic feature generators improve ml accuracy on limited data. 2019. Diaconu, N. and Worrall, D. E. Affine self convolution. ar Xiv preprint ar Xiv:1911.07704, 2019. Dieleman, S., De Fauw, J., and Kavukcuoglu, K. Exploiting cyclic symmetry in convolutional neural networks. ar Xiv preprint ar Xiv:1602.02660, 2016. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016. Hoogeboom, E., Peters, J. W., Cohen, T. S., and Welling, M. Hexaconv. In International Conference on Learning Representations, 2018. URL https://openreview. net/forum?id=r1vu QG-CW. Attentive Group Equivariant Convolutional Networks Hu, H., Zhang, Z., Xie, Z., and Lin, S. Local relation networks for image recognition. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3464 3473, 2019. Hu, J., Shen, L., and Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132 7141, 2018. Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700 4708, 2017. Ilse, M., Tomczak, J. M., and Welling, M. Attention-based deep multiple instance learning. ICML, 2018. Kalchbrenner, N. and Blunsom, P. Recurrent continuous translation models. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1700 1709, 2013. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Kondor, R. and Trivedi, S. On the generalization of equivariance and convolution in neural networks to the action of compact groups. ar Xiv preprint ar Xiv:1802.03690, 2018. Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009. Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097 1105, 2012. Lafarge, M. W., Bekkers, E. J., Pluim, J. P. W., Duits, R., and Veta, M. Roto-translation equivariant convolutional networks: Application to histopathology image analysis. ar Xiv preprint ar Xiv:2002.08725, 2020. Larochelle, H., Erhan, D., Courville, A., Bergstra, J., and Bengio, Y. An empirical evaluation of deep architectures on problems with many factors of variation. In Proceedings of the 24th international conference on Machine learning, pp. 473 480. ACM, 2007. Le Cun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, L. D. Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541 551, 1989. Lenssen, J. E., Fey, M., and Libuschewski, P. Group equivariant capsule networks. In Advances in Neural Information Processing Systems, pp. 8844 8853, 2018. Li, J., Yang, Z., Liu, H., and Cai, D. Deep rotation equivariant network. Neurocomputing, 290:26 33, 2018. Lin, X., Ma, L., Liu, W., and Chang, S.-F. Context-gated convolution, 2019. Marcos, D., Volpi, M., Komodakis, N., and Tuia, D. Rotation equivariant vector field networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5048 5057, 2017. Marcos, D., Kellenberger, B., Lobry, S., and Tuia, D. Scale equivariance in cnns with vector fields. ar Xiv preprint ar Xiv:1807.11783, 2018. Park, J., Woo, S., Lee, J.-Y., and Kweon, I. S. Bam: Bottleneck attention module. ar Xiv preprint ar Xiv:1807.06514, 2018. Pashler, H. Attention. Psychology Press, 2016. Ramachandran, P., Parmar, N., Vaswani, A., Bello, I., Levskaya, A., and Shlens, J. Stand-alone self-attention in vision models. ar Xiv preprint ar Xiv:1906.05909, 2019. Romero, D. W. and Hoogendoorn, M. Co-attentive equivariant neural networks: Focusing equivariance on transformations co-occurring in data. In International Conference on Learning Representations, 2020. URL https: //openreview.net/forum?id=r1g6ogrt Dr. Schmidt, F., Spr ote, P., and Fleming, R. W. Perception of shape and space across rigid transformations. Vision research, 126:318 329, 2016. Schwarzer, G. Development of face processing: The effect of face inversion. Child development, 71(2):391 401, 2000. Smets, B., Portegies, J., Bekkers, E., and Duits, R. Pdebased group equivariant convolutional neural networks. ar Xiv preprint ar Xiv:2001.09046, 2020. Sosnovik, I., Szmaja, M., and Smeulders, A. Scaleequivariant steerable networks. In International Conference on Learning Representations, 2020. URL https: //openreview.net/forum?id=HJgpugr KPS. Springenberg, J. T., Dosovitskiy, A., Brox, T., and Riedmiller, M. Striving for simplicity: The all convolutional net. ar Xiv preprint ar Xiv:1412.6806, 2014. Thomas, N., Smidt, T., Kearnes, S., Yang, L., Li, L., Kohlhoff, K., and Riley, P. Tensor Field Networks: Rotation-and Translation-Equivariant Neural Networks for 3D Point Clouds. ar Xiv preprint ar Xiv:1802.08219, 2018. Attentive Group Equivariant Convolutional Networks Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Advances in neural information processing systems, pp. 5998 6008, 2017. Veeling, B. S., Linmans, J., Winkens, J., Cohen, T., and Welling, M. Rotation equivariant cnns for digital pathology. In International Conference on Medical image computing and computer-assisted intervention, pp. 210 218. Springer, 2018. Veliˇckovi c, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., and Bengio, Y. Graph attention networks. ar Xiv preprint ar Xiv:1710.10903, 2017. Venkataraman, S. R., Balasubramanian, S., and Sarma, R. R. Building deep equivariant capsule networks. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum? id=BJg NJg SFPS. von Helmholtz, H. Uber die Tatsachen, die der Geometrie zugrunde liegen. 1868. Wang, X., Girshick, R., Gupta, A., and He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794 7803, 2018. Weiler, M. and Cesa, G. General e (2)-equivariant steerable cnns. In Advances in Neural Information Processing Systems, pp. 14334 14345, 2019. Weiler, M., Geiger, M., Welling, M., Boomsma, W., and Cohen, T. S. 3d steerable cnns: Learning rotationally equivariant features in volumetric data. In Advances in Neural Information Processing Systems, pp. 10381 10392, 2018a. Weiler, M., Hamprecht, F. A., and Storath, M. Learning steerable filters for rotation equivariant cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 849 858, 2018b. Wertheimer, M. Gestalt theory. 1938. Winkels, M. and Cohen, T. S. 3d g-cnns for pulmonary nodule detection. ar Xiv preprint ar Xiv:1804.04656, 2018. Woo, S., Park, J., Lee, J.-Y., and So Kweon, I. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 3 19, 2018. Worrall, D. and Brostow, G. Cubenet: Equivariance to 3d rotation and translation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 567 584, 2018. Worrall, D. E. and Welling, M. Deep scale-spaces: Equivariance over scale. ar Xiv preprint ar Xiv:1905.11697, 2019. Worrall, D. E., Garbin, S. J., Turmukhambetov, D., and Brostow, G. J. Harmonic networks: Deep translation and rotation equivariance. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5028 5037, 2017. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., and Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pp. 2048 2057, 2015. Zhang, R., Zou, Y., and Ma, J. Hyper-{sagnn}: a selfattention based graph neural network for hypergraphs. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum? id=rye Hu JBt PH. Zhaoping, L. The V1 hypothesis creating a bottom-up saliency map for preattentive selection and segmentation, pp. 189 314. 05 2014. ISBN 9780199564668. doi: 10.1093/acprof:oso/9780199564668.003.0005. Zhu, X., Cheng, D., Zhang, Z., Lin, S., and Dai, J. An empirical study of spatial attention mechanisms in deep networks. ar Xiv preprint ar Xiv:1904.05873, 2019.