# group_equivariant_standalone_selfattention_for_vision__8be99600.pdf Published as a conference paper at ICLR 2021 GROUP EQUIVARIANT STAND-ALONE SELF-ATTENTION FOR VISION David W. Romero Vrije Universiteit Amsterdam d.w.romeroguzman@vu.nl Jean-Baptiste Cordonnier Ecole Polytechnique F ed erale de Lausanne (EPFL) jean-baptiste.cordonnier@epfl.ch We provide a general self-attention formulation to impose group equivariance to arbitrary symmetry groups. This is achieved by defining positional encodings that are invariant to the action of the group considered. Since the group acts on the positional encoding directly, group equivariant self-attention networks (GSA-Nets) are steerable by nature. Our experiments on vision benchmarks demonstrate consistent improvements of GSA-Nets over non-equivariant self-attention networks. 1 INTRODUCTION Recent advances in Natural Language Processing have been largely attributed to the rise of the Transformer (Vaswani et al., 2017). Its key difference with previous methods, e.g., recurrent neural networks, convolutional neural networks (CNNs), is its ability to query information from all the input words simultaneously. This is achieved via the self-attention operation (Bahdanau et al., 2015; Cheng et al., 2016), which computes the similarity between representations of words in the sequence in the form of attention scores. Next, the representation of each word is updated based on the words with the highest attention scores. Inspired by the capacity of transformers to learn meaningful inter-word dependencies, researchers have started applying self-attention in vision tasks. It was first adopted into CNNs by channel-wise attention (Hu et al., 2018) and non-local spatial modeling (Wang et al., 2018). More recently, it has been proposed to replace CNNs with self-attention networks either partially (Bello et al., 2019) or entirely (Ramachandran et al., 2019). Contrary to discrete convolutional kernels, weights in self-attention are not tied to particular positions (Fig. A.1), yet self-attention layers are able to express any convolutional layer (Cordonnier et al., 2020). This flexibility allows leveraging long-range dependencies under a fixed parameter budget. An arguable orthogonal advancement to deep learning architectures is the incorporation of symmetries into the model itself. The seminal work by Cohen & Welling (2016) provides a recipe to extend the translation equivariance of CNNs to other symmetry groups to improve generalization and sampleefficiency further (see 2). Translation equivariance is key to the success of CNNs. It describes the property that if a pattern is translated, its numerical descriptors are also translated, but not modified. In this work, we introduce group self-attention, a self-attention formulation that grants equivariance to arbitrary symmetry groups. This is achieved by defining positional encodings invariant to the action of the group considered. In addition to generalization and sample-efficiency improvements provided by group equivariance, group equivariant self-attention networks (GSA-Nets) bring important benefits over group convolutional architectures: (i) Parameter efficiency: contrary to conventional discrete group convolutional kernels, where weights are tied to particular positions of neighborhoods on the group, group equivariant self-attention leverages long-range dependencies on group functions under a fixed parameter budget, yet it is able to express any group convolutional kernel. This allows for very expressive networks with low parameter count. (ii) Steerability: since the group acts directly on the positional encoding, GSA-Nets are steerable (Weiler et al., 2018b) by nature. This allows us to go beyond group discretizations that live in the grid without introducing interpolation artifacts. Contributions: We provide an extensive analysis on the equivariance properties of self-attention ( 4). We provide a general formulation to impose group equivariance to self-attention ( 5). We provide instances of self-attentive architectures equivariant to several symmetry groups ( 6). Our results demonstrate consistent improvements of GSA-Nets over non-equivariant ones ( 6). Published as a conference paper at ICLR 2021 Figure 1: Behavior of feature representations in group self-attention networks. An input rotation induces a rotation plus a cyclic permutation to the intermediary feature representations of the network. Additional examples for all the groups used in this work as well as their usage are provided in repo/demo/. 2 RELATED WORK Several approaches exist which provide equivariance to various symmetry groups. The translation equivariance of CNNs has been extended to additional symmetries ranging from planar rotations (Dieleman et al., 2016; Marcos et al., 2017; Worrall et al., 2017; Weiler et al., 2018b; Li et al., 2018; Cheng et al., 2018; Hoogeboom et al., 2018; Bekkers et al., 2018; Veeling et al., 2018; Lenssen et al., 2018; Graham et al., 2020) to spherical rotations (Cohen et al., 2018; 2019b; Worrall & Brostow, 2018; Weiler et al., 2018a; Esteves et al., 2019a;b; 2020), scaling (Marcos et al., 2018; Worrall & Welling, 2019; Sosnovik et al., 2020; Romero et al., 2020b) and more general symmetry groups (Cohen & Welling, 2016; Kondor & Trivedi, 2018; Tai et al., 2019; Weiler & Cesa, 2019; Cohen et al., 2019a; Bekkers, 2020; Venkataraman et al., 2020). Importantly, all these approaches utilize discrete convolutional kernels, and thus, tie weights to particular positions in the neighborhood on which the kernels are defined. As group neighborhoods are (much) larger than conventional ones, the number of weights discrete group convolutional kernels require proportionally increases. This phenomenon is further exacerbated by attentive group equivariant networks (Romero & Hoogendoorn, 2019; Diaconu & Worrall, 2019; Romero et al., 2020a). Since attention is used to leverage non-local information to aid local operations, non-local neighborhoods are required. However, as attention branches often rely on discrete convolutions, they effectively tie specific weights to particular positions on a large non-local neighborhood on the group. As a result, attention is bound to growth of the model size, and thus, to negative statistical efficiency. Differently, group self-attention is able to attend over arbitrarily large group neighborhoods under a fixed parameter budget. In addition, group self-attention is steerable by nature ( 5.1) a property primarily exhibited by works carefully designed to that end. Other way to detach weights from particular positions comes by parameterizing convolutional kernels as (constrained) neural networks (Thomas et al., 2018; Finzi et al., 2020). Introduced to handle irregularly-sampled data, e.g., point-clouds, networks parameterizing convolutional kernels receive relative positions as input and output their values at those positions. In contrast, our mappings change as a function of the input content. Most relevant to our work are the SE(3) and Lie Transformers (Fuchs et al., 2020; Hutchinson et al., 2020). However, we obtain group equivariance via a generalization of positional encodings, Hutchinson et al. (2020) does so via operations on the Lie algebra, and Fuchs et al. (2020) via irreducible representations. In addition, our work prioritizes applications on visual data and extensively analyses theoretical aspects and properties of group equivariant self-attention. 3 STAND-ALONE SELF-ATTENTION In this section, we recall the mathematical formulation of self-attention and emphasize the role of the positional encoding. Next, we introduce a functional formulation to self-attention which will allow us to analyze and generalize its equivariance properties. Definition. Let X RN Cin be an input matrix consisting of N tokens of Cin dimensions each.1 A self-attention layer maps an input matrix X RN Cin to an output matrix Y RN Cout as: Y = SA(X) = softmax[ , ](A)XWval, (1) with Wval RCin Ch the value matrix, A RN N the attention scores matrix, and softmax[ , ](A) the attention probabilities. The matrix A is computed as: A = XWqry(XWkey) , (2) parameterized by query and key matrices Wqry, Wkey RCin Ch. In practice, it has been found beneficial to apply multiple self-attention operations, also called heads, in parallel, such that different 1We consequently consider an image as a set of N discrete objects i {1, 2, ..., N}. Published as a conference paper at ICLR 2021 heads are able to attend to different parts of the input. In this multi-head self-attention formulation, the output of H heads of output dimension Ch are concatenated and projected to Cout as: MHSA(X) = concat h [H] [SA(h)(X)]Wout + bout, (3) with a projection matrix Wout RHCh Cout and a bias term bout RCout. 3.1 THE ROLE OF THE POSITIONAL ENCODING Note that the self-attention operation defined in Eq. 3 is equivariant to permutations of the input rows of X. That is, a permutation of the rows of X will produce the same output Y up to this permutation. Hence, self-attention is blind to the order of its inputs, i.e., it is a set operation. Illustratively, an input image is processed as a bag of pixels and the structural content is not considered. To alleviate this limitation, the input representations in self-attention are often enriched with a positional encoding that provides positional information of the set elements. Absolute positional encoding. Vaswani et al. (2017) introduced a (learnable) positional encoding P RN Cin for each input position which is added to the inputs when computing the attention scores: A = (X + P)Wqry((X + P)Wkey) . (4) More generally, P can be substituted by any function that returns a vector representation of the position and can be incorporated by means of addition or concatenation, e.g., Zhao et al. (2020). This positional encoding injects additional structural information about the tokens into the model, which makes it susceptible to changes in the token s positions. Unfortunately, the model must learn to recognize similar patterns at every position independently as absolute positional encodings are unique to each position. This undesired data inefficiency is addressed by relative positional encodings. Relative positional encoding. Introduced by Shaw et al. (2018), relative encodings consider the relative distance between the query token i the token we compute the representation of , and the key token j the token we attend to . The calculation of the attention scores (Eq. 2) then becomes: Arel i,j = Xi Wqry((Xj + Px(j) x(i))Wkey) , (5) where Px(j) x(i) R1 Cin is a vector representation of the relative shift and x(i) is the position of the token i as defined in 3.2. Consequently, similar patterns can be recognized at arbitrary positions, as relative query-key distances always remain equal. 3.2 A FUNCTIONAL FORMULATION TO SELF-ATTENTION Notation. We denote by [n] the set {1,2,...,n}. Given a set S and a vector space V, LV(S) will denote the space of functions {f S V}. Square brackets are used when functions are arguments. Let S = {i}N i=1 be a set of N elements. A matrix X RN Cin can be interpreted as a vector-valued function f S RCin that maps element sets i S to Cin-dimensional vectors: f i f(i). Consequently, a matrix multiplication, XW y, of matrices X RN Cin and Wy RCout Cin can be represented as a function ϕy LVCin (S) LVCout(S), ϕy f(i) ϕy(f(i)), parameterized by Wy, between functional spaces LVCin (S) = {f S RCin} and LVCout (S) = {f S RCout}. Following this notation, we can represent the position-less attention scores calculation (Eq. 2) as: Ai,j = α[f](i,j) = ϕqry(f(i)),ϕkey(f(j)) . (6) The function α[f] S S R maps pairs of set elements i,j S to the attention score of j relative to i. Therefore, the self-attention (Eq. 1) can be written as: Yi, = ζ[f](i) = j S σj(α[f](i,j))ϕval(f(j)) = j S σj( ϕqry(f(i)),ϕkey(f(j)) )ϕval(f(j)), (7) where σj = softmaxj and ζ[f] S RCh. Finally, multi-head self-attention (Eq. 3) can be written as: MHSA(X)i, = m[f](i) = ϕout( h [H] ζ(h)[f](i)) = ϕout( h [H] j S σj( ϕ(h) qry (f(i)),ϕ(h) key (f(j)) )ϕ(h) val (f(j))), (8) where is the functional equivalent of the concatenation operator concat, and m[f] S RCout. Local self-attention. Recall that α[f] assigns an attention scores to every other set element j S relative to the query element i. The computational cost of self-attention is often reduced by restricting Published as a conference paper at ICLR 2021 its calculation to a local neighborhood N(i) around the query token i analogous in nature to the local receptive field of CNNs (Fig. A.1a). Consequently, local self-attention can be written as: m[f](i) = ϕout( h [H] j N(i) σj( ϕ(h) qry (f(i)),ϕ(h) key (f(j)) )ϕ(h) val (f(j))). (9) Note that Eq. 9 is equivalent to Eq. 8 for N(i) = S, i.e. when considering global neighborhoods. Absolute positional encoding. The absolute positional encoding is a function ρ S RCin that maps set elements i S to a vector representation of its position: ρ i ρ(i). Note that this encoding is not dependent on functions defined on the set but only on the set itself.2 Consequently, absolute position-aware self-attention (Eq. 4) can be written as: m[f,ρ](i) = ϕout( h [H] j N(i) σj( ϕ(h) qry (f(i) + ρ(i)),ϕ(h) key (f(j) + ρ(j)) )ϕ(h) val (f(j))). (10) The function ρ can be decomposed as two functions ρP x: (i) the position function x S X, which provides the position of set elements in the underlying space X (e.g., pixel positions), and, (ii) the positional encoding ρP X RCin, which provides vector representations of elements in X. This distinction will be of utmost importance when we pinpoint where exactly (group) equivariance must be imposed to the self-attention operation ( 4.3, 5). Relative positional encoding. Here, positional information is provided in a relative manner. That is, we now provide vector representations of relative positions ρ(i,j) = ρP (x(j) x(i)) among pairs (i,j), i S,j N(i). Consequently, relative position-aware self-attention (Eq. 5) can be written as: mr[f,ρ](i) = ϕout( h [H] j N(i) σj( ϕ(h) qry (f(i)),ϕ(h) key (f(j) + ρ(i,j)) )ϕ(h) val (f(j))). (11) 4 EQUIVARIANCE ANALYSIS OF SELF-ATTENTION In this section we analyze the equivariance properties of self-attention. Since the analysis largely relies on group theory, we provide all concepts required for proper understanding in Appx. C. 4.1 GROUP EQUIVARIANCE AND EQUIVARIANCE FOR FUNCTIONS DEFINED ON SETS First we provide the general definition of group equivariance and refine it to relevant groups next. Additionally, we define the property of unique equivariance to restrict equivariance to a given group. Definition 4.1 (Group equivariance). Let G be a group (Def. C.1), S,S be sets, V,V be vector spaces, and Lg[ ],L g[ ] be the induced (left regular) representation (Def. C.4) of G on LV(S) and LV (S ), respectively. We say that a map ϕ LV(S) LV (S ) is equivariant to the action of G or G-equivariant , if it commutes with the action of G. That is, if: ϕ[Lg[f]] = L g[ϕ[f]], f LV(S), g G. Example 4.1.1 (Permutation equivariance). Let S = S = {i}N i=1 be a set of N elements, and G = SN be the group of permutations on sets of N elements. A map ϕ LV(S) LV (S) is said to be equivariant to the action of SN or permutation equivariant , if: ϕ[Lπ[f]](i) = L π[ϕ[f]](i), f LV(S), π SN, i S, where Lπ[f](i) = f(π 1(i)), and π S S is a bijection from the set to itself. The element π(i) indicates the index to which the i-th element of the set is moved to as an effect of the permutation π. In other words, ϕ is said to be permutation equivariant if it commutes with permutations π SN. That is, if permutations in its argument produce equivalent permutations on its response. Several of the transformations of interest, e.g., rotations, translations, are not defined on sets. Luckily, as we consider sets gathered from homogeneous spaces X where these transformations are welldefined, e.g., R2 for pixels, there exists an injective map x S X that associates a position in X to each set element, the position function. In Appx. D we show that the action of G on such a set is well-defined and induces a group representation to functions on it. With this in place, we are now able to define equivariance of set functions to groups whose actions are defined on homogeneous spaces. Definition 4.2 (Equivariance of set functions to groups acting on homogeneous spaces). Let G be a group acting on two homogeneous spaces X and X , let S,S be sets and V,V be vector 2Illustratively, one can think of this as a function returning a vector representation of pixel positions in a grid. Regardless of any transformation performed to the image, the labeling of the grid itself remains exactly equal. Published as a conference paper at ICLR 2021 spaces. Let x S X and x S X be injective maps. We say that a map ϕ LV(S) LV (S ) is equivariant to the action of G or G-equivariant , if it commutes with the action of G. That is, if: ϕ[Lg[f]] = L g[ϕ[f]], f LV(S), g G, where Lg[f](i) = f(x 1(g 1x(i))), L g[f](i) = f(x 1(g 1x (i))) are the induced (left regular) representation of G on LV(S) and LV (S ), respectively. I.o.w., ϕ is said to be G-equivariant if a transformation g G on its argument produces a corresponding transformation on its response. Example 4.2.1 (Translation equivariance). Let S, S be sets and let x S X and x S X be injective maps from the sets S,S to the corresponding homogeneous spaces X,X on which they are defined, e.g., Rd and G. With (X,+) the translation group acting on X, we say that a map ϕ LV(S) LV (S) is equivariant to the action of (X,+) or translation equivariant , if: ϕ[Ly[f]](i) = L y[ϕ[f]](i), f LV(S), y X, with Ly[f](i) = f(x 1(x(i) y)), L y[f](i) = f(x 1(x (i) y)). I.o.w., ϕ is said to be translation equivariant if a translation on its argument produces a corresponding translation on its response. 4.2 EQUIVARIANCE PROPERTIES OF SELF-ATTENTION In this section we analyze the equivariance properties of the self-attention. The proofs to all the propositions stated in the main text are provided in Appx. G. Proposition 4.1. The global self-attention formulation without positional encoding (Eqs. 3, 8) is permutation equivariant. That is, it holds that: m[Lπ[f]](i) = Lπ[m[f]](i). Note that permutation equivariance only holds for global self-attention. The local variant proposed in Eq. 9 reduces permutation equivariance to a smaller set of permutations where neighborhoods are conserved under permutation, i.e., SN = {π SN j N(i) π(j) N(i), i S}. Permutation equivariance induces equivariance to important (sub)groups. Consider the cyclic group of order 4, Z4 = {e,r,r2,r3} which induces planar rotations by 90 .3 As every rotation in Z4 effectively induces a permutation of the tokens positions, it can be shown that Z4 is a subgroup of SN, i.e., SN Z4. Consequently, maps equivariant to permutations are automatically equivariant to Z4. However, as the permutation equivariance constraint is harder than that of Z4-equivariance, imposing Z4-equivariance as a result of permutation equivariance is undesirable in terms of expressivity. Consequently, Ravanbakhsh et al. (2017) introduced the concept of unique G-equivariance to express the family of functions equivariant to G but not equivariant to other groups G G: Definition 4.3 (Unique G-equivariance). Let G a subgroup of G , G G (Def. C.2). We say that a map ϕ is uniquely G-equivariant iff it is G-equivariant but not G -equivariant for any G G. In the following sections, we show that we can enforce unique equivariance not only to subgroups of SN, e.g., Z4, but also to other interesting groups not contained in SN, e.g., groups of rotations finer than 90 degrees. This is achieved by enriching set functions with a proper positional encoding. Proposition 4.2. Absolute position-aware self-attention (Eqs. 4, 10) is neither permutation nor translation equivariant. i.e., m[Lπ[f],ρ](i) Lπ[m[f,ρ]](i) and m[Ly[f],ρ](i) Ly[m[f,ρ]](i). Though absolute positional encodings do disrupt permutations equivariance, they are unable to provide translation equivariance. We show next that translation equivariance is obtained via relative encodings. Proposition 4.3. Relative position-aware self-attention (Eq. 11) is translation equivariant. That is, it holds that: mr[Ly[f],ρ](i) = Ly[mr[f,ρ]](i). 4.3 WHERE EXACTLY IS EQUIVARIANCE IMPOSED IN SELF-ATTENTION? In the previous section we have seen two examples of successfully imposing group equivariance to self-attention. Specifically, we see that no positional encoding allows for permutation equivariance and that a relative positional encoding allows for translation equivariance. For the latter, as shown in the proof of Prop. 4.3 (Appx. G), this comes from the fact that for all shifts y X, ρ(x 1(x(i) + y),x 1(x(j) + y)) = ρP (x(j) + y (x(i) + y)) = ρP (x(j) x(i)) = ρ(i,j). (12) That is, from the fact that the relative positional encoding is invariant to the action of the translation group, i.e., Ly[ρ](i,j) = ρ(i,j), y X. Similarly, the absence of positional encoding more precisely, the use of a constant positional encoding , is what allows for permutation equivariance 3e represents a 0 rotation, i.e., the identity. The remaining elements rj represent rotations by (90 j) . Published as a conference paper at ICLR 2021 (Prop. 4.1, Appx. G). Specifically, constant positional encodings ρc(i) = c, i S are invariant to the action of the permutation group, i.e., Lπ[ρc](i) = ρc(i), π SN. From these observations, we conclude that G-equivariance is obtained by providing positional encodings which are invariant to the action of the group G, i.e., s.t., Lg[ρ] = ρ, g G. Furthermore, unique G-equivariance is obtained by providing positional encodings which are invariant to the action of G but not invariant to the action of any other group G G. This is a key insight that allows us to provide (unique) equivariance to arbitrary symmetry groups, which we provide next. 5 GROUP EQUIVARIANT STAND-ALONE SELF-ATTENTION In 4.3 we concluded that unique G-equivariance is induced in self-attention by introducing positional encodings which are invariant to the action of G but not invariant to the action of other groups G G. However, this constraint does not provide any information about the expressivity of the mapping we have just made G-equivariant. Let us first illustrate why this is important: Consider the case of imposing rotation and translation equivariance to an encoding defined in R2. Since translation equivariance is desired, a relative positional encoding is required. For rotation equivariance, we must further impose the positional encoding to be equal for all rotations. That is Lθ[ρ](i,j) != ρ(i,j), θ [0,2π], where Lθ[ρ](i,j) = ρP (θ 1x(j) θ 1x(i)), and θ 1 depicts a rotation by θ degrees. This constraint leads to an isotropic positional encoding unable to discriminate among orientations, which in turn enforces rotation invariance instead of rotation equivariance.4 This is alleviated by lifting the underlying function on R2 to a space where rotations are explicitly encoded (Fig. B.1). To this end, one performs self-attention operations for positional encodings Lθ[ρ] of varying values θ and indexes their responses by the corresponding θ value. Next, as rotations are now explicitly encoded, a positional encoding can be defined in this space which is able to discriminate among rotations (Fig. B.2). This in turn allows for rotation equivariance instead of rotation invariance. It has been shown both theoretically (Ravanbakhsh, 2020) and empirically (Weiler & Cesa, 2019) that the most expressive class of G-equivariant functions is given by functions that follow the regular representation of G. In order to obtain feature representations that behave that way, we introduce a lifting self-attention layer (Fig. B.1, Eq. 14) that receives an input function on Rd and produces a feature representation on G. Subsequently, arbitrarily many group self-attention layers (Fig. B.2, Eq. 16) interleaved with optional point-wise non-linearities can be applied. At the end of the network a feature representation on Rd can be provided by pooling over H. In short, we provide a pure selfattention analogous to Cohen & Welling (2016). However, as the group acts directly on the positional encoding, our networks are steerable as well (Weiler et al., 2018b). This allows us to go beyond group discretizations that live in the grid without introducing interpolation artifacts ( 5.1). Though theoretically sound, neural architectures using regular representations are unable to handle continuous groups directly in practice. This is a result of the summation over elements h H in Eq. 15, which becomes an integral for continuous groups. Interestingly, using discrete groups does not seem to be detrimental in practice. Our experiments indicate that performance saturates for fine discrete approximations of the underlying continuous group (Tab. 2). In fact, (Weiler & Cesa, 2019, Tab. 3) show via extensive experiments that networks using regular representations and fine enough discrete approximations consistently outperform networks handling continuous groups via irreducible representations. We conjecture this is a result of the networks receiving discrete signals as input. As the action of several group elements fall within the same pixel, no further improvement can be obtained. 5.1 GROUP SELF-ATTENTION IS AN STEERABLE OPERATION Convolutional filters are commonly parameterized by weights on a discrete grid, which approximate the function implicitly described by the filter at the grid locations. Unfortunately, for groups whose action does not live in this grid, e.g., 45 rotations, the filter must be interpolated. This is problematic as these filters are typically small and the resulting interpolation artifacts can be severe (Fig. 2a). Steerable CNNs tackle this problem by parameterizing convolutional filters on a continuous basis on which the action of the group is well-defined, e.g., circular harmonics (Weiler et al., 2018b), B-splines 4This phenomenon arises from the fact that R2 is a quotient of the roto-translation group. Consequently, imposing group equivariance in the quotient space is equivalent to imposing an additional homomorphism of constant value over its cosets. Conclusively, the resulting map is of constant value over the rotation elements and, thus, is not able to discriminate among them. See Ch. 3.1 of Dummit & Foote (2004) for an intuitive description. Published as a conference paper at ICLR 2021 (a) Discrete convolution (b) Group self-attention Figure 2: Steerability analysis of discrete convolutions and group self-attention. (Bekkers, 2020). In group self-attention, the action of the group leaves the content of the image intact and only modifies the positional encoding (Figs. B.1, B.2). As the positional encoding lives on a continuous space, it can be transformed at an arbitrary grade of precision without interpolation (Fig. 2b). 5.2 LIFTING AND GROUP SELF-ATTENTION Lifting self-attention (Fig. B.1). Let G = Rd H be an affine group (Def. C.3) acting on Rd. The lifting self-attention mr G [f,ρ] LV(Rd) LV (G) is a map from functions on Rd to functions on G obtained by modifying the relative positional encoding ρ(i,j) by the action of group elements h H: {Lh[ρ](i,j)}h H, Lh[ρ](i,j) = ρP (h 1x(j) h 1x(i)). It corresponds to the concatenation of multiple self-attention operations (Eq. 11) indexed by h with varying positional encodings Lh[ρ] : mr G [f,ρ](i,h) = mr[f,Lh[ρ]](i) (13) = ϕout( h [H] j N(i) σj( ϕ(h) qry (f(i)),ϕ(h) key (f(i) + Lh[ρ](i,j)) )ϕ(h) val (f(j))). (14) Proposition 5.1. Lifting self-attention is G-equivariant. That is, it holds that: mr G [Lg[f],ρ](i,h) = Lg[mr G [f,ρ]](i,h). Group self-attention (Fig. B.2). Let G = Rd H be an affine group acting on itself and f(i, h) LV(G), i S, h H, be a function defined on a set immersed with the structure of the group G. That is, enriched with a positional encoding ρ((i, h),(j, ˆh)) = ρP ((x(j) x(i), h 1ˆh)), i,j S, h, ˆh H. The group self-attention mr G[f,ρ] LV(G) LV (G) is a map from functions on G to functions on G obtained by modifying the group positional encoding by the action of group elements h H: {Lh[ρ]((i, h),(j, ˆh))}h H, Lh[ρ]((i, h),(j, ˆh)) = ρP (h 1(x(j) x(i)),h 1( h 1ˆh)). It corresponds to the concatenation of multiple self-attention operations (Eq. 11) indexed by h with varying positional encodings Lh[ρ] and followed by a summation over the output domain along h: mr G[f,ρ](i,h) = h H mr[f,Lh[ρ]](i, h) (15) = ϕout( h [H] h H (j,ˆh) N(i, h) σj,ˆh( ϕ(h) qry (f(i, h)),ϕ(h) key (f(j, ˆh) + Lh[ρ]((i, h),(j, ˆh)) )ϕ(h) val (f(j, ˆh))). (16) In contrast to vanilla and lifting self-attention, the group self-attention neighborhood N(i, h) is now defined on the group. This allows distinguishing across group transformations, e.g., rotations. Proposition 5.2. Group self-attention is G-equivariant. That is, it holds that: mr G[Lg[f],ρ](i,h) = Lg[mr G[f,ρ]](i,h). Non-unimodular groups, i.e., groups that modify the volume of the objects they act upon, such as the dilation group, require a special treatment. This treatment is provided in Appx. E. 5.3 GROUP SELF-ATTENTION IS A GENERALIZATION OF THE GROUP CONVOLUTION We have demonstrated that it is sufficient to define self-attention as a function on the group G and ensure that Lg[ρ] = ρ g G in order to enforce G-equivariance. Interestingly, this observation is inline with the main statement of Kondor & Trivedi (2018) for (group) convolutions: the group convolution on G is the only (unique) G-equivariant linear map . In fact, our finding can be formulated as a generalization of Kondor & Trivedi (2018) s statement as: Linear mappings on G whose positional encoding is G-invariant are G-equivariant. This statement is more general than that of Kondor & Trivedi (2018), as it holds for data structures where (group) convolutions are not well-defined, e.g., sets, and it is equivalent to Kondor & Trivedi Published as a conference paper at ICLR 2021 (2018) s statement for structures where (group) convolutions are well-defined. It is also congruent with results complementary to Kondor & Trivedi (2018) (Cohen et al., 2019a; Bekkers, 2020) as well as several works on group equivariance handling set-like structures like point-clouds (Thomas et al., 2018; Defferrard et al., 2020; Finzi et al., 2020; Fuchs et al., 2020) and symmetric sets (Maron et al., 2020). In addition, we can characterize the expressivity of group self-attention. It holds that (i) group self-attention generalizes the group convolution and (ii) regular global group self-attention is an equivariant universal approximator. Statement (i) follows from the fact that any convolutional layer can be described as a multi-head self-attention layer provided enough heads (Cordonnier et al., 2020), yet self-attention often uses larger receptive fields. As a result, self-attention is able to describe a larger set of functions than convolutions, e.g., Fig. A.1. Cordonnier et al. (2020) s statement can be seamlessly extended to group self-attention by incorporating an additional dimension corresponding to H in their derivations, and defining neighborhoods in this new space with a proportionally larger number of heads. Statement (ii) stems from the finding of Ravanbakhsh (2020) that functions induced by regular group representations are equivariant universal approximators provided full kernels, i.e., global receptive fields. Global receptive fields are required to guarantee that the equivariant map is able to model any dependency among input components. Global receptive fields are readily utilized by our proposed regular global group self-attention and, provided enough heads, one can ensure that any such dependencies is properly modelled. 6 EXPERIMENTS We perform experiments on three image benchmark datasets for which particular forms of equivariance are desirable.5 We evaluate our approach by contrasting GSA-Nets equivariant to multiple symmetry groups. Additionally, we conduct an study on rot MNIST to evaluate the performance of GSA-Nets as a function of the neighborhood size. All our networks follow the structure shown in Fig. F.1 and vary only in the number of blocks and channels. We emphasize that both the architecture and the number of parameters in GSA-Nets is left unchanged regardless of the group used. Our results illustrate that GSA-Nets consistently outperform equivalent non-equivariant attention networks. We further compare GSA-Nets with convolutional architectures. Though our approach does not build upon these networks, this comparison provides a fair view to the yet present gap between self-attention and convolutional architectures in vision tasks, also present in their group equivariant counterparts. Efficient implementation of lifting and group self-attention. Our self-attention implementation takes advantage of the fact that the group action only affects the positional encoding P to reduce the total computational cost of the operation. Specifically, we calculate self-attention scores w.r.t. the content X once, and reuse them for all transformed versions of the positional encoding {Lh[ρ]}h H. Model designation. We refer to translation equivariant self-attention models as Z2 SA. Reflection equivariant models receive the keyword M, e.g., Z2M SA, and rotation equivariant models the keyword Rn, where n depicts the angle discretization. For example, R8 SA depicts a model equivariant to rotations by 45 degrees. Specific model architectures are provided in Appx. F. Rot MNIST. The rotated MNIST dataset (Larochelle et al., 2007) is a classification dataset often used as a standard benchmark for rotation equivariance. It consists of 62k gray-scale 28x28 uniformly rotated handwritten digits, divided into training, validation and test sets of 10k, 2k and 50k images. First, we study the effect of the neighborhood size on classification accuracy and convergence time. We train R4 SA networks for 300 epochs with vicinities Nx N of varying size (Tab. 1, Fig. 3). Since GSA-Nets optimize where to attend, the complexity of the optimization problem grows as a function of N. Consequently, models with big vicinities are expected to converge slower. However, as the family of functions describable by big vicinities contains those describable by small ones, models with big vicinities are expected to be at least as good upon convergence. Our results show that models with small vicinities do converge much faster (Fig. 3). However, though some models with large vicinities do outperform models with small ones, e.g., 7x7 vs. 3x3, a trend of this behavior is not apparent. We conjecture that 300 epochs are insufficient for all models to converge equally well. Unfortunately, due to computational constraints, we were not able to perform this experiment for a larger number of epochs. We consider an in-depth study of this behavior an important direction for future work. Next, we compare GSA-Nets equivariant to translation and rotation at different angle discretizations (Tab. 2). Based on the results of the previous study, we select a 5x5 neighborhood, as it provides the 5Our code is publicly available at https://github.com/dwromero/g selfatt. Published as a conference paper at ICLR 2021 Table 1: Accuracy vs. neighborhood size. MODEL NBHD. SIZE ACC. (%) TRAIN. TIME / EPOCH 3x3 96.56 04:53 - 1GPU 5x5 97.49 05:34 - 1GPU 7x7 97.33 09:03 - 1GPU 9x9 97.42 09:16 - 1GPU 11x11 97.17 12:09 - 1GPU 15x15 96.89 10:27 - 2GPU 19x19 96.86 14:27 - 2GPU 23x23 97.05 06:13 - 3GPU 28x28 96.81 12:12 - 4GPU 5 10 15 20 25 epoch test accuracy R4_SA_5x5 91.9 R4_SA_9x9 90.9 R4_SA_7x7 90.0 R4_SA_11x11 89.4 R4_SA_15x15 89.1 R4_SA_19x19 86.7 R4_SA_3x3 85.9 R4_SA_23x23 83.0 R4_SA_28x28 81.3 Figure 3: Test accuracy in early training stage. Table 2: Classification results. All convolutional architectures use 3x3 filters. MODEL ACC. (%) PARAMS. Z2 SA 96.37 44.67K R4 SA 97.46 R8 SA 97.90 R12 SA 97.97 R16 SA 97.66 Z2 CNN+ 94.97 21.75K R4 CNN 98.21 77.54K α-R4 CNN 98.31 73.13K +Cohen & Welling (2016) Romero et al. (2020a) MODEL ACC. (%) PARAMS. Z2 SA 82.3 2.99M Z2M SA 83.72 Z2 CNN+ 90.56 1.37M +Cohen & Welling (2016). PATCHCAMELYON MODEL ACC. (%) PARAMS. Z2 SA 83.04 205.66K R4 SA 83.44 R8 SA 83.58 R4M SA 84.76 Z2 CNN 84.07 130.60K R4 CNN 87.55 129.65K R4M CNN 88.36 124.21K αF -R4 CNN 88.66 140.45K αF -R4M CNN 89.12 141.22K Romero et al. (2020a). best trade-off between accuracy and convergence time. Our results show that finer discretizations lead to better accuracy but saturates around R12. We conjecture that this is due to the discrete resolution of the images in the dataset, which leads finer angle discretizations to fall within the same pixel. CIFAR-10. The CIFAR-10 dataset (Krizhevsky et al., 2009) consists of 60k real-world 32x32 RGB images uniformly drawn from 10 classes, divided into training, validation and test sets of 40k, 10k and 10k images. Since reflection is a symmetry that appears ubiquitously in natural images, we compare GSA-Nets equivariant to translation and reflection in this dataset (Tab. 2). Our results show that reflection equivariance indeed improves the classification performance of the model. PCam. The Patch Camelyon dataset (Veeling et al., 2018) consists of 327k 96x96 RGB image patches of tumorous/non-tumorous breast tissues extracted from Camelyon16 (Bejnordi et al., 2017). Each patch is labeled as tumorous if the central region (32x32) contains at least one tumour pixel. As cells appear at arbitrary positions and poses, we compare GSA-Nets equivariant to translation, rotation and reflection (Tab. 2). Our results show that incorporating equivariance to reflection in addition to rotation, as well as providing finer group discretization, improve classification performance. 7 DISCUSSION AND FUTURE WORK Though GSA-Nets perform competitively to G-CNNs for some tasks, G-CNNs still outperforms our approach in general. We conjecture that this is due to the harder nature of the optimization problem in GSA-Nets and the carefully crafted architecture design, initialization and optimization procedures developed for CNNs over the years. Though our theoretical results indicate that GSA-Nets can be more expressive than G-CNNs ( 5.3), further research in terms of design, optimization, stability and generalization is required. These are in fact open questions for self-attention in general (Xiong et al., 2020; Liu et al., 2020; Zhao et al., 2020) and developments in this direction are of utmost importance. The main drawback of our approach is the quadratic memory and time complexity typical of selfattention. This is an active area of research, e.g., Kitaev et al. (2020); Wang et al. (2020); Zaheer et al. (2020); Choromanski et al. (2020) and we believe that efficiency advances to vanilla self-attention can be seamlessly integrated in GSA-Nets. Our theoretical results indicate that GSA-Nets have the potential to become the standard solution for applications exhibiting symmetries, e.g., medical imagery. In addition, as self-attention is a set operation, GSA-Nets provide straightforward solutions to set-like data types, e.g., point-clouds, graphs, symmetric sets, which may benefit from additional geometrical information, e.g., Fuchs et al. (2020); Maron et al. (2020). Finally, we hope our theoretical insights serve as a support point to further explore and understand the construction of equivariant maps for graphs and sets, which often come equipped with spatial coordinates: a type of positional encoding. Published as a conference paper at ICLR 2021 ACKNOWLEDGMENTS We gratefully acknowledge Michael Hutchinson, Charline Le Lan, Sheheryar Zaidi, Emilien Dupont, and Hyunjik Kim for useful discussions, Robert-Jan Bruintjes, Fabian Fuchs, Erik Bekkers, Andreas Loukas, Mark Hoogendoorn and our anonymous reviewers for their valuable comments on early versions of this work which largely helped us to improve the quality of our work. David W. Romero is financed as part of the Efficient Deep Learning (EDL) programme (grant number P16-25), partly funded by the Dutch Research Council (NWO) and Semiotic Labs. Jean Baptiste Cordonnier is financed by the Swiss Data Science Center (SDSC). Both authors thankfully acknowledge everyone involved in funding this work. This work was carried out on the Dutch national e-infrastructure with the support of SURF Cooperative. Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. ar Xiv preprint ar Xiv:1607.06450, 2016. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In Yoshua Bengio and Yann Le Cun (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1409.0473. Babak Ehteshami Bejnordi, Mitko Veta, Paul Johannes Van Diest, Bram Van Ginneken, Nico Karssemeijer, Geert Litjens, Jeroen AWM Van Der Laak, Meyke Hermsen, Quirine F Manson, Maschenka Balkenhol, et al. Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. Jama, 318(22):2199 2210, 2017. Erik J Bekkers. B-spline {cnn}s on lie groups. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=H1g Bhk BFDH. Erik J Bekkers, Maxime W Lafarge, Mitko Veta, Koen AJ Eppenhof, Josien PW Pluim, and Remco Duits. Roto-translation covariant convolutional networks for medical image analysis. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 440 448. Springer, 2018. Irwan Bello, Barret Zoph, Ashish Vaswani, Jonathon Shlens, and Quoc V. Le. Attention augmented convolutional networks, 2019. Jianpeng Cheng, Li Dong, and Mirella Lapata. Long short-term memory-networks for machine reading. ar Xiv preprint ar Xiv:1601.06733, 2016. Xiuyuan Cheng, Qiang Qiu, Robert Calderbank, and Guillermo Sapiro. Rotdcf: Decomposition of convolutional filters for rotation-equivariant deep networks. ar Xiv preprint ar Xiv:1805.06846, 2018. Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers. ar Xiv preprint ar Xiv:2009.14794, 2020. Taco Cohen and Max Welling. Group equivariant convolutional networks. In International conference on machine learning, pp. 2990 2999, 2016. Taco S. Cohen, Mario Geiger, Jonas K ohler, and Max Welling. Spherical cnns. Co RR, abs/1801.10130, 2018. URL http://arxiv.org/abs/1801.10130. Taco S Cohen, Mario Geiger, and Maurice Weiler. A general theory of equivariant cnns on homogeneous spaces. In Advances in Neural Information Processing Systems, pp. 9142 9153, 2019a. Taco S Cohen, Maurice Weiler, Berkay Kicanaoglu, and Max Welling. Gauge equivariant convolutional networks and the icosahedral cnn. ar Xiv preprint ar Xiv:1902.04615, 2019b. Published as a conference paper at ICLR 2021 Jean-Baptiste Cordonnier, Andreas Loukas, and Martin Jaggi. On the relationship between selfattention and convolutional layers. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=HJln C1r KPB. Micha el Defferrard, Martino Milani, Fr ed erick Gusset, and Nathana el Perraudin. Deepsphere: a graph-based spherical cnn. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=B1e3Ol St PB. Nichita Diaconu and Daniel E Worrall. Learning to convolve: A generalized weight-tying approach. 2019. Sander Dieleman, Jeffrey De Fauw, and Koray Kavukcuoglu. Exploiting cyclic symmetry in convolutional neural networks. ar Xiv preprint ar Xiv:1602.02660, 2016. David Steven Dummit and Richard M Foote. Abstract algebra, volume 3. Wiley Hoboken, 2004. Carlos Esteves, Avneesh Sud, Zhengyi Luo, Kostas Daniilidis, and Ameesh Makadia. Cross-domain 3d equivariant image embeddings. In International Conference on Machine Learning, pp. 1812 1822. PMLR, 2019a. Carlos Esteves, Yinshuang Xu, Christine Allen-Blanchette, and Kostas Daniilidis. Equivariant multiview networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1568 1577, 2019b. Carlos Esteves, Ameesh Makadia, and Kostas Daniilidis. Spin-weighted spherical cnns. Advances in Neural Information Processing Systems, 33, 2020. Marc Finzi, Samuel Stanton, Pavel Izmailov, and Andrew Gordon Wilson. Generalizing convolutional neural networks for equivariance to lie groups on arbitrary continuous data. ar Xiv preprint ar Xiv:2002.12880, 2020. Fabian B Fuchs, Daniel E Worrall, Volker Fischer, and Max Welling. Se (3)-transformers: 3d roto-translation equivariant attention networks. ar Xiv preprint ar Xiv:2006.10503, 2020. Simon Graham, David Epstein, and Nasir Rajpoot. Dense steerable filter cnns for exploiting rotational symmetry in histology images. ar Xiv preprint ar Xiv:2004.03037, 2020. Emiel Hoogeboom, Jorn WT Peters, Taco S Cohen, and Max Welling. Hexaconv. ar Xiv preprint ar Xiv:1803.02108, 2018. Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132 7141, 2018. Michael Hutchinson, Charline Le Lan, Sheheryar Zaidi, Emilien Dupont, Yee Whye Teh, and Hyunjik Kim. Lietransformer: Equivariant self-attention for lie groups. ar Xiv preprint ar Xiv:2012.10885, 2020. Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Franc ois Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention, 2020. Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer, 2020. Risi Kondor and Shubhendu Trivedi. On the generalization of equivariance and convolution in neural networks to the action of compact groups. ar Xiv preprint ar Xiv:1802.03690, 2018. Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009. Hugo Larochelle, Dumitru Erhan, Aaron Courville, James Bergstra, and Yoshua Bengio. An empirical evaluation of deep architectures on problems with many factors of variation. In Proceedings of the 24th international conference on Machine learning, pp. 473 480. ACM, 2007. Jan Eric Lenssen, Matthias Fey, and Pascal Libuschewski. Group equivariant capsule networks. In Advances in Neural Information Processing Systems, pp. 8844 8853, 2018. Published as a conference paper at ICLR 2021 Junying Li, Zichen Yang, Haifeng Liu, and Deng Cai. Deep rotation equivariant network. Neurocomputing, 290:26 33, 2018. Liyuan Liu, Xiaodong Liu, Jianfeng Gao, Weizhu Chen, and Jiawei Han. Understanding the difficulty of training transformers. ar Xiv preprint ar Xiv:2004.08249, 2020. Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. ar Xiv preprint ar Xiv:1608.03983, 2016. Diego Marcos, Michele Volpi, Nikos Komodakis, and Devis Tuia. Rotation equivariant vector field networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5048 5057, 2017. Diego Marcos, Benjamin Kellenberger, Sylvain Lobry, and Devis Tuia. Scale equivariance in cnns with vector fields. ar Xiv preprint ar Xiv:1807.11783, 2018. Haggai Maron, Or Litany, Gal Chechik, and Ethan Fetaya. On learning sets of symmetric elements. ar Xiv preprint ar Xiv:2002.08599, 2020. Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions. ar Xiv preprint ar Xiv:1710.05941, 2017. Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Levskaya, and Jonathon Shlens. Stand-alone self-attention in vision models. ar Xiv preprint ar Xiv:1906.05909, 2019. Siamak Ravanbakhsh. Universal equivariant multilayer perceptrons. ar Xiv preprint ar Xiv:2002.02912, 2020. Siamak Ravanbakhsh, Jeff Schneider, and Barnabas Poczos. Equivariance through parameter-sharing. ar Xiv preprint ar Xiv:1702.08389, 2017. David W Romero and Mark Hoogendoorn. Co-attentive equivariant neural networks: Focusing equivariance on transformations co-occurring in data. ar Xiv preprint ar Xiv:1911.07849, 2019. David W Romero, Erik J Bekkers, Jakub M Tomczak, and Mark Hoogendoorn. Attentive group equivariant convolutional networks. ar Xiv preprint ar Xiv:2002.03830, 2020a. David W Romero, Erik J Bekkers, Jakub M Tomczak, and Mark Hoogendoorn. Wavelet networks: Scale equivariant learning from raw waveforms. ar Xiv preprint ar Xiv:2006.05259, 2020b. Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representations. In Marilyn A. Walker, Heng Ji, and Amanda Stent (eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 2 (Short Papers), pp. 464 468. Association for Computational Linguistics, 2018. doi: 10.18653/v1/ n18-2074. URL https://doi.org/10.18653/v1/n18-2074. Ivan Sosnovik, Michał Szmaja, and Arnold Smeulders. Scale-equivariant steerable networks. In International Conference on Learning Representations, 2020. URL https://openreview. net/forum?id=HJgpugr KPS. Kai Sheng Tai, Peter Bailis, and Gregory Valiant. Equivariant transformer networks. ar Xiv preprint ar Xiv:1901.11399, 2019. Nathaniel Thomas, Tess Smidt, Steven Kearnes, Lusann Yang, Li Li, Kai Kohlhoff, and Patrick Riley. Tensor Field Networks: Rotation-and Translation-Equivariant Neural Networks for 3D Point Clouds. ar Xiv preprint ar Xiv:1802.08219, 2018. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pp. 5998 6008, 2017. Bastiaan S Veeling, Jasper Linmans, Jim Winkens, Taco Cohen, and Max Welling. Rotation equivariant cnns for digital pathology. In International Conference on Medical image computing and computer-assisted intervention, pp. 210 218. Springer, 2018. Published as a conference paper at ICLR 2021 Sai Raam Venkataraman, S. Balasubramanian, and R. Raghunatha Sarma. Building deep equivariant capsule networks. In International Conference on Learning Representations, 2020. URL https: //openreview.net/forum?id=BJg NJg SFPS. Sinong Wang, Belinda Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity. ar Xiv preprint ar Xiv:2006.04768, 2020. Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794 7803, 2018. Maurice Weiler and Gabriele Cesa. General e (2)-equivariant steerable cnns. In Advances in Neural Information Processing Systems, pp. 14334 14345, 2019. Maurice Weiler, Mario Geiger, Max Welling, Wouter Boomsma, and Taco S Cohen. 3d steerable cnns: Learning rotationally equivariant features in volumetric data. In Advances in Neural Information Processing Systems, pp. 10381 10392, 2018a. Maurice Weiler, Fred A Hamprecht, and Martin Storath. Learning steerable filters for rotation equivariant cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 849 858, 2018b. Daniel Worrall and Gabriel Brostow. Cubenet: Equivariance to 3d rotation and translation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 567 584, 2018. Daniel E Worrall and Max Welling. Deep scale-spaces: Equivariance over scale. ar Xiv preprint ar Xiv:1905.11697, 2019. Daniel E Worrall, Stephan J Garbin, Daniyar Turmukhambetov, and Gabriel J Brostow. Harmonic networks: Deep translation and rotation equivariance. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5028 5037, 2017. Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tie-Yan Liu. On layer normalization in the transformer architecture. ar Xiv preprint ar Xiv:2002.04745, 2020. Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences. ar Xiv preprint ar Xiv:2007.14062, 2020. Hengshuang Zhao, Jiaya Jia, and Vladlen Koltun. Exploring self-attention for image recognition. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020. Published as a conference paper at ICLR 2021 A CONVOLUTION AND THE SELF-ATTENTION: A GRAPHICAL COMPARISON Figure A.1: Parameter usage in convolutional kernels (Fig. A.1a) and self-attention (Fig. A.1b). Given a budget of 9 parameters, a convolutional filter ties these parameters to specific positions. Subsequently, these parameters remain static regardless of (i) the query input position and (ii) the input signal itself. Self-attention, on the other hand, does not tie parameters to any specific positions at all. Contrarily, it compares the representations of all tokens falling in its receptive field. As a result, provided enough heads, self-attention is more general than convolutions, as it can represent any convolutional kernel, e.g., Fig. A.1a, as well as several other functions defined on its receptive field. B LIFTING AND GROUP SELF-ATTENTION: A GRAPHICAL DESCRIPTION Figure B.1: Lifting self-attention on the roto-translation group for discrete rotations by 90 degrees (also called the Z4 group). The Z4 group is defined as H = {e,h,h2,h3}, where h depicts a 90 rotation. The lifting self-attention corresponds to the concatenation of H = 4 self-attention operations between the input f and h-transformed versions of the positional encoding L[ρ], h H. As a result, the model sees the input f at each of the rotations in the group at once. Since Z4 is a cyclic group, i.e., h4 = e, functions on this group are often represented as responses on a ring (right side of the image). This is a self-attention analogous to the regular lifting group convolution broadly utilized in group equivariant learning literature, e.g., Cohen & Welling (2016); Romero et al. (2020a). Published as a conference paper at ICLR 2021 Figure B.2: Lifting self-attention on the roto-translation group for discrete rotations by 90 degrees (also called the Z4 group). The Z4 group is defined as H = {e,h,h2,h3}, where h depicts a 90 rotation. Analogous to lifting self-attention (Fig. B.1), group self-attention corresponds to a concatenation of H = 4 self-attention operations between the input f and h-transformed versions of the positional encoding L[ρ], h H. However, in contrast to lifting self-attention, both f and ρ are now defined on the group G. Consequently, an additional sum over h is required during the operation (c.f., Eq. 16). Since Z4 is a cyclic group, i.e., h4 = e, functions on Z4 are often represented as responses on a ring (right side of the image). This is a self-attention analogous to the regular group convolution broadly utilized in group equivariant learning literature, e.g., Cohen & Welling (2016); Romero et al. (2020a). C CONCEPTS FROM GROUP THEORY Definition C.1 (Group). A group is an ordered pair (G, ) where G is a set and G G G is a binary operation on G, such that (i) the set is closed under this operation, (ii) the operation is associative, i.e., (g1 g2) g3 = g1 (g2 g3), g1,g2,g3 G, (iii) there exists an identity element e G s.t. g G we have e g = g e = g, and (iv) for each g G, there exists an inverse g 1 s.t. g g 1 = e. Definition C.2 (Subgroup). Let (G, ) be a group. A subset H of G is a subgroup of G if H is nonempty and closed under the group operation and inverses (i.e., h1,h2 H implies that h 1 1 H and h1 h2 H). If H is a subgroup of G we write H G Definition C.3 (Semi-direct product and affine groups). In practice, one is mainly interested in the analysis of data defined on Rd, and, consequently, in groups of the form G = Rd H, resulting from the semi-direct product ( ) between the translation group (Rd,+) and an arbitrary (Lie) group H that acts on Rd, e.g., rotation, scaling, mirroring, etc. This family of groups is referred to as affine groups and their group product is defined as: g1 g2 = (x1,h1) (x2,h2) = (x1 + h1 x2,h1 h2), (17) with g = (x1,h1), g2 = (x2,h2) G, x1, x2 Rd and h1,h2 H. The operator denotes the action of h H on x Rd, and it describes how a vector x Rd is modified by elements h H. Definition C.4 (Group representation). Let G be a group and L2(X) be a space of functions defined on some vector space X. The (left) regular group representation of G is a linear transformation L G L2(X) L2(X), (g,f) Lg[f] = f(g 1 x), that shares the group structure via: Lg1Lg2[f] = Lg1 g2[f] (18) for any g1,g2 G, f L2(X). That is, concatenating two such transformations, parameterized by Published as a conference paper at ICLR 2021 g1 and g2, is equivalent to a single transformation parameterized by g1 g2 G. If the group G is affine, the group representation Lg can be split as: Lg[f] = Lx Lh[f], (19) with g = (x,h) G, x Rd and h H. Intuitively, the representation of G on a function f describes how the function as a whole, i.e., f(x), x X, is transformed by the effect of group elements g G. D ACTIONS AND REPRESENTATIONS OF GROUPS ACTING ON HOMOGENEOUS SPACES FOR FUNCTIONS DEFINED ON SETS In this section we show that the action of a group G acting on a homogeneous space X is well defined on sets S gathered from X, and that it induces a group representation of functions defined on S. Let S = {i} be a set and X be a homogeneous space on which the action of G is well-defined, i.e., gx X, g G, x X. Since S has been gathered from X, there exists an injective map x S X, that maps set elements i S to unique elements xi X. That is, there exists a map x i xi that assigns an unique value xi X to each i S corresponding to the coordinates from which the set element has been gathered. Since the action of G is well defined on X, it follows that the left regular representation (Def. C.4) Lg[f X](xi) = f X(g 1xi) LY(X) of functions f X LY(X) exists and is well-defined. Since x is injective, the left regular representation Lg[f X](xi) = f X(g 1xi) can be expressed uniquely in terms of set indices as Lg[f X](xi) = f X(g 1xi) = f X(g 1x(i)). Furthermore, its inverse x 1 X S, x 1 xi i also exist and is well-defined. As a consequence, points xi X can be expressed uniquely in terms of set indices as i = x 1(xi),i S. Consequently, functions f X LY(X) can be expressed in terms of functions f LY(S) by means of the equality f X(i) = f(x 1(xi)). Resultantly, we see that the group representation Lg[f X](xi) = f X(g 1x(i)) can be described in terms of functions f LY(S) as: Lg[f X](xi) = f X(g 1(xi)) = f X(g 1x(i)) = f(x 1(g 1x(i))) = Lg[f](i), with a corresponding group representation on LY(S) given by Lg[f](i) = f(x 1(g 1x(i))), and an action of group elements g G on set elements i given by gi = x 1(gx(i)). E THE CASE OF NON-UNIMODULAR GROUPS: SELF-ATTENTION ON THE DILATION-TRANSLATION GROUP The lifting and group self-attention formulation provided in 5.2 are only valid for unimodular groups. That is, for groups whose action does not change the volume of the objects they act upon, e.g., rotation, mirroring, etc. Non-unimodular groups, however, do modify the volume of the acted objects (Bekkers, 2020). The most relevant non-unimodular group for this work is the dilation group H = (R>0, ). To illustrate why this distinction is important, consider the following example: Imagine we have a circle on R2 of area πr2. If we rotate, mirror or translate the circle, its size is kept constant. If we increase its radius by a factor h R>0, however, its size would increase by h2. Imagine that we have an application for which we would like to recognize this circle regardless of any of these transformations by means of self-attention. For this purpose, we define a neighborhood N for which the original circle fits perfectly. Since the size of the circle is not modified for any translated, rotated or translated versions of it, we would still be able to detect the circle regardless of these transformations. If we scale the circle by a factor of h > 1, however, the circle would fall outside of our neighborhood N and hence, we would not be able to recognize it. A solution to this problem is to scale our neighborhood N in a proportional way. That is, if the circle is scaled by a factor h R>0, we scale our neighborhood by the same factor h: N h N. Resultantly, the circle would fall within the neighborhood for any scale factor h R>0. Unfortunately, there is a problem: self-attention utilizes summations over its neighborhood. Since i h N i > i N i, for h > 1, and i h N i < i N i, for h < 1, the result of the summations would still differ for different scales. Specifically, this result would always be bigger for larger versions of the neighborhood. This is problematic, as the response produced by the same circle, would still be different for different scales. Published as a conference paper at ICLR 2021 In order to handle this problem, one utilizes a normalization factor proportional to the change of size of the neighborhood considered. This ensures that the responses are equivalent for any scale h R>0. That is, one normalizes all summations proportionally to the size of the neighborhood. As a result, we obtain that i h1N(h2 1) 1i = i h2N(h2 2) 1i, h1,h2 R>0.6 In the example above we have provided an intuitive description of the (left invariant) Haar measure dµ(h). As its name indicates, it is a measure defined on the group, which is invariant over all group elements h H. For several unimodular groups, the Haar measure corresponds to the Lebesgue measure as the volume of the objects the group acts upon is kept equal, i.e., dµ(h) = dh.7 For non-unimodular groups, however, the Haar measure requires a normalization factor proportional to the change of volume of these objects. Specifically, the Haar measure corresponds to the Lebesgue measure times a normalization factor hd, where d corresponds to the dimensionality of the space Rd the group acts upon (Bekkers, 2020; Romero et al., 2020b), i.e., dµ(h) = 1 hd dh. In conclusion, in order to obtain group equivariance to non-unimodular groups, the lifting and group self-attention formulation provided in Eqs. 14, 16 must be modified via normalization factors proportional to the group elements h H. Specifically, they are redefined as: mr G [f,ρ](i,h) = ϕout( h [H] j h N(i) 1 hd σj( ϕ(h) qry (f(i)),ϕ(h) key (f(i) + Lh[ρ](i,j)) )ϕ(h) val (f(j))) (20) mr G[f,ρ](i,h) = ϕout( h [H] h H (j,ˆh) h N(i, h) 1 hd+1 σj,ˆh( ϕ(h) qry (f(i, h)),ϕ(h) key (f(j, ˆh) + Lh[ρ]((i, h),(j, ˆh)) )ϕ(h) val (f(j, ˆh))). (21) The factor d + 1 in Eq. 21 results from the fact that the summation is now performed on the group G = Rd H, an space of dimensionality d+1. An interesting case emerges when global neighborhoods are considered, i.e., s.t. N(i) = S, i S. Since h N(i) = N(i) = S for any h > 1, approximation artifacts are introduced. It is not clear if it is better to introduce normalization factors in these situations or not. An in-depth investigation of this phenomenon is left for future research. E.1 CURRENT EMPIRICAL ASPECTS OF SCALE EQUIVARIANT SELF-ATTENTION Self-attention suffers from quadratic memory and time complexity proportional to the size of the neighborhood considered. This constraint is particularly important for the dilation group, for which these neighborhoods grow as a result of the group action. We envisage two possible solutions to this limitation left out for future research: The most promising solution is given by incorporating recent advances in efficient self-attention in group self-attention, e.g., Kitaev et al. (2020); Wang et al. (2020); Zaheer et al. (2020); Katharopoulos et al. (2020); Choromanski et al. (2020). By reducing the quadratic complexity of self-attention, the current computational constraints of scale equivariant self-attention can be (strongly) reduced. Importantly, resulting architectures would be comparable to Bekkers (2020); Sosnovik et al. (2020); Romero et al. (2020b) in terms of their functionality and the group discretizations they can manage. The second option is to draw a self-attention analogous to Worrall & Welling (2019), where scale equivariance is implemented via dilated convolutions. One might consider an analogous to dilated convolutions via sparse dilations of the self-attention neighborhood. As a result, scale equivariance can be implemented while retaining an equal computational cost for all group elements. Importantly however, this strategy is viable for a dyadic set of scales only, i.e., a set of scales given by a set {2j}jmax j=0, and thus, less general that the scale-equivariant architectures listed before. F EXPERIMENTAL DETAILS In this section we provide extended details over our implementation as well as the exact architectures and optimization schemes used in our experiments. All our models follow the structure shown in Fig. F.1 and vary only in the number of blocks and channels. All self-attention operations utilize 9 heads. We utilize Py Torch for our implementation. Any missing specification can be safely considered to be the Py Torch default value. Our code is publicly available at https://github.com/dwromero/g selfatt. 6The squared factor in h2 1 and h2 2 appears as a result that the neighborhood growth is quadratic in R2. 7This is why this subtlety is often left out in group equivariance literature. Published as a conference paper at ICLR 2021 Figure F.1: Graphical description of group self-attention networks. Dot-lined blocks depict optional blocks. Linear layers are applied point-wise across the feature map. Swish non-linearities (Ramachandran et al., 2017) and layer normalization (Ba et al., 2016) are used all across the network. The Global Pooling block consists of max-pool over group elements followed by spatial mean-pool. F.1 ROTATED MNIST For rotational MNIST we use a group self-attention network composed of 5 attention blocks with 20 channels. We utilize automatic mixed precision during training to reduce memory requirements. attention dropout rate and value dropout rate are both set to 0.1. We train for 300 epochs and utilize the Adam optimizer, batch size of 8, weight decay of 0.0001 and learning rate of 0.001. F.2 CIFAR-10 For CIFAR-10 we use a group self-attention network composed of 6 attention blocks with 96 channels for the first two blocks and 192 channels for the rest. attention dropout rate and value dropout rate are both set to 0.1. We use dropout on the input with a rate of 0.3 and additional dropout blocks of rate 0.2 followed by spatial max-pooling after the second and fourth block. We did not use automatic mixed precision training for this dataset as it made all models diverge. We perform training for 350 epochs and utilize stochastic gradient descent with a momentum of 0.9 and cosine learning rate scheduler with base learning rate 0.01 (Loshchilov & Hutter, 2016). We utilize a batch size of 24, weight decay of 0.0001 and He s initialization. F.3 PATCHCAMELYON For Patch Camelyon we use a group self-attention network composed of 4 attention blocks with 12 channels for the first block, 24 channels for the second block, 48 channels for the third and fourth blocks and 96 channels for the last block. attention dropout rate and value dropout rate are both set to 0.1. We use an additional max-pooling block after the lifting block to reduce memory requirements. We did not use automatic mixed precision training for this dataset as it made all models diverge. We perform training for 100 epochs, utilize stochastic gradient descent with a momentum of 0.9 and cosine learning rate scheduler with base learning rate 0.01 (Loshchilov & Hutter, 2016). We utilize a batch size of 8, weight decay of 0.0001 and He s initialization. Proof of Proposition 4.1. If the self-attention formulation provided in Eqs. 3, 8, is permutation equivariant, then it must hold that m[Lπ[f]](i) = Lπ[m[f]](i). Consider a permuted input signal Lπ[f](i) = f(π 1(i)). The self-attention operation on Lπ[f] is given by: m[Lπ[f]](i) = ϕout( h [H] j S σj( ϕ(h) qry (Lπ[f](i)),ϕ(h) key (Lπ[f](j)) )ϕ(h) val (Lπ[f](j))) =ϕout( h [H] j S σj( ϕ(h) qry (f(π 1(i))),ϕ(h) key (f(π 1(j))) )ϕ(h) val (f(π 1(j)))) Published as a conference paper at ICLR 2021 =ϕout( h [H] π( j) S σπ( j)( ϕ(h) qry (f( i)),ϕ(h) key (f( j)) )ϕ(h) val (f( j))) =ϕout( h [H] j S σ j( ϕ(h) qry (f( i)),ϕ(h) key (f( j)) )ϕ(h) val (f( j))) = m[f]( i) = m[f](π 1(i)) = Lπ[m[f]](i) Here we have used the substitution i = π(i) and j = π(j). Since the summation is defined over the entire set we have that π( j) S[ ] = j S[ ]. Conclusively, we see that m[Lπ[f]](i) = Lπ[m[f]](i). Hence, permutation equivariance indeed holds. Proof of Claim 4.2. Permutation equivariance. If the self-attention formulation provided in Eq. 10 is permutation equivariant, then it must hold that m[Lπ[f],ρ](i) = Lπ[m[f,ρ]](i). Consider a permuted input signal Lπ[f](i) = f(π 1(i)). The self-attention operation on Lπ[f] is given by: m[Lπ[f],ρ](i) = ϕout( h [H] j N(i) σj( ϕ(h) qry (Lπ[f](i) + ρ(i)),ϕ(h) key (Lπ[f](j) + ρ(j)) )ϕ(h) val (Lπ[f](j))) As discussed in 3.2, since there exists permutations in SN able to send elements j in N(i) to elements j outside of N(i), it is trivial to show that Eq. 10 is not equivariant to SN. Consequently, in order to provide a more interesting analysis, we consider global attention here, i.e., cases where N(i) = S. As shown for Proposition 4.1, this self-attention instantiation is permutation equivariant. Consequently, by considering this particular case, we are able to explicitly analyze the effect of introducing absolute positional encodings into the self-attention formulation. We have then that: m[Lπ[f],ρ](i) = ϕout( h [H] j S σj( ϕ(h) qry (f(π 1(i)) + ρ(i)),ϕ(h) key (f(π 1(j)) + ρ(j)) )ϕ(h) val (f(π 1(j)))) = ϕout( h [H] π( j) S σπ( j)( ϕ(h) qry (f( i) + ρ(π( i))),ϕ(h) key (f( j) + ρ(π( j))) )ϕ(h) val (f( j))) = ϕout( h [H] j S σ j( ϕ(h) qry (f( i) + ρ(π( i))),ϕ(h) key (f( j) + ρ(π( j))) )ϕ(h) val (f( j))) Here we have used the substitution i = π(i) and j = π(j). Since the summation is defined over the entire set we have that π( j) S[ ] = j S[ ]. Since ρ(π( i)) ρ( i) and ρ(π( j)) ρ( j), we are unable to reduce the expression further towards the form of m[f,ρ]( i). Consequently, we conclude that absolute position-aware self-attention is not permutation equivariant. Translation equivariance. If the self-attention formulation provided in Eq. 10, is translation equivariant, then it must hold that m(Ly[f],ρ)(i) = Ly[m(f,ρ)](i). Consider a translated input signal Ly[f](i) = f(x 1(x(i) y)). The self-attention operation on Ly[f], is given by: m[Ly[f],ρ](i) = ϕout( h [H] j N(i) σj( ϕ(h) qry (Ly[f](i) + ρ(i)),ϕ(h) key (Ly[f](j) + ρ(j)) )ϕ(h) val (Ly[f](j))) = ϕout( h [H] j N(i) σj( ϕ(h) qry (f(x 1(x(i) y)) + ρ(i)), ϕ(h) key (f(x 1(x(j) y)) + ρ(j)) )ϕ(h) val (f(x 1(x(j) y)))) = ϕout( h [H] x 1(x( j)+y)) N(x 1(x( i)+y)) σx 1(x( j)+y)( ϕ(h) qry (f( i) + ρ(x 1(x( i) + y))), ϕ(h) key (f( j) + ρ(x 1(x( j) + y))) )ϕ(h) val (f( j))) = ϕout( h [H] x 1(x( j)+y)) N(x 1(x( i)+y)) σx 1(x( j)+y)( ϕ(h) qry (f( i) + ρP (x( i) + y)), ϕ(h) key (f( j) + ρP (x( j) + y)) )ϕ(h) val (f( j))) Published as a conference paper at ICLR 2021 Here, we have used the substitution i = x 1(x(i) y) i = x 1(x( i)+y) and j = x 1(x(j) y) j = x 1(x( j) + y). Since the area of summation remains equal to any translation y Rd, we have: x 1(x( j)+y) N(x 1(x( i)+y)) [ ] = x 1(x( j)) N(x 1(x( i))) [ ] = j N( i) [ ]. Hence, we can further reduce the expression above as: m[Ly[f],ρ](i) = ϕout( h [H] j N( i) σ j( ϕ(h) qry (f( i) + ρP (x( i) + y)),ϕ(h) key (f( j) + ρP (x( j) + y)) )ϕ(h) val (f( j))) Since, ρ( i) + y ρ( i) and ρ( j) + y ρ( j), we are unable to reduce the expression further towards the form of m(f,ρ)( i). Consequently, we conclude that the absolute positional encoding does not allow for translation equivariance either. Proof of Claim 4.3. If the self-attention formulation provided in Eq. 11 is translation equivariant, then it must hold that mr[Ly[f],ρ](i) = Ly[mr[f,ρ]](i). Consider a translated input signal Ly[f](i) = f(x 1(x(i) y)). The self-attention operation on Ly[f] is given by: mr[Ly[f],ρ](i) = ϕout( h [H] j N(i) σj( ϕ(h) qry (Ly[f](i)),ϕ(h) key (Ly[f](i) + ρ(i,j)) )ϕ(h) val (Ly[f](j))) = ϕout( h [H] j N(i) σj( ϕ(h) qry (f(x 1(x(i) y))),ϕ(h) key (f(x 1(x(j) y)) + ρ(i,j)) )ϕ(h) val (f(x 1(x(j) y)))) = ϕout( h [H] x 1(x( j)+y) N(x 1(x( i)+y)) σx 1(x( j)+y)( ϕ(h) qry (f( i)),ϕ(h) key (f( j) + ρ(x 1(x( i) + y),x 1(x( j) + y))) )ϕ(h) val (f( j))) Here, we have used the substitution i = x 1(x(i) y) i = x 1(x( i)+y) and j = x 1(x(j) y) j = x 1(x( j) + y). By using the definition of ρ(i,j) we can further reduce the expression above as: = ϕout( h [H] x 1(x( j)+y) N(x 1(x( i)+y)) σx 1(x( j)+y)( ϕ(h) qry (f( i)),ϕ(h) key (f( j) + ρP (x( j) + y (x( i) + y))) )ϕ(h) val (f( j))) = ϕout( h [H] x 1(x( j)+y) N(x 1(x( i)+y)) σx 1(x( j)+y)( ϕ(h) qry (f( i)),ϕ(h) key (f( j) + ρP (x( j) x( i))) )ϕ(h) val (f( j))) = ϕout( h [H] x 1(x( j)+y) N(x 1(x( i)+y)) σx 1(x( j)+y)( ϕ(h) qry (f( i)),ϕ(h) key (f( j) + ρ( i, j)) )ϕ(h) val (f( j))) Since the area of the summation remains equal to any translation y Rd, we have that: x 1(x( j)+y) N(x 1(x( i)+y)) [ ] = x 1(x( j)) N(x 1(x( i))) [ ] = j N( i) [ ]. Resultantly, we can further reduce the expression above as: mr[Ly[f],ρ](i) = ϕout( h [H] j N( i) σ j( ϕ(h) qry (f( i)),ϕ(h) key (f( j) + ρ( i, j)) )ϕ(h) val (f( j))) = mr[f,ρ]( i) = mr[f,ρ](x 1(x(i) y)) = Ly[mr[f,ρ]](i) We see that indeed mr[Ly[f],ρ](i) = Ly[mr[f,ρ]](i). Consequently, we conclude that the relative positional encoding allows for translation equivariance. We emphasize that this is a consequence of the fact that ρ(x 1(x( i) + y),x 1(x( j) + y))) = ρ( i, j), y Rd. In other words, it comes from the fact that relative positional encoding is invariant to the action of the translation group. Published as a conference paper at ICLR 2021 Proof of Claim 5.1. If the lifting self-attention formulation provided in Eq. 11 is G-equivariant, then it must hold that mr G [Lg[f],ρ](i,h) = Lg[mr G [f,ρ]](i,h). Consider a g-transformed input signal Lg[f](i) = Ly L h[f](i) = f(x 1( h 1(x(i) y))), g = (y, h), y Rd, h H. The lifting group self-attention operation on Lg[f] is given by: mr G [Ly L h[f],ρ](i,h) = ϕout( h [H] j N(i) σj( ϕ(h) qry (Ly L h[f](i)),ϕ(h) key (Ly L h[f](i) + Lh[ρ](i,j)) )ϕ(h) val (Ly L h[f](j))) = ϕout( h [H] j N(i) σj( ϕ(h) qry (f(x 1( h 1(x(i) y)))),ϕ(h) key (f(x 1( h 1(x(j) y))) + Lh[ρ](i,j)) )ϕ(h) val (f(x 1( h 1(x(j) y))))) = ϕout( h [H] x 1( hx( j)+y) N(x 1( hx( i)+y)) σx 1( hx( j)+y)( ϕ(h) qry (f( i)),ϕ(h) key (f( j) + Lh[ρ](x 1( hx( i) + y),x 1( hx( j) + y))) )ϕ(h) val (f( j))) Here we have used the substitution i = x 1( h 1(x(i) y)) i = x 1( hx( i) + y) and j = x 1( h 1(x(j) y)) j = x 1( hx( j) + y). By using the definition of ρ(i,j) we can further reduce the expression above as: = ϕout( h [H] x 1( hx( j)+y) N(x 1( hx( i)+y)) σx 1( hx( j)+y)( ϕ(h) qry (f( i)),ϕ(h) key (f( j) + ρP (h 1( hx( j) + y) h 1( hx( i) + y)) )ϕ(h) val (f( j))) = ϕout( h [H] x 1( hx( j)+y) N(x 1( hx( i)+y)) σx 1( hx( j)+y)( ϕ(h) qry (f( i)),ϕ(h) key (f( j) + ρP (h 1 h(x( j) x( i)))) )ϕ(h) val (f( j))) = ϕout( h [H] x 1( hx( j)+y) N(x 1( hx( i)+y)) σx 1( hx( j)+y)( ϕ(h) qry (f( i)),ϕ(h) key (f( j) + L h 1h[ρ]( i, j)) )ϕ(h) val (f( j))) Since, for unimodular groups, the area of summation remains equal for any g G, we have that: x 1( hx( j)+y) N(x 1( hx( i)+y)) [ ] = x 1( hx( j)) N(x 1( hx( i))) [ ] = x 1(x( j)) N(x 1(x( i))) [ ] = j N( i) [ ]. Resultantly, we can further reduce the expression above as: mr G [Ly L h[f],ρ](i,h) = ϕout( h [H] j N( i) σ j( ϕ(h) qry (f( i)),ϕ(h) key (f( j) + L h 1h[ρ]( i, j)) )ϕ(h) val (f( j))) = mr G [f,ρ]( i, h 1h) = mr G [f,ρ](x 1( h 1(x(i) y)), h 1h) = Ly L h[mr G [f,ρ]](i,h). We see indeed that mr G [Ly L h[f],ρ](i,h) = Ly L h[mr G [f,ρ]](i,h). Consequently, we conclude that the lifting group self-attention operation is group equivariant. We emphasize once more that this is a consequence of the fact that Lg[ρ](i,j) = ρ(i,j), g G. In other words, it comes from the fact that the positional encoding used is invariant to the action of elements g G. Proof of Claim 5.2. If the group self-attention formulation provided in Eq. 11 is G-equivariant, then it must hold that mr G[Lg[f],ρ](i,h) = Lg[mr G[f,ρ]](i,h). Consider a g-transformed input signal Lg[f](i, h) = Ly L h[f](i, h) = f(ρ 1( h 1(ρ(i) y)), h h), g = (y, h), y Rd, h H. The group self-attention operation on Lg[f] is given by: mr G[Ly L h[f],ρ](i,h) = ϕout( h [H] h H (j,ˆh) N(i, h) σj,ˆh( ϕ(h) qry (Ly L h[f](i, h)),ϕ(h) key (Ly L h[f](j, ˆh) + Lh[ρ]((i, h),(j, ˆh)) )ϕ(h) val (Ly L h[f](j, ˆh))) = ϕout( h [H] h H (j,ˆh) N(i, h) σj,ˆh( ϕ(h) qry (f(x 1( h 1(x(i) y)), h 1 h)),ϕ(h) key (f(x 1( h 1(x(j) y)), h 1ˆh) + Lh[ρ]((i, h),(j, ˆh)) )ϕ(h) val (f(x 1( h 1(x(j) y)), h 1ˆh))) Published as a conference paper at ICLR 2021 = ϕout( h [H] h h H (x 1( hx( j)+y), hˆh ) N(x 1( hx( i)+y), h h ) σx 1( hx( j)+y), hˆh ( ϕ(h) qry (f( i, h )),ϕ(h) key (f( j, ˆh ) + Lh[ρ]((x 1( hx( i) + y), h h ), (x 1( hx( j) + y), hˆh )) )ϕ(h) val (f( j, ˆh ))) Here we have used the substitutions i = x 1( h 1(x(i) y)) i = x 1( hx( i) + y)), h = h 1 h, and j = x 1( h 1(x(j) y)) i = x 1( hx( i) + y)), ˆh = h 1ˆh. By using the definition of ρ((i, h),(j, ˆh)) we can further reduce the expression above as: = ϕout( h [H] h h H (x 1( hx( j)+y), hˆh ) N(x 1( hx( i)+y), h h ) σx 1( hx( j)+y), hˆh ( ϕ(h) qry (f( i, h )),ϕ(h) key (f( j, ˆh ) + ρP (h 1( hx( j) + y ( hx( i) + y)),h 1 h h 1ˆh )) )ϕ(h) val (f( j, ˆh ))) = ϕout( h [H] h h H (x 1( hx( j)+y), hˆh ) N(x 1( hx( i)+y), h h ) σx 1( hx( j)+y), hˆh ( ϕ(h) qry (f( i, h )),ϕ(h) key (f( j, ˆh ) + ρP (h 1 h(x( j) x( i), h 1ˆh ))) )ϕ(h) val (f( j, ˆh ))) = ϕout( h [H] h h H (x 1( hx( j)+y), hˆh ) N(x 1( hx( i)+y), h h ) σx 1( hx( j)+y), hˆh ( ϕ(h) qry (f( i, h )),ϕ(h) key (f( j, ˆh ) + L h 1h[ρ](( i, h ),( j, ˆh ))) )ϕ(h) val (f( j, ˆh ))) Furthermore, since for unimodular groups the area of summation remains equal for any transformation g G, we have that: (x 1( hx( j)+y), hˆh ) N(x 1( hx( i)+y), h h ) [ ] = (x 1( hx( j)), hˆh ) N(x 1( hx( i)), h h ) [ ] = (x 1(x( j)),ˆh ) N(x 1(x( i)), h ) [ ] = ( j,ˆh ) N( i, h ) [ ]. Additionally, we have that h h H[ ] = h H[ ]. Resultantly, we can further reduce the expression above as: mr G[Ly L h[f],ρ](i,h) = ϕout( h [H] h H ( j,ˆh ) N( i, h ) σ j,ˆh ( ϕ(h) qry (f( i, h )),ϕ(h) key (f( j, ˆh )+ L h 1h[ρ](( i, h ),( j, ˆh ))) )ϕ(h) val (f( j, ˆh ))) = mr G[f,ρ]( i, h 1h) = mr G[f,ρ](x 1( h 1(x(i) y)), h 1h) = Ly L h[mr G[f,ρ]](i,h). We see that indeed mr G[Ly L h[f],ρ](i,h) = Ly L h[mr G[f,ρ]](i,h). Consequently, we conclude that the group self-attention operation on is group equivariant. We emphasize once more that this is a consequence of the fact that Lg[ρ]((i, h),(j, ˆh)) = ρ((i, h),(j, ˆh)), g G. In other words, it comes from the fact that the positional encoding used is invariant to the action of elements g G.