# gauge_equivariant_transformer__4bc78705.pdf Gauge Equivariant Transformer Lingshen He1 Yiming Dong1 Yisen Wang1,2 Dacheng Tao4 Zhouchen Lin1,2,3 1Key Laboratory of Machine Perception (MOE), School of Artificial Intelligence, Peking University 2Institute for Artificial Intelligence, Peking University 3Pazhou Lab, Guangzhou 510330, 4JD Explore Academy, JD.com lingshenhe@pku.edu.cn, yimingdong ml@outlook.com, yisen.wang@pku.edu.cn, dacheng.tao@gmail.com, zlin@pku.edu.cn Attention mechanism has shown great performance and efficiency in a lot of deep learning models, in which relative position encoding plays a crucial role. However, when introducing attention to manifolds, there is no canonical local coordinate system to parameterize neighborhoods. To address this issue, we propose an equivariant transformer to make our model agnostic to the orientation of local coordinate systems (i.e., gauge equivariant), which employs multi-head selfattention to jointly incorporate both position-based and content-based information. To enhance expressive ability, we adopt regular field of cyclic groups as feature fields in intermediate layers, and propose a novel method to parallel transport the feature vectors in these fields. In addition, we project the position vector of each point onto its local coordinate system to disentangle the orientation of the coordinate system in ambient space (i.e., global coordinate system), achieving rotation invariance. To the best of our knowledge, we are the first to introduce gauge equivariance to self-attention, thus name our model Gauge Equivariant Transformer (GET), which can be efficiently implemented on triangle meshes. Extensive experiments show that GET achieves state-of-the-art performance on two common recognition tasks. 1 Introduction Recently, Transformer has dominated the area of Natural Language Processing [48]. Its key advantage over previous methods is its ability to attend to the most relevant part in a given context. This is largely attributed to its self-attention operator, which computes the similarity between representations of words in sequences in the form of attention scores. Because of the superiority, researchers start to apply Transformer to other learning areas, including Computer Vision [26, 53, 16, 59] and Graphs [49]. In this work, we aim at applying Transformer to manifolds. Unlike regular data, such as images, where each neighbor owns a clearly quantified relative position to its center in a canonical coordinate system, irregular data do not have a uniquely defined local coordinate system for the neighbors, resulting in the problem of orientation ambiguity, which directly obstructs the Transformer to numerically intake the relative position information. Several works have been proposed to deal with the rotation ambiguity problem, in which a promising way is to exploit gauge equivariance. While most of them are not rotation invariant to global Work was done during an internship at JD Explore Academy. Equal contribution. Sorted by tossing the coin. Corresponding author. 35th Conference on Neural Information Processing Systems (Neur IPS 2021). coordinate system, all of them are established on convolution, i.e., equal attention to neighboring points and neglect to content-based information. So it is desirable to propose a gauge equivariant transformer with the support of rotation invariance. In this paper, we propose Gauge Equivariant Transformer, named GET for short, which employs multi-head self-attention to simultaneously utilize position-based and content-based information, and is both gauge equivariant and rotation invariant. To achieve rotation invariance, we first project xyz coordinates in a global coordinate system onto a local coordinate frame, and then design equivariant transformers to overcome the orientation ambiguity problem of local coordinate systems. We adopt the regular field proposed in [13] as feature fields of intermediate layers, since the representation of regular field commutes with element-wise activation functions. After that, we propose a novel method to accommodate parallel transport of feature vectors in regular field with any rotation angles. Since we adopt regular fields in intermediate layers, we make a relaxation such that they are equivariant only for gauge transformations of angles that are multiples of 2π/N. Exact equivariance can be guaranteed for gauge transformations at multiples of 2π/N, and an equivariance error bound can be obtained for all other angles. In experiments, our model shows better performance and greater parameter efficiency than all baseline methods. Our contributions can be summarized as follows: We propose GET, which incorporates attention and achieves both gauge equivariance and rotation invariance with superior expressive power. GET is mathematically proven to be exactly equivariant on angles that are multiples of 2π/N(N N ), and an equivariance error bound is derived for other angles to guarantee the overall approximate equivariance property. We carefully design the model input to ensure that it is irrelevant to the global coordinate system, only depending on the choice of gauge. Our model achieves rotation invariance with the assistance of gauge equivariance. We propose a novel method to parallel transport the feature vectors in the regular field by extending the representation of a cyclic group to any angle rotation group. Compared to previous methods using truncation or interpolation, our extension can preserve more geometric information. We elevate the model performance by designing a new approach which incorporates Taylor expansion in solving the equivariance constraint, which has a better approximation ability in local neighborhoods. We confirm the superiority of our model via extensive experiments. Our model outperforms the HSN model on the SHREC dataset by 3.1% accuracy, and outperforms the Mesh CNN model on the Human Body Segmentation dataset by 0.3% accuracy with much fewer parameters, presenting state-of-the-art performance. 2 Related Work Geometric Deep Learning. Geometric deep learning is an emerging field concerning with adapting neural networks to various data types [7], especially on irregular data. For research works on modeling curved surfaces, common methods include view-based methods [46, 61, 51] and volumetric methods [32, 40, 52]. To boost efficiency, some works define convolution on point clouds directly [37, 38], but they are vulnerable to pose change since the coordinate inputs are dependent of the global coordinate system. So it is highly desired to develop models that solely intake geometric information of surfaces. Approaches that merely utilize intrinsic information of surfaces are called intrinsic methods. They use local parameterization to assign each neighboring point with a coordinate for information aggregation. A seminal work is Geodesic CNNs [31], which uses an exponential map to parameterize each local neighborhood and takes the maximum response across multiple choices of local coordinate orientation. While taking the maximum response direction discards the orientation information of feature maps, as an alternative, aligning local coordinate with principle curvature direction is another approach to deal with the ambiguity problem [33, 6]. But this approach can only be applied in limited cases as the curvature direction may be ill-defined at some points or even areas of curved surfaces. MDGCNN [35] and PFCNN [60] describes features by the so called directional functions. But both of them adopt scalar equivariant kernels, resulting in limited expressive power. Equivariant Deep Learning. Success of CNNs has been attributed to translation equivariance, which inspires researchers to implement more powerful equivariant models, including equivariance of planar rotation [12, 15, 13, 58, 54, 44, 25], 3D space rotation [57, 19, 47, 36, 55, 35, 28, 2, 39], sphere rotation [9, 17, 34, 18], and so on. All above works are about equivariance on homogeneous space [29, 10]. Cohen et al. [11] further extend equivariance to manifolds, in which they identify a new type of equivariance called gauge equivariance. The models in [56, 14] are successful extensions of gauge equivariant CNNs on mesh surfaces. However, their model suffers from changes in the orientation of global coordinate system. Also, there are works proposed for equivariant attention. Romero et al. [43] propose co-attentive equivariant networks, which effectively attends to co-occurring transformations. Romero et al. [41] further propose attentive group equivariant convolutional networks. Besides this, transformers have also been applied to group equivariant networks, where Fuchs et al. [21] do so via irreducible representations, Hutchinson et al. [27] via Lie algebra, and Romero et al. [42] via generalization of position encodings. All the models above are equivariant to symmetric groups, while currently gauge equivariant attention is still lacking. 3 Preliminaries Unlike regular data, in which coordinates (or pixels) are aligned in a global frame, there is no such specific frame on general manifolds. To begin with, we briefly review and define some mathematical concepts. 3.1 Basic Definitions We restrict our attention to 2D manifolds in 3D Euclidean space. Consider a 2D smooth orientable manifold M. For a point p in M, denote its tangent plane as Tp M. Each point in Tp M can be associated with a coordinate by specifying a coordinate system. Namely, we can parameterize the tangent plane Tp M with a pointwise linear mapping wp : R2 Tp M, which is defined as the gauge w at point p [11]. The gauge of manifold M is the set containing gauges at every point in M. For planar data, a feature map is the set of features located at different positions on a plane. Similarly, a feature field on a surface is a set of geometric quantities at different positions of the surface. Note that these two concepts are similar but not the same. From the perspective of geometric deep learning, a feature map is defined as numerical values of geometric quantities that may be gauge dependent, while a feature field refers to geometric quantities themselves that are gauge independent. For example, each point of the surface can be assigned with a tangent vector as its feature vector, all of which form a feature field. As is shown in Figure 1, the tangent vector v itself is a geometric quantity, which stays the same regardless of arbitrary gauge selection but takes different numerical values in different gauges following an underlying rule. We use f to denote the feature field of a manifold, fw : M Rn denotes the feature map under the gauge w and fw(p) denotes the feature map evaluated at point p. Different gauges can be linked by gauge transformations. The gauge transformation at point p is a frame transformation: gp SO(2), where SO(2) is the special orthogonal group consisting of all 2D rotation transformation matrices. A new gauge w p can be produced by applying gauge transformation gp to the original gauge wp, i.e., w p = gp wp. Gauge transformation is usually characterized by group representations. Group representation is a mapping ρ : G GL(n, R) where GL(n, R) is the group of invertible n n matrices, and ρ meets the condition ρ(g1)ρ(g2) = ρ(g1g2), where g1, g2 G are the elements of the group, g1g2 are element product defined on the group, and ρ(g1)ρ(g2) is matrix multiplication. Therefore, after applying the gauge transformation gp, the feature vector value fw(p) transforms to fw (p) = ρ g 1 p fw(p). Here ρ is a group representation of SO(2) which is called the type of the feature vector. If all the feature vectors share the same type ρ, the feature field is called a ρ-field and ρ is called the representation type of the field. The above definitions can also be at the manifold level, i.e., fw = ρ(g 1)fw. The notation kρ, where k is a positive integer, refers to the group representation whose output is k-blocks diagonal matrix with each block equals to ρ. In particular, if the representation of a feature field is ρ(g) = 1, then the feature field becomes scalar field, denoted as ρ0. 3.2 Gauge Equivariance For a function φ, its input is a feature map fw, where f is a ρin-field, in order to make φ gauge equivariant, and its output fw should be a feature map, where f is a ρout-field. When φ is a layer of a neural network, gauge equivariance implies that φ does not rely on the gauge in the forward process. Suppose that there are two gauges w and w linked by a gauge transformation g: w = g w, we have fw = ρin(g 1)fw since f is a ρin-field. Gauge equivariance means that the outputs fw = φ[fw] and fw = φ[fw ] are linked by the ρout representation of the same transformation g, i.e. fw = ρout(g 1) fw. Finally, we get: ρout(g 1)φ [fw] = φ ρin(g 1)fw . (1) To sum up, a function φ is gauge equivariant if the above equation always holds for any feature field f, gauge w and transformation g. 3.3 Riemannian Exponential Map Transformers require encoding the relative position to propagate information. Note that in images, there is still a local point parameterization, which is so natural that one even does not realize it. For general manifolds, it is non-trivial to establish a parameterization criterion, at least in the local frame. Among many charting-based methods, the mostly used one is the Riemannian exponential map expp : Tp M M at point p, which is a mapping from the tangent plane to the surface. For a coordinate vector v Tp M, the output of the Riemannian exponential map is obtained by moving the point p in the direction v along the geodesic curve with a distance of v . Denoting the arrival point as q, we have expp(v) = q. Figure 1 visualizes the exponential map as well as some basic definitions introduced in Section 3.1. According to the inverse function theorem, expp is a local diffeomorphism so can avoid metric distortion at the point p. The inverse of Riemannian exponential map is the logarithmic map logp : M Tp M. Under the gauge wp, every point q in the neighborhood of p is associated with coordinate w 1 p logp(q). 3.4 Parallel Transport The self-attention operation is essentially an aggregation of local neighboring features. However, the feature vectors of different points are from different spaces, thus they need to be parallel transported to the same feature space before being processed. For a tangent vector s at point q, we parallel transport it along the geodesic curve to another point p with respect to Levi-Civita connection [8], which preserves the norm of the vector. Levi-Civita connection is an isometry from Tq M to Tp M and determines the parallel transport of s, see Figure 2. In a gauge w, the parallel transport of tangent vector corresponds to a 2D rotation gw q p SO(2) which contains the relative orientation of gauges in the neighborhood. For a general feature vector of ρ type, parallel transport can be expressed as s w = ρ(gw q p)sw. 3.5 Self-attention Attention enables the model to selectively concentrate on the most relevant parts based on their content information [48, 53, 4, 22]. Consider a set of tokens t = {t1, t2, . . . , t T }, where ti RF . Attention is composed of three parts, namely query, key and value, denoted by Q : RF RFQ, K : RF RFK, and V : RF RFV , respectively. When Q, K and V are from the same source, it is called self-attention. When there are multiple sets of Q, K and V s, it becomes multi-head attention. The output of a multi-head self-attention transformer at node i is the linear transformation of the concatenation of the outputs at all the heads: MHSA(t)i = WM h SA(t)(h) i Figure 1: Illustration of basic definitions and Riemannian exponential map. Here, wp (black) and w p (blue) are two gauges on the tangent plane Tp M and they are linked by the gauge transformation gp. The coordinate of v takes different numerical values under wp and w p, as is illustrated in lower part. The exponential map assigns each vector v in Tp M with corresponding point q on the surface M. Figure 2: Parallel transport. The tangent vector s is parallel transported from q to p, resulting in a new vector s at point p. The numerical value change imposed by parallel transport is jointly determined by the geometric property of the surface, the Levi-Civita connection and the underlying gauge w. where is the vector concatenation operator. The single head attention output at head h is SA(t)(h) i = j=1 α(h) ij V (h)(tj), (3) where V (h) is the value function at the head h, and α(h) ij is attention score computed by α(h) ij = S(K(h)(ti), Q(h)(tj)) PT j =1 S(K(h)(ti), Q(h)(tj )) , (4) where K(h), Q(h) and S are the key function, query function and score function, respectively. 4 The Proposed GET 4.1 Gauge Equivariant Self-Attention Layers Suppose that the dimensions of input feature field f and output feature field f are Cin and Cout, respectively. We define the gauge equivariant multi-head self-attention output at point p under the gauge w as fw(p) = MHSA(f)w(p) = WM h SA(f)(h) w (p) where WM is the linear transformation matrix. At the head h, the output is defined as SA(f)(h) w (p) = Z u <σ α(f)(h) p,qu V (h) u (f w(qu))du, (6) where u = (u1, u2)T R2, qu = expp wp(u), f w(qu) is the numerical value of parallel transported feature vector from point qu to point p under the gauge w, and Vu is the value function incorporating the position information u through an encoder matrix WV (u) RCout Cin, i.e. f w(qu) = ρin(gw qu p)fw(qu), Vu(f w(qu)) = WV (u)f w(qu). (7) α is the attention score incorporating the content information, and is computed as: α(f)(h) p,qu = S(K(h)(fw(p)), Q(h)(f w(qu))) R v <σ S(K(h)(fw(p)), Q(h)(f w(qv)))dv . (8) We propose to enforce the attention score to be gauge invariant and the value function to be gauge equivariant, to make the attention layer gauge equivariant. The details of constructing them are presented in Sections 4.3 and 4.4, respectively. 4.2 Extension of Regular Representation In our model, the feature fields in the intermediate layers are all regular fields (i.e., whose type is regular representation). Regular representation is a special type of group representation of CN. If we use Θk to denote the rotation matrix with angle of k 2π/N, then CN can be expressed as CN = {Θ0, Θ1, , ΘN 1}. For k = 0, 1, , N 1, the regular representation ρCN reg(Θk) is an N N cyclic permutation matrix which shifts the coordinates of feature vectors by k steps. Regular representation provides transformation matrices when rotating by angles of multiples of 2π/N, but feature vectors can go through any rotation in SO(2) during parallel transport. Figure 3 illustrates this issue by giving an example in R5 with respect to ρC5 reg. We propose to extend the regular representation of CN by finding an orthogonal representation ρN of SO(2), such that it behaves the same as regular representation for any element in CN, i.e. Θ CN, ρN(Θ) = ρCN reg(Θ). (9) As ρCN reg takes different forms between odd values and even values of N, Theorem 1 shows that only odd N s are vaild in our model. Theorem 1 (i) If N is even, there is no such real representation ρN of SO(2) that satisfies Eqn. (9). (ii) If N is odd, there is a unique representation ρN of SO(2) that satisfies Eqn. (9). (iii) The representation ρN in (ii) is an orthogonal representation. Here we only show our method for constructing ρN in Theorem 1. According to group representation theory, regular representation ρCN reg can be decomposed into irreducible representations (irrep for short), i.e., ρCN reg(Θ) = A diag ϕ0(Θ), ϕ1(Θ), ,ϕ N 1 2 (Θ) A 1, (10) where ϕ0, , ϕ(N 1)/2 are the irreps of CN, and A GL(N, R). The irreps of CN take the following form for odd N: Θ CN, ϕ0(Θ) = 1, ϕk(Θ) = cos(kθ) sin(kθ) sin(kθ) cos(kθ) where θ [0, 2π) is the rotation angle of the matrix Θ, i.e. Θ = cos θ sin θ sin θ cos θ and k = 1, , N 1 2 . We extend the irreps to SO(2) as Θ SO(2), ϕ0(Θ) = 1, ϕk(Θ) = cos(kθ) sin(kθ) sin(kθ) cos(kθ) where k = 1, , N 1 2 . By substituting the ϕ s in Eqn. (10) with ϕ s, we get that for θ SO(2), ρN(Θ) = A diag ϕ0(Θ), ϕ1(Θ), , ϕ N 1 2 (Θ) A 1. (14) Obviously the representation ρN satisfies the condition Eqn. (9). In this way, one can apply ρN(gw q p) to feature vector of regular field during parallel transport. 4.3 Gauge Equivariant Value Function Inspired by [11], we choose the value function to be the numerical value of the parallel transported feature vector multiplied by the value encoding matrix. For the value function to be gauge equivariant, the necessary and sufficient condition is that Eqn. (15) always holds for any Θ SO(2): WV (Θ 1u) = ρout(Θ 1)WV (u)ρin(Θ). (15) We propose a practical method to solve Eqn. (15). We first expand the Wv into taylor series: WV (u) = W0 + W1u1 + W2u2 + W3u2 1 + W4u1u2 + W5u2 2 + , (16) where Wi RCout Cin(i = 0, 1, ) is the Taylor coefficient. Since we adopt regular representation in this paper, Eqn. (15) only needs to hold for Θ CN. Plugging Eqn. (16) into Eqn. (15), by comparing the coefficients, Wi s need to satisfy that for any Θ CN, W0 = ρout(Θ 1)W0ρin(Θ), (17a) cos(θ)W1 sin(θ)W2 = ρout(Θ 1)W1ρin(Θ), (17b) sin(θ)W1 + cos(θ)W2 = ρout(Θ 1)W2ρin(Θ), (17c) . To deal with the issue of having infinite terms in Eqn. (16), we may bypass it by simply truncating the Taylor series. We use the second order Taylor expansion and omit higher order terms, i.e., WV (u) W0 + W1u1 + W2u2 + W3u2 1 + W4u1u2 + W5u2 2. (18) It is worth emphasizing that making truncations does not affect the equivariance property in the slightest, as the equations in (17) show the coupling characteristics. Eqn. (17a) is the constraint on W0 in the order 0, Eqn. (17b) and Eqn. (17c) are the constraints on W1 and W2 in order 1, and there are three more equations in Eqn. (17) constraining on W3, W4 and W5 in the order 2. We can see that only the Wi s in the same order are coupled together. This coupling property allows us not only to solve the equations in (17) in separate groups, but also to make truncations in Eqn. (16) without affecting the equivariance property. After truncation, we can get a set of solution bases of Taylor coefficients { W (1), , W (m)} by solving the first six linear equations in (17) which are separated into three independent groups, where m is the dimension of solution space. Each W (i) is a tuple consisting of six components, W (i) 0 , , W (i) 5 . The details in solving linear equations are provided in supplementary materials. Then, the equivariant matrix basis W (i) has the following form: W (i)(u) = W (i) 0 + W (i) 1 u1 + W (i) 2 u2 + W (i) 3 u2 1 + W (i) 4 u1u2 + W (i) 5 u2 2, (19) which satisfies Eqn. (15) for all u. Their linear combination, P ci W (i), still meets Eqn. (15) and ci s can be set as learnable parameters during training. With WV = P ci W (i), the value function in Eqn. (7) is exactly equivariant to gauge transformations at multiples of 2π/N. It is remarkable that our method of solving the equivariance constraint Eqn. (15) is very general, as the solution process can be applied to any groups, including ρin and ρout. Especially, it can avoid solving analytic solutions when the group is very complex, like when the case is that the group is a high dimensional orthogonal group. In addition, compared to Fourier series used in [54], Taylor series is a better approximation in local neighborhoods. The omitted Taylor terms in Eqn. (18) is O(σ3), which is negligible when the radius σ is small enough. So GET could achieve the same performance with fewer parameters. In addition, we can avoid selecting radial profiles that introduce extra hyperparameters. 4.4 Gauge Invariant Attention Score In implementation, the manifold is discretized to mesh for computer processing. The discretization details are provided in supplementary materials. Here we set the key and query function to be structurally the same as Graph Attention Network [49], i.e., K(h)(fw(p)) = W (h) K fw(p), Figure 3: Illustration for the reason of extension. f(q) is a feature vector of type ρC5 reg, which takes numerical value fw(q) R5 under gauge wq. Applying a gauge transformation with angle 2π/5 to w q, f(q) takes another value fw (q), which is a permutation of fw(q). The problem here is what value does f(q) takes after it is parallel transported to point p. Figure 4: Local coordinate projection. xp is the position vector in the global coordinate system marked in red. For better illustration it is moved to the local coordinate system, marked in blue. In the local coordinate system xp is projected onto the directions of up, vp and np, respectively, and the lengths of three directed line segments (in green) form the input Xp. Q(h)(f w(qu)) = W (h) Q f w(qu), where W (h) K RN Cin, W (h) Q RN Cin. The score function is structurally similar to [49], which takes the following form: S(K( ), Q( )) = P(Re LU(K( ) + Q( ))). (20) Here, Re LU is the Nonlinear Rectified Unit acting on each element of the N dimensions, and P : RN R is the average pooling function. The linear transformation matrices WK and WQ are required to satisfy the constraint in Eqn. (17a) on CN for K and Q to be gauge equivariant. After activation and pooling, the final attention score is gauge invariant. With the gauge invariant attention score and gauge equivariant value function, the single head attention Eqn. (6) is gauge equivariant. For the multi-head attention to be gauge equivariant, the transformation matrix WM also needs to satisfy Eqn. (17a). 4.5 Rotation Invariance The rotation invariance property of GET is accomplished by constructing a local coordinate system for every point and making use of the gauge equivariance property. As is shown in Figure 4, assuming that xp is the coordinate vector of p M in the global coordinate system, np is the corresponding normal vector, and the gauge wp is ascertained by principal axes up and vp. By projecting the raw data xp onto the local coordinate system, we get the local coordinate of point p: Xp = ( xp, up , xp, vp , xp, np ), which relies on wp but is invariant to the choice of global coordinate system. The insight is that X is actually a feature map whose corresponding feature field is associated with representation ρlocal as: ρlocal(Θ) = "cos θ sin θ 0 sin θ cos θ 0 0 0 1 If we feed the local coordinates into an SO(2) gauge equivariant model whose outputs are scalar fields, the result will be SO(3) rotation invariant. 4.6 Error Analysis Following the conventions, GET stacks multiple self-attention layers with Re LU activation functions. Even if discretized on triangle meshes, GET is still exactly equivariant to gauge transformations at angles that are multiples of 2π/N. Theorem 2 Assume a GET ψ, whose types of input, intermediate, and output feature fields are ρlocal, kiρCN reg and ρ0, respectively, where ki is the number of regular fields in the ith intermediate feature field. Denote f as the input feature field on triangle mesh M, and the norm of the feature map is bounded by constant C. Gauges w and w are linked by transformation g. Further suppose that ψ is Lipschitz continuous with constant L, then we have: (i) If gp CN for every mesh vertex p M, then ψ(fw) = ψ(fw ). (ii) For general gp SO(2), we have ψ(fw) ψ(fw ) πL Theorem 2 provides a bound for gauge transformation with respect to any angles. Compared to non-equivariant models, GET decreases the equivariance error by a factor of 1/N. In experiments, we empirically show that the performance of our model increases as N increases. 5 Experiments We conduct extensive experiments to evaluate the effectiveness of our model. We test the performance of our model on two deformable domain tasks, and conduct parameter sensitivity analysis and several ablation studies to make a comprehensive evaluation. Note that we use data preprocessing to precompute some useful preliminary values in order to save training time. The details of preprocessing can be found in supplementary materials. 5.1 Shape Classification Our model used here is lightweight but powerful. The details of the architecture and training settings are provided in supplementary materials. Under the same setting, we compare our model with HSN [56], Mesh CNN [24], GWCNN [20], GI [45] and MDGCNN [35], whose results are cited in [56]. As is shown in Table 5.2, our model achieves state-of-the-art performance on this dataset. GET significantly improves the previous state-of-the-art model HSN by 3.1% in classification accuracy. This may attribute to the attention mechanism and the intrinsic rotation invariance of our model, while all other models are CNNs and directly accepts the raw coordinates xyz as input. Also, HSN is the most parameter efficient model among the models we compared with. Our model consumes only 1/7 parameters of HSN (11K vs. 78K). 5.2 Shape Segmentation A widely used task in 3D shape segmentation is Human Body Segmentation [30], in which the model is to predict body-part annotation for each sampled point. The dataset consists of 370 training models from MIT [50], FAUST [5], Adobe Fuse [1] and SCAPE [3] and 18 test models from SHREC07 [23]. The readers may refer to supplementary materials for details of neural network architecture and hyperparameters. Table 2 reports the percentage of correctly classified vertices across all samples in the test set. The results of comparing models are cited from [56], [60] and [35]. Our model outperforms all these models in the segmentation task. GET consumes only about 1/15 parameters compared with Mesh CNN (148K vs. 2.28M) but achieves higher performance. Table 1: Model results on the SHREC dataset. GET performs the best without rotation data augmentation. The models trained without rotation augmentation are rotation invariant intrinsically. Model Rotation Aug. Acc. (%) MDGCNN[35] ! 82.2 GI[45] ! 88.6 GWCNN[20] ! 90.3 Mesh CNN[24] % 91.0 HSN[56] ! 96.1 GET (Ours) % 99.2 Table 2: Segmentation results on the Human Body Segmentation dataset. Our GET performs the best even without data augmentation by rotations. Model Rotation Aug. Acc. (%) MDGCNN [35] ! 89.5 Point Net++ [38] ! 90.8 HSN [56] ! 91.1 PFCNN [60] % 91.5 Mesh CNN [24] % 92.3 GET (Ours) % 92.6 5.3 Parameter Sensitivity Order of the Group CN. The hyperparameter N is a key factor to the model equivariance since it controls both the dimension of regular field and the number of angles at which the our model is exactly equivariant. Also, Theorem 2 asserts that the equivariance error is bounded by a factor of 1/N compared to non-equivariant models. Here we study the effect of N on model accuracy while keeping parameter numbers roughly the same. The results of the Human Body Segmentation dataset with different N s are shown in Table 3. We can see that the model performance improves considerably as N increases and stabilizes finally. Table 3: Model accuracy and the number of parameters in the Human Body Segmentation task with respect to different N s. N 3 5 7 9 (Chosen) 11 Acc. (%) 91.2 92.0 92.4 92.6 92.5 # Params. 153K 149K 149K 148K 156K 5.4 Ablation Study In this section, we perform a series of ablation studies to analyze individual parts of our model. All the experiments are carried out on the Human Body Segmentation dataset under the same setting as in Section 5.2. We evaluate the effectiveness of gauge equivariance, attention, local coordinate and parallel transport method, with the latter two experiments provided in supplementary materials. Gauge Equivariance and Attention. To confirm the effectiveness of gauge equivariance property and attention mechanism, we design two baseline models with one not equivariant and the other based on convolution. For the non-equivariant baseline, we use Graph Attention Networks [49]. For the convolution-based model, we adopt a similar architecture as GET. Table 4: Model accuracy in the Human Body Segmentation task with two baselines without gauge equivariance and attention, respectively. Model Gauge Equivariance Attention Acc. (%) GET ! ! 92.6 Baseline 1 ! 81.1 Baseline 2 ! 92.3 Table 4 shows that GET both benefits from the power of gauge equivariance and attention. We can see that both properties do contribute to the superiority of the model performance. 6 Conclusion We propose GET, which firstly incorporates attention in gauge equivariance. GET introduces a new input, which is invariant to rotation of the global coordinate system. GET employs a new parallel transport approach, which is plausible for parallel transport between any two points. GET utilizes Taylor expansion in solving equivariant constraints, achieving better approximation ability. GET achieves state-of-the-art performances on several tasks and is efficient among the baselines. Acknowledgment Zhouchen Lin was supported by the NSF China (No.s 61625301 and 61731018), NSFC Tianyuan Fund for Mathematics (No. 12026606) and Project 2020BD006 supported by PKU-Baidu Fund. Yisen Wang is partially supported by the National Natural Science Foundation of China under Grant 62006153, and Project 2020BD006 supported by PKU-Baidu Fund. [1] Adobe. Adobe mixamo 3D characters. http://www.mixamo.com, 2016. [2] Brandon Anderson, Truong-Son Hy, and Risi Kondor. Cormorant: Covariant molecular neural networks. Neur IPS, 2019. [3] Dragomir Anguelov, Praveen Srinivasan, Daphne Koller, Sebastian Thrun, Jim Rodgers, and James Davis. SCAPE: Shape Completion and Animation of People. In SIGGRAPH. 2005. [4] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In ICLR, 2015. [5] Federica Bogo, Javier Romero, Matthew Loper, and Michael J Black. FAUST: Dataset and Evaluation for 3D Mesh Registration. In CVPR, 2014. [6] Davide Boscaini, Jonathan Masci, Emanuele Rodoià, and Michael Bronstein. Learning Shape Correspondence with Anisotropic Convolutional Neural Networks. In Neur IPS, 2016. [7] Michael M Bronstein, Joan Bruna, Taco S Cohen, and Petar Veliˇckovi c. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges. ar Xiv preprint ar Xiv:2104.13478, 2021. [8] Manfredo Perdigao do Carmo. Riemannian Geometry. Birkhäuser, 1992. [9] Taco S Cohen, Mario Geiger, Jonas Köhler, and Max Welling. Spherical CNNs. ICLR, 2018. [10] Taco S Cohen, Mario Geiger, and Maurice Weiler. A General Theory of Equivariant CNNs on Homogeneous Spaces. Neur IPS, 2019. [11] Taco S Cohen, Maurice Weiler, Berkay Kicanaoglu, and Max Welling. Gauge Equivariant Convolutional Networks and the Icosahedral CNN. ICML, 2019. [12] Taco S Cohen and Max Welling. Group Equivariant Convolutional Networks. In ICML, 2016. [13] Taco S Cohen and Max Welling. Steerable CNNs. ICLR, 2017. [14] Pim de Haan, Maurice Weiler, Taco S Cohen, and Max Welling. Gauge Equivariant Mesh CNNs: Anisotropic convolutions on geometric graphs. ICLR, 2021. [15] Sander Dieleman, Jeffrey De Fauw, and Koray Kavukcuoglu. Exploiting Cyclic Symmetry in Convolutional Neural Networks. In ICML, 2016. [16] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021. [17] Carlos Esteves, Christine Allen-Blanchette, Ameesh Makadia, and Kostas Daniilidis. Learning so (3) equivariant representations with spherical cnns. In ECCV, 2018. [18] Carlos Esteves, Ameesh Makadia, and Kostas Daniilidis. Spin-weighted spherical cnns. Neur IPS, 2020. [19] Carlos Esteves, Yinshuang Xu, Christine Allen-Blanchette, and Kostas Daniilidis. Equivariant Multi-view Networks. In ICCV, 2019. [20] Danielle Ezuz, Justin Solomon, Vladimir G Kim, and Mirela Ben-Chen. GWCNN: A Metric Alignment Layer for Deep Shape Analysis. In Computer Graphics Forum, 2017. [21] Fabian B Fuchs, Daniel E Worrall, Volker Fischer, and Max Welling. SE (3)-transformers: 3D Roto-translation Equivariant Attention Networks. Neur IPS, 2020. [22] Zhengyang Geng, Meng-Hao Guo, Hongxu Chen, Xia Li, Ke Wei, and Zhouchen Lin. Is attention better than matrix decomposition? In ICLR, 2021. [23] Daniela Giorgi, Silvia Biasotti, and Laura Paraboschi. Shape Retrieval Contest 2007: Watertight Models Track. SHREC competition, 8(7), 2007. [24] Rana Hanocka, Amir Hertz, Noa Fish, Raja Giryes, Shachar Fleishman, and Daniel Cohen-Or. Mesh CNN: a Network with an Edge. TOG, 2019. [25] Lingshen He, Yuxuan Chen, Zhengyang Shen, Yiming Dong, Yisen Wang, and Zhouchen Lin. Efficient equivariant network. In Neur IPS, 2021. [26] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In CVPR, 2018. [27] Michael Hutchinson, Charline Le Lan, Sheheryar Zaidi, Emilien Dupont, Yee Whye Teh, and Hyunjik Kim. Lietransformer: Equivariant self-attention for lie groups. ICML, 2021. [28] Risi Kondor, Zhen Lin, and Shubhendu Trivedi. Clebsch Gordan Nets: a fully fourier space spherical convolutional neural network. Neur IPS, 2018. [29] Risi Kondor and Shubhendu Trivedi. On the Generalization of Equivariance and Convolution in Neural Networks to the Action of Compact Groups. In ICML, 2018. [30] Haggai Maron, Meirav Galun, Noam Aigerman, Miri Trope, Nadav Dym, Ersin Yumer, Vladimir G Kim, and Yaron Lipman. Convolutional Neural Networks on Surfaces via Seamless Toric Covers. TOG, 2017. [31] Jonathan Masci, Davide Boscaini, Michael Bronstein, and Pierre Vandergheynst. Geodesic Convolutional Neural Networks on Riemannian Manifolds. In ICCV Workshops, 2015. [32] Daniel Maturana and Sebastian Scherer. Voxnet: A 3D Convolutional Neural Network for Real-time Object Recognition. In IROS, 2015. [33] Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodola, Jan Svoboda, and Michael M Bronstein. Geometric Deep Learning on Graphs and Manifolds using Mixture Model CNNs. In CVPR, 2017. [34] Nathanaël Perraudin, Michaël Defferrard, Tomasz Kacprzak, and Raphael Sgier. Deepsphere: Efficient spherical convolutional neural network with healpix sampling for cosmological applications. Astronomy and Computing, 27:130 146, 2019. [35] Adrien Poulenard and Maks Ovsjanikov. Multi-directional Geodesic Neural Networks via Equivariant Convolution. TOG, 2018. [36] Adrien Poulenard, Marie-Julie Rakotosaona, Yann Ponty, and Maks Ovsjanikov. Effective Rotation-invariant Point CNN with Spherical Harmonics Kernels. In 3DV, 2019. [37] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep Learning on Point Sets for 3D Classification and Segmentation. In CVPR, 2017. [38] Charles R Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. Neur IPS, 2017. [39] Yongming Rao, Jiwen Lu, and Jie Zhou. Spherical fractal convolutional neural networks for point cloud recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 452 460, 2019. [40] Gernot Riegler, Ali Osman Ulusoy, and Andreas Geiger. Octnet: Learning Deep 3D Representations at High Resolutions. In CVPR, 2017. [41] David Romero, Erik Bekkers, Jakub Tomczak, and Mark Hoogendoorn. Attentive group equivariant convolutional networks. In ICML, 2020. [42] David W Romero and Jean-Baptiste Cordonnier. Group Equivariant Stand-Alone Self-Attention For Vision. ICML, 2020. [43] David W Romero and Mark Hoogendoorn. Co-attentive equivariant neural networks: Focusing equivariance on transformations co-occurring in data. ICLR, 2020. [44] Zhengyang Shen, Lingshen He, Zhouchen Lin, and Jinwen Ma. PDO-e Convs: Partial differential operator based equivariant convolutions. In ICML, 2020. [45] Ayan Sinha, Jing Bai, and Karthik Ramani. Deep Learning 3D Shape Surfaces using Geometry Images. In ECCV, 2016. [46] Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik Learned-Miller. Multi-view Convolutional Neural Networks for 3D Sape Recognition. In ICCV, 2015. [47] Nathaniel Thomas, Tess Smidt, Steven Kearnes, Lusann Yang, Li Li, Kai Kohlhoff, and Patrick Riley. Tensor field networks: Rotation-and translation-equivariant neural networks for 3D point clouds. Neur IPS, 2018. [48] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. Neur IPS, 2017. [49] Petar Veliˇckovi c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. ICLR, 2018. [50] Daniel Vlasic, Ilya Baran, Wojciech Matusik, and Jovan Popovi c. Articulated Mesh Animation from Multi-view Silhouettes. In SIGGRAPH. 2008. [51] Chu Wang, Marcello Pelillo, and Kaleem Siddiqi. Dominant Set Clustering and Pooling for Multi-view 3D Object Recognition. BMVC, 2019. [52] Peng-Shuai Wang, Yang Liu, Yu-Xiao Guo, Chun-Yu Sun, and Xin Tong. O-cnn: Octree-based convolutional neural networks for 3D shape analysis. TOG, 2017. [53] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In CVPR, 2018. [54] Maurice Weiler and Gabriele Cesa. General E (2)-equivariant Steerable CNNs. In Neur IPS, 2019. [55] Maurice Weiler, Mario Geiger, Max Welling, Wouter Boomsma, and Taco S Cohen. 3D Steerable CNNS: Learning Rotationally Equivariant Features in Volumetric Data. Neur IPS, 2018. [56] Ruben Wiersma, Elmar Eisemann, and Klaus Hildebrandt. CNNs on Surfaces using Rotation Equivariant Features. TOG, 2020. [57] Daniel Worrall and Gabriel Brostow. Cubenet: Equivariance to 3D Rotation and Translation. In ECCV, 2018. [58] Daniel E Worrall, Stephan J Garbin, Daniyar Turmukhambetov, and Gabriel J Brostow. Harmonic Networks: Deep Translation and Rotation equivariance. In CVPR, 2017. [59] Yufei Xu, Qiming Zhang, Jing Zhang, and Dacheng Tao. Vitae: Vision transformer advanced by exploring intrinsic inductive bias. ar Xiv preprint ar Xiv:2106.03348, 2021. [60] Yuqi Yang, Shilin Liu, Hao Pan, Yang Liu, and Xin Tong. PFCNN: Convolutional Neural Networks on 3D Surfaces using Parallel Frames. In CVPR, 2020. [61] Tan Yu, Jingjing Meng, and Junsong Yuan. Multi-view Harmonized Bilinear Network for 3D Object Recognition. In CVPR, 2018.