# equivariant_neural_tangent_kernels__b36bd62e.pdf Equivariant Neural Tangent Kernels Philipp Misof 1 Pan Kessel 2 Jan E. Gerken 1 Little is known about the training dynamics of equivariant neural networks, in particular how it compares to data augmented training of their non-equivariant counterparts. Recently, neural tangent kernels (NTKs) have emerged as a powerful tool to analytically study the training dynamics of wide neural networks. In this work, we take an important step towards a theoretical understanding of training dynamics of equivariant models by deriving neural tangent kernels for a broad class of equivariant architectures based on group convolutions. As a demonstration of the capabilities of our framework, we show an interesting relationship between data augmentation and group convolutional networks. Specifically, we prove that they share the same expected prediction over initializations at all training times and even off the data manifold. In this sense, they have the same training dynamics. We demonstrate in numerical experiments that this still holds approximately for finite-width ensembles. By implementing equivariant NTKs for roto-translations in the plane (G = Cn R2) and 3d rotations (G = SO(3)), we show that equivariant NTKs outperform their non-equivariant counterparts as kernel predictors for histological image classification and quantum mechanical property prediction. 1. Introduction Equivariant neural networks (Weiler et al., 2023; Gerken et al., 2023) are widely used in many applications of great practical importance, for example in medical image analysis in two and three dimensions (Bekkers et al., 2018; Winkels & Cohen, 2019; M uller et al., 2021; Pang et al., 2023) and in quantum chemistry (Duval et al., 2023; Batzner et al., 2022; 1Department of Mathematical Sciences, Chalmers University of Technology and the University of Gothenburg, SE-412 96 Gothenburg, Sweden 2Prescient Design, Genentech Roche, Basel, Switzerland. Correspondence to: Jan Gerken . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). Sch utt et al., 2021; Unke et al., 2021). Other application areas include particle physics (Bogatskiy et al., 2020), cosmology (Perraudin et al., 2019) and even fairness in large language models (Basu et al., 2023). Recently, there has been a number of works which avoid equivariant architectures but rely on data augmentation to approximately learn equivariance, most notably Alpha Fold3 (Abramson et al., 2024). This has the potential advantage that non-equivariant architectures may offer better training dynamics, for example favorable scaling capabilities. There has been a vigorous debate on this subject with some empirical works claiming superiority of equivariant architectures (Gerken et al., 2022; Brehmer et al., 2024) while others suggest the opposite (Wang et al., 2024; Abramson et al., 2024). One challenging aspect to conclusively settle the matter is that there is no good theoretical understanding of how the equivariant and the purely augmentation-based training dynamics compare. Motivated by this observation, this paper derives equivariant neural tangent kernel (NTK) theory (Jacot et al., 2018) for group convolutional architectures. The NTK provides a powerful tool to analytically study the training dynamics of neural networks in the large width limit by analyzing the behavior of the kernel, in particular its trace, eigenvalues and other properties (Geiger et al., 2020; Mok et al., 2022; Engel et al., 2024; Tsai et al., 2023). A particularly important feature of the NTK is the fact that in the infinite width limit, it becomes constant throughout training (Jacot et al., 2018). Furthermore, at infinite width, the NTK can be computed by layer-wise recursion relations. These simplifications allow for complete analytic control over the training dynamics. In particular, the network output of an arbitrary input and at an arbitrary point in training time converges to a Gaussian process over initializations whose meanand covariance functions can be computed analytically. This result has led to a number of theoretical and practical insights (Geiger et al., 2020; Jacot et al., 2020; Yang & Hu, 2021; Yang et al., 2021; Franceschi et al., 2022; Day et al., 2023) into the initialization and training of neural networks. We derive recursive relations which determine the NTK of an equivariant neural networks for the first time. In particular, we study the NTK of group convolutional layers (Cohen & Welling, 2016). These layers are in some sense universal. Equivariant Neural Tangent Kernels Specifically, they have the unique property that they arise from imposing an equivariance constraint on dense fullyconnected layers and are therefore the most general linear, equivariant transformations (Kondor & Trivedi, 2018) and have been used in a wide array of applications (Chidester et al., 2019; Celledoni et al., 2021; Moyer et al., 2021). These NTK recursion allows us to clarify the relation between the training dynamics of pure data augmentation and equivariant architectures in the large width limit. Specifically, non-equivariant architectures trained with full data augmentation converge to certain group convolutional architectures in the infinite width limit. This result holds for any input, in particular off-manifold, and at any training time. Thus, at least in the infinite width limit and in expectation over initializations, the training dynamics of data augmentation is identical to the one of certain group convolutional architectures. NTKs have also been shown to be interesting kernel functions in their own right. Since they are induced by neural network architectures, they allow to transfer the intuition gained in the extensive literature on the design of neural networks to kernel machines and have shown to outperform more traditional kernel functions (Arora et al., 2019; Li et al., 2019; Lee et al., 2020). In our experiments, we show that group equivariant kernels outperform their non-equivariant counterparts for both regression and classification as well as for discrete roto-translations and continuous rotations. In summary, our main contributions are We derive layer-wise recursive relations for the neural tangent kernel and neural network Gaussian process kernel of group convolutional layers, the corresponding lifting layers, point-wise nonlinearities and grouppooling layers. We specialize our general results to the case of roto-translations in the plane as well as the threedimensional rotation group SO(3). We derive and implement the kernel relations for these cases, allowing for efficient computations. The code is provided publicly at https://github.com/Philipp Mi sof CH/equivariant-ntk. We prove that in the infinite width limit, a standard convolutional or fully connected network trained with full data augmentation yields the same expected network function as a corresponding group convolutional network trained without data augmentation . This result holds for all training times as well as off manifold. We show empirically that this holds approximately at finite width. We verify experimentally that the NTKs of finite-width equivariant networks converge to our equivariant NTKs as width grows to infinity. Furthermore, we demonstrate the superior performance of equivariant NTKs over other kernels for medical image classification and quantum mechanical property prediction. 2. Related Work Neural Tangent Kernel. Gaussian processes can be viewed as Bayesian neural networks as first pointed out by (Neal, 1996) and this relation extends to deep neural networks as shown in (Lee et al., 2018). Neural tangent kernels allow description of training dynamics, see the seminal reference (Jacot et al., 2018) and (Golikov et al., 2022) for an accessible review. In (Lee et al., 2019), NTK theory was used to show that wide neural networks trained with gradient descent become Gaussian processes and generalized in a more rigorous and systematic manner by (Yang, 2020). NTKs can be used to derive parametrizations that allow scaling networks to large width (Yang & Hu, 2021). They can also be used to theoretically analyze GANs (Franceschi et al., 2022), PINNs (Wang et al., 2022b), backdoor attacks (Hayase & Oh, 2023), pruning (Yang & Wang, 2023) and spectral learning biases (Bordelon et al., 2020; Canatar et al., 2021). Recently, corrections to infinite width limit have been studied by (Huang & Yau, 2020; Yaida, 2020; Halverson et al., 2021; Erbin et al., 2022) using techniques inspired by perturbative quantum field theory. The NTK kernel for convolutional architectures was derived in (Arora et al., 2019). Our results can be thought as a generalization thereof to general group convolutions. Equivariant Neural Networks. Equivariance has been an important theme of deep learning research over the last years, see (Gerken et al., 2023) for an accessible review. Equivariant deep learning is part of the larger area of geometric deep learning (Bronstein et al., 2017), in which more general geometric properties of the different parts of the learning problem (e.g. the data (Dombrowski et al., 2024), model (Weiler et al., 2023) and optimization procedure (Amari, 1998)) are studied. Herein, we focus on group convolutional layers (Cohen & Welling, 2016) which are the unique linear equivariant layers. A comprehensive summary is given in (Cohen, 2021). These architectures have found wide-spread application in computer vision (Chidester et al., 2019; Celledoni et al., 2021; Moyer et al., 2021), medical applications (Bekkers et al., 2018; Chidester et al., 2019; Pang et al., 2023) as well as natural science use cases (Nicoli et al., 2020; Liao & Smidt, 2023; Bekkers et al., 2024). Learned vs. Manifest Equivariance. While equivariance can be enforced on the training data via data augmentation, the imposed symmetry does in general not extend to out-of-distribution data (Moskalev et al., 2023). This is in accordance with the tighter upper bound on the generaliza- Equivariant Neural Tangent Kernels tion error of equivariant networks compared to purely dataaugmented ones that was found in (Wang et al., 2022a). By analyzing the spectrum of the NTK in a toy problem, (Perin & Deny, 2024) found that non-equivariant neural networks are unable to generalize symmetries learned in one class via data augmentation to another, only partially augmented class. They further show how the qualitatively different NTK spectrum of a CNN improves generalization for the task under consideration. In (Gerken & Kessel, 2024), the effect of data augmentation on infinitely wide neural network ensembles was studied. The authors found that the resulting Gaussian process is equivariant at all training times and even off the data manifold. In contrast, our results do not require data augmentation but derive an NTK for manifestly equivariant group convolution layers. This allows us to find a connection between the ensemble means of dataaugmented and equivariant networks, which complements the previously mentioned results focusing on individual networks. 3. Background This section gives a brief overview of NTK theory (for an optional broader introduction, see Appendix A) as well as of equivariant neural networks with a particular emphasis on group convolutional neural networks (GCNNs). Neural Tangent Kernels. The NTK can be computed by layer-wise recursive relations (Jacot et al., 2018) starting from the definition Θ(ℓ)(x, x ) = E of the layer-ℓNTK. The NTK of the full network is given by Θ(x, x ) = Θ(L)(x, x ) for a network depth L. Here, θ(ℓ) are the parameters of the layer ℓand we adopt the convention that expectation values are over the initialization distribution unless otherwise stated. In the limit of infinitely wide hidden layers, not only do analytic expressions exist for (1), but Θ(ℓ)(x, x ) also stays constant during training. Hence it is often referred to as the frozen NTK (Geiger et al., 2020). As customary in the NTK literature, we treat activations and preactivations as distinct layers and refer to N (ℓ) as the layer-ℓfeatures with N(x) = N (L)(x). This allows us to treat linearand nonlinear layers on an equal footing. Since (1) is proportional to the unit matrix, we can treat it as a scalar. We can find a recursion relation between Θ(ℓ+1) and Θ(ℓ) by separating the ℓ = ℓ+ 1 contribution from the sum and computing the ℓ ℓcontributions in terms of derivatives through the layer ℓ+ 1 using the chain rule, Θ(ℓ+1)(x, x ) = E " N (ℓ+1)(x) N (ℓ+1)(x ) " N (ℓ+1)(x) | {z } Θ(ℓ)(x,x ) Note that according to the NTK s definition (1), it holds that Θ(0) = 0. The recursions (2) have been computed explicitly for a number of layers, e.g. fully connected (Jacot et al., 2018), nonlinear (Jacot et al., 2018), convolution (Arora et al., 2019), and graph convolution (Du et al., 2019). An efficient implementation for many layers is available in the Jax-based Python package neural-tangents (Novak et al., 2020). For evaluating the expectation values in (2), it is convenient to introduce the neural network Gaussian process (NNGP) kernel K(ℓ)(x, x ) = E N (ℓ)(x) N (ℓ)(x ) , (3) whose name originates in the fact that at initialization, the neural network converges in the infinite width limit to a zero-mean Gaussian process with covariance function K(L)(x, x ) (Neal, 1996; Lee et al., 2018). In the infinite width limit, K is proportional to the unit matrix, so we will treat it as a scalar as well. The NNGP can also be computed recursively layer-by-layer. For the ℓ= 0, the NNGP is the covariance matrix of the input features K(0)(x, x ) = x x . Using the definition (3) of the NNGP, we can determine the structure of the NTK recursive relations from (2). For linear layers, the first expectation value will evaluate to the NNGP, while the second expectation value will be proportional to the unit matrix due to the initialization with independent normally distributed parameters. For nonlinear layers, the first expectation value vanishes and the second expectation values will depend on the derivative of the nonlinearity. Group Convolutions. Group convolutions (Cohen & Welling, 2016) act on feature maps f : G Rnin where nin denotes the number of input features to the network. In the example of image inputs, this feature map would be f : Z2 R3 where Z2 is the pixel grid, R3 is the space of RGB colors, and the feature map f is supported on [0, h] [0, w] for imagesize h w. Let L2(X, Y ) denote the set of square integrable functions from X to Y . The ℓ-th neural network layer N (ℓ) : L2(G, Rnin) L2(G, Rnℓ) Equivariant Neural Tangent Kernels maps an input feature map f : G Rnin to an output feature map N (ℓ)(f) : G Rnℓ. A particular instance of such a layer is the group convolution layer which in NTK representation is given by [N (ℓ+1)(f)](g) = 1 p G dh κ g 1h [N (ℓ)(f)](h) , with filter κ : G Rnℓ,nℓ+1 with support Sκ G. Here, we integrate over the group with respect to the Haar measure. For finite groups, the integral becomes a sum over group elements. Due to the invariance of the Haar measure, the layers (4) are equivariant with respect to the regular representation (ρreg(g)f)(h) = f(g 1h) g, h G . (5) Since the input features typically have domain X Rnin which is not the symmetry group G, the first layer of a group convolutional neural network (GCNN) is a lifting layer which maps a feature map with domain X equivariantly into a feature map with domain G (Cohen & Welling, 2016) [N (1)(f)](g) = 1 p X dx κ ρ(g 1)x f(x) , (6) where ρ is a representation of G on X. We assume here and in the following that X is a homogeneous space of G, i.e. that any two elements of X are connected by a group transformation. As is common for other network types as well, the nonlinearities in group convolutional networks are applied component-wise across the different group elements, [N (ℓ+1)(f)](g) = σ [N (ℓ)(f)](g) (7) for nonlinearity σ. Due to this component-wise structure, the layers (7) are equivariant with respect to the regular representation (5) as well. By combining liftingand group-convolution layers with nonlinearities, one can construct expressive architectures which are equivariant with respect to the regular representation, i.e. which satisfy N(ρreg(g)f) = ρreg(g)N(f) , g G . (8) Many practical applications necessitate an invariant network N(ρreg(g)f) = N(f) , g G . (9) Such a transformation property can be achieved by appending a group pooling layer to a GCNN, N (ℓ+1)(f) = 1 vol(G) G dg [N (ℓ)(f)](g) . (10) Using these layers, a wide variety of equivariantand invariant networks with respect to a general symmetry group G can be easily constructed. 4. Equivariant Neural Tangent Kernels This section presents our recursive relations for the NTK and the NNGP for group convolutional layers. These recursions allow for efficient calculation of these kernels for arbitrary group convolutional architectures and thus provide the necessary tools to analytically study their training dynamics in the large width limit. Specifically, we derive recursion relations for group convolutions (4), lifting layers (6) and group pooling layers (10) by evaluating the derivatives and expectation values in (2). 4.1. Equivariant NTK for Group Convolutions Since the domain of the feature maps in GCNNs is the symmetry group G, the layer-ℓNNGP and NTK kernels do not only depend on the input feature maps f and f but also on the group elements g, g at which the feature maps are evaluated, i.e., K(ℓ) g,g (f, f ) = E [N (ℓ)(f)](g) [N (ℓ)(f )](g ) , Θ(ℓ) g,g (f, f ) = E [N (ℓ)(f)](g) θ(ℓ ) [N (ℓ)(f )](g ) For these kernels, we derive the following recursion relation: Theorem 4.1 (Kernel recursions for group convolutional layers). The layer-wise recursive relations for the NNGP and NTK of the group convolutional layer (4) are given by K(ℓ+1) g,g (f,f ) = 1 |Sκ| Sκ dh K(ℓ) gh,g h(f, f ) (11) Θ(ℓ+1) g,g (f,f ) = K(ℓ+1) g,g (f,f )+ 1 Sκ dh Θ(ℓ) gh,g h(f,f ) . Proof. See Appendix B. Given G-invariant filter supports Sκ, these recursive definitions imply an invariance of the kernels in their groupindices under right-multiplication by the same group element h G, K(ℓ+1) gh,g h(f, f ) = K(ℓ+1) g,g (f, f ) (13) Θ(ℓ+1) gh,g h(f, f ) = Θ(ℓ+1) g,g (f, f ) . (14) While the kernels of feature maps on the group carry g, g - indices, the kernels of the input features carry x, x -indices, K(0) x,x (f, f ) = f(x)f (x ) , Θ(0) x,x (f, f ) = 0 . (15) Using this, we also derive the following recursion relations: Equivariant Neural Tangent Kernels Theorem 4.2 (Kernel recursions for the lifting layer). The layer-wise recursive relations for the NNGP and NTK of the lifting layer (6) are given by1 K(ℓ+1) g,g (f, f ) = 1 Sκ dx K(ℓ) ρ(g)x, ρ(g )x(f, f ) , (16) Θ(ℓ+1) g,g (f, f ) = 1 Sκ dx Θ(ℓ) ρ(g)x, ρ(g )x(f, f ) + K(ℓ+1) g,g (f, f ) , (17) where the regular representation ρreg is defined in (5). Proof. See Appendix B. The group pooling layer (10) maps feature maps on G onto channel-vectors. Therefore, the kernels lose their g, g - indices in this layer, as is reflected in the following result: Theorem 4.3 (Kernel recursions for group pooling layer). The layer-wise recursive relations for the NNGP and NTK of the group pooling layer (10) are given by K(ℓ+1)(f, f ) = 1 (vol(G))2 G dg K(ℓ) g,g (f, f ) Θ(ℓ+1)(f, f ) = 1 (vol(G))2 G dg Θ(ℓ) g,g (f, f ) . Proof. See Appendix B. The final layer necessary to compute kernels of GCNNs are the nonlinearities (7). Since these act pointwise on the feature maps, the recursive relations are the same as those for nonlinearities in MLPs (Jacot et al., 2018): Corollary 4.4 (Kernel recursions for nonlinearities). The layer-wise recursive relations for the NNGP and NTK of the nonlinear layer (7) are given by Λ(ℓ) g,g (f, f ) = K(ℓ) g,g(f, f) K(ℓ) g,g (f, f ) K(ℓ) g ,g(f , f) K(ℓ) g ,g (f , f ) K(ℓ+1) g,g (f, f ) = E(u,v) N(0,Λ(ℓ) g,g (f,f ))[σ(u)σ(v)] (21) K(ℓ+1) g,g (f, f ) = E(u,v) N(0,Λ(ℓ) g,g (f,f ))[σ (u)σ (v)](22) Θ(ℓ+1) g,g (f, f ) = K(ℓ+1) g,g (f, f )Θ(ℓ) g,g (f, f ) . (23) Using these results, the NTK and NNGP can be straightforwardly computed for any GCNN architecture. In particular, consider the transformation of the kernels under transformations of the inputs, i.e. consider K(ρreg(h)f, ρreg(h )f ) and Θ(ρreg(h)f, ρreg(h )f ). From the recursion relations 1In practice, the lifting layer is usually the first layer, thus ℓ= 0. for the lifting layer in Theorem 4.2, we have the transformation property K(1) g,g (ρreg(h)f, ρreg(h )f ) = K(1) h 1g,h 1g (f, f ) (24) Θ(1) g,g (ρreg(h)f, ρreg(h )f ) = Θ(1) h 1g,h 1g (f, f ) . (25) This left-multiplication is preserved by the recursions of both the group convolutions in Theorem 4.1 and the nonlinearities in Corollary 4.4. Therefore, before any pooling layer, we have K(ℓ) g,g (ρreg(h)f, ρreg(h )f ) = K(ℓ) h 1g,h 1g (f, f ) (26) Θ(ℓ) g,g (ρreg(h)f, ρreg(h )f ) = Θ(ℓ) h 1g,h 1g (f, f ) , (27) reflecting the equivariance of the network. The recursions of the group-pooling layer in Theorem 4.3 average over the group and the kernels become invariant after the group pooling layer K(ℓ)(ρreg(h)f, ρreg(h )f ) = K(ℓ)(f, f ) (28) Θ(ℓ)(ρreg(h)f, ρreg(h )f ) = Θ(ℓ)(f, f ) , (29) as expected from an invariant network. Note that these transformation properties of the kernels are independent for both arguments. 4.2. Roto-Translations in the Plane The kernel recursions provided in the previous section are valid for general symmetry groups G. In this section, we will specialize these expressions to the case of rototranslations in the plane with rotations by (360/n) . In this case, G = Cn R2 where G is the semidirect product of the cyclic group Cn and the translation group in two dimensions R2. It was shown that adding this rotational symmetry to conventional CNNs boosts performance considerably for important applications such as medical image analysis (Chidester et al., 2019; Bekkers et al., 2018; Pang et al., 2023). Due to the semidirect product nature of the symmetry group, the group convolutional layers can be written as a stack of n conventional convolutions which are summed over the rotation group. Details and explicit expressions for the lifting-, group convolutionaland group pooling layers in this case can be found in Appendix C.1. The kernel recursion of ordinary CNN-layers can be written in terms of the operator (Xiao et al., 2018) [ASκ(K)](t, t ) = 1 |Sκ| Sκ d t K(t + t, t + t) , (30) for which efficient implementations in terms of convolutions are available in (Novak et al., 2020). In Appendix C.2, we present explicit expressions for the NNGP and NTK recursions of roto-translation equivariant convolutions in Equivariant Neural Tangent Kernels terms of A, retaining the efficiency of the non-rotationequivariant kernel computations. We provide implementations of these recursions for n = 4 as new layers based on the neural-tangents package. 4.3. Rotations in 3d Spherical signals subject to rotations in 3d are a further important use case with numerous applications in quantum chemistry (Duval et al., 2023), weather prediction (Bonev et al., 2023) and 3d shape recognition (Fuchs et al., 2020). The group convolutions for the corresponding symmetry group SO(3) can be computed efficiently in the Fourier domain, in terms of coefficients in a steerable basis of spherical harmonics Y l m or Wigner matrices Dl mn, respectively (Cohen et al., 2018; Cohen & Welling, 2017). Due to the continuous nature of the SO(3) group, comprehensive data augmentation is not feasible, thus making group convolutional networks the natural choice to incorporate such symmetries. In Appendix D.1, we provide a summary of the necessary Fourier space relations for SO(3)-equivariant networks. The kernel forward equations (11), (12) simplify to purely algebraic equations in terms of Fourier coefficients h \ K(ℓ)(f, f ) il,l mn,m n = Z d R Z d R K(ℓ) R,R (f, f )Dl mn(R)Dl m n (R ) , (31) and analogously for the NTK. Note that the kernels have two group indices, thus necessitating a double Fourier transform. Detailed relations for lifting-, group convolutionaland group pooling layer in the Fourier space are provided in Appendix D.2. Again, these layers are implemented in the neural-tangents package and the necessary generalized FFTs are provided by the JAX-based package s2fft (Price & Mc Ewen, 2024). 5. Data Augmentation Versus Group Convolutions at Infinite Width The recursive relations presented in the previous sections give analytical access to the training dynamics of equivariant neural networks. In particular, they allow for a more in-depth theoretical understanding of the similarities and differences of data augmentation and manifest equivariance than previously possible. It is known that ensembles of independently initialized neural networks trained with data augmentation yield equivariant mean predictions (Gerken & Kessel, 2024; Nordenfors & Flinth, 2024). It is however unclear how these equivariant functions relate to trained manifestly equivariant networks. Using the recursive relations from Section 4.1, it is possible to show that non-equivariant networks trained with data aug- mentation in fact converge to group convolutional networks in the ensemble mean. 5.1. Data Augmentation at Infinite Width In the infinite width limit, the training dynamics under gradient descent can be solved exactly (Jacot et al., 2018). This enables us to explicitly study data augmentation, showing that data augmentation and kernel averaging yield the same mean predictions, as detailed in the following Theorem 5.1. Let µaug t and µt be the mean predictions after t training steps of infinite ensembles of two neural network architectures N aug and N . Let N aug be trained on the fully G-augmented training data of N and assume that the NTKs of the two architectures are related by Θ(f, f ) = 1 |G| g G Θaug(f, ρreg(g)f ) . (32) Then, µaug t and µt converge in the infinite width limit to the same function for all t for quadratic losses, up to quadratic corrections in the learning rate. Proof. See Appendix E. The proof of Theorem 5.1 proceeds inductively over training steps. At initialization, both mean functions are identically zero (Neal, 1996; Lee et al., 2018). The updates for the two networks can be written in terms of the NTK and shown to agree by splitting the sum over augmented training data into a sum over samples and a sum over G and using the assumption (32). In fact, the same argument can be used to show that the individual networks N aug and N (as opposed to their ensemble averages) agree if their empirical NTKs satisfy (32) as long as the networks are identical for all inputs and equivariant at initialization. This would for instance be the case if N (x) = N aug(x) = 0 for all x at initialization. 5.2. Kernel Averaging Yields GCNN-Kernels Theorem 5.1 shows equivalence of augmented and nonaugmented networks if the NTKs of both architectures are related by group-averaging. Consider the case of training an MLP on augmented data. Then, (32) prompts us to consider the group-average of its NTK to find the architecture which results in the same mean predictions if trained without data augmentation. By iterating the recursive kernel-relations found in the previous section, one can in fact show that this architecture is a GCNN, as detailed in the following Theorem 5.2. Let N FC be an MLP acting on feature maps with output in R and architecture N FC = FC(L) σ FC(3) σ FC(1) , (33) Equivariant Neural Tangent Kernels where FC denotes a dense MLP layer and σ a point-wise nonlinearity. Let N GC be a G-invariant GCNN with architecture N GC = GPool GConv(SL κ ) σ GConv(SL 2 κ ) σ GConv(S3 κ) σ Lifting(S1 κ) , (34) where Sℓ κ are the supports of the convolutional filters with S1 κ = X, the domain of the input feature maps, and the other Sℓ κ are invariant under G. Then, the G-averages of the kernels of the MLP are given by the kernels of the GCNN, KGC(f, f ) = 1 vol(G) Z dg KFC(f, ρreg(g)f ) (35) ΘGC(f, f ) = 1 vol(G) Z dg ΘFC(f, ρreg(g)f ) . (36) Proof. See Appendix E. Together with Theorem 5.1, this theorem shows that by augmenting an arbitrary deep MLP at infinite width, one obtains a specific equivariant architecture, namely a GCNN with the same depth L and an additional group-pooling layer. This result singles out group convolutional layers among other equivariant layers and mirrors the fact that group convolutions are the unique linear equivariant layers under the regular representation. Note that according to Theorem 5.1 the equivalence between augmented and equivariant networks holds throughout training and even out of distribution. 5.3. Augmenting a CNN Consider a generalization of the roto-translation symmetry discussed in Section 4.2, namely a general semidirect product group, G = K N with N a normal subgroup of G. For N a translation group, this covers cases such as CNNs in two and three dimensions with additional rotation or reflection symmetry (Cesa et al., 2022). The semidirect product structure of G allows a splitting of the full equivariance, namely training a K N-invariant GCNN is equivalent to training an N-invariant GCNN on K-augmented data. In order to see this, we show the corresponding kernel averages for Theorem 5.1 to hold: Theorem 5.3. Let N K N be the K N-invariant GCNN with architecture (34) and K-invariant filter supports Sℓ κ which for the GConv-layers decompose as Sℓ κ = Kℓ κ N ℓ κ, Kℓ κ K, N ℓ κ N. Let N N be the N-invariant GCNN with architecture (34) and filter supports N L κ , . . . , N 3 κ and S1 κ. Then, the NNGPs and NTKs of these networks are KK N(f, f ) = 1 vol(K) K dk KN(f, ρreg(k)f ) (37) ΘK N(f, f ) = 1 vol(K) K dk ΘN(f, ρreg(k)f ) . (38) Proof. See Appendix E. Remark 5.4. For K = Cn and N = R2, i.e. rototranslations in the plane, N N becomes an ordinary CNN and N K N is of the form discussed in Section 4.2. According to Theorem 5.3, the kernels of the rotation-equivariant network are then given by averaging the kernels of the CNN, KCn R2(f, f ) = 1 r Cn KCNN(f, ρreg(r)f ) (39) ΘCn R2(f, f ) = 1 r Cn ΘCNN(f, ρreg(r)f ) (40) if the spatial filter shapes agree for both networks and are rotation-invariant. Taken togenther with Theorem 5.1, this shows that training an N-invariant GCNN on K-augmented data results in a K N-invariant GCNN. For the special case of N a translation group and K a rotation group, this means that training a CNN on rotation-augmented data is equivalent to training a roto-translation equivariant GCNN on unaugmented data. In Section 6, we will show that this still holds approximately for finite-width networks and ensembles. 5.4. Distribution of Ensemble Members Consider training two ensembles of networks with (a) data augmentation on a non-equivariant architecture and (b) no data augmentation on an equivariant architecture. Then, the distributions of the individual networks in these ensembles do not agree since most of the augmented networks will not be equivariant. However, our results show that the ensemble mean of (a) converges to the ensemble mean of (b) with a specific GCNN architecture. This establishes a highly non-trivial relation between data augmentation and GCNNs. 6. Experiments In the following, we validate the theoretical results of the preceding sections experimentally for various datasets (Cifar10, QM9, MNIST, and histological data), tasks (regression and classification), and groups (SO(3) and C4 R2). Kernel Convergence for C4 R2. Figure 1 confirms that the Monte-Carlo estimate of the NTK converges to our analytical expression as the width increases. Our MC estimates are obtained by replacing the expectation values in (1) Equivariant Neural Tangent Kernels 0 10 20 30 40 50 60 70 80 Network Width Relative Error Figure 1. Convergence of the Monte-Carlo estimates of the NTK to their infinite-width limits for G = C4 R2. Plotted is the relative error averaged over the components of a 3 3 Gram matrix for networks with a Re LU or an error function nonlinearity. Bands show one standard deviation of the estimator. 0 200 400 600 800 1000 Training Set Size Test Accuracy Figure 2. NTK for image classification. Test accuracy of the arising NTK kernel methods in the infinite width and infinite training time limit for different training set sizes. The results for both a conventional CNN and a C4 R2-invariant GCNN are shown. by the sample mean of 1000 initializations of the network. We considered GCNNs with one liftingand four groupconvolution layers interspersed with Re LU nonlinearites, followed by a group-pooling layer. The convergence of the NNGP is shown in a similar plot in Appendix F.1. Medical Image Classification with C4 R2. We show that rotation-equivariant NTK-predictors outperform nonequivariant NTK-predictors on a dataset of histological images (Kather et al., 2018) containing nine distinct classes of tissues. Specifically, we compare a CNN architecture with the corresponding rotation-invariant GCNN architecture, in which we replace each of the five convolutional layers with a group-convolutionalor lifting layer, respectively and used a group-pooling layer instead of a Sum Pool layer. Figure 2 shows the improved scaling behavior of the equivariant kernel with training set size upon using the infinite-time 102 103 Training Set Size Figure 3. NTK for molecular energy prediction. Molecular energy MAEs of the NTK kernel methods in the infinite width and training time limit for different training set sizes. The results are for both a conventional MLP and a SO(3)-invariant GCNN. solution of the NTK-dynamics under MSE loss, µ(x) = Θ(x, X)Θ(X, X) 1Y , (41) where X represents the training images, scaled down to 32 32 pixels, Y are the training labels, given by ec 1 91 for class c, and Θ(X, X) is the Gram matrix of the NTK. We refer to Appendix F.2 for more details. Molecular Energy Regression with SO(3). We benchmark the NTK-predictor resulting from an SO(3)-invariant network on the QM9 dataset (Ramakrishnan et al., 2014) by predicting molecular energies U0 from atom configurations utilizing (41). Comparing this to the corresponding MLP kernel, we observe a considerable performance boost for the invariant kernel over a range of training set sizes, as shown in Figure 3. For preprocessing, we construct spherical signals from the atom configurations as described in (Esteves et al., 2023). For each atom i of the at most 29 atoms, the environment is represented by pairwise Gaussian smearing over atoms with the same atomic number z fi,z,p(x) = X ziz rij p e 1 β rij rij x 1 2 . (42) Choosing p {2, 6} and considering all of QM9 s five atom types leads to 29 spherical per-atom signals with 5 2 channels each. Each of those per-atom signals are then either processed by a two layer SO(3)-equivariant network with group pooling on top or by a two-layer MLP. The peratom outputs are eventually summed and fed into a final fully-connected layer similarly as in (Cohen et al., 2018). The input signals are constructed at a resolution of 12 11 on the sphere, corresponding to a bandlimit of L = 6, which are then downsampled to a bandlimit L = 3 for the group layer. We provide further details in Appendix F.3. Equivariant Neural Tangent Kernels 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 L2(equiv-logits, non-equiv-logits) Ensemble Size 1 Ensemble Size 3 Ensemble Size 8 Ensemble Size 100 Figure 4. Convergence of finite-width ensembles trained with data augmentation to ensembles of GCNNs on MNIST. L2distance between the logits of the equivariantand non-equivariant ensemble trained with data augmentation for different ensemble sizes on out of distribution data. For larger ensembles, the distance decreases. Data Augmentation Versus Group Convolutions at Finite Width. In Section 5, we proved that networks trained with data augmentation converge in expectation to group convolutional networks at infinite width. We also verify empirically that the mean predictions agree progressively with increasing ensembles size at finite width, supporting approximate validity away from the theoretical limit. To this end, we train ensembles of CNNs and GCNNs with symmetry group C4 R2 as discussed in Remark 5.4 on CIFAR10 and MNIST using the MSE-loss against smoothed one-hot labels as for the medical images above. For implementing the GCNNs, we used the escnn-package (Weiler & Cesa, 2019). As shown in Figure 4 for MNIST, the outputs of both ensembles converge to the same vector for large ensemble sizes throughout training and even out of distribution. For further details on the model architectures, out of distribution data, as well as results on CIFAR10 and histological images, see Appendix F.4. 7. Conclusion This paper provides recursive relations for the NNGP and NTK for group convolutional neural networks allowing us to theoretically establish an interesting equivalence between equivariance-based to data-augmentation-based training dynamics. We also show that equivariant kernels outperform their non-equivariant counterparts as kernel machines. A careful comparison of equivariant GCNNs and equivariant data augmentation, beyond the invariant case analyzed in Section 5, would be an interesting subject for further research. In particular, Theorem 5.1 can be straightforwardly extended to the equivariant case, as demonstrated in Appendix E. However, Theorems 5.2 and 5.3 rely on the group pooling layer and are thus specialized to invariant GCNNs. An equivariant extension would require new layers beyond those presented here since in the infinite-width limit, the NTK of an MLP becomes proportional to the unit matrix in the output channels, trivializing the feature map. Acknowledgments J.G. and P.M. want to thank Max Guillen for inspiring discussions and collaborations on related projects. The work of J.G. and P.M. is supported by the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation. The computations were enabled by resources provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS) and the Swedish National Infrastructure for Computing (SNIC) at C3SE partially funded by the Swedish Research Council through grant agreements 2022-06725 and 2018-05973. Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here. Abramson, J., Adler, J., Dunger, J., Evans, R., Green, T., Pritzel, A., Ronneberger, O., Willmore, L., Ballard, A. J., Bambrick, J., et al. Accurate structure prediction of biomolecular interactions with alphafold 3. Nature, pp. 1 3, 2024. Amari, S.-i. Natural Gradient Works Efficiently in Learning. Neural Computation, 10(2):251 276, February 1998. ISSN 0899-7667. doi: 10.1162/089976698300017746. Arora, S., Du, S. S., Hu, W., Li, Z., Salakhutdinov, R. R., and Wang, R. On Exact Computation with an Infinitely Wide Neural Net. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips. cc/paper_files/paper/2019/file/dbc4d 84bfcfe2284ba11beffb853a8c4-Paper.p df. Basu, S., Sattigeri, P., Ramamurthy, K. N., Chenthamarakshan, V., Varshney, K. R., Varshney, L. R., and Das, P. Equi-Tuning: Group Equivariant Fine-Tuning of Pretrained Models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 6788 6796, June 2023. doi: 10.1609/aaai.v37i6.25832. URL Equivariant Neural Tangent Kernels https://ojs.aaai.org/index.php/AAAI/ article/view/25832. Batzner, S., Musaelian, A., Sun, L., Geiger, M., Mailoa, J. P., Kornbluth, M., Molinari, N., Smidt, T. E., and Kozinsky, B. E(3)-Equivariant Graph Neural Networks for Data Efficient and Accurate Interatomic Potentials. Nature Communications, 13(1):2453, May 2022. ISSN 20411723. doi: 10.1038/s41467-022-29939-5. URL https: //doi.org/10.1038/s41467-022-29939-5. Bekkers, E. J., Lafarge, M. W., Veta, M., Eppenhof, K. A. J., Pluim, J. P. W., and Duits, R. Roto-Translation Covariant Convolutional Networks for Medical Image Analysis. In Medical Image Computing and Computer Assisted Intervention MICCAI 2018, Lecture Notes in Computer Science, pp. 440 448, Cham, 2018. Springer International Publishing. ISBN 978-3-030-00928-1. doi: 10.1007/978-3-030-00928-1 50. Bekkers, E. J., Vadgama, S. P., Hesselink, R., van der Linden, P. A., and Romero, D. W. Fast, expressive se(n) equivariant networks through weight-sharing in position-orientation space. In International Conference on Learning Representations, ICLR, 2024. URL https: //openreview.net/forum?id=d PHLb Uq Gbr. Bogatskiy, A., Anderson, B., Offermann, J., Roussi, M., Miller, D., and Kondor, R. Lorentz group equivariant neural network for particle physics. In Proceedings of the 37th International Conference on Machine Learning, volume 119, pp. 992 1002. PMLR, 13 18 Jul 2020. URL https://proceedings.mlr.press/v119/b ogatskiy20a.html. Bonev, B., Kurth, T., Hundt, C., Pathak, J., Baust, M., Kashinath, K., and Anandkumar, A. Spherical Fourier neural operators: Learning stable dynamics on the sphere. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 2806 2823. PMLR, 23 29 Jul 2023. URL https://proceedings.mlr.pr ess/v202/bonev23a.html. Bordelon, B., Canatar, A., and Pehlevan, C. Spectrum dependent learning curves in kernel regression and wide neural networks. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 1024 1034. PMLR, 13 18 Jul 2020. URL https://proceedi ngs.mlr.press/v119/bordelon20a.html. Brehmer, J., Behrends, S., de Haan, P., and Cohen, T. Does equivariance matter at scale? ar Xiv, October 2024. ISSN 2331-8422. URL https://arxiv.org/abs/24 10.23179. Bronstein, M. M., Bruna, J., Le Cun, Y., Szlam, A., and Vandergheynst, P. Geometric deep learning: Going beyond Euclidean data. IEEE Signal Processing Magazine, 34(4):18 42, July 2017. ISSN 1558-0792. doi: 10.1109/MSP.2017.2693418. Canatar, A., Bordelon, B., and Pehlevan, C. Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks. Nature Communications, 12(1), May 2021. ISSN 2041-1723. doi: 10.1038/s41467-021-23103-1. URL http://dx .doi.org/10.1038/s41467-021-23103-1. Celledoni, E., Ehrhardt, M. J., Etmann, C., Owren, B., Sch onlieb, C.-B., and Sherry, F. Equivariant neural networks for inverse problems. Inverse Problems, 37(8): 085006, jul 2021. doi: 10.1088/1361-6420/ac104f. URL https://dx.doi.org/10.1088/1361-6420/ ac104f. Cesa, G., Lang, L., and Weiler, M. A Program to Build E(N)- Equivariant Steerable CNNs. In International Conference on Learning Representations, ICLR, 2022. URL https: //openreview.net/forum?id=WE4qe9xln Qw. Chidester, B., Ton, T.-V., Tran, M.-T., Ma, J., and Do, M. N. Enhanced Rotation-Equivariant U-Net for Nuclear Segmentation. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 1097 1104, June 2019. doi: 10.1109/CVPRW.2019.001 43. Cohen, T. Equivariant Convolutional Networks. Ph D thesis, University of Amsterdam, 2021. Cohen, T. and Welling, M. Group Equivariant Convolutional Networks. In Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pp. 2990 2999, New York, New York, USA, 20 22 Jun 2016. PMLR. URL https://proceedings.mlr.press/v48/co henc16.html. Cohen, T. S. and Welling, M. Steerable CNNs. In International Conference on Learning Representations, ICLR, 2017. URL https://openreview.net/forum ?id=r JQKYt5ll. Cohen, T. S., Geiger, M., K ohler, J., and Welling, M. Spherical CNNs. In International Conference on Learning Representations, ICLR, 2018. URL https://openre view.net/forum?id=Hkbd5x ZRb. Day, H., Kahn, Y., and Roberts, D. A. Feature Learning and Generalization in Deep Networks with Orthogonal Weights. ar Xiv, October 2023. ISSN 2331-8422. URL https://arxiv.org/abs/2310.07765. Equivariant Neural Tangent Kernels Dombrowski, A.-K., Gerken, J. E., M uller, K.-R., and Kessel, P. Diffeomorphic Counterfactuals With Generative Models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(5):3257 3274, May 2024. ISSN 1939-3539. doi: 10.1109/TPAMI.2023.3339980. Driscoll, J. and Healy, D. Computing fourier transforms and convolutions on the 2-sphere. Advances in Applied Mathematics, 15(2):202 250, 1994. ISSN 0196-8858. doi: https://doi.org/10.1006/aama.1994.1008. URL https://www.sciencedirect.com/scienc e/article/pii/S0196885884710086. Du, S. S., Hou, K., Salakhutdinov, R. R., Poczos, B., Wang, R., and Xu, K. Graph neural tangent kernel: Fusing graph neural networks with graph kernels. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proc eedings.neurips.cc/paper_files/paper /2019/file/663fd3c5144fd10bd5ca6611a 9a5b92d-Paper.pdf. Duval, A., Mathis, S. V., Joshi, C. K., Schmidt, V., Miret, S., Malliaros, F. D., Cohen, T., Lio, P., Bengio, Y., and Bronstein, M. A Hitchhiker s Guide to Geometric GNNs for 3D Atomic Systems. ar Xiv, December 2023. ISSN 2331-8422. URL https://arxiv.org/abs/23 12.07511. Engel, A., Wang, Z., Frank, N., Dumitriu, I., Choudhury, S., Sarwate, A. D., and Chiang, T. Faithful and Efficient Explanations for Neural Networks via Neural Tangent Kernel Surrogate Models. In International Conference on Learning Representations, ICLR, 2024. URL https: //openreview.net/forum?id=y Kksu38Bp M. Erbin, H., Lahoche, V., and Samary, D. O. Nonperturbative renormalization for the neural network-QFT correspondence. Machine Learning: Science and Technology, 3(1):015027, March 2022. ISSN 2632-2153. doi: 10.1088/2632-2153/ac4f69. Esteves, C., Slotine, J.-J., and Makadia, A. Scaling spherical CNNs. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 9396 9411. PMLR, 23 29 Jul 2023. URL https://proceedings.ml r.press/v202/esteves23a.html. Franceschi, J.-Y., B ezenac, E. D., Ayed, I., Chen, M., Lamprier, S., and Gallinari, P. A Neural Tangent Kernel Perspective of GANs. In Proceedings of the 39th International Conference on Machine Learning, volume 162, pp. 6660 6704. PMLR, 17 23 Jul 2022. URL https://proceedings.mlr.press/v162/f ranceschi22a.html. Fuchs, F., Worrall, D., Fischer, V., and Welling, M. Se(3)- transformers: 3d roto-translation equivariant attention networks. In Advances in Neural Information Processing Systems, volume 33, pp. 1970 1981. Curran Associates, Inc., 2020. URL https://proceedings.neurip s.cc/paper_files/paper/2020/file/152 31a7ce4ba789d13b722cc5c955834-Paper.p df. Geiger, M., Spigler, S., Jacot, A., and Wyart, M. Disentangling feature and lazy training in deep neural networks. Journal of Statistical Mechanics: Theory and Experiment, 2020(11):113301, November 2020. ISSN 1742-5468. doi: 10.1088/1742-5468/abc4de. Gerken, J. E. and Kessel, P. Emergent Equivariance in Deep Ensembles. In Proceedings of the 41st International Conference on Machine Learning, volume 235, pp. 15438 15465. PMLR, 21 27 Jul 2024. URL https://proc eedings.mlr.press/v235/gerken24a.html. Gerken, J. E., Carlsson, O., Linander, H., Ohlsson, F., Petersson, C., and Persson, D. Equivariance versus Augmentation for Spherical Images. In Proceedings of the 39th International Conference on Machine Learning, volume 162, pp. 7404 7421. PMLR, 17 23 Jul 2022. URL https://proceedings.mlr.press/v162/g erken22a.html. Gerken, J. E., Aronsson, J., Carlsson, O., Linander, H., Ohlsson, F., Petersson, C., and Persson, D. Geometric deep learning and equivariant neural networks. Artificial Intelligence Review, June 2023. ISSN 1573-7462. doi: 10.1007/s10462-023-10502-7. Golikov, E., Pokonechnyy, E., and Korviakov, V. Neural Tangent Kernel: A Survey. ar Xiv, August 2022. ISSN 2331-8422. URL https://arxiv.org/abs/22 08.13614. Halverson, J., Maiti, A., and Stoner, K. Neural Networks and Quantum Field Theory. Machine Learning: Science and Technology, 2(3):035002, September 2021. ISSN 2632-2153. doi: 10.1088/2632-2153/abeca3. Hayase, J. and Oh, S. Few-shot Backdoor Attacks via Neural Tangent Kernels. In International Conference on Learning Representations, ICLR, 2023. URL https: //openreview.net/forum?id=a70l GJ-rwy. Huang, J. and Yau, H.-T. Dynamics of Deep Neural Networks and Neural Tangent Hierarchy. In Proceedings of the 37th International Conference on Machine Learning, volume 119, pp. 4542 4551. PMLR, 13 18 Jul 2020. URL https://proceedings.mlr.press/v1 19/huang20l.html. Equivariant Neural Tangent Kernels Jacot, A., Gabriel, F., and Hongler, C. Neural Tangent Kernel: Convergence and Generalization in Neural Networks. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper _files/paper/2018/file/5a4be1fa34e62 bb8a6ec6b91d2462f5a-Paper.pdf. Jacot, A., Gabriel, F., Ged, F., and Hongler, C. Order and Chaos: NTK views on DNN Normalization, Checkerboard and Boundary Artifacts. ar Xiv, June 2020. ISSN 2331-8422. URL https://arxiv.org/abs/19 07.05715. Kather, J. N., Halama, N., and Marx, A. 100,000 histological images of human colorectal cancer and healthy tissue. Zenodo, April 2018. URL https://doi.org/10.5 281/zenodo.1214456. Kondor, R. and Trivedi, S. On the Generalization of Equivariance and Convolution in Neural Networks to the Action of Compact Groups. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pp. 2747 2755. PMLR, 10 15 Jul 2018. URL https://proceedings.mlr.press/v80/ko ndor18a.html. Lee, J., Bahri, Y., Novak, R., Schoenholz, S. S., Pennington, J., and Sohl-Dickstein, J. Deep Neural Networks as Gaussian Processes. In International Conference on Learning Representations, ICLR, 2018. URL https: //openreview.net/forum?id=B1EA-M-0Z. Lee, J., Xiao, L., Schoenholz, S., Bahri, Y., Novak, R., Sohl Dickstein, J., and Pennington, J. Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper _files/paper/2019/file/0d1a9651497a3 8d8b1c3871c84528bd4-Paper.pdf. Lee, J., Schoenholz, S., Pennington, J., Adlam, B., Xiao, L., Novak, R., and Sohl-Dickstein, J. Finite versus infinite neural networks: An empirical study. In Advances in Neural Information Processing Systems, volume 33, pp. 15156 15172. Curran Associates, Inc., 2020. URL ht tps://proceedings.neurips.cc/paper_f iles/paper/2020/file/ad086f59924fffe 0773f8d0ca22ea712-Paper.pdf. Li, Z., Wang, R., Yu, D., Du, S. S., Hu, W., Salakhutdinov, R., and Arora, S. Enhanced Convolutional Neural Tangent Kernels. ar Xiv, November 2019. ISSN 2331-8422. URL https://arxiv.org/abs/1911.00809. Liao, Y. and Smidt, T. E. Equiformer: Equivariant Graph Attention Transformer for 3D Atomistic Graphs. In International Conference on Learning Representations, ICLR, 2023. URL https://openreview.net/forum ?id=Kwm Pf ARg OTD. Mok, J., Na, B., Kim, J.-H., Han, D., and Yoon, S. Demystifying the Neural Tangent Kernel from a Practical Perspective: Can it be trusted for Neural Architecture Search without training? In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11851 11860, June 2022. doi: 10.1109/CVPR52688.20 22.01156. Moskalev, A., Sepliarskaia, A., Bekkers, E. J., and Smeulders, A. W. On genuine invariance learning without weight-tying. In Proceedings of 2nd Annual Workshop on Topology, Algebra, and Geometry in Machine Learning (TAG-ML), volume 221 of Proceedings of Machine Learning Research, pp. 218 227. PMLR, 28 Jul 2023. URL https://proceedings.mlr.press/v2 21/moskalev23a.html. Moyer, D., Abaci Turk, E., Grant, P. E., Wells, W. M., and Golland, P. Equivariant Filters for Efficient Tracking in 3D Imaging. In Medical Image Computing and Computer Assisted Intervention MICCAI 2021, pp. 193 202, Cham, 2021. Springer International Publishing. ISBN 978-3-030-87202-1. doi: 10.1007/978-3-030-87202-1 1 9. M uller, P., Golkov, V., Tomassini, V., and Cremers, D. Rotation-Equivariant Deep Learning for Diffusion MRI. ar Xiv, February 2021. ISSN 2331-8422. URL https: //arxiv.org/abs/2102.06942. Neal, R. M. Bayesian Learning for Neural Networks. Springer Science & Business Media, 1996. ISBN 978-14612-0745-0. Nicoli, K. A., Nakajima, S., Strodthoff, N., Samek, W., M uller, K.-R., and Kessel, P. Asymptotically unbiased estimation of physical observables with neural samplers. Physical Review E, 101(2):023304, 2020. Nordenfors, O. and Flinth, A. Ensembles provably learn equivariance through data augmentation. ar Xiv, October 2024. ISSN 2331-8422. URL https://arxiv.or g/abs/2410.01452. Novak, R., Xiao, L., Hron, J., Lee, J., Alemi, A. A., Sohl Dickstein, J., and Schoenholz, S. S. Neural Tangents: Fast and Easy Infinite Neural Networks in Python. In International Conference on Learning Representations, ICLR, 2020. URL https://openreview.net/f orum?id=Skl D9yr FPS. Equivariant Neural Tangent Kernels Pang, S., Du, A., Orgun, M. A., Wang, Y., Sheng, Q. Z., Wang, S., Huang, X., and Yu, Z. Beyond CNNs: Exploiting Further Inherent Symmetries in Medical Image Segmentation. IEEE Transactions on Cybernetics, 53 (11):6776 6787, November 2023. ISSN 2168-2275. doi: 10.1109/TCYB.2022.3195447. Perin, A. and Deny, S. On the ability of deep networks to learn symmetries from data: A neural kernel theory. ar Xiv, December 2024. ISSN 2331-8422. URL https: //arxiv.org/abs/2412.11521. Perraudin, N., Defferrard, M., Kacprzak, T., and Sgier, R. Deep Sphere: Efficient spherical Convolutional Neural Network with HEALPix sampling for cosmological applications. Astronomy and Computing, 27:130 146, April 2019. ISSN 2213-1337. doi: 10.1016/j.ascom.2019.03. 004. Price, M. A. and Mc Ewen, J. D. Differentiable and accelerated spherical harmonic and wigner transforms. Journal of Computational Physics, 510:113109, 2024. doi: 10.1016/j.jcp.2024.113109. Ramakrishnan, R., Dral, P. O., Rupp, M., and von Lilienfeld, O. A. Quantum chemistry structures and properties of 134 kilo molecules. Scientific Data, 1, 2014. Sch utt, K., Unke, O., and Gastegger, M. Equivariant Message Passing for the Prediction of Tensorial Properties and Molecular Spectra. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pp. 9377 9388. PMLR, 18 24 Jul 2021. URL https://proceedings.mlr.press/v139/s chutt21a.html. Tsai, C.-P., Yeh, C.-K., and Ravikumar, P. Sample Based Explanations via Generalized Representers. In Advances in Neural Information Processing Systems, volume 36, pp. 23485 23498. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper _files/paper/2023/file/49cf35ff2298c 10452db99d08036805b-Paper-Conference. pdf. Unke, O., Bogojeski, M., Gastegger, M., Geiger, M., Smidt, T., and M uller, K.-R. SE(3)-equivariant prediction of molecular wavefunctions and electronic densities. In Advances in Neural Information Processing Systems, volume 34, pp. 14434 14447. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/p aper_files/paper/2021/file/78f189367 8afbeaa90b1fa01b9cfb860-Paper.pdf. Wang, R., Walters, R., and Yu, R. Data augmentation vs. equivariant networks: A theory of generalization on dynamics forecasting. ar Xiv, June 2022a. ISSN 2331-8422. URL https://arxiv.org/abs/2206.09450. Wang, S., Yu, X., and Perdikaris, P. When and why PINNs fail to train: A neural tangent kernel perspective. Journal of Computational Physics, 449:110768, January 2022b. ISSN 0021-9991. doi: 10.1016/j.jcp.2021.110768. Wang, Y., Elhag, A. A., Jaitly, N., Susskind, J. M., and Bautista, M. A. Swallowing the Bitter Pill: Simplified Scalable Conformer Generation. In Proceedings of the 41st International Conference on Machine Learning, volume 235, pp. 50400 50418. PMLR, 21 27 Jul 2024. URL https://proceedings.mlr.press/v235/w ang24q.html. Weiler, M. and Cesa, G. General E(2)-Equivariant Steerable CNNs. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/p aper_files/paper/2019/file/45d6637b7 18d0f24a237069fe41b0db4-Paper.pdf. Weiler, M., Hamprecht, F. A., and Storath, M. Learning Steerable Filters for Rotation Equivariant CNNs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 849 858, 2018. Weiler, M., Forr e, P., Verlinde, E., and Welling, M. Equivariant and Coordinate Independent Convolutional Networks. 2023. URL https://maurice-weiler.gitlab .io/cnn_book/Equivariant And Coordinat e Independent CNNs.pdf. Winkels, M. and Cohen, T. S. Pulmonary nodule detection in CT scans with equivariant CNNs. Medical Image Analysis, 55:15 26, July 2019. ISSN 1361-8415. doi: 10.1016/j.media.2019.03.010. Xiao, L., Bahri, Y., Sohl-Dickstein, J., Schoenholz, S., and Pennington, J. Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10,000-Layer Vanilla Convolutional Neural Networks. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pp. 5393 5402. PMLR, 10 15 Jul 2018. URL https://proceedings.mlr.press/v80/xi ao18a.html. Yaida, S. Non-Gaussian processes and neural networks at finite widths. In Proceedings of The First Mathematical and Scientific Machine Learning Conference, volume 107, pp. 165 192. PMLR, 20 24 Jul 2020. URL https: //proceedings.mlr.press/v107/yaida20a. html. Yang, G. Scaling Limits of Wide Neural Networks with Weight Sharing: Gaussian Process Behavior, Gradient Independence, and Neural Tangent Kernel Derivation. ar Xiv, April 2020. ISSN 2331-8422. URL https: //arxiv.org/abs/1902.04760. Equivariant Neural Tangent Kernels Yang, G. and Hu, E. J. Tensor programs IV: Feature learning in infinite-width neural networks. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pp. 11727 11737. PMLR, 18 24 Jul 2021. URL https://proceedings.mlr.press/v1 39/yang21c.html. Yang, G., Hu, E., Babuschkin, I., Sidor, S., Liu, X., Farhi, D., Ryder, N., Pachocki, J., Chen, W., and Gao, J. Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer. In Advances in Neural Information Processing Systems, volume 34, pp. 17084 17097. Curran Associates, Inc., 2021. Yang, H. and Wang, Z. On the Neural Tangent Kernel Analysis of Randomly Pruned Neural Networks. In Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, pp. 1513 1553. PMLR, April 2023. Equivariant Neural Tangent Kernels A. Basics of Neural Tangent Kernel Theory A neural network N : Rnin Rnout which is trained using continuous gradient descent on a loss function L with learning rate η evolves according to i=1 Θt(x, xi) L N(xi) , (44) where the sum runs over the training samples xi and Θt Rnout nout is the NTK Θt(x, x ) = N(x) For finite-width networks, Θt depends on the initialization and the training time and is referred to as the empirical NTK. At infinite width, Θt becomes independent of the initialization (it still depends on the initialization distribution) since it approaches its expectation value over initializations Θ(x, x ) = Eθ pinit It furthermore becomes constant throughout training and proportional to the unit matrix (Jacot et al., 2018) in the NTK parametrization. For this reason, we drop the t-subscript on this frozen NTK and treat it as a scalar. In the following, we will always mean (46) when we refer to the NTK unless otherwise stated. The NTK parametrization of a linear layer has an additional 1/ nfan in prefactor and uses independent standard Gaussians as initialization distributions. Hence, an MLP layer is given by N (ℓ)(x) = σ N (ℓ 1)(x) = σ 1 nfan in WN (ℓ 2)(x) + b , (47) with nonlinearity σ, weights W and bias b. B. Proofs: Kernel Recursions for GCNN-Layers In this section, we provide proofs for the theorems given in Section 4.1 in the main text. Theorem 4.1 (Kernel recursions for group convolutional layers). The layer-wise recursive relations for the NNGP and NTK of the group convolutional layer (4) are given by K(ℓ+1) g,g (f,f ) = 1 |Sκ| Sκ dh K(ℓ) gh,g h(f, f ) (11) Θ(ℓ+1) g,g (f,f ) = K(ℓ+1) g,g (f,f )+ 1 Sκ dh Θ(ℓ) gh,g h(f,f ) . (12) Proof. We first compute the NNGP recursion relation. For group-convolution layers, the definition (3) of the NNGP reads K(ℓ+1) g,g (f, f ) = E h [N (ℓ+1)(f)](g) [N (ℓ+1)(f )](g ) i (48) G dh dh E h κ(ℓ+1) g 1h κ(ℓ+1) g 1h i E h [N (ℓ)(f)](h)[N (ℓ)(f )](h ) i , (49) where we have again dropped the 1/ nℓ-prefactors and channel dependencies since these converge to the expectation value in the infinite width limit. Next, we shift the integration variables by g and g which leaves the Haar measure invariant by its definition K(ℓ+1) g,g (f, f ) = 1 G dh dh E h κ(ℓ+1)(h)κ(ℓ+1)(h ) i E h [N (ℓ)(f)](gh)[N (ℓ)(f )](g h ) i . (50) Equivariant Neural Tangent Kernels Since the kernel components are sampled independently from standard Gaussians at initialization, we only obtain a contribution to the integral when h = h and κ(ℓ+1) has support at this point, i.e. K(ℓ+1) g,g (f, f ) = 1 Sκ dh E h [N (ℓ)(f)](gh)[N (ℓ)(f )](g h) i . (51) Comparing to (3) shows that the right-hand side is just the NNGP K(ℓ)(f, f ) of the previous layer evaluated at group indices gh and g h. This proves the NNGP recursion relation stated in the theorem. For the NTK recursion relation, we start by specializing the general expression (2) to group-convolution layers and adapting it to the functional framework used for feature maps Θ(ℓ+1) g,g (f, f ) Sκ dh E δ[N (ℓ+1)(f)](g) δκ(ℓ+1)(h) δ[N (ℓ+1)(f )](g ) G d g d g E " δ[N (ℓ+1)(f)](g) δ[N (ℓ)(f)]( g) Sκ d h δ[N (ℓ)(f)]( g) δκ(ℓ )( h) δN (ℓ)( g ) Θ(ℓ) g, g (f,f ) δ[N (ℓ+1)(f )](g ) δ[N (ℓ)(f )]( g ) According to the layer definition (4), the derivatives evaluate to δ[N (ℓ+1)(f)](g) δκ(ℓ+1)(h) = 1 nℓSκ [N (ℓ)(f)](gh) (53) δ[N (ℓ+1)(f)](g) δ[N (ℓ)(f)]( g) = 1 nℓSκ κ(ℓ+1)(g 1 g) . (54) Therefore, (52) becomes Θ(ℓ+1) g,g (f, f ) = 1 Sκ dh E h [N (ℓ)(f)](gh)[N (ℓ)(f )](g h) i G d g d g Θ(ℓ) g, g (f, f ) E κ(ℓ+1)(g 1 g)κ(ℓ+1)(g 1 g ) = K(ℓ+1) g,g (f, f ) + 1 Sκ dh Θ(ℓ) gh,g h(f, f ) , (56) where we have dropped the channel-prefactors as usual. The last line is just the NTK recursion to be proven. Theorem 4.2 (Kernel recursions for the lifting layer). The layer-wise recursive relations for the NNGP and NTK of the lifting layer (6) are given by2 K(ℓ+1) g,g (f, f ) = 1 Sκ dx K(ℓ) ρ(g)x, ρ(g )x(f, f ) , (16) Θ(ℓ+1) g,g (f, f ) = 1 Sκ dx Θ(ℓ) ρ(g)x, ρ(g )x(f, f ) + K(ℓ+1) g,g (f, f ) , (17) where the regular representation ρreg is defined in (5). 2In practice, the lifting layer is usually the first layer, thus ℓ= 0. Equivariant Neural Tangent Kernels Proof. The NNGP of the lifting layer (6) is given by K(ℓ+1) g,g (f, f ) = E h [N (ℓ+1)(f)](g) [N (ℓ+1)(f )](g ) i (57) X dx dx E κ(ρ(g 1)x)κ(ρ(g 1)x ) E h [N (ℓ)(f)](x)[N (ℓ)(f )](x ) i (58) X dx dx E [κ(x)κ(x )] E h [N (ℓ)(f)](ρ(g)x)[N (ℓ)(f )](ρ(g )x ) i (59) Sκ dx E h [N (ℓ)(f)](ρ(g)x)[N (ℓ)(f )](ρ(g )x) i (60) Sκ dx K(ℓ) ρ(g)x,ρ(g )x(f, f ) , (61) where we have moved the regular representation through N (ℓ) onto f by using equivariance. This proves the NNGP recursion-relation. According to (2), the NTK recursion evaluates to Θ(ℓ+1) g,g (f, f ) = Z Sκ dx E δ[N (ℓ+1)(f)](g) δκ(ℓ+1)(x) δ[N (ℓ+1)(f )](g ) X d x d x E " δ[N (ℓ+1)(f)](g) δ[N (ℓ)(f)]( x) Θ(ℓ) x, x (f, f )δ[N (ℓ+1)(f )](g ) δ[N (ℓ)(f )]( x ) The derivatives in this expression are given by δ[N (ℓ+1)(f)](g) δκ(ℓ+1)(x) = 1 nℓSκ [N (ℓ)(f)](ρ(g)x) (63) δ[N (ℓ+1)(f)](g) δ[N (ℓ)(f)]( x) = 1 nℓSκ κ(ℓ+1)(ρ(g 1) x) . (64) Plugging this back into (62) yields the desired NTK recursion relation, Θ(ℓ) x, x (f, f ) = 1 Sκ dx E h [N (ℓ)(f)](ρ(g)x)[N (ℓ)(f )](ρ(g )x) i X d x d x Θ(ℓ) x, x (f, f )E h κ(ℓ+1)(ρ(g 1) x)κ(ℓ+1)(ρ(g 1) x ) i (65) = K(ℓ+1) g,g (f, f ) + 1 Sκ dx Θ(ℓ) ρ(g)x, ρ(g )x (f, f ) . (66) Theorem 4.3 (Kernel recursions for group pooling layer). The layer-wise recursive relations for the NNGP and NTK of the group pooling layer (10) are given by K(ℓ+1)(f, f ) = 1 (vol(G))2 G dg K(ℓ) g,g (f, f ) (18) Θ(ℓ+1)(f, f ) = 1 (vol(G))2 G dg Θ(ℓ) g,g (f, f ) . (19) Proof. Since we integrate over the entire domain of the input feature maps N (ℓ)(f) : G Rnℓin the pooling layer (10), are the output features N (ℓ)(f) Rnℓ+1 a channel-vector. Therefore, the NNGP of the group pooling layer is given by K(ℓ+1)(f, f ) = E h N (ℓ+1)(f) N (ℓ+1)(f ) i (67) = 1 (vol(G))2 G dg E h [N (ℓ)(f)](g) [N (ℓ)(f)](g ) i (68) = 1 (vol(G))2 G dg K(ℓ) g,g (f, f ) . (69) Equivariant Neural Tangent Kernels The NTK recursion (2) is in this case Θ(ℓ+1)(f, f ) = Z " δN (ℓ+1)(f) δ[N (ℓ)(f)](g)Θ(ℓ) g,g (f, f ) δN (ℓ+1)(f ) δ[N (ℓ)(f )](g ) = 1 (vol(G))2 G dg Θ(ℓ) g,g (f, f ) , (71) which is the NTK-relation to be shown. C. Equivariant Kernels for Roto-Translations in the Plane In this appendix, we provide explicit expressions for the kernel recursions of lifting-, group convolutionaland group pooling layers for the special case of roto-translations in the plane, i.e. for the symmetry group G = Cn R2. In this case, the general expressions in Theorems 4.1, 4.2 and 4.3 simplify and can be written in terms of the A operator (30) which can be computed efficiently in terms of ordinary 2d convolutions. However, before discussing the kernel recursions, we will first establish a simplifying notation for the GCNN layer-definitions. C.1. GCNNs for G = Cn R2 Due to the semidirect-product structure of G, any element g G can be written uniquely as a product of a translation and a rotation, g = tr with t R2 and r Cn3. We can therefore write a feature map N (ℓ) on G as a stack of n feature maps on R2, N (ℓ)(g = tr) = N (ℓ) r (t) . (72) Using this representation, the liftingand group convolutional layers can be written in terms of ordinary two-dimensional convolutions as (Cohen & Welling, 2016) [N (1)(f)]r(t) = 1 p R2dx κ ρ(r 1)(x t) f(x) (73) [N (ℓ+1)(f)]r(t) = 1 p R2dt κr 1r ρ(r 1)(t t) [N (ℓ)(f)]r (t ) , ℓ 1 , (74) where ρ is the fundamental representation of SO(2) on R2, given by two-dimensional rotation matrices. Finally, for invariant problems like classification, the group pooling layer (10) is central to making the network invariant. For Cn R2, it is given by [N (ℓ+1)(f)] = 1 p n| supp(N (ℓ)(f))| Z dx [N (ℓ)(f)]r(x) . (75) C.2. Kernel recursions for G = Cn R2 In analogy to the notation introduced in the previous section for feature maps, we write for the NNGP and NTK on Cn R2 Kg=tr,g =t r (f, f ) = [Krr(f, f )](t, t ) , Θg=tr,g =t r (f, f ) = [Θrr (f, f )](t, t ) (76) to emphasize the dependency on the two translations t, t R2. Furthermore, we repeat here the definition of the operator (30) for convenience [ASκ(K)](t, t ) = 1 |Sκ| Sκ d t K(t + t, t + t) , (77) Given these definitions, the recursive relations from Theorem 4.1 for group convolutions can be computed efficiently using the following 3In an abuse of notation, we will denote both the abstract translation group element and its representation as a vector in R2 by the same symbol. Equivariant Neural Tangent Kernels Lemma C.1 (Kernel recursions of group convolutional layers for roto-translations). In the case G = Cn R2, the layer-wise recursive relations for the NNGP and NTK of the group convolutional layer (74) are given by [K(ℓ+1) rr (f, f )](t, t ) = X r Cn [Aρ(r)Sκ( K(ℓ) r r, r r(f, f ))](t, ρ(rr 1)t ) (78) [Θ(ℓ+1) rr (f, f )](t, t ) = [K(ℓ+1) r r (f, f )](t, t ) + X r Cn [Aρ(r)Sκ( Θ(ℓ) r r, r r(f, f ))](t, ρ(rr 1)t ) , (79) [ K(ℓ) r r (f, f )](t, t ) = [K(ℓ) r r (f, f )](t, ρ(r r 1)t ) (80) Θ(ℓ) r r (f, f ) = [Θ(ℓ) r r (f, f )](t, ρ(r r 1)t ) . (81) Proof. In order to compute the NNGP recursion relation in the notation (76), we first need to compute the unique decomposition of a general group multiplication gh, g, h G into a rotation and a translation. This is possible since G is a semidirect product group. Starting from g = tgrg and h = thrh with tg, th R2 and rg, rh Cn, we have gh = tgrgthrh = tgrgthr 1 g rgrh . (82) Since R2 is a normal subgroup of G (a further property implied by the semidirect product), rgthr 1 g R2. Therefore, tgh = tgrgthr 1 g R2 and rgh = rgrh Cn . (83) Since the action of tgh on a vector x R2 is given by ρ(tgh)x = x + ρ(r 1 g )th + tg , (84) we obtain for the NNGP recursion from Theorem 4.1, [K(ℓ+1) rr (f, f )](t, t ) = 1 |Sκ| Sκ d t [K(ℓ) r r, r r(f, f )](ρ(r 1) t + t, ρ(r 1) t + t ) . (85) This we will now write in terms of the A-operator (77). However, since the A-operator shifts both slots of the argument kernel by y, whereas the first argument in (85) is shifted by ρ(r 1) t, while the second argument is shifted by ρ(r 1) t, we cannot write (85) directly in terms of A(K), but need to compute A at a transformed argument instead and then transform back. To this end, first consider [ASκ( K(ℓ) rr (f, f ))](t, t ) = 1 |Sκ| Sκ d t [ K(ℓ) rr (f, f )](t + t, t + t) (86) Sκ d t [K(ℓ) rr (f, f )](t + t, ρ(r r 1)(t + t)) . (87) Therefore, we obtain for the RHS of the NNGP recursion X r Cn [Aρ(r)Sκ( K(ℓ) r r, r r(f, f ))] t, ρ(rr 1)t ρ(r)Sκ d t [K(ℓ) r r,r r(f, f )] t + t, ρ(r r 1) ρ(rr 1)t + t (88) ρ(r)Sκ d t [K(ℓ) r r,r r(f, f )] t + t, t + ρ(r r 1) t (89) Sκ d t [K(ℓ) r r,r r(f, f )] t + ρ(r) t, t + ρ(r ) t (90) The last line is just (85), proving the NNGP recursion relation. Equivariant Neural Tangent Kernels For the NTK, we start from the NTK-recursion in Theorem 4.1. The structure of the integral appearing in that recursion is the same as the one of the integral in the NNGP recursion. Therefore, the NTK recursion is given by [Θ(ℓ+1) rr (f, f )](t, t ) = [K(ℓ+1) rr (f, f )](t, t ) Sκ d t [Θ(ℓ) r r, r r(f, f )](ρ(r 1) t + t, ρ(r 1) t + t ) . (91) The integral can be written in terms of the A-operator following the same steps as for the NNGP above. Therefore, by first computing the kernels K(ℓ) and Θ(ℓ) and then applying the A-operator, it is possible to efficiently compute the kernel-recursions in this case. Similarly, the recursive kernel-relations for the lifting layer can also be written efficiently in terms of the A-operator, as detailed in Lemma C.2 (Kernel recursions of lifting layers for roto-translations). The layer-wise recursive relations for the NNGP and NTK of the group convolutional layer (73) are given by [K(ℓ+1) rr (f, f )](t, t ) = Aρ(r)Sκ( K(ℓ) r r (f, f )) (t, ρ(rr 1)t ) (92) [Θ(ℓ+1) rr (f, f )](t, t ) = [K(ℓ+1) rr (f, f )](t, t ) + Aρ(r)Sκ( Θ(ℓ) r r (f, f )) (t, ρ(rr 1)t ) , (93) [ K(ℓ) rr (f, f )](t, t ) = [K(ℓ)(f, f )](t, ρ(r r 1)t ) (94) [ Θ(ℓ) rr (f, f )](t, t ) = [Θ(ℓ)(f, f )](t, ρ(r r 1)t ) . (95) Proof. According to Theorem 4.2, the NNGP recursion is given by [K(ℓ+1) rr (f, f )](t, t ) = 1 Sκ dx K(ℓ) ρ(r)x + t, ρ(r )x + t . (96) Comparing this expression to (85) shows that the sum over r as well as the r, r -indices on K(ℓ) are absent in (96) but otherwise the two expressions agree. Therefore, we can use the same argument as above to rewrite (96) in terms of the A-operator and only need to drop the r, r -indices in the definition of K(ℓ) as well as pick the r = e contribution in the sum. Similarly, we can show the NTK recursion relation starting form the NTK-recursion in Theorem 4.2. Finally, in the group pooling layer, the kernels are trivialized over their rand t-indices, resulting in a kernel without spatial indices: Lemma C.3 (Kernel recursions of group pooling layers for roto-translations). The layer-wise recursive relations for the NNGP and NTK of the group convolutional layer (75) are given by K(ℓ+1)(f, f ) = 1 n| supp(N (ℓ)(f))| Z dt dt [K(ℓ) rr (f, f )](t, t ) (97) Θ(ℓ+1)(f, f ) = 1 n| supp(N (ℓ)(f))| Z dt dt [Θ(ℓ) rr (f, f )](t, t ) . (98) Proof. The integral over two copies of the group in Theorem 4.3 factorize for G = Cn R2 into integrals over the translations in R2 and sums over the discrete rotations in Cn. This immediately implies the recursions in the statement of the lemma. The expressions given in the lemmata in this section can be straightforwardly implemented and therefore allow for explicit calculations of the NTK and NNGP of realistically-sized GCNNs. Equivariant Neural Tangent Kernels D. Equivariant NTK in the Fourier Domain for 3d Rotations D.1. Group Convolutions in the Fourier Domain for G = SO(3) For compact groups, it is possible to define a Fourier transformation. The group convolution (4) then becomes a point-wise product in the Fourier domain. For the case of G = SO(3), the Fourier transformation is given in terms of Wigner matrices Dl mn, m,n= l ˆf l mn Dlmn(R) (99) ˆf l mn = Z SO(3) d R f(R)Dl mn(R) , (100) where R SO(3) is a rotation matrix. Note that the presented convention corresponds to the one in the s2fft package (Price & Mc Ewen, 2024). The rotations act naturally on the sphere S2 on which the Fourier transform is given in terms of spherical harmonics Y l m, m= l ˆf l m Y l m(x) (101) S2dx f(x)Y lm(x) . (102) These Fourier transformations are e.g. used in spherical CNNs (Cohen et al., 2018) and steerable convolutional networks (Weiler et al., 2018), which define equivariant group convolution layers with respect to SO(3) and act on input features defined on the sphere S2. The change to the Fourier space is motivated by the fact that group convolutions reduce to simple multiplications of the corresponding Fourier components. For SO(3), the group convolutions (4) for filter support Sκ SO(3) are defined as [N (ℓ+1)(f)](R) = 1 p Sκ d S κ R 1S [N (ℓ)(f)](S) . (103) Dl mn(R 1) = Dlnm(R) , (104) Dlmn(R) = ( 1)m n Dl m, n(R) , (105) Z SO(3) d R Dlmn(R)Dl m n (R) = 8π2 2l + 1δll δnn δmm , (106) the Fourier components (100) of the layer in (103) can be written compactly as [ \ N (ℓ+1)(f)]l mn = 1 p p= l [ \ N (ℓ)(f)]l mpˆκlnp . (107) Note that we have assumed a real-valued kernel κ. Similarly, the lifting layer (6) for features on S2 is [N (1)(f)](R) = 1 p S2dx κ R 1x f(x) , (108) which, in terms of the Fourier coefficients (102) becomes [ \ N (1)(f)]l mn = 1 p 2l + 1 ˆf l mˆκln . (109) Equivariant Neural Tangent Kernels Again, we have assumed a real-valued kernel κ and used the relations Y l m(Rx) = Dlmn(R)Y l n(x) , (110) Y lm(x) = ( 1)m Y l m(x) , (111) Z S2dx Y lm(x)Y l m (x) = δll δmm . (112) D.2. Kernel Recursions for G = SO(3) As we have seen in Section D.1 SO(3) group convolutions are frequently computed in the Fourier domain. In this section, we show how also the kernel recursions from Theorems 4.1 and 4.2 for group-convolution layers and lifting layers can be computed in the Fourier domain corresponding to the spherical convolutions presented in Section D.1. In the following we will assume filters κ with global support Sκ = SO(3) or Sκ = S2, respectively. The reason is that equations (106) and (112) otherwise have to be replaced by expressions including the Wigner s 3j symbols. Due to the current lack of an efficient JAX-based implementation providing their computation, we decided to restrict ourselves to the more efficient case of global filters. In terms of the Fourier coefficients defined in (31), the recursive Kernel-relations for the spherical convolution layer (103) are specified in the following Lemma D.1 (Kernel recursions of SO(3) group-convolutions in the Fourier domain). The layer-wise recursive relations for the NNGP and NTK of the group convolutional layer (107) for G = SO(3) and global filters are given by [ \ K(ℓ+1)(f, f )]l,l mn,m n = 1 2l + 1δll δn, n p= l ( 1)n p[ \ Kℓ(f, f )]l,l mp,m ( p) (113) [ \ Θ(ℓ+1)(f, f )]l,l mn,m n = [ \ K(ℓ+1)(f, f )]l,l mn,m n + 1 2l + 1δll δn, n p= l ( 1)n p[ \ Θℓ(f, f )]l,l mp,m ( p) . (114) Proof. We identify elements of SO(3) with 3 3 rotation matrices R, S, . . . . Then, the recursive relations for the group convolution layer from Theorem 4.1 are K(ℓ+1) R,R (f, f ) = 1 8π2 SO(3) d S K(ℓ) RS,R S(f, f ) (115) Θ(ℓ+1) R,R (f, f ) = K(ℓ+1) R,R (f, f ) + 1 8π2 SO(3) d S Θ(ℓ) RS,R S(f, f ) . (116) The Fourier coefficients of the NNGP are given by a double Fourier integral of the form (31), so the recursion (115) becomes [ \ K(ℓ+1)(f, f )]l,l mn,m n = 1 8π2 [SO(3)]3d S d R d R K(ℓ) RS,R S(f, f )Dl mn(R)Dl m n (R ) . (117) Plugging in the Fourier expansion of the kernel K(ℓ) RS,R S(f, f ) yields [ \ K(ℓ+1)(f, f )]l,l mn,m n = 1 8π2 [SO(3)]3d S d R d R q ,r = p [ \ K(ℓ)(f, f )]p,p qr,q r Dp qr(RS)Dp q r (R S) Dl mn(R)Dl m n (R ) . (118) Using (106), (105) and Dl mn(RS) = p= l Dl mp(R)Dl pn(S) , (119) Equivariant Neural Tangent Kernels we can simplify the expression to [ \ K(ℓ+1)(f, f )]l,l mn,m n = 1 8π2 q ,r = p ( 1)r u 8π2 2l + 1δplδmqδnuδl p δm q δn u δpp δu, u δr, r [ \ K(ℓ)(f, f )]p,p qr,q r (120) = 1 2l + 1δll δn, n r= l ( 1)r n[ \ K(ℓ)(f, f )]l,l mr,m ( r) . (121) Renaming the summation index r p yields the desired result. The computation for the NTK is analogous. Similarly, for the lifting layer (108) for features on the sphere, the kernel recursions can be expressed in terms of the Fourier coefficients (102) according to the following Lemma D.2 (Kernel recursions of spherical lifting layer in the Fourier domain). The layer-wise recursive relations for the NNGP and NTK of the lifting layer (109) for features on S2 to features on SO(3) with global filters are given by [ \ K(ℓ+1)(f, f )]l,l 2 ( 1)nδll δn, n [ \ K(ℓ)(f, f )]l,l m,m (122) [ \ Θ(ℓ+1)(f, f )]l,l mn,m n = [ \ K(ℓ+1)(f, f )]l,l 2 ( 1)nδll δn, n [ \ Θ(ℓ)(f, f )]l,l m,m . (123) Proof. Starting from the recursive relations for the lifting layer in Theorem 4.2, the recursions in real space for G = SO(3) are K(ℓ+1) R,R (f, f ) = 1 S2dx K(ℓ) Rx, R x(f, f ) (124) Θ(ℓ+1) R,R (f, f ) =K(ℓ+1) g,g (f, f ) + 1 S2dx Θ(ℓ) Rx, R x(f, f ) . (125) Expressing the Fourier coefficients of the NNGP according to (31) gives [ \ K(ℓ+1)(f, f )]l,l [SO(3)]2d R d R K(ℓ) Rx,R x(f, f )Dl mn(R)Dl m n (R ) . (126) We can now plug in the Fourier expansion of the kernel on S2 K(ℓ) x,x (f, f ) = m = l [ \ K(ℓ)(f, f )]l,l m,m Y l m(x)Y l m (x ) , (127) [ \ K(ℓ+1)(f, f )]l,l [SO(3)]2d R d R q = p [ \ K(ℓ)(f, f )]p,p q,q Y p q (Rx)Y p Dl mn(R)Dl m n (R ) . (128) Using (106), (112), (110) and (111) one can rewrite and simplify the expression as [ \ K(ℓ+1)(f, f )]l,l 2l + 1( 1)r δlpδmqδnrδl p δm q δn r δpp δr, r [ \ K(ℓ)(f, f )]p,p 2 ( 1)nδl,l δn, n [ \ K(ℓ)(f, f )]l,l m,m , (130) which is the claimed result. Equivariant Neural Tangent Kernels E. Proofs: Data Augmentation Versus Group Convolutions at Infinite Width In this section, we provide proofs for the theorems given in Section 5 in the main text. Theorem 5.1. Let µaug t and µt be the mean predictions after t training steps of infinite ensembles of two neural network architectures N aug and N . Let N aug be trained on the fully G-augmented training data of N and assume that the NTKs of the two architectures are related by Θ(f, f ) = 1 |G| g G Θaug(f, ρreg(g)f ) . (32) Then, µaug t and µt converge in the infinite width limit to the same function for all t for quadratic losses, up to quadratic corrections in the learning rate. Proof. For a neural network N, we can expand the change N in output due to one training step of gradient descent in the learning rate η Nt+1(f) = Nt+1(f) Nt(f) = (θt+1 θt) Nt(f) θ + O(η2) (131) θ | {z } Θt(f,fi) L (Nt(fi), yi) + O(η2) , (132) where Θt is the empirical NTK at training step t, yi are the training labels and L is the derivative of the per-sample loss with respect to the output of the network. Taking the mean and the infinite width limit yields µt+1(f) = η ntrain i=1 Θ(f, fi) L (µt(fi), yi) , (133) since we have assumed that L is linear in its first argument. The network N aug on the other hand is trained using full data augmentation over G, so we can decompose the sum over training samples into a sum over the training samples in (133) and a sum over G. Note that since we assume full data augmentation and a finite training set, we restrict to G being finite in this section. We obtain µaug t+1(f) = η ntrain|G| i=1 Θaug(f, ρreg(g)fi)L (µaug t (ρreg(g)fi), yi) . (134) As mentioned in the main text, we will prove the statement inductively over training steps t. At t = 0, the mean output of all neural networks is zero in the infinite width limit (Neal, 1996; Lee et al., 2018). For the induction step, assume that µaug t = µt. Then, µaug t+1 = µt+1 if µaug t+1 = µt+1. Since the ensemble mean of networks trained with data augmentation is exactly equivariant (Gerken & Kessel, 2024; Nordenfors & Flinth, 2024), we have µaug t (ρreg(g)fi) = µaug t (fi) = µt(fi) by the induction assumption. Therefore, µaug t+1(f) = η ntrain|G| i=1 Θaug(f, ρreg(g)fi)L (µt(fi), yi) . (135) Using assumption (32) concludes the proof, µaug t+1(f) = η i=1 Θ(f, fi)L (µt(fi), yi) = µt+1(f) . (136) Equivariant Neural Tangent Kernels As mentioned in Section 7, Theorem 5.1 can be generalized to equivariantly data-augmented networks trained on g G {(ρreg(g)fi, ρreg(g)yi)}, (137) where the targets yi : G Rnout are signals on the group. Theorem E.1. Let µaug t and µt be the mean predictions after t training steps of infinite ensembles of two neural network architectures N aug and N . Let N aug be trained on the fully equivariantly G-augmented training data of N and assume that the NTKs of the two architectures are related by Θg,g (f, f ) = 1 |G| h g Θaug g,hg (f, ρreg(h)f ) . (138) Then, µaug t and µt converge in the infinite width limit to the same function for all t for quadratic losses, up to quadratic corrections in the learning rate. Proof. This theorem is a straightforward extension of Theorem 5.1, which is why we only highlight the differences. Similarly to (132), the change in the output after one training step on unaugmented data is given by [ Nt+1(f)](g) = η ntrain g G Θt;g,g (f, fi)L ([Nt(fi)](g ), yi(g )) + O(η2) , (139) where L is the pointwise per-sample loss. Again, L is the derivative with respect to the output of the network at a given point. As before, we assume that L is linear in its first argument, allowing us to simplify the expectation of the infinite-width version of (139) to [ µt+1(f)](g) = η ntrain g G Θg,g (f, fi) L ([µt(fi)](g ), yi(g )) . (140) In a similar fashion, we derive the update of a network trained on fully equivariantly augmented data and obtain [ µaug t+1(f)](g) = η ntrain|G| i=1 Θaug g,g (f, ρreg(h)fi)L ([µaug t (ρreg(h)fi)](g ), [ ρreg(h)yi](g )) . (141) Using the equivariance property of the ensemble mean again (Gerken & Kessel, 2024), i.e. µaug t (ρreg(h)fi) = ρreg(h)µaug t (fi), and shifting the summation as g hg , we obtain [ µaug t+1(f)](g) = η ntrain|G| i=1 Θaug g,hg (f, ρreg(h)fi)L ([µaug t (fi)](g ), yi(g )) (142) h G Θaug g,hg (f, ρreg(h)fi) L ([µaug t (fi)](g ), yi(g )) , (143) where we have used that [µaug t (ρreg(h)fi)](hg ) = [µaug t (fi)](g ) and analogously for yi. Using (138) and following the same inductive argument as in the proof of Theorem 5.1 concludes this proof. Theorem 5.2. Let N FC be an MLP acting on feature maps with output in R and architecture N FC = FC(L) σ FC(3) σ FC(1) , (33) where FC denotes a dense MLP layer and σ a point-wise nonlinearity. Let N GC be a G-invariant GCNN with architecture N GC = GPool GConv(SL κ ) σ GConv(SL 2 κ ) σ GConv(S3 κ) σ Lifting(S1 κ) , (34) Equivariant Neural Tangent Kernels where Sℓ κ are the supports of the convolutional filters with S1 κ = X, the domain of the input feature maps, and the other Sℓ κ are invariant under G. Then, the G-averages of the kernels of the MLP are given by the kernels of the GCNN, KGC(f, f ) = 1 vol(G) Z dg KFC(f, ρreg(g)f ) (35) ΘGC(f, f ) = 1 vol(G) Z dg ΘFC(f, ρreg(g)f ) . (36) Proof. In order to proof the kernel equalities, we will construct the kernels for the fully connected architecture (33) and the group convolutional architecture (34) by explicitly iterating the recursion relations. The iteration starts with the input kernels which for the fully-connected network are KFC(0)(f, f ) = 1 vol(X) Z dx f(x)f (x) , ΘFC(0)(f, f ) = 0 , (144) since the different points in the domain X of the input function take the role of different channels when the image tensor is flattened. The first layer of the FC-network is a fully-connected layer. These update the kernels according to (Jacot et al., 2018) KFC(ℓ+1)(f, f ) = KFC(ℓ)(f, f ) (145) ΘFC(ℓ+1)(f, f ) = KFC(ℓ+1)(f, f ) + ΘFC(ℓ)(f, f ) . (146) In order to write kernel transformation like this more compactly, we will collect all relevant kernels at layer ℓinto an R4 vector ΞFC(ℓ)(f, f ) according to ΞFC(ℓ)(f, f ) = KFC(ℓ)(f, f) KFC(ℓ)(f, f ) KFC(ℓ)(f , f ) ΘFC(ℓ)(f, f ) where the components KFC(ℓ)(f, f) and KFC(ℓ)(f , f ) are needed for the nonlinear layers below. In the ΞFC-notation, (145), (146) can be summarized by a function G : R4 R4 mapping ΞFC(ℓ)(f, f ) 7 ΞFC(ℓ+1)(f, f ), defined by k1 k2 k3 k2 + Θ Therefore, the kernels of the first fully-connected layer take the form ΞFC(1)(f, f ) = G(ΞFC(0)(f, f )) = 1 vol(X) f(x)f(x) f(x)f (x) f (x)f (x) f(x)f (x) In the architecture (33), fully connected layers are alternated with nonlinearities, which act according to (Jacot et al., 2018) ΛFC(ℓ)(f, f ) = KFC(ℓ)(f, f) KFC(ℓ)(f, f ) KFC(ℓ)(f , f) KFC(ℓ)(f , f ) KFC(ℓ+1)(f, f ) = E(u,v) N (0, ΛFC(ℓ)(f,f ))[σ(u)σ(v)] (151) KFC(ℓ+1)(f, f ) = E(u,v) N (0, ΛFC(ℓ)(f,f ))[σ (u)σ (v)] (152) ΘFC(ℓ+1)(f, f ) = KFC(ℓ+1)(f, f )ΘFC(ℓ)(f, f ) . (153) on the kernels. We will denote the corresponding action on the ΞFC-vectors by a function Fσ : R4 R4. Therefore, a fully-connected layer followed by a nonlinearity can be written as ΞFC(ℓ+2)(f, f ) = Fσ(G(ΞFC(ℓ)(f, f ))) . (154) Equivariant Neural Tangent Kernels Hence, in this notation, the kernels of the entire FC network are given by ΞFC(f, f ) = G Fσ G Fσ G(ΞFC(0)(f, f )) . (155) Next, we compute the kernels of the GCNN. The input kernels in this case are KGC(0) x,x (f, f ) = f(x)f (x) , ΘGC(0) x,x (f, f ) = 0 . (156) According to (34), the first layer of the network is a lifting layer whose recursion relation was given in Theorem 4.2. Again, we define an R4-vector to collect all kernel components necessary for computing the kernels of the network, ΞGC(ℓ) g,g (f, f ) = KGC(ℓ) g,g (f, f) KGC(ℓ) g,g (f, f ) KGC(ℓ) g ,g (f , f ) ΘGC(ℓ) g,g (f, f ) In terms of ΞGC the kernels of the lifting layer are given by (note that the filter of the lifting layer has global support by assumption) ΞGC(1) g,g (f, f ) = G KGC(0) ρ(g)x,ρ(g)x(f, f) KGC(0) ρ(g)x,ρ(g )x(f, f ) KGC(0) ρ(g )x,ρ(g )x(f , f ) ΘGC(0) ρ(g)x,ρ(g )x(f, f ) f(ρ(g)x)f(ρ(g)x) f(ρ(g)x)f (ρ(g )x) f (ρ(g )x)f (ρ(g )x) f(ρ(g)x)f (ρ(g )x) For later convenience, we note here that ΞGC(1) h,g 1h(f, f ) = 1 vol(X) f(ρ(h)x)f(ρ(h)x) f(ρ(h)x)f (ρ(g 1h)x) f (ρ(g 1h)x)f (ρ(g 1h)x) f(ρ(h)x)f (ρ(g 1h)x) f(x)f(x) f(x)f (ρ(g 1)x) f (ρ(g 1)x)f (ρ(g 1)x) f(x)f (ρ(g 1)x) = ΞFC(1)(f, ρreg(g)f ) , (161) where we shifted the integration variable in the second step and used (149). After the lifting layer, we act with a point-wise nonlinearty whose recursion relations are given in Corollary 4.4. Since this transformation is independent for the different g, g -components, we can write it using the same function Fσ introduced above as ΞGC(ℓ+1) g,g (f, f ) = Fσ(ΞGC(ℓ) g,g (f, f )) . (162) A GCNN layer transforms the NNGP and NTK according to Theorem 4.1. We can write this in terms of ΞGC(ℓ) g,g as ΞGC(ℓ+1) g,g (f, f ) = 1 |Sℓκ| Sκ dhℓG(ΞGC(ℓ) ghℓ,g hℓ(f, f )) , (163) with G as introduced in (148). The final pooling layer acts according to Theorem 4.3, which we can write as ΞGC(ℓ) g,g (f, f ) = 1 (vol(G))2 G dg ΞGC(ℓ) g,g (f, f ) . (164) Equivariant Neural Tangent Kernels With the expressions (162), (163) and (164), we can write the kernels of the entire network as ΞGC(f, f ) = 1 (vol(G))2 G dg 1 |SL κ | SL κ dh L G 1 |SL 2 κ | SL 2 κ dh L 2 G S3κ dh3 G Fσ(ΞGC(1) gh Lh L 2 h5h3,g h Lh L 2 h5h3(f, f )) In order to simplify this expression, we shift h3 and absorb gh Lh L 2 h5 into it. This will not change the integration domain of h3 since S3 κ is by assumption invariant under G. Then, the integrals over h L, h L 2, . . . , h5 become trivial and cancel against their 1/|Sℓ κ|-prefactors. We are left with ΞGC(f, f ) = 1 (vol(G))2 S3κ dh3 G Fσ(ΞGC(1) h3,g g 1h3(f, f )) Finally, we trivialize the g -integral by shifting g 1 to absorb g . Thus, we obtain ΞGC(f, f ) = 1 vol(G) S3κ dh3 G Fσ(ΞGC(1) h3,g 1h3(f, f )) G dg G Fσ G Fσ G Fσ(ΞFC(1)(f, ρreg(g)f )) (168) G dg ΞFC(f, ρreg(g)f ) , (169) where we used (161), trivializing the integral over h3, and then identified ΞFC from (155). The statement follows by taking the second and fourth components of (169). Theorem 5.3. Let N K N be the K N-invariant GCNN with architecture (34) and K-invariant filter supports Sℓ κ which for the GConv-layers decompose as Sℓ κ = Kℓ κ N ℓ κ, Kℓ κ K, N ℓ κ N. Let N N be the N-invariant GCNN with architecture (34) and filter supports N L κ , . . . , N 3 κ and S1 κ. Then, the NNGPs and NTKs of these networks are related by KK N(f, f ) = 1 vol(K) K dk KN(f, ρreg(k)f ) (37) ΘK N(f, f ) = 1 vol(K) K dk ΘN(f, ρreg(k)f ) . (38) Proof. In this proof, we will use the same notation as in the proof for Theorem 5.2 above and use results from there as well. We start by considering the kernels ΞK N of N K N by specializing (165) to the case G = K N. Due to the semidirect product structure of G, there is a unique decomposition g = kn for each g G into k K and n N. Since by assumption the filter supports Sℓ κ on G also factorize over K and N, we can split all G-integrations in (165) over N and K and obtain ΞK N(f, f ) = 1 (vol(K))2 1 (vol(N))2 N dn 1 |KL κ ||N L κ | NL κ dm L G 1 |K3κ||N 3κ| N 3 κ dm3 G Fσ(ΞK N(1) knj Lm L j3m3,k n j Lm L j3m3(f, f )) In order to trivialize the integrals over K, as was done with the integrals over G in (166), we need to rewrite the first group Equivariant Neural Tangent Kernels index of ΞK N(1) such that all jℓappear next to each other. To this end, we introduce several unit elements knj Lm L j7m7j5m5j3m3 = knj Lm L j7m7j5j3 j 1 3 m5j3 = knj Lm L j7j5j3 (j5j3) 1m7j5j3 N j 1 3 m5j3 N m3 (172) ... = kj L j3 (j L j3) 1nj L j3 N (j L 2 j3) 1m L N j 1 3 m5j3 N m3 . (173) We perform the same rewriting also on the second group index of ΞK N(1). Since N is a normal subgroup of G, knk 1 N for all k K, n N and the Haar measure on N is invariant under shifts of the form n knk 1. Furthermore, the integration domains N ℓ κ are by assumption invariant under this transformation. Hence, we shift n, n and the mℓby n j L j3n(j L j3) 1 (174) n j L j3n (j L j3) 1 (175) mℓ jℓ 2 j3mℓ(jℓ 2 j3) 1 ℓ> 3 . (176) With this (170) becomes ΞK N(f, f ) = 1 (vol(K))2 1 (vol(N))2 N dn 1 |KL κ ||N L κ | NL κ dm L G 1 |K3κ||N 3κ| N3 κ dm3 G Fσ(ΞK N(1) kj L j3nm L m3,k j L j3n m L m3(f, f )) = 1 (vol(K))2 1 (vol(N))2 N dn 1 |N L κ | NL κ dm L G 1 |K3κ||N 3κ| N3 κ dm3 G Fσ(ΞK N(1) j3nm L m3,k k 1j3n m L m3(f, f )) = 1 vol(K) 1 (vol(N))2 N dn 1 |N L κ | N L κ dm L G 1 |K3κ||N 3κ| N3 κ dm3 G Fσ(ΞK N(1) j3nm L m3,k 1j3n m L m3(f, f )) Here, we shifted j3 (kj L j5) in the first step, trivializing the integrals over j L, . . . , j5 which then cancel against their 1/|Kℓ κ|-prefactors. In the second step, we first trivialized the integral over k by shifting k k k and then canceled it against its 1/ vol(K)-prefactor. Next, we perform another manipulation on the group indices of ΞK N(1) by first inserting suitable unit elements, j3nm L m3 = j3nj 1 3 N j3m Lj 1 3 N j3m L 2j 1 3 N j3m3j 1 3 N j3 , (180) and similarly for the second group index of ΞK N(1). After shifting n j 1 3 nj3 , n j 1 3 n j3 , mℓ j 1 3 mℓj3 ℓ 3 , (181) in (179), we obtain ΞK N(f, f ) = 1 vol(K) 1 (vol(N))2 N dn 1 |N L κ | N L κ dm L G 1 |K3κ||N 3κ| N3 κ dm3 G Fσ(ΞK N(1) nm L m3j3,k 1n m L m3j3(f, f )) Equivariant Neural Tangent Kernels 0 10 20 30 40 50 60 70 80 Network Width Relative Error Figure 5. Convergence of the Monte-Carlo estimates of the NNGP to their infinite-width limits for G = C4 R2. Plotted is the relative error averaged over the components of a 3 3 Gram matrix for networks with a Re LU or an error function nonlinearity. The bands correspond to one standard deviation of the estimator. As in the proof for Theorem 5.2, we will now write ΞK N(1) in terms of ΞN(1). Using the shorthand m = m L m3 and the analogous steps to (161), we find ΞK N(1) n mj3,k 1n mj3(f, f ) = 1 |S1κ| f(ρ(n mj3)x)f(ρ(n mj3)x) f(ρ(n mj3)x)f (ρ(k 1n mj3)x) f (ρ(k 1n mj3)x)f (ρ(k 1n mj3)x) f(ρ(n mj3)x)f (ρ(k 1n mj3)x) f(ρ(n m)x)f(ρ(n m)x) f(ρ(n m)x)f (ρ(k 1n m)x) f (ρ(k 1n m)x)f (ρ(k 1n m)x) f(ρ(n m)x)f (ρ(k 1n m)x) = ΞN(1) n m,n m(f, ρreg(k)f ) , (185) where for the second equality, we have shifted x ρ(j 1 3 )x, which leaves S1 κ invariant by assumption. Plugging (185) into (182) trivializes the j3-integral which cancels against its 1/|K3 κ|-prefactor, yielding ΞK N(f, f ) = 1 vol(K) K dk 1 (vol(N))2 N dn 1 |N L κ | N L κ dm L G N3 κ dm3 G Fσ(ΞN(1) nm L m3,n m L m3(f, ρreg(k)f )) K dk ΞN(f, ρreg(k)f ) , (187) where we have identified ΞN by comparing to (165). The statement of the theorem follows by considering the second and fourth components of (187). F. Further Experimental Results In this appendix, we provide further details and results of the numerical experiments presented in Section 6. F.1. Kernel Convergence Figure 5 shows the convergence of Monte-Carlo estimates of the NNGP to the analytical infinite-width expression derived using the theorems in Section 4.1. Equivariant Neural Tangent Kernels Table 1. Architectures used for the medical image classification described in Section 6. For convolutional, group-convolutional and lifting layers, the argument is the kernel size (all kernels are squared). Both pooling layers are global. The number of output neurons is finite and has to correspond to the 9 classes. Conv(3) Lifting(3) Re LU Re LU Conv(3) GConv(3) Re LU Re LU Conv(3) GConv(3) Re LU Re LU Conv(3) GConv(3) Re LU Re LU Conv(3) GConv(3) Re LU Re LU Sum Pool GPool Dense Dense Re LU Re LU Dense(9) Dense(9) F.2. Medical Image Experiments In the infinite-width limit, the NTK becomes deterministic and time-independent under the gradient flow dynamics. In the case of MSE loss, the differential equation describing the mean output of a network at time t becomes a linear ODE, thus allowing for an analytic expression at arbitrary time. In the limit of infinite training time, the mean is given by (41) (Jacot et al., 2018). This relation is effectively a kernel method that can be used to generate prediction of the infinitely wide network. The task consists of classifying histological images (Kather et al., 2018) containing nine classes of tissues, two of which are cancerous. The original images have a resolution of 224 224 pixels each and have been down-scaled to a resolution of 32 32 pixels to reduce the kernel evaluation time. Note that the size of the final kernel matrix, that needs to be inverted, is independent of the resolution because we use a group pooling or Sum Pool layer, respectively. Since the analytic solution in (41) only applies for MSE loss, we constructed target vectors Y = {y0, . . . , y N} from classes c according to ec 1 91 as is standard in the NTK literature (Lee et al., 2020). The CNN and GCNN architectures that were used are shown in Table 1. Note that the infinite-width limit refers to the number of channels, which is why we only need to specify the kernel sizes. The same training and test data was used for both models with a test data size of 1000 images. Both architectures have been implemented in the neural-tangents package (Novak et al., 2020). F.3. Molecular Energy Regression We used the same kernel method resulting from the infinite-width and infinite-time limit as explained in Section F.2. Both the grid on S2 as well as on SO(3) are equiangular Driscoll & Healy grids (Driscoll & Healy, 1994) with resolution 2L (2L 1)4 on S2 and (2L 1) 2L (2L 1) on SO(3) (parametrized in Euler angles). L is the corresponding bandlimit defining the cutoff in the Fourier domain, i.e. only Fourier coefficients l < L are considered. The input signals are sampled for L = 6. As the labels we have used the internal energies U0 of the molecules at 0 K after substracting the atomic reference energies. The hyperparameter β in (42) was chosen as described in (Esteves et al., 2023) according to β = cos(π/4)) 1)2 log(0.05) (188) 4In the original work by (Driscoll & Healy, 1994) the grid contained actually 2L 2L points, but we have adapted our grid to the convention used in the s2fft package. Equivariant Neural Tangent Kernels Table 2. Architectures used for the molecular energy regression described in Section 6. 29 identical networks (encaptured by curly braces) process the inputs associated to each atom. Their outputs are then summed together. For group-convolutional and lifting layers, the output bandlimit L is stated. The pooling layer is global and the single output neuron represents the predicted energy of the network. 29 per-atom networks 29 per-atom networks Dense Re LU Dense Dense Re LU Dense Lifting(3) Erf GConv(3) Lifting(3) Erf GConv(3) combined to molecule network combined to molecule network Fan In Sum Fan In Sum Dense(1) Dense(1) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 L2(equiv-logits, non-equiv-logits) Ensemble Size 1 Ensemble Size 3 Ensemble Size 8 Ensemble Size 100 Figure 6. Convergence of finite-width ensembles trained with data augmentation to ensembles of GCNNs on CIFAR10. Shown is the L2-distance between the logits of the equivariant ensemble and the non-equivariant ensemble trained with data augmentation for different ensemble sizes on out of distribution data. For larger ensembles, the distance decreases. The precise architectures of the MLP based network and the SO(3)-invariant network are listed in Table 2. The MAE loss was evaluated on a test set of 100 molecules. F.4. Data Augmentation Versus Group Convolutions at Finite Width Figure 6 shows that large ensembles trained with data augmentation on CIFAR10 converge to GCNNs even out of distribution. Similarly, Figure 7 shows the same behavior on the NCT-CRC-HE-100K data set of histological images (Kather et al., 2018), downscaled to 32 32 pixels. Samples of the out of distribution data, whose mean and variance were normalized to 0 and 1, respectively, are provided in Figure 8. The architectures used for the ensemble members are detailed in Table 3. Equivariant Neural Tangent Kernels 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 L2(equiv-logits, non-equiv-logits) Ensemble Size 1 Ensemble Size 3 Ensemble Size 8 Ensemble Size 16 Ensemble Size 40 Figure 7. Convergence of finite-width ensembles trained with data augmentation to ensembles of GCNNs on histological images from the NCT-CRC-HE-100K data set. Shown is the L2-distance between the logits of the equivariant ensemble and the non-equivariant ensemble trained with data augmentation for different ensemble sizes on out of distribution data. For larger ensembles, the distance decreases. The standard deviation is estimated from 20 independent runs for each curve. Figure 8. Examples for out of distribution data for MNIST (left) and CIFAR10 (middle) and NCT-CRC-HE-100K (right). Equivariant Neural Tangent Kernels Table 3. Architectures used for the ensemble members in the experiments described in Section 6. For convolutional, group-convolutional and lifting layers, the arguments are input channels, output channels and kernel size (all kernels are squared). For max-pooling layers, the arguments are kernel size and stride. For the GCNNs, the max pooling is done only over spatial dimensions, not group dimensions. The kernel sizes were selected such that the GCNNs are exactly equivariant for the respective input sizes of 28 28 and 32 32. MNIST CIFAR10 CNN GCNN CNN GCNN Conv(1, 4, 3) Lifting(1, 4, 3) Conv(3, 4, 3) Lifting(3, 4, 3) Re LU Re LU Re LU Re LU Max Pool(2, 2) Spatial Max Pool(2, 2) Max Pool(2, 2) Spatial Max Pool(2, 2) Conv(4, 16, 4) GConv(4, 16, 4) Conv(4, 16, 4) GConv(4, 16, 4) Re LU Re LU Re LU Re LU Max Pool(2, 2) Spatial Max Pool(2, 2) Max Pool(2, 2) Spatial Max Pool(2, 2) Conv(16, 32, 3) GConv(16, 32, 3) Conv(16, 32, 3) GConv(16, 32, 3) Re LU Re LU Re LU Re LU Conv(32, 64, 3) GConv(32, 64, 3) Conv(32, 64, 4) GConv(32, 64, 4) Re LU Re LU Re LU Re LU Conv(64, 128, 1) GConv(64, 128, 1) Conv(64, 128, 1) GConv(64, 128, 1) Re LU Re LU Re LU Re LU Conv(128, 32, 1) GConv(128, 32, 1) Conv(128, 32, 1) GConv(128, 32, 1) Re LU Re LU Re LU Re LU Conv(32, 10, 1) GConv(32, 10, 1) Conv(32, 10, 1) GConv(32, 10, 1) GPool GPool Table 4. Architectures used for the ensemble members in the experiments described in Section F.4. For convolutional, group-convolutional and lifting layers, the arguments are input channels, output channels and kernel size (all kernels are squared). For max-pooling layers, the arguments are kernel size and stride. For the GCNNs, the max pooling is done only over spatial dimensions, not group dimensions. The kernel sizes were selected such that the GCNNs are exactly equivariant for the input size of 32 32. NCT-CRC-HE-100K CNN GCNN Conv(3, 4, 3) Lifting(3, 4, 3) Re LU Re LU Max Pool(2, 2) Spatial Max Pool(2, 2) Conv(4, 16, 4) GConv(4, 16, 4) Re LU Re LU Max Pool(2, 2) Spatial Max Pool(2, 2) Conv(16, 32, 3) GConv(16, 32, 3) Re LU Re LU Conv(32, 64, 4) GConv(32, 64, 4) Re LU Re LU Conv(64, 128, 1) GConv(64, 128, 1) Re LU Re LU Conv(128, 32, 1) GConv(128, 32, 1) Re LU Re LU Conv(32, 9, 1) GConv(32, 9, 1) GPool