# a_program_to_build_enequivariant_steerable_cnns___1aacda15.pdf Published as a conference paper at ICLR 2022 A PROGRAM TO BUILD E(n)-EQUIVARIANT STEERABLE CNNS Gabriele Cesa Qualcomm AI Research University of Amsterdam gcesa@qti.qualcomm.com Leon Lang University of Amsterdam l.lang@uva.nl Maurice Weiler University of Amsterdam m.weiler.ml@gmail.com Equivariance is becoming an increasingly popular design choice to build data efficient neural networks by exploiting prior knowledge about the symmetries of the problem at hand. Euclidean steerable CNNs are one of the most common classes of equivariant networks. While the constraints these architectures need to satisfy are understood, existing approaches are tailored to specific (classes of) groups. No generally applicable method that is practical for implementation has been described so far. In this work, we generalize the Wigner-Eckart theorem proposed in Lang & Weiler (2020), which characterizes general G-steerable kernel spaces for compact groups G over their homogeneous spaces, to arbitrary G-spaces. This enables us to directly parameterize filters in terms of a band-limited basis on the whole space rather than on G s orbits, but also to easily implement steerable CNNs equivariant to a large number of groups. To demonstrate its generality, we instantiate our method on a variety of isometry groups acting on the Euclidean space R3. Our framework allows us to build E(3) and SE(3)-steerable CNNs like previous works, but also CNNs with arbitrary G O(3)-steerable kernels. For example, we build 3D CNNs equivariant to the symmetries of platonic solids or choose G = SO(2) when working with 3D data having only azimuthal symmetries. We compare these models on 3D shapes and molecular datasets, observing improved performance by matching the model s symmetries to the ones of the data. 1 INTRODUCTION In machine learning, it is common for learning tasks to present a number of symmetries. A symmetry in the data occurs, for example, when some property (e.g., the label) does not change if a set of transformations is applied to the data itself, e.g. translations or rotations of images. Symmetries are algebraically described by groups. If prior knowledge about the symmetries of a task is available, it is usually beneficial to encode them in the models used (Shawe-Taylor, 1989; Cohen & Welling, 2016a). The property of such models is referred to as equivariance and is obtained by introducing some equivariance constraints in the architecture (see Eq. 2). A classical example are convolutional neural networks (CNNs), which achieve translation equivariance by constraining linear layers to be convolution operators. A wider class of equivariant models are Euclidean steerable CNNs (Cohen & Welling, 2016b; Weiler et al., 2018a; Weiler & Cesa, 2019; Jenner & Weiler, 2022), which guarantee equivariance to isometries Rn G of a Euclidean space Rn, i.e., to translations and a group G of origin-preserving transformations, such as rotations and reflections. As proven in Weiler et al. (2018a; 2021); Jenner & Weiler (2022), this requires convolutions with G-steerable (equivariant) kernels. Our goal is developing a program to parameterize with minimal requirements arbitrary G-steerable kernel spaces, with compact G, which are required to implement Rn G equivariant CNNs. Lang & Weiler (2020) provides a first step in this direction by generalizing the Wigner-Eckart theorem from quantum mechanics to obtain a general technique to parametrize G-steerable kernel spaces over orbits of a compact G. The theorem reduces the task of building steerable kernel bases to that of finding some pure representation theoretic ingredients. Since the equivariance constraint only relates points g.x Rn in the same orbit G.x Rn, a kernel can take independent values on different orbits. Fig. 1 shows Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc. Published as a conference paper at ICLR 2022 examples of orbits in R2 under the action of continuous planar rotations G = SO(2) (orbits are rings of different radii) and of discrete rotations by π 2 , G = C4 (orbits are sets of four points). Hence, Lang & Weiler (2020) suggest independently parameterizing orbits of G in Rn (colored regions in Fig.1). (a) Orbits of G = SO(2) g2. x g3. x (b) Orbits of G = C4 Figure 1: Orbits in R2. Each color represents points in the same orbit under G s action. While this guarantees a complete parameterization of G-steerable functions, the discretization of the filters requires them to be sufficiently band-limited1. Indeed, particular care should be taken to avoid aliasing effects while discretizing filters on a grid (e.g., to process voxelized data). Lang & Weiler (2020) naturally support band-limiting along each orbit of G but do not restrict the kernel across different orbits. Intuitively, this means that a filter can be decomposed into a finite number of orbits but these should then be patched together with some smoothness considerations. This introduces an important requirement and a sensitive design choice. First, building new G-steerable CNNs requires one to identify the orbits of G in Rn. Second, the choice of orbits to consider affects a filter s band-width, but this relation is hard to explicitly quantify in a general setting. This issue is more severe for smaller groups (particularly, finite ones), as Rn decomposes into a larger number of orbits. For example, if G = SO(2) in Fig. 1a, band-limiting along rings is controlled by the Wigner-Eckart theorem but is not enforced along the radial component. If G = C4 in Fig. 1b, each ring is partitioned into discrete orbits: a larger number of orbits increases the maximum angular frequency, so aliasing is sensitive to this design choice. Contributions and Outline In this work, we solve this issue and propose a general solution strategy to build arbitrary CNNs with G-steerable kernels, for any compact group G. Theorem 2.1 generalizes the result of Lang & Weiler (2020) from G-homogeneous spaces to more general spaces X carrying a G-action. Essentially, our theorem replaces the harmonic basis on the orbits of G in X with a G-steerable basis B for unconstrained scalar filters (Freeman & Adelson, 1991) over the whole space X. Aliasing effects are thereby controlled by the choice of this initial basis. In practice, in this work we focus on Euclidean spaces X = Rn, although the theory developed holds for any semi-direct product group X G, and can be extended with minor changes to general homogeneous spaces X with compact stabilizers G (as argued in Remark D.15 in Lang & Weiler (2020)). Since a G -steerable basis is G-steerable for any subgroup G G , we also propose to use an initial G = O(n)-steerable basis to support any compact group G with minimal requirements. In summary, the benefits of our method are two-fold: i) it allows direct control on band-width and aliasing via the initial basis B and ii) it completely disentangles the discretization issues from the choice of G, minimizing the requirements to implement equivariance to new groups. For example, this enables the parameterization of C4 filters without discretizing rings into a finite number of points as in Fig. 1b. Algorithm 1 summarizes our method and its requirements. Since the planar case n=2 was extensively studied in Weiler & Cesa (2019) and since R3 has a large variety of isometry (sub)groups, in our experiments, we put particular emphasis on the 3D setting. To illustrate the generality of our method, we instantiate it on many different subgroups of O(3), the group of 3D rotations and reflections, and compare them experimentally. This results in many new 3D convolution networks, equivariant, for example, to icosahedral, axial, cylindrical or conical symmetries. Axial symmetries are particularly relevant since natural scenes generally have vertical orientation while discrete symmetries occur in crystallography or solid state physics; see Sec. G. In Sec.5.1, we discuss many design choices for 3D equivariant networks, and compare them in the experiments in Sec. 5.3. In particular, we find that designs based on different discretizations of the group and pointwise non-linearities to be beneficial when working with volumetric data. Additionally, Sec.5.2 studies the benefit of our basis in terms of equivariance after discretization and accuracy of trained models. In the Appendix, we derive some results for working with representation theoretic objects to reduce the work needed to implement new equivariant CNNs. This includes real-valued representations of compact groups, harmonic bases for induced representations, numerical irreps decomposition and representation theory of direct product groups. Finally, we implement the program described in this work as a general purpose library based on Py Torch at github.com/QUVA-Lab/escnn. 1A band-limited signal s spectrum has bounded support. Nyquist-Shannon sampling theorem states that only sufficiently band-limited signals are faithfully represented by discrete samples. Otherwise, aliasing effects occur. Published as a conference paper at ICLR 2022 2 THE THEORY OF STEERABLE CNNS AND G-STEERABLE KERNEL SPACES In Sec. 2.1, we discuss preliminaries on steerable CNNs and the equivariance constraints on their filters. Sec. 2.2 describes our generalization of Lang & Weiler (2020) used to parametrize kernels. 2.1 EUCLIDEAN STEERABLE CNNS Equivariant neural networks are characterized by their property to commute with the action of a given symmetry group on their input, intermediate feature spaces, and output. We focus on convolutional networks on Euclidean spaces X = Rn, acted on by the group (Rn, +) G, i.e. the semidirect product of translations in (Rn, +) and origin-preserving transformations G O(n). A general framework of equivariant convolutional models is that of steerable CNNs, which we briefly recapitulate here. For more details, see Cohen & Welling (2016b); Weiler et al. (2018a; 2021); Weiler & Cesa (2019). The fundamental design choice underlying steerable CNNs is that they operate on feature vector fields, which are similar to feature maps but differ in that they are associated with a well defined action of (Rn, +) G. The geometric type of a feature field is prescribed by an orthogonal group representation ρ : G Rdρ dρ which determines the transformation law of dρ-dimensional feature vectors under the action of G. Specifically, a feature field of type ρ is a map f : Rn Rdρ which transforms under (Rn, +) G according to the induced representation Ind(Rn,+) G G ρ: Ind(Rn,+) G G ρ (tg) f (x) := ρ(g)f g 1(x t) g G, t (Rn, +) . (1) Intuitively, the induced representation acts by moving feature vectors spatially from g 1(x t) to x and by transforming each feature f(x) according to ρ(g) (see Fig. 1 in Weiler & Cesa (2019)). Notable examples are scalar or tangent-vector fields, which transform according to the trivial representation ρ(g) = 1 or the standard representation ρ(g) = g, respectively. Group-convolution networks (Cohen & Welling, 2016a; Kondor & Trivedi, 2018) are special cases with ρ being the regular representation of G. It is common to define the full feature fields of a network as a direct sum (concatenation) f := L i fi of multiple individual feature fields fi. The full feature space transforms according to the direct sum representation ρ := L i ρi, which ensures fields transform independently from each other. Weiler et al. (2018a); Jenner & Weiler (2022) proved that, under mild assumptions, the most general linear equivariant maps between spaces of feature fields are convolutions with G-steerable kernels. If the input and output fields have types ρin : G Rdin din and ρout : G Rdout dout, G-steerable kernels are convolution kernels K : Rn Rdout din satisfying the linear G-steerability constraint K(g.x) = ρout(g)K(x)ρin(g) 1 g G, x Rn . (2) Thus, designing steerable CNNs only requires finding a basis of the vector space of G-steerable kernels, which is then used to parameterize conventional Euclidean convolutions. Note that Eq. 2 relates the kernel values on all points g.x in the orbit G.x, but leaves values on different orbits unrelated. 2.2 A WIGNER-ECKART THEOREM FOR SUBGROUP EQUIVARIANT CONVOLUTION KERNELS First, we need to introduce a few concepts and some notation. To keep the presentation general, in this section we will consider a general space X rather than Rn. Since G is compact, we can assume all representations to be orthogonal, i.e., ρ(g 1) = ρ(g) 1 = ρ(g)T . To parametrize a kernel K : X Rdout din satisfying the steerability constraint in Eq. 2, it is convenient to use its vectorized form κ( ) = vec (K( )) : X Rdout din. The constraint2 becomes: κ(g.x) = (ρin ρout)(g) κ(x) g G, x X . (3) The matrix (ρin ρout)(g) is the Kronecker product of the two matrices ρin(g) and ρout(g). Irrep Decomposition A useful property is that there exists a set3 b G of special representations of G, called irreducible representations or irreps, such that any representation ρ of G can be decomposed as a direct sum of them: ρ(g) = QT L i I ρi(g) Q, where ρi b G, I is an index set ranging over the elements of b G (possibly, with repetition) and Q is an orthogonal matrix. The direct sum ρ1(g) ρ2(g) 2Note that vec (ABC) = (CT A) vec (B), where vec ( ) stacks matrix columns into a vector. 3More precisely, irreps come in equivalence classes and the set b G contains a representative for each class. Published as a conference paper at ICLR 2022 Figure 2: Two (unvectorized) basis kernels (j =1, i=1, r {1, 2}) from Thm. 2.1 for G=SO(2), l=1, J =2. of two matrices is a block diagonal matrix containing the two matrices in its blocks. W.l.o.g., we can assume ρin and ρout in the kernel constraint from Eq. 2 to be irreps. The solutions for arbitrary ρin and ρout can be recovered from their irreps decomposition, as explained in Weiler & Cesa (2019). Hence, from now, we will assume irreps ρin = ρl and ρout = ρJ, where l and J are indexes over b G. Tensor Product The matrix ρin ρout built with the Kronecker product in Eq. 3 is itself a representation of G, namely a tensor product representation. As such, we can decompose it in terms of G irreps: ρl ρJ = [CGl J]T L j L[j(l J)] s ρj CGl J, where [j(l J)] indicates the multiplicity of the irrep ρj, while CGl J is the change of basis matrix.4 The block diagonal structure of the direct sum allows one to distinguish blocks of rows in CGl J, acted on by individual irreps in the direct sum. The block associated with the s-th occurrence of ρj is denoted as CGj(l J) s Rdj dld J. Endomorphisms We are interested in the space of equivariant linear maps between two spaces transforming according to irreps ρj, ρ j ˆG. If j = j, this space is empty, otherwise, this is the endomorphism space of ρj. This is a vector space; assume {cj r Rdj dj}r is a basis5 for it. Steerable Basis To parametrize our kernels, we first need a basis for square-integrable functions on X, i.e. for L2(X). A G-steerable basis for L2(X) is a collection of orthogonal functions {Y m ji : X R}j b G,m dj,i mj, with mj a positive (possibly infinite) integer. Denoting by Yji : X Rdj the stack {Y m ji }dj m=1, Yji has the defining property that g G, x X, Yji(g.x) = ρj(g)Yji(x). The vectorized constraint in Eq. 3 is equivalent to Eq. 2 with a scalar input field and a ρin ρout output one. A (vectorized) kernel κ is a G-equivariant linear map from L2(X) to Rdl d J, which will be applied convolutionally over a scalar field. Hence, a basis for this kernel space is given by considering the irreps decomposition of G s action on L2(X) and Rdl d J, and, then, equivariant maps between pairs of irreps. The basis B defines an irrep decomposition of L2(X), CGl J decomposes ρin ρout and equivariant maps are parameterized by each irrep s endomorphism basis. This leads to: Theorem 2.1 (Basis for G-Steerable Kernels). Let G be a compact group acting on a space X. Let B ={Yji}ji be a G-steerable basis for L2(X). Assume ρin =ρl and ρout =ρJ in b G. Under minor conditions, a basis for (vectorized) G-steerable kernels over X is given by K={κjisr}jisr, with: κjisr(x) = [CGj(l J) s ]T cj r Yji(x) . (4) The proof is in Appendix B (including the case of complex valued representations); see also Fig. 2. Designing a new G-steerable basis B can be laborious; however, if G is a subgroup of G (i.e., G G ), a G -steerable basis can be turned into a G-steerable one via group restriction: Group Restriction Given two groups G G , one can turn an irrep ρ of G into a representation Res G G ρ of G by restricting its domain to G. This representation is not irreducible and decomposes as: Res G G ρj = [IDj ]T L j L[jj ] t ρj IDj where [jj ] is the multiplicity of ρj and IDj is the change of basis. As earlier, the block-diagonal structure of the direct sum distinguishes blocks of rows of the matrix IDj . We denote the block associated with the t-th occurrence of ρj as IDjj t Rdj dj . Restricted Steerable Basis If B = {Yj i }j i is a G -steerable basis for L2(X), the following set is a G-steerable basis, with index i = (j i t) (see Appendix B.3): B = {Yj(i j t) | Yj(i j t)(x) = IDjj t Yj i (x)}j b G,Yj i B ,1 t [jj ] . (5) 4CG stands for Clebsh-Gordan since this matrix contains the so-called Clebsh-Gordan coefficients. 5In the complex case, such matrices are multiples of the identity, but this space can be larger in the real case. Published as a conference paper at ICLR 2022 We present two examples on the circle X =S1; see Appendix B.4 for more detailed examples. Example 1: SO(2)-Steerable Kernels In Fig. 3, we show a simple example of the basis for a G = SO(2)-steerable kernel space. Since X = S1 is an orbit of G, our parameterization is equivalent to Lang & Weiler (2020). An SO(2)-steerable basis on X = S1 is given by the circular harmonics: Y 0 ji(θ) = cos(jθ) and Y 1 ji(θ) = sin(jθ), for j N and i = 1. We consider steerable kernels with input irrep ρin = ρl=0 and output irrep ρout one of ρJ=0, ρJ=1 or Figure 3: Basis for G = SO(2)-steerable kernels from Eq. 4. ρin = ρl=0 is 1-dimensional. ρout = ρJ for J {0, 1, 2}, which are 1, 2 and 2 dimensional. Each column is an element Kjisr of the basis, while each row is an output channel. i = 1 always and, since ρ0 ρJ = ρJ, j = J. ρJ=2.6 Hence, the kernels have form, respectively, K0 : S1 R1 1, K1 : S1 R2 1 and K2 : S1 R2 1. ρ0 ρJ = ρJ, so we always have j = J and s = 1. Additionally, any 2-dimensional irrep ρj (j > 0) of SO(2) has a 2-dimensional endomorphism space spanned7 by cj r=0 = and cj r=1 = . This can be verified visually: in Fig. 3b and 3c, the second column is obtained by swapping the rows of the first and by changing the sign of the second row. Example 2: C4-Steerable Kernels In Fig. 4, we show the basis for G = C4 steerable kernels described by Theorem 2.1, using the C4-steerable basis generated by Eq. 5 from the circular harmonics basis B used for G = SO(2) in the previous example. ρj occurs in a restricted representation Res SO(2) C4 ρj whenever j = |j +4k|, k Z. Indeed, the basis prescribed in Eq. 5 has infinitely many elements, but each element is associated with a circular harmonic of a specific frequency j . It is natural to use only a band-limited finite subset of this basis: Fig. 4 shows only frequencies up to j = 4. If j = 0 or j = 2 and j = |j + 4k| > 0, ρj is 1-dimensional but ρj is 2-dimensional, so Res G G ρj contains two copies of ρj and t {1, 2} (see j = 2 or 4 in Fig. 4a). We use the trivial irrep l = 0 in the input and J {0, 1, 2} in the output. ρj=1 has a 2-dimensional space of endomorphisms, spanned by the same basis described in the SO(2) example above, so r {1, 2} in this case. Note that the parametrization in Fig. 4 is continuous along the ring and does not suffer from the discretization issue in Fig. 1b. (c) J = 2 Figure 4: Basis for G = C4-steerable kernels as in Eq. 4 using a steerable basis B built from circular harmonics using Eq. 5. ρin = ρl=0, while ρout = ρJ for J {0, 1, 2}, which are 1, 2, and 2 dimensional. Each column is an element Kj(j i t )sr of the basis, while each row is a different output channel. Since ρ0 ρJ = ρJ, j = J; for circular harmonics, i = 1. We require G = C4 equivariance, so a frequency j = 4 filter is invariant to G; analogously, the output of frequency j = |j +4k| filters (k Z) transforms like the output of a frequency j one. 3 IMPLEMENTATION DETAILS Given the theoretical results described in the previous sections, we now describe the steps and inputs required to build new G-steerable kernels. See Algorithm 1 for a summary. Theorem 2.1 relies on: a) the irreps b G = {ρj}j, b) a G-steerable basis B = {Yji}ji for L2(X), c) the decomposition ρl ρJ = [CGl J]T L j b G L[j(l J)] s=1 ρj CGl J, and d) a basis {cj r}r for the endomorphism space of ρj. Optionally, B can be generated via Eq. 5 using e) a larger group G and its irreps c G = {ρj }j , f) a G - steerable basis B = {Yj i }j i and g) the decomposition Res G G ρj = [IDj ]T L The irreps of G in a) are the main requirement: they allow us to build arbitrary representations (via 6For θ [0, 2π), ρ0 :θ7 1 is the trivial irrep; if k>0, ρk maps θ to the 2 2 rotation matrix of k θ radians. 7One can verify that any linear combination of these two matrices commutes with any 2 2 rotation matrix. Published as a conference paper at ICLR 2022 Algorithm 1 Generate G-Steerable basis on space X Require: ρin = ρl and ρout = ρJ, G -steerable basis B = {Yj i }j i , b G = {ρj}j and c G = {ρj }j 1: {cj r}r basis for endomorphism space of ρj, for all ρj Appendix C 2: CGl J, {[j(l J)]}j decompose(ρl ρJ) Appendix E 3: for all Yj i B do 4: IDj , {[jj ]}j decompose(Res G G ρj ) Appendix E 5: for all j ˆG : [jj ] > 0, s [j(l J)], t [jj ], cj r {cj r}r do 6: Yj(j i t)(x) IDjj t Yj i (x) Equation 5 7: Kj(j i t)sr(x) unvec [CGj(l J) s ]T cj r Yj(j i t)(x) Theorem 2.1 8: yield Kj(j i t)sr direct sum) and reduce the kernel constraint to the form in Sec. 2.2 by decomposing the input and output representations. If B in b) is unknown, one can rely on Eq. 5. A convenient choice for G in e) is a group whose irreps c G , together with a steerable basis B as in f), are known. G = O(n) is always possible if X = Rn, and B is built via polar decomposition of Rn and by combining hyper-spherical harmonics with a radial basis. Like Worrall et al. (2017); Weiler et al. (2018b;a), we use Gaussian radial profiles8. In Sec. 5, this choice of G allows easily experimenting with multiple subgroups G O(3), without the overhead of identifying each G s orbits or designing ad-hoc G-steerable bases. Note also that band-limiting is achieved by modifying B and is independent from G. Other G s orbits can also be used, combining harmonic bases (see Appendix D) with a Gaussian kernel. The irreps decomposition of ρl ρJ in c) provides the multiplicity [j(l J)] of each irrep ρj and the matrices CGj(l J) s . Similarly, the irreps decomposition of Res G G ρj in g) provides the multiplicity of each G-irrep ρj in each G -irrep ρj and the projection matrices IDjj t . Knowing these decompositions a-priori for any l, J and j is generally difficult, but they can be easily computed numerically; see Appendix E. Finally, d) requires a basis {cj r}r for the endomorphism space of each ρj b G. This can be computed numerically, but is often unnecessary; see Appendix C. In summary, if X = Rn and G = O(n), implementing new G-steerable CNNs only requires knowing the irreps b G: by knowing the action of G on Rn, one implicitly knows its embedding into O(n) and, therefore, can apply Eq. 5. Practical Example and Experiments In Sec. 5.3, we implement a variety of subgroups of the isometries of R3 (see Appendix G). To do so, we choose G = O(3) and B is built by combining spherical harmonics with a Gaussian radial profile. Since irreps are used to define the types of the feature fields, we need the irreps of SO(3), O(2), SO(2), CN, I and O. Direct product groups such as Inv SO(2) are built as in Appendix F. Since there are generally multiple subgroups G < O(3) isomorphic to the same abstract group (e.g. see O(2) in Tab. 1), for each G, we explicitly define its isomorphism with an abstract group; this enables the automatic restriction of G s irreps and the numerical computation of the IDjj t matrices. The matrices CGj(l J) s are also computed numerically. Limitations Our choice of B implies we only parameterize filters supported on a compact ball, which is slightly more restrictive than the usual parameterization of filters with support on a cube. For a fixed kernel size, an ad-hoc implementation of G-steerable filters can exploit a larger initial basis, potentially leading to some performance gain (e.g. C4 allows filters supported on a square rather than a disk). However, if an ad-hoc implementation of G < O(n) steerable kernels is available, our method can be used to construct G G steerable spaces by using the G -steerable basis rather than the O(n)-steerable one. Comparing the effect of different B on the performance of a G-steerable CNN is beyond the scope of this paper, which instead focuses on building a method suitable for any G. 4 RELATED WORKS Similar in spirit to our work is the e2cnn library (Weiler & Cesa, 2019), although limited to X = R2 and G O(2). G = O(2) recovers their solutions. Geiger et al. (2020) present a library implementing general 3D steerable CNNs, but limited to the choices G = SO(3) or O(3). In comparison, we currently support both 2D and 3D convolution and any compact group G O(3) (including discrete 8Other radial profiles are suitable as well without any substantial difference in our theory. The Gaussian profile is chosen mostly for presentation and implementation convenience. Published as a conference paper at ICLR 2022 and planar subgroups acting on R3), and other spaces X can be potentially integrated. Finzi et al. (2021) propose a numerical method to parameterize (finite dimensional) MLPs that are equivariant to arbitrary matrix groups. Their numerical method to compute a basis of equivariant linear maps resembles our irreps decomposition method; see Appendix E. Bekkers (2020) and Finzi et al. (2020) implement group convolution on Lie groups using a finite number of samples from the continuous group. This is similar to our SO(3) or O(3) architectures with pointwise non-linearities; see Sec. 5.1. However, since our features and filters are explicitly parametrized on a band-limited space, we can adapt the sampling set to control the equivariance error caused by this approximation. Aronsson (2021) previously discussed bandlimited convolution operators in equivariant CNNs. Jenner & Weiler (2022) generalize Euclidean steerable CNNs to partial differential operators. (Cohen et al., 2019b; Kondor & Trivedi, 2018; Bekkers, 2020) define steerable CNNs on homogeneous spaces. G-steerable kernels are necessary to implement gauge equivariant CNNs on general manifolds (Weiler et al., 2021; Cohen et al., 2019a; Kicanaoglu et al., 2019; Haan et al., 2021). Li et al. (2021) use transformed filters to parameterize steerable kernels. Previously, Mallat (2012); Oyallon & Mallat (2015); Sifre & Mallat (2013) described similar architectures based on scattering. Brandstetter et al. (2021) use a non-linear parameterization of steerable convolution in a geometric graph. 5 INSTANTIATION OF THE METHOD AND EXPERIMENTS We demonstrate the generality of our method by instantiating steerable CNNs for different groups G. We will mostly focus on R3 and its isometries; Appendix G briefly summarizes the subgroups used. Sec. 5.1 discusses existing and new design choices for 3D steerable CNNs. We then compare these designs in Sec. 5.3. Sec. 5.2 numerically compares our steerable basis with alternative bases. 5.1 DESIGN CHOICES OF 3D STEERABLE CNNS A G-steerable CNN design involves a choice of feature field types, i.e. G representations acting on the channels of the features, see Sec. 2.1. The chosen types constrain the non-linear operations permitted. Many works in the literature focus on designs which achieve perfect SO(3) equivariance (Weiler et al., 2018b; Thomas et al., 2018; Anderson et al., 2019) . To ensure this, a common choice is the use of irreps feature types and gated non-linearities. This operation combines a feature field fρ(x) Rdρ of type ρ with a gate f0(x) R as fρ(x) 7 σ(f0(x))fρ(x), where σ is a sigmoid function and f0 is feature field of trivial type. This non-linearity is equivariant to any G but Weiler & Cesa (2019) found them to perform worse than point-wise non-linearities like those used in GCNNs. Another popular choice is using the tensor product of two feature fields as a non-linear operation (Kondor et al., 2018), i.e. fρi(x), fρj(x) 7 fρi(x) fρj(x). However, this non-linearity is quadratic and turns the network into a polynomial function of the input, which can lead to training difficulties (Anderson et al., 2019). Additionally, computing the tensor product has computational and memory cost O(dρidρj); it is common to project the output features to a smaller feature field through a learnable linear map. A different approach is used in Worrall & Brostow (2018); Winkels & Cohen (2018), which use the octahedral group O on voxelized data. The finiteness of the group allows for a simple GCNN (Cohen & Welling, 2016a) architecture, corresponding to the choice of the regular representation ρreg as the feature type and standard pointwise non-linearities (e.g. Re LU). The representation ρreg of a group G acts on feature vectors in R|G| by permutations and a vector in this space can be interpreted as a scalar function over G; see Weiler & Cesa (2019) for an intuitive description. Because SO(3) (and O(3)) has only a limited number of discrete subgroups, we also consider discretizations without group structure. We interpret a feature vector f(x) RB as the coefficients parameterizing a band-limited function over G in Fourier space; see Appendix D. To apply a pointwise non-linearity σ : R R (e.g. ELU) on f(x), we i) sample the parameterized function on a finite set G G, ii) apply σ and, then, iii) recover the coefficients f (x) of the output via a Fourier transform. If σ is smooth, the output signal is still approximatively band-limited but can contain higher frequencies than the input f(x). This implies that a larger set G is required to reconstruct f (x) than for f(x). If a fixed budget of channels (i.e. number of samples |G|) is available for a non-linear module, the input field f(x) must be strongly bandlimited. While this can reduce the expressiveness of the model s features, it also implies that linear layers are computationally cheaper as they operate on smaller number of channels. A similar result was leveraged in Cheng et al. (2019) for G = SO(2). This regular non-linearity was also used in Kicanaoglu et al. (2019) for G = SO(2). The same approach can be generalized to quotient fields, i.e. feature fields which parameterize functions over Published as a conference paper at ICLR 2022 Table 1: Rotated Model Net10 (O(3) symmetry). indicates wider models to fix the computational cost. G Description Accuracy {e} Conventional CNN 82.5 1.4 SO(2) Axial Symmetry 86.9 1.9 SO(2) F = O(2) Dihedral Symmetry 87.5 0.7 SO(2) M = O(2) Conical Symmetry 88.5 0.8 Inv SO(2) Cylindrical Symmetry 86.8 0.7 Inv SO(2) F Full Cylindrical Symmetry 87.0 1.0 O Octahedral Symmetry (Winkels & Cohen, 2018) 89.7 0.6 I Icosahedral Symmetry 90.0 0.6 I Icosahedral Symmetry (finite orbits basis) 88.2 1.0 SO(3) Chiral (Tensor product) (Anderson et al., 2019) 86.3 1.0 SO(3) Chiral (Gated non-linearity)(Weiler et al., 2018b) 88.8 1.2 SO(3) Chiral (Regular, |G| = 96) 89.1 1.2 SO(3) Chiral (Regular, |G| = 192)* 89.4 1.4 SO(3) Chiral (Quotient S2 = SO(3)/ SO(2), |X| = 30) 89.5 1.0 O(3) Achiral (Regular, |G| = 120) 89.2 0.6 O(3) Achiral (Regular, |G| = 144)* 89.4 0.7 O(3) Achiral (Quotient Inv S2 = O(3)/ SO(2), |X| = 60) 88.6 0.9 a quotient space X = G/H (e.g. S2 = SO(3)/ SO(2)). See Appendix H.2 for more details on point-wise non-linearities and their computational benefit. 5.2 NUMERICAL COMPARISON OF STEERABLE BASES In this section, we compare our basis with one built according to Lang & Weiler (2020). We consider the example of Icosahedral I steerable 3D convolution. Our basis is composed by a band-limited set of spherical harmonics per spherical shell, combined as in Theorem 2.1. The two baselines use respectively the center of the 20 faces of the dodecahedron and the 12 faces of the icosahedron embedded in each shell as orbits and parameterize the kernel along each 0 28 56 84 112 0 Dodecahedron (12) 0 28 56 84 112 Relative Error (%) Icosahedron (20) 0 28 56 84 112 Harmonics (Ours) Figure 5: Histograms of relative equivariance errors of three different Icosahedral I-steerable bases. The first two parameterize two different orbits of I while the third uses our basis based on spherical harmonics. orbit directly using (Lang & Weiler, 2020). Fig. 5 shows the histograms of the equivariance errors of the three bases with respect to the Icosahedral group; each point represents an element of the basis and the errors are averaged all transformations; see Appendix H.1. Our basis shows significantly higher stability. Finally, we study the effect of these bases on a model s performance. In Tab. 1 we compare two I-equivariant models: one using our harmonic basis and one using the 20 faces of the Icosahedron as orbit. Our anti-aliased basis leads to a significant improvement in accuracy. 5.3 EXPERIMENTS To emphasize our method is not limited to R3, we include a simple experiment with 2D images. In the rest of the section, we compare different model designs on two volumetric datasets: Model Net10 (Wu et al., 2015) (and a rotated version of it) and LBA (Townshend et al., 2020). For each model, we run a simple search over hyperparameters and minor variants of the designs described in Sec. 5.1 using validation accuracy, to ensure a fair comparison. Unless specified, within a task, all models approximatively share the same width. In particular, in all tasks, the models which more closely match the symmetries of the data perform the best. See Appendix H for more details on the experiments. The equivariance groups used (column G in the tables) are described in Appendix G. Rotated MNIST As a simple 2D experiment, we train a conventional CNN and a G = C8 equivariant CNN on rotated MNIST. The steerable kernel bases are similar to those in Weiler & Cesa (2019) and we use regular field types with pointwise ELU. The conventional model achieves 96.3 0.1 test accuracy while the C8 equivariant one 96.7 0.4. Rotated Model Net10 We generate a voxelized version of Model Net10 with resolution 33px by sampling 3 random SO(3) rotations for each object. During training, we use rotations from O for Published as a conference paper at ICLR 2022 augmentation. In Tab. 1, we compare different equivariance groups and different equivariant designs. It includes our implementation of some relevant related works; see Sec. 5.1. We first highlight the large margin between all equivariant networks and the conventional CNN. SO(3) and O(3) models achieve better performance than the other continuous subgroups. The best results are achieved by using the discrete groups I and O or the SO(3) model with spherical quotient feature types. Recall also that the models with quotient or regular non-linearities are computationally cheaper; see Sec.5.1. Table 2: Model Net10 (O(2) symmetry) G Description Accuracy {e} Conventional CNN 91.2 0.5 SO(2) Azimuthal Symmetry 91.9 0.8 SO(3) Chiral (Regular, |G| = 72) 89.8 0.6 O(2) Full Azimuthal Symmetry 92.3 0.4 O(3) Achiral (Regular, |G| = 120) 89.9 1.0 C2 F Klein Group (dihedral symmetry) 91.0 0.6 VOXNet (Maturana & Scherer, 2015) 92.0 C2 F Klein Group (Worrall & Brostow, 2018) 94.2 Model Net10 To experiment with data with only axial and mirror symmetries, we generate a version of Model Net10 by sampling the random rotations only around the Z axis. π 2 rotations and reflection augmentation in the XY plane is used for training but no rotation averaging is used for testing. The final accuracies are in Tab. 2. The models equivariant to 3D rotations are outperformed by the SO(2) and O(2) models, which better match the data symmetries. Ligand Binding Affinity (LBA) regression We also evaluate our models on a regression task using the LBA dataset (Townshend et al., 2020), containing protein-ligand complexes from the PDBBind database (Wang et al., 2004; Liu et al., 2014). In Townshend et al. (2020), an SO(3)-equivariant model with tensor product activations performed the best, despite its cost prevented training on the full dataset. We propose a volumetric SO(3) network with a computational cost equivalent to a conventional CNN by leveraging the benefit of regular non-linearities. Tab. 3 reports our results. Discussion We again emphasize that the purpose of our experiments is mostly demonstrating the generality and flexibility of our program, but we do not necessarily envision applications of all subgroups G < O(3) considered. Still, some subgroups are practically relevant; indeed, models with azimuthal symmetry perform the best in Table 2 where the data lacks full rotational symmetry. Here, enforcing SO(3) equivariance results in over-constrained models which are not sufficiently expressive and are outperformed even by the conventional CNN. Overall, we observe models whose symmetry matches the data tend to perform the best. Some discrete subgroups, e.g. the platonic ones, constitute a special exception: they approximate the full rotational symmetry but are less restrictive than a fully rotation equivariant model, sometimes better matching the approximate symmetry of voxelized data. Finally, we only found minor differences among the various SO(3) models, although regular and quotient non-linearities stood out for their reduced computational cost (see also Appendix H.7). 6 CONCLUSIONS In this work, we gave a general characterization of G-steerable kernel spaces over an arbitrary space X and proposed a general strategy to automatically parameterize them based only on G s irreps. This enabled us to implement steerable CNNs equivariant to a variety of new groups G O(3), which we compare in an exploratory study in our experiments. Finally, we believe that Theorem 2.1 suggests a new theoretical perspective on the construction of equivariant CNNs, reducing the problem to that of devising a suitable starting basis B, regardless of the choice of feature types in the model. Table 3: Ligand Binding Affinity dataset. indicates wider models to fix the computational cost. G Description RMSD Pearson Spearman {e} Conventional CNN 1.419 0.047 0.575 0.022 0.569 0.021 O Octahedral Symmetry 1.417 0.032 0.589 0.010 0.581 0.011 I Icosahedral Symmetry 1.432 0.020 0.569 0.023 0.559 0.023 SO(3) Chiral (Regular, |G| = 72) 1.397 0.039 0.580 0.021 0.573 0.022 SO(3) Chiral (Regular, |G| = 60)* 1.380 0.033 0.588 0.015 0.578 0.018 SO(3) Chiral (Quotient S2, |X| = 24) 1.405 0.028 0.582 0.018 0.576 0.016 SO(3) Chiral (Quotient S2, |X| = 30)* 1.385 0.032 0.587 0.019 0.576 0.019 {e} 3DCNN (Townshend et al., 2020) 1.520 0.558 0.556 {e} Graph-NN (GNN) (Townshend et al., 2020) 1.936 0.581 0.647 SO(3) E-GNN (Townshend et al., 2020) 1.429 0.541 0.532 Published as a conference paper at ICLR 2022 ACKNOWLEDGMENTS Gabriele would like to thank Arash Behboodi for fruitful discussions on the approximation of point-wise non-linearities with finite samples. REPRODUCIBILITY STATEMENT To ensure reproducibility, we include a dedicated section in Appendix H to describe our experiments in more details. Moreover, we implemented all the methods described in this work in a Python library which can be found at github.com/QUVA-Lab/escnn. We also plan to publish the code used to run our experiments. POSSIBLE NEGATIVE SOCIETAL IMPACTS Since this work is very foundational, many different negative societal impacts are conceivable. In particular, any misuse which is eased by improved computer vision systems can profit from this and similar work. Of course, we cannot discuss them all. Nevertheless, we want to highlight one specific misuse which seems likely to occur for the field of equivariant deep learning in the future namely, the adoption in lethal autonomous weapons systems (LAWS). We can easily think of drones using spherical vision which would profit from equivariant CNNs. In particular, since they have a fixed vertical orientation, but arbitrary horizontal orientation, they could profit from spherical CNNs using azymuthal symmetries. It is important to realize that the mitigation of such misuses cannot happen at the level of individual research projects. Instead, what is needed is a global effort to mitigate the misuse of AI systems for LAWS in general. Fortunately, the broader AI community agrees that AI should not be used in LAWS, and that international regulations are necessary to prevent their use in the future aut (2015). We hope that efforts in this direction are continued. Published as a conference paper at ICLR 2022 Autonomous Weapons: An Open Letter from AI & Robotics Researchers. https://futureoflife.org/openletter-autonomous-weapons/, 2015. Brandon Anderson, Truong Son Hy, and Risi Kondor. Cormorant: Covariant molecular neural networks. Advances in neural information processing systems, 32, 2019. Jimmy Aronsson. Homogeneous vector bundles and G-equivariant convolutional neural networks. Ph D thesis, Chalmers Tekniska Hogskola (Sweden), 2021. Erik J Bekkers. B-spline CNNs on lie groups. In International Conference on Learning Representations (ICLR), 2020. J. Michael Boardman. Real and complex representations, 2007. URL https://math.jhu.edu/ ~jmb/note/rschur.pdf. Johannes Brandstetter, Rob Hesselink, Elise van der Pol, Erik Bekkers, and Max Welling. Geometric and physical quantities improve e (3) equivariant message passing. ar Xiv preprint ar Xiv:2110.02905, 2021. T. Bröcker and T. Dieck. Representations of Compact Lie Groups. Graduate Texts in Mathematics. Springer Berlin Heidelberg, 2003. ISBN 9783540136781. URL https://books.google. nl/books?id=Af Bz WL5b IIQC. Xiuyuan Cheng, Qiang Qiu, Robert Calderbank, and Guillermo Sapiro. Rot DCF: Decomposition of convolutional filters for rotation-equivariant deep networks. In International Conference on Learning Representations (ICLR), 2019. Taco Cohen and Max Welling. Group Equivariant Convolutional Networks. In Proceedings of The 33rd International Conference on Machine Learning, Proceedings of Machine Learning Research, 2016a. Taco Cohen, Maurice Weiler, Berkay Kicanaoglu, and Max Welling. Gauge equivariant convolutional networks and the icosahedral cnn. In International Conference on Machine Learning, pp. 1321 1330. PMLR, 2019a. Taco S. Cohen and Max Welling. Steerable CNNs. In ICLR 2017, November 2016b. Taco S. Cohen, Mario Geiger, and Maurice Weiler. A general theory of equivariant CNNs on homogeneous spaces. Advances in neural information processing systems, 32, 2019b. Marc Finzi, Samuel Stanton, Pavel Izmailov, and Andrew Gordon Wilson. Generalizing convolutional neural networks for equivariance to Lie groups on arbitrary continuous data. pp. 3165 3176, 2020. Marc Finzi, Max Welling, and Andrew Gordon Wilson. A practical method for constructing equivariant multilayer perceptrons for arbitrary matrix groups. In International Conference on Machine Learning, pp. 3318 3328. PMLR, 2021. William T. Freeman and Edward H. Adelson. The design and use of steerable filters. IEEE Transactions on Pattern Analysis & Machine Intelligence, (9):891 906, 1991. Mario Geiger, Tess Smidt, Alby M., Benjamin Kurt Miller, Wouter Boomsma, Bradley Dice, Kostiantyn Lapchevskyi, Maurice Weiler, Michał Tyszkiewicz, Simon Batzner, Martin Uhrin, Jes Frellsen, Nuri Jung, Sophia Sanborn, Josh Rackers, and Michael Bailey. Euclidean neural networks: e3nn, 2020. URL https://github.com/e3nn/e3nn. Pim De Haan, Maurice Weiler, Taco Cohen, and Max Welling. Gauge equivariant mesh {cnn}s: Anisotropic convolutions on geometric graphs. In International Conference on Learning Representations, 2021. Erik Jenner and Maurice Weiler. Steerable partial differential operators for equivariant neural networks. In International Conference on Learning Representations, 2022. Berkay Kicanaoglu, P. D. Haan, and Taco Cohen. Gauge equivariant spherical cnns. 2019. Published as a conference paper at ICLR 2022 Anthony Knapp. Lie Groups Beyond an Introduction, Second edition, volume 140. 01 2002. ISBN 978-1-4757-2455-4. doi: 10.1007/978-1-4757-2453-0. Risi Kondor and Shubhendu Trivedi. On the generalization of equivariance and convolution in neural networks to the action of compact groups. In International Conference on Machine Learning (ICML), 2018. Risi Kondor, Zhen Lin, and Shubhendu Trivedi. Clebsch gordan nets: a fully fourier space spherical convolutional neural network. Advances in Neural Information Processing Systems, 31, 2018. Leon Lang and Maurice Weiler. A Wigner-Eckart theorem for group equivariant convolution kernels. In International Conference on Learning Representations, 2020. Bo Li, Qili Wang, and Gim Hee Lee. Filtra: Rethinking steerable cnn by filter transform. In International Conference on Machine Learning, pp. 6515 6522. PMLR, 2021. Zhihai Liu, Yan Li, Li Han, Jie Li, Jie Liu, Zhixiong Zhao, Wei Nie, Yuchen Liu, and Renxiao Wang. PDB-wide collection of binding data: current status of the PDBbind database. Bioinformatics, 31(3):405 412, 10 2014. ISSN 1367-4803. doi: 10.1093/bioinformatics/btu626. URL https: //doi.org/10.1093/bioinformatics/btu626. Stéphane Mallat. Group invariant scattering. Communications on Pure and Applied Mathematics, 65 (10):1331 1398, 2012. Daniel Maturana and Sebastian Scherer. Voxnet: A 3d convolutional neural network for real-time object recognition. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 922 928, 2015. doi: 10.1109/IROS.2015.7353481. Edouard Oyallon and Stéphane Mallat. Deep roto-translation scattering for object classification. In Conference on Computer Vision and Pattern Recognition (CVPR), 2015. J. Shawe-Taylor. Building symmetries into feedforward networks. In 1989 First IEE International Conference on Artificial Neural Networks, (Conf. Publ. No. 313), pp. 158 162, Oct 1989. Laurent Sifre and Stephane Mallat. Rotation, scaling and deformation invariant scattering for texture discrimination. Conference on Computer Vision and Pattern Recognition (CVPR), 2013. Keith Taylor and Eberhard Kaniutrh. Induced Representations of Locally Compact Groups. 2013. doi: 10.1017/CBO9781139045391. Nathaniel Thomas, Tess Smidt, Steven M. Kearnes, Lusann Yang, Li Li, Kai Kohlhoff, and Patrick Riley. Tensor field networks: Rotationand translation-equivariant neural networks for 3d point clouds. ar Xiv preprint ar Xiv:1802.08219, 2018. Raphael JL Townshend, Martin Vögele, Patricia Suriana, Alexander Derry, Alexander Powers, Yianni Laloudakis, Sidhika Balachandar, Brandon Anderson, Stephan Eismann, Risi Kondor, et al. Atom3d: Tasks on molecules in three dimensions. 2020. Renxiao Wang, Xueliang Fang, Yipin Lu, and Shaomeng Wang. The pdbbind database: collection of binding affinities for protein-ligand complexes with known three-dimensional structures. Journal of medicinal chemistry, 47(12):2977 2980, June 2004. ISSN 0022-2623. doi: 10.1021/jm030580l. URL https://doi.org/10.1021/jm030580l. Maurice Weiler and Gabriele Cesa. General E(2)-Equivariant Steerable CNNs. In Conference on Neural Information Processing Systems (Neur IPS), 2019. Maurice Weiler, Mario Geiger, Max Welling, Wouter Boomsma, and Taco S. Cohen. 3D steerable CNNs: Learning rotationally equivariant features in volumetric data. In Conference on Neural Information Processing Systems (Neur IPS), 2018a. Maurice Weiler, Fred A. Hamprecht, and Martin Storath. Learning steerable filters for rotation equivariant CNNs. In Conference on Computer Vision and Pattern Recognition (CVPR), 2018b. Published as a conference paper at ICLR 2022 Maurice Weiler, Patrick Forré, Erik Verlinde, and Max Welling. Coordinate Independent Convolutional Networks - Isometry and Gauge Equivariant Convolutions on Riemannian Manifolds. ar Xiv preprint ar Xiv:2106.06020, 2021. Marysia Winkels and Taco S. Cohen. 3D G-CNNs for pulmonary nodule detection. In Conference on Medical Imaging with Deep Learning (MIDL), 2018. Daniel E. Worrall and Gabriel J. Brostow. Cubenet: Equivariance to 3D rotation and translation. In European Conference on Computer Vision (ECCV), 2018. Daniel E. Worrall, Stephan J. Garbin, Daniyar Turmukhambetov, and Gabriel J. Brostow. Harmonic networks: Deep translation and rotation equivariance. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017. Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1912 1920, 2015. doi: 10.1109/CVPR. 2015.7298801. Published as a conference paper at ICLR 2022 A Preliminaries on the Representation Theory of Compact Groups 15 A.1 Basics of (Unitary and Irreducible) Representations . . . . . . . . . . . . . . . . . 15 A.2 Homogeneous Spaces, Square-Integrable Functions, and the Peter-Weyl Theorem . 16 A.3 Constructing Unitary Representations from Others and their Relations . . . . . . . 18 A.4 Compact Subgroups and Restricted Representations . . . . . . . . . . . . . . . . . 19 B A Generalized Wigner-Eckart Theorem for Steerable Kernels 20 B.1 Basis-Independent Formulation of the Group-Restricted Wigner-Eckart Theorem . 20 B.2 Steerable Kernel Bases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 B.3 A Wigner-Eckart Theorem for Group-Restricted Steerable Kernels . . . . . . . . . 25 B.4 Extended Examples of Steerable Kernel Bases . . . . . . . . . . . . . . . . . . . . 25 C Real Valued Representations 29 C.1 Preliminaries on the Structure of End G,R(Vψ) . . . . . . . . . . . . . . . . . . . . 29 C.2 Orthogonality and Alignment of Matrix Coefficients . . . . . . . . . . . . . . . . . 32 C.3 A Convenient Choice of Basis for Real Irreps . . . . . . . . . . . . . . . . . . . . 36 D The Computation of the Induced Representation of Compact Groups 38 D.1 Useful Definitions, Assumptions and Notations . . . . . . . . . . . . . . . . . . . 39 D.2 Induced Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 D.3 Harmonic Analysis of Induced Representations . . . . . . . . . . . . . . . . . . . 42 D.4 Reconstructing the irrep decomposition: the complex case . . . . . . . . . . . . . 44 D.5 Reconstructing the irrep decomposition: the real case . . . . . . . . . . . . . . . . 45 E Numerical Irreps Decomposition of Real Representations 50 F Real Irreducible Representations of the Direct Product of two Compact Groups 53 G 3D symmetries, Azimuthal symmetries and their relevance 55 H Further Details on Network Designs and Experiments 56 H.1 Estimating the Equivariance Error of Steerable Bases . . . . . . . . . . . . . . . . 56 H.2 Regular and Quotient Non-Linearities . . . . . . . . . . . . . . . . . . . . . . . . 56 H.3 Architecture Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 H.4 MNIST Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 H.5 Model Net10 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 H.6 Auto-Encoder Experiments on Model Net10 . . . . . . . . . . . . . . . . . . . . . 61 H.7 Computational cost and Inference time of the models . . . . . . . . . . . . . . . . 61 H.8 LBA Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Published as a conference paper at ICLR 2022 In the Appendix, we provide a more detailed description and the proof of the theoretical results mentioned in the main paper. We first give a brief overview of the representation theory of compact groups and introduce the notation we will use throughout the next chapters in Section A. In Section B, we state more precisely and prove the main result from Sec. 2.2, which we used to parametrize the steerable filters. In particular, we state both a complex and a real version of Theorem 2.1. In section B.4, we discuss two examples of kernel bases. Sections C, D, E and F provide additional theoretical results and computational methods which can ease the practical implementation of steerable CNNs, but are not necessary for understanding our main results. Because the implementation of steerable CNNs is necessarily using real numbers, its design should rely on the real representation theory of the group considered. In Section C, we review the theory of real irreducible representations and derive some useful results to deal with these representations in the implementation. Notably, we show that, for a compact group G, any real irrep of G can always be expressed in a particular basis which allows one to recover a harmonic basis for square-integrable functions over G from the matrix coefficients of the irreps. The results shown in Section C will become useful in Section D, E and F. Regular, quotient and induced features types encode scalar or vector functions over a homogeneous space of G in terms of a harmonic basis. Moreover, harmonic bases provide a simple way to generate the G-steerable bases required to define the kernel basis in Sec. 2.2: by leveraging the fact that orbits of G in X are isomorphic to homogeneous spaces, the harmonic bases for the orbits of G in X can be combined with a Gaussian kernel to generate G-steerable bases over X. In Section D, we derive a method to compute the (either real or complex) harmonic basis for any homogeneous space of G from the harmonic basis of functions over G. The method described in Section D turns out to be more general and also supports vectors fields over homogeneous spaces, i.e. general induced representations of G. Since the kernel parameterization in Sec. 2.2 relies on the irreps decomposition of the input and output representations, of the tensor product of irreps (to compute the Clebsh-Gordan coefficients CG) and of the restricted irreps of G (to compute the ID coefficients), the ability to decompose an arbitrary representation is essential to apply our method in a fully general setting. Therefore, in Section E, we describe a numerical method to perform this decomposition for real representations. As explained in Sec. 3, the main (and, often, only) requirement to apply our method to a new group G is the knowledge of its irreps b G. A simple operation which can be used to generate new groups is the direct product of two groups. Examples of direct product groups include many subgroups of O(3) that we used in Sec. 5. In Section F, we describe a method to derive the real irrep of a direct product group G = A B from the real irreps of the two subgroups A and B. Section H includes additional details on the experimental results reported in Sec. 5. In particular, we describe in more details the different equivariant designs, the architectures used, the datasets and the training and testing configurations. A PRELIMINARIES ON THE REPRESENTATION THEORY OF COMPACT GROUPS In this section, we collect many notions and concepts which we need throughout the appendix for the development of the theory. This section assumes some familiarity with the representation theory of compact groups and is meant as a reminder. For readers with no background in representation theory, we can recommend the appendix of Lang & Weiler (2020). We will use K to denote any of the two fields R or C, i.e., we develop the theory in such a way that it covers both real and complex representations. Whenever we deviate from this assumption, we will say so explicitly. Throughout, let furthermore G be any compact group. A.1 BASICS OF (UNITARY AND IRREDUCIBLE) REPRESENTATIONS Linear representations are (in a certain sense continuous) group homomorphisms ρ : G GL(V ) for some topological K-vector space V . In our applications, V always carries a scalar product | which induces its topology, and which we always assume to be conjugate linear in the first component, Published as a conference paper at ICLR 2022 and linear in the second component. The pair (V, | ) is assumed to be a Hilbert space, meaning that the scalar product makes V a complete metric space. Note that, as is typical in a physics context, we freely make use of the Bra-Ket convention, and thus for example view v| as a linear functional which maps |w := w to v | w . By the Riesz representation theorem, there are no other continuous linear functionals. Unitary representations are linear representations ρ : G U(V ) GL(V ), where U(V ) is the unitary group of V . This is the group of unitary transformations on V , i.e., linear functions on V which preserve the scalar product, and thus distances. Whenever we explicitly assume K = R, we will write O(V ) instead of U(V ), which is the group of orthogonal transformations on V , and the representations are then called orthogonal instead of unitary.9 We will denote the set of isomorphism classes of all irreducible unitary representations (irreps) of G as b G. Thereby, irreducible representations are representations which cannot be decomposed nontrivially into subrepresentations. We usually write indices j, l, J b G to indicate such isomorphism classes, and then, e.g., ρj : G U(Vj) to denote a representative of such an isomorphism class. We set dj := dim Vj as the dimension of the representation space Dof Vj. Whenever we consider only one irrep, we may also, by abuse of notation, write ρ b G to indicate (a representative of an isomorphism class) of an irrep of G. We now collect definitions which we use throughout. Definition A.1 (Homomorphisms/Intertwiners). Let ρV : G GL(V ) and ρW : G GL(W) be two linear representations. The space of homomorphisms, also called intertwiners, Hom G,K(V, W), consists of all linear, continuous functions f : V W which commute with the two representations, i.e., f ρV (g) = ρW (g) f for all g G. We write for this space also Hom G,K(ρV , ρW ), depending on whether we want to emphasize the representation spaces or the representations. Definition A.2 (Equivalences and Isomorphisms). Let ρV : G GL(V ) and ρW : G GL(W) be two linear representations. An intertwiner f : V W is called an equivalence if there is an intertwiner g : W V such that g f = id V and f g = id W . ρV and ρW , or also V and W, are then said to be equivalent. If ρV and ρW are even unitary representations, and f and g additionally are unitary transformations, i.e., preserve the scalar product and thus distances, then V and W or ρV and ρW are called isomorphic, and f and g isomorphisms. We write this as ρV = ρW or V = W, depending on whether we want to emphasize the representations or the representation spaces. Definition A.3 (Endomorphisms, Endomorphism Basis). Let ρ : G GL(V ) be a linear representation. The space of endomorphisms End G,K(V ) consists of all linear, continuous maps c : V V that commute with ρ, i.e., c ρ(g) = ρ(g) c for all g G. For ρ = ρj being an irrep, this space is finite-dimensional and Das a basis (the endomorphism basis) (cj r)Ej r=1 with Ej = dim (End G,K(Vj)). Note that End G,K(Vj) = Hom G,K(Vj, Vj). By Schur s Lemma, the endomorphism spaces of complex irreps are always 1-dimensional, generated by the identity. For real representations, they can be 1, 2, or 4-dimensional, and the irrep is then correspondingly called of real type, complex type, or quaternionic type. See Supplementary C for more details on real representations. A.2 HOMOGENEOUS SPACES, SQUARE-INTEGRABLE FUNCTIONS, AND THE PETER-WEYL THEOREM Recall that a homogeneous space of the compact group G is a topological (Hausdorff) space X together with a transitive continuous action . : G X X, (g, x) 7 g.x = gx. As is commonly known, G carries an inversion-invariant and left-right invariant so-called Haarmeasure µ, which can furthermore be assumed to be normalized: µ(G) = 1. This Haar measure also 9But note that in the general case of unspecified field K, we always speak of unitary representations, by which we just mean orthogonal in the case K = R Published as a conference paper at ICLR 2022 induces a corresponding G-action invariant measure ν on X, which we also call Haar-measure and assume to be normalized: ν(X) = 1. The normalization makes both Haar-measures the unique measures which are defined on the whole Borel Sigma algebras of these spaces and satisfying the mentioned invariance properties. To reduce clutter, we will mostly not mention the used measure in integral-formulas, but we view it as understood that all integrals are with respect to these unique Haar measures. Fix from now on a homogeneous space X of G. Definition A.4 ((The Space of) Square-Integrable Functions). A square-integrable function f : X K is a measurable function with the property that the square of its absolute value is integrable: R X |f(x)|2dx < . We denote the space of all such square-integrable functions by L2 K(X). It carries a scalar product given by f | g = R X f(x)g(x)dx, making it a Hilbert space. It also carries a unitary representation of G given by λ : G U(L2 K(X)), (λ(g)f)(x) := f(g 1x). Definition A.5 (Regular Representation). In the case that X = G is the group itself, we call L2 K(G), together with the action [λ(g)f](g ) := f(g 1g ), the regular representation of G. Theorem A.6 (Peter-Weyl Theorem). There is a decomposition L2 K(X) =[ M into irreducible subrepresentations. Thereby, Vji is isomorphic to the representative irreducible representation Vj and generated by orthonormal harmonic basis functions Y m ji : X K. We have mj dj := dim Vj. If D= C and X = G, then mj = dj. Note that the hat c ( ) denotes a topological closure. It is a technicality which the reader can mostly ignore without problems. Proposition A.7 (Peter-Weyl Theorem: An Explicit Basis for Complex-Valued Square-Integrable Functions on a Compact Group). In the complex field K = C, the orthonormal harmonic basis for L2 C(G) is given explicitly by: Y m ji (g) = p dj ρj(g)ei|em = p dj [ρj(g)]mi = p dj [ρj(g 1)]im In other words, the matrix coefficients of the complex irreps of G form an orthogonal basis for L2 C(G). Definition A.8 (Harmonic Projections). We denote the orthogonal projection from L2 K(X) to the i th copy of Vj by pji : L2 K(X) Vj. On the harmonic basis, it is given by pji Y e m ejei = δjejδiei |j em , which just follows from the fact that the harmonic basis functions form an orthonormal basis of L2 K(X), which is then mapped to the corresponding orthonormal basis of Vj, the representation space of the representative irrep ρj : G U(Vj). One commonly known corollary of the Peter-Weyl theorem is the following. We make use of it in the definitions of the Clebsch-Gordan coefficients and irrep-decomposition coefficients which we define in the next two subsections: Corollary 1. Any unitary representation of G, even infinite dimensional, can be decomposed into a (topological closure of a) direct sum of irreps of G with a unitary isomorphism, i.e.: for all unitary representations π : G U(V ) of G, there exists a unitary transformation M : V c L i IVi, where ρi : G U(Vi) are unitary irreducible representations such that for all g G the following identity holds: π(g) = M 1 [ M I is a set indexing irreps in b G, and the same irrep may occur multiple times. If such an M exists, we also write V = c L Published as a conference paper at ICLR 2022 That M is a unitary transformation means, as explained earlier, that it preserves the scalar product and thus distances. Consequently, one has M 1 = M , i.e., the inverse is equal to the adjoint10. In the finite-dimensional case, and working with matrices, one has that M is given by the conjugate transpose of M, and if we additionally consider real-valued representations, then M = M T is just the transpose. A.3 CONSTRUCTING UNITARY REPRESENTATIONS FROM OTHERS AND THEIR RELATIONS Now we come to tensor-products and Clebsch-Gordan coefficients. Recall that given two finitedimensional unitary representations ρV : G U(V ) and ρW : G U(W), one can define a unitary representation ρV ρW : G U(V W) called tensor product given by (ρV ρW )(g) = ρV (g) ρW (g) : v w 7 [ρV (g)] (v) [ρW (g)] (w). Thereby, V W is the tensor product of the vector spaces V and W. If V and W have bases {v1, . . . , vn} and {w1, . . . , wm}, respectively, then the tensor product has a basis {vi wj | i = 1, . . . , n, j = 1, . . . , m}. Correspondingly, represented with respect to given bases, the matrix corresponding to ρV (g) ρW (g) is the Kronecker product of the matrices corresponding to ρV (g) and ρW (g). The scalar product on the tensor product is given by v w | v w = v | v w | w . The tensor product V W has a universal property which relates bilinear functions on V W to linear functions on V W. Recall that for an irrep ρj : G U(Vj), dj denotes the dimension of Vj. Definition A.9 (Clebsch-Gordan Coefficients). Let ρl : G U(Vl) and ρJ : G U(VJ) be irreducible unitary representations of G. There is a decomposition of the tensor product in irreps, where [j(l J)] is the multiplicity of Vj in Vl VJ. This multiplicity is zero for all but finitely many j b G. We denote the coupling coefficients between basis elements coming from this decomposition by s, jm|ln; JM , where m dj, n dl, M d J and s [j(l J)]. They are called the Clebsch-Gordan coefficients. Remark A.10. The main application of the tensor product will be in the Wigner-Eckart theorem which we state in Section B, where ρj will be the irrep corresponding to harmonic basis functions on a homogeneous space, ρl will be the input representation, and ρJ will be the output representation. Compared to the work Lang & Weiler (2020), this means that we will decompose Vl VJ instead of Vj Vl, which leads to a simpler implementation for real representations, the case we are concerned with, at the cost of the necessity to invoke dual representations in the case that one works with complex representations. The definition of these coefficients can be made more explicit by invoking the corresponding projections and embeddings, which we will use in the proof of our adapted Wigner-Eckart Theorem. We thereby define |ln , n dl, as the n th basis vector in Vl, for a fixed basis, and similarly for |jm and |JM . Definition A.11 (Clebsch-Gordan Projections and Embeddings). We denote the projection from Vl VJ to the s th copy of Vj by CGj(l J) s : Vl VJ Vj. It has as matrix elements the Clebsch Gordan coefficients for fixed s, j, l, J: s, jm | ln; JM = D jm CGj(l J) s |ln |JM E . (6) Dually, we denote the embedding of the s th copy of Vj into Vl VJ by ij(l J) s : Vj Vl VJ. It relates to the Clebsch-Gordan coefficients as follows: s, jm|ln; JM = ij(l J) s (|jm ) ln; JM . (7) Note that this definition of Glebsh-Gordan coefficient is compatible with the one in Sec. 2.2, for real representations. Indeed, in Sec. 2.2, CGj(l J) s denotes the matrix of Clebsch-Gordan coefficients s, j em|lm; JM of shape dj (dl d J). 10The adjoint of a continuous linear operator f : V W between Hilbert spaces is defined as the unique continuous linear operator f : W V satisfying f(v) | w = v | f (w) for all v V and w W. Published as a conference paper at ICLR 2022 For a Hilbert space V , we set V as the space of continuous linear functionals on V . As mentioned above, by the Riesz representation theorem these linear functionals are given by all v := v| for v V . V is then also a Hilbert space with the scalar product given by v |w = w|v .11 This definition of a dual can be extended to a dual of unitary representations: Definition A.12 (Dual Representation). Let ρ : G U(V ) be a finite-dimensional unitary representation. Then we can define the dual representation ρ : G U(V ) as follows: [ρ (g)](v ) := v ρ(g 1) : V K. (8) If ρl : G U(Vl) is a finite-dimensional irreducible representation, then we write Vl := (Vl) and (ρl) := ρl for a new auxiliary label l . If |ln , n = 1, . . . , dl, are chosen basis elements of Vl, then we choose as the basis of Vl the vectors |l n := ln|, which are the functionals defined by the property len|ln = δenn. Similarly to how we defined a tensor product representation above, one can also define Homrepresentations: Let ρV : G U(V ) and ρW : G U(W) be finite-dimensional unitary representations. Let Hom K(V, W) be defined as the vector space of linear, not necessarily equivariant, functions from V to W. Choosing bases of V and W, this space is isomorphic to Kdim V dim W and thus carries a Euclidean scalar product which makes it a Hilbert space. The Hom-representation is then given by ρHom : G U(Hom K(V, W)) with [ρHom(g)](f) := ρW (g) f ρV (g) 1. Tensor product representations and Hom-representations are related as follows: Proposition A.13. Let ρV : G U(V ) and ρW : G U(W) be finite-dimensional unitary representations. The function ΨHT : V W Hom K(V, W) given by ΨHT : v w 7 v w (v ) := v (v ) w (9) is an isomorphism of unitary representations. The indices HT stand for Hom and Tensor . Note that when working with real representations instead of complex representations, the map v 7 v is linear instead of only conjugate linear, and consequently, dual representations are isomorphic to the original representation. In that case, one thus obtains V W = Hom R(V, W). A.4 COMPACT SUBGROUPS AND RESTRICTED REPRESENTATIONS Let in this subsection G G be a compact subgroup of the compact group G . By restriction, one can view any unitary representation of G as one of G, which will become important both in our treatment of the adapted Wigner-Eckart Theorem in Section B and in considerations on the induced representation of a given irrep in Section D: Definition A.14 (Restricted Representation). Let ρ : G U(V ) be a unitary representation of G . We define Res G G V := V and Res G G ρ by Res G G ρ : G U(Res G G V ) = U(V ), g 7 ρ (g). It is called the restricted representation of ρ on G. Crucially, restrictions of irreps of G need not be irreps of G. In analogy to the Clebsch-Gordan coefficients, which emerge when decomposing tensor products, one can define irrep-decomposition coefficients, which emerge when decomposing restrictions of irreps of the group G into irreps of G: Definition A.15 (Irrep-Decomposition Coefficients). Let ρj : G U(Vj ) be an irrep of G . Then there is a decomposition Res G G Vj = M where [jj ] is the multiplicity of irrep Vj of G in Res G G Vj . This multiplicity is zero for all but finitely many j b G. We denote the coupling coefficients between basis elements coming from this decomposition by t, jm|j m , where m dj, m dj , t [jj ]. We call them the irrep-decomposition coefficients. 11Notice the change of order. Published as a conference paper at ICLR 2022 Definition A.16 (Irrep-Decomposition Projections). We denote the projection from Res G G Vj to the t th copy of Vj by IDjj t : Res G G Vj Vj. Is has as matrix coefficients the irrep-decomposition coefficients for fixed t, j, j : t, jm | j m = jm IDjj t j m . (10) Here, ID stands for irrep and decomposition . B A GENERALIZED WIGNER-ECKART THEOREM FOR STEERABLE KERNELS The kernel space solution from this work (Theorem 2.1) will be based on the Wigner-Eckart theorem for steerable kernels from Lang & Weiler (2020). We thereby adapt that theorem in two practically relevant ways, and thus have to reprove the theorem: First, rather than working on a homogeneous space X of G, we work with an arbitrary space X equipped with an action of G. The special case where X is a homogeneous space of G then recovers the result in Lang & Weiler (2020). Second, by decomposing a certain Hom-representation instead of a tensor product representation, we make the computational process easier which finds the harmonic basis functions that build the equivariant kernels. Finally, in Section B.3, we extend the previous result by reusing a G -steerable basis to parameterize a G-steerable kernel space. In this whole section, we freely make use of the concepts and notation defined in Section A. B.1 BASIS-INDEPENDENT FORMULATION OF THE GROUP-RESTRICTED WIGNER-ECKART THEOREM Assumptions We explain the kernel space solution for the following general setting. We assume G is a compact group. Also, K is one of the two fields R or C (with our applications focused on the case R). Let X be a topological Hausdorff space equipped with the Borel σ-algebra, making it a measurable space. We assume X is equipped with a measure µ and a continuous action of G on X. Since we want L2 K(X) to be a unitary representation over G, we assume the action of G on X to preserve the measure µ. All these assumptions are, for example, satisfied if X is any homogeneous space of G, or if G is a subgroup of O(n) and X Rn any subset of the Euclidean space that is preserved under the action of G. Finally, ρl : G U(Vl) and ρJ : G U(VJ) are irreducible unitary input and output representations of G. Their dimension is denoted dl and d J, respectively. In this setting, our goal is to determine a basis for the space of G-steerable kernels K : X Hom K(Vl, VJ), which we define as any square-integrable function such that the steerability constraint is satisfied, i.e.: for all g G and x X, the following holds: K(gx) = ρout(g) K(x) ρin(g) 1. Note that, compared to Lang & Weiler (2020), the space X is in general not a homogeneous space of the group G. We denote the space of all these G-steerable kernels by Hom G(X, Hom K(Vl, VJ)). More details and intuitions on this abstract definition of steerable kernels (however, only for the special case where X is a homogeneous space of G) can be found in Lang & Weiler (2020), Appendix C.1. First, note that the space L2 K(X) of square-integrable functions on X carries an action of G as well. Similarly as in Lang & Weiler (2020), the main ingredient of the Wigner-Eckart Theorem is a correspondence between G-steerable kernels on X and kernel operators, i.e., intertwiners, on L2 K(X). For this, recall that both Hom K(Vl, VJ) and L2 K(X) carry a G-representation. Definition B.1 (The Kernel-Operator Correspondence). For any intertwiner K : L2 K(X) Hom K(Vl, VJ), we define its restriction to X by K|X(x) := K(δx), where δx is the Dirac delta function at x X.12 K|X is then a G-steerable kernel. 12Since the Dirac delta function is not actually square-integrable, formally, this is defined by performing a limit over square-integrable functions that more and more resemble the Dirac delta, see Lang & Weiler (2020). Published as a conference paper at ICLR 2022 In the other direction, for any G-steerable kernel K : X Hom K(Vl, VJ), we define the extension to L2 K(X) by b K(f) := Z X f(x)K(x)dµx. This is then an intertwiner. Theorem B.2. Those two operations are inverse to each other, i.e., b K|X = K and d K|X = K. Proof. This follows from the well-known fact that L2 K(X) is isomorphic to its own dual space . First, note that an intertwiner K : L2 K(X) Hom K(Vl, VJ) belongs to Ldl d J L2 K(X) , i.e. to dl d J copies of the dual space of L2 K(X). Conversely, a kernel K : X Hom K(Vl, VJ) belongs to Ldl d J L2 K(X), i.e. to dl d J copies of L2 K(X). For 1 < p < and q = p p 1, there is a natural isomorphism between the spaces Lp(X) and Lq(X) given by the map: Φ : Lq(X) Lp(X) , f 7 ˆf, ˆf(g) = Z X f(x)g(x)dµ(x) Φ 1 : Lp(X) Lq(X), ˆf 7 f, f(x) = ˆf(δx) By choosing p = q = 2 we obtain the isomorphism between L2 K(X) and L2 K(X) . Then, note that the operations c ( ) and ( )|X are respectively equivalent to Φ and Φ 1 applied independently to each of the dl d J subspaces isomorphic to L2 K(X) or L2 K(X) . And finally, note that this isomorphism is still well-defined when restricted to the equivariant kernels and kernel operators, respectively. Note now that, since G is acting as an isometry on X, its action on L2 K(X) is unitary. By Corollary 1, it follows that this actions decomposes into a direct sum of irreps of G. It follows that L2 K(X) also decomposes into a direct sum of invariant subspaces L2 K(X) = c L j ˆ Gc Lmj i V i j , where mj is the multiplicity of the irrep ρj ˆG and each V i j = Vj is acted on by G through ρj ˆG. We denote a basis for V i j with {Y m ji : X K}dj m. Note that mj may be infinite, even uncountably large, since we do not assume X to be a homogeneous space of G. Definition B.3 (Steerable Basis). The union of {Y m ji : X K | j ˆG, i mj, m dj} forms a basis for L2 K(X). We call this basis steerable since the action of G on a function L2 K(X) is realized just by recombination of the linear coefficients used to expand this basis (via its action on each invariant subspace through its irreps). Note that this definition is compatible with the one in Freeman & Adelson (1991). Similar to the harmonic projection in Def. A.8, we define an orthogonal projection on the invariant subspaces of L2 K(X): Definition B.4 (Steerable Basis Projections). We denote the orthogonal projection from L2 K(X) to the i th copy of Vj by pji : L2 K(X) Vj. On the steerable basis, it is given by pji Y e m ejei = δjejδiei |j em , which just follows from the fact that the steerable basis forms an orthonormal basis of L2 K(X), which is then mapped to the corresponding orthonormal basis of Vj, the representation space of ρj ˆG. With all this in mind, we can formulate our basis-independent Wigner-Eckart Theorem for G-steerable kernels: Theorem B.5 (Wigner-Eckart Theorem for Steerable Kernels). The basis-independent version of the Wigner-Eckart theorem consists of two parts: 1. There is an isomorphism of vector spaces s=1 End G,K(Vj) Hom G(X, Hom K(Vl, VJ)), Published as a conference paper at ICLR 2022 where the hat c ( ) denotes a topological closure and can be ignored by the reader. 2. Explicitly, GKer is given by GKer((cjis)jis) = X s=1 ΨHT ij(l J) s cjis pji|X. Proof. For the first statement, we observe: Hom G(X, Hom K(Vl, VJ)) (1) =Hom G,K L2 K(X), Hom K(Vl, VJ) i=1 Hom G,K(Vj, Vl VJ) s=1 End G,K(Vj) In (1), we perform the linear extension to the space of square integrable functions from Definition B.1, using Theorem B.2. In (2), we use the Peter-Weyl Theorem A.6, and the isomorphism ΨHT which we defined in Proposition A.13. Note that the decomposition from the Peter-Weyl Theorem is G-equivariant, so this step is valid. In (3), we use the Clebsch-Gordan decomposition from Definition A.9, and Schur s Lemma which shows that Vj can only have a nontrivial homomorphism to Vj, and not to any other irrep. From right to left, we call the isomorphism GKer. This proves the first statement. For the second statement, we go through the sequence of isomorphisms from bottom to top and trace back where an arbitrary endomorphism tuple comes from : GKer : (cjis)jis (3) 7 s=1 ij(l J) s cjis s=1 ΨHT ij(l J) s cjis pji s=1 ΨHT ij(l J) s cjis pji|X. This finishes the proof of the second statement, and thus of the theorem. B.2 STEERABLE KERNEL BASES After the basis-independent version of the theorem, we now state how a basis for the space of steerable kernels can be constructed. The basis elements of the representation space Vl, VJ and Vj are denoted |ln , |JM and |jm , respectively, where the indices range in 0 n dl, 0 M d J, and 0 m dj. Sometimes, we write also Y M J instead of |JM , etc., in order to remind to the connection with harmonic basis functions. Theorem B.6 (Steerable Basis Kernels). A basis of the space of G-steerable kernels Hom G(X, Hom K(Vl, VJ)) is given by basis kernels Kjisr j ˆG, i mj, s [j(l J)], r Ej , which are given as follows: Kjisr = GKer(cjisr), where cjisr is zero at every entry except at position jis, where it takes on value cj r.13 13Remember that cj r is the r th basis endomorphism of ρj. Published as a conference paper at ICLR 2022 Furthermore, the matrix-elements of such a basis kernel with respect to orthonormal bases |ln of Vl and |JM of VJ, where M d J and n dl, are given by: JM|Kjisr(x)|ln s, j em|l n; JM j em cj r jm i, jm|x (11) where the overline denotes complex conjugation, and with i, jm|x := Y m ji (x) for steerable basis functions Y m ji : X K. Proof. That these kernels form a basis follows from the fact that the cjisr obviously form a basis of endomorphism tuples, and that GKer is an isomorphism by Theorem B.5. Concretely, this theorem then shows that Kjisr = ΨHT ij(l J) s cj r pji|X. We write [ Kjisr for the extension from Definition B.1, which is just given by [ Kjisr = ΨHT ij(l J) s cj r pji. Furthermore, we expand the Dirac delta function δx for x X in the steerable basis of L2 K(X) as D Y e m ejei δx E Y e m ejei . This expansion can be justified with a similar approximation procedure as in the proof of Lang & Weiler (2020), Theorem D.13. Thereby, we will in the following write the coefficients as D ei,ej em x E := D Y e m ejei δx E , which is also equal to Y e m ejei (x). With these tricks, we obtain: JM|Kjisr(x)|ln (1) = D JM [ Kjisr(δx) ln E D ei,ej em x E D JM ΨHT ij(l J) s cj r pji (Y e m ejei ) ln E m=1 i, jm|x D JM ΨHT ij(l J) s cj r (Y m j ) ln E m=1 i, jm|x j em cj r jm D JM ΨHT ij(l J) s (Y e m j ) ln E D JM ΨHT ij(l J) s (Y e m j ) ln E dj X j em cj r jm i, jm|x We first explain the steps so far and will then deal with the last term. Step (1) uses the extension of K from X to L2 K(X). Step (2) uses the expansion of δx explained before. Step (3) uses the property of the projection discussed in Definition B.4 and a renaming of em to m. Step (4) expands cj r(Y m j ) in terms of the basis vectors Y e m j , with the expansion coefficients being the matrix elements of cj r. Step (5) is then just a reordering of the terms. Published as a conference paper at ICLR 2022 Now we deal with the first factor in the last term: D JM ΨHT ij(l J) s (Y e m j ) ln E D l en; J f M ij(l J) s j em E D JM ΨHT l en; J f M ln E D s, j em l en; J f M E D JM ( len|)|J f M D s, j em l en; J f M E JM len|ln J f M D s, j em l en; J f M E δnenδM f M (e) = s, j em|l n; JM . In step (a), we expand ij(l J) s (Y e m j ) in the basis of Vl VJ. In step (b), we use the definition of the Clebsch-Gordan coefficients using the embedding ij(l J) s and of ΨHT given in Proposition A.13. Also, the definition of the dual basis given after Definition A.12 is used as the equality |l en = len|. In step (c), we use eq. 9. Step (d) and (e) are clear. Plugging this intermediate result into our earlier computation finishes the proof. We now formulate a matrix-version of this result that is more suitable to implementation. We thereby identify Kjisr(x) as a matrix in Kd J dl, for chosen bases |JM in VJ and |ln in Vl. I.e., Kjisr(x) is viewed as the matrix with coefficients JM|Kjisr(x)|ln . Furthermore, we identify cj r with the matrix in Kdj dj with coefficients j em cj r jm . Additionally, with Yji(x) we mean the column vector in Kdj with entries i, jm|x . Finally, with CGj(l J) s we mean the matrix of Clebsch-Gordan coefficients s, j em l m; JM of shape dj (dl d J). Its conjugate transpose is denoted as h CGj(l J) s i , which is of shape (d J dl) dj. When using it in matrix multiplications, it is interpreted as having d J dl rows and dj columns. This is compatible with the vectorization of steerable kernels which was discussed in the main paper. Theorem B.7 (Matrix-Version of Steerable Basis Kernels). The basis kernel evaluated at x, Kjisr(x), is given by the matrix Kjisr(x) = h CGj(l J) s i cj r Yji(x) (12) where each dot means conventional matrix multiplication. Proof. We do the sanity check, i.e., we test whether the resulting matrix is well-defined and of shape d J dl. This can be verified by observing that the shape of the right-hand-side of eq. 12 is: [(d J dl) dj] [dj dj] [dj] = [d J dl]. The full proof of eq. 12 follows directly from eq. 11. Remark B.8. Note that if X is a homogeneous space for G, then the steerable basis {Y m ji }jim is the harmonic basis in Theorem A.6. Additionally, if all endomorphism spaces are 1-dimensional, which is for example always the case for complex representations, then cj r can be chosen to be the identity matrix and thus completely omitted from the formula. Additionally, if one is concerned with real representations, then l = l and the complex conjugation in the Clebsch-Gordan coefficients can be omitted, which leads to the following formula which we implement in this work: Kjisr(x) = h CGj(l J) s i T cj r Yji(x). (13) For the special case that G = SO(3), X = S2 and K = R, one obtains the basis kernels Kj(x) = CGj(l J) T Yj(x), Published as a conference paper at ICLR 2022 where for |l J| j l + J, the Yj : S2 R2j+1 are the spherical harmonics. This result was derived in Weiler et al. (2018a). Many other examples for X homogeneous space of G, with a tensor-product decomposition instead of a Hom-space decomposition, can be found in Lang & Weiler (2020). B.3 A WIGNER-ECKART THEOREM FOR GROUP-RESTRICTED STEERABLE KERNELS In this section, we consider a special case of the previous theorem where X carries an action of a compact group G , with G G . We assume this action to extend the action of G on X. This result is useful to reuse G -steerable kernels to parameterize new G-steerable spaces without designing a new basis for L2 K(Rn). First of all, we denote by ˆG a set of representatives of the irreps of G and by {Y m j i |j ˆG , i mj , m dj } a G -steerable basis for X. Recall also the definition of group restriction and irrep decomposition coefficients from Definition A.15. In particular, [jj ] is the multiplicity of ρj ˆG inside Res G G ρj , with j ˆG , and IDjj t is the projection from Res G G Vj to the t th copy of Vj (Definition A.16). Proposition B.9. The G -steerable basis for X given by {Y m j i |j ˆG , i mj , m dj } can be turned into a G-steerable basis defined as: n Y m j(i j t) = h ID[jj ] t Yj i im |j ˆG , i mj , j ˆG, t [jj ], m dj o Proof. Because the space X carries an action of G , we can find the action of G on L2 K(X) via group restriction: Res G G L2 K(X) = Res G G d M j ˆ G d Mmj j ˆ G d Mmj i =1 Res G G Vj j ˆ G d Mmj j ˆ G d M[jj ] We can plug this basis in Theorem B.7 to obtain the following result: Theorem B.10 (Matrix-Version of Group-Restricted Steerable Basis Kernels). The basis kernel evaluated at x, Kj i jtsr(x ), is given by the matrix Kj i jtsr(x) = h CGj(l J) s i cj r IDjj t Yj i (x) (14) where each dot means conventional matrix multiplication. B.4 EXTENDED EXAMPLES OF STEERABLE KERNEL BASES In this section, we provide an extended version of the two examples in Section 2.2. In the following examples we will consider the space X = R2 and the two groups SO(2) and C4, i.e. the group of all planar rotations and the group of rotations by multiples of π/2 radians. SO(2)-Steerable Kernels First of all, recall the (real) irreducible representations of G = SO(2). For rθ SO(2): ρj(rθ) = cos j θ sin j θ sin j θ cos j θ where j is a non-negative integer which can be interpreted as the rotational frequency. All irreps are 2 dimensional but for the frequency j = 0, which is 1 dimensional. We consider the basis B = {Yji}ji Published as a conference paper at ICLR 2022 generated by circular harmonics combined with a Gaussian radial profile, as in Worrall et al. (2017); Weiler et al. (2018b); Weiler & Cesa (2019), defined as: Yji(r, φ) = ωRi(r) cos(j φ) sin(j φ) where ωRi : R R is a Gaussian radial kernel centered around Ri R, defining a ring of radius Ri. Hence, in this basis, i indexes the circular shells, while j indexes the angular frequencies along each ring. For simplicity, in the following examples we will only consider a single ring, indexed by i = 1. This setting is similar to the X = S1 examples discussed in Section 2.2, since each ring is isomorphic to a circle S1. One can verify that the basis elements Yji in B are G = SO(2)-steerable using the irrep ρj, i.e. Yji(r, φ + θ) = ρj(rθ)Yji(r, φ) . To apply Theorem 2.1, we need a basis for the endomorphism space of each irrep and the irreps decomposition between their tensor-products. If j > 0, ρj is a real irrep of complex type so its endomorphism space is 2-dimensional and is spanned by14: cj 1 = 1 0 0 1 , cj 2 = 0 1 1 0 If J = 0, ρ0 ρl = ρl and if l = 0, ρJ ρ0 = ρJ. For J, l > 0, ρJ ρl = ρ|J l| ρJ+l if J = l or ρ0 ρ0 ρJ+l otherwise. In particular, in the first case, using some trigonometric identities, one can verify that 1 0 0 1 0 1 1 0 1 0 0 1 0 1 1 0 satisfies ρJ ρl = [CGl J]T ρ|J l| ρJ+l CGl J. In this case, CGj(l J) 1 = 1 1 0 0 1 0 1 1 0 j = |J l| and CGj(l J) 1 = 1 1 0 0 1 0 1 1 0 if j = J + l. Thus, for l = J > 0, Theorem 2.1 prescribes a basis containing the following elements (for a single ring at radius Ri): κ|J l|,i,1,1(r, φ) = ωRi(r) cos(|J l|φ) sin(|J l|φ) sin(|J l|φ) cos(|J l|φ) κ|J l|,i,1,2(r, φ) = ωRi(r) sin(|J l|φ) cos(|J l|φ) cos(|J l|φ) sin(|J l|φ) κJ+l,i,1,1(r, φ) = ωRi(r) cos((J + l)φ) sin((J + l)φ) sin((J + l)φ) cos((J + l)φ) κJ+l,i,1,2(r, φ) = ωRi(r) sin((J + l)φ) cos((J + l)φ) cos((J + l)φ) sin((J + l)φ) Observe that these 4 basis kernels are identical to the ones in Table 8 of Weiler & Cesa (2019), up to vectorization and a minus sign for κ|J l|,i,1,2 and κJ+l,i,1,2. Fig. 2 shows the two kernels among these four obtained with j = |J l| = 1, in the case l = 1 and J = 2. For l = 0, j = J and CG0J 14Observe that we expressed ρj in a basis such that is has the form described in Lemma C.13, which implies the endomorphism space is spanned by the basis described in Section C.3. Published as a conference paper at ICLR 2022 is the identity matrix; then, for J > 0, Theorem 2.1 prescribes the following basis elements: κJ,i,1,1(r, φ) = ωRi(r) cos(Jφ) sin(Jφ) κJ,i,1,2(r, φ) = ωRi(r) sin(Jφ) cos(Jφ) which are also shown in Fig. 3. C4-Steerable Kernels We construct a G = C4-steerable basis through Eq. 5 by considering the G = SO(2)-steerable basis described before in Eq. 15. First, recall the representation theory of G = C4. For p {0, 1, 2, 3}, rp π 2 )p C4; C4 has the following three irreps: 2 ) = cos p π 2 ) = ( 1)p which are, respectively, 1, 2 and 1 dimensional. Again, we index the irreps by j {0, 1, 2}, which can be interpreted as the rotational frequency. To apply Eq. 5, we need to know the irreps decomposition of Res SO(2) C4 ρj , for each irrep ρj of SO(2). We have, for t N: Res SO(2) C4 ρ0 = ρ0 Res SO(2) C4 ρ4t = ρ0 ρ0 Res SO(2) C4 ρ4t+1 = ρ1 Res SO(2) C4 ρ4t+2 = ρ2 ρ2 Res SO(2) C4 ρ4t+3 = 1 0 0 1 so, IDj = 1 0 0 1 if j = 4t + 3 and IDj is the identity otherwise. Thus, if B = {Yj i }j i is the SO(2)-steerable basis described before in Eq. 15, according to Eq. 5, a C4-steerable basis B contains the following elements for each t N: Y0,(i ,0,1)(r, φ) = Y0,i (r, φ) = ωRi (r), Y0,(i ,4t,1)(r, φ) = Y 1 4t,i (r, φ) = ωRi (r) cos(4tφ) Y0,(i ,4t,2)(r, φ) = Y 2 4t,i (r, φ) = ωRi (r) sin(4tφ), Y1,(i ,4t+1,1)(r, φ) = Y4t+1,i (r, φ) = ωRi (r) cos((4t + 1)φ) sin((4t + 1)φ) Y2,(i ,4t+2,1)(r, φ) = Y 1 4t+2,i (r, φ) = ωRi (r) cos((4t + 2)φ), Y2,(i ,4t+2,2)(r, φ) = Y 2 4t+2,i (r, φ) = ωRi (r) sin((4t + 2)φ), Y1,(i ,4t+3,1)(r, φ) = 1 0 0 1 Y4t+3,i (r, φ) = ωRi (r) cos((4t + 3)φ) sin((4t + 3)φ) The endomorphism space of ρ0 and ρ2 is one dimensional and is spanned by the identity, while the endomorphism space of ρ1 is spanned by c1 1 = 1 0 0 1 , c1 2 = 0 1 1 0 The tensor products decompose as follows: ρ0 ρj = ρj ρ0 = ρj ρ2 ρ2 = ρ0 ρ1 ρ2 = ρ2 ρ1 = 1 0 0 1 ρ1 ρ1 = [CG11]T (ρ0 ρ0 ρ2 ρ2) CG11 Published as a conference paper at ICLR 2022 1 0 0 1 0 1 1 0 1 0 0 1 0 1 1 0 whose rows are respectively denoted CG0(11) 1 , CG0(11) 2 , CG2(11) 1 and CG2(11) 2 . Finally, Theorem 2.1 prescribes the following basis elements for each t N and each i : l, J {0, 2}: κj,(i ,4t+j,1),1,1(r, φ) = ωRi (r) cos((4t + j)φ) κj,(i ,4t+j,2),1,1(r, φ) = ωRi (r) sin((4t + j)φ) where j = |l J| {0, 2}. l = 0 and J = 1 (or l = 1 and J = 0): κ1,(i ,4t+1,1),1,1(r, φ) = ωRi (r) cos((4t + 1)φ) sin((4t + 1)φ) κ1,(i ,4t+1,1),1,2(r, φ) = ωRi (r) sin((4t + 1)φ) cos((4t + 1)φ) κ1,(i ,4t+3,1),1,1(r, φ) = ωRi (r) cos((4t + 3)φ) sin((4t + 3)φ) κ1,(i ,4t+3,1),1,2(r, φ) = ωRi (r) sin((4t + 3)φ) cos((4t + 3)φ) l = 2 and J = 1 (or l = 1 and J = 2): κ1,(i ,4t+1,1),1,1(r, φ) = ωRi (r) cos((4t + 1)φ) sin((4t + 1)φ) κ1,(i ,4t+1,1),1,2(r, φ) = ωRi (r) sin((4t + 1)φ) cos((4t + 1)φ) κ1,(i ,4t+3,1),1,1(r, φ) = ωRi (r) cos((4t + 3)φ) sin((4t + 3)φ) κ1,(i ,4t+3,1),1,2(r, φ) = ωRi (r) sin((4t + 3)φ) cos((4t + 3)φ) l = 1 and J = 1: κ0,(i ,4t,1),1,1(r, φ) = ωRi (r) [cos(4tφ) 0 0 cos(4tφ)]T , κ0,(i ,4t,1),2,1(r, φ) = ωRi (r) [0 cos(4tφ) cos(4tφ) 0]T , κ0,(i ,4t,2),1,1(r, φ) = ωRi (r) [sin(4tφ) 0 0 sin(4tφ)]T , κ0,(i ,4t,2),2,1(r, φ) = ωRi (r) [0 sin(4tφ) sin(4tφ) 0]T , κ2,(i ,4t+2,1),1,1(r, φ) = ωRi (r) [cos((4t + 2)φ) 0 0 cos((4t + 2)φ)]T , κ2,(i ,4t+2,1),2,1(r, φ) = ωRi (r) [0 cos((4t + 2)φ) cos((4t + 2)φ) 0]T , κ2,(i ,4t+2,2),1,1(r, φ) = ωRi (r) [sin((4t + 2)φ) 0 0 sin((4t + 2)φ)]T , κ2,(i ,4t+2,2),2,1(r, φ) = ωRi (r) [0 sin((4t + 2)φ) sin((4t + 2)φ) 0]T Fig 4 shows the basis kernels for l = 0 and t = 0. For simplicity, in Fig 4, we use k Z rather that t N to index the basis; this difference emerges from the fact that ρj is isomorphic to ρ1 when restricted from SO(2) to C4 for any j = |1 + 4k|, but k > 0 and k < 0 require a different change of basis (see 4t + 1 and 4t + 3 above). Observe also that these basis kernels span the same space of the ones in Table 11 of Weiler & Cesa (2019), for N = 4, up to vectorization. Published as a conference paper at ICLR 2022 C REAL VALUED REPRESENTATIONS In this section, we discuss the representation theory of compact groups on the real field R. Usually in representation theory, one assumes complex-valued signals in L2 C(G) and complex representations of G. However, in this work, we implement equivariant CNNs using real representations, and therefore must discuss general issues arising in this case. When using real representations, the theory becomes more difficult: while the matrix coefficients of the irreducible representation span the space of real square-integrable functions L2 R(G), they are not necessarily orthogonal to each other anymore. More precisely, the matrix coefficients belonging to the same irrep might be linearly dependent, while different irreps still contain orthogonal coefficients. This is a consequence of the fact that the strong version of Schur s Lemma stating that the endomorphism space of irreps are one dimensional does not hold for representations over R. As before, we freely make use of the preliminaries from Section A. In particular, recall that for real representations, we talk about orthogonal representations instead of unitary representations, and such representations take values in the orthogonal group O(V ) of some real Hilbert-space V . In Section C.1, we introduce some properties of real irreducible representations and characterize their endomorphism space. In Section C.2, we show that any real irrep can always be expressed with respect to a convenient basis such that a set of linearly independent matrix coefficients are contained in a subset of the columns of the irrep. This will be particularly useful to construct real harmonic bases in Section D and, in particular, in Corollary 5. Additionally, we describe a method to find such a basis for an arbitrary real irrep. Finally, in Section C.3, we show an example of such basis, which makes the result more intuitive. Additionally, we use this specific basis to simplify computations in Section F. C.1 PRELIMINARIES ON THE STRUCTURE OF End G,R(Vψ) Fix an orthogonal irreducible real representation ψ : G O(Vψ) of the compact group G. Let End G,R(Vψ) be its endomorphism algebra. It is well-known that this endomorphism algebra is isomorphic (in a sense which will be made precise in the following proposition) to one of the division algebras R, C, and H, i.e., the real numbers, complex numbers, or quaternions, see Bröcker & Dieck (2003), Theorem 6.7. Let K = End G,R(Vψ) be this division algebra. Let Eψ = dim End G,R(Vψ) its dimension. Let i1 = 1, i2 = i, i3 = j and i4 = k with 1, i, j, k H the usual basis. It has the characterizing properties i2 = j2 = k2 = 1, that 1 is a multiplicative identity, and that ij = k = ji, jk = i = kj, and ki = j = ik. Then K has as an orthonormal basis {il}Eψ l=1. We now make the properties of the isomorphism precise: Proposition C.1. There is K {R, C, H} such that there is a function Φ : K End G,R(Vψ) with the following properties: 1. Φ(x + y) = Φ(x) + Φ(y) for all x, y K. 2. Φ(λx) = λΦ(x) for all λ R and x K. 3. Φ(xy) = Φ(x) Φ(y) for all x, y K. 4. Φ(1) = id Vψ. 5. Φ is bijective. 6. Φ(ik) O(Vψ) for all k {1, . . . , Eψ}. Proof. Properties 1 to 5 are the properties of an R-algebra isomorphism. Such an R-algebra isomorphism Φ exists by Bröcker & Dieck (2003), Theorem 6.7. We now show that this automatically already implies property 6 as well: Φ(il) End G,R(Vψ) is a non-zero endomorphism, and thus, by Schur s Lemma for orthogonal representations (see Lang & Weiler (2020), Lemma B.29), there exists a coefficient µl R such that µlΦ(il) O(Vψ). We Published as a conference paper at ICLR 2022 deduce: µlΦ(il) µlΦ(il) = µ2 l Φ(il il) = µ2 l Φ(1) = µ2 l id Vψ . Together with µlΦ(il) µlΦ(il) O(Vψ) we necessarily have µ2 l { 1}, and thus µl = 1. Thus, Φ(il) = µlΦ(il) O(Vl), which shows 6. Definition C.2. Let ψ be a real irrep. Depending on its endomorphism algebra End G(ψ), ψ is classified in one of the following three categories: real type: if End G(Vψ) = R complex type: if End G(Vψ) = C quaternionic type: if End G(Vψ) = H From now on, fix the isomorphism Φ : K End G,R(Vψ) and denote Il := Φ(il). For simplifying the notation, we set dψ := dimψ := dim Vψ. From now on, we assume for simplicity that Vψ = Rdψ, which holds up to isomorphism. The scalar product on Vψ is then just given by v | w = v T w, and Vψ has the standard basis {e1, . . . , edψ} with ei having a 1 at position i and being zero elsewhere. For any A End G,R(Vψ), we define AT End G,R(Vψ) as the transpose, which is also the unique matrix and linear function AT : Vψ Vψ such that Av | w = v | AT w for all v, w Vψ. That AT is also an endomorphism is for the fact that ψ is an orthogonal representation. We define the following inner product on End G,R(Vψ): dψ Tr(ABT ), (16) where Tr is the trace of a matrix, given by the sum of diagonal elements. The trace is linear and continuous and has the property Tr(AB) = Tr(BA), which results in the trace of a matrix being independent of the basis in which it is expressed. Lemma C.3. We have the following: 1. Il = I 1 l for l > 1. 2. [Il]ii = 0 for l > 1 and i {1, . . . , dψ}. 3. Il | Il = δll for all l. Proof. 1. This follows from Il Il = Φ(i2 l ) = Φ(1) = id Vψ for all l > 1. 2. From Proposition C.1 we know that Il O(Vψ) for all l. So, I 1 l = IT l . For l > 1, we deduce Il = IT l from 1, i.e., Il + IT l = 0. It follows [Il]ii = 1 2[Il + IT l ]ii = 0. 3. Note that there is some l {1, . . . , Eψ} such that the following holds: Il | Il = 1 dψ Tr Il IT l dψ Tr Φ(il) Φ(il ) 1 dψ Tr Φ(il i 1 l ) dψ Tr Φ( il ) Published as a conference paper at ICLR 2022 Thereby, we made use of Il O(Vψ) in step 2, and used the multiplication properties of the quaternions in the second to last step. Note that il = 1 if and only if l = l , and that in this case, we have Il = Φ(il ) = id Vψ, leading to Il|Il = 1. If, however, l > 1 (i.e., l = l ), then we have [Il ]ii = 0 for all i by part 2, and thus we get a zero trace, proving the result. Corollary 2. The function Φ : K End G,R(Vψ) is a unitary isomorphism, meaning: When considering K as an R-vector space with standard inner product | and considering End G,R(Vψ) as an R-vector space with inner product given by eq. 16, then we have for all x, y K Φ(x) | Φ(y) = x | y . Furthermore, {I1, . . . , IEψ} forms an orthonormal basis of End G,R(Vψ). Proof. First, note that eq. 16 clearly defines a positive-definite inner product on End G,R(Vψ). Write x = PEψ l=1 xlil and y = PEψ l =1 yl il . Using Lemma C.3, part 3, we obtain: Φ(x) | Φ(y) = l =1 xlyl Il | Il = x | y The last statement that the Il form an orthonormal basis of End G,R(Vψ) follows from Lemma C.3 part 3 as well, together with the fact that Φ(il) = Il and that Φ is an isomorphism. For x = PEψ l=1 xlil K, the conjugation x is defined by Furthermore, we define the norm W := p W | W for W End G,R(Vψ). Proposition C.4. We have the following: 1. For x K, we have Φ(x) = Φ(x)T . 2. For any W End G,R(Vψ), we have WW T = W 2 id Vψ. Proof. 1. We have l=2 xl IT l In the third step, we used I1 = id Vψ and IT l = Il, which was shown in Lemma C.3 together with Il O(Vl). Published as a conference paper at ICLR 2022 2. This follows from the well-known fact that xx = x 2 for x K and part 1. C.2 ORTHOGONALITY AND ALIGNMENT OF MATRIX COEFFICIENTS The investigations here on the orthogonality relations of matrix coefficients of irreducible real representations generalize what s presented in Knapp (2002) for complex representations to real representations. The results thereby become a bit more involved. Remember that Vψ = Rdψ has the standard basis {e1, . . . , edψ}. Define Eij := eie T j Rdψ dψ. Furthermore, define G ψ(g)Eijψ(g) 1dg = Z G ψ(g)Eijψ(g)T dg. Remember that ψ : G O(Vψ) = O(Rdψ) Rdψ dψ takes values in dψ dψ-matrices. For k, j {1, . . . , dψ}, define ψkj : G R as the function ψkj L2 R(G) given by ψkj(g) := ψ(g)kj. Remember that we have a scalar product on L2 R(G) given by f | h := R G f(g)h(g)dg, which means we can evaluate also scalar products of different matrix coefficients. Lemma C.5. We have the following: 1. Oij End G,R(Vψ). 2. Oij = 1 dψ PEψ l=1[Il]ij Il . 3. Oij = ψk1i | ψk2j dψ k1,k2=1. Proof. We prove the three statements as follows: 1. From the integration-measure on G being a Haar measure, i.e., left-invariant, it can easily be checked that Oijψ(g ) = ψ(g )Oij for all g G. 2. From Oij End G,R(Vψ) and the fact that the Il form an orthonormal basis of the space End G,R(Vψ) by Corollary 2, we obtain Oij = PEψ l=1 Il | Oij Il. We now determine the coefficients: Il | Oij = Oij | Il dψ Tr Oij IT l G ψ(g)Eijψ(g) 1IT l dg G Tr ψ(g)Eij IT l ψ(g) 1 dg G Tr Eij IT l dg dψ Tr Eij IT l In (1), we use that Il is an endomorphism, i.e., commutes with ψ(g). Additionally, we use that Tr is linear and continuous, and thus commutes with integrals. Step (2) uses that the trace satisfies Tr(AB) = Tr(BA). In the last step, we make use of Eij = eie T j . Published as a conference paper at ICLR 2022 G ψ(g)Eijψ(g) 1dg G ψ(g)eie T j ψ(g)T dg ψ1i(g) ... ψdψi(g) ψ1j(g) . . . ψdψj(g) dg ψk1i(g)ψk2j(g) dψ k1,k2=1dg = ψk1i | ψk2j dψ k1,k2=1. Corollary 3. We have ψk1j | ψk2j = δk1k2 dψ for all k1, k2, j {1, . . . , dψ}. In other words, the matrix coefficients in the same column are orthogonal to each other, and they are normalized up to a constant factor: p dψ ψkj = 1. Proof. Remember that in Lemma C.3 we showed [Il]jj = 0 for l > 1 and that I1 = id Vψ. Together with Lemma C.5, we obtain: ψk1j | ψk2j = Ojj l=1 [Il]jj Il dψ [I1]jj [I1]k1k2 The statement about the normalization is clear. Lemma C.6. For any p, q {1, . . . , Eψ} and for any v Vψ, Ipv|Iqv = δpq v|v . Proof. Note that there is a l such that IT p Iq = Il. If p = q, IT p Ip = I1 = id Vψ is the identity matrix, hence v T IT p Iqv = v T v. Otherwise, l > 1 and Il = IT l is a skew symmetric matrix, hence v T Ilv = 0. Indeed, note that, for any matrix A, it holds that v T Av = v T AT v. If A is a skew symmetric matrix (AT = A), it follows that v T Av = 0. Lemma C.7. For each column j, the set of matrices {Oij}i spans the space End G(Vψ), which is an Eψ dimensional space. Proof. To prove this, it is sufficient to look at the set of coefficient vectors of each of these matrices when expressed with respect to the basis {Il}l of End G(Vψ). If this set of coefficient vectors spans an Eψ dimensional space, the proof is complete. Because Oij = 1 dψ P l[Il]ij Il, let s consider the set of vectors {([Il]ij)l}dψ i=1 and stack them in the rows of a matrix M Rdψ Eψ. The set of vectors spans an Eψ dimensional space if and only if M has row rank or, equivalently, column rank Eψ. Because M has Eψ nonzero columns (namely, the l th column of M is the j th column of Il, which is non-zero since Il is invertible), then it has column rank equal to Eψ if and only if all its columns are pairwise orthogonal. The orthogonality of the columns of M follows from Lemma C.6. Indeed, note that the l-th column of M contains the j-th column of Il. By choosing v = ej in Lemma C.6, it follows that the column j of two different basis elements Il1 and Il2 are orthogonal. Thus, the columns of M are orthogonal. Published as a conference paper at ICLR 2022 From Lemma C.7, and since End G(Vψ) contains only the zero-map and isomorphisms, it follows that, for any j, the set of vectors {Oijej}dψ i=1 spans an Eψ dimensional vector space. We will show that this means that there is a change of basis B Rdψ dψ and a corresponding irrep eψ = BT ψB isomorphic to ψ such that, for any j, {g Oijej}dψ i=1 contains only Eψ non-zero vectors, where g Oij := R G eψ(g)Eij eψ(g)T dg. We give a constructive proof now. First, we build one such matrix B starting from the basis elements {Il}Eψ l=1. Then, we show that for eψ = BT ψB, for each j, the matrix g Oij is non-zero precisely for Eψ different values of i. Lemma C.8. For any vectors v, w Rdψ, if Ipv|w = 0 for all p, then Ipv|Iqw = 0 for any p and q. Proof. For any p and q, there exists l such that Ip IT q = Il. Then Ipv|Iqw = v T IT p Iqw = v T IT l w = Ilv|w which is zero by assumption. To construct the matrix B, we use the following algorithm. Let B = Rdψ the empty set. Repeat the following process dψ/Eψ times:15 Take a normalized vector v Rdψ which is orthogonal to all vectors in B. Add the vectors {Ilv}Eψ l=1 to B. Note that all vectors {Ilv}Eψ l=1 are orthogonal to each other because of Lemma C.6. Note that if B contains the vectors {Ilw}l and we choose a vector v such that it is orthogonal to all {Ilw}l, then, due to Lemma C.8, the updated set B still contains only pairwise orthogonal vectors. The matrix B is built by stacking horizontally the vectors in B. Note that since all v in the process are normalized, and the Ilv are normalized as well by Lemma C.6, the matrix is orthogonal, i.e. BT B = BBT = id. By construction, for any l, the left multiplication of B by Il only results in a permutation of the columns of B (and, potentially, flipping their sign). In particular, Il only permutes the vectors in B belonging to the same group {Ilv}l. Hence, there exists a permutation matrix Pl (with, potentially, negative entries) such that Il B = BPl. Let, as mentioned before, eψ(g) := BT ψ(g)B, i.e., eψ is isomorphic to ψ via B. Also, let e Oij = R G eψ(g)Eij eψ(g)T dg. Lemma C.9. Let ψ be a real irrep and B a matrix computed as above. Fix j {1, . . . , dψ}. Then, the following holds: 1. Fix also i {1, . . . , dψ}. Then e Oij = 0 if and only if there is a unique l {1, . . . , Eψ} such that [Pl]ij = 0. In that case, e Oij = 1 dψ [Pl]ij Pl. 2. The matrix e Oij is non-zero for precisely Eψ different values of i. 15The process itself will reveal that dψ/Eψ is a whole number. Published as a conference paper at ICLR 2022 Proof. 1. Let Bi and Bj, respectively, the i-th and the j-th columns of B. G BT ψ(g)BEij BT ψ(g)T Bdg G ψ(g)Bi BT j ψ(g)T dg B ab Bai Bbj Eab a,b Bai Bbj BT Oab B a,b Bai Bbj[Il]ab l [BT Il B]ij BT Il B l [Pl]ij Pl, where in the last step we used that B has the property BT Il B = BT BPl = Pl, where Pl has the structure of a signed permutation matrix. Thus, if e Oij = 0, then there exists l such that [Pl]ij = 0. For the other direction, assume that there is such an l. We want to show that there cannot be another k with [Pk]ij = 0. If we can do that, then it follows e Oij = 1 dψ [Pl]ij Pl = 0 and we are done. To do so, first observe that Plej|Pkej = δlk. Indeed, Plej|Pkej = e T j P T l Pkej = e T j BT IT l BBT Ik Bej = e T j BT IT l Ik Bej. Now, note that there exists a m such that IT l Ik = Im. In particular, if l = k, IT l Il = I1 = id and, therefore, e T j BT IT l Ik Bej = e T j ej = 1. Otherwise, e T j BT IT l Ik Bej = e T j BT Im Bej = 0, thanks to Lemma C.6. Now, assume we have [Pl]ij = 0 = [Pk]ij. Then, since Pl and Pk are signed permutation matrices, all other entries in column j are zero, meaning that 0 = [Pl]ij [Pk]ij = X i [Pl]ij [Pk]ij = Plej | Pkej = δkl, meaning that k = l. 2. For l = 1, . . . , Eψ, let il {1, . . . , dψ} be the unique index such that [Pl]ilj = 0. It follows e Oilj = 0 from part 1. We also know from part 1 that all il are pairwise different, so we found Eψ such indices. Now, assume i is such that e Oij = 0. Then, from part 1 again, we get that there exists l such that [Pl]ij = 0, and thus i = il. That finishes the proof. As a last classical result, we will use the Cauchy-Schwartz inequality, which is well-known and therefore without a proof: Lemma C.10. Let V be any Hilbert space with scalar product | . Let v, w V . Then one has | v | w | v w , with equality if and only if v and w are linearly dependent. Published as a conference paper at ICLR 2022 Theorem C.11. Let ψ be a real irrep. Then, there exists a real irrep eψ, isomorphic to ψ, such that the following holds. Fix j {1, . . . , dψ}: 1. The matrix coefficients in column j of eψ are orthogonal to all matrix coefficients of eψ in all but precisely Eψ columns i. 2. If i is one of the remaining Eψ columns, then eψei is a signed permutation of eψej. Proof. Let B and eψ = BT ψB as before. We know from Lemma C.9 that there are precisely Eψ values for i such that e Oij = 0. 1. Assume e Oij = 0. Then, since by Lemma C.5 we have e Oij = eψk1i eψk2j dψ k1,k2=1, it follows that columns i and j of eψ are fully orthogonal to each other. 2. Assume e Oij = 0. Then, by Lemma C.9, there is l such that [Pl]ij = 0 and e Oij = 1 dψ [Pl]ij Pl, where Pl is a signed permutation matrix. We have [Pl]ij = 1, i.e., we can wlog. assume [Pl]ij = 1, i.e., e Oij = 1 dψ Pl. Now, let k1 {1, . . . , dψ} be arbitrary and look at the matrix coefficient eψik1. Then, since 1 dψ Pl = eψk1i eψk2j dψ k1,k2=1 by Lemma C.5 again, and due to the structure of Pl, there is precisely one k2 {1, . . . , dψ} such that eψk1i | eψk2j = 1 dψ . By the Cauchy-Schwartz inequality Lemma C.10 it follows: 1 dψ = eψk1i eψk2j eψk1i eψk2j = 1 p In the second to last step, we thereby made use of Corollary 3. We obtain eψk1i eψk2j = eψk1i eψk2j and thus, by the last statement in Lemma C.10, that eψk1i and eψk2j are linearly dependent. Since they have the same norm, it follows eψk1i = eψk2j. Since for different k1, the value of k2 needs to differ as well due to Pl being a signed permutation matrix, the result follows. This result is convenient to define a real harmonic basis {Y m ij } for L2 R(G), which we then use in Corollary 5. C.3 A CONVENIENT CHOICE OF BASIS FOR REAL IRREPS At the end of the previous section, we described a method that, given a real irrep ψ and a basis for End G(Vψ), generates an isomorphic irrep eψ which satisfies the convenient properties in Theorem C.11. To provide an intuition of what structure these real irreps have and to simplify some computations in Section F, in this section we describe a choice of basis to express each real irreducible representation, satisfying the properties in Theorem C.11. Note that this particular choice of basis is not necessary in most of our results, since we rely only on a generic orthonormal basis for End G(Vψ), but is useful to simplify Section F. The content of this section is mostly based on Boardman (2007) and Bröcker & Dieck (2003). Recall first the classification of real irreps in real, complex or quaternionic type according to Definition C.2. We will rely on the following useful results: Lemma C.12. For any real type real irrep ψ, its complexification is isomorphic to a complex irrep σ, with σ = σ. The complexification of ψ is a representation like ψ which acts on Cdψ rather than Rdψ. Published as a conference paper at ICLR 2022 Lemma C.13. Any complex or quaternionic type real irrep ψ is isomorphic to the realification of a complex irrep σ. In other words, for any complex or quaternionic type real irrep ψ, there exists a complex irrep σ such that: ψ = Re (σ) Im (σ) Im (σ) Re (σ) Additionally, if it has quaternionic type, it also holds that σ = σ. Indeed, using Re (x) = 1 2(x + x) and Im (x) = 1 2i(x x), one can verify that: Re (σ) Im (σ) Im (σ) Re (σ) = D σ(g) 0 0 σ(g) i iddσ i iddσ iddσ iddσ Using these results, we define a convenient choice of basis for ψ for each possible irrep type. For each case, we also define an associated basis for End G(Vψ). Real type irrep Since ψ = σ (as complex irreps), we do not need further assumptions on the basis of ψ. Note also that End G(Vψ) contains only scalar multiples of the identity; hence, regardless of the choice of basis for ψ, a basis for End G(Vψ) is given by the identity matrix. Complex type irrep If ψ has complex type, we assume that ψ is expressed in a basis such that ψ = Re (σ) Im (σ) Im (σ) Re (σ) for some complex irrep σ. Note that this is a natural choice, for example, for G = SO(2). It follows that a basis for End G(Vψ) is given by the following homomorphism Φ : C End G(Vψ): Φ(1) = idn 0 0 idn Φ(i) = 0 idn idn 0 where n = dψ/2 and idn is the identity matrix of size n. Quaternionic type irrep Similarly, if ψ has quaternionic type, we assume that ψ is expressed in a basis such that ψ = Re (σ) Im (σ) Im (σ) Re (σ) for some complex irrep σ. One can show that if σ has quaternionic type, it can always be expressed in a basis such that it has the following block structure: σ(g) = σ1(g) σ2(g) σ2(g) σ1(g) and, therefore: Re σ1(g) σ2(g) σ2(g) σ1(g) Im σ1(g) σ2(g) σ2(g) σ1(g) Im σ1(g) σ2(g) σ2(g) σ1(g) Re σ1(g) σ2(g) σ2(g) σ1(g) Published as a conference paper at ICLR 2022 A basis for End G(Vψ) is given by the following homomorphism Φ : H End G(Vψ): idn 0 0 0 0 idn 0 0 0 0 idn 0 0 0 0 idn 0 0 idn 0 0 0 0 idn idn 0 0 0 0 idn 0 0 0 idn 0 0 idn 0 0 0 0 0 0 idn 0 0 idn 0 0 0 0 idn 0 0 idn 0 0 idn 0 0 idn 0 0 0 where n = dψ/4. D THE COMPUTATION OF THE INDUCED REPRESENTATION OF COMPACT GROUPS Quotient and induced feature fields, i.e. features fields where the feature type ρ is an quotient or induced representation (see Sec 2.1), were studied in Weiler & Cesa (2019) for the planar groups. These feature types encode square-integrable scalar or vector functions over a homogeneous space X = G/H of G. If X is not finite, one restricts the consideration to a subspace of bandlimited functions by using a finite subset of an harmonic basis for the space of all functions. A band-limited function is then parameterized by a coefficients vector, which carries an action of G as a direc-sum of G irreps. Given one such vector, the function it parametrizes can be sampled at any point of x X by evaluating the harmonic basis on x. For example, in Sec.5.3, we employed models with quotient features over the sphere S2 = SO(3)/ SO(2) or the space Inv S2 = O(3)/ SO(2). See also Section H.3 for more details on how this is used in our neural network design. Recall also that a G-orbit is isomorphic to an homogeneous space for G and, therefore, to a quotient space G/H = {g.H|g G}, where H < G is a stabilizer subgroup of the space; e.g. SO(3).x = S2 = SO(3)/ SO(2) for x R3 non-zero. The Wigner-Eckart theorem from Lang & Weiler (2020) relies on an harmonic basis over an orbit G.x of the equivariance group G inside the base-space X. While our Theorem B.7 relies on a G-steerable basis for X, a simple way (although, generally more complicated to band-limit and discretize) to generate such basis is by embedding G-orbits inside X and use the harmonic bases of such orbits. This is visualized for G = C4 in Fig. 1b and used for the G = I baselines in Sec. 5.2 In all these settings, an harmonic basis for functions on an homogeneous space is required. In this section, we describe how to compute the harmonic basis for scalar and vector fields over a compact homogeneous space16 X = G/H, directly from the matrix coefficients of the irreps of G. A vector field over X = G/H is a function f : X Rdψ (or Cdψ) and transforms under the action of G according to an induced representation Ind G H ψ, where ψ is an irrep of H. We give a precise definition of induced representation acting on such vector fields in Def. D.3. Note that scalar fields are considered a special case of vector fields. In particular, scalar fields on G/H transform according to Ind G H ψ0, where ψ0 : H {1} is the trivial representation of H. We occasionally call this representation acting on scalar fields a quotient representation. Similar investigations are well-documented for complex representations: for the case X = G, a complex harmonic basis is given by the matrix coefficients of its irreps, as described by the Peter- 16Note that in this section we use X with a different meaning with respect to the rest of the paper. Published as a conference paper at ICLR 2022 Weyl theorem A.6. A similar result for complex-valued representations and scalar functions on homogeneous spaces was also described in Kondor & Trivedi (2018). Outline and goal After some definitions and assumptions in Supplementary D.1, we provide different equivalent definitions of the induced representation in Supplementary D.2. Then, we construct the harmonic basis of induced representations in Supplementary D.3. Finally, Corollaries 4 and 5 provide the final harmonic bases for the complex and real settings, respectively. We freely make use of the concepts defined in Supplementary A in this whole section. D.1 USEFUL DEFINITIONS, ASSUMPTIONS AND NOTATIONS In this section, we will always assume a compact group G and a compact subgroup H G. The neutral elements of these groups are just denoted e H, which is the same for both groups. As before, for generality, we allow K to be any of the two fields R or C before we will later separately consider complex and real representations. b H is the set of all isomorphism classes of unitary irreps of H and ψ b H is, by abuse of notation, meant to refer to a representative of such an isomorphism class. For the rest of the section, we fix such an irrep ψ : G U(Vψ). In order to generate a harmonic basis for the functions in the induced representation, we only rely on an harmonic basis for L2 K(G), a basis for the endomorphism space of each irrep ρ ˆG and the irrep decomposition of the irreps of G when restricted to H < G. Assuming these are known, in the rest of this section we will use them to explicitly build a harmonic basis for functions transforming according to an induced representation. D.2 INDUCED REPRESENTATIONS We can now introduce the induced representation. In this section, we will limit to the case where ψ : H U(Vψ) is an irreducible representation of H. This is fixed in this whole section. Induction from non-irreducible representations of H can be built by combining the induced representations of the H irreps composing the reducible H-representation. We now define the induced representation as a space of Mackey Functions, and then alternatively as a space of vector fields over a quotient space. The constructions are well-known and can for example be found in Taylor & Kaniutrh (2013). This book also contains more theoretical justifications, including the definition of the scalar product making the induced representation a Hilbert space and proofs that both versions of the representation are actually unitary. Definition D.1 (Mackey Functions). A Mackey function is a ψ-vector field f : G Vψ which has square-integrable component-functions, i.e., in L2 K(G), and with the following equivariance-property: for all h H and g G one has f(gh 1) = ψ(h)f(g). (18) We denote the space of Mackey functions by Hom H(G, Vψ).17 This becomes a unitary Grepresentation with the action [g.f](g ) := f(g 1g ) . (19) Remark D.2. One can verify that g.f still belongs to the space of Mackey functions. Indeed, for g, g G, h H and f Hom H(G, Vψ) one has [g.f](g h 1) = f(g 1(g h 1)) = f((g 1g )h 1) = ψ(h)f(g 1g ) = ψ(h)[g.f](g ). Before we come to the definition as quotient vector fields, recall that for the pair of compact groups H G, one can build the space of cosets G/H := {g H | g G} . (20) 17These functions can be viewed as left-H-equivariant when using the left action h g := gh 1 of H on G, which explains the notation. Published as a conference paper at ICLR 2022 There is a projection π : G G/H given by g 7 g H which induces a topology on G/H, making it a compact topological Hausdorff space. Together with the action g.(g H) := (gg )H, G/H becomes a homogeneous space of G, and thus carries a Haar measure. Now, pick an arbitrary measurable section r : G/H G associating a representative element to each coset g H G/H.18 In other words, the section r satisfies π r = id G/H, with the projection as defined above. This can also be written as r(g H)H = g H, for all g G. (21) Since r(g H)H = g H, we automatically have that h(g) := r(g H) 1g H, for all g G. (22) Since g = r(g H) h(g), it follows that every g G is fully characterized by the pair (g H, h(g)) (G/H) H. We fix from now on the definitions of the section r : G/H G and the map h : G H, with the latter implicitly depending on r. Definition D.3 (ψ-Vector Fields over the Quotient Space). A ψ-vector field over the quotient space G/H is defined to be any function of the form f : G/H Vψ that has square-integrable component-functions with respect to the Haar measure on G/H, i.e., the component-functions are in L2 K(G/H). The space of these ψ-vector fields is denoted by Hom(G/H, Vψ). This becomes a unitary G-representation with the action [g.f ](g H) := ψ h g r(g 1g H) f g 1g H . (23) Note that the action of G on Hom(G/H, Vψ) depends on the choice of the (not necessarily continuous) section r : G/H G of the quotient space. We now show that these two definitions are equivalent: Proposition D.4. There is an isomorphism of unitary representations Hom H(G, Vψ) Hom(G/H, Vψ) ( ) between the space of Mackey functions on the left and the space of quotient vector fields on the right, given as follows: 1. For f : G Vψ a Mackey function, we define ef(g H) := f(r(g H)) using the section defined in eq. 21. 2. For a quotient vector field f : G/H Vψ, we define f (g) := ψ(h(g) 1)f (g H) using the function h : G H defined in eq. 22. Proof. In this proof, we freely make use of eq. 21 and eq. 22. First, we show the equivariance of f ( ): Let f : G Vψ satisfy eq. 18. First, note that h g r g 1g H = r g r g 1g H H 1 g r g 1g H = r gg 1g H 1g r g 1g H = r(g H) 1g r g 1g H = r g 1g H 1g 1 r(g H) 1 = h g 1 r(g H) 1 . 18This section can usually not be chosen to be continuous, but being measurable is enough for our purposes. Published as a conference paper at ICLR 2022 From this, we deduce: f g.f(g H) = [g.f] r(g H) = f g 1 r(g H) = f r g 1 r(g H)H h g 1 r(g H) = f r g 1g H h g 1 r(g H) = ψ h g 1 r(g H) 1 f r(g 1g H) = ψ h g r(g 1g H) ef g 1g H = [g. ef](g H), i.e., the desired equivariance f g.f = g. ef. Given f : G/H Vψ, we first verify that f satisfies eq. 18, i.e. that is lives in Hom H(G, Vψ). First, note that h gh 1 = r gh 1H 1gh 1 = r g H 1gh 1 From this, we deduce: f gh 1 = ψ h(gh 1) 1 f (g H) = ψ h h(g) 1 f (g H) = ψ(h)ψ(h(g))f (g H) = ψ(h)f (g). We now show the equivariance of ( ). We first note, using the result of eq. 24 and eq. 25: h(g ) 1 h g r g 1g H = h(g ) 1 h g 1 r g H 1 = h h g 1 r g H h(g ) i 1 = h g 1 r g H h(g ) 1 = h g 1g 1. We deduce: g.f (g ) = ψ h(g ) 1 [g.f ](g H) = ψ h(g ) 1 ψ h g r(g 1g H) f g 1g H = ψ h(g 1g ) 1 f g 1g H = f (g 1g ) = [g.f ](g ), i.e., the desired equivariance g.f = g.f . Finally, we need to show that ef = f and ef = f , i.e., that the maps f ( ) and ( ) are inverse to each other. We have ef (g) = ψ h(g) 1 ef(g H) = ψ h(g) 1 f r(g H) = f r(g H) h(g) Published as a conference paper at ICLR 2022 and ef (g H) = f r(g H) = ψ h(r(g H)) 1 f r(g H)H where we used r(g H)H = g H, h r(g H) = e, and ψ(e) = id Vψ in the last step. More technical details, especially on the unitarity of the isomorphism, can be found in Taylor & Kaniutrh (2013). This finishes the proof. Proposition D.4 justifies the following definition: Definition D.5 (Induced Representation). For any irrep ψ of H G, we define the induced representation Ind G H ψ as the unitary G-representation acting on Ind G H Vψ := Hom H(G, Vψ) = Hom(G/H, Vψ) as in eq. 19 or eq. 23, depending on which representation space is chosen. We will always make the representation space (Mackey functions or quotient vector fields) explicit in the following. D.3 HARMONIC ANALYSIS OF INDUCED REPRESENTATIONS Before we come to finding harmonic bases of induced representations, we must deal with a small technicality: in the induced representation given by Mackey functions, the equivariance property f(gh 1) = ψ(h)f(g) is used, which suggests that we look at the following left-action of G on itself: g g := g g 1, compared to the usual left action g.g := gg . Similarly to how we solved for a basis of the space of steerable kernels K : X Hom G(Vl, VJ) with our Wigner-Eckart theorem, Theorem B.5, by studying L2 K(X ), we want to find a basis of the space of Mackey functions f : G Vψ by studying L2 K(G) however, due to the adapted left-action on G, we also need to consider a different action on L2 K(G). Thus, define L2 K(G)2 := L2 K(G) together with the action (g f)(g ) := f(g g), which can be shown to be a well-defined left-action on L2 K(G)2. This is then a unitary representation. Fortunately, it turns out to be isomorphism to the standard regular representation: Lemma D.6. There is an isomorphism of unitary representations L2 K(G) L2 K(G)2 given in both directions by the same function: Φ(f) (g) := f(g 1). Proof. Since in L2 K(G), we integrate over an inversion-invariant Haar measure, Φ is a unitary transformation. It is also clear that Φ is its own inverse. Thus, we are only left with checking the equivariance in one of the two directions: Φ(g.f) (g ) = [g.f] g 1 = f (g g) 1 = Φ(f) (g g) = g Φ(f) (g ). It follows Φ(g.f) = g Φ(f), i.e., the desired equivariance. We remind the reader of the following notation introduced in Section A, which we use in the harmonic analysis of the induced representation: b G is the set of isomorphism classes of unitary irreps of G. mj is the multiplicity of the irrep ρj : G U(Vj) in the regular representation L2 K(G). For j mj, pji : L2 K(G) Vj is the corresponding harmonic projection. [ψj] is the multiplicity of the irrep Published as a conference paper at ICLR 2022 ψ : H U(Vψ) in Res G H Vj. Eψ = dim End H,K(Vψ) is the dimension of the endomorphism space of Vψ. The cψ r : Vψ Vψ, r Eψ, form a basis of endomorphisms of Vψ. IDψj t : Res G H Vj Vψ is an irrep decomposition projection. The Y m ji : G K form a harmonic basis of the regular representation L2 K(G). Yji : G Kdj = Vj is the column-vector of all the Y m ji . Yji is its complex conjugation. Finally, note that after choosing bases, by abuse of notation we also refer with cψ r and IDψj t to the matrices corresponding to the linear maps. Theorem D.7. A basis of Mackey functions Ind G H Vψ = Hom H(G, Vψ) is given by n fjitr : G Vψ = Kdψ | j b G, i mj, t [ψj], r Eψ o . Thereby, we have fjitr(g) = cψ r IDψj t Yji g 1 Consequently, a basis of ψ-vector fields over the quotient space, Ind G H Vψ = Hom(G/H, Vψ), is given by n g fjitr : G/H Vψ = Kdψ | j b G, i mj, t [ψj], r Eψ o with g fjitr(g H) := fjitr(r(g H)), where r : G/H G is a measurable section. Proof. We have Hom H(G, Vψ) (1) = Hom H,K Res G H L2 K(G)2, Vψ (2) = Hom H,K Res G H L2 K(G), Vψ i=1 Hom H,K Res G H Vj, Vψ t=1 End H,K(Vψ). Step (1) follows from Theorem B.2 with G replaced by H, G replaced by G, X replaced by G together with the left-action g g := g g 1, and Hom K(Vl, VJ) replaced by Vψ.19 Step (2) follows from Lemma D.6. Step (3) is the Peter-Weyl Theorem A.6. And finally, step (4) uses the irrep decompositions IDψj t : Res G H Vj Vψ. Explicitly, from right to left the isomorphism is given by: t=1 cjit IDψj t t=1 cjit IDψj t pji t=1 cjit IDψj t pji Φ t=1 cjit IDψj t pji Φ|G. 19Note that in the original theorem, the structure of the Hom-representation is not actually used, so we can just replace it by any finite-dimensional representation; in this case, Vψ Published as a conference paper at ICLR 2022 Thus, a basis is given by all fjitr := cψ r IDψj t pji Φ|G. The result now follows by expanding the dirac delta function δg in the harmonic basis: pji Φ|G(g) = pji Φ(δg) m=1 Y m ji | δg 1 pji Y m ji Y m ji g 1 |jm = Yji g 1 . In the last step, we identified Vj = Kdj, and thus |jm with a standard basis vector em. Note that, to be precise, one would need to approximate the Dirac delta function with square-integrable functions. The details of such arguments can be found in the proof of Theorem D.13 of Lang & Weiler (2020). The last statement about a basis of Hom(G/H, Vψ), finally, follows from Proposition D.4. Corollary 4. An harmonic basis for complex Mackey functions is given by: n fjit : G Vψ = Cdψ | j b G, i dj, t [ψj] o . where fjit(g) = IDψj t p djρj(g 1)ei Proof. Using Proposition A.7, we find that Yji(g 1) = p djρj(g 1)ei. Because we assume complex representations, End G,C(ψ) only contains scalar multiples of the identity, due to Schur s Lemma. In other words, cψ r is always the identity matrix. Corollary 5. Assume that all real irreps are defined over a basis as in Theorem C.11. An harmonic basis for real Mackey functions is given by: n fjitr : G Vψ = Rdψ | j b G, i mj = dj/Ej, t [ψj], r Eψ o . where fjitr(g) = cψ r IDψj t p djρj(g 1)e Ij i Ij is an index set containing the indexes of the mj = dj/Ej columns of ρj with linearly independent matrix coefficients. Proof. Using Proposition A.7, If ρj is expressed in a basis as in Theorem C.11, the matrix coefficients of ρj in a column i are linearly dependent with those in Ej columns of ρj. Therefore, it is sufficient to consider a subset of mj = dj/Ej columns of ρj which contain linearly independents matrix coefficients. Additionally, if ρj is expressed in the basis described in Supplementary C.3, Ij = {1, 2, . . . , dj/Ej} so Ij i = i. In Section D.5, we show that this basis is indeed harmonic and we explicitly derive a the action of G on it as a direc-sum of irreps. D.4 RECONSTRUCTING THE IRREP DECOMPOSITION: THE COMPLEX CASE The goal of this and the next section is to use write the action of G on the bases found above explicitly as a direct sum of irreps of G. We first consider the simpler case of complex valued irreps. Recall Corollary 4. Consider the set of all basis elements for a fixed value of j, i.e. {fjit}it where: fjit(g) = IDψj t p djρj(g 1)ei Published as a conference paper at ICLR 2022 Let f be a Mackey function in the span of these basis elements, i.e. there exists {wit C}it such that: it witfjit(g) it wit IDψt t p djρj(g 1)ei djρj(g 1) X djρj(g 1)wt where [wt]i = wit and wt Cdj. Now, the action of G on these functions is the usual action on L2(G), i.e. k, g G: [k.f](g) = f(k 1g) = X it witfjit(k 1g) = X djρj(g 1) (ρj(k)wt) In other words, the space of functions spanned by {fjit}it is isomorphic to L[ψj] t Vj, and each wt lives in one such Vj space. We call the vectors wt the Fourier Transform coefficients of the function f. More generally, let f be a generic Mackey function in Ind G H Vψ. The matrix ˆf(ρj) of shape dj [ψj] contains in its columns the vectors wt Cdj for each t. We call the collection { ˆf(ρj)}j the Fourier Transform of the function f. Note that if ψ is the trivial representation, this definition recovers the classical definition of Fourier transform of functions over compact groups or homogeneous spaces. Finally, it follows that Ind G H ψ = L j L[ψj] t ρj. D.5 RECONSTRUCTING THE IRREP DECOMPOSITION: THE REAL CASE In the real case, this decomposition is less straightforward. Recall Corollary 5. Consider the set of all basis elements for a fixed value of j, i.e. {fjitr}itr where: fjitr(g) = cψ r IDψj t p djρj(g 1)ei with i mj = dj/Ej. Let f be a Mackey function in the span of these basis elements, i.e. there exists {witr R}itr such that: itr witrfjitr(g) itr witrcψ r IDψt t p djρj(g 1)ei tr cψ r IDψt t p djρj(g 1) X i mj witrei tr cψ r IDψt t p djρj(g 1)wrt where [wrt]i = witr if i mj and [wrt]i = 0 otherwise. Unfortunately, this basis is not harmonic since it does not have an explicit action of G through ρj on the coefficients vectors. To see this, consider the function above transformed by an element k G: [k.f](g) = X itr witr[k.fjitr](g) = X itr witrfjitr(k 1g) tr cψ r IDψj t p djρj(g 1)ρj(k)wrt The vector ρj(k)wrt does not generally belong to the span of {ei|i mj}, i.e. the function k.f as written above is not expressed directly in terms of the basis in Corollary 5. This means that the basis we found is not an harmonic basis yet. Published as a conference paper at ICLR 2022 Nevertheless, we know that the space of Mackey function has a compact action of G and therefore must be decomposable in a direct sum of irreps of G. Therefore, there exists a change of basis from the current one to an harmonic one. In the rest of this section, we will derive a variation of the Gram Schmidt process to compute such change of basis. First of all, we want to identify the invariant subspaces spanned by the basis. To do so, we re-project each basis element after an action of G on the basis itself, i.e. we project the function g.fjitr, with g G on each other basis element fjksq. First, note that: [g.fjitr](k) = cψ r IDψj t p djρj(k 1)ρj(g)ei i dj cψ r IDψj t p djρj(k 1)ei [ρj(g)]i ,i i.e. g.fjitr belongs to the span of the set of functions {fjitr | i dj} (note that i dj rather than i mj = dj/Ej). Despite the functions {fjitr | mj < i dj} do not belong to the basis in Corollary 5, one can easily verify that they are actually Mackey functions and, therefore, must belong to the space spanned by the basis in Corollary 5. Additionally, we will see that {fjitr | i dj} contains functions which are orthogonal to each other; before showing this, we should define an inner product of Mackey functions: f1, f2 G = Z G f1(g), f2(g) ψdg = Z G f1(g)T f2(g)dg Note that we assumed real functions and real representations. We can now compute the projection of any two functions: fjitr, fjksq = Z g G e T i ρj(g) p dj[IDψj t ]T [cψ r ]T cψ q IDψj s p djρj(g)T ekdg g G ρj(h) p dj[IDψj t ]T [cψ r ]T cψ q IDψj s p djρj(g)T dg | {z } =:Otr,sq Lemma D.8. the following statements are true: 1. Otr,sq End G(ρj) 2. Otr,sq = P l Otr,sq, Il Il = P l Tr([IDψj t ]T [cψ r ]T cψ q IDψj s IT l )Il Proof. 1) By definition, Otr,sq is the invariant projection of a matrix with respect to a left and right action of G through ρ. It follows that Otr,sq commutes with ρ(g) for any g G; hence, Otr,sq End G(ρj). 2) Let {Il | 1 l Ej} be an orthonormal basis for End G(ρj) as in Sec. C. Then, we can project Otr,sq on this basis: Otr,sq, Il = 1 dj Tr(Otr,sq IT l ) h G ρj(h) p dj[IDψj t ]T [cψ r ]T cψ q IDψj s p djρj(h)T dh IT l ) h G ρj(h) p dj[IDψj t ]T [cψ r ]T cψ q IDψj s p dj IT l ρj(h)T dh) = Tr([IDψj t ]T [cψ r ]T cψ q IDψj s IT l ) l Otr,sq, Il Il = X l Tr([IDψj t ]T [cψ r ]T cψ q IDψj s IT l )Il Published as a conference paper at ICLR 2022 Note that the coefficients Otr,sq, Il can be easily computed numerically since all the matrices involved are known. Corollary 6. The following statements are true: 1. Otr,sq, I1 = δstδrq 2. Otr,tr is the identity matrix 3. fjitr, fjktr = δik Proof. 1) By considering l = 1, i.e. Il = id the identity, in Lemma D.8, we find Otr,sq, I1 = Tr([IDψj t ]T [cψ r ]T cψ q IDψj s ) = Tr(IDψj s [IDψj t ]T [cψ r ]T cψ q ) Leveraging the orthogonality of IDψj and the fact that IDψj t and IDψj s contain the rows of IDψj: = Tr(δst id[cψ r ]T cψ q ) There exists a unique r such that cψ r = [cψ r ]T cψ q and cψ r is not skew-symmetric iff r = 0, i.e. r = q: 2) Note that the matrix Otr,tr is symmetric; since Il for l = 1 is antisymmetric, Otr,tr, Il = 0 if l = 1. Using the result in 1), it follows that Otr,tr = id is the identity matrix. 3) An immediate consequence of this is that fjitr, fjktr = e T i Otr,trek = δik. Note that the last proposition in Corollary 6 holds also for the functions in {fjitr | i > mj = dj/Ej} which don t belong to the basis in Corollary 5. Intuitively, for a fixed value of (t, r), the basis in Corollary 5 ignores these orthogonal elements since they are linearly dependent with the basis elements associated with some other value of (t, r). This explains why the form of the basis in Corollary 5 is not convenient for us. Using the projection coefficients Otr,sq, we can compute the necessary change of basis. To do so, we start with the over-complete basis given by: {fjitr | i dj, r Eψ, t [ψj]} Recall that, for a fixed value of r and t, all basis elements Btr = {fjitr}i are orthogonal. We can therefore build a basis by iteratively picking a set Btr and project it to the space orthogonal to the one spanned by the basis elements already chosen. Before making this more precise, we need to introduce a few more handful results. Lemma D.9. For any r Eψ, t [ψj] and l Ej: 1. cψ r IDψj t Il Hom H(Res G H ρj, ψ) 2. there is a set of real numbers (wtr,sq,l)s,q s.t. cψ r IDψj t Il = X s [ψj] cψ q IDψj s wtr,sq,l 3. wtr,sq,l = Otr,sq, Il Proof. 1) Note that cψ r IDψj t Hom H(Res G H ρj, ψ) and Il End G(ρ). It follows that, for any h H < G: cψ r IDψj t Ilρ(h) = cψ r IDψj t ρ(h)Il = cψ r ψ(h) IDψj t Il = ψ(h)cψ r IDψj t Il i.e. cψ r IDψj t Il commutes with the action of H; hence, it belongs to Hom H(Res G H ρj, ψ). Published as a conference paper at ICLR 2022 2) This result is an immediate consequence of 1) by using the following decomposition Hom H(Res G H ρj, ψ) = and the basis {cψ q }q for End H(ψ). 3) Since {cψ q IDψj s }q,s is an orthonormal basis for Hom H(Res G H ρj, ψ), the coefficients (wsq,tr,l)s,q are found just by projecting the matrix on each basis element using a (scaled) standard inner product, i.e.: wtr,sq,l = cψ q IDψj s , cψ r IDψj t Il dj Tr cψ q IDψj s IT l [IDψj t ]T [cψ r ]T dj Tr [IDψj t ]T [cψ r ]T cψ q IDψj s IT l = Otr,sq, Il Lemma D.10. Let {bi}dj i=1 a set of Mackey functions defined as: r,t cψ r IDψj t p djρj(g)T eibr,t parameterized by the vector b = (br,t)r,t. Let b(g) = (b1(g)| . . . |bi(g)| . . . ) Rdψ dj the matrix obtained by stacking horizontally the vectors {bi(g)}i, i.e. r,t cψ r IDψj t p djρj(g)T br,t The set of functions {bi}dj i=1 is an harmonic family, i.e. if f(g) = b(g)f for a coefficients vector f Rdj, then [k.f](g) = f(k 1g) = b(g)ρj(k)f. In other words, the function k.f is the function parameterized by the coefficients ρj(k)f. f(k 1g) = b(k 1g)f r,t br,tcψ r IDψj t p djρj(k 1g)T f r,t br,tcψ r IDψj t p djρj(g)T ρj(k)f = b(g)ρj(k)f Finally, we define a few operations which we will need to combine and project harmonic functions in our final algorithm. Lemma D.11. Let a and b be two set of harmonic functions as in Lemma D.10, respectively parameterized by the vectors a and b. Let α R and M End G(ρj); then, the following sets of functions are all harmonic families as in Lemma D.10: 1. the set f defined as f(g) = αa(g) 2. the set f defined as f(g) = b(g) a(g) 3. the set f defined as f(g) = a(g)M Published as a conference paper at ICLR 2022 Proof. 1) This is a simple consequence of the linearity of the harmonic property. r,t cψ r IDψj t p djρj(g)T br,t X r,t cψ r IDψj t p djρj(g)T ar,t r,t cψ r IDψj t p djρj(g)T z }| { (br,t ar,t) =:fr,t 3) Let M = P l ml Il. Then: r,t cψ r IDψj t p djρj(g)T ar,t M r,t cψ r IDψj t p djρj(g)T ar,t X l,r,t ar,tml cψ r IDψj t Il p l,r,t ar,tml q,s cψ q IDψj s wtr,sq,l l,r,t wtr,sq,lar,tml cψ q IDψj s p where we used Lemma D.9. Corollary 7. Given two harmonic families a and b, let Oa,b = R G a(g)T b(g)dg be the matrix containing the inner product between each pair of functions in a and b. Then, Oa,b End G(ρj) and Oa,b = P l wa,b,l Il, where wa,b,l = P sq,tr as,qbt,rwsq,tr,l. Moreover, let f be a set of functions defined as f(g) = b a Oa,b. Then, Oa,f is the zero matrix. Proof. The first result follows from: G a(g)T b(g)dg djas,qρj(g)[IDψj s ]T [cψ q ]T X r,t cψ r IDψj t p djρj(g)T br,tdg r,t as,qbr,t djρj(g)[IDψj s ]T [cψ q ]T cψ r IDψj t p djρj(g)T dg r,t as,qbr,t Osq,tr The second result follows from G a(g)T f(g)dg = Z G a(g)T b(g)dg Z G a(g)T a(g)dg Oa,b = Oa,b id Oa,b = 0 . Additionally, note that all functions within the same harmonic family b have the same norm, i.e. t,r b2 t,r id = b id Published as a conference paper at ICLR 2022 where we used the fact Ob,b is a symmetric matrix and, therefore, wsq,tr,l = δstδqrδl,1. b is the (squared) norm of each function in b. We now have all the ingredients to describe the final algorithm. Let B = {} the (initially empty) set of harmonic families chosen so far. In other words, an element bi of B is an harmonic family of dj functions as in Lemma D.10 and is represented by a vector bi = (bi,t,r)t,r. Our goal now is to find Mj = Eψ[ψj]/Ej vectors {bi}i such that Obi,bk = δi,k id. To do so, we can repeat the following steps until the set B contains Mj elements: 1. pick a vector b = (bt,r)t,r , which defines the set of harmonics b 2. define b = b P bi B bi Ob,b , represented by the coefficients vector b = (b t,r)t,r, where the coefficients can be computed using Lemma D.11 and Corollary 7. 3. normalize b as b = b / p 4. add b to B Note that, at each iteration, we add dj orthogonal functions to the new basis and that the original basis in Corollary 5 is Mj dj dimensional. Therefore, the algorithm terminates in Mj iterations. E NUMERICAL IRREPS DECOMPOSITION OF REAL REPRESENTATIONS Assume a compact group G and a set of (representatives of the isomorphism classes of) orthogonal real irreps b G = {ρj}j. In this section, we only engage with real valued representations. In particular, recall that, for a real-valued irrep ρ, its endomorphism space End G(ρ) is not necessarily 1-dimensional, see Section C. As stated in Corollary 1, any orthogonal real representation π of G can be decomposed as a direct sum of irreps of G.20 The goal of this section is that of explicitly computing this decomposition for an arbitrary finite-dimensional orthogonal real representation π of G. In other words, given an input finite-dimensional (real, orthogonal) representation π, we want to find the orthogonal transformation/matrix M and the multiplicities {[jπ] N}j such that: Thereby, being orthogonal means that the scalar product is preserved, which for matrices means that the matrix is orthogonal in the classical sense i.e., its columns build an orthonormal basis. Warning: note that, with respect to Corollary 1 and the notation used in the main paper, in this section M is transposed on the right-side rather than the left-side of the direct sum of irreps. To obtain the decomposition used in the other sections, one just need to transpose our final result. The reason why we consider this notation is simply that this is the convention we used in our implementation and we wanted to keep the following algorithm consistent with our code. In this whole section, we assume that π and all ρj are matrix representations, and correspondingly, that M is given by a matrix. To simplify the notation, define P = L i ρj, such that π = MPM T . Since M is orthogonal, we can write πM = MP, i.e. M needs to commute with π and P. It follows that M Hom G,R(P, π). 20That corollary speaks of unitary representations. But recall that by our convention, unitary means orthogonal when the field is the real numbers. Published as a conference paper at ICLR 2022 Recall that the block-diagonal structure of P induces a block-structure in the columns of M, visually: π(g) = . . . |Mj1| . . . |Mji| . . . |Mj[jπ]| . . . ... ρ(g) ... ρj(g) ... ρj(g) ... ... M T j1 ... M T ji ... M T j[jπ] ... i.e., we split M in a number of vertical blocks, one for each occurrence i of each irrep ρj in P. From the equation πM = MP we obtain the following: each such block Mji satisfies πMji = Mjiρj, i.e. Mji Hom G(ρj, π). We restrict our attention to this case. Thus, from now on, assume that ρ is a real orthogonal irrep of G, and our goal is to find an orthogonal embedding in Hom G(ρ, π), where π is not necessarily irreducible. Note that Hom G(ρ, π) = L[jπ] i=1 End G(ρ). We can easily find a basis for Hom G(ρ, π) by solving a linear system. Let dj be the dimension of the representation space of ρj, and similarly, dπ be the dimension of the representation space of π. Let Ij the dj dj identity matrix and Iπ the dπ dπ identity matrix; then, the matrix Mji should satisfy: g G, π(g)Mji = Mjiρj(g) g G, π(g)Mji Ij = IπMjiρj(g) vectorizing both sides of the equation and using vec (ABC) = CT A vec (B): g G, (Ij π(g)) vec (Mji) = (ρj(g)T Iπ) vec (Mji) g G, (Ij π(g) ρj(g)T Iπ) vec (Mji) = 0 Note that Ij π(g) ρj(g)T Iπ = ρj(g)T π(g), where indicates the Kronecker sum here (not to be confused with the direct sum). Note that the last line above describes a linear constraint on Mji for each group element g G. One can solve for the space of suitable Mji by stacking vertically the matrices (Ij π(g) ρj(g)T Iπ) for each g G and then solving for its null space (e.g. through SVD). While this approach is feasible if G is a finite group and |G| is sufficiently small, this can not be used for infinite groups (or large finite groups). If G is a finite group, one can easily verify that it is sufficient to consider the matrices (Ij π(g) ρj(g)T Iπ) associated with the set of generators of G. Indeed, note that if Mji satisfies the constraint above for two group elements a and b G, then it satisfies it also for ab and ba G. Finzi et al. (2021) follows precisely this approach to compute a basis for a general space Hom G(π1, π2), when G is finite. When G is an infinite Lie group, Finzi et al. (2021) shows that it is sufficient to consider the Lie algebra generators of G to solve this constraint. In our work, however, we do not want to rely on the Lie algebra of G, since it would introduce additional non-trivial requirements when implementing a new equivariance group G. Instead, we note that it is generally sufficient to consider a few random samples G G to generate enough constraint matrices {Ij π(g) ρj(g)T Iπ|g G} such that the null space of their vertical stack is precisely Hom G(ρj, π). A similar method was used already in Weiler et al. (2018a) to compute the Clebsh-Gordan coefficients in the decomposition of the tensor product of the irreps of SO(3). These coefficients appear in their kernel basis precisely as in our vectorized kernel constraint in Sec. 2.2. In Weiler & Cesa (2019) (Appendix G), the authors discuss the cost of this approach when applied on generic input and output representation pairs, and highlight the benefits of, instead, restricting the consideration to only pairs of input and output irreducible representations, as done here. Therefore, by following the method just described, we can generate an orthonormal basis for the space of vectorized matrices satisfying the constraint above. By un-vectorization, any element in the space spanned by this basis generates a matrix Mji such that πMji = Mjiρj. This is not sufficient yet for us, since we require the whole matrix M to be orthogonal. Published as a conference paper at ICLR 2022 From now, we assume the previous method generated an orthonormal basis Bj = {Bj,i}d i for Hom G(ρj, π), i.e. a collection of d = [jπ] Ej matrices of shape dπ dj which are orthogonal with respect to a (scaled) standard inner product A, B = 1 dj Tr(ABT ) = 1 dj vec (A)T vec (B). Recall that Ej is the dimensionality of End G,R(ρj). In order to build a matrix M which is orthogonal, for each j, we need to choose [jπ] orthogonal matrices {Mji}[jπ] i in Hom G(ρj, π), such that all columns of all matrices Mji are also orthogonal to each other. Lemma E.1. Any matrix A1 Hom G(ρj1, π) has columns which are orthogonal to the columns of any matrix A2 Hom G(ρj2, π), for j1 = j2. Proof. This is a direct consequence of the fact that any homomorphism in A1 Hom G(ρj1, π) maps to a subspace V1 Vπ which is orthogonal to the image V2 of A2 Hom G(ρj2, π). This can be seen by realizing that the projection from V1 to V2 is an intertwiner, and thus needs to be zero by Schur s Lemma since V1 and V2 are not isomorphic. Lemma E.2. Any orthonormal basis for Hom G(ρj, π) contains matrices which are orthogonal, i.e., such that the columns are orthogonal and normalized. Proof. Let {Ek}Eρ k be an orthonormal basis for End G(ρ). We know that there exists a matrix M such that 1) M is orthogonal and 2) π = M L j ρ [jπ] j M T .21 Therefore, there exists a collection {Mji}[jπ] i=1 of orthogonal matrices in Hom G(ρj, π), where Mji are the blocks of the matrix M before. It follows that there exists a basis for Hom G(ρ, π) defined as H = {Ik Mji | 1 k Ej, 1 i [jπ]}. This basis is made of orthogonal matrices, since both Ik and Mji are orthogonal. Note that we don t know M and H, but we only know they exist. Assume H = {Hj}j is a basis for Hom G(ρ, π) made of orthogonal matrices, i.e. for each i, HT i Hi = I. More generally HT i Hl = δil I. Then any orthonormal basis for this space is made of orthogonal matrices. Indeed, assume B = {Bl}l is a basis such that Bl, Bl = 1. An element Bl can be written as Bl = P i bli Hi. It follows that for any element Bl: BT l Bl = X i1,i2 bl,i1bl,i2HT i1Hi2 i bl,ibl,i I = B, B I = I i.e. any element of an orthonormal basis for Hom G,R(ρj, π) must contains matrices which are themselves orthogonal. It follows that we only need to focus on finding a collection {Mji}[jπ] i=1 of matrices whose columns are orthogonal; the orthogonality of columns within the same matrix Mji and between different matrices Mj1,i1 and Mj2,i2 with j1 = j2 is already guaranteed. Recall that Bj = {Bj,i}d i is the orthonormal basis for Hom G(ρj, π) generated by the numerical method (e.g. SVD). To find suitable {Mji}i, it is sufficient to set Mj0 = Bj0. Let Mj the subset of all matrices {Mji}i already set (initially containing only Mj0). Let {Ik}Ej k=1 be an orthonormal basis for End G,R(ρj). Let Bi=0 = Bj. Then we build the set Mj through the following iterative method: 1. create B = {Mji Ik|Mi Mj, k {1, . . . , Ej}} 2. project the elements in Bi on the space space of matrices which is orthogonal to the span of B and remove all zero elements (i.e. all elements which already belong to the span of B ) 21This is a direct consequence of Lemma B.29 in Lang & Weiler (2020). Published as a conference paper at ICLR 2022 3. pick the first non-zero element of Bi, set Mji to that and add it in Mj 4. repeat until the set Bi is empty Up to a scalar factor (of p dj), the matrices in Mj are all orthogonal. Additionally, this set contains precisely [jπ] matrices. By stacking all matrices together, one builds the matrix M. Note also that this method can be used to find the multiplicities [jπ], too. Indeed, because each column of ρj is related through an endomorphism precisely with Ej other columns (see Theorem C.11), once a basis Bj with d elements is found, it is sufficient to compute [jπ] = d/Ej, since we assume the type of ρj (and therefore Ej) is known. F REAL IRREDUCIBLE REPRESENTATIONS OF THE DIRECT PRODUCT OF TWO COMPACT GROUPS A useful operation to generate new groups is the direct product. Given two groups A and B, their direct product G = A B is a group defined as: the Cartesian product of the element of A and B, i.e. {(a, b) | a A, b B} with the group law : G G G defined as (a1, b1) (a2, b2) = (a1a2, b2b2). Many groups can be written as a direct product of two smaller groups. In particular, O(3) = Inv SO(3), or more generally O(n) = Inv SO(n) when n > 1 is odd, where Inv = { 1} is the group generated by the reflection x 7 x. In our work, we are interested in the direct product to easily build many subgroups of O(n) (in particular, of O(3)) which have form G = Inv H, where H < SO(n). This is the case for the cylindrical symmetries (H = SO(2), O(2) or their discrete subgroups) or the full isometries of the solids (H = T , O or I); see Sec. G. In order to use these groups, it is sufficient for us to build their set of irreducible representations. In this section, we describe a method to generate the set of real irreducible representations of a direct product of two compact groups from the irreps of the two groups. Given two compact groups A and B and their respective set of (representatives of) complex unitary irreducible representations b A = {αi}i and b B = {βj}j, it is well known that any complex unitary irrep of G = A B is isomorphic to a unique γij = αi βj for a particular choice of i and j. Thereby, (αi βj)(a, b) := αi(a) βj(b) is the Kronecker product of the two matrices αi(a) and βj(b). Thus, a set of representatives of isomorphism classes of unitary irreps b G can be constructed as b G = {γij = αi βj | αi b A, βj b B}. (26) However, this results does not hold in the real field. In the rest of this section, we will derive a variation of this theorem using real irreducible representations, which is more useful to describe the representations of the groups we used. In Appendix C.3, we have already shown that any real irrep ψ of G can be classified in one of the following three categories: real-type: ψ = σ = σ complex-type: ψ = σR = σ σ, with σ σ quaternionic-type: ψ = σR = σ σ = σ σ, as σ = σ where σ is a complex irrep of G. Moreover, for each complex irrep σ, there exists only one real irrep ψ which contains it. We can use this classification starting from the complex irreps of G which we have already defined in Eq. 26. First, we clarify some notation. We use Greek letters like α, β, γ to indicate complex irreps and Roman letters to indicate the corresponding real irreps a, b, c. For each complex irrep γij, there can be only one real irrep cij containing it. It follows that αi βj only appears in cij. Moreover, if αi αi, i.e. ai is of complex-type, we index the complex irrep αi with i, i.e. αi = αi. Published as a conference paper at ICLR 2022 We can now introduce a useful result for real irreps: Definition F.1 (Frobenious-Schur Indicator). Let G be a compact group and ρ a complex irreducible representation of G. Let χ be the character of ρ, i.e. χ : G C, g 7 Tr(ρ(g)). The Frobenious Schur indicator ιρ of ρ is defined as: This indicator can only take the following values: 1 if ρ is contained in real type irrep 0 if ρ is contained in complex type irrep 1 if ρ is contained in quaternionic type irrep This allows us to analytically determine the type of the real irrep associated with a complex one. We can combine this result with the definition of complex irreps of G = A B in Eq. 26 to find the type of the real irreps of G. Let γij = αi βj a complex irrep of G; then: G Tr((αi βj)(g2))dg B Tr(αi(a2) βj(b2))da db Using the property of the trace and the Kronecker product: B Tr(αi(a2)) Tr(βj(b2))da db A Tr(αi(a2))da Z B Tr(βj(b2))db Hence, the types of the irreps αi and βj of A and B fully determine the type of the irrep γij of G. This can be visualized in the following table: αi βj real ιβj = 1 complex ιβj = 0 quaternionic ιβj = 1 real ιαi = 1 ιγij = 1 ιγij = 0 ιγij = 1 complex ιαi = 0 ιγij = 0 ιγij = 0 ιγij = 0 quaternionic ιαi = 1 ιγij = 1 ιγij = 0 ιγij = 1 In the rest of this section, we will use the following two useful facts: Re (γij) = Re (αi βj) = Re (αi) Re (βj) Im (αi) Im (βj) Im (γij) = Im (αi βj) = Re (αi) Im (βj) + Im (αi) Re (βj) We will also assume the irreps to be expressed on the bases described in Section C.3. In particular, if αi (or βj) are real type irreps, we assume they are expressed with respect to a basis such that they have real entries, i.e. αi = ai (and βj = bj). In such case, Re (αi) = ai and Im (αi) = 0. If αi (or βj) is a complex or quaternionic type irrep, we assume ai (and bj) is expressed as the realification: ai = Re (αi) Im (αi) Im (αi) Re (αi) This enables us to extract Re (αi) and Im (αi) directly from the matrix coefficients of ai (or bj). Therefore, to build the real irreps of G = A B we distinguish three cases, depending on the type of the irrep. Published as a conference paper at ICLR 2022 Real type irrep Assume γij has real type. This implies that either both αi and βj have real type or both have quaternionic type. In the first case, gij is simply computed as gij = ai bj. In the second case, assume αi and βj have the structure in Eq. 17. Then, one can verify that the change of basis 0 i I i I 0 i I 0 0 i I 0 I I 0 I 0 0 I turns γij = αi βj into a matrix with only real entries, i.e. gij = Qγij Q , where I is the identity matrix of size dimγij /4. Complex type irrep Whenever γij has complex type, we can easily implement gij as the realification: gij := Re (γij) Im (γij) Im (γij) Re (γij) By using the identities above, we can easily compute this matrix from entries of the matrices of ai and bj. Note that, if both αi and βj have complex type, the irrep γij = αi βj and γij = αi βj generate two different real irreps gij and gij. Since bj and bj are isomorphic real irreps, only one of the two belongs to the set of representatives in b G chosen. However, Re (βj) = Re βj and Im (βj) = Im βj , so gij can still be computed easily from the coefficients of ai and bj. Quaternionic type irrep Like for the complex types, whenever γij has quaternionic type, we can easily implement gij as the realification: gij := Re (γij) Im (γij) Im (γij) Re (γij) Additionally, note that this is the case only if either αi or βj is of real type, while the other is of quaternionic type. Hence, the formula can be simplified by noting that either Im (αi) or Im (βj) is zero. G 3D SYMMETRIES, AZIMUTHAL SYMMETRIES AND THEIR RELEVANCE E(3) is the group of isometries in R3 and includes translations (R3, +), rotations SO(3) and inversions22 through the origin Inv. The group O(3) = SO(3) Inv is the compact group of origin-preserving isometries. While continuous rotational symmetry SO(3) occurs commonly in molecular data, smaller symmetry groups are sometimes more relevant to describe scenes or objects in natural environments. For instance, when the vertical orientation of a the scene is obvious, the symmetry group of the data is reduced to axial symmetries. There is only a limited number of topologically closed infinite subgroups of O(3) (up to conjugation). The building blocks are rotations along an axis (SO(2)), inversions (Inv), mirroring with respect to a plane (M) and an out-of-plane rotation by π (F). Together, they generate 5 continuous symmetry groups: axial symmetry (SO(2) < SO(3)), dihedral symmetry (O(2) = SO(2) F < SO(3)), conical symmetry (O(2) = SO(2) M < O(3)), cylindrical symmetry (Inv SO(2) < O(3)) and full cylindrical symmetry (Inv (SO(2) F) < O(3)). In the planar case, Weiler & Cesa (2019) observed benefits from using discrete groups rather than continuous ones when the data is discretized on a grid and, therefore, the continuous symmetry of the data is lost. The only discrete subgroups of SO(3) with more than 2 rotation axes are the symmetry groups of the Platonic solids: the tetrahedral (T), the octahedral (O) and the icosahedral (I) groups which have, respectively, 12, 24, and 60 elements. T and O are perfect symmetries of a voxel grid and were previously used in Worrall & Brostow (2018); Winkels & Cohen (2018). Also, note that T < I but O I; hence, an I-equivariant CNN is not equivariant to all symmetries of the voxel grid. 22Inv = C2 is a group containing the identity element and the inversion x 7 x in R3. Published as a conference paper at ICLR 2022 H FURTHER DETAILS ON NETWORK DESIGNS AND EXPERIMENTS H.1 ESTIMATING THE EQUIVARIANCE ERROR OF STEERABLE BASES In this section we provide more details on Fig. 5. In this experiment, we considered equivariance to the Icosahedral group G = I. We employ 5 5 5 filters, parameterized by two spherical shells (one at radius 1 and one at radius 2) and the origin. In the two baselines, we use the orbits containing the vertices of either the icosahedron or the dodecahedron in the two spherical shells and the orbit containing only the origin. The kernel is parameterized independently along each orbit by directly using the Wigner-Eckart theorem from Lang & Weiler (2020) and, then, the kernels are embedded in R3 by means of a small Gaussian smoothing kernel. In our basis, we parameterize each shell with spherical harmonics band-limited to frequency 2 (the origin contains only frequency 0); each shell is diffused along the radial direction with a small Gaussian kernel. Each histogram in Fig. 5 shows the relative equivariance error of each element of one of the bases. An histogram contains a point for each filter k in the basis associated with each pair of input and output irreps (ρin, ρout) of G = I. The error of a basis filter k is computed as the average over all 60 rotations g I of the following error: gx downsample(g.X) x downsample(X) fgx conv3d(gx, k) fx conv3d(x, k) gfx ρout(g)fx return compare(fgx, gfx) where x (and gx) have the same resolution of the filter k, such that the outputs fx and fgx do not have spatial resolutions. In this way, g acts on fx only by transforming its channels but no spatial interpolation is necessary. To reduce the interpolation artifacts when rotating X 7 g.X, we perform the rotation at a higher resolution and then downsample the inputs by a factor of 3. H.2 REGULAR AND QUOTIENT NON-LINEARITIES Quotient and regular-type features encode square-integrable scalar functions over a homogeneous space Q = G/H of G (with H = {e} and Q = G for the regular case). Whenever Q = G/H is infinite, one can rely on a bandlimited representation of these functions in the Fourier domain, using the harmonic bases we derived in Section D. More precisely, bandlimited functions are parameterized by using a finite subset of the harmonic basis in Corollary 5, considering only those j in a subset e G b G. Then, the action of G on a function turns into the representation f ρQ = L j e G LMj ρj acting on the coefficients vector which parameterizes it, where Mj is the multiplicity of the irrep ρj. Given one such vector, the function it parametrizes can be sampled at any point q Q by evaluating the harmonic basis on q. In a neural network, a feature vector f(x) Rdρ of quotient (or regular) type ρ is interpreted as the coefficients vector parameterizing a band-limited function in L2(Q) = L2(G/H) on a basis, which transforms according to ρ. A point-wise nonlinearity σ : R R (e.g. ELU) is applied on a quotient feature f(x) : Q R by composition, obtaining a new quotient feature f (x), defined as [f (x)](q) = σ([f(x)](q)). When Q is infinite, if f(x) is band-limited and σ is sufficiently smooth, we can recover an approximation of f (x) from a finite number of samples. Approximate point-wise non-linearity Let Q Q be a finite number of samples from the space Q = G/H. We define the non-linearity as f(x) 7 f (x) = [FT σ IFT](f(x)) where FT is a Fourier Transform, which recovers the coefficients of a band-limited function from a number of samples, σ is the pointwise non-linearity and IFT is an Inverse Fourier Transform, which just samples the function parametrized by the coefficients f(x) on the elements in Q. Note that IFT Published as a conference paper at ICLR 2022 can be implemented as a simple multiplication with a matrix of shape |Q| dρ and, similarly, FT can be implemented as a multiplication with a matrix of shape dρ |Q|, hence f (x) = FT σ(IFT f(x)) , where FT and IFT are interpreted as matrices. In practice, the matrix IFT is implemented as a matrix whose rows contain the Fourier transform of Dirac delta functions centered at each sample in Q, while FT is implemented as the pseudo-inverse matrix FT = IFT . This operation is only approximatively equivariant and the degree of equivariance strongly depends on the number of samples |Q| as well as their distribution over the group (and the smoothness of σ). In particular, it is important that the samples in Q uniformly cover the space Q = G/H and are, therefore, as far away from each other as possible. To generate a sampling set Q, one can use the elements of a discrete subgroup of G (e.g. Icosahedral group I < SO(3)) or optimize the location of |Q| points on Q by minimizing a potential energy, as done by Bekkers (2020). Computational benefits A typical feature f in our models includes multiples fields of the same type, i.e. f = L i fi and ρ = L i ρi, with ρi = ρj i, j. Assume a sequence Conv ND σ Conv ND in the model, where all layers operate on feature fields of type ρ. Let C = P i dρi be the size of the feature vectors f(x) RC. In our neural networks design, we ensure that the number of channels in each layer is approximately constant across different models. In particular, in models using quotient (or regular) feature types, we count the number of samples Q as channels and, therefore, keep the quantity P i |Q| N approximately constant. Let α = |Q|/dρi; this implies that the first Conv ND only has P i dρi N/α output channels (and the second Conv ND has this many input channels). Let k be the cost of a single convolution with 1 input and output channels and m the number of pixels in the input grid. The cost of the non-linearity applied to a single feature field is O(dρi|Q|) and there are N/|Q| such input fields. Then, the cost of this layer is only O(km CN/α + m(N/|Q|)|Q|dρi) = O(km CN/α + m Ndρi) which is equal to O(km CN/α), since m C dρi. For comparison, the same block in a conventional network has cost O(km CN + m N) = O(km CN), hence this layer is approximatively α times cheaper. In Section H.7, we compare the approximate inference time of different architecture to verify the computational benefits described above. H.3 ARCHITECTURE DETAILS All models used in our experiments consist of a sequence of a basic inverted-bottleneck residual block, or small variations of it (see each experiment s section). The basic block for a conventional network consists of a sequence Conv3D - Batch Norm - ELU -Conv3D, whose input and output are connected with a skip connection. The block has Cin input channels and the first convolution maps to N > Cin channels. To downsample, we used stride 2 in the second Conv3D and average pooling in the skip connection. We use kernel_size = 3 for both Conv3D. If the block has a different number of channels Cout in output, the skip connection also contains a convolution with kernel_size = 1. We refer to the linear path in the architecture, i.e. the sequence of skip connections, as the backbone of the model. If Cj is the number of input channels of the j-th residual block, we refer to [C1, C2, . . .] as the number of channels in the backbone features of the model. If Nj is the number of channels used by the non-linearity of the j-th residual block, we refer to [N1, N2, . . .] as the number of channels in the residual blocks of the model. In an equivariant model, we associate a stack of field types {ρi}i on the input Cin channels, such that P i dρi Cin. Similarly, we associate a stack of field types to the N channels produced by the first Conv3D. The particular choice here depends on the kind of equivariant non-linearity used here instead of ELU. In general, we try to ensure that the channels used in the non-linearity are approximatively equal to N, for a fair comparison. We now list a number of non-linearities considered. Gated Non-Linearities The input consists of a number of feature fields f(x) = L i fi(x) of type L i ρi. Recall that fi(x) Rdρi. For each feature field i, the input should also contain a gate gi(x) R of trivial type. Then, each input feature field fi(x) is mapped to f (x) = σ(gi(x))fi(x), Published as a conference paper at ICLR 2022 where σ is a non-linearity like Sigmoid. Note that a gated non-linearity inherently loses some of its input channels, i.e. the input gates {gi}i. In our models, we try to ensure that P i (dρi + 1) N (the +1 corresponds to the size of the gate gi(x) for each input feature field fi). Note that a field type ρi can be any representation. In particular, we experiment with different choices of ρi; we consider {ρi}i containing: each irrep of SO(3) of frequencies 1 to 2, in equal proportions each irrep of SO(3) of frequencies 1 to 3, in equal proportions each irrep of SO(3) of frequencies 1 to 2, with multiplicities proportional to their multiplicities in the regular representation of SO(3), i.e. for each trivial ρ0, we include 3 copies of the frequency 1 ρ1 and 5 copies of ρ2 copies of the quotient representation ρS2 = ρ0 ρ1 ρ2, band-limited up to frequency 2 copies of the quotient representation ρS2 = ρ0 ρ1 ρ2 ρ3, band-limited up to frequency 3 copies of the regular representation ρreg = ρ0 ρ 3 1 ρ 5 2 , band-limited up to frequency 2 To match the N channels, we also add trivial feature fields (ρi = ρ0) and operate on them with a simple ELU non-linearity. Note that in all cases, the first Conv3D has Cin input channels and N output channels, maintaining approximatively the same computational cost of the conventional CNN. Tensor-Product Non-Linearity Let f(x) = L i fi(x) be the input features field, with type ρ = L i ρi. In our experiments, we always chose all {ρi}i equal to a unique type, i.e. ρi = ρj for all i, j. The tensor-product non-linearity maps fi(x) Rdρi to f i(x) = fi(x) fi(x) Rd2 ρi. Note the quadratic growth in the number of channels. In the experiments, we keep the output size approximatively equal to N, i.e. P i d2 ρi N. It is common in the literature to learn a linear projection of f i(x) to a lower-dimensional feature field f i (x) of type ρ i, with dρ i d2 ρi. The second Conv3D originally maps P i d2 ρi N channels to the output Cout channels, while this projection would reduce the input channels to P i dρ i N, thereby reducing the computational cost of the layer proportionally. This, however, results also in a potential loss in expressiveness, since a Conv3D layer taking N channels in input can learn a larger space of operations. Since we do not expect this linear projection to improve the performance of this architecture, we do not include it in our experiments. Finally, in our experiments we consider the following choices of ρi: the quotient representation ρS2 = ρ0 ρ1 ρ2, band-limited up to frequency 2 the quotient representation ρS2 = ρ0 ρ1 ρ2 ρ3, band-limited up to frequency 3 the regular representation ρreg = ρ0 ρ 3 1 ρ 5 2 , band-limited up to frequency 2 Note that a larger feature type ρi (i.e. larger dρi) results in a smaller number of independent copies of it, since we constrain P i d2 ρi N. On the other hand, a larger ρi makes the tensor-product fi(x) fi(x) more expressive, since it combines more different input channels. Therefore, the choice of the size of ρi should be a trade-off between expressiveness and computational and memory cost. (Pointwise) Regular non-Linearities Again, we consider all input feature fields of the same type, i.e. ρi = ρj for all i, j. Here, fi(x) Rdρi is interpreted as the coefficients vectors parameterizing a band-limited function in L2(G), as described in Section H.2. We consider band-limiting up to frequency 2 or 3, i.e. the three following choices of ρi: ρi = ρ0 ρ 3 1 ρi = ρ0 ρ 3 1 ρ 5 2 ρi = ρ0 ρ 3 1 ρ 5 2 ρ 7 3 We experiment with different sampling sets G G: the 24 elements of the Octahedral group O < SO(3) the 60 elements of the Icosahedral group I < SO(3) Published as a conference paper at ICLR 2022 P points over G with maximal relative distance, obtained by optimizing a potential energy P = 24 |S| points over G with maximal relative distance, obtained by optimizing over a set S G a potential energy over G = S o O{o.g|g S}, such that G has Octahedral symmetry: for any o O, o.G = G. In particular, the last sampling set guarantees that the resulting neural network is at least perfectly equivariant to O, i.e. the symmetries of the voxel grid. In general, we observe that at least |G| 2dρi samples are necessary for achieving an equivariance error below 10%. For G = O(3), we use a similar design and just augment the sampling set with inversions, i.e. we use G = Inv G, where G is a sampling set used for SO(3). We argued about the computational benefits of this design in Section H.2. For discrete groups such as O and I, we also use regular non-linearities. In this case, however, the first Conv3D in the residual block directly use a regular feature types as output types and is immediately followed by the pointwise non-linearity, with no need to perform sampling. Hence, the computational and memory cost of these models is equivalent to that of a conventional network, i.e. they have no additional computational gain. (Pointwise) Quotient non-Linearities Similarly, we define quotient non-linearities, with the only difference that fi(x) parameterizes a band-limited function over a homogeneous space Q = G/H, for some H < G; see Section H.2. For G = SO(3), we consider the following quotient representations ρi: ρS2 = ρ0 ρ1 ρ2 ρS2 = ρ0 ρ1 ρ2 ρ3 and the following sampling sets the (center of the) 30 edges of the icosahedron; note that this set has Icosahedral symmetry P points over Q with maximal relative distance, obtained by optimizing a potential energy P = 24 |S| points over Q with maximal relative distance, obtained by optimizing over a set S Q a potential energy on Q = S o O{o.q|q S}, such that Q has Octahedral symmetry: for any o O, o.Q = Q. In particular, the last sampling set guarantees that the resulting neural network is at least perfectly equivariant to O, i.e. the symmetries of the voxel grid. Additionally, note that for Q = S2, the second sampling set corresponds to the well known Thomson problem. When G = O(3), we consider Q = O(3)/ O(2) = S2 and Q = O(3)/ SO(2) = Inv S2. In the first case, we can reuse the same sampling sets defined above for Q = SO(3)/ SO(2) = S2. In the second one, we augment the sampling set with inversions. The considerations about the computational cost discussed in Section H.2 apply also here. Invariant Maps Because we consider invariant tasks, an equivariant architecture should map to invariant features before the final classification layer. To do so, it is common to use some invariant map at the end of the convolutional feature extractor and then feed the invariant features in a few final fully-connected layers. Here, we consider a few different invariant maps. Norm-pooling: given an input field fi(x) Rdρi, we compute its norm, i.e. fi(x) 7 f i(x) = ||fi(x)||2 R; note that this is indeed an invariant quantity. We use this kind of map for models which should be perfectly G equivariant, if G is a continuous group; for instance, we use it for models which use gated non-linearities or tensor-product non-linearities. We experiment with different choices of ρi: the quotient representation ρS2 = ρ0 ρ1 ρ2, band-limited up to frequency 2 the regular representation ρreg = ρ0 ρ 3 1 , band-limited up to frequency 1 the regular representation ρreg = ρ0 ρ 3 1 ρ 5 2 , band-limited up to frequency 2 {ρi}i contains copies of the irrep ρ0, ρ1 and ρ2 in equal proportion Group Pooling: if G is a finite group, we can choose ρi = ρreg and, therefore, fi(x) R|G|. In this case, we can compute the max fi(x) 7 maxj[fi(x)]j. Something similar can be done if G Published as a conference paper at ICLR 2022 is continuous and ρi is a band-limited regular representation by using a sampling set G. However, this operation is often more unstable. To devise a more stable alternative for continuous groups, we consider Average Pooling. This operation is identical to the regular non-linearity explained above (i.e., f i(x) = [FT σ IFT](fi(x))), with the difference that the final Fourier Transform FT only recovers the invariant frequency 0 component. This is equivalent to averaging the output of [σ IFT](fi(x)) over the samples G. In our architectures, we keep the number of invariant features fed into the final fully-connected layers constant. Note that, to generate N invariant outputs, the invariant maps require Ndρi inputs. This can be quite expensive if ρi is a regular representation (e.g., a regular representation of G = SO(3) band-limited to 2 has size 35). To alleviate this, we introduce a Quotient Average Pooling. This operation is similar to Average Pooling but relies on quotient representations (and samples X from a quotient space X), which have smaller size. H.4 MNIST EXPERIMENTS As a proof of concept, we train a conventional and a C8 equivariant model on rotated MNIST. A model includes a sequence of 6 residual blocks similar to the ones described in Section H.3. With respect to that design, here we use Conv2D and use stride 2 in the second convolution. We downsample (with stride 2 convolution), every two blocks. In the first layer of the first block, we use kernel_size = 7. The number of channels in the features of the backbone, connected through the skip connections, is C = [1, 21, 54, 72, 108, 168]. The number of channels in each residual block is N = [96, 192, 288, 288, 576, 576]. Both models map to 128 final invariant features. The conventional model uses a final 3 3 max pooling over the spatial dimension. In the equivariant model, we do not pad the convolution in the last residual block such that its output already has 1 1 resolution. However, the last block maps to 128 copies of the regular representation of G = C8; we then apply group pooling over them to generate 128 invariant features. Finally, both models use 2 fully connected layers to perform classification. We train both models using Adam. We performed simple tuning of batch size, learning rate and weight decay of each model independently by evaluating them on the validation set. The same number of hyper-parameter combinations was evaluated for both models. Once the best set of hyper-parameters was chosen, we re-trained a model using this configuration 5 more times using different random seeds. We report the means and the standard deviations of the test performance over these runs. H.5 MODELNET10 EXPERIMENTS To generate the two Model Net10 datasets, we use the original mesh data. Each mesh is centered and scaled to fit in the [ 1, 1]3 cube. Then, for each mesh we apply 3 random rotations. In the normal Model Net10 experiments, we only sample SO(2) rotations around the Z axis. In the rotated Model Net10 experiments, we sample arbitrary SO(3) rotations. A mesh is then rendered by computing the signed distance function of each point in the voxel grid from the mesh. This distance is then used into a Gaussian kernel to generate a smooth 3D occupancy grid of resolution 33 pixels. The architecture used consists of a first Conv3D with kernel_size = 5, followed by 4 residual blocks and final Conv3D layer. The second convolution layer of the 4 residual blocks uses stride 2. All models map to 128 final invariant features. The last convolution layer has kernel_size = 3 In the conventional networks, the last convolution layer has 128 output channels and uses padding = 1, so its output has resolution 3px; we apply max pooling with a window of size 3 to generate 128 translation invariant features of resolution 1 pixel. In the equivariant networks, the last convolution uses padding = 0 such that the output resolution is already 1 pixel; however, the last block maps to a larger number of channels 128 P, depending on the invariant map chosen. We choose this design to maintain the computational and memory cost of the equivariant and the conventional models comparable; indeed, the final features of the conventional network have shape 128 33 while the features of the equivariant one have shape 128 P 13, before applying, respectively, max pooling or an invariant map. We try to keep P 33 = 27. In the architectures, we use N = [240, 480, 480, 960] channels in the residual blocks and C = [39, 78, 240, 480, 312] channels in the backbone. Finally, three fully connected layers, alternated with dropout, batchnorm and ELU non-linearity, use these invariant features to perform classification. For the SO(2) model, we used regular non-linearities with regular representations band-limited up to frequency 3 and |G| = 12 samples on a regular grid. Similarly, for the O(2), Inv SO(2) and Published as a conference paper at ICLR 2022 Inv O(2) models, we used regular representations band-limited up to frequency 3 and G was built by extending a grid of 12 points on their SO(2) subgroup with reflections and flips. In all cases, as invariant map, we used quotient average pooling, with X = S1 = SO(2) and maximum frequency 3. In the SO(3) models using regular non-linearities, we used features band-limited to frequency 2 when |G| = 60 and to frequency 3 when |G| = 120. For the O(3) model, we used regular features band-limited up to frequency 2. In both SO(3) and O(3) models using quotient features, we used features band-limited up to frequency 2. For each equivariant model, the choice of feature types in the backbone (the input and output types of each residual block) is independent of the non-linearities used. Moreover, since no non-linearities are applied on these features, w.l.o.g. we can consider only irreps feature types. We consider the multiplicity of the irreps a hyperparameter which we tune during hyper-parameter tuning. We train all models using Adam. For each model, we independently performed simple tuning of batch size, learning rate and weight decay by evaluating them on the validation set. This hyper-parameter search also includes a search over the variants of each architecture design (see the variants of each non-linear operation described in Section H.3). The same number of hyper-parameter combinations was evaluated for all models. Once the best set of hyper-parameters was chosen, we re-trained a model using this configuration 5 more times using different random seeds. We report the means and the standard deviations of the test performance over these runs. H.6 AUTO-ENCODER EXPERIMENTS ON MODELNET10 For completeness, here we extend the comparison between the steerable bases discussed in Section 5.2 with an auto-encoding experiment with Model Net10 data. In this experiment, we use the rotated voxel data from Section H.5 and train a model to reconstruct its own input. We consider an auto-encoder architecture whose encoder and decoder comprise 5 blocks, each up/downsampling the features by a factor of 2, such that the latent features have no spatial resolution. Each block is a residual block including a sequence Conv3D - Batch Norm - ELU followed by either a Conv3D or a Conv3DTransposed; all convolutions have kernel_size = 5 and padding = 2, and the last Conv3D/Conv3DTransposed has stride = 2. The residual connection down/upsamples with a Gaussian kernel followed by a learnable 1 1 1 convolution. Finally, the model includes an invariant map and a Conv3D layer to predict a single value at each voxel cell. The outputs of the encoder s blocks have in order approximatively [30, 90, 240, 540, 1080] output channels, while the decoder blocks have [540, 240, 120, 90, 50] output channels. The ELU modules within each block have about [180, 360, 600, 840, 1080] and [1080, 840, 600, 360, 180] channels. The models are trained on the data from Section H.5, which include 3 random rotations of each mesh. Testing is performed on a subset of the test meshes but each mesh is rotated by each element of the icosahedral group (i.e. | I | = 60 times), since the purpose is verifying the equivariance of the model to the icosahedral group I. As in Section 5.2, we consider two versions of the same model which differ by the choice of steerable basis employed. The following table reports the test mean-squared errors (MSE) of the two models. Description MSE Icosahedral Symmetry 0.785 0.130 Icosahedral Symmetry (finite orbits basis) 0.948 0.004 Once again, we find the model using our anti-aliased basis outperforms the baseline using a finiteorbits basis. H.7 COMPUTATIONAL COST AND INFERENCE TIME OF THE MODELS In Section H.2, we have argued that point-wise non-linearities based on discrete Fourier Transform and inverse Fourier Transform benefit from a lower computational cost due to the factorized linear layers. In this section, we verify this by comparing the inference time measured for different architectures. We measure inference time by feeding 100 batches of 8 random 33 33 33 inputs Published as a conference paper at ICLR 2022 (we first feed 10 more batches for warm-up). We report mean and standard-deviation over these runs in Tab. 4. The architectures used are the ones described in Section H.3 and H.5. For reference, we also report the number of parameters of each model. Note also that the equivariant models include a small overhead due to the less optimized implementation of certain operations: while the I model has approximatively the same running time of a conventional CNN, the gated non-linearity model and the scaled-up regular model (marked with ) are slower that the first two. Despite this overhead, the normal regular SO(3) model (without ) is much faster than all other models. Table 4: Approximate inference times. indicates wider models to preserve the computational cost. G Description time (ms) # params ( 106) {e} Conventional CNN 41.6 0.2 36 I Icosahedral Symmetry 43.1 0.2 0.49 SO(3) Chiral (Gated non-linearities) 52.2 0.2 0.37 SO(3) Chiral (Regular, |G| = 96) 28.2 0.2 0.25 SO(3) Chiral (Regular, |G| = 192)* 51.6 0.2 0.24 H.8 LBA EXPERIMENTS To perform the LBA experiments, we use the atom3d library from Townshend et al. (2020). In particular, we adapted the 3DCNN example from their repository. In the example code, each test molecule is randomly rotated at every run of the test. To make the testing procedure deterministic, we generate a test set by rotating each test molecule with 5 random SO(3) rotations. The occupancy grid of a molecule is generated by setting to 1 the voxel cell which is closest to each atom. This results in a coarser rendering of the data, with respect to the Model Net10 datasets. To mitigate the discretization noise, we use voxels with a higher resolution of 65 pixels. Note that Townshend et al. (2020) used voxel grids with resolution 40 pixels. However, the larger resolution of the input data can drastically increase the memory cost of the models. To prevent this, all models include a first Conv3D layer with kernel_size = 7 and stride = 2 which immediately downsamples the input. This first convolution layer is then followed by 4 residual blocks. In each residual block, the stride = 2 convolution is performed in the first Conv3D rather than the second. In the conventional network, the output resolution is 3px and we use max pooling to reduce it to 1 pixel. In the equivariant models, the last residual block does not use padding such that the output resolution is already 1 pixel. However, this last equivariant block has more output channels than the conventional network since this block is followed from an invariant map layer. See Section H.5 for more details about this design choice. In the architectures, we use N = [120, 120, 240, 480] channels in the residual blocks and C = [26, 32, 64, 128] channels in the backbone. Finally, all models produce 256 invariant features and terminate with two fully-connected layers, alternated with dropout, batchnorm and ELU, which perform regression. In the SO(3) architecture, we use regular non-linearities with regular features band-limited up to frequency 2. Because of the lower computational cost of this architecture, we scale up the number of channels in the residual block such that the computational cost is comparable with that of a conventional CNN. This results approximatively in twice the number of channels of the original model; see the paragraph regular non-linearities in Section H.3 for more details about the computational cost of these operations. To make our results comparable with Townshend et al. (2020), we use the same train, validation and test split. Moreover, during training, we augment each molecule with a random SO(3) rotation, as done in Townshend et al. (2020). The training details and hyper-parameter tuning procedures are similar to those described in Section H.5.