# efficient_generalized_spherical_cnns__69a65847.pdf Published as a conference paper at ICLR 2021 EFFICIENT GENERALIZED SPHERICAL CNNS Oliver J. Cobb, Christopher G. R. Wallis, Augustine N. Mavor-Parker, Augustin Marignier, Matthew A. Price, Mayeul d Avezac & Jason D. Mc Ewen Kagenova Limited, Guildford GU5 9LD, UK Many problems across computer vision and the natural sciences require the analysis of spherical data, for which representations may be learned efficiently by encoding equivariance to rotational symmetries. We present a generalized spherical CNN framework that encompasses various existing approaches and allows them to be leveraged alongside each other. The only existing non-linear spherical CNN layer that is strictly equivariant has complexity Op C2L5q, where C is a measure of representational capacity and L the spherical harmonic bandlimit. Such a high computational cost often prohibits the use of strictly equivariant spherical CNNs. We develop two new strictly equivariant layers with reduced complexity Op CL4q and Op CL3 log Lq, making larger, more expressive models computationally feasible. Moreover, we adopt efficient sampling theory to achieve further computational savings. We show that these developments allow the construction of more expressive hybrid models that achieve state-of-the-art accuracy and parameter efficiency on spherical benchmark problems. 1 INTRODUCTION Many fields involve data that live inherently on spherical manifolds, e.g. 360 photo and video content in virtual reality and computer vision, the cosmic microwave background radiation from the Big Bang in cosmology, topographic and gravitational maps in planetary sciences, and molecular shape orientations in molecular chemistry, to name just a few. Convolutional neural networks (CNNs) have been tremendously effective for data defined on Euclidean domains, such as the 1D line, 2D plane, or n D volumes, thanks in part to their translation invariance properties. However, these techniques are not effective for data defined on spherical manifolds, which have a very different geometric structure to Euclidean spaces (see Appendix A). To transfer the remarkable success of deep learning to data defined on spherical domains, deep learning techniques defined inherently on the sphere are required. Recently, a number of spherical CNN constructions have been proposed. Existing CNN constructions on the sphere fall broadly into three categories: fully real (i.e. pixel) space approaches (e.g. Boomsma & Frellsen, 2017; Jiang et al., 2019; Perraudin et al., 2019; Cohen et al., 2019); combined real and harmonic space approaches (Cohen et al., 2018; Esteves et al., 2018; 2020); and fully harmonic space approaches (Kondor et al., 2018). Real space approaches can often be computed efficiently but they necessarily provide an approximate representation of spherical signals and the connection to the underlying continuous symmetries of the sphere is lost. Consequently, such approaches cannot fully capture rotational equivariance. Other constructions take a combined real and harmonic space approach (Cohen et al., 2018; Esteves et al., 2018; 2020), where sampling theorems (Driscoll & Healy, 1994; Kostelec & Rockmore, 2008) are exploited to connect with underlying continuous signal representations to capture the continuous symmetries of the sphere. However, in these approaches non-linear activation functions are computed pointwise in real space, which induces aliasing errors that break strict rotational equivariance. Fully harmonic space spherical CNNs have been constructed by Kondor et al. (2018). A continual connection with underlying continuous signal representations is captured by using harmonic signal representations throughout. Consequently, this is the only approach exhibiting strict rotational equivariance. However, strict equivariance comes at great computational cost, which can often prohibit usage. Corresponding author: jason.mcewen@kagenova.com Published as a conference paper at ICLR 2021 In this article we present a generalized framework for CNNs on the sphere (and rotation group), which encompasses and builds on the influential approaches of Cohen et al. (2018), Esteves et al. (2018) and Kondor et al. (2018) and allows them to be leveraged alongside each other. We adopt a harmonic signal representation in order to retain the connection with underlying continuous representations and thus capture all symmetries and geometric properties of the sphere. We construct new fully harmonic (non-linear) spherical layers that are strictly rotationally equivariant, are parameterefficient, and dramatically reduce computational cost compared to similar approaches. This is achieved by a channel-wise structure, constrained generalized convolutions, and an optimized degree mixing set determined by a minimum spanning tree. Furthermore, we adopt efficient sampling theorems on the sphere (Mc Ewen & Wiaux, 2011) and rotation group (Mc Ewen et al., 2015a) to improve efficiency compared to the sampling theorems used in existing approaches (Driscoll & Healy, 1994; Kostelec & Rockmore, 2008). We demonstrate state-of-the-art performance on all spherical benchmark problems considered, both in terms of accuracy and parameter efficiency. 2 GENERALIZED SPHERICAL CNNS We first overview the theoretical underpinnings of the spherical CNN frameworks introduced by Cohen et al. (2018), Esteves et al. (2018), and Kondor et al. (2018), which make a connection to underlying continuous signals through harmonic representations. For more in-depth treatments of the underlying harmonic analysis we recommend Esteves (2020), Kennedy & Sadeghi (2013) and Gallier & Quaintance (2019). We then present a generalized spherical layer in which these and other existing frameworks are encompassed, allowing existing frameworks to be easily integrated and leveraged alongside each other in hybrid networks. Throughout the following we consider a network composed of S rotationally equivariant layers Ap1q, ...., Ap Sq, where the i-th layer Apiq maps an input activation f pi 1q P Hpi 1q onto an output activation f piq P Hpiq. We focus on the case where the network input space Hp0q consists of spherical signals (but note that input signals on the rotation group may also be considered). 2.1 SIGNALS ON THE SPHERE AND ROTATION GROUP Let L2pΩq denote the space of square-integrable functions over domain Ω. A signal f P L2pΩq on the sphere (Ω S2) or rotation group (Ω SOp3q) can be rotated by ρ P SOp3q by defining the action of rotation on signals by Rρfpωq fpρ 1ωq for ω P Ω. An operator A : L2pΩ1q Ñ L2pΩ2q, where Ω1, Ω2 P t S2, SOp3qu, is then equivariant to rotations if Rρp Apfqq Ap Rρfq for all f P L2pΩ1q and ρ P SOp3q, i.e. rotating the function before application of the operator is equivalent to application of the operator first, followed by a rotation. A spherical signal f P L2p S2q admits a harmonic representation p ˆf 0, ˆf 1, ..., q where ˆf ℓP C2ℓ 1 are the harmonic coefficients given by the inner product xf, Y ℓ my, where Y ℓ m are the spherical harmonic functions of degree ℓand order |m| ď ℓ. Likewise a signal f P L2p SOp3qq on the rotation group admits a harmonic representation p ˆf 0, ˆf 1, ...q where ˆf ℓP Cp2ℓ 1qˆp2ℓ 1q are the harmonic coefficients with pm, nq-th entry xf, Dℓ mny for integers |m|, |n| ď ℓ, where Dℓ: SOp3q Ñ Cp2ℓ 1qˆp2ℓ 1q is the unique 2ℓ 1 dimensional irreducible group representation of SOp3q on Cp2ℓ 1q. The rotation f ÞÑ Rρf of a signal f P L2pΩq can be described in harmonic space by ˆf ℓÞÑ Dℓpρq ˆf ℓ. A signal on the sphere or rotation group is said to be bandlimited at L if, respectively, xf, Y ℓ my 0 or xf, Dℓ mny 0 for ℓě L. Furthermore, a signal on the rotation group is said to be azimuthally bandlimited at N if, additionally, xf, Dℓ mny 0 for |n| ě N. Bandlimited signals therefore admit finite harmonic representations p ˆf 0, ..., ˆf L 1q. In practice real-world signals can be accurately represented by suitably bandlimited signals; henceforth, we assume signals are bandlimited. 2.2 CONVOLUTION ON THE SPHERE AND ROTATION GROUP A standard definition of convolution between two signals f, ψ P L2pΩq on either the sphere (Ω S2) or rotation group (Ω SOp3q) is given by pf ψqpρq xf, Rρψy ż Ω dµpωqfpωq ψ pρ 1ωq, (1) Published as a conference paper at ICLR 2021 where dµpωq denotes the Haar measure on Ωand complex conjugation (e.g. Wandelt & G orski, 2001; Mc Ewen et al., 2007; 2013; 2015b; 2018; Cohen et al., 2018; Esteves et al., 2018). In particular, the convolution satisfies pp Rρfq ψqpρ1q x Rρf, Rρ1ψy xf, Rρ 1ρ1ψy p Rρpf ψqqpρ1q (2) and is therefore a rotationally equivariant linear operation, which we shall denote by Lpψq. The convolution of bandlimited signals can be computed exactly and efficiently in harmonic space as { pf ψqℓ ˆf ℓˆψℓ , ℓ 0, ..., L 1, (3) which for each degree ℓis a vector outer product for signals on the sphere and a matrix product for signals on the rotation group (see Appendix B for further details). Convolving in this manner results in signals on the rotation group (for inputs on both the sphere and rotation group). However, in the spherical case, if the filter is invariant to azimuthal rotations the resultant convolved signal may be interpreted as a signal on the sphere (see Appendix B). 2.3 GENERALIZED SIGNAL REPRESENTATIONS The harmonic representations and convolutions described above have proven useful for describing rotationally equivariant linear operators Lpψq. Cohen et al. (2018) and Esteves et al. (2018) define spherical CNNs that sequentially apply this operator, with intermediary representations taking the form of signals on SOp3q and S2 respectively. Alternatively, for intermediary representations we now consider the more general space of signals introduced by Kondor et al. (2018), to which the aforementioned notions of rotation and convolution naturally extend. In describing the generalization we first note from Section 2.1 that all bandlimited signals on the sphere and rotation group can be represented as a set of variable length vectors of the form f t ˆf ℓ t P C2ℓ 1 : ℓ 0, .., L 1; t 1, ..., τ ℓ fu, (4) where τ ℓ f 1 for signals on the sphere and τ ℓ f minp2ℓ 1, 2N 1q for signals on the rotation group. The generalization is to let FL be the space of all such sets of variable length vectors, with τf unrestricted. This more general space contains the spaces of bandlimited signals on the sphere and rotation group as strict subspaces. For a generalized signal f P FL we adopt the terminology of Kondor et al. (2018) by referring to ˆf ℓ t as the t-th fragment of degree ℓand to τf pτ 0 f , ..., τ L 1 f q, specifying the number of fragments for each degree, as the type of f. The action of rotations upon FL can be naturally extended from their action upon L2p S2q and L2p SOp3qq. For f P FL we define the rotation operator f ÞÑ Rρf by ˆf ℓ t ÞÑ Dℓpρq ˆf ℓ t , allowing us to extend the usual notion of equivariance to operators A : FL Ñ FL. 2.4 GENERALIZED CONVOLUTIONS The convolution described by Equation 1 provides a learnable linear operator Lpψq that satisfies the desired property of equivariance. Nevertheless, given the generalized interpretation of signals on S2 and SOp3q as signals in FL, the notion of convolution can also be generalized (Kondor et al., 2018). In order to linearly and equivariantly transform a signal f P FL of type τf into a new signal f ψ P FL of any desired type τpf ψq, we may specify a filter ψ t ˆψℓP Cτ ℓ f ˆτ ℓ pf ψq : ℓ 0, ..., L 1u, which in general is not an element of FL, and define a transformation f ÞÑ f ψ by t1 1 ˆf ℓ t1 ˆψℓ t,t1, ℓ 0, ..., L 1; t 1, ..., τ ℓ pf ψq. (5) The degree-ℓfragments of the transformed signal pf ψq are simply linear combinations of the degree-ℓfragments of f, with no mixing between degrees. Equation 3 shows that this is precisely the form taken by convolution on the sphere and rotation group. In fact Kondor & Trivedi (2018) show that all equivariant linear operations take this general form; the standard notion of convolution is just a special case. One benefit to the generalized notion is that the filter ψ is not forced to occupy the same domain as the signal f, thus allowing control over the type τpf ψq of the transformed signal. We use Lpψq G to denote this generalized convolutional operator. Published as a conference paper at ICLR 2021 2.5 NON-LINEAR ACTIVATION OPERATORS For FL to be a useful representational space, it must be possible to not only linearly but also nonlinearly transform its elements in an equivariant manner. However, equivariance and non-linearity is not enough. Equivariant linear operators cannot mix information corresponding to different degrees. Therefore it is of crucial importance that degree mixing is achieved by the non-linear operator. 2.5.1 POINTWISE ACTIVATIONS When the type τf of f P FL permits an interpretation as a signal on S2 or SOp3q we may perform an inverse harmonic transform to map the function onto a sample-based representation (e.g. Driscoll & Healy, 1994; Mc Ewen & Wiaux, 2011; Kostelec & Rockmore, 2008; Mc Ewen et al., 2015a). A non-linear function σ : C Ñ C may then be applied pointwise, i.e. separately to each sample, before performing a harmonic transform to return to a representation in FL. We denote the corresponding non-linear operator as Nσpfq Fpσp F 1pfqqq, where F represents the harmonic (i.e. Fourier) transform on S2 or SOp3q. The computational cost of the non-linear operator is dominated by the harmonic transforms. While costly, fast algorithms can be leveraged (see Appendix A). While inverse and forward harmonic transforms on S2 or SOp3q that are based on a sampling theory maintain perfect equivariance for bandlimited signals, the pointwise application of σ (most commonly Re LU) is only equivariant in the continuous limit L Ñ 8. For any finite bandlimit L, aliasing effects are introduced such that equivariance becomes approximate only, as shown by the following experiments. We consider 100 random rotations ρ P SOp3q, for each of 100 random signal-filter pairs pf, ψq, and compute the mean equivariance error dp Ap Rρfq, Rρp Afqq for operator A, where dpf, gq }f g}{}f} is the relative distance between signals. For convolutions the equivariance error is 4.4 ˆ 10 7 for signals on S2 and 5.3 ˆ 10 7 for signals on SOp3q (achieving floating point precision). By comparison the equivariance error for a pointwise Re LU is 0.34 for signals on S2 and 0.37 for signals on SOp3q. Only approximate equivariance is achieved for the Re LU since the nonlinear operation spreads information to higher degrees that are not captured at the original bandlimit, resulting in aliasing. To demonstrate this point we reduce aliasing error by oversampling the realspace signal. When oversampling by 2ˆ or 8ˆ for signals on SOp3q the equivariance error of the Re LU is reduced to 0.10 and 0.01, respectively. See Appendix D for further experimental details. Despite the high cost of repeated harmonic transforms and imperfect equivariance, this is nevertheless the approach adopted by Cohen et al. (2018), Esteves et al. (2018) and others, who find empirically that such models maintain a reasonable degree of equivariance. 2.5.2 TENSOR-PRODUCT ACTIVATIONS In order to define a strictly equivariant non-linear operation that can be applied to a signal f P FL of any type τf the decomposability of tensor products between group representations may be leveraged, as first considered by Thomas et al. (2018) in the context of neural networks. Given two group representations Dℓ1 and Dℓ2 of SOp3q on C2ℓ1 1 and C2ℓ2 1 respectively, the tensor-product group representation Dℓ1 b Dℓ2 of SOp3q on C2ℓ1 1 b C2ℓ2 1 is defined such that p Dℓ1 b Dℓ2qpρq Dℓ1pρq b Dℓ2pρq for all ρ P SOp3q. Decomposing Dℓ1 b Dℓ2 into a direct sum of irreducible group representations then constitutes finding a change of basis for C2ℓ1 1 b C2ℓ2 1 such that p Dℓ1 b Dℓ2qpρq is block diagonal, where for each ℓthere is a block equal to Dℓpρq. The necessary change of basis for ˆuℓ1 b ˆvℓ2 P C2ℓ1 1 b C2ℓ2 1 is given by pˆuℓ1 b ˆvℓ2qℓ m m2 ℓ2 Cℓ1,ℓ2,ℓ m1,m2,mˆuℓ1 m1ˆvℓ2 m2, (6) where Cℓ1,ℓ2,ℓ m1,m2,m P C denote Clebsch-Gordan coefficients whose symmetry properties are such that pˆuℓ1 b ˆvℓ2qℓ m is non-zero only for |ℓ1 ℓ2| ď ℓď ℓ1 ℓ2. The use of Equation 6 arises naturally in quantum mechanics when coupling angular momenta. This property is useful since if ˆf ℓ1 P C2ℓ1 1 and ˆf ℓ2 P C2ℓ2 1 are two fragments that are equivariant with respect to rotations of the network input, then a rotation of ρ applied to the network input results Published as a conference paper at ICLR 2021 in ˆf ℓ1 b ˆf ℓ2 transforming as r ˆf ℓ1 b ˆf ℓ2sℓÞÑ rp Dℓ1pρq ˆf ℓ1q b p Dℓ2pρq ˆf ℓ2qsℓ (7) rp Dℓ1 b Dℓ2qpρqp ˆf ℓ1 b ˆf ℓ2qsℓ (8) Dℓpρqr ˆf ℓ1 b ˆf ℓ2sℓ, (9) where the final equality follows by block diagonality with respect to the chosen basis. Therefore, if fragments ˆf ℓ1 and ˆf ℓ2 are equivariant with repsect to rotations of the network input, then so is the fragment p Cℓ1,ℓ2,ℓq Jp ˆf ℓ1 b ˆf ℓ2q P C2ℓ 1, where we have written Equation 6 more compactly. We now describe how Kondor et al. (2018) use this fact to define equivariant non-linear transformations of elements in FL. A signal f t ˆf ℓ t P C2ℓ 1 : ℓ 0, .., L 1; t 1, ..., τ ℓ fu P FL may be equivariantly and non-linearly transformed by an operator Nb : FL Ñ FL defined as Nbpfq tp Cℓ1,ℓ2,ℓq Jp ˆf ℓ1 t1 b ˆf ℓ2 t2 q : ℓ 0, ..., L 1; pℓ1, ℓ2q P Pℓ L; t1 0, ..., τ ℓ1 f ; t2 0, ..., τ ℓ2 f u, (10) where for each degree ℓP t0, ..., L 1u the set Pℓ L tpℓ1, ℓ2q P t0, ..., L 1u2 : |ℓ1 ℓ2| ď ℓď ℓ1 ℓ2u (11) is defined in order to avoid the computation of trivially equivariant all-zero fragments. We make the dependence on Pℓ L explicit since we redefine it in Section 3. Unlike the pointwise activations discussed in the previous section this operator is strictly equivariant, with a mean relative equivariance error of 5.0 ˆ 10 7 (see Appendix D). Note that applying this operator to signals on the sphere or rotation group results in generalized signals that are no longer on the sphere or rotation group. This is the rationale for the generalization to FL: to unlock the ability to introduce non-linearity in a strictly equivariant manner. Note, however, that g Nbpfq has type τg pτ 0 g , ..., τ L 1 g q where τ ℓ g ř pℓ1,ℓ2q PPℓ L τ ℓ1 f τ ℓ2 f and therefore application of this nonlinear operator results in a drastic expansion in representation size, which is problematic. 2.6 GENERALIZED SPHERICAL CNNS Equipped with operators to both linearly and non-linearly transform elements of FL, with the latter also performing degree mixing, we may consider a network with representation spaces Hp0q ... Hp Sq FL. We consider the s-th layer of the network to take the form of a triple Apsq p L1, N, L2q such that Apsqpf ps 1qq L2p Np L1pf ps 1qqqq, where L1, L2 : FL Ñ FL are linear operators and N : FL Ñ FL is a non-linear activation operator. The approaches of Cohen et al. (2018) and Esteves et al. (2018) are encompassed in this framework as Apsq p Lpψq, Nσ, Iq, where I denotes the identity operator and filters ψ may be defined to encode real-space properties such as localization (see Appendix C). The framework of Kondor et al. (2018) is also encompassed as Apsq p I, Nb, Lpψq G q. Here the generalized convolution Lpψq G comes last to counteract the representation-expanding effect of the tensor-product activation and prevent it from compounding as signals pass through the network. Appendix E lends intuition regarding relationships that may be captured by tensor-product activations followed by generalized convolutions. For any intermediary representation f piq P FL we may transition from equivariance with respect to the network input to invariance by discarding all but the scalar-valued fragments corresponding to ℓ 0 (equivalent to average pooling for signals on the sphere and rotation group). Finally, note that within this general framework we are free to consider hybrid approaches whereby layers proposed by Cohen et al. (2018); Esteves et al. (2018); Kondor et al. (2018) and others, and those presented subsequently, can be leveraged alongside each other within a single model. 3 EFFICIENT GENERALIZED SPHERICAL CNNS Existing approaches to spherical convolutional layers that are encompassed within the above framework are computationally demanding. They require the evaluation of costly harmonic transforms Published as a conference paper at ICLR 2021 on the sphere and rotation group. Furthermore, the only strictly rotationally equivariant non-linear layer is that of Kondor et al. (2018), which has an even greater computational cost, scaling with the fifth power of bandlimit thereby limiting spatial resolution and quadratically with the number of fragments per degree thereby limiting representational capacity. This often prohibits the use of strictly equivariant spherical networks. In this section we introduce a channel-wise structure, constrained generalized convolutions, and an optimized degree mixing set in order to construct new strictly equivariant layers that exhibit much improved scaling properties and parameter efficiency. Furthermore, we adopt efficient sampling theory on the sphere and rotation group to achieve additional computational savings. 3.1 EFFICIENT GENERALIZED SPHERICAL LAYERS For an activation f P FL the value τf 1 L řL 1 ℓ 1 τ ℓ f represents a resolution-independent proxy for its representational capacity. Kondor et al. (2018) consider the separate fragments contained within f to subsume the traditional role of separate channels and therefore control the capacity of intermediary network representations through specification of τf. This is problematic because, whereas activation functions usually act on each channel separately and therefore have a cost that scales linearly with representational capacity (usually controlled by the number of channels), for the activation function Nb not only does the cost scale quadratically with representational capacity τf, but so too does the size of Nbpfq. This feeds forward the quadratic dependence to the cost of, and number of parameters required by, the proceeding generalized convolution. More specifically, note that computation of g Nbpfq requires the computation of řL 1 ℓ 0 τ ℓ g fragments, where τ ℓ g ř pℓ1,ℓ2q PPℓ L τ ℓ1 f τ ℓ2 f . The size of Pℓ L is Op Lℓq for each ℓand therefore the ex- panded representation has size řL 1 ℓ 0 τ ℓ g, of order Op τ 2 f L3q. By exploiting the sparsity of Clebsch Gordan coefficients (Cℓ1,ℓ2,ℓ m1,m2,m 0 if m1 m2 m) each fragment p Cℓ1,ℓ2,ℓq Jp ˆf ℓ1 b ˆf ℓ2q can be computed in Opℓminpℓ1, ℓ2qq. Hence, the total cost of computing all necessary fragments has complexity Op C2L5q, where C τf captures representational capacity. 3.1.1 CHANNEL-WISE TENSOR-PRODUCT ACTIVATIONS As is more standard for CNNs we maintain a separate channels axis, with network activations taking the form pf1, ..., f Kq P FK L where fi P FL all share the same type τf. The non-linearity Nb may then be applied to each channel separately at a cost that is reduced by K-times relative to its application on a single channel with the same total number of fragments. This saving arises since for each ℓwe need only compute K ř pℓ1,ℓ2q PPℓ L τ ℓ1 f τ ℓ2 f fragments rather than ř pℓ1,ℓ2q PPℓ Lp Kτ ℓ1 f qp Kτ ℓ2 f q. Figure 1 visualizes this reduction for the case K 3. Note, however, that for practical applications K 100 is more typical. The K-times reduction in cost is therefore substantial and allows for intermediary activations with orders of magnitude more representational capacity. By introducing this multi-channel approach and using C K rather than C τf to control representational capacity, we reduce the complexity of Nb with respect to representational capacity from Op C2q to Op Cq. ℓ=0 ℓ=1 ℓ=2 (a) Prior approach to applying a tensor-product based non-linear operator Figure 1: Comparison (to scale) of the expansion caused by the tensor-product activation applied to inputs of equal representational capacity but different structure. With depth representing the number of channels and width the number of fragments for each degree, it is clear that by grouping fragments into K separate channels the expansion (and therefore cost) can be K-times reduced. Visualization corresponds to inputs with p L, Kq equal to p3, 1q and p3, 3q for panel (a) and (b), respectively. Published as a conference paper at ICLR 2021 3.1.2 CONSTRAINED GENERALIZED CONVOLUTION Although much reduced, for a signal f P FKin L the channel-wise application of Nb still results in a drastically expanded representation g Nbpfq, to which a representation-contracting generalized convolution must be applied in order to project onto a new activation g1 Lpψq G pgq P FKout L of the desired type τg1 and number of channels Kout. However, under our multi-channel structure computational and parameter efficiency can be improved significantly by decomposing Lpψq G into three separate linear operators, Lpψ1q G1 , Lpψ2q G2 and Lpψ3q G3 . The first, Lpψ1q G1 , acts uniformly across channels, performing a linear projection down onto the desired type, and should be interpreted as a learned extension of Nb which undoes the drastic expansion. The second, Lpψ2q G2 , then acts channel-wise, taking linear combinations of the (con- tracted number of) fragments within each channel. The third, Lpψ3q G3 , acts across channels, taking linear combinations to learn new features. More concretely, the three filters are of the form ψ1 t ˆψℓ 1 P Cτ ℓ gˆτ ℓ g1 : ℓ 0, ..., L 1u, ψ2 t ˆψℓ,k 2 P Cτ ℓ g1ˆτ ℓ g1 : ℓ 0, ..., L 1; k 1, ..., Kinu and ψ3 t ˆψℓ 3 P CKinˆKout : ℓ 0, ..., L 1u, rather than a single filter of the form ψ t ˆψℓP CKinˆτ ℓ gˆKoutˆτ ℓ g1 : ℓ 0, ..., L 1u, leading to a large reduction in the number of parameters as τ ℓ g is invariably very large. By applying the first step uniformly across channels we minimize the parametric dependence on the expanded representation and allow new features to be subsequently learned much more efficiently. Together the second and third steps can be seen as analogous to depthwise separable convolutions often used in planar convolutional networks. 3.1.3 OPTIMIZED DEGREE MIXING SETS We now consider approaches to reduce the Op L5q complexity with respect to spatial resolution L. In the definition of Nb each element of Pℓ L independently defines an equivariant fragment. Therefore a restricted Nb in which only a subset of Pℓ L is used for each degree ℓstill defines a strictly equivariant operator, while reducing computational complexity. In order to make savings whilst remaining at resolution L it is necessary to consider subsets of Pℓ L that scale better than Op L2q. The challenge is to find such subsets that do not hamper the ability of the resulting operator to inject non-linearity and mix information corresponding to different degrees ℓ. Whilst various subsetting approaches are possible, the following argument motivates an approach that we have found to be particularly effective. If pℓ1, ℓ3q P Pℓ L, then representational space is designated to capture the relationship between ℓ1 and ℓ3-degree information. However, if resources have been designated already to capture the relationship between ℓ1 and ℓ2-degrees, as well as between ℓ2 and ℓ3-degrees, then some notion of the relationship between ℓ1 and ℓ3-degrees has been captured already. Consequently, it is unnecessary to designate further resources for this purpose. More generally, consider the graph Gℓ L p NL, Pℓ Lq with nodes NL t0, ..., L 1u and edges Pℓ L. A restricted tensor-product activation can be constructed by using a subset of Pℓ L that corresponds to a subgraph of Gℓ L. The subgraph of Gℓ L captures some notion of the relationship between incoming ℓ1 and ℓ2-degree information if it contains a path between nodes ℓ1 and ℓ2. Therefore we are interested in subgraphs for which there exists a path between any two nodes if there exists such a path in the original graph, guaranteeing that any degree-mixing relationship captured by the original graph is also captured by the subgraph. The smallest subgraph satisfying this property is a minimum spanning tree (MST) of Gℓ L. The set of edges corresponding to any MST has at most L elements and we choose to consider its union with the set of loop-edges in Gℓ L (of the form pℓ1, ℓ1q), which proved particularly important for injecting non-linearity. We denote the resulting set as Pℓ L and note that it satisfies | Pℓ L| ď 2L. Therefore the tensor-product activation Nb corresponding to Equation 11 with Pℓ L replaced by Pℓ L has reduced spatial complexity Op L4q. Given that many minimal spanning trees of the unweighted graph Gℓ L exist for each ℓ, we select the ones that minimize the cost of the resulting activation Nb by assigning to each edge pℓ1, ℓ2q in Gℓ L a weight equal to the cost of computing p Cℓ1,ℓ2,ℓq Jp ˆf ℓ1 b ˆf ℓ2q and selecting the MST of the weighted graph. Published as a conference paper at ICLR 2021 0 1 2 3 4 5 6 (a) Full Pℓ L set of size Op L2q 0 1 2 3 4 5 6 (b) MST subset of size Op Lq 0 1 2 3 4 5 6 (c) RMST subset of size Oplog Lq Figure 2: Visualization of the mixing set Pℓ L (for L 7 and ℓ 4) and the approaches to subsetting based on the minimum spanning tree (MST) and reduced minimum spanning tree (RMST) mixing polices, which reduces related computation costs from Op L2q to, respectively, Op Lq or Oplog Lq. An example of Pℓ L and a MST subset Pℓ L is shown in Figure 2, where the dashed line in Figure 2c shows the general form of the MST. Using this as a principled starting point we consider the further reduced MST (RMST) subset Pℓ L corresponding to centering the MST at the edge pℓ, ℓq and retaining only the edges that fall a distance of 2i away on the dotted line for some i P N. We use Nb to denote the corresponding operator and note that it has further reduced spatial complexity of Op L3 log Lq. We demonstrate in Section 4 that networks that make use of the MST tensor-product activation achieve state-of-the-art performance. Replacing the MST with RMST activation results in a small but insignificant degradation in performance, which is offset by the reduced computational cost. 3.1.4 REDUCTION IN COMPUTATIONAL AND MEMORY FOOTPRINTS The three modifications proposed in Sections 3.1.1 to 3.1.3 are motivated by improved scaling properties. Importantly, they also translate to large reductions in the computational and memory cost of strictly equivariant layers in practice, as detailed in Appendix F. Even at a modest bandlimit of L 64 and relatively small number of channels K 4, for example, the modifications together lead to a 51-times reduction in the number of flops required for computations and 16-times reduction in the amount of memory required to store representations, weights and gradients for training. 3.2 EFFICIENT SAMPLING THEORY By adopting sampling theorems on the sphere we provide access to underlying continuous signal representations that fully capture the symmetries and geometric properties of the sphere, and allow standard convolutions to be computed exactly and efficiently through their harmonic representations, as discussed in greater detail in Appendices A and B. We adopt the efficient sampling theorems on sphere and rotation group of Mc Ewen & Wiaux (2011) and Mc Ewen et al. (2015a), respectively, which reduce the Nyquist rate by a factor of two compared to those of Driscoll & Healy (1994) and Kostelec & Rockmore (2008), which have been adopted in other spherical CNN constructions (e.g. Cohen et al., 2018; Kondor et al., 2018; Esteves et al., 2018; 2020). The sampling theorems adopted are equipped with fast algorithms to compute harmonic transforms, with complexity Op L3q for the sphere and Op L4q for the rotation group. When imposing an azimuthal bandlimit N ! L, the complexity of transforms on the rotation group can be reduced to Op NL3q, which we often exploit in our standard (non-generalized) convolutional layers. 4 EXPERIMENTS Using our efficient generalized spherical CNN framework (implemented in the fourpi AI 1 code) we construct networks that we apply to numerous spherical benchmark problems. We achieve stateof-the-art performance, demonstrating enhanced equivariance without compromising representational capacity or parameter efficiency. In all experiments we use a similar architecture, consisting of 2 3 standard convolutional layers (e.g. S2 or SOp3q convolutions proceeded by Re LUs), followed by 2 3 of our efficient generalized layers. We adopt the efficient sampling theory described in Section 3.2 and encode localization of spatial filters as discussed in Appendix C. Full experimental details may be found in Appendix G. 1https://www.kagenova.com/products/fourpi AI/ Published as a conference paper at ICLR 2021 Table 1: Test accuracy for spherical MNIST digits classification problem NR/NR R/R NR/R Params Planar CNN 99.32 90.74 11.36 58k Cohen et al. (2018) 95.59 94.62 93.40 58k Kondor et al. (2018) 96.40 96.60 96.00 286k Esteves et al. (2020) 99.37 99.37 99.08 58k Ours (MST) 99.35 99.38 99.34 58k Ours (RMST) 99.29 99.17 99.18 57k Table 2: Test root mean squared (RMS) error for QM7 regression problem RMS Params Montavon et al. (2012) 5.96 - Cohen et al. (2018) 8.47 1.4M Kondor et al. (2018) 7.97 ą1.1M Ours (MST) 3.16 337k Ours (RMST) 3.46 335k Table 3: SHREC 17 object retrieval competition metrics (perturbed micro-all) P@N R@N F1@N m AP NDCG Params Kondor et al. (2018) 0.707 0.722 0.701 0.683 0.756 ą1M Cohen et al. (2018) 0.701 0.711 0.699 0.676 0.756 1.4M Esteves et al. (2018) 0.717 0.737 - 0.685 - 500k Ours 0.719 0.710 0.708 0.679 0.758 250k 4.1 ROTATED MNIST ON THE SPHERE We consider the now standard benchmark problem of classifying MNIST digits projected onto the sphere. Three experimental modes NR/NR, R/R and NR/R are considered, indicating whether the training/test sets have been randomly rotated (R) or not (NR). Results are presented in Table 1, which shows that we closely match the prior state-of-the-art performance obtained by Esteves et al. (2020) on the NR/NR and R/R modes, whilst outperforming all previous spherical CNNs on the NR/R mode, demonstrating the increased degree of equivariance achieved by our model. Results are shown for models using both the MST-based and RMST-based mixing sets within the tensor-product activation. The results obtained when using the full sets Pℓ L are very similar to those obtained when using the MST-based sets (e.g. full sets achieved an accuracy of 99.39 for R/R). 4.2 ATOMIZATION ENERGY PREDICTION We consider the problem of regressing the atomization energy of molecules given the molecule s Coulomb matrix and the positions of the atoms in space, using the QM7 dataset (Blum & Reymond, 2009; Rupp et al., 2012). Results are presented in Table 2, which shows that we dramatically outperform other approaches, whilst using significantly fewer parameters. 4.3 3D SHAPE RETRIEVAL We consider the 3D shape retrieval problem on the SHREC 17 (Savva et al., 2017) competition dataset, containing 51k 3D object meshes. We follow the pre-processing step of Cohen et al. (2018), where several spherical projections of each mesh are computed, and use the official SHREC 17 data splits. Results are presented in Table 3 for the standard SHREC precision and recall metrics, which shows that we achieve state-of-the-art performance compared to other spherical CNN approaches, achieving the highest three of five performance metrics, whilst using significantly fewer parameters. 5 CONCLUSIONS We have presented a generalized framework for CNNs on the sphere that encompasses various existing approaches. We developed new efficient layers to be used as primary building blocks in this framework by introducing a channel-wise structure, constrained generalized convolutions, and optimized degree mixing sets determined by minimum spanning trees. These new efficient layers exhibit strict rotational equivariance, without compromising on representational capacity or parameter efficiency. When combined with the flexibility of the generalized framework to leverage the strengths of alternative layers, powerful hybrid models can be constructed. On all spherical benchmark problems considered we achieve state-of-the-art performance, both in terms of accuracy and parameter efficiency. In future work we intend to improve the scalability of our generalized framework further still. In particular, we plan to introduce additional highly scalable layers, for example by extending scattering transforms (Mallat, 2012) to the sphere, to further realize the potential of deep learning on a host of new applications where spherical data are prevalent. Published as a conference paper at ICLR 2021 Lorenz Blum and Jean-Louis Reymond. 970 million druglike small molecules for virtual screening in the chemical universe database GDB-13. Journal of the American Chemical Society, 131:8732, 2009. Wouter Boomsma and Jes Frellsen. Spherical convolutions and their application in molecular modelling. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems 30, pp. 3433 3443. Curran Associates, Inc., 2017. Taco Cohen, Mario Geiger, Jonas K ohler, and Max Welling. Spherical CNNs. In International Conference on Learning Representations, 2018. URL https://arxiv.org/abs/1801.10130. Taco Cohen, Maurice Weiler, Berkay Kicanaoglu, and Max Welling. Gauge equivariant convolutional networks and the icosahedral CNN. ar Xiv preprint ar Xiv:1902.04615, 2019. URL https://arxiv.org/abs/1902.04615. James Driscoll and Dennis Healy. Computing Fourier transforms and convolutions on the sphere. Advances in Applied Mathematics, 15:202 250, 1994. Carlos Esteves. Theoretical aspects of group equivariant neural networks. ar Xiv preprint ar Xiv:2004.05154, 2020. URL https://arxiv.org/abs/2004.05154. Carlos Esteves, Christine Allen-Blanchette, Ameesh Makadia, and Kostas Daniilidis. Learning SO(3) equivariant representations with spherical CNNs. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 52 68, 2018. URL https://arxiv.org/abs/1711. 06721. Carlos Esteves, Ameesh Makadia, and Kostas Daniilidis. Spin-weighted spherical CNNs. ar Xiv preprint ar Xiv:2006.10731, 2020. URL https://arxiv.org/abs/2006.10731. Jean Gallier and Jocelyn Quaintance. Aspects of Harmonic Analysis and Representation Theory. 2019. URL https://www.seas.upenn.edu/ jean/nc-harmonic.pdf. Dennis Healy, Daniel Rockmore, Peter Kostelec, and S. Moore. FFTs for the 2-sphere improvements and variations. Journal of Fourier Analysis and Applications, 9(4):341 385, 2003. Chiyu Jiang, Jingwei Huang, Karthik Kashinath, Philip Marcus, Matthias Niessner, et al. Spherical CNNs on unstructured grids. ar Xiv preprint ar Xiv:1901.02039, 2019. URL https://arxiv. org/abs/1901.02039. Rodney A Kennedy and Parastoo Sadeghi. Hilbert space methods in signal processing. Cambridge University Press, 2013. Diederik P Kingma and Jimmy Lei Ba. Adam: A method for stochastic gradient descent. In ICLR: International Conference on Learning Representations, 2015. URL https://arxiv.org/abs/ 1412.6980. Risi Kondor and Shubhendu Trivedi. On the generalization of equivariance and convolution in neural networks to the action of compact groups. In International Conference on Machine Learning, pp. 2747 2755, 2018. URL https://arxiv.org/abs/1802.03690. Risi Kondor, Zhen Lin, and Shubhendu Trivedi. Clebsch-Gordan nets: a fully fourier space spherical convolutional neural network. In Advances in Neural Information Processing Systems, pp. 10117 10126, 2018. URL https://arxiv.org/abs/1806.09231. Peter Kostelec and Daniel Rockmore. FFTs on the rotation group. Journal of Fourier Analysis and Applications, 14:145 179, 2008. St ephane Mallat. Group invariant scattering. Communications on Pure and Applied Mathematics, 65(10):1331 1398, 2012. URL https://arxiv.org/abs/1101.2286. Domenico Marinucci and Giovanni Peccati. Random Fields on the Sphere: Representation, Limit Theorem and Cosmological Applications. Cambridge University Press, 2011. Published as a conference paper at ICLR 2021 Jason Mc Ewen and Yves Wiaux. A novel sampling theorem on the sphere. IEEE Transactions on Signal Processing, 59(12):5876 5887, 2011. URL https://arxiv.org/abs/1110.6298. Jason Mc Ewen, Michael P. Hobson, Daniel J. Mortlock, and Anthony N. Lasenby. Fast directional continuous spherical wavelet transform algorithms. IEEE Trans. Sig. Proc., 55(2):520 529, 2007. URL https://arxiv.org/abs/astro-ph/0506308. Jason Mc Ewen, Pierre Vandergheynst, and Yves Wiaux. On the computation of directional scalediscretized wavelet transforms on the sphere. In Wavelets and Sparsity XV, SPIE international symposium on optics and photonics, invited contribution, volume 8858, 2013. URL https: //arxiv.org/abs/1308.5706. Jason Mc Ewen, Martin B uttner, Boris Leistedt, Hiranya V Peiris, and Yves Wiaux. A novel sampling theorem on the rotation group. IEEE Signal Processing Letters, 22(12):2425 2429, 2015a. URL https://arxiv.org/abs/1508.03101. Jason Mc Ewen, Boris Leistedt, Martin B uttner, Hiranya Peiris, and Yves Wiaux. Directional spin wavelets on the sphere. IEEE Trans. Sig. Proc., submitted, 2015b. URL https://arxiv.org/ abs/1509.06749. Jason Mc Ewen, Claudio Durastanti, and Yves Wiaux. Localisation of directional scale-discretised wavelets on the sphere. Applied Comput. Harm. Anal., 44(1):59 88, 2018. URL https://arxiv. org/abs/1509.06767. Gr egoire Montavon, Katja Hansen, Siamac Fazli, Matthias Rupp, Franziska Biegler, Andreas Ziehe, Alexandre Tkatchenko, Anatole V. Lilienfeld, and Klaus-Robert M uller. Learning invariant representations of molecules for atomization energy prediction. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (eds.), Advances in Neural Information Processing Systems 25, pp. 440 448. Curran Associates, Inc., 2012. Nathana el Perraudin, Micha el Defferrard, Tomasz Kacprzak, and Raphael Sgier. Deepsphere: Efficient spherical convolutional neural network with HEALPix sampling for cosmological applications. Astronomy and Computing, 27:130 146, 2019. URL https://arxiv.org/abs/1810. 12186. Matthias Rupp, Alexandre Tkatchenko, Klaus-Robert M uller, and O. Anatole von Lilienfeld. Fast and accurate modeling of molecular atomization energies with machine learning. Physical Review Letters, 108:058301, 2012. URL https://arxiv.org/abs/1109.2618. Manolis Savva, Fisher Yu, Hao Su, Asako Kanezaki, Takahiko Furuya, Ryutarou Ohbuchi, Zhichao Zhou, Rui Yu, Song Bai, Xiang Bai, et al. Large-scale 3d shape retrieval from shapenet core55: Shrec 17 track. In Proceedings of the Workshop on 3D Object Retrieval, pp. 39 50. Eurographics Association, 2017. Max Tegmark. An Icosahedron-Based method for pixelizing the celestial sphere. Astrophys. J. Lett., 470:L81, October 1996. URL https://arxiv.org/abs/astro-ph/9610094. Nathaniel Thomas, Tess Smidt, Steven Kearnes, Lusann Yang, Li Li, Kai Kohlhoff, and Patrick Riley. Tensor field networks: Rotation-and translation-equivariant neural networks for 3d point clouds. ar Xiv preprint ar Xiv:1802.08219, 2018. URL https://arxiv.org/abs/1802.08219. Stefano Trapani and Jorge Navaza. Calculation of spherical harmonics and Wigner d functions by FFT. Applications to fast rotational matching in molecular replacement and implementation into AMo Re. Acta Crystallographica Section A, 62(4):262 269, 2006. Benjamin Wandelt and Krzysztof G orski. Fast convolution on the sphere. Phys. Rev. D., 63(12): 123002, 2001. URL https://arxiv.org/abs/astro-ph/0008227. Published as a conference paper at ICLR 2021 A REPRESENTATIONS OF SIGNALS ON THE SPHERE AND ROTATION GROUP To provide further context for the discussion presented in the introduction and to elucidate the properties of different sampling theory on the sphere and rotation group, we concisely review representations of signals on the sphere and rotation group. A.1 DISCRETIZATION It is well-known that a completely regular point distribution on the sphere does in general not exist (e.g. Tegmark, 1996). Consequently, while a variety of spherical discretization schemes exists (e.g. icosahedron, HEALPix, graph, and other representations), it is not possible to discretize (i.e. to sample or pixelize) the sphere in a manner that is invariant to rotations, i.e. a discrete sampling of rotations of the samples on the sphere will in general not map onto the same set of sample positions. This differs to the Euclidean setting and has important implications when constructing convolution operators on the sphere, which clearly are a critical component of CNNs. Since convolution operators are in general built using a translation operator equivalently a rotation operator when on the sphere it is thus not possible to construct a convolution operator directly on a discretized representation of the sphere that captures all of the symmetries of the underlying spherical manifold. While approximate discrete representations can be considered, and are nevertheless useful, such representations cannot capture all underlying spherical symmetries. A.2 SAMPLING THEORY Alternative representations, however, can capture all underlying spherical symmetries. Sampling theories on the sphere (e.g. Driscoll & Healy, 1994; Mc Ewen & Wiaux, 2011) provide a mechanism to capture all information content of an underlying continuous function on the sphere from a finite set of samples (and similarly on the rotation group; Kostelec & Rockmore 2008; Mc Ewen et al. 2015a). A sampling theory on the sphere is equivalent to a cubature (i.e. quadrature) rule for the exact integration of a bandlimited functions on the sphere. While optimal cubature on the sphere remains an open problem, the most efficient sampling theory on the sphere and rotation group is that developed by Mc Ewen & Wiaux (2011) and Mc Ewen et al. (2015a), respectively. On a compact manifold like the sphere (and rotation group), harmonic (i.e. Fourier) space is discrete. Hence, a finite set of harmonic coefficients captures all information content of an underlying continuous bandlimited signal. Since such a representation provides access to the underlying continuous signal, all symmetries and geometric properties of the sphere are captured perfectly. Such representations have been employed extensively in the construction of wavelet transforms on the sphere, where the use of sampling theorems on the sphere and rotation group yield wavelet transforms of discretized continuous signals that are theoretically exact (e.g. Mc Ewen et al., 2013; 2015b; 2018). Harmonic signal representations have also been exploited in spherical CNNs to access all underlying spherical symmetries and develop equivariance network layers (Cohen et al., 2018; Kondor et al., 2018; Esteves et al., 2018; 2020). A.3 EXACT AND EFFICIENT COMPUTATION Signals on the sphere f P L2p S2q may be decomposed into their harmonic representations as m ℓ f ℓ m Y ℓ mpωq, (12) where their spherical harmonic coefficients are given by f ℓ m xf, Y ℓ my ż S2 dµpωqfpωq Y ℓ m pωq, (13) for ω P S2. Similarly, signals on the rotation group g P L2p SOp3qq may be decomposed into their harmonic representations as n ℓ gℓ mn Dℓ mnpρq (14) Published as a conference paper at ICLR 2021 where their harmonic (Wigner) coefficients are given by gℓ mn xg, Dℓ mny ż SOp3q dµpρqgpρq Dℓ mnpρq, (15) for ρ P SOp3q. Note that we adopt the convention where the conjugate of the Wigner D-function is used in Equation 14 since this leads to a convenient harmonic representation when considering convolutions (cf. Mc Ewen et al., 2015a; 2018). As mentioned above, sampling theory pertains to strategies to capture all of the information content of band limited signals from a finite set of samples. Since the harmonic space of the sphere and rotation group is discrete, this is equivalent to an exact quadrature rule for the computation of harmonic coefficients by Equation 13 and Equation 15 from sampled signals. The canonical equiangular sampling theory on the sphere was that developed by Driscoll & Healy (1994), and subsequently extended to the rotation group by Kostelec & Rockmore (2008). More recently, novel sampling theorems on the sphere and rotation group were developed by Mc Ewen & Wiaux (2011) and Mc Ewen et al. (2015a), respectively, that reduce the Nyquist rate by a factor of two. Previous CNN constructions on the sphere (e.g. Cohen et al., 2018; Kondor et al., 2018; Esteves et al., 2018; 2020) have adopted the more well-known sampling theories of Driscoll & Healy (1994) and Kostelec & Rockmore (2008). In contrast, we adopt the more efficient sampling theories of Mc Ewen & Wiaux (2011) and Mc Ewen et al. (2015a) to provide additional efficiency savings, implemented in the open source ssht 2 and so3 3 software packages (we also make use of a Tensor Flow implementation of these algorithms in our private tensossht 4 code). Note also that the sampling schemes associated with the theory of Mc Ewen & Wiaux (2011) (and other minor variants implemented in ssht ) align more closely with the one-to-two aspect ratio of common spherical data, such as 360 photos and videos. All of the sampling theories discussed are equipped with fast algorithms to compute harmonic transforms, with complexity Op L3q for transforms on the sphere (Driscoll & Healy, 1994; Mc Ewen & Wiaux, 2011) and complexity Op L4q for transforms on the rotation group (Kostelec & Rockmore, 2008; Mc Ewen et al., 2015a). Note that algorithms that achieve slightly lower complexity have been developed (Driscoll & Healy, 1994; Healy et al., 2003; Kostelec & Rockmore, 2008) but these are known to suffer stability issues (Healy et al., 2003; Kostelec & Rockmore, 2008). By imposing an azimuthally bandlimit N, where typically N ! L, the complexity of transforms on the rotation group can be reduced to Op NL3q (Mc Ewen et al., 2015a), which we exploit in our networks. These fast algorithms to compute harmonic transforms on the sphere and rotation group can be leveraged to yield the exact and efficient computation of convolutions through their harmonic representations (see Appendix B). By computing convolutions in harmonic space, pixelization and quadrature errors are avoided and computational complexity is reduced to the cost of the respective harmonic transforms. B CONVOLUTION ON THE SPHERE AND ROTATION GROUP For completeness we make explicit the standard (non-generalized) convolution operations on the sphere and rotation group that we adopt. The general form of convolution for signals f P L2pΩq either on the sphere (Ω S2) or rotation group (Ω SOp3q) is specified by Equation 1, with harmonic representation given by Equation 3. Here we provide specific expressions for the convolution for a variety of cases, describe the normalization constants that arise and may be absorbed into learnable filters, and derive the corresponding harmonic forms. In practice all convolutions are computed in harmonic space since the computation is then exact, avoiding pixelisation or quadrature errors, and efficient when fast algorithms to compute harmonic transforms are exploited (see Appendix A). 2http://www.spinsht.org/ 3http://www.sothree.org/ 4Available on request from https://www.kagenova.com/. Published as a conference paper at ICLR 2021 B.1 CONVOLUTION ON THE SPHERE Given two spherical signals f, ψ P L2p S2q their convolution, which in general is a signal on the rotation group, may be decomposed as pf ψqpρq xf, Rρψy (16) S2 dΩpωqfpωqψ pρ 1ωq (17) n f ℓ m Dℓ1 m1npρqψℓ1 n S2 dΩpωq Y ℓ mpωq Y ℓ1 m1 pωq (18) n f ℓ m Dℓ1 m1npρqψℓ1 n δℓℓ1δmm1 (19) f ℓ mψℓ n Dℓ mnpρq, (20) yielding harmonic coefficients pf ψqℓ mn 8π2 2ℓ 1f ℓ mψℓ n . (21) The constants 8π2{p2ℓ 1q may be absorbed into learnable parameters. B.2 CONVOLUTION ON THE SPHERE WITH AXISYMMETRIC FILTERS When convolving a spherical signal f P L2p S2q with an axisymmetric spherical filter ψ P L2p S2q that is invariant to azimuthal rotations, the resultant pf ψq may be interpreted as a signal on the sphere. To see this note that an axisymmetric filter ψ has harmonic coefficients ψℓ n ψℓ 0δn0 that are non-zero only for m 0. Denoting rotations by their zyz-Euler angles ρ pα, β, γq and substituting into Equation 20 we see that the convolution may be decomposed as pf ψqpα, β, γq ÿ f ℓ mψℓ 0 δn0 Dℓ mnpα, β, γq (22) ℓm f ℓ mψℓ 0 Dℓ m0pα, β, 0q (23) ℓm f ℓ mψℓ 0 4π 2ℓ 1Y ℓ mpβ, αq. (24) We may therefore interpret pf ψq as a signal on the sphere with spherical harmonic coefficients 4π 2ℓ 1f ℓ mψℓ 0 . (25) The constants a 4π{p2ℓ 1q may be absorbed into learnable parameters. Published as a conference paper at ICLR 2021 B.3 CONVOLUTION ON THE ROTATION GROUP Given two signals f, ψ P L2p SOp3qq on the rotation group their convolution may then be decomposed as pf ψqpρq xf, Rρψy (26) SOp3q dµpρ1qfpρ1qψ pρ 1ρ1q (27) SOp3q dµpρ1q ÿ mn f ℓ mn Dℓ mnpρ1q ÿ m1n1 ψℓ1 m1n1Dℓ1 m1n1pρ 1ρq mn f ℓ mn ÿ m1n1 ψℓ1 m1n1 ż SOp3q dµpρ1q Dℓ mnpρ1q Dℓ1 m1n1pρ 1ρ1q mn f ℓ mn ÿ m1n1 ψℓ1 m1n1 ż SOp3q dµpρ1q Dℓ mnpρ1q ÿ k Dℓ1 km1pρq Dℓ1 kn1pρ1q mn f ℓ mn ÿ m1n1 ψℓ1 m1n1 ÿ k Dℓ1 km1pρq 8π2 2ℓ 1δℓℓ1δmkδnn1 (31) 8π2 Dℓ mm1pρq ˆ ÿ n f ℓ mnψℓ m1n where for Equation 30 we make use of the relation (e.g. Marinucci & Peccati, 2011; Mc Ewen et al., 2018) Dℓ mnpρ 1ρ1q ÿ k Dℓ kmpρq Dℓ knpρ1q. (33) This decomposition yields harmonic coefficients pf ψqℓ mn ÿ m1 f ℓ mm1ψℓ nm1. (34) C FILTERS ON THE SPHERE AND ROTATION GROUP When defining filters we look to encode desirable real-space properties, such as locality and regularity. However, in practice considerable computation may be saved by defining the filters in harmonic space and saving the cost of harmonic transforming ahead of harmonic space convolutions. We describe here how filters motivated by their real space properties may be defined directly in harmonic space. C.1 DIRAC DELTA FILTERS ON THE SPHERE Spherical filters may be constructed as a weighted sum of Dirac delta functions on the sphere. This construction is useful as the harmonic representation has an analytic form that may be computed efficiently. Furthermore, various real space properties can be encoded through sensible placement of the Dirac delta functions. The spherical Dirac delta function δω1 centered at ω1 pθ1, φ1q P S2 is defined as δω1pωq 1 sin θδRpcos θ cos θ1qδRpφ φ1q, (35) where δR is the familiar Dirac delta function on the reals centered at 0. The Dirac delta on the sphere may be represented in harmonic space by pδω1qℓ m Y ℓ m pω1q N ℓ m P ℓ mpcos θ1qe imφ1, (36) Published as a conference paper at ICLR 2021 which follows form the sifting property of the Dirac delta, and where Y ℓ m denote the spherical harmonic functions, P ℓ mpxq are associated Legendre functions and 4π pl mq! pl mq! (37) is a normalizing constant. This representation may then be used to define a filter ψ P L2p S2q as a weighted sum of spherical Dirac delta functions, with weights wij assigned to Dirac delta functions centered at points tpθi, φjq : i 1, ..., Nθ; j 1, ..., Nφu. The associated harmonic space representation is given by i,j wij N ℓ m P ℓ mpcos θiqe imφj (38) i N ℓ m P ℓ mpcos θiq ÿ j wije imφj, (39) where fast Fourier transforms may be leveraged to compute the inner sum if the Dirac deltas are spaced evenly azimuthally (e.g. if φj 2πj{Nφ). Alternative arbitrary samplings can of course be considered if useful for a problem at hand. When defining filters in this manner one should be careful not to over-parametrize by assigning more weights than needed to define a filter at the harmonic bandlimit of the signal with which we wish to convolve. For example, if the filter is to be convolved with a signal bandlimited at L then a maximum of 2L 1 Dirac deltas should be placed along each ring of constant θ. One may also choose to interpolate the weights from a smaller number of learnable parameters acting as anchor points, allowing higher resolution filters to be defined with fewer learnable parameters. C.2 DIRAC DELTA FILTERS ON THE ROTATION GROUP Similarly a Dirac delta function δρ1 on the rotation group SOp3q centered at position ρ1 pα1, β1, γ1q P SOp3q is defined as δρ1pρq 1 sin β δRpα α1qδRpcos β cos β1qδRpγ γ1q, (40) with harmonic form pδρ1qℓ mn Dℓ mnpρ1q e imα1dℓ mnpβ1qe inγ1, (41) where dℓ mn are Wigner (small) d-matrices. The filter ψ P L2p SOp3qq corresponding to a weighted sum of Dirac deltas with weights wijk assigned to Dirac delta functions centered at points tpαi, βj, γkq : i 1, ..., Nα; j 1, ..., Nβ; k 1, ..., Nγu has harmonic form i,j,k wijk e imαjdℓ mnpβiqe inγk (42) j dℓ mnpβiq ÿ k wijke inγk, (43) where again fast Fourier transforms may be leveraged to compute the inner two sums assuming the Dirac deltas are spaced evenly in α and γ. The outer sums of Equation 39 and Equation 43 can also be computed by fast Fourier transforms by decomposing the Wigner d-matrices into their Fourier representation (cf. Trapani & Navaza, 2006; Mc Ewen & Wiaux, 2011). One should again be careful not to over-parametrize. D EQUIVARIANCE TESTS To test rotational equivariance of operators we consider Nf 100 random signals tfiu Nf i 1 in L2pΩ1q with harmonic coefficients sampled from the standard normal distribution and Nρ 100 Published as a conference paper at ICLR 2021 random rotations tρju Nρ j 1 sampled uniformly on SOp3q. In order to measure the extent to which an operator A : L2pΩ1q Ñ L2pΩ2q is equivariant we evaluate the mean relative error dp Ap Rρjfiq, Rρjp Afiqq 1 Nf }Ap Rρjfiq Rρjp Afiqq} }Ap Rρjfiq} (44) resulting from pre-rotation of the signal, followed by application of A, as opposed to post-rotation after application of A, where the operator norm } } is defined using the inner product x , y L2pΩ2q. Table 4 presents the mean relative equivariance errors computed. We consider the three standard convolutions described in Appendix B (with a random filter ψi for each signal fi, generated in the same manner as fi), the pointwise Re LU activation described in Section 2.5.1 for signals on the sphere (Ω1 S2) and rotation group (Ω1 SOp3q), and the composition of tensor-product activation with a generalized convolution, described in Sections 2.5.2 and 2.4, respectively. We follow the tensor-product activation with a generalized convolution in order to project down onto the sphere to allow the same notion of error to be adopted as for the other operators. For consistency with the context in which we leverage these operators, all experiments are performed using singleprecision arithmetic. We see that all three standard notions of convolution and the composition of the tensor-product activation and generalized convolution are all strictly equivariant to floating point machine precision, with errors on the order of 10 7. The pointwise Re LU operator is not strictly equivariant, with a mean relative error of 0.37 for signals on the rotation group and 0.34 for signals on the sphere. These errors reduce when the signals are oversampled before application of the Re LU, indicating that the error is due to aliasing induced by the spreading of information to higher degrees not captured at the original bandlimit. For example, for the pointwise Re LU operator on the rotation group oversampling by factors of 2ˆ, 4ˆ and 8ˆ results in a reduction in the mean relative equivariance error from 0.37 to 0.098, 0.032 and 0.0096, respectively. Table 4: Layer equivariance tests Layer Mean Relative Error S2 to S2 conv. 4.4 ˆ 10 7 S2 to SOp3q conv. 5.3 ˆ 10 7 SOp3q to SOp3q conv. 9.3 ˆ 10 7 Tensor-product activation Ñ Generalized conv. 5.0 ˆ 10 7 S2 Re LU 3.4 ˆ 10 1 S2 Re LU (2ˆ oversampling) 8.9 ˆ 10 2 S2 Re LU (4ˆ oversampling) 2.9 ˆ 10 2 S2 Re LU (8ˆ oversampling) 1.3 ˆ 10 2 SOp3q Re LU 3.7 ˆ 10 1 SOp3q Re LU (2ˆ oversampling) 9.8 ˆ 10 2 SOp3q Re LU (4ˆ oversampling) 3.2 ˆ 10 2 SOp3q Re LU (8ˆ oversampling) 9.6 ˆ 10 3 E CONNECTION BETWEEN THE TENSOR PRODUCT ACTIVATION AND POINTWISE SQUARING To provide some intuition on the manner in which the tensor-product based activation introduces non-linearity into representations we describe its relationship to pointwise squaring for signals on the sphere. Here we consider the operator N : L2p S2q Ñ L2p S2q satisfying p Nfqpxq f 2pxq for all x P S2, which differs subtly to Nσ with σpxq x2 (using notation from Section 2.5.1), which corresponds to obtaining a sample-based representation at a finite bandlimit L and applying the squaring at the sample positions. For the special case σpxq x2 we can directly compute the harmonic representation corresponding to the equivariance-preserving continuous limit L Ñ 8. Published as a conference paper at ICLR 2021 Kondor et al. (2018) Ours 0 20 40 60 80 100 120 Bandlimit L (a) Computational cost Memory (MB) Kondor et al. (2018) Ours 0 20 40 60 80 100 120 Bandlimit L (b) Memory requirements Figure 3: Comparison of computation and memory footprints between the generalized spherical CNN layers of Kondor et al. (2018) and our efficient generalized layers. The reduction in cost due to our efficient layers is given as multiplicative factors in the lower plot of each panel. Given a spherical signal f P L2p S2q with generalized representation f t ˆf ℓ 0 P C2ℓ 1 : ℓ 0, .., L 1u, the generalized representation of the signal f 2 P L2p S2q is given as pℓ1,ℓ2q PPℓ,L p Gℓ1,ℓ2,ℓq Jp ˆf ℓ1 0 b ˆf ℓ2 0 q : ℓ 0, 1, ..., L 1u, (45) where Gℓ1,ℓ2,ℓP Cp2ℓ1 1qˆp2ℓ2 1qˆp2ℓ 1q are Gaunt coefficients defined as Gj1j2j3 m1m2m3 ż S2 dµpωq Y j1 m1pωq Y j2 m2pωq Y j3 m3 pωq. (46) Gaunt coefficients are related to the Clebsch-Gordan coefficients by Gj1j2j3 m1m2m3 wj1j2j3Cj1j2j3 m1m2m3, (47) where wj1j2j3 p 1qm3 b p2j1 1qp2j2 1q 4πp2j3 1q Cj1j2j3 0 0 0 . Therefore, the continuous squaring operation corresponds to passing f through a tensor-product activation Nb followed by a generalized convolution back down onto the sphere (single fragment per degree) with weight assigned to the pℓ1, ℓ2q P Pℓ L fragment in degree-ℓgiven by wl1l2l3. This demonstrates that activations that are learnable within our framework can have very simple realspace interpretations. Even when confining outputs to the sphere we found it to be beneficial to allow the down-projection to be learnable rather than enforcing the weights given above for pointwise squaring. Learned activations will remain quadratic, however, given that output fragments are linear combinations of products between input fragments. F COMPARISON OF COMPUTATIONAL AND MEMORY FOOTPRINTS We perform a quantitative analysis of the computational cost and memory requirements of our proposed layers, and comparisons to prior approaches, to demonstrate how the complexity savings made through our proposals in Section 3 translate to tangible efficiency improvements. We consider the simplest comparison, between a multi-channel generalized signal f pf1, ..., f Kq P FK L where each channel fi has type τfi p1, ..., 1q (and therefore corresponds to a signal on the sphere) and a uni-channel signal g P FL of type τ g p K, ..., Kq. The setting corresponding to signal f captures the efficiency improvements of our proposed layers, while the setting corresponding to g represents the case without these improvements. Notice that the total number of fragments is the same in both f and g. We compare the number of floating point operations (FLOPs) and amount of memory required to perform a tensor-product based activation followed by a generalized convolution projecting back down onto a signal of the same type as the input. When applied to f, MST mixing sets Pℓ L (Section 3.1.3) and constrained generalized convolutions (Section 3.1.2) are Published as a conference paper at ICLR 2021 SO(3) Conv. SO(3) Layer Constrained Gen. Conv. Efficient Gen. Layer Constrained Gen. Conv. Tensor Products Constrained Gen. Conv. Efficient Gen. Layer Tensor Products Constrained Gen. Conv. Efficient Gen. Layer Tensor Products Figure 4: Visualization of the architecture used for the convolutional base in our hybrid models. The input to the first convolutional layer is a signal on the sphere. The output from the final convolutional layer are scalar values corresponding to fragments of degree ℓ 0, which are then mapped through some fully connected layers to give the model output. used. When applied to g full mixing sets Pℓ Land unconstrained generalized convolutions are used, as in Kondor et al. (2018). Considered are the costs for a single training instance (batch size of 1). Figure 3 shows the computational costs and memory requirements, in terms of floating point operations and megabytes respectively, for K 4 and various spatial bandlimits L. We adopt the convention whereby complex addition and multiplication require 2 and 6 floating point operations respectively. At low bandlimits we see the saving arising from the channel-wise structure. Note that the saving illustrated here is relatively small since K 4 for these experiments (so that the L scaling is apparent), whereas in practice typically K 100. The saving then increases linearly in response to increases in the bandlimit of the input, as expected given the Op L5q and Op L4q spatial complexities. We see that even in this simple case, with a relatively small number of channels (K 4), both the computational and memory footprints are reduced by orders of magnitude. At a bandlimit of L 128 the computational cost is 101-times reduced and the memory requirement is 29-times reduced. G ADDITIONAL INFORMATION ON EXPERIMENTS G.1 ROTATED MNIST ON THE SPHERE For our MNIST experiments we used a hybrid model with the architecture shown in Figure 4. The first block includes a directional convolution on the sphere that lifts the spherical input (τ ℓ f p0q 1) onto the rotation group (τ ℓ f p1q minp2ℓ 1, 2N1 1q). The second block includes a convolution on the rotation group, hence its input and output both live on the rotation group. We then apply a restricted generalized convolution to map to type τ ℓ f p3q rτmax{ ? 2ℓ 1s, where τmax 5. The same type is used for the following three channel-wise tensor-product activations and two restricted generalized convolutions until the final restricted generalized convolution maps down to a rotationally invariant representation (τ ℓ f p5q δℓ0). As is traditional in convolution networks we gradually decrease the resolution, with p L0, L1, L2, L3, L4, L5q p20, 10, 10, 6, 3, 1q, and increase the number of channels, with p K0, K1, K2, K3, K4, K5q p1, 20, 22, 24, 26, 28q. We proceed these convolutional layers with a single dense layer of size 30, sandwiched between two dropout layers (keep probability 0.5), and then fully connect to the output of size 10. We train the network for 50 epochs on batches of size 32, using the Adam optimizer (Kingma & Ba, 2015) with a decaying learning rate starting at 0.001. For the restricted generalized convolutions we follow the approach of Kondor et al. (2018) by using L1 regularization (regularization strength 10 5) and applying a restricted batch normalization across fragments, where the fragments are only scaled by their average and not translated (to preserve equivariance). Published as a conference paper at ICLR 2021 G.2 ATOMIZATION ENERGY PREDICTION When regressing the atomization energy of molecules there are two inputs to the model: the number of atoms of each element contained in the molecule; and spherical cross-sections of the potential energy around each atom. We adopt the high-level QM7-specific architecture of Cohen et al. (2018) which contains a spherical CNN as a sub-model, for which we substitute our own. This results in an overall model that is invariant to both rotations of the molecule around each constituent atom and to permutations of the ordering of the atoms. The first (non-spherical) input is mapped onto a scalar output using a multi-layer perceptron (MLP) with three hidden layers of sizes 100, 100 and 10 (and Re LU activations). The second input, multiple spherical cross sections for each atom, are separately projected using a shared spherical CNN (of architecture described below) onto lower dimensional vectors of size 64. The mean vector is then taken across atoms (ensuring invariance w.r.t. permutations of the atoms) and mapped onto a scalar output using an MLP with a single hidden layer of size 512 (with a Re LU activation). The predicted energy is then taken to be the sum of the two scalar outputs. As a starting point we train the first MLP to regress the atomization energies alone (achieving RMS 20), before pairing it with the spherical model (and its connected MLP). We then train the joint model for 60 epochs, again with the Adam optimizer, a decaying learning rate (starting at 2.5 ˆ 10 4), regularizing the efficient generalized layers with L2 regularization (strength 2.5 ˆ 10 6) and batch sizes of 32. For the spherical component we again adopt the convolutional architecture shown in Figure 4 except with one fewer efficient generalized layer. We use bandlimits of p L0, L1, L2, L3, L4q p10, 6, 6, 3, 1q, channels of p K0, K1, K2, K3, K4q p5, 16, 24, 32, 40q and τmax 6. One minor difference is that this time we include a skip connection between the ℓ 0 components of the fourth and fifth layer. We proceed the convolutional layers with two dense layers of size p256, 64q and use batch normalization between each layer. G.3 3D SHAPE RETRIEVAL To project the 3D meshes of the SHREC 17 data onto spherical representations (bandlimited at L 128) we adopt the preprocessing approach of Cohen et al. (2018) and augment the data with random rotations and translations. We construct a model with an architecture that is again similar to that described in Appendix G.1 but with an additional axisymmetric convolutional layer prepended to the start of the network and one fewer efficient generalized layers. We use bandlimits p L0, L1, L2, L3, L4, L5q p128, 32, 16, 16, 6, 1q, channels p K0, K1, K2, K3, K4, K5q p6, 20, 30, 40, 60, 70q and τmax 6 for the efficient generalized layers. The convolutional layers are followed by a dense layer of size 128 which is fully connected to the output (of size 55). We again train with the Adam optimizer, a decaying learning rate (starting at 5 ˆ 10 4) and batch sizes of 8, this time until performance on the validation set showed no improvement for at least 4 epochs (36 epochs in total). We perform batch normalization between convolutional layers and dropout preceding the dense layer. We regularize the efficient generalized layers with L2 regularization (strength 10 5). When testing our model we average the output probabilities over 15 augmentations of the data.