# implicit_bias_of_linear_equivariant_networks__47bd6035.pdf

Implicit Bias of Linear Equivariant Networks

Hannah Lawrence 1 Kristian Georgiev 1 Andrew Dienes 1 Bobak T. Kiani 1

Group equivariant convolutional neural networks (G-CNNs) are generalizations of convolutional neural networks (CNNs) which excel in a wide range of technical applications by explicitly encoding symmetries, such as rotations and permutations, in their architectures. Although the success of G-CNNs is driven by their explicit symmetry bias, a recent line of work has proposed that the implicit bias of training algorithms on particular architectures is key to understanding generalization for overparameterized neural nets. In this context, we show that L-layer full-width linear G-CNNs trained via gradient descent for binary classiﬁcation converge to solutions with lowrank Fourier matrix coefﬁcients, regularized by the 2/L-Schatten matrix norm. Our work strictly generalizes previous analysis on the implicit bias of linear CNNs to linear G-CNNs over all ﬁnite groups, including the challenging setting of noncommutative groups (such as permutations), as well as band-limited G-CNNs over inﬁnite groups. We validate our theorems via experiments on a variety of groups, and empirically explore more realistic nonlinear networks, which locally capture similar regularization patterns. Finally, we provide intuitive interpretations of our Fourier space implicit regularization results in real space via uncertainty principles.

1. Introduction

Modern deep learning algorithms typically have many more parameters than data points, and their ability to achieve good generalization in this overparameterized setting is largely unexplained by current theory. Classic generalization bounds, which bound the generalization error when models are not overly complex, are vacuous for neural networks that can

1Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA 02139, USA. Correspondence to: Bobak T. Kiani <bkiani@mit.edu>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

perfectly ﬁt random training labels (Zhang et al., 2017). More recent work analyzes the complexity of deep learning algorithms by instead characterizing the properties of the functions they output. Notably, prior work has shown that training via gradient descent implicitly regularizes towards certain hypothesis classes with low complexity, which may generalize better as a result. For example, in underdetermined least squares regression, gradient descent converges to the ℓ2-norm minimizer, while a pointwise-square reparametrization converges to the ℓ1-norm minimizer (Gunasekar et al., 2018a). For separable linear regression, Soudry et al. (2018) proved that the learned predictor under gradient descent converges in direction to the max-margin solution. Such phenomena are consistent with certain linear neural networks, e.g., Lyu and Li (2020) extended this max-margin result to gradient descent on any homogeneous neural network, and Gunasekar et al. (2018b) showed that learned linear fully-connected and convolutional networks implicitly regularize the ℓ2 norm and a depth-dependent norm in Fourier space, respectively.

From a more applied perspective, a large body of work imposes structured inductive biases on deep learning algorithms to exploit symmetry patterns (Kondor, 2007; Reisert, 2008; Cohen and Welling, 2016b). One prominent method parameterizes models over functions that are equivariant with respect to a symmetry group (i.e., outputs transform predictably in response to input transformation). In fact, Kondor and Trivedi (2018) showed that any group equivariant network can be expressed as a series of group convolutional layers interwoven with pointwise nonlinearities, demonstrating that group convolutional neural networks (GCNNs) are the most general family of equivariant networks.

Our Contributions: The explicit inductive bias of G-CNNs is the main reason for their usage. Yet, to the best of our knowledge, the implicit bias imposed by equivariant architectures has not been explored. Here, we greatly generalize the results of Yun et al. (2020) and Gunasekar et al. (2018b) to linear G-CNNs whose hidden layers perform group equivariant transformations. We show, surprisingly, that L-layer G-CNNs are implicitly regularized by the 2/L-Schatten norm, which is the 2/L norm of a matrix s singular values, over the irreducible representations in the Fourier basis. As a result, convergence is biased towards sparse solutions in the Fourier regime, as summarized in Theorem 1 and il-

Implicit Bias of Linear Equivariant Networks

0 2500 5000 7500 10000 12500 15000 17500 20000 epoch

CNN G-CNN FC optimal

(a) Fourier space norm of network linearization β

0 2500 5000 7500 10000 12500 15000 17500 20000 epoch

CNN G-CNN FC

(b) Real space norm of network linearization β

Figure 1: Training via gradient descent of linear G-CNNs (with linearization β) implicitly biases towards sparse singular values for Fourier matrix coefﬁcients of β (see Theorem 1). Figure 1a traces the D8 Fourier space sparsity over the course of training three architectures in an overparameterized classiﬁcation task, beginning from epoch 1. The G-CNN converges to a Fourier-sparse linearization, in contrast with fully-connected and convolutional networks. The horizontal line shows the minimum Fourier Schatten norm among all interpolating predictors, obtained as the solution of a convex relaxation; since the G-CNN norm converges toward this value, this veriﬁes our main statement of convergence to a local minimum. Figure 1b illustrates the uncertainty principles, discussed in Section 6 and illustrated more explicitly in Figure 5 that sparseness in (group) Fourier space necessitates being dense in real space, and vice versa.

lustrated in Figure 1a, as well as in further experiments in Section 6.

New Technical Ingredients: Our primary technical contribution is to generalize the proof technique of Gunasekar et al. (2018b) by realizing it in a more general setting, translating it to non-abelian groups using the language of representation theory. Since convolutions in real space correspond to (noncommutative) matrix multiplications, rather than scalar multiplications, in the appropriate group Fourier space, this substantially complicates analysis of KKT conditions. Nonetheless, we use the framework of non-commutative Fourier and convex analysis to prove that a particular function on the singular values of the Fourier-space linearization is being implicitly optimized.

Theorem 1 (main result; informal). Let NNL( ) denote an L layer linear group convolutional neural network, in which each hidden layer performs group cross-correlation over the group G with a full-width kernel, and the ﬁnal layer is a fully connected layer. When learning linearly separable data {xi, yi}n i=1 using the exponential loss function, the network converges in direction to a linear function NNL(x) βT x, where β is given by a stationary point of the following minimization problem:

2/L s.t. yix T i β 1 i [n] (1)

Here, bβ (S)

2/L denotes the 2/L-Schatten norm of the matrix

Fourier transform of β (see Deﬁnition 4.1) equivalent to

where b G is a complete set of unitary irreducible representations of G and dρ is the dimension of irreducible representation ρ.

We note that both sparsity (for vectors) and low-rankness (for matrices) are desirable properties for many applications, not least because such predictors are efﬁcient to store and manipulate, thus potentially expanding the scope of application of G-CNNs to areas where sparsity and low-rankness are explicit desiderata. Connecting our ﬁndings to research on uncertainty theorems (Wigderson and Wigderson, 2021), we also show that the implicit regularization towards sparseness (or low rank irreducible representations) in the Fourier regime necessarily implies that solutions in the real regime are dense, as illustrated in Figure 1b. These results provide a more intuitive and practical perspective into the inductive bias of G-CNNs and the types of functions that they learn.

We proceed as follows. Section 2 discusses related works and their relation to our contributions. In Section 3, we deﬁne notation. Section 4 provides a basic background in the group theory and Fourier analysis necessary to understand our results, with a more complete exposition in Appendix B. Our main results are stated in Section 5, with the main proof ideas for the abelian (or commutative) and non-abelian (or non-commutative) cases given in Section 5.1

Implicit Bias of Linear Equivariant Networks

and Section 5.2, respectively (complete proofs can be found in Appendix A). Section 6 validates our theoretical results with synthetic experiments on a variety of groups, and exploratory experiments validating our theory on non-linear networks. Finally, we discuss these results and future questions in Section 7.

2. Related Work

Enforcing equivariance and symmetries via parameter sharing schemes was introduced in the group theoretic setting in Cohen and Welling (2016a) and Gens and Domingos (2014). Despite considerable interest in equivariant learning, no works to our knowledge have explored the implicit regularization of gradient descent on equivariant convolutional neural networks. We show that the tensor formulation of neural networks in Yun et al. (2020) and the proofs in Gunasekar et al. (2018b) encompass G-CNNs for which the underlying group is cyclic, and we naturally extend their results to G-CNNs over any commutative group (see Section 5.1). However, these works do not cover the case of convolutions with respect to non-commutative groups, such as three-dimensional rotations and permutations, which incidentally include some of the most compelling applications of group equivariance in practice (Zaheer et al., 2017; Anderson et al., 2019; Esteves et al., 2018). As such, articulating the implicit bias in the more general non-abelian case is important for understanding many of the current group equivariant architectures (Zaheer et al., 2017; Kondor et al., 2018; Esteves et al., 2018; Weiler and Cesa, 2019). Nonabelian convolutions require more structure to theoretically analyze compared to abelian convolutions: the former are merely pointwise multiplications in Fourier space, whereas the latter are matrix multiplications between irreducible representations, and therefore cannot be expressed in the tensor language of Yun et al. (2020). Instead, we build on the optimization tools and comparable convergence assumptions of Gunasekar et al. (2018b) to explicitly characterize the stationary points of convergence for non-abelian G-CNNs.

We also note that our results are consistent with those of Razin and Cohen (2020) showing that implicit generalization is often captured by measures of complexity which are quasi-norms, such as tensor rank. Our results prove that linear G-CNNs are biased towards low rank solutions in the Fourier regime, via regularization of the 2/L-Schatten norms over Fourier matrix coefﬁcients (also a quasi-norm). Lastly, there is a line of work focusing on understanding the expressivity (Kondor and Trivedi, 2018; Cohen et al., 2019; Yarotsky, 2021) and generalization (Sannai and Imaizumi,

2019; Lyle et al., 2020; Bulusu et al., 2021; Elesedy and Zaidi, 2021) of equivariant networks, but not speciﬁcally their implicit regularization.

To analyze bounded-width ﬁlters, which are more com-

monly used in practice, a recent work by Jagadeesan et al. (2021) shows that the implicit regularization for an arbitrary ﬁlter width K is unlikely to admit a closed-form solution. Separate from calculating the exact form of implicit regularization, there is a rich line of work that details the trade-offs between restricting a function in its real versus Fourier regimes via uncertainty principles (Meshulam, 1992; Wigderson and Wigderson, 2021). While the connection between uncertainty theorems and bounded-width convolutional neural networks has not been thoroughly explored, Caro et al. (2021) and Nicola and Trapasso (2021) highlight the importance of uncertainty principles for understanding the behaviour of modern CNNs.

3. Notation

Throughout this text, we denote scalars in lowercase script (a), vectors in bold lowercase script (a), matrices in either bold uppercase script (A) or lowercase script hat (ba) when vectors are transformed into the Fourier regime (see Deﬁnition 4.1), and tensors in bold non-italic uppercase script (A). For f a function with range in C, we overload notation slightly and let f denote the function with an element-wise complex conjugate applied, i.e. f(x) = f(x). If f is deﬁned on a group G, let f (g) = f(g 1). For a vector a Cn or a matrix A Cm n, we denote its conjugate transpose as a and A respectively. We use , to denote the standard vector inner product between two vectors and , M to denote the inner product between matrices deﬁned as A, B M = tr(AB ). p denotes the vector p-norm (p = 2 when subscript is hidden) and (S) p denotes the p-Schatten norm1 of a matrix (equivalent to the p-vector norm of the singular values of the matrix).

We denote groups by uppercase letters G, an irreducible representation (irrep) of a group G by ρ or ρi, and a complete set of irreps by b G, so every unitary irrep ρ is equivalent (up to isomorphism) to exactly one element of b G. The dimension of a given irrep ρ is dρ.

4. Background in group theory and Group-Equivariant CNNs

In this study, we analyze linear G-CNNs in a binary classiﬁcation setting, where hidden layers perform equivariant operations over a ﬁnite group G, and networks have no nonlinear activation function after their linear group operations. Inputs xi are vectors of dimension |G| (i.e., vectorized group functions x : G R), and targets yi are scalars taking the value of either +1 or 1. Hidden layers in our G-CNNs

1Despite our terminology, p-vector and p-Schatten norms are technically quasi-norms for p < 1.

Implicit Bias of Linear Equivariant Networks

Figure 2: Cross-correlation of two functions over a group is equivalent to matrix multiplication over irreps (shown as blocks of a larger matrix here) in Fourier space.

perform cross-correlation over a group G, deﬁned as

(g h)(u) = X

v G g(uv)h(v) (3)

where g, h : G R. Note that the above is equivariant to the left action of the group, i.e., if w G and g w(u) = g(wu), then (g w h)(u) = (g h)(wu). The ﬁnal layer of our G-CNN is a fully connected layer mapping vectors of length |G| to scalars. We note that this ﬁnal layer in general will not construct functions that are symmetric to the group operations, as strictly enforcing group invariance in this linear setting will result in trivial outputs (only scalings of the average value of the input). Nonetheless, this model still captures the composed convolutions of G-CNNs, and is similar in construction to many practical G-CNN models, whose earlier G-convolutions still capture useful high-level equivariant features. For instance, the spherical CNN of Cohen et al. (2018) also has a ﬁnal fully connected layer.

Analogous to the discrete Fourier transform, there exists a group Fourier transform mapping a function into the Fourier basis over the irreps of G.

Deﬁnition 4.1 (Group Fourier transform). Let f : G C. Given a ﬁxed ordering of G, let eu be the standard basis vector in R|G| that is 1 at the location of u and 0 elsewhere. Then, f = P

u G f(u)eu is the vectorized function f. Given b G a complete set of unitary irreps of G, let ρ b G be a given irrep of dimension dρ, ρ : G GL (dρ, C)2. The group Fourier transform of f, bf : b G C at a representation ρ is deﬁned as (Terras, 1999)

u G f(u)ρ(u). (4)

By choosing a ﬁxed ordering of b G, one can similarly construct b f as a block-diagonal matrix version of bf (as in Figure 2). We deﬁne FM to be the matrix Fourier transform that takes f to b f:

b f = FMf = M

bf(ρ) dρ GL (|G|, C) . (5)

2Note that for an abelian group, dρ = 1 ρ. For standard Fourier analysis over the cyclic group, each ρ is a complex sinusoid at some frequency.

b f or FMf are shortened notation for the complete Fourier transform. Furthermore, by vectorizing the matrix b f, there is a unitary matrix F taking f to b f, analogous to the standard discrete Fourier matrix. We use the following explicit construction of F: denoting e[ρ,i,j] as the column-major vectorized basis for element ρij in the group Fourier transform, then we can form the matrix

i,j=1 ρ(u)ije[ρ,i,j]e T u . (6)

Intuitively, for each group element g, the matrix F contains all the irrep images ρ(g) ﬂattened into a single column. See Appendix B for further exposition.

Convolution and cross-correlation are equivalent, up to scaling, to matrix operations after Fourier transformation. For example, for cross-correlation (Equation 3), \ (g h)(ρ) = bg(ρ)bh(ρ) . This simple fact, illustrated in Figure 2, is behind the proofs of our implicit bias results.

5. Main Results

We consider linear group-convolutional networks for classiﬁcation analogous to those of Yun et al. (2020) and Gunasekar et al. (2018b). A linear G-CNN is composed of several group cross-correlation layers followed by a fully connected layer. The input is formalized as a function on the group (according to some pre-deﬁned ordering of elements), x : G R, and the output is a scalar. Explicitly, let G be a ﬁnite group with x, w1, . . . , w L realvalued functions on G. The network output is NN(x) = x w1 w L 1, w L x, β . As an example, in Figure 3, we illustrate both a practical G-CNN architecture and this linearized version that we will study, for the group D8. Let W = [w1 . . . w L] be the concatenation of all network parameters, and β = P(W ) = w L w L 1 w 1 be the end-to-end linear predictor consisting of composed cross-correlation operations. One can check (as we do in Lemma A.1) that bβ FMβ = c w L . . . c w1. Networks are trained via gradient descent over the exponential loss function on linearly separable data {xi, yi}N i=1. Iterates take the form

W (t+1) ℓ = W (t) ℓ ηt WℓL(P(W )) (7)

where ℓ( x, β , y) = exp( x, β y) and L(β) = PN i=1 ℓ( xi, β , yi).

Implicit Bias of Linear Equivariant Networks

Input Filter

Re LU Re LU Re LU

Filter Filter Layer L Layer 1 Layer 2

flip, rot 90

flip, rot 180

flip, rot 270

(a) A practical G-CNN architecture for D8.

Filter Filter Layer 1 Layer 2

Identity Identity

Input on D8:

flip, rot 90

flip, rot 180

flip, rot 270

(b) The linear G-CNN architecture we analyze for D8.

Figure 3: A comparison between a practical G-CNN architecture for the group D8, and its corresponding linear idealization that we will theoretically analyze. D8 is a group of size 8, consisting of all rotations by 90 degrees and reﬂections. In a practical architecture, as shown in the ﬁrst panel, the input may be an image, or anything upon which D8 acts; it can be convolved over D8 with respect to a learnable ﬁlter, yielding a function on D8 (i.e. a function that takes 8 values). The following layers intersperse D8-convolutions, possibly with several channels, and nonlinearities, before a ﬁnal fully connected layer yields a scalar output. In contrast, the second panel shows the simpliﬁed architecture we analyze here. In particular, there are no non-linearities and only a single channel is used. Furthermore, we assume the input is already a function on D8, i.e. the input is an 8-dimensional vector. (One can think of this input as a ﬁxed featurization of some image convolved over D8 with a ﬁxed ﬁlter.)

5.1. Abelian

Similar to ordinary (cyclic) convolution3, the commutative property of abelian groups implies that convolutions in real space are equivalently pointwise multiplication of irreps in Fourier space, since all irreps are one-dimensional for commutative groups (dρ = 1 ρ). To start, recall the key deﬁnition of Yun et al. (2020), determining which network architectures fall within the purview of their results:

Proposition 5.1 (paraphrased from (Yun et al., 2020)). Let M(x) be a map from data x Rd to a data tensor M(x) Rk1 k2 k L. The input into an L-layer tensorized neural network can be written as an orthogonally decomposable data tensor if there exists a full column rank matrix S Cm d and semi-unitary matrices U1, . . . , UL Ckℓ m

where d m minℓkℓsuch that:

j=1 [Sx]j ([U1] ,j [U2] ,j [UL] ,j)

and moreover the network output is the tensor multiplication between M(x) and each layer s parameters:

NN(x; Θ) = M(x) (w1, . . . , w L)

i L=1 M(x)i1...i L(w1)i1 (w L)i L

3We include a canonical result in the appendix, Theorem A.4, demonstrating that all ﬁnite abelian groups are direct products of cyclic groups, i.e. multidimensional translational symmetries.

Indeed, a linear G-CNN over an abelian group can be expressed in a way that satisﬁes Proposition A.3 for an appropriate choice of S and U1, . . . , UL, as stated in the following proposition.

Proposition 5.2. Let M(x) be an orthogonally decomposable data tensor with associated matrices S, U1, . . . , UL as in Proposition 5.1. Given a ﬁnite abelian group G, let d = m = kℓ= |G| and F Cd d be the group Fourier transform of G (see Deﬁnition 4.1). With S = d L 1

2 F, unitary matrices Uℓ= F 1 = F , and the data tensor M(x) deﬁned correspondingly, the output of a G-CNN with real-valued ﬁlters w1, . . . , w L is a tensor operation:

M(x) (w1, . . . w L) = x w1 w L 1, w L

The proof is deferred to Appendix A. Fundamentally, the result requires not only that F is unitary, which holds for all ﬁnite groups, but also that cross-correlation is pointwise multiplication (up to a conjugate transpose) in Fourier space, i.e. F(x w) = (Fx) (Fw) . This property only holds for commutative groups, as matrix multiplication is pointwise multiplication only for matrices of dimension dρ = 1. Given Proposition 5.2, we apply the implicit bias statement of Yun et al. (2020).

Theorem 5.3 (Implicit regularization of linear G-CNNs for G an abelian group). Suppose there exists λ > 0 such that the initial directions w1, . . . , w L of the network parameters satisfy |[F wℓ]j|2 |[F w L]j|2 λ for all ℓ [L 1] and j [m], i.e. if the Fourier transform magnitudes of the initial directions look sufﬁciently different pointwise (which

Implicit Bias of Linear Equivariant Networks

occurs with high probability for e.g. a Gaussian random initialization). Then, β = P([w1, . . . , w L]) converges in a direction that aligns with a stationary point z of the following optimization program:

min z Cm Fz 2/L s.t. yi xi, z 1, i [n] (8)

As noted in Theorem A.4 of the Appendix, all ﬁnite abelian groups can be expressed as a direct product of cyclic groups. In contrast, many groups (rotations, subgroups of permutations, etc.) with much richer structure are non-commutative, and we now turn our attention to the non-abelian case.

5.2. Non-Abelian

In Fourier space, non-abelian convolution consists of matrix multiplication over irreps, and does not ﬁt the pointwise multiplication structure of Proposition 5.2. We instead build upon the results of Gunasekar et al. (2018b), and directly analyze the stationary points of the proposed optimization program to prove the following:

Theorem 5.4 (Non-abelian; see also Theorem A.5). Consider a classiﬁcation task with ground-truth linear predictor β, trained via a linear G-CNN architecture with L > 2 layers under the exponential loss. For almost all β-separable datasets {xi, yi}n i=1, any bounded sequence of step sizes ηt, and almost all initializations: if the loss converges to 0, the gradients converge in direction, and the iterates themselves all converge in direction to a classiﬁer with positive margin, then the resultant predictor is a scaling of a ﬁrst order stationary point of the optimization problem:

min β bβ (S) 2/L s.t. n, yn xn, β 1 (9)

To prove the above statement, we show that linear G-CNNs converge to stationary points of Equation 9 via KKT conditions, which is also the high-level method of Gunasekar et al. (2018b). However, our proof diverges in several key ways. First, we carefully redeﬁne operations of the G-CNN as a series of inner products and cross-correlations with respect to the matrix Fourier transform of Deﬁnition 4.1. Second, in this Fourier space, we analyze the subdifferential of the Schatten norms, to ultimately show that the KKT conditions of Equation 9 are satisﬁed. In contrast, Gunasekar et al. (2018b) analyze the subdifferential of a different objective, the ordinary 2/L-vector norm. The fact that the irreps of a group are only unique up to isomorphism (e.g., conjugation by a unitary matrix) hints at the Schatten norm as the correct regularizer, since the Schatten norm is among the norms invariant to unitary matrix conjugation, but this must be conﬁrmed by rigorous analysis. These features are speciﬁc to the non-abelian case. More speciﬁcally, the proof of this result follows the outline below:

1. First, by applying a general result of Gunasekar et al.

(2018b), Theorem A.6, we can immediately characterize the implicit regularization in the full space of parameters, W (in contrast to the end-to-end linear predictor β), as a (scaled) stationary point W of the following optimization problem in W :

min W RP W 2 2 s.t. n, yn xn, P(W ) 1 (10)

2. Separately, we deﬁne a distinct optimization problem,

Equation 9, in β, with the aim of showing that stationary points of Equation 10 are a subset of those of Equation 9, up to scaling.

3. The necessary KKT conditions for Equation 10 characterize its stationary points:

{αn 0} : αn = 0 if yn xn, P(W ) > 1

w i = wi D P(W ), X

n αnynxn E (11)

From here, we show that the sufﬁcient KKT conditions for Equation 9 are also satisﬁed by the corresponding end-to-end predictor. In particular, we calculate the set of subgradients4 o bβ (S) p for p = 2

L < 1, and then use Equation 11 to derive recurrences demonstrating that a positive scaling of P

n αnynbxn is a member of this set.

Remark 5.5. For abelian groups where all irreps are onedimensional, bβ in Theorem 5.4 is a diagonal matrix. Thus, the p-Schatten norm coincides with the p-vector norm of the diagonal entries, recovering results in Section 5.2. However, Theorem 5.4 requires stronger convergence assumptions.

Inﬁnite dimensional groups: Theorem 5.4 applies to all ﬁnite groups, but G-CNNs have extensive applications for inﬁnite groups, where outputs of convolutions are inﬁnitedimensional. Here, it is common to assume sparsity in the Fourier coefﬁcients and band-limit ﬁlters over a set of low-frequency irreps (under some natural group-speciﬁc ordering) that form a ﬁnite dimensional linear subspace (we denote the representation of a function in this band-limited Fourier space as b w). G-CNNs with band-limited ﬁlters take precisely the form of the ﬁnite G-CNNs from Theorem 5.4. Thus, slight modiﬁcations yield the following for inﬁnite groups (see Appendix A.2.1 for details).

Corollary 5.6 (Inﬁnite-dimensional groups with band-limited functions; see also Theorem A.11). Let G be a compact Lie group with irreps b G, and let B b G with |B| < .5

4 o is the local subgradient of (Clarke, 1975): of(β) = conv{limi f(β + hi) : hi 0} 5For example, G = SO(3) and B indexes all Wigner dmatrices with |ℓ| L (Kondor et al., 2018).

Implicit Bias of Linear Equivariant Networks

Proceed fully in Fourier space, in the subspace corresponding to B: consider a linearly separable classiﬁcation task with ground-truth linear predictor bβ and inputs bx, both real-valued and supported only on irreps in B, and proceed by gradient descent on the band-limited Fourier-space ﬁlters. Under near-identical conditions as Theorem 5.4, the resultant predictor is a scaling of a ﬁrst order stationary point of:

2/L s.t. n, yn D bxn, bβ

6. Experiments

We ﬁrst experimentally conﬁrm our theory in a simple setting illustrating the effects of implicit regularization.6 Then, we relax the crucial assumption of linearity in our setup, to empirically show that the results may hold locally even in nonlinear settings (including the practical case of spherical CNNs (Cohen et al., 2018)). Note that the results for nonlinear networks in Section 6.2 are only empirical in nature, and Theorems 5.3 and A.5 do not necessarily hold in the more general nonlinear setting. For all binary classiﬁcation tasks, we use three-layer networks with inputs and convolution weights in R|G|, and all plots begin at epoch 1. Since we are interested in the resulting implicit bias alone, we only analyze loss on data in the training set. A complete description of our experimental setup can be found in Appendix E.

Throughout this section, we plot norms in the Fourier and real regimes side-by-side to highlight unavoidable tradeoffs in implicit regularization between the two conjugate regimes. These trade-offs, which have a rich history of study in physics and group theory, are commonly termed uncertainty principles (Wigderson and Wigderson, 2021). Since non-abelian groups have matrix-valued irreducible representations, uncertainty theorems must account for norms and notions of support in the context of matrices. One especially relevant uncertainty theorem states that sparseness in the real or Fourier regime necessarily implies dense support in the conjugate regime: Theorem 6.1 (Meshulam uncertainty theorem (Meshulam, 1992)). Given a ﬁnite group G and f : G C, let b G be the set of irreps of G and f be the vectorized function (see Deﬁnition 4.1). Then

ρ b G dρ rank b f(ρ)

The theorem above shows that the rank of a function s Fourier matrix coefﬁcients is the proper notion of support in

6Our code is available here: https://github.com/kri stian-georgiev/implicit-bias-of-linear-equ ivariant-networks

the uncertainty theorem for a non-abelian group. In the context of our result, this implies that the learned linear function is likely to have large support in real space (at least locally, with respect to the network parameters). Other uncertainty principles are detailed in Appendix C.

Remark 6.2 (Generalization). The implication of our results on generalization are highly task and data-dependent. Indeed, one could create contrived experiments where the inductive bias leads to worse or better results, which would depend entirely on whether the ground-truth linear predictor has small Fourier Schatten norm (i.e. based on the uncertainty principle above, whether the ground-truth function is sparse in real or Fourier space). Noting that band-limited functions have small Schatten norm, and that practical spherical CNNs band-limit (as natural spherical images can often be well-represented by their ﬁrst few spherical harmonics), there is reason to believe this implicit bias could aid in generalization on natural data distributions, but the primary intention of our result is purely to highlight that such an implicit bias exists and characterize it mathematically. As such, our experiments demonstrate only effects on training error, rather than generalization.

6.1. Empirical Conﬁrmation of Theory

We ﬁrst trace the regularization through (training) epochs for networks trained to classify data with 1 labels. We consider three groups here, with more in Appendix E. Figure 1 shows the implicit bias for a G-CNN over the dihedral group D8, a simple non-abelian group that captures the geometry of a square7. Inputs are vectors with elements drawn i.i.d. from the standard normal distribution. Figure 4 shows the implicit bias for a G-CNN over the non-abelian group (C28 C28) D8 which acts on images (the digits 1 and 5) from the MNIST dataset.

In the three settings above, we compare the behaviors of GCNN, traditional CNN8, and fully-connected (FC) network architectures with similar instantiations. We plot the real space vector norm and Fourier space Schatten norm of the network linearization over training epochs. All models perfectly ﬁt the data in this overparameterized setting, and convergence to a given regularized solution corresponds to convergence in the loss to zero.

Consistent with theory, G-CNN architectures shown in Figures 1 and 4 have the smallest Fourier space Schatten norms among the architectures. Note that since our theory only implies a G-CNN will reach a stationary point of the Schatten norm minimization subject to ﬁtting the training data, Schatten norms greater than 0 are expected. The form of the implicit bias is visualized in the example shown in Fig-

7Dn denotes the dihedral group of order n. 8 CNN generically refers to a G-CNN over the cyclic group of size equal to the size of the input.

Implicit Bias of Linear Equivariant Networks

0 20 40 60 80 100 epoch

CNN G-CNN FC

(a) Fourier space norm of network linearization β

0 20 40 60 80 100 epoch

CNN G-CNN FC

(b) Real space norm of network linearization β

Figure 4: Norms of the linearizations of three different linear architectures for the non-abelian group G = (C28 C28) D8 trained using the digits 1 and 5 from the MNIST dataset.

Figure 5: Linearized functions of the G-CNN (over the dihedral group D60) are sparse in the Fourier regime of the group. Furthermore, the sparsity pattern shows up in blocks of 4, which are the blocks containing coefﬁcients of individual irreps of the dihedral group D60. Linearized functions in the real regime of the G-CNN, in contrast, are rather dense, highlighting the uncertainty principles inherent in the implicit bias of G-CNNs. Sparsity in the group Fourier regime is not evident in CNNs or fully connected (FC) networks. This is the same setting as that in Figure 1, except with D60 instead of D8 (see Appendix D for more details).

ure 5, which shows the values of the linearization over irreps in Fourier space (see Appendix D for further details). As expected, the G-CNN outputs a linearization that is sparse over low-rank irreps in the Fourier regime. FC networks exhibit no group Fourier regime regularization, while standard CNNs exhibit some regularization since their irreducible representations are similar to those of the D60-CNN (see e.g. B.8 in the Appendix). The differing behaviors of CNNs and FC networks show that implicit regularization is a consequence of the choice of architecture, and not inherent to the task itself.

6.2. Assessment of Theory on Nonlinear Networks

Here we introduce rectiﬁed linear unit (Re LU) nonlinearities between hidden layers to analyze implicit bias in the presence of nonlinearity. Our theoretical results do not necessarily hold in this case, so we are exploring, rather than conﬁrming, their validity for nonlinear networks. Given a G-CNN with Re LU activations, we wish to calculate the

Schatten norm of the Fourier matrix coefﬁcients of the network linearization β. However, networks can no longer be collapsed into linear functions due to the nonlinearities. Instead, we construct local linear approximations at each data point via a ﬁrst-order Taylor expansion, calculate the norms of interest according to this linearization, and average the results across the dataset to get a single aggregate value. We evaluate the implicit bias of a nonlinear G-CNN (with linear ﬁnal layer) on the dihedral group D60 with synthetic data. We also evaluate a nonlinear spherical-CNN, where G = SO(3) is an inﬁnite group, with spherical MNIST data; see Cohen et al. (2018) for details. Linear approximations are used only to analyze implicit regularization, and not to further interpret the outputs of the locally linear neural network, as such analysis can give rise to misleading or fragile feature importance maps (Ghorbani et al., 2019).

Remarkably, as shown in Figures 6a and 6b, our results remain valid in this nonlinear setting. While this does not guarantee that our implicit bias characterization will hold in

Implicit Bias of Linear Equivariant Networks

0 200 400 600 800 1000 epoch

Re LU CNN Re LU G-CNN Re LU FC

(a) A G-CNN on non-abelian G = D60.

0 20 40 60 80 epoch

Spherical CNN CNN FC

(b) A Spherical CNN on bandlimited G = SO(3).

Figure 6: Group Fourier norms for nonlinear architectures with Re LU activations show that nonlinear G-CNNs implicitly regularize locally. Both ﬁgures track the mean norm of the per-sample local linearizations of each network. Figure 6a is on a network with a ﬁnal linear layer, and performs binary classiﬁcation on 10 isotropic Gaussian data points. Figure 6b is on a spherical CNN, trained on rotated spherical MNIST (Cohen et al., 2018). See Appendix E.3 for details.

more general settings, it is encouraging that our theoretical predictions seem to numerically hold, despite the violation of linearity and, in the case of Figure 6b, the cross-entropy loss function. Additional ﬁgures detailing the real-space behaviour are provided in Appendix E.

7. Discussion

In this work, we have shown that L-layer linear G-CNNs with full width kernels are biased towards sparse solutions in the Fourier regime regularized by the 2/L-Schatten norm over Fourier matrix coefﬁcients. Our analysis applies to linear G-CNNs, over either ﬁnite groups or inﬁnite groups with band-limited inputs, which are trained to perform binary classiﬁcation. In advancing our results on implicit regularization, we highlight some limitations of this work and important future directions:

Nonlinearities: Adding nonlinearities to the G-CNNs studied here expands the space of functions which the G-CNNs can express, but implicit regularization in this nonlinear setting may be challenging to characterize as G-CNNs are no longer linear predictors. Local analysis may be possible in special cases, e.g., in the inﬁnite width limit (Lee et al., 2017).

Bounded width kernels: Our results apply to fullwidth kernels, supported on the entire group. Expanding results to bounded-width kernels, i.e., those with sparse support, is an obvious future direction, though prior work indicates that closed form solutions may not exist for these special cases (Jagadeesan et al., 2021).

Different loss function and learning settings: We study the exponential loss function on binary classiﬁcation. It is an open question how the implicit bias

changes for classiﬁcation over more than two classes, even for CNNs and fully-connected networks, as well as G-CNNs (Gunasekar et al., 2018b).

Although concise implicit regularization measures are challenging to analyze for realistic, nonlinear architectures, linear networks provide an instructive case study with precise analytic results. In this work, by proving that linear group equivariant CNNs trained with gradient descent are regularized towards low-rank matrices in Fourier space, we hope to advance the broader agenda of understanding generalization, and in particular how networks with diverse architectures particularly those with built-in symmetries learn.

ACKNOWLEDGMENTS

We thank Ankur Moitra and Chulhee Yun for helpful discussions early in the development of this project. HL is supported by the Fannie and John Hertz Foundation and the National Science Foundation Graduate Research Fellowship under Grant No. 1745302.

B. Anderson, T.-S. Hy, and R. Kondor. Cormorant: Covariant Molecular Neural Networks. Curran Associates Inc., Red Hook, NY, USA, 2019.

S. Bulusu, M. Favoni, A. Ipp, D. I. M uller, and D. Schuh. Generalization capabilities of translationally equivariant neural networks. ar Xiv preprint ar Xiv:2103.14686, 2021.

J. O. Caro, Y. Ju, R. Pyle, S. Dey, W. Brendel, F. Anselmi, and A. Patel. Local convolutions cause an implicit bias towards high frequency adversarial examples, 2021.

Implicit Bias of Linear Equivariant Networks

F. H. Clarke. Generalized gradients and applications. Transactions of the American Mathematical Society, 205:247 262, 1975.

T. S. Cohen and M. Welling. Group equivariant convolutional networks. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML 16, page 2990 2999. JMLR.org, 2016a.

T. S. Cohen and M. Welling. Group Equivariant Convolutional Networks. ar Xiv, 2016b.

T. S. Cohen, M. Geiger, J. K ohler, and M. Welling. Spherical cnns. ar Xiv preprint ar Xiv:1801.10130, 2018.

T. S. Cohen, M. Geiger, and M. Weiler. A general theory of equivariant cnns on homogeneous spaces. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch e-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips. cc/paper/2019/file/b9cfe8b6042cf759d c4c0cccb27a6737-Paper.pdf.

D. L. Donoho and P. B. Stark. Uncertainty principles and signal recovery. SIAM Journal on Applied Mathematics, 49(3):906 931, 1989.

D. S. Dummit and R. M. Foote. Abstract Algebra, volume 3. Wiley Hoboken, 2004.

B. Elesedy and S. Zaidi. Provably strict generalisation beneﬁt for equivariant models. ar Xiv preprint ar Xiv:2102.10333, 2021.

C. Esteves, C. Allen-Blanchette, A. Makadia, and K. Daniilidis. Learning so(3) equivariant representations with spherical cnns. In Proceedings of the European Conference on Computer Vision (ECCV), September 2018.

R. Gens and P. M. Domingos. Deep symmetry networks. Advances in neural information processing systems, 27: 2537 2545, 2014.

A. Ghorbani, A. Abid, and J. Zou. Interpretation of neural networks is fragile. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 33, pages 3681 3688, 2019.

S. Gunasekar, J. Lee, D. Soudry, and N. Srebro. Characterizing implicit bias in terms of optimization geometry. In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1832 1841. PMLR, 10 15 Jul 2018a. URL http://proceedings.mlr.press/v80/guna sekar18a.html.

S. Gunasekar, J. D. Lee, D. Soudry, and N. Srebro. Implicit bias of gradient descent on linear convolutional networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018b. URL https://proceedings.neurips.cc/paper /2018/file/0e98aeeb54acf612b9eb4e48a 269814c-Paper.pdf.

M. Jagadeesan, I. P. Razenshteyn, and S. Gunasekar. Inductive bias of multi-channel linear convolutional networks with bounded weight norm. Co RR, abs/2102.12238, 2021. URL https://arxiv.org/abs/2102.12238.

R. Kondor. A novel set of rotationally and translationally invariant features for images based on the non-commutative bispectrum. ar Xiv preprint cs/0701127, 2007.

R. Kondor and S. Trivedi. On the generalization of equivariance and convolution in neural networks to the action of compact groups, 2018.

R. Kondor, Z. Lin, and S. Trivedi. Clebsch gordan nets: a fully fourier space spherical convolutional neural network. Advances in Neural Information Processing Systems, 31: 10117 10126, 2018.

J. Lee, Y. Bahri, R. Novak, S. S. Schoenholz, J. Pennington, and J. Sohl-Dickstein. Deep neural networks as gaussian processes. ar Xiv preprint ar Xiv:1711.00165, 2017.

A. S. Lewis. The convex analysis of unitarily invariant matrix functions. J. Convex Anal, pages 173 183, 1995.

C. Lyle, M. van der Wilk, M. Kwiatkowska, Y. Gal, and B. Bloem-Reddy. On the beneﬁts of invariance in neural networks. ar Xiv preprint ar Xiv:2005.00178, 2020.

K. Lyu and J. Li. Gradient descent maximizes the margin of homogeneous neural networks. In International Conference on Learning Representations, 2020. URL https: //openreview.net/forum?id=SJe LIg BKPS.

E. Matusiak, M. Ozaydin, and T. Przebinda. The donoho stark uncertainty principle for a ﬁnite abelian group. Acta Math. Univ. Comenianae, 73(2):155 160, 2004.

R. Meshulam. An uncertainty inequality for groups of order pq. European journal of combinatorics, 13(5):401 407, 1992.

F. Nicola and S. I. Trapasso. On the stability of deep convolutional neural networks under irregular or random deformations, 2021.

N. Razin and N. Cohen. Implicit regularization in deep learning may not be explainable by norms. In H. Larochelle,

Implicit Bias of Linear Equivariant Networks

M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 21174 21187. Curran Associates, Inc., 2020. URL https://proceedings.neurips. cc/paper/2020/file/f21e255f89e0f258a ccbe4e984eef486-Paper.pdf.

M. Reisert. Group integration techniques in pattern analysis a kernel view. 2008.

A. Sannai and M. Imaizumi. Improved generalization bound of group invariant/equivariant deep networks via quotient feature space. ar Xiv preprint ar Xiv:1910.06552, 2019.

D. Soudry, E. Hoffer, M. S. Nacson, S. Gunasekar, and N. Srebro. The implicit bias of gradient descent on separable data. Journal of Machine Learning Research, 19 (70):1 57, 2018. URL http://jmlr.org/paper s/v19/18-188.html.

A. Terras. Fourier analysis on ﬁnite groups and applications. Number 43. Cambridge University Press, 1999.

G. A. Watson. Characterization of the subdifferential of some matrix norms. Linear algebra and its applications, 170(0):33 45, 1992.

M. Weiler and G. Cesa. General E(2)-Equivariant Steerable CNNs. In Conference on Neural Information Processing Systems (Neur IPS), 2019.

A. Wigderson and Y. Wigderson. The uncertainty principle: variations on a theme. Bulletin of the American Mathematical Society, 58(2):225 261, 2021.

D. Yarotsky. Universal approximations of invariant maps by neural networks. Constructive Approximation, pages 1 68, 2021.

C. Yun, S. Krishnan, and H. Mobahi. A unifying view on implicit bias in training linear neural networks. Co RR, abs/2010.02501, 2020. URL https://arxiv.org/ abs/2010.02501.

M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola. Deep sets. Advances in Neural Information Processing Systems, 30, 2017.

C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires rethinking generalization. 2017. URL https://arxiv.org/abs/16 11.03530.

Implicit Bias of Linear Equivariant Networks

We begin with a simple lemma detailing the Fourier space end-to-end predictor for a G-CNN.

Lemma A.1 (G-CNN in Fourier space). A G-CNN given by x w1 w L 1, w L is equivalent to x, w L w L 1 w 1 , or in Fourier space to 1 |G| bx, b w L b w1 M.

N(x) = x w1 w L 1, w L

= 1 |G| FM(x w1 w L 1), FMw L M

= 1 |G| tr[FM(x w1 w L 1) b w L]

= 1 |G| tr[FM(x w1 w L 2) b w L 1 b w L]

= 1 |G| tr[bx b w 1 b w L 1 b w L]

= 1 |G| bx, b w L b w1 M

To see the stated equality in real space, observe that d w 1 = c w1 by deﬁnition. Thus,

FM(w L w L 1 w 1 ) = b w L b w1

A.1. Abelian

The proof of Proposition 5.2 is included below.

First, recall the primary theorem of Yun et al. (2020):

Theorem A.2 (paraphrased from (Yun et al., 2020)). If there exists λ > 0 such that the initial directions v1, . . . , v L of the network parameters satisfy |[F vℓ]j|2 |[F v L]j|2 λ for all ℓ [L 1] and j [m], i.e. of the Fourier transform magnitudes of the initial directions look sufﬁciently different pointwise (which is likely for e.g. a random initialization), then β(Θ(t)) converges in a direction that aligns with ST ρ where ρ Cm denotes a stationary point of the following optimization program: min ρ Cm ρ 2/L s.t. yix T i FT vρ 1, i [n] (A.2)

Since S = F is invertible, then in fact β(Θ(t)) converges in a direction that aligns with a stationary point z of the following optimization program: min z Cm Fz 2/L s.t. yix T i z 1, i [n] (A.3)

We proceed by showing that abelian G-CNNs can be written as a vector of parameters contracted (according to tensor operations) with an orthogonally decomposable data tensor, which is the primary condition for Theorem A.2 to hold.

Proposition A.3 (paraphrased from (Yun et al., 2020)). Let M(x) be a function that maps data x Rd to a data tensor M(x) Rk1 k2 k L. The data input into an L-layer tensorized neural network can be written in the form of an orthogonally decomposable data tensor if there exists a full column rank matrix S Cm d and semi-unitary matrices U1, . . . , UL Ckℓ m where d m minℓkℓsuch that M(x) can be written as:

j=1 [Sx]j ([U1] ,j [U2] ,j [UL] ,j) (A.4)

Implicit Bias of Linear Equivariant Networks

and such that the network output is the tensor multiplication between M(x) and each layer s parameters:

NN(x; Θ) = M(x) (W1, . . . , WL)

i L=1 M(x)i1...i L(W1)i1 (WL)i L

For Proposition A.3. By direct manipulation:

f(x; Θ) = M(x) (v1, . . . , v L)

i L=1 M(x)i1...i L(v1)i1 (v L)i L

j=1 [Sx]j ([U1] ,j [U2] ,j [UL] ,j)

i1...i L (v1)i1 (v L)i L

j=1 [Sx]j ([U1]i1,j[U2]i2,j [UL]i L,j)

(v1)i1 (v L)i L

j=1 [Sx]j [U T 1 v1]i1[U T 2 v2]i2 [U T L v L]i L

j=1 [Fx]j [Fv1]i1[Fv2]i2 [Fv L]i L

j=1 [Fx]j [Fv1]i1[Fv2]i2 [Fv L]i L

j=1 [Fx]j ([Fv1]i1[Fv2]i2 [Fv L]i L)

= Fx, Fv1 Fv L

= Fx Fv1, Fv2 Fv L

= F(x v1) Fv2, Fv3 Fv L

= F(x v1 v2) Fv3, Fv4 Fv L

= F(x v1 v2 v L 1), Fv L

= x v1 v L 1, v L

Here, we have used that the ﬁlters are real-valued.

Note that Theorem 5.3 then merely requires that ||F T z|| = ||Fz|| = ||Fz||, for real-valued z.

A.1.1. FUNDAMENTAL THEOREM OF FINITE ABELIAN GROUPS

While the proof in the previous section is complete and correct, intuition (and/or alternate analysis) for abelian groups is aided by the important fact that all ﬁnite abelian groups are a direct product of cyclic groups.

Theorem A.4 (Fundamental theorem of ﬁnite abelian groups (Dummit and Foote, 2004)). Any ﬁnite abelian group is a direct product of a ﬁnite number of cyclic groups whose orders are prime powers uniquely determined by the group.

Given a decomposition of an abelian group G into k cyclic groups Cd1 Cdk, one can easily construct the group Fourier transform as a Kronecker product of discrete Fourier transform matrices which are the group Fourier transforms of

Implicit Bias of Linear Equivariant Networks

the respective cyclic groups.

G = Cd1 Cdk = F =

i=1 Fdi, (A.5)

where N denotes the Kronecker product over matrices and Fd is the standard (unitary) discrete Fourier transform matrix of dimension d deﬁned as

ω0 0 d ω0 1 d ω0 (d 1) d ω1 0 d ω1 1 d ω1 (d 1) d ... ... ... ... ω(d 1) 0 d ω(d 1) 1 d ω(d 1) (d 1) d

, ωd = e 2πi

From this result, it is clear that the desired properties of the Fourier transform and convolution hold.

A.2. Non-abelian

Theorem A.5. Consider a classiﬁcation task with ground-truth linear predictor β, trained via a linear G-CNN architecture NN(x) = x w1 w L 1, w L (see Section 5 for architecture details) with L 2 layers under the exponential loss. Then for almost any datasets {xi, yi} separable by β, any bounded sequence of step sizes ηt, and almost all initializations, suppose that:

The loss L(W ) converges to 0

The gradients with respect to the end-to-end linear predictor, βL(P(W t)), converge in direction as t

The iterates W t themselves converge in direction as t to a separator P(W t) with positive margin

When L = 2, we need an additional technical assumption, Assumption A.7. Then, the resultant linear predictor bβ is a positive scaling of a ﬁrst order stationary point of the optimization problem:

min β bβ (S) 2/L s.t. n, yn βn, xn M 1 (A.7)

In this section, we prove the non-abelian case, Theorem A.5. The proof of our result proceeds according to the following outline:

1. By applying a general result of Gunasekar et al. (2018b), Theorem A.6, we characterize the implicit regularization in the full space of parameters, W (in contrast to the end-to-end linear predictor β), as the stationary point of an optimization problem Equation A.9 in W .

2. Separately, we deﬁne a distinct optimization problem, Equation A.7 in β. The goal is to demonstrate that stationary points of Equation A.9 are a subset of the stationary points of Equation A.7.

3. The necessary KKT conditions for Equation A.9 characterize its stationary points. Using this characterization, we show that the sufﬁcient KKT optimality conditions for Equation A.7 are in fact also satisﬁed for the corresponding end-to-end predictor. Thus, we show that for any stationary point W of Equation A.9, the linear predictor P(W ) is a stationary point of Equation A.7.

First, recall that Gunasekar et al. (2018b) prove the following general result about the implicit regularization of any homogeneous polynomial parametrization. (Lyu and Li (2020) later strengthened this result to the case of arbitrary homogeneous mappings, and showed furthermore that the margin is monotonically increasing under gradient ﬂow, but the result of Gunasekar et al. (2018b) is suitable for our needs.)

Implicit Bias of Linear Equivariant Networks

Theorem A.6 (Homogeneous polynomial parametrization, Theorem 4 of (Gunasekar et al., 2018b)). Let W be the concatenation of all (real-valued) parameters Wi. For any homogeneous polynomial map P : RP R|G| from parameters Wi RP to linear predictors, almost all datasets {xn, yn}N n=1 separable by the ground truth predictor β := {P(W ) : W RP }, almost all initializations W 0, and any bounded sequence of step sizes {ηt}t, consider the gradient descent updates:

W t+1 = W t ηt W L(W t) = W t ηt W P(W t) βL(P(W t)) (A.8)

Suppose furthermore that the exponential loss converges to zero, that the gradients βL(βt) converge in direction, and that the iterates W t themselves converge in direction to yield a separator with positive margin. Then, the limit direction of the parameters W = limt W (t)

W (t) 2 is a positive scaling of a ﬁrst order stationary point of the following optimization problem: min W RP W 2 2 s.t. n, yn xn, P(W ) 1. (A.9)

To keep track of constant factors, let W = τW denote the ﬁrst order stationary point itself. Furthermore, let W i denote the individual layers (or parameter blocks) comprising W , and similarly let W i denote the individual layers comprising W . We then have via the KKT conditions that:

{αn : αn 0}N n=1 s.t. αn = 0 if yn xn, P(W ) > 1

W i = Wi P(W )

= Wi D P(W ), X

n αnynxn E (A.10)

While this is an interesting result alone, the goal of implicit regularization is to characterize the ﬁnal linear predictor (which is some function P of the complete parametrization W). To that end, consider the following optimization problem in β:

min β bβ (S) 2/L s.t. n, yn βn, xn 1 (A.11)

We will leverage the necessary KKT conditions from Equation A.10 to show that ﬁrst-order stationary points of Equation A.9 are (up to a scaling) also ﬁrst-order stationary points of Equation A.7, using the sufﬁcient KKT conditions for Equation A.7.

Using standard KKT sufﬁciency conditions, the ﬁrst-order stationary points of Equation A.7 are those vectors β such that there exist α1, . . . , αn satisfying:

1. Feasibility: n, yn β, xn 1 and αi 0 i

2. Complementary slackness: i, αi = 0 if yn β, xn > 1

3. Membership in subdifferential: P

n αnynbxn o bβ (S) 2/L

In the third condition above, o is the local sub-differential of (Clarke, 1975): of(β) = conv{limi f(β + hi) : hi 0}9.

We will need the following assumption in the special case L = 2:

Assumption A.7 (L = 2 bounded subgradient). Let bz = P

n αnynbxn result from the KKT conditions of the optimization problem in W , Equation A.10, as described previously. Then, we assume that bz (S) 1.

Let β = P(W ) and let αi = 1

γ αi for all i, where γ is equal to || bβ || 2

for L > 2 and to 1 otherwise. (Note that by

homogeneity of P, P(W ) = P(τW ) = τ LP(W ).) We will check these conditions one by one for β and αi, with the ﬁrst two following immediately from Theorem A.6 and the last one requiring the most manipulation.

Feasibility Trivially, αi 0 i by deﬁnition of α in Equation A.10. Similarly, yn xn, β = yn xn, P(W ) 1.

9hi is a sequence of vectors in some linear space, and we take hi 0 as an entry-wise statement. This is because the vectors are ﬁnite-dimensional, so all norms are equivalent.

Implicit Bias of Linear Equivariant Networks

Complementary slackness If yn β , xn > 1 = yn P(W ), xn > 1, then αn αn = 0.

Membership in subdifferential We ﬁrst characterize the set o bβ (S) 2/L for a generic matrix β, and then show that

n αnynbxn o bβ (S) 2/L. When L = 2 and thus p = 1, the Schatten norm bβ 1 is indeed a norm and its subgradient is known; see e.g. Watson (1992). We restate this result below:

Lemma A.8 (Subdifferential of p-Schatten norm, p = 1). Suppose L = 2, such that 2

L = p = 1. Let A be an n n complex-valued matrix. Then we have

o A (S) 1 = n G : A (S) 1 = tr[G A], G (S) 1] o (A.12)

When L > 2, previous works have characterized the subdifferential of the 2/L-Schatten norm. For example, Lewis (1995) characterizes the subdifferential of any unitary matrix norm ||X|| as all those matrices with the same left and right singular vectors as X, but whose singular values lie in the subdifferential of the corresponding norm of the singular values of X (Corollary 2.5). For our purposes, a different formulation of the subdifferential will be more useful later in the overall proof. To keep the paper relatively self-contained, we prove the following result from scratch.

Lemma A.9 (Subdifferential of p-Schatten norm, p < 1). Suppose L > 2 and let p = 2

L, such that 0 < p < 1. Let A be an n n complex-valued matrix with singular value decomposition A = UDV . Let ΠV project onto the row space of A, i.e. ΠV (M) = MV V . Then we have

o A (S) p =

G : A G = 1

A A p and GA = 1

Proof. Suppose A is rank r and has singular value decomposition A = Pr i=1 diuiv i , where di > 0 for all i. Consider unit vectors W1, . . . , Wn r which are a basis for the orthogonal subspace to Span(u1, . . . , ur), and unit vectors s1, . . . , sn r which are a basis for the orthogonal subspace to Span(v1, . . . , vr). Treating the space of n n complex matrices as a n2-dimensional linear space, we see that {uiv i }r i=1 form a basis for an r-dimensional subspace. Let ΠA denote the

projector onto this space, ΠA = Pr i=1 uiv i . Note that ΠAA = A. For any set of n r sequences n {ϵi,m} m=1 on r

that limm ϵm,i = 0 for all i = 1, . . . , n r, consider the particular sequence of matrices {Hm} m=1 deﬁned by

i=1 ϵm,i Wis i

By deﬁnition, limm ||Hm||Fro = 0, where Hm Fro Hm (S) 2 .Also, if M is a full rank matrix with singular value decomposition ADB , ||M||(S) p is differentiable at M and ||M||(S) p = 1 ||M||(S) p ADp 1B . For convenience of

notation, let U be the matrix with ith column ui, W the matrix with ith column Wi, and similarly for V and S with respect to vectors vi and si respectively. Also, let D be the diagonal matrix with di on the ith diagonal.

Combining this fact with the construction of Hm, we have that

||A + Hm||(S) p =

i=1 diuiv i +

i=1 ϵm,i Wis i

D 0 . . . 0 0 ϵm,1 . . . 0 ... ... ... 0 0 . . . ϵm,n r

(Pr i=1 dp i + Pn r i=1 ϵp m,i) 1 p

ϵp 1 m,1 . . . 0 ... ... 0 . . . ϵp 1 m,n r

Implicit Bias of Linear Equivariant Networks

In the limit as m goes to inﬁnity, Pn r i=1 ϵp m,i approaches 0. However, p 1 < 0 implies that limm ϵp 1 m,i = . By taking convex combinations, one can create any matrix with left and right singular vectors W and S . Formally, we have:

o A (S) p = conv{ lim m A + Hm p : Hm 0} (A.17)

||A||(S) p UDp 1V + 1

||A||(S) p conv

ϵp 1 m,1 . . . 0 ... ... 0 . . . ϵp 1 m,n r

S : ϵm,i m 0

||A||(S) p UDp 1V + {W ΣS : Σ is real and diagonal} (A.19)

Let ΠU project onto the column space of A, i.e. ΠU(M) = UU M. Note that for any rank-one matrix M = ab , if a is not orthogonal to each column of U, then UU M = 0. Similarly, if b is not orthogonal to each column of V , then MV V = 0. Thus for an arbitrary matrix M, by decomposing it into a sum of rank-one matrices via its SVD, we see that ΠUM = ΠV M = 0 implies that both the row and column spaces of M are orthogonal to those of A, respectively. Thus, we can project via ΠU and ΠV to disregard the second term of Equation A.19, and obtain the following expression for the subgradient:

o A (S) p = G : ΠUG = ΠV G = ||A||(S) p p UDp 1V (A.20)

Consider ﬁrst the equality

ΠUG = ||A||(S) p p UDp 1V (A.21)

We can left-multiply by A without changing the set of matrices G satisfying this relation. To see this, one can check that if

A ΠUG = ||A||(S) p p A UDp 1V , left-multiplying on both sides by UD 1V recovers Equation A.21.

Similarly, consider the second equality:

ΠV G = ||A||(S) p p UDp 1V (A.22)

We can right-multiply by A without changing the set of matrices G satisfying this relation. To see this, one can check that if

(ΠV G)A = ||A||(S) p p UDp 1V A

Then right-multiplying on both sides by UD 1V recovers Equation A.22. Finally, observe that A ΠUG = V DU UU G = A G and (ΠV G)A = GV V V DU = GV DU = GA . Furthermore, A 1 ||A||(S) p UDp 1V =

V Dp V = 1 ||A||(S) p

A A p and 1 ||A||(S) p UDp 1V A = 1 ||A||(S) p

AA p. This completes the proof of the lemma.

Lemmas A.8 and A.9 characterize the subdifferential of the Schatten norm. Now, we show that P

n αnynbxn satisﬁes Equation A.13.

Lemma A.10. Recall that our G-CNN is given by NN(x) = x w1 w L 1, w L , with the vector of parameters W = [w1 . . . w L] and end-to-end linear predictor given by P(W ) = w1 w L. Consider an arbitrary such vector of real-valued parameters. Also, we have assumed that the ﬁlters in real-space are real-valued, i.e. wi R|G|. Then the following relation holds: FM wℓ P(W ), ei = b w ℓ+1 b w Lbei b w 1 b w ℓ 1 (A.23)

where ei is the ith standard basis vector.

Implicit Bias of Linear Equivariant Networks

Proof. Since P(W ), ei is real, we have P(W ), ei = ei, P(W ) . Plugging this in,

FM wℓP(W )[ei] = FM wℓ P(W ), ei

= FM wℓ ei, P(W )

= FM wℓ 1 |G| FMei, FMP(W ) M

= FM wℓ 1 |G| bei, b w L b w1 M

= FM wℓ 1 |G| tr[bei( b w L b w1) ]

= FM wℓ 1 |G| tr[ b w ℓ+1 b w Lbei b w 1 b w ℓ]

= FM wℓ 1 |G| b w ℓ+1 b w Lbei b w 1 b w ℓ 1, b wℓ M

= FM wℓ D F 1 M ( b w ℓ+1 b w Lbei bw 1 b w ℓ 1), wℓ E

= b w ℓ+1 b w Lbei b w 1 b w ℓ 1

Letting z = P

n αnynxn and W = W , Lemma A.10 implies that

FM w ℓ P(W ), z = b w ℓ+1 b w L bz b w 1 b w ℓ 1 (A.25)

By combining Equation A.10 with Lemma A.10, we have

1 γ b w ℓ = 1

n αnynxn E (A.26)

n αnynxn E (A.27)

= b w ℓ+1 b w L bz b w 1 b w ℓ 1 (A.28)

As a result:

b w ℓ = γ b w ℓ+1 b w L bz b w 1 b w ℓ 1 b w ℓb w ℓ = γ b w ℓ+1 b w L bz b w 1 b w ℓ (A.29)

Applying this relation with ℓ= L, we have that

b w L b w L = γbz b w 1 b w L (A.30)

= γbz bβ (A.31)

Taking adjoints of both sides implies that bz bβ is Hermitian, which will be useful later.

Implicit Bias of Linear Equivariant Networks

Let bβ = FMP(W ), from which we can derive the following recursion:

bβ bβ = b w L b w 1 b w 1 b w L = γ1 b w L b w 2 b w 2 b w L bz b w 1 b w L by Equation A.29

= γ2 b w L b w 3 b w 3 b w L bz b w 1 b w 2 b w L bz b w 1 b w L again, by Equation A.29

= γ2 b w L b w 3 b w 3 b w L bz bβ bz bβ by deﬁnition of β

= γ2 b w L b w 3 b w 3 b w L (bz bβ )2 by repeated application of Equation A.29

= γL(bz bβ )L

Similarly to Equation A.29, we have:

b w ℓ b w ℓ = γ b w ℓ b w ℓ+1 . . . b w L bz b w 1 . . . b w ℓ 1 (A.33)

By considering ℓ= 1, we have b w 1 b w 1 = bβ bz, which shows that bβ bz is Hermitian as well.

Using Equation A.33, we can similarly reason about bβ bβ :

bβ bβ = b w 1 . . . b w L b w L . . . b w 1 = γ b w 1 . . . b w L 1 ( b w L bz b w 1 . . . b w L 1 ) b w L 1 . . . b w 1

= γ( bβ bz) b w 1 . . . b w L 1 b w L 1 . . . b w 1

= γ( bβ bz) b w 1 . . . b w L 1 b w L bz b w 1 . . . b w L 2 b w L 2 . . . b w 1

= γ2( bβ bz)2 b w 1 . . . b w L 2 b w L 2 . . . b w 1

= γL( bβ bz)L

For L > 2, we have shown in Lemma A.9 that

o bβ (S) p =

G : bβ G = 1

|| bβ ||(S) p

bβ bβ p and G bβ = 1

|| bβ ||(S) p

Let s check that setting G = bz satisﬁes this relation. Since p = 2 L, p

2 = 1 L and, by Equation A.32 and that bz bβ is Hermitian:

( bβ bβ )p/2 = γbz bβ (A.36)

Similarly, by Equation A.34 and that bβ bz is Hermitian:

( bβ bβ ) p 2 = γ bβ bz (A.37)

By choice of γ, γ = bβ (S) 2 L . Thus, bz o bβ (S) p as desired for L > 2.

For L = 2, p = 1 and by Lemma A.8 we had the following expression for the subgradient:

o A (S) 1 = n G : A (S) 1 = tr[G A], G (S) 1] o (A.38)

As a technical condition, we had to assume in Assumption A.7 that bz (S) 1. (We believe that with a reﬁned analysis in future work, this assumption can be shown to be true given only our existing assumptions.) By the previous reasoning

for L = 2, we had that bβ bβ = γ2( bβ bz)2. Since bβ bβ = UD2U is positive semi-deﬁnite and symmetric, and

since bβ bz is Hermitian, we can take the square root of both sides and obtain that UDU = γ bβ bz = γbz bβ . Thus, tr[bz bβ ] = tr[DU U] = tr[D] = bβ (S) 1 , which is what was needed (since γ = 1 for the L = 2 case).

Implicit Bias of Linear Equivariant Networks

A.2.1. INFINITE GROUPS WITH BAND-LIMITED INPUTS

Here we consider the case of more general, inﬁnite-dimensional compact Lie groups. Such groups admit a Fourier transform which is an operator between inﬁnite-dimensional spaces, rather than a ﬁnite matrix as before, but which has the same key properties: a convolution theorem, and preservation of inner products. To be concrete, the Fourier transform is now deﬁned as bf(ρ) = R

g G ρ(g)f(g)µ(g), where µ(g) denotes the Haar measure for the group.

Since it is impossible to store a general function x : G R, one must make simplifying assumptions on the input x. A common assumption is that x is band-limited in Fourier space, i.e. bx is supported only on a small (and ﬁnite) subset of Fourier coefﬁcients contained within the irreps ρ S. For many such groups, there is a natural hierarchy of irreps (analogous to low frequencies and high frequencies in classical Fourier analysis), and so practical architectures typically assume only those corresponding to low frequencies are non-zero.

For ease of analysis, we will assume that the input functions and all convolutional ﬁlters are real-valued in Fourier space. The architecture of our G-CNN is the same as in the ﬁnite-dimensional group setting except that we will apply functions entirely in the ﬁnite-dimensional Fourier space over the irreps in S. Given a function bx supported only on S in Fourier space, we run gradient descent over only the Fourier coefﬁcients on S of ﬁlters c wi, and assume they are zero elsewhere. Before we proceed, we deﬁne FM in the natural way after restricting the irreps to the band-limiting space S:

b f = FMf = M

ρ S bf(ρ) dρ GL

ρ S d2 ρ, C

In this band-limited space, for each ﬁlter, there are P

ρ S d2 ρ trainable parameters corresponding to the entries of the irreps in S. Note that these entries are also orthogonal to each other with respect to the inner product , M and thus form a vector space of dimension P

Recall that we previously applied Theorem A.6 to the homogeneous polynomial parametrization from (w1, . . . , w L) x w1 w L 1, w L . Since we will operate in Fourier space, instead consider the homogeneous polynomial parametrization c W Rp containing the parameters stored in the matrices {c w1, . . . , c w L}. In other words, there are p = L(P ρ S d2 ρ) parameters stored in the matrices contained in the set c W .

c W t+1 = c W t ηt c W L(c W t) = c W t ηt c W P(c W t) b βL(P(c W t)) (A.40)

Note that here, iterates are only allowed to vary over the ﬁnite subset S of Fourier coefﬁcients and are assumed to be equal to 0 elsewhere. If we assume further that the exponential loss converges to zero, that the gradients b βL( bβt) converge in

direction, and that the iterates d W t themselves converge in direction to yield a separator with positive margin, then the limit

direction of the parameters [ W = limt d W t

d W t 2 is a positive scaling of a ﬁrst order stationary point of the following optimization problem:

min c W RP c W 2 2 s.t. n, yn c xn, P(c W ) M 1. (A.41)

Again letting [ W = τ [ W denote the ﬁrst order stationary point itself, [ W i the individual layers (or parameter blocks)

comprising [ W , and [ W i the individual layers comprising [ W , we again have via the KKT conditions that:

{αn : αn 0}N n=1 s.t. αn = 0 if yn c xn, P([ W ) M > 1

c Wi = d Wi P([ W )

n αnyn c xn

D P([ W ), X

n αnyn c xn

Equivalently, writing the above in terms of the matrices b w ℓ, and deﬁning bz = P n αnyn c xn, we have an equivalence to Lemma A.10.

Implicit Bias of Linear Equivariant Networks

wℓP([ W )[bz] = b w ℓ P([ W ), bz M

= b w ℓ bz, P(c W ) M = b w ℓ bz, b w L b w 1 M

= b w ℓtr[bz( b w L b w 1 )]

= b w ℓtr[ b w ℓ+1 b w L bz b w 1 b w ℓ ]

= b w ℓ b w ℓ+1 b w L bz b w 1 b w ℓ 1, b w ℓ M

= b w ℓ+1 b w L bz b w 1 b w ℓ 1

Using the KKT conditions above and this fact, all of the manipulations demonstrating that a positive scaling of P

n αnyn c xn is a ﬁrst-order stationary point of the optimization problem below carry over exactly as they do in the previous part. This yields the following formal result:

Theorem A.11. Consider a classiﬁcation task with ground-truth linear predictor bβ, trained via a real-valued, Fourier-space, band-limited linear G-CNN architecture NN(bx) = bx, c w L . . . c w1 with L 2 layers under the exponential loss. Then for almost any datasets {xi, yi} separable by β, any bounded sequence of step sizes ηt, and almost all initializations, suppose that:

The loss converges to 0

The gradients with respect to the end-to-end linear predictor bβ converge in direction

The iterates c wi themselves converge in direction to a separator with positive margin

When L = 2, we need an additional technical assumption, Assumption A.7. Then, the resultant linear predictor bβ is a positive scaling of a ﬁrst order stationary point of the optimization problem:

2/L s.t. n, yn D c xn, bβ

M 1. (A.44)

B. Group Fourier Transforms

To aid the reader in understanding the notation and structure behind the group Fourier transform, the following exposition is given for reference and convenience. Here, we introduce important concepts from representation theory and from there, provide explicit constructions the group Fourier transform.

A representation of a group G is a vector space V together with a G-linear map ρ : G GL(V ). Of particular interest is the regular representation which we construct as follows. Let G be a group of order n and choose V = Cn. Consider an element u C[G], so u = a1g1 + + angn where ai C and gn G, and the associated vector u Cn such that u = a1 an .

The action of left-multiplication of u for any h G yields hu = a1(hg1) + + an(hgn), which is equivalent to a permutation of the coefﬁcients, so there is a unique matrix H GL(Cn) such that Hu is equivalent to the associated vector for hu. The G-linear map Lh : h 7 H is the (left) regular representation.

The direct sum of two G-representations (ρ1, V1) and (ρ2, V2) can be constructed by

(ρ1 ρ2)(g) = ρ1(g) ρ2(g)

The dimension dρ of a representation (ρ, V ) is deﬁned to be dim(V ). A ﬁnite-dimensional representation is irreducible if it cannot be written as the direct sum of two nontrivial representations. Denote b G to be the set of isomorphism classes of

Implicit Bias of Linear Equivariant Networks

irreducible subrepresentations of Lu, and let dρi denote the dimension of ρi b G. As it turns out, there is an isomorphism

Where ρ ranges over one representative from each of the isomorphism classes of b G, repeated according to its multiplicity. In general, this decomposition is not uniquely determined as it depends on the choice of representatives. Throughout this paper, we choose representatives such that each ρ is unitary, meaning that every ρ(g) is a unitary matrix.

Every function f : G C can be considered as equivalent to an element uf C[G] by setting uf = f(g1)g1 + + f(gn)gn. And as we have already seen, every u C[G] can naturally be considered a subrepresentation of Lu. Then the Fourier transform of f at a representation ρ, denoted bf(ρ) GL(dρ, C) where bf(ρ) = P

u G f(u)ρ(u), can be considered as a projection of Lu(uf) onto the orthogonal subspace described by ρ. Throughout the paper we use slightly different notations and characterizations of the Fourier transform depending on the context, but all share this projection as the fundamental operation.

Recalling Equation B.1, there is a representation isomorphic to Lu, which we will suggestively call FM, that blockdiagonalizes the image of Lu into orthogonal subspaces of unitary irreducible representations (each ρ b G repeated with multiplicity dρ):

ρ1(u) ... ρj(g)

GL(Cn) (B.2)

For the last piece of the puzzle, extend the domain of FM to all functions f : G C (by considering f as an element of C[G]). Then we obtain

bf(ρ1) ... bf(ρj)

GL(Cn) (B.3)

Which we call the matrix Fourier transform of f.

We also make use of the Fourier basis matrix F, which depends only on the group G and not the function f. To construct it, we ﬁrst need the operation Flatten which vertically stacks the columns of a matrix. Then deﬁne the transform

Flatten bf(ρ1)

Flatten bf(ρj)

Letting ei : G C be the indicator function ei(gj) = 1i=j we can describe the unitary Fourier basis matrix F for a group G as a row-scaling of

F φ(e1) φ(e2) φ(en) (B.5)

In other words, given a column vectorization f of a function f such that fi = f(gi), then F is the matrix such that the unﬂattening of Ff is equal to FMf up to the row-scaling constants. Thus we can treat the group Fourier transform either as an abstract isomorphism or as a concrete matrix-vector multiplication, depending on the application.

The matrix F can be explicitly constructed as described in Deﬁnition 4.1. Denoting e[ρ,i,j] as the column-major vectorized basis for element ρij in the group Fourier transform, then we can form the matrix

i,j=1 ρ(u)ije[ρ,i,j]e T u . (B.6)

Implicit Bias of Linear Equivariant Networks

For visualization, consider the dihedral group D6 = {1, r, r2, a, ar, ar2}, which has three irreducible representations ρ1, ρ2, ρ3 (up to isomorphism) of dimensions 1, 1, and 2 respectively, and let f : D6 C. Using colors instead of values at ﬁrst to avoid numerical clutter:

bf(ρ1) = bf(ρ2) = bf(ρ3) =

Since LD6 = ρ1 ρ2 ρ2 3, we can get something like

Whereas for the unitary Fourier basis matrix we have the form

1 r r2 a ar ar2

Note that we do not yet include the row-scaling constants. Now explicitly, choosing ρ1 the trivial irrep, ρ2 the sign irrep, and ρ3 the representation sending

ρ3(a) = 0 1 1 0

ρ3(r) = ω 0 0 ω2

Where ω = e2πi/3, then the matrix F is exactly

1 r r2 a ar ar2

1 1 1 1 1 1 1 1 1 1 1 1

2ω2 0 0 0 0 0 0

Note that the above is only one possible way of writing F since the 2-dimensional irreducible representation of D6 is unique only up to conjugation by a unitary matrix.

C. Uncertainty principles for groups

In mathematics, uncertainty principles categorize trade-offs of the amount of information stored in a function between canonically conjugate regimes, e.g., position (real regime) and momentum (Fourier regime) of a physical particle. More

Implicit Bias of Linear Equivariant Networks

generically, uncertainty principles show that a function and its Fourier transform cannot both be very localized or concentrated. In the context of group theory, uncertainty principles speciﬁcally show that sparse support in either the real or Fourier regime of a group necessarily implies dense support in the conjugate regime (Wigderson and Wigderson, 2021). Such results are directly relevant when interpreting implicit regularization of linear G-CNNs which bias gradient descent towards sparse solution in the Fourier basis of the group. In this section, we formally state and summarize these group theoretic uncertainty principles to provide intuition into the properties of functions which linear group convolutional networks are likely to learn.

Abelian groups For abelian groups, the fundamental uncertainty principle details a trade-off between the norms of a function in its real and Fourier bases.

Theorem C.1 (Generalization of Donoho-Stark Theorem (Wigderson and Wigderson, 2021)). Given a ﬁnite abelian group G, let b G be the set of irreducible representations of G and f : G C be a function mapping group elements to complex numbers. Let f be the vectorized function and F be the unitary group Fourier transform (see Deﬁnition 4.1), then

Ff 1 Ff |G| (C.1)

Remark C.2. Since the size of the support of a vector is bounded by |supp(v)| v 1 v , this directly implies that |supp(f)| |supp(Ff)| |G| recovering the Donoho-Stark theorem (Donoho and Stark, 1989; Matusiak et al., 2004).

Non-abelian groups Since non-abelian groups have matrix valued irreducible representations, uncertainty theorems must account for norms and notions of support in the context of matrices. Here, we will provide two different uncertainty theorems for the non-abelian setting one via the rank of irreducible representations and another via the Schatten norm of irreducible representations. Uncertainty relations for non-abelian groups make use of the matrix group Fourier transform detailed in Deﬁnition 4.1.

Theorem C.3 (Meshulam uncertainty theorem (Meshulam, 1992)). Given a ﬁnite non-abelian group G, let b G be the set of irreducible representations of G and f : G C be a function mapping group elements to complex numbers. Let f be the vectorized function and FM be the matrix group Fourier transform (see Deﬁnition 4.1), then

| supp(f)| rank(FMf) = | supp(f)|

ρ b G dρ rank bf(ρ)

The above theorem shows that the rank of the matrix Fourier transformed function is the proper notion of support in the uncertainty theorem for a non-abelian group. A stronger uncertainty principle which is a more direct corollary to Theorem C.1 can be obtained via the Schatten norms of the irreducible representations as shown below.

Theorem C.4 (Kuperberg uncertainty theorem (Wigderson and Wigderson, 2021)). Given a ﬁnite non-abelian group G, let b G be the set of irreducible representations of G and f : G C be a function mapping group elements to complex numbers. Let f be the vectorized function and FM be the matrix group Fourier transform (see Deﬁnition 4.1), then

FMf (S) 1 FMf (S) = f 1

ρ b G dρ bf(ρ) (S) 1

maxρ b G bf(ρ) (S) |G|. (C.3)

D. Visualizing the implicit bias

Implicit biases induced by the G-CNN architectures studied here can be readily observed by analyzing coefﬁcients of the linearized transformation in the Real or Fourier regimes. Here, we visualize the linearized outputs a 3-layer linear G-CNN over the Dihedral group D60 which has 4 scalar irreps and 14 irreps of dimension 2 (hence 2 2 matrices). Figure 5 shows these linearized coefﬁcients of the G-CNN, CNN, and FC at intialization and training.

As evident in Figure 5, the learned coefﬁcients of the G-CNN are sparse in the Fourier regime of the group. This sparsity pattern appears over blocks of irreps of length four, corresponding to coefﬁcients of the 2 2 irreps of D60. Furthermore, the values of the coefﬁcients within a block are roughly constant, highlighting the bias towards low rank irreps. The relative denseness of coefﬁcients of the trained G-CNN in the real regime is inherent due to the uncertainty principles of group

Implicit Bias of Linear Equivariant Networks

functions. Unlike the G-CNN, the fully connected network (FC) appears to have no bias towards sparseness in its coefﬁcients in either the real or Fourier regime. On a related note, the cyclic group of CNNs share some of the same irreps as those of the G-CNN studied here. This may be one explanation for the partial sparsity patterns observed in the coefﬁcients of the CNN in the Fourier regime.

E. Computational details

As described in Section 6, for all models we use three-layer networks over R|G| unless otherwise stated and binary classiﬁcation tasks trained via standard gradient descent with exponential loss. We often train networks on isotropic Gaussian data points which are random vectors whose entries are drawn i.i.d. from a standard Normal distribution. For the linear networks on simulated data (with a fully connected output layer) we use the groups D8 and (C5 C5) Q8. For the linear and nonlinear networks on MNIST, we use (C28 C28) D8. For the networks with Re LU activations and a linear layer (instead of pooling) we use the groups D8 and D60. For the experiments on MNIST with non-linear networks, we have used the e2cnn package (Weiler and Cesa, 2019). The weights are initialized with the standard uniform initialization. We choose an appropriate learning rate for each task depending on the dimension and magnitude of the values. Due to the choice of the exponential loss as our loss function, we sometimes periodically increased the learning rate since gradients decay over time to speed up convergence. Since each problem is overparameterized the loss will almost surely converge to zero, so we choose enough training epochs to achieve satisfactory convergence. For each plot, we report 95% conﬁdence intervals over 3 to 50 runs, depending on the classiﬁcation task. The computational resources used are modest - commodity hardware should sufﬁce to fully reproduce our results. To replicate experiments, please visit our code repository.10

E.1. Additional experiments - linear

0 100 200 300 400 500 epoch

CNN G-CNN FC

(a) Fourier space norm of network linearization β

0 100 200 300 400 500 epoch

CNN G-CNN FC

(b) Real space norm of network linearization β

Figure 7: Norms of the linearizations of three different linear architectures for the non-abelian group G = (C5 C5) Q8 trained on a binary classiﬁcation task with 10 isotropic Gaussian data points.

We include linear architecture experiments on two additional groups. First, we include a G-CNN with the group (C5 C5) Q8 (Figure 7), which is a non-abelian group that has irreducible representations of up to dimension 8 and displays implicit regularization over a more elaborate group structure. Inputs are vectors with elements drawn i.i.d. from the standard Normal distribution.

E.2. Additional experiments - nonlinear

We evaluate the implicit bias of an invariant Re LU G-CNN (with ﬁnal pooling layer) with respect to translations, rotations, and ﬂips on MNIST digits in Figure 8.

10Our code is available here: https://github.com/kristian-georgiev/implicit-bias-of-linear-equivar iant-networks

Implicit Bias of Linear Equivariant Networks

0 50 100 150 200 250 300 epoch

G-CNN CNN FC

Figure 8: Group Fourier norm for the non-abelian group (C28 C28) D8 for an additional nonlinear architecture with Re LU activations shows that a nonlinear G-CNN seems to implicitly regularize locally. The ﬁgure tracks the mean norm of the per-sample local linearizations of each network. The network uses a pooling layer after the convolutional layers to maintain invariance. We evaluate a binary classiﬁcation task using the MNIST digits 0 and 5.

E.3. Spherical CNN Experimental Details

For the spherical CNN experiments, our architecture is drawn directly from Cohen et al. (2018). Given an input image deﬁned on a sphere, we apply to it a ﬁxed S2-convolution, followed by two learnable SO(3)-convolutions and a learnable fully-connected ﬁnal layer, with Re LUs in-between. As in Cohen et al. (2018), convolutions are done in Fourier space. We bandlimit the signals up to ℓ= 5, and then ℓ= 3. The ﬁrst SO(3)-convolution has 20 channels, and the second one has 100. The use of varying bandlimits is adapted directly from Cohen et al. (2018), but departs from our theory in the linear case of inﬁnite groups. We compute all Schatten norms with respect to the bandwidth ℓ= 5. Since we use multiple (20) channels for the ﬁrst convolution, we compute the linearization by computing the linearization for each channel separately, and then reporting the average Schatten norms over all channels. We note that this also departs from our theory towards a more realistic scenario, and we only provide these empirical observations for multi-channel networks.

The architecture is trained via stochastic gradient descent on the cross-entropy loss, as the exponential loss did not result in learning the training data. Although this further departs from the theory, other implicit regularization works have studied the cross-entropy and exponential losses simultaneously, suggesting that the same results may carry over. Indeed, it is promising that the Schatten norm regularization occurs with a different loss function in the nonlinear case. For comparison architectures, we use a fully connected architecture and a 2D CNN (applied to each discretization index of SO(3)). All architectures have as input a ﬁxed, shared S2 convolution and Re LU layer, which we conceptualize as a ﬁxed, non-linear feature embedding. The reason for this is as follows: to apply the representation theory of SO(3), we need inputs and linearization which are deﬁned on SO(3), not the homogeneous space S2 (the sphere). As the output of an S2 convolution is deﬁned on SO(3), the ﬁxed feature embedding serves this purpose.

E.4. Omitted real-space plots

E.5. Loss plots

Implicit Bias of Linear Equivariant Networks

0 200 400 600 800 1000 epoch

Re LU CNN Re LU G-CNN Re LU FC

Figure 9: Real-space norm for D60 with Re LU networks, 10 Gaussian training points. See Figure 6a for comparable plot of norms in Fourier space.

0 2500 5000 7500 10000 12500 15000 17500 20000 epoch

CNN G-CNN FC

Figure 10: Loss trajectory for the setting of D8 (see Figure 1). Networks are trained on 2 Fourier i.i.d. data points.

Implicit Bias of Linear Equivariant Networks

0 20 40 60 80 100 epoch

CNN G-CNN FC

Figure 11: Loss trajectory for the setting of C10 C10 C2 (see ??). Networks are trained on 6 Gaussian i.i.d. data points.

0 100 200 300 400 500 epoch

CNN G-CNN FC

Figure 12: Loss trajectory for the setting of (C5 C5) Q8 (see Figure 7). Networks are trained on 10 Gaussian i.i.d. data points.

Implicit Bias of Linear Equivariant Networks

0 20 40 60 80 100 epoch

CNN G-CNN FC

Figure 13: Loss trajectory for the setting of (C28 C28) D8 (see Figure 4). Networks are trained on the digits 1 and 5 from MNIST.

0 200 400 600 800 1000 epoch

Re LU CNN Re LU G-CNN Re LU FC

Figure 14: Loss trajectory for the setting of D60 (see Figure 6a). Networks are nonlinear with Re LU activations and trained on 10 distinct frequencies as data points.

Implicit Bias of Linear Equivariant Networks

0 20 40 60 80 epoch

Spherical CNN CNN FC

Figure 15: Loss trajectory for the setting of bandlimited SO(3). Networks are nonlinear with Re LU activations and trained on a subset of spherical MNIST (digits 0 and 5).