# implicit_regularization_for_group_sparsity__88bdcabe.pdf

Published as a conference paper at ICLR 2023

IMPLICIT REGULARIZATION FOR GROUP SPARSITY

Jiangyuan Li , Thanh V. Nguyen, Chinmay Hegde & Raymond K. W. Wong

Texas A&M University New York University {jiangyuanli, raywong}@tamu.edu; thanhng.cs@gmail.com; chinmay.h@nyu.edu

We study the implicit regularization of gradient descent towards structured sparsity via a novel neural reparameterization, which we call a diagonally grouped linear neural network . We show the following intriguing property of our reparameterization: gradient descent over the squared regression loss, without any explicit regularization, biases towards solutions with a group sparsity structure. In contrast to many existing works in understanding implicit regularization, we prove that our training trajectory cannot be simulated by mirror descent. We analyze the gradient dynamics of the corresponding regression problem in the general noise setting and obtain minimax-optimal error rates. Compared to existing bounds for implicit sparse regularization using diagonal linear networks, our analysis with the new reparameterization shows improved sample complexity. In the degenerate case of size-one groups, our approach gives rise to a new algorithm for sparse linear regression. Finally, we demonstrate the efﬁcacy of our approach with several numerical experiments1.

1 INTRODUCTION

Motivation. A salient feature of modern deep neural networks is that they are highly overparameterized with many more parameters than available training examples. Surprisingly, however, deep neural networks trained with gradient descent can generalize quite well in practice, even without explicit regularization. One hypothesis is that the dynamics of gradient descent-based training itself induce some form of implicit regularization, biasing toward solutions with low-complexity (Hardt et al., 2016; Neyshabur et al., 2017). Recent research in deep learning theory has validated the hypothesis of such implicit regularization effects. A large body of work, which we survey below, has considered certain (restricted) families of linear neural networks and established two types of implicit regularization standard sparse regularization and ℓ2-norm regularization depending on how gradient descent is initialized.

On the other hand, the role of network architecture, or the way the model is parameterized in implicit regularization, is less well-understood. Does there exist a parameterization that promotes implicit regularization of gradient descent towards richer structures beyond standard sparsity?

In this paper, we analyze a simple, prototypical hierarchical architecture for which gradient descent induces group sparse regularization. Our ﬁnding that ﬁner, structured biases can be induced via gradient dynamics highlights the richness of co-designing neural networks along with optimization methods for producing more sophisticated regularization effects.

Background. Many recent theoretical efforts have revisited traditional, well-understood problems such as linear regression (Vaskevicius et al., 2019; Li et al., 2021; Zhao et al., 2019), matrix factorization (Gunasekar et al., 2018b; Li et al., 2018; Arora et al., 2019) and tensor decomposition (Ge et al., 2017; Wang et al., 2020), from the perspective of neural network training. For nonlinear models with squared error loss, Williams et al. (2019) and Jin & Montúfar (2020) study the implicit bias of gradient descent in wide depth-2 Re LU networks with input dimension 1. Other works (Gunasekar et al., 2018c; Soudry et al., 2018; Nacson et al., 2019) show that gradient descent biases the solution towards the max-margin (or minimum ℓ2-norm) solutions over separable data.

1Code is available on https://github.com/jiangyuan2li/Implicit-Group-Sparsity

Published as a conference paper at ICLR 2023

NNs Noise Implicit vs. Explicit Regularization Vaskevicius et al. (2019) DLNN Implicit (GD) Sparsity Dai et al. (2021) LNN Explicit (ℓ2-penalty) (Group) Quasi-norm Jagadeesan et al. (2021) LCNN Explicit (ℓ2-penalty) Norm induced by SDP Wu et al. (2020) DLNN Implicit ℓ2-norm This paper DGLNN Implicit (GD) Structured sparsity

Table 1: Comparisons to related work on implicit and explicit regularization. Here, GD stands for gradient descent, (D)LNN/CNN for (diagonal) linear/convolutional neural network, and DGLNN for diagonally grouped linear neural network.

Outside of implicit regularization, several other works study the inductive bias of network architectures under explicit ℓ2 regularization on model weights (Pilanci & Ergen, 2020; Sahiner et al., 2020). For multichannel linear convolutional networks, Jagadeesan et al. (2021) show that ℓ2-norm minimization of weights leads to a norm regularizer on predictors, where the norm is given by a semideﬁnite program (SDP). The representation cost in predictor space induced by explicit ℓ2 regularization on (various different versions of) linear neural networks is studied in Dai et al. (2021), which demonstrates several interesting (induced) regularizers on the linear predictors such as ℓp quasi-norms and group quasi-norms. However, these results are silent on the behavior of gradient descent-based training without explicit regularization. In light of the above results, we ask the following question:

Beyond ℓ2-norm, sparsity and low-rankness, can gradient descent induce other forms of implicit regularization?

Our contributions. In this paper, we rigorously show that a diagonally-grouped linear neural network (see Figure 1b) trained by gradient descent with (proper/partial) weight normalization induces group-sparse regularization: a form of structured regularization that, to the best of our knowledge, has not been provably established in previous work.

One major approach to understanding implicit regularization of gradient descent is based on its equivalence to a mirror descent (on a different objective function) (e.g., Gunasekar et al., 2018a; Woodworth et al., 2020). However, we show that, for the diagonally-grouped linear network architecture, the gradient dynamics is beyond mirror descent. We then analyze the convergence of gradient ﬂow with early stopping under orthogonal design with possibly noisy observations, and show that the obtained solution exhibits an implicit regularization effect towards structured (speciﬁcally, group) sparsity. In addition, we show that weight normalization can deal with instability related to the choices of learning rates and initialization. With weight normalization, we are able to obtain a similar implicit regularization result but in more general settings: orthogonal/non-orthogonal designs with possibly noisy observations. Also, the obtained solution can achieve minimax-optimal error rates.

Overall, compared to existing analysis of diagonal linear networks, our model design that induces structured sparsity exhibits provably improved sample complexity. In the degenerate case of size-one groups, our bounds coincide with previous results, and our approach can be interpreted as a new algorithm for sparse linear regression.

Our techniques. Our approach is built upon the power reparameterization trick, which has been shown to promote model sparsity (Schwarz et al., 2021). Raising the parameters of a linear model element-wisely to the N-th power (N > 1) results in that parameters of smaller magnitude receive smaller gradient updates, while parameters of larger magnitude receive larger updates. In essence, this leads to a rich get richer phenomenon in gradient-based training. In Gissin et al. (2019) and Berthier (2022), the authors analyze the gradient dynamics on a toy example, and call this incremental learning . Concretely, for a linear predictor w Rp, if we re-parameterize the model as w = u N v N (where u N means the N-th element-wise power of u), then gradient descent will bias the training towards sparse solutions. This reparameterization is equivalent to a diagonal linear network, as shown in Figure 1a. This is further studied in Woodworth et al. (2020) for interpolating predictors, where they show that a small enough initialization induces ℓ1-norm regularization. For noisy settings, Vaskevicius et al. (2019) and Li et al. (2021) show that gradient descent converges to sparse models with early stopping. In the special case of sparse recovery from under-sampled

Published as a conference paper at ICLR 2023

(a) Diagonal linear NN (DLNN). (b) Diagonally grouped linear NN (DGLNN).

Figure 1: An illustration of the two architectures for standard and group sparse regularization.

observations (or compressive sensing), the optimal sample complexity can also be obtained via this reparameterization (Chou et al., 2021).

Inspired by this approach, we study a novel model reparameterization of the form w = [w1, . . . , w L], where wl = u2 l vl for each group l {1, . . . , L}. (One way to interpret this model is to think of ul as the magnitude and vl as the direction of the subvector corresponding to each group; see Section 2 for details.) This corresponds to a special type of linear neural network architecture, as shown in Figure 1b. A related architecture has also been recently studied in Dai et al. (2021), but there the authors have focused on the bias induced by an explicit ℓ2 regularization on the weights and have not investigated the effect of gradient dynamics.

The diagonally linear network parameterization of Woodworth et al. (2020); Li et al. (2021) does not suffer from identiﬁability issues. In contrast to that, in our setup the magnitude parameter ul of each group interacts with the norm of the direction , vl 2, causing a fundamental problem of identiﬁability. By leveraging the layer balancing effect (Du et al., 2018) in DGLNN, we verify the group regularization effect implicit in gradient ﬂow with early stopping. But gradient ﬂow is idealized; for a more practical algorithm, we use a variant of gradient descent based on weight normalization, proposed in (Salimans & Kingma, 2016), and studied in more detail in (Wu et al., 2020). Weight normalization has been shown to be particularly helpful in stabilizing the effect of learning rates (Morwani & Ramaswamy, 2022; Van Laarhoven, 2017). With weight normalization, the learning effect is separated into magnitudes and directions. We derive the gradient dynamics on both magnitudes and directions with perturbations. Directions guide magnitude to grow, and as the magnitude grows, the directions get more accurate. Thereby, we are able to establish regularization effect implied by such gradient dynamics.

A remark on grouped architectures. Finally, we remark that grouping layers have been commonly used in grouped CNN and grouped attention mechanisms (Xie et al., 2017; Wu et al., 2021), which leads to parameter efﬁciency and better accuracy. Group sparsity is also useful for deep learning models in multi-omics data for survival prediction (Xie et al., 2019). We hope our analysis towards diagonally grouped linear NN could lead to more understanding of the inductive biases of groupingstyle architectures.

Notation. Denotes the set {1, 2, . . . , L} by [L], and the vector ℓ2 norm by . We use 1p and 0p to denote p-dimensional vectors of all 1s and all 0s correspondingly. Also, represents the entry-wise multiplication whereas β N denotes element-wise power N of a vector β. We use ei to denote the ith canonical vector. We write inequalities up to multiplicative constants using the notation , whereby the constants do not depend on any problem parameter.

Observation model. Suppose that the index set [p] = L j=l Gl is partitioned into L disjoint (i.e., non-overlapping) groups G1, G2, . . . , GL where Gi Gj = , i = j. The size of Gl is denoted by pl = |Gl| for l [L]. Let w Rp be a p-dimensional vector where the entries of w are non-zero only on a subset of groups. We posit a linear model of data where observations (xi, yi) Rp R, i

Published as a conference paper at ICLR 2023

[n] are given such that yi = xi, w + ξi for i = 1, . . . , n, and ξ = [ξ1, . . . , ξn] is a noise vector. Note that we do not impose any special restriction between n (the number of observations) and p (the dimension). We write the linear model in the following matrix-vector form: y = Xw + ξ, with the n p design matrix X = [X1, X2, . . . , XL], where Xl Rn pl represents the features from the lth group Gl, for l [L]. We make the following assumptions on X:

Assumption 1. The design matrix X satisﬁes

sup β1 1, β2 1

n X l Xl I β2

δin, where β1, β2 Rpl, (1)

sup β1 1, β2 1

1 n Xlβ1, 1 n Xl β2

δout, where β1 Rpl, β2 Rpl , l = l , (2)

for some constants δin, δout (0, 1).

The ﬁrst part (1) is a within-group eigenvalue condition while the second part (2) is a between-group block coherence assumption. There are multiple ways to construct a sensing matrix to fulﬁll these two conditions (Eldar & Bolcskei, 2009; Baraniuk et al., 2010). One of them is based on the fact that random Gaussian matrices satisfy such conditions with high probability (Stojnic et al., 2009).

Reparameterization. Our goal is to learn a parameter w from the data {(xi, yi)}n i=1 with coefﬁcients which obey group structure. Instead of imposing an explicit group-sparsity constraint on w (e.g., via weight penalization by group), we show that gradient descent on the unconstrained regression loss can still learn w , provided we design a special reparameterization. Deﬁne a mapping g( ) : [p] [L] from each index i to its group g(i). Each parameter is rewritten as wi = u2 g(i)vi, i [p]. The parameterization G( ) : RL + Rp Rp reads

[u1, . . . , u L, v1, v2, . . . , vp] [u2 1v1, u2 1v2, . . . , u2 Lvp].

This corresponds to the 2-layer neural network architecture displayed in Figure 1b, in which W1 = diag(v1, . . . , vp), and W2 is diagonally tied within each group:

W2 = diag(u1, . . . , u1, u2, . . . , u2, . . . , u L, . . . , u L).

Gradient dynamics. We learn u and v by minimizing the standard squared loss:

L(u, v) = 1

y X[(Du) 2 v] 2 ,

1p1 0p1 . . . 0p1 0p2 1p2 . . . 0p2 ... ... ... ... 0p L 0p L . . . 1p L

By simple algebra, the gradients with respect to u and v read as follows:

u L = 2D v X X((Du) 2 v w ) X ξ Du ,

v L = X X((Du) 2 v w ) X ξ (Du) 2.

Denote r(t) = y PL l =1 u2 l (t)Xlvl(t). For each group l [L], the gradient ﬂow reads

nul(t)v l (t)X l r(t), vl(t)

nu2 l (t)X l r(t). (3)

Although we are not able to transform the gradient dynamics back onto w(t) due to the overparameterization, the extra term ul(t) on group magnitude leads to incremental learning effect.

Published as a conference paper at ICLR 2023

3 ANALYSIS OF GRADIENT FLOW

3.1 FIRST ATTEMPT: MIRROR FLOW

Existing results about implicit bias in overparameterized models are mostly based on recasting the training process from the parameter space {u(t), v(t)}t 0 to the predictor space {w(t)}t 0 (Woodworth et al., 2020; Gunasekar et al., 2018a). If properly performed, the (induced) dynamics in the predictor space can now be analyzed by a classical algorithm: mirror descent (or mirror ﬂow). Implicit regularization is demonstrated by showing that the limit point satisﬁes a KKT (Karush Kuhn Tucker) condition with respect to minimizing some regularizer R( ) among all possible solutions.

At ﬁrst, we were unable to express the gradient dynamics in Eq. (3) in terms of w(t) (i.e., in the predictor space), due to complicated interactions between u and v. This hints that the training trajectory induced by an overparameterized DGLNN may not be analyzed by mirror ﬂow techniques. In fact, we prove a stronger negative result, and rigorously show that the corresponding dynamics cannot be recast as a mirror ﬂow. Therefore, we conclude that our subsequent analysis techniques are necessary and do not follow as a corollary from existing approaches.

We ﬁrst list two deﬁnitions from differential topology below.

Deﬁnition 1. Let M be a smooth submanifold of RD. Given two C1 vector ﬁelds of X, Y on M, we deﬁne the Lie Bracket of X and Y as [X, Y ](x) := Y (x)X(x) X(x)Y (x).

Deﬁnition 2. Let M be a smooth submanifold of RD. A C2 parameterization G : M Rd is said to be commuting iff for any i, j [d], the Lie Bracket [ Gi, Gj](x) = 0 for all x M.

The parameterization studied in most existing works on diagonal networks is separable, meaning that each parameter only affects one coordinate in the predictor space. In DGLNN, the parameterization is not separable, due to the shared parameter u within each group. We formally show that it is indeed not commuting.

Lemma 1. G( ) is not a commuting parameterization.

Non-commutativity of the parameterization implies that moving along Gi and then Gj is different with moving with Gj ﬁrst and then Gi. This causes extra difﬁculty in analyzing the gradient dynamics. Li et al. (2022) study the equivalence between gradient ﬂow on reparameterized models and mirror ﬂow, and show that a commuting parameterization is a sufﬁcient condition for when a gradient ﬂow with certain parameterization simulates a mirror ﬂow. A complementary necessary condition is also established on the Lie algebra generated by the gradients of coordinate functions of G with order higher than 2. We show that the parameterization G( ) violates this necessary condition.

Theorem 1. There exists an initialization [u init, v init] RL + Rp and a time-dependent loss Lt such that gradient ﬂow under Lt G starting from [u init, v init] cannot be written as a mirror ﬂow with respect to any Legendre function R under the loss Lt.

The detailed proof is deferred to the Appendix. Theorem 1 shows that the gradient dynamics implied in DGLNN cannot be emulated by mirror descent. Therefore, a different technique is needed to analyze the gradient dynamics and any associated implicit regularization effect.

3.2 LAYER BALANCING AND GRADIENT FLOW

Let us ﬁrst introduce relevant quantities. Following our reparameterization, we rewrite the true parameters for each group l as

w l = (u l )2v l , v l 2 = 1, v l Rpl.

The support is deﬁned on the group level, where S = {l [L] : u l > 0} and the support size is deﬁned as s = |S|. We denote u max = max{u l |l S}, and u min = min{u l |l S}.

The gradient dynamics in our reparameterization does not preserve vl(t) 2 = 1, which causes difﬁculty to identify the magnitude of each ul and vl(t) 2. Du et al. (2018) and Arora et al. (2018)

Published as a conference paper at ICLR 2023

show that the gradient ﬂow of multi-layer homogeneous functions effectively enforces the differences between squared norms across different layers to remain invariant. Following the same idea, we discover a similar balancing effect in DGLNN between the parameter u and v.

Lemma 2. For any l [L], we have

2u2 l vl 2 = 0.

The balancing result eliminates the identiﬁability issue on the magnitudes. As the coordinates within one group affect each other, the direction which controls the growth rate of both u and v need to be determined as well.

Lemma 3. If the initialization vl(0) is proportional to 1

n X l y, then vl(0)

vl(0) , v l

1 δin + Lδout + 1 n X l ξ 2 /(u l )2 2 .

Note that this initialization can be obtained by a single step of gradient descent with 0 initialization. Lemma 3 suggests the direction is close to the truth at the initialization. We can further normalize it to be vl(0) 2 2 = 1

2u2 l (0) based on the balancing criterion. The magnitude equality, vl(t) 2 2 = 1

2u2 l (t), is preserved by Lemma 2. However, ensuring the closeness of the direction throughout the gradient ﬂow presents signiﬁcant technical difﬁculties. That said, we are able to present a meaningful implicit regularization result of the gradient ﬂow under orthogonal (and noisy) settings.

Theorem 2. Fix ϵ > 0. Consider the case where 1

n X l Xl = I, 1 n X l Xl = O, l = l , the initialization ul(0) = θ < ϵ 2(u max)2 and vl(0) = ηl 1

n X l y with vl(0) 2 2 = 1

2θ2, l [L], there exists an lower bound and upper bound of the time Tl < Tu in the gradient ﬂow in Eq. (3), such that for any Tl t Tu we have

u2 l (t)vl(t) w l 1

n X ξ ϵ, if l S. θ3/2, if l / S.

Theorem 2 states the error bounds for the estimation of the true weights w . For entries outside the (true) support, the error is controlled by θ3/2. When θ is small, the algorithm keeps all non-supported entries to be close to zero through iterations while maintaining the guarantee for supported entries. Theorem 2 shows that under the assumption of orthogonal design, gradient ﬂow with early stopping is able to obtain the solution with group sparsity.

4 GRADIENT DESCENT WITH WEIGHT NORMALIZATION

Algorithm 1 Gradient descent with weight normalization

Initialize: u(0) = α1, unit norm initialization vl(0) for each l [L], ηl,t = 1 u4 l (t). for t = 0 to T do

z(t + 1) = v(t) ηl,t v L(u(t), v(t)) vl(t + 1) = zl(t+1) zl(t+1) 2 , l [L] u(t + 1) = u(t) γ u L(u(t), v(t + 1)) if the early stopping criterion is satisﬁed then

stop end if end for

We now seek a more practical algorithm with more general assumptions and requirements on initialization. To speed up the presentation, we will directly discuss the corresponding variant of (the more practical) gradient descent instead of gradient ﬂow. When standard gradient descent is

Published as a conference paper at ICLR 2023

applied on DGLNN, initialization for directions is very crucial; The algorithm may fail even with a very small initialization when the direction is not accurate, as shown in Appendix E. The balancing effect (Lemma 2) is sensitive to the step size, and errors may accumulate (Du et al., 2018).

Weight normalization as a commonly used training technique has been shown to be helpful in stabilizing the training process. The identiﬁability of the magnitude is naturally resolved by weight normalization on each vl. Moreover, weight normalization allows for a larger step size on v, which makes the direction estimation at each step behave like that at the origin point. This removes the restrictive assumption of orthogonal design. With these intuitions in mind, we study the gradient descent algorithm with weight normalization on v summarized in Algorithm 1. One advantage of our algorithm is that it converges with any unit norm initialization vl(0). The step size on u(t) is chosen to be small enough in order to enable the incremental learning, whereas the step size on v(t) is chosen as ηl,t = 1 u4 l (t) as prescribed by our theoretical investigation. For convenience, we

deﬁne ζ = 80 1

n X ξ ϵ , for a precision parameter ϵ > 0. The convergence of Algorithm 1 is formalized as follows: Theorem 3. Fix ϵ > 0. Consider Algorithm 1 with

ul(0) = α < ϵ4 1 (u max)8 1 80L(u min)2 ϵ

any unit-norm initialization on vl for each l [L] and γ 1 20(u max)2 . Suppose Assumption 1 is

satisﬁed with δin (u min)2

120(u max)2 and δout (u min)2

120s(u max)2 . There exist a lower bound on the number of iterations

Tlb = log (u max)2

2α2 2 log(1 + γ

2 (ζ (u min)2)) + log2 (u max)2

5 2γ(ζ (u min)2),

and an upper bound

Tub 5 16γ(ζ (u min)2) log 1

such that Tlb Tub and for any Tlb t Tub, u2 l (t)vl(t) w l 1

n X ξ ϵ, if l S α, if l / S .

Similarly as Theorem 2, Theorem 3 states the error bounds for the estimation of the true weights w . When α is small, the algorithm keeps all non-supported entries to be close to zero through iterations while maintaining the guarantee for supported entries. Compared to the works on implicit (unstructured) sparse regularization (Vaskevicius et al., 2019; Chou et al., 2021), our assumption on the incoherence parameter δout scales with 1/s, where s is the number of non-zero groups, instead of the total number of non-zero entries. Therefore, the relaxed bound on δout implies an improved sample complexity, which is also observed experimentally in Figure 4. We now state a corollary in a common setting with independent random noise, where (asymptotic) recovery of w is possible. Deﬁnition 3. A random variable Y is σ-sub-Gaussian if for all t R there exists σ > 0 such that

Eet Y eσ2t2/2. Corollary 1. Suppose the noise vector ξ has independent σ2-sub-Gaussian entries and ϵ =

n . Under the assumptions of Theorem 3, Algorithm 1 produces w(t) = (Du(t)) 2 v(t)

that satisﬁes w(t) w 2 2 (sσ2 log p)/n with probability at least 1 1/(8p3) for any t such that Tlb t Tub.

Note that the error bound we obtain is minimax-optimal. Despite these appealing properties of Algorithm 1, our theoretical results require a large step size on each vl(t), which may cause instability at later stages of learning. We observe this instability numrerically (see Figure 6, Appendix E). Although the estimation error of w remains small (which aligns with our theoretical result), individual entries in v may ﬂuctuate considerably. Indeed, the large step size is mainly introduced to maintain a strong directional information extracted from the gradient of vl(t) so as to stabilize the updates of u(t) at the early iterations. Therefore, we also propose Algorithm 2, a variant of Algorithm 1, where we decrease the step size after a certain number of iterations.

Published as a conference paper at ICLR 2023

Algorithm 2. Run Algorithm 1 with the same setup till each ul(t), l [L] gets roughly accurate, set ηl,t = η. Continue Algorithm 1 until early stopping criterion is satisﬁed.

Theorem 4. Under the assumptions of Theorem 3 with replacing the condition on δ s by δin ζ(u min)2

120(u max)3 and δout ζ(u min)2

120s(u max)3 , we apply Algorithm 2 with ηl,t = 1 u4(t) at the beginning, and ηl,t = η 4 9(u max)2 after l [L], u2 l (t) 1

2(u l )2, then with the same Tlb and Tub, we have that for any Tlb t Tub,

u2 l (t)vl(t) w l 1

n X ξ ϵ, if l S. α, if l / S.

In Theorem 4, the criterion to decrease the step size is: u2 l (t) 1

2(u l )2, l [L]. Once this criterion is satisﬁed, our proof indeed ensures that it would hold for at least up to the early stopping time Tub speciﬁed in the theorem. In practice, since u l s are unknown, we can switch to a more practical criterion: max l [L]{|ul(t + 1) ul(t)|/|ul(t) + ε|} < τ for some pre-speciﬁed tolerance τ > 0 and

small value ε > 0 as the criterion for changing the step size. The motivation of this criterion is further discussed in Appendix D. The error bound remains the same as Theorem 3. The change in step size requires a new way to study the gradient dynamics of directions with perturbations. With our proof technique, Theorem 4 requires a smaller bound on δ s (see Lemma 16 versus Lemma 8 in Appendix C for details). We believe it is a proof artifact and leave the improvement for future work.

Connection to standard sparsity. Consider the degenerate case where each group size is 1. Our reparameterization, together with the normalization step, can roughly be interpreted as wi u2 i sgn(vi), which is different from the power-reparameterization wi = u N i v N i , N 2 in Vaskevicius et al. (2019) and Li et al. (2021). This also shows why a large step size on vi is needed at the beginning. If the initialization on vi is incorrect, the sign of vi may not move with a small step size.

5 SIMULATION STUDIES

We conduct various experiments on simulated data to support our theory. Following the model in Section 2, we sample the entries of X i.i.d. using Rademacher random variables and the entries of the noise vector ξ i.i.d. under N(0, σ2). We set σ = 0.5 throughout the experiments.

0 500 1000 1500 2000 epochs

||w(t) w ||2

Recovery error

0 500 1000 1500 2000 epochs

Recovered group magnitudes

max l/ S ul(t)

Figure 2: Convergence of Algorithm 1. The entries on the support are all 10.

0 200 400 600 800 1000 epochs

Recovered entries

wli(t), l S

max l/ S wli(t)

0 200 400 600 800 1000 epochs

Recovered group magnitudes

max l/ S ul(t)

0 200 400 600 800 1000 epochs

Recovered group directions

group1 group2 group3

Figure 3: Convergence of Algorithm 2. The entries on the support are from 5 to 13.

Published as a conference paper at ICLR 2023

The effectiveness of our algorithms. We start by demonstrating the convergence of the two proposed algorithms. In this experiment, we set n = 150 and p = 300. The number of non-zero entries is 9, divided into 3 groups of size 3. We run both Algorithms 1 and 2 with the same initialization α = 10 6. The step size γ on u and decreased step size η on v are both 10 3. In Figure 2, we present the recovery error of w on the left, and recovered group magnitudes on the right. As we can see, early stopping is crucial for reaching the structured sparse solution. In Figure 3, we present the recovered entries, recovered group magnitudes and recovered directions for each group from left to right. In addition to convergence, we also observe an incremental learning effect.

0 100 200 300 400 500 epochs

Recovered entries using group sparsity

wli(t), l S

max l/ S wli(t)

0 100 200 300 400 500 epochs

Recovered entries using sparsity

wli(t), l S

max l/ S wli(t)

Figure 4: Comparison with reparameterization using standard sparsity. n = 100, p = 500.

Structured sparsity versus standard sparsity. From our theory, we see that the block incoherence parameter scales with the number of non-zero groups, as opposed to the number of non-zero entries. As such, we can expect an improved sample complexity over the estimators based on unstructured sparse regularization. We choose a larger support size of 16. The entries on the support are all 1 for simplicity. We apply our Algorithm 2 with group size 4. The result is shown in Figure 4 (left). We compare with the method in Vaskevicius et al. (2019) with parameterization w = u 2 v 2, designed for unstructured sparsity. We display the result in the right ﬁgure, where interestingly, that algorithm fails to converge because of an insufﬁcient number of samples.

0 200 400 600 800 1000 epochs

Recovered entries

wli(t), l S

max l/ S |wli(t)|

0 200 400 600 800 1000 epochs

log2 ||wt w ||2 2

Recovery error

Figure 5: Degenerate case when each group size is 1. The log ℓ2-error plot is repeated 30 times, and the mean is depicted. The shaded area indicates the region between the 25th and 75th percentiles.

Degenerate case. In the degenerate case where each group is of size 1, our reparameterization takes a simpler form wi u2 i sgn(v), i.e., due to weight normalization, our method normalizes v to 1 or 1 after each step. We demonstrate the efﬁcacy of our algorithms even in the degenerate case. We set n = 80 and p = 200. The entries on the support are [1, 1, 1, 1, 1] with both positive and negative entries. We present the coordinate plot and the recovery error in Figure 5.

6 DISCUSSION

In this paper, we show that implicit regularization for group-structured sparsity can be obtained by gradient descent (with weight normalization) for a certain, specially designed network architecture. Overall, we hope that such analysis further enhances our understanding of neural network training. Future work includes relaxing the assumptions on δ s in Theorem 2, and rigorous analysis of modern grouping architectures as well as power parametrizations.

Published as a conference paper at ICLR 2023

ACKNOWLEDGMENTS

This work was supported in part by the National Science Foundation under grants CCF-1934904, CCF-1815101, and CCF-2005804.

Sanjeev Arora, Nadav Cohen, and Elad Hazan. On the optimization of deep networks: Implicit acceleration by overparameterization. In International Conference on Machine Learning, pp. 244 253. PMLR, 2018.

Sanjeev Arora, Nadav Cohen, Wei Hu, and Yuping Luo. Implicit regularization in deep matrix factorization. ar Xiv preprint ar Xiv:1905.13655, 2019.

Richard G Baraniuk, Volkan Cevher, Marco F Duarte, and Chinmay Hegde. Model-based compressive sensing. IEEE Transactions on information theory, 56(4):1982 2001, 2010.

Raphaël Berthier. Incremental learning in diagonal linear networks. ar Xiv preprint ar Xiv:2208.14673, 2022.

Iain Carmichael, Thomas Keefe, Naomi Giertych, and Jonathan P Williams. yaglm: a python package for ﬁtting and tuning generalized linear models that supports structured, adaptive and non-convex penalties. ar Xiv preprint ar Xiv:2110.05567, 2021.

Hung-Hsu Chou, Johannes Maly, and Holger Rauhut. More is less: Inducing sparsity via overparameterization. ar Xiv preprint ar Xiv:2112.11027, 2021.

Zhen Dai, Mina Karzand, and Nathan Srebro. Representation costs of linear neural networks: Analysis and design. Advances in Neural Information Processing Systems, 34, 2021.

Simon S Du, Wei Hu, and Jason D Lee. Algorithmic regularization in learning deep homogeneous models: Layers are automatically balanced. Advances in Neural Information Processing Systems, 31, 2018.

Yonina C Eldar and Helmut Bolcskei. Block-sparsity: Coherence and efﬁcient recovery. In 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2885 2888. IEEE, 2009.

Rong Ge, Jason D Lee, and Tengyu Ma. Learning one-hidden-layer neural networks with landscape design. ar Xiv preprint ar Xiv:1711.00501, 2017.

Daniel Gissin, Shai Shalev-Shwartz, and Amit Daniely. The implicit bias of depth: How incremental learning drives generalization. ar Xiv preprint ar Xiv:1909.12051, 2019.

Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro. Characterizing implicit bias in terms of optimization geometry. In International Conference on Machine Learning, pp. 1832 1841. PMLR, 2018a.

Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro. Implicit bias of gradient descent on linear convolutional networks. ar Xiv preprint ar Xiv:1806.00468, 2018b.

Suriya Gunasekar, Jason D Lee, Daniel Soudry, and Nati Srebro. Implicit bias of gradient descent on linear convolutional networks. Advances in Neural Information Processing Systems, 31, 2018c.

Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. In International Conference on Machine Learning, pp. 1225 1234. PMLR, 2016.

Meena Jagadeesan, Ilya Razenshteyn, and Suriya Gunasekar. Inductive bias of multi-channel linear convolutional networks with bounded weight norm. ar Xiv preprint ar Xiv:2102.12238, 2021.

Hui Jin and Guido Montúfar. Implicit bias of gradient descent for mean squared error regression with wide neural networks. ar Xiv preprint ar Xiv:2006.07356, 2020.

Published as a conference paper at ICLR 2023

Li Jing, Jure Zbontar, et al. Implicit rank-minimizing autoencoder. Advances in Neural Information Processing Systems, 33:14736 14746, 2020.

Tae Kwan Lee, Wissam J Baddar, Seong Tae Kim, and Yong Man Ro. Convolution with logarithmic ﬁlter groups for efﬁcient shallow cnn. In International Conference on Multimedia Modeling, pp. 117 129. Springer, 2018.

Jiangyuan Li, Thanh Nguyen, Chinmay Hegde, and Raymond K. W. Wong. Implicit sparse regularization: The impact of depth and early stopping. Advances in Neural Information Processing Systems, 34, 2021.

Yuanzhi Li, Tengyu Ma, and Hongyang Zhang. Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations. In Conference On Learning Theory, pp. 2 47. PMLR, 2018.

Zhiyuan Li, Tianhao Wang, Jason D Lee, and Sanjeev Arora. Implicit bias of gradient descent on reparametrized models: On equivalence to mirror descent. ar Xiv preprint ar Xiv:2207.04036, 2022.

Cesare Molinari, Mathurin Massias, Lorenzo Rosasco, and Silvia Villa. Iterative regularization for convex regularizers. In International conference on artiﬁcial intelligence and statistics, pp. 1684 1692. PMLR, 2021.

Depen Morwani and Harish G Ramaswamy. Inductive bias of gradient descent for weight normalized smooth homogeneous neural nets. In International Conference on Algorithmic Learning Theory, pp. 827 880. PMLR, 2022.

Mor Shpigel Nacson, Jason Lee, Suriya Gunasekar, Pedro Henrique Pamplona Savarese, Nathan Srebro, and Daniel Soudry. Convergence of gradient descent on separable data. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, pp. 3420 3428. PMLR, 2019.

Behnam Neyshabur, Ryota Tomioka, Ruslan Salakhutdinov, and Nathan Srebro. Geometry of optimization and implicit regularization in deep learning. ar Xiv preprint ar Xiv:1705.03071, 2017.

Mert Pilanci and Tolga Ergen. Neural networks are convex regularizers: Exact polynomial-time convex optimization formulations for two-layer networks. In International Conference on Machine Learning, pp. 7695 7705. PMLR, 2020.

Arda Sahiner, Tolga Ergen, John Pauly, and Mert Pilanci. Vector-output relu neural network problems are copositive programs: Convex analysis of two layer networks and polynomial-time algorithms. ar Xiv preprint ar Xiv:2012.13329, 2020.

Tim Salimans and Durk P Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. Advances in neural information processing systems, 29, 2016.

Todd E Scheetz, Kwang-Youn A Kim, Ruth E Swiderski, Alisdair R Philp, Terry A Braun, Kevin L Knudtson, Anne M Dorrance, Gerald F Di Bona, Jian Huang, Thomas L Casavant, et al. Regulation of gene expression in the mammalian eye and its relevance to eye disease. Proceedings of the National Academy of Sciences, 103(39):14429 14434, 2006.

Jonathan Schwarz, Siddhant Jayakumar, Razvan Pascanu, Peter Latham, and Yee Teh. Powerpropagation: A sparsity inducing weight reparameterisation. Advances in Neural Information Processing Systems, 34, 2021.

Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1): 2822 2878, 2018.

Mihailo Stojnic, Farzad Parvaresh, and Babak Hassibi. On the reconstruction of block-sparse signals with an optimal number of measurements. IEEE Transactions on Signal Processing, 57(8): 3075 3085, 2009.

Twan Van Laarhoven. L2 regularization versus batch and weight normalization. ar Xiv preprint ar Xiv:1706.05350, 2017.

Published as a conference paper at ICLR 2023

Tomas Vaskevicius, Varun Kanade, and Patrick Rebeschini. Implicit regularization for optimal sparse recovery. In Advances in Neural Information Processing Systems, pp. 2972 2983, 2019.

Xiang Wang, Chenwei Wu, Jason D Lee, Tengyu Ma, and Rong Ge. Beyond lazy training for over-parameterized tensor decomposition. Advances in Neural Information Processing Systems, 33:21934 21944, 2020.

Francis Williams, Matthew Trager, Daniele Panozzo, Claudio Silva, Denis Zorin, and Joan Bruna. Gradient dynamics of shallow univariate relu networks. Advances in neural information processing systems, 32, 2019.

Blake Woodworth, Suriya Gunasekar, Jason D Lee, Edward Moroshko, Pedro Savarese, Itay Golan, Daniel Soudry, and Nathan Srebro. Kernel and rich regimes in overparametrized models. In Conference on Learning Theory, pp. 3635 3673. PMLR, 2020.

Xiaoxia Wu, Edgar Dobriban, Tongzheng Ren, Shanshan Wu, Zhiyuan Li, Suriya Gunasekar, Rachel Ward, and Qiang Liu. Implicit regularization and convergence for weight normalization. Advances in Neural Information Processing Systems, 33:2835 2847, 2020.

Xuan Wu, Zhijie Zhang, Wanchang Zhang, Yaning Yi, Chuanrong Zhang, and Qiang Xu. A convolutional neural network based on grouping structure for scene classiﬁcation. Remote Sensing, 13(13):2457, 2021.

Gangcai Xie, Chengliang Dong, Yinfei Kong, Jiang F Zhong, Mingyao Li, and Kai Wang. Group lasso regularized deep learning for cancer prognosis from multi-omics and clinical features. Genes, 10(3):240, 2019.

Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1492 1500, 2017.

Yi Yang and Hui Zou. A fast uniﬁed algorithm for solving group-lasso penalize learning problems. Statistics and Computing, 25:1129 1141, 2015.

Peng Zhao, Yun Yang, and Qiao-Chu He. Implicit regularization via hadamard product overparametrization in high-dimensional linear regression. ar Xiv preprint ar Xiv:1903.09367, 2019.