# mask_in_the_mirror_implicit_sparsification__2fd52089.pdf

Published as a conference paper at ICLR 2025

MASK IN THE MIRROR: IMPLICIT SPARSIFICATION

Tom Jacobs CISPA Helmholtz Center for Information Security tom.jacobs@cispa.de

Rebekka Burkholz CISPA Helmholtz Center for Information Security burkholz@cispa.de

Continuous sparsification strategies are among the most effective methods for reducing the inference costs and memory demands of large-scale neural networks. A key factor in their success is the implicit L1 regularization induced by jointly learning both mask and weight variables, which has been shown experimentally to outperform explicit L1 regularization. We provide a theoretical explanation for this observation by analyzing the learning dynamics, revealing that early continuous sparsification is governed by an implicit L2 regularization that gradually transitions to an L1 penalty over time. Leveraging this insight, we propose a method to dynamically control the strength of this implicit bias. Through an extension of the mirror flow framework, we establish convergence and optimality guarantees in the context of underdetermined linear regression. Our theoretical findings may be of independent interest, as we demonstrate how to enter the rich regime and show that the implicit bias can be controlled via a time-dependent Bregman potential. To validate these insights, we introduce PILo T, a continuous sparsification approach with novel initialization and dynamic regularization, which consistently outperforms baselines in standard experiments.

1 INTRODUCTION

Deep learning continues to impress across disciplines ranging from language and vision (Ramesh et al., 2022) to drug design (Stephenson et al., 2019; Jumper et al., 2021) and even fast matrix multiplication (Fawzi et al., 2022). These accomplishments come at immense costs, as they rely on increasingly large neural network models. Moreover, training such massive models with first-order methods like variants of Stochastic Gradient Descent (SGD) is a considerable challenge and often requires large-scale compute infrastructure (Kaack et al., 2022). Even higher costs are incurred at inference time, if the trained models are frequently evaluated (Wu et al., 2022; Luccioni et al., 2023).

Sparsifying such neural network models is thus a pressing objective. It not only holds the promise to save computational resources, it can also improve generalization (Frankle & Carbin, 2019; Paul et al., 2023), interpretability (Chen et al., 2022; Hossain et al., 2024), denoising (Jin et al., 2022; Wang et al., 2023), and verifiability (Narodytska et al., 2020; Albarghouthi, 2021). However, at its core is a hard large-scale nested optimization problem combining multiple objectives. In addition to minimizing a typical neural network loss minw Rn f(w) (and its generalization error), we wish to rely on the smallest possible number of weights, effectively minimizing the L0 norm minw Rn ||w||L0. This is an NP-hard problem that is also practically hard to solve due to its mixed discrete and continuous nature. This becomes more apparent when we reformulate it like the best performing sparsification methods.

Among such approaches that achieve high sparsity while maintaining high generalization performance are continuous sparsification methods and iterative pruning strategies. These methods explicitly identify for each weight parameter w of a neural network a binary mask m {0, 1} that signifies whether a parameter is pruned. Thus, a parameter is set to zero (m = 0) or not (m = 1), effectively parameterizing the network with parameters x = m w, where is the pointwise mutliplication (Hadamard product). The introduction of the additional mask parameters m turns the sparsity objective into a discrete L1 penalty of m as ||w m||L0 = P

i mi subject to mi {0, 1}, where denotes elementwise multiplication and we assume that w = 0 if m = 1. The L1 objective is already more amenable to continuous optimization than the original L0 objective (Louizos et al., 2018). Nevertheless, the big challenge arises from the fact that m is binary.

Published as a conference paper at ICLR 2025

Continuous sparsification addresses this issue by relaxing the optimization problem to continuous or even differentiable variables m, often with m [0, 1] by learning a parameterization m = g(s) with g: R [0, 1], e.g., like a sigmoid. This way, the problem becomes solvable with standard first-order optimization methods. Yet, moving from the continuous space back to the discrete space is error-prone. Regularizing and projecting m towards binary values {0, 1} is generally problematic, requires careful tuning, and often entails robustness issues.

However, this projection step is not necessarily required in a parameterization m w, where m can freely attain values in R. Ziyin & Wang (2023) show that a loss with this parameterization m w combined with weight decay α ||m||2 L2 + ||w||2 L2 is equivalent to LASSO and thus an explicit L1regularization (Ziyin, 2023). Yet, their proposed method spred significantly outperforms LASSO, which suggests that the posed equivalence of optimized objectives cannot explain this success.

As we show, an important, but overlooked, distinguishing factor is the fact that continuous sparsification with m w parameterization is driven by an implicit rather than an explicit regularization. This implies that in a sufficiently overparameterized setting, the following hierarchical optimization problem is solved: min x Rn:f(x)=0 ||x||L1. (1)

The main idea follows the same philosophy as the lottery ticket hypothesis (Frankle & Carbin, 2019). From a range of models which all attain optimal training loss, it prefers the sparsest model. In other words, we aim to find a subnetwork with a similar accuracy as a dense network. Instead of having two competing objectives, we enforce cooperation between the two objectives by subjugating the sparsification. While this formulation can have advantages, if we can attain zero training loss (f(x) = 0), it might still be equivalent to an explicit regularization.

The real potential of continuous sparsification and its induced implicit bias (in particular in nonconvex settings) becomes apparent when we study the corresponding learning dynamics. Our theoretical analysis of (stochastic) gradient flow applied to m w reveals that the dynamics differ fundamentally from the ones of LASSO, where redundant features are sparsified exponentially fast. Instead, we show by extending the mirror flow framework that a dynamic explicit weight decay can move the implicit bias from L2 to L1 during training. In consequence, the sparsification becomes effective only relatively late during training, allowing the overparameterized model to first attain high generalization performance. The dynamics resemble thus a successful strategy that is applied across continuous and iterative pruning methods, which all acknowledge and realize the premise that training overparameterized models before they are sparsified usually leads to significant performance benefits (Frankle & Carbin, 2018; Paul et al., 2023; Gadhikar & Burkholz, 2024).

Our analysis extends the implicit bias framework, which covers different parameterizations (Pesme et al., 2021; Gunasekar et al., 2020; Woodworth et al., 2020; Li et al., 2022) and the well-known fact that standard gradient flow has an implicit L2 bias (Nemirovski & Yudin, 1983; Beck & Teboulle, 2003). Another common use of the framework has been to study when training dynamics enter the so-called rich regime, which is responsible for improved feature learning. In this context, we study two main innovations. a) As we show, the explicit regularization (i.e. weight decay on m and w) guides the strength of the implicit bias. The fact that this makes the implicit bias tuneable makes it practically relevant for sparsification, as we have to be able to reach a target sparsity. b) By proposing a dynamic regularization (rather than a common static one), we obtain control over the transition speed from L2 to L1 regularization, which is crucial for performance gains in the high sparsity regime and enables us to enter the rich regime. From a conceptual point of view, we unite explicit and implicit bias within a time-dependent Bregman potential, which is potentially of independent theoretical interest.

While our general derivations provide insights into various continuous sparsification approaches, including STR (Kusupati et al., 2020), spred Ziyin & Wang (2023), or (Savarese et al., 2021), we also utilize them to propose a new improved algorithm, PILo T (Parametric Implicit Lottery Ticket). PILo T combines the m w parameterization with a dynamic regularization and an initialization that enables sign flips. Such sign flips are key to effective sparse training (Gadhikar & Burkholz, 2024), but are not feasible with the spred initialization. The dynamic regularization and thus implicit bias leads us to outperform state-of-the-art baselines in particular in the high-sparsity regime, as we demonstrate in extensive experiments.

In summary, we make the following contributions:

Published as a conference paper at ICLR 2025

We gain novel insights into continuous sparsification by highlighting its implicit bias towards sparsity induced by doubling the number of trainable parameters. In particular, we explain the effectiveness of spred (Ziyin & Wang, 2023), which is based on m w.

To the best of our knowledge, we are the first to introduce the implicit bias with an explicit regularization resulting in a mirror flow with a time-dependent Bregman potential.

We provide convergence results for (quasi)-convex loss functions (Theorem 2.2) and optimality for underdetermined linear regression (Theorem 2.3) with time-dependent Bregman potential.

Improving results by (Alvarez et al., 2004; Li et al., 2022), we replace convexity with the Polyak-Łojasiewicz inequality, quasi-convexity and a growth condition on the Bregman potential (see Theorem A.3).

Using our extensions of the mirror flow framework, we propose a new continuous sparsification method, PILo T, which controls the implicit regularization dynamically moving from L2 to L1. Its initialization enables sign flips in contrast to spred.

In experiments for diagonal linear networks and vision benchmarks (including Image Net), PILo T consistently outperforms baseline sparsification methods such as STR and spred, which demonstrates the utility of our theoretical insights.

1.1 RELATED WORK

Neural network sparsification. A multitude of neural network sparsification methods have been proposed with different objectives (Liu & Wang, 2023). A popular objective is, for instance, to save computational and memory costs primarily at inference, or also during training, which is linked to the time of pruning, i.e., initially (Frankle et al., 2021; Lee et al., 2019; Tanaka et al., 2020; Wang et al., 2020; Pham et al., 2023; Patil & Dovrolis, 2021; Tanaka et al., 2020; Liu et al., 2021; Gadhikar et al., 2023; Fischer & Burkholz, 2021), early during training (Evci et al., 2020; Dettmers & Zettlemoyer, 2019), during training like continuous sparsification (Sreenivasan et al., 2022; Kusupati et al., 2020; Savarese et al., 2021; Peste et al., 2021) or within multiple pruning-training iterations (Han et al., 2015; Frankle & Carbin, 2018; You et al., 2020; Renda et al., 2020; Gadhikar & Burkholz, 2024). Other distinguishing factors are which type of sparsity the methods seek, if they focus on saving computational resources and memory in specific resource-constrained environments or, which methodological approach they follow.

Unstructured sparsity. In this work, we focus on unstructured sparsity, i.e., the fraction of pruned weights, and thus seek to remove as many weight entries as possible, which can achieve generally the highest sparsity ratios while maintaining high generalization performance. Structured sparsity, which usually obtains higher computational gains on modern GPUs (Kuzmin et al., 2019; Wen et al., 2016; Lasby et al., 2023), could also be realized in the continuous sparsification setting, for instance, by learning neuron-, group, or even layer-wise masks. Yet, this would not enjoy the same theoretical benefits as we derive here by showing that the unstructured continuous m w parameterization induces a mirror flow, whereas for example the neuron-wise mask does not (see Section D).

Iterative pruning. Iterative pruning is often motivated by the Lottery Ticket Hypothesis (LTH) (Frankle & Carbin, 2018), which conjectures the existence of sparse subnetworks of larger dense source networks that can achieve the same accuracy as the dense network when trained (Frankle et al., 2021; Liu et al., 2024; Malach et al., 2020; Orseau et al., 2020; Pensia et al., 2020; Burkholz et al., 2022; Fischer & Burkholz, 2021; Burkholz, 2022a;b; da Cunha et al., 2022; Ferbach et al., 2022). In addition to the sparse structure, iterative pruning tries to identify a trainable parameter initialization, indirectly also implementing an approximate L0-regularization. In repeated prunetrain iterations, trained weights are thresholded according to an importance score like magnitude. Afterward, the remaining parameters are free to adapt to data in a new training run and not be regularized by a sparsity penalty (like L1). Our proposal PILo T can be combined with such iterative schemes. The experiments show that it boosts performance of state-of-the-art schemes like Weight Rewinding (WR) (Frankle et al., 2019), and Learning Rate Rewinding (LRR) (Maene et al., 2021).

Continuous sparsification. Continuous sparsification characterizes a collection of methods that can compete with iterative pruning techniques, while often requiring fewer training epochs (Sreenivasan et al., 2022; Kusupati et al., 2020). In one of the first proposals by (Savarese et al., 2021), the mask

Published as a conference paper at ICLR 2025

is relaxed to a continuous variable. In general, continuous sparsification can be combined with a probabilistic approach where m is interpreted as a probability (Louizos et al., 2018; Zhou et al., 2021a;b). Other parameterizations of m that are not restricted to the range [0, 1] (e.g. Powerpropagation) can also be utilized to regularize towards higher sparsity (Schwarz et al., 2021). Yet, they are usually combined with projection to map m to a binary mask. The spred algorithm (Ziyin & Wang, 2023) removes any projections and shows that m w with weight decay solve a LASSO objective. To explain its performance gain over LASSO, we extend the mirror flow framework and find an explanation in the training dynamics. Our extension, PILo T, dynamically adjusts the weight decay and induces a transition from an implicit L2 to L1 regularization. This enables it to outperform the state-of-the-art method STR (Kusupati et al., 2020) in the high-sparsity regime. For a survey of other methods see (Kuznedelev et al., 2023).

Implicit bias. The implicit bias of (S)GD is a well-studied phenomenon (Chizat & Bach, 2020; Li et al., 2022; Woodworth et al., 2020; Gunasekar et al., 2020; 2017; Chou et al., 2024; Vaˇskeviˇcius et al., 2019; Li et al., 2023; Zhao et al., 2022; Li et al., 2021) and can in certain cases be described by a mirror flow or mirror descent (in the discrete case with finite learning rate) (Li et al., 2022). Originally, mirror descent was proposed to generalize gradient descent and other first-order methods in convex optimization (Alvarez et al., 2004; Rockafellar & Fenchel, 1970; Boyd & Vandenberghe, 2009; Nemirovski & Yudin, 1983; Beck & Teboulle, 2003). Moreover, it has been used to study the implicit regularization of SGD in diagonal linear networks (Pesme et al., 2021; Even et al., 2023). More recently, it also has been applied to analyze the implicit bias of attention (Sheen et al., 2024). While (Li et al., 2022) has shown that different parameterizations have a corresponding mirror flow, we find that m w with our proposed explicit regularization, PILo T, gives rise to a corresponding time-dependent mirror flow. Its time dependence gives us means to control the implicit bias, while still achieving convergence. Time-dependent mirror descent has so far only been studied in the discrete case as a general possibility (Radhakrishnan et al., 2021). The time dependence also naturally arises in SDE modelling, yet, without control of the implicit bias (Pesme et al., 2021; Even et al., 2023). Here, we not only highlight a practical use case for time-dependent Bregman potentials, we also derive a way to control and exploit it.

Optimization and convergence proofs. Loss landscapes and the convergence of first-order methods is a large field of study (Karimi et al., 2016; Fehrman et al., 2019) in its own right. We draw on literature that shows convergence by using the Polyak-Łojasiewicz inequality (Wojtowytsch, 2021; Dereich & Kassing, 2024), which is a more realistic assumption in the deep learning context than, for example convexity, because it can hold locally true for non-convex loss functions that are common in machine learning.

2 CONTROLLING THE IMPLICIT BIAS WITH EXPLICIT REGULARIZATION

Structure of theoretical exposition. Our first goal is to advance the mirror flow framework from (Li et al., 2022) to incorporate time-dependent regularization. This is a key innovation that forms the foundation of our proposed PILo T algorithm (Algorithm 1). The dynamical description also covers constant regularization, as implemented in the spred algorithm. We begin by integrating time-dependent regularization in the mirror flow framework in the case of the parameterization m w combined with time-varying weight decay. This corresponds to a time-dependent Bregman potential, enabling a more dynamic and powerful form of implicit regularization. The implicit regularization becomes controllable and moves from an L2 to an L1 regularization. Building on this, Theorem 2.1 rigorously characterizes this process within this extended framework, offering new insights into the sparsification process. Then Theorem 2.2 establishes convergence to a solution of the original optimization problem. Sparsity is still attained according to Theorem 2.1, which can also be observed in the gradient flow in Eq. (6) of PILo T. For diagonal linear networks, which is an analytically tractable setting, we also prove optimality, as stated by Theorem 2.3. This highlights a mechanism how our method PILo T improves over spred, since spred cannot reach optimality.

Optimization problem. Consider the following time-dependent optimization problem for a loss function f : Rn R:

min m,w Rn f(m w) + αt ||m||2 L2 + ||w||2 L2 . (2)

Published as a conference paper at ICLR 2025

where αt 0 can change during training. In contrast, (Ziyin & Wang, 2023) set αt = α constant and show that Eq. (2) is equivalent to the LASSO objective. Why does spred tend to outperform LASSO then?

Seeking answers in the training dynamics. The gradient flow associated with minimizing the continuously differentiable loss function f is: dxt = f(xt)dt, x0 = xinit. Using this gradient flow framework, (Li et al., 2022) show that a reparameterization or overparameterization of the parameters x leads to a mirror flow. A mirror flow informally minimizes a potential in the background, for example, the L1 or L2-norm. In contrast, explicit regularization forces a direct trade-off. The no need for a trade-off becomes clear in the convergence and optimality theorem.

Mirror flow. Concretely, to define a mirror flow, let R : Rn R be a differentiable function. It is described by d R(xt) = f(xt)dt, x0 = xinit. (Li et al., 2022) provide sufficient conditions for a paramerization g: M Rn to induce a mirror flow, where M is a smooth sub-manifold in RD for D n. The parameterization m w falls into this category. The corresponding potential R is either close to the L1 or L2 norm depending on the initialization of m0 and w0. If we could steer it towards L1, we could therefore induce an implicit regularization towards sparsity, yet, not without issues.

Two caveats and their solution. a) The potential R attains its global minimum at the initial x0 (and not 0). This also holds for other reparameterizations (Li et al., 2022). In consequence, we would not promote actual sparsity for x0 = 0. b) To induce L1 regularization and enter the rich regime, the initialization of both m0 and w0 would need to be exponentially small (Woodworth et al., 2020).

The explicit dynamic regularization of PILo T in Eq. (2) solves both of these problems, as we show next. The corresponding mirror function R is stated in Theorem A.1 and the corresponding convergence and optimality theorems in Theorem A.2 and A.4.

Dynamic regularization. Next we present the main result, the dynamical description with timedependent regularization. The exact dynamics are described by the time-dependent mirror flow and is derived in Theorem 2.1.

Theorem 2.1 Let |w0,i| < m0,i for all i [n], the time-dependent Bregman potential is given by

i=1 xiarcsinh xi

x2 i + a2 t,i xilog u0,i

with at,i = 2u0,iv0,iexp 2 R t 0 αsds and u0,i = m0,i+w0,i

2 and v0,i = m0,i w0,i

2 . The gradient flow of xt = mt wt induced by Eq. (2) then satisfies

d Rat(xt) = f(xt)dt, x0 = m0 w0.

Proof. The proof is given in the appendix. The main steps are: a) Deriving the evolution of the gradient flow (Lemma B.1). b) Showing that it satisfies the time-dependent mirror flow (Lemma B.2). Note that step a) also derives Eq. (6).

Observe that the potential in Eq. (3) now depends on at. This changes the global minimum to

Rat(x) = 0 x = exp 2 Z t

0 αsds m0 w0.

Thus, we gain control over the positional implicit bias, solving our problem with the nonzero global minimum. Next, we characterize the asymptotic behavior, which we control in practice with αt, which determines also at. The asymptotics follows from Theorem 2 in (Woodworth et al., 2020). For a 0 and | x

a| , we receive

Ra(x) log 1

Interestingly, the term xilog u0,i v0,i

does not play a role in the asymptotics. The reason is that

a) in front of the other term dominates. Figure 1 illustrates the asymptotics for the one dimensional case. We observe that our two previously identified mirror flow caveats can be resolved: a)

Published as a conference paper at ICLR 2025

Increasing a moves the minimum to the origin, leading to an L1 regularization. This implies initializing at zero with an exponentially small scaling (i.e. a 0) is not necessary. b) Moreover, the regularization αt has an exponential and time-dependent effect on at. It thus enables steering the dynamics towards an L1 regularization at the desired speed. In conclusion, our extension and novel analysis have revealed how explicit regularization solves the problems of the standard mirror flow framework.

Remark 2.1 At the end of LASSO training, the regularization can be turned off to enable the search for a better solution. However, this risks losing the benefits of the regularization, for example, if the basin of attraction contains non-sparse critical points. In contrast, for the m w reparameterization, the time-dependent Bregman potential steers (but does not force) the bias towards sparsity.

x 1 0 1 2 3 0 2 4

Figure 1: Evolution of the timedependent Bregman potential. α = R t 0 αsds is the exponent of at.

Convergence and optimality. It remains to be shown that convergence and optimality results transfer from the mirror flow framework to the time-dependent mirror flow framework. For quasi-convex loss functions, we prove convergence to a critical point. For convex or quasi-convex functions that satisfy the PL-inequality, we derive convergence to a minimizer.

Theorem 2.2 Assume f is quasi-convex, f is locally Lipschitz and argmin{f(x)|x Rn} is non-empty. Assume αt 0 for all t 0 and that R t 0 αsds < for t [0, ) { }. Then as t , xt converges to some critical point x . Furthermore, if f is either convex or both quasi-convex and satisfies the PL-inequality in Eq. (9). Then xt convergences to an interpolator x that is a minimizer of f. Furthermore, in the PL-inequality case, the loss converges linearly such that there is a constant C > 0 such that

f(xt) f(x ) Bexp ( λA t) , (4)

where B = (f(x0) f(x )) exp C||x ||L2 R 0 αsds with C depending on the smoothness of the loss function and A = mini m2 0,i w2 0,i exp 2 R 0 αsds .

Proof. The main steps of the proof are to show that a) the iterates are bounded and converge to a critical point (Lemma B.3); b) the loss converges (Theorem B.1). A noteworthy tool is a timedependent Bregman divergence, which we use to bound the iterates. Furthermore, we utilize that αt 0 converges to zero, resulting eventually in a non-increasing evolution of the loss.

Potential drawbacks. Theorem 2.2 guarantees convergence in the case of implicit regularization. Explicit L1 regularization or spred, on the other hand, cannot achieve the same result due to constant regularization, which we will also highlight in experiments. Regardless, note that the constant B in Eq. (4) could be large. Furthermore, to reach the implicit L1-regularization, at = m2 0 w2 0 exp 2 R t 0 αsds needs to be exponentially small similarly as in Theorem 2 in (Woodworth et al., 2020). These two potential drawbacks also reveal where the method will work, namely, in overparameterized settings where the solution x should have less active parameters. Then B is potentially relatively small.

Remark 2.2 If f is one-sided inversely Lipschitz, a speed-up is possible. The quantity that needs to be bounded for convergence is f(xt) xt. In this case, we get

f(xt) xt f(xt) x ||xt x ||2 L2 C||x ||L2 ||xt x ||2 L2,

where C is the bound on the smoothness of the loss function f. This implies that when the interpolator x 0 is small, the right-hand side is negative, leading to a speed-up. This condition is also known as coercive.

Optimality. The main assumption in Theorem 2.2 is αt 0, which ensures convergence to a minimizer of the original problem, while retaining sparsity depending on at a . In the case of diagonal linear networks, we can even prove optimality with respect to the final Bregman potential Ra .

Published as a conference paper at ICLR 2025

Theorem 2.3 In case of under-determined regression consider the loss function f(x) = f(Zx Y ). Assume f satisfies the conditions with at least one of the convergence criteria of Theorem 2.2. Then xt converges to x such that x = argmin Zx=Y Ra (x). (5)

Proof. We show that the KKT conditions of Problem (5) are satisfied (Theorem B.2).

Take away. We have shown that our extended mirror flow framework can attain convergence and optimality in an analytically tractable scenario. Furthermore, from a practical side, it also gives us new tools to derive more promising continuous sparsification techniques with implicit regularization. In the following section, we propose a way to dynamically control the transition from implicit L2 regularization to L1 during training with the help of the derived time-dependent Bregman potential.

3 THE ALGORITHM: PILOT

Like spred, our new algorithm PILo T (Algorithm 1) utilizes the parameterization m w, but proposes a novel initialization and dynamic regularization schedule to control the transition from implicit L2 to L1 regularization. To attain the desired results in the original parameterization x, we first derive its gradient flow.

Gradient flow. Essentially, the gradient flow follows from the analysis in Section 2. According to Theorem 2.2, we guarantee convergence by ensuring αt 0. Concretely, the gradient flow for xt = mt wt induced by Eq. (2) is given by:

x2 t + a2 t

f(xt) + 2αt xt p

x2 t + a2 t

dt, x0 = xinit, (6)

where at = m2 0 w2 0 exp 2 R t 0 αsds . m0 and w0 have to be initialized such that m0 w0 = xinit. Note that all operations are point-wise. The derivation is based on the time-dependent mirror flow in Section 2.

Remark 3.1 The gradient flow in Eq. (6) allows us to make a direct comparison to the continuous sparsification method STR (Kusupati et al., 2020). Instead of the soft thresholding operator, we have p

x2 t + a2 t. The main difference is that STR does not change the magnitude of the gradient update outside of the (learnable) threshold, while both PILo T and spred actively change the magnitude depending on the magnitude of the weight. This active sparsification explains why spred and also PILo T can perform better in the high-sparsity regime.

Spred. The gradient flow in Eq. (6) explains why spred performs better than LASSO and highlights where spred can further be improved. Note that the balanced initialization of spred is defined such that m2 0 w2 0 = 0 and the regularization is constant αt = α. Plugging this into Eq. (6) gives

x2 t ( f(xt) + 2αsign (xt)) dt, x0 = xinit. (7)

Compare this with the gradient flow of LASSO with regularization strength 2α:

dxt = ( f(xt) + 2αsign (xt)) dt, x0 = xinit.

We observe that the main difference to the gradient flow of LASSO is the factor p

x2 t. This implies the considerable drawback that spred gradient flows cannot sign flip. Therefore, it cannot reach the optimal solution or specific minimizers potentially. Another way to see this is studying the evolution xt = x0 exp 4sign (x0) R t 0 f(xs)ds 2αt satisfying Eq. (7). In practice, the absence of sign flips might be remedied by using a large learning rate and noise. The evolution also explains why it can perform better than LASSO, as it decays redundant parameters exponentially faster instead of linearly. In other words, the gradient update is proportional to the magnitude of the parameter. Therefore, the evolution of spred (Eq. (7)) can converge faster and come closer to zero than LASSO.

PILo T. Our main goal is to remedy the caveats of the spred by inducing the more general gradient flow in Eq. (6). The first improvement is to enable sign flips by changing the initialization to m2 0

Published as a conference paper at ICLR 2025

w2 0 = β > 0, where β denotes the scaling constant. After discretizing Eq. (6), the effective learning rate at initialization x0 = 0 is η|β|, where η > 0 is the learning rate. Therefore, we use β = 1 in the experiments so that the learning rate η is not altered.

Our second and main improvement is induced by the time dependence of αt. αt controls the strength of both the implicit and explicit regularization via at. If at >> xt, then the regularization term in Eq. (6) resembles an L2 instead of an L1 norm. Therefore decreasing at moves the implicit regularization from L2 to L1. Accordingly, we sparsify only mildly in early training epochs in contrast to spred. We have shown this formally in Section 2. Furthermore, convergence is covered by Theorem 2.2 when αt 0. Even when PILo T attains a similar sparsity as spred at the end of the training dynamics, it can usually still achieve a higher accuracy due to its improved training dynamics.

Details on PILo T. The described design choices define Algorithm 1. Our update of the regularization strength αk depends on three quantities: a) the sparsity threshold K for the weights (a hyperparameter), b) the training accuracy, and c) δ 1, the multiplicative factor to gradually increase or decrease the regularization strength. The regularization strength (and thus sparsity) grows if the sparsity threshold has not been reached yet and the training accuracy has increased in the previous gradient update step. As the strength is adaptive, the algorithm is less sensitive to the initial strength α0. Note that the setting δ = 1 and β = 0 corresponds to spred. Therefore, PILo T is a strict generalization of spred. In contrast to spred, however, after half of the training epochs, we decay the regularization strength regardless of whether the sparsity threshold K is reached. This guarantees convergence of the corresponding gradient flow, in accordance with Theorem 2.2.

Algorithm 1 PILo T

Require: epochs T, schedule αinit, initialization xinit, scaling constant β

Initialize m0, w0 such that m0 w0 = xinit, m2 0 w2 0 = β, δ 1 and, K α0 αinit Current training acc 0 Set f(m, w, α0) := f(m w) + α0 ||m||2 L2 + ||w||2 L2

for k in 1 . . . T do

(mk, wk) = Optimizer Step f(mk 1, wk 1, αk 1)

if Training acc Current training acc and ||mk wk||L1 K and k T

2 then αk αk 1δ else

αk αk 1/δ end if Current training acc Training acc end for return Model f(x T ) with x T = m T w T

4 EXPERIMENTS

We demonstrate the effectiveness of PILo T in extensive experiments covering three different scenarios. Firstly, we confirm our theoretical results on the gradient flow in Theorem 2.3. Secondly, we compare PILo T with other state-of-the-art continuous sparsification methods such as STR (Kusupati et al., 2020) and spred (Ziyin & Wang, 2023) in a one-shot setting. In this context, we also isolate the individual contribution of our initialization. Finally, we combine PILo T with iterative pruning methods such as WR (Frankle & Carbin, 2019) and LRR (Maene et al., 2021).

Memory requirements. As most other continuous sparsification approaches, note that PILo T doubles the number of parameters during training. Yet, according to Ziyin & Wang (2023), the training time of a Res Net50 with m w parameterization on Image Net increases roughly by 5% only and the memory cost is negligible if the batch size is larger than 50. At inference, we would return to the original representation x and therefore benefit from the improved sparsification.

Published as a conference paper at ICLR 2025

Diagonal Linear Network. We have proven optimality for the analytically tractable setting of diagonal linear networks. Now we illustrate the benefit of our initialization and dynamic explicit regularization. Furthermore, we highlight the impact of a good dynamic schedule of the regularization strength αt. We use d = 40 amount of data points with feature dimension n = 100 and sample zj N(0, In) for j [d]. The ground truth x is set such that ||x ||L0 = 5. Furthermore, the network parameters are initialized with x0 N(0, In 1 n). The step-size is η = 10 4

and the trajectories are averaged over 5 initializations. 0.95 confidence regions are indicated by shades. The mean squared error is used as loss function. We report the distance to the ground truth ||xt x ||L2 over training time in Figure 2. Two different initializations, i.e., the one of spred and of PILo T, and different regularization schedules are considered. The schedules are described in Appendix C.1. Results for the best ones are shown in the figure and confirm our theoretical insights. The inability to sign flip prevents spred from reaching the ground truth for all considered schedules. Furthermore, with a dynamic regularization, our PILo T initialization outperforms both spred and LASSO, reaching the ground truth. The best-performing schedule for the PILo T initialization is a geometrically decaying schedule, as also implemented in Algorithm 1. In contrast, for the other two methods, a constant regularization works best. This experimentally confirms Theorem 2.3 and Remark 2.2. Note that LASSO with gradient descent performs as expected. Decaying schedules perform worse than constant ones supporting Remark 2.1.

0 250 500 750 1000 Time

||x x * ||L2

mw spred init mw PILo T init x L1

Figure 2: A simulation of gradient flow on a diagonal linear network is given for the different regularizations.

One-shot sparsification. Firstly, we compare our method PILo T with STR, spred, and LASSO on CIFAR10 and CIFAR100 training a Res Net-20 or Res Net-18, respectively. Furthermore, in the case of CIFAR10, we also implement the novel initialization (m2 0 w2 0 = 1) without dynamic regularization to isolate its benefits. We consider two learning rates {0.1, 0.2} and the weight decay range {1e 5, . . . 1e 2} for CIFAR10 and range {1e 4, . . . 1e 3} for CIFAR100 and always show the best result. The same range for the regularization strength is explored for LASSO. The other hyperparameters are reported in the appendix. Secondly, we train Res Net-50 on Image Net with the setup of STR (Kusupati et al., 2020) and compare directly with their results. Furthermore, we implement both PILo T and spred in this setting.

Figure 3 presents results for CIFAR10 and CIFAR100. PILo T outperforms all other methods and is particularly effective in the high-sparsity regime. Our PILo T initialization leads to improvements over spred and STR for medium levels of sparsity. This supports our theoretical insight into the role of initializations and how they influence the implicit bias. In addition, STR is outperformed by spred in the high-sparsity regime, confirming the findings of (Ziyin & Wang, 2023).

0.80 0.85 0.90 0.95 1.00 Sparsity (%)

Test Accuracy (%)

0.900 0.925 0.950 0.975 1.000 Sparsity (%)

spred PILo T L1 STR PILo T init

Figure 3: One-shot sparsification. Acc. versus sparsity for CIFAR10 (left) and CIFAR100 (right).

In Table 1, we compare PILo T to both STR and spred on Image Net (Deng et al., 2009). See Appendix C.2 Table 3 for details on experimental configurations. Our method competes with or out-

1Starting from a pretrained model with 77% validation accuracy

Published as a conference paper at ICLR 2025

Table 1: Res Net-50 on Image Net sparsity (%) versus accuracy (%) results.

Method Top-1 Acc Sparsity Res Net-50 77.01 0 STR 76.19 79.55 STR 76.12 81.27 spred1 75.5 80.00 spred 72.64 79.03 PILo T 75.62 80.00 STR 74.73 87.7 STR 74.01 90.55 spred 71.84 89.26 PILo T 74.73 88.00 PILo T 74.04 91.00

Method Top-1 Acc Sparsity Res Net-50 77.01 0 STR 70.4 95.03 spred 69.47 94.50 PILo T 72.67 94.00 PILo T 71.30 95.00 PILo T 71.05 95.60 PILo T 70.49 96.00 STR 67.22 96.53 spred 66.12 97.19 PILo T 68.49 97.19 STR 61.46 98.05 spred 62.71 98.20 PILo T 66.49 97.75 PILo T 64.06 98.20

0.0 0.2 0.4 0.6 0.8 Sparsity (%)

Test Accuracy (%)

0.80 0.85 0.90 0.95 Sparsity (%)

LRR+PILo T WR+PILo T LRR WR

Figure 4: Learning Rate Rewinding (LRR) and Weight Rewinding (WR) with PILo T on Image Net Res Net-18. The left plot is the complete plot and the right plot is a zoomed-in version.

performs all baselines at medium and high sparsity levels. In addition, it improves over spred at 80% sparsity, even when spred is initialized with a 77% pretrained Res Net-50. This highlights the effectiveness of PILo T and confirms the insights from the developed theory.

Iterative Pruning. To demonstrate the versatility of PILo T, we also combine it with the state-ofthe-art iterative pruning methods Learning Rate Rewinding (LRR) (Maene et al., 2021) and Weight Rewinding (WR) (Frankle et al., 2019) on Image Net using a Res Net-18. For simplicity, we use β = 1 and no regularization. We see in Figure 4 that the parameterization m w futher boosts the performance of iterative methods. Remarkably, WR becomes competitive with LRR. Additional experiments on CIFAR10 and CIFAR100 with regularization are in Appendix C.3. The regularization further helps to reach higher accuracy at high sparsities.

5 DISCUSSION

We have shed light on the inner workings of continuous sparsification. Its basic relaxed formulation utilizes the parameterization m w, which induces an implicit bias towards sparsity. In contrast to an explicit L1-regularization, it enjoys all the benefits of an implicit regularization that caters first to the loss and not a sparsity penalty. Exploiting this insight for neural network sparsification, we have proposed PILo T, which relies on a controllable regularization that acts like an implicit regularization in the original neural network parameter space and, remarkably, corresponds to a time-dependent Bregman potential. As we have shown, the time-dependent control enables the associated mirror flow to enter the so-called rich regime, and thus effectively change the implicit regularization from L2 to L1. This property is central to showing convergence of our approach for (quasi)-convex loss functions and optimality for underdetermined linear regression. Experiments on standard vision benchmarks further corroborate the utility of our theoretical insights, as our proposal PILo T achieves significant improvements over state-of-the-art baselines.

Published as a conference paper at ICLR 2025

ACKNOWLEDGMENTS AND DISCLOSURE OF FUNDING

The authors gratefully acknowledge the Gauss Centre for Supercomputing e.V. for funding this project by providing computing time on the GCS Supercomputer JUWELS at J ulich Supercomputing Centre (JSC). We also gratefully acknowledge funding from the European Research Council (ERC) under the Horizon Europe Framework Programme (HORIZON) for proposal number 101116395 SPARSE-ML.

Aws Albarghouthi. Introduction to neural network verification, 2021.

Felipe Alvarez, J erˆome Bolte, and Olivier Brahic. Hessian riemannian gradient flows in convex programming. SIAM Journal on Control and Optimization, 43(2):477 501, January 2004. ISSN 1095-7138. doi: 10.1137/s0363012902419977. URL http://dx.doi.org/10.1137/ S0363012902419977.

Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167 175, 2003. ISSN 0167-6377. doi: https://doi.org/10.1016/S0167-6377(02)00231-6. URL https://www.sciencedirect. com/science/article/pii/S0167637702002316.

Stephen P. Boyd and Lieven Vandenberghe. Convex optimization. 2009. URL https://web. stanford.edu/ boyd/cvxbook/.

Rebekka Burkholz. Convolutional and residual networks provably contain lottery tickets. In International Conference on Machine Learning, 2022a.

Rebekka Burkholz. Most activation functions can win the lottery without excessive depth. In Advances in Neural Information Processing Systems, 2022b.

Rebekka Burkholz, Nilanjana Laha, Rajarshi Mukherjee, and Alkis Gotovos. On the existence of universal lottery tickets. In International Conference on Learning Representations, 2022.

Tianlong Chen, Zhenyu Zhang, Jun Wu, Randy Huang, Sijia Liu, Shiyu Chang, and Zhangyang Wang. Can you win everything with a lottery ticket? Transactions on Machine Learning Research, 2022. ISSN 2835-8856. URL https://openreview.net/forum?id= JL6MU9XFz W.

L ena ıc Chizat and Francis Bach. Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. In Jacob Abernethy and Shivani Agarwal (eds.), Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pp. 1305 1338. PMLR, 09 12 Jul 2020. URL https://proceedings.mlr. press/v125/chizat20a.html.

Hung-Hsu Chou, Johannes Maly, and Dominik St oger. How to induce regularization in linear models: A guide to reparametrizing gradient flow, 2024.

Arthur da Cunha, Emanuele Natale, and Laurent Viennot. Proving the lottery ticket hypothesis for convolutional neural networks. In International Conference on Learning Representations, 2022.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248 255, 2009. doi: 10.1109/CVPR.2009.5206848.

Steffen Dereich and Sebastian Kassing. Convergence of stochastic gradient descent schemes for lojasiewicz-landscapes, 2024.

Tim Dettmers and Luke Zettlemoyer. Sparse networks from scratch: Faster training without losing performance. 2019.

Simon S. Du, Chi Jin, Jason D. Lee, Michael I. Jordan, Barnabas Poczos, and Aarti Singh. Gradient descent can take exponential time to escape saddle points, 2017.

Published as a conference paper at ICLR 2025

Utku Evci, Trevor Gale, Jacob Menick, Pablo Samuel Castro, and Erich Elsen. Rigging the lottery: Making all tickets winners. In International Conference on Machine Learning, pp. 2943 2952. PMLR, 2020.

Mathieu Even, Scott Pesme, Suriya Gunasekar, and Nicolas Flammarion. (s)gd over diagonal linear networks: Implicit regularisation, large stepsizes and edge of stability. Ar Xiv, abs/2302.08982, 2023. URL https://api.semanticscholar.org/Corpus ID:268042036.

Alhussein Fawzi, Matej Balog, Aja Huang, Thomas Hubert, Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Francisco J. R. Ruiz, Julian Schrittwieser, Grzegorz Swirszcz, David Silver, Demis Hassabis, and Pushmeet Kohli. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature, 610(7930):47 53, 2022.

Benjamin Fehrman, Benjamin Gess, and Arnulf Jentzen. Convergence rates for the stochastic gradient descent method for non-convex objective functions, 2019.

Damien Ferbach, Christos Tsirigotis, Gauthier Gidel, and Bose Avishek. A general framework for proving the equivariant strong lottery ticket hypothesis, 2022.

Jonas Fischer and Rebekka Burkholz. Plant n seek: Can you find the winning ticket?, 2021.

Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. ar Xiv: Learning, 2018. URL https://api.semanticscholar.org/ Corpus ID:53388625.

Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations, 2019.

Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, and Michael Carbin. Linear mode connectivity and the lottery ticket hypothesis. Co RR, abs/1912.05671, 2019. URL http:// arxiv.org/abs/1912.05671.

Jonathan Frankle, Gintare Karolina Dziugaite, Daniel Roy, and Michael Carbin. Pruning neural networks at initialization: Why are we missing the mark? In International Conference on Learning Representations, 2021.

Advait Gadhikar and Rebekka Burkholz. Masks, signs, and learning rate rewinding. In Twelfth International Conference on Learning Representations, 2024. URL https://openreview. net/forum?id=q ODvx Q8TXW.

Advait Harshal Gadhikar, Sohom Mukherjee, and Rebekka Burkholz. Why random pruning is all we need to start sparse. In International Conference on Machine Learning, 2023.

Suriya Gunasekar, Blake Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nathan Srebro. Implicit regularization in matrix factorization, 2017.

Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro. Characterizing implicit bias in terms of optimization geometry, 2020.

Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. Advances in neural information processing systems, 28, 2015.

Intekhab Hossain, Jonas Fischer, Rebekka Burkholz, and John Quackenbush. Not all tickets are equal and we know it: Guiding pruning with domain-specific knowledge, 2024.

Tian Jin, Michael Carbin, Daniel M. Roy, Jonathan Frankle, and Gintare Karolina Dziugaite. Pruning s effect on generalization through the lens of training and regularization. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id= Orc LKV9s KWp.

John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin ˇZ ıdek, Anna Potapenko, et al. Highly accurate protein structure prediction with Alpha Fold. Nature, 596(7873):583 589, 2021.

Published as a conference paper at ICLR 2025

Lynn H Kaack, Priya L Donti, Emma Strubell, George Kamiya, Felix Creutzig, and David Rolnick. Aligning artificial intelligence with climate change mitigation. Nature Climate Change, 12(6): 518 527, 2022.

Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximalgradient methods under the polyak-łojasiewicz condition. In Paolo Frasconi, Niels Landwehr, Giuseppe Manco, and Jilles Vreeken (eds.), Machine Learning and Knowledge Discovery in Databases, pp. 795 811, Cham, 2016. Springer International Publishing. ISBN 978-3-31946128-1.

Aditya Kusupati, Vivek Ramanujan, Raghav Somani, Mitchell Wortsman, Prateek Jain, Sham Kakade, and Ali Farhadi. Soft threshold weight reparameterization for learnable sparsity. In Proceedings of the International Conference on Machine Learning, July 2020.

Andrey Kuzmin, Markus Nagel, Saurabh Pitre, Sandeep Pendyam, Tijmen Blankevoort, and Max Welling. Taxonomy and evaluation of structured compression of convolutional neural networks, 2019.

Denis Kuznedelev, Eldar Kurtic, Eugenia Iofinova, Elias Frantar, Alexandra Peste, and Dan Alistarh. Accurate neural network pruning requires rethinking sparse optimization, 2023.

Mike Lasby, Anna Golubeva, Utku Evci, Mihai Nica, and Yani Ioannou. Dynamic sparse training with structured sparsity. ar Xiv preprint ar Xiv:2305.02299, 2023.

Namhoon Lee, Thalaiyasingam Ajanthan, and Philip H. S. Torr. Snip: single-shot network pruning based on connection sensitivity. In International Conference on Learning Representations, 2019.

Jiangyuan Li, Thanh V. Nguyen, Chinmay Hegde, and Raymond K. W. Wong. Implicit sparse regularization: The impact of depth and early stopping, 2021. URL https://arxiv.org/ abs/2108.05574.

Jiangyuan Li, Thanh V. Nguyen, Chinmay Hegde, and Raymond K. W. Wong. Implicit regularization for group sparsity, 2023. URL https://arxiv.org/abs/2301.12540.

Zhiyuan Li, Tianhao Wang, Jason D. Lee, and Sanjeev Arora. Implicit bias of gradient descent on reparametrized models: On equivalence to mirror descent. Ar Xiv, abs/2207.04036, 2022. URL https://api.semanticscholar.org/Corpus ID:250407876.

Bohan Liu, Zijie Zhang, Peixiong He, Zhensen Wang, Yang Xiao, Ruimeng Ye, Yang Zhou, Wei Shinn Ku, and Bo Hui. A survey of lottery ticket hypothesis, 2024.

Shiwei Liu and Zhangyang Wang. Ten lessons we have learned in the new sparseland : A short handbook for sparse neural network researchers, 2023.

Shiwei Liu, Tianlong Chen, Xiaohan Chen, Li Shen, Decebal Constantin Mocanu, Zhangyang Wang, and Mykola Pechenizkiy. The unreasonable effectiveness of random pruning: Return of the most naive baseline for sparse training. In International Conference on Learning Representations, 2021.

Christos Louizos, Max Welling, and Diederik P. Kingma. Learning sparse neural networks through l0 regularization, 2018.

Alexandra Sasha Luccioni, Yacine Jernite, and Emma Strubell. Power hungry processing: Watts driving the cost of ai deployment? ar Xiv preprint ar Xiv:2311.16863, 2023.

Jaron Maene, Mingxiao Li, and Marie-Francine Moens. Towards understanding iterative magnitude pruning: Why lottery tickets win, 2021.

Eran Malach, Gilad Yehudai, Shai Shalev-Schwartz, and Ohad Shamir. Proving the lottery ticket hypothesis: Pruning is all you need. In International Conference on Machine Learning, 2020.

Nina Narodytska, Hongce Zhang, Aarti Gupta, and Toby Walsh. In search for a sat-friendly binarized neural network architecture. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SJx-j64FDr.

Published as a conference paper at ICLR 2025

A.S. Nemirovski and D.B. Yudin. Problem Complexity and Method Efficiency in Optimization. A Wiley-Interscience publication. Wiley, 1983. ISBN 9780471103455. URL https://books. google.de/books?id=6ULv AAAAMAAJ.

Laurent Orseau, Marcus Hutter, and Omar Rivasplata. Logarithmic pruning is all you need. Advances in Neural Information Processing Systems, 33, 2020.

Shreyas Malakarjun Patil and Constantine Dovrolis. Phew: Constructing sparse networks that learn fast and generalize well without training data. In International Conference on Machine Learning, pp. 8432 8442. PMLR, 2021.

Mansheej Paul, Feng Chen, Brett W. Larsen, Jonathan Frankle, Surya Ganguli, and Gintare Karolina Dziugaite. Unmasking the lottery ticket hypothesis: What s encoded in a winning ticket s mask? In The Eleventh International Conference on Learning Representations, 2023. URL https: //openreview.net/forum?id=x Ss W2Am-uk Z.

Ankit Pensia, Shashank Rajput, Alliot Nagle, Harit Vishwakarma, and Dimitris Papailiopoulos. Optimal lottery tickets via subset sum: Logarithmic over-parameterization is sufficient. In Advances in Neural Information Processing Systems, volume 33, pp. 2599 2610, 2020.

Scott Pesme, Loucas Pillaud-Vivien, and Nicolas Flammarion. Implicit bias of sgd for diagonal linear networks: a provable benefit of stochasticity, 2021.

Alexandra Peste, Eugenia Iofinova, Adrian Vladu, and Dan Alistarh. Ac/dc: Alternating compressed/decompressed training of deep neural networks, 2021.

Hoang Pham, Shiwei Liu, Lichuan Xiang, Dung D Le, Hongkai Wen, Long Tran-Thanh, et al. Towards data-agnostic pruning at initialization: What makes a good sparse mask? In Thirtyseventh Conference on Neural Information Processing Systems, 2023.

Adityanarayanan Radhakrishnan, Mikhail Belkin, and Caroline Uhler. Linear convergence of generalized mirror descent with time-dependent mirrors, 2021.

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical textconditional image generation with clip latents. ar Xiv preprint ar Xiv:2204.06125, 2022.

Alex Renda, Jonathan Frankle, and Michael Carbin. Comparing rewinding and fine-tuning in neural network pruning. In International Conference on Learning Representations, 2020.

Tyrrel R Rockafellar and Werner Fenchel. Convex Analysis. 1970. URL https://api. semanticscholar.org/Corpus ID:198120397.

Pedro Savarese, Hugo Silva, and Michael Maire. Winning the lottery with continuous sparsification, 2021.

Jonathan Schwarz, Siddhant M. Jayakumar, Razvan Pascanu, Peter E. Latham, and Yee Whye Teh. Powerpropagation: A sparsity inducing weight reparameterisation, 2021.

Heejune Sheen, Siyu Chen, Tianhao Wang, and Harrison H. Zhou. Implicit regularization of gradient flow on one-layer softmax attention, 2024.

Kartik Sreenivasan, Jy-yong Sohn, Liu Yang, Matthew Grinde, Alliot Nagle, Hongyi Wang, Eric Xing, Kangwook Lee, and Dimitris Papailiopoulos. Rare gems: Finding lottery tickets at initialization. Advances in Neural Information Processing Systems, 35:14529 14540, 2022.

Natalie Stephenson, Emily Shane, Jessica Chase, Jason Rowland, David Ries, Nicola Justice, Jie Zhang, Leong Chan, and Renzhi Cao. Survey of machine learning techniques in drug discovery. Current drug metabolism, 20(3):185 193, 2019.

Hidenori Tanaka, Daniel Kunin, Daniel L. Yamins, and Surya Ganguli. Pruning neural networks without any data by iteratively conserving synaptic flow. In Advances in Neural Information Processing Systems, 2020.

Tomas Vaˇskeviˇcius, Varun Kanade, and Patrick Rebeschini. Implicit regularization for optimal sparse recovery, 2019. URL https://arxiv.org/abs/1909.05122.

Published as a conference paper at ICLR 2025

Chaoqi Wang, Guodong Zhang, and Roger B. Grosse. Picking winning tickets before training by preserving gradient flow. In International Conference on Learning Representations, 2020.

Kun Wang, Yuxuan Liang, Pengkun Wang, Xu Wang, Pengfei Gu, Junfeng Fang, and Yang Wang. Searching lottery tickets in graph neural networks: A dual perspective. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/ forum?id=Dvs-a3aym Pe.

Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems, volume 29, 2016.

Stephan Wojtowytsch. Stochastic gradient descent with noise of machine learning type. part i: Discrete time analysis, 2021.

Blake Woodworth, Suriya Gunasekar, Jason D. Lee, Edward Moroshko, Pedro Savarese, Itay Golan, Daniel Soudry, and Nathan Srebro. Kernel and rich regimes in overparametrized models, 2020.

Carole-Jean Wu, Ramya Raghavendra, Udit Gupta, Bilge Acun, Newsha Ardalani, Kiwan Maeng, Gloria Chang, Fiona Aga, Jinshi Huang, Charles Bai, et al. Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems, 4:795 813, 2022.

Haoran You, Chaojian Li, Pengfei Xu, Yonggan Fu, Yue Wang, Xiaohan Chen, Richard G. Baraniuk, Zhangyang Wang, and Yingyan Lin. Drawing early-bird tickets: Toward more efficient training of deep networks. In International Conference on Learning Representations, 2020.

Peng Zhao, Yun Yang, and Qiao-Chu He. High-dimensional linear regression via implicit regularization. Biometrika, 109(4):1033 1046, February 2022. ISSN 1464-3510. doi: 10.1093/biomet/ asac010. URL http://dx.doi.org/10.1093/biomet/asac010.

Xiao Zhou, Weizhong Zhang, Zonghao Chen, Shizhe Diao, and Tong Zhang. Efficient neural network training via forward and backward propagation sparsification. Advances in Neural Information Processing Systems, 34:15216 15229, 2021a.

Xiao Zhou, Weizhong Zhang, Hang Xu, and Tong Zhang. Effective sparsification of neural networks with global sparsity constraint. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3599 3608, 2021b.

Liu Ziyin. Symmetry induces structure and constraint of learning. 2023. URL https://api. semanticscholar.org/Corpus ID:263310669.

Liu Ziyin and Zihao Wang. spred: Solving l1 penalty with sgd, 2023. URL https://arxiv. org/abs/2210.01212.

A MIRROR FLOW FRAMEWORK

In this section we present some known results from the mirror flow framework for completeness. We derive the Bregman potential associated with m w in Theorem A.1. Next, we will also show the convergence of the loss and provide optimality guarantees in Theorem A.2. In addition, we extend Theorem A.2 with Theorem A.3. Finally, we give the optimality result for diagonal linear networks in Theorem A.4

Theorem A.1 Let the initialization of m and w satisfy m0,i > |w0,i| for all i [n]. Then the corresponding mirror function is:

i=1 xiarcsinh xi 2u0,iv0,i

x2 i + 4u2 0,iv2 0,i xilog u0,i

where u0,i = m0,i+w0,i

2 and v0,i = m0,i w0,i

2 . Furthermore, R is a Bregman function.

Published as a conference paper at ICLR 2025

Proof. The result follows directly from applying Theorem 4.16 in (Li et al., 2022).

Theorem A.1 implies the following: a) The global minima of R is at the initialization x0 = m0 w0. b) The Lipschitz coeficient of R depends on the initalization. The Lipschitz coeficient LR of (8) is LR = 1 mini 2u0,iv0,i , determining the smoothness of the potential. Following these two observations we make the following remark about Theorem A.1.

Remark A.1 Note that when the initialization is zero, i.e., w0 = 0, m0 = a with a 0 then (8) is the hyperbolic entropy. The hyperbolic entropy is

i=1 xiarcsinh xi

Theorem 2 of (Woodworth et al., 2020) characterizes the behavior in the limit for this case. For the hyperbolic entropy in case a 0 and | x

This means an L1 bias is induced when a is small. Nevertheless, we need an exponentially small a compared to x to get there as shown in (Woodworth et al., 2020), which can lead to numerical problems. Furthermore, m0 = w0 = 0 is a saddle point which can slow down training (exponentially) (Du et al., 2017). Additionally, the asymptotic result only holds for initializing at zero. Note LR = a in this case.

Remark A.1 shows the potential of using the implicit bias to induce sparsity. To actualize this, we need to solve the two challenges posed in the remark. Both are remedied in Section 2.

In addition to this promising formulation of implicitly minimizing an L1 norm with the use of the mirror framework, we can get convergence results. These results make it clear why implicit regularization is preferable over explicit regularization. The convergence result from (Li et al., 2022) is stated for our setting. Furthermore, the theorem is extended for a specific class of Bregman functions.

Theorem A.2 (Theorem 4.14 (Li et al., 2022)) Assume that f is quasi-convex, f is locally Lipschitz and argmin{f(x)|x Rn} is non-empty. Then as t , xt converges to some critical point x . Moreover, if f is convex xt converges to a minimizer of f.

In Theorem A.2 it is shown that with implicit regularization an optimal solution to the original optimization problem can be reached. In contrast, explicit regularization makes this not possible, by definition. Because the optimization problem has fundamentally changed. Showing the benefit of implicit over explicit.

For the extension, the convexity constraint is replaced by the Polyak-Łojasiewicz (PL) inequality in the theorem. The PL-inequality is a more realistic constraint in a machine learning context as loss functions are not locally convex but can satisfy the PL inequality locally (Wojtowytsch, 2021; Dereich & Kassing, 2024). The PL-inequality for a continuously differentiable function f is

|| f(x)||2 L2 λ (f(x) f(x )) x Rn (9)

for some λ > 0 and global minima x of f. This allows us to state the modified theorem.

Theorem A.3 Consider the same setting as Theorem A.2. Assume R satisfies for all x Rn,

z T 2R(x) 1 z σ||z||2 L2 z Rn. (10)

Furthermore, assume f satisfies the PL-inequality (9). Then xt converges to a minimizer of f. Furthermore, the loss converges linearly with rate σλ.

Proof. The evolution of f(xt) f(x ) is described by df(xt) = f(xt) 2R(xt) 1 f(xt)dt. From (10) and (9) the evolution is bounded by

df(xt) σ|| f(xt)||2 L2dt σλ (f(xt) f(x )) dt.

Published as a conference paper at ICLR 2025

Applying Gronwall s Lemma concludes the proof.

Note that Theorem A.3 holds in the same (general) setting as Theorem A.2. Also, note that the PL-inequality together with quasi-convexity does not imply convexity. Theorem A.3 holds for our setting. In this case, it follows from a direct computation that 2R(x) 1 = diag q

x2 + 4u2 0,1v2 0,1, . . . , q

x2 + 4u2 0,nv2 0,n .

This implies that σ = 2 mini u0,iv0,i in Theorem A.3, which again highlights the importance of the initialization.

Finally, in the case of under-determined linear regression, we can derive optimality conditions in the form of KKT conditions of R. Consider a data set (zj, yj)d i=1 with zj Rn and yj R. Let Z = (z1, . . . zd) and Y = (y1, . . . yd). For the regression to be called underdetermined n > d.

Theorem A.4 (Theorem 4.17 (Li et al., 2022)) In case of under-determined regression consider the loss function f(x) = f(Zx Y ). Assume f satisfies the conditions of Theorem A.2. Then xt converges to x such that x = argmin Zx=Y R(x)

Note that Theorem 4.17 of (Li et al., 2022) only uses quasi-convexity of the loss. Theorem A.4 guarantees that the optimization problem is solved while implicitly minimizing the potential R. Thus choosing the sparsest model out of the models that predict the data perfectly. This highlights another benefit of implicit regularization over explicit regularization.

In this section, we have shown the viability of using the implicit bias framework to induce an implicit regularization. Furthermore, we have given two known benefits of using the implicit bias framework over explicit regularization. The benefits are convergence to the optimal solution of the original problem and optimality in the case of underdetermined regression. To add to this, we have extended the convergence theorem using the PL-inequality in 9. Moreover, we again highlight the importance of the initialization of m0 and w0 with the influence on the smoothness of the Bregman potential and convergence of the loss. The initialization insight as in the main text is used to improve upon spred (Ziyin, 2023) as their initialization has scaling 2u0v0 = 0, it follows already from the mirror flow framework that 2u0v0 = 1 is a better initialization. Furthermore, we also do not initialize at zero though, and scaling needs to be exponentially small to get a good approximation of the L1 norm potentially making it hard to escape the saddle point. Therefore, the explicit regularization analyzed in Section 2 is necessary to exploit the implicit bias framework.

B PROOF MAIN RESULT

We show the main result here. The proof consists of four parts

Rat satisfies a mirror flow (Lemmas B.1 and B.2) Boundedness of the iterates and convergence to a critical point (Lemma B.3) Convergence of the loss (Theorem B.1) Optimality in case of underdetermined linear regression (Theorem B.2)

Consider the following gradient flow

dmt = f (mt wt) wt 2αtmtdt dwt = f (mt wt) mt 2αtwtdt (11)

For the flow in (11) to be well-posed f needs to be locally Lipschitz continuous. This is a sufficient condition given that αt is nice , which will be made more rigorous later. The evolution of xt = mt wt is derived in Lemma B.1.

Lemma B.1 The evolution of xt = mt wt with 11 is described by

xt = u2 0 exp 2 Z t

0 f (xs) ds 4 Z t

0 αsds v2 0 exp 2 Z t

0 f (xs) ds 4 Z t

Published as a conference paper at ICLR 2025

where u0 = m0+w0

2 and v0 = m0 w0

Proof. This follows from deriving the flow of mt and wt and then combining the two. The evolution of both are given by

mt = m0 cosh R t 0 f (xs) ds + w0 sinh R t 0 f (xs) ds exp 2 R t 0 αsds

wt = w0 cosh R t 0 f (xs) ds + m0 sinh R t 0 f (xs) ds exp 2 R t 0 αsds .

For ease of notation set Lt = R t 0 f (xs) ds and At = R t 0 αsds. Combining gives us

xt = mt wt = m2 0 + w2 0 cosh ( Lt) sinh ( Lt) exp ( 4At)

+ w0 m0 cosh ( Lt)2 + cosh ( Lt)2 exp ( 4At)

m2 0 + w2 0

2 sinh ( 2Lt)

+ w0 m0 (cosh ( 2Lt)) exp ( 4At)

= u2 0 exp ( 2Lt 4At) v2 0 exp (2Lt 4At)

where the second equality follows from hyperbolic identities.

It follows from Lemma B.1 and the local Lipschitz condition on f and R t 0 αsds < for all t 0, that the flow is well-posed. We now define the (corrected)-hyperbolic entropy function. The corrected hyperbolic entropy is given by

i=1 xiarcsinh xi

x2 i + a2 xilog u0i

where the last term is the correction. The correction stems from not initializing at zero.

Lemma B.2 Let |wi0| m0i for all i [n], then Rat(xt) with at = 2u0 v0exp 2 R t 0 αsds

satisfies d Rat(xt) = f(xt)dt x0 = m0 w0. (12)

Proof. This follows from Lemma B.1,

xtexp 4 Z t

0 αsds = u2 0exp 2 Z t

0 f(xs)ds v2 0exp 2 Z t

at ) log(u0

This equivalence follows from setting z = exp 2 R t 0 f(xs)ds and solving the resulting

quadratic equation. Notice that the left hand side in (12) is Rat(xt).

Lemma B.3 Let f be a quasi-convex function and αt 0 for all t 0. Furthermore, assume the integral R t 0 αsds < . Then the iterates are bounded and converge to a critical point.

Consider the time-dependent Bregman divergence

Dat(x , xt) := Rat(x ) Rat(xt) x RT at(x xt) 0

The divergence is bounded by:

Dat(x , xt) Ra (x ) Rat(xt) x RT at(x xt) =: Wt,

this follows from the fact that the map a Ra is decreasing. We make the following two observations: d da Ra(x) = 1

2 x2 |a| + a3

a2 + x2 0 and dat

dt 0 t 0. (13)

Published as a conference paper at ICLR 2025

This allows us to bound the evolution d dt Wt = d

dt Rat (xt) x RT at (x xt)

da Rat (xt) d

dt x RT at (x xt)

xf(xt)T (x xt) 0,

where the observations in (13) are used in the first inequality.

From Theorem 4.16 in (Li et al., 2022) it follows that for all a > 0, Ra is a Bregman potential. Implying that for all a > 0 the level set for γ R,

{x Rn : Da(x , x) γ}

is bounded. Combining this with the fact that the evolution is bounded, implies the iterates are bounded.

In the next part, it is shown that xt converges to a critical point. We first show that the loss becomes eventually non-increasing. There is a T such that for all t T the loss f is non-increasing.,

df(xt) = f(xt)T diag q

x2 t + a2 t

f(xt) + 2αt f(xt)T xt

f(xt)T diag q

x2 t + a2 t

f(xt) + 2αt C dt,

where it is used that the iterates are bounded and f is locally Lipschitz. As t we have that αt 0 and at a > 0 by assumption. Hence there exists a T such that for all t T we have

Note if this is not the case then there is a T > 0 such that f(x T ) = 0 implying convergence (to a critical point).

Now let x be an accumulation point of the bounded flow xt. We use this to show convergence to a critical point. We have that for all x Rn,

f(x ) x = lim t 1

x = lim t 1

t Ra T (x T ) Ra T +t(x T +t) x = 0,

where the first equality follows from that the loss is non-increasing and the second one from the time-dependent mirror flow description. Finally, because xt converges to an accumulation point we also have limt Rat(xt) = Ra (x ) by continuity, giving the last equality.

We use that the accumulation point is a critical point and set x = x in Wt such that Wt 0. This implies Dat(x , xt) 0 by the upperbound. It follows from the fact that the iterates are bounded that Rat is µ-strongly convex on this bounded convex set where the iterates stay. This gives

||x xt||L2 µ

2 Dat(x , xt) 0,

showing xt converges to a critical point.

Lemma B.3 gives a condition such that the iterates are bounded and converge to a critical point. It remains to be shown that the loss converges. This is done in Theorem B.1.

Theorem B.1 Consider the same setting as Lemma B.3, if f is convex or satisfies the PL-inequality we have convergence to an interpolator x such that it is a minimizer of f. Furthermore, in the PL-inequality case, the loss converges linearly.

Proof. Assume f is convex, notice first that there is a T such that for all t T the loss is nonincreasing. Combining this with a bound on the time-dependent Bregman potential gives us convergence of the loss. The time-dependent Bregman divergence is again defined by

Dat(x , xt) = Rat(x ) Rat(xt) x R at(x xt) 0.

Published as a conference paper at ICLR 2025

The divergence is bounded by:

Dat(x , xt) Wt.

The evolution of the bound is

d dt Wt = d

dt Rat (xt) x R at (x xt)

da Rat (xt) d

dt x R at (x xt)

xf(xt) (x xt) f (x ) f (xt) ,

where again the observations in (13) are used in the first inequality. Therefore the loss converges:

f(x T +t) f(x ) 1

T f(xs) f(x )ds

where the first inequality follows from convexity of the loss and the third inequality from the fact that Wt Dat(x , xt) 0. So the loss converges. We already know from Lemma B.3 that the iterates converge, concluding the convex case.

In case when f satisfies the PL-inequality, we proceed in the same way as Theorem A.3. The evolution of f is given by

df(xt) = f(xt) diag q

x2 t + a2 t

f(xt) + 2αt f(xt) xt

( A λ (f(xt) f(x )) + αt C||x ||L2) dt

where C is constant depending on the smoothness of f and A = 2 mini u0,iv0,i exp 2 R 0 αsds . Then it follows from Gronwall s Lemma that

f(xt) f(x ) (f(x0) f(x )) exp A λt + Z t

0 αs C||x ||L2ds .

It follows from the fact that R t 0 αsds < for all t 0 that the loss f convergence. Convergence of the iterates now follows in a similar way as the convex case.

We now show optimality in the case of underdetermined linear regression Consider a data set (zj, yj)d i=1 with zj Rn and yj R. Let Z = (z1, . . . zd) and Y = (y1, . . . yd). For the regression to be called underdetermined n > d.

Theorem B.2 In case of under-determined regression consider the loss function f(x) = f(Zx Y ). Assume f satisfies the conditions with at least one of the convergence criteria of Theorem B.1. Then xt converges to x such that x = argmin Zx=Y Ra (x) (14)

Proof. Convergence follows from Theorem B.1. It remains to be shown that the optimality conditions of (14) are satisfied. The gradient flow of Rat satisfies

Rat(xt) = Z Z t

0 f(xs)ds span{Z }.

This quantity is well defined for all t 0 because f is locally Lipschitz (because f has to be locally Lipschitz). Therefore taking the t yields the KKT conditions of the optimization problem in (14).

Published as a conference paper at ICLR 2025

B.1 DISCUSSION OF THE PROOF

Most of the proof follows the same arguments as in (Alvarez et al., 2004; Li et al., 2022; Pesme et al., 2021). The notable differences are showing that the loss becomes decreasing over time and the observations made in (13).

C DETAILS EXPERIMENTS

In this section, we provide the details of the experiments. In addition, there are additional figures given.

Compute The codebase for the experiments is written in Py Torch and torchvision and their relevant primitives for model construction and data-related operations. The experiments in the paper are trained on an NVIDIA A6000. In addition, the diagonal linear network is trained on a CPU 13th Gen INTEL(R) Core(TM) i9-13900H.

C.1 DIAGONAL LINEAR NETWORK

For each setting, different regularization schemes are tried. In total, 7 options are tried. 4 of the schedules are constant i.e. the regularization stays the same during training. The remaining 3 are decaying schedules. These schedules we name harmonic, quadratic, and geometric. The schedules are described by the following recurrent relations:

k2 and gk = pk

where we have set p = 0.95. These schedules lead to a total strength of regularization applied. We denote A := R t 0 αsds the total strength of the regularization. So in practice, it is the weighted sum of the regularization strength.

In Figure 5 we present the trajectories of all the schedules for the 3 considered settings. We observe that our regularization performs the best with the decaying schedules as predicted by the theory. The other methods need constant regularization to perform well as already mentioned in Remark 2.1. Note that, PILo T also can perform well with constant regularization (see Figure 5.

0 500 1000 10

0 500 1000 0 500 1000

||x x * ||L2

Constant A = 0.1 Constant A = 1

Constant A = 10 Constant A = 100

Harmonic A 14.4

Quadratic A

Geometeric A 20

Figure 5: All runs for the diagonal linear network. From left to right m w with PILo T initalization, m w with spred initialization, and x with L1 regularization

C.2 ONE-SHOT

In this section we give the additional details for the one-shot experiments. In Table 2 the hyperparameters for the CIFAR 10 and 100 experiments are given. To determine which configuration is best

Published as a conference paper at ICLR 2025

for which sparsity level we compute the validation accuracy at multiple levels and choose the level just before the accuracy drops 1% or in the high-sparsity regime 2%. Moreover, we use s = 200 for STR.

For the Image Net experiment we use the setup of (Kusupati et al., 2020). For PILo T and spred we in addition use L2 regularization to compensate for the weight decay i.e. we add a term (m w)2

with strength 0.000030517578125/2, which is based on the weight decay strength in (Kusupati et al., 2020) for the baseline. Furthermore, weight decay is turned off for the other parameters. In Table 3 we present the configurations that correspond to the values in Table 1. Note for all PILo T configs δ = 1.01 is used.

Table 2: One-shot experiment

Parameter Setting Comments

Optimizer SGD Momentum 0.9 Batch size 256 Activation function Re Lu Weight decay2 10 4 Base learning rate {0.1, 0.2} Epochs 150 Warmup period 0 Initialization Kaiming normal Scaling 1 Only for m w δ 1.01 K 8000

Learning rate schedule cosine warmup

Learning rate schedule step warmup

Table 3: Res Net-50 on Image Net configurations for each sparsity (%).

Method αinit K sparsity spred3 2e 5 - 80.00 spred 3e 6 - 79.03 PILo T 7e 6 60000 80.00 spred 5e 6 - 89.26 PILo T 1e 5 60000 88.00 PILo T 1.4e 5 60000 91.00

Method αinit K sparsity spred 2e 5 - 94.50 PILo T 2e 5 60000 94.00 PILo T 3e 5 60000 95.00 PILo T 3e 5 60000 95.60 PILo T 3e 5 60000 96.00 spred 3e 5 - 97.19 PILo T 4e 5 40000 97.19 spred 5e 5 - 98.20 PILo T 5e 5 20000 97.75 PILo T 7e 5 20000 98.20

Label smoothing Altough PILo T is competitive in the 80% 90% sparsity range it is not SOTA. Nevertheless, if we turn of label smoothing in the experiment PILo T outperforms STR in this range as well. The only change for STR is turning of labelsmoothing. For PILo T we use two different configurations. We use L2 regularization i.e. ||m w||2 L2 regularization set to 5 10 5 instead of the value from STR experiment. We use K = 600000 and δ = 1.01. Furthermore, the strength of the PILo T regularization is initialized at {1 10 5, 2 10 5} and no weight decay is used on the rest of the parameters. The results are given in Table C.2.

2Applied to the other parameters 3Starting from a pretrained model with 77% validation accuracy

Published as a conference paper at ICLR 2025

Table 4: Extra experiment Res Net-50 on Image Net sparsity (%) versus accuracy (%) without label smoothing.

Method Top-1 Acc Sparsity Res Net-50 75.80 0 STR 73.03 79.03 PILo T 74.72 79.03 STR 71.6 89.26 PILo T 73.21 91.41

C.3 ITTERATIVE PRUNING

In Table 5 the details of the Image Net the experiment are given. Note the base learning rate 0.1 is for the baseline and 0.2 is used for our parameterization combined with scaling 1. In addition, the L2 regularization denotes the reparameterization of the original weight decay. Thus for PILo T, in this case, we use that instead. Moreover, all runs have been done for 3 different seeds. Furthermore, we provide additional experiments on CIFAR 10 and 100 with Res Net-20 and Res Net-18 respectively in Figure 6. The details are given in Table 6

Table 5: WR and LRR experiment on Image Net

Parameter Setting Comments

Optimizer SGD Momentum 0.9 Batch size 512 Activation function Re Lu Weight decay {0, 10 4} Learning rate schedule step warmup Base learning rate {0.1, 0.2} Cycles 25 Pruning rate 0.8 Epochs per cycle 90 Warmup period 10 Initialization Kaiming normal L2 regularization 5 10 5 Only for m w PILOT regularization {0} Only for m w Scaling 1 Only for m w δ 1 Only for m w K Only for m w

0.0 0.5 1.0 90

0.8 0.9 1.0 90

0.0 0.5 1.0 73

0.90 0.95 1.00 73

LRR+PILo T WR+PILo T LRR WR

Test Accuracy

Figure 6: Learning Rate Rewinding (LRR) and Weight Rewinding (WR) with PILo T shows improvement over the baseline itterative pruning methods for CIFAR 10 and 100.

D REMARK ON NEURONWISE PRUNING

In the main text, we have used mirror flow to describe the implicit bias of parameterwise pruning. In this section, we show that neuronwise pruning can not be analyzed in the same way. We first define neuronwise pruning. Next, we paraphrase the necessary condition such that the implicit bias can be

Published as a conference paper at ICLR 2025

Table 6: WR and LRR experiment on CIFAR 10 and 100

Parameter Setting Comments

Optimizer SGD Momentum 0.9 Batch size 256 Activation function Re Lu Weight decay 10 4 Learning rate schedule step warmup Base learning rate {0.1, 0.2} Cycles 25 Pruning rate 0.8 Epochs per cycle 150 Warmup period 50 Initialization Kaiming normal L2 regularization 0 Only for m w PILOT regularization {10 4} Only for m w Scaling 1 Only for m w δ 1 Only for m w K Only for m w

described by a mirror flow from (Li et al., 2022). Finally, we show neuronwise pruning violates this condition pointing out a limitation of the framework.

Consider the parameterization for a function with p neurons, g : Rp Rn1 . . . Rnp,

g(m, w1, . . . , wp) = (m1w1, . . . , mpwp)

where mi R is the mask and wi Rni are the neurons.

We state the necessary condition for a parameterization to induce a mirror flow.

Theorem D.1 (Theorem 4.10 (Li et al., 2022)) The Lie bracket span of { ig}n i=1 is in the kernel of Jacobian g.

Now, we use this theorem to show that neuronwise pruning does not induce a mirror flow.

Lemma D.1 Neuronwise pruning violates Theorem D.1.

Proof. We show that for p = 1 the condition is already violated. This implies that in the general case, the condition is also violated as the neurons themselves are commuting with each other as they are parameterized separate.

To see this we can explicitly check the commuting condition for the following parameterization g : R Rn Rn

g(m, w) = mw

Then the gradients (Jacobians) and Hessian s are given by:

wi m Ii=1 ... m Ii=n1

0 Ii=1 . . . Ii=n Ii=1 0 . . . 0 ... ... Ii=n 0 . . . 0

for i = 1, 2

Computing Hgi gj

Hgi gj = wj

0 Ii=1 ... Ii=n

Published as a conference paper at ICLR 2025

we compute the Lie brackets which span a subspace of the Lie Algebra LIE 2( g). The subspace is spanned by

0 Ii=1 ... Ii=n

0 Ij=1 ... Ij=n

for i, j [n]

Clearly, this span is not in Ker( g) as can be shown by a direct computation:

0 Ii=1 ... Ii=n

0 Ij=1 ... Ij=n

= (Ii=1mwj Ij=1mwi, . . . , Ii=nmwi Ij=1mwi)

For the span to be in the kernel we need either that all wi = 0 for all i [n] or m = 0. In both cases this implies that g(m, w) {0}. This implies that a mirror flow is only well-defined if the parameterization is zero. Hence, neuron-wise continuous sparsification does not induce a mirror flow.