# spartan_differentiable_sparsity_via_regularized_transportation__1532d296.pdf

Spartan: Differentiable Sparsity

via Regularized Transportation

Kai Sheng Tai

Taipeng Tian

Ser-Nam Lim

We present Spartan, a method for training sparse neural network models with a predetermined level of sparsity. Spartan is based on a combination of two techniques: (1) soft top-k masking of low-magnitude parameters via a regularized optimal transportation problem and (2) dual averaging-based parameter updates with hard sparsiﬁcation in the forward pass. This scheme realizes an explorationexploitation tradeoff: early in training, the learner is able to explore various sparsity patterns, and as the soft top-k approximation is gradually sharpened over the course of training, the balance shifts towards parameter optimization with respect to a ﬁxed sparsity mask. Spartan is sufﬁciently ﬂexible to accommodate a variety of sparsity allocation policies, including both unstructured and block structured sparsity, as well as general cost-sensitive sparsity allocation mediated by linear models of per-parameter costs. On Image Net-1K classiﬁcation, Spartan yields 95% sparse Res Net-50 models and 90% block sparse Vi T-B/16 models while incurring absolute top-1 accuracy losses of less than 1% compared to fully dense training.

1 Introduction

Sparse learning algorithms search for model parameters that minimize training loss while retaining only a small fraction of non-zero entries. Parameter sparsity yields beneﬁts along several axes: reduced model storage costs, greater computational and energy efﬁciency during training and inference, and potentially improved model generalization [33]. However, sparse training is a challenging optimization problem: for the general problem of learning parameters 2 Rd subject to the constraint that is k-sparse, the feasible set is the union of

axis-aligned hyperplanes intersecting at the origin, each of dimension k. This complex, nonconvex geometry of the feasible set compounds the existing difﬁculty of optimizing deep neural network models.

This work focuses on improving the generalization error of sparse neural network models. To this end, we introduce Spartan, or Sparsity via Regularized Transportation a sparse learning algorithm that leverages an optimal transportation-based top-k masking operation to induce parameter sparsity during training. Spartan belongs to the family of parameter dense training algorithms that maintains a dense parameter vector 2 Rd throughout training [48; 24], in contrast to parameter sparse training algorithms that adhere to a O(k) memory budget for representing the parameters of a ksparse model [3; 31; 32; 13; 16]. While computational cost and memory usage at training time are important considerations, Spartan primarily optimizes for performance at inference time.

Intuitively, Spartan aims to achieve a controlled transition between the exploration and exploitation of various sparsity patterns during training. In the exploration regime, the goal is for the learner to easily transition between differing sparsity patterns in order to escape those that correspond to poor minima of the loss function. On the other hand, the goal of exploitation is for the learner to optimize model parameters given a ﬁxed sparsity pattern, or a small set of similar patterns. This latter setting

Correspondence to: Kai Sheng Tai (kst@meta.com)

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

(a) (b) (c)

Figure 1: Soft top-k masking. (a, b) The soft top-k masking operator σβ

k computes approximately k-sparse outputs, with the sharpness of the mask controlled by the parameter β. (c) Small updates to iterates far from the k-sparse feasible set ( t ! t+1) can correspond to a large perturbation in parameter space after projection by k. Updates to iterates t in the approximately sparse region (shaded in grey) correspond to smaller post-projection perturbations.

avoids frequent oscillations between disparate sparsity patterns, which may be detrimental to the optimization process.

We operationalize this intuition through the use of a differentiable soft top-k masking operation σβ

k : Rd ! Rd. This function maps parameters to an approximately sparse output that suppresses low-magnitude entries in . We parameterize the soft top-k mask with a sharpness parameter β: at β = 0, σβ

k simply scales the input by a constant, and as β ! 1, the mapping reduces to hard top-k magnitude-based selection (Figure 1 (a, b)). Soft top-k masking therefore constrains the iterates t to be close to the set of exactly k-sparse vectors, with the strength of this constraint mediated by the parameter β. Figures 1 (c) and 2 give some geometeric intuition for the effect of this mapping. We implement soft top-k masking using a regularized optimal transportation formulation [11; 42] and demonstrate that this technique scales to networks on the order of 108 parameters in size.

We evaluate Spartan on Res Net-50 [21] and Vi T [15] models trained on the Image Net-1K dataset. On Res Net-50, we ﬁnd that sparse models trained with Spartan achieve higher generalization accuracies than those trained with existing methods at sparsity levels of 90% and above. In particular, we train Res Net-50 models to 76.5% top-1 validation accuracy at 95% sparsity and to 74.2% accuracy at 97.5% sparsity, improving on the previous state-of-the-art by 0.6% and 1.6% respectively. Our sparse Vi T-B/16 models reduce model storage size by 10 and inference FLOPs by 7.4 at the cost of a 0.6% accuracy reduction relative to Dei T-B [38]. We further demonstrate that Spartan is effective for block structured pruning, a form of structured sparsity that is more amenable to acceleration on current GPU hardware than unstructured sparsity [18]. On a Vi T-B/16 model with 16 16 block structured pruning, Spartan achieves 79.1% top-1 accuracy at 90% sparsity, compared to the baseline of 74.1% at the same sparsity level.

To summarize, we make the following contributions in this paper:

We present Spartan, a sparsity-inducing training algorithm based on a soft top-k masking op-

eration. We show that Spartan interpolates between two existing sparse learning algorithms: iterative magnitude pruning [48] and Top-K Always Sparse Training [24].

We empirically evaluate Spartan using Res Net-50 and Vi T models trained on the

Image Net-1K dataset, demonstrating consistent improvements over the prior state-of-the-art.

We study the effect of Spartan s hyperparameters on an exploration-exploitation tradeoff

during training and on the ﬁnal accuracy of the trained models.

We provide an open source implementation of Spartan at https://github.com/ facebookresearch/spartan.

Notation. We use R+ and R++ to denote the nonnegative real numbers and the strictly positive real numbers respectively. 1d is the all-ones vector of dimension d. Given a vector x 2 Rd, kxkp denotes

Figure 2: A 2D example of soft masking. We plot probability densities corresponding to the action of the soft top-1 operator σβ

1 on the 2D standard Gaussian distribution (darker colors indicate higher density). Speciﬁcally, we visualize the densities of σβ

1 (x) | x N(0, I2) for a range of sharpness parameters β. Higher values of β constrain iterates to be closer to the k-sparse feasible set.

the p-norm of x, kxk0 is the number of nonzero entries in x, and |x| is the vector of elementwise absolute values. When x and y are vectors, x y denotes elementwise multiplication and x/y denotes elementwise division. The operator k(x) := arg miny:kyk0 k kx yk2 denotes Euclidean projection onto the set of k-sparse vectors.

2 Related Work

Neural network pruning. Our use of a magnitude-based criterion for sparse learning draws on early work in the area of neural network pruning [23]. Magnitude-based pruning is computationally cheap relative to alternative criteria that rely on ﬁrstor second-order information [33; 27; 20], and is a perhaps surprisingly performant option despite its simplicity [17]. More generally, Spartan builds on a substantial body of previous work that aims to jointly optimize sparsity patterns and model parameters for deep neural networks [43; 8; 19; 29; 28; 48, inter alia; see, e.g., 22 for a survey].

Spartan as a generalization of existing methods. We highlight two particularly relevant sparse training methods in the literature: the iterative magnitude pruning (IMP) method (Algorithm 1, [48]), and Top-K Always Sparse Training, or Top-KAST (Algorithm 2, [24]):

Algorithm 1 Iterative magnitude pruning update Input: parameters t, loss function L( ), spar-

sity budget k, step size t Output: parameters t+1

1: t = k ( t) 2: t+1 = t tr tr L( t) 3: return t+1

Algorithm 2 Dual averaging / Top-KAST update Input: parameters t, loss function L( ), spar-

sity budget k, step size t Output: parameters t+1

1: t = k ( t) 2: t+1 = t tr L( t) 3: return t+1

Note that both Algorithms 1 and 2 compute the forward pass through the network using the sparsiﬁed parameters , obtained by setting all but the top-k entries of by magnitude to zero. Algorithms 1 and 2 differ in the presence of the r t term in line 2. This is a d d diagonal matrix where (r t)ii = 1 iff t,i was not set to zero by the projection k, and (r t)ii = 0 otherwise.

At the extreme points of the sharpness parameter β, Spartan s parameter update procedure reduces to those of IMP and Top-KAST. We can thus view Spartan as a method that generalizes and smoothly interpolates between this pair of algorithms. As β ! 1, Spartan is equivalent to IMP: this approach sparsiﬁes parameters using a top-k magnitude-based binary mask, and entries that were masked in the forward pass receive zero gradient in the backward pass. At β = 0, Spartan reduces to Top-KAST: this method again sparsiﬁes parameters by magnitude with a binary mask, but unlike IMP, all entries are updated in the backward pass using the gradient of the loss with respect to the masked parameters. The Top-KAST update is thus an application of the straight-through gradient method [5; 9], otherwise known as lazy projection or dual averaging (DA) in optimization [34; 41; 2; 14]. See Sec. 5 for further discussion on this point.

Smooth approximations. Our use of the soft top-k operation is related to prior methods that use the logistic sigmoid function as a differentiable approximation to the step function for sparse training [30; 36; 1]. These approaches similarly regulate the sharpness of the approximation with a temperature parameter that scales the input logits to the sigmoid function. A distinctive characteristic of Spartan is that the parameter k directly controls the degree of sparsity of the mask; this differs from previous approaches involving the logistic sigmoid approximation that only indirectly control the degree of sparsity using an L1 penalty term.

Transformer-speciﬁc methods. Several approaches are specialized to transformer-based architectures. These include structured pruning of pre-trained transformers for NLP tasks [39; 26; 40] and sparse training methods speciﬁcally applied to vision transformers [7; 6]. In contrast, Spartan is a general-purpose sparse training algorithm that is designed to be agnostic to the model architecture.

3 Sparsity via Regularized Transportation

We begin by motivating our proposed approach. A key advantage of the dual averaging method (Algorithm 2) over IMP (Algorithm 1) is that it mitigates the issue of gradient sparsity. In the IMP update, only those parameters that were not masked in the forward pass receive a nonzero gradient, resulting in slow progress at high levels of sparsity. In contrast, dual averaging applies a dense update to the parameter vector in each iteration intuitively, these dense updates can help revive parameters that are below the top-k magnitude threshold and are consequently masked to zero. Dual averaging iterates are thus able to more quickly explore the space of possible sparsity patterns.

On the other hand, large variations in the sparsity patterns realized post-projection can lead to instability in optimization, thus hampering the ﬁnal performance of the model. For instance, Top KAST empirically beneﬁts from the application of additional sparsity-inducing regularization on the parameters [24 (Sec. 2.3), 37]. This issue motivates our use of a soft top-k approximation as a mechanism for controlling the stability of our training iterates.

Each iteration of Spartan consists of the following two steps (Algorithm 3): (1) an approximately k-sparse masking of the model parameters using the soft top-k operator, and (2) dual averaging-based parameter updates with the set of exactly k-sparse vectors as the feasible set. This procedure aims to combine the advantages of Algorithms 1 and 2 within a single parameter update scheme. Note that the Spartan parameter update is essentially identical to Algorithm 2, save for the inclusion of the intermediate soft top-k masking step.

Algorithm 3 Spartan parameter update Input: parameters t, loss function L( ), sparsity budget k, sharpness parameter β, step size t Output: parameters t+1

k ( t) = t softtopk(| t|, k, β) . apply soft masking

. project onto k-sparse set

3: t+1 = t trσβ

k ( t)r L( t) . compute dual averaging update 4: return t+1

In the following, we begin by describing the soft top-k masking operation and address the issues of scalability and of applying soft top-k masking to induce structured sparsity. We then discuss how the update procedure outlined in Algorithm 3 can be incorporated within a complete training loop.

3.1 Soft Top-k Masking

Our soft top-k masking scheme is based on the soft top-k operator softtopk(v, k, β) described by Xie et al. [42]. softtopk takes as input a vector v 2 Rd, a budget parameter k > 0 and a sharpness parameter β 0, and outputs m 2 [0, 1]d, a smoothed version of the top-k indicator vector. By using the magnitudes | | of the parameter vector as a measure of the value of each entry, we obtain a soft top-k magnitude pruning operator σβ

k ( ) by masking the entries of the parameter vector with the output of the soft top-k operator:

k ( ) := softtopk(| |, k, β).

Algorithm 4 Soft top-k forward pass

Input: values v 2 Rd, costs c 2 Rd

++, budget k 2 (0, 1T

d c], sharpness parameter β 0, max. iterations T, tolerance , initial dual variable µ0 Output: mask m 2 [0, 1]d, dual variables µ 2 R, 2 Rd

1: z = βv/c, m0 = 1d . normalize values v by costs c 2: for t = 1, . . . , T do 3: t = log c log(1d + exp(z + µt 1)) . normalize mask entries 4: µt = log k log P

i exp(zi + t,i) . normalize sum to be equal to k 5: mt = exp(z + µt + t log c) . compute mask 6: if |v T (mt mt 1)| < |v T mt 1| then . check stopping criterion 7: return mt, µt, t 8: return m T , µT , T

Algorithm 5 Soft top-k backward pass

Input: gradient w.r.t. outputs g 2 Rd, mask m 2 [0, 1]d, costs c, budget k, sharpness parameter β Output: gradient w.r.t. inputs

i gimi(1 mi), a2 = P

i 2: return βm (1d m) (g/c a1/(k a2))

We introduce a mild generalization of the soft top-k operator from Xie et al. [42] by incorporating a strictly positive cost vector c. In particular, we require that the output mask m 2 [0, 1]d satisﬁes the budget constraint c T m = k. This abstraction is useful for modeling several notions of heterogeneous parameter costs within a model; for instance, parameters that are repeatedly invoked within a network have a higher associated computational cost. As a concrete example, Evci et al. [16] observed that the FLOP count of a Res Net-50 model at a ﬁxed global sparsity can vary by a factor of over 2 due to differences in the individual sparsities of convolutional layers with differing output dimensions.

To derive this cost-sensitive soft top-k operator, we begin by expressing the mask m as the solution to the following linear program (LP):

m2Rd v T m subject to 0 m 1d, c T m = k. (1)

Deferring the derivation to the Appendix, we can rewrite this LP as the following optimal transportation (OT) problem:

Cij Yij subject to Y 12 = c, 1d Y = [k, 1T

d c k], (2)

with the cost matrix C := [ v/c, 0]. We can recover the mask from Y as mi = Yi1/ci. Now following [42; 11], we introduce entropic regularization to obtain a smoothed version of the top-k operator, with the degree of smoothing controlled by β:

β H(Y ) (3)

subject to Y 12 = c, 1d Y = [k, 1T

where H(Y ) := P

ij Yij(log Yij 1) is the entropy of Y . In this standard form, we can efﬁciently solve this regularized OT problem using the Sinkhorn-Knopp algorithm [11; 4; 12].

Algorithms 4 and 5 describe the forward and backward passes of the resulting cost-sensitive soft top-k operator (see Appendices B and C for their derivations). The expression for the gradient in Algorithm 5 follows from Theorem 3 in Xie et al. [42] with some algebraic simpliﬁcation. We remark that this closed-form backward pass is approximate in the sense that it assumes that we obtain the optimal dual variables µ, in the forward pass; in practice, we do not encounter issues when using estimates of the dual variables obtained with tolerance = 0.01 and a maximum iteration count of T = 100 in Algorithm 4.

Scalability. Since we apply soft top-k masking to high-dimensional parameters in each iteration of training, it is important to minimize the computational overhead of Sinkhorn iteration in the

Figure 3: Spartan parameterizes an exploration-exploitation tradeoff. For each run of Res Net-50 training on Image Net-1K, we plot Pearson correlation coefﬁcients between the sparsity mask at the end of each epoch and the mask obtained at the end of training (left), and between sparsity masks at the end of subsequent epochs (right). Kinks in the correlation curves at epochs 20 and 80 are respectively due to the end of sparsity annealing and the start of ﬁne-tuning with ﬁxed masks (see Section 3.2 for details on the training schedule).

forward pass. We note that in order to compute the hard top-k projection step in dual averaging (as in Algorithm 3), it is necessary to ﬁnd the index ik of the kth largest entry of | |/c. Given the index ik, we can use the heuristic initialization µ0 = β| ik|/cik for the dual variable µ to accelerate the convergence of Sinkhorn iteration. This heuristic is based on the observation that | ik|/cik is the threshold value of the normalized value vector | |/c in hard top-k masking. Concretely, we demonstrate in Sec. 4.3 that our implementation of Spartan incurs a per-iteration runtime overhead of approximately 5% over standard dense Res Net-50 training.

Structured sparsity. To implement block structured sparsity, we instantiate one mask variable per block and mask all parameters within that block with the same mask variable. To compute the mask, we use the sum of the magnitudes of the entries in each block as the corresponding value vi. In a standard pruning setting with uniform costs across all parameters, the corresponding cost ci is simply the total number of entries in the block.

3.2 Training with Spartan

Training schedule. When training with Spartan, we divide the training process into a warmup phase, an intermediate phase, and a ﬁne-tuning phase. In our experiments, the warmup phase consists of the ﬁrst 20% of epochs, the intermediate phase the next 60%, and the ﬁne-tuning phase the ﬁnal 20%. In the warmup phase, we linearly anneal the global sparsity of the model from a value of 1 at the start of training to the target level of sparsity. Throughout the intermediate phase, we maintain the target sparsity level, and during the ﬁne-tuning phase, we ﬁx the sparsity mask to that used at the start of ﬁne-tuning. This training schedule is similar to those used in prior work [37].

Exploration vs. exploitation as a function of β. From the start of training up to the start of the ﬁne-tuning phase, we linearly ramp the sharpness parameter β from an initial value of 1 to the ﬁnal value βmax. We interpret the spectrum of updates parameterized by β as being characteristic of an exploration-exploitation tradeoff. In Figure 3, we illustrate this phenomenon by plotting Pearson correlation coefﬁcients between sparsity masks at different stages of training. We observe that iterative magnitude pruning converges on the ﬁnal mask relatively early in training, which is indicative of insufﬁcient exploration of different sparsity patterns consequently, this model achieves lower validation accuracy than the remaining models. The three models trained with Spartan each ramp β from 1 at the start of training to βmax at epoch 80. The value of βmax correlates well both with the Pearson correlation with the ﬁnal mask, and with the Pearson correlation between masks at the end of subsequent epochs. Spartan thus interpolates between the high-exploration regime of standard dual averaging and the low-exploration regime of iterative magnitude pruning. In this example, the intermediate setting βmax = 10 achieves the highest top-1 validation accuracy of 74.6%, with Top-KAST at 73.5% and IMP at 68.2%.

Table 1: Top-1 accuracies on Image Net-1K validation set with fully dense training.

Method Epochs Accuracy (%) Method Epochs Accuracy (%)

100 77.08 0.11

Vi T-B/16 100

80.06 0.11 200 77.46 0.09 400 77.17 0.04

Table 2: Res Net-50 top-1 accuracies on Image Net-1K validation set at varying levels of sparsity and epochs of training. When available, we report means and standard deviations over 3 trials.

Method Epochs 80% 90% 95% 97.5% 99%

[16] Rig L (ERK) 100 75.1 0.05 73.0 0.04 69.7 0.17 - - 500 77.1 0.06 76.4 0.05 74.5 0.09 - -

[25] STR 100 76.19 74.31 70.40 62.84 51.82

[47] Prob Mask 100 - 74.68 71.50 66.83 61.07

[46] Opt G 100 - 74.28 72.38 - 62.10

IMP 100 75.3 0.07 73.7 0.14 70.6 0.05 - -

Top-KAST 100 76.08 0.02 75.13 0.03 73.19 0.02 - - (with PP) 100 76.24 0.07 75.23 0.02 73.25 0.02 - - (with ERK) 100 76.42 0.03 75.51 0.05 - - -

(with PP 100 76.76 0.08 75.74 0.08 - - - & ERK) 200 77.51 0.03 76.94 0.10 - - - 300 77.64 0.05 77.16 0.19 - - -

100 76.66 0.04 75.48 0.15 73.51 0.16 70.23 0.05 - 200 77.55 0.08 76.84 0.11 75.20 0.11 71.72 0.06 - 400 77.75 0.09 77.37 0.07 75.90 0.04 72.58 0.13 -

100 76.89 0.09 76.17 0.10 74.68 0.24 71.95 0.11 63.87 0.12 200 77.56 0.19 77.06 0.13 75.92 0.01 73.41 0.15 65.77 0.03 400 77.57 0.08 77.40 0.06 76.48 0.20 74.15 0.10 66.80 0.03

Reported accuracies for models closest to the listed sparsity level. Model trained at 98% sparsity.

4 Empirical Evaluation

In this section, we report the results of our sparse training experiments on the Image Net-1K dataset with two standard architectures: Res Net-50 [21] and Vi T-B/16 [15], consisting of 25.6M and 86.4M parameters respectively. On Res Net-50, we evaluate only unstructured pruning, whereas on Vi T-B/16, we evaluate both unstructured and block structured pruning. We subsequently present empirical studies of the sensitivity of Spartan to the value of β, the effect of running Spartan without the dual averaging step, and the computational overhead of soft top-k masking.

4.1 Image Net-1K Classiﬁcation

In all our experiments, we run Spartan with the training schedule described in Section 3.2. We train and evaluate our models on the Image Net-1K dataset with the standard training-validation split and report means and standard deviations over 3 independent trials. We provide additional details on our experimental setup in the Appendix.

Res Net-50 experimental setup. For consistency with our baselines, we use standard data augmentation with horizontal ﬂips and random crops at 224 224 resolution. For all Spartan runs, we use βmax = 10, which we selected based on models trained at 95% accuracy. We sparsify the parameters of all linear and convolutional layers with a global sparsity budget, excluding bias parameters and the parameters of batch normalization layers. Our baselines are iterative magnitude pruning [48], Rig L with the Erdos-Renyi-Kernel (ERK) sparsity distribution [16], Soft Threshold Weight Reparameterization (STR) [25], probabilistic masking (Prob Mask) [47], Opt G [46], and Top-KAST with Powerpropagation and ERK [24; 37]. We additionally re-run the most performant baseline method,

Table 3: Vi T-B/16 top-1 accuracies on Image Net-1K validation set at 90% sparsity with unstructured sparsity and block structured pruning of attention layers.

Sparsity Structure

Method Epochs Unstructured 16 16 blocks 32 32 blocks

Top-KAST 100 78.05 0.07 67.69 0.23 64.79 0.14 300 80.86 0.03 74.11 0.20 70.67 0.96

Spartan 100 79.88 0.16 75.62 0.07 74.50 0.23 300 81.18 0.04 79.12 0.16 78.43 0.10

Top-KAST, using a reimplementation in our own codebase. For Top-KAST, we exclude the ﬁrst convolutional layer from pruning (following [37; 16]) and we use fully dense backward passes (i.e., with a backward sparsity of 0), since this setting achieved the highest accuracy in prior work [24]. We use mixed precision training with a batch size of 4096 on 8 NVIDIA A100 GPUs.

Vi T experimental setup. We use the Vi T-B architecture with 16 16 patches at 224 224 input resolution. We augment the training data using Rand Augment [10], Mix Up [45] and Cut Mix [44]. Our Vi T models are trained from random initialization, without any pretraining. We set βmax = 20 for Spartan with unstructured sparsity, and βmax = 320 and βmax = 640 for Spartan with 16 16 and 32 32 blocks respectively. These latter settings are the values of βmax for the unstructured case scaled up by factors of 16 and 32 since each block averages the magnitudes of B2 entries, we expect the variance of the block values to correspondingly decrease by a factor of B2, and we thus compensate by scaling up β by B. In the block structured case, we exclude the input convolutional layer and the output classiﬁcation head from pruning since their parameter dimensions are not divisible by the block size. We use mixed precision training with a batch size of 4096 on 16 NVIDIA A100 GPUs across 2 nodes.

Results. Table 1 lists the top-1 validation accuracies achieved by fully dense Res Net-50 and Vi T-B/16 models. Table 2 reports validation accuracies for Res Net-50 models at 80%, 90%, 95%, 97.5% and 99% sparsity, and Table 3 reports validation accuracies for Vi T at 90% sparsity. In the Appendix, we report additional measurements of inference FLOP costs for our sparse models and the results of experiments with FLOP-sensitive pruning.

For Res Net-50 models, we ﬁnd that Spartan outperforms all our baselines across all training durations at sparsity levels of 90% and above. In particular, Spartan achieves a mean top-1 accuracy within 1% of fully dense training at 95% sparsity. We observe that additional epochs of sparse training consistently improves the ﬁnal generalization accuracy; in contrast, validation accuracy peaks at 200 epochs for dense training. This trend persists at 800 training epochs, where Spartan achieves 67.18 0.13 top-1 accuracy at 99% sparsity. For the Top-KAST baseline, we omit results at 99% sparsity due to training instability. We note that the accuracy improvements in the Spartan-trained Res Net-50 models relative to Top-KAST are not a result of increased FLOP counts at a given level of sparsity, as can be seen from the FLOP measurements reported in Appendix E.

For Vi T-B/16 models, Spartan outperforms Top-KAST for both unstructured and block structured pruning. We observe a particularly large improvement over Top-KAST in the block structured case, where Spartan improves absolute validation accuracy by 7.8% for 32 32 block structured pruning. For unstructured pruning, Spartan achieves comparable accuracy to SVi TE [7] (81.2% for Spartan vs. 81.3% for SVi TE), but with 30% higher sparsity (90% for Spartan vs. 60% for SVi TE). In exchange for a 0.6% reduction in accuracy relative to Dei T-B [38], Spartan reduces model storage cost by 10 and the FLOP cost of inference by 7.4 .

4.2 Sensitivity and Ablation Analysis

Figure 4 (left) shows the effect of ablating the dual averaging step in Spartan i.e., omitting hard top-k projection in the forward pass over a range of settings of βmax for 95% sparse Res Net-50 models trained for 100 epochs. The dashed line shows top-1 accuracy with Top-KAST. For Spartan training without dual averaging, we compute a hard top-k mask at the end of epoch 80, and as with standard Spartan training, we ﬁx this mask until the end of training. In the absence of top-k projection, we ﬁnd that accuracy increases with increasing βmax up to βmax = 80; at lower settings of βmax,

Figure 4: Effect of β and dual averaging (left). Top-1 Image Net-1K validation accuracies for Spartan with and without dual averaging (DA) and with varying βmax at 95% sparsity. Computational overhead (right). Percentage increase in per-iteration wall clock runtime over dense training for Spartan with standard Sinkhorn iteration, Sinkhorn with dual caching, and Sinkhorn with sorting.

the higher approximation error of soft masking is detrimental to the ﬁnal accuracy of the model. In contrast, the use of top-k projection in the forward pass mitigates this mismatch between training and inference and improves the ﬁnal accuracy of the sparse model.

4.3 Computational Overhead

We evaluate the computational overhead of Spartan over standard dense training by measuring wall clock training times for a Res Net-50 model (Figure 4, right). Our benchmark measures runtime on a single NVIDIA A100 GPU over 50 iterations of training with a batch size of 256. We use random inputs and labels to avoid incurring data movement costs. We compare three approaches: standard Sinkhorn iteration, Sinkhorn iteration with the dual variable µ initialized using its ﬁnal value from the previous iteration ( dual caching ), and Sinkhorn iteration with µ initialized using the value β| ik|/cik, where we compute the index ik of the kth largest entry of the normalized value vector | |/c by sorting its entries in each iteration ( sorting ).

Standard Sinkhorn iteration incurs higher overheads as β increases this is due to the additional iterations required to reach convergence as the regularized OT problem more closely approximates the original LP. We ﬁnd that dual caching and sorting both prevent this growth in runtime over the range of β values that we tested. In our remaining experiments, we use the simple sorting-based approach since it corresponds to the lowest relative overhead over standard dense training (approximately 5%). We note that this relative overhead decreases as the batch size increases since the cost of computing the mask is independent of the batch size. Spartan is therefore best suited to training with large batch sizes since we amortize the cost of mask updates over the size of the batch.

5 Discussion

Extensions. As an alternative to magnitude-based pruning, we may also apply Spartan in conjunction with learned value parameters, as in methods such as Movement Pruning [35]. In this approach, we would compute masks using a set of auxiliary parameters φ 2 Rd instead of the magnitudes | |: σβ

k ( ; φ) = softtopk(φ, k, β), and similarly for hard projection. We remark that while this variant requires additional memory during training to store the value parameters, there is no additional cost during inference since the sparsity mask is ﬁxed at the end of training.

Limitations. Since Spartan retains a dense parameter vector and computes dense backward passes during training, it incurs higher memory and computational costs in each iteration than methods like Rig L [16] that use both a sparse parameters and sparse backward passes. Nevertheless, we note that in terms of total computational or energy cost over a full training run, Spartan may remain a competitive option as it requires fewer iterations of training to reach a given accuracy threshold relative to these sparse-to-sparse methods. However, we do not compare total training FLOP costs in our empirical evaluation.

A further limitation is that cost-sensitive pruning with Spartan is only compatible with relatively crude linear cost models of the form c T m, where c is the cost vector and m is the mask. This restriction

arises due to the requirements of the regularized OT formulation used to compute the soft top-k mask. In particular, this limitation precludes the use of cost models involving interaction terms such as those arising from spatially coherent sparsity patterns. For example, the cost models that we consider here cannot encode the notion that coherently pruning an entire row or column of a parameter matrix will yield more cost savings than pruning random entries of the matrix.

Optimization with Dual Averaging. As discussed in Section 3, Spartan and Top-KAST are both applications of the dual averaging method [34; 41]. The empirically observed effectiveness of dual averaging in sparse training is reminiscent of its success in a related domain where continuous optimization is complicated by discrete constraints: neural network quantization. In this area, the Binary Connect algorithm [9], which trains binary neural networks with parameters in { 1}d, is considered to be a foundational method. As observed by Bai et al. [2], Binary Connect is itself a special case of dual averaging, and the theoretical underpinnings of this connection to dual averaging were further analyzed by Dockhorn et al. [14]. These parallels with neural network quantization suggest that similar ideas may well apply towards deepening our understanding of sparse learning.

Societal Impacts. Inference with deep neural network models is a computationally intensive process. At present, the total energy footprint associated with serving these models in production is expected to continue growing in tandem with the rising prevalence of large transformer-based architectures in vision and NLP applications. Research towards improving the energy efﬁciency of deep neural networks is therefore an important counterbalancing force against increasing resource usage by these models. The development of sparse learning algorithms is particularly relevant to these efforts, and we expect that the impact of these approaches will further increase as sparsity-aware hardware acceleration becomes more widely available.

6 Conclusions & Future Work

In this work, we describe a sparse learning algorithm that interpolates between two parameter update schemes: standard stochastic gradient updates with hard masking and the dual averaging method. We show that there exists an intermediate regime between these two methods that yields improved generalization accuracy for sparse convolutional and transformer models, particularly at higher levels of sparsity. While we have demonstrated promising empirical results with our proposed method, the learning dynamics of stochastic optimization for deep networks under sparsity constraints remains relatively poorly understood from a theoretical standpoint. There thus exists ample opportunity for further work towards better understanding sparse learning algorithms, which may in turn inspire future algorithmic advances in this area.

Acknowledgements and Disclosure of Funding

We are grateful to Trevor Gale for his feedback on this work. We also thank our anonymous reviewers for their valuable suggestions on improving the manuscript. This work was funded by Meta.

[1] Kambiz Azarian, Yash Bhalgat, Jinwon Lee, and Tijmen Blankevoort. Learned threshold

pruning. ar Xiv preprint ar Xiv:2003.00075, 2020.

[2] Yu Bai, Yu-Xiang Wang, and Edo Liberty. Prox Quant: Quantized Neural Networks via Proximal

Operators. In International Conference on Learning Representations, 2018.

[3] Guillaume Bellec, David Kappel, Wolfgang Maass, and Robert Legenstein. Deep Rewiring:

Training very sparse deep networks. In International Conference on Learning Representations, 2018.

[4] Jean-David Benamou, Guillaume Carlier, Marco Cuturi, Luca Nenna, and Gabriel Peyré.

Iterative Bregman projections for regularized transportation problems. SIAM Journal on Scientiﬁc Computing, 37(2), 2015.

[5] Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients

through stochastic neurons for conditional computation. ar Xiv preprint ar Xiv:1308.3432, 2013.

[6] Arnav Chavan, Zhiqiang Shen, Zhuang Liu, Zechun Liu, Kwang-Ting Cheng, and Eric Xing.

Vision transformer slimming: Multi-dimension searching in continuous optimization space. ar Xiv preprint ar Xiv:2201.00814, 2022.

[7] Tianlong Chen, Yu Cheng, Zhe Gan, Lu Yuan, Lei Zhang, and Zhangyang Wang. Chasing

sparsity in vision transformers: An end-to-end exploration. In Advances in Neural Information Processing Systems, 2021.

[8] Maxwell D Collins and Pushmeet Kohli. Memory bounded deep convolutional networks. ar Xiv

preprint ar Xiv:1412.1442, 2014.

[9] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binary Connect: Training deep

neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems, 2015.

[10] Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical

automated data augmentation with a reduced search space. In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020.

[11] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances

in Neural Information Processing Systems, 2013.

[12] Marco Cuturi, Olivier Teboul, and Jean-Philippe Vert. Differentiable ranking and sorting using

optimal transport. In Advances in Neural Information Processing Systems, 2019.

[13] Tim Dettmers and Luke Zettlemoyer. Sparse networks from scratch: Faster training without

losing performance. ar Xiv preprint ar Xiv:1907.04840, 2019.

[14] Tim Dockhorn, Yaoliang Yu, Eyyüb Sari, Mahdi Zolnouri, and Vahid Partovi Nia. Demystifying

and generalizing binaryconnect. 2021.

[15] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai,

Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.

[16] Utku Evci, Trevor Gale, Jacob Menick, Pablo Samuel Castro, and Erich Elsen. Rigging the

lottery: Making all tickets winners. In International Conference on Machine Learning, 2020.

[17] Trevor Gale, Erich Elsen, and Sara Hooker. The state of sparsity in deep neural networks. ar Xiv

e-prints, ar Xiv:1902.09574, 2019.

[18] Scott Gray, Alec Radford, and Diederik P Kingma. GPU kernels for block-sparse weights.

ar Xiv preprint ar Xiv:1711.09224, 2017.

[19] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for

efﬁcient neural network. Advances in Neural Information Processing Systems, 2015.

[20] Babak Hassibi and David Stork. Second order derivatives for network pruning: Optimal brain

surgeon. Advances in Neural Information Processing Systems, 1992.

[21] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image

Recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.

[22] Torsten Hoeﬂer, Dan Alistarh, Tal Ben-Nun, Nikoli Dryden, and Alexandra Peste. Sparsity

in deep learning: Pruning and growth for efﬁcient inference and training in neural networks. Journal of Machine Learning Research, 2021.

[23] Steven A Janowsky. Pruning versus clipping in neural networks. Physical Review A, 1989.

[24] Siddhant Jayakumar, Razvan Pascanu, Jack Rae, Simon Osindero, and Erich Elsen. Top-KAST:

Top-K Always Sparse Training. Advances in Neural Information Processing Systems, 2020.

[25] Aditya Kusupati, Vivek Ramanujan, Raghav Somani, Mitchell Wortsman, Prateek Jain, Sham

Kakade, and Ali Farhadi. Soft threshold weight reparameterization for learnable sparsity. In International Conference on Machine Learning, 2020.

[26] François Lagunas, Ella Charlaix, Victor Sanh, and Alexander M Rush. Block pruning for

faster transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021.

[27] Yann Le Cun, John Denker, and Sara Solla. Optimal brain damage. Advances in Neural

Information Processing Systems, 2, 1989.

[28] Namhoon Lee, Thalaiyasingam Ajanthan, and Philip Torr. SNIP: Single-shot Network Pruning

based on Connection Sensitivity. In International Conference on Learning Representations, 2018.

[29] C Louizos, M Welling, and DP Kingma. Learning sparse neural networks through L0 regular-

ization. In International Conference on Learning Representations, 2018.

[30] Jian-Hao Luo and Jianxin Wu. Autopruner: An end-to-end trainable ﬁlter pruning method for

efﬁcient deep model inference. Pattern Recognition, 2020.

[31] Decebal Constantin Mocanu, Elena Mocanu, Peter Stone, Phuong H Nguyen, Madeleine

Gibescu, and Antonio Liotta. Scalable training of artiﬁcial neural networks with adaptive sparse connectivity inspired by network science. Nature Communications, 2018.

[32] Hesham Mostafa and Xin Wang. Parameter efﬁcient training of deep convolutional neural

networks by dynamic sparse reparameterization. In International Conference on Machine Learning, 2019.

[33] Michael C Mozer and Paul Smolensky. Skeletonization: A technique for trimming the fat from

a network via relevance assessment. In Advances in Neural Information Processing Systems, 1988.

[34] Yurii Nesterov. Primal-dual subgradient methods for convex problems. Mathematical Program-

ming, 2009.

[35] Victor Sanh, Thomas Wolf, and Alexander Rush. Movement pruning: Adaptive sparsity by

ﬁne-tuning. In Advances in Neural Information Processing Systems, 2020.

[36] Pedro Savarese, Hugo Silva, and Michael Maire. Winning the lottery with continuous sparsiﬁ-

cation. Advances in Neural Information Processing Systems, 2020.

[37] Jonathan Schwarz, Siddhant Jayakumar, Razvan Pascanu, Peter Latham, and Yee Teh. Power-

propagation: A sparsity inducing weight reparameterisation. In Advances in Neural Information Processing Systems, 2021.

[38] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and

Hervé Jégou. Training data-efﬁcient image transformers & distillation through attention. In International Conference on Machine Learning, 2021.

[39] Ziheng Wang, Jeremy Wohlwend, and Tao Lei. Structured pruning of large language models.

In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020.

[40] Mengzhou Xia, Zexuan Zhong, and Danqi Chen. Structured pruning learns compact and

accurate models. ar Xiv preprint ar Xiv:2204.00408, 2022.

[41] Lin Xiao. Dual averaging method for regularized stochastic learning and online optimization.

Advances in Neural Information Processing Systems, 2009.

[42] Yujia Xie, Hanjun Dai, Minshuo Chen, Bo Dai, Tuo Zhao, Hongyuan Zha, Wei Wei, and

Tomas Pﬁster. Differentiable top-k with optimal transport. In Advances in Neural Information Processing Systems, 2020.

[43] Dong Yu, Frank Seide, Gang Li, and Li Deng. Exploiting sparseness in deep neural networks for

large vocabulary speech recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012.

[44] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon

Yoo. Cutmix: Regularization strategy to train strong classiﬁers with localizable features. In IEEE/CVF International Conference on Computer Vision, 2019.

[45] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond

empirical risk minimization. In International Conference on Learning Representations, 2018.

[46] Yuxin Zhang, Mingbao Lin, Mengzhao Chen, Zihan Xu, Fei Chao, Yunhan Shen, Ke Li,

Yongjian Wu, and Rongrong Ji. Optimizing gradient-driven criteria in network sparsity: Gradient

is all you need. ar Xiv preprint ar Xiv:2201.12826, 2022.

[47] Xiao Zhou, Weizhong Zhang, Hang Xu, and Tong Zhang. Effective sparsiﬁcation of neural

networks with global sparsity constraint. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.

[48] Michael Zhu and Suyog Gupta. To Prune, or Not to Prune: Exploring the Efﬁcacy of Pruning

for Model Compression. In International Conference on Learning Representations, 2018.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s

contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] See Section 5.

(c) Did you discuss any potential negative societal impacts of your work? [Yes] See

Section 5. (d) Have you read the ethics review guidelines and ensured that your paper conforms to

them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A] 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experi-

mental results (either in the supplemental material or as a URL)? [Yes] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they

were chosen)? [Yes] See Section 4 and the Appendix. (c) Did you report error bars (e.g., with respect to the random seed after running experi-

ments multiple times)? [Yes] (d) Did you include the total amount of compute and the type of resources used (e.g., type

of GPUs, internal cluster, or cloud provider)? [Yes] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [Yes]

(c) Did you include any new assets either in the supplemental material or as a URL? [Yes] (d) Did you discuss whether and how consent was obtained from people whose data you re

using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable

information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if

applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review

Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount

spent on participant compensation? [N/A]