# spartan_differentiable_sparsity_via_regularized_transportation__1532d296.pdf Spartan: Differentiable Sparsity via Regularized Transportation Kai Sheng Tai Taipeng Tian Ser-Nam Lim We present Spartan, a method for training sparse neural network models with a predetermined level of sparsity. Spartan is based on a combination of two techniques: (1) soft top-k masking of low-magnitude parameters via a regularized optimal transportation problem and (2) dual averaging-based parameter updates with hard sparsification in the forward pass. This scheme realizes an explorationexploitation tradeoff: early in training, the learner is able to explore various sparsity patterns, and as the soft top-k approximation is gradually sharpened over the course of training, the balance shifts towards parameter optimization with respect to a fixed sparsity mask. Spartan is sufficiently flexible to accommodate a variety of sparsity allocation policies, including both unstructured and block structured sparsity, as well as general cost-sensitive sparsity allocation mediated by linear models of per-parameter costs. On Image Net-1K classification, Spartan yields 95% sparse Res Net-50 models and 90% block sparse Vi T-B/16 models while incurring absolute top-1 accuracy losses of less than 1% compared to fully dense training. 1 Introduction Sparse learning algorithms search for model parameters that minimize training loss while retaining only a small fraction of non-zero entries. Parameter sparsity yields benefits along several axes: reduced model storage costs, greater computational and energy efficiency during training and inference, and potentially improved model generalization [33]. However, sparse training is a challenging optimization problem: for the general problem of learning parameters 2 Rd subject to the constraint that is k-sparse, the feasible set is the union of axis-aligned hyperplanes intersecting at the origin, each of dimension k. This complex, nonconvex geometry of the feasible set compounds the existing difficulty of optimizing deep neural network models. This work focuses on improving the generalization error of sparse neural network models. To this end, we introduce Spartan, or Sparsity via Regularized Transportation a sparse learning algorithm that leverages an optimal transportation-based top-k masking operation to induce parameter sparsity during training. Spartan belongs to the family of parameter dense training algorithms that maintains a dense parameter vector 2 Rd throughout training [48; 24], in contrast to parameter sparse training algorithms that adhere to a O(k) memory budget for representing the parameters of a ksparse model [3; 31; 32; 13; 16]. While computational cost and memory usage at training time are important considerations, Spartan primarily optimizes for performance at inference time. Intuitively, Spartan aims to achieve a controlled transition between the exploration and exploitation of various sparsity patterns during training. In the exploration regime, the goal is for the learner to easily transition between differing sparsity patterns in order to escape those that correspond to poor minima of the loss function. On the other hand, the goal of exploitation is for the learner to optimize model parameters given a fixed sparsity pattern, or a small set of similar patterns. This latter setting Correspondence to: Kai Sheng Tai (kst@meta.com) 36th Conference on Neural Information Processing Systems (Neur IPS 2022). (a) (b) (c) Figure 1: Soft top-k masking. (a, b) The soft top-k masking operator σβ k computes approximately k-sparse outputs, with the sharpness of the mask controlled by the parameter β. (c) Small updates to iterates far from the k-sparse feasible set ( t ! t+1) can correspond to a large perturbation in parameter space after projection by k. Updates to iterates t in the approximately sparse region (shaded in grey) correspond to smaller post-projection perturbations. avoids frequent oscillations between disparate sparsity patterns, which may be detrimental to the optimization process. We operationalize this intuition through the use of a differentiable soft top-k masking operation σβ k : Rd ! Rd. This function maps parameters to an approximately sparse output that suppresses low-magnitude entries in . We parameterize the soft top-k mask with a sharpness parameter β: at β = 0, σβ k simply scales the input by a constant, and as β ! 1, the mapping reduces to hard top-k magnitude-based selection (Figure 1 (a, b)). Soft top-k masking therefore constrains the iterates t to be close to the set of exactly k-sparse vectors, with the strength of this constraint mediated by the parameter β. Figures 1 (c) and 2 give some geometeric intuition for the effect of this mapping. We implement soft top-k masking using a regularized optimal transportation formulation [11; 42] and demonstrate that this technique scales to networks on the order of 108 parameters in size. We evaluate Spartan on Res Net-50 [21] and Vi T [15] models trained on the Image Net-1K dataset. On Res Net-50, we find that sparse models trained with Spartan achieve higher generalization accuracies than those trained with existing methods at sparsity levels of 90% and above. In particular, we train Res Net-50 models to 76.5% top-1 validation accuracy at 95% sparsity and to 74.2% accuracy at 97.5% sparsity, improving on the previous state-of-the-art by 0.6% and 1.6% respectively. Our sparse Vi T-B/16 models reduce model storage size by 10 and inference FLOPs by 7.4 at the cost of a 0.6% accuracy reduction relative to Dei T-B [38]. We further demonstrate that Spartan is effective for block structured pruning, a form of structured sparsity that is more amenable to acceleration on current GPU hardware than unstructured sparsity [18]. On a Vi T-B/16 model with 16 16 block structured pruning, Spartan achieves 79.1% top-1 accuracy at 90% sparsity, compared to the baseline of 74.1% at the same sparsity level. To summarize, we make the following contributions in this paper: We present Spartan, a sparsity-inducing training algorithm based on a soft top-k masking op- eration. We show that Spartan interpolates between two existing sparse learning algorithms: iterative magnitude pruning [48] and Top-K Always Sparse Training [24]. We empirically evaluate Spartan using Res Net-50 and Vi T models trained on the Image Net-1K dataset, demonstrating consistent improvements over the prior state-of-the-art. We study the effect of Spartan s hyperparameters on an exploration-exploitation tradeoff during training and on the final accuracy of the trained models. We provide an open source implementation of Spartan at https://github.com/ facebookresearch/spartan. Notation. We use R+ and R++ to denote the nonnegative real numbers and the strictly positive real numbers respectively. 1d is the all-ones vector of dimension d. Given a vector x 2 Rd, kxkp denotes Figure 2: A 2D example of soft masking. We plot probability densities corresponding to the action of the soft top-1 operator σβ 1 on the 2D standard Gaussian distribution (darker colors indicate higher density). Specifically, we visualize the densities of σβ 1 (x) | x N(0, I2) for a range of sharpness parameters β. Higher values of β constrain iterates to be closer to the k-sparse feasible set. the p-norm of x, kxk0 is the number of nonzero entries in x, and |x| is the vector of elementwise absolute values. When x and y are vectors, x y denotes elementwise multiplication and x/y denotes elementwise division. The operator k(x) := arg miny:kyk0 k kx yk2 denotes Euclidean projection onto the set of k-sparse vectors. 2 Related Work Neural network pruning. Our use of a magnitude-based criterion for sparse learning draws on early work in the area of neural network pruning [23]. Magnitude-based pruning is computationally cheap relative to alternative criteria that rely on firstor second-order information [33; 27; 20], and is a perhaps surprisingly performant option despite its simplicity [17]. More generally, Spartan builds on a substantial body of previous work that aims to jointly optimize sparsity patterns and model parameters for deep neural networks [43; 8; 19; 29; 28; 48, inter alia; see, e.g., 22 for a survey]. Spartan as a generalization of existing methods. We highlight two particularly relevant sparse training methods in the literature: the iterative magnitude pruning (IMP) method (Algorithm 1, [48]), and Top-K Always Sparse Training, or Top-KAST (Algorithm 2, [24]): Algorithm 1 Iterative magnitude pruning update Input: parameters t, loss function L( ), spar- sity budget k, step size t Output: parameters t+1 1: t = k ( t) 2: t+1 = t tr tr L( t) 3: return t+1 Algorithm 2 Dual averaging / Top-KAST update Input: parameters t, loss function L( ), spar- sity budget k, step size t Output: parameters t+1 1: t = k ( t) 2: t+1 = t tr L( t) 3: return t+1 Note that both Algorithms 1 and 2 compute the forward pass through the network using the sparsified parameters , obtained by setting all but the top-k entries of by magnitude to zero. Algorithms 1 and 2 differ in the presence of the r t term in line 2. This is a d d diagonal matrix where (r t)ii = 1 iff t,i was not set to zero by the projection k, and (r t)ii = 0 otherwise. At the extreme points of the sharpness parameter β, Spartan s parameter update procedure reduces to those of IMP and Top-KAST. We can thus view Spartan as a method that generalizes and smoothly interpolates between this pair of algorithms. As β ! 1, Spartan is equivalent to IMP: this approach sparsifies parameters using a top-k magnitude-based binary mask, and entries that were masked in the forward pass receive zero gradient in the backward pass. At β = 0, Spartan reduces to Top-KAST: this method again sparsifies parameters by magnitude with a binary mask, but unlike IMP, all entries are updated in the backward pass using the gradient of the loss with respect to the masked parameters. The Top-KAST update is thus an application of the straight-through gradient method [5; 9], otherwise known as lazy projection or dual averaging (DA) in optimization [34; 41; 2; 14]. See Sec. 5 for further discussion on this point. Smooth approximations. Our use of the soft top-k operation is related to prior methods that use the logistic sigmoid function as a differentiable approximation to the step function for sparse training [30; 36; 1]. These approaches similarly regulate the sharpness of the approximation with a temperature parameter that scales the input logits to the sigmoid function. A distinctive characteristic of Spartan is that the parameter k directly controls the degree of sparsity of the mask; this differs from previous approaches involving the logistic sigmoid approximation that only indirectly control the degree of sparsity using an L1 penalty term. Transformer-specific methods. Several approaches are specialized to transformer-based architectures. These include structured pruning of pre-trained transformers for NLP tasks [39; 26; 40] and sparse training methods specifically applied to vision transformers [7; 6]. In contrast, Spartan is a general-purpose sparse training algorithm that is designed to be agnostic to the model architecture. 3 Sparsity via Regularized Transportation We begin by motivating our proposed approach. A key advantage of the dual averaging method (Algorithm 2) over IMP (Algorithm 1) is that it mitigates the issue of gradient sparsity. In the IMP update, only those parameters that were not masked in the forward pass receive a nonzero gradient, resulting in slow progress at high levels of sparsity. In contrast, dual averaging applies a dense update to the parameter vector in each iteration intuitively, these dense updates can help revive parameters that are below the top-k magnitude threshold and are consequently masked to zero. Dual averaging iterates are thus able to more quickly explore the space of possible sparsity patterns. On the other hand, large variations in the sparsity patterns realized post-projection can lead to instability in optimization, thus hampering the final performance of the model. For instance, Top KAST empirically benefits from the application of additional sparsity-inducing regularization on the parameters [24 (Sec. 2.3), 37]. This issue motivates our use of a soft top-k approximation as a mechanism for controlling the stability of our training iterates. Each iteration of Spartan consists of the following two steps (Algorithm 3): (1) an approximately k-sparse masking of the model parameters using the soft top-k operator, and (2) dual averaging-based parameter updates with the set of exactly k-sparse vectors as the feasible set. This procedure aims to combine the advantages of Algorithms 1 and 2 within a single parameter update scheme. Note that the Spartan parameter update is essentially identical to Algorithm 2, save for the inclusion of the intermediate soft top-k masking step. Algorithm 3 Spartan parameter update Input: parameters t, loss function L( ), sparsity budget k, sharpness parameter β, step size t Output: parameters t+1 k ( t) = t softtopk(| t|, k, β) . apply soft masking . project onto k-sparse set 3: t+1 = t trσβ k ( t)r L( t) . compute dual averaging update 4: return t+1 In the following, we begin by describing the soft top-k masking operation and address the issues of scalability and of applying soft top-k masking to induce structured sparsity. We then discuss how the update procedure outlined in Algorithm 3 can be incorporated within a complete training loop. 3.1 Soft Top-k Masking Our soft top-k masking scheme is based on the soft top-k operator softtopk(v, k, β) described by Xie et al. [42]. softtopk takes as input a vector v 2 Rd, a budget parameter k > 0 and a sharpness parameter β 0, and outputs m 2 [0, 1]d, a smoothed version of the top-k indicator vector. By using the magnitudes | | of the parameter vector as a measure of the value of each entry, we obtain a soft top-k magnitude pruning operator σβ k ( ) by masking the entries of the parameter vector with the output of the soft top-k operator: k ( ) := softtopk(| |, k, β). Algorithm 4 Soft top-k forward pass Input: values v 2 Rd, costs c 2 Rd ++, budget k 2 (0, 1T d c], sharpness parameter β 0, max. iterations T, tolerance , initial dual variable µ0 Output: mask m 2 [0, 1]d, dual variables µ 2 R, 2 Rd 1: z = βv/c, m0 = 1d . normalize values v by costs c 2: for t = 1, . . . , T do 3: t = log c log(1d + exp(z + µt 1)) . normalize mask entries 4: µt = log k log P i exp(zi + t,i) . normalize sum to be equal to k 5: mt = exp(z + µt + t log c) . compute mask 6: if |v T (mt mt 1)| < |v T mt 1| then . check stopping criterion 7: return mt, µt, t 8: return m T , µT , T Algorithm 5 Soft top-k backward pass Input: gradient w.r.t. outputs g 2 Rd, mask m 2 [0, 1]d, costs c, budget k, sharpness parameter β Output: gradient w.r.t. inputs i gimi(1 mi), a2 = P i 2: return βm (1d m) (g/c a1/(k a2)) We introduce a mild generalization of the soft top-k operator from Xie et al. [42] by incorporating a strictly positive cost vector c. In particular, we require that the output mask m 2 [0, 1]d satisfies the budget constraint c T m = k. This abstraction is useful for modeling several notions of heterogeneous parameter costs within a model; for instance, parameters that are repeatedly invoked within a network have a higher associated computational cost. As a concrete example, Evci et al. [16] observed that the FLOP count of a Res Net-50 model at a fixed global sparsity can vary by a factor of over 2 due to differences in the individual sparsities of convolutional layers with differing output dimensions. To derive this cost-sensitive soft top-k operator, we begin by expressing the mask m as the solution to the following linear program (LP): m2Rd v T m subject to 0 m 1d, c T m = k. (1) Deferring the derivation to the Appendix, we can rewrite this LP as the following optimal transportation (OT) problem: Cij Yij subject to Y 12 = c, 1d Y = [k, 1T d c k], (2) with the cost matrix C := [ v/c, 0]. We can recover the mask from Y as mi = Yi1/ci. Now following [42; 11], we introduce entropic regularization to obtain a smoothed version of the top-k operator, with the degree of smoothing controlled by β: β H(Y ) (3) subject to Y 12 = c, 1d Y = [k, 1T where H(Y ) := P ij Yij(log Yij 1) is the entropy of Y . In this standard form, we can efficiently solve this regularized OT problem using the Sinkhorn-Knopp algorithm [11; 4; 12]. Algorithms 4 and 5 describe the forward and backward passes of the resulting cost-sensitive soft top-k operator (see Appendices B and C for their derivations). The expression for the gradient in Algorithm 5 follows from Theorem 3 in Xie et al. [42] with some algebraic simplification. We remark that this closed-form backward pass is approximate in the sense that it assumes that we obtain the optimal dual variables µ, in the forward pass; in practice, we do not encounter issues when using estimates of the dual variables obtained with tolerance = 0.01 and a maximum iteration count of T = 100 in Algorithm 4. Scalability. Since we apply soft top-k masking to high-dimensional parameters in each iteration of training, it is important to minimize the computational overhead of Sinkhorn iteration in the Figure 3: Spartan parameterizes an exploration-exploitation tradeoff. For each run of Res Net-50 training on Image Net-1K, we plot Pearson correlation coefficients between the sparsity mask at the end of each epoch and the mask obtained at the end of training (left), and between sparsity masks at the end of subsequent epochs (right). Kinks in the correlation curves at epochs 20 and 80 are respectively due to the end of sparsity annealing and the start of fine-tuning with fixed masks (see Section 3.2 for details on the training schedule). forward pass. We note that in order to compute the hard top-k projection step in dual averaging (as in Algorithm 3), it is necessary to find the index ik of the kth largest entry of | |/c. Given the index ik, we can use the heuristic initialization µ0 = β| ik|/cik for the dual variable µ to accelerate the convergence of Sinkhorn iteration. This heuristic is based on the observation that | ik|/cik is the threshold value of the normalized value vector | |/c in hard top-k masking. Concretely, we demonstrate in Sec. 4.3 that our implementation of Spartan incurs a per-iteration runtime overhead of approximately 5% over standard dense Res Net-50 training. Structured sparsity. To implement block structured sparsity, we instantiate one mask variable per block and mask all parameters within that block with the same mask variable. To compute the mask, we use the sum of the magnitudes of the entries in each block as the corresponding value vi. In a standard pruning setting with uniform costs across all parameters, the corresponding cost ci is simply the total number of entries in the block. 3.2 Training with Spartan Training schedule. When training with Spartan, we divide the training process into a warmup phase, an intermediate phase, and a fine-tuning phase. In our experiments, the warmup phase consists of the first 20% of epochs, the intermediate phase the next 60%, and the fine-tuning phase the final 20%. In the warmup phase, we linearly anneal the global sparsity of the model from a value of 1 at the start of training to the target level of sparsity. Throughout the intermediate phase, we maintain the target sparsity level, and during the fine-tuning phase, we fix the sparsity mask to that used at the start of fine-tuning. This training schedule is similar to those used in prior work [37]. Exploration vs. exploitation as a function of β. From the start of training up to the start of the fine-tuning phase, we linearly ramp the sharpness parameter β from an initial value of 1 to the final value βmax. We interpret the spectrum of updates parameterized by β as being characteristic of an exploration-exploitation tradeoff. In Figure 3, we illustrate this phenomenon by plotting Pearson correlation coefficients between sparsity masks at different stages of training. We observe that iterative magnitude pruning converges on the final mask relatively early in training, which is indicative of insufficient exploration of different sparsity patterns consequently, this model achieves lower validation accuracy than the remaining models. The three models trained with Spartan each ramp β from 1 at the start of training to βmax at epoch 80. The value of βmax correlates well both with the Pearson correlation with the final mask, and with the Pearson correlation between masks at the end of subsequent epochs. Spartan thus interpolates between the high-exploration regime of standard dual averaging and the low-exploration regime of iterative magnitude pruning. In this example, the intermediate setting βmax = 10 achieves the highest top-1 validation accuracy of 74.6%, with Top-KAST at 73.5% and IMP at 68.2%. Table 1: Top-1 accuracies on Image Net-1K validation set with fully dense training. Method Epochs Accuracy (%) Method Epochs Accuracy (%) 100 77.08 0.11 Vi T-B/16 100 80.06 0.11 200 77.46 0.09 400 77.17 0.04 Table 2: Res Net-50 top-1 accuracies on Image Net-1K validation set at varying levels of sparsity and epochs of training. When available, we report means and standard deviations over 3 trials. Method Epochs 80% 90% 95% 97.5% 99% [16] Rig L (ERK) 100 75.1 0.05 73.0 0.04 69.7 0.17 - - 500 77.1 0.06 76.4 0.05 74.5 0.09 - - [25] STR 100 76.19 74.31 70.40 62.84 51.82 [47] Prob Mask 100 - 74.68 71.50 66.83 61.07 [46] Opt G 100 - 74.28 72.38 - 62.10 IMP 100 75.3 0.07 73.7 0.14 70.6 0.05 - - Top-KAST 100 76.08 0.02 75.13 0.03 73.19 0.02 - - (with PP) 100 76.24 0.07 75.23 0.02 73.25 0.02 - - (with ERK) 100 76.42 0.03 75.51 0.05 - - - (with PP 100 76.76 0.08 75.74 0.08 - - - & ERK) 200 77.51 0.03 76.94 0.10 - - - 300 77.64 0.05 77.16 0.19 - - - 100 76.66 0.04 75.48 0.15 73.51 0.16 70.23 0.05 - 200 77.55 0.08 76.84 0.11 75.20 0.11 71.72 0.06 - 400 77.75 0.09 77.37 0.07 75.90 0.04 72.58 0.13 - 100 76.89 0.09 76.17 0.10 74.68 0.24 71.95 0.11 63.87 0.12 200 77.56 0.19 77.06 0.13 75.92 0.01 73.41 0.15 65.77 0.03 400 77.57 0.08 77.40 0.06 76.48 0.20 74.15 0.10 66.80 0.03 Reported accuracies for models closest to the listed sparsity level. Model trained at 98% sparsity. 4 Empirical Evaluation In this section, we report the results of our sparse training experiments on the Image Net-1K dataset with two standard architectures: Res Net-50 [21] and Vi T-B/16 [15], consisting of 25.6M and 86.4M parameters respectively. On Res Net-50, we evaluate only unstructured pruning, whereas on Vi T-B/16, we evaluate both unstructured and block structured pruning. We subsequently present empirical studies of the sensitivity of Spartan to the value of β, the effect of running Spartan without the dual averaging step, and the computational overhead of soft top-k masking. 4.1 Image Net-1K Classification In all our experiments, we run Spartan with the training schedule described in Section 3.2. We train and evaluate our models on the Image Net-1K dataset with the standard training-validation split and report means and standard deviations over 3 independent trials. We provide additional details on our experimental setup in the Appendix. Res Net-50 experimental setup. For consistency with our baselines, we use standard data augmentation with horizontal flips and random crops at 224 224 resolution. For all Spartan runs, we use βmax = 10, which we selected based on models trained at 95% accuracy. We sparsify the parameters of all linear and convolutional layers with a global sparsity budget, excluding bias parameters and the parameters of batch normalization layers. Our baselines are iterative magnitude pruning [48], Rig L with the Erdos-Renyi-Kernel (ERK) sparsity distribution [16], Soft Threshold Weight Reparameterization (STR) [25], probabilistic masking (Prob Mask) [47], Opt G [46], and Top-KAST with Powerpropagation and ERK [24; 37]. We additionally re-run the most performant baseline method, Table 3: Vi T-B/16 top-1 accuracies on Image Net-1K validation set at 90% sparsity with unstructured sparsity and block structured pruning of attention layers. Sparsity Structure Method Epochs Unstructured 16 16 blocks 32 32 blocks Top-KAST 100 78.05 0.07 67.69 0.23 64.79 0.14 300 80.86 0.03 74.11 0.20 70.67 0.96 Spartan 100 79.88 0.16 75.62 0.07 74.50 0.23 300 81.18 0.04 79.12 0.16 78.43 0.10 Top-KAST, using a reimplementation in our own codebase. For Top-KAST, we exclude the first convolutional layer from pruning (following [37; 16]) and we use fully dense backward passes (i.e., with a backward sparsity of 0), since this setting achieved the highest accuracy in prior work [24]. We use mixed precision training with a batch size of 4096 on 8 NVIDIA A100 GPUs. Vi T experimental setup. We use the Vi T-B architecture with 16 16 patches at 224 224 input resolution. We augment the training data using Rand Augment [10], Mix Up [45] and Cut Mix [44]. Our Vi T models are trained from random initialization, without any pretraining. We set βmax = 20 for Spartan with unstructured sparsity, and βmax = 320 and βmax = 640 for Spartan with 16 16 and 32 32 blocks respectively. These latter settings are the values of βmax for the unstructured case scaled up by factors of 16 and 32 since each block averages the magnitudes of B2 entries, we expect the variance of the block values to correspondingly decrease by a factor of B2, and we thus compensate by scaling up β by B. In the block structured case, we exclude the input convolutional layer and the output classification head from pruning since their parameter dimensions are not divisible by the block size. We use mixed precision training with a batch size of 4096 on 16 NVIDIA A100 GPUs across 2 nodes. Results. Table 1 lists the top-1 validation accuracies achieved by fully dense Res Net-50 and Vi T-B/16 models. Table 2 reports validation accuracies for Res Net-50 models at 80%, 90%, 95%, 97.5% and 99% sparsity, and Table 3 reports validation accuracies for Vi T at 90% sparsity. In the Appendix, we report additional measurements of inference FLOP costs for our sparse models and the results of experiments with FLOP-sensitive pruning. For Res Net-50 models, we find that Spartan outperforms all our baselines across all training durations at sparsity levels of 90% and above. In particular, Spartan achieves a mean top-1 accuracy within 1% of fully dense training at 95% sparsity. We observe that additional epochs of sparse training consistently improves the final generalization accuracy; in contrast, validation accuracy peaks at 200 epochs for dense training. This trend persists at 800 training epochs, where Spartan achieves 67.18 0.13 top-1 accuracy at 99% sparsity. For the Top-KAST baseline, we omit results at 99% sparsity due to training instability. We note that the accuracy improvements in the Spartan-trained Res Net-50 models relative to Top-KAST are not a result of increased FLOP counts at a given level of sparsity, as can be seen from the FLOP measurements reported in Appendix E. For Vi T-B/16 models, Spartan outperforms Top-KAST for both unstructured and block structured pruning. We observe a particularly large improvement over Top-KAST in the block structured case, where Spartan improves absolute validation accuracy by 7.8% for 32 32 block structured pruning. For unstructured pruning, Spartan achieves comparable accuracy to SVi TE [7] (81.2% for Spartan vs. 81.3% for SVi TE), but with 30% higher sparsity (90% for Spartan vs. 60% for SVi TE). In exchange for a 0.6% reduction in accuracy relative to Dei T-B [38], Spartan reduces model storage cost by 10 and the FLOP cost of inference by 7.4 . 4.2 Sensitivity and Ablation Analysis Figure 4 (left) shows the effect of ablating the dual averaging step in Spartan i.e., omitting hard top-k projection in the forward pass over a range of settings of βmax for 95% sparse Res Net-50 models trained for 100 epochs. The dashed line shows top-1 accuracy with Top-KAST. For Spartan training without dual averaging, we compute a hard top-k mask at the end of epoch 80, and as with standard Spartan training, we fix this mask until the end of training. In the absence of top-k projection, we find that accuracy increases with increasing βmax up to βmax = 80; at lower settings of βmax, Figure 4: Effect of β and dual averaging (left). Top-1 Image Net-1K validation accuracies for Spartan with and without dual averaging (DA) and with varying βmax at 95% sparsity. Computational overhead (right). Percentage increase in per-iteration wall clock runtime over dense training for Spartan with standard Sinkhorn iteration, Sinkhorn with dual caching, and Sinkhorn with sorting. the higher approximation error of soft masking is detrimental to the final accuracy of the model. In contrast, the use of top-k projection in the forward pass mitigates this mismatch between training and inference and improves the final accuracy of the sparse model. 4.3 Computational Overhead We evaluate the computational overhead of Spartan over standard dense training by measuring wall clock training times for a Res Net-50 model (Figure 4, right). Our benchmark measures runtime on a single NVIDIA A100 GPU over 50 iterations of training with a batch size of 256. We use random inputs and labels to avoid incurring data movement costs. We compare three approaches: standard Sinkhorn iteration, Sinkhorn iteration with the dual variable µ initialized using its final value from the previous iteration ( dual caching ), and Sinkhorn iteration with µ initialized using the value β| ik|/cik, where we compute the index ik of the kth largest entry of the normalized value vector | |/c by sorting its entries in each iteration ( sorting ). Standard Sinkhorn iteration incurs higher overheads as β increases this is due to the additional iterations required to reach convergence as the regularized OT problem more closely approximates the original LP. We find that dual caching and sorting both prevent this growth in runtime over the range of β values that we tested. In our remaining experiments, we use the simple sorting-based approach since it corresponds to the lowest relative overhead over standard dense training (approximately 5%). We note that this relative overhead decreases as the batch size increases since the cost of computing the mask is independent of the batch size. Spartan is therefore best suited to training with large batch sizes since we amortize the cost of mask updates over the size of the batch. 5 Discussion Extensions. As an alternative to magnitude-based pruning, we may also apply Spartan in conjunction with learned value parameters, as in methods such as Movement Pruning [35]. In this approach, we would compute masks using a set of auxiliary parameters φ 2 Rd instead of the magnitudes | |: σβ k ( ; φ) = softtopk(φ, k, β), and similarly for hard projection. We remark that while this variant requires additional memory during training to store the value parameters, there is no additional cost during inference since the sparsity mask is fixed at the end of training. Limitations. Since Spartan retains a dense parameter vector and computes dense backward passes during training, it incurs higher memory and computational costs in each iteration than methods like Rig L [16] that use both a sparse parameters and sparse backward passes. Nevertheless, we note that in terms of total computational or energy cost over a full training run, Spartan may remain a competitive option as it requires fewer iterations of training to reach a given accuracy threshold relative to these sparse-to-sparse methods. However, we do not compare total training FLOP costs in our empirical evaluation. A further limitation is that cost-sensitive pruning with Spartan is only compatible with relatively crude linear cost models of the form c T m, where c is the cost vector and m is the mask. This restriction arises due to the requirements of the regularized OT formulation used to compute the soft top-k mask. In particular, this limitation precludes the use of cost models involving interaction terms such as those arising from spatially coherent sparsity patterns. For example, the cost models that we consider here cannot encode the notion that coherently pruning an entire row or column of a parameter matrix will yield more cost savings than pruning random entries of the matrix. Optimization with Dual Averaging. As discussed in Section 3, Spartan and Top-KAST are both applications of the dual averaging method [34; 41]. The empirically observed effectiveness of dual averaging in sparse training is reminiscent of its success in a related domain where continuous optimization is complicated by discrete constraints: neural network quantization. In this area, the Binary Connect algorithm [9], which trains binary neural networks with parameters in { 1}d, is considered to be a foundational method. As observed by Bai et al. [2], Binary Connect is itself a special case of dual averaging, and the theoretical underpinnings of this connection to dual averaging were further analyzed by Dockhorn et al. [14]. These parallels with neural network quantization suggest that similar ideas may well apply towards deepening our understanding of sparse learning. Societal Impacts. Inference with deep neural network models is a computationally intensive process. At present, the total energy footprint associated with serving these models in production is expected to continue growing in tandem with the rising prevalence of large transformer-based architectures in vision and NLP applications. Research towards improving the energy efficiency of deep neural networks is therefore an important counterbalancing force against increasing resource usage by these models. The development of sparse learning algorithms is particularly relevant to these efforts, and we expect that the impact of these approaches will further increase as sparsity-aware hardware acceleration becomes more widely available. 6 Conclusions & Future Work In this work, we describe a sparse learning algorithm that interpolates between two parameter update schemes: standard stochastic gradient updates with hard masking and the dual averaging method. We show that there exists an intermediate regime between these two methods that yields improved generalization accuracy for sparse convolutional and transformer models, particularly at higher levels of sparsity. While we have demonstrated promising empirical results with our proposed method, the learning dynamics of stochastic optimization for deep networks under sparsity constraints remains relatively poorly understood from a theoretical standpoint. There thus exists ample opportunity for further work towards better understanding sparse learning algorithms, which may in turn inspire future algorithmic advances in this area. Acknowledgements and Disclosure of Funding We are grateful to Trevor Gale for his feedback on this work. We also thank our anonymous reviewers for their valuable suggestions on improving the manuscript. This work was funded by Meta. [1] Kambiz Azarian, Yash Bhalgat, Jinwon Lee, and Tijmen Blankevoort. Learned threshold pruning. ar Xiv preprint ar Xiv:2003.00075, 2020. [2] Yu Bai, Yu-Xiang Wang, and Edo Liberty. Prox Quant: Quantized Neural Networks via Proximal Operators. In International Conference on Learning Representations, 2018. [3] Guillaume Bellec, David Kappel, Wolfgang Maass, and Robert Legenstein. Deep Rewiring: Training very sparse deep networks. In International Conference on Learning Representations, 2018. [4] Jean-David Benamou, Guillaume Carlier, Marco Cuturi, Luca Nenna, and Gabriel Peyré. Iterative Bregman projections for regularized transportation problems. SIAM Journal on Scientific Computing, 37(2), 2015. [5] Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. ar Xiv preprint ar Xiv:1308.3432, 2013. [6] Arnav Chavan, Zhiqiang Shen, Zhuang Liu, Zechun Liu, Kwang-Ting Cheng, and Eric Xing. Vision transformer slimming: Multi-dimension searching in continuous optimization space. ar Xiv preprint ar Xiv:2201.00814, 2022. [7] Tianlong Chen, Yu Cheng, Zhe Gan, Lu Yuan, Lei Zhang, and Zhangyang Wang. Chasing sparsity in vision transformers: An end-to-end exploration. In Advances in Neural Information Processing Systems, 2021. [8] Maxwell D Collins and Pushmeet Kohli. Memory bounded deep convolutional networks. ar Xiv preprint ar Xiv:1412.1442, 2014. [9] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binary Connect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems, 2015. [10] Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical automated data augmentation with a reduced search space. In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020. [11] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems, 2013. [12] Marco Cuturi, Olivier Teboul, and Jean-Philippe Vert. Differentiable ranking and sorting using optimal transport. In Advances in Neural Information Processing Systems, 2019. [13] Tim Dettmers and Luke Zettlemoyer. Sparse networks from scratch: Faster training without losing performance. ar Xiv preprint ar Xiv:1907.04840, 2019. [14] Tim Dockhorn, Yaoliang Yu, Eyyüb Sari, Mahdi Zolnouri, and Vahid Partovi Nia. Demystifying and generalizing binaryconnect. 2021. [15] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. [16] Utku Evci, Trevor Gale, Jacob Menick, Pablo Samuel Castro, and Erich Elsen. Rigging the lottery: Making all tickets winners. In International Conference on Machine Learning, 2020. [17] Trevor Gale, Erich Elsen, and Sara Hooker. The state of sparsity in deep neural networks. ar Xiv e-prints, ar Xiv:1902.09574, 2019. [18] Scott Gray, Alec Radford, and Diederik P Kingma. GPU kernels for block-sparse weights. ar Xiv preprint ar Xiv:1711.09224, 2017. [19] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. Advances in Neural Information Processing Systems, 2015. [20] Babak Hassibi and David Stork. Second order derivatives for network pruning: Optimal brain surgeon. Advances in Neural Information Processing Systems, 1992. [21] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2016. [22] Torsten Hoefler, Dan Alistarh, Tal Ben-Nun, Nikoli Dryden, and Alexandra Peste. Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. Journal of Machine Learning Research, 2021. [23] Steven A Janowsky. Pruning versus clipping in neural networks. Physical Review A, 1989. [24] Siddhant Jayakumar, Razvan Pascanu, Jack Rae, Simon Osindero, and Erich Elsen. Top-KAST: Top-K Always Sparse Training. Advances in Neural Information Processing Systems, 2020. [25] Aditya Kusupati, Vivek Ramanujan, Raghav Somani, Mitchell Wortsman, Prateek Jain, Sham Kakade, and Ali Farhadi. Soft threshold weight reparameterization for learnable sparsity. In International Conference on Machine Learning, 2020. [26] François Lagunas, Ella Charlaix, Victor Sanh, and Alexander M Rush. Block pruning for faster transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021. [27] Yann Le Cun, John Denker, and Sara Solla. Optimal brain damage. Advances in Neural Information Processing Systems, 2, 1989. [28] Namhoon Lee, Thalaiyasingam Ajanthan, and Philip Torr. SNIP: Single-shot Network Pruning based on Connection Sensitivity. In International Conference on Learning Representations, 2018. [29] C Louizos, M Welling, and DP Kingma. Learning sparse neural networks through L0 regular- ization. In International Conference on Learning Representations, 2018. [30] Jian-Hao Luo and Jianxin Wu. Autopruner: An end-to-end trainable filter pruning method for efficient deep model inference. Pattern Recognition, 2020. [31] Decebal Constantin Mocanu, Elena Mocanu, Peter Stone, Phuong H Nguyen, Madeleine Gibescu, and Antonio Liotta. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature Communications, 2018. [32] Hesham Mostafa and Xin Wang. Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization. In International Conference on Machine Learning, 2019. [33] Michael C Mozer and Paul Smolensky. Skeletonization: A technique for trimming the fat from a network via relevance assessment. In Advances in Neural Information Processing Systems, 1988. [34] Yurii Nesterov. Primal-dual subgradient methods for convex problems. Mathematical Program- ming, 2009. [35] Victor Sanh, Thomas Wolf, and Alexander Rush. Movement pruning: Adaptive sparsity by fine-tuning. In Advances in Neural Information Processing Systems, 2020. [36] Pedro Savarese, Hugo Silva, and Michael Maire. Winning the lottery with continuous sparsifi- cation. Advances in Neural Information Processing Systems, 2020. [37] Jonathan Schwarz, Siddhant Jayakumar, Razvan Pascanu, Peter Latham, and Yee Teh. Power- propagation: A sparsity inducing weight reparameterisation. In Advances in Neural Information Processing Systems, 2021. [38] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, 2021. [39] Ziheng Wang, Jeremy Wohlwend, and Tao Lei. Structured pruning of large language models. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020. [40] Mengzhou Xia, Zexuan Zhong, and Danqi Chen. Structured pruning learns compact and accurate models. ar Xiv preprint ar Xiv:2204.00408, 2022. [41] Lin Xiao. Dual averaging method for regularized stochastic learning and online optimization. Advances in Neural Information Processing Systems, 2009. [42] Yujia Xie, Hanjun Dai, Minshuo Chen, Bo Dai, Tuo Zhao, Hongyuan Zha, Wei Wei, and Tomas Pfister. Differentiable top-k with optimal transport. In Advances in Neural Information Processing Systems, 2020. [43] Dong Yu, Frank Seide, Gang Li, and Li Deng. Exploiting sparseness in deep neural networks for large vocabulary speech recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012. [44] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong classifiers with localizable features. In IEEE/CVF International Conference on Computer Vision, 2019. [45] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018. [46] Yuxin Zhang, Mingbao Lin, Mengzhao Chen, Zihan Xu, Fei Chao, Yunhan Shen, Ke Li, Yongjian Wu, and Rongrong Ji. Optimizing gradient-driven criteria in network sparsity: Gradient is all you need. ar Xiv preprint ar Xiv:2201.12826, 2022. [47] Xiao Zhou, Weizhong Zhang, Hang Xu, and Tong Zhang. Effective sparsification of neural networks with global sparsity constraint. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021. [48] Michael Zhu and Suyog Gupta. To Prune, or Not to Prune: Exploring the Efficacy of Pruning for Model Compression. In International Conference on Learning Representations, 2018. 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] See Section 5. (c) Did you discuss any potential negative societal impacts of your work? [Yes] See Section 5. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A] 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experi- mental results (either in the supplemental material or as a URL)? [Yes] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] See Section 4 and the Appendix. (c) Did you report error bars (e.g., with respect to the random seed after running experi- ments multiple times)? [Yes] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [Yes] (c) Did you include any new assets either in the supplemental material or as a URL? [Yes] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]