# dynamic_sparse_training_with_structured_sparsity__fd4c0afc.pdf Published as a conference paper at ICLR 2024 DYNAMIC SPARSE TRAINING WITH STRUCTURED SPARSITY Mike Lasby1, Anna Golubeva2,3, Utku Evci4, Mihai Nica5,6, Yani A. Ioannou1 1University of Calgary, 2Massachusetts Institute of Technology, 3IAIFI 4Google Deep Mind, 5University of Guelph, 6Vector Institute for AI Dynamic Sparse Training (DST) methods achieve state-of-the-art results in sparse neural network training, matching the generalization of dense models while enabling sparse training and inference. Although the resulting models are highly sparse and theoretically less computationally expensive, achieving speedups with unstructured sparsity on real-world hardware is challenging. In this work, we propose a sparse-to-sparse DST method, Structured Rig L (SRig L), to learn a variant of fine-grained structured N:M sparsity by imposing a constant fan-in constraint. Using our empirical analysis of existing DST methods at high sparsity, we additionally employ a neuron ablation method which enables SRig L to achieve state-of-the-art sparse-to-sparse structured DST performance on a variety of Neural Network (NN) architectures. Using a 90% sparse linear layer, we demonstrate a real-world acceleration of 3.4 /2.5 on CPU for online inference and 1.7 /13.0 on GPU for inference with a batch size of 256 when compared to equivalent dense/unstructured (CSR) sparse layers, respectively. 1 INTRODUCTION Dynamic Sparse Training (DST) methods such as Rig L (Evci et al., 2021) are the state-of-the-art in sparse training methods for Deep Neural Networks (DNNs). DST methods typically learn unstructured masks resulting in 85 95% fewer weights than dense models, while maintaining dense-like generalization and typically outperforming masks found via pruning. Furthermore, sparse-to-sparse DST algorithms are capable of employing sparsity both during training and inference, unlike pruning and denseto-sparse DST methods such as SR-STE (Zhou et al., 2021) which only exploit sparsity at inference time. While models trained with DST methods are highly sparse and enable a large reduction in Floating Point Operations (FLOPs) in theory, realizing these speedups on hardware is challenging when the sparsity pattern is unstructured. Even considering recent advances in accelerating unstructured Sparse Neural Networks (SNNs) (Gale et al., 2020; Elsen et al., 2020; Ji & Chen, 2022), structured sparsity realizes much stronger acceleration on real-world hardware. On the other hand, structured sparse pruning often removes salient weights, resulting in worse generalization than comparable unstructured SNNs for the same sparsity level (Fig. 1a). Our work presents a best-of-both-worlds approach: we exploit the DST framework to learn both a highly-sparse and structured representation while maintaining generalization performance. In summary, our work makes the following contributions: 1. We propose a novel sparse-to-sparse DST method, Structured Rig L (SRig L), based on Rig L (Evci et al., 2021). SRig L learns a SNN with constant fan-in fine-grained structured sparsity (Fig. 1a) while maintaining generalization comparable with Rig L up to a high sparsity level (99%) for a variety of network architectures. This structure is a particular case of N:M sparsity which requires N out of M consecutive weights to be non-zero (Mishra et al., 2021). 2. Our empirical analysis shows Rig L, at sparsity levels > 90%, ablates whole neurons. By allowing neuron ablation in SRig L, we match Rig L generalization even in this high-sparsity regime. 3. We enable neuron ablation in SRig L across all sparsity regimes. We find this structured sparsity is complementary to the constant fan-in sparsity in improving real-world inference timings while maintaining generalization comparable to unstructured DST methods. {mklasby,yani.ioannou}@ucalgary.ca, golubeva@mit.edu, evcu@google.com, nicam@uoguelph.ca Our source code is available here. Published as a conference paper at ICLR 2024 (a) Constant fan-in pruning v.s. unstructured pruning. (b) Output-norm variance analysis. Figure 1: (a) Constant fan-in pruning keeps the most salient weights per neuron, while unstructured pruning keeps the most salient weights per layer. A constant fan-in weight matrix has the same number of non-zero elements (here 2) per column allowing condensed representation. While pruning may remove salient weights affecting generalization, with SRig L structure and weights are learned concurrently. (b) Output-norm variance: Theoretical predictions and simulation results (see Appendix A) demonstrating that sparse layers with constant fan-in have consistently smaller output-norm variance than layers with the same sparsity but w/o the constant fan-in constraint. 4. We demonstrate that constant fan-in sparsity enables a compact representation that is not only parameterand memory-efficient, but also amenable to real-world acceleration. We observe significantly reduced real-world timings for online inference using our CPU-based Py Torch implementation and for batched inference using a GPU-based implementation from Schultheis & Babbar (2023) over dense and unstructured baselines. 2 RELATED WORK Dynamic sparse training Unlike with pruning, where weights are typically pruned after the dense network was trained (Han et al., 2015; 2016), or at initialization (Wang et al., 2020), DST methods learn the sparse connectivity during training by periodically adding and removing weights based on various saliency criteria. For instance, Sparse Evolutionary Training (SET) (Mocanu et al., 2018) removes weights with the smallest magnitude and adds weights randomly; similarly, Rig L (Evci et al., 2021) prunes weights with the smallest magnitude and regrows weights that have large-magnitude gradients. Liu et al. (2021c) further improved the original Rig L results by increasing the extent of the parameter space explored by modifying the sparse connectivity update schedule and drop rate. Many recent works have examined the effect of different grow and prune saliency criteria on unstructured DST approaches, including SET, Deep Rewiring (Deep R) (Bellec et al., 2018), Sparse Networks from Scratch (SNFS) (Dettmers & Zettlemoyer, 2019), Dynamic Sparse Reparameterization (DSR) (Mostafa & Wang, 2019), Top-K Always Sparse Training (Top-KAST) (Jayakumar et al., 2020), and Memory-Economic Sparse Training (MEST) (Yuan et al., 2021a). In Section 4, we compare SRig L to several of these methods. While the above-noted DST methods are highly effective at finding SNNs which reduce theoretical inference cost, they result in unstructured SNNs which are difficult to accelerate in practice on common hardware architectures. In a contemporaneous work, Yin et al. (2023) also identified the existence of sparse amenable channels in existing unstructured DST algorithms. Their method, Chase, achieves state-of-the-art generalization performance by including a soft memory bound similar to Yuan et al. (2021b) and calculating the saliency of parameters based on global instead of layer-wise statistics. Chase requires that the structured sparsity level be set prior to training. In contrast, SRig L dynamically learns to ablate channels based on the number of remaining weights that are considered salient. Accelerating unstructured sparse neural networks Elsen et al. (2020) proposed a method for accelerating unstructured SNNs based on one-dimensional tiling of non-zero elements, which demonstrated significant speedups on both Central Processing Unit (CPU) (Elsen et al., 2020) and Graphics Processing Unit (GPU) (Gale et al., 2020). However, like most approaches to accelerating unstructured SNNs, this method relies on imposing structure on an existing sparse weight matrix after training. Our method can be considered a way of adding structure to SNNs during training, allowing the model to maximally utilize non-zero weights since structure and weights are learned concurrently. Published as a conference paper at ICLR 2024 Deep Sparse Engine (Neural Magic, 2021) accelerates inference of unstructured sparse networks on CPU by applying several innovations. In Appendix K, we compare our timings with SRig L to the Deep Sparse Engine. Learning block structured sparsity from scratch Block sparsity is a particular type of structured sparsity in which blocks of non-zero weights are grouped together in arrangements that reduce the memory overhead required to store the indices of the non-zero weights. Blocks can be generated out of contiguous weights in 1D (sometimes called tiles) or 2D or by utilizing a fixed number of non-zero weights per row or column group in the case of block-balanced sparsity (Hoefler et al., 2021). Spurred by the success of DST in learning unstructured sparse models, recent works have attempted to apply DST principles to learn block-structured sparsity. Jiang et al. (2022) introduced a novel block-aware DST algorithm known as Dynamic Shuffled Block (DSB). DSB reshuffles non-zero weights into a block sparsity pattern after sparse connectivity updates, thereby improving memory access efficiency. Wall-clock speed-ups of up to 4 were reported with this method; however, generalization performance was reduced compared to Rig L at comparable sparsities. Dietrich et al. (2022) applied a modified variant of Rig L to BERT models (Devlin et al., 2019). The resulting method is capable of learning models with block-structured sparsity. Learning N:M structured sparsity from scratch N:M sparsity is a specific form of block-balanced sparsity in which 1D blocks with M contiguous elements contain exactly N non-zero elements. N:M sparsity is particularly amenable to acceleration and several attempts have been made to train models with N:M fine-grained structure using DST methods. Yang et al. (2022) extended the DST method proposed by Liu et al. (2021b) to train multiple sparse sub-networks sampled from a single dense super-network. Their proposed method, Alternating Sparse Training (AST), switches the network topology between sparse sub-networks after each mini-batch during training. Yang et al. (2022) demonstrated state-of-the-art performance on several typical sparse training benchmarks. However, the dense model weights and gradients are required throughout the majority of training, greatly increasing the overall compute and storage requirements. While AST demonstrated a tantalizing possibility of training multiple sparse sub-networks within a single training loop, the gradual dense-to-sparse training paradigm used by (Liu et al., 2021b) is not directly comparable to Rig L or other similar end-to-end sparse DST methods. Zhou et al. (2021) explored how N:M sparsity can be achieved during training using magnitude-based pruning during the forward pass and a Straight-Through Estimator (STE) (Bengio et al., 2013) on the backward pass. In their method, the dense network weights are projected into a sparse network during each training iteration. The sparse network is obtained by selecting the top-N out of every M contiguous weights and STE is used to propagate the approximated gradients through the projection function. A regularization term is applied to the gradients of pruned weights to reduce instabilities during training. Their approach Sparse-Refined Straight-Through Estimator (SR-STE) was applied to networks with N:M ratios of 1:4, 2:4, 2:8, 4:8, 1:16. Although SR-STE utilizes sparse operations in the forward pass and can find sparse models optimized for inference, it does not reduce the training cost significantly. Specifically, SR-STE training requires (1) storing original parameters in their dense format, and (2) calculating dense gradients during each trainingiteration. Thismakes SR-STEtrainingasexpensiveastheoriginaldensetrainingintermsofmemory and compute cost1. On the other hand, DST methods such as Rig L, and our proposed method SRig L, are capable of end-to-end sparse training and use sparse parameters and gradients throughout training. Accelerating fine-grained N:M structured sparsity Nvidia (2020); Mishra et al. (2021) introduced the Ampere Tensor Core GPU architecture (e.g. A100 GPUs) and proposed the 2:4 fine-grained structured sparsity scheme that enables SNNs to be accelerated on this hardware at inference time. This scheme places a constraint on the allowed sparsity pattern: For every contiguous array of four weights, two are pruned, yielding a 50%-sparse net. The resulting regular structure of the weight matrix allows one to compress it efficiently and to reduce memory storage and bandwidth by operating on the nonzero weights only. Since the focus is on acceleration at inference time, the authors proposed to use the standard method of magnitude-based pruning post training to achieve the 2:4 sparsity. Importantly, this work considered exclusively the 2:4 ratio; other N:M ratios cannot be accelerated on Ampere GPUs. 1To be precise, SR-STE can use some sparse operations and reduce training cost up to two thirds of the original dense training. However this is still far from fully sparse acceleration for training. Published as a conference paper at ICLR 2024 Constant fan-in N:M structured sparsity The constant fan-in constraint represents a special case of N:M sparsity where N is the number of non-zero weights per neuron and M is the dense fan-in for each neuron within a given layer. While commodity hardware acceleration currently exists only for 2:4 sparsity on Nvidia s Ampere and later architectures (Mishra et al., 2021), a constant fan-in constraint can also take advantage of the efficient memory access and throughput increase that N:M sparsity yields, as recently demonstrated by Schultheis & Babbar (2023). Constant fan-in sparsity has several attributes which differentiate it from N:M sparsity: Constant fan-in sparsity is more flexible than N:M sparsity, enabling arbitrary global sparsity values to be applied to the mode whereas N:M sparsity is limited to specific sparsity ratios. With the constant fan-in constraint, per-layer sparsity distributions such as Erd os-Rényi-Kernel (ERK) can be applied to the model. The ERK distribution has been demonstrated to outperform uniform sparsity distributions by reallocating parameters to layers with fewer parameters (Mocanu et al., 2018; Evci et al., 2021). In contrast, N:M sparsity can only be applied with a uniform sparsity distribution. Hardware support for acceleration of N:M sparsity is currently limited to 2:4 sparsity on Nvidia GPUs, offering a modest acceleration on the order of 2. In contrast, the potential promise of highly sparse models (>=90% sparsity) to be 10 faster than an equivalent dense model. As we demonstrate in Section 4.4 and Appendix I, our condensed sparse representation with constant fan-in sparsity can achieve significant acceleration over a wide range of sparsities even without specialized hardware. Online inference In many applications, DNNs are used in an online manner, i.e. by using only single inputs and not batches of inputs. Online inference is common in real-time and latency-sensitive applications, or applications without significant numbers of simultaneous requests allowing batching. Online inference, especially for real-time applications, does not typically benefit from accelerators such as GPUs that require host to device transfers, since the cost of the transfer itself often negates any benefit in compute. Accelerating online inference workloads remains an open research problem, with many systems engineering solutions proposed to achieve acceleration (Kumar et al., 2019; Li et al., 2020; Wang et al., 2022; Wu et al., 2020). Our condensed representation CPU implementation, which exploits both structured and constant fan-in sparsity, offers a complimentary, orthogonal solution to these engineered solutions by directly accelerating model inference for single samples. Our goal in this work is to introduce structural constraints on the sparse mask learned by Rig L, in order to make it more amenable to acceleration at inference time while not affecting Rig L s generalization performance. We first performed a theoretical analysis to explore the effect of various sparsity distributions with different degrees of structural constraints on the training dynamics of SNNs, detailed in Fig. 1a and Appendix A. Based on this analysis, we did not find any evidence to suggest that the constant fan-in constraint would impair SNN training dynamics and performance, motivating the use of constant fan-in sparsity in our method outlined in Section 3.1. 3.1 STRUCTURED RIGL As motivated by Appendix A, we propose to enforce the constant-fan-in constraint within a sparse-to-sparse DST method to learn structured sparse connectivity from scratch. Specifically, we use Rig L by Evci et al. (2021), which can obtain highly sparse networks with generalization performance comparable to their dense baselines. In brief, the methodology of Rig L is to update the SNN connectivity during training by pruning weights with the smallest magnitude and regrowing those with the largest corresponding gradient magnitude in each layer. This occurs in periodic, but relatively infrequent mask update steps throughout most of training. In SRig L, weight saliency must be determined at the neuron level (in convolutional layers, at the level of each filter), since we enforce that every neuron (output channel) has the same number of unmasked incoming weights, thereby satisfying the constant fan-in constraint. (Fig. 1a). However, this approach alone significantly lags behind Rig L s generalization at very high sparsities (>90%) and with transformer architectures, as shown in Fig. 3a and Table 4. This is because the constant fan-in constraint has an important side-effect: under a strict constant fan-in constraint, neurons Published as a conference paper at ICLR 2024 Figure 2: Neuron ablation. At sparsity levels over 90%, Rig L learns to completely mask (ablate) a large number of neurons within each layer, effectively reducing layer width. Imposing a constant fan-in constraint requires all neurons to have the same number of (non-pruned) incoming weights and therefore inhibits ablation, which results in worse generalization performance than Rig L. Allowing SRig L to ablate neurons restores Rig L-level performance. can never be entirely masked (ablated), as illustrated in Fig. 2. At very high sparsity levels this can lead to many neurons that have only 1 2 weights, limiting the capacity to learn complex features and consequently reducing generalization performance. Indeed, at high sparsities we observed empirically that Rig L ablates large numbers of neurons (Figs. 3b, 11 and 12). Effectively, Rig L reduces the width of the model at high sparsities to maintain generalization performance; we believe we are the first to explicitly identify this behaviour within a DST method. To resolve this issue in SRig L, we implement a neuron ablation method, allowing SRig L to maintain both a constant fan-in constraint and to reduce layer width at high sparsities. We introduce a new hyperparameter, γsal, which defines the required minimum percentage of salient weights per neuron. Given a neuron with constant fan-in of k, if fewer than γsal k weights are considered salient by either the drop or grow criteria, then the neuron is ablated and its weights redistributed to other neurons within the same layer. Notably this neuron ablation method allows SRig L to exploit neuron ablation structured sparsity at much lower sparsity levels than we identified it occurring at in Rig L, while maintaining good generalization, as demonstrated in Table 4. The steps below outline our final SRig L method with neuron ablation. In the following procedure, the first two steps are the same as in Rig L, while the other steps are specific to SRig L, containing modifications to include the constant fan-in constraint and dynamic neuron ablation. We first set an ablation threshold γsal. Then, for each layer we do the following: 1. Obtain magnitudes of the active weights and gradient magnitudes of the pruned weights; these will serve as prune and growth criteria, respectively. 2. Compute K, the number of weights to be grown and pruned in the current step in this layer. We always grow the same number of connections as we prune. 3. Count the number of salient weights per neuron. A weight is considered salient if it is in the top-K of either the largest-magnitude weights or the largest-magnitude gradients. 4. Ablate neurons that have fewer salient weights than γsal k, where k is the fan-in. Ablation is done by pruning all incoming weights. These pruned weights are redistributed to the remaining neurons in the following steps. 5. Compute the new constant fan-in constraint, k , based on the number of ablated neurons. 6. Prune the K smallest-magnitude weights in the current layer. Note that this pruning criterion considers all weights within a layer rather than pruning only the smallest weights in each neuron. 7. For each active neuron, regrow as many weights as required, proceeding in order of decreasing gradient magnitude, until the target fan-in, k , is achieved. We implement SRig L in Py Torch by extending an existing implementation of Rig L (Mc Creary, 2020). We evaluate our method empirically on image classification tasks: on the CIFAR-10 dataset (Krizhevsky, 2009) we train a variant of Res Net-18 (He et al., 2016) suitable for CIFAR-10 and Wide Res Net-22 (Zagoruyko & Komodakis, 2017); on the 2012 Image Net Large Scale Visual Recognition Challenge (ILSVRC-12) dataset (Russakovsky et al., 2015) commonly referred to as Image Net we train Res Net-50 (He et al., 2016), Mobile Net-V3 (Howard et al., 2019), and Vision Transformer (Vi T-B/16) (Dosovitskiy et al., 2021). See Appendix C and Appendix D.4 for Wide Res Net-22 and Mobile Net-V3 experimental results, respectively. Unless noted otherwise, we use the same hyperparameter configuration as the original Rig L method. A detailed summary of our hyperparameter settings and training details can be found in Appendix D. Published as a conference paper at ICLR 2024 75 80 85 90 95 100 Sparsity (%) Test Accuracy (%) Rig L SRig L w/o ablation SRig L SRig L x2 SRig L x5 Dense benchmark (a) Res Net-50/Image Net 80 85 90 95 100 Sparsity (%) Active Nerons (%) Rig L SRig L w/o ablation SRig L SRig L x2 SRig L x5 (b) Neuron Ablation Figure 3: (a) Res Net-50/Image Net top-1 test accuracy when trained with SRig L for a range of sparsities is comparable to Rig L. Extended training durations of 2 and 5 are also reported for SRig L. Results reported are single runs. (b) Neuron ablation: The percentage active neurons (i.e., not ablated) following Rig L/SRig L training on Res Net-50/Image Net. Rig L ablates a large number of neurons at high sparsities. We set the ablation threshold, γsal, to 30% for all SRig L results, except for our Vi T-B/16 experiments. This value was selected based on a hyperparameter sweep performed by training Res Net-18 and Wide Res Net-22 on the CIFAR-10 dataset, see Appendix E. 4.1 RESNET-18 TRAINED ON CIFAR-10 We use a variant of Res Net-18 with reduced kernel dimensions and stride in the first two convolutional layers to obtain a model suitable for CIFAR-10; our training regimen generally follows Evci et al. (2021), see Appendix D.1 for more information. We repeat training with five different random seeds for both methods and report the mean and 95% confidence interval compared to a densely-connected benchmark model in Table 2. These results confirm that imposing a constant fan-in constraint during sparse training does not significantly degrade generalization performance of the SNN compared to the Rig L method. In Fig. 11 we plot the number of neurons ablated at ablation thresholds of 0%, 30%, and 50% to demonstrate how the γsal hyperparameter can be used to guide the final model width during training. 4.2 RESNET-50 TRAINED ON IMAGENET Our training regimen for the Image Net dataset generally follows Evci et al. (2021), see Appendix D.2 for more details. We investigate the effect of extended training with 2 and 5 the original number of training epochs. We train each model with a single seed and report the results in Fig. 3a and Table 1. SRig L yields similar generalization performance as Rig L across each sparsity and training duration considered. At high sparsities, SRig L with ablation outperforms SRig L without ablation, highlighting the importance of neuron ablation as sparsity increases. Notably, Rig L 5 results at 99% sparsity in Evci et al. (2021) used a dense first layer, unlike all other results reported in Table 1. Despite this difference, SRig L 5 at 99% sparsity is comparable to the Rig L 5 results. We expect that the 99% sparse models would be improved by using a dense first layer for all SRig L results. Similar to Rig L, we observe that SRig L generalization performance improves with increasing training time. We inspect the connectivity of Res Net models trained with the Rig L method and find, as shown in Fig. 3b, that at 95% sparsity 10.9% of neurons are removed completely. Thus, Rig L results in fewer, but more densely connected neurons, whereas the fan-in constraint enforces that all neurons are retained. In Table 3 we compare SRig L to a variety of DST algorithms. SRig L performs comparably to other methods, even those which learn unstructured sparsity. Methods with a memory footprint listed as dense require training with the dense network and therefore are not directly comparable to other sparse-to-sparse DST methods. The most directly comparable method to ours is DSB; we note that SRig L outperforms DSB at all sparsity ratios reviewed. Published as a conference paper at ICLR 2024 Table 1: Top-1 Image Net test accuracy of Res Net-50 trained with Rig L or SRig L at high sparsities and with various training times (as in Evci et al. (2021)), e.g. 5 more training epochs than dense Res Net-50. Rig L SRig L sparsity w/o w/ ablation (%) 1 5 1 1 2 5 80 74.9 77.1 74.8 75.0 76.5 77.2 90 72.8 76.6 72.6 72.7 74.7 76.2 95 69.6 74.6 68.8 69.1 71.5 73.6 99 51.4 61.9 48.7 51.5 55.3 59.0 0 dense Res Net-50: 76.7 5 Rig L results are from Evci et al. (2021) uses a dense first layer, unlike other results Table 2: Test accuracy for Res Net-18 on CIFAR-10 trained with Rig L or SRig L with/without neuron ablation at varying sparsities repeated with five different random seeds. Rig L SRig L sparsity w/o w/ ablation (%) 80 95.2 0.1 95.2 0.1 95.2 0.0 90 95.1 0.1 95.0 0.1 95.1 0.1 95 94.6 0.2 94.5 0.3 94.7 0.2 99 92.9 0.1 91.5 0.3 92.8 0.1 0 dense Res Net-18: 95.5 Table 3: Top-1 Image Net test accuracy of Res Net-50 trained with a variety of DST methods, highlighting methods that both are sparse-to-sparse (i.e. sparse training) and learn structured sparsity similar to SRig L only DSB-16 (2:4 and 1:4 sparsity) is directly comparable in this regard. Rig L and SRig L results are from our experiments, other values are obtained from each method s corresponding paper, unless noted otherwise. training sparsity method method structured 50% 75% 80% 90% 93.75% Static* sparse no 70.6 0.06 65.8 0.04 SET* sparse no 72.9 0.39 69.6 0.23 Deep R sparse no 71.7 70.2 DSR sparse no 73.3 71.6 Top-KAST sparse no 74.76 70.42 MEST sparse no 75.39 72.58 Rig L sparse no 74.98 72.81 DSB-16 sparse yes 76.33 74.04 Chase sparse yes 75.27 74.03 SRig L (Ours) sparse yes 76.60 75.55 75.01 72.71 70.56 SNFS (ERK)* dense no 75.2 0.11 73.0 0.04 AST+GC** dense no 73.2 73.1 SR-STE dense yes 76.2 71.5 dense Res Net-50: 76.7 *Values obtained from Evci et al. (2021). values obtained from Mostafa & Wang (2019). Values for the MEST (x0.67+EM) variant, matched to the same number of training FLOPs as Rig L. Values tabulated for Top-KAST correspond to the backwards sparsity as Top-KAST uses different sparsities in the forward and backward passes. For more information see Table 1 in Jayakumar et al. (2020). Values from Yin et al. (2023) for channel sparsity (Sc) set to 40%. **50% initial sparsity. Values from Yang et al. (2022) Table 4: Top-1 test accuracy of Vi T-B/16 trained on Image Net with or w/o neuron ablation Rig L SRig L sparsity (%) w/o w/ ablation 80 77.9 73.5 77.5 90 76.4 71.3 76.0 0 dense Vi T-B/16: 78.35 Sparsity level set for all modules except multi-headed attention input projections, which remain dense. See Appendix D.3 for more details. Table 5: SRig L sparsity and FLOPs for Res Net-50/Image Net training and inference. See Appendix G for more details. SRig L FLOPs sparsity (%) training ( 1e18) inference ( 1e9) 80 1.13 3.40 90 0.77 1.99 95 0.40 1.01 99 0.09 0.21 0 3.15 8.20 Published as a conference paper at ICLR 2024 4.3 VISION TRANSFORMER TRAINED ON IMAGENET We train the vision transformer variant Vi T-B/16 on Image Net generally following the original training recipe per Dosovitskiy et al. (2021) with select modifications, see Appendix D.3 for more information. Similar to our Convolutional Neural Network (CNN) experiments, Rig L ablates a significant number of neurons when applied to the Vi T-B/16 architecture with sparsities of 80 and 90%. Additionally, we find that Rig L learns sparse connectivities with a high variance of fan-in between neurons (see Fig. 12). At 90% sparsity, some neurons are allocated up to 10 more active weights than the mean number of active weights in the same layer. We hypothesize that these more densely connected neurons found in our Rig L experiments are important for generalization performance; therefore, a high γsal threshold should improve performance of SRig L by ablating neurons until a sufficient density of sparse fan-in is reached. Indeed, we find that SRig L s generalization performance is sensitive to γsal and that high γsal thresholds of 90% to 99% perform best. See Fig. 9a and Appendix E for more details on how γsal affects the generalization performance of Vi T-B/16. For the following results, we used a γsal of 95%. We train each model with a single random initialization and report the results in Table 4. SRig L without ablation is unable to match the generalization performance of Rig L at very high sparsity. However, with neuron ablation enabled, SRig L s performance greatly improves and is closely comparable to Rig L at 80% and 90% sparsity. 4.4 ACCELERATION OF CONSTANT FAN-IN SPARSITY Algorithm 1 Condensed linear layer with constant fan-in sparsity forward pass 1: Input: x: the input matrix of shape (batch_size, num_features) 2: w: the condensed weight matrix of shape (active_neurons, constant_fan_in) 3: indx: indices of non-zero dense weights of shape (active_neurons, 4: constant_fan_in) 5: output torch.zeros(size=(batch_size, neurons)) 6: for b in range(batch_size) do For each sample in mini-batch 7: for n in range(neurons) do For each active neuron in layer 8: for k in range(constant_fan_in) do For each non-zero weight 9: source_idx idx[n, k] 10: feature x[b, source_idx] 11: output[b, n] += feature w[n, k] 12: return output While SRig L shows promising theoretical speedups (i.e. FLOPs) as demonstrated in Table 5 and Appendix G, FLOPs are limited in demonstrating the real-world acceleration potential of a proposed sparse representation in general. Yet conversely, creating a fully-optimized software or hardware implementation of a novel representation typically requires significant engineering effort outside of the scope of this paper. Here we show that even a straight-forward Py Torch implementation of our proposed condensed neural network representation (see Appendix F) can demonstrate this real-world acceleration. The algorithm to accelerate our condensed sparsity representation is shown in Algorithm 1, demonstrating that it is embarrassingly parallel. Additionally, leveraging CUDA kernels from Schultheis & Babbar (2023), we also demonstrate that constant fan-in sparsity can be accelerated on commodity GPUs. To accelerate our condensed linear layer we exploit both structured and constant fan-in sparsity by removing ablated neurons and zero-valued weights from active neurons. In Fig. 4, we present real-world timings comparing our condensed linear layer to structured and unstructured sparse representations. We extract the trained layer weights and bias from Vi T-B/16 models trained with SRig L to obtain an accurate representation of the sparse topology produced during a real training run with SRig L. Our condensed representation is significantly faster than the dense benchmark and other sparse representations across all sparsities investigated. This real-world speed-up is immediately applicable to applications where latency is critical. In some instances, we found structured sparsity yields the best acceleration. By including both structured and constant fan-in sparsity, models trained with SRig L can use either the fully condensed (structured + constant fan-in) or purely structured sparse representations to obtain real-world acceleration across a broad range of applications with the same set of weights. Published as a conference paper at ICLR 2024 SRig L (ours) Structured Unstructured 0 Dense 80 90 95 99 (a) CPU online inference SRig L (ours) Structured Unstructured Dense 80 90 95 99 (b) GPU inference with batch size of 256 Figure 4: Comparing real-world timings for a fully-connected layer extracted from a Vi T-B/16 model trained with SRig L when compressed using the condensed representation learned by SRig L to structured (i.e. the same layer accelerated using only the ablated neurons without exploiting the fine-grained sparsity), and unstructured (i.e. Compressed Sparse Row (CSR)) representations. The median over a minimum of 5 runs is shown, while the error bars show the std. dev. Note: the increased timings for the 95 & 99% sparse structured representations is due to SRig L ablating relatively fewer neurons at these sparsities compared to 80 and 90%. (a) CPU wall-clock timings for online inference on an Intel Xeon W-2145. For online (single input) inference, our condensed representation at 90% is 3.4 faster than dense and 2.5 faster than unstructured sparsity. See Appendix I. (b) GPU wall-clock timings for inference with a batch size of 256 on an NVIDIA Titan V. At 90% sparsity, our condensed representation is 1.7 faster than dense and 13.0 faster than unstructured (CSR) sparse layers. Note y-axis is log-scaled. See Appendix I and Appendix J for details on wall-clock benchmarks across a range of threads and batch sizes. Furthermore, we expect that a more optimized software implementation and/or explicit hardware support would enable use of SRig L across a wider range of applications. 5 CONCLUSION In this work we present SRig L, a novel DST method that learns a sparsity mask incorporating both structured and constant fan-in sparsity. SRig L is capable of sparse-to-sparse training while maintaining generalization performance on par with state-of-the-art unstructured sparse training methods on a wide variety of network architectures. Our observation that Rig L ablates neurons at high sparsities inspires our neuron ablation method which enables SRig L to match the performance of Rig L, even at high sparsities and on the Vi T-B/16 network architecture. SRig L s constant fan-in constraint and neuron ablation results in real-world acceleration for CPU online inference and GPU batched inference. We hope this work will motivate the implementation of additional fine-grained structured sparsity schemes and the engineering efforts required to accelerate them further. Published as a conference paper at ICLR 2024 ACKNOWLEDGMENTS We acknowledge the support of Alberta Innovates, the Natural Sciences and Engineering Research Council of Canada (NSERC), and the NSF AI Institute for Artificial Intelligence and Fundamental Interactions (IAIFI). We are grateful for computational resources made available to us by Denvr Dataworks, Google, Amazon, and the Digital Research Alliance of Canada. We also acknowledge the very helpful feedback of Erik Schultheis and Trevor Gale. Guillaume Bellec, David Kappel, Wolfgang Maass, and Robert Legenstein. Deep Rewiring: Training very sparse deep networks. In International Conference on Learning Representations, February 2018. URL https://openreview.net/forum?id=BJ_w N01C-. Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. ar Xiv pre-print, 2013. Ting-Wu Chin, Ruizhou Ding, Cha Zhang, and Diana Marculescu. Towards efficient model compression via learned global ranking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020. Ekin D. Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V. Le. Auto Augment: Learning Augmentation Policies from Data, April 2019. URL http://arxiv.org/abs/1805.09501. ar Xiv:1805.09501 [cs, stat]. Ekin Dogus Cubuk, Barret Zoph, Jon Shlens, and Quoc Le. Rand Augment: Practical Automated Data Augmentation with a Reduced Search Space. In Advances in Neural Information Processing Systems, volume 33, pp. 18613 18624. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/hash/ d85b63ef0ccb114d0a3bb7b7d808028f-Abstract.html. Tim Dettmers and Luke Zettlemoyer. Sparse Networks from Scratch: Faster Training without Losing Performance. Technical Report ar Xiv:1907.04840, ar Xiv, August 2019. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, May 2019. URL http://arxiv.org/abs/1810.04805. ar Xiv:1810.04805 [cs]. Anastasia S. D. Dietrich, Frithjof Gressmann, Douglas Orr, Ivan Chelombiev, Daniel Justus, and Carlo Luschi. Towards Structured Dynamic Sparse Pre-Training of BERT. Submitted to the International Conference on Learning Representations, January 2022. URL https://openreview.net/forum?id=-e7awdz Ws Oc. Xuanyi Dong and Yi Yang. Network pruning via transformable architecture search. Advances in Neural Information Processing Systems, 32, 2019. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations, January 2021. URL https://openreview.net/forum?id=Yicb Fd NTTy. Erich Elsen, Marat Dukhan, Trevor Gale, and Karen Simonyan. Fast sparse convnets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. Utku Evci, Trevor Gale, Jacob Menick, Pablo Samuel Castro, and Erich Elsen. Rigging the Lottery: Making All Tickets Winners. Technical Report ar Xiv:1911.11134, ar Xiv, July 2021. ar Xiv:1911.11134. Utku Evci, Yani Ioannou, cem Keskin, and Yann Dauphin. Gradient Flow in Sparse Neural Networks and How Lottery Tickets Win - AAAI 2022 Poster, February 2022. Trevor Gale, Matei Zaharia, Cliff Young, and Erich Elsen. Sparse gpu kernels for deep learning, 2020. Published as a conference paper at ICLR 2024 Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour, April 2018. Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, 2015. Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770 778, Las Vegas, NV, USA, June 2016. IEEE. ISBN 978-1-4673-8851-1. doi: 10.1109/CVPR.2016.90. Yang He, Ping Liu, Ziwei Wang, Zhilan Hu, and Yi Yang. Filter pruning via geometric median for deep convolutional neural networks acceleration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. Torsten Hoefler, Dan Alistarh, Tal Ben-Nun, Nikoli Dryden, and Alexandra Peste. Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks. ar Xiv:2102.00554 [cs], January 2021. URL http://arxiv.org/abs/2102.00554. ar Xiv: 2102.00554. Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, Quoc V. Le, and Hartwig Adam. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019. Eugenia Iofinova, Alexandra Peste, Mark Kurtz, and Dan Alistarh. How well do sparse imagenet models transfer? Co RR, abs/2111.13445, 2021. URL https://arxiv.org/abs/2111.13445. Sebastian Jaszczur, Aakanksha Chowdhery, Afroz Mohiuddin, Lukasz Kaiser, Wojciech Gajewski, Henryk Michalewski, and Jonni Kanerva. Sparse is enough in scaling transformers. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=-b5OSCyd OMe. Siddhant Jayakumar, Razvan Pascanu, Jack Rae, Simon Osindero, and Erich Elsen. Top-KAST: Top-K Always Sparse Training. In Advances in Neural Information Processing Systems, volume 33, pp. 20744 20754. Curran Associates, Inc., 2020. Bo Ji and Tianyi Chen. FSCNN: A Fast Sparse Convolution Neural Network Inference System, December 2022. URL https://arxiv.org/abs/2212.08815v1. Peng Jiang, Lihan Hu, and Shihui Song. Exposing and Exploiting Fine-Grained Block Structures for Fast and Accurate Sparse Training. In Proceedings of the Neural Information Processing Systems Conference (Neur IPS), October 2022. URL https://openreview.net/forum?id=s Fapsu4h Yo. Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009. Adarsh Kumar, Arjun Balasubramanian, Shivaram Venkataraman, and Aditya Akella. Accelerating deep learning inference via freezing. In 11th USENIX Workshop on Hot Topics in Cloud Computing (Hot Cloud 19), Renton, WA, July 2019. USENIX Association. URL https://www.usenix.org/conference/hotcloud19/presentation/kumar. Mark Kurtz, Justin Kopinsky, Rati Gelashvili, Alexander Matveev, John Carr, Michael Goin, William Leiserson, Sage Moore, Nir Shavit, and Dan Alistarh. Inducing and Exploiting Activation Sparsity for Fast Inference on Deep Neural Networks. In Proceedings of the 37th International Conference on Machine Learning, pp. 5533 5543. PMLR, November 2020. URL https://proceedings.mlr.press/v119/kurtz20a.html. ISSN: 2640-3498. Published as a conference paper at ICLR 2024 En Li, Liekang Zeng, Zhi Zhou, and Xu Chen. Edge AI: On-Demand Accelerating Deep Neural Network Inference via Edge Computing. IEEE Transactions on Wireless Communications, 19(1):447 457, January 2020. ISSN 1558-2248. doi: 10.1109/TWC.2019.2946140. URL https://ieeexplore.ieee.org/abstract/document/8876870. Conference Name: IEEE Transactions on Wireless Communications. Mufan Li, Mihai Nica, and Daniel M. Roy. The future is log-gaussian: Resnets and their infinite-depthand-width limit at initialization. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021. Mingbao Lin, Rongrong Ji, Yan Wang, Yichen Zhang, Baochang Zhang, Yonghong Tian, and Ling Shao. Hrank: Filter pruning using high-rank feature map. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1529 1538, 2020. Etai Littwin, Ben Myara, Sima Sabah, Joshua Susskind, Shuangfei Zhai, and Oren Golan. Collegial ensembles. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 18738 18748. Curran Associates, Inc., 2020. Kuang Liu. pytorch-cifar, 2017. URL https://github.com/kuangliu/pytorch-cifar. Liyang Liu, Shilong Zhang, Zhanghui Kuang, Aojun Zhou, Jing-Hao Xue, Xinjiang Wang, Yimin Chen, Wenming Yang, Qingmin Liao, and Wayne Zhang. Group fisher pruning for practical network compression. In International Conference on Machine Learning, pp. 7021 7032. PMLR, 2021a. Shiwei Liu, Tianlong Chen, Xiaohan Chen, Zahra Atashgahi, Lu Yin, Huanyu Kou, Li Shen, Mykola Pechenizkiy, Zhangyang Wang, and Decebal Constantin Mocanu. Sparse Training via Boosting Pruning Plasticity with Neuroregeneration. In Advances in Neural Information Processing Systems, volume 34, pp. 9908 9922. Curran Associates, Inc., 2021b. Shiwei Liu, Lu Yin, Decebal Constantin Mocanu, and Mykola Pechenizkiy. Do We Actually Need Dense Over-Parameterization? In-Time Over-Parameterization in Sparse Training. In Proceedings of the 38th International Conference on Machine Learning, pp. 6989 7000. PMLR, July 2021c. URL https://proceedings.mlr.press/v139/liu21y.html. ISSN: 2640-3498. Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization. In International Conference on Learning Representations, December 2018. URL https: //openreview.net/forum?id=Bkg6Ri Cq Y7. Maintainers and Contributors. Torchvision: Pytorch s computer vision library. https: //github.com/pytorch/vision, 2016. Dyllan Mc Creary. Pytorch implementation of rigging the lottery: Making all tickets winners, Nov 2020. Re-implementation/extension of the work done by Google Research: https://github.com/googleresearch/rigl. Asit Mishra, Jorge Albericio Latorre, Jeff Pool, Darko Stosic, Dusan Stosic, Ganesh Venkatesh, Chong Yu, and Paulius Micikevicius. Accelerating Sparse Deep Neural Networks, April 2021. URL http://arxiv.org/abs/2104.08378. ar Xiv:2104.08378 [cs]. Decebal Constantin Mocanu, Elena Mocanu, Peter Stone, Phuong H. Nguyen, Madeleine Gibescu, and Antonio Liotta. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature Communications, 9(1):2383, December 2018. ISSN 2041-1723. doi: 10.1038/s41467-018-04316-3. Hesham Mostafa and Xin Wang. Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization. In Proceedings of the 36th International Conference on Machine Learning, pp. 4646 4655. PMLR, May 2019. ISSN: 2640-3498. Neural Magic. Deepsparse engine: Sparsity-aware deep learning inference runtime for CPUs, 2021. URL https://github.com/neuralmagic/deepsparse. Nvidia. Nvidia A100 Tensor Core GPU Architecture. Technical report, Nvidia, 2020. Published as a conference paper at ICLR 2024 Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge. International Journal of Computer Vision (IJCV), 2015. Erik Schultheis and Rohit Babbar. Towards Memory-Efficient Training for Extremely Large Output Spaces Learning with 500k Labels on a Single Commodity GPU, June 2023. URL http://arxiv.org/abs/2306.03725. ar Xiv:2306.03725 [cs]. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research, 15(56):1929 1958, 2014. ISSN 1533-7928. URL http://jmlr.org/papers/v15/srivastava14a.html. Xiu Su, Shan You, Tao Huang, Fei Wang, Chen Qian, Changshui Zhang, and Chang Xu. Locally free weight sharing for network width search. ar Xiv preprint ar Xiv:2102.05258, 2021. Yang Sui, Miao Yin, Yi Xie, Huy Phan, Saman Aliari Zonouz, and Bo Yuan. Chip: Channel independence-based pruning for compact neural networks. Advances in Neural Information Processing Systems, 34:24604 24616, 2021. Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the Inception Architecture for Computer Vision. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818 2826, June 2016. doi: 10.1109/CVPR.2016.308. ISSN: 1063-6919. Yehui Tang, Yunhe Wang, Yixing Xu, Dacheng Tao, Chunjing Xu, Chao Xu, and Chang Xu. Scop: Scientific control for reliable neural network pruning. Advances in Neural Information Processing Systems, 33:10936 10947, 2020. Tijmen Tieleman, Geoffrey Hinton, et al. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26 31, 2012. Chaoqi Wang, Guodong Zhang, and Roger Grosse. Picking winning tickets before training by preserving gradient flow. In ICLR, 2020. Zehuan Wang, Yingcan Wei, Minseok Lee, Matthias Langer, Fan Yu, Jie Liu, Shijie Liu, Daniel G. Abel, Xu Guo, Jianbing Dong, Ji Shi, and Kunlun Li. Merlin Huge CTR: GPU-accelerated Recommender System Training and Inference. In Proceedings of the 16th ACM Conference on Recommender Systems, Rec Sys 22, pp. 534 537, New York, NY, USA, September 2022. Association for Computing Machinery. ISBN 978-1-4503-9278-5. doi: 10.1145/3523227.3547405. URL https://dl.acm.org/doi/10.1145/3523227.3547405. Xiaorui Wu, Hong Xu, and Yi Wang. Irina: Accelerating DNN Inference with Efficient Online Scheduling. In Proceedings of the 4th Asia-Pacific Workshop on Networking, APNet 20, pp. 36 43, New York, NY, USA, August 2020. Association for Computing Machinery. ISBN 978-1-4503-8876-4. doi: 10.1145/3411029.3411035. URL https://dl.acm.org/doi/10.1145/3411029.3411035. Li Yang, Jian Meng, Jae-sun Seo, and Deliang Fan. Get More at Once: Alternating Sparse Training with Gradient Correction. In Proceedings of the Neural Information Processing Systems Conference (Neur IPS), October 2022. URL https://openreview.net/forum?id=l YZQRpq Lesi. Lu Yin, Gen Li, Meng Fang, Li Shen, Tianjin Huang, Zhangyang Wang, Vlado Menkovski, Xiaolong Ma, Mykola Pechenizkiy, and Shiwei Liu. Dynamic Sparsity Is Channel-Level Sparsity Learner. In Neural Information Processing Systems. ar Xiv, November 2023. doi: 10.48550/ar Xiv.2305.19454. URL http://arxiv.org/abs/2305.19454. ar Xiv:2305.19454 [cs]. Zhonghui You, Kun Yan, Jinmian Ye, Meng Ma, and Ping Wang. Gate decorator: Global filter pruning method for accelerating deep convolutional neural networks. Advances in neural information processing systems, 32, 2019. Jiahui Yu and Thomas Huang. Autoslim: Towards one-shot architecture search for channel numbers. ar Xiv preprint ar Xiv:1903.11728, 2019. Published as a conference paper at ICLR 2024 Geng Yuan, Xiaolong Ma, Wei Niu, Zhengang Li, Zhenglun Kong, Ning Liu, Yifan Gong, Zheng Zhan, Chaoyang He, Qing Jin, Siyue Wang, Minghai Qin, Bin Ren, Yanzhi Wang, Sijia Liu, and Xue Lin. MEST: Accurate and Fast Memory-Economic Sparse Training Framework on the Edge. In Advances in Neural Information Processing Systems, volume 34, pp. 20838 20850. Curran Associates, Inc., 2021a. URL https://proceedings.neurips.cc/paper/2021/ hash/ae3f4c649fb55c2ee3ef4d1abdb79ce5-Abstract.html. Geng Yuan, Xiaolong Ma, Wei Niu, Zhengang Li, Zhenglun Kong, Ning Liu, Yifan Gong, Zheng Zhan, Chaoyang He, Qing Jin, et al. Mest: Accurate and fast memory-economic sparse training framework on the edge. Advances in Neural Information Processing Systems, 34:20838 20850, 2021b. Sangdoo Yun, Dongyoon Han, Sanghyuk Chun, Seong Joon Oh, Youngjoon Yoo, and Junsuk Choe. Cut Mix: Regularization Strategy to Train Strong Classifiers With Localizable Features. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6022 6031, Seoul, Korea (South), October 2019. IEEE. ISBN 978-1-72814-803-8. doi: 10.1109/ICCV.2019.00612. URL https://ieeexplore.ieee.org/document/9008296/. Sergey Zagoruyko and Nikos Komodakis. Wide Residual Networks, 2017. ar Xiv:1605.07146. Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyond Empirical Risk Minimization. In International Conference on Machine Learning, May 2023. URL https://openreview.net/forum?id=r1Ddp1-Rb&;note Id=r1Ddp1-Rb),. Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random Erasing Data Augmentation, November 2017. URL http://arxiv.org/abs/1708.04896. ar Xiv:1708.04896 [cs]. Aojun Zhou, Yukun Ma, Junnan Zhu, Jianbo Liu, Zhijie Zhang, Kun Yuan, Wenxiu Sun, and Hongsheng Li. Learning n:m fine-grained structured sparse neural networks from scratch. In International Conference on Learning Representations, 2021. Published as a conference paper at ICLR 2024 A SPARSITY AND OUTPUT-NORM VARIANCE Consider a SNN with Re LU activations, where each neuron has on average k connections to the previous layer (i.e., fan-in). It has been shown by Evci et al. (2022), that by normalizing the weights on initialization by a factor of p 2/k, one achieves the following desirable normalization property for each layer ℓwith output zℓ: E ||zℓ+1||2 Meaning that on average the variance of the norm of each layer s output is constant. However, the variance of this ratio is non-trivial. In networks with large depth, it can accumulate, leading to exponentially large variance at the final layer (Li et al., 2021). Minimizing this variance on initialization has been shown to have a positive effect on training dynamics in some network models (Littwin et al., 2020), as it stabilizes the gradients. We therefore analyze the output norm variance as a guiding quantity for sparsity-type selection. In the following, we consider three different types of sparsity distributions, which respectively correspond to different degrees of sparsity structure in the SNN, and derive analytic expressions for the behaviour of output norm variance in SNNs with the given sparsity type. The derivations for the following results can be found in Appendix B: Bernoulli sparsity : A connection between each neuron in layer ℓ+1 and each neuron in layer ℓappears independently with probability p= k n, resulting in each neuron having k connections on average and each layer having nk connections on average. The variance is: Var Bernoulli = 5n 8+18 k n n(n+2) . (1) Constant Per-Layer sparsity : Exactly kn connections are distributed at random in the layer connecting the n neurons in layer ℓ+1 and the n neurons in layer ℓ, resulting in each neuron having k connections on average. The variance is: Var Const-Per-Layer = (n2+7n 8)Cn,k+18 k n n2 2n n(n+2) , (2) where Cn,k = n 1/k n 1/n. Note that when n 1, Cn,k 1 n k n2k is close to 1, and with Cn,k =1 we recover the formula for Bernoulli sparsity, meaning that this sparsity type and Bernoulli sparsity are very similar. Constant Fan-In sparsity : Each neuron in layer ℓ+1 is connected to exactly k neurons from layer ℓ, chosen uniformly at random. In this case, the variance is: Var Const-Fan-In = 5n 8+18 k n n(n+2) 3(n k) kn(n+2). (3) In deriving the above results we assumed that the direction of the layer output vector zℓ ||zℓ|| is uniformly distributed on the unit sphere. We compare our theoretical predictions with simulations in Fig. 1b and verify their accuracy. Bernoulli and constant-per-layer distribution result in unstructured sparsity, and most of the current DST approaches, including Rig L, operate with constant-per-layer sparsity. In contrast, the constant-fan-in type imposes a strong structural constraint. Therefore we are somewhat surprised to find that, in fact, constant-fan-in sparsity always produces slightly smaller output-norm variance than the other types. The difference is larger when k n, i.e., for very sparse networks. This indicates that, at the very least, the constant fan-in constraint should not impair SNN training dynamics and performance, motivating our method of maintaining the constant fan-in sparsity constraint within a DST approach. Published as a conference paper at ICLR 2024 B COMPUTING THE OUTPUT NORM VARIANCE Definition B.1. Let ξ {0,1}N be a binary vector. Let I {0,1}N N be an N N binary matrix. Let u RN be any vector. Let W RN N be a matrix of iid N(0,1) random variables. Define the vector z by: 2 k (W I)(ξ u) (4) i.e. the entries zi are given by: j=1 Wij Iijξjuj (5) Proposition B.2. The variance of each entry zi is: j=1 Iijξju2 j (6) and therefore the distribution of each zi can be written as j=1 Iijξju2 j (7) where gi are N iid N(0,1) random variables. Proof. By the properties of variance: j,j Iij Iij ξjξj uju j Cov(Wij,Wij ) (8) j,j Iij Iij ξjξj uju jδj=j (9) j I2 ijξ2 j u2 j (10) j Iijξju2 j (11) since I2 ij = Iij and ξ2 j = ξj because they are binary valued. Once the variance is established, notice that zi is a linear combination of Gaussians with zi zi , because the row Wij Wi j. Hence the zi are independent Gaussians, so the form zi d=gi q 2 k Pn j=1Iijξju2 j follows. Corollary B.3. The norm z 2 can be written as: i,j=1 g2 i Iijξju2 j (12) Proposition B.4 ( Bernoulli Sparsity ). Suppose that u Rn is uniform from the unit sphere, the entries Iij Ber k n , ξj Ber( 1 2) all independent of each other. Then: E z 2 =1 (13) Var z 2 = 5n 8+18 n k n(n+2) (14) Published as a conference paper at ICLR 2024 Case Num. Terms E g2 i g2 i E[Ii j Iij] E[ξjξj ] E u2 ju2 j i=i ,j =j n2 3 k n 1 2 3 n(n+2) i =i ,j =j n2(n 1) 1 k n 2 1 2 3 n(n+2) i=i ,j =j n2(n 1) 3 k 2 2 1 n(n+2) i =i ,j =j n2(n 1)2 1 k 2 2 1 n(n+2) Table 6: Overview of terms for Bernoulli type sparsity. Proof. We have i,j=1 E g2 i Iijξju2 j (15) i,j=1 E g2 i E[Iij]E[ξj]E u2 j (16) Similarly, we compute the 4-th moment as follows: i,j,i ,j E g2 i g2 i E[Ii j Iij]E[ξjξj ]E u2 ju2 j (19) We split this into four cases and evaluate these based on whether or not i=i and j =j in the following table. Combining the value of each term with the number of terms gives the desired result for the variance. Proposition B.5 ( Constant-per-layer sparsity ). Suppose that u Rn is uniform from the unit sphere and ξj Ber( 1 2) are independent of each other. Suppose the entries of the matrix Iij are chosen such that: There are exactly kn ones and exactly n2 nk zeros in the matrix I, and their positions in the matrix are chosen uniformly from the n2 nk possible configurations. Then: E z 2 =1 (20) Var z 2 = (n2+7n 8)Cn,k+18 k n n2 2n n(n+2) (21) Proof. Note that E(Iij)=k/n still holds, since there are kn ones distributed over n2 locations. Thus the computation for E( z 2) is identical to the previous proposition. Note also that when there are two entries, we have: E[Iij Ii j ]= ( k n if i=i and j =j n2 1 otherwise (22) ( k n if i=i and j =j k n 2 Cn,k otherwise (23) where Cn,k = n 1/k n 1/n. The table with terms for computing E( z 4) becomes: The extra factor of Cn,k in the entries leads to the stated result. Published as a conference paper at ICLR 2024 Case Num. Terms E g2 i g2 i E[Ii j Iij] E[ξjξj ] E u2 ju2 j i=i ,j =j n2 3 k n 1 2 3 n(n+2) i =i ,j =j n2(n 1) 1 k n 2Cn,k 1 2 3 n(n+2) i=i ,j =j n2(n 1) 3 k 2 2 1 n(n+2) i =i ,j =j n2(n 1)2 1 k 2 2 1 n(n+2) Table 7: Overview of terms for Constant-per-layer type sparsity. Proposition B.6 ( Constant Fan-In sparsity ). Suppose that u Rn is uniform from the unit sphere, and ξj Ber( 1 2) all independent of each other. Suppose the entries of the matrix Iij are chosen so that: 1. There are exactly k ones in each row of the matrix I and exactly n k zeros in the matrix I, chosen uniformly from the n k possible ways this can happen. 2. Different rows of I are independent. E z 2 =1 (24) Var z 2 = 5n 8+18 n k n(n+2) 3(n k) kn(n+2) (25) Proof. Same arguments as before apply, but now we have E[Iij Ii j ]= k n if i=i and j =j k n k 1 n 1 if i=i and j =j n n 2 otherwise (26) and the table for the variance computation becomes: Case Num. Terms E g2 i g2 i E[Ii j Iij] E[ξjξj ] E u2 ju2 j i=i ,j =j n2 3 k n 1 2 3 n(n+2) i =i ,j =j n2(n 1) 1 k n 2 1 2 3 n(n+2) i=i ,j =j n2(n 1) 3 k n k 1 2 2 1 n(n+2) i =i ,j =j n2(n 1)2 1 k 2 2 1 n(n+2) Table 8: Overview of terms for Constant-fan-in type sparsity. Which leads to the stated result. Published as a conference paper at ICLR 2024 50 60 70 80 90 100 Sparsity (%) Test Accuracy (%) Rig L SRig L w/o ablation SRig L Dense benchmark Rig L SRig L sparsity (%) w/o w/ ablation 50 94.6 0.1 94.7 0.1 94.6 0.1 60 94.6 0.1 94.5 0.1 94.6 0.1 70 94.5 0.1 94.4 0.1 94.4 0.1 80 94.0 0.1 94.1 0.2 94.0 0.1 90 93.3 0.1 93.1 0.1 93.3 0.1 95 92.1 0.1 91.4 0.1 91.8 0.2 99 84.9 0.2 76.9 0.3 82.7 0.8 0 dense Wide Res Net-22: 95.0 Figure 5 & Table 9: Test accuracy of Wide Res Net-22 trained on CIFAR-10. Mean and 95% confidence intervals are reported over five runs. C WIDE RESNET-22 TRAINED ON CIFAR-10 In Fig. 5 we present results of training Wide Res Net-22 (Zagoruyko & Komodakis, 2017) with Rig L or SRig L on the CIFAR-10 dataset. The training details for this experiment are identical to those reported in Section 4.1. SRig L without ablation performs poorly at very high sparsities. With ablation, SRig L achieves generalization performance comparable to Rig L. D HYPERPARAMETER AND TRAINING DETAILS D.1 RESNET-18 TRAINED ON CIFAR-10 As per Liu (2017), we modify the original Res Net-18 network by changing the kernel dimensions of the first convolutional layer to 3 3 instead of 7 7. Further, we reduce the stride in the first two convolutional layers to one to avoid excessive reduction of the feature map s spatial dimensions. We train each network for 250 epochs (97,656 steps) using a batch size of 128. An initial learning rate of 0.1 is reduced by a factor of 5 every 77 epochs (about 30,000 steps). We use stochastic gradient descent (SGD) with momentum, with an L2 weight decay coefficient of 5e-4 and momentum coefficient of 0.9. We train each model using a single Nvidia V100 GPU. We achieve the desired overall sparsity by distributing the per-layer sparsity according to the ERK (Evci et al., 2021; Mocanu et al., 2018) distribution, which scales the per-layer sparsity based on the number of neurons and the dimensions of the convolutional kernel, if present. We set the number of mini-batch steps between connectivity updates, T, to 100. γsal is set at 30% based on the results of a small grid search performed on CIFAR-10 with Res Net-18 and Wide Res Net-22. See Fig. 8 for details. For each trial, we select a desired sparsity in the range from 0.5 to 0.99. At each connectivity update, the portion of weights to be pruned or regrown is based on a cosine annealing schedule (Dettmers & Zettlemoyer, 2019) with an initial value α=0.3. The portion of weights to be updated decays from the initial value to zero once 75% of the total training steps have been completed, after which the weight mask remains constant. D.2 RESNET-50 TRAINED ON IMAGENET We use a mini-batch size of 512 instead of 4096, We linearly scale the learning rate and T to account for our smaller batch size. Linearly scaling the learning rate in this manner was included in the original Rig L source code and is further motivated by Goyal et al. (2018). We increase T to 800 and average the dense gradients over eight mini-batch steps to ensure that SRig L has the same quality of parameter saliency information available as Rig L at each network connectivity update. We set γsal to 30% based on our grid search presented in Fig. 8. Published as a conference paper at ICLR 2024 Our learning rate uses a linear warm-up to reach a maximum value of 0.2 at epoch five and is reduced by a factor of 10 at epochs 30, 70, and 90. Using a mini-batch of 512, we train the networks for 256,000 steps to match Rig L s training duration. We use a cosine connectivity update schedule with α=0.3. We initialize the sparse model weights per Evci et al. (2022). We train the networks using SGD with momentum, L2 weight decay, and label smoothing (Szegedy et al., 2016) coefficients of 0.9, 1e-4 and 0.1, respectively. We use the same standard data augmentation in our data preprocessing as Rig L, including randomly resizing to 256 256 or 480 480 pixels, random crops to 224 224 pixels, random horizontal flips, and per-image normalization to zero mean and unit variance using identical per RGB channel mean and standard deviation values as Rig L. We train each model using either four Nvidia V100 or A100 GPUs. D.3 VISION TRANSFORMER TRAINED ON IMAGENET For our Vi T-B/16 experiments, we used sparsity on the convolutional projection (input projection to patches), the fully connected layers in the feed forward (MLP) blocks and the output projections of the multi-headed attention (MHA) modules. We performed a lightweight ablation study on four Vi T-B/16 networks trained on Image Net to determine the affect of sparsifying the first convolutional projection layer as well as the input projection layers in the multi-headed attention modules. Based on the results of our ablation study, we did not use sparsity on the MHA input projection layers or the scaled-dot products. See Fig. 9b for more details. This setup is similar to the "Sparse FF" models investigated by Jaszczur et al. (2021). The global model sparsity level reported in Table 4 is calculated based on the sparse modules only. If we also consider the parameters in the MHA input projections as part of our parameter budget, the global model sparsities tabulated in Table 4 correspond to 60.35% and 67.90% for the rows labelled 80% and 90% sparsity, respectively. We add additional data augmentations following the standard Torch Vision (Maintainers & Contributors, 2016) Vi T-B/16 training procedure for Image Net. These data augmentations applied include: random cropping, resizing the cropped image to 224 by 224 pixels, randomly horizontal flips, randomly augmenting with Rand Augment algorithm (Cubuk et al., 2020), and normalizing with the typical RGB channel mean and standard deviation values. We also randomly choose one of random mixup (Zhang et al., 2023) or random cutmix (Yun et al., 2019) and add it to the above-noted augmentations. We use 0.2 and 1.0 for the alpha parameter values for mixup and cutmix, respectively. We omit Dropout (Srivastava et al., 2014) from the model entirely to avoid potential layer collapse in the case where all non-zero weights are dropped from a layer and to avoid any other unintended interference with SRig L s sparse training procedure. We sample eight mini-batch steps with 512 samples per mini-batch and accumulate gradients before applying the optimizer, resulting in an effective mini-batch size of 4096. We train the model for 150 epochsusingan Adam W(Loshchilov&Hutter,2018)optimizerwithweightdecay, labelsmoothing, β1, and β2 coefficients of 0.3, 0.11, 0.9, and 0.999, respectively. We use cosine annealing with linear warmup for our learning rate scheduler with an initial learning rate of 9.9e-5 that warms-up to a maximum value of 0.003 at epoch 16. We clip all parameter gradients to a max L2 norm of 1.0. We apply uniformly distributed sparsity across all layers in the model. T is set to 100 to update network connectivity every 100 mini-batch steps. We train each model using either four Nvidia V100 or A100 GPUs. D.4 MOBILENET-V3 TRAINED ON IMAGENET We follow the Torch Vision (Maintainers & Contributors, 2016) training recipe for Mobile Net-V3 Large and Small for Image Net. We set T to 100 and γsal to 30% similar to our other CNN experiments. We train the models from scratch for 600 epochs using an RMSProp (Tieleman et al., 2012) optimizer with a momentum, L2 weight decay, and smoothing constant coefficients of 0.9, 1e-5, and 0.9, respectively. The networks are trained with a step learning rate decay schedule with initial learning rate of 0.064, multiplicative factor of 0.973, and we decay the learning rate every two epochs. The input data is augmented with random cropping to 224 by 224 pixels, random horizontal flips, Auto Augmentation using the Image Net policy (Cubuk et al., 2019), normalizing to standard RGB mean and standard deviation values, and random erasing with a probability of 0.2 (Zhong et al., 2017). Similar to the above, we omit Dropout (Srivastava et al., 2014) to avoid potential layer collapse. Unlike the Torch Vision recipe, we do not average the trained parameters across the last three checkpoints that improved the top-1 accuracy. We train with a batch size of 512 and accumulate gradients across Published as a conference paper at ICLR 2024 two mini-batches, resulting in an effective mini-batch size of 1024. We train each model using four Nvidia A100 GPUs. 80 85 90 95 100 Sparsity (%) Test Accuracy (%) Rig L SRig L Dense benchmark Figure 6: Mobile Net-V3 Large / Image Net top-1 accuracy. SRig L compares well against Rig L both both models perform poorly compared to the denes baseline at 99% sparsity. 80 85 90 95 100 Sparsity (%) Test Accuracy (%) Rig L SRig L Dense benchmark Figure 7: Mobile Net-V3 Small / Image Net top-1 accuracy. SRig L compares well against Rig L both both models perform poorly compared to the denes baseline at 99% sparsity. E TUNING γsal, MINIMUM PERCENTAGE SALIENT WEIGHTS PER NEURON Fig. 8 depicts the generalization performance of highly sparse Res Net-18 and Wide Res Net-22 models trained on the CIFAR-10 dataset. SRig L s generalization performance at high sparsities is improved with neuron ablation; however, the specific value selected for γsal does not have a significant effect on performance. Our experiments demonstrate that SRig L performs well with a variety of γsal values. In Section 4 we report the results of SRig L models trained with γsal set to 30%. With dynamic ablation enabled, we set the minimum salient weights per neuron to one if the user-defined threshold results in a value less than one. In Fig. 10, many layers in Res Net-50 are set to the minimum threshold of one when we apply a γsal of 30% for all model types other than Vi T-B/16. This minimum threshold explains the invariance of the model s performance when comparing against multiple values for γsal. 0 10 20 30 40 50 Minimum Salient Weights per Neuron (%) Test Accuracy (%) Sparsity (%) 99% 99% w/o Ablation 90% 90% w/o Ablation 95% 95% w/o Ablation (a) Res Net-18/CIFAR-10 0 10 20 30 40 50 Minimum Salient Weights per Neuron (%) Test Accuracy (%) Sparsity (%) 99% 99% w/o Ablation 95% 95% w/o Ablation 90% 90% w/o Ablation (b) Wide Res Net-22/CIFAR-10 Figure 8: (a) Res Net18/CIFAR-10 Test Accuracy vs. γsal when trained with SRig L with and without ablation for a range of sparsities. The mean and 95% confidence intervals are shown for five different random seeds for the runs with ablation. For the runs without ablation, we report the mean of five different random seeds. (b) Wide Res Net22 Test Accuracy vs. γsal. The mean and 95% confidence intervals are shown for five different random seeds. Fig. 9a demonstrates how Vi T-B/16 s generalization performance is much more sensitive to γsal. We find that Rig L learns a sparse connectivity pattern with a large variance in sparse fan-in between neurons within a given layer, with some neurons having an order of magnitude more fan-in connection than the mean fan-in. Published as a conference paper at ICLR 2024 40 60 80 100 Minimum Salient Weights per Neuron (%) Test Accuracy (%) Sparsity (%) 80% 80% w/o Ablation 90% 90% w/o Ablation (a) Vi T-B/16/Image Net Test Accuracy vs. γsal when trained with SRig L with and without ablation enabled for 80% and 90% sparsity. Vi T-B/16 s performance is much more sensitive to γsal and generally performs best with high ablation thresholds. Based on this data we set γsal to 95% for our results reported in Section 4.3. Dense MHA Sparse MHA 50 Dense benchmark Sparse first layer Dense first layer (b) Vi T-B/16 ablation study. The best performing variant used a sparse first layer and dense input projections in the MHA modules. F CONDENSED MATRIX MULTIPLICATION Using a constant fan-in sparse representation presents an advantage compared to the general N:M sparse representation in that we can represent our weight matrices in a compact form, since every neuron/convolutional filter has the same number of non-zero weights. Here we demonstrate how this can be used to accelerate a fully-connected layer. Consider the standard matrix-vector product: W11 W12 ... W1d W21 W22 ... W2d ... ... ... ... Wn1 Wn2 ... Wnd v1 v2 ... vd When W Rn d is sparse and has only k non-zero elements per row, the sums representing each element of vout will be limited to k terms, i.e.: α=1 Wijαvjα with jα {1,...,d}, jα =jα (30) Note that the expression on the right-hand side of Eq. (29) can be represented as an operation between a dense matrix W c Rn k (we call it condensed W ) and k vectors vπ1,...,vπk, vπi Rn, whose elements are drawn from v with replacement (we call them recombinations of v ). The operation is a sum over element-wise products between the i-th column of W c and the i-th column vector vπi: i=1 W c :,i vπi (31) Mathematically, these methods are equivalent for any matrices. Computationally, the condensed method can be more efficient, in particular for sparse matrices with constant small fan-in k. By construction, this method requires the sparse matrix W to be stored in dense representation which involves two 2D arrays of shape n k: One holds the values of the non-zero elements of W and the other one their respective column indices, which are used to generate input vector re-combinations. An efficient computational implementation of this method is subject of ongoing work on this project. Based on our results, the constant fan-in constraint does not appear to have a limiting effect on SNNs. Published as a conference paper at ICLR 2024 0 1 2 3 4 5 6 7 Minimum Salient Weights per Neuron conv1 layer1.0.conv1 layer1.0.conv2 layer1.0.conv3 layer1.0.downsample.0 layer1.1.conv1 layer1.1.conv2 layer1.1.conv3 layer1.2.conv1 layer1.2.conv2 layer1.2.conv3 layer2.0.conv1 layer2.0.conv2 layer2.0.conv3 layer2.0.downsample.0 layer2.1.conv1 layer2.1.conv2 layer2.1.conv3 layer2.2.conv1 layer2.2.conv2 layer2.2.conv3 layer2.3.conv1 layer2.3.conv2 layer2.3.conv3 layer3.0.conv1 layer3.0.conv2 layer3.0.conv3 layer3.0.downsample.0 layer3.1.conv1 layer3.1.conv2 layer3.1.conv3 layer3.2.conv1 layer3.2.conv2 layer3.2.conv3 layer3.3.conv1 layer3.3.conv2 layer3.3.conv3 layer3.4.conv1 layer3.4.conv2 layer3.4.conv3 layer3.5.conv1 layer3.5.conv2 layer3.5.conv3 layer4.0.conv1 layer4.0.conv2 layer4.0.conv3 layer4.0.downsample.0 layer4.1.conv1 layer4.1.conv2 layer4.1.conv3 layer4.2.conv1 layer4.2.conv2 layer4.2.conv3 Res Net-50 Layer Name Figure 10: Res Net-50 Layer vs. Minimum salient weights per neuron. SRig L sets the minimum salient weight per neuron to 1 if the product between γsal and the sparse fan-in per neuron is less than 1. Therefore, even in a relatively large network such as Res Net50 many of the layers only require that a single weight be active to keep the neuron active. We believe this is why SRig L s performance is relatively invariant to various ablation thresholds when applied to CNNs Figure 11: Res Net-18/CIFAR-10 layer widths at the end of training at 99% sparsity. Without ablation, constant fan-in constraint enforces that sparse layers retain their original width. When ablation is enabled, the γsal threshold (minimum percentage salient weights per neuron) is used to control the amount of ablation. 0 1 2 3 4 5 6 7 8 9 Layer Index Sparse Fan-In Figure 12: Sparse Fan-In vs. Vi T-B/16 layer index at the end of training with Rig L at 90% sparsity. Only the first 10 layers are shown for clarity. We find that Rig L learns a sparse connectivity with large variance in fan-in between neurons within the same layer with some neurons receiving up to 10 the number of active connections than the mean for the same layer. 0 10 20 30 40 50 Normalized Training FLOPS (%) Test Accuracy (%) SRig L Dense generalization Figure 13: Training FLOPs for SRig L on Res Net50/Image Net at a variety of sparsities compared with dense generalization. FLOPs are normalized by dense training FLOPs. Published as a conference paper at ICLR 2024 G FLOPS ANALYSIS In Fig. 13, we present an analysis of the FLOPs required during training and inference for SRig L and compare with SR-STE. We calculate FLOPs using the same methodology as Evci et al. (2021) by considering only operations induced by convolutional and linear layers and their activations. FLOPs for add and pooling operations are ignored. For training FLOPs, we also disregard FLOPs required for mask updates, as this step is amortized over T steps and is negligible compared to the FLOPs required otherwise for training. The open-source code for counting operations is from the Neur IPS 2019 Micro Net Challenge and is available on Git Hub2. Similar to other DST methods, SRig L obtains generalization performance comparable to a dense network benchmark at a fraction of the FLOPs required for both training and inference. H IN TIME OVERPARAMETERIZATION RATES In Figs. 14 to 17 we present the In Time Overparameterization Rate (ITOP) (Liu et al., 2021c) for various models and datasets. In this same work, Liu et al. (2021c) proposed modified hyperparameters for Rig L that may yield higher generalization performance; however, a detailed investigation of these hyperparameters for SRig L is left to future work. 50 60 70 80 90 100 Sparsity (%) Rig L SRig L w/o ablation SRig L Figure 14: Res Net-18/CIFAR-10 ITOP rate 50 60 70 80 90 100 Sparsity (%) Rig L SRig L w/o ablation SRig L Figure 15: Res Net-18/CIFAR-10 ITOP rate 80 82 84 86 88 90 Sparsity (%) Rig L SRig L w/o ablation SRig L Figure 16: Vi T-B/16/Image Net ITOP Rate 80 85 90 95 100 Sparsity (%) Rig L SRig L w/o ablation SRig L SRig L x2 SRig L x5 Figure 17: Res Net-50/Image Net ITOP Rate 2Micro Net Challenge Github Repository Published as a conference paper at ICLR 2024 I CONDENSED LINEAR CPU BENCHMARK DETAILS For each sparsity level, we used the trained weights from the last linear layer in the final multi-layer perception block from the Vi T-B/16 transformer encoder. This layer has a width of 768 neurons and an input of 3072 features. The input and layer parameters are all set to a 32 bit floating point type. Across all sparsities, batch sizes, and number of threads investigated, our condensed representation utilizing both structured and fine-grained sparsity yields the fastest online inference speed. However, at higher batch sizes and modest sparsities, structured sparsity is often faster than our condensed representation. See Figs. 18 to 20 for benchmark results from 1-8 threads and batch sizes 1-64. We note that SRig L with either a condensed or a structured sparse representation yields the fastest benchmark times. We used torch.compile with the inductor backend. For compiler options, we used the max-autotune mode and full graph output. However, full graph output is not compatible with CSR formats so we omit this parameter for the unstructured benchmarks. The benchmark script was run with a niceness value of 15 to ensure as accurate results as possible. The apparent slow down in 99% structured sparse benchmarks compared to other sparsities is due to the fact that SRig L ablates fewer neurons at 99% sparsity. At extreme sparsities, each neuron has very few active weights resulting in more neurons being considered as salient by SRig L. J GPU BENCHMARKS Using GPU CUDA kernels developed by (Schultheis & Babbar, 2023), we accelerate our sparse networks and demonstrate a significant acceleration for batched inference and a modest acceleration for online inference at high sparsities (>90%), see Fig. 21. All runs conducted on an NVIDIA Titan V. Note y-axis scale is logarithmic. SRig L (ours) Structured Unstructured Dense SRig L (ours) Structured Unstructured Dense 12 4 8 16 32 64 Batch Size SRig L (ours) Structured Unstructured Dense 12 4 8 16 32 64 Batch Size SRig L (ours) Structured Unstructured Dense Figure 18: CPU benchmarks with 1 thread up to batch size 64 Published as a conference paper at ICLR 2024 SRig L (ours) Structured Unstructured Dense SRig L (ours) Structured Unstructured Dense 12 4 8 16 32 64 Batch Size SRig L (ours) Structured Unstructured Dense 12 4 8 16 32 64 Batch Size SRig L (ours) Structured Unstructured Dense Figure 19: CPU benchmarks with 4 threads up to batch size 64 SRig L (ours) Structured Unstructured Dense SRig L (ours) Structured Unstructured Dense 12 4 8 16 32 64 Batch Size SRig L (ours) Structured Unstructured Dense 12 4 8 16 32 64 Batch Size SRig L (ours) Structured Unstructured Dense Figure 20: CPU benchmarks with 8 threads up to batch size 64 Published as a conference paper at ICLR 2024 SRig L (ours) Structured Unstructured Dense 80 90 95 99 (a) GPU online inference (batch size of 1) SRig L (ours) Structured Unstructured Dense 80 90 95 99 (b) GPU batched inference with batch size of 128 SRig L (ours) Structured Unstructured Dense 80 90 95 99 (c) GPU batched inference with batch size of 2048 Figure 21: Real-world GPU wall-clock timings for inference on an NVIDIA Titan V. We compare timings for a fully-connected layer extracted from the Vi T-B/16 model trained with SRig L when compressed using the condensed representation learned by SRig L, structured (i.e. SRig L with only neuron ablation) and unstructured (i.e. CSR) representations. Batch sizes are 1, 256, and 2048 for sub-figures 21a, 21b, 21c, respectively. The median over a minimum of 5 runs is shown, while the error bars show the std. dev. Note: y-axis scale is logarithmic Published as a conference paper at ICLR 2024 SRig L (ours) Deep Sparse (Unstructured) Structured CSR (Unstructured) 0 Dense 80 90 95 99 Figure 22: Online inference with Deep Sparse compared to SRig L on an Intel Xeon W-2145 with 4 threads. The median over a minimum of 5 runs is shown, while the error bars show the std. dev. K DEEPSPARSE CPU BENCHMARKS Here we present online inference benchmarks for CPU using the Deep Sparse Engine library (Iofinova et al., 2021). Deep Sparse library includes several engineering innovations to accelerate unstructured sparsity on CPU. For instance, a depth-wise asynchronous execution algorithm is used that takes advantage of the relatively large cache size for CPUs compared to hardware accelerators such as GPUs. Other additional innovations used include pre-loading the input data to hide latency via CPU pipelining, compressing sparse activations into a CSR format on-the-fly, and keeping convolutional kernels in L2 cache. For more details see Kurtz et al. (2020). We compare our CPU timings for SRig L to Deep Sparse in Fig. 22 and find similar latency; however, we note that Deep Sparse is subject to a higher variability as evidenced by a larger standard deviation. Further, many of the innovations used to accelerate unstructured sparse networks with Deep Sparse could equally be applied to networks trained with SRig L. L COMPARISON WITH STRUCTURED PRUNING METHODS In the following table we compare several structured pruning methods to SRig L. The tabulated structured pruning methods typically prune and fine-tune a pretrained model, resulting in extended training duration compared to typical dense training. We report the inference FLOPs, top-1 accuracy, and number of epochs for each method in Table 10. Published as a conference paper at ICLR 2024 Table 10: Top-1 Image Net test accuracy of Res Net-50 for various structured pruning methods compared with SRig L and Chase (Yin et al., 2023). All values, except for SRig L, are obtained from Yin et al. (2023). Methods Inference FLOPs Top-1 Accuracy Epochs Uniform 2.0G 75.1% 300 Random 2.0G 74.6% 300 GBN (You et al., 2019) 2.4G 76.2% 350 LEGR (Chin et al., 2020) 2.4G 75.7% - FPGM (He et al., 2019) 2.4G 75.6% 200 TAS (Dong & Yang, 2019) 2.3G 76.2% 240 Hrank (Lin et al., 2020) 2.3G 75.0% 570 SCOP (Tang et al., 2020) 2.2G 76.0% 230 CHIP (Sui et al., 2021) 2.2G 76.3% - Group Fisher (Liu et al., 2021a) 2.0G 76.4% - Auto Slim (Yu & Huang, 2019) 2.0G 75.6% - Cafe Net-R (Su et al., 2021) 2.0G 76.5% 300 Chase-1 (Yin et al., 2023) 1.5G 76.6% 250 SRig L 2.0G 74.7% 205 SRig L 2.0G 76.2% 515 Uniform 1.0G 73.1% 300 Random 1.0G 72.2% 300 Group Fisher (Liu et al., 2021a) 1.0G 73.9% - Cafe Net-R (Su et al., 2021) 1.0G 74.9% 300 Cafe Net-E (Su et al., 2021) 1.0G 75.3% 300 Chase-2 (Yin et al., 2023) 0.9G 75.7% 250 SRig L 1.0G 71.5% 205 SRig L 1.0G 73.6% 515 DST methods. All other methods tabulated are structured pruning methods.