# sparse_weight_activation_training__2c774b25.pdf

Sparse Weight Activation Training

Md Aamir Raihan, Tor M. Aamodt Department of Electrical And Computer Engineering University of British Columbia Vancouver, BC {araihan,aamodt}@ece.ubc.ca

Neural network training is computationally and memory intensive. Sparse training can reduce the burden on emerging hardware platforms designed to accelerate sparse computations, but it can also affect network convergence. In this work, we propose a novel CNN training algorithm called Sparse Weight Activation Training (SWAT). SWAT is more computation and memory-efﬁcient than conventional training. SWAT modiﬁes back-propagation based on the empirical insight that convergence during training tends to be robust to the elimination of (i) small magnitude weights during the forward pass and (ii) both small magnitude weights and activations during the backward pass. We evaluate SWAT on recent CNN architectures such as Res Net, VGG, Dense Net and Wide Res Net using CIFAR-10, CIFAR-100 and Image Net datasets. For Res Net-50 on Image Net SWAT reduces total ﬂoating-point operations (FLOPs) during training by 80% resulting in a 3.3 training speedup when run on a simulated sparse learning accelerator representative of emerging platforms while incurring only 1.63% reduction in validation accuracy. Moreover, SWAT reduces memory footprint during the backward pass by 23% to 50% for activations and 50% to 90% for weights. Code is available at https: //github.com/Aamir Raihan/SWAT.

1 Introduction

Convolutional Neural Networks (CNNs) are effective at many complex computer vision tasks including object recognition [27, 57], object detection [56, 50] and image restoration [12, 65]. However, training CNNs requires signiﬁcant computation and memory resources. Software and hardware approaches have been proposed for addressing this challenge. On the hardware side, graphics processor units (GPUs) are now typically used for training [27] and recent GPUs from NVIDIA include specialized Tensor Core hardware speciﬁcally to accelerate deep learning [46, 44, 45]. Specialized programmable hardware is being deployed in datacenters by companies such as Google and Microsoft [25, 7]. Techniques for reducing computation and memory consumption on existing hardware include those reducing the number of training iterations such as batch normalization [24] and enhanced optimization strategies [26, 13] and those reducing computations per iteration. Examples of the latter, which may be effective with appropriate hardware support, include techniques such as quantization [67, 6, 62, 59], use of ﬁxed-point instead of ﬂoating-point [63, 8], sparsiﬁcation [60] and dimensionality reduction [34]. This paper introduces Sparse Weight Activation Training (SWAT), which signiﬁcantly extends the sparsiﬁcation approach.

Training involves repeated application of forward and backward passes. Prior research on introducing sparsity during training has focused on sparsifying the backward pass. While model compression [18, 40, 36, 35, 32, 21, 37, 61, 41] introduces sparsiﬁcation into the forward pass, it typically does so by introducing additional training phases which increase overall training time. Amdahl s Law [2] implies overall speedup is limited by the fraction of original execution time spent on computations that are

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

not sped up by system changes. To reduce training time signiﬁcantly by reducing computations per training iteration it is necessary to address both forward and backward passes. SWAT introduces sparsiﬁcation into both forward and backward passes and is suitable for emerging hardware platforms containing support for sparse matrix operations. Such hardware is now available. For example recently announced Ampere GPU architecture [46] includes support for exploiting sparsity. In addition, there is a growing body of research on hardware accelerators for sparse networks [47, 9, 66, 1] and we demonstrate, via hardware simulation, that SWAT can potentially train 5.9 faster when such accelerators become available.

While SWAT employs sparsity it does so with the objective of reducing training time not performing model compression. The contributions of this paper are:

An empirical sensitivity analysis of approaches to inducing sparsity in network training;

SWAT, a training algorithm that introduces sparsity in weights and activations resulting in reduced execution time in both forward and backward passes of training;

An empirical evaluation showing SWAT is effective on complex models and datasets.

2 Related work

Below we summarize the works most closely related to SWAT.

Network pruning: Le Cun et al. [31] proposed removing network parameters using second-order information of the loss function to improve generalization and reduce training and inference time. More recently, pruning has been explored primarily as a way to improve the efﬁciency and storage requirements of inference but at the expense of increasing training time, contrary to the objective of this paper. Speciﬁcally, Han et al. [18] showed how to substantially reduce network parameters while improving validation accuracy by pruning weights based upon their magnitude combined with a subsequent retraining phase that ﬁne-tunes the remaining weights. Regularization techniques can be employed to learn a pruned network [40, 36]. Various approaches to structured pruning [61, 41, 32, 35, 21, 37], ensure entire channels or ﬁlters are removed to reduce inference execution time on the vector hardware found in GPUs.

Reducing per-iteration training overhead: Me Prop [55, 60] reduces computations during training by back propagating only the highest magnitude gradient component and setting other component to zero. As shown in Section 3.2, on complex networks and models training is very sensitive to the fraction of gradient components set to zero. Liu et al. [34] proposed reducing computation during training and inference by constructing a dynamic sparse graph (DSG) using random projection. DSG incurs accuracy loss of 3% on Image Net at a sparsity of 50%. Goli and Aamodt [17] propose to reduce backward pass computations by performing convolutions only on gradient components that change substantially from the prior iteration. They reduced the overall computation in the backward pass by up to 90% with minimum loss in accuracy. Their approach of backward pass sparsiﬁcation is orthogonal to SWAT.

Sparse Learning: Sparse learning attempts to learn a sparse representation during training, generally as a way to achieve model compression and reduce computation during inference. Since sparse learning introduces sparsity during training it can potentially reduce training time but pruning weights by itself does not reduce weight gradient computation (Equation 3). Many sparse learning algorithms start with a random sparse network, then repeat a cycle of training, pruning and regrowth. Sparse Evolutionary Training (SET) [39] prunes the most negative and smallest positive weights then randomly selects latent (i.e., missing) weights for regrowth. Dynamic Sparse Reparameterization (DSR) [42] uses a global adaptive threshold for pruning and randomly regrows latent weights in a layer proportionally to the number of active (non-zero) weights in that same layer. Sparse Network From Scratch (SNFS) [10] further improves performance using magnitude-based pruning and momentum for determining the regrowth across layers. Rigging the Lottery Ticket (Rig L) [15] uses an instantaneous gradient as its regrowth criteria. Dynamic Sparse Training (DST) [33] deﬁnes a trainable mask to determine which weights to prune. Recently Kusupati et al. [30] proposes a novel state-of-the-art method of ﬁnding per layer learnable threshold which reduces the FLOPs during inference by employing a non-unform sparsity budget across layers.

Table 1: Comparison of SWAT with related works

Sparsity Across Layer

Sparse Forward Pass

Sparse Backward Pass Unstructured Sparse Network

Structured Sparse Network Input Gradient Weight Gradient

Network Pruning

Fixed/ Variable

Yes/ Gradual

Yes/ Gradual No Depend on algorithm

me Prop [55] Fixed No Yes Yes No - DSG [34] Fixed Yes Yes Yes Yes - SET [39] Fixed Yes Yes No Yes - DSR [42] Variable Yes Yes No Yes - SNFS [10] Variable Yes Yes No Yes - Rig L [15] Fixed Yes Yes No Yes -

SWAT Fixed/ Variable Yes Yes Yes Yes Yes

In contrast, SWAT employs a uniﬁed training phase where the algorithm continuously explores sparse topologies during training by employing simple magnitude based thresholds to determine which weight and activations components to operate upon. In addition, the sparsifying function used in SWAT can be adapted to induce structured sparse topologies and varying sparsity across layers. Table 1 summarizes the differences between SWAT and recent related work on increasing sparsity. The column Sparsity Across Layer indicates whether the layer sparsity is constant during training. The columns labeled Input Gradient and Weight Gradient represent whether the input gradient or weight gradient computation is sparse during training.

3 Sparse weight activation training

We begin with preliminaries, describe a sensitivity study motivating SWAT then describe SWAT and several enhancements.

3.1 Preliminaries

We consider a CNN trained using mini-batch stochastic gradient descent, where the lth layer maps input activations al 1 to outputs al using function fl:

al = fl(al 1, wl) (1)

where wl are layer l s weights. During back-propagation the lth layer receives the gradient of the loss with respect to its output activation ( al). This is used to compute the gradient of the loss with respect to its input activation ( al 1) and weight ( wl) using function Gl and Hl respectively. Thus, the backward pass for the lth layer can be deﬁned as:

al 1 = Gl( al, wl), (2)

wl = Hl( al, al 1) (3)

We induce sparsity on a tensor by retaining the values for the K highest magnitude elements and setting the remaining elements to zero. Building on the notion of a Top-k query in databases [52] and simlar to Sun et al. [55] we call this process Top-K sparsiﬁcation.

3.2 Sensitivity Analysis

We begin by studying the sensitivity of network convergence by applying Top-K sparsiﬁcation to weights (wl), activations (al 1) and/or back-propagated error gradients ( al) during training. For these experiments, which help motivate SWAT, we evaluate Dense Net-121, VGG-16 and Res Net-18 on the CIFAR-100 dataset. We run each experiment three times and report the mean value.

Figure 1 plots the impact on validation accuracy of applying varying degrees of Top-K sparsiﬁcation to weights or activations during computations in the forward pass (Equation 1). In this experiment, when applying top-K sparsiﬁcation in the forward pass weights or activations are not permanently

removed. Rather, low magnitude weights or activations are removed temporarily during the forward pass and restored before the backward pass. Thus, in this experiment the backward pass uses unmodiﬁed weights and activations without applying any sparsiﬁcation and all weights are updated by the resulting dense weight gradient. The data in Figure 1 suggests convergence is more robust to Top-K sparsiﬁcation of weights versus activations during the forward pass. The data shows that inducing high activation sparsity hurts accuracy after a certain point, conﬁrming similar observations by Georgiadis [16] and Kurtz et al. [29].

50 60 70 80 90 Sparsity

Validation Accuracy (%)

Dense Net-121 VGG-16 Res Net-18

sparse weight sparse activation

Figure 1: Forward Pass Sensitivity Analysis.

50 60 70 80 90 Sparsity

Validation Accuracy (%)

Dense Net-121 VGG-16 Res Net-18

sparse weight and activation sparse output gradient

Figure 2: Backward Pass Sensitivity Analysis.

Figure 2 plots the impact on validation accuracy of applying varying degrees of Top-K sparsiﬁcation to both weights and activations (labeled sparse weight and activation ) or to only output gradients (labeled sparse output gradient ) during computations in the backward pass (Equations 2 and 3). Weights and activations or back-propagated output error gradients are sparsiﬁed before performing the convolutions to generate weight and input gradients. The generated gradients are dense since a convolution between a sparse and dense inputs will in general produce a dense output. The resulting dense weight gradients are used in the parameter update stage. We note that this process differs, for example, from recent approaches to sparsifying inference (e.g., [30, 68, 15]) which employ dense back-propagated output gradients during convolution to generate weight gradients that are masked during the parameter update stage. The sparse weight and activation curve shows that convergence is relatively insensitive to applying Top-K sparsiﬁcation. In contrast, the sparse output gradient curve shows that convergence is sensitive to applying Top-K sparsiﬁcation to back-propagated error gradients ( al). The latter observation indicates that me Prop, which drops back-propagated error-gradients, will suffer convergence issues on larger networks.

3.3 The SWAT Algorithm

The analysis above suggests two strategies: In the forward pass use sparse weights (but not activations) and in the backward pass use sparse weights and activations (but not gradients).

Sparse weight activation training (SWAT) embodies these two strategies as follows (for pseudo-code see supplementary material): During each training pass (forward and backward iteration on a minibatch) a sparse weight topology is induced during the forward pass using the Top-K function. This partitions weights into active (i.e., Top-K) and non-active sets for the current iteration. The forward pass uses the active weights. In the backward pass, the full gradients and active weights are used in Equation 2 and the full gradients and highly activated neurons (Top-K sparsiﬁed activations) are used in Equation 3. The later generate dense weight gradients that are used to update both active and non-active weights. The updates to non-active weights mean the topology can change from iteration to iteration. This enables SWAT to perform dynamic topology exploration: Backpropagation with sparse weights and activations approximates backpropagation on a network with sparse connectivity and sparsely activated neurons. The dense gradients generated during back-propagation minimize loss for the current sparse connectivity. However, the updated weights resulting from the dense gradients will potentially lead to a new sparse network since non-active weights are also updated. This process captures ﬁne-grained temporal importance of connectivity during training. Section 4.5 shows quantitatively, the importance of unmasked gradient updates and dynamic exploration of connectivity.

3.3.1 Top-K Channel Selection

TOP-50% CHANNELS

Figure 3: Top-K Channel Selection.

Top-K sparsiﬁcation induces ﬁne-grained sparsity. SWAT can instead induce structured sparsity on weights by pruning channels. Similarly to [4, 35, 38, 32], the saliency criteria for selecting channels is L1 norm. Figure 3 illustrates using channel L1 norm to select 50% of channel (Top-50%). The squares on the right side contain the channel L1 norm and lower L1 norm channels are set as non-active (lightly shaded). The importance of channels is consider independently, i.e., different ﬁlters can select different active channels.

3.3.2 Sparsity Distribution

The objective of SWAT is to reduce training time while maintaining accuracy. So far we have assumed the per layer sparsity of weights and activations is equal to the target sparsity. Prior work [15, 10] has demonstrated that a non-uniform distribution of active weights and activation sparsity can improve accuracy for a given sparsity. We explore three strategies to distributing sparsity. All three of the following variants employ magnitude comparisons to select which tensor elements to set to zero to induce sparsity. We say an element of a tensor is unmasked if it is not forced to zero before being used in a computation. For all three techniques below the fraction of unmasked elements in a given layer is identical for weights and activations.

Uniform (SWAT-U): Similar to others (e.g., Evci et al. [15]) we found that keeping ﬁrst layer dense improves validation accuracy. For SWAT-U we keep the ﬁrst layer dense and apply the same sparsity threshold uniformly across all other layers.

Erdos-Renyi-Kernel (SWAT-ERK): For SWAT-ERK active weights and unmasked activations are distributed across layers by taking into account layer dimensions and setting a per layer threshold for magnitude based pruning. Following Evci et al. [15] higher sparsity is allocated to layers containing more parameters.

Momentum (SWAT-M): For SWAT-M active weights and unmasked activations are distributed across layers such that less non-zero elements are retained in layers with smaller average momentum for active weights. This approach is inspired by Dettmers and Zettlemoyer [10].

0 2000 4000 6000 8000

layer-1 layer-7

layer-14 layer-17

Top-K Threshold

Figure 4: Top-K weight threshold versus training iteration

1 1000 10000 20000 30000 40000 50000 60000 Top-K Period (Iterations)

Validation Accuracy (%)

Sparsity-70 Sparsity-90 Sparsity-70 Sparsity-90

Figure 5: Impact of Top-K sampling period.

3.3.3 Efﬁcient Top-K Threshold Calculation

The variants of SWAT described above induce sparsity by examining the magnitude of tensor elements and this incurs some overhead. Naively, the Top-K operation could be performed on a 1-dimensional array of size N in O(N log N) using sorting. The overhead can be reduced to O(N) using a threshold operation where elements with magnitude greater than a threshold are retained an others are treated as zeros. The K-th largest element can be found in O(N) average time complexity using Quickselect [22] or in θ(N) time using either BFPRT [3] or Introselect [43] and efﬁcient parallel implementations of Top-K have been proposed for GPUs [52].

Figure 4 plots the threshold value required to achieve 90% weight sparsity versus training iteration for SWAT-U and unstructured pruning four representative layers of Res Net-18 while training using

CIFAR-100. This data suggests that for a given layer the magnitude of the K-th largest element is almost constant during training. Thus, we explored whether we can set a per layer threshold for weights and activations on the ﬁrst iteration and only update it periodically.

Figure 5 plots the ﬁnal Top-1 validation accuracy after training converges versus Top-K sampling period for SWAT-U and unstructured pruning applied to Res Net-18 trained using CIFAR 100. Here 391 iterations (x-axis) corresponds to a single epoch. The data indicates that for sampling intervals up to 1000 iterations validation accuracy is not degraded signiﬁcantly.

4 Experiments

Below we present results for validation accuracy, theoretical reduction in ﬂoating-point operations (FLOPs) during training and estimates of training speedup on a simulated sparse accelerator. While not our primary objective we also report theoretical FLOPs reduction for inference.

4.1 Methodology

We measure validation accuracy of SWAT by implementing custom convolution and linear layers in Py Torch 1.1.0 [48]. Inside each custom Py Torch layer we perform sparsiﬁcation before performing the layer forward or backward pass computation. To obtain accuracy measurements in a reasonable time these custom layers invoke NVIDIA s cu DNN library using Pytorch s C++ interface.

We estimate potential for training time reduction using an analytical model to measure total ﬂoatingpoint operations and an architecture simulator modeling a sparse DNN training accelerator based upon an extension of the Bit-Tactical inference accelerator [9].

We employ standard training schedules for training Res Net [19], VGG [53] with batchnormalization [24], Dense Net [23] and Wide Residual Network [64] on CIFAR10 and CIFAR100 [28]. We use SGD with momentum as an optimization algorithm with an initial learning rate of 0.1, momentum of 0.9, and weight decay λ of 0.0005. For training runs with Image Net we employ the augmentation technique proposed by Krizhevsky et al. [27]: 224 224 random crops from the input images or their horizontal ﬂip are used for training. Networks are trained with label smoothing [58] of 0.1 for 90 epochs with a batch size of 256 samples on a system with eight NVIDIA 2080Ti GPUs. The learning rate schedule starts with a linear warm-up reaching its maximum of 0.1 at epoch 5 and is reduced by (1/10) at epochs 30th, 60th and 80th. The optimization method is SGD with Nesterov momentum of 0.9 and weight decay λ of 0.0001. Results for Image Net use a Top-K threshold recomputed every 1000 iterations while those for CIFAR-10 recompute the Top-K threshold every iteration. Due to time and resource constraints below we report SWAT-ERK and SWAT-M results only for CIFAR-10 and unstructured sparsity.

The supplementary material includes detailed hyperparameters, results for CIFAR-100, ablation study of the accuracy impact of performing Top-K on different subsets of weight and activation tensors.

4.2 Unstructured SWAT

CIFAR-10: Table 2 compares SWAT-U, SWAT-ERK and SWAT-M with unstructured sparsity versus published results for DST [33] and SNFS [10] for VGG-16-D, WRN-16-8, and Dense Net-121 on CIFAR-10. Under the heading Training Sparsity we plot the average sparsity for weights (W) and activations (Act). For SWAT-ERK and SWAT-M per layer sparsity for weights and activations are equal for a given layer but their averages for the entire network differ because these are computed by weighting by the number of weights or activations per layer. Comparing SWAT against SNFS and DST for VGG-16 and WRN-16-8 the data shows that SWAT-M has better accuracy versus SNFS and DST on both networks. While SWAT-M reduces training FLOPs by 33% and 22% versus SNFS and DST, respectively, SWAT-M requires more training FLOPs versus DST on VGG-16. SWAT-U has better accuracy versus SNFS and DST on WRN-16-8 and SWAT-U obtains 2.53 and 1.96 harmonic mean reduction in remaining training FLOPS versus SNFS and DST, respectively. It is important to note that the reduction in FLOPs for SWAT-ERK is competitive with SWAT-U. In general, uniform will require fewer training and inference computations verses ERK when the input resolution is high since ERK generally applies lower sparsity at the initial layer resulting in signiﬁcant initial overhead. However, in Table 2, for all these networks the initial layer will have less computation

Table 2: Unstructured SWAT on the CIFAR-10 dataset. W: Weight, Act: Activation, BA: Baseline Accuracy, AC: Accuracy Change, MC: Model Compression, DS: Default Sparsity.

Network Methods Training Sparsity Top-1 Training Inference W (%) Act (%) Acc. SD (σ) (%) BA / AC FLOPs (%) Act (%) MC ( ) FLOPs (%)

SNFS [10] 95.0 DS 93.31 93.41 / -0.10 57.0 DS 20.0 62.9 DST [33] 96.2 DS 93.02 93.75 / -0.73 75.7 DS 26.3 83.2 SWAT-U 90.0 90.0 91.95 0.06 93.30 / -1.35 89.7 36.2 10.0 89.5 SWAT-ERK 95.0 82.0 92.50 0.07 93.30 / -0.80 89.5 33.0 20.0 89.5 SWAT-M 95.0 65.0 93.41 0.05 93.30 / +0.11 64.0 25.0 20.0 58.4

SNFS [10] 95.0 DS 94.38 95.43 / -1.05 81.8 DS 20.0 88.0 DST [33] 95.4 DS 94.73 95.18 / -0.45 83.3 DS 21.7 91.4 SWAT-U 90.0 90.0 95.13 0.11 95.10 / +0.03 90.0 49.0 10.0 90.0 SWAT-ERK 95.0 84.0 95.00 0.12 95.10 / -0.10 91.4 45.8 20.0 91.7 SWAT-M 95.0 78.0 94.97 0.04 95.10 / -0.13 86.3 42.5 20.0 85.9

Dense Net-121 SWAT-U 90.0 90.0 94.48 0.06 94.46 / +0.02 89.8 44.0 10.0 89.8 SWAT-ERK 90.0 88.0 94.14 0.11 94.46 / -0.32 89.7 43.0 10.0 89.6 SWAT-M 90.0 86.0 94.29 0.11 94.46 / -0.17 84.2 42.0 10.0 83.6

Table 3: Unstructured SWAT on the Image Net dataset. W: Weight, Act: Activation, BA: Baseline Accuracy , AC: Accuracy Change, MC: Model Compression, DS: Default Sparsity.

Network Methods Training Sparsity Top-1 Training Inference

W (%) Act (%) Acc. SD (σ) (%) BA / AC FLOPs (%) Act (%) MC ( ) FLOPs (%)

SET [39] 80.0 DS 73.4 0.32 76.8 / -3.4 58.1 DS 3.4 73.0 90.0 71.3 0.24 76.8 / -5.5 63.8 DS 5.0 82.1

DSR [42] 80.0 DS 74.1 0.17 76.8 / -2.7 51.6 DS 3.4 59.4 90.0 71.9 0.07 76.8 / -4.9 58.9 DS 5.0 70.7

SNFS [10] 80.0 DS 74.9 0.07 77.0 / -2.1 45.8 DS 5.0 43.3 90.0 72.9 0.07 77.0 / -4.1 57.6 DS 10.0 59.7

Rig L [15] 80.0 DS 74.6 0.06 76.8 / -2.2 67.2 DS 5.0 77.7 90.0 72.0 0.05 76.8 / -4.8 74.1 DS 10.0 87.4

DST [33] 80.4 DS 74.0 0.41 76.8 / -2.8 67.1 DS 5.0 84.9 90.1 72.8 0.27 76.8 / -4.0 75.8 DS 10.0 91.3

SWAT-U 80.0 80.0 75.2 0.06 76.8 / -1.6 76.1 39.2 5.0 77.7 90.0 90.0 72.1 0.03 76.8 / -4.7 85.6 44.0 10.0 87.4

SWAT-ERK 80.0 52.0 76.0 0.16 76.8 / -0.8 60.0 25.5 5.0 58.9 90.0 64.0 73.8 0.23 76.8 / -3.0 79.0 31.4 10.0 77.8

SWAT-M 80.0 49.0 74.6 0.11 76.8 / -2.2 45.9 23.7 5.0 45.0 90.0 57.0 74.0 0.18 76.8 / -2.8 65.4 27.2 10.0 64.8

WRN-50-2 SWAT-U 80.0 80.0 76.4 0.10 78.5 / -2.1 78.6 39.1 5.0 79.2 90.0 90.0 74.7 0.27 78.5 / -3.8 88.4 43.9 10.0 89.0

due to the small input resolution of the CIFAR-10 dataset, and computationally expensive layers are allotted higher sparsity in SWAT-ERK.

The data under Act report the fraction of activation elements masked to zero, which can be exploited by hardware compression proposed by NVIDIA [51] to reduce transfer time between GPU and CPU.

2x 4x 6x 8x Training FLOP

Accuracy Loss

SWAT-U SET DSR

DST Baseline

Figure 6: Accuracy decrease versus reduction in training FLOPs

Image Net: Table 3 compares SWAT-U with unstructured sparsity against six recently proposed sparse learning algorithms at target weight sparsities of 80% and 90% on Image Net. Data for all sparse learning algorithms except Rig L were obtained by running their code using the hyperparameters in Section 4.1. Results for Rig L are quoted from Evci et al. [15]. SET and DSR do not sparsify downsampling layers leading to increased parameter count. At 80% sparsity, SWAT-U attains the highest validation accuracy while reducing training FLOPs. DST trains Res Net-50 on Image Net dataset with less computation ( Training FLOPs ) than Rig L even though DST is a dense to sparse training algorithm while Rig L is a sparse to sparse training algorithm. This occurs for two reasons: (1) DST quickly reaches a relatively sparse topology after a few initial epochs; in our experiment, DST discards more

Table 4: Structured SWAT on the CIFAR-10 dataset. W: Weight, Act: Activation, CP: Channel Pruned, BA: Baseline Accuracy, AC: Accuracy Change, MC: Model Compression.

Network Methods Training Sparsity Top-1 Training Inference W (%) / Act (%) CP (%) Acc. SD (σ) (%) BA / AC FLOPs (%) Act (%) MC ( ) FLOPs (%)

Res Net-18 SWAT-U 50.0/50.0 50.0 94.73 0.06 94.59 / +0.14 49.8 26.0 2.0 49.8 60.0/60.0 60.0 94.68 0.03 94.59 / +0.09 59.8 31.2 2.5 59.8 70.0/70.0 70.0 94.65 0.19 94.59 / +0.06 69.8 36.4 3.3 69.8

Dense Net-121 SWAT-U 50.0/50.0 50.0 95.04 0.26 94.51 / +0.53 49.9 25.0 2.0 49.9 60.0/60.0 60.0 94.82 0.11 94.51 / +0.31 59.9 30.0 2.5 59.9 70.0/70.0 70.0 94.81 0.20 94.51 / +0.30 69.9 35.0 3.3 69.9

Table 5: Structured SWAT on the Image Net dataset. W: Weight, Act: Activation, CP: Channel Pruned, BA: Baseline Accuracy, AC: Accuracy Change, MC: Model Compression.

Network Methods Training Pruning Top-1 Inference W (%) / Act (%) FLOPs (%) Act (%) CP (%) Fine-Tune (epoch) Acc (%) BA / AC FLOPs (%)

DCP [69] Ofﬂine Pruning - 60 74.95 76.01 / -1.06 55.0 CCP [49] Ofﬂine Pruning 35 100 75.50 76.15 / -0.65 48.8 AOFP [11] Ofﬂine Pruning - Yes (-) 75.11 75.34 / -0.23 56.73 Soft-Pruning [20] - - - 30 No 74.61 76.15 / -1.54 41.8

SWAT-U (Structured)

50.0/50.0 47.6 24.5 50 No 76.51 0.30 76.80 / -0.29 48.6 60.0/60.0 57.1 29.5 60 No 76.35 0.06 76.80 / -0.45 58.3 70.0/70.0 66.6 34.3 70 No 75.67 0.06 76.80 / -1.13 68.0

WRN-50-2 SWAT-U (Structured)

50.0/50.0 49.1 24.5 50 No 78.08 0.20 78.50 / -0.42 49.5 60.0/60.0 58.9 29.4 60 No 77.55 0.07 78.50 / -0.95 59.4 70.0/70.0 68.7 34.2 70 No 77.19 0.11 78.50 / -1.31 69.3

than 70% of network parameter within 5 epochs; (2) The sparsity distribution across layers is the crucial factor deciding reduction in FLOPs since allocating higher sparsity to the computationally expensive layer alleviates the initial overhead during entire training. Therefore, the overall beneﬁt of DST is dependent on network architecture, sparsity distribution, and the parameter discard rate. Figure 6 plots the reduction in validation accuracy versus the reduction in training FLOPs relative to baseline dense training at both 80% and 90% sparsity using the same data as Table 3. This ﬁgure shows SWAT-U provides the best tradeoff between validation and reduction in training FLOPs.

4.3 Structured SWAT

CIFAR10: Table 4 provides results for SWAT-U with channel pruning on the CIFAR-10 dataset. At 70% sparsity SWAT-U with channel pruning improves validation accuracy on both Res Net-18 and Dense Net-121 while reducing training FLOPs by 3.3 .

Image Net: Table 5 compares SWAT-U with channel pruning against four recent structured pruning algorithms. At 70% sparsity SWAT-U with channel pruning reduces training FLOPs by 3.19 by pruning 70% of the channels on Res Net50 while incurring only 1.2% loss in validation accuracy. SWAT-U with structured sparsity shows better accuracy but shows larger drops versus our baseline (trained with label smoothing) in some cases. The works we compare with start with a densely trained network, prune channels, then ﬁne-tune and so increase training time contrary to our objective.

4.4 Sparse Accelerator Speedup

We believe SWAT is well suited for emerging sparse machine learning accelerator hardware designs like NVIDIA s recently announced Ampere (A100) GPU. To estimate the speedup on a sparse accelerator, we have used an architecture simulator developed for a recent sparse accelerator hardware proposal [9]. The simulator counts only the effectual computation (non-zero computation) and exploits the sparsity present in the computation. The architecture has a 2D array of processing units where each processing unit has an array of multipliers and dedicated weight and activation and accumulation buffers. It counts the cycles taken to spatially map and schedules the computation present in each layer of the network. The memory hierarchy is similar to the Da Dian No architecture [5]. The activation and weight buffer can hold one entire layer at a time and hide the latency of memory transfer. The memory throughput is high enough to satisfy the computation throughput.

Table 6: Speedup due to SWAT on ML Accelerator

Top-K Sparsity Forward Pass Speed Up Backward Pass Speed Up

0% 1 1 80% 3.3 3.4 90% 5.6 6.3

The simulator only implements the forward pass, and therefore the simulator does not simulate the backward pass. However, the backward pass convolution is a transposed convolution, which can be translated into a standard convolution by rotating the input tensor [14]. So we estimated the backward pass speedup by transforming transpose convolution into a standard convolution and using the inference simulator. Note, here the transformation overhead was not considered in the speedup. However, the overhead would be small in the actual hardware since this transformation operation is a simple rotation operation, i.e., rotating the tensor along some axis and, therefore, we assume it could be accelerated on hardware. Table 6 reports forward and backward pass training speedup (in simulated clock cycles) for SWAT-U with unstructured pruning on Res Net-50 with Image Net. From Table 3 we see that at 80% sparsity SWAT-U incurs 1.63% accuracy loss but the data in Table 6 suggests it may improve training time by 3.3 on emerging platforms supporting supporting hardware acceleration of sparse computations.

4.5 Effect of updates to non-active weights

SWAT updates non-active weights in the backward pass using dense gradients and one might reasonable ask whether it would be possible to further decrease training time by masking these gradients. Figure 7 measures the effect of this masking on validation accuracy while training Res Net18 on CIFAR-100. The data suggest it is important to update non-active weights. Since non-active weights are updated they may become active changing the topology. Figure 8 shows the effect of freezing the topology after differing number of epochs during training of Res Net18 on CIFAR100 dataset. Freezing topology exploration early is harmful to convergence and results in reduced ﬁnal validation accuracy.

Sparsity 80 Sparsity 90

Validation Accuracy

Masking No Masking

Figure 7: Effect of masking of gradient update for non-active weights.

Sparsity 70 Sparsity 80 Sparsity 90

Validation Accuracy

Freeze Epoch 25 50 75 100 FULL

Figure 8: Effect of stopping dynamic topology exploration early.

5 Conclusion

In this work, we propose SWAT, a novel efﬁcient training algorithm that sparsiﬁes both the forward and the backward passes with negligible impact on convergence. SWAT is the ﬁrst sparse learning algorithm we are aware of that can train both structured and unstructured sparse networks. SWAT achieves validation accuracy comparable to other pruning and sparse learning algorithms while demonstrating potential for signiﬁcant reduction in training time on emerging platforms supporting hardware acceleration of sparse computations.

6 Broader Impact

This work has the following potential positive impact in society: 1) It is an entirely sparse training algorithm for emerging sparse hardware accelerators. 2) It makes training efﬁcient and faster. Thus,

decreasing the model training cost. 3) It can reduce the overall carbon footprint of model training, which is a huge problem. Strubell et al. [54] shows that the carbon footprint of training models can release 5 the carbon emission of a car during its lifetime. 4) It can enable us to train even bigger models, thus allowing us to achieve new state-of-the-art accuracies.

At the same time, this work may have some negative consequences because SWAT can enable us to develop better AI and better AI technologies may have negative societal implications such as strict surveillance, privacy concerns, and job loss.

7 Acknowledgements

We thank Aayush Ankit, Francois Demoullin, Negar Goli, Dave Evans, Deval Shah, Yuan Hsi Chou, and the anonymous reviewers for their valuable comments on this work. This research has been funded in part by the Computing Hardware for Emerging Intelligent Sensory Applications (COHESA) project. COHESA is ﬁnanced under the National Sciences and Engineering Research Council of Canada (NSERC) Strategic Networks grant number NETGP485577-15. Professor Aamodt serves as a consultant for Huawei Technologies Canada.

[1] Jorge Albericio, Patrick Judd, Tayler Hetherington, Tor Aamodt, Natalie Enright Jerger, and Andreas Moshovos. Cnvlutin: Ineffectual-neuron-free deep neural network computing. ACM SIGARCH Computer Architecture News, 44(3):1 13, 2016.

[2] Gene M Amdahl. Validity of the single processor approach to achieving large scale computing capabilities. In Proceedings of the April 18-20, 1967, spring joint computer conference, pages 483 485, 1967.

[3] Manuel Blum, Robert W. Floyd, Vaughan Pratt, Ronald L. Rivest, and Robert E. Tarjan. Time bounds for selection. Journal of Computer and System Sciences, 7(4):448 461, 1973. ISSN 0022-0000. doi: https://doi.org/10.1016/S0022-0000(73)80033-9. URL http://www. sciencedirect.com/science/article/pii/S0022000073800339.

[4] Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han. Once-for-all: Train one network and specialize it for efﬁcient deployment. In International Conference on Learning Representations, 2020.

[5] Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, et al. Dadiannao: A machine-learning supercomputer. In 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, pages 609 622. IEEE, 2014.

[6] Jungwook Choi, Pierce I-Jen Chuang, Zhuo Wang, Swagath Venkataramani, Vijayalakshmi Srinivasan, and Kailash Gopalakrishnan. Bridging the accuracy gap for 2-bit quantized neural networks (qnn). ar Xiv preprint ar Xiv:1807.06964, 2018.

[7] Eric Chung, Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Adrian Caulﬁeld, Todd Massengill, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, et al. Serving dnns in real time at datacenter scale with project brainwave. IEEE Micro, 38(2):8 20, 2018.

[8] Dipankar Das, Naveen Mellempudi, Dheevatsa Mudigere, Dhiraj Kalamkar, Sasikanth Avancha, Kunal Banerjee, Srinivas Sridharan, Karthik Vaidyanathan, Bharat Kaul, Evangelos Georganas, Alexander Heinecke, Pradeep Dubey, Jesus Corbal, Nikita Shustrov, Roma Dubtsov, Evarist Fomenko, and Vadim Pirogov. Mixed precision training of convolutional neural networks using integer operations. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=H135uz Z0-.

[9] Alberto Delmas Lascorz, Patrick Judd, Dylan Malone Stuart, Zissis Poulos, Mostafa Mahmoud, Sayeh Sharify, Milos Nikolic, Kevin Siu, and Andreas Moshovos. Bit-tactical: A software/hardware approach to exploiting value and bit sparsity in neural networks. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 749 763, 2019.

[10] Tim Dettmers and Luke Zettlemoyer. Sparse networks from scratch: Faster training without losing performance. ar Xiv preprint ar Xiv:1907.04840, 2019.

[11] Xiaohan Ding, Guiguang Ding, Yuchen Guo, Jungong Han, and Chenggang Yan. Approximated oracle ﬁlter pruning for destructive cnn width optimization. ar Xiv preprint ar Xiv:1905.04748, 2019.

[12] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Learning a deep convolutional network for image super-resolution. In European conference on computer vision, pages 184 199. Springer, 2014.

[13] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(Jul):2121 2159, 2011.

[14] Vincent Dumoulin and Francesco Visin. A guide to convolution arithmetic for deep learning. ar Xiv preprint ar Xiv:1603.07285, 2016.

[15] Utku Evci, Trevor Gale, Jacob Menick, Pablo Samuel Castro, and Erich Elsen. Rigging the lottery: Making all tickets winners, 2019.

[16] Georgios Georgiadis. Accelerating convolutional neural networks via activation map compression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7085 7095, 2019.

[17] Negar Goli and Tor Aamodt. Resprop: Reuse sparsiﬁed backpropagation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020.

[18] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. ar Xiv preprint ar Xiv:1510.00149, 2015.

[19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016.

[20] Yang He, Guoliang Kang, Xuanyi Dong, Yanwei Fu, and Yi Yang. Soft ﬁlter pruning for accelerating deep convolutional neural networks. In Proceedings of the 27th International Joint Conference on Artiﬁcial Intelligence, pages 2234 2240. AAAI Press, 2018.

[21] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 1389 1397, 2017.

[22] C. A. R. Hoare. Algorithm 65: Find. Commun. ACM, 4(7):321 322, July 1961. ISSN 0001-0782. doi: 10.1145/366622.366647. URL http://doi.acm.org/10.1145/366622.366647.

[23] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700 4708, 2017.

[24] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ar Xiv preprint ar Xiv:1502.03167, 2015.

[25] Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pages 1 12, 2017.

[26] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

[27] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classiﬁcation with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097 1105, 2012.

[28] Alex Krizhevsky et al. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.

[29] Mark Kurtz, Justin Kopinsky, Rati Gelashvili, Alexander Matveev, John Carr, Michael Goin, William Leiserson, Sage Moore, Bill Nell, Nir Shavit, et al. Inducing and exploiting activation sparsity for fast neural network inference.

[30] Aditya Kusupati, Vivek Ramanujan, Raghav Somani, Mitchell Wortsman, Prateek Jain, Sham Kakade, and Ali Farhadi. Soft threshold weight reparameterization for learnable sparsity. ar Xiv preprint ar Xiv:2002.03231, 2020.

[31] Yann Le Cun, John S Denker, and Sara A Solla. Optimal brain damage. In Advances in neural information processing systems, pages 598 605, 1990.

[32] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning ﬁlters for efﬁcient convnets. ar Xiv preprint ar Xiv:1608.08710, 2016.

[33] Junjie Liu, Zhe XU, Runbin SHI, Ray C. C. Cheung, and Hayden K.H. So. Dynamic sparse training: Find efﬁcient sparse network from scratch with trainable masked layers. In International Conference on Learning Representations, 2020. URL https://openreview.net/ forum?id=SJlb GJrt DB.

[34] Liu Liu, Lei Deng, Xing Hu, Maohua Zhu, Guoqi Li, Yufei Ding, and Yuan Xie. Dynamic sparse graph for efﬁcient deep learning. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=H1go Bo R9F7.

[35] Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning efﬁcient convolutional networks through network slimming. In Proceedings of the IEEE International Conference on Computer Vision, pages 2736 2744, 2017.

[36] Christos Louizos, Max Welling, and Diederik P Kingma. Learning sparse neural networks through l 0 regularization. ar Xiv preprint ar Xiv:1712.01312, 2017.

[37] Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. Thinet: A ﬁlter level pruning method for deep neural network compression. In Proceedings of the IEEE international conference on computer vision, pages 5058 5066, 2017.

[38] Huizi Mao, Song Han, Jeff Pool, Wenshuo Li, Xingyu Liu, Yu Wang, and William J Dally. Exploring the regularity of sparse structure in convolutional neural networks. ar Xiv preprint ar Xiv:1705.08922, 2017.

[39] Decebal Constantin Mocanu, Elena Mocanu, Peter Stone, Phuong H Nguyen, Madeleine Gibescu, and Antonio Liotta. Scalable training of artiﬁcial neural networks with adaptive sparse connectivity inspired by network science. Nature communications, 9(1):2383, 2018.

[40] Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout sparsiﬁes deep neural networks. In Proceedings of the 34th International Conference on Machine Learning Volume 70, pages 2498 2507. JMLR. org, 2017.

[41] Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks for resource efﬁcient inference. ar Xiv preprint ar Xiv:1611.06440, 2016.

[42] Hesham Mostafa and Xin Wang. Parameter efﬁcient training of deep convolutional neural networks by dynamic sparse reparameterization. In International Conference on Machine Learning, pages 4646 4655, 2019.

[43] David R. Musser. Introspective sorting and selection algorithms. Software: Practice and Experience, 27(8):983 993, 1997. doi: 10.1002/(SICI)1097-024X(199708)27:8 983::AID-SPE117 3.0.CO;2-\#. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/%28SICI% 291097-024X%28199708%2927%3A8%3C983%3A%3AAID-SPE117%3E3.0.CO%3B2-%23.

[44] NVIDIA Corporation. NVIDIA Turing Architecture Whitepaper. https://www.nvidia.com/content/dam/en-zz/Solutions/designvisualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf, June 2017.

[45] NVIDIA Corporation. NVIDIA TESLA V100 GPU ARCHITECTURE. http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf, June 2017.

[46] NVIDIA Corporation. NVIDIA A100 Tensor Core GPU Architecture. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-amperearchitecture-whitepaper.pdf, May 2020.

[47] Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel Emer, Stephen W Keckler, and William J Dally. Scnn: An accelerator for compressed-sparse convolutional neural networks. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), pages 27 40. IEEE, 2017.

[48] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary De Vito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.

[49] Hanyu Peng, Jiaxiang Wu, Shifeng Chen, and Junzhou Huang. Collaborative channel pruning for deep networks. In International Conference on Machine Learning, pages 5113 5122, 2019.

[50] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 91 99. Curran Associates, Inc., 2015.

[51] Minsoo Rhu, Mike O Connor, Niladrish Chatterjee, Jeff Pool, Youngeun Kwon, and Stephen W Keckler. Compressing dma engine: Leveraging activation sparsity for training deep neural networks. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 78 91. IEEE, 2018.

[52] Anil Shanbhag, Holger Pirk, and Samuel Madden. Efﬁcient top-k query processing on massively parallel hardware. In Proceedings of the 2018 International Conference on Management of Data, pages 1557 1570, 2018.

[53] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. ar Xiv preprint ar Xiv:1409.1556, 2014.

[54] Emma Strubell, Ananya Ganesh, and Andrew Mc Callum. Energy and policy considerations for deep learning in nlp. ar Xiv preprint ar Xiv:1906.02243, 2019.

[55] Xu Sun, Xuancheng Ren, Shuming Ma, and Houfeng Wang. meprop: Sparsiﬁed back propagation for accelerated deep learning with reduced overﬁtting. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3299 3308. JMLR. org, 2017.

[56] Christian Szegedy, Alexander Toshev, and Dumitru Erhan. Deep neural networks for object detection. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 2553 2561. Curran Associates, Inc., 2013. URL http://papers.nips.cc/paper/ 5207-deep-neural-networks-for-object-detection.pdf.

[57] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1 9, 2015.

[58] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818 2826, 2016.

[59] Naigang Wang, Jungwook Choi, Daniel Brand, Chia-Yu Chen, and Kailash Gopalakrishnan. Training deep neural networks with 8-bit ﬂoating point numbers. In Advances in neural information processing systems, pages 7675 7684, 2018.

[60] Bingzhen Wei, Xu Sun, Xuancheng Ren, and Jingjing Xu. Minimal effort back propagation for convolutional neural networks. ar Xiv preprint ar Xiv:1709.05804, 2017.

[61] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In Advances in neural information processing systems, pages 2074 2082, 2016.

[62] Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu, and Jian Cheng. Quantized convolutional neural networks for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4820 4828, 2016.

[63] Shuang Wu, Guoqi Li, Feng Chen, and Luping Shi. Training and inference with integers in deep neural networks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=HJGXzmspb.

[64] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. ar Xiv preprint ar Xiv:1605.07146, 2016.

[65] Kai Zhang, Wangmeng Zuo, Shuhang Gu, and Lei Zhang. Learning deep cnn denoiser prior for image restoration. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3929 3938, 2017.

[66] Shijin Zhang, Zidong Du, Lei Zhang, Huiying Lan, Shaoli Liu, Ling Li, Qi Guo, Tianshi Chen, and Yunji Chen. Cambricon-x: An accelerator for sparse neural networks. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture, page 20. IEEE Press, 2016.

[67] Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. ar Xiv preprint ar Xiv:1606.06160, 2016.

[68] Michael Zhu and Suyog Gupta. To prune, or not to prune: exploring the efﬁcacy of pruning for model compression. ar Xiv preprint ar Xiv:1710.01878, 2017.

[69] Zhuangwei Zhuang, Mingkui Tan, Bohan Zhuang, Jing Liu, Yong Guo, Qingyao Wu, Junzhou Huang, and Jinhui Zhu. Discrimination-aware channel pruning for deep neural networks. In Advances in Neural Information Processing Systems, pages 875 886, 2018.