# dynamic_model_pruning_with_feedback__be7a74db.pdf Published as a conference paper at ICLR 2020 DYNAMIC MODEL PRUNING WITH FEEDBACK Tao Lin EPFL, Switzerland tao.lin@epfl.ch Sebastian U. Stich EPFL, Switzerland sebastian.stich@epfl.ch Luis Barba EPFL & ETH Zurich, Switzerland luis.barba@inf.ethz.ch Daniil Dmitriev EPFL, Switzerland daniil.dmitriev@epfl.ch Martin Jaggi EPFL, Switzerland martin.jaggi@epfl.ch Deep neural networks often have millions of parameters. This can hinder their deployment to low-end devices, not only due to high memory requirements but also because of increased latency at inference. We propose a novel model compression method that generates a sparse trained model without additional overhead: by allowing (i) dynamic allocation of the sparsity pattern and (ii) incorporating feedback signal to reactivate prematurely pruned weights we obtain a performant sparse model in one single training pass (retraining is not needed, but can further improve the performance). We evaluate our method on CIFAR-10 and Image Net, and show that the obtained sparse models can reach the state-of-the-art performance of dense models. Moreover, their performance surpasses that of models generated by all previously proposed pruning schemes. 1 INTRODUCTION Highly overparametrized deep neural networks show impressive results on machine learning tasks. However, with the increase in model size comes also the demand for memory and computer power at inference stage two resources that are scarcely available on low-end devices. Pruning techniques have been successfully applied to remove a significant fraction of the network weights while preserving test accuracy attained by dense models. In some cases, the generalization of compressed networks has even been found to be better than with full models (Han et al., 2015; 2017; Mocanu et al., 2018). The sparsity of a network is the number of weights that are identically zero, and can be obtained by applying a sparsity mask on the weights. There are several different approaches to find sparse models. For instance, one-shot pruning strategies find a suitable sparsity mask by inspecting the weights of a pretrained network (Mozer & Smolensky, 1989; Le Cun et al., 1990; Han et al., 2017). While these algorithms achieve a substantial size reduction of the network with little degradation in accuracy, they are computationally expensive (training and refinement on the dense model), and they are outperformed by algorithms that explore different sparsity masks instead of a single one. In dynamic pruning methods, the sparsity mask is readjusted during training according to different criteria (Mostafa & Wang, 2019; Mocanu et al., 2018). However, these methods require fine-tuning of many hyperparameters. We propose a new pruning approach to obtain sparse neural networks with state-of-the-art test accuracy. Our compression scheme uses a new saliency criterion that identifies important weights in the network throughout training to propose candidate masks. As a key feature, our algorithm not only evolves the pruned sparse model alone, but jointly also a (closely related) dense model that is used in a natural way to correct for pruning errors during training. This results in better generalization properties on a wide variety of tasks, since the simplicity of the scheme allows us further to study it from a theoretical point of view, and to provide further insights and interpretation. We do not require Published as a conference paper at ICLR 2020 Pruning before training Pruning after training Dynamic pruning during training Incremental pruning during training Scheme Gradient Pruning Comments Pruning before training eg(ewt) (on pruned model) once at the beginning + reaches local minima mask m0 fixed throughout - finding suitable masks is challenging Pruning after training g(wt) (on full model) once at the end - does not reach local minima mask m T fixed at the end - fine-tuning of pruned model required Incremental p. during training eg(ewt) (on partially pruned m.) several times during training + adaptive incrementally, mt+1 mt - cannot recover from premature pruning Dynamic p. during training g(ewt) (on pruned model) dynamically adapting mask mt + adaptive every few iterations + can recover from premature pruning Figure 1: Schematic view of different pruning methodologies and their properties. tuning of additional hyperparameters, and no retraining of the sparse model is needed (though can further improve performance). Contributions. A novel dynamic pruning scheme, that incorporates an error feedback in a natural way Sec. 3 and finds a trained sparse model in one training pass. Sec. 5 We demonstrate state-of-the-art performance (in accuracy and sparsity), Sec. 5 ourperforming all previously proposed pruning schemes. Sec. 5 We complement our results by an ablation study that provides further insights. Sec. 6 and convergence analysis for convex and non-convex objectives. Sec. 4 2 RELATED WORK Previous works on obtaining pruned networks can (loosely) be divided into three main categories. Pruning after training. Training approaches to obtain sparse networks usually include a three stage pipeline training of a dense model, one-shot pruning and fine-tuning e.g., (Han et al., 2015). Their results (i.e., moderate sparsity level with minor quality loss) made them the standard method for network pruning and led to several variations (Guo et al., 2016; Carreira-Perpinán & Idelbayev, 2018). Pruning during training. Zhu & Gupta (2017) propose the use of magnitude-based pruning and to gradually increase the sparsity ratio while training the model from scratch. A pruning schedule determines when the new masks are computed (extending and simplifying (Narang et al., 2017)). He et al. (2018) (SFP) prune entire filters of the model at the end of each epoch, but allow the pruned filters to be updated when training the model. Deep Rewiring (Deep R) (Bellec et al., 2018) allows for even more adaptivity by performing pruning and regrowth decisions periodically. This approach is computationally expensive and challenging to apply to large networks and datasets. Sparse evolutionary training (SET) (Mocanu et al., 2018) simplifies prune regrowth cycles by using heuristics for random growth at the end of each training epoch and Ne ST (Dai et al., 2019) by inspecting gradient magnitudes. Dynamic Sparse Reparameterization (DSR) (Mostafa & Wang, 2019) implements a prune redistribute regrowth cycle where target sparsity levels are redistributed among layers, based on loss gradients (in contrast to SET, which uses fixed, manually configured, sparsity levels). Sparse Momentum (SM) (Dettmers & Zettlemoyer, 2019) follows the same cycle but instead using the mean momentum magnitude of each layer during the redistribute phase. SM outperforms DSR on Image Net for unstructured pruning by a small margin but has no performance difference on CIFAR experiments. Our approach also falls in the dynamic category but we use error compensation mechanisms instead of hand crafted redistribute regrowth cycles. Pruning before training. Recently spurred by the lottery ticket hypothesis (LT) (Frankle & Carbin, 2019) methods which try to find a sparse mask that can be trained from scratch have attracted increased interest. For instance, Lee et al. (2019) propose SNIP to find a pruning mask by inspecting connection sensitivities and identifying structurally important connections in the network for a given task. Pruning is applied at initialization, and the sparsity mask remains fixed throughout training. Published as a conference paper at ICLR 2020 Loss function value One-shot pruning Our algorithm Figure 2: Left: One-shot pruning (red) computes a stochastic gradient at w and takes a step towards the best dense model. In contrast, DPF (blue) computes a stochastic gradient at the pruned model ew (here obtained by smallest magnitude pruning), and takes a step that best suits the compressed model. Right: One-shot pruning commits to a single sparsity mask and might obtain sparse models that generalize poorly (without retraining). DPF explores different available sparsity patterns and finds better sparse models. Note that Frankle & Carbin (2019); Frankle et al. (2019) do not propose an efficient pruning scheme to find the mask, instead they rely on iterative pruning, repeated for several full training passes. Further Approaches. Srinivas et al. (2017); Louizos et al. (2018) learn gating variables (e.g. through ℓ0 regularization) that minimize the number of nonzero weights, recent parallel work studies filter pruning for pre-trained models (You et al., 2019). Gal et al. (2017); Neklyudov et al. (2017); Molchanov et al. (2017) prune from Bayesian perspectives to learn dropout probabilities during training to prune and sparsify networks as dropout weight probabilities reach 1. Gale et al. (2019) extensively study recent unstructured pruning methods on large-scale learning tasks, and find that complex techniques (Molchanov et al., 2017; Louizos et al., 2018) perform inconsistently. Simple magnitude pruning approaches achieve comparable or better results (Zhu & Gupta, 2017). We consider the training of a non-convex loss function f : Rd R. We assume for a weight vector w Rd to have access to a stochastic gradient g(w) Rd such that E[g(w)] = f(w). This corresponds to the standard machine learning setting with g(w) representing a (mini-batch) gradient of one (or several) components of the loss function. Stochastic Gradient Descent (SGD) computes a sequence of iterates by the update rule wt+1 := wt γtg(wt) , (SGD) for some learning rate γt. To obtain a sparse model, a general approach is to prune some of the weights of wt, i.e., to set them to zero. Such pruning can be implemented by applying a mask m {0, 1}d to the weights, resulting in a sparse model ewt := m wt, where denotes the entry-wise (Hadamard) product. The mask could potentially depend on the weights wt (e.g., smallest magnitude pruning), or depend on t (e.g., the sparsity is incremented over time). Before we introduce our proposed dynamic pruning scheme, we formalize the three main existing types of pruning methodologies (summarized in Figure 1). These approaches differ in the way the mask is computed, and the moment when it is applied.1 Pruning before training. A mask m0 (depending on e.g. the initialization w0 or the network architecture of f) is applied and (SGD) is used for training on the resulting subnetwork ef(w) := f(m0 w) with the advantage that only pruned weights need to be stored and updated2, and that by training with SGD a local minimum of the subnetwork ef (but not of f the original training target) can be reached. In practice however, it remains a challenge to efficiently determine a good mask m0 and a wrongly chosen mask at the beginning strongly impacts the performance. 1The method introduced in Section 2 typically follow one of these broad themes loosely, with slight variations in detail. For the sake of clarity we omit a too technical and detailed discussion here. 2When training on ef(w), it suffices to access stochastic gradients of ef(w), denoted by eg(w), which can potentially be cheaper be computed than by naively applying the mask to g(w) (note eg(w) = m0 g(w)). Published as a conference paper at ICLR 2020 Pruning after training (one-shot pruning). A dense model is trained, and pruning is applied to the trained model w T . As the pruned model ew T = m T w T is very likely not at a local optimum of f, fine-tuning (retraining with the fixed mask m T ) is necessary to improve performance. Pruning during training (incremental and dynamic pruning). Dynamic schemes change the mask mt every (few) iterations based on observations during training (i.e. by observing the weights and stochastic gradients). Incremental schemes monotonically increase the sparsity pattern, fully dynamic schemes can also reactivate previously pruned weights. In contrast to previous dynamic schemes that relied on elaborated heuristics to adapt the mask mt, we propose a simpler approach: Dynamic pruning with feedback (DPF, Algorithm 1). Our scheme evaluates a stochastic gradient at the pruned model ewt = mt wt and applies it to the (simultaneously maintained) dense model wt: wt+1 := wt γtg(mt wt) = wt γtg(ewt) . (DPF) Applying the gradient to the full model allows to recover from errors , i.e. prematurely masking out important weights: when the accumulated gradient updates from the following steps drastically change a specific weight, it can become activated again (in contrast to incremental pruning approaches that have to stick to sub-optimal decisions). For illustration, observe that (DPF) can equivalently be written as wt+1 = wt γtg(wt + et), where et := ewt wt is the error produced by the compression. This provides a different intuition of the behavior of (DPF), and connects it with the concept of error-feedback (Stich et al., 2018; Karimireddy et al., 2019).3 We illustrate this principle in Figure 2 and give detailed pseudocode and further implementation details in Appendix A.1. The DPF scheme can also be seen as an instance of a more general class of schemes that apply (arbitrary) perturbed gradient updates to the dense model. For instance straight-through gradient estimators (Bengio et al., 2013) that are used to empirically simplify the backpropagation can be seen as such perturbations. Our stronger assumptions on the structure of the perturbation allow to derive non-asymptotic convergence rates in the next section, though our analysis could also be extended to the setting in (Yin et al., 2019) if the perturbations can be bounded. 4 CONVERGENCE ANALYSIS We now present convergence guarantees for (DPF). For the purposes of deriving theoretical guarantees, we assume that the training objective is smooth, that is f(w) f(v) L w v , w, v Rd, for a constant L > 0, and that the stochastic gradients are bounded E g(ewt) 2 G2 for every pruned model ewt = mt(wt) wt. The quality of this pruning is defined as the parameter δt [0, 1] such that δt := wt ewt 2 wt 2 . (1) Pruning without information loss corresponds to ewt = wt, i.e., δt = 0, and in general δt 1. Convergence on Convex functions. We first consider the case when f is in addition µ-strongly convex, that is f(v), w v f(w) f(v) µ 2 w v 2, w, v Rd. While it is clear that this assumption does not apply to neural networks, it eases the presentation as strongly convex functions have a unique (global) minimizer w := arg minw Rd f(w). Theorem 4.1. Let f be µ-strongly convex and learning rates given as γt = 4 µ(t+2). Then for a randomly chosen pruned model eu of the iterates {ew0, . . . , ew T } of DPF, concretely eu = ewt with probability pt = 2(t+1) (T +1)(T +2), it holds that in expectation over the stochasticity and the selection of eu: Ef(eu) f(w ) = O G2 µT + LE h δt wt 2i . (2) The rightmost term in (2) measures the average quality of the pruning. However, unless δt 0 or wt 0 for t , the error term never completely vanishes, meaning that the method converges only to a neighborhood of the optimal solution (this not only holds for the pruned model, but also for 3Our variable wt corresponds to xt in the notation of Karimireddy et al. (2019). Their error-fixed SGD algorithm evaluates gradients at perturbed iterates xt := xt + et, which correspond precisely to wt = wt + et in our notation. This shows the connection of these two methods. Published as a conference paper at ICLR 2020 the jointly maintained dense model, as we will show in the appendix). This behavior is expected, as the global optimal model w might be dense and cannot be approximated well by a sparse model. For one-shot methods that only prune the final (SGD) iterate w T at the end, we have instead: Ef(ewt) f(w ) 2E (f(w T ) f(w )) + LδT E w T 2 = O LG2 µ2T + LE h δT w T 2i , as we show in the appendix. First, we see from this expression that the estimate is very sensitive to δT and w T , i.e. the quality of the pruning the final model. This could be better or worse than the average of the pruning quality of all iterates. Moreover, one looses also a factor of the condition number L µ in the asymptotically decreasing term, compared to (2). This is due to the fact that standard convergence analysis only achieves optimal rates for an average of the iterates (but not the last one). This shows a slight theoretical advantage of DPF over rounding at the end. Convergence on Non-Convex Functions to Stationary Points. Secondly, we consider the case when f is a non-convex function and show convergence (to a neighborhood) of a stationary point. Theorem 4.2. Let learning rate be given as γ = c T , for c = q f(w0) f(w ) LG2 . Then for pruned model eu chosen uniformly at random from the iterates {ew0, . . . , ew T } of DPF, concretely eu := ewt with probability pt = 1 T +1, it holds in expectation over the stochasticity and the selection of eu: E f(eu) 2 = O L(f(w0) f(w ))G T + L2E h δt wt 2i! Extension to Other Compression Schemes. So far we put our focus on simple mask pruning schemes to achieve high model sparsity. However, the pruning scheme in Algorithm 1 could be replaced by an arbitrary compressor C : Rd Rd, i.e., ewt = C(wt). Our analysis extends to compressors as e.g. defined in (Karimireddy et al., 2019; Stich & Karimireddy, 2019), whose quality is also measured in terms of the δt parameters as in (1). For example, if our objective was not to obtain a sparse model, but to produce a quantized neural network where inference could be computed faster on low-precision numbers, then we could define C as a quantized compressor. One variant of this approach is implemented in the Binary Connect algorithm (BC) (Courbariaux et al., 2015) with prominent results, see also (Li et al., 2017) for further insights and discussion. 5 EXPERIMENTS We evaluated DPF together with its competitors on a wide range of neural architectures and sparsity levels. DPF exhibits consistent and noticeable performance benefits over its competitors. 5.1 EXPERIMENTAL SETUP Datasets. We evaluated DPF on two image classification benchmarks: (1) CIFAR-10 (Krizhevsky & Hinton, 2009) (50K/10K training/test samples with 10 classes), and (2) Image Net (Russakovsky et al., 2015) (1.28M/50K training/validation samples with 1000 classes). We adopted the standard data augmentation and preprocessing scheme from He et al. (2016a); Huang et al. (2016); for further details refer to Appendix A.2. Models. Following the common experimental setting in related work on network pruning (Liu et al., 2019; Gale et al., 2019; Dettmers & Zettlemoyer, 2019; Mostafa & Wang, 2019), our main experiments focus on Res Net (He et al., 2016a) and Wide Res Net (Zagoruyko & Komodakis, 2016). However, DPF can be effectively extended to other neural architectures, e.g., VGG (Simonyan & Zisserman, 2015), Dense Net (Huang et al., 2017). We followed the common definition in He et al. (2016a); Zagoruyko & Komodakis (2016) and used Res Net-a, Wide Res Net-a-b to represent neural network with a layers and width factor b. Baselines. We considered the state-of-the-art model compression methods presented in the table below as our strongest competitors. We omit the comparison to other dynamic reparameterization methods, as DSR can outperform Deep R (Bellec et al., 2018) and SET (Mocanu et al., 2018) by a noticeable margin (Mostafa & Wang, 2019). Implementation of DPF. Compared to other dynamic reparameterization methods (e.g. DSR and SM) that introduced many extra hyperparameters, our method has trivial hyperparameter tuning Published as a conference paper at ICLR 2020 Scheme Reference Pruning How the mask(s) are found Lottery Ticket (LT) 2019, FDRC before training 10-30 successive rounds of (full training + pruning). SNIP 2019, LAT By inspecting properties/sensitivity of the network. One-shot + fine-tuning (One-shot P+FT) 2015, HPDT after training Saliency criterion (prunes smallest weights). Incremental pruning + fine-tuning (Incremental) 2017, ZG incremental Saliency criterion. Sparsity is gradually incremented. Dynamic Sparse Reparameterization (DSR) 2019, MW dynamic Prune redistribute regrowth cycle. Sparse Momentum (SM) 2019, DZ Prune redistribute regrowth + mean momentum. DPF ours Reparameterization via error feedback. overhead. We perform pruning across all neural network layers (no layer-wise pruning) using magnitude-based unstructured weight pruning (inherited from Han et al. (2015)). We found the best preformance when updating the mask every 16 iterations (see also Table 11) and we keep this value fixed for all experiments (independent of the architecture or task). Unlike our competitors that may ignore some layers (e.g. the first convolution and downsampling layers in DSR), we applied DPF (as well as the One-shot P+FT and Incremental baselines) to all convolutional layers while keeping the last fully-connected layer4, biases and batch normalization layers dense. Lastly, our algorithm gradually increases the sparsity st of the mask from 0 to the desired sparsity using the same scheduling as in (Zhu & Gupta, 2017); see Appendix A.2. Training schedules. For all competitors, we adapted their open-sourced code and applied a consistent (and standard) training scheme over different methods to ensure a fair comparison. Following the standard training setup for CIFAR-10, we trained Res Net-a for 300 epochs and decayed the learning rate by 10 when accessing 50% and 75% of the total training samples (He et al., 2016a; Huang et al., 2017); and we trained Wide Res Net-a-b as Zagoruyko & Komodakis (2016) for 200 epochs and decayed the learning rate by 5 when accessing 30%, 60% and 80% of the total training samples. For Image Net training, we used the training scheme in (Goyal et al., 2017) for 90 epochs and decayed learning rate by 10 at 30, 60, 80 epochs. For all datasets and models, we used mini-batch SGD with Nesterov momentum (factor 0.9) with fine-tuned learning rate for DPF. We reused the tuned (or recommended) hyperparameters for our baselines (DSR and SM), and fine-tuned the optimizer and learning rate for One-shot P+FT, Incremental and SNIP. The mini-batch size is fixed to 128 for CIFAR-10 and 1024 for Image Net regardless of datasets, models and methods. 5.2 EXPERIMENT RESULTS CIFAR-10. Figure 3 shows a comparison of different methods for Wide Res Net-28-2. For low sparsity level (e.g. 50%), DPF outperforms even the dense baseline, which is in line with regularization properties of network pruning. Furthermore, DPF can prune the model up to a very high level (e.g. 99%), and still exhibit viable performance. This observation is also present in Table 1, where the results of training different state-of-the-art DNN architectures with higher sparsities are depicted. DPF shows reasonable performance even with extremely high sparsity level on large models (e.g. Wide Res Net-28-8 with sparsity ratio 99.9%), while other methods either suffer from significant quality loss or even fail to converge. Because simple model pruning techniques sometimes show better performance than complex techniques (Gale et al., 2019), we further consider these simple models in Table 2. While DPF outperforms them in almost all settings, it faces difficulties pruning smaller models to extremely high sparsity ratios (e.g. Res Net-20 with sparsity ratio 95%).This seems however to be an artifact of fine-tuning, as DPF with extra fine-tuning convincingly outperforms all other methods regardless of the network size. This comes to no surprise as schemes like One-shot P+FT and Incremental do not benefit from extra fine-tuning, since it is already incorporated in their training procedure and they might become stuck in local minima. On the other hand, dynamic pruning methods, and in particular DPF, work on a different paradigm, and can still heavily benefit from fine-tuning.5 Figure 13 (in Appendix A.3.4) depicts another interesting property of DPF. When we search for a subnetwork with a (small) predefined number of parameters for a fixed task, it is much better to run 4The last fully-connected layer normally makes up only a very small faction of the total MACs, e.g. 0.05% for Res Net-50 on Image Net and 0.0006% for Wide Res Net-28-2 on CIFAR-10. 5Besides that a large fraction of the mask elements already converge during training (see e.g. Figure 4 below), not all mask elements converge. Thus DPF can still benefit from fine-tuning on the fixed sparsity mask. Published as a conference paper at ICLR 2020 0.5 0.6 0.7 0.8 0.9 1.0 Target sparsity ratio Top-1 test accuracy Dense model SM DPF (ours) (a) Sparsity ratio v.s. top-1 test acc. 0.0 0.2 0.4 0.6 0.8 # of params (M) Top-1 test accuracy Dense model SM DPF (ours) (b) # of params v.s. top-1 test acc. 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 Top-1 test accuracy Dense model SM DPF (ours) (c) MAC v.s. top-1 test acc. Figure 3: Top-1 test accuracy of Wide Res Net-28-2 on CIFAR-10 for unstructured weight pruning. The original model has 1.47M parameters with 216M MACs (Multiplier-ACcumulator). We varied the sparsity ratio from 50% to 99%. The complete numerical test accuracy values refer to Table 4 in Appendix A.3.1. The lower # of params and MACs the model has, the higher sparsity ratio it uses. All results are averaged over three runs. Note that different methods might consider different types of layers and thus the same pruning sparsity ratio might result in the slight difference for both of # of params and MACs. The DSR cannot converge when using the extreme high sparsity ratio (99%). Table 1: Top-1 test accuracy of SOTA DNNs on CIFAR-10 for unstructured weight pruning. We considered unstructured pruning and the indicates the method cannot converge. The results are averaged for three runs. The results we presented for each model consider some reasonable pruning ratios (we prune more aggressively for deeper and wider neural networks), and readers can refer to Table 4 (in Appendix A.3.1) for a complete overview. Model Baseline on dense model SNIP (L+, 2019) SM (DZ, 2019) DSR (MW, 2019) DPF Target Pr. ratio VGG16-D 93.74 0.13 93.04 0.26 93.59 0.17 - 93.87 0.15 95% Res Net-20 92.48 0.20 91.10 0.22 91.98 0.01 92.00 0.19 92.42 0.14 70% 90.53 0.27 91.54 0.16 91.78 0.28 92.17 0.21 80% 88.50 0.13 89.76 0.40 87.88 0.04 90.88 0.07 90% 84.91 0.25 83.03 0.74 88.01 0.30 95% Res Net-32 93.83 0.12 90.40 0.26 91.54 0.18 91.41 0.23 92.42 0.18 90% 87.23 0.29 88.68 0.22 84.12 0.32 90.94 0.35 95% Res Net-56 94.51 0.20 91.43 0.34 92.73 0.21 93.78 0.20 93.95 0.11 90% 90.96 0.40 92.57 0.09 92.74 0.08 95% Wide Res Net-28-2 95.01 0.04 92.58 0.22 93.41 0.22 93.88 0.08 94.36 0.24 90% 90.80 0.04 92.24 0.14 92.74 0.17 93.62 0.05 95% 83.45 0.38 85.36 0.80 88.92 0.29 99% Wide Res Net-28-4 95.69 0.10 93.62 0.17 94.45 0.14 94.63 0.08 95.38 0.04 95% 92.06 0.38 93.80 0.24 93.92 0.16 94.98 0.08 97.5% 89.49 0.20 92.18 0.04 92.50 0.07 93.86 0.20 99% Wide Res Net-28-8 96.06 0.06 95.49 0.21 95.67 0.14 95.81 0.10 96.08 0.15 90% 94.92 0.13 95.64 0.07 95.55 0.12 95.98 0.10 95% 94.11 0.19 95.31 0.20 95.11 0.07 95.84 0.04 97.5% 92.04 0.11 94.38 0.12 94.10 0.12 95.63 0.16 99% 74.50 2.23 88.65 0.36 91.76 0.18 99.9% DPF on a large model (e.g. Wide Res Net-28-8) than on a smaller one (e.g. Wide Res Net-28-2). That is, DPF performs structural exploration more efficiently in larger parametric spaces. Table 2: Top-1 test accuracy of SOTA DNNs on CIFAR-10 for unstructured weight pruning via some simple pruning techniques. This table complements Table 1 and evaluates the performance of model compression under One-shot P+FT and Incremental, as well as how extra fine-tuning (FT) impact the performance of Incremental and our DPF. Note that One-shot P+FT prunes the dense model and uses extra fine-tuning itself. The Dense, Incremental and DPF train with the same number of epochs from scratch. The extra fine-tuning procedure considers the model checkpoint at the end of the normal training, uses the same number of training epochs (60 epochs in our case) with tuned optimizer and learning rate. Detailed hyperparameters tuning procedure refers to Appendix A.2. Model Baseline on dense model One-shot + FT (H+, 2015) Incremental (ZG, 2017) Incremental + FT DPF DPF + FT Target pr. ratio Res Net-20 92.48 0.20 90.18 0.12 90.55 0.38 90.54 0.25 90.88 0.07 91.76 0.12 90% 86.91 0.16 89.21 0.10 89.24 0.28 88.01 0.30 90.34 0.31 95% Res Net-32 93.83 0.12 91.72 0.15 91.69 0.12 91.76 0.14 92.42 0.18 92.61 0.11 90% 89.31 0.18 90.86 0.17 90.93 0.18 90.94 0.35 92.18 0.14 95% Res Net-56 94.51 0.20 93.26 0.06 93.14 0.23 93.09 0.16 93.95 0.11 93.95 0.17 90% 91.61 0.07 92.14 0.10 92.50 0.25 92.74 0.08 93.25 0.15 95% Published as a conference paper at ICLR 2020 Image Net. We compared DPF to other dynamic reparameterization methods as well as the strong Incremental baseline in Table 3. For both sparsity levels (80% and 90%), DPF shows a significant improvement of top-1 test accuracy with fewer or equal number of parameters. Table 3: Top-1 test accuracy of Res Net-50 on Image Net for unstructured weight pruning. The # of parameters for the full model is 25.56 M. We used the results of DSR from Mostafa & Wang (2019) as we use the same (standard) training/data augmentation scheme for the same neural architecture. Note that different methods prune different types of layers and result in the different # of parameters for the same target pruning ratio. We also directly (and fairly) compare with the results of Incremental (Zhu & Gupta, 2017) reproduced and fine-tuned by Gale et al. (2019), where they consider layer-wise sparsity ratio and fine-tune both the sparsity warmup schedule and label-smoothing for better performance. Top-1 accuracy Top-5 accuracy Pruning ratio Method Dense Pruned Difference Dense Pruned Difference Target Reached remaining # of params Incremental (ZG, 2017) 75.95 74.25 -1.70 92.91 91.84 -1.07 80% 73.5% 6.79 M DSR (MW, 2019) 74.90 73.30 -1.60 92.40 92.40 0 80% 71.4% 7.30 M SM (DZ, 2019) 75.95 74.59 -1.36 92.91 92.37 -0.54 80% 72.4% 7.06 M DPF 75.95 75.48 -0.47 92.91 92.59 -0.32 80% 73.5% 6.79 M Incremental (Gale et al., 2019) 76.69 75.58 -1.11 - - - 80% 79.9% 5.15 M DPF 75.95 75.13 -0.82 92.91 92.52 -0.39 80% 79.9% 5.15 M Incremental (ZG, 2017) 75.95 73.36 -2.59 92.91 91.27 -1.64 90% 82.6% 4.45 M DSR (MW, 2019) 74.90 71.60 -3.30 92.40 90.50 -1.90 90% 80.0% 5.10 M SM (DZ, 2019) 75.95 72.65 -3.30 92.91 91.26 -1.65 90% 82.0% 4.59 M DPF 75.95 74.55 -1.44 92.91 92.13 -0.78 90% 82.6% 4.45 M 6 DISCUSSION Besides the theoretical guarantees, a straightforward benefit of DPF over one-shot pruning in practice is its fine-tuning free training process. Figure 12 in the appendix (Section A.3.3) demonstrates the trivial computational overhead (considering the dynamic reparameterization cost) of involving DPF to train the model from scratch. Small number of hyperparameters compared to other dynamic reparameterization methods (e.g. DSR and SM) is another advantage of DPF and Figure 11 further studies how different setups of DPF impact the final performance. Notice also that for DPF, inference is done only at sparse models, an aspect that could be leveraged for more efficient computations. Empirical difference between one-shot pruning and DPF. From the Figure 2 one can see that DPF tends to oscillate among several local minima, whereas one-shot pruning, even with fine tuning, converges to a single solution, which is not necessarily close to the optimum. We believe that the wider exploration of DPF helps to find a better local minima (which can be even further improved by fine-tuning, as shown in Table 2). We empirically analyzed how drastically the mask changes between each reparameterization step, and how likely it is for some pruned weight to become non-zero in the later stages of training. Figure 4 shows at what stage of the training each element of the final mask becomes fixed. For each epoch, we report how many mask elements were flipped starting from this epoch. As an example, we see that for sparsity ratio 95%, after epoch 157 (i.e. for 43 epochs left), only 5% of the mask elements were changing. This suggests that, except for a small percentage of weights that keep oscillating, the mask has converged early in the training. In the final epochs, the algorithm keeps improving accuracy, but the masks are only being fine-tuned. A similar mask convergence behavior can be found in Appendix (Figure 7) for training Res Net-20 on CIFAR-10. 0 25 50 75 100 125 150 175 200 % of params not converged 0.5 0.6 0.7 0.8 0.9 0.95 0.99 120 160 200 0% Zoom in, epochs 120 200 Figure 4: Convergence of the pruning mask mt of DPF for different target sparsity levels (see legend). The y-axis represent the percentage of mask elements that still change after a certain epoch (x-axis). The illustrated example are from Wide Res Net-28-2 on CIFAR-10. We decayed the learning rate at 60,120,160 epochs. Published as a conference paper at ICLR 2020 DPF does not find a lottery ticket. The LT hypothesis (Frankle & Carbin, 2019) conjectures that for every desired sparsity level there exists a sparse submodel that can be trained to the same or better performance as the dense model. In Figure 5 we show that the mask found by DPF is not a LT, i.e., training the obtained sparse model from scratch does not recover the same performance. The (expensive) procedure proposed in Frankle & Carbin (2019); Frankle et al. (2019) finds different masks and achieves the same performance as DPF for mild sparsity levels, but DPF is much better for extremely sparse models (99% sparsity). 0.50 0.60 0.70 0.80 0.90 0.95 0.99 Target sparsity ratio Top-1 test accuracy DPF re-train w/ original init (mask from DPF) re-train w/o original init (mask from DPF) Lottery ticket training lottery ticket extra full training rounds Figure 5: Top-1 test accuracy for different target sparsity levels (on Wide Res Net-28-2 with CIFAR-10, unstructured pruning). DPF reaches comparable accuracy than the LT training method (and better for 99% target sparsity), but involves much less computation (right y-axis, green). Training the sparse models found by DPF from scratch does not reach the same performance (hence our sparse models are not lottery tickets). Extension to structured pruning. The current state-of-the-art dynamic reparameterization methods only consider unstructured weight pruning. Structured filter pruning6 is either ignored (Bellec et al., 2018; Mocanu et al., 2018; Dettmers & Zettlemoyer, 2019) or shown to be challenging (Mostafa & Wang, 2019) even for the CIFAR dataset. In Figure 6 below we presented some preliminary results on CIFAR-10 to show that our DPF can also be applied to structured filter pruning schemes. DPF outperforms the current filter-norm based state-of-the-art method for structured pruning (e.g. SFP (He et al., 2018)) by a noticeable margin. Figure 16 in Appendix A.4.3 displays the transition procedure of the sparsity pattern (of different layers) for Wide Res Net-28-2 with different sparsity levels. DPF can be seen as a particular neural architecture search method, as it gradually learns to prune entire layers under the guidance of the feedback signal. We followed the common experimental setup as mentioned in Section 5 with ℓ2 norm based filter selection criteria for structured pruning extension on CIFAR-10. We do believe a better filter selection scheme (Ye et al., 2018; He et al., 2019; Lym et al., 2019) could further improve the results but we leave this exploration for the future work. 0.4 0.6 0.8 1.0 1.2 1.4 Top-1 test accuracy (a) Wide Res Net-28-2. Top-1 test accuracy (b) Wide Res Net-28-4. 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 Top-1 test accuracy (c) Wide Res Net-28-8. Figure 6: MAC v.s. top-1 test accuracy, for training Wide Res Net-28 (with different width) on CIFAR-10. The reported results are averaged over three runs. The Wide Res Net-28-2 has 216M MACs, Wide Res Net-28-4 has 848M MACs and Wide Res Net-28-8 has 3366M MACs. Other detailed information refers to the Appendix A.4.1, e.g., the # of params v.s. top-1 test accuracy in Figure 14, and the numerical test accuracy score in Table 6. 6Lym et al. (2019) consider structured filter pruning and reconfigure the large (but sparse) model to small (but dense) model during the training for better training efficiency. Note that they perform model update on a gradually reduced model space, and it is completely different from the dynamic reparameterization methods (e.g. DSR, SM and our scheme) that perform reparameterization under original (full) model space. Published as a conference paper at ICLR 2020 ACKNOWLEDGEMENTS We acknowledge funding from SNSF grant 200021_175796, as well as a Google Focused Research Award. Guillaume Bellec, David Kappel, Wolfgang Maass, and Robert Legenstein. Deep rewiring: Training very sparse deep networks. In ICLR - International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=BJ_w N01C-. Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. ar Xiv preprint ar Xiv:1308.3432, 2013. Miguel A Carreira-Perpinán and Yerlan Idelbayev. learning-compression algorithms for neural net pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8532 8541, 2018. Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Neur IPS - Advances in Neural Information Processing Systems, pp. 3123 3131, 2015. Xiaoliang Dai, Hongxu Yin, and Niraj Jha. Nest: A neural network synthesis tool based on a grow-and-prune paradigm. IEEE Transactions on Computers, 2019. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Image Net: A large-scale hierarchical image database. In CVPR09, 2009. Tim Dettmers and Luke Zettlemoyer. Sparse networks from scratch: Faster training without losing performance. ar Xiv preprint ar Xiv:1907.04840, 2019. Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In ICLR - International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=r Jl-b3Rc F7. Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M Roy, and Michael Carbin. Stabilizing the lottery ticket hypothesis. ar Xiv preprint ar Xiv:1903.01611, 2019. Yarin Gal, Jiri Hron, and Alex Kendall. Concrete dropout. In Neur IPS - Advances in Neural Information Processing Systems, pp. 3581 3590, 2017. Trevor Gale, Erich Elsen, and Sara Hooker. The state of sparsity in deep neural networks. ar Xiv preprint ar Xiv:1902.09574, 2019. Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training Image Net in 1 hour. ar Xiv preprint ar Xiv:1706.02677, 2017. Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for efficient DNNs. In Neur IPS - Advances in Neural Information Processing Systems, pp. 1379 1387, 2016. Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Neur IPS - Advances in Neural Information Processing Systems, pp. 1135 1143, 2015. Song Han, Jeff Pool, Sharan Narang, Huizi Mao, Shijian Tang, Erich Elsen, Bryan Catanzaro, John Tran, and William J Dally. DSD: regularizing deep neural networks with dense-sparse-dense training flow. In ICLR - International Conference on Learning Representations, 2017. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770 778, 2016a. Published as a conference paper at ICLR 2020 Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European Conference on Computer Vision, pp. 630 645. Springer, 2016b. Yang He, Guoliang Kang, Xuanyi Dong, Yanwei Fu, and Yi Yang. Soft filter pruning for accelerating deep convolutional neural networks. In International Joint Conference on Artificial Intelligence (IJCAI), pp. 2234 2240, 2018. Yang He, Ping Liu, Ziwei Wang, Zhilan Hu, and Yi Yang. Filter pruning via geometric median for deep convolutional neural networks acceleration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with stochastic depth. In European Conference on Computer Vision, pp. 646 661. Springer, 2016. Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700 4708, 2017. Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian Stich, and Martin Jaggi. Error feedback fixes Sign SGD and other gradient compression schemes. In International Conference on Machine Learning, pp. 3252 3261, 2019. Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009. Simon Lacoste-Julien, Mark W. Schmidt, and Francis R. Bach. A simpler approach to obtaining an O(1/t) convergence rate for the projected stochastic subgradient method. ar Xiv preprint ar Xiv:1212.2002, 2012. Yann Le Cun, John S Denker, and Sara A Solla. Optimal brain damage. In Neur IPS - Advances in Neural Information Processing Systems, pp. 598 605, 1990. Namhoon Lee, Thalaiyasingam Ajanthan, and Philip HS Torr. SNIP: Single-shot network pruning based on connection sensitivity. In ICLR - International Conference on Learning Representations, 2019. Hao Li, Soham De, Zheng Xu, Christoph Studer, Hanan Samet, and Tom Goldstein. Training quantized nets: A deeper understanding. In Neur IPS - Advances in Neural Information Processing Systems, pp. 5811 5821, 2017. Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the value of network pruning. In ICLR - International Conference on Learning Representations, 2019. Christos Louizos, Max Welling, and Diederik P. Kingma. Learning sparse neural networks through l_0 regularization. In ICLR - International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=H1Y8hhg0b. Sangkug Lym, Esha Choukse, Siavash Zangeneh, Wei Wen, Mattan Erez, and Sujay Shanghavi. Prunetrain: Gradual structured pruning from scratch for faster neural network training. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2019. Decebal Constantin Mocanu, Elena Mocanu, Peter Stone, Phuong H Nguyen, Madeleine Gibescu, and Antonio Liotta. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature communications, 9(1):2383, 2018. Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout sparsifies deep neural networks. In ICML - International Conference on Machine Learning, pp. 2498 2507. JMLR. org, 2017. Hesham Mostafa and Xin Wang. Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization. In ICML - International Conference on Machine Learning, 2019. Published as a conference paper at ICLR 2020 Michael C Mozer and Paul Smolensky. Skeletonization: A technique for trimming the fat from a network via relevance assessment. In Neur IPS - Advances in Neural Information Processing Systems, pp. 107 115, 1989. Sharan Narang, Erich Elsen, Gregory Diamos, and Shubho Sengupta. Exploring sparsity in recurrent neural networks. In ICLR - International Conference on Learning Representations, 2017. Kirill Neklyudov, Dmitry Molchanov, Arsenii Ashukha, and Dmitry P Vetrov. Structured bayesian pruning via log-normal multiplicative noise. In Neur IPS - Advances in Neural Information Processing Systems, pp. 6775 6784, 2017. Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary De Vito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in Py Torch. 2017. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211 252, 2015. Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR - International Conference on Learning Representations, 2015. Suraj Srinivas, Akshayvarun Subramanya, and R Venkatesh Babu. Training sparse neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 138 145, 2017. Sebastian U. Stich and Sai P. Karimireddy. The error-feedback framework: Better rates for SGD with delayed gradients and compressed communication. ar Xiv preprint ar Xiv:1909.05350, 2019. URL https://arxiv.org/abs/1909.05350. Sebastian U Stich, Jean-Baptiste Cordonnier, and Martin Jaggi. Sparsified SGD with memory. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Neur IPS - Advances in Neural Information Processing Systems, pp. 4447 4458. Curran Associates, Inc., 2018. URL http://papers.nips.cc/paper/ 7697-sparsified-sgd-with-memory.pdf. Jianbo Ye, Xin Lu, Zhe Lin, and James Z. Wang. Rethinking the smaller-norm-less-informative assumption in channel pruning of convolution layers. In ICLR - International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=HJ94fq Ap W. Penghang Yin, Jiancheng Lyu, Shuai Zhang, Stanley Osher, Yingyong Qi, and Jack Xin. Understanding straight-through estimator in training activation quantized neural nets. In ICLR - International Conference on Learning Representations, 2019. Zhonghui You, Kun Yan, Jinmian Ye, Meng Ma, and Ping Wang. Gate decorator: Global filter pruning method for accelerating deep convolutional neural networks. In Neur IPS - Advances in Neural Information Processing Systems, 2019. Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In BMVC, 2016. Michael Zhu and Suyog Gupta. To prune, or not to prune: exploring the efficacy of pruning for model compression. ar Xiv preprint ar Xiv:1710.01878, 2017. Published as a conference paper at ICLR 2020 A.1 ALGORITHM Algorithm 1 The detailed training procedure of DPF. input: uncompressed model weights w Rd, pruned weights: ew, mask: m {0, 1}d; reparametrization period: p; training iterations: T. 1: for t = 0, . . . , T do 2: if p | t then trigger mask update, per default every p = 16 iterations 3: compute mask m mt(wt) by arbitrary pruning scheme (e.g. unstructured magnitude pruning) 4: end if 5: ewt m wt apply (precomputed) mask 6: compute (mini-batch) gradient g(ew) forward/backward pass with pruned weights ewt 7: wt+1 gradient update g(ew) to wt via arbitrary optimizer (e.g. SGD with momentum) 8: end for output: w T and ew T We trigger the mask update every p = 16 iterations (see also Figure 11) and we keep this parameter fixed throughout all experiments, independent of architecture or task. We perform pruning across all neural network layers (no layer-wise pruning) using magnitude-based unstructured weight pruning (inherited from Han et al. (2015)). Pruning is applied to all convolutional layers while keeping the last fully-connected layer, biases and batch normalization layers dense. We gradually increases the sparsity st of the mask from 0 to the desired sparsity using the same scheduling as in Zhu & Gupta (2017); see Appendix A.2 below. A.2 IMPLEMENTATION DETAILS We implemented our DPF in Py Torch (Paszke et al., 2017). All experiments were run on NVIDIA Tesla V100 GPUs. Sparse tensors in our implementation are respresented as the dense tenors multiplied by the corresponding binary masks. Datasets We evaluate all methods on the following standard image classification tasks: Image classification for CIFAR-10 (Krizhevsky & Hinton, 2009). Dataset consists of a training set of 50K and a test set of 10K color images of 32 32 pixels, as well as 10 target classes. We adopt the standard data augmentation and preprocessing scheme (He et al., 2016a; Huang et al., 2016). Image classification for Image Net (Russakovsky et al., 2015). The ILSVRC 2012 classification dataset consists of 1.28 million images for training, and 50K for validation, with 1K target classes. We use Image Net-1k (Deng et al., 2009) and adopt the same data preprocessing and augmentation scheme as in He et al. (2016a;b); Simonyan & Zisserman (2015). Gradual Pruning Scheme For Incremental baseline, we tuned their automated gradual pruning scheme st = sf + (si sf) 1 t t0 n t 3 to gradually adjust the pruning sparsity ratio st for t {t0, . . . , t0 + n t}. That is, in our setup, we increased from an initial sparsity ratio si = 0 to the desired target model sparsity ratio sf over the epoch (n) when performing the second learning rate decay, from the training epoch t0 = 0 and with pruning frequency t = 1 epoch. In our experiments, we used this gradual pruning scheme over different methods, except One-shot P+FT, SNIP, and the methods (DSR, SM) that have their own fine-tuned gradual pruning scheme. Hyper-parameters tuning procedure We grid-searched the optimal learning rate, starting from the range of {0.05, 0.10, 0.15, 0.20}. More precisely, we will evaluate a linear-spaced grid of learning rates. If the best performance was ever at one of the extremes of the grid, we would try new grid points so that the best performance was contained in the middle of the parameters. Published as a conference paper at ICLR 2020 We trained most of the methods by using mini-batch SGD with Nesterov momentum. For baselines involving fine-tuning procedure (e.g. Table 2), we grid-searched the optimal results by tuning the optimizers (i.e. mini-batch SGD with Nesterov momentum, or Adam) and the learning rates. The optimal hyper-parameters for DPF The mini-batch size is fixed to 128 for CIFAR-10 and 1024 for Image Net regardless of datasets and models. For CIFAR-10, we trained Res Net-a and VGG for 300 epochs and decayed the learning rate by 10 when accessing 50% and 75% of the total training samples (He et al., 2016a; Huang et al., 2017); and we trained Wide Res Net-a-b as Zagoruyko & Komodakis (2016) for 200 epochs and decayed the learning rate by 5 when accessing 30%, 60% and 80% of the total training samples. The optimal learning rate for Res Net-a, Wide Res Net-a-b and VGG are 0.2, 0.1 and 0.2 respectively; the corresponding weight decays are 1e 4, 5e 4 and 1e 4 respectively. For Image Net training, we used the training scheme in Goyal et al. (2017) for 90 epochs, where we gradually warmup the learning rate from 0.1 to 0.4 and decayed learning rate by 10 at 30; 60; 80 epochs. The used weight decay is 1e 4. A.3 ADDITIONAL RESULTS FOR UNSTRUCTURED PRUNING A.3.1 COMPLETE RESULTS OF UNSTRUCTURED PRUNING ON CIFAR-10 Table 4 details the numerical results for training SOTA DNNs on CIFAR-10. Some results of it reconstruct the Table 1 and Figure 3. Table 4: Top-1 test accuracy for training (compressed) SOTA DNNs on CIFAR-10 from scratch. We considered unstructured pruning and the indicates the method cannot converge. The results are averaged for three runs. Model Baseline on dense model SNIP (L+, 2019) SM (DZ, 2019) DSR (MW, 2019) DPF Target Pr. ratio VGG16-D 93.74 0.13 93.04 0.26 93.59 0.17 - 93.87 0.15 95% Res Net-20 92.48 0.20 91.10 0.22 91.98 0.01 92.00 0.19 92.42 0.14 70% Res Net-20 92.48 0.20 90.53 0.27 91.54 0.16 91.78 0.28 92.17 0.21 80% Res Net-20 92.48 0.20 88.50 0.13 89.76 0.40 87.88 0.04 90.88 0.07 90% Res Net-20 92.48 0.20 84.91 0.25 83.03 0.74 88.01 0.30 95% Res Net-32 93.83 0.12 90.40 0.26 91.54 0.18 91.41 0.23 92.42 0.18 90% Res Net-32 93.83 0.12 87.23 0.29 88.68 0.22 84.12 0.32 90.94 0.35 95% Res Net-56 94.51 0.20 91.43 0.34 92.73 0.21 93.78 0.20 93.95 0.11 90% Res Net-56 94.51 0.20 90.96 0.40 92.57 0.09 92.74 0.08 95% Wide Res Net-28-2 95.01 0.04 94.67 0.23 94.73 0.16 94.80 0.14 95.11 0.06 50% Wide Res Net-28-2 95.01 0.04 94.47 0.19 94.65 0.16 94.98 0.07 94.90 0.06 60% Wide Res Net-28-2 95.01 0.04 94.29 0.22 94.46 0.11 94.80 0.15 94.86 0.13 70% Wide Res Net-28-2 95.01 0.04 93.56 0.14 94.17 0.12 94.57 0.13 94.76 0.18 80% Wide Res Net-28-2 95.01 0.04 92.58 0.22 93.41 0.22 93.88 0.08 94.36 0.24 90% Wide Res Net-28-2 95.01 0.04 90.80 0.04 92.24 0.14 92.74 0.17 93.62 0.05 95% Wide Res Net-28-2 95.01 0.04 83.45 0.38 85.36 0.80 88.92 0.29 99% Wide Res Net-28-4 95.69 0.10 95.42 0.05 95.57 0.08 95.67 0.07 95.58 0.21 70% Wide Res Net-28-4 95.69 0.10 95.24 0.07 95.27 0.02 95.49 0.04 95.60 0.08 80% Wide Res Net-28-4 95.69 0.10 94.56 0.11 95.01 0.05 95.30 0.12 95.65 0.14 90% Wide Res Net-28-4 95.69 0.10 93.62 0.17 94.45 0.14 94.63 0.08 95.38 0.04 95% Wide Res Net-28-4 95.69 0.10 92.06 0.38 93.80 0.24 93.92 0.16 94.98 0.08 97.5% Wide Res Net-28-4 95.69 0.10 89.49 0.20 92.18 0.04 92.50 0.07 93.86 0.20 99% Wide Res Net-28-8 96.06 0.06 95.81 0.05 95.92 0.12 96.06 0.09 - 70% Wide Res Net-28-8 96.06 0.06 95.86 0.10 95.97 0.05 96.05 0.12 - 80% Wide Res Net-28-8 96.06 0.06 95.49 0.21 95.67 0.14 95.81 0.10 96.08 0.15 90% Wide Res Net-28-8 96.06 0.06 94.92 0.13 95.64 0.07 95.55 0.12 95.98 0.10 95% Wide Res Net-28-8 96.06 0.06 94.11 0.19 95.31 0.20 95.11 0.07 95.84 0.04 97.5% Wide Res Net-28-8 96.06 0.06 92.04 0.11 94.38 0.12 94.10 0.12 95.63 0.16 99% Wide Res Net-28-8 96.06 0.06 74.50 2.23 88.65 0.36 91.76 0.18 99.9% A.3.2 UNDERSTANDING THE TRAINING DYNAMICS AND LOTTERY TICKET EFFECT Figure 7 and Figure 8 complements Figure 4, and details the training dynamics (e.g. the converge of δ and masks) of DPF from the other aspect. Figure 9 compares the training dynamics between DPF Published as a conference paper at ICLR 2020 and Incremental (Zhu & Gupta, 2017), demonstrating the fact that our scheme enables a drastical reparameterization over the dense parameter space for better generalization performance. 0 50 100 150 200 250 300 % of params not converged 200 250 300 0% Zoom in, epochs 200 300 Figure 7: Convergence of the pruning mask mt of DPF for different target sparsity levels (see legend). The y-axis represent the percentage of mask elements that still change after a certain epoch (x-axis). The illustrated example are from Res Net-20 on CIFAR-10. We decayed the learning rate at 150 and 225 epochs. 0 25 50 75 100 125 150 175 200 0.5 0.6 0.7 0.8 0.9 0.95 0.99 (a) δ v.s. epoch. 0 25 50 75 100 125 150 175 200 Mask flip ratio 0.5 0.6 0.7 0.8 0.9 0.95 0.99 (b) Mask flip ratio v.s. epoch. 0 25 50 75 100 125 150 175 200 0.5 0.6 0.7 0.8 0.9 0.95 0.99 (c) Io U v.s. epoch. Figure 8: Training dynamics of DPF (Wide Res Net-28-2 on CIFAR-10) for unstructured pruning with different sparsity ratios. Io U stands for Intersection over Union for the non-masked elements of two consecutive masks; the smaller value the more fraction of the masks will flip. 0 50 100 150 200 mask not converged 0 50 100 150 200 Incremental 0.5 0.6 0.7 0.8 0.9 0.95 0 50 100 150 200 DPF - Incremental Figure 9: Convergence of the pruning mask mt of DPF V.S. Incremental (Zhu & Gupta, 2017) for different target sparsity levels (see legend). The y-axis represent the percentage of mask elements that still change after a certain epoch (x-axis). The illustrated example are from Wide Res Net-28-2 on CIFAR-10. We decayed the learning rate at 60; 120; 160 epochs. These two schemes use the same gradual warmup schedule (and hyper-parameters) for the pruning ratio during the training. Figure 10 in addition to the Figure 5 (in the main text) further studies the lottery ticket hypothesis under different training budgets (same epochs or same total flops). The results of DPF also demonstrate the importance of training-time structural exploration as well as the corresponding implicit regularization effects. Note that we do not want to question the importance of the weight initialization or the existence of the lottery ticket. Instead, our DPF can provide an alternative training scheme to compress the model to an extremely high compression ratio without sacrificing the test accuracy, where most of the existing methods still meet severe quality loss (including Frankle & Carbin (2019); Liu et al. (2019); Frankle et al. (2019)). A.3.3 COMPUTATIONAL OVERHEAD AND THE IMPACT OF HYPER-PARAMETERS In Figure 11, we evaluated the top-1 test accuracy of a compressed model trained by DPF under different setups, e.g., different reparameterization period p, different sparsity ratios, different mini-batch Published as a conference paper at ICLR 2020 0.50 0.60 0.70 0.80 0.90 0.95 0.99 Sparsity ratio Top-1 test accuracy DPF re-train w/ original init (mask from DPF) re-train w/o original init (mask from DPF) (a) Retraining with same epoch. 0.50 0.60 0.70 0.80 Sparsity ratio Top-1 test accuracy DPF re-train w/ original init (mask from DPF) re-train w/o original init (mask from DPF) (b) Retraining with same flops. Figure 10: Investigate the effect of lottery ticket for model compression (unstructured weight pruning for Wide Res Net28-2 on CIFAR-10). It complements the observations in Figure 5 by retraining the model for the same amount of computation budget (i.e. flops). sizes, as well as whether layer-wise pruning or not. We can witness that the optimal reparameterization (i.e p = 16) is quite consistent over different sparsity ratios and different mini-batch sizes, and we used it in all our experiments. The global-wise unstructured weight pruning (instead of layer-wise weight pruning) allows our DPF more flexible to perform dynamic parameter reallocation, and thus can provide better results especially for more aggressive pruning sparsity ratios. However, we also need to note that, for the same number of compressed parameters (layerwise or globalwise unstructured weight pruning), using global-wise pruning leads to a slight increase in the amount of MACs, as illustrated Table 5. 0.5 0.6 0.7 0.8 0.9 0.95 1 2 4 8 16 32 64 Reparameterization frequency -0.05 -0.04 -0.16 -0.26 -0.03 -0.42 -0.04 -0.17 -0.01 -0.08 -0.09 -0.51 0.00 -0.05 -0.18 -0.18 -0.20 -0.45 -0.08 -0.04 -0.08 -0.18 -0.10 -0.02 -0.09 0.00 -0.04 0.00 0.00 0.00 -0.15 -0.14 0.00 -0.11 -0.15 -0.44 -0.06 -0.12 -0.02 -0.11 -0.30 -1.03 1.0 (a) Reparameterization period vs. sparsity ratio. The heatmap value is the test accuracy minus the best accuracy of the corresponding sparsity. We use mini-batch size 128 w/o layerwise reparameterization. 128 256 512 1024 Batch size 1 2 4 8 16 32 Reparameterization frequency 91.95 92.01 92.04 91.74 92.13 92.06 91.97 91.81 92.03 91.95 92.02 91.76 92.03 92.12 92.03 91.88 92.21 92.32 91.98 91.90 92.10 92.13 92.05 91.62 (b) Reparameterization period vs. mini-batch size. The heatmap displays the test accuracy. We use sparsify ratio 0.80. 0.5 0.6 0.7 0.8 0.9 0.95 Sparsity False True Layerwise reparameterization 92.51 92.62 92.41 92.21 90.96 88.21 92.54 92.45 92.11 91.63 89.71 83.35 (c) Reparameterization scheme (whether layerwise) vs. sparsity ratio. The heatmap displays the test accuracy. We use mini-batch size 128 w/ reparameterization period p = 16. Figure 11: Investigate how the reparameterization period/scheme and mini-batch size impact the generalization performance (test top-1 accuracy), for dynamically training (and reparameterizing) a compressed model from scratch (Res Net-20 with CIFAR-10). Table 5: Investigate how the reparameterization scheme (layer-wise or not) impact the MACs (for the same number of compressed parameters), for using DPF on Res Net-20 with CIFAR-10. Target sparsity 50% 60% 70% 80% 90% 95% w/o layerwise reparameterization 22.80M 19.13M 15.18M 11.12M 6.54M 4.02M w/ layerwise reparameterization 20.99M 16.91M 12.83M 8.75M 4.67M 2.63M Figure 12 demonstrates the trivial computational overhead of involving DPF to gradually train a compressed model (Res Net-50) from scratch (on Image Net). Note that we evaluated the introduced reparameterization cost for dynamic pruning, which is independent of (potential) significant system speedup brought by the extreme high model sparsity. Even though our work did not estimate the practical speedup, we do believe we can have a similar training efficiency as the values reported in Dettmers & Zettlemoyer (2019). Published as a conference paper at ICLR 2020 0 20000 40000 60000 80000 100000 120000 Training loss 0 20000 40000 60000 80000 100000 120000 Figure 12: The learning curve of our DPF (with unstructured magnitude pruning) and the standard mini-batch SGD for training Res Net50 on Image Net. Our proposed DPF has trivial computational overhead. We trained Res Net50 on 4 NVIDIA V100 GPUs with 1024 mini-batch size. The target sparsity ratio is 80%. A.3.4 IMPLICIT NEURAL ARCHITECTURE SEARCH DPF can provide effective training-time structural exploration or even implicit neural network search. Figure 13 below demonstrates that for the same pruned model size (i.e. any point in the x-axis), we can always perform architecture search to get a better (in terms of generalization) pruned model, from a larger network (e.g. Wide Res Net-28-8) rather than the one searched from a relatively small network (e.g. Wide Res Net-28-4). 0.0 0.5 1.0 1.5 2.0 # of params (M) Top-1 test accuracy Wide Res Net28-2 Wide Res Net28-4 Wide Res Net28-8 Figure 13: Test top-1 accuracy vs. the compressed model size, for training Wide Res Net-28 (with different widths) on CIFAR-10. The compressed model is searched from Wide Res Net-28 (fixed depth) with different width (number of filters per layer). A.4 ADDITIONAL RESULTS FOR STRUCTURED PRUNING A.4.1 GENERALIZATION PERFORMANCE FOR CIFAR-10 Figure 14 complements the results of structured pruning in the main text (Figure 6), and Table 6 details the numerical results presented in both of Figure 6 and Figure 14. 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 # of params (M) Top-1 test accuracy (a) Wide Res Net28-2. 1.0 1.5 2.0 2.5 3.0 3.5 # of params (M) Top-1 test accuracy (b) Wide Res Net28-4. 4 6 8 10 12 14 # of params (M) Top-1 test accuracy (c) Wide Res Net28-8. Figure 14: # of params v.s. top-1 test accuracy, for training Wide Res Net28 (with different width) on CIFAR-10. Structured filter-wise pruning is used here. The reported results are averaged over three runs. Published as a conference paper at ICLR 2020 Table 6: Performance evaluation of DPF (and other baseline methods) for training (Wide)Res Net variants on CIFAR-10. We use the norm-based criteria for filter selection (as in SFP (He et al., 2018)) to estimate the output channel-wise pruning threshold. We follow the gradual pruning warmup scheme (as in Zhu & Gupta (2017)) from 0 epoch to the epoch when performing the second learning rate decay. Note that SFP prunes filters within the layer by a given ratio while our DPF prunes filters across layers. Due to the difference between filters for different layers, the # of parameters pruned by DPF might slight different from the one pruned by SFP. The pruning ratio refers to either prune filters within the layer or across the layers. Model Baseline on dense model SFP (H+, 2018) DPF Target Pr. ratio Res Net-20 92.48 0.20 92.18 0.31 92.54 0.07 10% Res Net-20 92.48 0.20 91.12 0.20 91.90 0.06 20% Res Net-20 92.48 0.20 90.32 0.25 91.07 0.40 30% Res Net-20 92.48 0.20 89.60 0.46 90.28 0.26 40% Res Net-32 93.52 0.13 92.07 0.22 92.18 0.16 30% Res Net-32 93.52 0.13 91.14 0.45 91.50 0.21 40% Res Net-56 94.51 0.20 93.99 0.27 94.53 0.13 30% Res Net-56 94.51 0.20 93.57 0.16 94.03 0.38 40% Wide Res Net-28-2 95.01 0.04 94.02 0.24 94.52 0.08 40% Wide Res Net-28-2 95.01 0.04 93.34 0.14 94.11 0.12 50% Wide Res Net-28-2 95.01 0.04 92.07 0.09 93.74 0.25 60% Wide Res Net-28-2 95.01 0.04 90.66 0.62 92.89 0.16 70% Wide Res Net-28-2 95.01 0.04 86.00 1.09 90.53 0.17 80% Wide Res Net-28-4 95.69 0.10 95.15 0.11 95.50 0.05 40% Wide Res Net-28-4 95.69 0.10 94.86 0.10 95.43 0.16 50% Wide Res Net-28-4 95.69 0.10 94.47 0.10 95.20 0.05 60% Wide Res Net-28-4 95.69 0.10 93.37 0.18 94.67 0.08 70% Wide Res Net-28-4 95.69 0.10 91.88 0.59 93.79 0.09 80% Wide Res Net-28-8 96.06 0.06 95.62 0.04 96.06 0.12 40% Wide Res Net-28-8 96.06 0.06 95.59 0.09 96.03 0.02 50% Wide Res Net-28-8 96.06 0.06 95.40 0.14 95.88 0.16 60% Wide Res Net-28-8 96.06 0.06 94.99 0.22 95.71 0.16 70% Wide Res Net-28-8 96.06 0.06 94.22 0.21 95.15 0.03 80% A.4.2 UNDERSTANDING THE LOTTERY TICKET EFFECT Similar to the observations in Section 6 (for unstructured pruning), Figure 15 instead considers structured pruning and again we found DPF does not find a lottery ticket. The superior generalization performance of DPF cannot be explained by the found mask or the weight initialization scheme. 0.40 0.50 0.60 0.70 0.80 Sparsity ratio Top-1 test accuracy DPF re-train w/ original init (mask from DPF) re-train w/o original init (mask from DPF) Figure 15: Investigate the effect of lottery ticket for model compression (Wide Res Net28-2 with CIFAR-10) for structured pruning. We retrained the model with the mask from the model trained by DPF, by using the same epoch budget. A.4.3 MODEL SPARSITY VISUALIZATION Figure 16 below visualizes the model sparsity transition patterns for different model sparsity levels under the structured pruning. We can witness that due to the presence of residual connection, DPF gradually learns to prune the entire residual blocks. Published as a conference paper at ICLR 2020 conv1.weight block1.layer.0.conv1.weight block1.layer.0.conv2.weight block1.layer.0.conv_downsample.weight block1.layer.1.conv1.weight block1.layer.1.conv2.weight block1.layer.2.conv1.weight block1.layer.2.conv2.weight block1.layer.3.conv1.weight block1.layer.3.conv2.weight block2.layer.0.conv1.weight block2.layer.0.conv2.weight block2.layer.0.conv_downsample.weight block2.layer.1.conv1.weight block2.layer.1.conv2.weight block2.layer.2.conv1.weight block2.layer.2.conv2.weight block2.layer.3.conv1.weight block2.layer.3.conv2.weight block3.layer.0.conv1.weight block3.layer.0.conv2.weight block3.layer.0.conv_downsample.weight block3.layer.1.conv1.weight block3.layer.1.conv2.weight block3.layer.2.conv1.weight block3.layer.2.conv2.weight block3.layer.3.conv1.weight block3.layer.3.conv2.weight linear.weight Layer names # of params Weights footprint: Sparse vs. Dense (element-wise) NNZ (dense) NNZ (sparse) (a) Structured pruning with 40% sparsity. conv1.weight block1.layer.0.conv1.weight block1.layer.0.conv2.weight block1.layer.0.conv_downsample.weight block1.layer.1.conv1.weight block1.layer.1.conv2.weight block1.layer.2.conv1.weight block1.layer.2.conv2.weight block1.layer.3.conv1.weight block1.layer.3.conv2.weight block2.layer.0.conv1.weight block2.layer.0.conv2.weight block2.layer.0.conv_downsample.weight block2.layer.1.conv1.weight block2.layer.1.conv2.weight block2.layer.2.conv1.weight block2.layer.2.conv2.weight block2.layer.3.conv1.weight block2.layer.3.conv2.weight block3.layer.0.conv1.weight block3.layer.0.conv2.weight block3.layer.0.conv_downsample.weight block3.layer.1.conv1.weight block3.layer.1.conv2.weight block3.layer.2.conv1.weight block3.layer.2.conv2.weight block3.layer.3.conv1.weight block3.layer.3.conv2.weight linear.weight Layer names # of params Weights footprint: Sparse vs. Dense (element-wise) NNZ (dense) NNZ (sparse) (b) Structured pruning with 50% sparsity. conv1.weight block1.layer.0.conv1.weight block1.layer.0.conv2.weight block1.layer.0.conv_downsample.weight block1.layer.1.conv1.weight block1.layer.1.conv2.weight block1.layer.2.conv1.weight block1.layer.2.conv2.weight block1.layer.3.conv1.weight block1.layer.3.conv2.weight block2.layer.0.conv1.weight block2.layer.0.conv2.weight block2.layer.0.conv_downsample.weight block2.layer.1.conv1.weight block2.layer.1.conv2.weight block2.layer.2.conv1.weight block2.layer.2.conv2.weight block2.layer.3.conv1.weight block2.layer.3.conv2.weight block3.layer.0.conv1.weight block3.layer.0.conv2.weight block3.layer.0.conv_downsample.weight block3.layer.1.conv1.weight block3.layer.1.conv2.weight block3.layer.2.conv1.weight block3.layer.2.conv2.weight block3.layer.3.conv1.weight block3.layer.3.conv2.weight linear.weight Layer names # of params Weights footprint: Sparse vs. Dense (element-wise) NNZ (dense) NNZ (sparse) (c) Structured pruning with 60% sparsity. conv1.weight block1.layer.0.conv1.weight block1.layer.0.conv2.weight block1.layer.0.conv_downsample.weight block1.layer.1.conv1.weight block1.layer.1.conv2.weight block1.layer.2.conv1.weight block1.layer.2.conv2.weight block1.layer.3.conv1.weight block1.layer.3.conv2.weight block2.layer.0.conv1.weight block2.layer.0.conv2.weight block2.layer.0.conv_downsample.weight block2.layer.1.conv1.weight block2.layer.1.conv2.weight block2.layer.2.conv1.weight block2.layer.2.conv2.weight block2.layer.3.conv1.weight block2.layer.3.conv2.weight block3.layer.0.conv1.weight block3.layer.0.conv2.weight block3.layer.0.conv_downsample.weight block3.layer.1.conv1.weight block3.layer.1.conv2.weight block3.layer.2.conv1.weight block3.layer.2.conv2.weight block3.layer.3.conv1.weight block3.layer.3.conv2.weight linear.weight Layer names # of params Weights footprint: Sparse vs. Dense (element-wise) NNZ (dense) NNZ (sparse) (d) Structured pruning with 70% sparsity. conv1.weight block1.layer.0.conv1.weight block1.layer.0.conv2.weight block1.layer.0.conv_downsample.weight block1.layer.1.conv1.weight block1.layer.1.conv2.weight block1.layer.2.conv1.weight block1.layer.2.conv2.weight block1.layer.3.conv1.weight block1.layer.3.conv2.weight block2.layer.0.conv1.weight block2.layer.0.conv2.weight block2.layer.0.conv_downsample.weight block2.layer.1.conv1.weight block2.layer.1.conv2.weight block2.layer.2.conv1.weight block2.layer.2.conv2.weight block2.layer.3.conv1.weight block2.layer.3.conv2.weight block3.layer.0.conv1.weight block3.layer.0.conv2.weight block3.layer.0.conv_downsample.weight block3.layer.1.conv1.weight block3.layer.1.conv2.weight block3.layer.2.conv1.weight block3.layer.2.conv2.weight block3.layer.3.conv1.weight block3.layer.3.conv2.weight linear.weight Layer names # of params Weights footprint: Sparse vs. Dense (element-wise) NNZ (dense) NNZ (sparse) (e) Structured pruning with 80% sparsity. Figure 16: The element-wise sparsity of each layer for Wide Res Net28-2 (trained on CIFAR-10 via DPF), under different structured pruning sparsity ratios. DPF for model compression (with structured pruning) performs implicit as a neural architecture search. Published as a conference paper at ICLR 2020 B MISSING PROOFS In this section we present the proofs for the claims in Section 4. First, we give the proof for the stongly convex case. Here we follow Lacoste-Julien et al. (2012) for the general structure, combined with estimates from the error-feedback framework (Stich et al., 2018; Stich & Karimireddy, 2019) to control the pruning errors. Proof of Theorem 4.1. By definition of (DPF), wt+1 = wt γtg(ewt), hence, E h wt+1 w 2 | wt i = wt w 2 2γt wt w , Eg(ewt) + γ2 t E g(ewt) 2 wt w 2 2γt wt w , f(ewt) + γ2 t G2 = wt w 2 2γt ewt w , f(ewt) + γ2 t G2 + 2γt ewt wt, f(ewt) . By strong convexity, 2 ewt w , f(ewt) µ ewt w 2 2 (f(ewt) f(w )) , and with a + b 2 2 a 2 + 2 b 2 further 2 wt w 2 + wt ewt 2 and with a, b 1 2α a 2 + α 2 b 2 for a, b Rd and α > 0, 2 ewt wt, f(ewt) 2L ewt wt 2 + 1 2L f(ewt) 2 = 2L ewt wt 2 + 1 2L f(ewt) f(w ) 2 2L ewt wt 2 + f(ewt) f(w ) , where the last inequality is a consequence of L-smoothness. Combining all these inequalities yields E h wt+1 w 2 | wt i 1 µγt wt w 2 γt (f(ewt) f(w )) + γ2 t G2 + γt(2L + µ) ewt wt 2 wt w 2 γt (f(ewt) f(w )) + γ2 t G2 + 3γt L ewt wt 2 . as µ L. Hence, by rearranging and multiplying with a weight λt > 0: λt E (f(ewt) f(w )) λt(1 µγt/2) γt E wt w 2 λt γt E wt+1 w 2 + γtλt G2 + 3λt LE ewt wt 2 . By plugging in the learning rate, γt = 4 µ(t+2) and setting λt = (t + 1) we obtain λt E (f(ewt) f(w )) µ h t(t + 1)E wt w 2 (t + 1)(t + 2)E wt+1 w 2i µ(t + 2)G2 + 3(t + 1)LE ewt wt 2 . By summing from t = 0 to t = T these λt-weighted inequalities, we obtain a telescoping sum: t=0 λt E (f(ewt) f(w )) µ h 0 (T + 1)(T + 2)E wt+1 w 2i + 4(T + 1) t=0 λt E ewt wt 2 t=0 λt E ewt wt 2 . Published as a conference paper at ICLR 2020 Hence, for ΛT := PT t=0 λt = (T +1)(T +2) t=0 λt E (f(ewt) f(w )) 4(T + 1) µΛT G2 + 3L t=0 λt E ewt wt 2 t=0 λt E ewt wt 2 ! Finally, using ewt wt 2 = δt wt 2 by (1), shows the theorem. Before giving the proof of Theorem 4.2, we first give a justification for the remark just below Theorem 4.1 on the one-shot pruning of the final iterate. We have by L-smoothness and a, b 1 2α a 2 + α 2 b 2 for a, b Rd and α > 0 for any iterate wt: f(ewt) f(w ) f(wt) f(w ) + f(wt), ewt wt + L f(wt) f(w ) + 1 2L f(wt) 2 + L ewt wt 2 2(f(wt) f(w )) + δt L wt 2 . (4) Furthermore, again by L-smoothness, f(w T ) f(w ) L 2 w T w 2 = O LG2 as standard SGD analysis gives the estimate E w T w 2 = O G2 µ2T , see e.g. Lacoste-Julien et al. (2012). Combining these two estimates (with wt = w T ) shows the claim. Furthermore, we also claimed that also the dense model converges to a neighborhood of optimal solution. This follows by L-smoothness and (4): For any fixed model wt we have the estimate (4), hence for a randomly chosen (dense) model u (from the same distribution as the sparse model in Theorem 4.1) we have Ef(u) f(w ) (4) 2E [f(eu) f(w )] + LE h δt wt 2i (Thm 4.1) = O G2 µT + LE h δt wt 2i . Lastly, we give the proof of Theorem 4.2, following Karimireddy et al. (2019). Proof of Theorem 4.2. By smoothness, and a, b 1 2 b 2 for a, b Rd, E [f(wt+1) | wt] f(wt) γ f(wt), Eg(ewt) + γ2 L 2 E g(ewt) 2 f(wt) γ f(wt), f(ewt) + γ2 LG2 = f(wt) γ f(ewt), f(ewt) + γ2 LG2 2 + γ f(ewt) f(wt), f(ewt) f(wt) γ f(ewt) 2 + γ2 LG2 2 f(ewt) f(wt) 2 + γ 2 f(ewt) 2 + γ2 LG2 2 wt ewt 2 , and by rearranging E f(ewt) 2 2 γ [Ef(wt) Ef(wt+1)] + γLG2 + L2E wt ewt 2 . Published as a conference paper at ICLR 2020 Summing these inequalities from t = 0 to t = T gives t=0 E f(ewt) 2 2 γ(T + 1) t=0 (E [f(wt)] E [f(wt+1)]) + γLG2 t=0 E wt ewt 2 2 (f(w0) f(w )) γ(T + 1) + LγG2 + L2 t=0 E et 2 . Finally, using ewt wt 2 = δt wt 2 by (1), and plugging in the stepsize γ that minimizes the right hand side shows the claim.