# effective_model_sparsification_by_scheduled_growandprune_methods__5762c1de.pdf Published as a conference paper at ICLR 2022 EFFECTIVE MODEL SPARSIFICATION BY SCHEDULED GROW-AND-PRUNE METHODS Xiaolong Ma1, , Minghai Qin2, , Fei Sun2, , Zejiang Hou3, Kun Yuan2, Yi Xu4, Yanzhi Wang1, Yen-Kuang Chen2, Rong Jin2, Yuan Xie2 1 Northeastern University 2 DAMO Academy, Alibaba Group 3 Princeton University 4 Dalian University of Technology Deep neural networks (DNNs) are effective in solving many real-world problems. Larger DNN models usually exhibit better quality (e.g., accuracy) but their excessive computation results in long inference time. Model sparsification can reduce the computation and memory cost while maintaining model quality. Most existing sparsification algorithms unidirectionally remove weights, while others randomly or greedily explore a small subset of weights in each layer for pruning. The limitations of these algorithms reduce the level of achievable sparsity. In addition, many algorithms still require pre-trained dense models and thus suffer from large memory footprint. In this paper, we propose a novel scheduled grow-and-prune (Ga P) methodology without having to pre-train a dense model. It addresses the shortcomings of the previous works by repeatedly growing a subset of layers to dense and then pruning them back to sparse after some training. Experiments show that the models pruned using the proposed methods match or beat the quality of the highly optimized dense models at 80% sparsity on a variety of tasks, such as image classification, objective detection, 3D object part segmentation, and translation. They also outperform other state-of-the-art (SOTA) methods for model sparsification. As an example, a 90% non-uniform sparse Res Net-50 model obtained via Ga P achieves 77.9% top-1 accuracy on Image Net, improving the previous SOTA results by 1.5%. Code available at: https://github.com/boone891214/Ga P. 1 INTRODUCTION Deep neural networks (DNNs) have achieved great performance in many real-world scenarios. However, the large computation and memory requirements of deep neural networks discourage them from being applied to broader applications on resource-limited devices. Model compression via weight pruning is a popular research topic, where a large proportion of weights are set to zero, leading to significant reduction in both memory and computation. The introduction of sparse tensor cores in the NVIDIA A100 GPU (NVIDIA, 2020b) brings weight pruning into mainstream. Early works on weight pruning generally follow a prune-from-dense methodology (Guo et al., 2016; Yu et al., 2018; Wen et al., 2016; Liu et al., 2019b; Pham et al., 2018; Real et al., 2019), which usually requires 3 phases of training: pre-train a dense model, prune it to sparse, and fine-tune it. In such methodologies, the one-shot or iteratively pruning from a well-trained DNN can only remove weights, which lacks the flexibility of growing back weights that are considered unimportant early in the training process but showed to be significant later in training. On the other hand, early-stage pruning methods (Lee et al., 2019; Wang et al., 2020; You et al., 2020) avoid training dense models to converge. However, those sparse masks (i.e., the binary matrix in which a zero value removes the corresponding weight entry from the model) are prematurely fixed, resulting in inferior model quality. Methods based on sparse mask exploration, such as Deep R (Bellec et al., 2018), SET (Mocanu et al., 2018), DSR (Mostafa & Wang, 2019), and Rig L (Evci et al., 2020) maintain the target sparsity in all layers throughout the training process and selectively explore a small fraction of the weights Equal Contribution. This work is partially done during Xiaolong Ma s internship and Yi Xu s working period at Alibaba Group. Published as a conference paper at ICLR 2022 Random Weights + Mask (very low accuracy model) Trained weights + updated mask (high accuracy model) n layers divide into 4 partitions Initialization C-Ga P Training Step 2 Step 3 Step 4 Step K Step K-1 Fine-tune output DNN weights Mask part-4 part-3 part-2 part-1 Prune Grow Figure 1: Overview of the cyclic Ga P (C-Ga P) training method. We assume 4 partitions in a model are grown and pruned in a cyclic order. Only one partition is kept dense during training. After K steps, the dense partition is pruned and the whole model is fine-tuned to obtain the sparse model. periodically. The explored weights are chosen based on random or greedy heuristics, leading to limited coverage of sparse patterns and consequentially sub-optimal model quality. In order to overcome the shortcomings of the previous solutions, we propose a scheduled grow-andprune (Ga P) methodology. It does not require a dense model any time during the training process, leading to largely-reduced memory footprints. Meanwhile, the sparse mask of a layer is updated after exploring all weights in the same layer, resulting in better mask-update efficiency. It targets to obtain the best quality sparse model under a runtime budget. Since some large models may be compressed below a target latency before being deployed in the wild to thousands of devices and inferenced billions of times per day (Hazelwood et al., 2018; Wu et al., 2019), spending a bit more time in training to find a better model is well rewarded. The scheduled Ga P methodology divides a DNN into several partitions composed of contiguous layers. During model training, one of the sparse partitions is grown to dense while the rest remain sparse. After some epochs of training, the previously dense partition is pruned to sparse, and an alternate partition is grown to dense. This process is repeated so that all partitions are grown to dense and pruned back to sparse multiple times. If the partitions are grown to dense one by one sequentially, it is called the cyclic Ga P method (as shown in Figure 1). If the model is replicated to multiple machines with different dense partitions, it is called the parallel Ga P method. The Ga P methods are carefully scheduled to explore all weights efficiently, which is not guaranteed in the existing mask exploration methods. We illustrate their differences by an example in Figure 2. In the scheduled Ga P methods, all weights to be grown (or to be pruned) belong to the same partition. All weights in the model are explored when every partition is grown to dense and pruned to sparse. On the other hand, in the random or greedy exploration methods, the grown and pruned weights are distributed across all layers and they do not guarantee that all weights are explored, e.g, the connection between 1st input neuron and 3rd neuron in the first layer is not explored in Figure 2(b). Random sparse mask exploration and our scheduled Ga P methods are like sampling operations with replacement and without replacement (Basu, 1958). The former requires a lot more samples to achieve the same weight coverage as the latter. Please see Appendix C for an illustrative example. The contribution of this paper is summarized as follows. We propose two Ga P methods such that they do not require to train dense models any time throughout the training process, reducing the training memory footprints. Sparse models obtained by the proposed Ga P methods outperform the SOTA sparsification algorithms on a wide variety of deep learning tasks, including 2D image recognition, object detection, 3D object part segmentation, and machine translation. For all four tasks, the proposed Ga P methods surprisingly generate sparse models that match or even exceed the dense counterparts at 80% sparsity. We provide theoretical guarantees for the Ga P method and clarify its convergence property. In particular, we justify the influence of the mask-induced error. Published as a conference paper at ICLR 2022 (a) Scheduled mask exploration Step 2 Step 3 Step 4 Train Train (b) Random/Greedy mask exploration Prune & Grow Grow Prune & Grow Prune Figure 2: Comparison between (a) the scheduled mask exploration and (b) the random/greedy mask exploration. In (a), the weights to be grown (or pruned) are in the same layer and the growing phase covers all weights in that layer. In (b), they may be in different layers and the growing phase does not guarantee a coverage during training. Algorithm 1: C-Ga P training flow. Input: An L-layer model with uninitialized weight Θ; pruning ratio r. Output: An L-layer sparse model satisfying the target sparsity requirement. 1 Initialize f(x; Θ m) with random weights Θ and random masks m that satisfy the sparsity requirement. 2 Divide all layers into κ partitions, denoted by Si, i {0, 1, . . . , κ 1}. 4 while step < K do 5 Partition to grow index i step mod κ, partition to prune index j (step 1) mod κ. 6 Prune ΘSj and update m Sj by [ΘSj, m Sj] Arg Prune To(ΘSj, r). 7 Grow ΘSi by m Si Arg Grow To(ΘSi). 8 Train the model f(x; Θ m) for T epochs. 9 step = step + 1. 10 Denote the final dense partition by Sd. 11 Prune ΘSd and update m Sd by [ΘSd, m Sd] Arg Prune To(ΘSd, r). 12 Fine-tune f(x; Θ m) for T epochs. 2 METHODOLOGY In this section, we describe the scheduled grow-and-prune (Ga P) methods in detail. The process starts from a randomly initialized model. First, its weights are pruned randomly to reach the target sparsity, i.e. a random sparse mask is applied to the weights. Then, the model is divided into κ partitions and each partition is grown and pruned separately. In the following two sub-sections, two variants of the Ga P methods are described in detail. 2.1 CYCLIC GAP As shown in Figure 1, the cyclic Ga P (C-Ga P) method rotates the dense partitions among all κ partitions (κ = 4 in Figure 1). Starting from all partitions with random sparse masks, the first partition is grown to a dense one. All weights in the first partition and the masked weights in the remaining 3 partitions are trained for a few epochs (Step 1). Then, the first partition is pruned to sparse (Step 2). Next, we apply the same strategy to the second partition, then continue iterating over all κ partitions. When all layers are cyclically grown and pruned once after κ steps, we call it one round. This process is repeated K times. Algorithm 1 describes this process in more detail. In Line 1, we use f(x; Θ) to represent an L-layer deep neural network model, where Θ RM are M trainable weights and x are training samples. A sparse model is denoted as f(x; Θ m), where denotes element-wise multiplication. m is a binary mask m {0, 1}M, where a zero in m means that the corresponding weight in Θ is fixed to be zero and a one in m means that the corresponding weight in Θ is free to be updated. Published as a conference paper at ICLR 2022 Output of current step P-Ga P Training Initialization Random Weights + Mask (very low accuracy model) n layers divide into 4 partitions DNN weights Mask part-4 part-3 part-2 part-1 Trained weights + updated mask (high accuracy model) Fine-tune output Figure 3: Overview of the parallel Ga P (P-Ga P) training method. We assume 4 training nodes are used for the 4 partitions, which are grown in parallel for T epochs. At the end of each step, the dense parts in the nodes are combined, followed by a pruning back to a sparse model. After K steps, the model is fine-tuned. In the C-Ga P, all layers of the model {1, 2, . . . , L} are divided into κ subsets S0, S1, . . . , Sκ 1, where κ is an integer between 1 and L, as shown in Line 2. Let ΘS and m S denote the subset of trainable weights and their masks. Consequently, Θ is divided into κ partitions, denoted by ΘS0, . . . , ΘSκ 1. For the i-th partition ΘSi, we denote its mask by m Si. Then the grow and prune process is iterated K times (Line 4). First, the indices of the partitions to grow and prune are calculated (Line 5). They are determined in a cyclic order. Then, the selected dense partition is pruned to sparse (Line 6). It prunes the partition ΘSj with a pruning ratio r. In the mean time, the next sparse partition is grown to dense (Line 7). It sets the mask m Si to be all ones such that all weights ΘSi are free to be updated during training. After that, the model is trained for T epochs (Line 8). After the entire grow-and-prune process is finished, the final dense partition is pruned to sparse (Line 11) and the model is fine-tuned to get the final trained sparse model (Line 12). 2.2 PARALLEL GAP In this section, we introduce the parallel Ga P (P-Ga P) as a flexible and highly efficient complement to the C-Ga P sparse training method when sufficient training resources are available, as described in Figure 3. Unlike the C-Ga P method where partitions are grown and pruned serially in one machine, the P-Ga P method distributes the model to κ machines. Each grows a different partition to dense and trains the model in parallel. Then, the trained dense partitions are combined to a dense model and pruned to a sparse one again. As a result, the P-Ga P method utilizes κ more machines, but its training time is also κ times shorter than the C-Ga P method. Since each server trains the model independently, the inter-server communication only occurs when the dense parts are combined to synchronize their learned weights once every few epochs. Thus its impact to performance is negligible. Each pair of distribution and combining form one round in the P-Ga P. Unlike C-Ga P, one round of P-Ga P consists of only one step. The detailed P-Ga P method is shown in Algorithm 2 and illustrated in Figure 4 (b) in Appendix D. Some of the differences between the parallel Ga P and the conventional distributed training are listed as follows. 1. In P-Ga P, data communicates across different server nodes every T epochs of training. It is far less frequent than conventional distributed training, which requires communications and synchronization in every mini-batch. Therefore, the data bandwidth between the nodes remains minimal. 2. In P-Ga P, each node can use the single-node hyper-parameters to train the sparse models. It does not require excessively large batch-size to fully utilize the computing resource. 3. Different from data parallelism, model parallelism, and pipeline parallelism, the P-Ga P explores another dimension of parallelism when training sparse models, which we call it mask parallelism. This implies that masks in different partitions are less correlated and can be searched separately. Published as a conference paper at ICLR 2022 Table 1: Results of sparse Res Net-50 models on Image Net. Method Distribution Sparsity ratio: Sparsity ratio: Total 80% 90% epochs Top-1 FLOPS Top-1 FLOPS Acc (%) ( e9) Acc (%) ( e9) Dense (NVIDIA, 2020a) Top-1 accuracy: 78.2, FLOPS: 8.2 e9 Prune-from-dense uniform 750/1250 77.1 1.7 75.8 0.8 SNIP (Lee et al., 2019) uniform 100 72.0 1.7 67.2 0.8 SET (Mocanu et al., 2018) uniform 100 72.9 1.7 69.6 0.8 SET12 uniform 1200 76.4 1.7 74.5 0.8 Rig L (Evci et al., 2020) uniform 100 74.6 1.7 72.0 0.8 Rig L uniform 100 74.6 1.7 72.5 0.8 Rig L5 (Evci et al., 2020) uniform 500 76.6 1.7 75.7 0.8 Rig L5 uniform 500 76.9 1.7 75.6 0.8 Rig L12 uniform 1200 77.1 1.7 76.0 0.8 C-Ga P uniform 990 77.9 1.7 76.3 0.8 P-Ga P uniform 1110 77.5 1.7 76.1 0.8 GMP (Gale et al., 2019) non-uniform 135 76.5 n/a 75.0 n/a SNIP (Lee et al., 2019) non-uniform 100 69.7 2.8 61.9 1.9 Gra SP (Wang et al., 2020) non-uniform 150 72.1 2.8 68.1 1.9 Deep R (Bellec et al., 2018) non-uniform 100 71.7 n/a 70.2 n/a SET (Mocanu et al., 2018) non-uniform 100 72.6 1.7 70.4 0.8 DSR (Mostafa & Wang, 2019) non-uniform 100 73.3 2.9 71.6 0.8 DPF (Lin et al., 2020b) non-uniform 90 75.1 n/a n/a n/a Rig L (Evci et al., 2020) ERK 100 75.1 3.4 73.0 2.0 Rig L ERK 100 75.4 3.4 73.9 2.0 Rig L5 (Evci et al., 2020) ERK 500 77.1 3.4 76.4 2.0 Rig L5 ERK 500 77.4 3.4 76.3 2.0 Rig L12 ERK 1200 77.4 3.4 76.8 2.0 C-Ga P non-uniform 990 78.1 2.7 77.9 2.0 750 epochs for 80% sparsity and 1250 epochs for 90% sparsity (please see Appendix B.1 for details). Our implementation (please refer to Appendix B.2 for implementation details). Following the convention in (Evci et al., 2020), multiplication and addition are counted as two operations. 3 EXPERIMENTAL RESULTS This section evaluates the proposed C-Ga P and P-Ga P sparse training methods on different machine learning tasks. Since our goal is to search for the optimal sparse models to speed up inference, we report the best model accuracy obtained using the Ga P methods for all models. We compare our Ga P method with the state-of-the-arts, as well as our implemented prune-from-dense results. If not otherwise specified, all prune-from-dense are implemented using magnitude-based iterative pruning schedule as described in GMP (Zhu & Gupta, 2017; Gale et al., 2019). More ablation experiments on partition number, partition scheme, different sparsity granularity using Ga P, and FC layer pruning are shown in Appendix A. All of our experimental results are trained and inferenced using Py Torch in the machines with 8 NVIDIA-V100 GPUs. The cyclic Ga P is trained in one training node and the parallel Ga P is trained in κ training nodes, where κ is the number of model partitions. In the pruning stage of the Ga P method, the weights with the lowest magnitudes are pruned to zero. Note that in our experiments, only the weights for convolution, matrix-vector, and matrix-matrix multiplications are made sparse. The weights related to biases and batch-normalization are kept dense, as they are critical for model quality but only contribute to a tiny fraction of the inference FLOPs. For uniform sparsity, we prune individual layer separately by sorting all weights in the layer based on their magnitudes, and prune away any weight whose magnitude is below the percentile described as the sparsity level. Thus, all layers receive the same sparsity. For non-uniform sparsity, the pruning process is similar, but the weights in the entire model are sorted together. Thus, the sparsity level in different layers may differ, with the more important layers pruned less than the trivial layers. We list the implementation details and hyper-parameter settings for all tasks in Appendix B. Published as a conference paper at ICLR 2022 3.1 IMAGE CLASSIFICATION: RESNET-50 ON IMAGENET Table 1 compares the accuracy between using the Ga P methods and the previous works. The Res Net50 is divided into four partitions by the last three downsampling stages. For non-uniform sparsity, all layers are sparsified. For the uniform sparsity, the first convolutional layer with 7 7 kernels is kept dense, the same as in Evci et al. (2020). The fully connected (FC) layer only contributes 0.5% of FLOPs, thus it is also kept dense (sparse FC layer has similar accuracy with dense FC layer, please check the ablation results in Appendix A). To ensure fair comparison, we perform the following demonstrations in addition to the original baseline results. We include the SNIP (Lee et al., 2019) and SET (Mocanu et al., 2018) results in uniform sparsity that are reproduced in Evci et al. (2020). We also implement the Rig L and Rig L5 (Evci et al., 2020) using our code base on Py Torch (please refer to Appendix B.2 for details). In Table 1, our reproduction achieves equivalent or higher accuracy than the one reported in the paper. Based on that, we extend the Rig L training time by 12 to match the Ga P method. Additionally, SET shares similar method with Rig L, thus we also implement and extend SET training by 12 . We observe that the improvement over Rig L5 (Evci et al., 2020) is 1.3% (77.9% vs. 76.6%) at uniform 80% sparsity, and 0.6% at 90% uniform sparsity. The C-Ga P is also better than our reimplementation of the prune-from-dense method using ADMM (Ren et al., 2019). For non-uniform sparse Res Net-50 model, the improvement over Rig L5 (Evci et al., 2020) is 1.0% and 1.5% at 80% and 90% sparsity, respectively. Note that the 80% sparse Res Net-50 model can almost match the accuracy of the dense model reported by NVIDIA (2020a). We also observe that, when we extend the training time for the SOTA methods such as SET and Rig L to match the Ga P training efforts, their accuracies still lag behind. As an example, the model at 80% non-uniform sparsity trained using the C-Ga P method outperforms the re-implementation of the ERK Rig L12 model with 0.7% accuracy increase and 21% FLOPS reduction (2.7 GFLOPS vs. 3.4 GFLOPS). The 90% non-uniform sparsity model achieves 1.1% accuracy increase at the same FLOPS. Comparing the C-Ga P and P-Ga P pruning methods with same training epochs, we find that CGa P achieves 0.4% and 0.2% higher accuracy at 80% and 90% uniform sparsity, respectively. We conjecture the reason is that P-Ga P converges slower than C-Ga P. In C-Ga P, the weight values in the sparse partitions are continuously trained to get better model accuracy when the dense partition is explored, while the P-Ga P method only keeps the dense partition and discards the sparse partitions when combining all dense partitions to the updated model. 3.2 OBJECT DETECTION: SSD ON COCO-2017 We divide the SSD network into 4 partitions with three in the backbone and one in the detection head. We train the model using cyclic-partitioned C-Ga P with a uniform weight sparsity of 90% on all layers except for the last output layer, which is kept dense. We report the accuracy of the best model searched within 40 Ga P steps. As shown in Table 2, in the mean average precision category m AP@[.5:.95], the sparse model obtained using our C-Ga P method exceeds the dense model by 0.7 m AP (25.9 vs 25.2). It also exceeds the best sparse model iteratively pruned from the dense model by 1.6 m AP (25.9 vs 24.3). Table 2: Results of sparse SSD models for object detection on COCO-2017. Method Sparsity AP, Io U: AP, Area: AR, #Dets: AR, Area: ratio 0.5:0.95 0.5 0.75 S M L 1 10 100 S M L Dense (NVIDIA, 2020a) 0 25.2 42.7 25.8 7.3 27.1 40.8 23.8 34.5 36.1 11.7 39.6 56.1 Prune-from-dense 90% 24.3 41.1 24.8 6.8 26.3 40.0 23.3 34.0 35.8 11.1 39.4 55.7 C-Ga P 90% 25.9 42.3 26.9 8.0 28.1 42.6 24.7 35.9 37.8 12.7 41.5 58.5 3.3 3D OBJECT PART SEGMENTATION: POINTNET++ ON SHAPENET We divide the Point Net++ model into 4 partitions with three in the backbone and one in the segmentation head. We apply the C-Ga P and P-Ga P methods with uniform sparsity of 80% and 90% on all layers, respectively. We report the accuracy of the best model searched within 40 Ga P steps. Table 3 compares them with the dense model and the best sparse model pruned from the dense model using Published as a conference paper at ICLR 2022 the ADMM algorithm. The results show that on the class and instance m AP categories, the pruned model using the C-Ga P method easily beats the model pruned from dense at both sparsities. It even beats the dense model at 80% sparsity. The pruned model using the P-Ga P method also beats the model pruned from dense at 80% sparsity and is not far behind at 90% sparsity. Table 3: Results of sparse Point Net++ models for 3D part segmentation on Shape Net. Method #Points Sparsity ratio: 80% Sparsity ratio: 90% Class m Io U (%) Instance m Io U (%) Class m Io U (%) Instance m Io U (%) Dense (Yan, 2019) 2k Class m Io U: 82.5, Instance m Io U: 85.4 Prune-from-dense 2k 79.1 84.5 77.1 84.0 C-Ga P 2k 82.9 85.8 79.5 85.1 P-Ga P 2k 80.8 85.5 74.0 83.7 3.4 TRANSLATION: TRANSFORMER ON WMT-14 EN-DE DATASET In this part, we evaluate the C-Ga P and the P-Ga P methods on the translation task based on Transformer (Vaswani et al., 2017). We train the Transformer models on the WMT-14 En-De dataset and evaluate the Sacre BLEU score (Post, 2018). We equally divide the Transformer models into three or six partitions with the decoder containing 2 partitions of the encoder. We apply the C-Ga P and P-Ga P methods with uniform sparsity of 80% and 90% on all layers, respectively. We report the best Sacre BLEU scores on the validation dataset in Table 4 within 30 Ga P steps. The models trained using both C-Ga P and P-Ga P methods significantly improve over the sparse models obtained by pruning from the dense counterparts. They exceed the dense model quality at 80% sparsity, and the model using three-partition C-Ga P method even outperforms the dense model at 90% sparsity. Table 4: Results of sparse Transformer models for the translation task on WMT14 En-De. Method Sparsity ratio: 80% Sparsity ratio: 90% κ = 3 κ = 6 κ = 3 κ = 6 Dense (NVIDIA, 2020a) Sacre BLEU: 27.6 Prune-from-dense 27.1 25.7 C-Ga P 27.6 27.6 27.7 27.1 P-Ga P 27.9 27.7 27.3 26.9 4 THEORETICAL JUSTIFICATION We now provide the convergence guarantee for the cyclic-partitioned C-Ga P algorithm. We let F(Θ) = Ex Df(x; Θ) be the loss function of the deep learning task where D is the data distribution. In addition, we let mq be the mask at the start of the q-th round, Θq be the learned model weights after q 1 rounds, and Θ(i) q , i = 1, , κ be the learned weights in the q-th round after the i-th partition is grown and trained for T epochs. Proposition 1 shows that C-Ga P converges to a neighborhood around the stationary solution at rate O(1/ Q) when learning rate is set properly. Due to the space limitation, we put its proof in Appendix E. Proposition 1 (C-GAP CONVERGENCE). Suppose the loss function F(Θ) is partition-wise Lsmooth, the sampled stochastic gradient is unbiased and has bounded variance, and the relative error introduced by each mask is bounded by δ2 [0, 1), i.e., Θq Θq mq 2 δ2 Θq 2. If the learning rate γ = 1/(4κLT Q), the sparse models generated by the C-Ga P method after Q rounds will converge as follows: q=1 E F(Θq mq) 2 = O G Q + δ2 i=1 E Θ(i) q 2 (1) Published as a conference paper at ICLR 2022 where G is a constant related to the gradient noise and the initialized model weights. We make several remarks for Proposition 1 as follows. Remark 1. If there is no pruning in the training process, then it holds that δ2 = 0. Substituting it into equation 1, we find that C-Ga P will converge exactly to a stationary solution, i.e., E F(ΘQ) 2 0 as Q increases to infinity. This implies C-Ga P is also an effective algorithm even for dense training. Remark 2. When a sparse mask is utilized, it holds that δ2 = 0. In this scenario, the mask-induced error will inevitably influence the accuracy of the trained model. A smaller mask-induced error will lead to a more accurate model that C-Ga P can converge to. Remark 3. In our proofs, we find that cycling through all partitions so that all weights can be trained per round is critical to guarantee the convergence of C-Ga P. Note that previous works update weights either greedily (e.g., Rig L (Evci et al., 2020)) or randomly (e.g., SET (Mocanu et al., 2018) and DSR (Mostafa & Wang, 2019)). It may take numerous training steps for these algorithms to have each weight explored and updated. This may explain why C-Ga P achieves better accuracy than Rig L, SET, and DSR (see Table 1). This intuition is consistent with (G urb uzbalaban et al., 2019; Ying et al., 2018) which theoretically establish that sampling data cyclically converges faster than uniformly randomly in SGD training. 5 RELATED WORKS Pruning from pre-trained dense models: Among various pruning algorithms, one-shot pruning (Le Cun et al., 1990; Xu et al., 2021) zeros out a given percentage of the trained weights with the smallest magnitude. Based on the magnitude-based one-shot pruning, the gradual magnitude pruning (GMP) (Zhu & Gupta, 2017; Gale et al., 2019) proposes an iterative pruning method that progressively prune models to the target pruning ratio. However, those methods suffer from significant accuracy drop and relatively long training time. Besides magnitude-based pruning, other approaches (Wen et al., 2016; He et al., 2017; Zhang et al., 2021a) adapt mathematics-oriented regularization-based algorithms to generate sparsity. Later works (Zhang et al., 2018b; Ren et al., 2019; Ma et al., 2020a;b; Niu et al., 2020; 2021; Rumi et al., 2020; Ma et al., 2021a; Zhang et al., 2021b; Huang et al., 2021) utilizes dynamic regularization penalty such as ADMM to solve the pruning problem as an optimization problem and maintaining high accuracy. Several methods (Zhou et al., 2021; Srinivas et al., 2017; Molchanov et al., 2017) work in probability space. Other works such as HRank (Lin et al., 2020a), SCOP (Tang et al., 2020), DMCP (Guo et al., 2020), Meta Pruning (Liu et al., 2019a) use complicated rules to generate the sparsity distribution in the channel level. Since pre-training a dense model from which to select critical weights consumes large memory space, our work differs substantially from them such that we do not rely on a pre-trained dense model. Pruning at an early stage: A new trend of exploring sparsity at an early stage (Lee et al., 2019; Wang et al., 2020; Tanaka et al., 2020; Wimmer et al., 2020; van Amersfoort et al., 2020; Ma et al., 2021b; Liu et al., 2021) has emerged to embrace the promising sparse training paradigm. SNIP (Lee et al., 2019) finds the sparse masks based on the saliency score of each weight that is obtained after training the dense model for only a few iterations. Gra SP (Wang et al., 2020) prunes weights based on preserving the gradient flow in the model. Early Bird (You et al., 2020) conducts a retraining process after searching sub-network topology for a few epochs. Prune Train (Lym et al., 2019) integrates structured pruning in the pretraining process. However, pruning at an early stage fails to achieve acceptable accuracy on Image Net (Deng et al., 2009). We recognize the underlying reason is that the single-shot pruning criterion is based on the information from a few mini-batch of training samples, which cannot accurately represent the rich features of large-scale dataset. The proposed Ga P methods repeatedly update the sparse masks and hence overcome their drawbacks. Sparse mask exploration: To satisfy the condition of low computation cost as well as low memory footprint, many works have incorporated the training-from-sparse-initialization paradigm, and optimize the network by sparse mask exploration. Deep R (Bellec et al., 2018) proposes to train a sparse network at initialization, and dynamically adjust the sparse topology by pruning the weights that flip the sign and randomly growing new weights during training. However, Deep R is primarily demonstrated with sparsification of fully-connected layers and applied to relatively small and shallow networks. Sparse Evolutionary Training (SET) (Mocanu et al., 2018) and MEST (Yuan et al., 2021a) uses layer-wise magnitude-based pruning and random growth at the end of each training epoch. Published as a conference paper at ICLR 2022 DSR (Mostafa & Wang, 2019) and STR (Kusupati et al., 2020) designs a dynamic reparameterization method that allows weights to be re-distributed across layers by providing an effective and trainable way to allocate the sparsity across all layers. Similarly, DNW (Wortsman et al., 2019) trains a sparse model and keeps dense gradients to dynamically adjust the network connections. SNFS (Dettmers & Zettlemoyer, 2019) develops sparse momentum to identify the importance of weights. Rig L (Evci et al., 2020) and Ne ST (Dai et al., 2019) propose to use magnitude-based pruning and gradientflow-based growth that update sparse model topology during training. Yuan et al. (2021b) uses a continuous relaxation and optimization method to dynamically update the structural masks of DNN models. All of the mentioned works update the sparse masks based on heuristics and there is little or no training at the mask update stage. Our work differs from them in that the mask updating rule of the Ga P methods is more optimized by training a subset of layers into dense and then pruned. 6 SOCIETAL IMPACT AND LIMITATIONS Societal Impact. The scheduled Ga P methods target to obtain the best quality of the pruned model under a runtime budget. Based on the researches in Facebook (Hazelwood et al., 2018; Wu et al., 2019), many models are inferenced many billions of times per day, which makes the cost of pruning such models a tiny fraction of the inference cost in one day. Thus, a better pruning algorithm resulting to a smaller model may save a lot more computation on the inference side, which is of high value for the community and society to achieve Green AI (Schwartz et al., 2020). On the other hand, our method cannot prevent the possible malicious usage, which may cause negative societal impact. Limitations. In this work, we focus on the unstructured sparsity to demonstrate the algorithm-level innovations of our proposed Ga P methods, and we only preliminarily include a block sparsity example in Table 9 of Appendix A.3 to demonstrate its applicability to structured sparsity and its potential to improve the inference speed. More detailed analysis is still a work in progress. The scheduled Ga P methods require more training time than conventional pruning methodology, which may potentially limit its applicability when the training compute resources are limited. Thus, the scheduled Ga P methods are mostly beneficial to models whose benefit of the reduced inference time and/or improved accuracy out weigh the cost of the moderately longer training time. The key hyper-parameter settings such as partition numbers or partition strategies are determined heuristically. In this work, we try to divide the models to partitions with similar parameter counts or computation. We also intuitively group consecutive layers to the same partition, based on our hypothesis that adjacent layers are tighter correlated than more distant layers. Our analysis in Proposition 1 cannot be directly extended to cover P-Ga P since the analysis is built on the partition-wise cyclic updating structure of C-Ga P. We will leave the analysis for P-Ga P as a future work. The prediction results from the sparse models are generally biased comparing with the dense models they are pruned from (Hooker et al., 2019). This is a general problem on sparse model architectures, and our Ga P methodology cannot reduce such biases. 7 CONCLUSIONS Sparsity based model compression is gaining traction in research and industry to reduce the memory footprint and inference runtime. However, conventional compression approaches result to noticeable accuracy loss during the pruning process. In this paper, we propose a scheduled grow-and-prune (Ga P) methodology to obtain sparse models. It divides a model into several partitions, grows and prunes each partition cyclically or in parallel. The experimental results show that this method preserves the model quality far better than previous works. In all four experimented tasks, the Ga P method matches or even beats the dense solution at 80% sparsity. In addition, the Ga P method does not require a pre-trained dense model and can continuously searching sparsity masks with new data, showing its flexibility and practical potential. Published as a conference paper at ICLR 2022 8 ACKNOWLEDGEMENT This work was supported by Alibaba Group through Alibaba Research Intern Program, and was partly supported by the National Science Foundation CCF-1937500. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of NSF. D Basu. On sampling with and without replacement. In Sankhy a: The Indian Journal of Statistics, pp. 287 294, 1958. Guillaume Bellec, David Kappel, Wolfgang Maass, and Robert Legenstein. Deep rewiring: Training very sparse deep networks. In Proceedings of the International Conference on Learning Representations (ICLR), 2018. Xiaoliang Dai, Hongxu Yin, and Niraj K Jha. Nest: A neural network synthesis tool based on a grow-and-prune paradigm. In Proceedings of the IEEE Transactions on Computers, volume 68, pp. 1487 1497, 2019. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Image Net: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248 255, 2009. Tim Dettmers and Luke Zettlemoyer. Sparse networks from scratch: Faster training without losing performance. In Proceedings of the Advances in Neural Information Processing Systems (Neur IPS), 2019. Utku Evci, Trevor Gale, Jacob Menick, Pablo Samuel Castro, and Erich Elsen. Rigging the lottery: Making all tickets winners. In Proceedings of the International Conference on Machine Learning (ICML), pp. 2943 2952, 2020. Trevor Gale, Erich Elsen, and Sara Hooker. The state of sparsity in deep neural networks. ar Xiv preprint ar Xiv:1902.09574, 2019. Shaopeng Guo, Yujie Wang, Quanquan Li, and Junjie Yan. DMCP: Differentiable Markov channel pruning for neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1539 1547, 2020. Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for efficient DNNs. In Proceedings of the Advances in Neural Information Processing Systems (Neur IPS), pp. 1379 1387, 2016. Mert G urb uzbalaban, Asu Ozdaglar, and Pablo A Parrilo. Why random reshuffling beats stochastic gradient descent. In Proceedings of the Mathematical Programming, pp. 1 36, 2019. Kim Hazelwood, Sarah Bird, David Brooks, Soumith Chintala, Utku Diril, Dmytro Dzhulgakov, Mohamed Fawzy, Bill Jia, Yangqing Jia, Aditya Kalro, et al. Applied machine learning at facebook: A datacenter infrastructure perspective. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 620 629, 2018. Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1389 1397, 2017. Sara Hooker, Aaron Courville, Gregory Clark, Yann Dauphin, and Andrea Frome. What do compressed deep neural networks forget? ar Xiv preprint ar Xiv:1911.05248, 2019. Shaoyi Huang, Dongkuan Xu, Ian EH Yen, Sung-en Chang, Bingbing Li, Shiyang Chen, Mimi Xie, Hang Liu, and Caiwen Ding. Sparse progressive distillation: Resolving overfitting under pretrain-and-finetune paradigm. ar Xiv preprint ar Xiv:2110.08190, 2021. Published as a conference paper at ICLR 2022 Aditya Kusupati, Vivek Ramanujan, Raghav Somani, Mitchell Wortsman, Prateek Jain, Sham Kakade, and Ali Farhadi. Soft threshold weight reparameterization for learnable sparsity. In Proceedings of the International Conference on Machine Learning (ICML), pp. 5544 5555, 2020. Yann Le Cun, John S Denker, and Sara A Solla. Optimal brain damage. In Proceedings of the Advances in Neural Information Processing Systems (Neur IPS), pp. 598 605, 1990. Namhoon Lee, Thalaiyasingam Ajanthan, and Philip Torr. SNIP: Single-shot network pruning based on connection sensitivity. In Proceedings of the International Conference on Learning Representations (ICLR), 2019. Mingbao Lin, Rongrong Ji, Yan Wang, Yichen Zhang, Baochang Zhang, Yonghong Tian, and Ling Shao. Hrank: Filter pruning using high-rank feature map. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1529 1538, 2020a. Tao Lin, Sebastian U. Stich, Luis Barba, Daniil Dmitriev, and Martin Jaggi. Dynamic model pruning with feedback. In Proceedings of the International Conference on Learning Representations (ICLR), 2020b. Ning Liu, Geng Yuan, Zhengping Che, Xuan Shen, Xiaolong Ma, Qing Jin, Jian Ren, Jian Tang, Sijia Liu, and Yanzhi Wang. Lottery ticket preserves weight correlation: Is it desirable or not? In International Conference on Machine Learning (ICML), pp. 7011 7020. PMLR, 2021. Zechun Liu, Haoyuan Mu, Xiangyu Zhang, Zichao Guo, Xin Yang, Kwang-Ting Cheng, and Jian Sun. Meta Pruning: Meta learning for automatic neural network channel pruning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3296 3305, 2019a. Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the value of network pruning. In Proceedings of the International Conference on Learning Representations (ICLR), 2019b. Ilya Loshchilov and Frank Hutter. SGDR: Stochastic gradient descent with warm restarts. In Proceedings of the International Conference on Learning Representations (ICLR), 2017. Sangkug Lym, Esha Choukse, Siavash Zangeneh, Wei Wen, Sujay Sanghavi, and Mattan Erez. Prune Train: fast neural network training by dynamic sparse model reconfiguration. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1 13, 2019. Xiaolong Ma, Fu-Ming Guo, Wei Niu, Xue Lin, Jian Tang, Kaisheng Ma, Bin Ren, and Yanzhi Wang. PCONV: The missing but desirable sparsity in DNN weight pruning for real-time execution on mobile devices. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), volume 34, pp. 5117 5124, 2020a. Xiaolong Ma, Wei Niu, Tianyun Zhang, Sijia Liu, Sheng Lin, Hongjia Li, Wujie Wen, Xiang Chen, Jian Tang, Kaisheng Ma, et al. An image enhancing pattern-based sparsity for real-time inference on mobile devices. In Proceedings of the European conference on computer vision (ECCV), pp. 629 645. Springer, 2020b. Xiaolong Ma, Sheng Lin, Shaokai Ye, Zhezhi He, Linfeng Zhang, Geng Yuan, Sia Huat Tan, Zhengang Li, Deliang Fan, Xuehai Qian, et al. Non-structured dnn weight pruning is it beneficial in any platform? IEEE Transactions on Neural Networks and Learning Systems (TNNLS), 2021a. Xiaolong Ma, Geng Yuan, Xuan Shen, Tianlong Chen, Xuxi Chen, Xiaohan Chen, Ning Liu, Minghai Qin, Sijia Liu, Zhangyang Wang, et al. Sanity checks for lottery tickets: Does your winning ticket really win the jackpot? Advances in Neural Information Processing Systems (Neur IPS), 34, 2021b. Decebal Constantin Mocanu, Elena Mocanu, Peter Stone, Phuong H Nguyen, Madeleine Gibescu, and Antonio Liotta. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. In Proceedings of the Nature Communications, volume 9, pp. 1 12, 2018. Published as a conference paper at ICLR 2022 Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout sparsifies deep neural networks. In Proceedings of the International Conference on Machine Learning (ICML), pp. 2498 2507, 2017. Hesham Mostafa and Xin Wang. Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization. In Proceedings of the International Conference on Machine Learning (ICML), pp. 4646 4655, 2019. Wei Niu, Xiaolong Ma, Sheng Lin, Shihao Wang, Xuehai Qian, Xue Lin, Yanzhi Wang, and Bin Ren. Patdnn: Achieving real-time dnn execution on mobile devices with pattern-based weight pruning. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 907 922, 2020. Wei Niu, Zhengang Li, Xiaolong Ma, Peiyan Dong, Gang Zhou, Xuehai Qian, Xue Lin, Yanzhi Wang, and Bin Ren. Grim: A general, real-time deep learning inference framework for mobile devices based on fine-grained structured weight sparsity. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2021. NVIDIA. https://github.com/NVIDIA/Deep Learning Examples, 2020a. NVIDIA. NVIDIA A100 tensor core GPU architecture. Technical report, 2020b. Hieu Pham, Melody Guan, Barret Zoph, Quoc Le, and Jeff Dean. Efficient neural architecture search via parameters sharing. In Proceedings of the International Conference on Machine Learning (ICML), pp. 4095 4104, 2018. Matt Post. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pp. 186 191, 2018. Charles R Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. ar Xiv preprint ar Xiv:1706.02413, 2017. Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized evolution for image classifier architecture search. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), volume 33, pp. 4780 4789, 2019. Ao Ren, Tianyun Zhang, Shaokai Ye, Jiayu Li, Wenyao Xu, Xuehai Qian, Xue Lin, and Yanzhi Wang. ADMM-NN: An algorithm-hardware co-design framework of DNNs using alternating direction methods of multipliers. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 925 938, 2019. Masuma Akter Rumi, Xiaolong Ma, Yanzhi Wang, and Peng Jiang. Accelerating sparse cnn inference on gpus with performance-aware weight pruning. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 267 278, 2020. Roy Schwartz, Jesse Dodge, Noah A Smith, and Oren Etzioni. Green ai. In Proceedings of the Communications of the ACM, volume 63, pp. 54 63, 2020. Suraj Srinivas, Akshayvarun Subramanya, and R Venkatesh Babu. Training sparse neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition workshops (CVPR workshop), pp. 138 145, 2017. Hidenori Tanaka, Daniel Kunin, Daniel L Yamins, and Surya Ganguli. Pruning neural networks without any data by iteratively conserving synaptic flow. In Proceedings of the Advances in Neural Information Processing Systems (Neur IPS), volume 33, 2020. Yehui Tang, Yunhe Wang, Yixing Xu, Dacheng Tao, Chunjing Xu, Chao Xu, and Chang Xu. Scop: Scientific control for reliable neural network pruning. In Proceedings of the Advances in Neural Information Processing Systems (Neur IPS), 2020. Joost van Amersfoort, Milad Alizadeh, Sebastian Farquhar, Nicholas Lane, and Yarin Gal. Single shot structured pruning before training. ar Xiv preprint ar Xiv:2007.00389, 2020. Published as a conference paper at ICLR 2022 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems (Neur IPS), pp. 5998 6008, 2017. Chaoqi Wang, Guodong Zhang, and Roger Grosse. Picking winning tickets before training by preserving gradient flow. In Proceedings of the International Conference on Learning Representations (ICLR), 2020. Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In Proceedings of the Advances in Neural Information Processing Systems (Neur IPS), pp. 2074 2082, 2016. Paul Wimmer, Jens Mehnert, and Alexandru Condurache. Freezenet: Full performance by reduced storage costs. In Proceedings of the Asian Conference on Computer Vision (ACCV), 2020. Mitchell Wortsman, Ali Farhadi, and Mohammad Rastegari. Discovering neural wirings. In Proceedings of the Advances in Neural Information Processing Systems (Neur IPS), volume 32, pp. 2684 2694, 2019. Carole-Jean Wu, David Brooks, Kevin Chen, Douglas Chen, Sy Choudhury, Marat Dukhan, Kim Hazelwood, Eldad Isaac, Yangqing Jia, Bill Jia, et al. Machine learning at facebook: Understanding inference at the edge. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 331 344, 2019. Dongkuan Xu, Ian EH Yen, Jinxi Zhao, and Zhibin Xiao. Rethinking network pruning under the pre-train and fine-tune paradigm. NAACL, 2021. Xu Yan. Pointnet/pointnet++ pytorch. https://github.com/yanx27/Pointnet Pointnet2 pytorch, 2019. Li Yi, Vladimir G Kim, Duygu Ceylan, I-Chao Shen, Mengyan Yan, Hao Su, Cewu Lu, Qixing Huang, Alla Sheffer, and Leonidas Guibas. A scalable active framework for region annotation in 3D shape collections. In Proceedings of the ACM Transactions on Graphics (To G), volume 35, pp. 1 12, 2016. Bicheng Ying, Kun Yuan, Stefan Vlaski, and Ali H Sayed. Stochastic learning under random reshuffling with constant step-sizes. In Proceedings of the IEEE Transactions on Signal Processing, volume 67, pp. 474 489, 2018. Haoran You, Chaojian Li, Pengfei Xu, Yonggan Fu, Yue Wang, Xiaohan Chen, Yingyan Lin, Zhangyang Wang, and Richard G. Baraniuk. Drawing early-bird tickets: Toward more efficient training of deep networks. In Proceedings of the International Conference on Learning Representations (ICLR), 2020. Ruichi Yu, Ang Li, Chun-Fu Chen, Jui-Hsin Lai, Vlad I Morariu, Xintong Han, Mingfei Gao, Ching Yung Lin, and Larry S Davis. NISP: Pruning networks using neuron importance score propagation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9194 9203, 2018. Geng Yuan, Xiaolong Ma, Wei Niu, Zhengang Li, Zhenglun Kong, Ning Liu, Yifan Gong, Zheng Zhan, Chaoyang He, Qing Jin, et al. Mest: Accurate and fast memory-economic sparse training framework on the edge. Advances in Neural Information Processing Systems (Neur IPS), 34, 2021a. Xin Yuan, Pedro Henrique Pamplona Savarese, and Michael Maire. Growing efficient deep networks by structured continuous sparsification. In Proceedings of the International Conference on Learning Representations (ICLR), 2021b. Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In Proceedings of the International Conference on Learning Representations (ICLR), 2018a. Tianyun Zhang, Shaokai Ye, Kaiqi Zhang, Jian Tang, Wujie Wen, Makan Fardad, and Yanzhi Wang. A systematic DNN weight pruning framework using alternating direction method of multipliers. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 184 199, 2018b. Published as a conference paper at ICLR 2022 Tianyun Zhang, Xiaolong Ma, Zheng Zhan, Shanglin Zhou, Caiwen Ding, Makan Fardad, and Yanzhi Wang. A unified dnn weight pruning framework using reweighted optimization methods. In 2021 58th ACM/IEEE Design Automation Conference (DAC), pp. 493 498. IEEE, 2021a. Tianyun Zhang, Shaokai Ye, Xiaoyu Feng, Xiaolong Ma, Kaiqi Zhang, Zhengang Li, Jian Tang, Sijia Liu, Xue Lin, Yongpan Liu, et al. Structadmm: Achieving ultrahigh efficiency in structured pruning for dnns. IEEE Transactions on Neural Networks and Learning Systems (TNNLS), 2021b. Xiao Zhou, Weizhong Zhang, Hang Xu, and Tong Zhang. Effective sparsification of neural networks with global sparsity constraint. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3599 3608, 2021. Michael Zhu and Suyog Gupta. To prune, or not to prune: exploring the efficacy of pruning for model compression. ar Xiv preprint ar Xiv:1710.01878, 2017. Published as a conference paper at ICLR 2022 A ABLATION STUDY In this section, we present some ablation study on the proposed Ga P methods. A.1 NUMBER OF PARTITIONS FOR GAP The number of partitions in Ga P methods play a key role in obtaining better results. As compared in Table 5 and Table 6, one partition is not sufficient to preserve the best quality. Too many partitions may reduce quality as well (as shown in the model with six partitions in Table 6). Empirically, a small number of equal partitions that naturally follow the structures of the models usually perform well. For example, Res Net-50 model is composed of four blocks, each sub-samples from the previous block. Thus a four-partition Ga P is a natural choice. Transformer model contains an encoder and a decoder, with the decoder twice the number of attention blocks as the encoder. It is naturally divided to three partitions. Table 5: The effect of different number of partitions for Res Net-50 models on Image Net. Method Sparsity Ratio Distribution Num. Partitions κ Top-1 Acc (%) C-Ga P 90% uniform 1 75.9 C-Ga P 90% uniform 4 76.3 Table 6: The effect of different number of partitions for Transformer models on WMT-14 En-De. Method Sparsity Ratio Distribution Num. Partitions κ Sacre BLEU C-Ga P 90% uniform 1 26.8 C-Ga P 90% uniform 3 27.7 C-Ga P 90% uniform 6 27.1 A.2 PARTITION THE MODEL IN A RANDOM MANNER In the main text, each partition of the model consists of contiguous layers and they are cyclically grown and pruned. We also explore a random partition strategy for each step in the C-Ga P for sparse Res Net-50 and Transformer models. Unlike the cyclic partition, a random partition chooses a uniformly random subset of layers to grow in each step of the Ga P process. On the results for Res Net-50 described in Table 7, the random partition is slightly worse than the cyclic partition but still better than all previous existing solutions. On the results for Transformer described in Table 8, however, the accuracy gap is much larger. At 90% sparsity, Transformer model can achieve 27.7 Sacre BLEU score in uniform sparsity with 3 partitions using C-Ga P. When a random 3-partition is applied to a 90% sparsity Transformer, the Sacre BLEU score is reduced to 27.0 and 26.9 for uniform and non-uniform sparsity, respectively. We conjecture that different model structures may result to different accuracy gaps. For Res Net-50, the correlations between neighboring layers may be small so each layer may be optimized independently. However, for Transformer, the inter-layer weight correlation is stronger (e.g., the Wkey, Wquery, Wvalue in the same attention block have very strong correlation since they process the same input sequence for the self-attention computation) This analysis shows that the partition mechanism affects the final model accuracy. Our scheduled Ga P methods minimize the chance that the neighboring layers belong to different partitions, and have achieved superior accuracy. Table 7: Different partition strategies for Res Net-50 models on Image Net. Method Sparsity Ratio Distribution Num. Partitions Partition strategy Top-1 Acc (%) C-Ga P 90% non-uniform 4 Cyclic 77.9 C-Ga P 90% non-uniform 4 Random 77.8 Published as a conference paper at ICLR 2022 Table 8: Different partition strategies for Transformer models on WMT-14 En-De. Method Sparsity Ratio Distribution Num. Partitions Partition strategy Sacre BLEU C-Ga P 90% uniform 3 Cyclic 27.7 C-Ga P 90% uniform 3 Random 27.0 C-Ga P 90% non-uniform 3 Random 26.9 A.3 GENERALIZATION TO OTHER SPARSE GRANULARITY We use unstructured sparsity throughout the paper. In this section, we demonstrate that the Ga P method is also applicable to other coarse-grained sparse granularity such as block-wise sparsity. The unit for the block-wise sparsity in this experiment is 1 8, i.e., if the weight matrix is viewed as many 1 8 blocks, then the sparsity ratio of the matrix is defined as the ratio of all-zero blocks. It is often believed that the accuracy would be impacted severely when the block size is 8 or greater. Table 9 compares the Sacre Bleu scores of the sparse Transformer models obtained by applying the 3-partition cyclic C-Ga P and the prune-from-dense methods. It is observed that the Ga P method improves over the prune-from-dense method significantly (27.21 vs. 26.21). This further validates the benefits of mask updating strategy brought by the Ga P method. Table 9: Different sparsifying methods for the Transformer models with structured sparsity on WMT14 En-De. Method Sparsity Ratio sparse granularity Num. Partitions κ BLEU prune-from-dense 80% block 1 8 - 26.1 C-Ga P 80% block 1 8 3 27.2 A.4 PRUNING THE FULLY CONNECTED LAYER IN THE RESNET-50 MODEL In the main context, the last fully connected (FC) layer of the Res Net-50 model is kept dense in the uniform sparsity distribution since it contributes to less than 1% of the FLOPs. In Table 10, we perform additional experiments to supplement Table 1 by pruning the last FC layer using the C-Ga P method and comparing them with Evci et al. (2020). The results in Table 10 and Table 1 indicate that the accuracy of the Res Net-50 models with sparse and dense FC layers are similar, and both of them outperform the state-of-the-art results in Evci et al. (2020). Please note that models with the non-uniform sparsifying distribution in Table 1 already have the last FC layer pruned, thus the experiment setup is the same as the ones in Evci et al. (2020). Table 10: Results of sparse Res Net-50 models on Image Net. Method Distribution Sparsity ratio: 80% Sparsity ratio: 90% Top-1 Acc (%) FLOPS ( e9) Top-1 Acc (%) FLOPS ( e9) Rig L Evci et al. (2020) uniform 74.6 1.7 72.0 0.8 Rig L5 Evci et al. (2020) uniform 76.6 1.7 75.7 0.8 C-Ga P (dense FC) uniform 77.9 1.7 76.3 0.8 C-Ga P (sparse FC) uniform 77.8 1.7 76.4 0.8 B EXPERIMENT SETUP Most training scripts and hyper-parameters are inherited from NVIDIA (2020a) and Yan (2019). Identical hyper-parameters are applied to each Ga P step and the final fine-tune stage. For example, the initial learning rate of each step is restored to the same value. This hyper-parameter restoration avoids pre-determining the total number of epochs before training and enables the Ga P methods to continuously search for sparse masks. Published as a conference paper at ICLR 2022 B.1 IMAGE CLASSIFICATION: RESNET-50 ON IMAGENET In the image classification task, we use standard data augmentation, a batch size of 2048, cosine annealing learning rate schedule, SGD optimizer with a momentum of 0.875, and a weight decay of 3.05e-5. The learning rate is scheduled with a linear warm-up for 2 epochs before reaching the initial learning rate of 2.048. Each Ga P step with non-uniform and uniform sparsity includes 30 of training, respectively. The final fine-tuning includes 150 epochs. For C-Ga P, we train for 28 steps (i.e., 7 rounds with 4 partitions), and for P-Ga P, we train for 32 steps (i.e., 8 rounds with 4 partitions). After the Ga P step, we prune the dense partition(s) and fine-tune the model. For the prune-from-dense method, we use iterative ADMM pruning. We first pretrain a dense model using 250 epochs, and perform ADMM regularization training for 250 epochs. Then we prune the model to 80% sparsity and finetune for another 250 epochs. Thus, the total number of epochs for 80% sparsity is 750 epochs. Starting from the 80% sparse model, we iteratively prune to 90% sparsity by firstly training the 80% sparse model with ADMM regularization for 250 epochs, and then prune to 90% sparsity and finetune 250 epochs. It takes a total of 1250 epochs to train a 90% sparsity model. B.2 OUR IMPLEMENTATION DETAILS OF RIGL In this section, we provide the implementation details of the Rig L (Evci et al., 2020). For the experiments with 100 training epochs, Most of the important hyper-paramters can be found in Evci et al. (2020). We adopt the settings in our Py Torch code base and the accuracy of Rig L in 80% and 90% sparsity can be reproduced, both in uniform and ERK sparsity distributions, respectively. As shown in Table 1, our implementation has slightly higher accuracy, which indicates that our code base in Py Torch is comparable with the original implementation in Evci et al. (2020), and can be used for reproducing and extending the training of Rig L without causing fairness issue. Based on the above implementation, we reproduce the Rig L results with 500 training epochs. However, Evci et al. (2020) doesn t provide detailed hyper-parameter settings for 500-epoch training. We perform extensive experiments and find out that using the same step learning rate decay scheduler can not reproduce the accuracy claimed in the original Rig L paper, regardless of the learning rate decay factor or decay epochs. We change the step learning rate decay scheduler to the cosine annealing learning rate scheduler (Loshchilov & Hutter, 2017), and add mixup technique (Zhang et al., 2018a) with the hyper-parameter 0.2 to improve the generalization ability of the network. We find that the 500-epoch accuracy can be reproduced in these settings. We conjecture that the cosine annealing learning rate schedule and the mixup technique reconcile the overfitting problem that easily occurs in long period training. We inherit the Rig L training recipe used in our 500-epoch training, and extend the training to 1200 epochs. The reproduced results are shown in Table 1. B.3 OBJECT DETECTION: SSD ON COCO-2017 To train SSD on COCO-2017 dataset, we use an input size of 300 300 pixels, a batch size of 64, step learning rate schedule, SGD optimizer with a momentum of 0.9, and a weight decay of 5e-4. The learning rate is 2.6e-3 with 300 iterations of linear warm-up. Each Ga P step includes 32 epochs of training and the final fine-tuning includes 65 epochs. For the prune-from-dense method, we report the m AP of 24.3 by iteratively pruning. We first pretrain a dense model using 65 epochs, then we prune iteratively to 50% 70% 80% 87.5% 90%. For each pruning iteration, we train the sparse model for 65 epochs. It takes 390 epochs to get the final 90% sparsity model. Please note that directly pruning a 90% sparsity model only results to 21.98 m AP, which is significantly lower than the 24.3 m AP using the iterative pruning method. The model pruned using C-Ga P method can achieve 25.9 m AP. B.4 3D OBJECT PART SEGMENTATION: POINENET++ ON SHAPENET When training Point Net++ (Qi et al., 2017; Yan, 2019) model on Shape Net (Yi et al., 2016) dataset, we use a batch size of 32, step learning rate schedule with an initial learning rate of 0.001, SGD optimizer with a momentum of 0.9, and a weight decay of 1e-4. We pretrain a dense model using 250 epochs. Each step of Ga P includes 25 epochs of training. The final fine-tuning includes 250 epochs. Published as a conference paper at ICLR 2022 Input: unpruned DNN model Apply random sparse mask on model (a) Workflow of C-Ga P (b) Workflow of P-Ga P Divide into partitions Grow part-0, prune part-( - 1) and keep other parts sparse Grow part-1, prune part-0 and keep other parts sparse Grow part-( - 1) , prune part-( -2) and keep other parts sparse Finetune model? Prune part-( - 1) . Finetune model Output: pruned DNN model A randomly pruned model Step=0 Round=0 Input: unpruned DNN model Apply random sparse mask on model Divide into partitions and copy to machines Grow part-[i] on machine-[i], and train the model on all machines. i {0, 1, 2, ... - 1} Finetune Model? Prune and finetune model Output: pruned DNN model A randomly pruned model Collect and combine part-[i] into one model Figure 4: Training workflow of the (a) C-Ga P and (b) P-Ga P. In C-Ga P (left), the model is partitioned and trained on one machine for multiple rounds. In each round, the sparse mask for each partition is grown and pruned cyclically. Each round has κ steps when the model is partitioned into κ parts. In P-Ga P (right), a randomly initialized and sparsified model is copied and distributed to κ machines when partitioned into κ parts. Each round contains one step since all machines train the corresponding copy in parallel. An optimized model can be collected from the process and pruned to the desired sparsity scheme, then finetuned to restore accuracy. B.5 TRANSLATION: TRANSFORMER ON WMT-14 EN-DE DATASET We use a batch size of 5120, inverse square-root scheduler with an initial learning rate of 0.000846, and Adam optimizer. Each step in Ga P includes 10 epochs of training, and the final fine-tuning includes 40 epochs. We pretrain a dense model using 40 epochs. For prune from dense method, we use one-shot magnitude-based pruning to prune Transformer directly to 80% and 90% sparsity. C THE DIFFERENCE BETWEEN SCHEDULED GAP AND RANDOM SPARSE MASK EXPLORATION In our scheduled Ga P algorithm, we intend to use all information in a model. i.e., explore all weights efficiently. One may argue that when trained sufficiently long, all weights will be explored even for random mask exploration. However, random sparse mask exploration and our scheduled Ga P methods are like sampling operations with replacement and without replacement (Basu, 1958), and their required training time to get the same weight coverage differs a lot . If we consider a very small model with only 10 weights, and each time we only train one weight and keep the remaining weights the same. When we apply the scheduled Ga P algorithm (without replacement), each weight is guaranteed to be trained after 10 steps (one step indicates training one weight). This is because the scheduled Ga P alternates the weight to train. The already trained weight is guaranteed not to be re-trained again until all the remaining weights are trained. On the other hand, if a random weight is selected to train in each step (with replacement), it may take 29 steps to be fairly confident that all weights are trained. That is almost 3 times more training time than the scheduled Ga P. Please note, in our example, we only have 10 weights. A real neural network contains millions of weights and it is more difficult to guarantee a full exploration. Published as a conference paper at ICLR 2022 Algorithm 2: P-Ga P training flow. Input: An L-layer model with uninitialized weight Θ; pruning ratio r. Output: An L-layer sparse model satisfying the target sparsity requirement. 1 Divide all layers into κ partitions, denoted by Si, i {0, 1, . . . , κ 1}. 2 Initialize f(x; Θ m) with random weights Θ and random masks m that satisfy the sparsity requirement. 4 while step < K do 5 Send identical copies of [Θ, m] to all κ nodes and denote them by Θ(i) and m(i) on the i-th node. 6 parfor i {0, . . . , κ 1} do 7 Grow Si by updating the mask m(i)Si Arg Grow To(Θ(i)Si). 8 Train the model f(x; Θ(i) m(i)) for T epochs. 9 Collect the dense partition Θ(i)Si from the i-th machine. 10 Combine all Θ(i)Si into a dense model ΘS0 Sκ 1 with the updated weights. 11 Prune Θ and update m by [Θ, m] Arg Prune To(Θ, r). 12 step = step + 1. 13 Fine-tune f(x; Θ m) for T epochs. D A DETAILED WORKFLOW OF THE CYCLIC GAP AND PARALLEL GAP In this section, we show the general flow of parallel Ga P in Algorithm 2, and demonstrate the detailed workflow of the proposed cyclic Ga P and parallel Ga P method in Figure 4 as a supplement to the Algorithm 1 and Algorithm 2. E PROOF OF PROPOSITION 1 E.1 PROBLEM FORMULATION Training deep neural networks can be formulated into minimizing the following problem: min Θ Rd F(Θ) = Ex D[f(x; Θ)] (2) where Θ Rd is the model parameter to be learned, and f(x; Θ) is a smooth non-convex cost function in terms of Θ. The random variable x denotes the data samples that follow the distribution D. Ex D[ ] denotes the expectation over the random variable x. E.2 NOTATION We first clarify the notions used in the convergence analysis. To characterize how Ga P updates model parameters in a partition-wise manner, we introduce matrices Ui Rd di, i = 1, , κ, for which the identity matrix Id Rd d can be represented as Id = (U1, U2, , Uκ). (3) ΘSi = U T i Θ Rdi is the model parameter at the i-th partition (i.e., partition Si). m Si Rdi is the mask for the i-th partition. f(x; Θ) Rd and F(Θ) Rd are the complete stochastic and accurate gradients in terms of Θ, respectively. if(x; Θ) = U T i f(x; Θ) Rdi is the stochastic gradient for the i-th partition i F(Θ) = U T i F(Θ) Rdi is the gradient for the i-th partition. Θ(i) q,t Rd denotes the complete model parameter in which its i-th partition (i.e., [Θ(i) q,t]Si Rdi) is newly updated at the q-th round and t-th inner iteration while the other partitions remain unchanged. m(i) q Rd is a complete mask vector for the i-th partition in the q-th round. x(i) q,t is the data sampled at q-th round and t-th inner iteration to update the i-th partition. Published as a conference paper at ICLR 2022 E.3 ALGORITHM REFORMULATION In this subsection we reformulate the cyclic Ga P (C-Gap) in Algorithm 1 in a way that can present the convergence analysis more easily. With the notations in Sec. E.2, the C-Gap Algorithm 1 can be reformulated as Algorithm 3: C-Ga P Algorithm (A math-friendly version) 1 Initialize Θ(1) 1,0 and m(0) 1 randomly. Initialize pruning ratio r. 2 for round q = 1, 2, ..., Q do 3 for partition i = 1, 2, ..., κ do 4 Generate a mask m(i) q = Ga PMask(Θ(i) q,0, m(i 1) q , r); 5 for t = 1, ..., T do 6 Update Θ(i) q,t = Θ(i) q,t 1 γUi if(x(i) q,t 1; Θ(i) q,t 1 m(i) q ) 7 Update Θ(i+1) q,0 = Θ(i) q,T 8 Update Θ(1) q+1,0 = Θ(κ+1) q,T and m(0) q+1 = m(κ) q ; 9 Output: Θ(κ) Q,T m(κ) Q . In Ga PMask(Θ(i) q,0, m(i 1) q , r), the partition of [m(i) q ]Si is updated as: Prune (i-1)-th partition: update [m(i) q ]Si 1 Arg Prune To([Θ(i) q,t]Si 1, r) for pruning ratio r Grow i-th partition: Let [m(i) q ]Si = 1di Rdi Other j-th partition: [m(i) q ]Sj = [m(i 1) q ]Sj (for all j = i and j = i 1) and m(i) q = [m(i) q ]S1 S2 Sκ. E.4 ASSUMPTIONS We now introduce several assumptions on the cost function and the gradient noise, which are standard in the literature. Assumption 1 (SMOOTHNESS). We assume the cost function F(Θ) is partition-wise L-smooth, i.e., i F(Θ + Uihi) i F(Θ) L hi , hi Rdi. (4) The above assumption implies that F(Θ) F(Φ) L Θ Φ , Θ, Φ Rd, (5) or equivalently, F(Θ) F(Φ) F(Φ), Θ Φ + L 2 Θ Φ 2, Θ, Φ Rd, (6) Assumption 2 (GRADIENT NOISE). We assume for any k, t, and i that E[ if(x(i) q,t; Θ)] = i F(Θ), (7) E[ if(x(i) q,t; Θ) i F(Θ) 2] σ2, (8) where σ > 0 is a constent. Moreover, we assume the data sample x(i) q,t is independent of each other for any k, t, and i. This assumption implies that the stochastic partition-gradient is unbiased and has bounded variance. Assumption 3 (MASK-INCURRED ERROR). It is assumed that Θ(i) q,t 1 m(i) q Θ(i) q,t 1 2 δ2 Θ(i) q,t 1 2 where δ (0, 1). (9) Published as a conference paper at ICLR 2022 Note that [m(i) q ]Si = 1di, and only the i-th partition in Θ(i) q,t is got updated during iterations (q, 1), , (q, T), we have Θ(i) q,t 1 m(i) q Θ(i) q,t 1 2 = Θ(i) q,0 m(i) q Θ(i) q,0 2 δ2 Θ(i) q,0 2. (10) E.5 CONVERGENCE ANALYSIS Now we are ready to establish the convergence property of C-Ga P in Algorithm 3. The arguments to establish the convergence are strait-forward: First, we need to establish the function value E[F(Θ(i) q,t 1)] will decrease to E[F(Θ(i) q,t)], up to an error term caused by the gradient noise and mask pruning, after each inner-iteration t when updating the i-th partition (See Lemma 1). Based on the above fact, we next prove E[F(Θ(i) q,0)] will decrease to E[F(Θ(i+1) q,0 )] when T-steps training of the i-th partition is completed (See Lemma 2). Finally, with the second fact, we show that E[F(Θ(1) q,0)] will decrease to E[F(Θ(1) q+1,0)] after the k-th outer round (See Lemmas 3 and 4). Since the function value decreases for each round k, we can prove that C-Ga P will converge to a stationary solution after sufficiently large Q rounds (See Theorem 1). Next we present the detail analysis. Lemma 1 (DESCENT LEMMA AFTER EACH INNER-ITERATION). Under Assumptions 1-3, it holds for each k, i, and t that E[F(Θ(i) q,t)] E[F(Θ(i) q,t 1)] γ 3 E i F(Θ(i) q,t 1) 2 + γ2Lσ2 3 E Θ(i) q,0 2. (11) Remark 4. It is observed in Lemma 1 that the function value will decrease by γ 3 E i F(Θ(i) q,t 1) 2 for each inner-iteration t. But such decrement suffers from two error terms: one is caused by stochastic gradient noise, and the other is by inexact pruning. Proof. Since F(Θ) is L-smooth (see Assumption 1), it holds that F(Θ(i) q,t) F(Θ(i) q,t 1) + F(Θ(i) q,t 1), Θ(i) q,t Θ(i) q,t 1 + L 2 Θ(i) q,t Θ(i) q,t 1 2 = F(Θ(i) q,t 1) γ i F(Θ(i) q,t 1), if(x(i) q,t 1; Θ(i) q,t 1 m(i) q ) 2 if(x(i) q,t 1; Θ(i) q,t 1 m(i) q ) 2 (12) With Assumption 2, it is easy to verify that E i F(Θ(i) q,t 1), if(x(i) q,t 1; Θ(i) q,t 1 m(i) q ) = EΘ(i) q,t 1 n E[ i F(Θ(i) q,t 1), if(x(i) q,t 1; Θ(i) q,t 1 m(i) q ) |Θ(i) q,t 1] o equation 7 = E i F(Θ(i) q,t 1), i F(Θ(i) q,t 1 m(i) q ) . (13) Using a similar way, we can prove, with equation 8, that E if(x(i) q,t 1; Θ(i) q,t 1 m(i) q ) 2 E i F(Θ(i) q,t 1 m(i) q ) 2 + σ2 (14) By taking the expectation over equation 12 and substituting equation 13 and equation 14 into equation 12, we achieve E[F(Θ(i) q,t)] E[F(Θ(i) q,t 1)] γE i F(Θ(i) q,t 1), i F(Θ(i) q,t 1 m(i) q ) 2 E i F(Θ(i) q,t 1 m(i) q ) 2 + γ2Lσ2 = E[F(Θ(i) q,t 1)] γE i F(Θ(i) q,t 1) 2 + γ2L 2 E i F(Θ(i) q,t 1 m(i) q ) 2 + γ2Lσ2 Published as a conference paper at ICLR 2022 γE i F(Θ(i) q,t 1), i F(Θ(i) q,t 1 m(i) q ) i F(Θ(i) q,t 1) . (15) E i F(Θ(i) q,t 1 m(i) q ) 2 2E i F(Θ(i) q,t 1) 2 + 2E i F(Θ(i) q,t 1) i F(Θ(i) q,t 1 m(i) q ) 2 2E i F(Θ(i) q,t 1) 2 + 2E F(Θ(i) q,t 1) F(Θ(i) q,t 1 m(i) q ) 2 equation 5 2E i F(Θ(i) q,t 1) 2 + 2L2E Θ(i) q,t 1 m(i) q Θ(i) q,t 1 2 equation 10 2E i F(Θ(i) q,t 1) 2 + 2L2δ2E Θ(i) q,0 2 (16) and, similarly, E i F(Θ(i) q,t 1), i F(Θ(i) q,t 1 m(i) q ) i F(Θ(i) q,t 1) 2E i F(Θ(i) q,t 1) 2 + 1 2E i F(Θ(i) q,t 1 m(i) q ) i F(Θ(i) q,t 1) 2 2E i F(Θ(i) q,t 1) 2 + L2δ2 2 E Θ(i) q,0 2 (17) Substituting equation 16 and equation 17 into equation 15, and by setting γ 1/(6L), we have E[F(Θ(i) q,t)] E[F(Θ(i) q,t 1)] γ(1 2γL) 2 E i F(Θ(i) q,t 1) 2 + γ2Lσ2 + γL2δ2(1 + 2γL) 2 E Θ(i) q,0 2 E[F(Θ(i) q,t 1)] γ 3 E i F(Θ(i) q,t 1) 2 + γ2Lσ2 3 E Θ(i) q,0 2. (18) Lemma 2 (DESCENT LEMMA AFTER EACH PARTITION UPDATE). Under Assumptions 1-3, it holds for each k and i that E[F(Θ(i) q,0)] E[F(Θ(i+1) q,0 )] γ t=1 E i F(Θ(i) q,t 1) 2 + γ2Lσ2T 2 + 2γL2δ2T 3 E Θ(i) q,0 2 Proof. Summing the inequality over equation 11 for t = 1, , T, we achieve t=1 E i F(Θ(i) q,t 1) 2 t=1 E[F(Θ(i) q,t 1) F(Θ(i) q,t)] + γ2Lσ2T 2 + 2γL2δ2T 3 E Θ(i) q,0 2 = E[F(Θ(i) q,0) F(Θ(i) q,T )] + γ2Lσ2T 2 + 2γL2δ2T 3 E Θ(i) q,0 2 (20) which completes the proof. Lemma 3 (DESCENT LEMMA AFTER EACH ROUND). Under Assumptions 1-3, it holds for each k that E[F(Θ(1) q,0)] E[F(Θ(1) q+1,0)] γ t=1 E i F(Θ(i) q,t 1) 2 2 + 2γL2δ2T i=1 E Θ(i) q,0 2 (21) Published as a conference paper at ICLR 2022 Proof. Summing the inequality in equation 19 over i = 1, , κ, we have t=1 E i F(Θ(i) q,t 1) 2 3 i=1 E[F(Θ(i) q,0) F(Θ(i) q,T )] + 3γLσ2Tκ i=1 E Θ(i) q,0 2 On the other hand, by the updating rule of Θ(i) q,0, we have i=1 E[F(Θ(i) q,0) F(Θ(i) q,T )] = E[F(Θ(κ) q,0) F(Θ(κ) q,T )] + i=1 E[F(Θ(i) q,0) F(Θ(i) q,T )] = E[F(Θ(κ) q,0) F(Θ(1) q+1,0)] + i=1 E[F(Θ(i) q,0) F(Θ(i+1) q,0 )] = E[F(Θ(1) q,0) F(Θ(1) q+1,0)]. (23) Substituting equation 23 into equation 22, we achieve t=1 E i F(Θ(i) q,t 1) 2 E[F(Θ(1) q,0) F(Θ(1) q+1,0)] + 3γLσ2Tκ i=1 E Θ(i) q,0 2 (24) which completes the proof. To finish the convergence proof, we still need to establish the relation between the decrement Pκ i=1 PT t=1 E i F(Θ(i) q,t 1) 2 and E F(Θ(1) q,0) so that Lemma 4 (DESCENT LEMMA II AFTER EACH ROUND). Under Assumptions 1-3, and learning rate γ 1 4κLT , it holds for each k that E[F(Θ(1) q,0)] E[F(Θ(1) q+1,0)] γT 12 E F(Θ(1) q,0) 2 + γ2Lσ2T 2κ 6 + 2γL2δ2T 2 i=1 E Θ(i) q,0 2 Proof. In the following we lower bound the term Pκ i=1 PT t=1 E i F(Θ(i) q,t 1) 2 in equation 21. Recall Algorithm 3 that Θ(i+1) q,0 = Θ(i) q,0 γUi t=1 if(x(i) q,t 1; Θ(i) q,t 1 m(i) q ), (26) Θ(i) q,t = Θ(i) q,0 γUi s=1 if(x(i) q,s 1; Θ(i) q,s 1 m(i) q ). (27) With these relations, it holds for i 1 and t 1 that E Θ(i) q,t Θ(1) q,0 2 = E Θ(i) q,t Θ(i) q,0 + Θ(i) q,0 Θ(i 1) q,0 + + Θ(2) q,0 Θ(1) q,0 2 (a) = E Θ(i) q,t Θ(i) q,0 2 + E Θ(i) q,0 Θ(i 1) q,0 2 + + E Θ(2) q,0 Θ(1) q,0 2 s=1 if(x(i) q,s 1; Θ(i) q,s 1 m(i) q ) 2 + γ2 i 1 X t=1 jf(x(j) q,t 1; Θ(j) q,t 1 m(j) q ) 2 s=1 F(Θ(i) q,s 1 m(i) q ) 2 + γ2 i 1 X t=1 F(Θ(j) q,t 1 m(j) q ) 2 + iγ2σ2 Published as a conference paper at ICLR 2022 (d) tγ2 t X s=1 E F(Θ(i) q,s 1 m(i) q ) 2 + γ2T t=1 E F(Θ(j) q,t 1 m(j) q ) 2 + iγ2σ2 t=1 E F(Θ(j) q,t 1 m(j) q ) 2 + κγ2σ2 t=1 E F(Θ(j) q,t 1) 2 + 2γ2L2Tδ2 i X t=1 E Θ(j) q,0 2 + κγ2σ2 t=1 E F(Θ(j) q,t 1) 2 + 2γ2L2T 2δ2 i X j=1 E Θ(j) q,0 2 + κγ2σ2 (28) where (a) holds because Θ(i) q,0 Θ(i 1) q,0 and Θ(j) q,0 Θ(j 1) q,0 are orthogonal to each other when i = j, (b) holds because of equation 26 and equation 27, (c) holds because of Assumption 2, and (d) holds because of the Jensen s inequality. With the above inequality, it holds for t [1, T] that E i F(Θ(1) q,0) 2 2E i F(Θ(i) q,t) 2 + 2E i F(Θ(i) q,t) i F(Θ(1) q,0) 2 2E i F(Θ(i) q,t) 2 + 2E F(Θ(i) q,t) F(Θ(1) q,0) 2 equation 5 2E i F(Θ(i) q,t) 2 + 2L2E Θ(i) q,t Θ(1) q,0 2 equation 28 2E i F(Θ(i) q,t) 2 + 4L2γ2T t=1 E F(Θ(j) q,t 1) 2 + 4γ2L4T 2δ2 i X j=1 E Θ(j) q,0 2 + 2κγ2σ2L2 (29) Summing the above inequality over t = 0, 1, , T, we have TE i F(Θ(1) q,0) 2 2 t=1 E i F(Θ(i) q,t 1) 2 + 4T 2L2γ2 i X t=1 E F(Θ(j) q,t 1) 2 + 4γ2L4T 3δ2 i X j=1 E Θ(j) q,0 2 + 2κγ2σ2L2T (30) Summing the above inequality over i = 1, , κ, we have i=1 E i F(Θ(1) q,0) 2 2 i=1 E i F(Θ(i) q,t 1) 2 + 4T 2L2γ2κ t=1 E F(Θ(j) q,t 1) 2 + 4γ2L4T 3δ2κ j=1 E Θ(j) q,0 2 + 2κ2γ2σ2L2T = 4T 2L2γ2κ + 2 T X i=1 E i F(Θ(i) q,t 1) 2 + 4γ2L4T 3δ2κ j=1 E Θ(j) q,0 2 + 2κ2γ2σ2L2T (31) Combining equation 24 and equation 31, we have i=1 E i F(Θ(1) q,0) 2 4T 2L2γ2κ + 2 3 E[F(Θ(1) q,0) F(Θ(1) q+1,0)] + 3γLσ2Tκ i=1 E Θ(i) q,0 2 Published as a conference paper at ICLR 2022 + 4γ2L4T 3δ2κ i=1 E Θ(i) q,0 2 + 2κ2γ2σ2L2T E[F(Θ(1) q,0) F(Θ(1) q+1,0)] + (6γLTκ + 2κ2γ2L2T)σ2 + 4γ2L4T 3δ2κ + 8L2δ2T κ X i=1 E Θ(i) q,0 2 E[F(Θ(1) q,0) F(Θ(1) q+1,0)] + 8γLTκσ2 + 10L2δ2T i=1 E Θ(i) q,0 2 (32) where (a) and (b) use the facts that 2T 2L2γ2κ + 1 2 (it is enough to set γ 1 2κLT ), (33) κ2γ2L2T γLTκ (it is enough to set γ 1 κL). (34) Then the inequality equation 32 becomes E F(Θ(1) q,0) 2 12 E[F(Θ(1) q,0) F(Θ(1) q+1,0)] + 8γLκσ2 + 10L2δ2 κ X i=1 E Θ(i) q,0 2. (35) Finally we establish the convergence of the C-Gap algorithm as follows: Theorem 1 (C-GAP CONVERGENCE). Under Assumptions 1-3, if learning rate is set as γ = 1 4κLT Q, it holds that q=1 E F(Θ(1) q,0 m(1) q ) 2 G Q + 10L2Tδ2 i=1 E Θ(i) q,0 2 (36) where G = 96κLE[F(Θ(1) 1,0)] + σ2 is a constant. Proof. Since E F(Θ(1) q,0 m(1) q ) 2 2E F(Θ(1) q,0) 2 + 2L2δ2E Θ(1) q,0 2, (37) we thus have E F(Θ(1) q,0 m(1) q ) 2 equation 25 24 γT E[F(Θ(1) q,0) F(Θ(1) q+1,0)] + 8L2Tδ2 κ X i=1 E Θ(i) q,0 2 + 4γLTκσ2 + 2L2δ2E Θ(1) q,0 2 E[F(Θ(1) q,0) F(Θ(1) q+1,0)] + 10L2Tδ2 κ X i=1 E Θ(i) q,0 2 + 4γLTκσ2 (38) Summing the above inequality over q = 1, , Q and taking its average, we achieve q=1 F(Θ(1) q,0 m(1) q ) 2 24 γTQE[F(Θ(1) 1,0)] + 10L2Tδ2 i=1 E Θ(i) q,0 2 + 4γLTκσ2 If we let γ = 1 4κLT Q, the above inequalities becomes q=1 F(Θ(1) q,0 m(1) q ) 2 96κL Q E[F(Θ(1) 1,0)] + σ2 Q + 10L2Tδ2 i=1 E Θ(i) q,0 2 (40) which completes the proof.