# accelerating_stratified_sampling_sgd_by_reconstructing_strata__f853940e.pdf Accelerating Stratified Sampling SGD by Reconstructing Strata Weijie Liu1,2 , Hui Qian2 , Chao Zhang2 , Zebang Shen3 , Jiahao Xie2 and Nenggan Zheng1 1Qiushi Academy for Advanced Studies, Zhejiang University, Hangzhou, Zhejiang, China 2College of Computer Science and Technology, Zhejiang University, Hangzhou, Zhejiang, China 3University of Pennsylvania, Philadelphia, Pennsylvania {westonhunter, qianhui, zczju, xiejh, zng}@zju.edu.cn, zebang@seas.upenn.edu In this paper, a novel stratified sampling strategy is designed to accelerate the mini-batch SGD. We derive a new iteration-dependent surrogate which bound the stochastic variance from above. To keep the strata minimizing this surrogate with high probability, a stochastic stratifying algorithm is adopted in an adaptive manner, that is, in each iteration, strata are reconstructed only if an easily verifiable condition is met. Based on this novel sampling strategy, we propose an accelerated minibatch SGD algorithm named SGD-RS. Our theoretical analysis shows that the convergence rate of SGD-RS is superior to the state-of-the-art. Numerical experiments corroborate our theory and demonstrate that SGD-RS achieves at least 3.48-times speed-ups compared to vanilla mini-batch SGD. 1 Introduction Over the past decades, mini-batch Stochastic Gradient Descent (SGD) has become the main workhorse in classical machine learning tasks due to its balance between efficacy and scalability [Finkel et al., 2008; Krizhevsky et al., 2012; Sutskever, 2013]. To form a gradient estimate, the vanilla implementation of mini-batch SGD samples a small and fixed number of training instances uniformly in each iteration. Although uniform sampling is easy to implement, the resulting estimate may have a rather high variance since the component gradient of each sample can vary considerably, which results in the slow convergence of the underlying optimization procedure [Zhao and Zhang, 2015; Borsos et al., 2018]. A strategy that can significantly ameliorate the high variance issue is importance sampling, which constructs a nonuniform sampling distribution to ensure that important training instances are sampled more frequently. Obtaining the sampling distribution that minimizes the variance requires calculating all component gradients and is hence prohibitive. Zhao and Zhang [2015] and Needell et al. [2016] use training instances to calculate a fixed sampling distribution. Recently, there has been abundant research in finding iterationdependent sampling distributions. Katharopoulos and Fleuret Corresponding Author [2018] propose an upper bound of the gradient norm that can be calculated in the forward pass of the neural network and use this upper bound to calculate the sampling distribution. Johnson and Guestrin [2018] compute the sampling distribution using robust optimization and find the sampling distribution that is minimax optimal with respect to an uncertainty set. Salehi et al. [2017] and Borsos et al. [2018] formulate the sampling distribution learning as an online optimization problem with the bandit feedback. Another line of research that also aims to effectively reduce the variance of gradient estimates is stratified sampling [Botev and Ridder, 2014]. Stratified sampling involves two phases: first dividing a population into subpopulations and then applying uniform sampling methods to each subpopulation to form the estimates. When applying to mini-batch SGD, calculating and stratifying all component gradients at every iteration are computationally impractical. To reduce this overhead, Zhao and Zhang [2014] propose a practical algorithm SGD-ss forming fixed strata by clustering training instances instead of component gradients, which essentially determines strata by minimizing a surrogate bounding the stochastic variance from above. More recently, methods based on repulsive point process generalize the idea of SGD-ss by adopting a measure of similarity between data points, and allowing strata to overlap [Zhang et al., 2017; Zhang et al., 2019]. Although these methods substantially alleviate the overhead of stratification, their strata are predetermined before the main gradient descent iterations and cannot partition the time-variant component gradients well. In this paper, a novel stratified sampling strategy is designed to accelerate the mini-batch SGD. Specifically, we derive a new iteration-dependent surrogate and propose a stochastic stratifying algorithm to find the strata that minimize it. Strata are reconstructed in an adaptive manner, that is, in each iteration, strata are reconstructed only if an easily verifiable condition is met. Based on this novel sampling strategy, we propose an improved stratified mini-batch SGD algorithm named SGD-RS. Our contributions can be summarized as follows. We derive a new surrogate bounding the stochastic variance from above. Compared with the counterpart used in SGD-ss, ours is closer to the stochastic variance and does not require the component gradient to be Lipschitz continuous with respect to the feature vector. As a re- Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) sult, our theoretical analysis shows better convergence rates than SGD-ss for both convex and non-convex objective functions. We design a strategy to substantially alleviate the overhead of reconstructing strata. Specifically, a reconstructing condition is devised for our method to avoid periteration computational load of stratification. We prove that, when current strata do not minimize the surrogate, the reconstruction is invoked with high probability. Therefore, strata maintained by this strategy attain the minimum of the surrogate with high probability for all SGD iterations, with only sporadical stratification. Besides, we prove that our stratifying algorithm can find the optimal strata with high probability in polynomial time. Numerical experiments are conducted on popular applications including logistic regression and neural network based image classification. On the rcv1 binary classification dataset, SGD-RS achieves a more than 4-times speed-up compared to SGD-ss, with only 1.20% iterations reconstructing strata. On the CIFAR10 image classification dataset, a more than 3.5times speed-up is obtained by our method compared to SGDss, with merely 0.33% iterations reconstructing strata. On average, strata reconstructing occurs at a rate of 0.61%. Notation. We use bold lowercase symbols to denote vectors (e.g., x). We denote the ℓ2 norm of vector a by a and the cardinality of set A by |A|. For a random variable x, E x is its expectation. 2 Preliminaries Throughout this paper, we use the empirical risk minimization setting to demonstrate the advantages of our method, from which the nature of stratification can be gleaned. Given training set {(x1, y1), . . . , (xn, yn)} where xs is the feature vector and ys is the label for all s {1, . . . , n}, we solve the empirical risk minimization problem min w W f(w) = 1 s=1 fs(w), (1) where W Rd is a convex set, and the component function fs(w) = ℓ(w; xs, ys) designates the loss of model w encountered on datum (xs, ys). Let g be an unbiased estimate of the true gradient f. The variance of g is defined as V (g) := E g f 2. (2) 2.1 Stratified Sampling for Mini-batch SGD We now introduce stratified sampling for mini-batch SGD. At the t-th iteration, we partition the training set into k disjoint subsets At = {At 1, . . . , At k}. Each subset has cardinality nt i = |At i|. We sample subsets Bt i from At i, i.e., Bt i At i. The mini-batch is the union of these sampled subsets Bt = Bt 1 Bt k. Let bt i = |Bt i|. The pre-determined mini-batch size is b = Pk i=1 bt i. SGD-ss assumes that ℓ(wt;xs,ys) w is L-Lipschitz continuous with respect to xs [Zhao and Zhang, 2014] and uses iteration-independent strata A throughout training. bt i is set as an iteration-independent constant bi = bni ui Pn j=1 nj uj where ui = 1 ni P s Ai xs 1 |Ai| P r Ai xr 2. SGD-ss then has gradient estimate ˆ fss(wt) = 1 ℓ(wt; xs, ys) and the upper bound of the variance of ˆ fss(wt), i.e., V ˆ fss(wt) L2 r Ai xr 2 . (3) The fixed strata in SGD-ss are obtained by minimizing the right hand side of (3). 2.2 Importance Sampling for Mini-batch SGD At the t-th iteration, importance sampling methods assign each s {1, . . . , n} a probability qt s 0 such that Pn s=1 qt s = 1, and select training instances according to the sampling distribution qt = (qt 1, . . . , qt n). The gradient estimate of importance sampling is then ˆ fis(wt) = 1 |Bt is| where Bt is is the sampled mini-batch. According to [Johnson and Guestrin, 2018], its variance achieves minimum when qt s fs(wt) for all s , i.e, qt s = fs(wt) Pn j=1 fj(wt) . (5) In the section, we discuss two critical details of SGD-RS, strata construction and reconstructing condition. 3.1 Constructing Strata The gradient estimate adopting both stratified sampling and importance sampling is ˆ fss-is(wt) = 1 ℓ(wt; xs, ys) where pt s is the probability assigned to s in At i. Proposition 1. (6) is an unbiased gradient estimate, and its variance can be bounded from above by V t 0 (At, pt) = 1 s At i nt ipt s s At i fs(wt) Finding the partition At and the probability pt minimizing (7) can be solved in an alternating manner, that is, we first find the best At with fixed pt, and then vice versa. As we shall see in Section 4.3, if we choose pt as a uniform distribution, we can obtain a good partition that already guarantees Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) Algorithm 1 Stochastic Stratifying 1: Input: wt, k, A0, mini-batch size β and β0, learning rate sequence λτ i , termination criterion. 2: Sample S0 of size β0 uniformly with replacement 3: GS0 = { fs(wt), s S0} 4: c0 run Single-Linkage on GS0 5: for τ = 0, . . . do 6: Sample indices S of size β uniformly with replacement; Set counter mi as 0 and Si as an empty set for all i = 1, . . . , k 7: GS = { fs(wt), s S} 8: for fs(wt) GS do 9: Find i = arg minl fs(wt) cτ l 2 and I the current stratum s belongs to 10: Si = Si {s}; mi = mi + 1 11: if i = I then Aτ i = Aτ i {s}; Aτ I = Aτ I {s} 12: end if 13: end for 14: for i = 1, 2, . . . , k do 15: if mi > 0 then 16: ˆcτ+1 i = s Si fs(wt) mi 17: cτ+1 i = (1 λτ i )cτ i + λτ i ˆcτ+1 i 18: else 19: cτ+1 i = cτ i 20: end if 21: end for 22: if termination criterion is satisfied then return Aτ 23: end if 24: end for a better performance than SGD-ss. With such pt, V t 0 (At, pt) is reduced to the following form: ˆV t 0 (At)= 1 nb Pk i=1 P s At i s At i fs(wt) which is also the objective function of the standard k-means. We can find optimal At by algorithm 1 which is a modified version of the stochastic k-means [Tang and Monteleoni, 2016]. The difference is that stochastic k-means only updates the centroids and does not maintain the strata. As we shall see in Section 4.1, we prove that Algorithm 1 can find the strata minimizing (8) in polynomial time by extending Tang and Monteleoni s result. Note that in constrast to SGD-ss, we partition the training set by Euclidian distances between gradient vectors. We use Single-Linkage to initialize the centroids [Wikipedia contributors, 2019]. The stratification results for a 2-D logistic regression task are demonstrated in Figure 1 which shows that SGD-RS forms better strata than what SGD-ss does, and the gradient estimate is also closer to the true gradient. 3.2 Reconstructing Condition To avoid per-iteration reconstruction of strata, we design a reconstructing condition that exploits a subset of the sampled mini-batch Dt = Dt 1 Dt k, where Dt 1 Bt 1, . . . , Dt k Figure 1: Comparision of gradient estimates in a toy 2-D logistic regression task. Each circle is a gradient vector. If a gradient vector is used to form the gradient estimate, it is denoted by a filled circle. Otherwise it is an unfilled circle. In the picture of importance sampling, the size of each circle is proportional to the possiblity that this gradient vector is used. In the second row, the color of each point indicates the clustering assignment. The black + and the cyan in each subplot denote the true gradient and the gradient estimate respectively. As is shown, SGD-ss that only forms strata by clustering training instances cannot partition gradient vectors well. The gradient estimate of SGD-RS is the closest to the true gradient. Bt k. It is defined as ζ(Dt) = true, if vt i(s1, s2) ϵt, false, else, (9) where vt i(s1, s2) is the distance between two component gradients within the same stratum i, i.e., vt i(s1, s2) = fs1(wt) fs2(wt) , for s1, s2 Dt i. (10) The intuition is that we reconstruct the strata only when component gradients from the same stratum are not homogeneous. We summarize the proposed SGD-RS in Algorithm 2. SGD-RS reconstructs strata when the reconstructing condition ζ(Dt) is true. After calculating ˆ f(wt), we perform stochastic gradient descent and project wt+1 into W. 4 Theoretical Analysis In this section, we conduct theoretical analysis about SGDRS. All proofs are referred to the long version of this paper. 4.1 Algorithm 1 Terminates in Finite Steps We prove that Algorithm 1 can find At that minimizes (8) in finite steps. We first introduce the proximity condition and illustrate it in Figure 2. Definition 1 (Proximity condition (cf. [Kumar and Kannan, 2010])). Assume point set X admits ground truth nondegenerate1 k-clustering (strata) T = {T1, . . . , Tk} with ground truth centroids µ = {µ1, . . . , µk}. Let g be the projection of g onto the µi to µj line. We say a point g Ti 1k-clustering is degenerate if any of its k clusters is empty. Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) Algorithm 2 SGD-RS 1: Input: initialization w0, learning rate sequence ηt 2: for t = 0, 1, . . . , T 1 do 3: For all i, sample Bt i from At i 4: Calculate gradient vectors { fs(wt), s Bt} 5: if ζ(Dt) is true then 6: Run Algorithm 1 to reconstruct At 7: For all i, sample Bt i from At i 8: Calculate gradient vectors { fs(wt), s Bt} 9: end if 10: ˆ f(wt) = 1 n Pk i=1 nt i bt i P s Bt i fs(wt) 11: wt+1 = ΠW wt ηt ˆ f(wt) ; At+1 = At 12: end for 13: Output: Select w T from {w0, . . . , w T 1} according to probability P( w T = wt) = ηt PT 1 t=0 ηt satisfies the proximity condition if for any i = j, g is at least ij closer to µi than to µj, i.e., ij g µj g µi , j = i, g Ti. (11) Figure 2: Visualization of Definition 1. m is the middle point between µi and µj. Because g µj g µi = 2 g m , 2 ij, g satisfies the proximity condition. We assume that at iteration t the set of gradient vectors Gt admits ground truth non-degenerate k-clustering (strata) T t = {T t 1, . . . , T t k} with ground truthcentroids µt = {µt 1, . . . , µt k}. The ground truth strata obtain the minimum of the surrogate. We further make following assumptions about Gt. These assumptions are commonly used in clustering literature (see e.g. [Tang and Monteleoni, 2016]). Assumption 1 (δt-general dataset). Gt is a general dataset with δt-margin if all points satisfy the proximity condition and δt = min i,j =i t ij, δt > 0. (12) Assumption 2 (F(a)-clusterability). We say that Gt is F(a)- clusterable if Gt has δt-margin such that for all i = j, |T t i | + 1 q where φt = Pk i=1 P g T t i g µt i 2 and F(a) max αt, 64, 5a+5 256a with 0 < a < 1 and αt = maxi,j |T t i | |T t j |. Our proof is based on Tang and Monteleoni [2016] s result which we include here for completeness. Lemma 2 (Theorem 3 of [Tang and Monteleoni, 2016]). Suppose Gt satisfies Assumptions 1 and 2. Fix any 0 < θ 1 If we set parameters β > ln(1 a) ln 1 mini 4|T t i | 5n , λτ i = c τ0+τ , where c > 1 1 a 1 mini 4|T t i | 5n β and τ0 768(c )2 1 + 16 F (a) 2 2 n2 ln2 1 θ , and β0 satisfying n ln 2k ξ mini |T t i | < β0 < ξ 2 exp 2 F (a) 4 1 2(ωt)2 , then with probability at least (1 θ)(1 ξ), Algorithm 1 guarantees that i=1 |T t i | cτ i µt i 2 = O(1 Algorithm 1 can assign all points to ground truth clusters in polynomial time when Assumptions 1 and 2 hold. The main idea is that Pk i=1 P j =i |T t i Aτ j | can be bounded us- ing Pk i=1 |T t i | cτ i µt i 2. Extending Lemma 2, we have the following result. Theorem 3. When Assumptions 1 and 2 hold, if we use parameters in Lemma 2, Algorithm 1 guarantees that j =i |T t i Aτ j | = O 1 with probability at least (1 ξ)(1 θ)(1 ντ), where ντ = O k 4.2 Reconstructing Condition In this subsection, we prove that At maintained by SGD-RS minimizes (8) for all t with high probability. The intuition that we can reconstruct strata sporadically is simple: the gradient vectors that are far away from each other at iteration t tend to be still far away at iteration t + 1, while the gradient vectors that are close to each other at iteration t tend to be still close at iteration t + 1. More specifically, fs(wt+1) fr(wt+1) fs(wt) fr(wt) fs(wt+1) fs(wt) + fr(wt+1) fr(wt) . When wt and wt+1 are located within a locally smooth region, fs(wt+1) fs(wt) and fr(wt+1) fr(wt) are of order O ηt ˆ f(wt) which is small if we set small learning rates. SGD-RS requires ϵt to satisfy ϵt δt, where δt is the margin in Assumption 1. When s2 and s1 are from different ground truth clusters, vt i(s1, s2) δt, ζ(Dt) is true. The following lemma states that w.h.p. we reconstruct strata when there exists i such that At i = T t i . Lemma 4. Assume that Gt is a δt-general dataset. When there exists i, such that At i = T t i , ζ(Dt) is false with probability at most P ζ(Dt) = false i, s.t. At i = T t i j=1 (pt ij)|Dt i| , (16) Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) Figure 3: Logistic regression results on rcv1 (top row), ijcnn1 (middle row), and w8a (bottom row). From left to right, we report training loss vs iterations, training loss vs time (second), testing loss vs iterations, testing loss vs time (second), and estimated variance vs iterations. Estimated variance is in log-scale. Filled areas signify 1.5 times standard error of the mean. where pt ij = |At i T t j | |At i| , for all i, j = 1, . . . , k. Remark. As Lemma 4 states, if we choose an appropriate k, Qk i=1 Pk j=1(pt ij)|Dt i| is close to 0 even if some Pk j=1(pt ij)|Dt i| s approach 1 that implies ζ(Dt) holds with high probability when there exists i, such that At i = T t i . 4.3 Faster Convergence We now prove that SGD-RS improves the convergence rates for both convex and non-convex objectives by obtaining tighter variance upper bounds. We first give the following lemma. Lemma 5. The minimum of V t 0 is less than or equal to the minimal upper bound of SGD-ss, i.e., min At,pt V t 0 (At, pt) min A L2 r Ai xr 2 . (17) Furthermore, SGD-RS has a tighter variance bound than SGD-ss, i.e., min At ˆV t 0 (At) min A L2 r Ai xr 2 . (18) Now we formally give the convergence rates. Let w = arg minw W f(w) and σ2 maxt min At ˆV t 0 (At) where At is maintained by SGD-RS. Theorem 6. When f(w) 2 M 2, and fs(w) is convex for all s = 1, . . . , n, then the objective can be bounded as follows E f( wt) f(w ) w0 w Theorem 7. If f(w) is non-convex and γ-Lipschitz smooth, then the number of iterations for E f( w T ) 2 to be smaller than ϵ2 is f(w0) f(w ) , (20) where Ω( ) is the asymptotic lower bound. Combining Lemma 4 and Lemma 5, SGD-RS has a tighter variance bound than SGD-ss for every iteration with probability at least 1 Qk i=1 Pk j=1(pt ij)|Dt i| , which implies SGD-RS converges faster than SGD-ss for both convex and non-convex cases. 5 Experiments In this section, we compare SGD-RS with state-of-theart algorithms, including SGD-ss [Zhao and Zhang, 2014], PDS [Zhang et al., 2019], Upper-bound [Katharopoulos and Fleuret, 2018], RAIS [Johnson and Guestrin, 2018], VRB [Borsos et al., 2018], and vanilla mini-batch SGD. SGD-ss pre-stratifies the dataset. PDS generalizes the idea of SGD-ss by adopting a measure of similarity between data points and also has iteration-independent sampling behaviours. Upperbound, RAIS, and VRB are importance sampling algorithms. We run SGD-RS and baseline methods 5 times and report the averaged objective values, averaged testing losses, and the estimated variances of gradient estimates. The overhead of stratification and updating importance sampling distributions is considered in the comparison. To estimate the variances, we calculate the distances between gradient estimates Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) Figure 4: Image classification results on MNIST (top row), CIFAR10 (middle row), and CIFAR100 (bottom row). From left to right, we report training loss vs iterations, training loss vs time (second), testing loss vs iterations, testing loss vs time (second), and estimated variance vs iterations. Estimated variance is in log-scale. Filled areas signify 1.5 times standard error of the mean. and true gradients each time we run the algorithms and average the distances over the 5 rounds. The datasets and the corresponding parameter setup are summarized in Table 1. Dataset rcv1 ijcnn1 w8a MNIST CIFAR10 CIFAR100 b 100 200 300 100 100 100 k 50 100 50 50 50 50 Table 1: A summary of the datasets and parameter choices in the logistic regression with ℓ2 norm experiment and the neural network based image classification experiment. 5.1 Logistic Regression We conduct logistic regression experiments on three realworld benchmark datasets: rcv1, ijcnn1, and w8a2. We can see from Figure 3 that both stratified sampling and importance sampling can reduce the variance of gradient estimates and accelerate the convergence. The gradient estimates in SGD-RS are more accurate than baseline methods and hence SGD-RS converges fastest. The variance of SGDRS can be as small as 8.2% of the variance of SGD-ss. We observe that strata reconstructing happens at rates of 1.20%, 2.90% and 2.07% on rcv1, ijcnn1, and w8a, respectively. 5.2 Image Classification We evaluate the empirical performance of SGD-RS in image classification benchmark datasets: MNIST, CIFAR10, and CIFAR100. On MNIST, we train a simple network that 2These datasets are downloaded from libsvm websites https:// www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/ has three fully-connected layers and two Re LU layers. We train VGG-11 [Simonyan and Zisserman, 2014] on CIFAR10 and Res Net-18 [He et al., 2016] on CIFAR100 respectively and the networks are initialized by running vanilla mini-batch SGD for 50 epochs. As is shown in Figure 4, SGD-RS consistently outperforms baseline methods both in terms of the objective value and the generalization performance. Strata reconstructing happens at rates of 0.27%, 0.33% and 1.26% on MNIST, CIFAR10, and CIFAR100 respectively. SGD-RS is also the most stable. 6 Conclusion We propose a novel stratified sampling strategy for minibatch SGD. It maintains strata by minimizing an iterationdependent surrogate majorizing the variance of gradient estimates and accelerates the convergence. To alleviate the overhead of stratifying, we devise a reconstructing condition and reconstruct strata when the condition is met. We demonstrate its superiority in extensive experiments. This strategy is complementary to other variance reduction methods like SVRG [Johnson and Zhang, 2013] and can be used simultaneously. Acknowledgments This work is supported by Zhejiang Provincial Natural Science Foundation of China (Grant No: LZ18F020002, LR19F020005), and National Natural Science Foundation of China (Grant No: 61672376, 61751209, 61472347, 61572433, 61972347). Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) [Borsos et al., 2018] Zal an Borsos, Andreas Krause, and Kfir Y Levy. Online variance reduction for stochastic optimization. ar Xiv preprint ar Xiv:1802.04715, 2018. [Botev and Ridder, 2014] Zdravko Botev and Ad Ridder. Variance reduction. Wiley Stats Ref: Statistics Reference Online, pages 1 6, 2014. [Finkel et al., 2008] Jenny Rose Finkel, Alex Kleeman, and Christopher D Manning. Efficient, feature-based, conditional random field parsing. In Proceedings of ACL-08: HLT, pages 959 967, 2008. [He et al., 2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016. [Johnson and Guestrin, 2018] Tyler B Johnson and Carlos Guestrin. Training deep models faster with robust, approximate importance sampling. In Advances in Neural Information Processing Systems, pages 7265 7275, 2018. [Johnson and Zhang, 2013] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pages 315 323, 2013. [Katharopoulos and Fleuret, 2018] Angelos Katharopoulos and Franc ois Fleuret. Not all samples are created equal: Deep learning with importance sampling. ar Xiv preprint ar Xiv:1803.00942, 2018. [Krizhevsky et al., 2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097 1105, 2012. [Kumar and Kannan, 2010] Amit Kumar and Ravindran Kannan. Clustering with spectral norm and the k-means algorithm. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pages 299 308. IEEE, 2010. [Needell et al., 2016] Deanna Needell, Nathan Srebro, and Rachel Ward. Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm. Mathematical Programming, 155(1-2):549 573, 2016. [Salehi et al., 2017] Farnood Salehi, L Elisa Celis, and Patrick Thiran. Stochastic optimization with bandit sampling. ar Xiv preprint ar Xiv:1708.02544, 2017. [Simonyan and Zisserman, 2014] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. ar Xiv preprint ar Xiv:1409.1556, 2014. [Sutskever, 2013] Ilya Sutskever. Training recurrent neural networks. University of Toronto Toronto, Ontario, Canada, 2013. [Tang and Monteleoni, 2016] Cheng Tang and Claire Monteleoni. convergence rate of stochastic k-means. ar Xiv preprint ar Xiv:1610.04900, 2016. [Wikipedia contributors, 2019] Wikipedia contributors. Single-linkage clustering Wikipedia, the free encyclopedia, 2019. [Online; accessed 21-January-2020]. [Zhang et al., 2017] Cheng Zhang, Hedvig Kjellstrom, and Stephan Mandt. Determinantal point processes for minibatch diversification. ar Xiv preprint ar Xiv:1705.00607, 2017. [Zhang et al., 2019] Cheng Zhang, Cengiz Oztireli, Stephan Mandt, and Giampiero Salvi. Active mini-batch sampling using repulsive point processes. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 5741 5748, 2019. [Zhao and Zhang, 2014] Peilin Zhao and Tong Zhang. Accelerating minibatch stochastic gradient descent using stratified sampling. ar Xiv preprint ar Xiv:1405.3080, 2014. [Zhao and Zhang, 2015] Peilin Zhao and Tong Zhang. Stochastic optimization with importance sampling for regularized loss minimization. In international conference on machine learning, pages 1 9, 2015. Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)