# lockfree_optimization_for_nonconvex_problems__59a939a0.pdf Lock-Free Optimization for Non-Convex Problems Shen-Yi Zhao, Gong-Duo Zhang, Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer Science and Technology, Nanjing University, China {zhaosy, zhanggd}@lamda.nju.edu.cn, liwujun@nju.edu.cn Stochastic gradient descent (SGD) and its variants have attracted much attention in machine learning due to their efficiency and effectiveness for optimization. To handle largescale problems, researchers have recently proposed several lock-free strategy based parallel SGD (LF-PSGD) methods for multi-core systems. However, existing works have only proved the convergence of these LF-PSGD methods for convex problems. To the best of our knowledge, no work has proved the convergence of the LF-PSGD methods for nonconvex problems. In this paper, we provide the theoretical proof about the convergence of two representative LF-PSGD methods, Hogwild! and Asy SVRG, for non-convex problems. Empirical results also show that both Hogwild! and Asy SVRG are convergent on non-convex problems, which successfully verifies our theoretical results. Introduction Many machine learning models can be formulated as the following optimization problem: i=1 fi(w), (1) where w is the parameter to learn (optimize), n is the number of training instances, fi(w) is the loss defined on instance i. For example, assuming we are given a set of labeled instances {(xi, yi)|i = 1, 2, . . . , n}, where xi Rd is the feature vector and yi {1, 1} is the label of xi, fi(w) can be log(1 + e yix T i w) + λ 2 w 2 which is known as the regularized loss in logistic regression (LR). We can also take fi(w) to be max{0, 1 yix T i w} + λ 2 w 2 which is known as the regularized loss in support vector machine (SVM). Here, λ is the regularization hyper-parameter. Moreover, many other machine learning models, including neural networks (Krizhevsky, Sutskever, and Hinton 2012), matrix factorization (Koren, Bell, and Volinsky 2009), and principal component analysis (PCA) (Shamir 2015) and so on, can also be formulated as that in (1). When the problem in (1) is large-scale, i.e., n is large, researchers have recently proposed stochastic gradient descent (SGD) and its variants like SVRG (Johnson and Copyright c 2017, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Zhang 2013) to solve it. Many works (Roux, Schmidt, and Bach 2012; Shalev-Shwartz and Zhang 2013; Johnson and Zhang 2013) have found that SGD-based methods can achieve promising performance in large-scale learning problems. According to the implementation platforms or systems, existing SGD-based methods can be divided into three categories: sequential SGD (SSGD) methods, parallel SGD (PSGD) methods, and distributed SGD (DSGD) methods. SSGD methods are designed for a single thread on a single machine, PSGD methods are designed for multicore (multi-thread) on a single machine with a shared memory1, and DSGD methods are designed for multiple machines. When the problem in (1) is convex, the SGD methods, including SSGD (Roux, Schmidt, and Bach 2012; Shalev-Shwartz and Zhang 2013; Johnson and Zhang 2013), PSGD (Recht et al. 2011) and DSGD (Jaggi et al. 2014; Li et al. 2014; Xing et al. 2015; Zhang, Zheng, and Kwok 2016), have achieved very promising empirical performance. Furthermore, good theoretical results about the convergence of the SGD methods are also provided by these existing works. In many real applications, the problems to optimize can be non-convex. For example, the problems for the neural networks are typically non-convex. Because many researchers (Li et al. 2014; Xing et al. 2015) find that the SGD methods can also achieve good empirical results for nonconvex problems, theoretical proof about the convergence of SGD methods for non-convex problems has recently attracted much attention. Some progress has been achieved. For example, the works in (Ghadimi and Lan 2013; Reddi et al. 2016; Li et al. 2016; Allen-Zhu and Hazan 2016; Allen-Zhu and Yuan 2016) have proved the convergence of the sequential SGD and its variants for non-convex problems. There are also some other theoretical results for some particular non-convex problems, like PCA (Shamir 2015; 2016a; 2016b) and matrix factorization (Sa, Re, and Olukotun 2015). But all these works are only for SSGD methods. There have appeared only two works (Lian et al. 2015; Huo and Huang 2016) which propose PSGD methods for non-convex problems with theoretical proof of convergence. 1In some literatures, PSGD refers to the methods implemented on both multi-core and multi-machine systems. In this paper, PSGD only refers to the methods implemented on multi-core systems with a shared memory. Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17) However, the PSGD methods in (Lian et al. 2015) need write-lock or atomic operation for the memory to prove the convergence 2. Similarly, the work in (Huo and Huang 2016) also does not prove the convergence for the lockfree case in our paper. Recent works (Recht et al. 2011; Chaturapruek, Duchi, and R e 2015; J. Reddi et al. 2015; Zhao and Li 2016) find that lock-free strategy based parallel SGD (LF-PSGD) methods can empirically outperform lockbased PSGD methods for multi-core systems. Although some existing works (Chaturapruek, Duchi, and R e 2015; Zhao and Li 2016) have proved the convergence of these LFPSGD methods for convex problems, no work has proved the convergence of the LF-PSGD methods for non-convex problems. In this paper, we provide the theoretical proof about the convergence of two representative LF-PSGD methods, Hogwild! (Recht et al. 2011; Chaturapruek, Duchi, and R e 2015) and Asy SVRG (Zhao and Li 2016), for non-convex problems. The contribution of this work can be outlined as follows: Theoretical results show that both Hogwild! and Asy SVRG can converge with lock-free strategy for nonconvex problems. Hogwild! gets a convergence rate of O(1/ T) for nonconvex problems, where T = p T is the total iteration number of p threads. Asy SVRG gets a convergence rate of O(1/ T) for nonconvex problems. To get an ϵ-local optimal solution for Asy SVRG, the computation complexity by all threads is O(n 2 3 /ϵ), or equivalently the computation complexity of each thread is O( n 2 3 pϵ ). This is faster than traditional parallel gradient decent methods whose computation complexity is O( n pϵ) for each thread. Empirical results also show that both Hogwild! and Asy SVRG are convergent on non-convex problems, which successfully verifies our theoretical results. Preliminary We use f(w) to denote the objective function in (1), which means f(w) = 1 n n i=1 fi(w). And we use to denote the L2-norm 2. Assumption 1. The function fi( ) in (1) is smooth, which means that there exists a constant L > 0, a, b, fi(b) fi(a) + fi(a)T (b a) + L or equivalently fi(b) fi(a) L b a . 2Although the implementation of Asy SG-incon in (Lian et al. 2015) is lock-free, the theoretical analysis about the convergence of Asy SG-incon is based on an assumption that no over-writing happens, i.e., the theoretical analysis is not for the lock-free case. This is a common assumption for the convergence analysis of most existing gradient-based methods. Since we focus on non-convex problems in this paper, it is difficult to get the global solution of (1) based on the gradient methods. Hence, we use f(w) 2 to measure the convergence instead of f(w) min w f(w). Here, we give a Lemma which is useful in the convergence analysis of Hogwild! and Asy SVRG. Lemma 1. Assume B is a positive semi-definite matrix with the largest eigenvalue less than or equal to 1 and the minimum eigenvalue α > 0, we have: x, y, f(x)T B f(y) L2 2 f(x) 2 f(x)T B f(y) B 1 2 f(x) 2 f(x)T B f(y) B 1 2 f(x) 2 f(x)T B f(y) + 1 B 1 2 f(y) 2 B 1 2 ( f(x) f(y)) 2 Hogwild! for Non-Convex Problems The Hogwild! method (Recht et al. 2011) is listed in Algorithm 1. Each thread reads w from the shared memory, computes a stochastic gradient and updates the w in the shared memory. Please note that Hogwild! in (Recht et al. 2011) has several variants with locks or lock-free. Here, we only focus on the lock-free variant of Hogwild!, which means that we do not use any locks, either read-lock or write-lock, for all threads. Algorithm 1 Hogwild! Initialization: p threads, initialize w0, η; For each thread, do: for l = 0, 1, 2, ..., T 1 do Read current w in the shared memory, denoted as ˆw; Randomly pick up an i from {1, . . . , n} and compute the gradient fi( ˆw); w w η fi( ˆw); end for As in (Zhao and Li 2016), we can construct an equivalent write sequence {wt}: wt+1 = wt ηBt fit( ˆwt), (2) where 0 t p T, Bt is a random diagonal matrix whose diagonal entries are 0 or 1. The Bt is used to denote whether over-writing happens. If the kth diagonal entry of Bt is 0, it means that the kth element in the gradient vector fit( ˆwt) is overwritten by other threads. Otherwise, that element is not overwritten. ˆwt is read by the thread who computes fit( ˆwt) and has the following format: ˆwt = wa(t) η j=a(t) Pt,j a(t) fij( ˆwj), (3) where a(t) means that some old stochastic gradients have been completely written on the w in the shared memory. Pt,j a(t) is a diagonal matrix whose diagonal entries are 0 or 1, which means ˆwt might include parts of new stochastic gradients. In the lock-free strategy, we need the following assumptions to guarantee convergence: Assumption 2. a(t) is bounded by: 0 t a(t) τ It means that the old stochastic gradients fi0, . . . , fit τ 1 have been completely written on w in the shared memory. Assumption 3. We consider the matrix Bt as a random matrix and E[Bt|wt, ˆwt] = B 0 with the minimum eigenvalue α > 0. According to the definition of Bt, it is easy to find Bt, B are positive semi-definite matrices and the largest eigenvalue of B is less than or equal to 1. Assumption 3 means that the probability that over-writing happens is at most 1 α < 1 for each write step. Assumption 4. Bt and it are independent. Since it is the random index selected by each thread while Bt is highly affected by the hardware, the independence assumption is reasonable. For Hogwild!, the following assumption is also necessary: Assumption 5. There exists a constant V , fi(w) V, i = 1, . . . , n. For convenience, in this section, we denote i=1 fi(x) 2. It is easy to find that Eq( ˆwt) = E[ fit( ˆwt) 2] and note that when x is close to some stationary point, q(x) may still be far away from 0. Hence, it is not a variance reduction method and we need to control the variance of the stochastic gradient. The difficulty of the analysis is wt = ˆwt. Here, we give the following Lemmas 3: Lemma 2. In Hogwild!, we have Eq( ˆwt) ρEq( ˆwt+1) if ρ, η satisfy 1 η 9η(τ+1)L2(ρτ+1 1) 3The proof of some Lemmas can be found in the supplementary material, which can be downloaded from http://cs.nju.edu.cn/lwj/ paper/LFnon Convex sup.pdf. Lemma 3. With the condition about ρ, η in Lemma 2, we have E wt ˆwt 2 4η2τρ(ρτ 1) ρ 1 Eq( ˆwt) (4) Combining with Assumption 5, we can find that the gap of the write sequence and read sequence can always be bounded by a constant 4η2V 2τρ(ρτ 1) Theorem 1. Let A = 2f(w0) α and B = 2V 2( 2τL2ηρ(ρτ 1) L 2α). If we take the stepsize η = A T B , where T = p T, we can get the following result: t=0 E f(wt) 2 Proof. According to Assumption 1, we have E[f(wt+1)|wt, ˆwt] f(wt) ηE[ f(wt)T Bt fit( ˆwt)|wt, ˆwt] 2 E[ fit( ˆwt) 2|wt, ˆwt] =f(wt) η f(wt)T B f( ˆwt) 2 E[ fit( ˆwt) 2|wt, ˆwt] 2 f(wt) 2 + L2η 2 E[ fit( ˆwt) 2|wt, ˆwt], where the first equality uses Assumption 4, the second inequality uses Lemma 1. Taking expectation on the above inequality, we obtain 2 E f(wt) 2 + L2η 2 E wt ˆwt 2 2 E f(wt) 2 + η2V 2(2τL2ηρ(ρτ 1) where the first inequality uses Assumption 5 and second inequality uses Lemma 3. Summing the above inequality from t = 0 to T 1, we get t=0 E f(wt) 2 αη f(w0) + 2η TV 2(2τL2ηρ(ρτ 1) For convenience, let A = 2f(w0) 2V 2( 2τL2ηρ(ρτ 1) 2α), which are two bounded constants. If we take the stepsize η = A T B , we get t=0 E f(wt) 2 Hence, our theoretical result shows that Hogwild! with lock-free strategy gets a convergence rate of O(1/ T) for non-convex problems, where T = p T is the total iteration number of p threads. Asy SVRG for Non-Convex Problems The Asy SVRG method (Zhao and Li 2016) is listed in Algorithm 2. Asy SVRG provides a lock-free parallel strategy for the original sequential SVRG (Johnson and Zhang 2013). Compared with Hogwild!, Asy SVRG includes the full gradient to get a variance reduced stochastic gradient, which has been proved to have linear convergence rate on strongly convex problems (Zhao and Li 2016). In this section, we will prove that Asy SVRG is also convergent for non-convex problems, and has faster convergence rate than Hogwild! on non-convex problems. Algorithm 2 Asy SVRG Initialization: p threads, initialize w0, η; for t = 0, 1, 2, ...T 1 do u0 = wt; All threads parallelly compute the full gradient f(u0) = 1 n n i=1 fi(u0); u = wt; For each thread, do: for j = 0 to M 1 do Read current value of u, denoted as ˆu, from the shared memory. And randomly pick up an i from {1, . . . , n}; Compute the update vector: ˆv = fi(ˆu) fi(u0) + f(u0); u u ηˆv; end for Take wt+1 to be the current value of u in the shared memory; end for Similar to the analysis in the last section, we construct an equivalent write sequence {ut,m} for the tth outer-loop: ut,0 = wt, ut,m+1 = ut,m ηBt,mˆvt,m, (5) where ˆvt,m = fit,m(ˆut,m) fit,m(ut,0) + f(ut,0). Bt,m is a diagonal matrix whose diagonal entries are 0 or 1. And ˆut,m is read by the thread who computes ˆvt,m. It has the following format: ˆut,m = ut,a(m) η j=a(m) P(t) m,j a(m)ˆvt,j, where P(t) m,j a(m) is a diagonal matrix whose diagonal entries are 0 or 1. Note that according to (5), ut, M = wt+1 since all the stochastic gradients have been written on w at the end of the tth outer-loop. Here, we also need the assumptions: 0 m a(m) τ; E[Bt,m|ut,m, ˆut,m] = B 0 with the minimum eigenvalue α > 0; Bt,m and it,m are independent. These assumptions are similar to those in the previous section. For convenience, let pi(x) = fi(x) fi(ut,0) + f(ut,0), and in this section, we denote i=1 pi(x) 2. It easy to find that Eq(ˆut,m) = E[ ˆvt,m 2]. The difference between Hogwild! and Asy SVRG is the stochastic gradient and we have the following Lemmas which lead to fast convergence rate of Asy SVRG: Lemma 4. x, we have q(x) 2L2 x ut,0 2 + 2 f(x) 2. i=1 fi(x) fi(ut,0) + f(ut,0) 2 i=1 fi(x) fi(ut,0) + f(ut,0) f(x) 2 i=1 fi(x) fi(ut,0) 2 + 2 f(x) 2 2L2 x ut,0 2 + 2 f(x) 2. According to Lemma 4, we can find that Asy SVRG is a variance reduction method for non-convex problems, because when ˆut,m, ut,0 get close to some stationary point, q(ˆut,m) gets close to 0. And hence we do not need the bounded gradient assumption for the convergence proof. Since ut,m = ˆut,m, the difficulty of convergence analysis lies in the gap between ut,m and ˆut,m, and the relation between q(ˆut,m) and q(ut,m). Lemma 5. In Asy SVRG, we have Eq(ˆut,m) < ρEq(ˆut,m+1) if we choose ρ and η to satisfy that 1 1 η 9η(τ+1)L2(ρτ+1 1) Lemma 6. With the condition about ρ, η in Lemma 5, we have E ut,m ˆut,m 2 4η2τρ(ρτ 1) ρ 1 Eq(ˆut,m). (6) Lemma 7. With the condition about ρ, η in Lemma 5, we have Eq(ˆut,m) < ρEq(ut,m). Combining Lemma 6 and Lemma 7, we can directly obtain: E ˆut,m ut,m 2 4η2τρ2(ρτ 1) ρ 1 Eq(ut,m). (7) Theorem 2. We define cm = cm+1(1 + βη) + 2L2η2hm+1, hm = ( ηL2 β ) 4τρ2(ρτ 1) ρ 1 +(cmρ+ Lρ 2 ) with c0, β > 0. Furthermore, we choose c0, η, β such that γ = min αη β 2η2hm+1 > 0 and c M = 0, where M = M p. Then we have m=0 E f(ut,m) 2 Ef(w0) Ef(w T ) Proof. In the tth outer-loop, similar to (Reddi et al. 2016), we define Rt,m as follows Rt,m = f(ut,m) + cm ut,m ut,0 2. Then β > 0, E[ ut,m+1 ut,0 2|ut,m, ˆut,m] E ut,m+1 ut,m 2 + ut,m ut,0 2 2η(EBt,mˆvt,m)T (ut,m ut,0) η2E ˆvt,m 2 + (1 + βη) ut,m ut,0 2 β f(ˆut,m) 2 η2E ˆvt,m 2 + (1 + βη) ut,m ut,0 2 β ( f(ut,m) + f(ˆut,m) f(ut,m) 2) η2E ˆvt,m 2 + (1 + βη) ut,m ut,0 2 β ( f(ut,m) 2 + L2 ˆut,m ut,m 2), (8) where the second inequality uses the fact 2ab βa2 + 1 β b2. Since the objective function is L-smooth, we have E[f(ut,m+1)|ut,m, ˆut,m] ηE[ f(ut,m)T Bt,m fit,m(ˆut,m)|ut,m, ˆut,m] + f(ut,m) + Lη2 2 E[ ˆvt,m 2|ut,m, ˆut,m] =f(ut,m) η f(ut,m)T B f(ˆut,m) 2 E[ ˆvt,m 2|ut,m, ˆut,m] 2 f(ut,m) 2 2 ut,m ˆut,m 2 2 E[ ˆvt,m 2|ut,m, ˆut,m], (9) where the first equality uses the independence of Bt,m, it,m, the second inequality uses Lemma 1. Combining (8) and (9), ERt,m+1 =Ef(ut,m+1) + cm+1 ut,m+1 ut,0 2 Ef(ut,m) (αη β )E f(ut,m) 2 β )E ut,m ˆut,m 2 + cm+1(1 + βη)E ut,m ut,0 2 + η2(cm+1 + L 2 )E ˆvt,m 2 Ef(ut,m) (αη β )E f(ut,m) 2 β )4τη2ρ2(ρτ 1) ρ 1 Eq(ut,m) + cm+1(1 + βη)E ut,m ut,0 2 + η2(cm+1 + L 2 )E ˆvt,m 2, where the last inequality uses equation (7). For convenience, we use hm = ( ηL2 β ) 4τρ2(ρτ 1) ρ 1 + ρ(cm + L 2 ). Since E[ ˆvt,m 2] = Eq(ˆut,m) ρEq(ut,m), we have Ef(ut,m) (αη β )E f(ut,m) 2 + cm+1(1 + βη)E ut,m ut,0 2 + η2hm+1Eq(ut,m) Ef(ut,m) + [cm+1(1 + βη) + 2L2η2hm+1]E ut,m ut,0 2 β 2η2hm+1)E f(ut,m) 2, where the second inequality uses Lemma 4. Then we can obtain: β 2η2hm+1)E f(um) 2 where cm = cm+1(1 + βη) + 2L2η2hm+1. We set c0 > 0. It is easy to see that cm > cm+1. We can choose c0, η, β to make c M = 0. Then we have: m=0 E f(ut,m) 2 ER0 ER M γ = Ef(wt) Ef(wt+1) which is equivalent to m=0 E f(ut,m) 2 Ef(w0) Ef(w T ) Computation Complexity In Theorem 2, we construct a sequence {cm} and need γ > 0. According to the definition of hm, we can write hm as hm = gcm + f, where g = 2η β 4τρ2(ρτ 1) ρ 1 + ρ, f = 2 4τρ2(ρτ 1) 2 are constants. First, we choose β > η, then both g, f are bounded positive constants. We have cm = cm+1(1 + βη + 2L2η2g) + 2L2η2f. Let a = βη + 2L2η2g. Because c M = 0, it is easy to get c0 = 2L2η2f (1 + a) M 1 a . We take M = 1 a, then we have c0 4L2η2f As recommended in (Reddi et al. 2016), we can take η = μ/n2/3, β = v/n1/3 with η < β (assuming n is large). Then we can get f = O(1), g = O(1), a = O(1/n). By choosing μ, v to satisfy 16L2fμ αv2 < 1 such that 4c0 αβ < 1, it is easy to find that γ = O(1/n2/3) > 0, M = O(n). Hence, to get an ϵ-local optimal solution, the computation complexity by all p threads is O(n 2 3 /ϵ), and the computation complexity of each thread is O( n 2 3 pϵ ). Experiment To verify our theoretical results about Hogwild! and Asy SVRG, we use a fully-connected neural network to construct a non-convex function. The neural network has one hidden layer with 100 nodes and the sigmoid function is used for the activation function. We use the soft-max output and a L2 regularization for training. The loss function is: f(w, b) = 1 k=1 1{yi = k} log o(k) i + λ where w is the weights of the neural network, b is the bias, yi is the label of instance xi, o(k) i is the output corresponding to xi, K is the total number of class labels. We use two datasets: connect-4 and MNIST4 to do experiments and λ = 10 3. We initialize w by randomly sampling from a Gaussian distribution with mean being 0 and variance being 0.01, and initialize b = 0. During training, we use a fixed stepsize for both Hogwild! and Asy SVRG. The stepsize is chosen from {0.1, 0.05, 0.01, 0.005, 0.001, 0.0005, 0.0001}, and the best is reported. For the iteration number of the inner-loop of Asy SVRG, we set M = n/p, where p is the number of threads. The experiments are conducted on a server with 12 Intel cores and 64G memory. 4https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/ Figure 1 illustrates the convergence property of both Hogwild! and Asy SVRG. The x-axis denotes the CPU time, where we set the CPU time that Hogwild! passes through the whole dataset once with one thread as 1 unit. The yaxis denotes the training loss. In this experiment, we run Hogwild! and Asy SVRG with 10 threads. Hogwild!-10 and Asy SVRG-10 denote the corresponding methods with 10 threads. It is easy to see that both Hogwild! and Asy SVRG are convergent. Furthermore, Asy SVRG is faster than Hogwild!. This is consistent with our theoretical results in this paper. 0 5 10 15 0.4 Training loss Hogwild!-10 Asy SVRG-10 0 20 40 60 80 100 120 0.65 Training loss Hogwild!-10 Asy SVRG-10 (b) connect-4 Figure 1: Hogwild! vs Asy SVRG Figure 2 reports the results of Hogwild! and Asy SVRG with different numbers of threads, where the number of threads p = 1, 4, 10. We can find that in most cases the two methods will become faster with the increase of threads. The only outlier is the case for Hogwild! on dateset connect-4, Hogwild! using 4 threads is slower than using 1 thread. One possible reason is that we have two CPUs in our server, with 6 cores for each CPU. In the 4-thread case, different threads may be allocated on different CPUs, which will cause extra cost. 0 5 10 15 20 25 30 0.4 Training loss Hogwild!-1 Hogwild!-4 Hogwild!-10 (a) Hogwild! on MNIST 0 5 10 15 20 25 30 0.4 Training loss Asy SVRG-1 Asy SVRG-4 Asy SVRG-10 (b) Asy SVRG on MNIST 0 50 100 150 200 250 0.65 Training loss Hogwild!-1 Hogwild!-4 Hogwild!-10 (c) Hogwild! on connect-4 0 50 100 150 200 0.65 Training loss Asy SVRG-1 Asy SVRG-4 Asy SVRG-10 (d) Asy SVRG on connect-4 Figure 2: Comparison between different numbers of threads. Conclusion In this paper, we have provided theoretical proof about the convergence of two representative lock-free strategy based parallel SGD methods, Hogwild! and Asy SVRG, for nonconvex problems. Empirical results also show that both Hogwild! and Asy SVRG are convergent on non-convex problems, which successfully verifies our theoretical results. To the best of our knowledge, this is the first work to prove the convergence of lock-free strategy based parallel SGD methods for non-convex problems. Acknowledgements This work is partially supported by NSFC (No. 61472182) and a fund from Tencent. References Allen-Zhu, Z., and Hazan, E. 2016. Variance reduction for faster non-convex optimization. In Proceedings of the 33nd International Conference on Machine Learning. Allen-Zhu, Z., and Yuan, Y. 2016. Improved svrg for nonstrongly-convex or sum-of-non-convex objectives. In Proceedings of the 33nd International Conference on Machine Learning. Chaturapruek, S.; Duchi, J. C.; and R e, C. 2015. Asynchronous stochastic convex optimization: the noise is in the noise and sgd don t care. In Proceedings of the Advances in Neural Information Processing Systems. Ghadimi, S., and Lan, G. 2013. Stochastic firstand zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization 23(4):2341 2368. Huo, Z., and Huang, H. 2016. Asynchronous stochastic gradient descent with variance reduction for non-convex optimization. Co RR abs/1604.03584. J. Reddi, S.; Hefny, A.; Sra, S.; Poczos, B.; and Smola, A. J. 2015. On variance reduction in stochastic gradient descent and its asynchronous variants. In Proceedings of the Advances in Neural Information Processing Systems. Jaggi, M.; Smith, V.; Takac, M.; Terhorst, J.; Krishnan, S.; Hofmann, T.; and Jordan, M. I. 2014. Communicationefficient distributed dual coordinate ascent. In Proceedings of the Advances in Neural Information Processing Systems. Johnson, R., and Zhang, T. 2013. Accelerating stochastic gradient descent using predictive variance reduction. In Proceedings of the Advances in Neural Information Processing Systems. Koren, Y.; Bell, R. M.; and Volinsky, C. 2009. Matrix factorization techniques for recommender systems. IEEE Computer 42(8):30 37. Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems. Li, M.; Andersen, D. G.; Park, J. W.; Smola, A. J.; Ahmed, A.; Josifovski, V.; Long, J.; Shekita, E. J.; and Su, B. 2014. Scaling distributed machine learning with the parameter server. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation. Li, X.; Zhao, T.; Arora, R.; Han; and Haupt, J. 2016. Stochastic variance reduced optimization for nonconvex sparse learning. In Proceedings of the 33nd International Conference on Machine Learning. Lian, X.; Huang, Y.; Li, Y.; and Liu, J. 2015. Asynchronous parallel stochastic gradient for nonconvex optimization. In Proceedings of the Advances in Neural Information Processing Systems. Recht, B.; Re, C.; Wright, S.; and Niu, F. 2011. Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. In Proceedings of the Advances in Neural Information Processing Systems. Reddi, S. J.; Hefny, A.; Sra, S.; Poczos, B.; and Alex. 2016. Stochastic variance reduction for nonconvex optimization. In Proceedings of the 33nd International Conference on Machine Learning. Roux, N. L.; Schmidt, M. W.; and Bach, F. 2012. A stochastic gradient method with an exponential convergence rate for finite training sets. In Proceedings of the Advances in Neural Information Processing Systems. Sa, C. D.; Re, C.; and Olukotun, K. 2015. Global convergence of stochastic gradient descent for some non-convex matrix problems. In Proceedings of the 32nd International Conference on Machine Learning. Shalev-Shwartz, S., and Zhang, T. 2013. Stochastic dual coordinate ascent methods for regularized loss. Journal of Machine Learning Research 14(1):567 599. Shamir, O. 2015. A stochastic PCA and SVD algorithm with an exponential convergence rate. In Proceedings of the 32nd International Conference on Machine Learning. Shamir, O. 2016a. Convergence of stochastic gradient descent for pca. In Proceedings of the 33nd International Conference on Machine Learning. Shamir, O. 2016b. Fast stochastic algorithms for svd and pca: Convergence properties and convexity. In Proceedings of the 33nd International Conference on Machine Learning. Xing, E. P.; Ho, Q.; Dai, W.; Kim, J. K.; Wei, J.; Lee, S.; Zheng, X.; Xie, P.; Kumar, A.; and Yu, Y. 2015. Petuum: A new platform for distributed machine learning on big data. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Zhang, R.; Zheng, S.; and Kwok, J. T. 2016. Asynchronous distributed semi-stochastic gradient optimization. In Proceedings of the AAAI Conference on Artificial Intelligence. Zhao, S.-Y., and Li, W.-J. 2016. Fast asynchronous parallel stochastic gradient descent: A lock-free approach with convergence guarantee. In Proceedings of the AAAI Conference on Artificial Intelligence.