# accelerated_variance_reduced_stochastic_admm__58e8cc2e.pdf

Accelerated Variance Reduced Stochastic ADMM

Yuanyuan Liu, Fanhua Shang, James Cheng Department of Computer Science and Engineering, The Chinese University of Hong Kong {yyliu, fhshang, jcheng}@cse.cuhk.edu.hk

Recently, many variance reduced stochastic alternating direction method of multipliers (ADMM) methods (e.g. SAGADMM, SDCA-ADMM and SVRG-ADMM) have made exciting progress such as linear convergence rates for strongly convex problems. However, the best known convergence rate for general convex problems is O(1/T) as opposed to O(1/T 2) of accelerated batch algorithms, where T is the number of iterations. Thus, there still remains a gap in convergence rates between existing stochastic ADMM and batch algorithms. To bridge this gap, we introduce the momentum acceleration trick for batch optimization into the stochastic variance reduced gradient based ADMM (SVRG-ADMM), which leads to an accelerated (ASVRG-ADMM) method. Then we design two different momentum term update rules for strongly convex and general convex cases. We prove that ASVRG-ADMM converges linearly for strongly convex problems. Besides having a low per-iteration complexity as existing stochastic ADMM methods, ASVRG-ADMM improves the convergence rate on general convex problems from O(1/T) to O(1/T 2). Our experimental results show the effectiveness of ASVRG-ADMM.

Introduction

In this paper, we consider a class of composite convex optimization problems

min x Rd1 f(x) + h(Ax), (1)

where A Rd2 d1 is a given matrix, f(x) := 1

n n i=1fi(x), each fi(x) is a convex function, and h(Ax) is convex but possibly non-smooth. With regard to h( ), we are interested in a sparsity-inducing regularizer, e.g. ℓ1-norm, group Lasso and nuclear norm. When A is an identity matrix, i.e. A = Id1, the above formulation (1) arises in many places in machine learning, statistics, and operations research (Bubeck 2015), such as logistic regression, Lasso and support vector machine (SVM). We mainly focus on the large sample regime. In this regime, even ﬁrst-order batch methods, e.g. FISTA (Beck and Teboulle 2009), become computationally burdensome due to their per-iteration

Corresponding author. Copyright c 2017, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

complexity of O(nd1). As a result, stochastic gradient descent (SGD) with per-iteration complexity of O(d1) has witnessed tremendous progress in the recent years. Especially, a number of stochastic variance reduced gradient methods such as SAG (Roux, Schmidt, and Bach 2012), SDCA (Shalev-Shwartz and Zhang 2013) and SVRG (Johnson and Zhang 2013) have been proposed to successfully address the problem of high variance of the gradient estimate in ordinary SGD, resulting in a linear convergence rate (for strongly convex problems) as opposed to sub-linear rates of SGD. More recently, the Nesterov s acceleration technique (Nesterov 2004) was introduced in (Allen-Zhu 2016; Hien et al. 2016) to further speed up the stochastic variancereduced algorithms, which results in the best known convergence rates for both strongly convex and general convex problems. This motivates us to integrate the momentum acceleration trick into the stochastic alternating direction method of multipliers (ADMM) below. When A is a more general matrix, i.e. A =Id1, the formulation (1) becomes many more complicated problems arising from machine learning, e.g. graph-guided fuzed Lasso (Kim, Sohn, and Xing 2009) and generalized Lasso (Tibshirani and Taylor 2011). To solve this class of composite optimization problems with an auxiliary variable y = Ax, which are the special case of the general ADMM form, min x Rd1,y Rd2 f(x) + h(y), s.t. Ax + By = c, (2)

the ADMM is an effective optimization tool (Boyd et al. 2011), and has shown attractive performance in a wide range of real-world problems, such as big data classiﬁcation (Nie et al. 2014). To tackle the issue of high periteration complexity of batch (deterministic) ADMM (as a popular ﬁrst-order optimization method), Wang and Banerjee (2012), Suzuki (2013) and Ouyang et al. (2013) proposed some online or stochastic ADMM algorithms. However, all these variants only achieve the convergence rate of O(1/

T) for general convex problems and O(log T/T) for strongly convex problems, respectively, as compared with the O(1/T 2) and linear convergence rates of accelerated batch algorithms (Nesterov 1983), e.g. FISTA, where T is the number of iterations. By now several accelerated and faster converging versions of stochastic ADMM, which are all based on variance reduction techniques, have been proposed, e.g. SAG-ADMM (Zhong and Kwok 2014b),

Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17)

Table 1: Comparison of convergence rates and memory requirements of some stochastic ADMM algorithms.

General convex Strongly-convex Space requirement SAG-ADMM O(1/T) unknown O(d1d2+nd1) SDCA-ADMM unknown linear rate O(d1d2+n) SCAS-ADMM O(1/T) O(1/T) O(d1d2) SVRG-ADMM O(1/T) linear rate O(d1d2) ASVRG-ADMM O(1/T 2) linear rate O(d1d2)

SDCA-ADMM (Suzuki 2014) and SVRG-ADMM (Zheng and Kwok 2016). With regard to strongly convex problems, Suzuki (2014) and Zheng and Kwok (2016) proved that linear convergence can be obtained for the special ADMM form (i.e. B = Id2 and c = 0) and the general ADMM form, respectively. In SAG-ADMM and SVRG-ADMM, an O(1/T) convergence rate can be guaranteed for general convex problems, which implies that there still remains a gap in convergence rates between the stochastic ADMM and accelerated batch algorithms. To bridge this gap, we integrate the momentum acceleration trick in (Tseng 2010) for deterministic optimization into the stochastic variance reduction gradient (SVRG) based stochastic ADMM (SVRG-ADMM). Naturally, the proposed method has low per-iteration time complexity as existing stochastic ADMM algorithms, and does not require the storage of all gradients (or dual variables) as in SCAS-ADMM (Zhao, Li, and Zhou 2015) and SVRGADMM (Zheng and Kwok 2016), as shown in Table 1. We summarize our main contributions below.

We propose an accelerated variance reduced stochastic ADMM (ASVRG-ADMM) method, which integrates both the momentum acceleration trick in (Tseng 2010) for batch optimization and the variance reduction technique of SVRG (Johnson and Zhang 2013). We prove that ASVRG-ADMM achieves a linear convergence rate for strongly convex problems, which is consistent with the best known result in SDCA-ADMM (Suzuki 2014) and SVRG-ADMM (Zheng and Kwok 2016). We also prove that ASVRG-ADMM has a convergence rate of O(1/T 2) for non-strongly convex problems, which is a factor of T faster than SAG-ADMM and SVRG-ADMM, whose convergence rates are O(1/T). Our experimental results further veriﬁed that our ASVRG-ADMM method has much better performance than the state-of-the-art stochastic ADMM methods.

Related Work Introducing y = Ax Rd2, problem (1) becomes

min x Rd1,y Rd2 f(x) + h(y), s.t. Ax y = 0. (3)

Although (3) is only a special case of the general ADMM form (2), when B = Id2 and c = 0, the stochastic (or online) ADMM algorithms and theoretical results in (Wang and Banerjee 2012; Ouyang et al. 2013; Zhong and Kwok 2014b; Zheng and Kwok 2016) and this paper are all for the

more general problem (2). To minimize (2), together with the dual variable λ, the update steps of batch ADMM are

yk = arg miny h(y) + β

2 Axk 1+By c+λk 1 2, (4)

xk = arg minx f(x) + β

2 Ax+Byk c+λk 1 2, (5) λk = λk 1 + Axk + Byk c, (6)

where β >0 is a penalty parameter. To extend the batch ADMM to the online and stochastic settings, the update steps for yk and λk remain unchanged. In (Wang and Banerjee 2012; Ouyang et al. 2013), the update step of xk is approximated as follows:

xk = arg min x x T fik(xk 1) + 1 2ηk x xk 1 2 G

2 Ax+Byk c+λk 1 2, (7)

where we draw ik uniformly at random from [n] := {1, . . . , n}, ηk 1/

k is the step-size, and z 2 G = z T Gz with given positive semi-deﬁnite matrix G, e.g. G = Id1 in (Ouyang et al. 2013). Analogous to SGD, the stochastic ADMM variants use an unbiased estimate of the gradient at each iteration. However, all those algorithms have much slower convergence rates than their batch counterpart, as mentioned above. This barrier is mainly due to the variance introduced by the stochasticity of the gradients. Besides, to guarantee convergence, they employ a decaying sequence of step sizes ηk, which in turn impacts the rates. More recently, a number of variance reduced stochastic ADMM methods (e.g. SAG-ADMM, SDCA-ADMM and SVRG-ADMM) have been proposed and made exciting progress such as linear convergence rates. SVRG-ADMM in (Zheng and Kwok 2016) is particularly attractive here because of its low storage requirement compared with the algorithms in (Zhong and Kwok 2014b; Suzuki 2014). Within each epoch of SVRG-ADMM, the full gradient p = f( x) is ﬁrst computed, where x is the average point of the previous epoch. Then fik(xk 1) and ηk in (7) are replaced by

f Ik(xk 1) = 1 |Ik|

ik Ik ( fik(xk 1) fik( x)) + p (8)

and a constant step-size η, respectively, where Ik [n] is a mini-batch of size b (which is a useful technique to reduce the variance). In fact, f Ik(xk 1) is an unbiased estimator of the gradient f(xk 1), i.e. E[ f Ik(xk 1)]= f(xk 1).

Accelerated Variance Reduced Stochastic ADMM In this section, we design an accelerated variance reduced stochastic ADMM method for both strongly convex and general convex problems. We ﬁrst make the following assumptions: Each convex fi( ) is Li-smooth, i.e. there exists a constant Li >0 such that fi(x) fi(y) Li x y , x, y Rd, and L maxi Li; f( ) is μ-strongly convex, i.e. there is μ>0 such that f(x) f(y)+ f(y)T(x y)+μ

for all x, y Rd; The matrix A has full row rank. The ﬁrst two assumptions are common in the analysis of ﬁrst-order

Algorithm 1 ASVRG-ADMM for strongly-convex case Input: m, η, β > 0, 1 b n. Initialize: x0 = z0, y0, θ, λ0 = 1

β (AT ) f( x0). 1: for s = 1, 2, . . . , T do 2: xs 0 = zs 0 = xs 1, ys 0 = ys 1, λs 0 = λs 1; 3: p = f( xs 1); 4: for k = 1, 2, . . . , m do 5: Choose Ik [n]of size b, uniformly at random; 6: ys k =arg miny h(y)+ β

2 Azs k 1+By c+λs k 1 2;

7: zs k =zs k 1 η( f Ik(xs k 1)+βAT(Azs k 1+Bys k c+λs k 1)) γθ ;

8: xs k =(1 θ) xs 1 + θzs k; 9: λs k =λs k 1 + Azs k + Bys k c; 10: end for 11: xs = 1

m m k=1xs k, ys =(1 θ) ys 1+ θ

m m k=1ys k,

β (AT ) f( xs); 13: end for Output: x T , y T .

optimization methods, while the last one has been used in the convergence analysis of batch ADMM (Shang et al. 2014; Nishihara et al. 2015; Deng and Yin 2016) and stochastic ADMM (Zheng and Kwok 2016).

The Strongly Convex Case In this part, we consider the case of (2) when each fi( ) is convex, L-smooth, and f( ) is μ-strongly convex. Recall that this class of problems include graph-guided Logistic Regression and SVM as notable examples. To efﬁciently solve this class of problems, we incorporate both the momentum acceleration and variance reduction techniques into stochastic ADMM. Our algorithm is divided into T epochs, and each epoch consists of m stochastic updates, where m is usually chosen to be O(n) as in (Johnson and Zhang 2013). Let z be an important auxiliary variable, its update rule is given as follows. Similar to (Zhong and Kwok 2014b; Zheng and Kwok 2016), we also use the inexact Uzawa method (Zhang, Burger, and Osher 2011) to approximate the sub-problem (7), which can avoid computing the inverse of the matrix ( 1

ηId1+βATA). Moreover, the momentum weight 0 θs 1 (the update rule for θs is provided below) is introduced into the proximal term 1 2η x xk 1 2 G similar to that of (7), and then the sub-problem with respect to z is formulated as follows:

min z (z zs k 1)T f Ik(xs k 1)+ θs 1

2η z zs k 1 2 G

2 Az + Bys k c + λs k 1 2, (9)

where f Ik(xs k 1) is deﬁned in (8), η < 1 2L, and G =

θs 1 ATA with γ γmin ηβ ATA 2

θs 1 +1 to ensure that G I similar to (Zheng and Kwok 2016), where 2 is the spectral norm, i.e. the largest singular value of the matrix. Furthermore, the update rule for x is given by

xs k = xs 1+θs 1(zs k xs 1)=(1 θs 1) xs 1+θs 1zs k, (10)

where θs 1(zs k xs 1) is the key momentum term (similar to those in accelerated batch methods (Nesterov 2004)), which helps accelerate our algorithm by using the iterate of the previous epoch, i.e. xs 1. Similar to xs k, ys = (1 θs 1) ys 1+ θs 1

m m k=1ys k. Moreover, θs can be set to a constant θ in all epochs of our algorithm, which must satisfy 0 θ 1 δ(b)/(α 1), where α = 1 Lη > 1+δ(b), and δ(b) is deﬁned below. The optimal value of θ is provided in Proposition 1 below. The detailed procedure is shown in Algorithm 1, where we adopt the same initialization technique for λs as in (Zheng and Kwok 2016), and ( ) is the pseudo-inverse. Note that, when θ=1, ASVRG-ADMM degenerates to SVRG-ADMM in (Zheng and Kwok 2016).

The Non-Strongly Convex Case In this part, we consider general convex problems of the form (2) when each fi( ) is convex, L-smooth, and h( ) is not necessarily strongly convex (but possibly non-smooth). Different from the strongly convex case, the momentum weight θs is required to satisfy the following inequalities:

θ2s 1 θ2 s 1 and 0 θs 1 δ(b)

where δ(b) := n b b(n 1) is a decreasing function with respect to the mini-batch size b. The condition (11) allows the momentum weight to decease, but not too fast, similar to the requirement on the step-size ηk in classical SGD and stochastic ADMM (Tseng 1998). Unlike batch acceleration methods, the weight must satisfy both inequalities in (11). Motivated by the momentum acceleration techniques in (Tseng 2010; Nesterov 2004) for batch optimization, we give the update rule of the weight θs for the mini-batch case:

θ4 s 1+ 4θ2 s 1 θ2 s 1 2 and θ0 = 1 δ(b)

For the special case of b = 1, we have δ(1) = 1 and θ0 = 1 1 α 1, while b = n (i.e. batch version), δ(n) = 0 and θ0 =

1. Since {θs} is decreasing, then θs 1 δ(b)

α 1 is satisﬁed. The detailed procedure is shown in Algorithm 2, which has many slight differences in the initialization and output of each epoch from Algorithm 1. In addition, the key difference between them is the update rule for the momentum weight θs. That is, θs in Algorithm 1 can be set to a constant, while that in Algorithm 2 is adaptively adjusted as in (12).

Convergence Analysis This section provides the convergence analysis of our ASVRG-ADMM algorithms (i.e. Algorithms 1 and 2) for strongly convex and general convex problems, respectively. Following (Zheng and Kwok 2016), we ﬁrst introduce the following function P(x, y) := f(x) f(x ) f(x )T(x x ) + h(y) h(y ) h (y )T(y y ) as a convergence criterion, where h (y) denotes the (sub)gradient of h( ) at y. Indeed, P(x, y) 0 for all x, y Rd. In the following, we give the intermediate key results for our analysis.

Algorithm 2 ASVRG-ADMM for general convex case Input: m, η, β > 0, 1 b n. Initialize: x0 = z0, y0, λ0, θ0 = 1 Lηδ(b)

1 Lη . 1: for s = 1, 2, . . . , T do 2: xs 0 =(1 θs 1) xs 1+θs 1 zs 1, ys 0 = ys 1, λs 0 = λs 1; 3: p = f( xs 1), zs 0 = zs 1; 4: for k = 1, 2, . . . , m do 5: Choose Ik [n]of size b, uniformly at random; 6: ys k =arg miny h(y)+ β

2 Azs k 1+By c+λs k 1 2;

7: zs k =zs k 1 η( f Ik(xs k 1)+βAT(Azs k 1+Bys k c+λs k 1)) γθs 1 ;

8: xs k =(1 θs 1) xs 1 + θs 1zs k; 9: λs k =λs k 1 + Azs k + Bys k c; 10: end for 11: xs = 1

m m k=1xs k, ys =(1 θs 1) ys 1+ θs 1

m m k=1ys k,

12: λs =λs m, zs =zs m, θs =

θ4 s 1+4θ2 s 1 θ2 s 1 2 ; 13: end for Output: x T , y T .

E[ f Ik(xs k 1) f(xs k 1) 2]

2Lδ(b) f( xs 1) f(xs k 1)+(xs k 1 xs 1)T f(xs k 1) ,

where δ(b)= n b b(n 1) 1 and 1 b n.

Lemma 2. Using the same notation as in Lemma 1, let (x , y , λ ) denote an optimal solution of problem (2), and {(zs k, xs k, ys k, λs k, xs, ys)} be the sequence generated by Algorithm 1 or 2 with θs 1 δ(b)

α 1, where α = 1 Lη. Then the following holds for all k,

P( xs, ys) θs 1

(x zs k)TATϕs k+(y ys k)TBTϕs k

P( xs 1, ys 1)

1/(1 θs 1) + θ2 s 1 x zs 0 2 G x zs m 2 G

Azs 0 Ax 2 Azs m Ax 2+

k=1 λs k λs k 1 2

where ϕs k = β(λs k λ ). The detailed proofs of Lemmas 1 and 2 are provided in the Supplementary Material.

Linear Convergence Our ﬁrst main result is the following theorem which gives the convergence rate of Algorithm 1. Theorem 1. Using the same notation as in Lemma 2 with given θ 1 δ(b)

α 1, and suppose f( ) is μ-strongly convex and Lf-smooth, and m is sufﬁciently large so that

ρ= θ θG+ηβATA 2

+ 1 θ 2 + Lfθ βmσmin(AAT ) 3

where σmin(AAT ) is the smallest eigenvalue of the positive semi-deﬁnite matrix AAT , and G is deﬁned in (9). Then

E P( x T, y T ) ρT P( x0, y0).

The proof of Theorem 1 is provided in the Supplementary Material. From Theorem 1, one can see that ASVRGADMM achieves linear convergence, which is consistent with that of SVRG-ADMM, while SCAS-ADMM has only an O(1/T) convergence rate. Remark 1. Theorem 1 shows that our result improves slightly upon the rate ρ in (Zheng and Kwok 2016) with the same η and β. Speciﬁcally, as shown in (13), ρ consists of three components, corresponding to those of Theorem 1 in (Zheng and Kwok 2016). In Algorithm 1, recall that here θ 1 and G is deﬁned in (9). Thus, both the ﬁrst and third terms in (13) are slightly smaller than those of Theorem 1 in (Zheng and Kwok 2016). In addition, one can set η = 1/8L (i.e. α = 8) and θ = 1 δ(b)/(α 1) = 1 δ(b)/7. Thus, the second term in (13) equals to δ(b)/7, while that of SVRG-ADMM is approximately equal to 4Lηδ(b)/(1 4Lηδ(b)) δ(b)/2. In summary, the convergence bound of SVRG-ADMM can be slightly improved by ASVRG-ADMM.

Selecting Scheme of θ The rate ρ in (13) of Theorem 1 can be expressed as the function with respect to the parameters θ and β with given m, η, Lf, L, A, μ. Similar to (Nishihara et al. 2015; Zheng and Kwok 2016), one can obtain the optimal parameter β =

Lfμ/(σmin(AAT ) ATA 2), which produces a smaller rate ρ. In addition, as shown in (13), all the three terms are with respect to the weight θ. Therefore, we give the following selecting scheme for θ. Proposition 1. Given κf = Lf/μ, β , κ = L/μ, b, A, and let ω = ATA 2/σmin(AAT ), we set m > 2κ+2 κfω and

η=1/(Lα), where α= m 2 κf ω

2κ +δ(b)+1. Then the optimal θ of Algorithm 1 is given by

θ = m 2 κfω m 2 κfω + 2κ(δ(b) + 1).

The proof of Proposition 1 is provided in the Supplementary Material.

Convergence Rate of O(1/T 2) We ﬁrst assume that z Z, where Z is a convex compact set with diameter DZ = supz1,z2 Z z1 z2 , and the dual variable λ is also bounded with Dλ = supλ1,λ2 λ1 λ2 . For Algorithm 2, we give the following result. Theorem 2. Using the same notation as in Lemma 2 with θ0 =1 δ(b)

α 1, then we have

E P( x T, y T )+ γ A x T + B y T c

4(α 1)δ(b) P( x0, y0) + γ A x0+ B y0 c

(α 1 δ(b))2(T +1)2

+ 2Lα x x0 2 G m(T + 1)2 + 2β ATA 2D2 Z + 4D2 λ

0 1 2 3 4 5

CPU time (s)

Objective minus best

0 10 20 30 40 50

CPU time (s)

Objective minus best

0 20 40 60 80

CPU time (s)

Objective minus best

0 50 100 150 200 250 300

CPU time (s)

Objective minus best

0 1 2 3 4 5 0.32

CPU time (s)

0 10 20 30 40 50

CPU time (s)

0 20 40 60 80 0.3765

CPU time (s)

0 50 100 150 200 250 300 0.326

CPU time (s)

STOC ADMM OPG ADMM SAG ADMM SCAS ADMM SVRG ADMM ASVRG ADMM

Figure 1: Comparison of different stochastic ADMM methods for graph-guided fuzed Lasso problems on the four data sets. The x-axis represents the objective value minus the minimum (top) or testing loss (bottom), and the y-axis corresponds to the running time (seconds).

The proof of Theorem 2 is provided in the Supplementary Material. Theorem 2 shows that the convergence bound consists of the three components, which converge as O(1/T 2), O(1/m T 2) and O(1/m T), respectively, while the three components of SVRG-ADMM converge as O(1/T), O(1/m T) and O(1/m T). Clearly, ASVRGADMM achieves the convergence rate of O(1/T 2) as opposed to O(1/T) of SVRG-ADMM and SAG-ADMM (m T). All the components in the convergence bound of SCAS-ADMM converge as O(1/T). Thus, it is clear from this comparison that ASVRG-ADMM is a factor of T faster than SAG-ADMM, SVRG-ADMM and SCAS-ADMM.

Connections to Related Work Our algorithms and convergence results can be extended to the following settings. When the mini-batch size b = n and m=1, then δ(n)=0, that is, the ﬁrst term of (14) vanishes, and ASVRG-ADMM degenerates to the batch version. Its convergence rate becomes O(D2 x /(T +1)2+D2 Z/(T +1)+ D2 λ/(T +1)) (which is consistent with the optimal result for accelerated deterministic ADMM methods (Goldstein et al. 2014; Lu et al. 2016)), where Dx = x x0 G. Many empirical risk minimization problems can be viewed as the special case of (1) when A = I. Thus, our method can be extended to solve them, and has an O(1/T 2+1/(m T 2)) rate, which is consistent with the best known result as in (Allen Zhu 2016; Hien et al. 2016).

Experiments In this section, we use our ASVRG-ADMM method to solve the general convex graph-guided fuzed Lasso, strongly convex graph-guided logistic regression and graph-guided SVM problems. We compare ASVRG-ADMM with the following state-of-the-art methods: STOC-ADMM (Ouyang et al.

2013), OPG-ADMM (Suzuki 2013), SAG-ADMM (Zhong and Kwok 2014b), and SCAS-ADMM (Zhao, Li, and Zhou 2015) and SVRG-ADMM (Zheng and Kwok 2016). All methods were implemented in MATLAB, and the experiments were performed on a PC with an Intel i5-2400 CPU and 16GB RAM.

Graph-Guided Fused Lasso We ﬁrst evaluate the empirical performance of the proposed method for solving the graph-guided fuzed Lasso problem:

i=1 ℓi(x) + λ1 Ax 1, (15)

where ℓi is the logistic loss function on the feature-label pair (ai, bi), i.e., log(1+exp( bia T i x)), and λ1 0 is the regularization parameter. Here, we set A = [G; I] as in (Ouyang et al. 2013; Zhong and Kwok 2014b; Azadi and Sra 2014; Zheng and Kwok 2016), where G is the sparsity pattern of the graph obtained by sparse inverse covariance selection (Banerjee, Ghaoui, and d Aspremont 2008). We used four publicly available data sets1 in our experiments, as listed in Table 2. Note that except STOC-ADMM, all the other algorithms adopted the linearization of the penalty term β

2 Ax y+z 2 to avoid the inversion of 1 ηk Id1+βATA at each iteration, which can be computationally expensive for large matrices. The parameters of ASVRG-ADMM are set as follows: m = 2n/b and γ = 1 as in (Zhong and Kwok 2014b; Zheng and Kwok 2016), as well as η and β. Figure 1 shows the training error (i.e. the training objective value minus the minimum) and testing loss of all the algorithms for the general convex problem on the four data sets. SAG-ADMM could not generate experimental results

1http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/

Table 2: Summary of data sets and regularization parameters used in our experiments.

Data sets training test mini-batch λ1 λ2 a9a 16,281 16,280 20 1e-5 1e-2 w8a 32,350 32,350 20 1e-5 1e-2 SUSY 3,500,000 1,500,000 100 1e-5 1e-2 HIGGS 7,700,000 3,300,000 150 1e-5 1e-2

0 1 2 3 4 5 10 10

CPU time (s)

Objective minus best

0 1 2 3 4 5 0.37

CPU time (s)

0 5 10 15 20 25 30

CPU time (s)

Objective minus best

0 5 10 15 20 25 30 0.25

CPU time (s)

STOC ADMM OPG ADMM SAG ADMM

SCAS ADMM SVRG ADMM ASVRG ADMM

Figure 2: Comparison of different stochastic ADMM methods for graph-guided logistic regression problems on the two data sets: a9a (top) and w8a (bottom).

on the HIGGS data set because it ran out of memory. These ﬁgures clearly indicate that the variance reduced stochastic ADMM algorithms (including SAG-ADMM, SCASADMM, SVRG-ADMM and ASVRG-ADMM) converge much faster than those without variance reduction techniques, e.g. STOC-ADMM and OPG-ADMM. Notably, ASVRG-ADMM consistently outperforms all other algorithms in terms of the convergence rate under all settings, which empirically veriﬁes our theoretical result that ASVRG-ADMM has a faster convergence rate of O(1/T 2), as opposed to the best known rate of O(1/T).

Graph-Guided Logistic Regression We further discuss the performance of ASVRG-ADMM for solving the strongly convex graph-guided logistic regression problem (Ouyang et al. 2013; Zhong and Kwok 2014a):

+ λ1 Ax 1. (16)

Due to limited space and similar experimental phenomena on the four data sets, we only report the experimental results on the a9a and w8a data sets in Figure 2, from which we observe that SVRG-ADMM and ASVRG-ADMM achieve comparable performance, and they signiﬁcantly outperform the other methods in terms of the convergence rate, which is consistent with their linear (geometric) convergence guarantees. Moreover, ASVRG-ADMM converges slightly faster

Running time (s)

Testing accuracy %

SVM STOC ADMM SVRG ADMM ASVRG ADMM

2 4 6 8 0.84

# of epochs

Testing accuracy %

Figure 3: Comparison of accuracies multi-class classiﬁcation on the 20newsgroups data set: accuracy v.s. running time (left) or number of epochs (right).

than SVRG-ADMM, which shows the effectiveness of the momentum trick to accelerate variance reduced stochastic ADMM, as we expected.

Graph-Guided SVM Finally, we evaluate the performance of ASVRG-ADMM for solving the graph-guided SVM problem,

[1 bia T i x]+ + λ2

+ λ1 Ax 1, (17)

where [x]+ = max(0, x) is the non-smooth hinge loss. To effectively solve problem (17), we used the smooth Huberized hinge loss in (Rosset and Zhu 2007) to approximate the hinge loss. For the 20newsgroups dataset2, we randomly divide it into 80% training set and 20% test set. Following (Ouyang et al. 2013), we set λ1 =λ2 =10 5, and use the one-vs-rest scheme for the multi-class classiﬁcation. Figure 3 shows the average prediction accuracies and standard deviations of testing accuracies over 10 different runs. Since STOC-ADMM, OPG-ADMM, SAG-ADMM and SCAS-ADMM consistently perform worse than SVRGADMM and ASVRG-ADMM in all settings, we only report the results of STOC-ADMM. We observe that SVRGADMM and ASVRG-ADMM consistently outperform the classical SVM and STOC-ADMM. Moreover, ASVRGADMM performs much better than the other methods in all settings, which again veriﬁes the effectiveness of our ASVRG-ADMM method.

Conclusions In this paper, we proposed an accelerated stochastic variance reduced ADMM (ASVRG-ADMM) method, in which we combined both the momentum acceleration trick for batch optimization and the variance reduction technique. We designed two different momentum term update rules for strongly convex and general convex cases, respectively. Moreover, we also theoretically analyzed the convergence properties of ASVRG-ADMM, from which it is clear that ASVRG-ADMM achieves linear convergence and O(1/T 2) rates for both cases. Especially, ASVRG-ADMM is at least a factor of T faster than existing stochastic ADMM methods for general convex problems.

2http://www.cs.nyu.edu/ roweis/data.html

Acknowledgements

We thank the reviewers for their valuable comments. The authors are supported by the Hong Kong GRF 2150851 and 2150895, and Grants 3132964 and 3132821 funded by the Research Committee of CUHK.

Allen-Zhu, Z. 2016. Katyusha: Accelerated variance reduction for faster SGD. ar Xiv:1603.05953v4. Azadi, S., and Sra, S. 2014. Towards an optimal stochastic alternating direction method of multipliers. In Proc. 31st Int. Conf. Mach. Learn. (ICML), 620 628. Banerjee, O.; Ghaoui, L. E.; and d Aspremont, A. 2008. Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. J. Mach. Learn. Res. 9:485 516. Beck, A., and Teboulle, M. 2009. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1):183 202. Boyd, S.; Parikh, N.; Chu, E.; Peleato, B.; and Eckstein, J. 2011. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1):1 122. Bubeck, S. 2015. Convex optimization: Algorithms and complexity. Found. Trends Mach. Learn. 8:231 358. Deng, W., and Yin, W. 2016. On the global and linear convergence of the generalized alternating direction method of multipliers. J. Sci. Comput. 66:889 916. Goldstein, T.; ODonoghue, B.; Setzer, S.; and Baraniuk, R. 2014. Fast alternating direction optimization methods. SIAM J. Imaging Sciences 7(3):1588 1623. Hien, L. T. K.; Lu, C.; Xu, H.; and Feng, J. 2016. Accelerated stochastic mirror descent algorithms for composite non-strongly convex optimization. ar Xiv:1605.06892v2. Johnson, R., and Zhang, T. 2013. Accelerating stochastic gradient descent using predictive variance reduction. In Proc. Adv. Neural Inf. Process. Syst. (NIPS), 315 323. Kim, S.; Sohn, K. A.; and Xing, E. P. 2009. A multivariate regression approach to association analysis of a quantitative trait network. Bioinformatics 25:i204 i212. Lu, C.; Li, H.; Lin, Z.; and Yan, S. 2016. Fast proximal linearized alternating direction method of multiplier with parallel splitting. In Proc. 30th AAAI Conf. Artif. Intell. (AAAI), 739 745. Nesterov, Y. 1983. A method of solving a convex programming problem with convergence rate O(1/k2). Soviet Mathematics Doklady 27(2):372 376. Nesterov, Y. 2004. Introductory Lectures on Convex Optimization: A Basic Course. Boston: Kluwer Academic Publ. Nie, F.; Huang, Y.; Xiaoqian Wang; and Huang, H. 2014. Linear time solver for primal SVM. In Proc. 31st Int. Conf. Mach. Learn. (ICML), 505 513. Nishihara, R.; Lessard, L.; Recht, B.; Packard, A.; and Jordan, M. I. 2015. A general analysis of the convergence of

ADMM. In Proc. 32nd Int. Conf. Mach. Learn. (ICML), 343 352. Ouyang, H.; He, N.; Tran, L. Q.; and Gray, A. 2013. Stochastic alternating direction method of multipliers. In Proc. 30th Int. Conf. Mach. Learn. (ICML), 80 88. Rosset, S., and Zhu, J. 2007. Piecewise linear regularized solution paths. Ann. Statist. 35(3):1012 1030. Roux, N. L.; Schmidt, M.; and Bach, F. 2012. A stochastic gradient method with an exponential convergence rate for ﬁnite training sets. In Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2672 2680. Shalev-Shwartz, S., and Zhang, T. 2013. Stochastic dual coordinate ascent methods for regularized loss minimization. J. Mach. Learn. Res. 14:567 599. Shang, F.; Liu, Y.; Cheng, J.; and Cheng, H. 2014. Robust principal component analysis with missing data. In Proc. 23rd Int. Conf. Inform. Knowl. Manag. (CIKM), 1149 1158. Suzuki, T. 2013. Dual averaging and proximal gradient descent for online alternating direction multiplier method. In Proc. 30th Int. Conf. Mach. Learn. (ICML), 392 400. Suzuki, T. 2014. Stochastic dual coordinate ascent with alternating direction method of multipliers. In Proc. 31st Int. Conf. Mach. Learn. (ICML), 736 744. Tibshirani, R. J., and Taylor, J. 2011. The solution path of the generalized lasso. Annals of Statistics 39(3):1335 1371. Tseng, P. 1998. An incremental gradient(-projection) method with momentum term and adaptive step size rule. SIAM J. Optim. 8(2):506 531. Tseng, P. 2010. Approximation accuracy, gradient methods, and error bound for structured convex optimization. Math. Program. 125:263 295. Wang, H., and Banerjee, A. 2012. Online alternating direction method. In Proc. 29th Int. Conf. Mach. Learn. (ICML), 1119 1126. Zhang, X.; Burger, M.; and Osher, S. 2011. A uniﬁed primal-dual algorithm framework based on Bregman iteration. J. Sci. Comput. 46(1):20 46. Zhao, S.-Y.; Li, W.-J.; and Zhou, Z.-H. 2015. Scalable stochastic alternating direction method of multipliers. ar Xiv:1502.03529v3. Zheng, S., and Kwok, J. T. 2016. Fast-and-light stochastic ADMM. In Proc. 25th Int. Joint Conf. Artif. Intell. (IJCAI), 2407 2613. Zhong, L. W., and Kwok, J. T. 2014a. Accelerated stochastic gradient method for composite regularization. In Proc. 17th Int. Conf. Artif. Intell. Statist. (AISTATS), 1086 1094. Zhong, L. W., and Kwok, J. T. 2014b. Fast stochastic alternating direction method of multipliers. In Proc. 31st Int. Conf. Mach. Learn. (ICML), 46 54.