# fast_nonsmooth_regularized_risk_minimization_with_continuation__2966921e.pdf

Fast Nonsmooth Regularized Risk Minimization with Continuation

Shuai Zheng, Ruiliang Zhang, James T. Kwok Department of Computer Science and Engineering Hong Kong University of Science and Technology Hong Kong {szhengac, rzhangaf, jamesk}@cse.ust.hk

In regularized risk minimization, the associated optimization problem becomes particularly difﬁcult when both the loss and regularizer are nonsmooth. Existing approaches either have slow or unclear convergence properties, are restricted to limited problem subclasses, or require careful setting of a smoothing parameter. In this paper, we propose a continuation algorithm that is applicable to a large class of nonsmooth regularized risk minimization problems, can be ﬂexibly used with a number of existing solvers for the underlying smoothed subproblem, and with convergence results on the whole algorithm rather than just one of its subproblems. In particular, when accelerated solvers are used, the proposed algorithm achieves the fastest known rates of O(1/T 2) on strongly convex problems, and O(1/T) on general convex problems. Experiments on nonsmooth classiﬁcation and regression tasks demonstrate that the proposed algorithm outperforms the state-of-the-art.

Introduction In regularized risk minimization, one has to minimize the sum of an empirical loss and a regularizer. When both are smooth, it can be easily optimized by a variety of solvers (Nesterov 2004). In particular, a popular choice for big data applications is stochastic gradient descent (SGD), which is easy to implement and highly scalable (Kushner and Yin 2003). For many nonsmooth regularizers (such as the ℓ1 and nuclear norm regularizers), the corresponding regularized risks can still be efﬁciently minimized by the proximal gradient algorithm and its accelerated variants (Nesterov 2013). However, when the regularizer is smooth but the loss is nonsmooth (e.g., the hinge loss and absolute loss), or when both the loss and regularizer are nonsmooth, proximal gradient algorithms are not directly applicable. On nonsmooth problems, SGD can still be used, by simply replacing the gradient with subgradient. However, the information contained in the subgradient is much less informative (Nemirovski and Yudin 1983), and convergence is hindered. On general convex problems, SGD converges at a rate of O(log T/

T), where T is the number of iterations; whereas on strongly convex problems, the rate is O(log T/T). In contrast, its smooth counterparts converge

Copyright c 2016, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

with the much faster O(1/

T) and O(1/T) rates, respectively (Rakhlin, Shamir, and Sridharan 2012; Shamir and Zhang 2013). Recently, Shamir and Zhang (2013) recovered these rates by using a polynomial-decay averaging scheme on the SGD iterates. However, a major drawback is that it does not exploit properties of the regularizer. For example, when used with a sparsity-inducing regularizer, its solution obtained may not be sparse (Duchi and Singer 2009). Nesterov (2005b) proposed to smooth the nonsmooth objective so that it can then be efﬁciently optimized. This smoothing approach is now popularly used for nonsmooth optimization. However, the optimal smoothness parameter needs to be known in advance. This restriction is later avoided by the (batch) excessive gap algorithm (Nesterov 2005a). In the stochastic setting, Ouyang and Gray (2012) combined Nesterov s smoothing with SGD. Though these methods achieve the fastest known convergence rates in the batch and stochastic settings respectively, they assume a Lipschitz-smooth regularizer, and nonsmooth regularizers (such as the sparsity-inducing regularizers) cannot be used. Recently, based on the observation that the training set is indeed ﬁnite, a number of fast stochastic algorithms are proposed for both smooth and composite optimization problems (Schmidt, Roux, and Bach 2013; Johnson and Zhang 2013; Xiao and Zhang 2014; Mairal 2013; Defazio, Bach, and Lacoste-Julien 2014). They are based on the idea of variance reduction, and attain comparable convergence rates as their batch counterparts. However, they are not applicable when both the loss and regularizer are nonsmooth. To alleviate this, Shalev-Shwartz and Zhang (2014) suggested running these algorithms on the smoothed approximation obtained by Nesterov s smoothing. However, as in (Nesterov 2005b), it requires a careful setting of the smoothness parameter. Over-smoothing deteriorates solution quality, while under-smoothing slows down convergence. The problem of setting the smoothness parameter can be alleviated by continuation (Becker, Bobin, and Cand es 2011). It solves a sequence of smoothed problems, in which the smoothing parameter is gradually reduced from a large value (and the corresponding smoothed problem is easy to solve) to a small value (which leads to a solution closer to that of the original nonsmooth problem). Moreover, solution of the intermediate problem is used to warm-start the next smoothed problem. This approach is also similar to that

Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16)

of gradually changing the regularization parameter in (Hale, Yin, and Zhang 2007; Wen et al. 2010; Mazumder, Hastie, and Tibshirani 2010). Empirically, continuation converges much faster than the use of a ﬁxed smoothing parameter (Becker, Bobin, and Cand es 2011). However, the theoretical convergence rate obtained in (Becker, Bobin, and Cand es 2011) is only for one stage of the continuation algorithm (i.e., on the smoothed problem with a particular smoothing parameter), while the convergence properties for the whole algorithm are not clear. Recently, Xiao and Zhang (2012) obtained a linear convergence rate for their continuation algorithm, though only for the special case of ℓ1-regularized least squares regression. In this paper, we consider the general nonsmooth optimization setting, in which both the loss and regularizer may be nonsmooth. The proposed continuation algorithm can be ﬂexibly used with a variety of existing batch/stochastic solvers in each stage. Theoretical analysis shows that the proposed algorithm, with this wide class of solvers, achieves the rate of O(1/T 2) on strongly convex problems, and O(1/T) on general convex problems. These are the fastest known rates for nonsmooth optimization. Note that these rates are for the whole algorithm, not just one of its stages as in (Becker, Bobin, and Cand es 2011). Experiments on nonsmooth classiﬁcation and regression models demonstrate that the proposed algorithm outperforms the state-of-the-art.

Notation. For x, y Rd, x 2 = d i=1 x2 i is its ℓ2-norm,

x 1 = d i=1 |xi| is its ℓ1-norm, and x, y is the dot product between x, y. Moreover, f denotes the subdifferential of a nonsmooth function f, if f is differentiable, then f denotes its gradient. I is the identity matrix.

Related Work Consider nonsmooth functions of the form

g(x) = ˆg(x) + max u U [ Ax, u Q(u)], (1)

where ˆg is convex, continuously differentiable with ˆLLipschitz-continuous gradient, U Rp is convex, A Rp d, and Q is a continuous convex function. Nesterov (2005b) proposed the following smooth approximation:

gγ(x) = ˆg(x) + max u U [ Ax, u Q(u) γω(u)] , (2)

where γ is a smoothness parameter, and ω is a nonnegative ζ-strongly convex function. For example, consider the hinge loss g(x) = max(0, 1 yiz T i x), where x is the linear model parameter, and (zi, yi) is the ith training sample with yi { 1}. Using ω(u) = 1 2 u 2 2, g can be smoothed to (Ouyang and Gray 2012)

0 yiz T i x 1 1 yiz T i x γ

2 yiz T i x < 1 γ 1 2γ (1 yiz T i x)2 otherwise . (3)

Similarly, the ℓ1 loss g(x) = |yi z T i x| can be smoothed to

yi z T i x γ

2 yi z T i x γ (yi z T i x) γ

2 yi z T i x < γ 1 2γ (yi z T i x)2 otherwise . (4)

Other examples in machine learning include popular regularizers such as the ℓ1, total variation (Becker, Bobin, and Cand es 2011), overlapping group lasso, and graph-guided fused lasso (Chen et al. 2012). Minimization of the smooth (and convex) gγ can be performed efﬁciently using ﬁrst-order methods, including the so-called optimal method and its variants (Nesterov 2005b) that achieve the optimal convergence rate.

Nesterov Smoothing with Continuation Consider the following nonsmooth minimization problem

min x P(x) f(x) + r(x), (5)

where both f and r are convex and nonsmooth. In machine learning, x usually corresponds to the model parameter, f is the loss, and r the regularizer. We assume that the loss f on a set of n training samples can be decomposed as f(x) = 1 n n i=1 fi(x), where fi is the loss value on the ith sample. Moreover, each fi can be written as in (1), i.e., fi(x) = ˆfi(x) + maxu U [ Aix, u Q(u)]. One can then apply Nesterov s smoothing, and P(x) in (5) is smoothed to

P(x) = fγ(x) + r(x), (6)

where fγ(x) = 1

n n i=1 fi(x) and

fi(x) = ˆfi(x) + max u U [ Aix, u Q(u) γω(u)] . (7)

As for r, we assume that it is simple , namely that its proximal operator, proxλr( ) arg minx 1

2 x 2 + λr(x) for any λ > 0, can be easily computed (Parikh and Boyd 2014).

Strongly Convex Objectives In this section, we assume that P is μ-strongly convex. This strong convexity may come from f (e.g., ℓ2-regularized hinge loss) or r (e.g., elastic-net regularizer) or both. Assumption 1. P is μ-strongly convex, i.e., there exists μ > 0 such that P(y) P(x) + ξT (y x) + μ

2 y x 2 2, ξ P(x) and x, y Rd. The proposed algorithm is based on continuation. It proceeds in stages, and a smoothed problem is solved in each stage (Becker, Bobin, and Cand es 2011). The smoothness parameter is gradually reduced across stages, so that the smoothed problem becomes closer and closer to the original one. In each stage, an iterative solver M is used to solve the smoothed problem. It returns an approximate solution, which is then used to warm-start the next stage. In stage s, let the smoothness parameter be γs, the smoothed objective in (6) be Ps(x), x s = arg minx Ps(x), and xs be the solution returned by M. As M is warmstarted by xs 1, the error before running M is Ps( xs 1) Ps(x s). At the end of stage s, we assume that the error is reduced by a factor of ρs. The expectation E below is over the stochastic choice of training samples for a stochastic solver. For a deterministic solver, this expectation can be dropped.

Assumption 2. E Ps( xs) Ps(x s) ρs( Ps( xs 1) Ps(x s)), where ρs (0, 1).

Table 1: Examples of non-accelerated and accelerated solvers. Note that Prox-GD and APG are batch solvers while the others are stochastic solvers. Here, θ and p are parameters related to the stepsize, and are ﬁxed across stages. In particular, θ (0, 0.25) and satisﬁes (1 4θ)ρs 4θ > 0, and p (0, 1) and satisﬁes ρs > p(2+p)

1 p . Accelerated Prox-SVRG has Ts = O( κs log(1/ρs)) only when a sufﬁciently large mini-batch is used.

Ts φ(ρs) a b c

non-accelerated

Prox-GD (Nesterov 2013) 4κs log(1/ρs) log(1/ρs) 4 0 0 Prox-SVRG (Xiao and Zhang 2014) θ (1 4θ)ρs 4θ (κs + 4) 1 (1 4θ)ρs 4θ θ 4θ 0 SAGA (Defazio, Bach, and Lacoste-Julien 2014) 3n ρs 3κs

n + 1 1 ρs 9 3n 0 MISO (Mairal 2013) nκs

accelerated APG (Schmidt, Roux, and Bach 2011) κs log(2/ρs) log(2/ρs) 1 0 0 Accelerated Prox-SVRG (Nitanda 2014) κs

2 (1 p) log 1 ρs 2+p p 1 p

log 1 ρs 2+p p 1 p

2 (1 p) 0 0

We consider two types of solvers, which differ in the number of iterations (Ts) it takes to satisfy Assumption 2.

1. Non-accelerated solvers: Ts = aκsφ(ρs) + bφ(ρs) + c;

2. Accelerated solvers: Ts = a κsφ(ρs) + bφ(ρs) + c.

Here, κs is the condition number of the objective, a, b, c 0 are constants not related to κs and φ(ρs). Moreover, φ satisﬁes (i) φ(ρs) > 0 and non-increasing for ρs (0, 1); (ii) φ(ρs) is not related to κs. Note that when κs is large (as is typical when the smoothed problem approaches the original problem), non-accelerated solvers need a larger Ts than accelerated solvers. Table 1 shows some non-accelerated and accelerated solvers popularly used in machine learning. Algorithm 1 shows the proposed procedure, which will be called CNS (Continuation for Non Smooth optimization). It is similar to that in (Becker, Bobin, and Cand es 2011), which however does not have convergence results. Moreover, a small but important difference is that Algorithm 1 speciﬁes how Ts should be updated across stages, and this is essential for proving convergence. Note the different update options for non-accelerated and accelerated solvers.

Algorithm 1 CNS algorithm for strongly convex problems.

1: Input: number of iterations T1 and smoothness parameter γ1 for stage 1, and shrinking parameter τ > 1. 2: Initialize: x0. 3: for s = 1, 2, . . . do 4: Ps smooth P with smoothing parameter γs; 5: xs minimize Ps(x) by running M for Ts iterations; 6: γs+1 = γs/τ; 7: Option I (non-accelerated solvers): Ts+1 = τTs; 8: Option II (accelerated solvers): Ts+1 = τTs; 9: end for 10: Output: xs.

The following Lemma shows that when T1 is large enough, error reduction can be guaranteed across all stages.

Lemma 1. For both non-accelerated and accelerated solvers, if T1 is large enough such that ρ1 1/τ 2, then ρs 1/τ 2 for all s > 1.

If κ1 is known, a sufﬁciently large T1 can be obtained from Table 1; otherwise, we can obtain T1 by ensuring P1( x1) P1( x0)/τ 2, which then implies P1( x1) P1(x 1) ( P1( x0) P1(x 1))/τ 2.

Convergence when Non-Accelerated Solver is used Let x = arg minx P(x), and Du = maxu U ω(u). The following Lemma shows that if x is an ϵ-accurate solution of the Ps (i.e., Ps(x) Ps(x s) ϵ), it is also an (ϵ + γs Du)- accurate solution of the original objective P.

Lemma 2. Ps(x) Ps(x s) γs Du P(x) P(x ) Ps(x) Ps(x s) + γs Du. Since Lemma 2 holds for any x, it also holds in expectation, i.e., E Ps( xs) Ps(x s) γs Du EP( xs) P(x ) E Ps( xs) Ps(x s) + γs Du. Theorem 1. Assume that T1 in Algorithm 1 is large enough so that ρ1 1/τ 2. When non-accelerated solvers are used,

EP( x S) P(x )

(P( x0) P(x ))+O γ1Du

where S is the number of stages, T = S s=1 Ts, and S s=1 ρs = O(1/T 2). The ﬁrst term on the RHS of (8) reﬂects the cumulative decrease of the objective after S stages, while the second term is due to smoothing. The condition ρ1 1/τ 2 is used to obtain the O(1/T) rate in the last term of (8). If we instead require that ρ1 1/τ, it can be shown that the rate will be slowed to O(log T/T); if ρ1 1/ τ, it degrades further to O(1/

T). On the other hand, if ρ1 1/τ c with c > 2, the rate will not be improved. Corollary 1. Together with Lemma 1, we have

EP( x S) P(x ) P( x0) P(x )

τ 2S + O γ1Du

where 1/τ 2S = O(1/T 2). Existing stochastic algorithms such as SGD, FOBOS and RDA have a convergence rate of O(log T/T) (Rakhlin, Shamir, and Sridharan 2012; Duchi and Singer 2009; Xiao 2009), while here we have the faster O(1/T) rate. Recent

works in (Shamir and Zhang 2013; Ouyang and Gray 2012) also achieve a O(1/T) rate. However, Shamir and Zhang (2013) use stochastic subgradient, and do not exploit properties of the regularizer (such as sparsity). This can lead to inferior performance (Duchi and Singer 2009; Xiao 2009; Mazumder, Hastie, and Tibshirani 2010). On the other hand, (Ouyang and Gray 2012) is restricted to r 0 in (5). Next, we compare with the case where continuation is not used (i.e., γs is a constant). Equivalently, this corresponds to setting τ = 1 in Algorithm 1. Proposition 1. When continuation is not used, let ρ (0, 1) be the error reduction factor at each stage, and γ > 0 be the ﬁxed smoothing parameter. When either an accelerated or non-accelerated solver is used, EP( x S) P(x ) ρS(P( x0) P(x ))+(1+ρS)γDu.(10) Proposition 2. Assume that the two terms on the RHS of (9) and (10) are equal to αϵ and (1 α)ϵ, respectively, where α > 0 and ϵ > 0. Let ρ1 = ρ = 1/τ 2 in (8) and (10). Assume that Algorithm 1 needs a total of T iterations to obtain an ϵ-accurate solution, while its ﬁxed-γs variant takes T iterations. Then,

T S a τ 2S + 1 τ S+1 + τ S κ1+C φ 1

where S log αϵ (P ( x0) P (x )) / log 1

τ 2 , C = 1 τ 2S+1 τ S+1+τ S K

μ , and K is a constant,

T and T are usually dominated by the aκ1φ(1/τ 2) term, and T is roughly S times that of T. This is also consistent with empirical observations that continuation is much faster than ﬁxed smoothing (Becker, Bobin, and Cand es 2011).

Convergence when Accelerated Solver is used Theorem 2. Assume that T1 in Algorithm 1 is large enough so that ρ1 1/τ 2. When accelerated solvers are used,

EP( x S) P(x )

(P( x0) P(x ))+O γ1Du

where T = S s=1 Ts, and S s=1 ρs = O(1/T 4). As the ρs s for non-accelerated and accelerated solvers are different, the S s=1 ρs term here is different from that in Theorem 1. Moreover, the last term is improved from O(1/T) in Theorem 1 to O(1/T 2) with accelerated solvers. This is also better than the rates of existing stochastic algorithms (O(log T/T) in (Duchi and Singer 2009; Xiao 2009) and O(1/T) in (Rakhlin, Shamir, and Sridharan 2012; Shamir and Zhang 2013; Ouyang and Gray 2012)). Besides, the black-box lower bound of O(1/T) for strongly convex problems (Agarwal et al. 2009) does not apply here, as we have additional assumptions that the objective is of the form in (1) and the number of training samples is ﬁnite. Though the (batch) excessive gap algorithm (Nesterov 2005a) also has a O(1/T 2) rate, it is limited to r 0 in (5). As in Proposition 2, the following shows that if continuation is not used, the algorithm is roughly S times slower.

Proposition 3. With the same assumptions in Proposition 2,

T τ S 1 τ 1

τ 2S + 1 τ S+1 + τ S κ1+Cφ 1

where S, C are as deﬁned in Proposition 2.

General Convex Objectives When P is not strongly convex, we add to it a small ℓ2 term (with weight λs). We then gradually decrease γs and λs simultaneously to approach the original problem. The revised procedure is shown in Algorithm 2.

Algorithm 2 CNS algorithm for general convex problems.

1: Input: number of iterations T1, smoothness parameter γ1 and strong convexity parameter λ1 for stage 1, and shrinking parameter τ > 1. 2: Initialize: x0. 3: for s = 1, 2, . . . do 4: Ps smooth P with smoothing parameter γs; 5: xs minimize Ps(x) + λs

2 x 2 2 by running M for Ts iterations; 6: γs+1 = γs/τ; λs+1 = λs/τ; 7: Option I (non-accelerated solvers): Ts+1 = τ 2Ts; 8: Option II (accelerated solvers): Ts+1 = τTs; 9: end for 10: Output: xs.

We assume that there exists R > 0 such that x 2 R, and x s 2 R for all s. Deﬁne Hs(x) = Ps(x) + λs

2 x 2 2, and let x s = arg minx Hs(x). The following assumption is similar to that for strongly convex problems. Assumption 3. EHs( xs) Hs(x s) ρs(Hs( xs 1) Hs(x s)), where ρs (0, 1). Theorem 3. Assume that T1 in Algorithm 2 is large enough so that ρ1 1/τ 2. When non-accelerated solvers are used,

EP( x S) P(x )

P( x0) P(x )+λ1

where S s=1 ρs = O( 1

T ). For accelerated solvers,

EP( x S) P(x )

P( x0) P(x )+λ1

where S s=1 ρs = O( 1

For non-accelerated solvers, the O(1/

T) convergence rate in (11) is only as good as that obtained in (Xiao 2009; Duchi and Singer 2009; Ouyang and Gray 2012; Shamir and

Zhang 2013). Hence, they will not be studied further in the sequel. However, for accelerated solvers, the O(1/T) convergence rate in (12) is faster than the O(1/

T) rate in (Xiao 2009; Duchi and Singer 2009; Ouyang and Gray 2012; Shamir and Zhang 2013) and the O( 1

T 2 + log T

T ) rate in (Orabona, Argyriou, and Srebro 2012). The O(1/T) convergence rate is also obtained in (Nesterov 2005a; 2005b), but again only for r 0 in (5). When continuation is not used, the following results are analogous to those obtained in the previous section.

Proposition 4. Let x0 = 0. When continuation is not used, let ρ be the error reduction factor at each stage. When either an accelerated or non-accelerated solver is used,

EP( x S) P(x ) ρS(P( x0) P(x ))

+(1 + ρS)γDu + λ

Proposition 5. Let x0 = 0. Suppose that the three terms on the RHS of (13) are equal to αϵ, βϵ and ζϵ, respectively, where α, β, ζ > 0 and α + β + ζ = 1. Let ρ1 = ρ = 1/τ 2 in (12) and (13). Assume that Algorithm 2 (with accelerated solver) needs a total of T iterations to obtain an ϵ-accurate solution, while its ﬁxed-γs variant takes T iterations. Then,

τ 2S + 1 (τ + 1)2 κ1+Cφ 1

where S log αϵ (P ( x0) P (x )) / log 1

τ 2 , C = 1 τ 2S+1

(τ+1)2 + τ S 1+τ τ 2S+1

(τ+1)2 K λ1 and K is a constant.

A summary of the convergence results is shown in Table 2. As can be seen, the convergence rates of the proposed CNS algorithm match the fastest known rates in nonsmooth optimization, but CNS is less restrictive and can exploit the composite structure of the optimization problem.

Table 2: Comparison with the fastest known convergence rates for nonsmooth optimization problem (1). The fastest known batch solver is restricted to r 0, while the fastest known stochastic solver does not exploit properties of r.

strongly batch stochastic CNS (batch/stochastic) convex solver solver non-accel. accel. yes 1/T 2 1/T 1/T 1/T 2

Experiments Because of the lack of space, we only report results on two data sets (Table 3) from the LIBSVM archive: (i) the popularly used classiﬁcation data set rcv1; and (ii) Year Prediction MSD, the largest regression data in the LIBSVM

archive, and is a subset of the Million Song data set. We use the hinge loss for classiﬁcation, and ℓ1 loss for regression. Both can be smoothed using Nesterov s smoothing (to (3) and (4), respectively). As for the regularizer, we use the 1. elastic-net regularizer r(x) = ν1 x 1 + ν2

2 x 2 2 (Zou and Hastie 2005), and problem (5) is strongly convex; and 2. ℓ1 regularizer r(x) = ν1 x 1, and (5) is (general) convex. Here, ν1, ν2 are tuned by 5-fold cross-validation. Obviously, all losses and regularizers are convex but nonsmooth. We use mini-batch for all methods. The mini-batch size b is 50 for rcv1, and 100 for Year Prediction MSD.

Table 3: Data sets used in the experiments.

#train #test #features rcv1 20,242 677,399 47,236 Year Prediction MSD 463,715 51,630 90

The following stochastic algorithms are compared: 1. Forward-backward splitting (FOBOS) (Duchi and Singer 2009), a standard baseline for nonsmooth stochastic composite optimization. 2. SGD with polynomial-decay averaging (Poly-SGD) (Shamir and Zhang 2013), the state-of-art for nonsmooth optimization. 3. Regularized dual averaging (RDA) (Xiao 2009): This is another state-of-the-art for sparse learning problems. 4. The proposed CNS algorithm: We use proximal SVRG (PSVRG) (Xiao and Zhang 2014) as the underlying non-accelerated solver, and accelerated proximal SVRG (ACC-PSVRG) (Nitanda 2014) as the accelerated solver. The resultant procedures are denoted CNS-NA and CNSA, respectively. We set γ1 = 0.01, τ = 2, and T1 = n/b . Empirically, this ensures ρ1 1/τ 2 (in Theorems 1 and 3) on the two data sets. Note that FOBOS, RDA and the proposed CNS can effectively make use of the composite structure of the problem, while Poly-SGD cannot. For each method, the stepsize is tuned by running on a subset containing 20% training data for a few epochs (for the proposed method, we tune η1). All algorithms are implemented in Matlab.

Strongly Convex Objectives Figure 1 shows convergence of the objective and testing performance (classiﬁcation error for rcv1 and ℓ1-loss for Year Prediction MSD). The trends are consistent with Theorem 1. CNS-A is the fastest (with a of O(1/T 2)). This is followed by CNS-NA and Poly-SGD, both with O(1/T) rate (from Theorem 1 and (Shamir and Zhang 2013)). The slowest are FOBOS and RDA, which converge at a rate of O(log T/T) (Duchi and Singer 2009; Xiao 2009). Figure 2 compares with the case where continuation is not used. Two ﬁxed smoothness settings, γ = 10 2 and γ = 10 3, are used. As can be seen, they are much slower (Propositions 2 and 3). Moreover, a smaller γ leads to slower convergence but better solution, while a larger γ leads to

CPU time (s) 0 10 20 30 40 50

objective minus best

CPU time (s) 0 10 20 30 40 50

test error (%)

CPU time (s) 0 10 20 30 40 50

objective minus best

CPU time (s) 0 10 20 30 40 50

Year Prediction MSD.

Figure 1: Objective (top) and testing performance (bottom) vs CPU time (in seconds) on a strongly convex problem.

faster convergence but worse solution. This is also consistent with Proposition 1, as using a ﬁxed γ only allows convergence to the optimal solution with a tolerance of (1 + ρS)γDu. Moreover, a smaller γ leads to a larger condition number, and convergence becomes slower.

CPU time (s) 0 10 20 30 40 50

objective minus best

CPU time (s) 0 10 20 30 40 50

objective minus best

(b) Year Prediction MSD.

Figure 2: Effect of continuation (strongly convex problem).

General Convex Objectives

We set λ1 in Algorithm 2 to 10 5 for rcv1, and 10 7 for Year Prediction MSD. As can be seen from Figure 3, the trends are again consistent with Theorem 3. CNS-A is the fastest (O(1/T) convergence rate), while the others all have a rate of O(1/

T) (Duchi and Singer 2009; Xiao 2009; Shamir and Zhang 2013). Also, RDA shows better performance than FOBOS and Poly-SGD. Recall that Poly-SGD outperforms FOBOS and RDA on strongly convex problems. However, on general convex problems, Poly-SGD is the worst as its rate is only as good as others, and it does not exploit the composite structure of the problem. Figure 4 compares with the case where continuation is not used. As in the previous section, CNS-NA and CNS-A show faster convergence than its ﬁxed-smoothing counterparts.

CPU time (s) 0 10 20 30 40 50

objective minus best

CPU time (s) 0 10 20 30 40 50

test error (%)

CPU time (s) 0 10 20 30 40 50

objective minus best

CPU time (s) 0 10 20 30 40 50

Year Prediction MSD.

Figure 3: Objective (top) and testing performance (bottom) vs CPU time (in seconds) on a general convex problem.

CPU time (s) 0 10 20 30 40 50

objective minus best

CPU time (s) 0 10 20 30 40 50

objective minus best

(b) Year Prediction MSD.

Figure 4: Effect of continuation (general convex problem).

Conclusion In this paper, we proposed a continuation algorithm (CNS) for regularized risk minimization problems, in which both the loss and regularizer may be nonsmooth. In each of its stages, the smoothed subproblem can be easily solved by either existing accelerated or non-accelerated solvers. Theoretical analysis establishes convergence results on the whole continuation algorithm, not just one of its stages. In particular, when accelerated solvers are used, the proposed CNS algorithm achieves the rate of O(1/T 2) on strongly convex problems, and O(1/T) on general convex problems. These are the fastest known rates for nonsmooth optimization. However, CNS is advantageous in that it allows the use of a regularizer (unlike the fastest batch algorithm) and can exploit the composite structure of the optimization problem (unlike the fastest stochastic algorithm). Experiments on nonsmooth classiﬁcation and regression models demonstrate that CNS outperforms the state-of-the-art.

Acknowledgments This research was supported in part by the Research Grants Council of the Hong Kong Special Administrative Region (Grant 614513).

Agarwal, A.; Wainwright, M. J.; Bartlett, P. L.; and Ravikumar, P. K. 2009. Information-theoretic lower bounds on the oracle complexity of convex optimization. In Advances in Neural Information Processing Systems, 1 9. Becker, S.; Bobin, J.; and Cand es, E. J. 2011. NESTA: A fast and accurate ﬁrst-order method for sparse recovery. SIAM Journal on Imaging Sciences 4(1):1 39. Chen, X.; Lin, Q.; Kim, S.; Carbonell, J. G.; Xing, E. P.; et al. 2012. Smoothing proximal gradient method for general structured sparse regression. Annals of Applied Statistics 6(2):719 752. Defazio, A.; Bach, F.; and Lacoste-Julien, S. 2014. SAGA: A fast incremental gradient method with support for nonstrongly convex composite objectives. In Advances in Neural Information Processing Systems, 2116 2124. Duchi, J., and Singer, Y. 2009. Efﬁcient online and batch learning using forward backward splitting. Journal of Machine Learning Research 10:2899 2934. Hale, E.; Yin, W.; and Zhang, Y. 2007. A ﬁxed-point continuation method for ℓ1-regularized minimization with applications to compressed sensing. Technical Report CAAM TR07-07, Rice University. Johnson, R., and Zhang, T. 2013. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, 315 323. Kushner, H. J., and Yin, G. 2003. Stochastic Approximation and Recursive Algorithms and Applications, volume 35. Springer Science & Business Media. Mairal, J. 2013. Optimization with ﬁrst-order surrogate functions. In Proceedings of the 30th International Conference on Machine Learning. Mazumder, R.; Hastie, T.; and Tibshirani, R. 2010. Spectral regularization algorithms for learning large incomplete matrices. Journal of Machine Learning Research 11:2287 2322. Nemirovski, A., and Yudin, D. 1983. Problem Complexity and Method Efﬁciency in Optimization. Wiley. Nesterov, Y. 2004. Introductory Lectures on Convex Optimization, volume 87. Springer. Nesterov, Y. 2005a. Excessive gap technique in nonsmooth convex minimization. SIAM Journal on Optimization 16(1):235 249. Nesterov, Y. 2005b. Smooth minimization of non-smooth functions. Mathematical Programming 103(1):127 152. Nesterov, Y. 2013. Gradient methods for minimizing composite functions. Mathematical Programming 140(1):125 161. Nitanda, A. 2014. Stochastic proximal gradient descent with acceleration techniques. In Advances in Neural Information Processing Systems, 1574 1582. Orabona, F.; Argyriou, A.; and Srebro, N. 2012. PRISMA: Proximal iterative smoothing algorithm. Preprint ar Xiv:1206.2372.

Ouyang, H., and Gray, A. G. 2012. Stochastic smoothing for nonsmooth minimizations: Accelerating SGD by exploiting structure. In Proceedings of the 29th International Conference on Machine Learning, 33 40. Parikh, N., and Boyd, S. 2014. Proximal algorithms. Foundations and Trends in Optimization 1(3):127 239. Rakhlin, A.; Shamir, O.; and Sridharan, K. 2012. Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the 29th International Conference on Machine Learning, 449 456. Schmidt, M.; Roux, N. L.; and Bach, F. R. 2011. Convergence rates of inexact proximal-gradient methods for convex optimization. In Advances in Neural Information Processing Systems, 1458 1466. Schmidt, M.; Roux, N. L.; and Bach, F. 2013. Minimizing ﬁnite sums with the stochastic average gradient. Preprint ar Xiv:1309.2388. Shalev-Shwartz, S., and Zhang, T. 2014. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. In Proceedings of the 31st International Conference on Machine Learning, 64 72. Shamir, O., and Zhang, T. 2013. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In Proceedings of the 30th International Conference on Machine Learning, 71 79. Wen, Z.; Yin, W.; Goldfarb, D.; and Zhang, Y. 2010. A fast algorithm for sparse reconstruction based on shrinkage, subspace optimization, and continuation. SIAM Journal on Scientiﬁc Computing 32(4):1832 1857. Xiao, L., and Zhang, T. 2012. A proximal-gradient homotopy method for the ℓ1-regularized least-squares problem. In Proceedings of the 29th International Conference on Machine Learning, 839 846. Xiao, L., and Zhang, T. 2014. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization 24(4):2057 2075. Xiao, L. 2009. Dual averaging method for regularized stochastic learning and online optimization. In Advances in Neural Information Processing Systems, 2116 2124. Zou, H., and Hastie, T. 2005. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B 67(2):301 320.