# optimal_blackbox_reductions_between_optimization_objectives__ba6a182d.pdf

Optimal Black-Box Reductions Between Optimization Objectives

Zeyuan Allen-Zhu zeyuan@csail.mit.edu Institute for Advanced Study & Princeton University

Elad Hazan ehazan@cs.princeton.edu

Princeton University

The diverse world of machine learning applications has given rise to a plethora of algorithms and optimization methods, ﬁnely tuned to the speciﬁc regression or classiﬁcation task at hand. We reduce the complexity of algorithm design for machine learning by reductions: we develop reductions that take a method developed for one setting and apply it to the entire spectrum of smoothness and strong-convexity in applications. Furthermore, unlike existing results, our new reductions are optimal and more practical. We show how these new reductions give rise to new and faster running times on training linear classiﬁers for various families of loss functions, and conclude with experiments showing their successes also in practice.

1 Introduction

The basic machine learning problem of minimizing a regularizer plus a loss function comes in numerous different variations and names. Examples include Ridge Regression, Lasso, Support Vector Machine (SVM), Logistic Regression and many others. A multitude of optimization methods were introduced for these problems, but in most cases specialized to very particular problem settings. Such specializations appear necessary since objective functions for different classiﬁcation and regularization tasks admin different convexity and smoothness parameters. We list below a few recent algorithms along with their applicable settings.

Variance-reduction methods such as SAGA and SVRG [9, 14] intrinsically require the objective to be smooth, and do not work for non-smooth problems like SVM. This is because for loss functions such as hinge loss, no unbiased gradient estimator can achieve a variance that approaches to zero. Dual methods such as SDCA or APCG [20, 30] intrinsically require the objective to be strongly convex (SC), and do not directly apply to non-SC problems. This is because for a non-SC objective such as Lasso, its dual is not well-behaved due to the ℓ1 regularizer. Primal-dual methods such as SPDC [35] require the objective to be both smooth and SC. Many other algorithms are only analyzed for both smooth and SC objectives [7, 16, 17].

In this paper we investigate whether such specializations are inherent. Is it possible to take a convex optimization algorithm designed for one problem, and apply it to different classiﬁcation or regression settings in a black-box manner? Such a reduction should ideally take full and optimal advantage of the objective properties, namely strong-convexity and smoothness, for each setting.

Unfortunately, existing reductions are still very limited for at least two reasons. First, they incur at least a logarithmic factor log(1/ε) in the running time so leading only to suboptimal convergence

The full version of this paper can be found on https://arxiv.org/abs/1603.05642. This paper is partially supported by an NSF Grant, no. 1523815, and a Microsoft Research Grant, no. 0518584.

30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

rates.2 Second, after applying existing reductions, algorithms become biased so the objective value does not converge to the global minimum. These theoretical concerns also translate into running time losses and parameter tuning difﬁculties in practice.

In this paper, we develop new and optimal regularization and smoothing reductions that can

shave off a non-optimal log(1/ε) factor produce unbiased algorithms

Besides such technical advantages, our new reductions also enable researchers to focus on designing algorithms for only one setting but infer optimal results more broadly. This is opposed to results such as [4, 25] where the authors develop ad hoc techniques to tweak speciﬁc algorithms, rather than all algorithms, and apply them to other settings without losing extra factors and without introducing bias.

Our new reductions also enable researchers to prove lower bounds more broadly [32].

1.1 Formal Setting and Classical Approaches

Consider minimizing a composite objective function

min x Rd F(x) def = f(x) + ψ(x) , (1.1)

where f(x) is a differentiable convex function and ψ(x) is a relatively simple (but possibly nondifferentiable) convex function, sometimes referred to as the proximal function. Our goal is to ﬁnd a point x Rd satisfying F(x) F(x ) + ε, where x is a minimizer of F.

In most classiﬁcation and regression problems, f(x) can be written as f(x) = 1

n Pn i=1 fi( x, ai ) where each ai Rd is a feature vector. We refer to this as the ﬁnite-sum case of (1.1).

CLASSICAL REGULARIZATION REDUCTION.

Given a non-SC F(x), one can deﬁne a new objective F (x) def = F(x) + σ

2 x0 x 2 in which σ is on the order of ε. In order to minimize F(x), the classical regularization reduction calls an oracle algorithm to minimize F (x) instead, and this oracle only needs to work with SC functions. EXAMPLE. If F is L-smooth, one can apply accelerated gradient descent to minimize F and obtain an algorithm that converges in O( p

ε) iterations in terms of minimizing the original F. This complexity has a suboptimal dependence on ε and shall be improved using our new regularization reduction.

CLASSICAL SMOOTHING REDUCTION (FINITE-SUM CASE).3

Given a non-smooth F(x) of a ﬁnite-sum form, one can deﬁne a smoothed variant bfi(α) for each fi(α) and let F (x) = 1

n Pn i=1 bfi( ai, x ) + ψ(x). 4 In order to minimize F(x), the classical smoothing reduction calls an oracle algorithm to minimize F (x) instead, and this oracle only needs to work with smooth functions. EXAMPLE. If F(x) is σ-SC and one applies accelerated gradient descent to minimize F , this yields an algorithm that converges in O 1 σε log 1

ε iterations for minimizing the original F(x). Again, the additional factor log(1/ε) can be removed using our new smoothing reduction.

Besides the non-optimality, applying the above two reductions gives only biased algorithms. One has to tune the regularization or smoothing parameter, and the algorithm only converges to the minimum of the regularized or smoothed problem F (x), which can be away from the true minimizer of F(x) by a distance proportional to the parameter. This makes the reduction hard to use in practice.

2Recall that obtaining the optimal convergence rate is one of the main goals in operations research and machine learning. For instance, obtaining the optimal 1/ε rate for online learning was a major breakthrough since the log(1/ε)/ε rate was discovered [13, 15, 26]. 3Smoothing reduction is typically applied to the ﬁnite sum form only. This is because, for a general high dimensional function f(x), its smoothed variant bf(x) may not be efﬁciently computable. 4More formally, one needs this variant to satisfy | bfi(α) fi(α)| ε for all α and be smooth at the same time. This can be done at least in two classical ways if bfi(α) is Lipschitz continuous. One is to deﬁne bfi(α) = Ev [ 1,1][fi(α + εv)] as an integral of f over the scaled unit interval, see for instance Chapter 2.3 of [12], and the other is to deﬁne bfi(α) = maxβ β α f i (β) ε

2α2} using the Fenchel dual f i (β) of fi(α), see for instance [24].

1.2 Our New Results

To introduce our new reductions, we ﬁrst deﬁne a property on the oracle algorithm.

Our Black-Box Oracle. Consider an algorithm A that minimizes (1.1) when the objective F is L-smooth and σ-SC. We say that A satisﬁes the homogenous objective decrease (HOOD) property in time Time(L, σ) if, for every starting vector x0, A produces an output x satisfying F(x ) F(x ) F (x0) F (x )

4 in time Time(L, σ). In other words, A decreases the objective value distance to the minimum by a constant factor in time Time(L, σ), regardless of how large or small F(x0) F(x ) is. We give a few example algorithms that satisfy HOOD:

Gradient descent and accelerated gradient descent satisfy HOOD with Time(L, σ) = O(L/σ) C and Time(L, σ) = O( p

L/σ) C respectively, where C is the time needed to compute a gradient f(x) and perform a proximal gradient update [23]. Many subsequent works in this line of research also satisfy HOOD, including [3, 7, 16, 17].

SVRG and SAGA [14, 34] solve the ﬁnite-sum form of (1.1) and satisfy HOOD with Time(L, σ) = O n + L/σ C1 where C1 is the time needed to compute a stochastic gradient fi(x) and perform a proximal gradient update.

Katyusha [1] solves the ﬁnite-sum form of (1.1) and satisﬁes HOOD with Time(L, σ) = O n + p

Adapt Reg. For objectives F(x) that are non-SC and L-smooth, our Adapt Reg reduction calls the an oracle satisfying HOOD a logarithmic number of times, each time with a SC objective F(x) + σ

2 x x0 2 for an exponentially decreasing value σ. In the end, Adapt Reg produces an output bx satisfying F(bx) F(x ) ε with a total running time P t=0 Time(L, ε 2t).

Since most algorithms have an inverse polynomial dependence on σ in Time(L, σ), when summing up Time(L, ε 2t) for positive values t, we do not incur the additional factor log(1/ε) as opposed to the old reduction. In addition, Adapt Reg is an unbiased and anytime algorithm. F(bx) converges to F(x ) as the time goes without the necessity of changing parameters, so the algorithm can be interrupted at any time. We mention some theoretical applications of Adapt Reg:

Applying Adapt Reg to SVRG, we obtain a running time O n log 1

ε C1 for minimizing ﬁnite-sum, non-SC, and smooth objectives (such as Lasso and Logistic Regression). This improves on known theoretical running time obtained by non-accelerated methods, including O n log 1

ε C1 through the old reduction, as well as O n+L

ε C1 through direct methods such as SAGA [9] and SAG [27].

Applying Adapt Reg to Katyusha, we obtain a running time O n log 1

n L ε C1 for minimizing ﬁnite-sum, non-SC, and smooth objectives (such as Lasso and Logistic Regression). This is the ﬁrst and only known stochastic method that converges with the optimal 1/ ε rate (as opposed to log(1/ε)/ ε) for this class of objectives. [1]

Applying Adapt Reg to methods that do not originally work for non-SC objectives such as [7, 16, 17], we improve their running times by a factor of log(1/ε) for working with non-SC objectives.

Adapt Smooth and Joint Adapt Reg Smooth. For objectives F(x) that are ﬁnite-sum, σ-SC, but non-smooth, our Adapt Smooth reduction calls an oracle satisfying HOOD a logarithmic number of times, each time with a smoothed variant of F (λ)(x) and an exponentially decreasing smoothing parameter λ. In the end, Adapt Smooth produces an output bx satisfying F(bx) F(x ) ε with a total running time P t=0 Time( 1 ε 2t , σ).

Since most algorithms have a polynomial dependence on L in Time(L, σ), when summing up Time( 1 ε 2t , σ) for positive values t, we do not incur an additional factor of log(1/ε) as opposed to the old reduction. Adapt Smooth is also an unbiased and anytime algorithm for the same reason as Adapt Reg.

In addition, Adapt Reg and Adapt Smooth can effectively work together, to solve ﬁnite-sum, non-SC, and non-smooth case of (1.1), and we call this reduction Joint Adapt Reg Smooth.

We mention some theoretical applications of Adapt Smooth and Joint Adapt Reg Smooth:

Applying Adapt Reg to Katyusha, we obtain a running time O n log 1

ε+ n σε C1 for minimizing ﬁnite-sum, SC, and non-smooth objectives (such as SVM). Therefore, Katyusha combined with Adapt Reg is the ﬁrst and only known stochastic method that converges with the optimal 1/ ε rate (as opposed to log(1/ε)/ ε) for this class of objectives. [1]

Applying Joint Adapt Reg Smooth to Katyusha, we obtain a running time O n log 1

ε C1 for minimizing ﬁnite-sum, SC, and non-smooth objectives (such as L1-SVM). Therefore, Katyusha combined with Joint Adapt Reg Smooth gives the ﬁrst and only known stochastic method that converges with the optimal 1/ε rate (as opposed to log(1/ε)/ε) for this class of objectives. [1]

Roadmap. We provide notations in Section 2 and discuss related works in Section 3. We propose Adapt Reg in Section 4 and Adapt Smooth in Section 5. We leave proofs as well as the description and analysis of Joint Adapt Reg Smooth to the full version of this paper. We include experimental results in Section 7.

2 Preliminaries

In this paper we denote by f(x) the full gradient of f if it is differentiable, or the subgradient if f is only Lipschitz continuous. Recall some classical deﬁnitions on strong convexity and smoothness.

Deﬁnition 2.1 (smoothness and strong convexity). For a convex function f : Rn R,

f is σ-strongly convex if x, y Rn, it satisﬁes f(y) f(x) + f(x), y x + σ

f is L-smooth if x, y Rn, it satisﬁes f(x) f(y) L x y .

Characterization of SC and Smooth Regimes. In this paper we give numbers to the following 4 categories of objectives F(x) in (1.1). Each of them corresponds to some well-known training problems in machine learning. (Letting (ai, bi) Rd R be the i-th feature vector and label.)

Case 1: ψ(x) is σ-SC and f(x) is L-smooth. Examples:

ridge regression: f(x) = 1 2n Pn i=1( ai, x bi)2 and ψ(x) = σ

2 x 2 2. elastic net: f(x) = 1 2n Pn i=1( ai, x bi)2 and ψ(x) = σ

2 x 2 2 + λ x 1. Case 2: ψ(x) is non-SC and f(x) is L-smooth. Examples:

Lasso: f(x) = 1 2n Pn i=1( ai, x bi)2 and ψ(x) = λ x 1. logistic regression: f(x) = 1

n Pn i=1 log(1 + exp( bi ai, x )) and ψ(x) = λ x 1. Case 3: ψ(x) is σ-SC and f(x) is non-smooth (but Lipschitz continuous). Examples:

SVM: f(x) = 1

n Pn i=1 max{0, 1 bi ai, x } and ψ(x) = σ x 2 2. Case 4: ψ(x) is non-SC and f(x) is non-smooth (but Lipschitz continuous). Examples:

ℓ1-SVM: f(x) = 1

n Pn i=1 max{0, 1 bi ai, x } and ψ(x) = λ x 1.

Deﬁnition 2.2 (HOOD property). We say an algorithm A(F, x0) solving Case 1 of problem (1.1) satisﬁes the homogenous objective decrease (HOOD) property with time Time(L, σ) if, for every starting point x0, it produces output x A(F, x0) such that F(x ) minx F(x) F (x0) minx F (x)

4 in time Time(L, σ).5

In this paper, we denote by C the time needed for computing a full gradient f(x) and performing a proximal gradient update of the form x arg minx 1

2 x x0 2 + η( f(x), x x0 + ψ(x)) . For the ﬁnite-sum case of problem (1.1), we denote by C1 the time needed for computing a stochastic (sub-)gradient fi( ai, x ) and performing a proximal gradient update of the form x arg minx 1

2 x x0 2 + η( fi( ai, x )ai, x x0 + ψ(x)) . For ﬁnite-sum forms of (1.1), C is usually on the magnitude of n C1.

5Although our deﬁnition is only for deterministic algorithms, if the guarantee is probabilistic, i.e., E F(x ) minx F(x) F (x0) minx F (x)

4 , all the results of this paper remain true. Also, the constant 4 is very arbitrary and can be replaced with any other constant bigger than 1.

Algorithm 1 The Adapt Reg Reduction Input: an objective F( ) in Case 2 (smooth and not necessarily strongly convex); x0 a starting vector, σ0 an initial regularization parameter, T the number of epochs; an algorithm A that solves Case 1 of problem (1.1). Output: bx T .

1: bx0 x0. 2: for t 0 to T 1 do 3: Deﬁne F (σt)(x) def = σt

2 x x0 + F(x). 4: bxt+1 A(F (σt), bxt). 5: σt+1 σt/2. 6: end for 7: return bx T .

3 Related Works

Catalyst/APPA [11, 19] turn non-accelerated methods into accelerated ones, which is different from the purpose of this paper. They can be used as regularization reductions from Case 2 to Case 1 (but not from Cases 3 or 4); however, they suffer from two log-factor loss in the running time, and perform poor in practice [1]. In particular, Catalyst/APPA ﬁx the regularization parameter throughout the algorithm but our Adapt Reg decreases it exponentially. Their results cannot imply ours.

Beck and Teboulle [5] give a reduction from Case 4 to Case 2. Such a reduction does not suffer from a log-factor loss for almost trivial reason: by setting the smoothing parameter λ = ε and applying any O(1/ ε)-convergence method for Case 2, we immediately obtain an O(1/ε) method for Case 4 without an extra log factor. Our main challenge in this paper is to provide log-free reductions to Case 1;6 simple ideas fail to produce log-free reductions in this case because all efﬁcient algorithms solving Case 1 (due to linear convergence) have a log factor. In addition, the Beck-Teboulle reduction is biased but ours is unbiased.

The so-called homotopy methods (e.g. methods with geometrically decreasing regularizer/smoothing weights) appeared a lot in the literature [6, 25, 31, 33]. However, to the best of our knowledge, all existing homotopy analysis either only work for speciﬁc algorithms [6, 25, 31] or solve only special problems [33]. In other words, none of them provides all-purpose black-box reductions like we do.

4 Adapt Reg: Reduction from Case 2 to Case 1

We now focus on solving Case 2 of problem (1.1): that is, f( ) is L-smooth, but ψ( ) is not necessarily SC. We achieve so by reducing the problem to an algorithm A solving Case 1 that satisﬁes HOOD.

Adapt Reg works as follows (see Algorithm 1). At the beginning of Adapt Reg, we set bx0 to equal x0, an arbitrary given starting vector. Adapt Reg consists of T epochs. At each epoch t = 0, 1, . . . , T 1, we deﬁne a σt-strongly convex objective F (σt)(x) def = σt

2 x x0 2 + F(x). Here, the parameter σt+1 = σt/2 for each t 0 and σ0 is an input parameter to Adapt Reg that will be speciﬁed later. We run A on F (σt)(x) with starting vector bxt in each epoch, and let the output be bxt+1. After all T epochs are ﬁnished, Adapt Reg simply outputs bx T .

We state our main theorem for Adapt Reg below and prove it in the full version of this paper.

Theorem 4.1 (Adapt Reg). Suppose that in problem (1.1) f( ) is L-smooth. Let x0 be a starting vector such that F(x0) F(x ) and x0 x 2 Θ. Then, Adapt Reg with σ0 = /Θ and T = log2( /ε) produces an output bx T satisfying F(bx T ) minx F(x) O(ε) in a total running time of PT 1 t=0 Time(L, σ0 2 t).7

Remark 4.2. We compare the parameter tuning effort needed for Adapt Reg against the classical regularization reduction. In the classical reduction, there are two parameters: T, the number of

6Designing reductions to Case 1 (rather than for instance Case 2) is crucial for various reasons. First, algorithm design for Case 1 is usually easier (esp. in stochastic settings). Second, Case 3 can only be reduced to Case 1 but not Case 2. Third, lower bound results [32] require reductions to Case 1. 7If the HOOD property is only satisﬁed probabilistically as per Footnote 5, our error guarantee becomes probabilistic, i.e., E F(bx T ) minx F(x) O(ε). This is also true for other reduction theorems of this paper.

iterations that does not need tuning; and σ, which had better equal ε/Θ which is an unknown quantity so requires tuning. In Adapt Reg, we also need tune only one parameter, that is σ0. Our T need not be tuned because Adapt Reg can be interrupted at any moment and bxt of the current epoch can be outputted. In our experiments later, we spent the same effort turning σ in the classical reduction and σ0 in Adapt Reg. As it can be easily seen from the plots, tuning σ0 is much easier than σ.

Corollary 4.3. When Adapt Reg is applied to SVRG, we solve the ﬁnite-sum case of Case 2 with running time PT 1 t=0 Time(L, σ0 2 t) = PT 1 t=0 O(n + L2t

σ0 ) C1 = O(n log

ε ) C1. This is faster than O n+ LΘ

ε C1 obtained through the old reduction, and faster than O n+LΘ

ε C1 obtained by SAGA [9] and SAG [27].

When Adapt Reg is applied to Katyusha, we solve the ﬁnite-sum case of Case 2 with running time PT 1 t=0 Time(L, σ0 2 t) = PT 1 t=0 O(n +

n L2t σ0 ) C1 = O(n log

n LΘ/ε) C1. This is

faster than O n + p

ε C1 obtained through the old reduction on Katyusha [1].8

5 Adapt Smooth: Reduction from Case 3 to 1

We now focus on solving the ﬁnite-sum form of Case 3 for problem (1.1). Without loss of generality, we assume ai = 1 for each i [n] because otherwise one can scale fi accordingly. We solve this problem by reducing it to an oracle A which solves the ﬁnite-sum form of Case 1 and satisﬁes HOOD. Recall the following deﬁnition using Fenchel conjugate:9

Deﬁnition 5.1. For each function fi : R R, let f i (β) def = maxα{α β fi(α)} be its Fenchel conjugate. Then, we deﬁne the following smoothed variant of fi parameterized by λ > 0: f (λ) i (α) def = maxβ β α f i (β) λ

2 β2 . Accordingly, we deﬁne F (λ)(x) def = 1

n Pn i=1 f (λ) i ( ai, x ) + ψ(x) .

From the property of Fenchel conjugate (see for instance the textbook [28]), we know that f (λ) i ( ) is a (1/λ)-smooth function and therefore the objective F (λ)(x) falls into the ﬁnite-sum form of Case 1 for problem (1.1) with smoothness parameter L = 1/λ.

Our Adapt Smooth works as follows (see pseudocode in the full version). At the beginning of Adapt Smooth, we set bx0 to equal x0, an arbitrary given starting vector. Adapt Smooth consists of T epochs. At each epoch t = 0, 1, . . . , T 1, we deﬁne a (1/λt)-smooth objective F (λt)(x) using Deﬁnition 5.1. Here, the parameter λt+1 = λt/2 for each t 0 and λ0 is an input parameter to Adapt Smooth that will be speciﬁed later. We run A on F (λt)(x) with starting vector bxt in each epoch, and let the output be bxt+1. After all T epochs are ﬁnished, Adapt Smooth outputs bx T .

We state our main theorem for Adapt Smooth below and prove it the full version of this paper.

Theorem 5.2. Suppose that in problem (1.1), ψ( ) is σ strongly convex and each fi( ) is G-Lipschitz continuous. Let x0 be a starting vector such that F(x0) F(x ) . Then, Adapt Smooth with λ0 = /G2 and T = log2( /ε) produces an output bx T satisfying F(bx T ) minx F(x) O(ε) in a total running time of PT 1 t=0 Time(2t/λ0, σ).

Remark 5.3. We emphasize that Adapt Smooth requires less parameter tuning effort than the old reduction for the same reason as in Remark 4.2. Also, Adapt Smooth, when applied to Katyusha, provides the fastest running time on solving the Case 3 ﬁnite-sum form of (1.1), similar to Corollary 4.3.

6 Joint Adapt Reg Smooth: From Case 4 to 1

We show in the full version that Adapt Reg and Adapt Smooth can work together to reduce the ﬁnite-sum form of Case 4 to Case 1. We call this reduction Joint Adapt Reg Smooth and it relies on a jointly exponentially decreasing sequence of (σt, λt), where σt is the weight of the convexity parameter that we add on top of F(x), and λt is the smoothing parameter that determines how we

8If the old reduction is applied on APCG, SPDC, or Acc SDCA rather than Katyusha, then two log factors will be lost. 9For every explicitly given fi( ), this Fenchel conjugate can be symbolically computed and fed into the algorithm. This pre-process is needed for nearly all known algorithms in order for them to apply to non-smooth settings (such as SVRG, SAGA, SPDC, APCG, SDCA, etc).

0 20 40 60 80 100

(a) covtype, λ = 10 6

0 20 40 60 80 100

(b) mnist, λ = 10 5

0 20 40 60 80 100

(c) rcv1, λ = 10 5

Figure 1: Comparing Adapt Reg and the classical reduction on Lasso (with ℓ1 regularizer weight λ). y-axis is the objective distance to minimum, and x-axis is the number of passes to the dataset. The blue solid curves represent APCG under the old regularization reduction, and the red dashed curve represents APCG under Adapt Reg. For other values of λ, or the results on SDCA, please refer to the full version of this paper.

change each fi( ). The analysis is analogous to a careful combination of the proofs for Adapt Reg and Adapt Smooth.

7 Experiments

We perform experiments to conﬁrm our theoretical speed-ups obtained for Adapt Smooth and Adapt Reg. We work on minimizing Lasso and SVM objectives for the following three well-known datasets that can be found on the Lib SVM website [10]: covtype, mnist, and rcv1. We defer some dataset and implementation details the full version of this paper.

7.1 Experiments on Adapt Reg

To test the performance of Adapt Reg, consider the Lasso objective which is in Case 2 (i.e. non-SC but smooth). We apply Adapt Reg to reduce it to Case 1 and apply either APCG [20], an accelerated method, or (Prox-)SDCA [29, 30], a non-accelerated method. Let us make a few remarks:

APCG and SDCA are both indirect solvers for non-strongly convex objectives and therefore regularization is intrinsically required in order to run them for Lasso or more generally Case 2. APCG and SDCA do not satisfy HOOD in theory. However, they still beneﬁt from Adapt Reg as we shall see, demonstrating the practical value of Adapt Reg.

A Practical Implementation. In principle, one can implement Adapt Reg by setting the termination criteria of the oracle in the inner loop as precisely suggested by the theory, such as setting the number of iterations for SDCA to be exactly T = O(n + L

σt ) in the t-th epoch. However, in practice, it is more desirable to automatically terminate the oracle whenever the objective distance to the minimum has been sufﬁciently decreased. In all of our experiments, we simply compute the duality gap and terminate the oracle whenever the duality gap is below 1/4 times the last recorded duality gap of the previous epoch. For details see the full version of this paper.

Experimental Results. For each dataset, we consider three different magnitudes of regularization weights for the ℓ1 regularizer in the Lasso objective. This totals 9 analysis tasks for each algorithm.

For each such a task, we ﬁrst implement the old reduction by adding an additional σ

2 x 2 term to the Lasso objective and then apply APCG or SDCA. We consider values of σ in the set {10k, 3 10k : k Z} and show the most representative six of them in the plots (blue solid curves in Figure 3 and Figure 4). Naturally, for a larger value of σ the old reduction converges faster but to a point that is farther from the exact minimizer because of the bias. We implement Adapt Reg where we choose the initial parameter σ0 also from the set {10k, 3 10k : k Z} and present the best one in each of 18 plots (red dashed curves in Figure 3 and Figure 4). Due to space limitations, we provide only 3 of the 18 plots for medium-sized λ in the main body of this paper (see Figure 1), and include Figure 3 and 4 only in the full version of this paper.

It is clear from our experiments that

Adapt Reg is more efﬁcient than the old regularization reduction; Adapt Reg requires no more parameter tuning than the classical reduction;

0 20 40 60 80 100

(a) covtype, σ = 10 5

0 20 40 60 80 100

(b) mnist, σ = 10 4

0 20 40 60 80 100

(c) rcv1, σ = 10 4

Figure 2: Comparing Adapt Smooth and the classical reduction on SVM (with ℓ2 regularizer weight λ). y-axis is the objective distance to minimum, and x-axis is the number of passes to the dataset. The blue solid curves represent SVRG under the old smoothing reduction, and the red dashed curve represents SVRG under Adapt Smooth. For other values of σ, please refer to the full version.

Adapt Reg is unbiased so simpliﬁes the parameter selection procedure.10

7.2 Experiments on Adapt Smooth

To test the performance of Adapt Smooth, consider the SVM objective which is non-smooth but SC. We apply Adapt Smooth to reduce it to Case 1 and apply SVRG [14]. We emphasize that SVRG is an indirect solver for non-smooth objectives and therefore regularization is intrinsically required in order to run SVRG for SVM or more generally for Case 3.

A Practical Implementation. In principle, one can implement Adapt Smooth by setting the termination criteria of the oracle in the inner loop as precisely suggested by the theory, such as setting the number of iterations for SVRG to be exactly T = O(n + 1 σλt ) in the t-th epoch. In practice, however, it is more desirable to automatically terminate the oracle whenever the objective distance to the minimum has been sufﬁciently decreased. In all of our experiments, we simply compute the Euclidean norm of the full gradient of the objective, and terminate the oracle whenever the norm is below 1/3 times the last recorded Euclidean norm of the previous epoch. For details see full version.

Experimental Results. For each dataset, we consider three different magnitudes of regularization weights for the ℓ2 regularizer in the SVM objective. This totals 9 analysis tasks. For each such a task, we ﬁrst implement the old reduction by smoothing the hinge loss functions (using Deﬁnition 5.1) with parameter λ > 0 and then apply SVRG. We consider different values of λ in the set {10k, 3 10k : k Z} and show the most representative six of them in the plots (blue solid curves in Figure 5). Naturally, for a larger λ, the old reduction converges faster but to a point that is farther from the exact minimizer due to its bias. We then implement Adapt Smooth where we choose the initial smoothing parameter λ0 also from the set {10k, 3 10k : k Z} and present the best one in each of the 9 plots (red dashed curves in Figure 5). Due to space limitations, we provide only 3 of the 9 plots for small-sized σ in the main body of this paper (see Figure 2, and include Figure 5 only in full version.

It is clear from our experiments that

Adapt Smooth is more efﬁcient than the old smoothing reduction, especially when the desired training error is small; Adapt Smooth requires no more parameter tuning than the classical reduction; Adapt Smooth is unbiased and simpliﬁes the parameter selection for the same reason as Footnote 10.

[1] Zeyuan Allen-Zhu. Katyusha: The First Direct Acceleration of Stochastic Gradient Methods. Ar Xiv e-prints, abs/1603.05953, March 2016. [2] Zeyuan Allen-Zhu and Lorenzo Orecchia. Linear coupling: An ultimate uniﬁcation of gradient and mirror descent. Ar Xiv e-prints, abs/1407.1537, July 2014. [3] Zeyuan Allen-Zhu, Peter Richtárik, Zheng Qu, and Yang Yuan. Even faster accelerated coordinate descent using non-uniform sampling. In ICML, 2016.

10It is easy to determine the best σ0 in Adapt Reg, and in contrast, in the old reduction if the desired error is somehow changed for the application, one has to select a different σ and restart the algorithm.

[4] Zeyuan Allen-Zhu and Yang Yuan. Improved SVRG for Non-Strongly-Convex or Sum-of-Non-Convex Objectives. In ICML, 2016. [5] Amir Beck and Marc Teboulle. Smoothing and ﬁrst order methods: A uniﬁed framework. SIAM Journal on Optimization, 22(2):557 580, 2012. [6] Radu Ioan Bo t and Christopher Hendrich. A variable smoothing algorithm for solving convex optimization problems. TOP, 23(1):124 150, 2015. [7] Sébastien Bubeck, Yin Tat Lee, and Mohit Singh. A geometric alternative to Nesterov s accelerated gradient descent. Ar Xiv e-prints, abs/1506.08187, June 2015. [8] Antonin Chambolle and Thomas Pock. A ﬁrst-order primal-dual algorithm for convex problems with applications to imaging. Journal of Mathematical Imaging and Vision, 40(1):120 145, 2011. [9] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives. In NIPS, 2014. [10] Rong-En Fan and Chih-Jen Lin. LIBSVM Data: Classiﬁcation, Regression and Multi-label. Accessed: 2015-06. [11] Roy Frostig, Rong Ge, Sham M. Kakade, and Aaron Sidford. Un-regularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization. In ICML, volume 37, pages 1 28, 2015. [12] Elad Hazan. DRAFT: Introduction to online convex optimimization. Foundations and Trends in Machine Learning, XX(XX):1 168, 2015. [13] Elad Hazan and Satyen Kale. Beyond the regret minimization barrier: Optimal algorithms for stochastic strongly-convex optimization. The Journal of Machine Learning Research, 15(1):2489 2512, 2014. [14] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, NIPS 2013, pages 315 323, 2013. [15] Simon Lacoste-Julien, Mark W. Schmidt, and Francis R. Bach. A simpler approach to obtaining an o(1/t) convergence rate for the projected stochastic subgradient method. Ar Xiv e-prints, abs/1212.2002, 2012. [16] Yin Tat Lee and Aaron Sidford. Efﬁcient accelerated coordinate descent methods and faster algorithms for solving linear systems. In FOCS, pages 147 156. IEEE, 2013. [17] Laurent Lessard, Benjamin Recht, and Andrew Packard. Analysis and design of optimization algorithms via integral quadratic constraints. Co RR, abs/1408.3595, 2014. [18] Hongzhou Lin. private communication, 2016. [19] Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A Universal Catalyst for First-Order Optimization. In NIPS, 2015. [20] Qihang Lin, Zhaosong Lu, and Lin Xiao. An Accelerated Proximal Coordinate Gradient Method and its Application to Regularized Empirical Risk Minimization. In NIPS, pages 3059 3067, 2014. [21] Arkadi Nemirovski. Prox-Method with Rate of Convergence O(1/t) for Variational Inequalities with Lipschitz Continuous Monotone Operators and Smooth Convex-Concave Saddle Point Problems. SIAM Journal on Optimization, 15(1):229 251, January 2004. [22] Yurii Nesterov. A method of solving a convex programming problem with convergence rate O(1/k2). In Doklady AN SSSR (translated as Soviet Mathematics Doklady), volume 269, pages 543 547, 1983. [23] Yurii Nesterov. Introductory Lectures on Convex Programming Volume: A Basic course, volume I. Kluwer Academic Publishers, 2004. [24] Yurii Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming, 103(1):127 152, December 2005. [25] Francesco Orabona, Andreas Argyriou, and Nathan Srebro. Prisma: Proximal iterative smoothing algorithm. ar Xiv preprint ar Xiv:1206.2372, 2012. [26] Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In ICML, 2012. [27] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing ﬁnite sums with the stochastic average gradient. ar Xiv preprint ar Xiv:1309.2388, pages 1 45, 2013. Preliminary version appeared in NIPS 2012. [28] Shai Shalev-Shwartz. Online Learning and Online Convex Optimization. Foundations and Trends in Machine Learning, 4(2):107 194, 2012. [29] Shai Shalev-Shwartz and Tong Zhang. Proximal Stochastic Dual Coordinate Ascent. ar Xiv preprint ar Xiv:1211.2717, pages 1 18, 2012. [30] Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14:567 599, 2013. [31] Quoc Tran-Dinh. Adaptive smoothing algorithms for nonsmooth composite convex minimization. ar Xiv preprint ar Xiv:1509.00106, 2015. [32] Blake Woodworth and Nati Srebro. Tight Complexity Bounds for Optimizing Composite Objectives. In NIPS, 2016. [33] Lin Xiao and Tong Zhang. A proximal-gradient homotopy method for the sparse least-squares problem. SIAM Journal on Optimization, 23(2):1062 1091, 2013. [34] Lin Xiao and Tong Zhang. A Proximal Stochastic Gradient Method with Progressive Variance Reduction. SIAM Journal on Optimization, 24(4):2057 -2075, 2014. [35] Yuchen Zhang and Lin Xiao. Stochastic Primal-Dual Coordinate Method for Regularized Empirical Risk Minimization. In ICML, 2015.