# connecting_optimization_and_regularization_paths__61f23eae.pdf

Connecting Optimization and Regularization Paths

Arun Sai Suggala Carnegie Mellon University

Pittsburgh, PA 15213 asuggala@cs.cmu.edu

Adarsh Prasad Carnegie Mellon University

Pittsburgh, PA 15213 adarshp@cs.cmu.edu

Pradeep Ravikumar Carnegie Mellon University

Pittsburgh, PA 15213 pradeepr@cs.cmu.edu

We study the implicit regularization properties of optimization techniques by explicitly connecting their optimization paths to the regularization paths of corresponding regularized problems. This surprising connection shows that iterates of optimization techniques such as gradient descent and mirror descent are pointwise close to solutions of appropriately regularized objectives. While such a tight connection between optimization and regularization is of independent intellectual interest, it also has important implications for machine learning: we can port results from regularized estimators to optimization, and vice versa. We investigate one key consequence, that borrows from the well-studied analysis of regularized estimators, to then obtain tight excess risk bounds of the iterates generated by optimization techniques.

1 Introduction

With the recent success of optimization techniques in training over-parametrized deep neural networks, there has been a growing interest in understanding the implicit regularization properties of various optimization techniques. Consequently, a line of work has focused on characterizing the implicit biases of global optimum reached by various optimization algorithms. For example, Gunasekar et al. [2017] consider the problem of matrix factorization and show that gradient descent (GD) on un-regularized objective converges to the minimum nuclear norm solution. Soudry et al. [2017] study gradient descent on un-regularized logistic regression and show that when the data is linearly separable, gradient descent converges to a max-margin solution. Gunasekar et al. [2018] generalized the results of Soudry et al. [2017] and study the limit behavior of the iterates of general optimization techniques when the data is linearly separable.

Another line of work has focused on studying the implicit regularization properties of early stopping various optimization algorithms, which is a widely used technique in neural network training. These works show that early stopping the iterative optimization of an empirical problem performs a form of implicit regularization. Yao et al. [2007] focus on non-parametric regression in reproducing kernel Hilbert spaces and provide theoretical justiﬁcation for early stopping. In a similar setting, Raskutti et al. [2014] show that early stopping gradient descent on least squares objective achieves similar risk bounds as the corresponding regularized problem, also called ridge regression. Hardt et al. [2015], Rosasco and Villa [2015] study the implicit regularization properties of early stopping stochastic gradient descent (SGD). All these results show that early stopping achieves similar performance as optimizing the corresponding regularized objective.

Furthermore, several recent works suggest that there could be a much deeper connection between the iterates generated by optimization techniques on un-regularized objectives (optimization path) and minimizers of corresponding regularized objectives (regularization path), than the performance similarity observed in the early stopping literature. Friedman and Popescu [2003] empirically observe that for linear regression, the optimization and regularization paths are very similar to each other. Rosset et al. [2004a] show that under certain conditions on the problem, the path traced by coordinate

32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada.

descent or boosting is similar to the regularization path of L1 constrained problem. In a related work Neu and Rosasco [2018] consider the problem of linear least squares regression and show that the iterates produced by GD on least squares objective are related to the solutions of ridge regression. Speciﬁcally, for any given regularization parameter of ridge regression, Neu and Rosasco [2018] show that there exists a weighing scheme for GD iterates that is exactly equal to the ridge solution.

In this work, we take a step towards understanding the deeper connection between the two paths by explicitly connecting the optimization path to the regularization path of the corresponding regularized problem. Our results explicitly show that the sequence of iterates produced by iterative optimization techniques such as gradient descent, mirror descent on strongly convex functions, lie pointwise close to the regularization path of a corresponding regularized objective. This surprising connection allows us to transfer insights from regularization to optimization and vice-versa. We expect that our work will lead to a new class of results in both ﬁelds that explicitly draw upon this connection.

In this paper, we focus on a particular consequence of our connection: we derive excess risk bounds of the iterates of optimization techniques. There has been a huge line of work in the ﬁelds of machine learning and statistics on understanding the risk bounds of regularized problems [Negahban et al., 2009, Hsu et al., 2012]. We utilize these results to derive excess risk bounds of iterates of optimization techniques.

Recently, there has been a line of work studying the excess risk of iterates of optimization techniques. Yao et al. [2007], Raskutti et al. [2014] focus on non-parametric regression in a reproducing kernel

Hilbert space and derive excess risk bounds of iterates of gradient descent. Wei et al. [2017] extend these results to a broad class of loss functions. In the context of ﬁnite dimensional spaces, Hardt et al. [2015], Chen et al. [2018] use the notion of algorithmic stability, which was introduced by Bousquet and Elisseeff [2002], to derive bounds on the expected excess risk of iterates of various methods. Our technique for deriving excess risk bounds can be viewed as an alternative to stability and has the advantage that we can make use of the existing results on the statistical properties of regularized problems. Moreover, this approach has the potential to obtain much tighter bounds than stability and we stress that any improvement in the analysis of regularized estimation will directly translate to a tighter bound for the corresponding optimization problem.

The main contributions of the paper are as follows. For strongly convex and smooth objectives, we explicitly connect the optimization path of GD and regularization path of L2

2 penalized objective. We further extend these results to Mirror Descent with strongly convex and smooth divergences. We use these connections to derive the excess risk of iterates of GD. For convex objectives, we show that the connection need not hold in general. However, for the problem of classiﬁcation with separable data, we show that for losses with exponentially decaying tails, the optimization path of GD is close to the regularization path of the corresponding regularized objective.

2 Strongly Convex Loss

In this section we explicitly connect the optimization path of GD and regularization path of L2

2 penalized objective on strongly convex and smooth functions. Let f : Rp ! R be a twice differentiable function which is strongly convex and smooth with parameters m, M > 0. In this work we mainly focus on continuous time GD (that is, GD with inﬁnitesimally small step size). We deﬁne the optimization path of GD on f( ), started at 0, as the trajectory followed by GD iterates, which is given by the following Ordinary Differential Equation (ODE)

dt (t) = rf( (t)), (0) = 0.

We now relate the above optimization path to the regularization path of the corresponding L2

2 penalized objective, which is deﬁned as the 1-dimensional path of optimal solutions of the following regularized objective, obtained as we vary between [0, 1)

( ) = argmin

The following Lemma bounds the distance between the optimization and regularization paths.

Theorem 1. Let ˆ be the minimizer of f( ). Let = m/M and let c = 2 1+ . Moreover, let the regularization penalty and time t be related through the relation (t) = 1 cm

GD is started at 0. Then

k (t) ( (t))k2 krf( 0)k2

e mt + c 1 c ec Mt

Note that when = 1 the upper bound in the above Theorem is equal to 0, thus showing that both the paths are exactly the same. To get a sense of quality of the bound, we compare it with a simple triangle inequality based bound, where we derive an upper bound for k (t) ( (t))k2 by ﬁrst bounding k (t) ˆ k2, k ( (t)) ˆ k2 and then using a triangle inequality.

Theorem 2 (Weak Bound). Consider the similar setting as in Theorem 1. Let (t) = 1 cm

k (t) ( (t))k2 k (t) ˆ k2 + k ( (t)) ˆ k2

e mt + c 1 c+ec Mt

The above Theorem gives us O(e mt + e Mt) upper bound for the distance k (t) ( (t))k2. Whereas, Theorem 1 gives us O(e mt e Mt) upper bound, which is strictly better than Theorem 2. Moreover, for small t, the bound in Theorem 2 is much weaker than the bound in Theorem 1. As we show later, the tighter bound in Theorem 1 helps us obtain tight generalization bounds and early stopping rules for the iterates of GD.

By choosing a different relation (t), one can obtain a different connection and a different upper bound for the distance between optimization and regularization paths. In Appendix A.1 we consider different choices for (t) and obtain different bounds. We believe the bounds in Theorem 1 and Appendix A.1 can be further improved by choosing an optimal (t).

2.1 Extension to Mirror Descent

In this section we provide an extension of Theorem 1 to Mirror descent. Before we proceed, we brieﬂy review mirror descent. For a complete review of properties of mirror descent and Bregman divergences see [Banerjee et al., 2005, Bubeck et al., 2015]. Let φ be a continuously differentiable Legendre function deﬁned on Rp. Moreover let φ be strongly convex w.r.t a reference norm k.k

φ( 2) φ( 1) hrφ( 1), 2 1i

Then the Bregman divergence Dφ induced by φ is deﬁned as Dφ( 2, 1) = φ( 2) φ( 1) hrφ( 1), 2 1i.

Mirror Descent (MD). Suppose we want to minimize a convex function f( ) over Rp. Then Mirror Descent with divergence Dφ uses the following update rule to estimate the minimizer

t+1 = argmin

Solving the above equation gives us the following update rule: rφ( t+1) = rφ( t) trf( t). The continuous time dynamics of MD, started at 0, is given by the following ODE

(t) = r2φ( (t)) 1rf( (t)), (0) = 0.

Dφ-Regularization. We connect the optimization path of MD to the regularization path of the following regularized problem

( ) = argmin

where 0 is some point in . The solution ( ) satisﬁes the following continuous time dynamics

r2f( ( )) + r2φ( ( ))

1 rf( ( )).

We now show that the optimization path of mirror descent with Dφ divergence, started at 0, is closely related to the regularization path of the corresponding Dφ-regularized objective. Our analysis is similar to the analysis of GD.

Theorem 3. Let ˆ be the minimizer of f( ). Suppose f is m strongly convex, M smooth and φ is strongly convex w.r.t euclidean norm. Moreover, suppose φ is β smooth w.r.t euclidean norm, in the following balls around 0 and ˆ

{ : Dφ( , 0) Dφ(ˆ , 0)} [ { : Dφ(ˆ , ) Dφ(ˆ , 0)}.

Let the regularization penalty and time t be related through the relation (t) = β cm

where c = 2

and = m/M. If MD is started at 0, then

k (t) ( (t))k2 β

e mt/β + c 1 c ec Mt/

Note that when β = , we retrieve the bounds of GD in Theorem 1. An example of a divergence which satisﬁes the assumption of strong convexity and smoothness of φ over Rd is the Mahalanobis distance which is a divergence induced by the function φ(x) = x T Ax, for some positive deﬁnite matrix A. However, we note that many popular divergences such as KL-divergence do not satisfy the smoothness condition over the entire space Rd. For such divergences, β depends on the distance between the starting point 0 and the minimizer ˆ and their location in the space.

3 Consequences for Excess Risk of GD Iterates

In this section we utilize the connection between optimization and regularization paths derived in Theorem 1 to provide excess risk bounds of GD iterates. To this end, we ﬁrst derive excess risk bounds of the solutions of the regularized problem and then combine it with the result from Theorem 1 to obtain the excess risk bounds of GD iterates.

3.1 General Analysis

In this section we provide a general statistical analysis of the solution of the regularized problem in Equation (1), for general statistical learning problems. Suppose we are given n i.i.d samples Dn = {xi}n

i=1, where xi 2 X, drawn from a distribution P. Let : Rp X ! R be a loss function that assigns a cost ( , x) for an observation x. Deﬁne the risk R( ) as R( ) = EX P [ ( , X)] and let be the minimizer of risk R( ). Given samples Dn, our goal is to use Dn to obtain an estimate

ˆ that has low excess risk: R(ˆ ) min R( ). Let Rn( ) denote the empirical risk, which is deﬁned as Rn( ) = 1

i=1 ( , xi). We consider the following regularized problem for estimating a with low excess risk

The following theorem bounds the parameter estimation error of the minimizer of the above problem. Theorem 4. Suppose the empirical risk Rn( ) is m strongly convex and M smooth. Consider the regularized problem in Equation (2). Suppose the regularization penalty satisﬁes 1/ 2 kr Rn( )k2

k k2 . Then the optimal solution ( ) satisﬁes

k ( ) k2 3 m k k.

Using the above result and the result from Theorem 1 we now bound the parameter error of the iterates of continuous time GD. Corollary 5. Suppose the conditions of Theorem 1 are satisﬁed. Moreover, let t satisfy t

1 + cmk k 2kr Rn( )k2

, where c = 2m m+M . Then (t) satisﬁes the following error bound

k (t) k2 kr Rn( 0)k2

e mt + c 1 c ec Mt

1 e c Mt k k.

Note that the above results provide deterministic error bounds for a particular choice of , t. The random quantities m, M, kr Rn( )k2 need to be bounded to instantiate the above result for speciﬁc learning problems.

Let m, M be the strong convexity and smoothness parameters of the population risk R( ). Using standard tools from empirical process theory, under certain regularity conditions on the distribution P and R( ), one can show that kr Rn( )k2, kr Rn(0)k2 scale as O(

n +k k) and m, M are close to m, M with high probability. Substituting these in Corollary 5 gives us the following bound for k (t) k2 at t = 1 c M log

k (t) k2 = O

where c = 2 m m+ M . When m = M, the above bound shows that at t = 1

achieve the standard parametric error rate O

We note that the above bound can be improved with a tighter analysis of the regularized problem. In the next section, we consider the problem of linear regression and show how a better understanding of the regularized problem, coupled with our connection, helps us obtain a tighter parameter estimation error bound for the iterates of GD.

3.2 Tighter Analysis for Linear Regression

Recall that in linear regression we observe paired samples {(x1, y1), . . . (xn, yn)}, where each (xi, yi) 2 Rp R. The distribution of y conditioned on the covariates x is speciﬁed by the following

linear model: y = hx, i + w, where w is drawn from a zero-mean distribution with bounded variance. In this work we assume that noise has a normal distribution with variance σ2; that is, w N(0, σ2). The goal in linear regression is to learn a linear map x ! h , xi with low risk R( ) = E

(y h , xi)2

. The empirical risk Rn( ) is given by

(yi h , xii)2 = 1

where X = [x1, x2, . . . xn]T 2 Rn p is the matrix of covariates. Let w = y X be the noise vector and b be the empirical covariance matrix. The regularized problem (2) corresponding to the least squares risk deﬁned above is called ridge regression problem. Ridge regression has been well studied and analyzed in the literature of machine learning and statistics. We now provide the following result from Hsu et al. [2012] which obtains tight upper bounds on the parameter estimation error of the solution of ridge regression. Theorem 6. Suppose the covariate vector x has a normal distribution with mean 0 and covariance matrix . Then there exists constants c1, c2 > 0 that depend on , such that for n c1p log p, the solution of ridge regression ( ) satisﬁes the following error bound with probability at least 1 1/p2

| {z } Bias

| {z } Variance

where λi is the ith largest eigenvalue of , where = λp/λ1.

We now use the above bound to obtain error rates for optimization path of GD. For the sake of simplicity and to gain insights into the bound we consider the special case of identity covariance matrix. Corollary 7. Suppose the covariate vector x has a normal distribution with mean 0 and identity covariance matrix. Then there exists a constant c1 > 0 such that for n c1p log p, the iterates of continuous time GD satisfy the following bound with probability at least 1 1/p2

k (t) k2 10

e 9t/10 9 10e99t/100 1

k k2 + (t)2σ2 p

where (t) = 100

. Further, at t = 100

, the iterate (t) satisﬁes

where is a positive constant less than 0.1.

Note that the above bound provides an early stopping rule for GD on linear regression, and the resulting rate, especially in the high SNR regime where σ is high, can be better than the σ2p

n rate obtained by running GD until convergence.

Comparison with Stability. Hardt et al. [2015], Chen et al. [2018] used stability as a technique to provide expected excess risk bounds of iterates generated by an iterative optimization algorithm. We note that in the setting of strong convexity, existing stability based approaches impose a much stronger condition of strong convexity on Rn( ). Speciﬁcally, they require the loss function ( , x) to be strongly convex in at each x 2 X. For example, this condition never holds for linear regression with dimension p > 1. Under the assumption that the loss function ( , x) at each x 2 X is m strongly convex and M smooth, stability gives us the following expected risk bounds for (t)

E [R( (t)) R( )] 1

Note that the above bound doesn t provide an early stopping rule and suggests that one has to run the algorithm until convergence for the best possible rates. Moreover, our approach can obtain high probability statements, whereas the above rates are in expectation.

Comparison with VC, Rademacher complexity bounds. Traditional techniques for bounding excess risk of iterates proceed by separately bounding the optimization error k (t) ˆ k, statistical error kˆ k and then using a simple triangle inequality to bound k (t) k. Such a technique gives us rates of the form O(e mt + σ

n). These rates suggest that one should always run GD until the end to obtain best possible rates and can t predict optimal early stopping rules. Note that the bound in Corollary 7, k (t) k2

n , is much better than O

obtained using standard VC and Rademacher complexity bounds. This is especially true in the low SNR regime, where σ is large and as a result is low. Moreover, this rate is same as ridge regression rate. This shows that GD with early stopping can obtain similar rates as ridge regression; and rates which are tighter than those obtained via VC bounds and stability based risk bounds.

4 Convex Loss

Having studied the setting where f( ) is strongly convex, we next turn our attention to losses which are simply convex. As we show below, in the convex case, the connection between the two paths is more nuanced, and in particular, is problem speciﬁc.

Firstly, we derive a result which characterizes the end-point of the regularization path.

Theorem 8. Let f : Rp ! R be a convex function. Suppose f has a minimizer. Let ( ) be the minimizer of

Then as ! 1, ( ) converges to the minimizer which is closest to 0.

We note that this result can be viewed as the regularization path analog of the result of Gunasekar et al. [2017], where it was shown that for matrix factorization, the optimization path of GD converges to the minimum Frobenius norm solution. Next, we present a simple counterexample which shows that in the convex regime, both regularization and optimization paths need not converge to the same point.

Lemma 1. Consider the following function in 2D space f : R ( 100, 1) 7! R, f(x, y) = (x+1)2

y+100 . Suppose the continuous time gradient descent is initialized at 0 = (2, 1). Then, we have that

lim !1 ( ) 6= lim

-2 -1 0 1 2 x

Contours of (x+1)2

y+100 GD-Path Reg-Path

The above result shows that for general convex losses, both the paths need not lie close to each other, even as t, ! 1.

4.1 Classiﬁcation Loss

In this section, we focus on classiﬁcation losses and show that the optimization path of GD and the corresponding regularization path of L2

2 penalized risk are close to each other. Commonly used losses in classiﬁcation such as exponential, logistic loss are not strongly convex and moreover, when the data Dn is separable, the risk doesn t admit a ﬁnite minimizer. In such cases, a more careful analysis is needed to bound the distance between the optimization and regularization paths.

Recent works by Ji and Telgarsky [2018], Soudry et al. [2017] study the behavior of gradient descent on un-regularized logistic regression and show that when the data is separable, GD converges to a max margin solution. In this section we ﬁrst show that similar properties hold for the regularization path of L2

2 regularized objectives. Recall that in classiﬁcation we observe samples Dn = {(xi, yi)}n

i=1, where each (xi, yi) 2 Rp { 1}. Let ( , (x, y)) = φ(yx T ) be the loss at (x, y). Consider the regularized problem in Equation (2). We ﬁrst present the following useful result from Rosset et al. [2004b] which shows that when the data is linearly separable, as ! 1, the minimizer ( ) of (2) converges to a max-margin solution.

Lemma 2. Assume the data Dn is linearly separable; that is, 9 such that mini yi

φ(z) be a monotone non-increasing loss function. If 9T > 0 (possibly T = 1) such that:

φ(t) = 1, 8 > 0,

then φ is a margin maximizing loss function in the sense that any convergence point of the normalized solutions ( ) k ( )k2 of the regularized problem (2) as ! 1 is an L2 margin maximizing separating hyper-plane. Consequently, if this margin-maximizing hyper-plane is unique, then the solutions converge to it

( ) k ( )k2

The condition on φ in the above Lemma is satisﬁed by many popular loss functions such as logistic, exponential, squared hinge losses. Note that Lemma 2 is asymptotic in nature. Our ﬁrst contribution is to derive a non-asymptotic version of this theorem. We focus on the exponential loss φ(z) = e z, but our results can be generalized to other losses as well. Perhaps interestingly, our non-asymptotic bounds depend on the Lambert W(product-log) function [Corless et al., 1996], which has a long history of applications to instrument design [Ohayon and Ron, 2013] and statistical physics [Valluri et al., 2000].

Theorem 9. Assume the data Dn is linearly separable; that is, 9 such that mini yi

( , (y, x)) = exp( T (yx)) and let ( ) be the solution to the regularized problem in Equation (2). Then ( ) satisﬁes

i) Rn( ( )) C1

, where W( ) is the Lambert W function.

ii) || ( )||2 = (log( ))

iii) mini yix T

i ( ) || ( )||2 1 log log( )

log( ) log n

As ! 1, the above Theorem shows that the ( ) converges to a max-margin solution, thus recovering the asymptotic result of Rosset et al. [2004b]. Moreover, our result shows that the minimizer of the regularized problem (2) converges to max-margin solution at a slow rate. In particular, the margin increases as O( 1 log ).

Comparison with GD on Rn( ). Soudry et al. [2017] analyze gradient descent on exponential loss, with separable data and obtained similar bounds for the iterates of GD. Letting (t) be the iterate of GD at time t, they show that Rn( (t)) goes down as O(1/t), the margin converges as O(1/ log t) and k (t)k2 increases as log t. When combined with our result, this shows that the optimization and regularization paths are very close to each other.

Theorem 10. Assume the data Dn is linearly separable; that is, 9 such that mini yi

Let ( , (y, x)) = exp( T (yx)). Suppose the regularization parameter and time t are related as (t) = t. Suppose GD is initialized at 0. Then for any t 0, we have 4444min

yi hxi, (t)i

yi hxi, ( (t))i

5 Experiments

In this section, we conduct simulations to corroborate our theoretical ﬁndings.

5.1 Strongly Convex

100 101 102 103

Excess Risk

(a) R( t) R vs t

5.8 6 6.2 6.4 6.6 6.8 7 7.2 7.4 log(n)

(b) t vs log(n)

4 4.5 5 5.5 6 6.5 7 log(n)

log(R(θT* ) - R*)

(c) log(R( t ) R ) vs log(n)

Figure 1: Connecting GD and L2

2-Penalization for Linear Regression

We use linear regression to empirically verify our results on connecting ridge-regression and gradient descent. We also corroborate our ﬁndings on excess risk and optimality of early-stopping rule for gradient descent.

Setup. We simulate a linear model by drawing the covariates from an isotropic gaussian X N(0, Ip p) and the response y|x N( T x, σ2) where = [1/pp, 1/pp, . . . , 1/pp]T and σ2 = 2. We generate a sequence of iterates by GD with step size 0.01, and a corresponding sequence of solutions for the penalized estimation problem. We also study how the optimal iteration number t , which minimizes the excess risk, changes as we increase the number of samples for GD. In this case, we ﬁx p = 100 and vary the samples n from 100 to 1500. Similarly, we ﬁnd the optimal penalization for each n. All results are reported after averaging over 50 trials.

Results. We report our results in Figure 1.

As shown by our theory, excess risk bounds for GD on OLS are composed of two terms, one which

increases with t and the other which decreases with t. Hence, one expects the excess risk to ﬁrst decreases, then increase before ﬁnally settling, which is corroborated by Figure 1(a). Figure 1(b) shows a logarithmic relationship between t and n, thereby verifying our theoretical

claims on t . Figure 1(c) shows that optimal risk for GD coincides with that of L2

2-penalized estimation across different values of n.

5.2 Classiﬁcation

0 2 4 6 8 10 12 14 0

GD L2 Square Regularization

(a) || ( )||2 vs log( )

0 500 1000 1500 2000 0.6

GD L2 Square Regularization

(b) Margin of ( ) vs

100 101 102 103 104 105 0

Margin Difference

(c) |Margin( ( )) Margin( t)| vs t

Figure 2: Connecting GD and L2

2-Penalization for Logistic Regression

In this section, we corroborate our results which connect GD and L2

2-regularization in the context of logistic regression on separable data. In particular, we corroborate our ﬁndings on parameter error, behavior of margin and the difference in margin for optimization and regularization paths.

Setup. We construct a classiﬁcation dataset by drawing covariates X from isotropic gaussian i.e. X N(0, Ip). We ﬁx the true parameter = [ 1 pp, 1 pp, . . . , 1 pp]T . We ﬁx the dimension p = 128 and the number of samples to n = 32. Note that our choice of p, n ensures that the generated data is separable. We run GD with a step size = 0.123 and construct corresponding points on the regularization path ( (t) = t

) of the L2

2 squared penalized objective.

Results. We report our results in Figure 2.

Figure 2(a) shows the norm of the points on the optimization and regularization paths. As predicted

by our theory, the norm increases at a logarithmic rate.

Figure 2(b) plots the L2-margin ( mini yi T (xi)

|| ||2 ) for both the optimization and regularization path. The ﬁgure conﬁrms our results that the margin increases with . Although the margins of both the optimization and regularization paths increase, Figure 2(c) depicts

that after a few initial iterations, the difference in the margin between t and ( ) decreases.

6 Summary and Future Work

In this work, we studied the connections between the trajectory of the iterates of optimization techniques such as GD, Mirror Descent and regularization path of the corresponding regularized objective. For strongly convex functions our results show that both the paths are point-wise close. However, for general convex functions, our results show that both the paths need not be close to each other. For the popularly studied problem of classiﬁcation with separable data, we showed that the optimization and regularization paths are close to each other.

We believe studying the connection between optimization and regularization paths has several advantages, with the key advantage being that it can be used to study the statistical properties of iterates generated by optimization techniques. We also believe that our results on strongly convex losses can be further improved to obtain tighter connections and better generalization bounds of the iterates.

An interesting direction for future work would be to see if similar connections hold for non-convex problems and speciﬁcally the optimization objectives that arise in deep learning. For convex losses, our current work focused on analyzing classiﬁcation losses with separable data. It would be interesting to study the connection for general convex losses and identify the conditions on the loss function under which both the paths stay close to each other.

While our analysis in this paper focused on GD, it d be interesting to study if similar connections hold for other non-stochastic methods such as steepest descent, accelerated GD, Newton s method and stochastic methods such as SGD.

7 Acknowledgement

We acknowledge the support of NSF via IIS-1149803, IIS-1664720, DMS-1264033. The authors are grateful to Suriya Gunasekar and anonymous reviewers for helpful comments on the paper.

Arindam Banerjee, Srujana Merugu, Inderjit S Dhillon, and Joydeep Ghosh. Clustering with bregman divergences.

Journal of machine learning research, 6(Oct):1705 1749, 2005.

Olivier Bousquet and André Elisseeff. Stability and generalization. Journal of machine learning research, 2

(Mar):499 526, 2002.

Sébastien Bubeck et al. Convex optimization: Algorithms and complexity. Foundations and Trends R in

Machine Learning, 8(3-4):231 357, 2015.

Y. Chen, C. Jin, and B. Yu. Stability and Convergence Trade-off of Iterative Optimization Algorithms. Ar Xiv

e-prints, April 2018.

Robert M Corless, Gaston H Gonnet, David EG Hare, David J Jeffrey, and Donald E Knuth. On the lambertw

function. Advances in Computational mathematics, 5(1):329 359, 1996.

J Friedman and Bogdan E Popescu. Gradient directed regularization for linear regression and classiﬁcation.

Technical report, Citeseer, 2003.

S. Gunasekar, J. Lee, D. Soudry, and N. Srebro. Characterizing Implicit Bias in Terms of Optimization Geometry.

Ar Xiv e-prints, February 2018.

Suriya Gunasekar, Blake E Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. Implicit

regularization in matrix factorization. In Advances in Neural Information Processing Systems, pages 6152 6160, 2017.

Moritz Hardt, Benjamin Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient

descent. ar Xiv preprint ar Xiv:1509.01240, 2015.

Abdolhossein Hoorfar and Mehdi Hassani. Inequalities on the lambert w function and hyperpower function. J.

Inequal. Pure and Appl. Math, 9(2):5 9, 2008.

Daniel Hsu, Sham M Kakade, and Tong Zhang. Random design analysis of ridge regression. In Conference on

Learning Theory, pages 9 1, 2012.

Ziwei Ji and Matus Telgarsky. Risk and parameter convergence of logistic regression. ar Xiv preprint ar Xiv:1803.07300, 2018.

Mor Shpigel Nacson, Jason Lee, Suriya Gunasekar, Nathan Srebro, and Daniel Soudry. Convergence of gradient

descent on separable data. ar Xiv preprint ar Xiv:1803.01905, 2018.

Sahand Negahban, Bin Yu, Martin J Wainwright, and Pradeep K Ravikumar. A uniﬁed framework for high-

dimensional analysis of m-estimators with decomposable regularizers. In Advances in Neural Information Processing Systems, pages 1348 1356, 2009.

G. Neu and L. Rosasco. Iterate averaging as regularization for stochastic gradient descent. Ar Xiv e-prints,

February 2018.

Ben Ohayon and Guy Ron. New approaches in designing a zeeman slower. Journal of Instrumentation, 8(02):

P02016, 2013.

Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Early stopping and non-parametric regression: an optimal

data-dependent stopping rule. Journal of Machine Learning Research, 15(1):335 366, 2014.

Lorenzo Rosasco and Silvia Villa. Learning with incremental iterative regularization. In Advances in Neural

Information Processing Systems, pages 1630 1638, 2015.

Saharon Rosset, Ji Zhu, and Trevor Hastie. Boosting as a regularized path to a maximum margin classiﬁer.

Journal of Machine Learning Research, 5(Aug):941 973, 2004a.

Saharon Rosset, Ji Zhu, and Trevor J Hastie. Margin maximizing loss functions. In Advances in neural

information processing systems, pages 1237 1244, 2004b.

Mark Rudelson and Roman Vershynin. Smallest singular value of a random rectangular matrix. Communications

on Pure and Applied Mathematics, 62(12):1707 1739, 2009.

Daniel Soudry, Elad Hoffer, and Nathan Srebro. The implicit bias of gradient descent on separable data. ar Xiv

preprint ar Xiv:1710.10345, 2017.

Sree Ram Valluri, David J Jeffrey, and Robert M Corless. Some applications of the lambert w function to physics.

Canadian Journal of Physics, 78(9):823 831, 2000.

Yuting Wei, Fanny Yang, and Martin J Wainwright. Early stopping for kernel boosting algorithms: A general

analysis with localized complexities. In Advances in Neural Information Processing Systems, pages 6067 6077, 2017.

Yuan Yao, Lorenzo Rosasco, and Andrea Caponnetto. On early stopping in gradient descent learning. Construc-

tive Approximation, 26(2):289 315, 2007.