# connecting_optimization_and_regularization_paths__61f23eae.pdf Connecting Optimization and Regularization Paths Arun Sai Suggala Carnegie Mellon University Pittsburgh, PA 15213 asuggala@cs.cmu.edu Adarsh Prasad Carnegie Mellon University Pittsburgh, PA 15213 adarshp@cs.cmu.edu Pradeep Ravikumar Carnegie Mellon University Pittsburgh, PA 15213 pradeepr@cs.cmu.edu We study the implicit regularization properties of optimization techniques by explicitly connecting their optimization paths to the regularization paths of corresponding regularized problems. This surprising connection shows that iterates of optimization techniques such as gradient descent and mirror descent are pointwise close to solutions of appropriately regularized objectives. While such a tight connection between optimization and regularization is of independent intellectual interest, it also has important implications for machine learning: we can port results from regularized estimators to optimization, and vice versa. We investigate one key consequence, that borrows from the well-studied analysis of regularized estimators, to then obtain tight excess risk bounds of the iterates generated by optimization techniques. 1 Introduction With the recent success of optimization techniques in training over-parametrized deep neural networks, there has been a growing interest in understanding the implicit regularization properties of various optimization techniques. Consequently, a line of work has focused on characterizing the implicit biases of global optimum reached by various optimization algorithms. For example, Gunasekar et al. [2017] consider the problem of matrix factorization and show that gradient descent (GD) on un-regularized objective converges to the minimum nuclear norm solution. Soudry et al. [2017] study gradient descent on un-regularized logistic regression and show that when the data is linearly separable, gradient descent converges to a max-margin solution. Gunasekar et al. [2018] generalized the results of Soudry et al. [2017] and study the limit behavior of the iterates of general optimization techniques when the data is linearly separable. Another line of work has focused on studying the implicit regularization properties of early stopping various optimization algorithms, which is a widely used technique in neural network training. These works show that early stopping the iterative optimization of an empirical problem performs a form of implicit regularization. Yao et al. [2007] focus on non-parametric regression in reproducing kernel Hilbert spaces and provide theoretical justification for early stopping. In a similar setting, Raskutti et al. [2014] show that early stopping gradient descent on least squares objective achieves similar risk bounds as the corresponding regularized problem, also called ridge regression. Hardt et al. [2015], Rosasco and Villa [2015] study the implicit regularization properties of early stopping stochastic gradient descent (SGD). All these results show that early stopping achieves similar performance as optimizing the corresponding regularized objective. Furthermore, several recent works suggest that there could be a much deeper connection between the iterates generated by optimization techniques on un-regularized objectives (optimization path) and minimizers of corresponding regularized objectives (regularization path), than the performance similarity observed in the early stopping literature. Friedman and Popescu [2003] empirically observe that for linear regression, the optimization and regularization paths are very similar to each other. Rosset et al. [2004a] show that under certain conditions on the problem, the path traced by coordinate 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada. descent or boosting is similar to the regularization path of L1 constrained problem. In a related work Neu and Rosasco [2018] consider the problem of linear least squares regression and show that the iterates produced by GD on least squares objective are related to the solutions of ridge regression. Specifically, for any given regularization parameter of ridge regression, Neu and Rosasco [2018] show that there exists a weighing scheme for GD iterates that is exactly equal to the ridge solution. In this work, we take a step towards understanding the deeper connection between the two paths by explicitly connecting the optimization path to the regularization path of the corresponding regularized problem. Our results explicitly show that the sequence of iterates produced by iterative optimization techniques such as gradient descent, mirror descent on strongly convex functions, lie pointwise close to the regularization path of a corresponding regularized objective. This surprising connection allows us to transfer insights from regularization to optimization and vice-versa. We expect that our work will lead to a new class of results in both fields that explicitly draw upon this connection. In this paper, we focus on a particular consequence of our connection: we derive excess risk bounds of the iterates of optimization techniques. There has been a huge line of work in the fields of machine learning and statistics on understanding the risk bounds of regularized problems [Negahban et al., 2009, Hsu et al., 2012]. We utilize these results to derive excess risk bounds of iterates of optimization techniques. Recently, there has been a line of work studying the excess risk of iterates of optimization techniques. Yao et al. [2007], Raskutti et al. [2014] focus on non-parametric regression in a reproducing kernel Hilbert space and derive excess risk bounds of iterates of gradient descent. Wei et al. [2017] extend these results to a broad class of loss functions. In the context of finite dimensional spaces, Hardt et al. [2015], Chen et al. [2018] use the notion of algorithmic stability, which was introduced by Bousquet and Elisseeff [2002], to derive bounds on the expected excess risk of iterates of various methods. Our technique for deriving excess risk bounds can be viewed as an alternative to stability and has the advantage that we can make use of the existing results on the statistical properties of regularized problems. Moreover, this approach has the potential to obtain much tighter bounds than stability and we stress that any improvement in the analysis of regularized estimation will directly translate to a tighter bound for the corresponding optimization problem. The main contributions of the paper are as follows. For strongly convex and smooth objectives, we explicitly connect the optimization path of GD and regularization path of L2 2 penalized objective. We further extend these results to Mirror Descent with strongly convex and smooth divergences. We use these connections to derive the excess risk of iterates of GD. For convex objectives, we show that the connection need not hold in general. However, for the problem of classification with separable data, we show that for losses with exponentially decaying tails, the optimization path of GD is close to the regularization path of the corresponding regularized objective. 2 Strongly Convex Loss In this section we explicitly connect the optimization path of GD and regularization path of L2 2 penalized objective on strongly convex and smooth functions. Let f : Rp ! R be a twice differentiable function which is strongly convex and smooth with parameters m, M > 0. In this work we mainly focus on continuous time GD (that is, GD with infinitesimally small step size). We define the optimization path of GD on f( ), started at 0, as the trajectory followed by GD iterates, which is given by the following Ordinary Differential Equation (ODE) dt (t) = rf( (t)), (0) = 0. We now relate the above optimization path to the regularization path of the corresponding L2 2 penalized objective, which is defined as the 1-dimensional path of optimal solutions of the following regularized objective, obtained as we vary between [0, 1) ( ) = argmin The following Lemma bounds the distance between the optimization and regularization paths. Theorem 1. Let ˆ be the minimizer of f( ). Let = m/M and let c = 2 1+ . Moreover, let the regularization penalty and time t be related through the relation (t) = 1 cm GD is started at 0. Then k (t) ( (t))k2 krf( 0)k2 e mt + c 1 c ec Mt Note that when = 1 the upper bound in the above Theorem is equal to 0, thus showing that both the paths are exactly the same. To get a sense of quality of the bound, we compare it with a simple triangle inequality based bound, where we derive an upper bound for k (t) ( (t))k2 by first bounding k (t) ˆ k2, k ( (t)) ˆ k2 and then using a triangle inequality. Theorem 2 (Weak Bound). Consider the similar setting as in Theorem 1. Let (t) = 1 cm k (t) ( (t))k2 k (t) ˆ k2 + k ( (t)) ˆ k2 e mt + c 1 c+ec Mt The above Theorem gives us O(e mt + e Mt) upper bound for the distance k (t) ( (t))k2. Whereas, Theorem 1 gives us O(e mt e Mt) upper bound, which is strictly better than Theorem 2. Moreover, for small t, the bound in Theorem 2 is much weaker than the bound in Theorem 1. As we show later, the tighter bound in Theorem 1 helps us obtain tight generalization bounds and early stopping rules for the iterates of GD. By choosing a different relation (t), one can obtain a different connection and a different upper bound for the distance between optimization and regularization paths. In Appendix A.1 we consider different choices for (t) and obtain different bounds. We believe the bounds in Theorem 1 and Appendix A.1 can be further improved by choosing an optimal (t). 2.1 Extension to Mirror Descent In this section we provide an extension of Theorem 1 to Mirror descent. Before we proceed, we briefly review mirror descent. For a complete review of properties of mirror descent and Bregman divergences see [Banerjee et al., 2005, Bubeck et al., 2015]. Let φ be a continuously differentiable Legendre function defined on Rp. Moreover let φ be strongly convex w.r.t a reference norm k.k φ( 2) φ( 1) hrφ( 1), 2 1i Then the Bregman divergence Dφ induced by φ is defined as Dφ( 2, 1) = φ( 2) φ( 1) hrφ( 1), 2 1i. Mirror Descent (MD). Suppose we want to minimize a convex function f( ) over Rp. Then Mirror Descent with divergence Dφ uses the following update rule to estimate the minimizer t+1 = argmin Solving the above equation gives us the following update rule: rφ( t+1) = rφ( t) trf( t). The continuous time dynamics of MD, started at 0, is given by the following ODE (t) = r2φ( (t)) 1rf( (t)), (0) = 0. Dφ-Regularization. We connect the optimization path of MD to the regularization path of the following regularized problem ( ) = argmin where 0 is some point in . The solution ( ) satisfies the following continuous time dynamics r2f( ( )) + r2φ( ( )) 1 rf( ( )). We now show that the optimization path of mirror descent with Dφ divergence, started at 0, is closely related to the regularization path of the corresponding Dφ-regularized objective. Our analysis is similar to the analysis of GD. Theorem 3. Let ˆ be the minimizer of f( ). Suppose f is m strongly convex, M smooth and φ is strongly convex w.r.t euclidean norm. Moreover, suppose φ is β smooth w.r.t euclidean norm, in the following balls around 0 and ˆ { : Dφ( , 0) Dφ(ˆ , 0)} [ { : Dφ(ˆ , ) Dφ(ˆ , 0)}. Let the regularization penalty and time t be related through the relation (t) = β cm where c = 2 and = m/M. If MD is started at 0, then k (t) ( (t))k2 β e mt/β + c 1 c ec Mt/ Note that when β = , we retrieve the bounds of GD in Theorem 1. An example of a divergence which satisfies the assumption of strong convexity and smoothness of φ over Rd is the Mahalanobis distance which is a divergence induced by the function φ(x) = x T Ax, for some positive definite matrix A. However, we note that many popular divergences such as KL-divergence do not satisfy the smoothness condition over the entire space Rd. For such divergences, β depends on the distance between the starting point 0 and the minimizer ˆ and their location in the space. 3 Consequences for Excess Risk of GD Iterates In this section we utilize the connection between optimization and regularization paths derived in Theorem 1 to provide excess risk bounds of GD iterates. To this end, we first derive excess risk bounds of the solutions of the regularized problem and then combine it with the result from Theorem 1 to obtain the excess risk bounds of GD iterates. 3.1 General Analysis In this section we provide a general statistical analysis of the solution of the regularized problem in Equation (1), for general statistical learning problems. Suppose we are given n i.i.d samples Dn = {xi}n i=1, where xi 2 X, drawn from a distribution P. Let : Rp X ! R be a loss function that assigns a cost ( , x) for an observation x. Define the risk R( ) as R( ) = EX P [ ( , X)] and let be the minimizer of risk R( ). Given samples Dn, our goal is to use Dn to obtain an estimate ˆ that has low excess risk: R(ˆ ) min R( ). Let Rn( ) denote the empirical risk, which is defined as Rn( ) = 1 i=1 ( , xi). We consider the following regularized problem for estimating a with low excess risk The following theorem bounds the parameter estimation error of the minimizer of the above problem. Theorem 4. Suppose the empirical risk Rn( ) is m strongly convex and M smooth. Consider the regularized problem in Equation (2). Suppose the regularization penalty satisfies 1/ 2 kr Rn( )k2 k k2 . Then the optimal solution ( ) satisfies k ( ) k2 3 m k k. Using the above result and the result from Theorem 1 we now bound the parameter error of the iterates of continuous time GD. Corollary 5. Suppose the conditions of Theorem 1 are satisfied. Moreover, let t satisfy t 1 + cmk k 2kr Rn( )k2 , where c = 2m m+M . Then (t) satisfies the following error bound k (t) k2 kr Rn( 0)k2 e mt + c 1 c ec Mt 1 e c Mt k k. Note that the above results provide deterministic error bounds for a particular choice of , t. The random quantities m, M, kr Rn( )k2 need to be bounded to instantiate the above result for specific learning problems. Let m, M be the strong convexity and smoothness parameters of the population risk R( ). Using standard tools from empirical process theory, under certain regularity conditions on the distribution P and R( ), one can show that kr Rn( )k2, kr Rn(0)k2 scale as O( n +k k) and m, M are close to m, M with high probability. Substituting these in Corollary 5 gives us the following bound for k (t) k2 at t = 1 c M log k (t) k2 = O where c = 2 m m+ M . When m = M, the above bound shows that at t = 1 achieve the standard parametric error rate O We note that the above bound can be improved with a tighter analysis of the regularized problem. In the next section, we consider the problem of linear regression and show how a better understanding of the regularized problem, coupled with our connection, helps us obtain a tighter parameter estimation error bound for the iterates of GD. 3.2 Tighter Analysis for Linear Regression Recall that in linear regression we observe paired samples {(x1, y1), . . . (xn, yn)}, where each (xi, yi) 2 Rp R. The distribution of y conditioned on the covariates x is specified by the following linear model: y = hx, i + w, where w is drawn from a zero-mean distribution with bounded variance. In this work we assume that noise has a normal distribution with variance σ2; that is, w N(0, σ2). The goal in linear regression is to learn a linear map x ! h , xi with low risk R( ) = E (y h , xi)2 . The empirical risk Rn( ) is given by (yi h , xii)2 = 1 where X = [x1, x2, . . . xn]T 2 Rn p is the matrix of covariates. Let w = y X be the noise vector and b be the empirical covariance matrix. The regularized problem (2) corresponding to the least squares risk defined above is called ridge regression problem. Ridge regression has been well studied and analyzed in the literature of machine learning and statistics. We now provide the following result from Hsu et al. [2012] which obtains tight upper bounds on the parameter estimation error of the solution of ridge regression. Theorem 6. Suppose the covariate vector x has a normal distribution with mean 0 and covariance matrix . Then there exists constants c1, c2 > 0 that depend on , such that for n c1p log p, the solution of ridge regression ( ) satisfies the following error bound with probability at least 1 1/p2 | {z } Bias | {z } Variance where λi is the ith largest eigenvalue of , where = λp/λ1. We now use the above bound to obtain error rates for optimization path of GD. For the sake of simplicity and to gain insights into the bound we consider the special case of identity covariance matrix. Corollary 7. Suppose the covariate vector x has a normal distribution with mean 0 and identity covariance matrix. Then there exists a constant c1 > 0 such that for n c1p log p, the iterates of continuous time GD satisfy the following bound with probability at least 1 1/p2 k (t) k2 10 e 9t/10 9 10e99t/100 1 k k2 + (t)2σ2 p where (t) = 100 . Further, at t = 100 , the iterate (t) satisfies where is a positive constant less than 0.1. Note that the above bound provides an early stopping rule for GD on linear regression, and the resulting rate, especially in the high SNR regime where σ is high, can be better than the σ2p n rate obtained by running GD until convergence. Comparison with Stability. Hardt et al. [2015], Chen et al. [2018] used stability as a technique to provide expected excess risk bounds of iterates generated by an iterative optimization algorithm. We note that in the setting of strong convexity, existing stability based approaches impose a much stronger condition of strong convexity on Rn( ). Specifically, they require the loss function ( , x) to be strongly convex in at each x 2 X. For example, this condition never holds for linear regression with dimension p > 1. Under the assumption that the loss function ( , x) at each x 2 X is m strongly convex and M smooth, stability gives us the following expected risk bounds for (t) E [R( (t)) R( )] 1 Note that the above bound doesn t provide an early stopping rule and suggests that one has to run the algorithm until convergence for the best possible rates. Moreover, our approach can obtain high probability statements, whereas the above rates are in expectation. Comparison with VC, Rademacher complexity bounds. Traditional techniques for bounding excess risk of iterates proceed by separately bounding the optimization error k (t) ˆ k, statistical error kˆ k and then using a simple triangle inequality to bound k (t) k. Such a technique gives us rates of the form O(e mt + σ n). These rates suggest that one should always run GD until the end to obtain best possible rates and can t predict optimal early stopping rules. Note that the bound in Corollary 7, k (t) k2 n , is much better than O obtained using standard VC and Rademacher complexity bounds. This is especially true in the low SNR regime, where σ is large and as a result is low. Moreover, this rate is same as ridge regression rate. This shows that GD with early stopping can obtain similar rates as ridge regression; and rates which are tighter than those obtained via VC bounds and stability based risk bounds. 4 Convex Loss Having studied the setting where f( ) is strongly convex, we next turn our attention to losses which are simply convex. As we show below, in the convex case, the connection between the two paths is more nuanced, and in particular, is problem specific. Firstly, we derive a result which characterizes the end-point of the regularization path. Theorem 8. Let f : Rp ! R be a convex function. Suppose f has a minimizer. Let ( ) be the minimizer of Then as ! 1, ( ) converges to the minimizer which is closest to 0. We note that this result can be viewed as the regularization path analog of the result of Gunasekar et al. [2017], where it was shown that for matrix factorization, the optimization path of GD converges to the minimum Frobenius norm solution. Next, we present a simple counterexample which shows that in the convex regime, both regularization and optimization paths need not converge to the same point. Lemma 1. Consider the following function in 2D space f : R ( 100, 1) 7! R, f(x, y) = (x+1)2 y+100 . Suppose the continuous time gradient descent is initialized at 0 = (2, 1). Then, we have that lim !1 ( ) 6= lim -2 -1 0 1 2 x Contours of (x+1)2 y+100 GD-Path Reg-Path The above result shows that for general convex losses, both the paths need not lie close to each other, even as t, ! 1. 4.1 Classification Loss In this section, we focus on classification losses and show that the optimization path of GD and the corresponding regularization path of L2 2 penalized risk are close to each other. Commonly used losses in classification such as exponential, logistic loss are not strongly convex and moreover, when the data Dn is separable, the risk doesn t admit a finite minimizer. In such cases, a more careful analysis is needed to bound the distance between the optimization and regularization paths. Recent works by Ji and Telgarsky [2018], Soudry et al. [2017] study the behavior of gradient descent on un-regularized logistic regression and show that when the data is separable, GD converges to a max margin solution. In this section we first show that similar properties hold for the regularization path of L2 2 regularized objectives. Recall that in classification we observe samples Dn = {(xi, yi)}n i=1, where each (xi, yi) 2 Rp { 1}. Let ( , (x, y)) = φ(yx T ) be the loss at (x, y). Consider the regularized problem in Equation (2). We first present the following useful result from Rosset et al. [2004b] which shows that when the data is linearly separable, as ! 1, the minimizer ( ) of (2) converges to a max-margin solution. Lemma 2. Assume the data Dn is linearly separable; that is, 9 such that mini yi φ(z) be a monotone non-increasing loss function. If 9T > 0 (possibly T = 1) such that: φ(t) = 1, 8 > 0, then φ is a margin maximizing loss function in the sense that any convergence point of the normalized solutions ( ) k ( )k2 of the regularized problem (2) as ! 1 is an L2 margin maximizing separating hyper-plane. Consequently, if this margin-maximizing hyper-plane is unique, then the solutions converge to it ( ) k ( )k2 The condition on φ in the above Lemma is satisfied by many popular loss functions such as logistic, exponential, squared hinge losses. Note that Lemma 2 is asymptotic in nature. Our first contribution is to derive a non-asymptotic version of this theorem. We focus on the exponential loss φ(z) = e z, but our results can be generalized to other losses as well. Perhaps interestingly, our non-asymptotic bounds depend on the Lambert W(product-log) function [Corless et al., 1996], which has a long history of applications to instrument design [Ohayon and Ron, 2013] and statistical physics [Valluri et al., 2000]. Theorem 9. Assume the data Dn is linearly separable; that is, 9 such that mini yi ( , (y, x)) = exp( T (yx)) and let ( ) be the solution to the regularized problem in Equation (2). Then ( ) satisfies i) Rn( ( )) C1 , where W( ) is the Lambert W function. ii) || ( )||2 = (log( )) iii) mini yix T i ( ) || ( )||2 1 log log( ) log( ) log n As ! 1, the above Theorem shows that the ( ) converges to a max-margin solution, thus recovering the asymptotic result of Rosset et al. [2004b]. Moreover, our result shows that the minimizer of the regularized problem (2) converges to max-margin solution at a slow rate. In particular, the margin increases as O( 1 log ). Comparison with GD on Rn( ). Soudry et al. [2017] analyze gradient descent on exponential loss, with separable data and obtained similar bounds for the iterates of GD. Letting (t) be the iterate of GD at time t, they show that Rn( (t)) goes down as O(1/t), the margin converges as O(1/ log t) and k (t)k2 increases as log t. When combined with our result, this shows that the optimization and regularization paths are very close to each other. Theorem 10. Assume the data Dn is linearly separable; that is, 9 such that mini yi Let ( , (y, x)) = exp( T (yx)). Suppose the regularization parameter and time t are related as (t) = t. Suppose GD is initialized at 0. Then for any t 0, we have 4444min yi hxi, (t)i yi hxi, ( (t))i 5 Experiments In this section, we conduct simulations to corroborate our theoretical findings. 5.1 Strongly Convex 100 101 102 103 Excess Risk (a) R( t) R vs t 5.8 6 6.2 6.4 6.6 6.8 7 7.2 7.4 log(n) (b) t vs log(n) 4 4.5 5 5.5 6 6.5 7 log(n) log(R(θT* ) - R*) (c) log(R( t ) R ) vs log(n) Figure 1: Connecting GD and L2 2-Penalization for Linear Regression We use linear regression to empirically verify our results on connecting ridge-regression and gradient descent. We also corroborate our findings on excess risk and optimality of early-stopping rule for gradient descent. Setup. We simulate a linear model by drawing the covariates from an isotropic gaussian X N(0, Ip p) and the response y|x N( T x, σ2) where = [1/pp, 1/pp, . . . , 1/pp]T and σ2 = 2. We generate a sequence of iterates by GD with step size 0.01, and a corresponding sequence of solutions for the penalized estimation problem. We also study how the optimal iteration number t , which minimizes the excess risk, changes as we increase the number of samples for GD. In this case, we fix p = 100 and vary the samples n from 100 to 1500. Similarly, we find the optimal penalization for each n. All results are reported after averaging over 50 trials. Results. We report our results in Figure 1. As shown by our theory, excess risk bounds for GD on OLS are composed of two terms, one which increases with t and the other which decreases with t. Hence, one expects the excess risk to first decreases, then increase before finally settling, which is corroborated by Figure 1(a). Figure 1(b) shows a logarithmic relationship between t and n, thereby verifying our theoretical claims on t . Figure 1(c) shows that optimal risk for GD coincides with that of L2 2-penalized estimation across different values of n. 5.2 Classification 0 2 4 6 8 10 12 14 0 GD L2 Square Regularization (a) || ( )||2 vs log( ) 0 500 1000 1500 2000 0.6 GD L2 Square Regularization (b) Margin of ( ) vs 100 101 102 103 104 105 0 Margin Difference (c) |Margin( ( )) Margin( t)| vs t Figure 2: Connecting GD and L2 2-Penalization for Logistic Regression In this section, we corroborate our results which connect GD and L2 2-regularization in the context of logistic regression on separable data. In particular, we corroborate our findings on parameter error, behavior of margin and the difference in margin for optimization and regularization paths. Setup. We construct a classification dataset by drawing covariates X from isotropic gaussian i.e. X N(0, Ip). We fix the true parameter = [ 1 pp, 1 pp, . . . , 1 pp]T . We fix the dimension p = 128 and the number of samples to n = 32. Note that our choice of p, n ensures that the generated data is separable. We run GD with a step size = 0.123 and construct corresponding points on the regularization path ( (t) = t ) of the L2 2 squared penalized objective. Results. We report our results in Figure 2. Figure 2(a) shows the norm of the points on the optimization and regularization paths. As predicted by our theory, the norm increases at a logarithmic rate. Figure 2(b) plots the L2-margin ( mini yi T (xi) || ||2 ) for both the optimization and regularization path. The figure confirms our results that the margin increases with . Although the margins of both the optimization and regularization paths increase, Figure 2(c) depicts that after a few initial iterations, the difference in the margin between t and ( ) decreases. 6 Summary and Future Work In this work, we studied the connections between the trajectory of the iterates of optimization techniques such as GD, Mirror Descent and regularization path of the corresponding regularized objective. For strongly convex functions our results show that both the paths are point-wise close. However, for general convex functions, our results show that both the paths need not be close to each other. For the popularly studied problem of classification with separable data, we showed that the optimization and regularization paths are close to each other. We believe studying the connection between optimization and regularization paths has several advantages, with the key advantage being that it can be used to study the statistical properties of iterates generated by optimization techniques. We also believe that our results on strongly convex losses can be further improved to obtain tighter connections and better generalization bounds of the iterates. An interesting direction for future work would be to see if similar connections hold for non-convex problems and specifically the optimization objectives that arise in deep learning. For convex losses, our current work focused on analyzing classification losses with separable data. It would be interesting to study the connection for general convex losses and identify the conditions on the loss function under which both the paths stay close to each other. While our analysis in this paper focused on GD, it d be interesting to study if similar connections hold for other non-stochastic methods such as steepest descent, accelerated GD, Newton s method and stochastic methods such as SGD. 7 Acknowledgement We acknowledge the support of NSF via IIS-1149803, IIS-1664720, DMS-1264033. The authors are grateful to Suriya Gunasekar and anonymous reviewers for helpful comments on the paper. Arindam Banerjee, Srujana Merugu, Inderjit S Dhillon, and Joydeep Ghosh. Clustering with bregman divergences. Journal of machine learning research, 6(Oct):1705 1749, 2005. Olivier Bousquet and André Elisseeff. Stability and generalization. Journal of machine learning research, 2 (Mar):499 526, 2002. Sébastien Bubeck et al. Convex optimization: Algorithms and complexity. Foundations and Trends R in Machine Learning, 8(3-4):231 357, 2015. Y. Chen, C. Jin, and B. Yu. Stability and Convergence Trade-off of Iterative Optimization Algorithms. Ar Xiv e-prints, April 2018. Robert M Corless, Gaston H Gonnet, David EG Hare, David J Jeffrey, and Donald E Knuth. On the lambertw function. Advances in Computational mathematics, 5(1):329 359, 1996. J Friedman and Bogdan E Popescu. Gradient directed regularization for linear regression and classification. Technical report, Citeseer, 2003. S. Gunasekar, J. Lee, D. Soudry, and N. Srebro. Characterizing Implicit Bias in Terms of Optimization Geometry. Ar Xiv e-prints, February 2018. Suriya Gunasekar, Blake E Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. Implicit regularization in matrix factorization. In Advances in Neural Information Processing Systems, pages 6152 6160, 2017. Moritz Hardt, Benjamin Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. ar Xiv preprint ar Xiv:1509.01240, 2015. Abdolhossein Hoorfar and Mehdi Hassani. Inequalities on the lambert w function and hyperpower function. J. Inequal. Pure and Appl. Math, 9(2):5 9, 2008. Daniel Hsu, Sham M Kakade, and Tong Zhang. Random design analysis of ridge regression. In Conference on Learning Theory, pages 9 1, 2012. Ziwei Ji and Matus Telgarsky. Risk and parameter convergence of logistic regression. ar Xiv preprint ar Xiv:1803.07300, 2018. Mor Shpigel Nacson, Jason Lee, Suriya Gunasekar, Nathan Srebro, and Daniel Soudry. Convergence of gradient descent on separable data. ar Xiv preprint ar Xiv:1803.01905, 2018. Sahand Negahban, Bin Yu, Martin J Wainwright, and Pradeep K Ravikumar. A unified framework for high- dimensional analysis of m-estimators with decomposable regularizers. In Advances in Neural Information Processing Systems, pages 1348 1356, 2009. G. Neu and L. Rosasco. Iterate averaging as regularization for stochastic gradient descent. Ar Xiv e-prints, February 2018. Ben Ohayon and Guy Ron. New approaches in designing a zeeman slower. Journal of Instrumentation, 8(02): P02016, 2013. Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Early stopping and non-parametric regression: an optimal data-dependent stopping rule. Journal of Machine Learning Research, 15(1):335 366, 2014. Lorenzo Rosasco and Silvia Villa. Learning with incremental iterative regularization. In Advances in Neural Information Processing Systems, pages 1630 1638, 2015. Saharon Rosset, Ji Zhu, and Trevor Hastie. Boosting as a regularized path to a maximum margin classifier. Journal of Machine Learning Research, 5(Aug):941 973, 2004a. Saharon Rosset, Ji Zhu, and Trevor J Hastie. Margin maximizing loss functions. In Advances in neural information processing systems, pages 1237 1244, 2004b. Mark Rudelson and Roman Vershynin. Smallest singular value of a random rectangular matrix. Communications on Pure and Applied Mathematics, 62(12):1707 1739, 2009. Daniel Soudry, Elad Hoffer, and Nathan Srebro. The implicit bias of gradient descent on separable data. ar Xiv preprint ar Xiv:1710.10345, 2017. Sree Ram Valluri, David J Jeffrey, and Robert M Corless. Some applications of the lambert w function to physics. Canadian Journal of Physics, 78(9):823 831, 2000. Yuting Wei, Fanny Yang, and Martin J Wainwright. Early stopping for kernel boosting algorithms: A general analysis with localized complexities. In Advances in Neural Information Processing Systems, pages 6067 6077, 2017. Yuan Yao, Lorenzo Rosasco, and Andrea Caponnetto. On early stopping in gradient descent learning. Construc- tive Approximation, 26(2):289 315, 2007.