# fast_margin_maximization_via_dual_acceleration__f149ea8d.pdf Fast Margin Maximization via Dual Acceleration Ziwei Ji 1 Nathan Srebro 2 Matus Telgarsky 1 Abstract We present and analyze a momentum-based gradient method for training linear classifiers with an exponentially-tailed loss (e.g., the exponential or logistic loss), which maximizes the classification margin on separable data at a rate of e O(1/t2). This contrasts with a rate of O(1/ log(t)) for standard gradient descent, and O(1/t) for normalized gradient descent. This momentum-based method is derived via the convex dual of the maximummargin problem, and specifically by applying Nesterov acceleration to this dual, which manages to result in a simple and intuitive method in the primal. This dual view can also be used to derive a stochastic variant, which performs adaptive nonuniform sampling via the dual variables. 1. Introduction First-order optimization methods, such as stochastic gradient descent (SGD) and variants thereof, form the optimization backbone of deep learning, where they can find solutions with both low training error and low test error (Neyshabur et al., 2014; Zhang et al., 2016). Motivated by this observation of low test error, there has been extensive work on the implicit bias of these methods: amongst those predictors with low training error, which predictors do these methods implicitly prefer? For linear classifiers and linearly separable data, Soudry et al. (2017) prove that gradient descent can not only minimize the training error, but also maximize the margin. This could help explain the good generalization of gradient descent, since a larger margin could lead to better generalization (Bartlett et al., 2017). However, gradient descent can only maximize the margin at a slow O(1/ log(t)) rate. It turns out that the margin can be maximized much faster by simply normalizing the gradient: letting θt denote the 1Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois, USA 2Toyota Technical Institute of Chicago, Chicago, Illinois, USA. Correspondence to: Ziwei Ji . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). 0 200 400 600 800 1000 gradient evaluations Alg1 / eq(1.1) Normalized Gradient Descent Gradient Descent Batch Perceptron Figure 1. Margin-maximization performance of the new momentum-based method (cf. Algorithm 1 and eq. (1.1)), which has a rate e O(1/t2), compared with prior work discussed below. All methods are first-order methods, and all but batch perceptron use an exponentially-tailed smooth loss, whereas batch perceptron applies gradient descent to the hard-margin problem directly. The data here is linearly separable, specifically mnist digits 0 and 1. step size and R the empirical risk with the exponential loss, consider the normalized gradient step wt+1 := wt θt R(wt) Using this normalized update, margins are maximized at a e O(1/ t) rate with θt = 1/ t (Nacson et al., 2018), and at a O(1/t) rate with θt = 1 (Ji & Telgarsky, 2019). A key observation in proving such rates is that normalized gradient descent is equivalent to an entropy-regularized mirror descent on a certain margin dual problem (cf. Section 3.1). Contributions. In this work, we further exploit this duality relationship from prior work, and design a momentumbased algorithm with iterates given by gt 1 + R(wt) wt+1 := wt θt Fast Margin Maximization via Dual Acceleration Our main result is that these iterates, with a proper choice of θt and βt, can maximize the margin at a rate of e O(1/t2), whereas prior work had a rate of O(1/t) at best. The key idea is to reverse the primal-dual relationship mentioned above: those works focus on primal normalized gradient descent, and show that it is equivalent to dual mirror descent, but here we start from the dual, and apply Nesterov acceleration to make dual optimization faster, and then translate the dual iterates into the momentum form in eq. (1.1). Note that if our goal is just to accelerate dual optimization, then it is natural to apply Nesterov s method; however, here our goal is to accelerate (primal) margin maximization it was unclear whether the momentum method changes the implicit bias, and our margin analysis is very different from the standard analysis of Nesterov s method. The connection between momentum in the primal and acceleration in the dual also appears to be new, and we provide it as an auxiliary contribution. We state the method in full in Algorithm 1, and its analysis in Section 3. Since our momentum-based iterates (cf. eq. (1.1)) are designed via a primal-dual framework, they can be written purely with dual variables, in which case they can be applied in a kernel setting. However, calculating the full-batch gradient would require n2 calls to the kernel, where n denotes the number of training examples. To reduce this computational burden, by further leveraging the dual perspective, we give an adaptive sampling procedure which avoids the earlier use of batch gradients and only needs n kernel calls per iteration. We prove a O(1/ t) margin rate for a momentum-free version of this adaptive sampling method, but also provide empirical support for an aggressive variant which uses our batch momentum formulation verbatim with these efficient stochastic updates. These results are presented in Section 4. For sake of presentation, the preceding analyses and algorithm definitions use the exponential loss, however they can be extended to both binary and multiclass losses with exponential tails. The multiclass extension is in fact a straightforward reduction to the binary case, and is used in most figures throughout this work. We discuss these extensions in Section 5. As an illustrative application of these fast margin maximization methods, we use them to study the evolution of the kernel given by various stages of deep network training. The main point of interest is that while these kernels do seem to generally improve during training (in terms of both margins and test errors), we provide an example where simply changing the random seed switches between preferring the final kernel and the initial kernel. These empirical results appear in Section 6. We conclude with open problems in Section 7. Full proofs and further experimental details are deferred to the appendices. 1.1. Related Work This work is closely related to others on the implicit bias, most notably the original analysis for gradient descent on linearly separable data (Soudry et al., 2017). The idea of using normalized steps to achieve faster margin maximization rates was first applied in the case of coordinate descent (Telgarsky, 2013), where this normalization is closely associated with the usual step sizes in boosting methods (Freund & Schapire, 1997). Many other works have used these normalized iterates, associated potential functions, and duality concepts, both in the linear case (Gunasekar et al., 2018a; Ji & Telgarsky, 2018), and in the nonlinear case (Gunasekar et al., 2018b; Lyu & Li, 2019; Chizat & Bach, 2020; Ji & Telgarsky, 2020). There appear to be few analyses of momentum methods; one example is the work of Ghadimi et al. (2015), which shows a O(1/t) convergence rate for general convex problems over bounded domains, but can not be applied to the exponentially-tailed loss setting here since the domain is unbounded and the solutions are off at infinity. Connections between momentum in the primal and Nesterov acceleration in the dual seem to not have been made before, and relatedly our use of momentum coefficient βt = t/(t + 1) is non-standard. Further on the topic of acceleration, Tseng (2008) gave an application to a smoothed version of the nonsmooth hardmargin objective, with a rate of O(1/t) to a fixed suboptimal margin. This analysis requires accelerated methods for general geometries, which were analyzed by Tseng (2008) and Allen-Zhu & Orecchia (2014). The original accelerated method for Euclidean geometry is due to Nesterov (1983). A simultaneous analysis of mirror descent and Nesterov acceleration is given here in Appendix B. The methods here, specifically Proposition 3.7, can ensure a margin of γ/4 in 4 p ln(n)/ γ steps, where γ denotes the optimal margin and will be defined formally in Section 2. Another primal-dual method for fast linear feasibility was given by Hanashiro & Abernethy (2020); the method terminates in O ln(n)/ γ steps with a positive margin, however the analysis does not reveal how large this margin is. Various figures throughout this work include experiments with the batch perceptron, which simply applies (super)gradient ascent to the explicit hard-margin maximization problem (Cotter et al., 2012). Despite this simplicity, the method is hard to beat, and surpasses prior implicit margin maximizers in experiments (cf. Figure 1). Interestingly, another standard method with strong guarantees fared less well in experiments (Clarkson et al., 2012), and is thus omitted from the figures. Fast Margin Maximization via Dual Acceleration 2. Notation The dataset is denoted by {(xi, yi)}n i=1, where xi Rd and yi { 1, +1}. Without loss of generality, we assume xi 2 1. Moreover, let zi := yixi, and collect these vectors into a matrix Z Rn d, whose i-th row is z i . We consider linear classifiers. The margin of a nonzero linear classifier w Rd is defined as γ(w) := min1 i n yi w, xi w 2 = max1 i n w, zi with γ(0) := 0. The maximum margin is γ := max w 2 1 γ(w). If γ > 0, then the dataset is linearly separable; in this case, the maximum-margin classifier is defined as u := arg max w 2 1 γ(w) = arg max w 2=1 γ(w). If γ = 0, the dataset is linearly nonseparable. Our algorithms are based on the empirical risk, defined as i=1 ℓ w, zi . For presentation, we mostly focus on the exponential loss ℓ(z) := ez, but our analysis can be extended to other exponentially-tailed losses such as the logistic loss ℓ(z) := ln(1 + ez) and various multiclass losses; these extensions are discussed in Section 5. The following potential function ψ : Rn R will be central to our analysis: given a strictly increasing loss ℓ: R R with limz ℓ(z) = 0 and limz ℓ(z) = , for ξ Rn, let ψ(ξ) := ℓ 1 thus ψ(Zw) = ℓ 1 n R(w) . For the exponential loss, ψ is the ln-sum-exp function, meaning ψ(Zw) = ln Pn i=1 exp( w, zi ) . This ψ is crucial in our analysis since (i) it induces the dual variable, which motivates our algorithms (cf. Section 3.1); (ii) it gives a smoothed approximation of margin, which helps in the margin analysis (cf. Section 3.3). Here we note another useful property of ψ: the gradient of ψ(Zw) with respect to w is Z ψ(Zw), which is a normalized version of R(w): Z ψ(Zw) = Pn i=1 ℓ w, zi zi ℓ ψ(Zw) = R(w) ℓ ψ(Zw) /n. 0 200 400 600 800 1000 gradient evaluations Alg1 / eq(1.1) Normalized Gradient Descent Gradient Descent Batch Perceptron Figure 2. Here the various margin-maximization methods from Figure 1 are run on non-separable data, specifically mnist digits 3 and 5; as such, test error and not margin are reported. The methods based on exponential loss still perform well; by contrast, the batch perceptron suffers, and perhaps requires additional effort to tune a regularization parameter. Algorithm 1 Input: data matrix Z Rn d, step size (θt) t=0, momentum factor (βt) t=0. Initialize: w0 = g 1 = (0, . . . , 0) Rd, q0 = ( 1 n, . . . , 1 n) n. for t = 0, 1, 2, . . . do gt βt(gt 1 + Z qt). wt+1 wt θt gt + Z qt . qt+1 exp(Zwt+1), and qt+1 n. end for For the exponential loss, ψ(Zw) n is just the softmax mapping over Zw, where n denotes the probability simplex. Moreover, Z ψ(Zw) = R(w) R(w) . (2.3) 3. Analysis of Algorithm 1 A formal version of our batch momentum method is presented in Algorithm 1. It uses the exponential loss, and is equivalent to eq. (1.1) since by eq. (2.3), Z qt = Z ψ(Zwt) = R(wt) Here are our main convergence results. Theorem 3.1. Let wt and gt be given by Algorithm 1 with θt = 1 and βt = t/(t + 1). Fast Margin Maximization via Dual Acceleration 1. If the dataset is separable, then for all t 1, γ(wt) γ 4 1 + ln(n) 1 + 2 ln(t + 1) γ(t + 1)2 . 2. For any dataset, separable or nonseparable, it holds for all t 1 that 4 gt 2 2 t2 8 ln(n) (t + 1)2 γ2 4 gt 2 2 t2 . Our main result is in the separable case, where Algorithm 1 can maximize the margin at a e O(1/t2) rate; by contrast, as mentioned in the introduction, all prior methods have a O(1/t) rate at best. On the other hand, for any dataset, our algorithm can find an interval of length O(1/t2) which includes γ2, in particular certifying non-existence of predictors with margin larger than any value in this interval. Moreover, as shown in Figure 2, Algorithm 1 can also achieve good test accuracy even in the nonseparable case; it is an interesting open problem to build a theory for this phenomenon. The rest of this section sketches the proof of Theorem 3.1, with full details deferred to the appendices. In Section 3.1, we first consider gradient descent without momentum (i.e., βt = 0), which motivates consideration of a dual problem. Then in Section 3.2, we apply Nesterov acceleration (Nesterov, 2004; Tseng, 2008; Allen-Zhu & Orecchia, 2014) to this dual problem, and further derive the corresponding primal method in Algorithm 1, and also prove the second part of Theorem 3.1. Finally, we give a proof sketch of the margin rate in Section 3.3. 3.1. Motivation from Gradient Descent We start by giving an alternate presentation and discussion of certain observations from the prior work of Ji & Telgarsky (2019), which in turn motivates Algorithm 1. Consider gradient descent wt+1 := wt ηt R(wt). Define the dual variable by qt := ψ(Zwt); for the exponential loss, it is given by qt exp(Zwt), qt n. Note that wt+1 = wt ηt R(wt) = wt ηt R(wt) R(wt) = wt θt Z qt, where we let θt = ηt R(wt). Moreover, qt+1 exp (Zwt+1) = exp Zwt θt ZZ qt qt exp θt ZZ qt = qt exp θt φ(qt) , where φ(q) := Z q 2 2 /2 and denotes coordinate-wise product. In other words, the update from qt to qt+1 is a mirror descent / dual averaging update with the entropy regularizer on the dual objective φ. This dual objective ZTq 2 2/2 is related to the usual hardmargin dual objective, and is evocative of the SVM dual problem; this connection is made explicit in Appendix A. Even without deriving this duality formally, it makes sense that qt tries to minimize φ, since φ encodes extensive structural information of the problem: for instance, if the dataset is not separable, then minq n φ(q) = 0 (cf. Lemma A.1). With a proper step size, we can ensure R(wt) 2 2 2R(wt)2 0, R(wt) is nonincreasing, and it follows that R(wt) 2 0. If the dataset is separable, then minq n φ(q) = γ2/2 (cf. Lemma A.1), and Z q = γ u, for q arg min q n φ(q), where u is the unique maximum-margin predictor, as defined in Section 2. As qt minimizes φ, the vector Z qt becomes biased towards u, by which we can also show wt/ wt 2 u. Ji & Telgarsky (2019) use this idea to show a O(1/t) margin maximization rate for primal gradient descent. The idea in this work is to reverse the above process: we can start from the dual and aim to minimize φ more efficiently, and then take the dual iterates (qt) t=0 from this more efficient minimization and use them to construct primal iterates (wt) t=0 satisfying ψ(Zwt) = qt. It is reasonable to expect such wt to maximize the margin faster, and indeed we show this is true in the following, by applying Nesterov acceleration to the dual, thanks to the ℓ1 smoothness of φ (Ji & Telgarsky, 2019, Lemma 2.5). 3.2. Primal and Dual Updates To optimize the dual objective φ, we apply Nesterov s method with the ℓ1 geometry (Tseng, 2008; Allen-Zhu & Orecchia, 2014). The following update uses the entropy regularizer; more general updates are given in Appendix B. Let µ0 = q0 := ( 1 n, . . . , 1 n). For t 0, let λt, θt (0, 1], and νt := (1 λt)µt + λtqt, qt+1 qt exp θt µt+1 := (1 λt)µt + λtqt+1. If we just apply the usual mirror descent / dual averaging to φ, then φ can be minimized at a O(1/t) rate (Ji & Telgarsky, 2019, Theorem 2.2). However, using the above accelerated process, we can minimize φ at a O(1/t2) rate. Fast Margin Maximization via Dual Acceleration Lemma 3.2. For all t 0, let θt = 1 and λt = 2/(t + 2). Then for all t 1 and q arg minq n φ(q), φ(µt) φ( q) 4 ln(n) Next we construct corresponding primal variables (wt) t=0 such that ψ(Zwt) = qt. (We do not try to make ψ(Zwt) = νt or µt, since only qt is constructed using a mirror descent / dual averaging update.) Let w0 := 0, and for t 0, let wt+1 := wt θt λt Z νt. (3.3) We can verify that qt is indeed the dual variable to wt, in the sense that ψ(Zwt) = qt: this is true by definition at t = 0, since ψ(Zw0) = ψ(0) = q0. For t 0, we have qt+1 qt exp θt exp(Zwt) exp θt = exp(Zwt+1). In addition, we have the following characterization of wt based on a momentum term, giving rise to the earlier eq. (1.1). Lemma 3.4. For all λt, θt (0, 1], if λ0 = 1, then for all t 0, wt+1 = wt θt gt + Z qt , where g0 := 0, and for t 1, gt := λt 1(1 λt) gt 1 + Z qt . Specifically, for λt = 2/(t + 2), it holds that λt = t t + 1, and gt = j t + 1Z qj, and Z µt = 2gt/t. Consequently, with λt = 2/(t + 2), the primal iterate defined by eq. (3.3) coincides with the iterate given by Algorithm 1 with βt = t/(t + 1). Additionally, Lemmas 3.2 and 3.4 already prove the second part of Theorem 3.1, since φ(µt) = 4 gt 2 2/(2t2) by Lemma 3.4, while φ( q) = γ2/2 by Lemma A.1. 3.3. Margin Analysis Now we consider the margin maximization result of Theorem 3.1. The function ψ will be important here, since it gives a smoothed approximation of the margin: recall that ψ(Zw) is defined as ψ(Zw) = ℓ 1 i=1 ℓ zi, w Since ℓis increasing, we have ψ(Zw) ℓ 1 max 1 i n ℓ zi, w ℓ max 1 i n zi, w ! = max 1 i n zi, w . As a result, to prove a lower bound on γ(wt), we only need to prove a lower bound on ψ(Zwt)/ wt 2, and it would be enough if we have a lower bound on ψ(Zwt) and an upper bound on wt 2. Below is our lower bound on ψ for Algorithm 1. Its proof is based on a much finer analysis of dual Nesterov, and uses both primal and dual smoothness. Lemma 3.5. Let θt = 1 for all t 0, and λ0 = 1, then for all t 1, ψ(Zwt) ψ(Zw0) + 1 2λ2 t 1 1 λ2 j 1 1 λj Additionally, here are our bounds on wt 2. Lemma 3.6. Let θt = 1 for all t 0, then With Lemmas 3.5 and 3.6, we can prove Theorem 3.1. Here we show a weaker result which gives 1/t2 convergence to γ/2; its proof is also part of the full proof of Theorem 3.1, but much simpler. The remaining proof of Theorem 3.1 is deferred to Appendix C. Proposition 3.7 (weaker version of Theorem 3.1). With θt = 1 and λt = 2/(t + 2), we have 2 4 ln(n) γ(t + 1)2 . Fast Margin Maximization via Dual Acceleration Proof. With λt = 2/(t + 2), it holds that 1 λ2 j 1 1 λj ψ(Zwt) ψ(Zw0) + Then eq. (3.8) and Lemma 3.6 imply ψ(Zw0) ψ(Zwt) Pt 1 j=0 1 2λj Z νj 2 2 Pt 1 j=0 1 λj Z νj 2 γ since Z νj 2 γ (cf. Lemma A.1). On the other hand, Lemma 3.6 and λt = 2/(t + 2) imply γ λj γ(t + 1)2 wt 2 = ln(n) wt 2 4 ln(n) γ(t + 1)2 . (3.10) It then follows from eqs. (3.9) and (3.10) that γ(wt) ψ(Zwt) 2 4 ln(n) γ(t + 1)2 . 4. Analysis of Algorithm 2 Since Algorithm 1 is derived from dual Nesterov, it can also be run completely in the dual, meaning primal iterates and in particular the primal dimensionality never play a role. However, this dual version would require calculating ZZ qt, which in the kernel setting requires n2 kernel calls. In Algorithm 2, we replace Z qt with a single column zit of Z , where it is sampled from qt n. This sampling allows us to make only n kernel calls per iteration, rather than n2 as in Algorithm 1. Unfortunately, we do not have a general theory for Algorithm 2. Instead, as follows, we provide here an analysis with momentum disabled, meaning βt = 0, and a small constant step size θt. Theorem 4.1. Given ϵ > 0 and δ (0, 1), let 32 ln(n) + 64 ln(2/δ) ln(n)/t for 0 j < t, then with probability 1 δ, Algorithm 2 Input: data matrix Z Rn d, step size (θt) t=0, momentum factor (βt) t=0. Initialize: w0 = g 1 = (0, . . . , 0) Rd, q0 = ( 1 n, . . . , 1 n) n. for t = 0, 1, 2, . . . do Sample it qt. gt βt (gt 1 + zit). wt+1 wt θt (gt + zit). qt+1 exp(Zwt+1), and qt+1 n. end for The proof of Theorem 4.1 is similar to the proof of Theorem 3.1, but must additionally produce high-probability bounds on ψ(Zwt) and wt 2; details are deferred to Appendix D. Although we do not have a convergence analysis for Algorithm 2 with a nonzero momentum, it works well in practice, as verified on the full mnist data, shown in Figure 3. Still with βt = t/(t + 1), Algorithm 2 can slightly beat the batch perceptron method, which is the fastest prior algorithm in the hard-margin kernel SVM setting. (Other classical methods, such as stochastic dual coordinate ascent (Shalev-Shwartz & Zhang, 2013), are focused on the nonseparable soft-margin SVM setting.) 5. Other Exponentially-Tailed Losses Here we discuss the extension to other exponentially-tailed losses, such as the logistic loss in the case of binary classification, and to multiclass losses. 5.1. Binary Classification In previous sections, we focused on the exponential loss. Our methods can also be applied to other strictly decreasing losses, such as the logistic loss ℓ(z) := ln(1+ez), simply by replacing Z qt in Algorithm 1 with Z ψ(Zwt), where ψ is still defined by eq. (2.1). In the proof of Theorem 3.1, we only use two properties of ψ: (i) ψ is ρ-smooth with respect to the ℓ norm, and (ii) ψ 1 1. These two properties hold with ρ = 1 for the exponential loss, and with ρ = n for the logistic loss (Ji & Telgarsky, 2019, Lemma 5.3, Lemma D.1). Therefore we can use the same analysis to prove a e O(1/t2) margin maximization rate for the logistic loss; details are given in Appendix C. However, the margin rate would additionally depend on ρ, which is n for the logistic loss. Such a bad dependency on n is probably due to the aggressive initial step size: from eq. (2.2), we know that ψ(Zwt) is just R(wt) normalized by ℓ ψ(Zw) /n. However, this quantity is at most Fast Margin Maximization via Dual Acceleration 0 50000 100000 150000 200000 iterations Alg2 (βt = t t+1, θt = 1) Alg2 (βt = 0, θt = 1) Alg2 (βt = 0, θt = 1/ t) Batch Perceptron (a) Margins. 0 50000 100000 150000 200000 iterations Alg2 (βt = t t+1, θt = 1) Alg2 (βt = 0, θt = 1) Alg2 (βt = 0, θt = 1/ t) Batch Perceptron (b) Test error. Figure 3. Margin maximization performance of various methods requiring O(n) kernel evaluations per iteration. The batch perceptron is slightly beaten by Algorithm 2 using the momentum and step size parameters from Algorithm 1, which is only provided here as a heuristic. By contrast, the theoretically-justified parameters, as analyzed in Theorem 4.1, are slower than batch perceptron. The data here is the full mnist data, with features given by the initial kernel of a 2-homogeneous network of width 128 (cf. Appendix F). 1/n for the logistic loss, even at initialization. It is an interesting open problem to find a better initial step size. 5.2. Multiclass Classification Suppose now that inputs (xi)N i=1 have multiclass labels (ci)N i=1, meaning ci {1, . . . , k}. The standard approach to multiclass linear prediction associates a linear predictor uj for each class j {1, . . . , k}; collecting these as columns of a matrix U Rd k, the multiclass prediction is x 7 arg max c {1,...,k} x Uec, and letting U F denote the Frobenius norm, the margin of U and maximum margin are respectively γm(U) := mini minc =ci x Ueci x Uec γm := max U F 1 γm(U), with edge case γm(0) = 0 as before. We now show how to reduce this case to the binary case and allow the application of Algorithm 1 and its analysis in Theorem 3.1. The standard construction of multiclass losses uses exactly the differences of labels as in the preceding definition of γm (Zhang, 2005; Tewari & Bartlett, 2007); that is, define a multiclass risk as j =ci ℓ x i Uej x i Ueci . To rewrite this in our notation as a prediction problem defined by a single matrix Z, define n := N(k 1), let F : Rd k Rdk be any fixed flattening of a d k matrix into a vector of length dk, and let π : {1, . . . , N} {1, . . . , k 1} {1, . . . , n} be any bijection between the N original examples and their n new fake counterparts defined as follows: for each example i and incorrect label j = ci, define zπ(i,j) := xi(eci ej) / 2, and let Z Rn dk be the matrix where row π(i, j) is the flattening F(zπ(i,j)) ; then, equivalently, 1 k 1Rm(U) = 1 i=1 ℓ F(U) F(zπ(i,j)) . In particular, it suffices to consider a flattened weight vector w = F(U) Rdk, and invoke the algorithm and analysis from Section 3, with the preceding matrix Z. Theorem 5.1. Let a multiclass problem {(xi, ci)}N i=1 be given with maximum multiclass margin γm > 0. Then the corresponding matrix Z as defined above has binary margin γ := γm/ 2 > 0. Moreover, letting wt denote the output of Algorithm 1 when run on this Z as in Theorem 3.1, meaning exponential loss ℓand βt := t/(t + 1) and θt := 1, for every t 1 the un-flattened output Ut := F 1(wt) satisfies γm(Ut) γm 4 1 + ln(n) 1 + 2 ln(t + 1) γm(t + 1)2 . Due to proceeding by reduction, the guarantees of Section 4 also hold for an analogous multiclass version of Algorithm 2. Fast Margin Maximization via Dual Acceleration Indeed, Algorithm 2, with the aggressive (heuristic) parameters βt = t/(t + 1) and θt = 1 proved effective in practice, and was used in the experiments of Figure 3, as well as the upcoming Figure 4. One issue that arises in these reduction-based implementations is avoiding explicitly writing down Z or even individual rows of Z, which have dk elements. Instead, note that sampling from q as in Algorithm 2 now returns both an example index i, as well as an incorrect label j = ci. From here, updates to just the two columns of U corresponding to j and ci can be constructed. 6. Application: Deep Network Kernel Evolution As an application of these fast margin-maximization methods, we study the evolution of kernels encountered during deep network training. Specifically, consider the cifar10 dataset, which has 50,000 input images in 10 classes; a standard deep network architecture for this problem is the Alex Net (Krizhevsky et al., 2012), which has both convolutional, dense linear, and various nonlinear layers. Let vt denote the Alex Net parameters encountered at epoch t of training on cifar10 with a standard stochastic gradient method, and let A(x; vt) denote the prediction of Alex Net on input x with these parameters vt. From here, we can obtain a feature vector v A(x; vt), and use it to construct a matrix Z to plug in to our methods; when t = 0, this corresponds to the Neural Tangent Kernel (NTK) (Jacot et al., 2018; Li & Liang, 2018; Du et al., 2018), but here we are also interested in later kernels, meaning t > 0. (To handle multiclass output, we simply flatten the Jacobian; as another technical point, we ℓ2-normalize the features to further simplify training and the selection of step sizes.) For any fixed t, we thus obtain a linear prediction problem with rows of matrix Z given by the features v A(x; vt) (with additional care for class labels, as in the reductions defined in Section 5.2), and can use Algorithm 2 to quickly determine the maximum margin. Figure 4(a) presents an experiment that is consistent with standard beliefs: as t increases, the test error of the corresponding maximummargin (kernel) predictor decreases. In these experiments, the Alex Net training is run until the features converge, and the test error of the final maximum-margin kernel predictor is identical to that of the final deep network. A more interesting example is given in Figure 4(b): a case where feature learning does not help. All that differs between Figure 4(a) and Figure 4(b) is the choice of random seed. A key point is that the Alex Net in both experiments was trained with only 128 training points (the testing set had the 0 500 1000 1500 2000 2500 3000 3500 4000 iterations Alg2, epoch 0 kernel Alg2, epoch 400 kernel Alg2, epoch 800 kernel Alg2, epoch 1200 kernel Alg2, epoch 1600 kernel (a) Random seed 100. 0 500 1000 1500 2000 2500 3000 3500 4000 iterations Alg2, epoch 0 kernel Alg2, epoch 400 kernel Alg2, epoch 800 kernel Alg2, epoch 1200 kernel Alg2, epoch 1600 kernel (b) Random seed 13579. Figure 4. Test error curves of kernel predictors trained with Algorithm 2, using kernels from different epochs of standard deep network training. Please see Section 6 and appendix F for details; the short summary is that changing the random seed suffices to change whether kernel features improve or not. usual 10,000 images, but test error is unsurprisingly large). The idea is that the feature learning implicit in deep network training can overfit with such small amounts of data. Of course, 128 examples is not a standard deep learning regime; these figures merely illustrate that feature learning may fail, not that it always fails. It is an interesting open question to study this phenomenon in realistic scenarios. 7. Concluding Remarks and Open Problems In this work, we gave two new algorithms based on a dual perspective of margin maximization and implicit bias: a momentum-based method in Section 3 constructed via translating dual Nesterov acceleration iterates into the primal, Fast Margin Maximization via Dual Acceleration and an adaptive sampling method in Section 4 which aims for greater per-iteration efficiency in the kernel case. Turning first to Algorithm 1, its derivation exposes a connection between Nesterov acceleration in the dual and momentum in the primal. Does this connection exist more generally, namely in other optimization problems? A second open problem is to formally analyze Algorithm 2 with momentum. As demonstrated empirically in Figure 3, it can work well, whereas our analysis disables momentum. On the empirical side, the small-scale experiments of Section 6 scratched the surface of situations where feature learning can fail. Can this phenomenon be exhibited in more realistic scenarios? Acknowledgements We thank the reviewers for their comments. ZJ and MT are grateful for support from the NSF under grant IIS-1750051, and from NVIDIA under a GPU grant. Allen-Zhu, Z. and Orecchia, L. Linear coupling: An ultimate unification of gradient and mirror descent. ar Xiv preprint ar Xiv:1407.1537, 2014. Bartlett, P. L., Foster, D. J., and Telgarsky, M. J. Spectrallynormalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pp. 6240 6249, 2017. Borwein, J. and Lewis, A. Convex Analysis and Nonlinear Optimization. Springer Publishing Company, Incorporated, 2000. Chizat, L. and Bach, F. Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. ar Xiv preprint ar Xiv:2002.04486, 2020. Clarkson, K. L., Hazan, E., and Woodruff, D. P. Sublinear optimization for machine learning. Journal of the ACM (JACM), 59(5):1 49, 2012. Cotter, A., Shalev-Shwartz, S., and Srebro, N. The kernelized stochastic batch perceptron. In ICML, 2012. Du, S. S., Zhai, X., Poczos, B., and Singh, A. Gradient descent provably optimizes over-parameterized neural networks. ar Xiv preprint ar Xiv:1810.02054, 2018. Freund, Y. and Schapire, R. E. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci., 55(1):119 139, 1997. Ghadimi, E., Feyzmahdavian, H. R., and Johansson, M. Global convergence of the heavy-ball method for convex optimization. In 2015 European control conference (ECC), pp. 310 315. IEEE, 2015. Gunasekar, S., Lee, J., Soudry, D., and Srebro, N. Characterizing implicit bias in terms of optimization geometry. ar Xiv preprint ar Xiv:1802.08246, 2018a. Gunasekar, S., Lee, J. D., Soudry, D., and Srebro, N. Implicit bias of gradient descent on linear convolutional networks. In Advances in Neural Information Processing Systems, pp. 9461 9471, 2018b. Hanashiro, R. and Abernethy, J. Linear separation via optimism. ar Xiv preprint ar Xiv:2011.08797, 2020. Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pp. 8571 8580, 2018. Ji, Z. and Telgarsky, M. Risk and parameter convergence of logistic regression. ar Xiv preprint ar Xiv:1803.07300v2, 2018. Ji, Z. and Telgarsky, M. Characterizing the implicit bias via a primal-dual analysis. ar Xiv preprint ar Xiv:1906.04540, 2019. Ji, Z. and Telgarsky, M. Directional convergence and alignment in deep learning. ar Xiv preprint ar Xiv:2006.06657, 2020. Krizhevsky, A., Sutskever, I., and Hinton, G. Imagenet classification with deep convolutional neural networks. In NIPS, 2012. Li, Y. and Liang, Y. Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems, pp. 8157 8166, 2018. Lyu, K. and Li, J. Gradient descent maximizes the margin of homogeneous neural networks. ar Xiv preprint ar Xiv:1906.05890, 2019. Nacson, M. S., Lee, J., Gunasekar, S., Srebro, N., and Soudry, D. Convergence of gradient descent on separable data. ar Xiv preprint ar Xiv:1803.01905, 2018. Nesterov, Y. A method of solving a convex programming problem with convergence rate O(1/k2). Soviet Mathematics Doklady, 27(2):372 376, 1983. Nesterov, Y. Introductory Lectures on Convex Optimization: A Basic Course. Kluwer Academic Publishers, 2004. Fast Margin Maximization via Dual Acceleration Neyshabur, B., Tomioka, R., and Srebro, N. In search of the real inductive bias: On the role of implicit regularization in deep learning. ar Xiv:1412.6614 [cs.LG], 2014. Shalev-Shwartz, S. and Zhang, T. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14(2), 2013. Shalev-Shwartz, S. et al. Online learning and online convex optimization. Foundations and trends in Machine Learning, 4(2):107 194, 2011. Soudry, D., Hoffer, E., Nacson, M. S., Gunasekar, S., and Srebro, N. The implicit bias of gradient descent on separable data. ar Xiv preprint ar Xiv:1710.10345, 2017. Telgarsky, M. Margins, shrinkage, and boosting. In ICML, 2013. Tewari, A. and Bartlett, P. L. On the consistency of multiclass classification methods. JMLR, 8:1007 1025, 2007. Tseng, P. On accelerated proximal gradient methods for convex-concave optimization. http://www.mit. edu/ dimitrib/PTseng/papers/apgm.pdf, 2008. Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. ar Xiv preprint ar Xiv:1611.03530, 2016. Zhang, T. Statistical analysis of some multi-category large margin classification methods. JMLR, 5:1225 1251, 2005.