# on_accelerated_perceptrons_and_beyond__ebebf7b5.pdf Published as a conference paper at ICLR 2023 ON ACCELERATED PERCEPTRONS AND BEYOND Guanghui Wang1, Rafael Hanashiro2, Etash Guha1, Jacob Abernethy1,3 1College of Computing, Georgia Tech, Atlanta, GA, USA 2Department of Electrical Engineering and Computer Science, MIT, Cambridge, MA, USA 3Google Research, Atlanta, GA 30309 {gwang369,etash}@gatech.edu, rafah@mit.edu, abernethyj@google.com The classical Perceptron algorithm of Rosenblatt can be used to find a linear threshold function to correctly classify n linearly separable data points, assuming the classes are separated by some margin γ > 0. A foundational result is that Perceptron converges after O(1/γ2) iterations. There have been several recent works that managed to improve this rate by a quadratic factor, to O( log n/γ), with more sophisticated algorithms. In this paper, we unify these existing results under one framework by showing that they can all be described through the lens of solving min-max problems using modern acceleration techniques, mainly through optimistic online learning. We then show that the proposed framework also leads to improved results for a series of problems beyond the standard Perceptron setting. Specifically, a) for the margin maximization problem, we improve the stateof-the-art result from O(log t/t2) to O(1/t2), where t is the number of iterations; b) we provide the first result on identifying the implicit bias property of the classical Nesterov s accelerated gradient descent (NAG) algorithm, and show NAG can maximize the margin with an O(1/t2) rate; c) for the classical p-norm Perceptron problem, we provide an algorithm with O( (p 1) log n/γ) convergence rate, while existing algorithms suffer the O((p 1)/γ2) convergence rate. 1 INTRODUCTION In this paper, we revisit the problem of learning a linear classifier, which is one of the most important and fundamental tasks of machine learning (Bishop, 2007). In this problem, we are given a set S of n training examples, and the goal is to find a linear classifier that correctly separates S as fast as possible. The most well-known algorithm is Perceptron (Rosenblatt, 1958), which can converge to a perfect (mistake-free) classifier after Ω(1/γ2) number of iterations, provided the data is linearly separable with some margin γ > 0 (Novikoff, 1962). Over subsequent decades, many variants of Perceptron have been developed (Aizerman, 1964; Littlestone, 1988; Wendemuth, 1995; Freund & Schapire, 1999; Cesa-Bianchi et al., 2005, to name a few). However, somewhat surprisingly, there has been little progress in substantially improving the fundamental Perceptron iteration bound presented by Novikoff (1962). It is only recently that a number of researchers have discovered accelerated variants of the Perceptron with a faster Ω( log n/γ) iteration complexity, although with a slower per-iteration cost. These works model the problem in different ways, e.g., as a non-smooth optimization problem or an empirical risk minimization task, and they have established faster rates using sophisticated optimization tools. Soheili & Pena (2012) put forward the smooth Perceptron, framing the objective as a non-smooth strongly-concave maximization and then applying Nestrov s excessive gap technique (NEG, Nesterov, 2005). Yu et al. (2014) proposed the accelerated Perceptron by furnishing a convex-concave objective that can be solved via the mirrorprox method (Nemirovski, 2004). Ji et al. (2021) put forward a third interpretation, obtaining the accelerated rate by minimizing the empirical risk under exponential loss with a momentum-based normalized gradient descent algorithm. Following this line of research, in this paper, we present a unified analysis framework that reveals the exact relationship among these methods that share the same order of convergence rate. Moreover, we show that the proposed framework also leads to improved results for various problems beyond the standard Perceptron setting. Specifically, we consider a general zero-sum game that involves Published as a conference paper at ICLR 2023 two players (Abernethy et al., 2018): a main player that chooses the classifier, and an auxiliary player that picks a distribution over data. The two players compete with each other by performing no-regret online learning algorithms (Hazan, 2016; Orabona, 2019), and the goal is to find the equilibrium of some convex-concave function. We show that, under this dynamic, all of the existing accelerated Perceptrons can find their equivalent forms. In particular, these perceptrons can be described as a dynamic where two players solving the game via performing optimistic online learning strategies (Rakhlin & Sridharan, 2013), which is one of the most important classes of algorithms in online learning. Note that implementing online learning algorithms (even optimistic strategies) to solve zero-sum games has already been extensively explored (e.g., Rakhlin & Sridharan, 2013; Daskalakis et al., 2018; Wang & Abernethy, 2018; Daskalakis & Panageas, 2019). However, we emphasize that our main novelty lies in showing that all of the existing accelerated Perceptrons, developed with advanced algorithms from different areas, can be perfectly described under this unified framework. It greatly simplifies the analysis of accelerated Perceptrons, as their convergence rates can now be easily obtained by plugging-in off-the-shelf regret bounds of optimistic online learning algorithms. Moreover, the unified framework reveals a close connection between the smooth Perceptron and the accelerated Perceptron of Ji et al. (2021): Theorem 1 (informal). Smooth Perceptron and the accelerated Perceptron of Ji et al. (2021) can be described as a dynamic where the two players employ the optimistic-follow-the-regularized-leader (OFTRL) algorithm to play. The main difference is that the smooth Perceptron outputs the weighted average of the main player s historical decisions, while the accelerated Perceptron of Ji et al. (2021) outputs the weighted sum. Beyond providing a deeper understanding of accelerated Perceptrons, our framework also provides improved new results for several other important areas: Implicit bias analysis. The seminal work of Soudry et al. (2018) shows that, for linearly separable data, minimizing the empirical risk with the vanilla gradient descent (GD) gives a classifier which not only has zero training error (thus can be used for linear separation), but also maximizes the margin. This phenomenon characterizes the implicit bias of GD, as it implicitly prefers the (ℓ2-)maximal margin classifier among all classifiers with a positive margin, and analysing the implicit bias has become an important tool for understanding why classical optimization methods generalize well for supervised machine learning problems. Soudry et al. (2018) show that GD can maximize the margin in an O(1/ log t) rate, and this is later improved to O(1/ t) (Nacson et al., 2019) and then O(1/t) (Ji & Telgarsky, 2021). The state-of-the-art algorithm is proposed by Ji et al. (2021), who show that their proposed momentum-based GD has an O(log t/t2) margin-maximization rate. In this paper, we make two contributions toward this direction: 1. We show that, under our analysis framework, we can easily improve the margin maxi- mization rate of the algorithm of Ji et al. (2021) from O(log t/t2) to O(1/t2); 2. Although previous work has analyzed the implicit bias of GD and momentum-based GD, it is still unclear how the classical Nesterov s accelerated gradient descent (NAG, Nesterov, 1988) will affect the implicit bias. In this paper, through our framework, we show that NAG with appropriately chosen parameters also enjoys an O(1/t2) marginmaximization rate. To our knowledge, it is the first time the implicit bias property of NAG is proved. p-norm Perceptron. Traditional work on Perceptrons typically assumes the feature vectors lie in an ℓ2-ball. A more generalized setting is considered in Gentile (2000), who assumes the feature vectors lie inside an ℓp-ball, with p [2, ). Their proposed algorithm requires O(p/γ2) number of iterations to find a zero-error classifier. In this paper, we develop a new Perceptron algorithm for this problem under our framework based on optimistic strategies, showing that it enjoys an accelerated O( p log n/γ) rate. 2 RELATED WORK This section briefly reviews the related work on Perceptron algorithms, implicit-bias analysis, and game theory. The background knowledge on (optimistic) online learning is presented in the Preliminaries (Section 3). Published as a conference paper at ICLR 2023 Accelerated Perceptrons and p-norm Perceptron The study of Perceptron algorithms has an extensive history dating back to the mid-twentieth century (Rosenblatt, 1958; Novikoff, 1962). However, it is only recently that progress on improving the fundamental Ω(1/γ2) iteration bound of the vanilla Perceptron in the standard setting has been made. Specifically, the smooth Perceptron proposed by Soheili & Pena (2012) achieves an Ω( log n/γ) rate by maximizing an auxiliary nonsmooth strongly-concave function with NEG (Nesterov, 2005). The same accelerated rate is later obtained by two other work (Yu et al., 2014; Ji et al., 2021). The former considers a bi-linear saddle point problem and employs the mirror-prox method (Nemirovski, 2004). The latter applies momentum-based GD to minimize the empirical risk with exponential loss. In this paper, we show that these algorithms can be unitedly analysed under our framework. Apart from the above accelerated Perceptrons, there exists another class of algorithms which enjoy an O(poly(n) log(1/γ)) convergence rate (Dunagan & Vempala, 2004; Pe na & Soheili, 2016; Dadush et al., 2020). These methods typically call (accelerated) Perceptrons as a subroutine, and then apply the re-scaling technique to adjust the decision periodically. Although the dependence on γ becomes better, the polynomial dependence on n of these methods makes them computationally inefficient for large-scale data sets. In this paper, we focus on accelerated Perceptrons with O( log n/γ) rate, and leave explaining the re-scaling type algorithms as future work. The p-norm Perceptron problem (Gentile & Littlestone, 1999) is a natural extension of the classical Perceptron setting, which assumes the p-norm of the feature vectors are bounded, where 2 p < . Gentile (2001) shows that a mirror-descent-style update guarantees an O(p/γ2) convergence rate. By contrast, our proposed algorithm achieves a tighter O( p log n/γ) convergence rate. Implicit-bias analysis In many real-world applications, directly minimizing the empirical risk (without any regularization) by first-order methods can provide a model which, not only enjoys low training error, but also generalizes well (Soudry et al., 2018). This is usually considered as the implicit bias introduced by the optimization methods (Soudry et al., 2018). Explaining this phenomenon is a crucial step towards understanding the generalization ability of commonly-used optimization methods. For linear separable data, Soudry et al. (2018) proves that, when minimizing the empirical risk with exponential loss, the vanilla GD can maximize the margin in an O(1/ log t) rate. This result implies the implicit bias of GD towards the ℓ2-maximal margin classifier. Later, Nacson et al. (2019) show that GD with a function-value-dependant decreasing step-size enjoys an O(1/ t) margin-maximization rate. Ji & Telgarsky (2021) improve this result to O(1/t) by employing a faster-decreasing step size. Ji et al. (2021) design a momentum-based GD that maximizes the margin with an O(log t/t2) rate. However, it remained unclear whether this rate could be further improved and how to analyze the margin maximization ability of other classical optimization methods such as NAG. In this paper, we provide positive answers to both questions. Fianlly, we note that (Ramdas & Pena, 2016) prove a varint of the Perceptron algorithm named normalized perceptron can also converge to a maximum margin classifer in an O(1/ t) convergence rate. Games and no-regret dynamics Our framework is motivated by the line of research that links optimization methods for convex optimization to equilibrium computation with no-regret dynamics. The seminal work of Rakhlin & Sridharan (2013) recovers the classical mirror-prox method (Nemirovski, 2004) with Optimistic online mirror descent. Abernethy & Wang (2017) show that the well-known Frank-Wolfe algorithm can be seen as applying (optimistic) online algorithms to compute the equilibrium of a special zero-sum game called Fenchel game, which is constructed via the Fenchel duality of the objective function. Later, researchers demonstrate that other classical optimization methods for smooth optimization, such as Heavy-ball and NAG, can also be described similarly (Abernethy et al., 2018; Wang & Abernethy, 2018; Wang et al., 2021). We highlight the differences between this work and the previous ones: 1) Our analysis does not involve the Fenchel game or Fenchel duality; instead, we directly work on the (regularized) min-max game designed for linear classification. 2) Most of the previous work mainly focus on understanding optimization algorithms for smooth optimization, and it was unclear how to understand algorithm such as NEG under the game framework. 3) Although both Abernethy et al. (2018) and our work analyze NAG, the goals are significantly different: Abernethy et al. (2018) focus on the optimization problem itself, while we consider how minimizing the empirical risk would affect the margin. 4) To our knowledge, the link between the implicit-bias problems and no-regret dynamics is also new. Published as a conference paper at ICLR 2023 3 PRELIMINARIES Notation. We use lower case bold face letters x, y to denote vectors, lower case letters a, b to denote scalars, and upper case letters A, B to denote matrices. For a vector x Rd, we use xi to denote the i-th component of x. For a matrix A Rn d, let A(i,:) be its i-th row, A(:,j) the j-th column, and A(i,j) the i-th element of the j-th column. We use to denote a general norm, its dual norm, and p the ℓp-norm. For a positive integer n, we denote the set {1, . . . , n} as [n], and the n-dimensional simplex as n. Let E : n ' R be the negative entropy function, defined as E(p) = "n i=1 pi log pi, p n. For some strongly convex function R(w) : W ' R, define the Bregman divergence between any two points w, w W as: DR(w, w ) = R(w) R(w ) (w w ) R(w ). Online convex optimization (OCO) Here we review a general weighted OCO framework, proposed by Abernethy et al. (2018). In each round t = 1, . . . , T of this paradigm, a learner first chooses a decision zt from a convex set Z Rd, then observes a loss function ft( ) : Z R as well as a weight αt > 0, and finally updates the decision. The performance of the learner is measured by the weighted regret, which is defined as RT = "T t=1 αtft(zt) minz Z t=1 αtft(z). Perhaps the most natural method for OCO is follow-the-leader (FTL), which simply picks the empirically best decision at each round: zt = argminz Z i=1 αifi(z). However, FTL is unstable, and one can easily find counter-examples where FTL suffers linear regret (Shalev-Shwartz, 2011). A classical way to address this limitation is by adding a regularizer to the objective: zt = argminz Z η "t 1 i=1 αifi(z) + DR(z, z0), where η > 0 is the step size, and z0 is the initial decision. This method is called follow-the-regularized-leader (Hazan, 2016), and it can achieve a sub-linear regret bound with appropriately chosen η. Moreover, tighter bounds are also possible in favored cases with more advanced techniques. In this paper, we consider FTRL equipped with optimistic strategies (i.e., Optimistic FTRL, Rakhlin & Sridharan, 2013; Orabona, 2019), given by OFTRL[R, z0, ψt, η, Z] : zt = argminz Z η i=1 αifi(z) + αtψt(z) + DR(z, z0), where an additional function ψt is added, which is an approximation of the next loss ft. A tighter regret bound can be achieved when ψt is close enough to ft. Next, we introduce two special cases of OFTRL: OFTL[ψt, Z] : zt = argmin αjfj(z) + αtψt(z) FTRL+[R, z0, η, Z] : zt = argmin αjfj(z) + DR(z, z0). The first one is Optimistic FTL, where the regularizer is set to be zero. The second algorithm is FTRL+, which uses ft as the optimistic function ψt. Finally, we note that, apart from FTL-type algorithms, there also exists optimistic methods that are developed based on mirror descent, such as optimistic online mirror decent (OMD, Rakhlin & Sridharan, 2013). Due to page limitation, the details of OMD and the regret bounds of these OCO algorithms are postponed to the Appendix A. No-regret dynamics for zero-sum game Finally, we introduce the framework for using no-regret online algorithms to solve a zero-sum game. Consider the following general two-player game: max w W min p Q g(w, p), (1) where g(w, p) is a concave-convex function, and W and Q are convex sets. The no-regret framework for solving (1) is presented in Protocol 1. In each round t of this procedure, the w-player first picks a decision wt W. Then, the p-player observes its loss ℓt( ) = g(wt, ) as well as a weight αt. After that, the p-player picks the decision pt, and passes it to the w-player. As a consequence, the w-player observes its loss ht( ) = g( , pt) and weight αt. Note that both ℓt and ht are convex. Denote the weighted regret bounds of the two players as Rw and Rp, respectively. Then, we have the following classical conclusion. The proof is shown in Appendix B. Theorem 2. Define m(w) = minp Q g(w, p), and w the weighted average of {wt}T t=1. Then we have that w W, m(w) m(w T ) ("T t=1 αt) 1 (Rp + Rw) . Published as a conference paper at ICLR 2023 Protocol 1 No-regret dynamics with weighted OCO 1: Input: {αt}T t=1, w0, p0. 2: Input: OLw, OLp. // The online algorithms for choosing w and p. 3: for t = 1, . . . , T do 4: wt OLw; 5: OLp αt, ℓt( ); // Define ℓt( ) = g(wt, ) 6: pt OLp; 7: OLw αt, ht( ); // Define ht( ) = g( , pt) 8: end for 9: Output: w T = 1 !T 4 UNDERSTANDING EXISTING ACCELERATED PERCEPTRONS In this section, we first introduce our main topic, i.e., the binary linear classification problem, and then present the three accelerated Perceptrons and their equivalent forms under Protocol 1. For clarity, we use vt to denote the classifier updates in the original algorithms, and wt the corresponding updates in the equivalent forms under Protocol 1. 4.1 BINARY LINEAR CLASSIFICATION AND PERCEPTRON We focus on the binary linear classification problem, which dates back to the pioneering work of Rosenblatt (1958). Let S = {x(i), y(i)}n i=1 be a linear-separable set of n training examples, where x(i) Rd is the feature vector of the i-th example, and y(i) { 1, 1} is the corresponding label. The goal is to efficiently find a linear classifier w Rd that correctly separates all data points. More formally, let A = [y(1)x(1), . . . , y(n)x(n)] be the matrix that contains all of the data. Then we would like to find a w Rd such that min i [n] A(i,:)w > 0. (2) This goal can be reformulated as a min-max optimization problem: max w Rd min i [n] A(i,:)w = max p n p Aw, (3) where, at the RHS of (3), mini [n] A(i,:)w is rewritten as minp n p Aw. The two expressions are equivalent since the optimal distribution p n will always put all weight on one training example (i.e., one row) in A. For any w Rd, let γ(w) = minp n p Aw be the margin of w, and we introduce the following standard assumption. Assumption 1. We assume that feature vectors are bounded, i.e., x(i) 2 1 for all i [n], and that there exists a w Rd, such that w 2 = 1 and γ(w ) = γ > 0. 4.2 SMOOTH PERCEPTRON In order to solve (3), the vanilla Perceptron repeatedly moves the direction of the classifier to that of examples on which it performs badly. However, this greedy policy only yields a sub-optimal convergence rate. To address this problem, Soheili & Pena (2012) propose the Smooth Perceptron, and the pseudo-code is summarized in the first box in Algorithm 1. The key idea is to find a classifier v Rd that maximizes the following ℓ2-regularized function: 2 + minq n q Av, (4) which is non-smooth strongly-concave with respect to v. Under Assumption 1, Soheili & Pena (2012) show that the maximum value of ψ(v) is maxv Rd ψ(v) = γ2/2, and a classifier v Rd satisfies (2) when ψ(v) 0. In order to solve (4), Soheili & Pena (2012) apply the classical Nesterov s excessive gap technique (Nesterov, 2005) for strongly concave functions. This algorithm introduces a smoothed approximation of (4), which is parameterized by some constant µ > 0, and defined as 2 + minq n & Published as a conference paper at ICLR 2023 Algorithm 1 Smooth Perceptron (Soheili & Pena, 2012) Initialization : θ0 = 2 3, µ0 = 4, v0 = A 1 n Initialization : q0 = qµ0(v0), where qµ(v) = argminq n q Av + µDE . for t = 1, . . . , T 1 do vt = (1 θt 1)(vt 1 + θt 1Aqt 1) + θ2 t 1Aqµt 1(vt 1) µt = (1 θt 1)µt 1 qt = (1 θt 1)qt 1 + θt 1qµt(vt) θt = 2 t+3 end for Output: v T 1 ht 1( ), Rd) wt = argmin αjhj(w) + αtht 1(w) OLp = FTRL+ pt = argmin s=1 αsℓs(p) + DE Output: w T which bounds the function in (4) from below, i.e., v Rd, µ > 0, ψµ(v) ψ(v) + µ log n. Then, the algorithm performs a sophisticated update rule such that the excessive gap condition holds t 1: γ2/2 Avt 2/2 ψµ(vt). We refer to Soheili & Pena (2012) for more details. Soheili & Pena (2012) show that the smooth Perceptron can output a positive-margin classifier after Ω( log n/γ) iterations. However, the analysis is quite involved, which heavily relies on the complicated relationship between ψ(vt), ψµ(vt) and Avt||2. In the following, we provide a no-regret explanation of this algorithm and then show that the convergence rate can be easily obtained under our framework. Specifically, we define the objective function in Protocol 1 as g(w, p) = p Aw 1 and provide the equivalent expression in the second box of Algorithm 1. More specifically, we have the following proposition. The proof is given in Appendix C.1. Proposition 1. Let αt = t. Then the two interpretations of the smooth Perceptron in Algorithm 1 are the same, in the sense that v T 1 = w T , and q T 1 = t=1 αtpt !T Proposition 1 shows that, under Protocol 1, the smooth Perceptron can be seen as optimizing (6) by implementing two online learning algorithms: in each round t, the w-player applies OFTL with ψt = ht 1, a decision which the p-player subsequently observes and responds with FTRL+. Moreover, we obtain theoretical guarantees for the smooth Perceptron based on Theorem 2, matching the convergence rate provided by Soheili & Pena (2012). Theorem 3. Let αt = t. Under Protocol 1, the regret of the two players of Algorithm 1 is bounded by Rw 2 "T t=1 pt pt 1 2 1 and Rp 4 log n 2 "T t=1 pt pt 1 2 1. Moreover, w T has non-negative margin when T = Ω( 4.3 ACCELERATED PERCEPTRON VIA ERM Since binary linear classification is a (and perhaps the most fundamental) supervised machine learning problem, it can be naturally solved via empirical risk minimization (ERM). This idea is adopted by Ji et al. (2021), who consider the following ERM problem: min v Rd R(v) = 1 i=1 ℓ( y(i)v x(i)), (7) where ℓ(z) = exp(z) is the exponential loss. To minimize (7), Ji et al. (2021) propose the a normalized momentum-based gradient descent (NMGD) algorithm, and show that NMGD can converge to Published as a conference paper at ICLR 2023 Algorithm 2 Accelerated Perceptron of Ji et al. (2021) Input: q0 = 1 n, v0 = 0, g0 = 0. for t = 1, . . . , T do Set θt 1 = t 2(t+1), βt = t t+1 vt = vt 1 θt 1(gt 1 A qt 1) for i = 1, . . . , n do qt,i = exp( y(i)v j=1 exp( y(j)v t x(j)) end for gt = βt(gt 1 A qt) end for Output: v T ht 1( ), Rd) wt = argmin αjhj(w) + αtht 1(w) OLp = FTRL+ & pt = argmin αsℓs(p) + DE Output: w T a classifier with a positive margin after O(1/γ2) iterations. Based on the property of the normalized gradient, they also provide primal-dual form of the algorithm (presented in the first-box of Algorithm 2), which can be considered as applying Nesterov s accelerated gradient descent in the dual space (Ji et al., 2021) . For the accelerated Perceptron of Ji et al. (2021), we set g(w, p) = p Aw 1 2 and provide its equivalent form at the second box in Algorithm 2. Specifically, we have the following proposition. Proposition 2. Let αt = t. Then the two interpretations of the accelerated Perceptron in Algorithm 2 are the same, in the sense that q T = p T and v T = 1 t=1 αtwt = 1 Remark Note that in Ji et al. (2021), the parameter θt 1 is set to be 1. We observe that this causes a mismatch between the weights and w ts in the output (when θt 1 = 1, v T will become "T t=1 αt+1wt instead of "T t=1 αtwt). Our no-regret analysis suggests that θt 1 should be configured as t 2(t+1), and later we will show that this procedure helps eliminate the log t factor in the convergence rate. The proposition above reveals that the smooth Perceptron and the accelerated Perceptron of Ji et al. (2021) are closely related. The main difference is that the accelerated Perceptron of Ji et al. (2021) outputs the weighted sum of all wt s, instead of the weighted average. The rationale behind this phenomenon is that, instead of the margin, Ji et al. (2021) use the normalized margin to measure the performance of the algorithm, defined as γ(v) = minp n p Av/ v 2. They not only prove γ(v T ) > 0 (which directly implies γ(v T ) > 0), but also show γ(v T ) = Ω(γ log T γT 2 ). This guarantee is more powerful than the previous results, as it implies that the normalized margin can be maximized in an O(log t/t2) rate. When t approaches , the direction of v T will converge to that of the maximal margin classifier. We show that a better margin-maximization rate can be directly obtained under our framework with the parameter setting in Algorithm 2. Theorem 4. Let αt = t. Under Protocol 1, the regret of the two players of Algorithm 2 is bounded by Rw 2 "T t=1 pt pt 1 2 1 and Rp 4 log n 2 "T t=1 pt pt 1 2 1. Moreover, w T has a non-negative margin when T = Ω( log n/γ) and γ(v T ) = Ω(γ 8 log n γT (T +1)). Published as a conference paper at ICLR 2023 Algorithm 3 NAG Input: v0 = 0, s0 = 0. for t = 1, . . . , T do ut = st 1 + 1 2(t 1)vt 1 vt = vt 1 ηt R(ut) st = st 1 + 1 2(t+1)vt end for Output: s T OLp = OFTRL 4, ℓt 1( ), n) pt = argmin αsℓs(p) + αtℓt 1(p) OLw = FTRL+[0, 0, 1, Rd] wt = argmin Output: 2w T = 4.4 ACCELERATED PERCEPTRON OF YU ET AL. (2014) Finally, Yu et al. (2014) considers applying the mirror-prox algorithm to solve the max-min optimization problem: max w 2 1 min p n p Aw, (8) As discovered by Rakhlin & Sridharan (2013), mirror-prox can be recovered by their optimistic online mirror descent (OMD) with an appropriately chosen optimistic term. Here, we show that, by simply manipulating the notation, the algorithm of Yu et al. (2014) can be recovered as two-players applying OMD to solve (8). Due to page limitation, the algorithmic details and their analysis are postponed to Appendix C.5, which we summarize as follows (formalized in Theorems 8 and 9): Theorem 5 (informal). Let g(w, p) = p Aw. The Perceptron of Yu et al. (2014) can be described under Protocol 1, where both players apply OMD endowed with the appropriate parameters. Moreover, under Assumption 1, w T has a non-negative margin when T = Ω 5 BEYOND PERCEPTRONS In This section, we show that our framework benefits a wide range of other problems. 5.1 MARGIN MAXIMIZATION BY NESTEROV S ACCELERATED GRADIENT DESCENT One major task for the implicit bias study is to explain why commonly-used first-order optimization methods generalize well. For linearly separable data, previous work proves that GD and momentumbased GD prefers the ℓ2-maximal margin classifier by showing that they can maximize the margin. In this part, we show that the well-known Nesterov s accelerated gradient descent (NAG), with appropriately chosen parameters, can maximize the margin in an O(1/γ2) rate. Specifically, following previous work, we consider maximize the margin by solving the ERM problem in (7) and apply Nesterov s accelerated gradient descent (NAG) to minimize the objective function. The details of the algorithm are summarized in the first box of Algorithm 3, and its equivalent form under Protocol 1 is presented in the second box of Algorithm 3. For the NAG algorithm, we set the objective function as g(w, p) = p Aw 1 2, and have the following conclusion. Theorem 6. Let ηt = t R(ut) and αt = t. Then, the two expressions in Algorithm 3 are equivalent, in the sense that s T = 2w T . Moreover, min p n p As T s T 2 γ 8 log n+2 T (T +1)γ . Published as a conference paper at ICLR 2023 Algorithm 4 Accelerated algorithm for the p-norm perceptron OLw = OFTRL q, 0, ηw, ht 1( ), Rd$ wt = argmin αjhj(w) + αtht 1(w) + D 1 2(q 1) 2 OLp = FTRL+ pt = argmin αsℓs(p) + DE Output: w T = 1 Remark Theorem 6 indicates that NAG can also maximize the margin in an O(1/T 2) rate. Note that, in Algorithm 3, the p-player plays first and the w-player second, while, in Algorithm 2, it is the other way around. This difference makes sense as Algorithm 3 can be considered as applying Nesterov s acceleration in the dual space (Ji et al., 2021), while Algorithm 3 uses Nesterov s acceleration in the primal space. 5.2 ACCELERATED p-NORM PERCEPTRON In this section, we focus on the p-norm Perceptron problem, introduced by Gentile (2000). Compared to the classical perceptron problem, p-norm Perceptron introduces the following assumption, which is more general than Assumption 1. Assumption 2. For the p-norm Perceptron problem, we assume: i [n], x(i) p 1, where p [2, ). Moreover, assume there exists a w Rd, such that w q 1 and min p n p Aw γ. Here, q is the dual norm of p; i.e., 1 q = 1 and q (1, 2]. Under Assumption 2, Gentile (2000) proposes a mirror-descent style algorithm that achieves an Ω((1 p)/γ2) convergence rate. In the following, we provide a new algorithm with a better rate. Specifically, under Protocol 1, we define the objective function as g(w, p) = p Aw, and introduce Algorithm 4, wherein the w-player uses OFTRL with regularizer 1 2(q 1) 2 q, which is 1-strongly convex w.r.t. the q-norm (Orabona, 2019). On the other hand, the q-player employs the FTRL+ algorithm. We have the following result. Theorem 7. Let αt = 1, ηp = 1/ηw, and ηw = 1 2(q 1) log n. Then the output w T of Algorithm 4 has a non-negative margin when T = Ω 2(p 1) log n/γ 6 CONCLUSION In this paper, we provide a unified analysis for the existing accelerated Perceptrons, and obtain improved results for a series of problems. In the future, we will explore how to extend our framework to other closely related areas, such as semi-definite programming (Garber & Hazan, 2011) and generalized margin maximization (Sun et al., 2022). Another interesting direction is to consider whether other more advanced online learning algorithms (Orabona & P al, 2016; Cutkosky & Orabona, 2018; Jun et al., 2017; Zhang et al., 2018; Wang et al., 2020; Zhao et al., 2020) is useful for perceptron and the implicit bias analysis based on our framework. ACKNOWLEDGMENTS We gratefully thank the AI4OPT Institute for funding, as part of NSF Award 2112533. We also acknowledge the NSF for their support through Award IIS-1910077, as well as an ARC-ACO fellowship provided by Georgia Tech. Published as a conference paper at ICLR 2023 Jacob Abernethy, Kevin A Lai, Kfir Y Levy, and Jun-Kun Wang. Faster rates for convex-concave games. In Proceedings of the 31st Annual Conference on Learning Theory, pp. 1595 1625, 2018. Jacob D Abernethy and Jun-Kun Wang. On frank-wolfe and equilibrium computation. Advances in Neural Information Processing Systems 30, 30, 2017. Mark A Aizerman. Theoretical foundations of the potential function method in pattern recognition learning. Automation and remote control, 25:821 837, 1964. Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2007. Nicolo Cesa-Bianchi, Alex Conconi, and Claudio Gentile. A second-order perceptron algorithm. SIAM Journal on Computing, 34(3):640 668, 2005. Ashok Cutkosky and Francesco Orabona. Black-box reductions for parameter-free online learning in banach spaces. In Conference On Learning Theory, pp. 1493 1529, 2018. Daniel Dadush, Laszlo A Vegh, and Giacomo Zambelli. Rescaling algorithms for linear conic feasibility. Mathematics of Operations Research, 45(2):732 754, 2020. C Daskalakis and Ioannis Panageas. Last-iterate convergence: Zero-sum games and constrained min-max optimization. In 10th Innovations in Theoretical Computer Science conference, 2019. Constantinos Daskalakis, Andrew Ilyas, Vasilis Syrgkanis, and Haoyang Zeng. Training gans with optimism. In International Conference on Learning Representations, 2018. John Dunagan and Santosh Vempala. A polynomial-time rescaling algorithm for solving linear programs. In Proceedings of the 36th annual ACM symposium on Theory of computing, pp. 28, 2004. Yoav Freund and Robert E Schapire. Large margin classification using the perceptron algorithm. Machine learning, 37(3):277 296, 1999. Dan Garber and Elad Hazan. Approximating semidefinite programs in sublinear time. Advances in Neural Information Processing Systems 24, 2011. Claudio Gentile. A new approximate maximal margin classification algorithm. Advances in Neural Information Processing Systems 13, 2000. Claudio Gentile. A new approximate maximal margin classification algorithm. Journal of Machine Learning Research, 2:213 242, 2001. Claudio Gentile and Nick Littlestone. The robustness of the p-norm algorithms. In Proceedings of the 12th annual conference on Computational learning theory, pp. 1 11, 1999. Elad Hazan. Introduction to online convex optimization. Foundations and Trends in Optimization, 2(3-4):157 325, 2016. Ziwei Ji and Matus Telgarsky. Characterizing the implicit bias via a primal-dual analysis. In Pro- ceedings of the 32nd International Conference on Algorithmic Learning Theory, pp. 772 804, 2021. Ziwei Ji, Nathan Srebro, and Matus Telgarsky. Fast margin maximization via dual acceleration. In Proceedings of the 38th International Conference on Machine Learning, pp. 4860 4869, 2021. Kwang-Sung Jun, Francesco Orabona, Stephen Wright, and Rebecca Willett. Improved strongly adaptive online learning using coin betting. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, pp. 943 951, 2017. Nick Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algo- rithm. Machine learning, 2(4):285 318, 1988. Published as a conference paper at ICLR 2023 Mor Shpigel Nacson, Jason Lee, Suriya Gunasekar, Pedro Henrique Pamplona Savarese, Nathan Srebro, and Daniel Soudry. Convergence of gradient descent on separable data. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 3420 3428, 2019. Arkadi Nemirovski. Prox-method with rate of convergence O(1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization, 15(1):229 251, 2004. Yu Nesterov. Excessive gap technique in nonsmooth convex minimization. SIAM Journal on Opti- mization, 16(1):235 249, 2005. Yurii Nesterov. On an approach to the construction of optimal methods of minimization of smooth convex functions. Ekonomika i Mateaticheskie Metody, 24(3):509 517, 1988. A.B.J. Novikoff. On convergence proofs on perceptrons. In Proceedings of the Symposium on the Mathematical Theory of Automata, volume 12, pp. 615 622, 1962. Francesco Orabona. A modern introduction to online learning. ar Xiv preprint ar Xiv:1912.13213, Francesco Orabona and D avid P al. Coin betting and parameter-free online learning. Advances in Neural Information Processing Systems 29, 2016. Javier Pe na and Negar Soheili. A deterministic rescaled perceptron algorithm. Mathematical Pro- gramming, 155(1):497 510, 2016. Sasha Rakhlin and Karthik Sridharan. Optimization, learning, and games with predictable sequences. In Advances in Neural Information Processing Systems 26, pp. 3066 3074, 2013. Aaditya Ramdas and Javier Pena. Towards a deeper geometric, analytic and algorithmic understand- ing of margins. Optimization Methods and Software, 31(2):377 391, 2016. Frank Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386, 1958. Shai Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107 194, 2011. Negar Soheili and Javier Pena. A smooth perceptron algorithm. SIAM Journal on Optimization, 22 (2):728 737, 2012. Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The im- plicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19 (1):2822 2878, 2018. Haoyuan Sun, Kwangjun Ahn, Christos Thrampoulidis, and Navid Azizan. Mirror descent maxi- mizes generalized margin and can be implemented efficiently. ar Xiv preprint ar Xiv:2205.12808, 2022. Guanghui Wang, Shiyin Lu, Quan Cheng, Wei-wei Tu, and Lijun Zhang. Sadam: A variant of adam for strongly convex functions. In International Conference on Learning Representations, 2020. Jun-Kun Wang and Jacob D Abernethy. Acceleration through optimistic no-regret dynamics. In Advances in Neural Information Processing Systems 32, pp. 3824 3834, 2018. Jun-Kun Wang, Jacob Abernethy, and Kfir Y Levy. No-regret dynamics in the fenchel game: A unified framework for algorithmic convex optimization. ar Xiv preprint ar Xiv:2111.11309, 2021. A Wendemuth. Learning the unlearnable. Journal of Physics A: Mathematical and General, 28(18): 5423, 1995. Adams Wei Yu, Fatma Kilinc-Karzan, and Jaime Carbonell. Saddle points and accelerated percep- tron algorithms. In International Conference on Machine Learning, pp. 1827 1835, 2014. Published as a conference paper at ICLR 2023 Lijun Zhang, Shiyin Lu, and Zhi-Hua Zhou. Adaptive online learning in dynamic environments. In Advances in Neural Information Processing Systems 31, pp. 1323 1333, 2018. Peng Zhao, Yu-Jie Zhang, Lijun Zhang, and Zhi-Hua Zhou. Dynamic regret of convex and smooth functions. Advances in Neural Information Processing Systems 33, pp. 12510 12520, 2020.