# on_accelerated_perceptrons_and_beyond__ebebf7b5.pdf

Published as a conference paper at ICLR 2023

ON ACCELERATED PERCEPTRONS AND BEYOND

Guanghui Wang1, Rafael Hanashiro2, Etash Guha1, Jacob Abernethy1,3

1College of Computing, Georgia Tech, Atlanta, GA, USA 2Department of Electrical Engineering and Computer Science, MIT, Cambridge, MA, USA 3Google Research, Atlanta, GA 30309 {gwang369,etash}@gatech.edu, rafah@mit.edu, abernethyj@google.com

The classical Perceptron algorithm of Rosenblatt can be used to ﬁnd a linear threshold function to correctly classify n linearly separable data points, assuming the classes are separated by some margin γ > 0. A foundational result is that Perceptron converges after O(1/γ2) iterations. There have been several recent works that managed to improve this rate by a quadratic factor, to O( log n/γ), with more sophisticated algorithms. In this paper, we unify these existing results under one framework by showing that they can all be described through the lens of solving min-max problems using modern acceleration techniques, mainly through optimistic online learning. We then show that the proposed framework also leads to improved results for a series of problems beyond the standard Perceptron setting. Speciﬁcally, a) for the margin maximization problem, we improve the stateof-the-art result from O(log t/t2) to O(1/t2), where t is the number of iterations; b) we provide the ﬁrst result on identifying the implicit bias property of the classical Nesterov s accelerated gradient descent (NAG) algorithm, and show NAG can maximize the margin with an O(1/t2) rate; c) for the classical p-norm Perceptron problem, we provide an algorithm with O(

(p 1) log n/γ) convergence rate, while existing algorithms suffer the O((p 1)/γ2) convergence rate.

1 INTRODUCTION

In this paper, we revisit the problem of learning a linear classiﬁer, which is one of the most important and fundamental tasks of machine learning (Bishop, 2007). In this problem, we are given a set S of n training examples, and the goal is to ﬁnd a linear classiﬁer that correctly separates S as fast as possible. The most well-known algorithm is Perceptron (Rosenblatt, 1958), which can converge to a perfect (mistake-free) classiﬁer after Ω(1/γ2) number of iterations, provided the data is linearly separable with some margin γ > 0 (Novikoff, 1962). Over subsequent decades, many variants of Perceptron have been developed (Aizerman, 1964; Littlestone, 1988; Wendemuth, 1995; Freund & Schapire, 1999; Cesa-Bianchi et al., 2005, to name a few). However, somewhat

surprisingly, there has been little progress in substantially improving the fundamental Perceptron iteration bound presented by Novikoff (1962). It is only recently that a number of researchers have discovered accelerated variants of the Perceptron with a faster Ω( log n/γ) iteration complexity, although with a slower per-iteration cost. These works model the problem in different ways, e.g., as a non-smooth optimization problem or an empirical risk minimization task, and they have established faster rates using sophisticated optimization tools. Soheili & Pena (2012) put forward the smooth Perceptron, framing the objective as a non-smooth strongly-concave maximization and then applying Nestrov s excessive gap technique (NEG, Nesterov, 2005). Yu et al. (2014) proposed the accelerated Perceptron by furnishing a convex-concave objective that can be solved via the mirrorprox method (Nemirovski, 2004). Ji et al. (2021) put forward a third interpretation, obtaining the accelerated rate by minimizing the empirical risk under exponential loss with a momentum-based normalized gradient descent algorithm.

Following this line of research, in this paper, we present a uniﬁed analysis framework that reveals the exact relationship among these methods that share the same order of convergence rate. Moreover, we show that the proposed framework also leads to improved results for various problems beyond the standard Perceptron setting. Speciﬁcally, we consider a general zero-sum game that involves

Published as a conference paper at ICLR 2023

two players (Abernethy et al., 2018): a main player that chooses the classiﬁer, and an auxiliary player that picks a distribution over data. The two players compete with each other by performing no-regret online learning algorithms (Hazan, 2016; Orabona, 2019), and the goal is to ﬁnd the equilibrium of some convex-concave function. We show that, under this dynamic, all of the existing accelerated Perceptrons can ﬁnd their equivalent forms. In particular, these perceptrons can be described as a dynamic where two players solving the game via performing optimistic online learning strategies (Rakhlin & Sridharan, 2013), which is one of the most important classes of algorithms in online learning. Note that implementing online learning algorithms (even optimistic strategies) to solve zero-sum games has already been extensively explored (e.g., Rakhlin & Sridharan, 2013; Daskalakis et al., 2018; Wang & Abernethy, 2018; Daskalakis & Panageas, 2019). However, we emphasize that our main novelty lies in showing that all of the existing accelerated Perceptrons, developed with advanced algorithms from different areas, can be perfectly described under this uniﬁed framework. It greatly simpliﬁes the analysis of accelerated Perceptrons, as their convergence rates can now be easily obtained by plugging-in off-the-shelf regret bounds of optimistic online learning algorithms. Moreover, the uniﬁed framework reveals a close connection between the smooth Perceptron and the accelerated Perceptron of Ji et al. (2021):

Theorem 1 (informal). Smooth Perceptron and the accelerated Perceptron of Ji et al. (2021) can be described as a dynamic where the two players employ the optimistic-follow-the-regularized-leader (OFTRL) algorithm to play. The main difference is that the smooth Perceptron outputs the weighted

average of the main player s historical decisions, while the accelerated Perceptron of Ji et al. (2021) outputs the weighted sum.

Beyond providing a deeper understanding of accelerated Perceptrons, our framework also provides improved new results for several other important areas:

Implicit bias analysis. The seminal work of Soudry et al. (2018) shows that, for linearly

separable data, minimizing the empirical risk with the vanilla gradient descent (GD) gives a classiﬁer which not only has zero training error (thus can be used for linear separation), but also maximizes the margin. This phenomenon characterizes the implicit bias of GD, as it implicitly prefers the (ℓ2-)maximal margin classiﬁer among all classiﬁers with a positive margin, and analysing the implicit bias has become an important tool for understanding why classical optimization methods generalize well for supervised machine learning problems. Soudry et al. (2018) show that GD can maximize the margin in an O(1/ log t) rate, and this is later improved to O(1/

t) (Nacson et al., 2019) and then O(1/t) (Ji & Telgarsky, 2021). The state-of-the-art algorithm is proposed by Ji et al. (2021), who show that their proposed momentum-based GD has an O(log t/t2) margin-maximization rate. In this paper, we make two contributions toward this direction:

1. We show that, under our analysis framework, we can easily improve the margin maxi-

mization rate of the algorithm of Ji et al. (2021) from O(log t/t2) to O(1/t2); 2. Although previous work has analyzed the implicit bias of GD and momentum-based

GD, it is still unclear how the classical Nesterov s accelerated gradient descent (NAG, Nesterov, 1988) will affect the implicit bias. In this paper, through our framework, we show that NAG with appropriately chosen parameters also enjoys an O(1/t2) marginmaximization rate. To our knowledge, it is the ﬁrst time the implicit bias property of NAG is proved.

p-norm Perceptron. Traditional work on Perceptrons typically assumes the feature vectors

lie in an ℓ2-ball. A more generalized setting is considered in Gentile (2000), who assumes the feature vectors lie inside an ℓp-ball, with p [2, ). Their proposed algorithm requires O(p/γ2) number of iterations to ﬁnd a zero-error classiﬁer. In this paper, we develop a new Perceptron algorithm for this problem under our framework based on optimistic strategies, showing that it enjoys an accelerated O( p log n/γ) rate.

2 RELATED WORK

This section brieﬂy reviews the related work on Perceptron algorithms, implicit-bias analysis, and game theory. The background knowledge on (optimistic) online learning is presented in the Preliminaries (Section 3).

Published as a conference paper at ICLR 2023

Accelerated Perceptrons and p-norm Perceptron The study of Perceptron algorithms has an extensive history dating back to the mid-twentieth century (Rosenblatt, 1958; Novikoff, 1962). However, it is only recently that progress on improving the fundamental Ω(1/γ2) iteration bound of the vanilla Perceptron in the standard setting has been made. Speciﬁcally, the smooth Perceptron proposed by Soheili & Pena (2012) achieves an Ω( log n/γ) rate by maximizing an auxiliary nonsmooth strongly-concave function with NEG (Nesterov, 2005). The same accelerated rate is later obtained by two other work (Yu et al., 2014; Ji et al., 2021). The former considers a bi-linear saddle point problem and employs the mirror-prox method (Nemirovski, 2004). The latter applies momentum-based GD to minimize the empirical risk with exponential loss. In this paper, we show that these algorithms can be unitedly analysed under our framework.

Apart from the above accelerated Perceptrons, there exists another class of algorithms which enjoy an O(poly(n) log(1/γ)) convergence rate (Dunagan & Vempala, 2004; Pe na & Soheili, 2016; Dadush et al., 2020). These methods typically call (accelerated) Perceptrons as a subroutine, and then apply the re-scaling technique to adjust the decision periodically. Although the dependence on γ becomes better, the polynomial dependence on n of these methods makes them computationally inefﬁcient for large-scale data sets. In this paper, we focus on accelerated Perceptrons with O( log n/γ) rate, and leave explaining the re-scaling type algorithms as future work.

The p-norm Perceptron problem (Gentile & Littlestone, 1999) is a natural extension of the classical Perceptron setting, which assumes the p-norm of the feature vectors are bounded, where 2 p < . Gentile (2001) shows that a mirror-descent-style update guarantees an O(p/γ2) convergence rate. By contrast, our proposed algorithm achieves a tighter O( p log n/γ) convergence rate.

Implicit-bias analysis In many real-world applications, directly minimizing the empirical risk (without any regularization) by ﬁrst-order methods can provide a model which, not only enjoys low training error, but also generalizes well (Soudry et al., 2018). This is usually considered as the implicit bias introduced by the optimization methods (Soudry et al., 2018). Explaining this phenomenon is a crucial step towards understanding the generalization ability of commonly-used optimization methods. For linear separable data, Soudry et al. (2018) proves that, when minimizing the empirical risk with exponential loss, the vanilla GD can maximize the margin in an O(1/ log t) rate. This result implies the implicit bias of GD towards the ℓ2-maximal margin classiﬁer. Later, Nacson et al. (2019) show that GD with a function-value-dependant decreasing step-size enjoys an O(1/

t) margin-maximization rate. Ji & Telgarsky (2021) improve this result to O(1/t) by employing a faster-decreasing step size. Ji et al. (2021) design a momentum-based GD that maximizes the margin with an O(log t/t2) rate. However, it remained unclear whether this rate could be further improved and how to analyze the margin maximization ability of other classical optimization methods such as NAG. In this paper, we provide positive answers to both questions. Fianlly, we note that (Ramdas & Pena, 2016) prove a varint of the Perceptron algorithm named normalized perceptron can also converge to a maximum margin classifer in an O(1/

t) convergence rate.

Games and no-regret dynamics Our framework is motivated by the line of research that links optimization methods for convex optimization to equilibrium computation with no-regret dynamics. The seminal work of Rakhlin & Sridharan (2013) recovers the classical mirror-prox method (Nemirovski, 2004) with Optimistic online mirror descent. Abernethy & Wang (2017) show that the well-known Frank-Wolfe algorithm can be seen as applying (optimistic) online algorithms to compute the equilibrium of a special zero-sum game called Fenchel game, which is constructed via the Fenchel duality of the objective function. Later, researchers demonstrate that other classical optimization methods for smooth optimization, such as Heavy-ball and NAG, can also be described similarly (Abernethy et al., 2018; Wang & Abernethy, 2018; Wang et al., 2021). We highlight the differences between this work and the previous ones: 1) Our analysis does not involve the Fenchel game or Fenchel duality; instead, we directly work on the (regularized) min-max game designed for linear classiﬁcation. 2) Most of the previous work mainly focus on understanding optimization algorithms for smooth optimization, and it was unclear how to understand algorithm such as NEG under the game framework. 3) Although both Abernethy et al. (2018) and our work analyze NAG, the goals are signiﬁcantly different: Abernethy et al. (2018) focus on the optimization problem itself, while we consider how minimizing the empirical risk would affect the margin. 4) To our knowledge, the link between the implicit-bias problems and no-regret dynamics is also new.

Published as a conference paper at ICLR 2023

3 PRELIMINARIES

Notation. We use lower case bold face letters x, y to denote vectors, lower case letters a, b to denote scalars, and upper case letters A, B to denote matrices. For a vector x Rd, we use xi to denote the i-th component of x. For a matrix A Rn d, let A(i,:) be its i-th row, A(:,j) the j-th column, and A(i,j) the i-th element of the j-th column. We use to denote a general norm, its dual norm, and p the ℓp-norm. For a positive integer n, we denote the set {1, . . . , n} as [n], and the n-dimensional simplex as n. Let E : n ' R be the negative entropy function, deﬁned as E(p) = "n

i=1 pi log pi, p n. For some strongly convex function R(w) : W ' R, deﬁne the Bregman divergence between any two points w, w W as:

DR(w, w ) = R(w) R(w ) (w w ) R(w ).

Online convex optimization (OCO) Here we review a general weighted OCO framework, proposed by Abernethy et al. (2018). In each round t = 1, . . . , T of this paradigm, a learner ﬁrst chooses a decision zt from a convex set Z Rd, then observes a loss function ft( ) : Z R as well as a weight αt > 0, and ﬁnally updates the decision. The performance of the learner is measured by the weighted regret, which is deﬁned as RT = "T

t=1 αtft(zt) minz Z

t=1 αtft(z).

Perhaps the most natural method for OCO is follow-the-leader (FTL), which simply picks the empirically best decision at each round: zt = argminz Z

i=1 αifi(z). However, FTL is unstable, and one can easily ﬁnd counter-examples where FTL suffers linear regret (Shalev-Shwartz, 2011). A classical way to address this limitation is by adding a regularizer to the objective: zt = argminz Z η "t 1

i=1 αifi(z) + DR(z, z0), where η > 0 is the step size, and z0 is the initial decision. This method is called follow-the-regularized-leader (Hazan, 2016), and it can achieve a sub-linear regret bound with appropriately chosen η. Moreover, tighter bounds are also possible in favored cases with more advanced techniques. In this paper, we consider FTRL equipped with optimistic strategies (i.e., Optimistic FTRL, Rakhlin & Sridharan, 2013; Orabona, 2019), given by

OFTRL[R, z0, ψt, η, Z] : zt = argminz Z η

i=1 αifi(z) + αtψt(z)

+ DR(z, z0),

where an additional function ψt is added, which is an approximation of the next loss ft. A tighter regret bound can be achieved when ψt is close enough to ft. Next, we introduce two special cases of OFTRL:

OFTL[ψt, Z] : zt = argmin

αjfj(z) + αtψt(z)

FTRL+[R, z0, η, Z] : zt = argmin

αjfj(z) + DR(z, z0).

The ﬁrst one is Optimistic FTL, where the regularizer is set to be zero. The second algorithm is FTRL+, which uses ft as the optimistic function ψt. Finally, we note that, apart from FTL-type algorithms, there also exists optimistic methods that are developed based on mirror descent, such as optimistic online mirror decent (OMD, Rakhlin & Sridharan, 2013). Due to page limitation, the details of OMD and the regret bounds of these OCO algorithms are postponed to the Appendix A.

No-regret dynamics for zero-sum game Finally, we introduce the framework for using no-regret online algorithms to solve a zero-sum game. Consider the following general two-player game:

max w W min

p Q g(w, p), (1)

where g(w, p) is a concave-convex function, and W and Q are convex sets. The no-regret framework for solving (1) is presented in Protocol 1. In each round t of this procedure, the w-player ﬁrst picks a decision wt W. Then, the p-player observes its loss ℓt( ) = g(wt, ) as well as a weight αt. After that, the p-player picks the decision pt, and passes it to the w-player. As a consequence, the w-player observes its loss ht( ) = g( , pt) and weight αt. Note that both ℓt and ht are convex. Denote the weighted regret bounds of the two players as Rw and Rp, respectively. Then, we have the following classical conclusion. The proof is shown in Appendix B. Theorem 2. Deﬁne m(w) = minp Q g(w, p), and w the weighted average of {wt}T

t=1. Then we have that w W, m(w) m(w T ) ("T

t=1 αt) 1 (Rp + Rw) .

Published as a conference paper at ICLR 2023

Protocol 1 No-regret dynamics with weighted OCO

1: Input: {αt}T

t=1, w0, p0. 2: Input: OLw, OLp. // The online algorithms for choosing w and p. 3: for t = 1, . . . , T do 4: wt OLw; 5: OLp αt, ℓt( ); // Deﬁne ℓt( ) = g(wt, ) 6: pt OLp; 7: OLw αt, ht( ); // Deﬁne ht( ) = g( , pt) 8: end for 9: Output: w T = 1 !T

4 UNDERSTANDING EXISTING ACCELERATED PERCEPTRONS

In this section, we ﬁrst introduce our main topic, i.e., the binary linear classiﬁcation problem, and then present the three accelerated Perceptrons and their equivalent forms under Protocol 1. For clarity, we use vt to denote the classiﬁer updates in the original algorithms, and wt the corresponding updates in the equivalent forms under Protocol 1.

4.1 BINARY LINEAR CLASSIFICATION AND PERCEPTRON

We focus on the binary linear classiﬁcation problem, which dates back to the pioneering work of Rosenblatt (1958). Let S = {x(i), y(i)}n

i=1 be a linear-separable set of n training examples, where x(i) Rd is the feature vector of the i-th example, and y(i) { 1, 1} is the corresponding label. The goal is to efﬁciently ﬁnd a linear classiﬁer w Rd that correctly separates all data points. More formally, let A = [y(1)x(1), . . . , y(n)x(n)] be the matrix that contains all of the data. Then we would like to ﬁnd a w Rd such that

min i [n] A(i,:)w > 0. (2)

This goal can be reformulated as a min-max optimization problem:

max w Rd min

i [n] A(i,:)w = max

p n p Aw, (3)

where, at the RHS of (3), mini [n] A(i,:)w is rewritten as minp n p Aw. The two expressions are equivalent since the optimal distribution p n will always put all weight on one training example (i.e., one row) in A. For any w Rd, let γ(w) = minp n p Aw be the margin of w, and we introduce the following standard assumption.

Assumption 1. We assume that feature vectors are bounded, i.e., x(i) 2 1 for all i [n], and that there exists a w Rd, such that w 2 = 1 and γ(w ) = γ > 0.

4.2 SMOOTH PERCEPTRON

In order to solve (3), the vanilla Perceptron repeatedly moves the direction of the classiﬁer to that of examples on which it performs badly. However, this greedy policy only yields a sub-optimal convergence rate. To address this problem, Soheili & Pena (2012) propose the Smooth Perceptron, and the pseudo-code is summarized in the ﬁrst box in Algorithm 1. The key idea is to ﬁnd a classiﬁer v Rd that maximizes the following ℓ2-regularized function:

2 + minq n q Av, (4)

which is non-smooth strongly-concave with respect to v. Under Assumption 1, Soheili & Pena (2012) show that the maximum value of ψ(v) is maxv Rd ψ(v) = γ2/2, and a classiﬁer v Rd satisﬁes (2) when ψ(v) 0. In order to solve (4), Soheili & Pena (2012) apply the classical Nesterov s excessive gap technique (Nesterov, 2005) for strongly concave functions. This algorithm introduces a smoothed approximation of (4), which is parameterized by some constant µ > 0, and deﬁned as

2 + minq n &

Published as a conference paper at ICLR 2023

Algorithm 1 Smooth Perceptron (Soheili & Pena, 2012)

Initialization : θ0 = 2

3, µ0 = 4, v0 = A 1

n Initialization : q0 = qµ0(v0), where qµ(v) = argminq n q Av + µDE

. for t = 1, . . . , T 1 do

vt = (1 θt 1)(vt 1 + θt 1Aqt 1) + θ2

t 1Aqµt 1(vt 1) µt = (1 θt 1)µt 1 qt = (1 θt 1)qt 1 + θt 1qµt(vt) θt = 2 t+3 end for Output: v T 1

ht 1( ), Rd)

wt = argmin

αjhj(w) + αtht 1(w)

OLp = FTRL+

pt = argmin

s=1 αsℓs(p) + DE

Output: w T

which bounds the function in (4) from below, i.e., v Rd, µ > 0, ψµ(v) ψ(v) + µ log n. Then, the algorithm performs a sophisticated update rule such that the excessive gap condition holds t 1: γ2/2 Avt 2/2 ψµ(vt). We refer to Soheili & Pena (2012) for more details.

Soheili & Pena (2012) show that the smooth Perceptron can output a positive-margin classiﬁer after Ω( log n/γ) iterations. However, the analysis is quite involved, which heavily relies on the complicated relationship between ψ(vt), ψµ(vt) and Avt||2. In the following, we provide a no-regret explanation of this algorithm and then show that the convergence rate can be easily obtained under our framework. Speciﬁcally, we deﬁne the objective function in Protocol 1 as

g(w, p) = p Aw 1

and provide the equivalent expression in the second box of Algorithm 1. More speciﬁcally, we have the following proposition. The proof is given in Appendix C.1. Proposition 1. Let αt = t. Then the two interpretations of the smooth Perceptron in Algorithm 1 are the same, in the sense that v T 1 = w T , and q T 1 =

t=1 αtpt !T

Proposition 1 shows that, under Protocol 1, the smooth Perceptron can be seen as optimizing (6) by implementing two online learning algorithms: in each round t, the w-player applies OFTL with ψt = ht 1, a decision which the p-player subsequently observes and responds with FTRL+. Moreover, we obtain theoretical guarantees for the smooth Perceptron based on Theorem 2, matching the convergence rate provided by Soheili & Pena (2012). Theorem 3. Let αt = t. Under Protocol 1, the regret of the two players of Algorithm 1 is bounded by Rw 2 "T

t=1 pt pt 1 2

1 and Rp 4 log n 2 "T

t=1 pt pt 1 2

1. Moreover, w T has non-negative margin when T = Ω(

4.3 ACCELERATED PERCEPTRON VIA ERM

Since binary linear classiﬁcation is a (and perhaps the most fundamental) supervised machine learning problem, it can be naturally solved via empirical risk minimization (ERM). This idea is adopted by Ji et al. (2021), who consider the following ERM problem:

min v Rd R(v) = 1

i=1 ℓ( y(i)v x(i)), (7)

where ℓ(z) = exp(z) is the exponential loss. To minimize (7), Ji et al. (2021) propose the a normalized momentum-based gradient descent (NMGD) algorithm, and show that NMGD can converge to

Published as a conference paper at ICLR 2023

Algorithm 2 Accelerated Perceptron of Ji et al. (2021)

Input: q0 = 1

n, v0 = 0, g0 = 0. for t = 1, . . . , T do

Set θt 1 = t 2(t+1), βt = t t+1 vt = vt 1 θt 1(gt 1 A qt 1) for i = 1, . . . , n do

qt,i = exp( y(i)v

j=1 exp( y(j)v

t x(j)) end for gt = βt(gt 1 A qt) end for Output: v T

ht 1( ), Rd)

wt = argmin

αjhj(w) + αtht 1(w)

OLp = FTRL+ &

pt = argmin

αsℓs(p) + DE

Output: w T

a classiﬁer with a positive margin after O(1/γ2) iterations. Based on the property of the normalized gradient, they also provide primal-dual form of the algorithm (presented in the ﬁrst-box of Algorithm 2), which can be considered as applying Nesterov s accelerated gradient descent in the dual space (Ji et al., 2021) .

For the accelerated Perceptron of Ji et al. (2021), we set g(w, p) = p Aw 1

2 and provide its equivalent form at the second box in Algorithm 2. Speciﬁcally, we have the following proposition.

Proposition 2. Let αt = t. Then the two interpretations of the accelerated Perceptron in Algorithm 2 are the same, in the sense that q T = p T and v T = 1

t=1 αtwt = 1

Remark Note that in Ji et al. (2021), the parameter θt 1 is set to be 1. We observe that this causes a mismatch between the weights and w

ts in the output (when θt 1 = 1, v T will become "T

t=1 αt+1wt instead of "T

t=1 αtwt). Our no-regret analysis suggests that θt 1 should be conﬁgured as t 2(t+1), and later we will show that this procedure helps eliminate the log t factor in the convergence rate.

The proposition above reveals that the smooth Perceptron and the accelerated Perceptron of Ji et al. (2021) are closely related. The main difference is that the accelerated Perceptron of Ji et al. (2021) outputs the weighted sum of all wt s, instead of the weighted average. The rationale behind this phenomenon is that, instead of the margin, Ji et al. (2021) use the normalized margin to measure the performance of the algorithm, deﬁned as γ(v) = minp n p Av/ v 2. They not only prove γ(v T ) > 0 (which directly implies γ(v T ) > 0), but also show γ(v T ) = Ω(γ log T

γT 2 ). This guarantee is more powerful than the previous results, as it implies that the normalized margin can be maximized in an O(log t/t2) rate. When t approaches , the direction of v T will converge to that of the maximal margin classiﬁer. We show that a better margin-maximization rate can be directly obtained under our framework with the parameter setting in Algorithm 2.

Theorem 4. Let αt = t. Under Protocol 1, the regret of the two players of Algorithm 2 is bounded by Rw 2 "T

t=1 pt pt 1 2

1 and Rp 4 log n 2 "T

t=1 pt pt 1 2

1. Moreover, w T has a non-negative margin when T = Ω( log n/γ) and γ(v T ) = Ω(γ 8 log n γT (T +1)).

Published as a conference paper at ICLR 2023

Algorithm 3 NAG

Input: v0 = 0, s0 = 0. for t = 1, . . . , T do

ut = st 1 + 1 2(t 1)vt 1 vt = vt 1 ηt R(ut) st = st 1 + 1 2(t+1)vt end for Output: s T

OLp = OFTRL

4, ℓt 1( ), n)

pt = argmin

αsℓs(p) + αtℓt 1(p)

OLw = FTRL+[0, 0, 1, Rd] wt = argmin

Output: 2w T =

4.4 ACCELERATED PERCEPTRON OF YU ET AL. (2014)

Finally, Yu et al. (2014) considers applying the mirror-prox algorithm to solve the max-min optimization problem:

max w 2 1 min

p n p Aw, (8)

As discovered by Rakhlin & Sridharan (2013), mirror-prox can be recovered by their optimistic online mirror descent (OMD) with an appropriately chosen optimistic term. Here, we show that, by simply manipulating the notation, the algorithm of Yu et al. (2014) can be recovered as two-players applying OMD to solve (8). Due to page limitation, the algorithmic details and their analysis are postponed to Appendix C.5, which we summarize as follows (formalized in Theorems 8 and 9):

Theorem 5 (informal). Let g(w, p) = p Aw. The Perceptron of Yu et al. (2014) can be described under Protocol 1, where both players apply OMD endowed with the appropriate parameters. Moreover, under Assumption 1, w T has a non-negative margin when T = Ω

5 BEYOND PERCEPTRONS

In This section, we show that our framework beneﬁts a wide range of other problems.

5.1 MARGIN MAXIMIZATION BY NESTEROV S ACCELERATED GRADIENT DESCENT

One major task for the implicit bias study is to explain why commonly-used ﬁrst-order optimization methods generalize well. For linearly separable data, previous work proves that GD and momentumbased GD prefers the ℓ2-maximal margin classiﬁer by showing that they can maximize the margin. In this part, we show that the well-known Nesterov s accelerated gradient descent (NAG), with appropriately chosen parameters, can maximize the margin in an O(1/γ2) rate. Speciﬁcally, following previous work, we consider maximize the margin by solving the ERM problem in (7) and apply Nesterov s accelerated gradient descent (NAG) to minimize the objective function. The details of the algorithm are summarized in the ﬁrst box of Algorithm 3, and its equivalent form under Protocol 1 is presented in the second box of Algorithm 3. For the NAG algorithm, we set the objective function as g(w, p) = p Aw 1

2, and have the following conclusion.

Theorem 6. Let ηt = t R(ut) and αt = t. Then, the two expressions in Algorithm 3 are equivalent,

in the sense that s T = 2w T . Moreover,

min p n p As T

s T 2 γ 8 log n+2

T (T +1)γ .

Published as a conference paper at ICLR 2023

Algorithm 4 Accelerated algorithm for the p-norm perceptron

OLw = OFTRL

q, 0, ηw, ht 1( ), Rd$

wt = argmin

αjhj(w) + αtht 1(w) + D 1 2(q 1) 2

OLp = FTRL+

pt = argmin

αsℓs(p) + DE

Output: w T = 1

Remark Theorem 6 indicates that NAG can also maximize the margin in an O(1/T 2) rate. Note that, in Algorithm 3, the p-player plays ﬁrst and the w-player second, while, in Algorithm 2, it is the other way around. This difference makes sense as Algorithm 3 can be considered as applying Nesterov s acceleration in the dual space (Ji et al., 2021), while Algorithm 3 uses Nesterov s acceleration in the primal space.

5.2 ACCELERATED p-NORM PERCEPTRON

In this section, we focus on the p-norm Perceptron problem, introduced by Gentile (2000). Compared to the classical perceptron problem, p-norm Perceptron introduces the following assumption, which is more general than Assumption 1.

Assumption 2. For the p-norm Perceptron problem, we assume: i [n], x(i) p 1, where p [2, ). Moreover, assume there exists a w Rd, such that w q 1 and min

p n p Aw γ.

Here, q is the dual norm of p; i.e., 1

q = 1 and q (1, 2].

Under Assumption 2, Gentile (2000) proposes a mirror-descent style algorithm that achieves an Ω((1 p)/γ2) convergence rate. In the following, we provide a new algorithm with a better rate. Speciﬁcally, under Protocol 1, we deﬁne the objective function as g(w, p) = p Aw, and introduce Algorithm 4, wherein the w-player uses OFTRL with regularizer 1 2(q 1) 2

q, which is 1-strongly convex w.r.t. the q-norm (Orabona, 2019). On the other hand, the q-player employs the FTRL+ algorithm. We have the following result.

Theorem 7. Let αt = 1, ηp = 1/ηw, and ηw =

1 2(q 1) log n. Then the output w T of Algorithm 4

has a non-negative margin when T = Ω

2(p 1) log n/γ

6 CONCLUSION

In this paper, we provide a uniﬁed analysis for the existing accelerated Perceptrons, and obtain improved results for a series of problems. In the future, we will explore how to extend our framework to other closely related areas, such as semi-deﬁnite programming (Garber & Hazan, 2011) and generalized margin maximization (Sun et al., 2022). Another interesting direction is to consider whether other more advanced online learning algorithms (Orabona & P al, 2016; Cutkosky & Orabona, 2018; Jun et al., 2017; Zhang et al., 2018; Wang et al., 2020; Zhao et al., 2020) is useful for perceptron and the implicit bias analysis based on our framework.

ACKNOWLEDGMENTS

We gratefully thank the AI4OPT Institute for funding, as part of NSF Award 2112533. We also acknowledge the NSF for their support through Award IIS-1910077, as well as an ARC-ACO fellowship provided by Georgia Tech.

Published as a conference paper at ICLR 2023

Jacob Abernethy, Kevin A Lai, Kﬁr Y Levy, and Jun-Kun Wang. Faster rates for convex-concave

games. In Proceedings of the 31st Annual Conference on Learning Theory, pp. 1595 1625, 2018.

Jacob D Abernethy and Jun-Kun Wang. On frank-wolfe and equilibrium computation. Advances in

Neural Information Processing Systems 30, 30, 2017.

Mark A Aizerman. Theoretical foundations of the potential function method in pattern recognition

learning. Automation and remote control, 25:821 837, 1964.

Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2007.

Nicolo Cesa-Bianchi, Alex Conconi, and Claudio Gentile. A second-order perceptron algorithm.

SIAM Journal on Computing, 34(3):640 668, 2005.

Ashok Cutkosky and Francesco Orabona. Black-box reductions for parameter-free online learning

in banach spaces. In Conference On Learning Theory, pp. 1493 1529, 2018.

Daniel Dadush, Laszlo A Vegh, and Giacomo Zambelli. Rescaling algorithms for linear conic feasibility. Mathematics of Operations Research, 45(2):732 754, 2020.

C Daskalakis and Ioannis Panageas. Last-iterate convergence: Zero-sum games and constrained

min-max optimization. In 10th Innovations in Theoretical Computer Science conference, 2019.

Constantinos Daskalakis, Andrew Ilyas, Vasilis Syrgkanis, and Haoyang Zeng. Training gans with

optimism. In International Conference on Learning Representations, 2018.

John Dunagan and Santosh Vempala. A polynomial-time rescaling algorithm for solving linear

programs. In Proceedings of the 36th annual ACM symposium on Theory of computing, pp. 28, 2004.

Yoav Freund and Robert E Schapire. Large margin classiﬁcation using the perceptron algorithm.

Machine learning, 37(3):277 296, 1999.

Dan Garber and Elad Hazan. Approximating semideﬁnite programs in sublinear time. Advances in

Neural Information Processing Systems 24, 2011.

Claudio Gentile. A new approximate maximal margin classiﬁcation algorithm. Advances in Neural

Information Processing Systems 13, 2000.

Claudio Gentile. A new approximate maximal margin classiﬁcation algorithm. Journal of Machine

Learning Research, 2:213 242, 2001.

Claudio Gentile and Nick Littlestone. The robustness of the p-norm algorithms. In Proceedings of

the 12th annual conference on Computational learning theory, pp. 1 11, 1999.

Elad Hazan. Introduction to online convex optimization. Foundations and Trends in Optimization,

2(3-4):157 325, 2016.

Ziwei Ji and Matus Telgarsky. Characterizing the implicit bias via a primal-dual analysis. In Pro-

ceedings of the 32nd International Conference on Algorithmic Learning Theory, pp. 772 804, 2021.

Ziwei Ji, Nathan Srebro, and Matus Telgarsky. Fast margin maximization via dual acceleration. In

Proceedings of the 38th International Conference on Machine Learning, pp. 4860 4869, 2021.

Kwang-Sung Jun, Francesco Orabona, Stephen Wright, and Rebecca Willett. Improved strongly

adaptive online learning using coin betting. In Proceedings of the 20th International Conference on Artiﬁcial Intelligence and Statistics, pp. 943 951, 2017.

Nick Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algo-

rithm. Machine learning, 2(4):285 318, 1988.

Published as a conference paper at ICLR 2023

Mor Shpigel Nacson, Jason Lee, Suriya Gunasekar, Pedro Henrique Pamplona Savarese, Nathan

Srebro, and Daniel Soudry. Convergence of gradient descent on separable data. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, pp. 3420 3428, 2019.

Arkadi Nemirovski. Prox-method with rate of convergence O(1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization, 15(1):229 251, 2004.

Yu Nesterov. Excessive gap technique in nonsmooth convex minimization. SIAM Journal on Opti-

mization, 16(1):235 249, 2005.

Yurii Nesterov. On an approach to the construction of optimal methods of minimization of smooth

convex functions. Ekonomika i Mateaticheskie Metody, 24(3):509 517, 1988.

A.B.J. Novikoff. On convergence proofs on perceptrons. In Proceedings of the Symposium on the

Mathematical Theory of Automata, volume 12, pp. 615 622, 1962.

Francesco Orabona. A modern introduction to online learning. ar Xiv preprint ar Xiv:1912.13213,

Francesco Orabona and D avid P al. Coin betting and parameter-free online learning. Advances in

Neural Information Processing Systems 29, 2016.

Javier Pe na and Negar Soheili. A deterministic rescaled perceptron algorithm. Mathematical Pro-

gramming, 155(1):497 510, 2016.

Sasha Rakhlin and Karthik Sridharan. Optimization, learning, and games with predictable sequences.

In Advances in Neural Information Processing Systems 26, pp. 3066 3074, 2013.

Aaditya Ramdas and Javier Pena. Towards a deeper geometric, analytic and algorithmic understand-

ing of margins. Optimization Methods and Software, 31(2):377 391, 2016.

Frank Rosenblatt. The perceptron: a probabilistic model for information storage and organization

in the brain. Psychological review, 65(6):386, 1958.

Shai Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in

Machine Learning, 4(2):107 194, 2011.

Negar Soheili and Javier Pena. A smooth perceptron algorithm. SIAM Journal on Optimization, 22

(2):728 737, 2012.

Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The im-

plicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19 (1):2822 2878, 2018.

Haoyuan Sun, Kwangjun Ahn, Christos Thrampoulidis, and Navid Azizan. Mirror descent maxi-

mizes generalized margin and can be implemented efﬁciently. ar Xiv preprint ar Xiv:2205.12808, 2022.

Guanghui Wang, Shiyin Lu, Quan Cheng, Wei-wei Tu, and Lijun Zhang. Sadam: A variant of adam

for strongly convex functions. In International Conference on Learning Representations, 2020.

Jun-Kun Wang and Jacob D Abernethy. Acceleration through optimistic no-regret dynamics. In

Advances in Neural Information Processing Systems 32, pp. 3824 3834, 2018.

Jun-Kun Wang, Jacob Abernethy, and Kﬁr Y Levy. No-regret dynamics in the fenchel game: A

uniﬁed framework for algorithmic convex optimization. ar Xiv preprint ar Xiv:2111.11309, 2021.

A Wendemuth. Learning the unlearnable. Journal of Physics A: Mathematical and General, 28(18):

5423, 1995.

Adams Wei Yu, Fatma Kilinc-Karzan, and Jaime Carbonell. Saddle points and accelerated percep-

tron algorithms. In International Conference on Machine Learning, pp. 1827 1835, 2014.

Published as a conference paper at ICLR 2023

Lijun Zhang, Shiyin Lu, and Zhi-Hua Zhou. Adaptive online learning in dynamic environments. In

Advances in Neural Information Processing Systems 31, pp. 1323 1333, 2018.

Peng Zhao, Yu-Jie Zhang, Lijun Zhang, and Zhi-Hua Zhou. Dynamic regret of convex and smooth

functions. Advances in Neural Information Processing Systems 33, pp. 12510 12520, 2020.