# margins_kernels_and_nonlinear_smoothed_perceptrons__5004ad1c.pdf Margins, Kernels and Non-linear Smoothed Perceptrons Aaditya Ramdas ARAMDAS@CS.CMU.EDU Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA 15213 USA Javier Pe na JFP@ANDREW.CMU.EDU Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA 15213 USA We focus on the problem of finding a non-linear classification function that lies in a Reproducing Kernel Hilbert Space (RKHS) both from the primal point of view (finding a perfect separator when one exists) and the dual point of view (giving a certificate of non-existence), with special focus on generalizations of two classical schemes - the Perceptron (primal) and Von Neumann (dual) algorithms. We cast our problem as one of maximizing the regularized normalized hard-margin (ρ) in an RKHS and rephrase it in terms of a Mahalanobis dot-product/semi-norm associated with the kernel s (normalized and signed) Gram matrix. We derive an accelerated smoothed algorithm with a convergence rate of log n ρ given n separable points, which is strikingly similar to the classical kernelized Perceptron algorithm whose rate is 1 ρ2 . When no such classifier exists, we prove a version of Gordan s separation theorem for RKHSs, and give a reinterpretation of negative margins. This allows us to give guarantees for a primal-dual algorithm that halts in min{ n ϵ } iterations with a perfect separator in the RKHS if the primal is feasible or a dual ϵ-certificate of near-infeasibility. 1. Introduction We are interested in the problem of finding a non-linear separator for a given set of n points x1, ..., xn Rd with labels y1, ..., yn { 1}. Finding a linear separator can be stated as the problem of finding a unit vector w Rd (if Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s). one exists) such that for all i yi(w xi) 0 i.e. sign(w xi) = yi. (1) This is called the primal problem. In the more interesting non-linear setting, we will be searching for functions f in a Reproducing Kernel Hilbert Space (RKHS) FK associated with kernel K (to be defined later) such that for all i yif(xi) 0. (2) We say that problems (1), (2) have an unnormalized margin ρ > 0, if there exists a unit vector w, such that for all i, yi(w xi) ρ or yif(xi) ρ. True to the paper s title, margins of non-linear separators in an RKHS will be a central concept, and we will derive interesting smoothed accelerated variants of the Perceptron algorithm that have convergence rates (for the aforementioned primal and a dual problem introduced later) that are inversely proportional to the RKHS-margin as opposed to inverse squared margin for the Perceptron. The linear setting is well known by the name of linear feasibility problems - we are asking if there exists any vector w which makes an acute angle with all the vectors yixi, i.e. (XY ) w > 0n, (3) where Y := diag(y), X := [x1, ..., xn]. This can be seen as finding a vector w inside the dual cone of cone{yixi}. When normalized, as we will see in the next section, the margin is a well-studied notion of conditioning for these problems. It can be thought of as the width of the feasibility cone as in (Freund & Vera, 1999), a radius of wellposedness as in (Cheung & Cucker, 2001), and its inverse can be seen as a special case of a condition number defined by (Renegar, 1995) for these systems. 1.1. Related Work In this paper we focus on the famous Perceptron algorithm (Rosenblatt, 1958) and the less-famous Von-Neumann algorithm (Dantzig, 1992) that we introduce in later sections. Margins, Kernels and Non-linear Smoothed Perceptron Our work builds on (Soheili & Pe na, 2012; 2013a) from the field of optimization - we generalize the setting to learning functions in RKHSs, extend the algorithms, simplify proofs, and simultaneously bring new perspectives to it. There is extensive literature around the Perceptron algorithm in the learning community; we restrict ourselves to discussing only a few directly related papers, in order to point out the several differences from existing work. We provide a general unified proof in the Appendix which borrows ideas from accelerated smoothing methods developed by Nesterov (Nesterov, 2005) - while this algorithm and others by (Nemirovski, 2004), (Saha et al., 2011) can achieve similar rates for the same problem, those algorithms do not possess the simplicity of the Perceptron or Von-Neumann algorithms and our variants, and also don t look at the infeasible setting or primal-dual algorithms. Accelerated smoothing techniques have also been seen in the learning literature like in (Tseng, 2008) and many others. However, most of these deal with convex-concave problems where both sets involved are the probability simplex (as in game theory, boosting, etc), while we deal with hard margins where one of the sets is a unit ℓ2 ball. Hence, their algorithms/results are not extendable to ours trivially. This work is also connected to the idea of ϵ-coresets (Clarkson, 2010), though we will not explore that angle. A related algorithm is called the Winnow (Littlestone, 1991) - this works on the ℓ1 margin and is a saddle point problem over two simplices. One can ask whether such accelerated smoothed versions exist for the Winnow. The answer is in the affirmative - however such algorithms look completely different from the Winnow, while in our setting the new algorithms retain the simplicity of the Perceptron. 1.2. Paper Outline Sec.2 will introduce the Perceptron and Normalized Perceptron algorithm and their convergence guarantees for linear separability, with specific emphasis on the unnormalized and normalized margins. Sec.3 will then introduce RKHSs and the Normalized Kernel Perceptron algorithm, which we interpret as a subgradient algorithm for a regularized normalized hard-margin loss function. Sec.4 describes the Smoothed Normalized Kernel Perceptron algorithm that works with a smooth approximation to the original loss function, and outlines the argument for its faster convergence rate. Sec.5 discusses the non-separable case and the Von-Neumann algorithm, and we prove a version of Gordan s theorem in RKHSs. We finally give an algorithm in Sec.6 which terminates with a separator if one exists, and with a dual certificate of nearinfeasibility otherwise, in time inversely proportional to the margin. Sec.7 has a discussion and some open problems. 2. Linear Feasibility Problems 2.1. Perceptron The classical perceptron algorithm can be stated in many ways, one is in the following form Algorithm 1 Perceptron Initialize w0 = 0 for k = 0, 1, 2, 3, ... do if sign(w k xi) = yi for some i then wk+1 := wk + yixi else Halt: Return wk as solution end if end for It comes with the following classic guarantee as proved by (Block, 1962) and (Novikoff, 1962): If there exists a unit vector u Rd such that Y X u ρ > 0, then a perfect separator will be found in maxi xi 2 2 ρ2 iterations/mistakes. The algorithm works when updated with any arbitrary point (xi, yi) that is misclassified; it has the same guarantees when w is updated with the point that is misclassified by the largest amount, arg mini yiw xi. Alternately, one can define the probability distribution over examples p(w) = arg min p n Y X w, p , (4) where n is the n-dimensional probability simplex. Intuitively, p picks the examples that have the lowest margin when classified by w. One can also normalize the updates so that we can maintain a probability distribution over examples used for updates from the start, as seen below: Algorithm 2 Normalized Perceptron Initialize w0 = 0, p0 = 0 for k = 0, 1, 2, 3, ... do if Y X wk > 0 then Exit, with wk as solution else θk := 1 k+1 wk+1 := (1 θk)wk + θk XY p(wk) end if end for Remark. Normalized Perceptron has the same guarantees as perceptron - the Perceptron can perform its update online on any misclassified point, while the Normalized Perceptron performs updates on the most misclassified point(s), and yet there does not seem to be any change in performance. However, we will soon see that the ability to see all the examples at once gives us much more power. Margins, Kernels and Non-linear Smoothed Perceptron 2.2. Normalized Margins If we normalize the data points by the ℓ2 norm, the resulting mistake bound of the perceptron algorithm is slightly different. Let X2 represent the matrix with columns xi/ xi 2. Define the unnormalized and normalized margins as ρ := sup w 2=1 inf p n Y X w, p , ρ2 := sup w 2=1 inf p n Y X 2 w, p . Remark. Note that we have sup w 2=1 in the definition, this is equivalent to sup w 2 1 iff ρ2 > 0. Normalized Perceptron has the following guarantee on X2: If ρ2 > 0, then it finds a perfect separator in 1 ρ2 2 iterations. Remark. Consider the max-margin separator u for X (which is also a valid perfect separator for X2). Then ρ maxi xi 2 = min i sup u 2=1 min i yix i u xi 2 Hence, it is always better to normalize the data as pointed out in (Graepel et al., 2001). This idea extends to RKHSs, motivating the normalized Gram matrix considered later. Example Consider a simple example in R2 +. Assume that + points are located along the line 6x2 = 8x1, and the points along 8x2 = 6x1, for 1/r x 2 r, where r > 1. The max-margin linear separator will be x1 = x2. If all the data were normalized to have unit Euclidean norm, then all the + points would all be at (0.6, 0.8) and all the points at (0.8, 0.6), giving us a normalized margin of ρ2 0.14. Unnormalized, the margin is ρ 0.14/r and maxi xi 2 = r. Hence, in terms of bounds, we get a discrepancy of r4, which can be arbitrarily large. Winnow The question arises as to which norm we should normalize by. There is a now classic algorithm in machine learning, called Winnow (Littlestone, 1991) or Multiplicate Weights. It works on a slight transformation of the problem where we only need to search for u Rd +. It comes with some very well-known guarantees - If there exists a u Rd + such that Y X u ρ > 0, then feasibility is guaranteed in u 2 1 maxi ai 2 log n/ρ2 iterations. The appropriate notion of normalized margin here is ρ1 := max w d min p n Y X w, p , where X is a matrix with columns xi/ xi . Then, the appropriate iteration bound is log n/ρ2 1. We will return to this ℓ1-margin in the discussion section. In the next section, we will normalize by using the kernel appropriately. 3. Kernels and RKHSs The theory of Reproducing Kernel Hilbert Spaces (RKHSs) has a rich history, and for a detailed introduction, refer to (Sch olkopf & Smola, 2002). Let K : Rd Rd R be a symmetric positive definite kernel, giving rise to a Reproducing Kernel Hilbert Space FK with an associated feature mapping at each point x Rd called φx : Rd FK where φx(.) = K(x, .) i.e. φx(y) = K(x, y). FK has an associated inner product φu, φv K = K(u, v). For any f FK, we have f(x) = f, φx K. Define the normalized feature map K(x, x) FK and φX := [ φxi]n 1. For any function f FK, we use the following notation Y f(X) := f, Y φX K = [yi f, φxi K]n 1 = h yif(xi) We analogously define the normalized margin here to be ρK := sup f K=1 inf p n D Y f(X), p E . (5) Consider the following regularized empirical loss function L(f) = sup p n D Y f(X), p E + 1 2 f 2 K. (6) Denoting t := f K > 0 and writing f = t f f K = t f, let us calculate the minimum value of this function inf f FK L(f) = inf t>0 inf f K=1 sup p n t f, Y φX K, p + t2 = inf t>0 tρK + 1 2ρ2 K when t = ρK > 0. (7) Since maxp n D Y f(X), p E is some empirical loss function on the data and 1 2 f 2 K is an increasing function of f K, the Representer Theorem (Sch olkopf et al., 2001) implies that the minimizer of the above function lies in the span of φxis (also the span of the yi φxis). Explicitly, arg min f FK L(f) = i=1 αiyi φxi = Y φX, α . (8) Substituting this back into Eq.(6), we can define L(α) := sup p n α, p G 2 α 2 G, (9) where G is a normalized signed Gram matrix with Gii = 1, Gji = Gij := yiyj K(xi,xj) K(xi,xi)K(xj,xj) = yi φxi, yj φxj K, and p, α G := p Gα, α G := α Gα. One can verify that G is a PSD matrix and the G-norm . G is a seminorm, whose properties are of great importance to us. Margins, Kernels and Non-linear Smoothed Perceptron 3.1. Some Interesting and Useful Lemmas The first lemma justifies our algorithms exit condition. Lemma 1. L(α) < 0 implies Gα > 0 and there exists a perfect classifier iff Gα > 0. Proof. L(α) < 0 supp n Gα, p < 0 Gα > 0. Gα > 0 fα := α, Y φX is perfect since K(xj, xj) = i=1 αi yiyj K(xi, xj) p K(xi, xi)K(xj, xj) If a perfect classifier exists, then ρK > 0 by definition and L(f ) = L(α ) = 1 2ρ2 K < 0 Gα > 0, where f , α are the optimizers of L(f), L(α). The second lemma bounds the G-norm of vectors. Lemma 2. For any α Rn, α G α 1 n α 2. Proof. Using the triangle inequality of norms, we get α Gα = r D α, Y φX , α, Y φX E i αiyi φxi K X i αiyi φxi K where we used φxi, φxi K = K(xi, xi). The third lemma gives a new perspective on the margin. Lemma 3. When ρK > 0, f maximizes the margin iff ρKf optimizes L(f). Hence, the margin is equivalently ρK = sup α G=1 inf p n α, p G p G for all p n. Proof. Let fρ be any function with fρ K = 1 that achieves the max-margin ρK > 0. Then, it is easy to plug ρKfρ into Eq. (6) and verify that L(ρKfρ) = 1 2ρ2 K and hence ρKfρ minimizes L(f). Similarly, let f L be any function that minimizes L(f), i.e. achieves the value L(f L) = 1 2ρ2 K. Defining t := f L K, and examining Eq. (7), we see that L(f L) cannot achieve the value 1 2ρ2 K unless t = ρK and supp n D Y f L(X), p E = ρ2 K which means that f L/ρK must achieve the max-margin. Hence considering only f = P i αiyi φxi is acceptable for both. Plugging this into Eq. (5) gives the equality and ρK = inf p n sup α G=1 α, p G sup α G=1 α, p G p G by applying Cauchy-Schwartz (can also be seen by going back to function space). 4. Smoothed Normalized Kernel Perceptron Define the distribution over the worst-classified points p(f) := arg min p n D Y f(X), p E or p(α) := arg min p n α, p G. (10) Algorithm 3 Normalized Kernel Perceptron (NKP) Set α0 := 0 for k = 0, 1, 2, 3, ... do if Gαk > 0n then Exit, with αk as solution else θk := 1 k+1 αk+1 := (1 θk)αk + θkp(αk) end if end for Implicitly fk+1 = (1 θk)fk + θk Y φX, p(fk) = fk θk fk Y φX, p(fk) = fk θk L(fk) and hence the Normalized Kernel Perceptron (NKP) is a subgradient algorithm to minimize L(f) from Eq. (6). Remark. Lemma 3 yields deep insights. Since NKP can get arbitrarily close to the minimizer of strongly convex L(f), it also gets arbitrarily close to a margin maximizer. It is known that it finds a perfect classifier in 1/ρ2 K iterations - we now additionally infer that it will continue to improve to find an approximate max-margin classifier. While both classical and normalized Perceptrons find perfect classifiers in the same time, the latter is guaranteed to improve. Remark. αk+1 is always a probability distribution. Curiously, a guarantee that the solution will lie in n is not made by the Representer Theorem in Eq. (8) - any α Rn could satisfy Lemma 1. However, since NKP is a subgradient method for minimizing Eq. (6), we know that we will approach the optimum while only choosing α n. Define the smooth minimizer analogous to Eq. (10) as pµ(α) := arg min p n n α, p G + µd(p) o where d(p) := X i pi log pi + log n (12) is 1-strongly convex with respect to the ℓ1-norm (Nesterov, 2005). Define a smoothened loss function as in Eq. (9) Lµ(α) = sup p n α, p G µd(p) + 1 Note that the maximizer above is precisely pµ(α). Margins, Kernels and Non-linear Smoothed Perceptron Algorithm 4 Smoothed Normalized Kernel Perceptron Set α0 = 1n/n, µ0 := 2, p0 := pµ0(α0) for k = 0, 1, 2, 3, ... do if Gαk > 0n then Halt: αk is solution to Eq. (8) else θk := 2 k+3 αk+1 := (1 θk)(αk + θkpk) + θ2 kpµk(αk) µk+1 = (1 θk)µk pk+1 := (1 θk)pk + θkpµk+1(αk+1) end if end for Lemma 4 (Lower Bound). At any step k, we have Lµk(αk) L(αk) µk log n. Proof. First note that supp n d(p) = log n. Also, α, p G µd(p) α, p G sup p n Combining these two facts gives us the result. Lemma 5 (Upper Bound). In any round k, SNKP satisfies Proof. We provide a concise, self-contained and unified proof by induction in the Appendix for Lemma 5 and Lemma 8, borrowing ideas from Nesterov s excessive gap technique (Nesterov, 2005) for smooth minimization of structured non-smooth functions. Finally, we combine the above lemmas to get the following theorem about the performance of SNKP. Theorem 1. The SNKP algorithm finds a perfect classifier f FK when one exists in O log n iterations. Proof. Lemma 4 gives us for any round k, Lµk(αk) L(αk) µk log n. From Lemmas 3, 5 we get Combining the two equations, we get that L(αk) µk log n 1 Noting that µk = 4 (k+1)(k+2) < 4 (k+1)2 , we see that L(αk) < 0 (and hence we solve the problem by Lemma 1) after at most k = 2 2 log n/ρK steps. 5. Infeasible Problems What happens when the points are not separable by any function f FK? We would like an algorithm that terminates with a solution when there is one, and terminates with a certificate of non-separability if there isn t one. The idea is based on theorems of the alternative like Farkas Lemma, specifically a version of Gordan s theorem (Chvatal, 1983): Lemma 6 (Gordan s Thm). Exactly one of the following two statements can be true 1. Either there exists a w Rd such that for all i, yi(w xi) > 0, 2. Or, there exists a p n such that XY p 2 = 0, (13) or equivalently P i piyixi = 0. As mentioned in the introduction, the primal problem can be interpreted as finding a vector in the interior of the dual cone of cone{yixi}, which is infeasible the dual cone is flat i.e. if cone{yixi} is not pointed, which happens when the origin is in the convex combination of yixis. We will generalize the following algorithm for linear feasibility problems, that can be dated back to Von-Neumann, who mentioned it in a private communication with Dantzig, who later studied it himself (Dantzig, 1992). Algorithm 5 Normalized Von-Neumann (NVN) Initialize p0 = 1n/n, w0 = XY p0 for k = 0, 1, 2, 3, ... do if XY pk 2 ϵ then Exit and return pk as an ϵ-solution to (13) else j := arg mini yix i wk θk := arg minλ [0,1] (1 λ)wk + λyjxj 2 pk+1 := (1 θk)pk + θkej wk+1 := XY pk+1 = (1 θk)wk + θkyjxj end if end for This algorithm comes with a guarantee: If the problem (3) is infeasible, then the above algorithm will terminate with an ϵ-approximate solution to (13) in 1/ϵ2 iterations. (Epelman & Freund, 2000) proved an incomparable bound - Normalized Von-Neumann (NVN) can compute an ϵsolution to (13) in O 1 ρ2 2 log 1 ϵ and can also find a solu- tion to the primal (using wk) in O 1 ρ2 2 when it is feasible. We derive a smoothed variant of NVN in the next section, after we prove some crucial lemmas in RKHSs. Margins, Kernels and Non-linear Smoothed Perceptron 5.1. A Separation Theorem for RKHSs While finite dimensional Euclidean spaces come with strong separation guarantees that come under various names like the separating hyperplane theorem, Gordan s theorem, Farkas lemma, etc, the story isn t always the same for infinite dimensional function spaces which can often be tricky to deal with. We will prove an appropriate version of such a theorem that will be useful in our setting. What follows is an interesting version of the Hahn-Banach separation theorem, which looks a lot like Gordan s theorem in finite dimensional spaces. The conditions to note here are that either Gα > 0 or p G = 0. Theorem 2. Exactly one of the following has a solution: 1. Either f FK such that for all i, K(xi, xi) = f, yi φxi K > 0 i.e. Gα > 0, 2. Or p n such that X i piyi φxi = 0 FK i.e. p G = 0. (14) Proof. Consider the following set i piyi φxi, X = conv h (y1 φx1, 1), ..., (yn φxn, 1) i If (2) does not hold, then it implies that (0, 1) / Q. Since Q is closed and convex, we can find a separating hyperplane between Q and (0, 1), or in other words there exists (f, t) FK R such that D (f, t), (g, s) E 0 (g, s) Q and D (f, t), (0, 1) E < 0. The second condition immediately yields t < 0. The first condition, when applied to (g, s) = (yi φxi, 1) Q yields f, yi φxi K + t 0 K(xi, xi) > 0 since t < 0, which shows that (1) holds. It is also immediate that if (2) holds, then (1) cannot. Note that G is positive semi-definite - infeasibility requires both that it is not positive definite, and also that the witness to p Gp = 0 must be a probability vector. Similarly, while it suffices that Gα > 0 for some α Rn, but coincidentally in our case α will also lie in the probability simplex. 5.2. The infeasible margin ρK Note that constraining f K = 1 (or α G = 1) in Eq. (5) and Lemma 3 allows ρK to be negative in the infeasible case. If it was , then ρK would have been non-negative because f = 0 (ie α = 0) is always allowed. So what is ρK when the problem is infeasible? Let conv(Y φX) := n X i piyi φxi|p n o FK be the convex hull of the yi φxis. Theorem 3. When the primal is infeasible, the margin1 is |ρK| = δmax := sup n δ f K δ f conv(Y φX) o Proof. (1) For inequality . Choose any δ such that f conv(Y φX) for any f K δ. Given an arbitrary f FK with f K = 1, put f := δf . By our assumption on δ, we have f conv(Y φX) implying there exists a p n such that f = Y φX, p . Also D f , Y φX, p E K = f , f K = δ f 2 K = δ. Since this holds for a particular p, we can infer D f , Y φX, p E Since this holds for any f with f G = 1, we have sup f K=1 inf p n D f , Y φX, p E K δ i.e. |ρK| δ. (2) For inequality . It suffices to show f K |ρK| f conv(Y φX). We will prove the contrapositive f / conv(Y φX) f K > |ρK|. Since n is compact and convex, conv(Y φX) FK is closed and convex. Therefore if f / conv(Y φX), then there exists g FK with g K = 1 that separates f and conv(Y φX), i.e. for all p n, g, f K < 0 and g, Y φX, p K 0 i.e. g, f K < inf p n g, Y φX, p K sup f K=1 inf p n f, Y φX, p K = ρK. Since ρK < 0 |ρK| < | f, g K| f K g K = f K. 1We thank a reviewer for pointing out that by this definition, ρK might always be 0 for infinite dimensional RKHSs because there are always directions perpendicular to the finite-dimensional hull - we conjecture the definition can be altered to restrict attention to the relative interior of the hull, making it non-zero. Margins, Kernels and Non-linear Smoothed Perceptron 6. Kernelized Primal-Dual Algorithms The preceding theorems allow us to write a variant of the Normalized Von Neumann algorithm from the previous section that is smoothed and works for RKHSs. Define W := n p n X i piyi φxi = 0 o = n p n p G = 0 o as the set of witnesses to the infeasibility of the primal. The following lemma bounds the distance of any point in the simplex from the witness set by its . G norm. Lemma 7. For all q n, the distance to the witness set dist(q, W) := min w W q w 2 min As a consequence, p G = 0 iff p W. Proof. This is trivial for p W. For arbitrary p n\W, let p := |ρK|p p G so that Y φX, p K = p G |ρK|. Hence by Theorem 3, there exists α n such that Y φX, α = Y φX, p . Let β = λα + (1 λ)p where λ = p G p G+|ρK|. Then Y φX, β = 1 p G + |ρ|K D Y φX, p Gα + |ρK|p E = 1 p G + |ρ|K Y φX, p G p + |ρK|p so β W (by definition of what it means to be in W) and p β 2 = λ p α 2 λ We take min with 2 because ρK might be 0. Hence for the primal or dual problem, points with small Gnorm are revealing - either Lemma 3 shows that the margin ρK p G will be small, or if it is infeasible then the above lemma shows that it is close to the witness set. We need a small alteration to the smoothing entropy proxfunction that we used earlier. We will now use for some given q n, which is strongly convex with respect to the ℓ2 norm. This allows us to define pq µ(α) = arg min p n Gα, p + µ Lq µ(α) = sup p n α, p G µdq(p) + 1 which can easily be found by sorting the entries of q Gα Algorithm 6 Smoothed Normalized Kernel Perceptron Von Neumann (SNKPV N(q, δ)) input q n, accuracy δ > 0 Set α0 = q, µ0 := 2n, p0 := pq µ0(α0) for k = 0, 1, 2, 3, ... do if Gαk > 0n then Halt: αk is solution to Eq. (8) else if pk G < δ then Return pk else θk := 2 k+3 αk+1 := (1 θk)(αk + θkpk) + θ2 k pq µk(αk) µk+1 = (1 θk)µk pk+1 := (1 θk)pk + θk pq µk+1(αk+1) end if end for When the primal is feasible, SNKPVN is similar to SNKP. Lemma 8 (When ρK > 0 and δ < ρK). For any q n, 2 pk 2 G Lq µk(αk) L(αk) µk. Hence SNKPVN finds a separator f in O n iterations. Proof. We give a unified proof for the first inequality and Lemma 5 in the Appendix. The second inequality mimics Lemma 4. The final statement mimics Theorem 1. The following lemma captures the near-infeasible case. Lemma 9 (When ρK < 0 or δ > ρK). For any q n, 2 pk 2 G Lq µk(αk) 1 2µkdist(q, W)2. Hence SNKPVN finds a δ-solution in at most O min n n δ|ρK| o iterations. Proof. The first inequality is the same as in the above Lemma 8, and is proved in the Appendix. Lq µk(αk) = sup p n α, p G µkdq(p) + 1 α, p G µkdq(p) 2µk p q 2 2 2µkdist(q, W)2 µk min n 2, q 2 G |ρK|2 o using Lemma 7. Since µk = 4n (k+1)(k+2) 4n (k+1)2 we get pk G 2 n (k + 1) min Hence p G δ after 2 n Margins, Kernels and Non-linear Smoothed Perceptron Using SNKPVN as a subroutine gives our final algorithm. Algorithm 7 Iterated Smoothed Normalized Kernel Perceptron-Von Neumann (ISNKPV N(γ, ϵ)) input Constant γ > 1, accuracy ϵ > 0 Set q0 := 1n/n for t = 0, 1, 2, 3, ... do δt := qt G/γ qt+1 := SNKPV N(qt, δt) if δt < ϵ then Halt; qt+1 is a solution to Eq. (14) end if end for Theorem 4. Algorithm ISNKPVN satisfies 1. If the primal (2) is feasible and ϵ < ρK, then each call to SNKPVN halts in at most 2 2n ρK iterations. Algo- rithm ISNKPVN finds a solution in at most log(1/ρK) log(γ) outer loops, bounding the total iterations by 2. If the dual (14) is feasible or ϵ > ρK, then each call to SNKPVN halts in at most O min n n ϵ , n |ρK| o steps. Algorithm ISNKPVN finds an ϵ-solution in at most log(1/ϵ) log(γ) outer loops, bounding the total iterations by Proof. First note that if ISNKPVN has not halted, then we know that after t outer iterations, qt+1 has small G-norm: qt+1 G δt q0 G γt+1 . (15) The first inequality holds because of the inner loop return condition, the second because of the update for δt. 1. Lemma 3 shows that for all p we have ρK p G, so the inner loop will halt with a solution to the primal as soon as δt ρK (so that p G < δt ρK cannot be satisfied for the inner loop to return). From Eq. (15), this will definitely happen when q0 G ie within T = log( qo G/ρK) log(γ) iterations. By Lemma 8, each iteration runs for at most 2 2n ρK steps. 2. We halt with an ϵ-solution when δt < ϵ, which definitely happens when q0 G γt+1 < ϵ, ie within T = log( qo G/ϵ) log(γ) iterations. Since qt G δt = γ, by Lemma 9, each iteration runs for at most O min n n ϵ , n |ρK| o steps. 7. Discussion The SNK-Perceptron algorithm presented in this paper has a convergence rate of log n ρK and the Iterated SNK-Perceptron-Von-Neumann algorithm has a min n n ϵ , n |ρK| o dependence on the number of points. Note that both of these are independent of the underlying dimensionality of the problem. We conjecture that it is possible to reduce this dependence to log n for the primaldual algorithm also, without paying a price in terms of the dependence on margin 1/ρ (or the dependence on ϵ). It is possible that tighter dependence on n is possible if we try other smoothing functions instead of the ℓ2 norm used in the last section. Specifically, it might be tempting to smooth with the . G semi-norm and define: pq µ(α) = arg min p n α, p G + µ One can actually see that the proofs in the Appendix go through with no dimension dependence on n at all! However, it is not possible to solve this in closed form - taking α = q and µ = 1 reduces the problem to asking pq(q) = arg min p n which is an oracle for our problem as seen by equation (14) - the solution s G-norm is 0 iff the problem is infeasible. In the bigger picture, there are several interesting open questions. The ellipsoid algorithm for solving linear feasibility problems has a logarithmic dependence on 1/ϵ, and a polynomial dependence on dimension. Recent algorithms involving repeated rescaling of the space like (Dunagan & Vempala, 2008) have logarithmic dependence on 1/ρ and polynomial in dimension. While both these algorithms are poly-time under the real number model of computation of (Blum et al., 1998), it is unknown whether there is any algorithm that can achieve a polylogarithmic dependence on the margin/accuracy, and a polylogarithmic dependence on dimension. This is strongly related to the open question of whether it is possible to learn a decision list polynomially in its binary description length. One can nevertheless ask whether rescaled smoothed perceptron methods like (Dunagan & Vempala, 2008) can be lifted to RKHSs, and whether using an iterated smoothed kernel perceptron would yield faster rates. The recent work (Soheili & Pe na, 2013b) is a challenge to generalize - the proofs relying on geometry involve arguing about volumes of balls of functions in an RKHS - we conjecture that it is possible to do, but we leave it for a later work. Acknowledgements We thank Negar Soheili, Avrim Blum for discussions and the excellent reviewers for references and Footnote 1. Margins, Kernels and Non-linear Smoothed Perceptron Block, HD. The perceptron: A model for brain functioning. i. Reviews of Modern Physics, 34(1):123, 1962. Blum, Lenore, Cucker, Felipe, Shub, Michael, and Smale, Steve. Complexity and real computation. Springer, 1998. Cheung, Dennis and Cucker, Felipe. A new condition number for linear programming. Mathematical programming, 91(1):163 174, 2001. Chvatal, Vasek. Linear programming. Macmillan, 1983. Clarkson, Kenneth L. Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm. ACM Transactions on Algorithms (TALG), 6(4):63, 2010. Dantzig, George B. An ϵ-precise feasible solution to a linear program with a convexity constraint in 1/ϵ2 iterations independent of problem size. Technical report, Technical Report, Stanford University, 1992. Dunagan, John and Vempala, Santosh. A simple polynomial-time rescaling algorithm for solving linear programs. Mathematical Programming, 114(1):101 114, 2008. Epelman, Marina and Freund, Robert M. Condition number complexity of an elementary algorithm for computing a reliable solution of a conic linear system. Mathematical Programming, 88(3):451 485, 2000. Freund, Robert M and Vera, Jorge R. Condition-based complexity of convex optimization in conic linear form via the ellipsoid algorithm. SIAM Journal on Optimization, 10(1):155 176, 1999. Graepel, Thore, Herbrich, Ralf, and Williamson, Robert C. From margin to sparsity. Advances in neural information processing systems, pp. 210 216, 2001. Littlestone, Nicholas. Redundant noisy attributes, attribute errors, and linear-threshold learning using winnow. In Proceedings of the fourth annual workshop on Computational learning theory, pp. 147 156. Morgan Kaufmann Publishers Inc., 1991. Nemirovski, Arkadi. Prox-method with rate of convergence O(1/t) for variational inequalities with Lipschitz continuous monotone operators and smooth convexconcave saddle point problems. SIAM Journal on Optimization, 15(1):229 251, 2004. Nesterov, Yu. Excessive gap technique in nonsmooth convex minimization. SIAM Journal on Optimization, 16 (1):235 249, 2005. Novikoff, Albert BJ. On convergence proofs for perceptrons. Technical report, 1962. Renegar, James. Incorporating condition measures into the complexity theory of linear programming. SIAM Journal on Optimization, 5(3):506 524, 1995. Rosenblatt, Frank. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386, 1958. Saha, Ankan, Vishwanathan, SVN, and Zhang, Xinhua. New approximation algorithms for minimum enclosing convex shapes. In Proceedings of the Twenty-Second Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1146 1160. SIAM, 2011. Sch olkopf, Bernhard and Smola, Alexander J. Learning with kernels. The MIT Press, 2002. Sch olkopf, Bernhard, Herbrich, Ralf, and Smola, Alex J. A generalized representer theorem. In Computational learning theory, pp. 416 426. Springer, 2001. Soheili, Negar and Pe na, Javier. A smooth perceptron algorithm. SIAM Journal on Optimization, 22(2):728 737, 2012. Soheili, Negar and Pe na, Javier. A primal dual smooth perceptron von Neumann algorithm. In Discrete Geometry and Optimization, pp. 303 320. Springer, 2013a. Soheili, Negar and Pe na, Javier. A deterministic rescaled perceptron algorithm. 2013b. Tseng, Paul. On accelerated proximal gradient methods for convex-concave optimization. SIAM Journal on Optimization, 2008.