# saddle_points_and_accelerated_perceptron_algorithms__e3a06e70.pdf

Saddle Points and Accelerated Perceptron Algorithms

Adams Wei Yu WEIYU@CS.CMU.EDU Fatma Kılınc -Karzan FKILINC@ANDREW.CMU.EDU Jaime G. Carbonell JGC@CS.CMU.EDU

School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA Tepper School of Business, Carnegie Mellon University, Pittsburgh, PA, USA

In this paper, we consider the problem of ﬁnding a linear (binary) classiﬁer or providing a nearinfeasibility certiﬁcate if there is none. We bring a new perspective to addressing these two problems simultaneously in a single efﬁcient process, by investigating a related Bilinear Saddle Point Problem (BSPP). More speciﬁcally, we show that a BSPP-based approach provides either a linear classiﬁer or an ϵ-infeasibility certiﬁcate. We show that the accelerated primal-dual algorithm, Mirror Prox, can be used for this purpose and achieves the best known convergence rate of O( log n

ρ(A) )(O( log n

ϵ )), which is almost independent of the problem size, n. Our framework also solves kernelized and conic versions of the problem, with the same rate of convergence. We support our theoretical ﬁndings with an empirical study on synthetic and real data, highlighting the efﬁciency and numerical stability of our algorithm, especially on large-scale instances.

1. Introduction

One of the central tasks in supervised learning is to train a binary classiﬁer: Given a training data set A Rm n of size n, where each column Aj Rm is an instance with label either 1 or -1; ﬁnd a binary (linear) classiﬁer that separates data into two groups based on their labels. Without loss of generality, we assume that A does not contain a zero column. After reversing the sign of each negative instance (Blum & Dunagan, 2002) (for simplicity, we still denote the data matrix A), training a classiﬁer is equivalent to ﬁnding a feasible solution to the following system of homogeneous

Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s).

linear inequalities w.r.t. y Rm:

AT y > 0. (1)

We refer (1) as Linear Dual Feasibility Problem (LDFP), and its solutions as linear classiﬁers or feasibility (separability) certiﬁcates. When the training data is not linearly separable, an infeasibility (inseparability) certiﬁcate is given by a solution to the Linear Alternative Problem (LAP) of (1):

Ax = 0, 1T x = 1, x 0, (2)

where 1 Rn denotes the vector with all coordinates equal to 1. LDFP and LAP are Linear Feasibility Problems (LFP) that are dual to each other: (1) is feasible if and only if (2) is infeasible (by Gordon s Theorem (Chvatal, 1983)). In particular, a feasible solution of one problem is an infeasibility certiﬁcate of the other. Often, instead of seeking an exact infeasibility certiﬁcate, it is more practical to opt for an ϵinfeasibility certiﬁcate for LDFP, i.e., ﬁnding an ϵ-solution for LAP, xϵ, such that Axϵ 2 ϵ, 1T xϵ = 1, xϵ 0. Without loss of generality, we assume Aj 2 = 1, for all j. Note that such a transformation does not change the feasibility status of either LDFP or LAP but simpliﬁes our analysis.

Given that (1) and (2) are convex (in fact just linear) optimization problems, quite a few algorithms including polynomial-time Interior Point methods (IPMs) can be used to solve them. Nonetheless, in real-life classiﬁcation scenarios, these problems, especially the ones with dense data, e.g., dense A matrices, remain challenging for IPMs. In these cases, each iteration of IPMs requires O(n3) arithmetic operations (a.o.), resulting in unacceptable overall runtimes. As a result, algorithms with computationally cheap iterations, i.e., ﬁrst-order methods (FOMs) like gradient descent, are more attractive despite their inferior rate of convergences. In fact, the ﬁrst algorithm suggested to solve (1), the perceptron algorithm by (Rosenblatt, 1958) is precisely from this class. This class of FOM algorithms, including perceptron and its variants, involves only elementary operations in each iteration. The time complexity per

Saddle Points and Accelerated Perceptron Algorithms

iteration is dominated by simple matrix-vector multiplication, and thus it is O(mn). Due to their scalability, our focus in this paper is also limited to this class of FOMs.

Most of the FOMs directly address either (1) or (2) separately, or both simultaneously. Assuming that a.o. involved in their iterations are of the same order, one can compare the efﬁciency of these algorithms based on their rate of convergence, i.e., the number of iterations needed to ﬁnd a feasible solution for LDFP or an ϵ-solution for LAP, The convergence rates of these algorithms are usually measured in terms of the parameters n, m ϵ, and the margin, ρ(A):

ρ(A) := max u 2=1 min j=1,...,n u T Aj Aj 2 . (3)

For example, (Novikoff, 1962) established that the rate of convergence of perceptron algorithm is O( 1 ρ(A)2 ). Among these quantities, ρ(A), in fact, provides a measure of the difﬁculty of solving LDFP or LAP, or equivalently of determining the separability of data, A. LDFP is feasible if ρ(A) > 0, and LAP is feasible if ρ(A) < 0 (see (Li & Terlaky, 2013)). The smaller its magnitude, |ρ(A)|, the harder is to solve the corresponding problem. With our normalization ( Aj 2 = 1 for all j), we have |ρ(A)| 1. Unfortunately, in real classiﬁcation scenarios, |ρ(A)| is rather tiny, so the complexity O( 1 ρ(A)2 ) of the original perceptron algorithm seems not so promising, despite that it is completely independent of the size (n) of the problem.

In this paper, we suggest a single algorithm, which solves (1) or (2) simultaneously by providing either a fesibility (separability) certiﬁcate or an almost infeasibility (inseparability) certiﬁcate. While doing so, our algorithm, not only enjoys the best rate of convergence in terms of its dependence on ρ(A), but also retains the simplicity of each iteration. To achieve this, we circumvent dealing directly with (1) or (2), and thus, we avoid making any assumptions on the feasibility status of either one of them. Instead, bringing a new perspective and a simpliﬁed analysis, we show that by solving a related Bilinear Saddle Point Problem (BSPP) via a primal-dual algorithm, one can obtain either a feasible solution for LDFP or an ϵ-solution for LAP depending on the feasibility status of the respective problem. To solve BSPP, we adopt the Mirror Prox (MP) algorithm, an accelerated primal-dual FOM introduced by (Nemirovski, 2004). While our suggested framework Mirror Prox for Feasibility Problems (MPFP) maintains the same iteration efﬁciency as perceptron, i.e., O(mn) per iteration, we establish that

MP achieves a convergence rate of O(

log(n) |ρ(A)| ) for LDF-

ϵ ) for LAP. To the best of our knowledge, this is the ﬁrst algorithm that simultaneously solves both of these problems at these rates. Unlike perceptron algorithm, the overall rate of convergence of the MP algorithm has a very mild dependence, O( p

log(n)), on the number

of training data. Nonetheless, O( p

log(n)) is in fact almost a constant factor even for extremely large n, such as n [1010, 1020], and thus we claim (and also empirically show) that MP has quite competitive performance in the case of large scale classiﬁcation problems. Note that such dependency is consistent with the existing literature in the case of LDFP. It is also strictly better (in terms of convergence rates) than the recent results for LAP, because MP has a factor of only O( p

log(n)), whereas the competing algorithms have a factor of O( n), signiﬁcantly limiting their computational performance. We further conﬁrm the efﬁciency and scalability of our algorithms via a numerical study on both synthetic and real data. In addition to their excellent theoretical and practical performance, our study also revealed that MP-based algorithms are numerically more stable than the other state-of-the-art methods.

MPFP also offers great ﬂexibility in adjusting to the geometry of the problem. In particular, we show that by proper customization, our framework can easily be extended to handle two important generalizations: the kernelized feasibility problems and the general conic feasibility problems. In both of these generalizations, we maintain the same rate of convergence as the original version while also retaining a cheap iteration cost. The connections of classiﬁcation with saddle point problems, and the efﬁciency and ﬂexibility of the MP framework discussed in this paper, open up further strikingly important possibilities for acceleration based on randomization. Particularly, the sublinear time behavior achieved by MP variants introduced in (Juditsky et al., 2013) is of great interest for future research.

2. Related Work

There is an extensive literature on deterministic and stochastic FOMs for solving classiﬁcation problems (Cristianini & Shawe-Taylor, 2000; Sch olkopf & Smola, 2002; Cotter et al., 2012) as well as the perceptron algorithm and its variants. In this paper, we limit our focus to deterministic FOMs and the most relevant literature. We categorize the representative literature into two parts in terms of the types of certiﬁcates, e.g., solutions to LDFP and LAP, the corresponding algorithms provide. Table 1 summarizes the rate of convergence of these algorithms. We note that the overall number of a.o. involved in each iteration of any one of these algorithms is O(mn) and thus the comparison presented in Table 1 is in fact meaningful.

The ﬁrst category of algorithms in Table 1 assumes that LDFP is feasible and only provides solutions for LDFP, e.g., feasibility (separability) certiﬁcates. The original perceptron (PCT) algorithm (Rosenblatt, 1958) with O( 1 ρ(A)2 ) rate of convergence is an example for this type of algorithms. As observed in (Saha et al., 2011), acceleration of perceptron algorithm via (Nesterov, 2005) s extra gradient

Saddle Points and Accelerated Perceptron Algorithms

LDFP LAP ϵ-solution PCT O( 1 ρ(A)2 ) N/A

log(n) |ρ(A)| ) N/A VN O( 1 ρ(A)2 ) O(min( 1

ϵ2 , 1 ρ(A)2 log 1

ISPVN O( n |ρ(A)| log 1 |ρ(A)|) O( n max{|ρ(A)|,ϵ} log 1

log(n) |ρ(A)| ) O(

Table 1. Summary of convergence rate of different algorithms.

based smoothing technique is possible. This technique underlies the Smooth Perceptron (SPCT) algorithm suggested by (Soheili & Pe na, 2012), which terminates in at most

2 log(n) ρ(A) 1 iterations.Under the assumption that LDFP is feasible, SPCT achieves the same theoretical iteration complexity as MPFP. However, SPCT does not provide infeasibility (inseparability) certiﬁcates, and verifying the feasibility status of LDFP is equivalent to solving LDFP.

Another vein of algorithms aims to either solve LDFP or provide an ϵ-solution to LAP simultaneously, without any assumption on their feasibility status beforehand. The von Neumann (VN) algorithm (Dantzig, 1992), is a wellknown example of such a method. Combining the analysis of (Dantzig, 1992) and (Epelman & Freund, 2000), one can conclude that i) if LAP is feasible, then in at most O(min( 1

ϵ2 , 1 ρ(A)2 log 1

ϵ )) iterations VN returns an ϵsolution; ii) if LAP is infeasible, then VN provides a feasible solution to LDFP within O( 1 ρ(A)2 ) iterations. Recently, based on SPCT and VN algorithms, (Soheili & Pe na, 2013) suggested the Iterated Smooth Perceptron-Von Neumann (ISPVN) algorithm. ISPVN ﬁnds a feasible solution for LDFP within O( n |ρ(A)| log ( 1 |ρ(A)|)) iterations provided that LDFP is feasible, or otherwise, gives an ϵ-solution to LAP within O( n |ρ(A)| log ( 1

ϵ )) iterations. Also, an extention of ISPVN to the kernelized setting, with the same convergence rate, (c.f., Theorem 4 in (Ramdas & Pe na, 2014)) is possible. We note that compared to ISPVN, MPFP achieves a signiﬁcantly better performance in terms of its dependence on both the dimension, n, of the problem (O( p

log(n)) as compared to O( n)), and the condition number ρ(A). On the other hand, the complexities of MPFP and ISPVN to ﬁnd an ϵ-solution of LAP indicates a tradeoff between n, ρ(A) and ϵ so that they are not comparable. Moreover, the improved complexity of MPFP in terms of its dependence on n, also gives an afﬁrmative answer to the conjecture of (Ramdas & Pe na, 2014).

3. Notation and Preliminaries

We ﬁrst introduce some notation and key concepts related to our setup and analysis. Throughout this paper, we use Matlab notation to denote vector and matrices, i.e., [x; y] denotes the concatenation of two column vectors x, y.

Bilinear Saddle Point Problem (BSPP) forms the backbone of our analysis. In its most general form a BSPP is deﬁned as max y Y min x X φ(x, y) (S)

where φ(x, y) = υ + a1, x + a2, y + y, Bx ; X, Y are nonempty convex compact sets in Euclidean spaces Ex, Ey and Z := X Y , hence φ(x, y) : Z R. Note that (S) gives rise to two convex optimization problems that are dual to each other:

Opt(P) = minx X[φ(x) := maxy Y φ(x, y)] (P) Opt(D) = maxy Y [φ(y) := minx X φ(x, y)] (D)

with Opt(P) = Opt(D) = Opt, and to the variational inequality (v.i.), i.e., ﬁnd z Z, such that

F(z), z z 0 for all z Z, (4)

where F : Z 7 Ex Ey is the afﬁne monotone operator deﬁned by

F(x, y) = Fx(y) = φ(x, y)

x ; Fy(x) = φ(x, y)

It is well known that the solutions to (S) the saddle points of φ on X Y are exactly the pairs z = [x; y] comprised of optimal solutions to problems (P) and (D). They are also solutions to the v.i. (4). For BSPP (S), the accuracy of a candidate solution z = [x; y] is quantiﬁed by the saddle point residual

ϵsad(z) = φ(x) φ(y)

φ(x) Opt(P)

+ Opt(D) φ(y)

4. General Mirror Prox Framework

MP algorithm is quite ﬂexible in terms of adjusting to the geometry of the problem characterized by the domain of BSPP (S), e.g., X, Y . The following components are standard in forming the MP setup for given domains X, Y , and analyzing its convergence rate:

Norm: on the Euclidean space E where the domain Z = X Y of (S) lives, along with the dual norm ζ = max z 1 ζ, z .

Distance-Generating Function (d.g.f.): ω(z), which is convex and continuous on Z, admits continuous on the set Zo = {z Z : ω(z) = } selection ω (z) of subgradient (here ω(z) is a subdifferential of ω taken at z), and is strictly convex with modulus 1 w.r.t. :

z , z Zo : ω (z ) ω (z ), z z z z 2.

Saddle Points and Accelerated Perceptron Algorithms

Bregman distance: Vz(u) = ω(u) ω(z) ω (z), u z , where z Zo and u Z. Prox-mapping: Given a prox center z Zo, Proxz(ξ) = argmin w Z { ξ, w + Vz(w)} : E Zo.

ω-center: zω = argmin z Z ω(z) Zo of Z.

Ω= Ωz := max z Z Vzω(z) max z Z ω(z) min z Z ω(z).

Lipschitz constant: L of F from to , satisfying F(z) F(z ) L z z , z, z .

Based on this setup, the general template of Mirror Prox algorithm for Feasibility Problems (MPFP) is given in Algorithm 1. We refer the customizations of Algorithm 1 to handle linear, kernelized, and conic feasibility problems as MPLFP, MPKFP, and MPCFP, respectively.

Algorithm 1 MPFP

1: Input: ω-center zω, step size {γt} and ϵ. 2: Output: zt(= [xt; yt]). 3: t = 1; v1 = zω; 4: while φ(yt) 0 and ϵsad(zt) > ϵ do 5: wt = Proxvt(γt F(vt)); 6: vt+1 = Proxvt(γt F(wt));

7: zt = h Pt s=1 γs i 1 Pt s=1 γsws;

8: t = t + 1; 9: end while

The standard customization of MPFP is based on associating a norm, x, and a d.g.f., ωx( ), with domain X, and similarly y, ωy( ) with domain Y . Then, given two scalars αx, αy > 0, we build the d.g.f. and ω-center, zω, for Z = X Y as:

ω(z) = αxωx(x) + αyωy(y) and zω = [xωx; yωy],

where ωx( ) and ωy( ) as well as xωx and yωy are customized based on the geometry of the domains X, Y . Also, by letting ξ = [ξx; ξy], z = [x; y], our prox mapping becomes decomposable as

Proxz(ξ) = Proxωx x

where Proxωx x ( ) and Proxωy y ( ) are respectively prox mappings w.r.t. ωx(x) in domain X and ωy(y) in domain Y . Because of space limitation, we provide the detailed derivation of the prox-mapping operators and the rationale behind our parameter choices, αx, αy, in the appendix.

The convergence rate of MP algorithm for solving BSPP was established by (Nemirovski, 2004) as follows:

Theorem 1. (Nemirovski, 2004) Suppose the step sizes in the Mirror Prox algorithm satisfy γt = L 1. Then at every

iteration t 1, the corresponding solution, zt = [xt; yt] satisﬁes xt X, yt Y , and we have

φ(xt) φ(yt) = ϵsad(zt) ΩL

5. Linear Feasibility Problems

In this section, we ﬁrst associate a BSPP for the linear feasibility problems LDFP and LAP, and establish close connections between the solutions of these problems. Then, we discuss customization of general MPFP framework, e.g. Section 4, for simultaneously solving LDFP and LAP, and the computational complexity of the resulting MPLFP. We provide customization of MPFP to the kernelized and conic feasibility problems in Section 6. Because of space limitation, all of the proofs are provided in appendix.

5.1. BSPP Formulation for Linear Feasibility Problems

To address the linear feasibility problems, LDFP (1) or LAP (2) simultaneously, we consider the BSPP:

Opt = max y Bm min x n y T Ax, (5)

where the domains X, Y of the variables x, y are n := {x Rn : Pn i=1 xi = 1, x 0}, i.e., a standard ndimensional simplex, and Bm := {y Rm : y 2 1}, a unit Euclidean ball in Rm, respectively. Then Z = n Bm. Also, φ(x, y) = y T Ax, φ(x) = max y Bm y T Ax,

φ(y) = min x n y T Ax, and F(x, y) = [AT y; Ax].

The connections between LDFP, LAP and BSPP, and their solutions, play an important role in our main result given in Theorem 3. These are explored in the next section.

5.2. Connections between LDFP/LAP and BSPP

We ﬁrst establish a close connection between ρ(A) and the objective value of BSPP (5) in Lemma 1, and then relate solutions of BSPP (5) and LDFP/LAP in Theorem 2.

Lemma 1. (a) ρ(A) = max y 2=1 min x n y T Ax; (b) Whenever

ρ(A) > 0, then ρ(A) = max y 2 1 min x n y T Ax = Opt.

Moreover, for any primal-dual algorithm solving BSPP (5), we have the following relations:

Theorem 2. Let zt = [xt; yt] be the solution at iteration t of a primal-dual algorithm for solving problem (5). Suppose that the algorithm terminates at step t, when either φ(yt) > 0 or ϵsad(zt) ϵ. Then we have

(a) If φ(yt) > 0, then AT yt > 0; otherwise,

(b) If ϵsad(zt) ϵ, then Axt 2 ϵ.

Saddle Points and Accelerated Perceptron Algorithms

Theorem 2 reveals that, depending on the sign of ρ(A), the primal-dual solution pair zt = [xt; yt] for BSPP (5) indeed corresponds to that of either LDFP or LAP. In fact, part (a) of the theorem corresponds to the case when LDFP is feasible, (since ρ(A) > φ(yt) > 0), and yt is a feasible solution of LDFP; part (b) is essentially the case when xt is an ϵ-solution of LAP. The merit of such a primal-dual algorithm stated in Theorem 2 is that it can automatically identify whether LDFP or LAP is feasible during the execution and provide the corresponding solution. As a result, we do not need to assess the feasibility of either of LDFP or LAP upfront, which is already as difﬁcult as solving the original feasibility problems.

Any primal-dual algorithm capable of solving BSPP (5) can be customized to ﬁt the conditions of Theorem 2, and thus, is suitable for simultaneously solving LDFP/LAP. Yet the convergence rates of these algorithms can signiﬁcantly differ in terms of their dependence on ρ(A). In Section 5.4, we show that Mirror Prox algorithm achieves the best known performance in terms of ρ(A). Further generalizations of the MP algorithm to handle kernelized and conic feasibility problems are possible as well. These generalizations retain the same rate of convergence given in Theorem 1, yet they may differ in terms of the a.o. needed for their iterations. We discuss these in Section 6.

5.3. MPLFP: Customization of MPFP for LDFP/LAP

For LDFP and LAP, we have Z = X Y = n Bm, and hence we pick ωx(x) = Pn i=1 xi ln(xi) and ωy(y) = 1 2y T y, which leads to, for i = 1, . . . , n and j = 1, . . . , m

[Proxωx x (ζ)]i = xi exp{ ζi} Pn k=1 xk exp{ ζk}, and

[Proxωy y (ζ)]j =

( yj ζj, if y ζ 2 1 yj ζj y ζ 2 , otherwise .

Therefore, each iteration of Algorithm 1, involves only the computation of F(ξ) = [AT ξy; Aξx], and

Proxz(γF(ξ)) = Proxωx x (γAT ξy

αx ); Proxωy y ( γAξx

This choice of d.g.f. leads to zω = xωx; yωy where xωx = 1

n1 Rn, and yωy = 0m, the zero vector in Rm. We compute the associated Ω= Ωz and L following the derivations in the appendix, and set γt = 1

5.4. Convergence Analysis for MPLFP

In the speciﬁc case of linear feasibility problems LDFP and LAP, following the derivation of αx and αy presented in appendix, one can optimally select αx and αy to achieve ΩL p

1/2. Hence by combining Theorems 1 and 2, we arrive at the following main result:

Theorem 3. Let the step sizes of MPFP be γt = L 1. Then

(a) When ρ(A) > 0, MPFP terminates in at most N =

ΩL ρ(A) + 1

1/2 ρ(A) + 1 iterations with a feasible solution to LDFP given by y N.

(b) When ρ(A) < 0, MPFP terminates in at most N =

1/2 ϵ + 1 iterations with an ϵsolution for LAP given by x N.

Complexity per Iteration of MPLFP: Each iteration t involves the following elementary computations with the given a.o. complexity: (1) Axt, AT yt: O(mn). (2) Prox mapping: O(m + n). (3) φ(xt) = Axt 2: O(m). (4) φ(yt) = mini {1,...,n}(AT i yt) : O(n). Therefore, the overall a.o. involved in each iteration of MPLFP is O(mn), which is the same as that of the perceptron algorithm.

6. Generalized Feasibility Problems

There are two important extensions of the basic MPFP framework, namely the Kernelized Feasibility Problems (KDFP, KAP) and the Conic Feasibility Problems (CDFP, CAP). Both of these extensions merely require customization of the MPFP framework to the geometry of the problem by using proper representation or proper selection of prox mappings. Hence, these can easily be handled within the existing MPFP framework while still enjoying the same convergence rate of MPLFP. To the best of our knowledge,

the extension of MP to the KDFP achieving O(

log(n) |ρ(Ψ)| )

ϵ ) in the case of KAP) performance is novel and the same performance of MP for the conic feasibility problem (in terms of its dependence on n and ρ(A)) is far superior to all of the other competing algorithms.

6.1. Kernelized Feasibility Problems

In the kernelized classiﬁcation problems, along with the data A, we are given, a feature map Φ( ) : Rm F, which maps each data point, Ai, to a new feature in the Reproducing Kernel Hilbert Space, F = Rd. The feature map is speciﬁed implicitly via a kernel, KΦ(a, a ) := Φ(a), Φ(a ) , with the motivation that explicitly computing Φ(a) can be rather time consuming, but instead the inner product computation, Φ(a), Φ(a ) , via the kernel KΦ(a, a ) is much easier to obtain (see (Sch olkopf & Smola, 2002)). Therefore, kernelized algorithms are designed under the assumption that only black box access to the kernel is available (i.e., our methods work for any kernel, as long as we can compute KΦ(a, a ) efﬁciently). By normalization, we assume that KΦ(a, a) = 1, and KΦ(a, a ) 1 for all a, a Rm, and in our runtime analysis, we assume kernel matrix associated with data A, i.e., GΦ where

Saddle Points and Accelerated Perceptron Algorithms

[GΦ]i,j = KΦ(Ai, Aj), is known.

Let Q be a diagonal matrix, where Qii = 1 or 1 is the label of the ith data point, Ai. In order to simplify the notation in our analysis, we work with the feature map Ψ(Ai) = QiiΦ(Ai) and the resulting kernel is given by K(Ai, Aj) = Ψ(Ai), Ψ(Aj) = QiiΦ(Ai), QjjΦ(Aj) which is the (i, j)-th element of the kernel matrix G := QT GΦQ. For notational convenience, we also let Ψ := [Ψ(A1), . . . , Ψ(An)].

Thus the kernelized classiﬁcation problem is equivalent to ﬁnding a feasible solution to the following Kernelized Dual Feasibility Problem (KDFP): y T Ψ > 0.When no such solution exists, an inseparability certiﬁcate is provided by an ϵ-solution for the Kernelized Alternative Problem (KAP): Ψx = 0, 1T x = 1, x 0.Therefore, one can equivalently consider the following Kernelized BSPP:

max y Bd min x n y T Ψx.

By a simple customization, the MPFP framework can be generalized to solve kernelized BSPP using only black box kernel oracle. Throughout the algorithm, in order avoid any explicit computation involving Ψ, we keep track of two vectors xt, gt Rn. While the role of xt in MPKFP remains the same, we use gt as a surrogate for yt. In particular, we let g0 = 0n implying y0 = Ψg0, and initialize the algorithm with zω = 1

n1; 0d = [x0; y0]. And, in all subsequent iterations, we always maintain the relation yt = Ψgt implicitly. This is the key modiﬁcation in MPKFP. The beneﬁt of this modiﬁcation is that ΨT yt = ΨT Ψgt = Pn i=1 K(Ai, Ai)(gt)i, hence we can avoid explicit computations involving Ψ. In particular, at each iteration t, the corresponding kernelized BSPP solution is given by zt = h Pt s=1 γs i 1 Pt s=1 γs[xt; Ψgt]. Based on zt, the corre-

sponding lower and upper bounds are given by φ(yt) = min i=1,...,n(yt)T Ψ(Ai) = min i=1,...,n Pn j=1(gt)j K(Aj, Ai), and

φ(xt) = Ψxt 2 = p

K(xt, xt), using only the kernel K.

Moreover, for y = Ψg, the prox mapping computations are given by

Proxωx x (γΨT ξy

where η = h exp { γ(Gg)1

αx }, . . . , exp { γ(Gg)n

αx } i , and is the elementwise product, and,

Proxωy y ( γΨξx

αy ), if Ψ(g + γξx

αy ) 2 , otherwise .

Note that Proxωx x ( ) computation does not use Ψ, and in

Proxωy y ( ), Ψ(g+ γξx

αy , g + γξx

so relies only on the kernel function K. Hence it is sufﬁcient to keep only the (perhaps normalized) term, g + γξx

αy , after each Proxωy y ( ) operation.

Complexity: (1) Per Iteration: Given K, the complexity within each iteration is O(n2) due to Ggt,

αy , g + γξx

, φ(yt) and φ(xt). (2) Convergence Rate: As the domains X = n and Y = Bd remain the same as MPLFP, MPKFP shares the same convergence rate as the former, except that ρ(A) is replaced by ρ(Ψ).

Remark: Let (x , g ) be the output of MPKFP. Then, if the data is separable in the feature space, we use g for classiﬁcation as follows: For a new testing data point, A0, its class label is determined by the sign of (Φ(A0))T Ψg = Pn i=1Qii Φ(A0), Φ(Ai) g i = Pn i=1Qii KΦ(A0, Ai)g i . Otherwise, the ϵ-inseparability certiﬁcate, i.e., an ϵsolution for KAP, is given by Qx .

6.2. Conic Feasibility Problems

The second extension we discuss is the Conic Dual Feasible Problem (CDFP) of the form

A y int(K ) (6)

where A is the conjugate linear map of a linear map A, and K Rn is a proper (closed, convex, pointed with nonempty interior) cone with its dual cone K .

Note that A y int(K ) is equivalent to requiring x, A y > 0 for all x K \ {0}. Let e be a vector from int(K ), as the selection of e depends on K, below we will specify it separately for each K. Deﬁne (K) = {x K : e, x = 1}, which is the base of the cone K given by the vector e. Hence x, A y > 0 for all x K \ {0} is equivalent to minx{ x, A y : x (K)} > 0. Moreover, the Conic Alternative Problem (CAP) is given by:

Ax = 0, e, x = 1, x K. (7)

Based on the same rationale of previous sections, we consider the following conic version of BSPP:

max y Bm min x (K) y T Ax. (8)

Once again, (8), analogous to (5), is a BSPP albeit with different domains, and hence it can still be solved within the MPFP framework, achieving the same iteration complexity given in Theorem 3. The main customization of the MPFP applied to (8) as opposed to (5) is in selecting various algorithm parameters, and prox-mapping operators for the corresponding domains. In particular, one needs to specify an efﬁcient prox-mapping operator customized to the domain (K). We present several interesting cases of K, where such efﬁcient prox-mapping onto (K)

Saddle Points and Accelerated Perceptron Algorithms

exists. These include K = Rn +, the nonnegative orthant,

K = Ln = {x Rn : xn q

x2 1 + . . . + x2 n 1},

the second order cone, and K = Sn + = {X Rn n : X = XT , a T Xa 0 a Rn}, the positive semideﬁnite cone. The basic settings of these cases as suggested in (Nemirovski, 2004; Juditsky et al., 2013) are as follow:

Nonnegative orthant: K = K = Rn +. This is precisely the case we have addressed for linear feasibility problems (see Section 5.3).

Second order cone: K = K = Ln.

e = [0, ..., 0, 1]T , (K) = Bn 1 {1}.

d.g.f: ω(x) = 1 2

i=1 x2 i ; Ωx 1 2 with xωx =

[0, ..., 0, 1]T . Given x (K), ξ Rn, for i = 1, . . . , n 1,

[Proxx(ξ)]i =

( xi ξi, if Pn 1 i=1 (xi ξi)2 1 xi ξi Pn 1 i=1 (xi ξi)2 , otherwise

and [Proxx(ξ)]n = 1. Note that the computational complexity of this prox mapping is O(n) as discussed in the MPLFP setup (see Section 5.3).

Positive semi-deﬁnite cone: K = K = Sn +. For any two symmetric matrices A, B, we let A, B = Tr(AB) as the corresponding inner product.

e = In, (K) = {X Sn + : X 0, Tr(X) = 1}(Flat Spectrahedron). d.g.f: matrix entropy ω(X) = Entr(λ(X)) = P

i λi(X) log(λi(X)), where λi(X) is the i-th eigenvalue of X and resulting ΩX 4 log(n) with xωx = Diag( 1

n). Given X Sn +, Ξ Sn, computing Prox X(Ξ) reduces to computing the eigenvalue decomposition of the matrix X, which also leads to the computation of ω (X), and subsequent eigenvalue decomposition of the matrix H = Ξ ω (X). Let H = UDiag(h)U T

be the eigenvalue decomposition of H, where Diag(h) stands for the diagonal matrix with diagonal elements from h. Then computing Prox X(Ξ) reduces to constructing the matrix W = UDiag(w)U T where w = argminz Rn{ Diag(h), Diag(z) + ω(Diag(z)) : Diag(z) (K)}, which is identical to the computation of the Prox function in the simplex setup. The computational complexity of this prox mapping is O(n3) per iteration due to the two singular value decompositions (O(n3)) and simple a.o. of O(n).

Discussion of MPCFP and conic ISPVN: The MPFP framework is quite ﬂexible: in particular, it can be easily adjusted to handle a variety of domains Z by properly

selecting ω( ) and thus Proxz( ). Besides, for the aforementioned cones, the computational cost of each iteration of MPCFP is quite low. While ISPVN is capable of handling the same conic feasibility problems, the analysis of (Soheili & Pe na, 2013; Ramdas & Pe na, 2014) is based on Euclidean d.g.f. s, and thus the respective operations involved for projections onto these domains are much more expensive (involving sorting operations), and hence they are inferior to the ones presented above.

7. Numerical Experiments

In this section, we conduct an empirical study on the MPFP framework and compare it with other baseline algorithms, using both synthetic data (for MPLFP and LAP) and real data (for MPKFP). All the codes are written in Matlab 2011b, and are ran in a single-threaded fashion on a Linux server with 6 dual core 2.8GHz CPU and 64 GB memory.

In the implementation of MPLFP or MPKFP, we select αx, αy as described in the appendix, for a normalized matrix A, i.e., Ai 2 = 1 for all i. We implemented the Smooth Perceptron (SPCT) (Soheili & Pe na, 2012), normalized Perceptron (PCT) (Ramdas & Pe na, 2014) and the normalized Von-Neumann (VN) algorithm (Soheili & Pe na, 2012) for comparison. We also implemented the ISPVN (Soheili & Pe na, 2013) and ISNKPVN (Ramdas & Pe na, 2014), but as the authors do not provide a parameter selection strategy, we did a search on the space and choose the one with the best performance, i.e., the parameter denoted by γ in (Soheili & Pe na, 2013) (Theorem 1) was set to 2. We note that our numerical study is based on the comparison of the basic implementations of all of these algorithms including ours as well. There may be ways for small improvement leading better computational performance for any one of them. Nevertheless, guided by our theoretical ﬁndings, we do not expect any major changes in the overall conclusions drawn.

7.1. Synthetic data

We ﬁrst compare the performance of different algorithms (non-kernelized versions) on synthetic data. In our instance generation for LDFP, κ serves as a proxy for the condition number ρ(A), i.e., a larger κ corresponds to a larger ρ(A) in (3). We provide details of our instance generation for both LDFP and LAP in the appendix.

Experimental Setup. We test the methods on a wide spectrum of (m, n) combinations. Given a combination, we generate 20 instances for LDFP and 20 for LAP on which all the methods are tested. For each method, the iteration limit is 106 and runtime limit is 5 104s. When either one of these limits is reached, we report a TO (Timeout) . We empirically observe that, occasionally, SPCT suffers

Saddle Points and Accelerated Perceptron Algorithms

from numerical overﬂow, and ISPVN, frequently, falls into a dead loop. We report a Fail in the table if a method fails in all of the 20 instances. Table 2 summarizes the average runtime for each algorithm over the instances that the algorithm successfully solved. Because of space limitations, we present the number of iterations for LDFP and for the effect of ρ(A), and our results on LAP instances in the appendix.

Performance on LDFP. We set κ = 1 for the case LDFP, and, by varying m, n, we generate normalized instances of A Rm n such that Aj 2 = 1 for all j. In all of the instances generated, ISPVN algorithm failed, and hence we omitted it from comparison. As indicated by Table 2, MPLFP outperforms all the other methods in terms of runtime for any pair of (m, n). The runner-up algorithm SPCT is obviously much slower while PCT and VN perform even equally worse. Moreover, as (m, n) increases, the time savings achieved by MPLFP become more signiﬁcant. More importantly, even on the very large size of high-dimensional data (m = 103, n = 5 106), MPLFP ﬁnds a feasible solution for LDFP within 10 hours, while the baseline methods either run out of time (PCT and VN) or fail (SPCT and ISPVN) on large scale problems (m = 102, n = 5 105). This highlights the suitability of MPLFP for large-scale learning.

The Effect of ρ(A). We tested the performance of algorithms by varying κ as a proxy for ρ(A) on instances with (m, n) = (100, 5000) and Aj 2 = 1 for all j. For various values of κ [1, 10] (i.e., for feasible LDFP), the runtime of each algorithm is reported in Figure 1. It is not surprising that as κ(ρ(A)) becomes larger, which means that the data become more separable, the runtime is shorter for all of the methods. Nonetheless, MPLFP is still considerably faster than its closest competitor in all cases.

With column-normalized matrix, A (m, n) MPLFP SPCT PCT VN (102, 5 103) 1.3 4.3 98.5 127.3 (103, 5 103) 5.5 16.2 142.2 141.5 (102, 5 104) 112.7 455.0 TO TO (103, 5 104) 318.1 1301.0 TO TO (102, 5 105) 25506 Fail TO TO (103, 5 105) 22471 Fail TO TO

Table 2. Runtime (second) of all methods for LDFP (κ = 1). Iteration limit is 106 and runtime limit is 5 104s.

7.2. Real Data: CIFAR-10 Image Classiﬁcation

We test the efﬁciency of MPKFP in training classiﬁers with kernel method. For comparison, we also implement the Kernelized Perceptron (KPCT), Kernelized Von Neumann (KVN), and ISNKVPN (Ramdas & Pe na, 2014). Our testbed is the CIFAT-10 (Krizhevsky, 2009) dataset1,

1http://www.cs.utoronto.ca/ kriz/cifar.html

1 2 3 4 5 6 7 8 9 10 10 1

MPLFP SPCT PCT VN

Figure 1. κ (ρ(A)) v.s. runtime.

10 1 10 2 10 3 0.5

Number of Iterations Per Classifier

Prediction Error Rate

MPKFP ISNKPVN KPCT KVN

Figure 2. Iteration v.s. error.

which contains 60,000 color images in 10 mutually exclusive classes, with 6,000 images per class. Each image Ai is encoded as a m = 3072 dimensional vector.2 We evenly pick around 1,000 images in each class to construct a training set of n = 10, 000 in total, and repeat this process to form a testing set of the same size. We train one-vs-rest binary classiﬁer for each class, and hence obtain 10 classiﬁers in total. After that, each test image is run on by all the classiﬁers and its predicted label corresponds to the one that gives the largest output value. We choose Radial Basis Function (RBF) kernel given by KΦ(Ai, Aj) = exp{ 5.5( Ai Aj 2)}.

We run all of the algorithms, and at different pre-speciﬁed iteration limits of N {10, 32, 100, 320, 1000}, we collect their respective solutions. We treat these solutions as the corresponding classiﬁers, and benchmark their prediction quality on the test set. Figure 2 shows how the prediction error rate drops as the training iteration number per classiﬁer increases. It can be seen that, within the same iteration limit, MPKFP outperforms all the other methods in terms of the prediction error rate. In particular, the error rate obtained by perceptron after 1,000 iterations is still higher than that achieved within merely 10 iterations by MPKFP, which highlights the suitability of MPKFP for fast classiﬁcation training scenarios.

8. Conclusions

We build a framework based on BSPP to either ﬁnd a binary classiﬁer (DFP) or return a near infeasibility certiﬁcate if there is none (AP). We adopt the accelerated primal-dual algorithm, Mirror Prox, to solve the BSPP and achieve the rate of convergence O( log n

|ρ(A)| ) for DFP and O( log n

ϵ ) for AP. Our results also extend to the general kernelized and conic problems, all of which share the same rate of convergence. We further conﬁrm the efﬁciency and numerical stability of our approach by a numerical study on both synthetic and real data. Future work includes exploring whether these methods can be modiﬁed to achieve sublinear time behavior, or allow for rapid incremental retraining in active learning scenarios.

2To achieve a lower prediction error rate, one can use other advanced features,which is not this paper s focus.

Saddle Points and Accelerated Perceptron Algorithms

Blum, Avrim and Dunagan, John. Smoothed analysis of the perceptron algorithm for linear programming. In SODA, pp. 905 914, 2002.

Chvatal, Vasek. Linear Programming. Macmillan, 1983.

Cotter, Andrew, Shalev-Shwartz, Shai, and Srebro, Nathan. The kernelized stochastic batch perceptron. In ICML, 2012.

Cristianini, Nello and Shawe-Taylor, John. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press, 2000.

Dantzig, George Bernard. An ϵ-precise feasible solution to a linear program with a convexity constraint in 1/ϵ2

iterations independent of problem size. Technical Report 92-5, Stanford University, 1992.

Epelman, Marina and Freund, Robert M. Condition number complexity of an elementary algorithm for computing a reliable solution of a conic linear system. Math. Program., 88(3):451 485, 2000.

Juditsky, Anatoli, Kılınc -Karzan, Fatma, and Nemirovski, Arkadi. Randomized ﬁrst order algorithms with applications to ℓ1-minimization. Math. Program., 142(1-2): 269 310, 2013.

Krizhevsky, Alex. Learning multiple layers of features from tiny images. Master s thesis, University of Toronto, 2009.

Li, Dan and Terlaky, Tam as. The duality between the perceptron algorithm and the von neumann algorithm. In Modeling and Optimization: Theory and Applications, volume 62, pp. 113 136. 2013.

Nemirovski, Arkadi. Prox-method with rate of convergence o(1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convexconcave saddle point problems. SIAM Journal on Optimization, 15(1):229 251, 2004.

Nesterov, Yurii. Excessive gap technique in nonsmooth convex minimization. SIAM Journal on Optimization, 16(1):235 249, 2005.

Novikoff, Albert B. J. On convergence proofs for perceptrons. Technical report, 1962.

Ramdas, Aaditya and Pe na, Javier. Margins, kernels and non-linear smoothed perceptrons. In ICML, 2014.

Rosenblatt, Frank. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6):386 408, 1958.

Saha, Ankan, Vishwanathan, S. V. N., and Zhang, Xinhua. New approximation algorithms for minimum enclosing convex shapes. In SODA, pp. 1146 1160, 2011.

Sch olkopf, Bernhard and Smola, Alex J. Learning with kernels. The MIT Press, 2002.

Soheili, Negar and Pe na, Javier. A smooth perceptron algorithm. SIAM Journal on Optimization, 22(2):728 737, 2012.

Soheili, Negar and Pe na, Javier. A primal-dual smooth perceptron-von neumann algorithm. In Discrete Geometry and Optimization, volume 69, pp. 303 320. 2013.