# saddle_points_and_accelerated_perceptron_algorithms__e3a06e70.pdf Saddle Points and Accelerated Perceptron Algorithms Adams Wei Yu WEIYU@CS.CMU.EDU Fatma Kılınc -Karzan FKILINC@ANDREW.CMU.EDU Jaime G. Carbonell JGC@CS.CMU.EDU School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA Tepper School of Business, Carnegie Mellon University, Pittsburgh, PA, USA In this paper, we consider the problem of finding a linear (binary) classifier or providing a nearinfeasibility certificate if there is none. We bring a new perspective to addressing these two problems simultaneously in a single efficient process, by investigating a related Bilinear Saddle Point Problem (BSPP). More specifically, we show that a BSPP-based approach provides either a linear classifier or an ϵ-infeasibility certificate. We show that the accelerated primal-dual algorithm, Mirror Prox, can be used for this purpose and achieves the best known convergence rate of O( log n ρ(A) )(O( log n ϵ )), which is almost independent of the problem size, n. Our framework also solves kernelized and conic versions of the problem, with the same rate of convergence. We support our theoretical findings with an empirical study on synthetic and real data, highlighting the efficiency and numerical stability of our algorithm, especially on large-scale instances. 1. Introduction One of the central tasks in supervised learning is to train a binary classifier: Given a training data set A Rm n of size n, where each column Aj Rm is an instance with label either 1 or -1; find a binary (linear) classifier that separates data into two groups based on their labels. Without loss of generality, we assume that A does not contain a zero column. After reversing the sign of each negative instance (Blum & Dunagan, 2002) (for simplicity, we still denote the data matrix A), training a classifier is equivalent to finding a feasible solution to the following system of homogeneous Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s). linear inequalities w.r.t. y Rm: AT y > 0. (1) We refer (1) as Linear Dual Feasibility Problem (LDFP), and its solutions as linear classifiers or feasibility (separability) certificates. When the training data is not linearly separable, an infeasibility (inseparability) certificate is given by a solution to the Linear Alternative Problem (LAP) of (1): Ax = 0, 1T x = 1, x 0, (2) where 1 Rn denotes the vector with all coordinates equal to 1. LDFP and LAP are Linear Feasibility Problems (LFP) that are dual to each other: (1) is feasible if and only if (2) is infeasible (by Gordon s Theorem (Chvatal, 1983)). In particular, a feasible solution of one problem is an infeasibility certificate of the other. Often, instead of seeking an exact infeasibility certificate, it is more practical to opt for an ϵinfeasibility certificate for LDFP, i.e., finding an ϵ-solution for LAP, xϵ, such that Axϵ 2 ϵ, 1T xϵ = 1, xϵ 0. Without loss of generality, we assume Aj 2 = 1, for all j. Note that such a transformation does not change the feasibility status of either LDFP or LAP but simplifies our analysis. Given that (1) and (2) are convex (in fact just linear) optimization problems, quite a few algorithms including polynomial-time Interior Point methods (IPMs) can be used to solve them. Nonetheless, in real-life classification scenarios, these problems, especially the ones with dense data, e.g., dense A matrices, remain challenging for IPMs. In these cases, each iteration of IPMs requires O(n3) arithmetic operations (a.o.), resulting in unacceptable overall runtimes. As a result, algorithms with computationally cheap iterations, i.e., first-order methods (FOMs) like gradient descent, are more attractive despite their inferior rate of convergences. In fact, the first algorithm suggested to solve (1), the perceptron algorithm by (Rosenblatt, 1958) is precisely from this class. This class of FOM algorithms, including perceptron and its variants, involves only elementary operations in each iteration. The time complexity per Saddle Points and Accelerated Perceptron Algorithms iteration is dominated by simple matrix-vector multiplication, and thus it is O(mn). Due to their scalability, our focus in this paper is also limited to this class of FOMs. Most of the FOMs directly address either (1) or (2) separately, or both simultaneously. Assuming that a.o. involved in their iterations are of the same order, one can compare the efficiency of these algorithms based on their rate of convergence, i.e., the number of iterations needed to find a feasible solution for LDFP or an ϵ-solution for LAP, The convergence rates of these algorithms are usually measured in terms of the parameters n, m ϵ, and the margin, ρ(A): ρ(A) := max u 2=1 min j=1,...,n u T Aj Aj 2 . (3) For example, (Novikoff, 1962) established that the rate of convergence of perceptron algorithm is O( 1 ρ(A)2 ). Among these quantities, ρ(A), in fact, provides a measure of the difficulty of solving LDFP or LAP, or equivalently of determining the separability of data, A. LDFP is feasible if ρ(A) > 0, and LAP is feasible if ρ(A) < 0 (see (Li & Terlaky, 2013)). The smaller its magnitude, |ρ(A)|, the harder is to solve the corresponding problem. With our normalization ( Aj 2 = 1 for all j), we have |ρ(A)| 1. Unfortunately, in real classification scenarios, |ρ(A)| is rather tiny, so the complexity O( 1 ρ(A)2 ) of the original perceptron algorithm seems not so promising, despite that it is completely independent of the size (n) of the problem. In this paper, we suggest a single algorithm, which solves (1) or (2) simultaneously by providing either a fesibility (separability) certificate or an almost infeasibility (inseparability) certificate. While doing so, our algorithm, not only enjoys the best rate of convergence in terms of its dependence on ρ(A), but also retains the simplicity of each iteration. To achieve this, we circumvent dealing directly with (1) or (2), and thus, we avoid making any assumptions on the feasibility status of either one of them. Instead, bringing a new perspective and a simplified analysis, we show that by solving a related Bilinear Saddle Point Problem (BSPP) via a primal-dual algorithm, one can obtain either a feasible solution for LDFP or an ϵ-solution for LAP depending on the feasibility status of the respective problem. To solve BSPP, we adopt the Mirror Prox (MP) algorithm, an accelerated primal-dual FOM introduced by (Nemirovski, 2004). While our suggested framework Mirror Prox for Feasibility Problems (MPFP) maintains the same iteration efficiency as perceptron, i.e., O(mn) per iteration, we establish that MP achieves a convergence rate of O( log(n) |ρ(A)| ) for LDF- ϵ ) for LAP. To the best of our knowledge, this is the first algorithm that simultaneously solves both of these problems at these rates. Unlike perceptron algorithm, the overall rate of convergence of the MP algorithm has a very mild dependence, O( p log(n)), on the number of training data. Nonetheless, O( p log(n)) is in fact almost a constant factor even for extremely large n, such as n [1010, 1020], and thus we claim (and also empirically show) that MP has quite competitive performance in the case of large scale classification problems. Note that such dependency is consistent with the existing literature in the case of LDFP. It is also strictly better (in terms of convergence rates) than the recent results for LAP, because MP has a factor of only O( p log(n)), whereas the competing algorithms have a factor of O( n), significantly limiting their computational performance. We further confirm the efficiency and scalability of our algorithms via a numerical study on both synthetic and real data. In addition to their excellent theoretical and practical performance, our study also revealed that MP-based algorithms are numerically more stable than the other state-of-the-art methods. MPFP also offers great flexibility in adjusting to the geometry of the problem. In particular, we show that by proper customization, our framework can easily be extended to handle two important generalizations: the kernelized feasibility problems and the general conic feasibility problems. In both of these generalizations, we maintain the same rate of convergence as the original version while also retaining a cheap iteration cost. The connections of classification with saddle point problems, and the efficiency and flexibility of the MP framework discussed in this paper, open up further strikingly important possibilities for acceleration based on randomization. Particularly, the sublinear time behavior achieved by MP variants introduced in (Juditsky et al., 2013) is of great interest for future research. 2. Related Work There is an extensive literature on deterministic and stochastic FOMs for solving classification problems (Cristianini & Shawe-Taylor, 2000; Sch olkopf & Smola, 2002; Cotter et al., 2012) as well as the perceptron algorithm and its variants. In this paper, we limit our focus to deterministic FOMs and the most relevant literature. We categorize the representative literature into two parts in terms of the types of certificates, e.g., solutions to LDFP and LAP, the corresponding algorithms provide. Table 1 summarizes the rate of convergence of these algorithms. We note that the overall number of a.o. involved in each iteration of any one of these algorithms is O(mn) and thus the comparison presented in Table 1 is in fact meaningful. The first category of algorithms in Table 1 assumes that LDFP is feasible and only provides solutions for LDFP, e.g., feasibility (separability) certificates. The original perceptron (PCT) algorithm (Rosenblatt, 1958) with O( 1 ρ(A)2 ) rate of convergence is an example for this type of algorithms. As observed in (Saha et al., 2011), acceleration of perceptron algorithm via (Nesterov, 2005) s extra gradient Saddle Points and Accelerated Perceptron Algorithms LDFP LAP ϵ-solution PCT O( 1 ρ(A)2 ) N/A log(n) |ρ(A)| ) N/A VN O( 1 ρ(A)2 ) O(min( 1 ϵ2 , 1 ρ(A)2 log 1 ISPVN O( n |ρ(A)| log 1 |ρ(A)|) O( n max{|ρ(A)|,ϵ} log 1 log(n) |ρ(A)| ) O( Table 1. Summary of convergence rate of different algorithms. based smoothing technique is possible. This technique underlies the Smooth Perceptron (SPCT) algorithm suggested by (Soheili & Pe na, 2012), which terminates in at most 2 log(n) ρ(A) 1 iterations.Under the assumption that LDFP is feasible, SPCT achieves the same theoretical iteration complexity as MPFP. However, SPCT does not provide infeasibility (inseparability) certificates, and verifying the feasibility status of LDFP is equivalent to solving LDFP. Another vein of algorithms aims to either solve LDFP or provide an ϵ-solution to LAP simultaneously, without any assumption on their feasibility status beforehand. The von Neumann (VN) algorithm (Dantzig, 1992), is a wellknown example of such a method. Combining the analysis of (Dantzig, 1992) and (Epelman & Freund, 2000), one can conclude that i) if LAP is feasible, then in at most O(min( 1 ϵ2 , 1 ρ(A)2 log 1 ϵ )) iterations VN returns an ϵsolution; ii) if LAP is infeasible, then VN provides a feasible solution to LDFP within O( 1 ρ(A)2 ) iterations. Recently, based on SPCT and VN algorithms, (Soheili & Pe na, 2013) suggested the Iterated Smooth Perceptron-Von Neumann (ISPVN) algorithm. ISPVN finds a feasible solution for LDFP within O( n |ρ(A)| log ( 1 |ρ(A)|)) iterations provided that LDFP is feasible, or otherwise, gives an ϵ-solution to LAP within O( n |ρ(A)| log ( 1 ϵ )) iterations. Also, an extention of ISPVN to the kernelized setting, with the same convergence rate, (c.f., Theorem 4 in (Ramdas & Pe na, 2014)) is possible. We note that compared to ISPVN, MPFP achieves a significantly better performance in terms of its dependence on both the dimension, n, of the problem (O( p log(n)) as compared to O( n)), and the condition number ρ(A). On the other hand, the complexities of MPFP and ISPVN to find an ϵ-solution of LAP indicates a tradeoff between n, ρ(A) and ϵ so that they are not comparable. Moreover, the improved complexity of MPFP in terms of its dependence on n, also gives an affirmative answer to the conjecture of (Ramdas & Pe na, 2014). 3. Notation and Preliminaries We first introduce some notation and key concepts related to our setup and analysis. Throughout this paper, we use Matlab notation to denote vector and matrices, i.e., [x; y] denotes the concatenation of two column vectors x, y. Bilinear Saddle Point Problem (BSPP) forms the backbone of our analysis. In its most general form a BSPP is defined as max y Y min x X φ(x, y) (S) where φ(x, y) = υ + a1, x + a2, y + y, Bx ; X, Y are nonempty convex compact sets in Euclidean spaces Ex, Ey and Z := X Y , hence φ(x, y) : Z R. Note that (S) gives rise to two convex optimization problems that are dual to each other: Opt(P) = minx X[φ(x) := maxy Y φ(x, y)] (P) Opt(D) = maxy Y [φ(y) := minx X φ(x, y)] (D) with Opt(P) = Opt(D) = Opt, and to the variational inequality (v.i.), i.e., find z Z, such that F(z), z z 0 for all z Z, (4) where F : Z 7 Ex Ey is the affine monotone operator defined by F(x, y) = Fx(y) = φ(x, y) x ; Fy(x) = φ(x, y) It is well known that the solutions to (S) the saddle points of φ on X Y are exactly the pairs z = [x; y] comprised of optimal solutions to problems (P) and (D). They are also solutions to the v.i. (4). For BSPP (S), the accuracy of a candidate solution z = [x; y] is quantified by the saddle point residual ϵsad(z) = φ(x) φ(y) φ(x) Opt(P) + Opt(D) φ(y) 4. General Mirror Prox Framework MP algorithm is quite flexible in terms of adjusting to the geometry of the problem characterized by the domain of BSPP (S), e.g., X, Y . The following components are standard in forming the MP setup for given domains X, Y , and analyzing its convergence rate: Norm: on the Euclidean space E where the domain Z = X Y of (S) lives, along with the dual norm ζ = max z 1 ζ, z . Distance-Generating Function (d.g.f.): ω(z), which is convex and continuous on Z, admits continuous on the set Zo = {z Z : ω(z) = } selection ω (z) of subgradient (here ω(z) is a subdifferential of ω taken at z), and is strictly convex with modulus 1 w.r.t. : z , z Zo : ω (z ) ω (z ), z z z z 2. Saddle Points and Accelerated Perceptron Algorithms Bregman distance: Vz(u) = ω(u) ω(z) ω (z), u z , where z Zo and u Z. Prox-mapping: Given a prox center z Zo, Proxz(ξ) = argmin w Z { ξ, w + Vz(w)} : E Zo. ω-center: zω = argmin z Z ω(z) Zo of Z. Ω= Ωz := max z Z Vzω(z) max z Z ω(z) min z Z ω(z). Lipschitz constant: L of F from to , satisfying F(z) F(z ) L z z , z, z . Based on this setup, the general template of Mirror Prox algorithm for Feasibility Problems (MPFP) is given in Algorithm 1. We refer the customizations of Algorithm 1 to handle linear, kernelized, and conic feasibility problems as MPLFP, MPKFP, and MPCFP, respectively. Algorithm 1 MPFP 1: Input: ω-center zω, step size {γt} and ϵ. 2: Output: zt(= [xt; yt]). 3: t = 1; v1 = zω; 4: while φ(yt) 0 and ϵsad(zt) > ϵ do 5: wt = Proxvt(γt F(vt)); 6: vt+1 = Proxvt(γt F(wt)); 7: zt = h Pt s=1 γs i 1 Pt s=1 γsws; 8: t = t + 1; 9: end while The standard customization of MPFP is based on associating a norm, x, and a d.g.f., ωx( ), with domain X, and similarly y, ωy( ) with domain Y . Then, given two scalars αx, αy > 0, we build the d.g.f. and ω-center, zω, for Z = X Y as: ω(z) = αxωx(x) + αyωy(y) and zω = [xωx; yωy], where ωx( ) and ωy( ) as well as xωx and yωy are customized based on the geometry of the domains X, Y . Also, by letting ξ = [ξx; ξy], z = [x; y], our prox mapping becomes decomposable as Proxz(ξ) = Proxωx x where Proxωx x ( ) and Proxωy y ( ) are respectively prox mappings w.r.t. ωx(x) in domain X and ωy(y) in domain Y . Because of space limitation, we provide the detailed derivation of the prox-mapping operators and the rationale behind our parameter choices, αx, αy, in the appendix. The convergence rate of MP algorithm for solving BSPP was established by (Nemirovski, 2004) as follows: Theorem 1. (Nemirovski, 2004) Suppose the step sizes in the Mirror Prox algorithm satisfy γt = L 1. Then at every iteration t 1, the corresponding solution, zt = [xt; yt] satisfies xt X, yt Y , and we have φ(xt) φ(yt) = ϵsad(zt) ΩL 5. Linear Feasibility Problems In this section, we first associate a BSPP for the linear feasibility problems LDFP and LAP, and establish close connections between the solutions of these problems. Then, we discuss customization of general MPFP framework, e.g. Section 4, for simultaneously solving LDFP and LAP, and the computational complexity of the resulting MPLFP. We provide customization of MPFP to the kernelized and conic feasibility problems in Section 6. Because of space limitation, all of the proofs are provided in appendix. 5.1. BSPP Formulation for Linear Feasibility Problems To address the linear feasibility problems, LDFP (1) or LAP (2) simultaneously, we consider the BSPP: Opt = max y Bm min x n y T Ax, (5) where the domains X, Y of the variables x, y are n := {x Rn : Pn i=1 xi = 1, x 0}, i.e., a standard ndimensional simplex, and Bm := {y Rm : y 2 1}, a unit Euclidean ball in Rm, respectively. Then Z = n Bm. Also, φ(x, y) = y T Ax, φ(x) = max y Bm y T Ax, φ(y) = min x n y T Ax, and F(x, y) = [AT y; Ax]. The connections between LDFP, LAP and BSPP, and their solutions, play an important role in our main result given in Theorem 3. These are explored in the next section. 5.2. Connections between LDFP/LAP and BSPP We first establish a close connection between ρ(A) and the objective value of BSPP (5) in Lemma 1, and then relate solutions of BSPP (5) and LDFP/LAP in Theorem 2. Lemma 1. (a) ρ(A) = max y 2=1 min x n y T Ax; (b) Whenever ρ(A) > 0, then ρ(A) = max y 2 1 min x n y T Ax = Opt. Moreover, for any primal-dual algorithm solving BSPP (5), we have the following relations: Theorem 2. Let zt = [xt; yt] be the solution at iteration t of a primal-dual algorithm for solving problem (5). Suppose that the algorithm terminates at step t, when either φ(yt) > 0 or ϵsad(zt) ϵ. Then we have (a) If φ(yt) > 0, then AT yt > 0; otherwise, (b) If ϵsad(zt) ϵ, then Axt 2 ϵ. Saddle Points and Accelerated Perceptron Algorithms Theorem 2 reveals that, depending on the sign of ρ(A), the primal-dual solution pair zt = [xt; yt] for BSPP (5) indeed corresponds to that of either LDFP or LAP. In fact, part (a) of the theorem corresponds to the case when LDFP is feasible, (since ρ(A) > φ(yt) > 0), and yt is a feasible solution of LDFP; part (b) is essentially the case when xt is an ϵ-solution of LAP. The merit of such a primal-dual algorithm stated in Theorem 2 is that it can automatically identify whether LDFP or LAP is feasible during the execution and provide the corresponding solution. As a result, we do not need to assess the feasibility of either of LDFP or LAP upfront, which is already as difficult as solving the original feasibility problems. Any primal-dual algorithm capable of solving BSPP (5) can be customized to fit the conditions of Theorem 2, and thus, is suitable for simultaneously solving LDFP/LAP. Yet the convergence rates of these algorithms can significantly differ in terms of their dependence on ρ(A). In Section 5.4, we show that Mirror Prox algorithm achieves the best known performance in terms of ρ(A). Further generalizations of the MP algorithm to handle kernelized and conic feasibility problems are possible as well. These generalizations retain the same rate of convergence given in Theorem 1, yet they may differ in terms of the a.o. needed for their iterations. We discuss these in Section 6. 5.3. MPLFP: Customization of MPFP for LDFP/LAP For LDFP and LAP, we have Z = X Y = n Bm, and hence we pick ωx(x) = Pn i=1 xi ln(xi) and ωy(y) = 1 2y T y, which leads to, for i = 1, . . . , n and j = 1, . . . , m [Proxωx x (ζ)]i = xi exp{ ζi} Pn k=1 xk exp{ ζk}, and [Proxωy y (ζ)]j = ( yj ζj, if y ζ 2 1 yj ζj y ζ 2 , otherwise . Therefore, each iteration of Algorithm 1, involves only the computation of F(ξ) = [AT ξy; Aξx], and Proxz(γF(ξ)) = Proxωx x (γAT ξy αx ); Proxωy y ( γAξx This choice of d.g.f. leads to zω = xωx; yωy where xωx = 1 n1 Rn, and yωy = 0m, the zero vector in Rm. We compute the associated Ω= Ωz and L following the derivations in the appendix, and set γt = 1 5.4. Convergence Analysis for MPLFP In the specific case of linear feasibility problems LDFP and LAP, following the derivation of αx and αy presented in appendix, one can optimally select αx and αy to achieve ΩL p 1/2. Hence by combining Theorems 1 and 2, we arrive at the following main result: Theorem 3. Let the step sizes of MPFP be γt = L 1. Then (a) When ρ(A) > 0, MPFP terminates in at most N = ΩL ρ(A) + 1 1/2 ρ(A) + 1 iterations with a feasible solution to LDFP given by y N. (b) When ρ(A) < 0, MPFP terminates in at most N = 1/2 ϵ + 1 iterations with an ϵsolution for LAP given by x N. Complexity per Iteration of MPLFP: Each iteration t involves the following elementary computations with the given a.o. complexity: (1) Axt, AT yt: O(mn). (2) Prox mapping: O(m + n). (3) φ(xt) = Axt 2: O(m). (4) φ(yt) = mini {1,...,n}(AT i yt) : O(n). Therefore, the overall a.o. involved in each iteration of MPLFP is O(mn), which is the same as that of the perceptron algorithm. 6. Generalized Feasibility Problems There are two important extensions of the basic MPFP framework, namely the Kernelized Feasibility Problems (KDFP, KAP) and the Conic Feasibility Problems (CDFP, CAP). Both of these extensions merely require customization of the MPFP framework to the geometry of the problem by using proper representation or proper selection of prox mappings. Hence, these can easily be handled within the existing MPFP framework while still enjoying the same convergence rate of MPLFP. To the best of our knowledge, the extension of MP to the KDFP achieving O( log(n) |ρ(Ψ)| ) ϵ ) in the case of KAP) performance is novel and the same performance of MP for the conic feasibility problem (in terms of its dependence on n and ρ(A)) is far superior to all of the other competing algorithms. 6.1. Kernelized Feasibility Problems In the kernelized classification problems, along with the data A, we are given, a feature map Φ( ) : Rm F, which maps each data point, Ai, to a new feature in the Reproducing Kernel Hilbert Space, F = Rd. The feature map is specified implicitly via a kernel, KΦ(a, a ) := Φ(a), Φ(a ) , with the motivation that explicitly computing Φ(a) can be rather time consuming, but instead the inner product computation, Φ(a), Φ(a ) , via the kernel KΦ(a, a ) is much easier to obtain (see (Sch olkopf & Smola, 2002)). Therefore, kernelized algorithms are designed under the assumption that only black box access to the kernel is available (i.e., our methods work for any kernel, as long as we can compute KΦ(a, a ) efficiently). By normalization, we assume that KΦ(a, a) = 1, and KΦ(a, a ) 1 for all a, a Rm, and in our runtime analysis, we assume kernel matrix associated with data A, i.e., GΦ where Saddle Points and Accelerated Perceptron Algorithms [GΦ]i,j = KΦ(Ai, Aj), is known. Let Q be a diagonal matrix, where Qii = 1 or 1 is the label of the ith data point, Ai. In order to simplify the notation in our analysis, we work with the feature map Ψ(Ai) = QiiΦ(Ai) and the resulting kernel is given by K(Ai, Aj) = Ψ(Ai), Ψ(Aj) = QiiΦ(Ai), QjjΦ(Aj) which is the (i, j)-th element of the kernel matrix G := QT GΦQ. For notational convenience, we also let Ψ := [Ψ(A1), . . . , Ψ(An)]. Thus the kernelized classification problem is equivalent to finding a feasible solution to the following Kernelized Dual Feasibility Problem (KDFP): y T Ψ > 0.When no such solution exists, an inseparability certificate is provided by an ϵ-solution for the Kernelized Alternative Problem (KAP): Ψx = 0, 1T x = 1, x 0.Therefore, one can equivalently consider the following Kernelized BSPP: max y Bd min x n y T Ψx. By a simple customization, the MPFP framework can be generalized to solve kernelized BSPP using only black box kernel oracle. Throughout the algorithm, in order avoid any explicit computation involving Ψ, we keep track of two vectors xt, gt Rn. While the role of xt in MPKFP remains the same, we use gt as a surrogate for yt. In particular, we let g0 = 0n implying y0 = Ψg0, and initialize the algorithm with zω = 1 n1; 0d = [x0; y0]. And, in all subsequent iterations, we always maintain the relation yt = Ψgt implicitly. This is the key modification in MPKFP. The benefit of this modification is that ΨT yt = ΨT Ψgt = Pn i=1 K(Ai, Ai)(gt)i, hence we can avoid explicit computations involving Ψ. In particular, at each iteration t, the corresponding kernelized BSPP solution is given by zt = h Pt s=1 γs i 1 Pt s=1 γs[xt; Ψgt]. Based on zt, the corre- sponding lower and upper bounds are given by φ(yt) = min i=1,...,n(yt)T Ψ(Ai) = min i=1,...,n Pn j=1(gt)j K(Aj, Ai), and φ(xt) = Ψxt 2 = p K(xt, xt), using only the kernel K. Moreover, for y = Ψg, the prox mapping computations are given by Proxωx x (γΨT ξy where η = h exp { γ(Gg)1 αx }, . . . , exp { γ(Gg)n αx } i , and is the elementwise product, and, Proxωy y ( γΨξx αy ), if Ψ(g + γξx αy ) 2 , otherwise . Note that Proxωx x ( ) computation does not use Ψ, and in Proxωy y ( ), Ψ(g+ γξx αy , g + γξx so relies only on the kernel function K. Hence it is sufficient to keep only the (perhaps normalized) term, g + γξx αy , after each Proxωy y ( ) operation. Complexity: (1) Per Iteration: Given K, the complexity within each iteration is O(n2) due to Ggt, αy , g + γξx , φ(yt) and φ(xt). (2) Convergence Rate: As the domains X = n and Y = Bd remain the same as MPLFP, MPKFP shares the same convergence rate as the former, except that ρ(A) is replaced by ρ(Ψ). Remark: Let (x , g ) be the output of MPKFP. Then, if the data is separable in the feature space, we use g for classification as follows: For a new testing data point, A0, its class label is determined by the sign of (Φ(A0))T Ψg = Pn i=1Qii Φ(A0), Φ(Ai) g i = Pn i=1Qii KΦ(A0, Ai)g i . Otherwise, the ϵ-inseparability certificate, i.e., an ϵsolution for KAP, is given by Qx . 6.2. Conic Feasibility Problems The second extension we discuss is the Conic Dual Feasible Problem (CDFP) of the form A y int(K ) (6) where A is the conjugate linear map of a linear map A, and K Rn is a proper (closed, convex, pointed with nonempty interior) cone with its dual cone K . Note that A y int(K ) is equivalent to requiring x, A y > 0 for all x K \ {0}. Let e be a vector from int(K ), as the selection of e depends on K, below we will specify it separately for each K. Define (K) = {x K : e, x = 1}, which is the base of the cone K given by the vector e. Hence x, A y > 0 for all x K \ {0} is equivalent to minx{ x, A y : x (K)} > 0. Moreover, the Conic Alternative Problem (CAP) is given by: Ax = 0, e, x = 1, x K. (7) Based on the same rationale of previous sections, we consider the following conic version of BSPP: max y Bm min x (K) y T Ax. (8) Once again, (8), analogous to (5), is a BSPP albeit with different domains, and hence it can still be solved within the MPFP framework, achieving the same iteration complexity given in Theorem 3. The main customization of the MPFP applied to (8) as opposed to (5) is in selecting various algorithm parameters, and prox-mapping operators for the corresponding domains. In particular, one needs to specify an efficient prox-mapping operator customized to the domain (K). We present several interesting cases of K, where such efficient prox-mapping onto (K) Saddle Points and Accelerated Perceptron Algorithms exists. These include K = Rn +, the nonnegative orthant, K = Ln = {x Rn : xn q x2 1 + . . . + x2 n 1}, the second order cone, and K = Sn + = {X Rn n : X = XT , a T Xa 0 a Rn}, the positive semidefinite cone. The basic settings of these cases as suggested in (Nemirovski, 2004; Juditsky et al., 2013) are as follow: Nonnegative orthant: K = K = Rn +. This is precisely the case we have addressed for linear feasibility problems (see Section 5.3). Second order cone: K = K = Ln. e = [0, ..., 0, 1]T , (K) = Bn 1 {1}. d.g.f: ω(x) = 1 2 i=1 x2 i ; Ωx 1 2 with xωx = [0, ..., 0, 1]T . Given x (K), ξ Rn, for i = 1, . . . , n 1, [Proxx(ξ)]i = ( xi ξi, if Pn 1 i=1 (xi ξi)2 1 xi ξi Pn 1 i=1 (xi ξi)2 , otherwise and [Proxx(ξ)]n = 1. Note that the computational complexity of this prox mapping is O(n) as discussed in the MPLFP setup (see Section 5.3). Positive semi-definite cone: K = K = Sn +. For any two symmetric matrices A, B, we let A, B = Tr(AB) as the corresponding inner product. e = In, (K) = {X Sn + : X 0, Tr(X) = 1}(Flat Spectrahedron). d.g.f: matrix entropy ω(X) = Entr(λ(X)) = P i λi(X) log(λi(X)), where λi(X) is the i-th eigenvalue of X and resulting ΩX 4 log(n) with xωx = Diag( 1 n). Given X Sn +, Ξ Sn, computing Prox X(Ξ) reduces to computing the eigenvalue decomposition of the matrix X, which also leads to the computation of ω (X), and subsequent eigenvalue decomposition of the matrix H = Ξ ω (X). Let H = UDiag(h)U T be the eigenvalue decomposition of H, where Diag(h) stands for the diagonal matrix with diagonal elements from h. Then computing Prox X(Ξ) reduces to constructing the matrix W = UDiag(w)U T where w = argminz Rn{ Diag(h), Diag(z) + ω(Diag(z)) : Diag(z) (K)}, which is identical to the computation of the Prox function in the simplex setup. The computational complexity of this prox mapping is O(n3) per iteration due to the two singular value decompositions (O(n3)) and simple a.o. of O(n). Discussion of MPCFP and conic ISPVN: The MPFP framework is quite flexible: in particular, it can be easily adjusted to handle a variety of domains Z by properly selecting ω( ) and thus Proxz( ). Besides, for the aforementioned cones, the computational cost of each iteration of MPCFP is quite low. While ISPVN is capable of handling the same conic feasibility problems, the analysis of (Soheili & Pe na, 2013; Ramdas & Pe na, 2014) is based on Euclidean d.g.f. s, and thus the respective operations involved for projections onto these domains are much more expensive (involving sorting operations), and hence they are inferior to the ones presented above. 7. Numerical Experiments In this section, we conduct an empirical study on the MPFP framework and compare it with other baseline algorithms, using both synthetic data (for MPLFP and LAP) and real data (for MPKFP). All the codes are written in Matlab 2011b, and are ran in a single-threaded fashion on a Linux server with 6 dual core 2.8GHz CPU and 64 GB memory. In the implementation of MPLFP or MPKFP, we select αx, αy as described in the appendix, for a normalized matrix A, i.e., Ai 2 = 1 for all i. We implemented the Smooth Perceptron (SPCT) (Soheili & Pe na, 2012), normalized Perceptron (PCT) (Ramdas & Pe na, 2014) and the normalized Von-Neumann (VN) algorithm (Soheili & Pe na, 2012) for comparison. We also implemented the ISPVN (Soheili & Pe na, 2013) and ISNKPVN (Ramdas & Pe na, 2014), but as the authors do not provide a parameter selection strategy, we did a search on the space and choose the one with the best performance, i.e., the parameter denoted by γ in (Soheili & Pe na, 2013) (Theorem 1) was set to 2. We note that our numerical study is based on the comparison of the basic implementations of all of these algorithms including ours as well. There may be ways for small improvement leading better computational performance for any one of them. Nevertheless, guided by our theoretical findings, we do not expect any major changes in the overall conclusions drawn. 7.1. Synthetic data We first compare the performance of different algorithms (non-kernelized versions) on synthetic data. In our instance generation for LDFP, κ serves as a proxy for the condition number ρ(A), i.e., a larger κ corresponds to a larger ρ(A) in (3). We provide details of our instance generation for both LDFP and LAP in the appendix. Experimental Setup. We test the methods on a wide spectrum of (m, n) combinations. Given a combination, we generate 20 instances for LDFP and 20 for LAP on which all the methods are tested. For each method, the iteration limit is 106 and runtime limit is 5 104s. When either one of these limits is reached, we report a TO (Timeout) . We empirically observe that, occasionally, SPCT suffers Saddle Points and Accelerated Perceptron Algorithms from numerical overflow, and ISPVN, frequently, falls into a dead loop. We report a Fail in the table if a method fails in all of the 20 instances. Table 2 summarizes the average runtime for each algorithm over the instances that the algorithm successfully solved. Because of space limitations, we present the number of iterations for LDFP and for the effect of ρ(A), and our results on LAP instances in the appendix. Performance on LDFP. We set κ = 1 for the case LDFP, and, by varying m, n, we generate normalized instances of A Rm n such that Aj 2 = 1 for all j. In all of the instances generated, ISPVN algorithm failed, and hence we omitted it from comparison. As indicated by Table 2, MPLFP outperforms all the other methods in terms of runtime for any pair of (m, n). The runner-up algorithm SPCT is obviously much slower while PCT and VN perform even equally worse. Moreover, as (m, n) increases, the time savings achieved by MPLFP become more significant. More importantly, even on the very large size of high-dimensional data (m = 103, n = 5 106), MPLFP finds a feasible solution for LDFP within 10 hours, while the baseline methods either run out of time (PCT and VN) or fail (SPCT and ISPVN) on large scale problems (m = 102, n = 5 105). This highlights the suitability of MPLFP for large-scale learning. The Effect of ρ(A). We tested the performance of algorithms by varying κ as a proxy for ρ(A) on instances with (m, n) = (100, 5000) and Aj 2 = 1 for all j. For various values of κ [1, 10] (i.e., for feasible LDFP), the runtime of each algorithm is reported in Figure 1. It is not surprising that as κ(ρ(A)) becomes larger, which means that the data become more separable, the runtime is shorter for all of the methods. Nonetheless, MPLFP is still considerably faster than its closest competitor in all cases. With column-normalized matrix, A (m, n) MPLFP SPCT PCT VN (102, 5 103) 1.3 4.3 98.5 127.3 (103, 5 103) 5.5 16.2 142.2 141.5 (102, 5 104) 112.7 455.0 TO TO (103, 5 104) 318.1 1301.0 TO TO (102, 5 105) 25506 Fail TO TO (103, 5 105) 22471 Fail TO TO Table 2. Runtime (second) of all methods for LDFP (κ = 1). Iteration limit is 106 and runtime limit is 5 104s. 7.2. Real Data: CIFAR-10 Image Classification We test the efficiency of MPKFP in training classifiers with kernel method. For comparison, we also implement the Kernelized Perceptron (KPCT), Kernelized Von Neumann (KVN), and ISNKVPN (Ramdas & Pe na, 2014). Our testbed is the CIFAT-10 (Krizhevsky, 2009) dataset1, 1http://www.cs.utoronto.ca/ kriz/cifar.html 1 2 3 4 5 6 7 8 9 10 10 1 MPLFP SPCT PCT VN Figure 1. κ (ρ(A)) v.s. runtime. 10 1 10 2 10 3 0.5 Number of Iterations Per Classifier Prediction Error Rate MPKFP ISNKPVN KPCT KVN Figure 2. Iteration v.s. error. which contains 60,000 color images in 10 mutually exclusive classes, with 6,000 images per class. Each image Ai is encoded as a m = 3072 dimensional vector.2 We evenly pick around 1,000 images in each class to construct a training set of n = 10, 000 in total, and repeat this process to form a testing set of the same size. We train one-vs-rest binary classifier for each class, and hence obtain 10 classifiers in total. After that, each test image is run on by all the classifiers and its predicted label corresponds to the one that gives the largest output value. We choose Radial Basis Function (RBF) kernel given by KΦ(Ai, Aj) = exp{ 5.5( Ai Aj 2)}. We run all of the algorithms, and at different pre-specified iteration limits of N {10, 32, 100, 320, 1000}, we collect their respective solutions. We treat these solutions as the corresponding classifiers, and benchmark their prediction quality on the test set. Figure 2 shows how the prediction error rate drops as the training iteration number per classifier increases. It can be seen that, within the same iteration limit, MPKFP outperforms all the other methods in terms of the prediction error rate. In particular, the error rate obtained by perceptron after 1,000 iterations is still higher than that achieved within merely 10 iterations by MPKFP, which highlights the suitability of MPKFP for fast classification training scenarios. 8. Conclusions We build a framework based on BSPP to either find a binary classifier (DFP) or return a near infeasibility certificate if there is none (AP). We adopt the accelerated primal-dual algorithm, Mirror Prox, to solve the BSPP and achieve the rate of convergence O( log n |ρ(A)| ) for DFP and O( log n ϵ ) for AP. Our results also extend to the general kernelized and conic problems, all of which share the same rate of convergence. We further confirm the efficiency and numerical stability of our approach by a numerical study on both synthetic and real data. Future work includes exploring whether these methods can be modified to achieve sublinear time behavior, or allow for rapid incremental retraining in active learning scenarios. 2To achieve a lower prediction error rate, one can use other advanced features,which is not this paper s focus. Saddle Points and Accelerated Perceptron Algorithms Blum, Avrim and Dunagan, John. Smoothed analysis of the perceptron algorithm for linear programming. In SODA, pp. 905 914, 2002. Chvatal, Vasek. Linear Programming. Macmillan, 1983. Cotter, Andrew, Shalev-Shwartz, Shai, and Srebro, Nathan. The kernelized stochastic batch perceptron. In ICML, 2012. Cristianini, Nello and Shawe-Taylor, John. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press, 2000. Dantzig, George Bernard. An ϵ-precise feasible solution to a linear program with a convexity constraint in 1/ϵ2 iterations independent of problem size. Technical Report 92-5, Stanford University, 1992. Epelman, Marina and Freund, Robert M. Condition number complexity of an elementary algorithm for computing a reliable solution of a conic linear system. Math. Program., 88(3):451 485, 2000. Juditsky, Anatoli, Kılınc -Karzan, Fatma, and Nemirovski, Arkadi. Randomized first order algorithms with applications to ℓ1-minimization. Math. Program., 142(1-2): 269 310, 2013. Krizhevsky, Alex. Learning multiple layers of features from tiny images. Master s thesis, University of Toronto, 2009. Li, Dan and Terlaky, Tam as. The duality between the perceptron algorithm and the von neumann algorithm. In Modeling and Optimization: Theory and Applications, volume 62, pp. 113 136. 2013. Nemirovski, Arkadi. Prox-method with rate of convergence o(1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convexconcave saddle point problems. SIAM Journal on Optimization, 15(1):229 251, 2004. Nesterov, Yurii. Excessive gap technique in nonsmooth convex minimization. SIAM Journal on Optimization, 16(1):235 249, 2005. Novikoff, Albert B. J. On convergence proofs for perceptrons. Technical report, 1962. Ramdas, Aaditya and Pe na, Javier. Margins, kernels and non-linear smoothed perceptrons. In ICML, 2014. Rosenblatt, Frank. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6):386 408, 1958. Saha, Ankan, Vishwanathan, S. V. N., and Zhang, Xinhua. New approximation algorithms for minimum enclosing convex shapes. In SODA, pp. 1146 1160, 2011. Sch olkopf, Bernhard and Smola, Alex J. Learning with kernels. The MIT Press, 2002. Soheili, Negar and Pe na, Javier. A smooth perceptron algorithm. SIAM Journal on Optimization, 22(2):728 737, 2012. Soheili, Negar and Pe na, Javier. A primal-dual smooth perceptron-von neumann algorithm. In Discrete Geometry and Optimization, volume 69, pp. 303 320. 2013.