# scalable_learning_in_reproducing_kernel_krein_spaces__285233bf.pdf Scalable Learning in Reproducing Kernel Kre ın Spaces Dino Oglic 1 Thomas Gärtner 2 Abstract We provide the first mathematically complete derivation of the Nyström method for low-rank approximation of indefinite kernels and propose an efficient method for finding an approximate eigendecomposition of such kernel matrices. Building on this result, we devise highly scalable methods for learning in reproducing kernel Kre ın spaces. The devised approaches provide a principled and theoretically well-founded means to tackle large scale learning problems with indefinite kernels. The main motivation for our work comes from problems with structured representations (e.g., graphs, strings, time-series), where it is relatively easy to devise a pairwise (dis)similarity function based on intuition and/or knowledge of domain experts. Such functions are typically not positive definite and it is often well beyond the expertise of practitioners to verify this condition. The effectiveness of the devised approaches is evaluated empirically using indefinite kernels defined on structured and vectorial data representations. 1. Introduction In learning problems with structured data it is relatively easy to devise a pairwise similarity/dissimilarity function based on intuition/knowledge of domain experts. Such functions are typically not positive definite and it is often the case that verifying this condition is well beyond the expertise of practitioners. The learning problems with indefinite similarity/dissimilarity functions are typically modeled via Kre ın spaces (e.g., see Ong et al., 2004; Loosli et al., 2016; Oglic and Gärtner, 2018), which are vector spaces with an indefinite bilinear form (Azizov and Iokhvidov, 1981; Iokhvidov et al., 1982). The computational and space complexities of these approaches are similar to those of the standard kernel methods that work with positive definite kernels (Schölkopf 1Department of Informatics, King s College London, UK 2School of Computer Science, University of Nottingham, UK. Correspondence to: Dino Oglic . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). and Smola, 2001). The Nyström method (Nyström, 1930; Smola and Schölkopf, 2000; Williams and Seeger, 2001) is an effective approach for low-rank approximation of positive definite kernels that can scale kernel methods to problems with millions of instances (Schölkopf and Smola, 2001). We provide the first mathematically complete derivation that extends the Nyström method to low-rank approximation of indefinite kernels and propose an efficient method for finding an approximate eigendecomposition of such kernel matrices. To tackle the computational issues arising in large scale problems with indefinite kernels, we also devise several novel Nyström-based low-rank approaches tailored for scalable learning in reproducing kernel Kre ın spaces. We start by showing that the Nyström method can be used for low-rank approximations of indefinite kernel matrices and provide means for finding their approximate eigendecompositions (Section 2.2). We then devise two landmark sampling strategies based on state-of-the-art techniques (Gittens and Mahoney, 2016; Oglic and Gärtner, 2017) used in Nyström approximations of positive definite kernels (Section 2.3). Having described means for finding low-rank factorizations of indefinite kernel matrices, we formulate lowrank variants of two least squares methods (Tikhonov and Arsenin, 1977; Rifkin, 2002; Oglic and Gärtner, 2018) for learning in reproducing kernel Kre ın spaces (Section 2.4). We also derive a novel low-rank variant of the support vector machine for scalable learning in reproducing kernel Kre ın spaces (Section 2.5), inspired by Oglic and Gärtner (2018). Having introduced means for scalable learning in reproducing kernel Kre ın spaces, we evaluate the effectiveness of these approaches and the Nyström low-rank approximations on datasets from standard machine learning repositories (Section 3). The empirical results demonstrate the effectiveness of the proposed approaches in: i) classification tasks and ii) problems of finding a low-rank approximation of an indefinite kernel matrix. The experiments are performed using 15 representative datasets and a variety of indefinite kernels. The paper concludes with a discussion where we contrast ours and other relevant approaches (Section 4). 2. Scalable Learning in Kre ın Spaces In this section, we first provide a novel derivation of the Nyström method that allows us to extend the approach to Scalable Learning in Reproducing Kernel Krein Spaces low-rank approximation of indefinite kernel matrices. Building on this result, we then provide means to scale Kre ın kernel methods to datasets with millions of instances/pairwise (dis)similarities. More specifically, we devise low-rank variants of kernel ridge regression and support vector machines in reproducing kernel Kre ın spaces, as well as a low-rank variant of the variance constrained ridge regression proposed in Oglic and Gärtner (2018). As the effectiveness of low-rank approximations based on the Nyström method critically depends on the selected landmarks, we also adapt two state-of-the-art landmark sampling strategies proposed for the approximation of positive definite kernels. In addition to this, we make a theoretical contribution (Proposition 3) relating the Kre ın support vector machines (Loosli et al., 2016) to previous work on learning with the flip-spectrum kernel matrices (Graepel et al., 1998; Chen et al., 2009). 2.1. Reproducing Kernel Kre ın Spaces Let K be a vector space defined on the scalar field R. A bilinear form on K is a function , K : K K R such that, for all f, g, h K and scalars α, β R, it holds: i) αf + βg, h K = α f, h K + β g, h K and ii) f, αg + βh K = α f, g K + β f, h K. The bilinear form is called non-degenerate if ( f K) : ( g K) f, g K = 0 = f = 0. The bilinear form , K is symmetric if, for all f, g K, we have f, g K = g, f K. The form is called indefinite if there exists f, g K such that f, f K > 0 and g, g K < 0. On the other hand, if f, f K 0 for all f K, then the form is called positive. A non-degenerate, symmetric, and positive bilinear form on K is called inner product. Any two elements f, g K that satisfy f, g K = 0 are , Korthogonal. Similarly, any two subspaces K1, K2 K that satisfy f1, f2 K = 0 for all f1 K1 and f2 K2 are called , K-orthogonal. Having reviewed bilinear forms, we are now ready to introduce the notion of a Kre ın space. Definition 1. (Bognár, 1974; Azizov and Iokhvidov, 1981) The vector space K with a bilinear form , K is called Kre ın space if it admits a decomposition into a direct sum K = H+ H of , K-orthogonal Hilbert spaces H such that the bilinear form can be written as f, g K = f+, g+ H+ f , g H , where H are endowed with inner products , H , f = f+ f , g = g+ g , and f , g H . For a fixed decomposition K = H+ H , the Hilbert space HK = H+ H endowed with inner product f, g HK = f+, g+ H+ + f , g H (f , g H ) can be associated with K. For a Kre ın space K, the decomposition K = H+ H is not necessarily unique. Thus, a Kre ın space can, in general, be associated with infinitely many Hilbert spaces. However, for any such Hilbert space HK the topology introduced on K via the norm f HK = q f, f HK is independent of the decomposition and the associated Hilbert space. More specifically, all norms HK generated by different decompositions of K into direct sums of Hilbert spaces are topologically equivalent (Langer, 1962). The topology on K defined by the norm of an associated Hilbert space is called the strong topology. Having reviewed basic properties of Kre ın spaces, we are now ready to introduce a notion of reproducing kernel Kre ın space. For that, let X be an instance space and denote with RX the set of functions from X to R. For a fixed element x X, the map Tx : RX R that assigns a real number to each function f RX is called the evaluation functional at x, i.e., Tx (f) = f (x) for all f RX . Definition 2. (Alpay, 1991; Ong et al., 2004) A Kre ın space (K, , K) is a reproducing kernel Kre ın space if K RX and the evaluation functional is continuous on K with respect to the strong topology. The following theorem provides a characterization of reproducing kernel Kre ın spaces. Theorem 1. (Schwartz, 1964; Alpay, 1991) Let k: X X R be a real-valued symmetric function. Then, there is an associated reproducing kernel Kre ın space if and only if k = k+ k , where k+ and k are positive definite kernels. When the function k admits such a decomposition, one can choose k+ and k such that the corresponding reproducing kernel Hilbert spaces are disjoint. 2.2. Nyström Method for Reproducing Kre ın Kernels Let X be an instance space and X = {x1, . . . , xn} an independent sample from a Borel probability measure defined on X. For a positive definite kernel h and a set of landmarks Z = {z1, . . . , zm} X, the Nyström method (Nyström, 1930; Smola and Schölkopf, 2000; Williams and Seeger, 2001) first projects the evaluation functionals h (xi, ) onto span ({h (z1, ) , . . . , h (zm, )}) and then approximates the kernel matrix H with entries {Hij = h (xi, xj)}n i,j=1 by inner products between the projections of the corresponding evaluation functionals. The projections of the evaluation functionals h (xi, ) are linear combinations of the landmarks and these coefficients are given by the following convex optimization problem α = arg min α Rm n j=1 αj,ih (zj, ) While this approach works for positive definite kernels, it cannot be directly applied to reproducing Kre ın kernels. In Scalable Learning in Reproducing Kernel Krein Spaces particular, let K be a reproducing kernel Kre ın space with an indefinite kernel k: X X R. The reproducing Kre ın kernel k is defined by an indefinite bilinear form , K which does not induce a norm on K and for all a, b K the value of a b, a b K does not capture the distance. More specifically, as the bilinear form is indefinite then there exists an element c K such that c, c K < 0. For an evaluation functional k (x, ) K and a linear subspace LZ K spanned by a set of evaluation functionals {k (z1, ) , . . . , k (zm, )}, we define k (x, ) to be the orthogonal projection of k (x, ) onto the subspace LZ if the evaluation functional admits a decomposition (Azizov and Iokhvidov, 1981; Iokhvidov et al., 1982) k (x, ) = k (x, ) + k (x, ) , where k (x, ) = Pm i=1 αi,xk (zi, ) with αx Rm, and k (x, ), LZ K = 0. For a landmark z Z, the inner product between the corresponding evaluation functional k (z, ) and k (x, ) then gives k (x, z) = k (x, ), k (z, ) K = i=1 αi,x k (zi, z) . (2) Denote with KZ Z the block in the kernel matrix K corresponding to landmarks Z and let kx = vec (k (x, z1) , . . . , k (x, zm)). From Eq. (2) it then follows that kx = KZ Zαx. Thus, in Kre ın spaces an orthogonal projection exists if the matrix KZ Z is non-singular. If this condition is satisfied, then the projection is given by i=1 α i,xk (zi, ) with α x = K 1 Z Zkx Rm . Having computed the projection of a point onto the span of the landmarks in a Kre ın space, we now proceed to define the Nyström approximation of the corresponding indefinite kernel matrix. In this, we follow the approach for positive definite kernels (Schölkopf and Smola, 2001; Smola and Schölkopf, 2000) and approximate the Kre ın kernel matrix K using the bilinear form on the span of the landmarks. More specifically, we have that k (xi, xj) = p=1 α p,ik (zp, ), q=1 α q,jk (zq, ) k xi K 1 Z Zkxj . Thus, the low-rank approximation of the Kre ın kernel matrix K is given by KX|Z = KX ZK 1 Z ZKZ ZK 1 Z ZKZ X = KX ZK 1 Z ZKZ X . (3) This approach for low-rank approximation of Kre ın kernel matrices also provides a direct way for an out-of-sample extension in the non-transductive setting. In particular, for an out-of-sample instance x X we have that if holds kx X = vec( k (x, x1) , . . . , k (x, xn)) = KX ZK 1 Z Zkx. In applications to estimation problems (see Sections 2.4 and 2.5) an approximate low-rank eigendecomposition of the kernel matrix, also known as the one-shot variant of the Nyström method (Fowlkes et al., 2004), is sometimes preferred over the plain Nyström approximation described above. To derive such a factorization, we first observe that the low-rank approximation of the indefinite kernel matrix can be written as KX|Z = LSL with L = KX ZUZ Z |DZ Z| 1 and where KZ Z = UZ ZDZ ZU Z Z is an eigendecomposition of the block in the kernel matrix corresponding to landmarks and S = sign (DZ Z). Substituting a singular value decomposition of L = AΣB into the latter equation (with orthonormal matrices A Rn m and B Rm m), we deduce the following low-rank factorization KX|Z = A ΣB SBΣ A = AMA , where M = ΣB SBΣ is a symmetric matrix with an eigendecomposition M = PΛP . Hence, KX|Z = (AP) Λ (AP) with (AP) AP = Im . As the matrix U = AP Rn m contains m orthonormal column vectors and Λ is a diagonal matrix, we have then derived an approximate low-rank eigendecomposition of K. 2.3. Landmark Selection for the Nyström Method with Indefinite Kernels The effectiveness of a low-rank approximation based on the Nyström method depends crucially on the choice of landmarks and an optimal choice is a difficult discrete optimization problem. The landmark selection for the Nyström method has been studied extensively in the context of approximation of positive definite matrices (e.g., see Drineas and Mahoney, 2005; Kumar et al., 2012; Gittens and Mahoney, 2016; Alaoui and Mahoney, 2015; Oglic and Gärtner, 2017). We follow this line of research and present two landmark selection strategies for indefinite Kre ın kernels inspired by the state-of-the-art sampling schemes: (approximate) kernel K-means++ sampling (Oglic and Gärtner, 2017) and statistical leverage scores (Alaoui and Mahoney, 2015; Drineas et al., 2006; Drineas and Mahoney, 2005). In both cases, we propose to first sample a small number of instances uniformly at random and create a sketch matrix K = UΛ U using the procedure described in Section 2.2. Then, using this sketch matrix we propose to Scalable Learning in Reproducing Kernel Krein Spaces approximate: i) statistical leverage scores for all instances, and/or ii) squared distances between instances in the feature space of the factorization H = L L with L = U |Λ| 1/2. An approximate leverage score assigned to the i-th instance is given as the squared norm of the i-th row in the matrix U, that is ℓ(xi) = U(i) 2 with 1 i n. As the two matrices H and K have identical eigenvectors, the approximate leverage scores obtained using the positive definite matrix H capture the informative part of the eigenspace of the indefinite matrix K. This follows from the Eckart Young Mirsky theorem (Eckart and Young, 1936; Mirsky, 1960) which implies that the optimal low-rank approximation of an indefinite kernel matrix is given by a set of landmarks spanning the same subspace as that spanned by the eigenvectors corresponding to the top eigenvalues, sorted in descending order with respect to their absolute values. The landmark selection strategy based on the approximate leverage scores then works by taking a set of independent samples from pℓ(x) = ℓ(x)/Pn i=1 ℓ(xi) . For approximate kernel K-means++ landmark selection, we propose to perform K-means++ clustering (Arthur and Vassilvitskii, 2007) in the feature space defined by the factorization matrix L, that is each instance is represented with a row from this matrix. The approach works by first sampling an instance uniformly at random and setting it as the first landmark (i.e., the first cluster centroid). Following this, the next landmark/centroid is selected by sampling an instance with the probability proportional to its clustering contribution. More formally, assuming that the landmarks {z1, z2, . . . , zs} have already been selected the (s + 1)-st one is selected by taking a sample from the distribution p++ s+1 (x) = min1 i s x zi 2/Pn i=1 min1 j s xi zj 2 . 2.4. Scaling Least Squares Methods for Indefinite Kernels using the Nyström Method We present two regularized risk minimization problems with squared error loss function for scalable learning in reproducing kernel Kre ın spaces. Our choice of the regularization term is motivated by the considerations in Oglic and Gärtner (2018), where the authors regularize with respect to a decomposition of the Kre ın kernel into a direct sum of Hilbert spaces. We start with a Kre ın least squares method (KRE IN LSM) which is a variant of kernel ridge regression, i.e., f = arg min f K f (xi) yi 2 + λ+ f+ 2 + + λ f 2 , where f = f+ f K, K = H+ H with disjoint H , f H , and hyperparameters λ R+. This is a convex optimization problem for which the representer theorem holds (Oglic and Gärtner, 2018, Appendix A) and the optimal solution f = Pn i=1 α i k (xi, ) with α Rn. Applying the reproducing property of the Kre ın kernel and setting the gradient of the objective to zero, we derive α = H + nΛ 1 Py , where K = UDU , S = sign (D), H = UDSU , P = USU , and Λ = λ+S+ + λ |S | with S = (S I)/2. An important difference compared to stabilization approaches (e.g., see Loosli et al., 2016) is that we are solving a regularized risk minimization problem for which a globally optimal solution can be found in polynomial time. Another difference is that stabilization approaches perform subspace descent while we are optimizing jointly over decomposition components of a Krein space. In the special case with λ+ = λ , the approach outputs a hypothesis equivalent to that of a stabilization approach along the lines of Loosli et al. (2016). In particular, the matrix H is called the flip-spectrum transformation of K and k x XP is the corresponding out-of-sample transformation. Learning with the flip-spectrum transformation of an indefinite kernel matrix was first considered in Graepel et al. (1998) and the corresponding out-of-sample transformation was first proposed in Chen et al. (2009). The following proposition (a proof is provided in Appendix A) establishes the equivalence between the least squares method with the flip-spectrum matrix in place of an indefinite kernel matrix and Kre ın kernel ridge regression regularized with a single hyperparameter. Proposition 2. If the Kre ın kernel ridge regression problem is regularized via the norm HK with λ = λ+ = λ , then the optimal hypothesis is equivalent to that obtained with kernel ridge regression and the flip-spectrum matrix in place of an indefinite Kre ın kernel matrix. Having established this, we now proceed to formulate a Kre ın regression problem with a low-rank approximation KX|Z in place of the indefinite kernel matrix K. More formally, after substituting the low-rank approximation into Kre ın kernel ridge regression problem we transform it by z = |DZ Z| 1/2 U Z ZKZ Xα = L X|Zα , Φ = KX ZUZ Z |DZ Z| 1/2 SZ Z = LX|ZSZ Z , KX|Zα = LX|ZSZ Zz = Φz and α H α = z z , where H = LX|Z |SZ Z, | L X|Z, z = |SZ Z, | z, and SZ Z, = (SZ Z I)/2. Hence, we can write a low-rank variant of the Kre ın kernel ridge regression problem as z = arg min z Rm Φz y 2 2 + nλ+ z+ 2 2 + nλ z 2 2 . The problem is convex in z and the optimal solution satisfies z = Φ Φ + nΛ 1 Φ y . Scalable Learning in Reproducing Kernel Krein Spaces An out-of-sample extension for this learning problem is f (x) = k x UZ Z |DZ Z| 1/2 SZ Zz . Having introduced a low-rank variant of Kre ın kernel ridge regression, we proceed to define a scalable variance constrained least squares method (KRE IN VC-LSM). This risk minimization problem is given by (Oglic and Gärtner, 2018) min f K 1 n i=1 (f (xi) yi)2 + λ+ f+ 2 + + λ f 2 i=1 f (xi)2 = r2 , with hyperparameters r R and λ R+. To simplify our derivations (just as in Oglic and Gärtner, 2018), we have without loss of generality assumed that the kernel matrix K is centered. Then, the hard constraint fixes the variance of the predictor over training instances. Similar to Kre ın kernel ridge regression, we can transform this problem into z = arg min z Rm nλ+ z+ 2 + nλ z 2 2z Φ y s.t. z Φ Φz = r2 . Now, performing a singular value decomposition of Φ = A B and taking γ = B z we obtain γ = arg min γ Rm nγ 1B Λ B 1γ 2(A y) γ s.t. γ γ = r2 . A globally optimal solution to this non-convex problem can be computed by following the procedures outlined in Gander et al. (1989) and Oglic and Gärtner (2018). The cost of computing the solution is O m3 and the cost for the lowrank transformation of the problem is O m3 + m2n . An out-of-sample extension can also be obtained by following the derivation for Kre ın kernel ridge regression. 2.5. Scaling Support Vector Machines for Indefinite Kernels using the Nyström Method In this section, we propose a low-rank support vector machine for scalable classification with indefinite kernels. Our regularization term is again motivated by the considerations in Oglic and Gärtner (2018) and that is one of the two main differences compared to Kre ın support vector machine proposed in Loosli et al. (2016). The latter approach outputs a hypothesis which can equivalently be obtained using the standard support vector machine with the flip-spectrum kernel matrix combined with the corresponding out-of-sample transformation (introduced in Section 2.4). The second difference of our approach compared to Loosli et al. (2016) stems from the fact that in low-rank formulations one optimizes the primal of the problem, defined with the squared hinge loss instead of the plain hinge loss. In particular, the latter loss function is not differentiable and that can complicate the hyperparameter optimization. We note that the identical choice of loss function was used in other works for primal-based optimization of support vector machines (e.g., see Mangasarian, 2002; Keerthi and De Coste, 2005). We propose the following optimization problem as the Kre ın squared hinge support vector machine (KRE IN SH-SVM) f = arg min f K i=1 max {1 yif (xi) , 0}2 + λ+ f+ 2 + + λ f 2 . Similar to Section 2.4, the representer theorem holds for this problem and applying the reproducing property of the Kre ın kernel we can transform it to a matrix form. If we again substitute a low-rank approximation KX|Z in place of the Kre ın kernel matrix K, we observe that f (xi) = k xiα = k xi K 1 Z ZKZ Xα = k xi UZ Z |DZ Z| 1/2 SZ Zz = Φiz , where Φi denotes the i-th row in the matrix Φ. The low-rank variant of the approach can then be written as z = arg min z Rm i=1 max {1 yiΦiz, 0}2 + nλ+ z+ 2 2 + nλ z 2 2 . The derivation of the solution follows that for the standard primal-based training of support vector machines with the only difference being that the diagonal matrix Λ is used instead of the scalar hyperparameter controlling the hypothesis complexity (e.g., see Mangasarian, 2002; Keerthi and De Coste, 2005). To automatically tune the hyperparameters, one can follow the procedure described in Chapelle et al. (2002) and use implicit derivation to compute the gradient of the optimal solution with respect to the hyperparameters. We conclude with a discussion of a potential shortcoming inherent to the Kre ın support vector machine (Loosli et al., 2016). As the following proposition shows (a proof can be found in Appendix A), that approach is equivalent to the standard support vector machine with the flip-spectrum matrix in place of an indefinite Kre ın kernel matrix (Graepel et al., 1998), combined with the corresponding out-ofsample transformation (Chen et al., 2009). Proposition 3. The Kre ın support vector machine (Loosli et al., 2016) is equivalent to the standard support vector machine with the flip-spectrum matrix in place of an indefinite Kre ın kernel matrix (Graepel et al., 1998), combined with the out-of-sample transformation from Chen et al. (2009). Scalable Learning in Reproducing Kernel Krein Spaces 5 10 25 50 100 rank (log scale) log Frobenius error (a) parkinsons, ι = 50% approx. kernel K-means++ (k | k) approx. leverage scores (k | k) uniform (k | k) approx. kernel K-means++ (k | k log n) approx. leverage scores (k | k log n) uniform (k | k log n) 5 10 25 50 100 rank (log scale) (b) ailerons, ι = 50% 5 10 25 50 100 rank (log scale) (c) elevators, ι = 50% 5 10 25 50 100 rank (log scale) (d) pole-telecom, ι = 50% 5 10 25 50 100 rank (log scale) (e) cal-housing, ι = 50% Figure 1: The figure shows the reduction in the approximation error for an indefinite kernel matrix defined as the difference between two Gaussian kernels, which comes as a result of the increase in the approximation rank. In the figure legend, we use (k | l) to express the fact that a rank k approximation of the kernel matrix is computed using a set of l landmarks. 0.5 1.5 2.5 3.5 4.5 log time (Nyström) log Frobenius error (a) parkinsons, ι = 50% approx. kernel K-means++ (k | k) approx. leverage scores (k | k) uniform (k | k) approx. kernel K-means++ (k | k log n) approx. leverage scores (k | k log n) uniform (k | k log n) 0.5 1.3 2.1 2.9 3.7 log time (Nyström) (b) ailerons, ι = 50% 0.7 1.6 2.5 3.4 4.3 log time (Nyström) (c) elevators, ι = 50% 0.5 1.5 2.5 3.5 4.5 log time (Nyström) (d) pole-telecom, ι = 50% 1.5 2.2 2.9 3.6 4.3 log time (Nyström) (e) cal-housing, ι = 50% Figure 2: The figure shows the approximation errors of Nyström low-rank approximations (with different approximation ranks) as a function of time required to compute these approximations. In the figure legend, we again use (k | l) to express the fact that a rank k approximation of the kernel matrix is computed using a set of l landmarks. While at the first glance this result seems incremental, it makes an important contribution towards understanding the Kre ın support vector machines (Loosli et al., 2016). In particular, the discussion of experiments in Loosli et al. (2016) differentiates between the Kre ın support vector machines and the flip-spectrum approach. This happens despite the illustration indicating that they produce identical hypotheses in synthetic experiments (e.g., see Figures 3 and 4 in Loosli et al., 2016, and the discussion therein). 3. Experiments In this section, we report the results of experiments aimed at demonstrating the effectiveness of: i) the Nyström method in low-rank approximations of indefinite kernel matrices, and ii) the described scalable Kre ın approaches in classification tasks with pairwise (dis)similarity matrices. In the first set of experiments, we take several datasets from UCI and LIACC repositories and define kernel matrices on them using the same indefinite kernels as previous work (Oglic and Gärtner, 2018, Appendix D). We use 0 ι = P {i : λi<0}|λi|/P to quantify the level of indefiniteness of a kernel matrix. Prior to computation of kernel matrices, all the data matrices were normalized to have mean zero and unit variance across features. Following this, we have applied the Nyström method with landmark selection strategies presented in Section 2.3 to derive approximations with different ranks. We measure the effectiveness of a low-rank approximation with its error in the Frobenius norm. To quantify the effectiveness of the approximate eigendecomposition of the kernel matrix (i.e., the one-shot Nyström method) derived in Section 2.2, we have performed rank k approximations using sets of k log n landmarks. Figures 1 and 2 summarize the results obtained with an indefinite kernel defined by the difference between two Gaussian kernels. The reported error/time is the median error/time over 10 repetitions of the experiment. Figure 1 indicates a sharp (approximately exponential) decay in the approximation error as the rank Scalable Learning in Reproducing Kernel Krein Spaces 5 10 25 50 100 rank (log scale) classification error (%) (a) coilyork, ι = 26% SF-LSM Kre ın LSM Kre ın VC-LSM Kre ın SH-SVM 5 10 25 50 100 rank (log scale) (b) balls3D, ι = 0.07% 5 10 25 50 100 rank (log scale ) (c) prodom, ι = 99% 5 10 25 50 100 rank (log scale ) (d) chicken10-120, ι = 18% 5 10 25 50 100 rank (log scale) (e) protein, ι = 0.07% Figure 3: The figure shows the reduction in the classification error as the approximation rank of a Nyström low-rank approximation increases. The reported error is the median classification error obtained using 10-fold stratified cross-validation. of the approximation increases. The devised approximate kernel K-means++ sampling strategy performs the best in terms of the accuracy in the experiments where rank k approximations are generated using k landmarks. The approximate leverage score strategy is quite competitive and in rank k approximations generated using k log n landmarks it performs as good or even better than the approximate kernel K-means++ sampling scheme. Oglic and Gärtner (2017) have evaluated these two state-of-the-art strategies on the low-rank approximation of positive definite kernels. In contrast to that work, we had to resort to an approximate kernel K-means++ sampling scheme because of the indefiniteness of the bilinear form defining a Kre ın space. As a result of this, we can observe the lack of a gap between the curves describing the two sampling strategies, compared to the results reported in Oglic and Gärtner (2017) for positive definite kernels. Our hypothesis is that this is due to sub-optimal choices of landmarks that define sketch matrices. In our simulations, we have generated sketches by sampling the corresponding landmarks uniformly at random. In support of this hypothesis, rather large approximation errors for uniformly selected landmarks in approximation of other indefinite kernels can be observed (see Appendix C). Figure 2 reports the time required to generate a Nyström low-rank approximation and indicates that the considered sampling strategies amount to only a small fraction of the total time required to generate the low-rank approximation. In the second set of experiments, we evaluate the effectiveness of the proposed least square methods and the support vector machine on classification tasks1 with pairwise dissimilarity matrices (Pekalska and Duin, 2005; Duin and Pekalska, 2009). Following the instructions in Pekalska and Haasdonk (2009), the dissimilarity matrices are converted to similarities by applying the transformation characteristic to multi-dimensional scaling (e.g., see the negative doublecentering transformation in Cox and Cox, 2000). In each simulation, we perform 10-fold stratified cross-validation 1http://prtools.org/disdatasets/index.html and measure the effectiveness of an approach with the average/median percentage of misclassified examples. For multi-class problems, we only evaluate the effectiveness of a single binary one-vs-all classifier (just as in Oglic and Gärtner, 2018, Appendix C). Figure 3 shows the reduction in the classification error as the approximation rank increases. The reported error is the median error over 10-folds. Here, SFLSM represents the baseline in which similarities are used as features and a linear ridge regression model is trained in that instance space (Chen et al., 2009; Alabdulmohsin et al., 2015). The figure indicates that the baseline is quite competitive, but overall the proposed low-rank variants perform very well across different datasets (additional plots are provided in Appendix C). Tables 1 provides the detailed results over all the datasets. In Appendix C (Table 2), we also compare the effectiveness of our low-rank approaches with respect to the relevant state-of-the-art methods which make no approximations and represent hypotheses via the span of kernel functions centered at training instances. The empirical results indicate a competitive performance of our low-rank approaches with only 100 landmarks across all the datasets and a variety of indefinite kernel functions. 4. Discussion The Nyström method has recently been used for approximate eigendecomposition and low-rank approximation of indefinite kernel matrices (Gisbrecht and Schleif, 2015; Schleif and Tiño, 2015; Schleif et al., 2016). To circumvent the fact that the original derivations of the approach are restricted to positive definite Mercer kernels (Smola and Schölkopf, 2000; Williams and Seeger, 2001), Gisbrecht and Schleif (2015) provide a derivation of the approach based on approximations of integral eigenfunctions arising in an eigendecomposition of an indefinite kernel. In particular, the authors of that work introduce an integral operator defined with an indefinite kernel and its empirical/samplebased approximation which asymptotically converges to the original (indefinite) integral operator. Based on this re- Scalable Learning in Reproducing Kernel Krein Spaces DATASET DISSIMILARITY TYPE RANK 100 APPROXIMATION KRE IN VC-LSM KRE IN LSM KRE IN SH-SVM SF-LSM coilyork Graph matching 32.22 ( 7.89) 31.21 ( 5.28) 38.20 ( 7.20) 35.33 ( 10.09) balls 3D Shortest distance between balls 1.00 ( 2.00) 0.50 ( 1.50) 0.00 ( 0.00) 0.50 ( 1.50) prodom Structural alignment of proteins 0.92 ( 0.46) 0.92 ( 0.46) 0.54 ( 0.47) 1.57 ( 0.58) chicken10 String edit distance 16.35 ( 4.31) 15.69 ( 4.97) 16.82 ( 6.57) 14.37 ( 4.02) protein Structural alignment of proteins 4.19 ( 2.47) 3.72 ( 2.76) 5.23 ( 2.89) 5.15 ( 3.91) zongker Deformable template matching 17.70 ( 2.06) 17.75 ( 2.23) 15.30 ( 3.39) 17.05 ( 2.36) chicken25 String edit distance 19.29 ( 4.64) 20.41 ( 4.09) 25.77 ( 4.68) 18.17 ( 6.67) pdish57 Hausdorff distance 3.40 ( 0.39) 3.40 ( 0.42) 2.73 ( 0.62) 3.03 ( 0.67) pdism57 Hausdorff distance 0.38 ( 0.26) 0.38 ( 0.26) 0.30 ( 0.29) 0.63 ( 0.42) woody50 Plant leaves shape dissimilarity 30.84 ( 5.25) 30.47 ( 5.54) 38.42 ( 7.13) 26.41 ( 4.42) Table 1: The table reports the results of our experiment on benchmark datasets for learning with indefinite kernels (Pekalska and Duin, 2005). The goal of the experiment is to evaluate the effectiveness of the state-of-the-art approaches for scalable learning in reproducing kernel Kre ın spaces on classification tasks with pairwise dissimilarity matrices. We measure the effectiveness of an approach using the average classification error obtained using 10-fold stratified cross-validation (standard deviations are given in the brackets). sult, Gisbrecht and Schleif (2015) provide a derivation of the Nyström method for indefinite kernels that treats the approximate equalities arising in the approximations of integral eigenfunctions as if they were exact. While such an assumption might hold for some datasets it fails to hold in the general case and this fact makes their extension of the Nyström method to indefinite kernels mathematically incomplete. Our derivation of the approach does not rely on such an assumption and, thus, provides a stronger result. Moreover, our proof is much simpler than the one in Gisbrecht and Schleif (2015) and provides a geometrical intuition for the approximation. In addition to this, Gisbrecht and Schleif (2015); Schleif and Tiño (2015); Schleif et al. (2016) proposed a method for finding an approximate low-rank eigendecomposition of an indefinite kernel matrix (for the sake of completeness, we review this approach in Appendix B). From the perspective of the exact number of floating point operations (FLOPs), the approach by Gisbrecht and Schleif (2015); Schleif and Tiño (2015); Schleif et al. (2016) requires 7 matrix-to-matrix multiplications (each with the cost of m2n FLOPs) and 2 eigendecompositions (each with the cost of m3 FLOPs). Thus, in total their approach requires 7m2n + 2m3 FLOPs to find an approximate low-rank eigendecomposition of an indefinite kernel matrix. In contrast to this, the approach proposed in Section 2.2 comes with a much better runtime complexity and requires at most 3m2n + 3m3 FLOPs. To see a practical runtime benefit of our approach, take a problem of approximating the kernel matrix defined with n = 106 instances using m = 103 landmarks. Our method for approximate low-rank eigendecomposition requires 3 1012 less FLOPs than the approach proposed by Gisbrecht and Schleif (2015); Schleif and Tiño (2015); Schleif et al. (2016). Beside the considered low-rank approximations, it is possible to treat indefinite similarity functions as features and learn with linear models (Alabdulmohsin et al., 2015; Chen et al., 2009) or squared kernel matrices (Graepel et al., 1998). However, Balcan et al. (2008) have showed that learning with a positive definite kernel corresponding to a feature space where the target concept is separable by a linear hypothesis yields a larger margin compared to learning with a linear model in a feature space constructed using that kernel function. As a result, if a kernel is used to construct a feature representation the sample complexity of a linear model in that space might be higher compared to learning with a kernelized variant of regularized risk minimization. The effectiveness of a particular landmark selection strategy is a problem studied separately from the derivation of the Nyström method and we, therefore, do not focus on that problem in this work. However, clustering and leverage score sampling have been proposed and validated in earlier publications and are state-of-the-art for low-rank approximation of positive definite kernels (Kumar et al., 2012; Alaoui and Mahoney, 2015; Gittens and Mahoney, 2016; Oglic and Gärtner, 2017). As the flip-spectrum matrix shares the eigenspace with the indefinite kernel matrix, the convergence results on the effectiveness of landmark selection strategies for Nyström low-rank approximation of positive definite kernels apply to indefinite kernels (e.g., see Section 2.3 or Eckart and Young, 1936; Mirsky, 1960). In particular, bounds for the leverage score sampling strategy applied to the flip-spectrum matrix carry over to our derivation of the Nyström method for indefinite kernels. We conclude with a reference to Schleif et al. (2018) and Loosli et al. (2016), where an issue concerning the sparsity of a solution returned by the Kre ın support vector machine has been raised. We hypothesize that our approach can overcome this limitation by either controlling the approximation rank or penalizing the low-rank objective with the ℓ1-norm of the linear model. We leave the theoretical study and evaluation of such an approach for future work. Acknowledgment: We are grateful for access to the University of Nottingham High Performance Computing Facility. Dino Oglic was supported in part by EPSRC grant EP/R012067/1. Scalable Learning in Reproducing Kernel Krein Spaces Alabdulmohsin, I., Gao, X., and Zhang, X. Z. (2015). Support vector machines with indefinite kernels. In Proceedings of the Sixth Asian Conference on Machine Learning, Proceedings of Machine Learning Research. PMLR. Alaoui, A. E. and Mahoney, M. W. (2015). Fast randomized kernel ridge regression with statistical guarantees. In Advances in Neural Information Processing Systems 28. Alpay, D. (1991). Some remarks on reproducing kernel Kre ın spaces. Rocky Mountain Journal of Mathematics. Arthur, D. and Vassilvitskii, S. (2007). K-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms. Azizov, T. Y. and Iokhvidov, I. S. (1981). Linear operators in spaces with indefinite metric and their applications. Journal of Soviet Mathematics. Balcan, M.-F., Blum, A., and Srebro, N. (2008). A theory of learning with similarity functions. Machine Learning. Bognár, J. (1974). Indefinite inner product spaces. Ergebnisse der Mathematik und ihrer Grenzgebiete. Springer. Chapelle, O., Vapnik, V., Bousquet, O., and Mukherjee, S. (2002). Choosing multiple parameters for support vector machines. Machine Learning, 46(1-3):131 159. Chen, Y., Garcia, E. K., Gupta, M. R., Rahimi, A., and Cazzanti, L. (2009). Similarity-based classification: Concepts and algorithms. Journal of Machine Learning Research. Cox, T. F. and Cox, M. A. A. (2000). Multidimensional Scaling. Chapman and Hall/CRC, 2nd edition. Drineas, P., Kannan, R., and Mahoney, M. W. (2006). Fast Monte Carlo algorithms for matrices II: Computing a low-rank approximation to a matrix. SIAM Journal on Computing. Drineas, P. and Mahoney, M. W. (2005). On the Nyström method for approximating a Gram matrix for improved kernel-based learning. Journal of Machine Learning Research. Duin, R. P. and Pekalska, E. (2009). Datasets and tools for dissimilarity analysis in pattern recognition. Beyond Features: Similarity-Based Pattern Analysis and Recognition. Eckart, C. and Young, G. (1936). The approximation of one matrix by another of lower rank. Psychometrika, 1(3):211 218. Fowlkes, C., Belongie, S., Chung, F., and Malik, J. (2004). Spectral grouping using the Nyström method. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(2):214 225. Gander, W., Golub, G. H., and von Matt, U. (1989). A constrained eigenvalue problem. Linear Algebra and its Applications. Gisbrecht, A. and Schleif, F.-M. (2015). Metric and non-metric proximity transformations at linear costs. Neurocomputing, 167:643 657. Gittens, A. and Mahoney, M. W. (2016). Revisiting the Nyström method for improved large-scale machine learning. Journal Machine Learning Research. Graepel, T., Herbrich, R., Bollmann-Sdorra, P., and Obermayer, K. (1998). Classification on pairwise proximity data. In Advances in Neural Information Processing Systems 11. Iokhvidov, I. S., Kre ın, M. G., and Langer, H. (1982). Introduction to the spectral theory of operators in spaces with an indefinite metric. Berlin: Akademie-Verlag. Keerthi, S. S. and De Coste, D. (2005). A modified finite Newton method for fast solution of large scale linear SVMs. Journal of Machine Learning Research, 6:341 361. Kumar, S., Mohri, M., and Talwalkar, A. (2012). Sampling methods for the Nyström method. Journal of Machine Learning Research. Langer, H. (1962). Zur Spektraltheorie J-selbstadjungierter Operatoren. Mathematische Annalen. Loosli, G., Canu, S., and Ong, C. S. (2016). Learning SVM in Kre ın spaces. IEEE Transactions on Pattern Analysis and Machine Intelligence. Mangasarian, O. L. (2002). A finite Newton method for classification. Optimization Methods and Software, 17:913 929. Mirsky, L. (1960). Symmetric gauge functions and unitarily invariant norms. Quaterly Journal of Mathematics, Oxford II. Series, 11:50 59. Nyström, E. J. (1930). Über die praktische Auflösung von Integralgleichungen mit Anwendungen auf Randwertaufgaben. Acta Mathematica. Oglic, D. (2018). Constructive Approximation and Learning by Greedy Algorithms. Ph D thesis, University of Bonn, Germany. Oglic, D. and Gärtner, T. (2017). Nyström method with kernel K-means++ samples as landmarks. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research. PMLR. Oglic, D. and Gärtner, T. (2018). Learning in reproducing kernel Kre ın spaces. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research. PMLR. Ong, C. S., Mary, X., Canu, S., and Smola, A. J. (2004). Learning with non-positive kernels. In Proceedings of the Twenty-First International Conference on Machine Learning. Pekalska, E. and Duin, R. P. W. (2005). The Dissimilarity Representation for Pattern Recognition: Foundations And Applications (Machine Perception and Artificial Intelligence). World Scientific Publishing Co., Inc. Pekalska, E. and Haasdonk, B. (2009). Kernel discriminant analysis with positive definite and indefinite kernels. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(6). Rifkin, R. M. (2002). Everything Old is New Again: A Fresh Look at Historical Approaches in Machine Learning. Ph D thesis. Schleif, F., Gisbrecht, A., and Tiño, P. (2016). Probabilistic classifiers with low rank indefinite kernels. ar Xiv preprint ar Xiv:1604.02264. Scalable Learning in Reproducing Kernel Krein Spaces Schleif, F.-M., Raab, C., and Tino, P. (2018). Sparsification of indefinite learning models. In Structural, Syntactic, and Statistical Pattern Recognition, pages 173 183. Springer. Schleif, F.-M. and Tiño, P. (2015). Indefinite proximity learning: A review. Neural Computation, 27(10):2039 2096. Schleif, F.-M. and Tino, P. (2017). Indefinite core vector machine. Pattern Recognition, 71:187 195. Schölkopf, B. and Smola, A. J. (2001). Learning with kernels: Support vector machines, regularization, optimization, and beyond. MIT Press. Schwartz, L. (1964). Sous-espaces hilbertiens d espaces vectoriels toplogiques et noyaux associés (noyaux reproduisants). Journal d Analyse Mathematique. Smola, A. J. and Schölkopf, B. (2000). Sparse greedy matrix approximation for machine learning. In Proceedings of the 17th International Conference on Machine Learning. Tikhonov, A. N. and Arsenin, V. Y. (1977). Solutions of ill-posed problems. W. H. Winston, Washington D. C. Williams, C. K. I. and Seeger, M. (2001). Using the Nyström method to speed up kernel machines. In Advances in Neural Information Processing Systems 13.