# scalable_learning_in_reproducing_kernel_krein_spaces__285233bf.pdf

Scalable Learning in Reproducing Kernel Kre ın Spaces

Dino Oglic 1 Thomas Gärtner 2

Abstract We provide the ﬁrst mathematically complete derivation of the Nyström method for low-rank approximation of indeﬁnite kernels and propose an efﬁcient method for ﬁnding an approximate eigendecomposition of such kernel matrices. Building on this result, we devise highly scalable methods for learning in reproducing kernel Kre ın spaces. The devised approaches provide a principled and theoretically well-founded means to tackle large scale learning problems with indeﬁnite kernels. The main motivation for our work comes from problems with structured representations (e.g., graphs, strings, time-series), where it is relatively easy to devise a pairwise (dis)similarity function based on intuition and/or knowledge of domain experts. Such functions are typically not positive deﬁnite and it is often well beyond the expertise of practitioners to verify this condition. The effectiveness of the devised approaches is evaluated empirically using indeﬁnite kernels deﬁned on structured and vectorial data representations.

1. Introduction

In learning problems with structured data it is relatively easy to devise a pairwise similarity/dissimilarity function based on intuition/knowledge of domain experts. Such functions are typically not positive deﬁnite and it is often the case that verifying this condition is well beyond the expertise of practitioners. The learning problems with indeﬁnite similarity/dissimilarity functions are typically modeled via Kre ın spaces (e.g., see Ong et al., 2004; Loosli et al., 2016; Oglic and Gärtner, 2018), which are vector spaces with an indeﬁnite bilinear form (Azizov and Iokhvidov, 1981; Iokhvidov et al., 1982). The computational and space complexities of these approaches are similar to those of the standard kernel methods that work with positive deﬁnite kernels (Schölkopf

1Department of Informatics, King s College London, UK 2School of Computer Science, University of Nottingham, UK. Correspondence to: Dino Oglic <dino.oglic@uni-bonn.de>.

Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s).

and Smola, 2001). The Nyström method (Nyström, 1930; Smola and Schölkopf, 2000; Williams and Seeger, 2001) is an effective approach for low-rank approximation of positive deﬁnite kernels that can scale kernel methods to problems with millions of instances (Schölkopf and Smola, 2001). We provide the ﬁrst mathematically complete derivation that extends the Nyström method to low-rank approximation of indeﬁnite kernels and propose an efﬁcient method for ﬁnding an approximate eigendecomposition of such kernel matrices. To tackle the computational issues arising in large scale problems with indeﬁnite kernels, we also devise several novel Nyström-based low-rank approaches tailored for scalable learning in reproducing kernel Kre ın spaces.

We start by showing that the Nyström method can be used for low-rank approximations of indeﬁnite kernel matrices and provide means for ﬁnding their approximate eigendecompositions (Section 2.2). We then devise two landmark sampling strategies based on state-of-the-art techniques (Gittens and Mahoney, 2016; Oglic and Gärtner, 2017) used in Nyström approximations of positive deﬁnite kernels (Section 2.3). Having described means for ﬁnding low-rank factorizations of indeﬁnite kernel matrices, we formulate lowrank variants of two least squares methods (Tikhonov and Arsenin, 1977; Rifkin, 2002; Oglic and Gärtner, 2018) for learning in reproducing kernel Kre ın spaces (Section 2.4). We also derive a novel low-rank variant of the support vector machine for scalable learning in reproducing kernel Kre ın spaces (Section 2.5), inspired by Oglic and Gärtner (2018). Having introduced means for scalable learning in reproducing kernel Kre ın spaces, we evaluate the effectiveness of these approaches and the Nyström low-rank approximations on datasets from standard machine learning repositories (Section 3). The empirical results demonstrate the effectiveness of the proposed approaches in: i) classiﬁcation tasks and ii) problems of ﬁnding a low-rank approximation of an indeﬁnite kernel matrix. The experiments are performed using 15 representative datasets and a variety of indeﬁnite kernels. The paper concludes with a discussion where we contrast ours and other relevant approaches (Section 4).

2. Scalable Learning in Kre ın Spaces

In this section, we ﬁrst provide a novel derivation of the Nyström method that allows us to extend the approach to

Scalable Learning in Reproducing Kernel Krein Spaces

low-rank approximation of indeﬁnite kernel matrices. Building on this result, we then provide means to scale Kre ın kernel methods to datasets with millions of instances/pairwise (dis)similarities. More speciﬁcally, we devise low-rank variants of kernel ridge regression and support vector machines in reproducing kernel Kre ın spaces, as well as a low-rank variant of the variance constrained ridge regression proposed in Oglic and Gärtner (2018). As the effectiveness of low-rank approximations based on the Nyström method critically depends on the selected landmarks, we also adapt two state-of-the-art landmark sampling strategies proposed for the approximation of positive deﬁnite kernels. In addition to this, we make a theoretical contribution (Proposition 3) relating the Kre ın support vector machines (Loosli et al., 2016) to previous work on learning with the ﬂip-spectrum kernel matrices (Graepel et al., 1998; Chen et al., 2009).

2.1. Reproducing Kernel Kre ın Spaces

Let K be a vector space deﬁned on the scalar ﬁeld R. A bilinear form on K is a function , K : K K R such that, for all f, g, h K and scalars α, β R, it holds: i) αf + βg, h K = α f, h K + β g, h K and ii) f, αg + βh K = α f, g K + β f, h K. The bilinear form is called non-degenerate if

( f K) : ( g K) f, g K = 0 = f = 0.

The bilinear form , K is symmetric if, for all f, g K, we have f, g K = g, f K. The form is called indeﬁnite if there exists f, g K such that f, f K > 0 and g, g K < 0. On the other hand, if f, f K 0 for all f K, then the form is called positive. A non-degenerate, symmetric, and positive bilinear form on K is called inner product. Any two elements f, g K that satisfy f, g K = 0 are , Korthogonal. Similarly, any two subspaces K1, K2 K that satisfy f1, f2 K = 0 for all f1 K1 and f2 K2 are called , K-orthogonal. Having reviewed bilinear forms, we are now ready to introduce the notion of a Kre ın space.

Deﬁnition 1. (Bognár, 1974; Azizov and Iokhvidov, 1981) The vector space K with a bilinear form , K is called Kre ın space if it admits a decomposition into a direct sum K = H+ H of , K-orthogonal Hilbert spaces H such that the bilinear form can be written as

f, g K = f+, g+ H+ f , g H ,

where H are endowed with inner products , H , f = f+ f , g = g+ g , and f , g H .

For a ﬁxed decomposition K = H+ H , the Hilbert space HK = H+ H endowed with inner product

f, g HK = f+, g+ H+ + f , g H (f , g H )

can be associated with K. For a Kre ın space K, the decomposition K = H+ H is not necessarily unique. Thus, a Kre ın space can, in general, be associated with inﬁnitely many Hilbert spaces. However, for any such Hilbert space HK the topology introduced on K via the norm f HK = q

f, f HK is independent of the decomposition and the associated Hilbert space. More speciﬁcally, all norms HK generated by different decompositions of K into direct sums of Hilbert spaces are topologically equivalent (Langer, 1962). The topology on K deﬁned by the norm of an associated Hilbert space is called the strong topology. Having reviewed basic properties of Kre ın spaces, we are now ready to introduce a notion of reproducing kernel Kre ın space. For that, let X be an instance space and denote with RX the set of functions from X to R. For a ﬁxed element x X, the map Tx : RX R that assigns a real number to each function f RX is called the evaluation functional at x, i.e., Tx (f) = f (x) for all f RX . Deﬁnition 2. (Alpay, 1991; Ong et al., 2004) A Kre ın space (K, , K) is a reproducing kernel Kre ın space if K RX and the evaluation functional is continuous on K with respect to the strong topology.

The following theorem provides a characterization of reproducing kernel Kre ın spaces. Theorem 1. (Schwartz, 1964; Alpay, 1991) Let k: X X R be a real-valued symmetric function. Then, there is an associated reproducing kernel Kre ın space if and only if k = k+ k , where k+ and k are positive deﬁnite kernels. When the function k admits such a decomposition, one can choose k+ and k such that the corresponding reproducing kernel Hilbert spaces are disjoint.

2.2. Nyström Method for Reproducing Kre ın Kernels

Let X be an instance space and X = {x1, . . . , xn} an independent sample from a Borel probability measure deﬁned on X. For a positive deﬁnite kernel h and a set of landmarks Z = {z1, . . . , zm} X, the Nyström method (Nyström, 1930; Smola and Schölkopf, 2000; Williams and Seeger,

2001) ﬁrst projects the evaluation functionals h (xi, ) onto span ({h (z1, ) , . . . , h (zm, )}) and then approximates the kernel matrix H with entries {Hij = h (xi, xj)}n i,j=1 by inner products between the projections of the corresponding evaluation functionals. The projections of the evaluation functionals h (xi, ) are linear combinations of the landmarks and these coefﬁcients are given by the following convex optimization problem

α = arg min α Rm n

j=1 αj,ih (zj, )

While this approach works for positive deﬁnite kernels, it cannot be directly applied to reproducing Kre ın kernels. In

Scalable Learning in Reproducing Kernel Krein Spaces

particular, let K be a reproducing kernel Kre ın space with an indeﬁnite kernel k: X X R. The reproducing Kre ın kernel k is deﬁned by an indeﬁnite bilinear form , K which does not induce a norm on K and for all a, b K the value of a b, a b K does not capture the distance. More speciﬁcally, as the bilinear form is indeﬁnite then there exists an element c K such that c, c K < 0.

For an evaluation functional k (x, ) K and a linear subspace LZ K spanned by a set of evaluation functionals {k (z1, ) , . . . , k (zm, )}, we deﬁne k (x, ) to be the orthogonal projection of k (x, ) onto the subspace LZ if the evaluation functional admits a decomposition (Azizov and Iokhvidov, 1981; Iokhvidov et al., 1982)

k (x, ) = k (x, ) + k (x, ) ,

where k (x, ) = Pm i=1 αi,xk (zi, ) with αx Rm, and k (x, ), LZ

K = 0. For a landmark z Z, the inner product between the corresponding evaluation functional k (z, ) and k (x, ) then gives

k (x, z) = k (x, ), k (z, ) K =

i=1 αi,x k (zi, z) . (2)

Denote with KZ Z the block in the kernel matrix K corresponding to landmarks Z and let kx = vec (k (x, z1) , . . . , k (x, zm)). From Eq. (2) it then follows that kx = KZ Zαx. Thus, in Kre ın spaces an orthogonal projection exists if the matrix KZ Z is non-singular. If this condition is satisﬁed, then the projection is given by

i=1 α i,xk (zi, ) with α x = K 1 Z Zkx Rm .

Having computed the projection of a point onto the span of the landmarks in a Kre ın space, we now proceed to deﬁne the Nyström approximation of the corresponding indeﬁnite kernel matrix. In this, we follow the approach for positive deﬁnite kernels (Schölkopf and Smola, 2001; Smola and Schölkopf, 2000) and approximate the Kre ın kernel matrix K using the bilinear form on the span of the landmarks. More speciﬁcally, we have that

k (xi, xj) =

p=1 α p,ik (zp, ),

q=1 α q,jk (zq, )

k xi K 1 Z Zkxj .

Thus, the low-rank approximation of the Kre ın kernel matrix K is given by

KX|Z = KX ZK 1 Z ZKZ ZK 1 Z ZKZ X =

KX ZK 1 Z ZKZ X . (3)

This approach for low-rank approximation of Kre ın kernel matrices also provides a direct way for an out-of-sample extension in the non-transductive setting. In particular, for an out-of-sample instance x X we have that if holds

kx X = vec( k (x, x1) , . . . , k (x, xn)) = KX ZK 1 Z Zkx.

In applications to estimation problems (see Sections 2.4 and 2.5) an approximate low-rank eigendecomposition of the kernel matrix, also known as the one-shot variant of the Nyström method (Fowlkes et al., 2004), is sometimes preferred over the plain Nyström approximation described above. To derive such a factorization, we ﬁrst observe that the low-rank approximation of the indeﬁnite kernel matrix can be written as

KX|Z = LSL with L = KX ZUZ Z |DZ Z| 1

and where KZ Z = UZ ZDZ ZU Z Z is an eigendecomposition of the block in the kernel matrix corresponding to landmarks and S = sign (DZ Z). Substituting a singular value decomposition of L = AΣB into the latter equation (with orthonormal matrices A Rn m and B Rm m), we deduce the following low-rank factorization

KX|Z = A ΣB SBΣ A = AMA ,

where M = ΣB SBΣ is a symmetric matrix with an eigendecomposition M = PΛP . Hence,

KX|Z = (AP) Λ (AP) with (AP) AP = Im .

As the matrix U = AP Rn m contains m orthonormal column vectors and Λ is a diagonal matrix, we have then derived an approximate low-rank eigendecomposition of K.

2.3. Landmark Selection for the Nyström Method with Indeﬁnite Kernels

The effectiveness of a low-rank approximation based on the Nyström method depends crucially on the choice of landmarks and an optimal choice is a difﬁcult discrete optimization problem. The landmark selection for the Nyström method has been studied extensively in the context of approximation of positive deﬁnite matrices (e.g., see Drineas and Mahoney, 2005; Kumar et al., 2012; Gittens and Mahoney, 2016; Alaoui and Mahoney, 2015; Oglic and Gärtner, 2017). We follow this line of research and present two landmark selection strategies for indeﬁnite Kre ın kernels inspired by the state-of-the-art sampling schemes: (approximate) kernel K-means++ sampling (Oglic and Gärtner, 2017) and statistical leverage scores (Alaoui and Mahoney, 2015; Drineas et al., 2006; Drineas and Mahoney, 2005).

In both cases, we propose to ﬁrst sample a small number of instances uniformly at random and create a sketch matrix K = UΛ U using the procedure described in Section 2.2. Then, using this sketch matrix we propose to

Scalable Learning in Reproducing Kernel Krein Spaces

approximate: i) statistical leverage scores for all instances, and/or ii) squared distances between instances in the feature space of the factorization H = L L with L = U |Λ| 1/2. An approximate leverage score assigned to the i-th instance is given as the squared norm of the i-th row in the matrix U, that is ℓ(xi) = U(i) 2 with 1 i n. As the two matrices H and K have identical eigenvectors, the approximate leverage scores obtained using the positive deﬁnite matrix H capture the informative part of the eigenspace of the indeﬁnite matrix K. This follows from the Eckart Young Mirsky theorem (Eckart and Young, 1936; Mirsky, 1960) which implies that the optimal low-rank approximation of an indeﬁnite kernel matrix is given by a set of landmarks spanning the same subspace as that spanned by the eigenvectors corresponding to the top eigenvalues, sorted in descending order with respect to their absolute values. The landmark selection strategy based on the approximate leverage scores then works by taking a set of independent samples from

pℓ(x) = ℓ(x)/Pn i=1 ℓ(xi) .

For approximate kernel K-means++ landmark selection, we propose to perform K-means++ clustering (Arthur and Vassilvitskii, 2007) in the feature space deﬁned by the factorization matrix L, that is each instance is represented with a row from this matrix. The approach works by ﬁrst sampling an instance uniformly at random and setting it as the ﬁrst landmark (i.e., the ﬁrst cluster centroid). Following this, the next landmark/centroid is selected by sampling an instance with the probability proportional to its clustering contribution. More formally, assuming that the landmarks {z1, z2, . . . , zs} have already been selected the (s + 1)-st one is selected by taking a sample from the distribution

p++ s+1 (x) = min1 i s x zi 2/Pn i=1 min1 j s xi zj 2 .

2.4. Scaling Least Squares Methods for Indeﬁnite Kernels using the Nyström Method

We present two regularized risk minimization problems with squared error loss function for scalable learning in reproducing kernel Kre ın spaces. Our choice of the regularization term is motivated by the considerations in Oglic and Gärtner (2018), where the authors regularize with respect to a decomposition of the Kre ın kernel into a direct sum of Hilbert spaces. We start with a Kre ın least squares method (KRE IN LSM) which is a variant of kernel ridge regression, i.e.,

f = arg min f K

f (xi) yi 2 +

λ+ f+ 2 + + λ f 2 ,

where f = f+ f K, K = H+ H with disjoint H , f H , and hyperparameters λ R+. This is a convex optimization problem for which the representer

theorem holds (Oglic and Gärtner, 2018, Appendix A) and the optimal solution f = Pn i=1 α i k (xi, ) with α Rn. Applying the reproducing property of the Kre ın kernel and setting the gradient of the objective to zero, we derive

α = H + nΛ 1 Py ,

where K = UDU , S = sign (D), H = UDSU , P = USU , and Λ = λ+S+ + λ |S | with S = (S I)/2.

An important difference compared to stabilization approaches (e.g., see Loosli et al., 2016) is that we are solving a regularized risk minimization problem for which a globally optimal solution can be found in polynomial time. Another difference is that stabilization approaches perform subspace descent while we are optimizing jointly over decomposition components of a Krein space. In the special case with λ+ = λ , the approach outputs a hypothesis equivalent to that of a stabilization approach along the lines of Loosli et al. (2016). In particular, the matrix H is called the ﬂip-spectrum transformation of K and k x XP is the corresponding out-of-sample transformation. Learning with the ﬂip-spectrum transformation of an indeﬁnite kernel matrix was ﬁrst considered in Graepel et al. (1998) and the corresponding out-of-sample transformation was ﬁrst proposed in Chen et al. (2009). The following proposition (a proof is provided in Appendix A) establishes the equivalence between the least squares method with the ﬂip-spectrum matrix in place of an indeﬁnite kernel matrix and Kre ın kernel ridge regression regularized with a single hyperparameter. Proposition 2. If the Kre ın kernel ridge regression problem is regularized via the norm HK with λ = λ+ = λ , then the optimal hypothesis is equivalent to that obtained with kernel ridge regression and the ﬂip-spectrum matrix in place of an indeﬁnite Kre ın kernel matrix.

Having established this, we now proceed to formulate a Kre ın regression problem with a low-rank approximation KX|Z in place of the indeﬁnite kernel matrix K. More formally, after substituting the low-rank approximation into Kre ın kernel ridge regression problem we transform it by

z = |DZ Z| 1/2 U Z ZKZ Xα = L X|Zα ,

Φ = KX ZUZ Z |DZ Z| 1/2 SZ Z = LX|ZSZ Z , KX|Zα = LX|ZSZ Zz = Φz and α H α = z z ,

where H = LX|Z |SZ Z, | L X|Z, z = |SZ Z, | z, and SZ Z, = (SZ Z I)/2. Hence, we can write a low-rank variant of the Kre ın kernel ridge regression problem as

z = arg min z Rm Φz y 2 2 + nλ+ z+ 2 2 + nλ z 2 2 .

The problem is convex in z and the optimal solution satisﬁes

z = Φ Φ + nΛ 1 Φ y .

Scalable Learning in Reproducing Kernel Krein Spaces

An out-of-sample extension for this learning problem is

f (x) = k x UZ Z |DZ Z| 1/2 SZ Zz .

Having introduced a low-rank variant of Kre ın kernel ridge regression, we proceed to deﬁne a scalable variance constrained least squares method (KRE IN VC-LSM). This risk minimization problem is given by (Oglic and Gärtner, 2018)

min f K 1 n

i=1 (f (xi) yi)2 + λ+ f+ 2 + + λ f 2

i=1 f (xi)2 = r2 ,

with hyperparameters r R and λ R+. To simplify our derivations (just as in Oglic and Gärtner, 2018), we have without loss of generality assumed that the kernel matrix K is centered. Then, the hard constraint ﬁxes the variance of the predictor over training instances. Similar to Kre ın kernel ridge regression, we can transform this problem into

z = arg min z Rm nλ+ z+ 2 + nλ z 2 2z Φ y

s.t. z Φ Φz = r2 .

Now, performing a singular value decomposition of Φ = A B and taking γ = B z we obtain

γ = arg min γ Rm nγ 1B Λ B 1γ 2(A y) γ

s.t. γ γ = r2 .

A globally optimal solution to this non-convex problem can be computed by following the procedures outlined in Gander et al. (1989) and Oglic and Gärtner (2018). The cost of computing the solution is O m3 and the cost for the lowrank transformation of the problem is O m3 + m2n . An out-of-sample extension can also be obtained by following the derivation for Kre ın kernel ridge regression.

2.5. Scaling Support Vector Machines for Indeﬁnite Kernels using the Nyström Method

In this section, we propose a low-rank support vector machine for scalable classiﬁcation with indeﬁnite kernels. Our regularization term is again motivated by the considerations in Oglic and Gärtner (2018) and that is one of the two main differences compared to Kre ın support vector machine proposed in Loosli et al. (2016). The latter approach outputs a hypothesis which can equivalently be obtained using the standard support vector machine with the ﬂip-spectrum kernel matrix combined with the corresponding out-of-sample transformation (introduced in Section 2.4). The second difference of our approach compared to Loosli et al. (2016)

stems from the fact that in low-rank formulations one optimizes the primal of the problem, deﬁned with the squared hinge loss instead of the plain hinge loss. In particular, the latter loss function is not differentiable and that can complicate the hyperparameter optimization. We note that the identical choice of loss function was used in other works for primal-based optimization of support vector machines (e.g., see Mangasarian, 2002; Keerthi and De Coste, 2005).

We propose the following optimization problem as the Kre ın squared hinge support vector machine (KRE IN SH-SVM)

f = arg min f K

i=1 max {1 yif (xi) , 0}2 +

λ+ f+ 2 + + λ f 2 .

Similar to Section 2.4, the representer theorem holds for this problem and applying the reproducing property of the Kre ın kernel we can transform it to a matrix form. If we again substitute a low-rank approximation KX|Z in place of the Kre ın kernel matrix K, we observe that

f (xi) = k xiα = k xi K 1 Z ZKZ Xα =

k xi UZ Z |DZ Z| 1/2 SZ Zz = Φiz ,

where Φi denotes the i-th row in the matrix Φ. The low-rank variant of the approach can then be written as

z = arg min z Rm

i=1 max {1 yiΦiz, 0}2 +

nλ+ z+ 2 2 + nλ z 2 2 .

The derivation of the solution follows that for the standard primal-based training of support vector machines with the only difference being that the diagonal matrix Λ is used instead of the scalar hyperparameter controlling the hypothesis complexity (e.g., see Mangasarian, 2002; Keerthi and De Coste, 2005). To automatically tune the hyperparameters, one can follow the procedure described in Chapelle et al. (2002) and use implicit derivation to compute the gradient of the optimal solution with respect to the hyperparameters.

We conclude with a discussion of a potential shortcoming inherent to the Kre ın support vector machine (Loosli et al., 2016). As the following proposition shows (a proof can be found in Appendix A), that approach is equivalent to the standard support vector machine with the ﬂip-spectrum matrix in place of an indeﬁnite Kre ın kernel matrix (Graepel et al., 1998), combined with the corresponding out-ofsample transformation (Chen et al., 2009). Proposition 3. The Kre ın support vector machine (Loosli et al., 2016) is equivalent to the standard support vector machine with the ﬂip-spectrum matrix in place of an indeﬁnite Kre ın kernel matrix (Graepel et al., 1998), combined with the out-of-sample transformation from Chen et al. (2009).

Scalable Learning in Reproducing Kernel Krein Spaces

5 10 25 50 100

rank (log scale)

log Frobenius error

(a) parkinsons, ι = 50%

approx. kernel K-means++ (k | k) approx. leverage scores (k | k) uniform (k | k)

approx. kernel K-means++ (k | k log n) approx. leverage scores (k | k log n) uniform (k | k log n)

5 10 25 50 100

rank (log scale)

(b) ailerons, ι = 50%

5 10 25 50 100

rank (log scale)

(c) elevators, ι = 50%

5 10 25 50 100

rank (log scale)

(d) pole-telecom, ι = 50%

5 10 25 50 100

rank (log scale)

(e) cal-housing, ι = 50%

Figure 1: The ﬁgure shows the reduction in the approximation error for an indeﬁnite kernel matrix deﬁned as the difference between two Gaussian kernels, which comes as a result of the increase in the approximation rank. In the ﬁgure legend, we use (k | l) to express the fact that a rank k approximation of the kernel matrix is computed using a set of l landmarks.

0.5 1.5 2.5 3.5 4.5

log time (Nyström)

log Frobenius error

(a) parkinsons, ι = 50%

approx. kernel K-means++ (k | k) approx. leverage scores (k | k) uniform (k | k)

approx. kernel K-means++ (k | k log n) approx. leverage scores (k | k log n) uniform (k | k log n)

0.5 1.3 2.1 2.9 3.7

log time (Nyström)

(b) ailerons, ι = 50%

0.7 1.6 2.5 3.4 4.3

log time (Nyström)

(c) elevators, ι = 50%

0.5 1.5 2.5 3.5 4.5

log time (Nyström)

(d) pole-telecom, ι = 50%

1.5 2.2 2.9 3.6 4.3

log time (Nyström)

(e) cal-housing, ι = 50%

Figure 2: The ﬁgure shows the approximation errors of Nyström low-rank approximations (with different approximation ranks) as a function of time required to compute these approximations. In the ﬁgure legend, we again use (k | l) to express the fact that a rank k approximation of the kernel matrix is computed using a set of l landmarks.

While at the ﬁrst glance this result seems incremental, it makes an important contribution towards understanding the Kre ın support vector machines (Loosli et al., 2016). In particular, the discussion of experiments in Loosli et al. (2016) differentiates between the Kre ın support vector machines and the ﬂip-spectrum approach. This happens despite the illustration indicating that they produce identical hypotheses in synthetic experiments (e.g., see Figures 3 and 4 in Loosli et al., 2016, and the discussion therein).

3. Experiments

In this section, we report the results of experiments aimed at demonstrating the effectiveness of: i) the Nyström method in low-rank approximations of indeﬁnite kernel matrices, and ii) the described scalable Kre ın approaches in classiﬁcation tasks with pairwise (dis)similarity matrices.

In the ﬁrst set of experiments, we take several datasets from UCI and LIACC repositories and deﬁne kernel matrices on them using the same indeﬁnite kernels as previous

work (Oglic and Gärtner, 2018, Appendix D). We use

0 ι = P {i : λi<0}|λi|/P

to quantify the level of indeﬁniteness of a kernel matrix. Prior to computation of kernel matrices, all the data matrices were normalized to have mean zero and unit variance across features. Following this, we have applied the Nyström method with landmark selection strategies presented in Section 2.3 to derive approximations with different ranks. We measure the effectiveness of a low-rank approximation with its error in the Frobenius norm. To quantify the effectiveness of the approximate eigendecomposition of the kernel matrix (i.e., the one-shot Nyström method) derived in Section 2.2, we have performed rank k approximations using sets of k log n landmarks. Figures 1 and 2 summarize the results obtained with an indeﬁnite kernel deﬁned by the difference between two Gaussian kernels. The reported error/time is the median error/time over 10 repetitions of the experiment. Figure 1 indicates a sharp (approximately exponential) decay in the approximation error as the rank

Scalable Learning in Reproducing Kernel Krein Spaces

5 10 25 50 100

rank (log scale)

classiﬁcation error (%)

(a) coilyork, ι = 26%

SF-LSM Kre ın LSM Kre ın VC-LSM Kre ın SH-SVM

5 10 25 50 100

rank (log scale)

(b) balls3D, ι = 0.07%

5 10 25 50 100

rank (log scale )

(c) prodom, ι = 99%

5 10 25 50 100

rank (log scale )

(d) chicken10-120, ι = 18%

5 10 25 50 100

rank (log scale)

(e) protein, ι = 0.07%

Figure 3: The ﬁgure shows the reduction in the classiﬁcation error as the approximation rank of a Nyström low-rank approximation increases. The reported error is the median classiﬁcation error obtained using 10-fold stratiﬁed cross-validation.

of the approximation increases. The devised approximate kernel K-means++ sampling strategy performs the best in terms of the accuracy in the experiments where rank k approximations are generated using k landmarks. The approximate leverage score strategy is quite competitive and in rank k approximations generated using k log n landmarks it performs as good or even better than the approximate kernel K-means++ sampling scheme. Oglic and Gärtner (2017) have evaluated these two state-of-the-art strategies on the low-rank approximation of positive deﬁnite kernels. In contrast to that work, we had to resort to an approximate kernel K-means++ sampling scheme because of the indeﬁniteness of the bilinear form deﬁning a Kre ın space. As a result of this, we can observe the lack of a gap between the curves describing the two sampling strategies, compared to the results reported in Oglic and Gärtner (2017) for positive deﬁnite kernels. Our hypothesis is that this is due to sub-optimal choices of landmarks that deﬁne sketch matrices. In our simulations, we have generated sketches by sampling the corresponding landmarks uniformly at random. In support of this hypothesis, rather large approximation errors for uniformly selected landmarks in approximation of other indeﬁnite kernels can be observed (see Appendix C). Figure 2 reports the time required to generate a Nyström low-rank approximation and indicates that the considered sampling strategies amount to only a small fraction of the total time required to generate the low-rank approximation.

In the second set of experiments, we evaluate the effectiveness of the proposed least square methods and the support vector machine on classiﬁcation tasks1 with pairwise dissimilarity matrices (Pekalska and Duin, 2005; Duin and Pekalska, 2009). Following the instructions in Pekalska and Haasdonk (2009), the dissimilarity matrices are converted to similarities by applying the transformation characteristic to multi-dimensional scaling (e.g., see the negative doublecentering transformation in Cox and Cox, 2000). In each simulation, we perform 10-fold stratiﬁed cross-validation

1http://prtools.org/disdatasets/index.html

and measure the effectiveness of an approach with the average/median percentage of misclassiﬁed examples. For multi-class problems, we only evaluate the effectiveness of a single binary one-vs-all classiﬁer (just as in Oglic and Gärtner, 2018, Appendix C). Figure 3 shows the reduction in the classiﬁcation error as the approximation rank increases. The reported error is the median error over 10-folds. Here, SFLSM represents the baseline in which similarities are used as features and a linear ridge regression model is trained in that instance space (Chen et al., 2009; Alabdulmohsin et al., 2015). The ﬁgure indicates that the baseline is quite competitive, but overall the proposed low-rank variants perform very well across different datasets (additional plots are provided in Appendix C). Tables 1 provides the detailed results over all the datasets. In Appendix C (Table 2), we also compare the effectiveness of our low-rank approaches with respect to the relevant state-of-the-art methods which make no approximations and represent hypotheses via the span of kernel functions centered at training instances. The empirical results indicate a competitive performance of our low-rank approaches with only 100 landmarks across all the datasets and a variety of indeﬁnite kernel functions.

4. Discussion

The Nyström method has recently been used for approximate eigendecomposition and low-rank approximation of indeﬁnite kernel matrices (Gisbrecht and Schleif, 2015; Schleif and Tiño, 2015; Schleif et al., 2016). To circumvent the fact that the original derivations of the approach are restricted to positive deﬁnite Mercer kernels (Smola and Schölkopf, 2000; Williams and Seeger, 2001), Gisbrecht and Schleif (2015) provide a derivation of the approach based on approximations of integral eigenfunctions arising in an eigendecomposition of an indeﬁnite kernel. In particular, the authors of that work introduce an integral operator deﬁned with an indeﬁnite kernel and its empirical/samplebased approximation which asymptotically converges to the original (indeﬁnite) integral operator. Based on this re-

Scalable Learning in Reproducing Kernel Krein Spaces

DATASET DISSIMILARITY TYPE RANK 100 APPROXIMATION KRE IN VC-LSM KRE IN LSM KRE IN SH-SVM SF-LSM coilyork Graph matching 32.22 ( 7.89) 31.21 ( 5.28) 38.20 ( 7.20) 35.33 ( 10.09) balls 3D Shortest distance between balls 1.00 ( 2.00) 0.50 ( 1.50) 0.00 ( 0.00) 0.50 ( 1.50) prodom Structural alignment of proteins 0.92 ( 0.46) 0.92 ( 0.46) 0.54 ( 0.47) 1.57 ( 0.58) chicken10 String edit distance 16.35 ( 4.31) 15.69 ( 4.97) 16.82 ( 6.57) 14.37 ( 4.02) protein Structural alignment of proteins 4.19 ( 2.47) 3.72 ( 2.76) 5.23 ( 2.89) 5.15 ( 3.91) zongker Deformable template matching 17.70 ( 2.06) 17.75 ( 2.23) 15.30 ( 3.39) 17.05 ( 2.36) chicken25 String edit distance 19.29 ( 4.64) 20.41 ( 4.09) 25.77 ( 4.68) 18.17 ( 6.67) pdish57 Hausdorff distance 3.40 ( 0.39) 3.40 ( 0.42) 2.73 ( 0.62) 3.03 ( 0.67) pdism57 Hausdorff distance 0.38 ( 0.26) 0.38 ( 0.26) 0.30 ( 0.29) 0.63 ( 0.42) woody50 Plant leaves shape dissimilarity 30.84 ( 5.25) 30.47 ( 5.54) 38.42 ( 7.13) 26.41 ( 4.42)

Table 1: The table reports the results of our experiment on benchmark datasets for learning with indeﬁnite kernels (Pekalska and Duin, 2005). The goal of the experiment is to evaluate the effectiveness of the state-of-the-art approaches for scalable learning in reproducing kernel Kre ın spaces on classiﬁcation tasks with pairwise dissimilarity matrices. We measure the effectiveness of an approach using the average classiﬁcation error obtained using 10-fold stratiﬁed cross-validation (standard deviations are given in the brackets).

sult, Gisbrecht and Schleif (2015) provide a derivation of the Nyström method for indeﬁnite kernels that treats the approximate equalities arising in the approximations of integral eigenfunctions as if they were exact. While such an assumption might hold for some datasets it fails to hold in the general case and this fact makes their extension of the Nyström method to indeﬁnite kernels mathematically incomplete. Our derivation of the approach does not rely on such an assumption and, thus, provides a stronger result. Moreover, our proof is much simpler than the one in Gisbrecht and Schleif (2015) and provides a geometrical intuition for the approximation.

In addition to this, Gisbrecht and Schleif (2015); Schleif and Tiño (2015); Schleif et al. (2016) proposed a method for ﬁnding an approximate low-rank eigendecomposition of an indeﬁnite kernel matrix (for the sake of completeness, we review this approach in Appendix B). From the perspective of the exact number of ﬂoating point operations (FLOPs), the approach by Gisbrecht and Schleif (2015); Schleif and Tiño (2015); Schleif et al. (2016) requires 7 matrix-to-matrix multiplications (each with the cost of m2n FLOPs) and 2 eigendecompositions (each with the cost of m3 FLOPs). Thus, in total their approach requires 7m2n + 2m3 FLOPs to ﬁnd an approximate low-rank eigendecomposition of an indeﬁnite kernel matrix. In contrast to this, the approach proposed in Section 2.2 comes with a much better runtime complexity and requires at most 3m2n + 3m3 FLOPs. To see a practical runtime beneﬁt of our approach, take a problem of approximating the kernel matrix deﬁned with n = 106 instances using m = 103 landmarks. Our method for approximate low-rank eigendecomposition requires 3 1012 less FLOPs than the approach proposed by Gisbrecht and Schleif (2015); Schleif and Tiño (2015); Schleif et al. (2016).

Beside the considered low-rank approximations, it is possible to treat indeﬁnite similarity functions as features and learn with linear models (Alabdulmohsin et al., 2015; Chen et al., 2009) or squared kernel matrices (Graepel et al., 1998).

However, Balcan et al. (2008) have showed that learning with a positive deﬁnite kernel corresponding to a feature space where the target concept is separable by a linear hypothesis yields a larger margin compared to learning with a linear model in a feature space constructed using that kernel function. As a result, if a kernel is used to construct a feature representation the sample complexity of a linear model in that space might be higher compared to learning with a kernelized variant of regularized risk minimization.

The effectiveness of a particular landmark selection strategy is a problem studied separately from the derivation of the Nyström method and we, therefore, do not focus on that problem in this work. However, clustering and leverage score sampling have been proposed and validated in earlier publications and are state-of-the-art for low-rank approximation of positive deﬁnite kernels (Kumar et al., 2012; Alaoui and Mahoney, 2015; Gittens and Mahoney, 2016; Oglic and Gärtner, 2017). As the ﬂip-spectrum matrix shares the eigenspace with the indeﬁnite kernel matrix, the convergence results on the effectiveness of landmark selection strategies for Nyström low-rank approximation of positive deﬁnite kernels apply to indeﬁnite kernels (e.g., see Section 2.3 or Eckart and Young, 1936; Mirsky, 1960). In particular, bounds for the leverage score sampling strategy applied to the ﬂip-spectrum matrix carry over to our derivation of the Nyström method for indeﬁnite kernels.

We conclude with a reference to Schleif et al. (2018) and Loosli et al. (2016), where an issue concerning the sparsity of a solution returned by the Kre ın support vector machine has been raised. We hypothesize that our approach can overcome this limitation by either controlling the approximation rank or penalizing the low-rank objective with the ℓ1-norm of the linear model. We leave the theoretical study and evaluation of such an approach for future work.

Acknowledgment: We are grateful for access to the University of Nottingham High Performance Computing Facility. Dino Oglic was supported in part by EPSRC grant EP/R012067/1.

Scalable Learning in Reproducing Kernel Krein Spaces

Alabdulmohsin, I., Gao, X., and Zhang, X. Z. (2015). Support vector machines with indeﬁnite kernels. In Proceedings of the Sixth Asian Conference on Machine Learning, Proceedings of Machine Learning Research. PMLR.

Alaoui, A. E. and Mahoney, M. W. (2015). Fast randomized kernel ridge regression with statistical guarantees. In Advances in Neural Information Processing Systems 28.

Alpay, D. (1991). Some remarks on reproducing kernel Kre ın spaces. Rocky Mountain Journal of Mathematics.

Arthur, D. and Vassilvitskii, S. (2007). K-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms.

Azizov, T. Y. and Iokhvidov, I. S. (1981). Linear operators in spaces with indeﬁnite metric and their applications. Journal of Soviet Mathematics.

Balcan, M.-F., Blum, A., and Srebro, N. (2008). A theory of learning with similarity functions. Machine Learning.

Bognár, J. (1974). Indeﬁnite inner product spaces. Ergebnisse der Mathematik und ihrer Grenzgebiete. Springer.

Chapelle, O., Vapnik, V., Bousquet, O., and Mukherjee, S. (2002). Choosing multiple parameters for support vector machines. Machine Learning, 46(1-3):131 159.

Chen, Y., Garcia, E. K., Gupta, M. R., Rahimi, A., and Cazzanti, L. (2009). Similarity-based classiﬁcation: Concepts and algorithms. Journal of Machine Learning Research.

Cox, T. F. and Cox, M. A. A. (2000). Multidimensional Scaling. Chapman and Hall/CRC, 2nd edition.

Drineas, P., Kannan, R., and Mahoney, M. W. (2006). Fast Monte Carlo algorithms for matrices II: Computing a low-rank approximation to a matrix. SIAM Journal on Computing.

Drineas, P. and Mahoney, M. W. (2005). On the Nyström method for approximating a Gram matrix for improved kernel-based learning. Journal of Machine Learning Research.

Duin, R. P. and Pekalska, E. (2009). Datasets and tools for dissimilarity analysis in pattern recognition. Beyond Features: Similarity-Based Pattern Analysis and Recognition.

Eckart, C. and Young, G. (1936). The approximation of one matrix by another of lower rank. Psychometrika, 1(3):211 218.

Fowlkes, C., Belongie, S., Chung, F., and Malik, J. (2004). Spectral grouping using the Nyström method. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(2):214 225.

Gander, W., Golub, G. H., and von Matt, U. (1989). A constrained eigenvalue problem. Linear Algebra and its Applications.

Gisbrecht, A. and Schleif, F.-M. (2015). Metric and non-metric proximity transformations at linear costs. Neurocomputing, 167:643 657.

Gittens, A. and Mahoney, M. W. (2016). Revisiting the Nyström method for improved large-scale machine learning. Journal Machine Learning Research.

Graepel, T., Herbrich, R., Bollmann-Sdorra, P., and Obermayer, K. (1998). Classiﬁcation on pairwise proximity data. In Advances in Neural Information Processing Systems 11.

Iokhvidov, I. S., Kre ın, M. G., and Langer, H. (1982). Introduction to the spectral theory of operators in spaces with an indeﬁnite metric. Berlin: Akademie-Verlag.

Keerthi, S. S. and De Coste, D. (2005). A modiﬁed ﬁnite Newton method for fast solution of large scale linear SVMs. Journal of Machine Learning Research, 6:341 361.

Kumar, S., Mohri, M., and Talwalkar, A. (2012). Sampling methods for the Nyström method. Journal of Machine Learning Research.

Langer, H. (1962). Zur Spektraltheorie J-selbstadjungierter Operatoren. Mathematische Annalen.

Loosli, G., Canu, S., and Ong, C. S. (2016). Learning SVM in Kre ın spaces. IEEE Transactions on Pattern Analysis and Machine Intelligence.

Mangasarian, O. L. (2002). A ﬁnite Newton method for classiﬁcation. Optimization Methods and Software, 17:913 929.

Mirsky, L. (1960). Symmetric gauge functions and unitarily invariant norms. Quaterly Journal of Mathematics, Oxford II. Series, 11:50 59.

Nyström, E. J. (1930). Über die praktische Auﬂösung von Integralgleichungen mit Anwendungen auf Randwertaufgaben. Acta Mathematica.

Oglic, D. (2018). Constructive Approximation and Learning by Greedy Algorithms. Ph D thesis, University of Bonn, Germany.

Oglic, D. and Gärtner, T. (2017). Nyström method with kernel K-means++ samples as landmarks. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research. PMLR.

Oglic, D. and Gärtner, T. (2018). Learning in reproducing kernel Kre ın spaces. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research. PMLR.

Ong, C. S., Mary, X., Canu, S., and Smola, A. J. (2004). Learning with non-positive kernels. In Proceedings of the Twenty-First International Conference on Machine Learning.

Pekalska, E. and Duin, R. P. W. (2005). The Dissimilarity Representation for Pattern Recognition: Foundations And Applications (Machine Perception and Artiﬁcial Intelligence). World Scientiﬁc Publishing Co., Inc.

Pekalska, E. and Haasdonk, B. (2009). Kernel discriminant analysis with positive deﬁnite and indeﬁnite kernels. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(6).

Rifkin, R. M. (2002). Everything Old is New Again: A Fresh Look at Historical Approaches in Machine Learning. Ph D thesis.

Schleif, F., Gisbrecht, A., and Tiño, P. (2016). Probabilistic classiﬁers with low rank indeﬁnite kernels. ar Xiv preprint ar Xiv:1604.02264.

Scalable Learning in Reproducing Kernel Krein Spaces

Schleif, F.-M., Raab, C., and Tino, P. (2018). Sparsiﬁcation of indeﬁnite learning models. In Structural, Syntactic, and Statistical Pattern Recognition, pages 173 183. Springer.

Schleif, F.-M. and Tiño, P. (2015). Indeﬁnite proximity learning: A review. Neural Computation, 27(10):2039 2096.

Schleif, F.-M. and Tino, P. (2017). Indeﬁnite core vector machine. Pattern Recognition, 71:187 195.

Schölkopf, B. and Smola, A. J. (2001). Learning with kernels: Support vector machines, regularization, optimization, and beyond. MIT Press.

Schwartz, L. (1964). Sous-espaces hilbertiens d espaces vectoriels toplogiques et noyaux associés (noyaux reproduisants). Journal d Analyse Mathematique.

Smola, A. J. and Schölkopf, B. (2000). Sparse greedy matrix approximation for machine learning. In Proceedings of the 17th International Conference on Machine Learning.

Tikhonov, A. N. and Arsenin, V. Y. (1977). Solutions of ill-posed problems. W. H. Winston, Washington D. C.

Williams, C. K. I. and Seeger, M. (2001). Using the Nyström method to speed up kernel machines. In Advances in Neural Information Processing Systems 13.