# minimizing_quadratic_functions_in_constant_time__936f3943.pdf

Minimizing Quadratic Functions in Constant Time

Kohei Hayashi National Institute of Advanced Industrial Science and Technology hayashi.kohei@gmail.com

Yuichi Yoshida National Institute of Informatics and Preferred Infrastructure, Inc. yyoshida@nii.ac.jp

A sampling-based optimization method for quadratic functions is proposed. Our method approximately solves the following n-dimensional quadratic minimization problem in constant time, which is independent of n: z = minv Rn v, Av + n v, diag(d)v + n b, v , where A Rn n is a matrix and d, b Rn are vectors. Our theoretical analysis speciﬁes the number of samples k(δ, ϵ) such that the approximated solution z satisﬁes |z z | = O(ϵn2) with probability 1 δ. The empirical performance (accuracy and runtime) is positively conﬁrmed by numerical experiments.

1 Introduction

A quadratic function is one of the most important function classes in machine learning, statistics, and data mining. Many fundamental problems such as linear regression, k-means clustering, principal component analysis, support vector machines, and kernel methods [14] can be formulated as a minimization problem of a quadratic function.

In some applications, it is sufﬁcient to compute the minimum value of a quadratic function rather than its solution. For example, Yamada et al. [21] proposed an efﬁcient method for estimating the Pearson divergence, which provides useful information about data, such as the density ratio [18]. They formulated the estimation problem as the minimization of a squared loss and showed that the Pearson divergence can be estimated from the minimum value. The least-squares mutual information [19] is another example that can be computed in a similar manner.

Despite its importance, the minimization of a quadratic function has the issue of scalability. Let n N be the number of variables (the dimension of the problem). In general, such a minimization problem can be solved by quadratic programming (QP), which requires poly(n) time. If the problem is convex and there are no constraints, then the problem is reduced to solving a system of linear equations, which requires O(n3) time. Both methods easily become infeasible, even for mediumscale problems, say, n > 10000.

Although several techniques have been proposed to accelerate quadratic function minimization, they require at least linear time in n. This is problematic when handling problems with an ultrahigh dimension, for which even linear time is slow or prohibitive. For example, stochastic gradient descent (SGD) is an optimization method that is widely used for large-scale problems. A nice property of this method is that, if the objective function is strongly convex, it outputs a point that is sufﬁciently close to an optimal solution after a constant number of iterations [5]. Nevertheless, in each iteration, we need at least O(n) time to access the variables. Another technique is lowrank approximation such as Nystr om s method [20]. The underlying idea is the approximation of the problem by using a low-rank matrix, and by doing so, we can drastically reduce the time

30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

complexity. However, we still need to compute the matrix vector product of size n, which requires O(n) time. Clarkson et al. [7] proposed sublinear-time algorithms for special cases of quadratic function minimization. However, it is sublinear with respect to the number of pairwise interactions of the variables, which is O(n2), and their algorithms require O(n logc n) time for some c 1.

Our contributions: Let A Rn n be a matrix and d, b Rn be vectors. Then, we consider the following quadratic problem:

minimize v Rn pn,A,d,b(v), where pn,A,d,b(v) = v, Av + n v, diag(d)v + n b, v . (1)

Here, , denotes the inner product and diag(d) denotes the matrix whose diagonal entries are speciﬁed by d. Note that a constant term can be included in (1); however, it is irrelevant when optimizing (1), and hence we ignore it.

Let z R be the optimal value of (1) and let ϵ, δ (0, 1) be parameters. Then, the main goal of this paper is the computation of z with |z z | = O(ϵn2) with probability at least 1 δ in constant time, that is, independent of n. Here, we assume the real RAM model [6], in which we can perform basic algebraic operations on real numbers in one step. Moreover, we assume that we have query accesses to A, b, and d, with which we can obtain an entry of them by specifying an index. We note that z is typically Θ(n2) because v, Av consists of Θ(n2) terms, and v, diag(d)v and b, v consist of Θ(n) terms. Hence, we can regard the error of Θ(ϵn2) as an error of Θ(ϵ) for each term, which is reasonably small in typical situations.

Let |S be an operator that extracts a submatrix (or subvector) speciﬁed by an index set S N; then, our algorithm is deﬁned as follows, where the parameter k := k(ϵ, δ) will be determined later.

Algorithm 1

Input: An integer n N, query accesses to the matrix A Rn n and to the vectors d, b Rn, and ϵ, δ > 0 1: S a sequence of k = k(ϵ, δ) indices independently and uniformly sampled from {1, 2, . . . , n}.

2: return n2

k2 minv Rn pk,A|S,d|S,b|S(v).

In other words, we sample a constant number of indices from the set {1, 2, . . . , n}, and then solve the problem (1) restricted to these indices. Note that the number of queries and the time complexity are O(k2) and poly(k), respectively. In order to analyze the difference between the optimal values of pn,A,d,b and pk,A|S,d|S,b|S, we want to measure the distances between A and A|S, d and d|S, and b and b|S, and want to show them small. To this end, we exploit graph limit theory, initiated by Lov asz and Szegedy [11] (refer to [10] for a book), in which we measure the distance between two graphs on different number of vertices by considering continuous versions. Although the primary interest of graph limit theory is graphs, we can extend the argument to analyze matrices and vectors.

Using synthetic and real settings, we demonstrate that our method is orders of magnitude faster than standard polynomial-time algorithms and that the accuracy of our method is sufﬁciently high.

Related work: Several constant-time approximation algorithms are known for combinatorial optimization problems such as the max cut problem on dense graphs [8, 13], constraint satisfaction problems [1, 22], and the vertex cover problem [15, 16, 25]. However, as far as we know, no such algorithm is known for continuous optimization problems.

A related notion is property testing [9, 17], which aims to design constant-time algorithms that distinguish inputs satisfying some predetermined property from inputs that are far from satisfying it. Characterizations of constant-time testable properties are known for the properties of a dense graph [2, 3] and the afﬁne-invariant properties of a function on a ﬁnite ﬁeld [23, 24].

2 Preliminaries

For an integer n, let [n] denote the set {1, 2, . . . , n}. The notation a = b c means that b c a b + c. In this paper, we only consider functions and sets that are measurable.

Let S = (x1, . . . , xk) be a sequence of k indices in [n]. For a vector v Rn, we denote the restriction of v to S by v|S Rk; that is, (v|S)i = vxi for every i [k]. For the matrix A Rn n, we denote the restriction of A to S by A|S Rk k; that is, (A|S)ij = Axixj for every i, j [k].

2.1 Dikernels

Following [12], we call a (measurable) function f : [0, 1]2 R a dikernel. A dikernel is a generalization of a graphon [11], which is symmetric and whose range is bounded in [0, 1]. We can regard a dikernel as a matrix whose index is speciﬁed by a real value in [0, 1]. We stress that the term dikernel has nothing to do with kernel methods.

For two functions f, g : [0, 1] R, we deﬁne their inner product as f, g = R 1 0 f(x)g(x)dx. For a dikernel W : [0, 1]2 R and a function f : [0, 1] R, we deﬁne a function Wf : [0, 1] R as (Wf)(x) = W(x, ), f .

Let W : [0, 1]2 R be a dikernel. The Lp norm W p for p 1 and the cut norm W of W are

deﬁned as W p = R 1 0 R 1 0 |W(x, y)|pdxdy 1/p and W = sup S,T [0,1] R

T W(x, y)dxdy , respectively, where the supremum is over all pairs of subsets. We note that these norms satisfy the triangle inequalities and W W 1.

Let λ be a Lebesgue measure. A map π : [0, 1] [0, 1] is said to be measure-preserving, if the pre-image π 1(X) is measurable for every measurable set X, and λ(π 1(X)) = λ(X). A measure-preserving bijection is a measure-preserving map whose inverse map exists and is also measurable (and then also measure-preserving). For a measure preserving bijection π : [0, 1] [0, 1] and a dikernel W : [0, 1]2 R, we deﬁne the dikernel π(W) : [0, 1]2 R as π(W)(x, y) = W(π(x), π(y)).

2.2 Matrices and Dikernels

Let W : [0, 1]2 R be a dikernel and S = (x1, . . . , xk) be a sequence of elements in [0, 1]. Then, we deﬁne the matrix W|S Rk k so that (W|S)ij = W(xi, xj).

We can construct the dikernel b A : [0, 1]2 R from the matrix A Rn n as follows. Let I1 = [0, 1

n], I2 = ( 1

n], . . . , In = ( n 1

n , . . . , 1]. For x [0, 1], we deﬁne in(x) [n] as a unique integer such that x Ii. Then, we deﬁne b A(x, y) = Ain(x)in(y). The main motivation for creating a dikernel from a matrix is that, by doing so, we can deﬁne the distance between two matrices A and B of different sizes via the cut norm, that is, b A b B .

We note that the distribution of A|S, where S is a sequence of k indices that are uniformly and independently sampled from [n] exactly matches the distribution of b A|S, where S is a sequence of k elements that are uniformly and independently sampled from [0, 1].

3 Sampling Theorem and the Properties of the Cut Norm

In this section, we prove the following theorem, which states that, given a sequence of dikernels W 1, . . . , W T : [0, 1]2 [ L, L], we can obtain a good approximation to them by sampling a sequence of a small number of elements in [0, 1]. Formally, we prove the following:

Theorem 3.1. Let W 1, . . . , W T : [0, 1]2 [ L, L] be dikernels. Let S be a sequence of k elements uniformly and independently sampled from [0, 1]. Then, with a probability of at least 1 exp( Ω(k T/ log2 k)), there exists a measure-preserving bijection π : [0, 1] [0, 1] such that, for any functions f, g : [0, 1] [ K, K] and t [T], we have

| f, W tg f, π( [ W t|S)g | = O LK2p

T/ log2 k .

We start with the following lemma, which states that, if a dikernel W : [0, 1]2 R has a small cut norm, then f, Wf is negligible no matter what f is. Hence, we can focus on the cut norm when proving Theorem 3.1.

Lemma 3.2. Let ϵ 0 and W : [0, 1]2 R be a dikernel with W ϵ. Then, for any functions f, g : [0, 1] [ K, K], we have | f, Wg | ϵK2.

Proof. For τ R and the function h : [0, 1] R, let Lτ(h) := {x [0, 1] | h(x) = τ} be the level set of h at τ. For f = f/K and g = g/K, we have

| f, Wg | = K2| f , Wg | = K2 Z 1

Lτ2(g ) W(x, y)dxdydτ1dτ2

Lτ2(g ) W(x, y)dxdy

1 |τ1||τ2|dτ1dτ2 = ϵK2.

To introduce the next technical tool, we need several deﬁnitions. We say that the partition Q is a reﬁnement of the partition P = (V1, . . . , Vp) if Q is obtained by splitting each set Vi into one or more parts. The partition P = (V1, . . . , Vp) of the interval [0, 1] is called an equipartition if λ(Vi) = 1/p for every i [p]. For the dikernel W : [0, 1]2 R and the equipartition P = (V1, . . . , Vp) of [0, 1], we deﬁne WP : [0, 1]2 R as the function obtained by averaging each Vi Vj for i, j [p]. More formally, we deﬁne

WP(x, y) = 1 λ(Vi)λ(Vj)

Vi Vj W(x , y )dx dy = p2 Z

Vi Vj W(x , y )dx dy ,

where i and j are unique indices such that x Vi and y Vj, respectively.

The following lemma states that any function W : [0, 1]2 R can be well approximated by WP for the equipartition P into a small number of parts.

Lemma 3.3 (Weak regularity lemma for functions on [0, 1]2 [8]). Let P be an equipartition of [0, 1] into k sets. Then, for any dikernel W : [0, 1]2 R and ϵ > 0, there exists a reﬁnement Q of P with |Q| k2C/ϵ2 for some constant C > 0 such that

W WQ ϵ W 2.

Corollary 3.4. Let W 1, . . . , W T : [0, 1]2 R be dikernels. Then, for any ϵ > 0, there exists an equipartition P into |P| 2CT/ϵ2 parts for some constant C > 0 such that, for every t [T],

W t W t P ϵ W t 2.

Proof. Let P0 be a trivial partition, that is, a partition consisting of a single part [n]. Then, for each t [T], we iteratively apply Lemma 3.3 with Pt 1, W t, and ϵ, and we obtain the partition Pt into at most |Pt 1|2C/ϵ2 parts such that W t W t Pt ϵ W t 2. Since Pt is a reﬁnement of Pt 1, we have W i W i Pt W i W i Pt 1 for every i [t 1]. Then, PT satisﬁes the desired property with |PT | (2C/ϵ2)T = 2CT/ϵ2.

As long as S is sufﬁciently large, W and d W|S are close in the cut norm: Lemma 3.5 ((4.15) of [4]). Let W : [0, 1]2 [ L, L] be a dikernel and S be a sequence of k elements uniformly and independently sampled from [0, 1]. Then, we have

k ES d W|S W < 8L

Finally, we need the following concentration inequality. Lemma 3.6 (Azuma s inequality). Let (Ω, A, P) be a probability space, k be a positive integer, and C > 0. Let z = (z1, . . . , zk), where z1, . . . , zk are independent random variables, and zi takes values in some measure space (Ωi, Ai). Let f : Ω1 Ωk R be a function. Suppose that |f(x) f(y)| C whenever x and y only differ in one coordinate. Then

Pr h |f(z) Ez[f(z)]| > λC i < 2e λ2/2k.

Now we prove the counterpart of Theorem 3.1 for the cut norm. Lemma 3.7. Let W 1, . . . , W T : [0, 1]2 [ L, L] be dikernels. Let S be a sequence of k elements uniformly and independently sampled from [0, 1]. Then, with a probability of at least 1 exp( Ω(k T/ log2 k)), there exists a measure-preserving bijection π : [0, 1] [0, 1] such that, for every t [T], we have

W t π( [ W t|S) = O L p

T/ log2 k .

Proof. First, we bound the expectations and then prove their concentrations. We apply Corollary 3.4 to W 1, . . . , W T and ϵ, and let P = (V1, . . . , Vp) be the obtained partition with p 2CT/ϵ2 parts such that W t W t P ϵL. for every t [T]. By Lemma 3.5, for every t [T], we have

ES \ W t P|S [ W t|S = ES (W t P W t)|S \ ϵL + 8L

Then, for any measure-preserving bijection π : [0, 1] [0, 1] and t [T], we have

ES W t π( [ W t|S) W t W t P + ES W t P π(\ W t P|S) + ES π(\ W t P|S) π( [ W t|S)

k1/4 + ES W t P π(\ W t P|S) . (2)

Thus, we are left with the problem of sampling from P. Let S = {x1, . . . , xk} be a sequence of independent random variables that are uniformly distributed in [0, 1], and let Zi be the number of points xj that fall into the set Vi. It is easy to compute that

p and Var[Zi] = 1

The partition P of [0, 1] is constructed into the sets V 1, . . . , V p such that λ(V i ) = Zi/k and λ(Vi

V i ) = min(1/p, Zi/k). For each t [T], we construct the dikernel W t : [0, 1] R such that the value of W t on V i V j is the same as the value of W t P on Vi Vj. Then, W t agrees with W t P on

the set Q = S

i,j [p](Vi V i ) (Vj V j ). Then, there exists a bijection π such that π(\ W t P|S) = W t

for each t [T]. Then, for every t [T], we have

W t P π(\ W t P|S) = W t P W t W t P W t 1 2L(1 λ(Q))

i [p] min 1

i [p] min 1

which we rewrite as W t P π(\ W t P|S) 2 4L2p X

The expectation of the right hand side is (4L2p/k2) P

i [p] Var(Zi) < 4L2p/k. By the Cauchy-

Schwartz inequality, E W t P π(\ W t P|S) 2L p

Inserted this into (2), we obtain

E W t π( [ W t|S) 2ϵL + 8L

k1/4 + 2L rp

k1/2 2CT/ϵ2.

Choosing ϵ = p

CT/(log2 k1/4) = p

4CT/(log2 k), we obtain the upper bound

E W t π( [ W t|S) 2L

4CT log2 k + 8L

Observing that W t π( [ W t|S) changes by at most O(L/k) if one element in S changes, we apply Azuma s inequality with λ = k p

T/ log2 k and the union bound to complete the proof.

The proof of Theorem 3.1 is immediately follows from Lemmas 3.2 and 3.7.

4 Analysis of Algorithm 1

In this section, we analyze Algorithm 1. Because we want to use dikernels for the analysis, we introduce a continuous version of pn,A,d,b (recall (1)). The real-valued function Pn,A,d,b on the functions f : [0, 1] R is deﬁned as

Pn,A,d,b(f) = f, b Af + f 2, d d1 1 + f, d b1 1 ,

where f 2 : [0, 1] R is a function such that f 2(x) = f(x)2 for every x [0, 1] and 1 : [0, 1] R is the constant function that has a value of 1 everywhere. The following lemma states that the minimizations of pn,A,d,b and Pn,A,d,b are equivalent: Lemma 4.1. Let A Rn n be a matrix and d, b Rn n be vectors. Then, we have

min v [ K,K]n pn,A,d,b(v) = n2 inf f:[0,1] [ K,K] Pn,A,d,b(f).

for any K > 0.

Proof. First, we show that n2 inff:[0,1] [ K,K] Pn,A,d,b(f) minv [ K,K]n pn,A,d,b(v). Given a vector v [ K, K]n, we deﬁne f : [0, 1] [ K, K] as f(x) = vin(x). Then,

f, b Af = X

Ij Aijf(x)f(y)dxdy = 1

i,j [n] Aijvivj = 1

f 2, d d1 1 = X

Ij dif(x)2dxdy = X

Ii dif(x)2dx = 1

i [n] div2 i = 1

n v, diag(d)v ,

f, d b1 1 = X

Ij bif(x)dxdy = X

Ii bif(x)dx = 1

i [n] bivi = 1

Then, we have n2Pn,A,d,b(f) pn,A,d,b(v).

Next, we show that minv [ K,K]n pn,A,d,b(v) n2 inff:[0,1] [ K,K] Pn,A,d,b(f). Let f : [0, 1] [ K, K] be a measurable function. Then, for x [0, 1], we have

Pn,A,d,b(f(x))

Ii Aiin(x)f(y)dy + X

Ij Ain(x)jf(y)dy + 2din(x)f(x) + bin(x).

Note that the form of this partial derivative only depends on in(x); hence, in the optimal solution f : [0, 1] [ K, K], we can assume f (x) = f (y) if in(x) = in(y). In other words, f is constant on each of the intervals I1, . . . , In. For such f , we deﬁne the vector v Rn as vi = f (x), where x [0, 1] is any element in Ii. Then, we have

i,j [n] Aijvivj = n2 X

Ij Aijf (x)f (y)dxdy = n2 f , b Af ,

v, diag(d)v = X

i [n] div2 i = n X

Ii dif (x)2dx = n (f )2, d d1T 1 ,

i [n] bivi = n X

Ii bif (x)dx = n f , d b1T 1 .

Finally, we have pn,A,d,b(v) n2Pn,A,d,b(f ).

Now we show that Algorithm 1 well-approximates the optimal value of (1) in the following sense:

Theorem 4.2. Let v and z be an optimal solution and the optimal value, respectively, of problem (1). By choosing k(ϵ, δ) = 2Θ(1/ϵ2) + Θ(log 1

δ log log 1

δ ), with a probability of at least 1 δ, a sequence S of k indices independently and uniformly sampled from [n] satisﬁes the following: Let v and z be an optimal solution and the optimal value, respectively, of the problem minv Rk pk,A|S,d|S,b|S(v). Then, we have

k2 z z ϵLK2n2,

where K = max{maxi [n] |v i |, maxi [n] | v i |} and L = max{maxi,j |Aij|, maxi |di|, maxi |bi|}.

Proof. We instantiate Theorem 3.1 with k = 2Θ(1/ϵ2) + Θ(log 1

δ log log 1

δ ) and the dikernels b A, d d1 , and d b1 . Then, with a probability of at least 1 δ, there exists a measure preserving bijection π : [0, 1] [0, 1] such that

max n | f, ( b A π(d A|S))f |, | f 2, ( d d1 π( \ d1 |S))1 |, | f, ( d b1 π(\ b1 |S))1 | o ϵLK2

for any function f : [0, 1] [ K, K]. Then, we have

z = min v Rk pk,A|S,d|S,b|S(v) = min v [ K,K]k pk,A|S,d|S,b|S(v)

= k2 inf f:[0,1] [ K,K] Pk,A|S,d|S,b|S(f) (By Lemma 4.1)

= k2 inf f:[0,1] [ K,K]

f, (π(d A|S) b A)f + f, b Af + f 2, (π( \ d1 |S) d d1 )1 +

f 2, d d1 1 + f, (π(\ b1 |S) d b1 )1 + f, d b1 1

k2 inf f:[0,1] [ K,K]

f, b Af + f 2, d d1 1 + f, d b1 1 ϵLK2

n2 min v [ K,K]n pn,A,d,b(v) ϵLK2k2. (By Lemma 4.1)

n2 min v Rn pn,A,d,b(v) ϵLK2k2 = k2

n2 z ϵLK2k2.

Rearranging the inequality, we obtain the desired result.

We can show that K is bounded when A is symmetric and full rank. To see this, we ﬁrst note that we can assume A + ndiag(d) is positive-deﬁnite, as otherwise pn,A,d,b is not bounded and the problem is uninteresting. Then, for any set S [n] of k indices, (A + ndiag(d))|S is again positive-deﬁnite because it is a principal submatrix. Hence, we have v = (A + ndiag(d)) 1nb/2 and v = (A|S + ndiag(d|S)) 1nb|S/2, which means that K is bounded.

5 Experiments

In this section, we demonstrate the effectiveness of our method by experiment.1 All experiments were conducted on an Amazon EC2 c3.8xlarge instance. Error bars indicate the standard deviations over ten trials with different random seeds.

Numerical simulation We investigated the actual relationships between n, k, and ϵ. To this end, we prepared synthetic data as follows. We randomly generated inputs as Aij U[ 1,1], di U[0,1], and bi U[ 1,1] for i, j [n], where U[a,b] denotes the uniform distribution with the support [a, b]. After that, we solved (1) by using Algorithm 1 and compared it with the exact solution obtained by QP.2 The result (Figure 1) show the approximation errors were evenly controlled regardless of n, which meets the error analysis (Theorem 4.2).

1The program codes are available at https://github.com/hayasick/CTOQ. 2We used GLPK (https://www.gnu.org/software/glpk/) for the QP solver.

0.00 0.05 0.10

0.00 0.05 0.10

0.00 0.05 0.10

0.00 0.05 0.10

n=200 500 1000 2000

10 20 40 80 160 k

Figure 1: Numerical simulation: absolute approximation error scaled by n2.

Table 1: Pearson divergence: runtime (second).

k n = 500 1000 2000 5000

20 0.002 0.002 0.002 0.002 40 0.003 0.003 0.003 0.003 80 0.007 0.007 0.008 0.008 160 0.030 0.030 0.033 0.035

20 0.005 0.012 0.046 0.274 40 0.010 0.022 0.087 0.513 80 0.022 0.049 0.188 0.942 160 0.076 0.116 0.432 1.972

Table 2: Pearson divergence: absolute approximation error. k n = 500 1000 2000 5000

20 0.0027 0.0028 0.0012 0.0012 0.0021 0.0019 0.0016 0.0022 40 0.0018 0.0023 0.0006 0.0007 0.0012 0.0011 0.0011 0.0020 80 0.0007 0.0008 0.0004 0.0003 0.0008 0.0008 0.0007 0.0017 160 0.0003 0.0003 0.0002 0.0001 0.0003 0.0003 0.0002 0.0003

20 0.3685 0.9142 1.3006 2.4504 3.1119 6.1464 0.6989 0.9644 40 0.3549 0.6191 0.4207 0.7018 0.9838 1.5422 0.3744 0.6655 80 0.0184 0.0192 0.0398 0.0472 0.2056 0.2725 0.5705 0.7918 160 0.0143 0.0209 0.0348 0.0541 0.0585 0.1112 0.0254 0.0285

Application to kernel methods Next, we considered the kernel approximation of the Pearson divergence [21]. The problem is deﬁned as follows. Suppose we have the two different data sets x = (x1, . . . , xn) Rn and x = (x 1, . . . , x n ) Rn where n, n N. Let H Rn n

be a gram matrix such that Hl,m = α

n Pn i=1 φ(xi, xl)φ(xi, xm) + 1 α

j=1 φ(x j, xl)φ(x j, xm), where φ( , ) is a kernel function and α (0, 1) is a parameter. Also, let h Rn be a vector such that hl = 1

n Pn i=1 φ(xi, xl). Then, an estimator of the α-relative Pearson divergence between the distributions of x and x is obtained by 1

2 minv Rn 1

2 v, Hv h, v + λ

2 v, v . Here, λ > 0 is a regularization parameter. In this experiment, we used the Gaussian kernel φ(x, y) = exp((x y)2/2σ2) and set n = 200 and α = 0.5; σ2 and λ were chosen by 5-fold cross-validation as suggested in [21]. We randomly generated the data sets as xi N(1, 0.5) for i [n] and x j N(1.5, 0.5) for j [n ] where N(µ, σ2) denotes the Gaussian distribution with mean µ and variance σ2.

We encoded this problem into (1) by setting A = 1

2H, b = h, and d = λ 2n1n, where 1n denotes the n-dimensional vector whose elements are all one. After that, given k, we computed the second step of Algorithm 1 with the pseudoinverse of A|S+kdiag(d|S). Absolute approximation errors and runtimes were compared with Nystr om s method whose approximated rank was set to k. In terms of accuracy, our method clearly outperformed Nystr om s method (Table 2). In addition, the runtimes of our method were nearly constant, whereas the runtimes of Nystr om s method grew linearly in k (Table 1).

6 Acknowledgments

We would like to thank Makoto Yamada for suggesting a motivating problem of our method. K. H. is supported by MEXT KAKENHI 15K16055. Y. Y. is supported by MEXT Grant-in-Aid for Scientiﬁc Research on Innovative Areas (No. 24106001), JST, CREST, Foundations of Innovative Algorithms for Big Data, and JST, ERATO, Kawarabayashi Large Graph Project.

[1] N. Alon, W. F. de la Vega, R. Kannan, and M. Karpinski. Random sampling and approximation of MAXCSP problems. In STOC, pages 232 239, 2002. [2] N. Alon, E. Fischer, I. Newman, and A. Shapira. A combinatorial characterization of the testable graph properties: It s all about regularity. SIAM Journal on Computing, 39(1):143 167, 2009. [3] C. Borgs, J. Chayes, L. Lov asz, V. T. S os, B. Szegedy, and K. Vesztergombi. Graph limits and parameter testing. In STOC, pages 261 270, 2006. [4] C. Borgs, J. T. Chayes, L. Lov asz, V. T. S os, and K. Vesztergombi. Convergent sequences of dense graphs i: Subgraph frequencies, metric properties and testing. Advances in Mathematics, 219(6):1801 1851, 2008. [5] L. Bottou. Stochastic learning. In Advanced Lectures on Machine Learning, pages 146 168. 2004. [6] V. Brattka and P. Hertling. Feasible real random access machines. Journal of Complexity, 14(4):490 526, 1998. [7] K. L. Clarkson, E. Hazan, and D. P. Woodruff. Sublinear optimization for machine learning. Journal of the ACM, 59(5):23:1 23:49, 2012. [8] A. Frieze and R. Kannan. The regularity lemma and approximation schemes for dense problems. In FOCS, pages 12 20, 1996. [9] O. Goldreich, S. Goldwasser, and D. Ron. Property testing and its connection to learning and approximation. Journal of the ACM, 45(4):653 750, 1998. [10] L. Lov asz. Large Networks and Graph Limits. American Mathematical Society, 2012. [11] L. Lov asz and B. Szegedy. Limits of dense graph sequences. Journal of Combinatorial Theory, Series B, 96(6):933 957, 2006. [12] L. Lov asz and K. Vesztergombi. Non-deterministic graph property testing. Combinatorics, Probability and Computing, 22(05):749 762, 2013. [13] C. Mathieu and W. Schudy. Yet another algorithm for dense max cut: go greedy. In SODA, pages 176 182, 2008. [14] K. P. Murphy. Machine learning: a probabilistic perspective. The MIT Press, 2012. [15] H. N. Nguyen and K. Onak. Constant-time approximation algorithms via local improvements. In FOCS, pages 327 336, 2008. [16] K. Onak, D. Ron, M. Rosen, and R. Rubinfeld. A near-optimal sublinear-time algorithm for approximating the minimum vertex cover size. In SODA, pages 1123 1131, 2012. [17] R. Rubinfeld and M. Sudan. Robust characterizations of polynomials with applications to program testing. SIAM Journal on Computing, 25(2):252 271, 1996. [18] M. Sugiyama, T. Suzuki, and T. Kanamori. Density Ratio Estimation in Machine Learning. Cambridge University Press, 2012. [19] T. Suzuki and M. Sugiyama. Least-Squares Independent Component Analysis. Neural Computation, 23(1):284 301, 2011. [20] C. K. I. Williams and M. Seeger. Using the nystr om method to speed up kernel machines. In NIPS, 2001. [21] M. Yamada, T. Suzuki, T. Kanamori, H. Hachiya, and M. Sugiyama. Relative density-ratio estimation for robust distribution comparison. In NIPS, 2011. [22] Y. Yoshida. Optimal constant-time approximation algorithms and (unconditional) inapproximability results for every bounded-degree CSP. In STOC, pages 665 674, 2011. [23] Y. Yoshida. A characterization of locally testable afﬁne-invariant properties via decomposition theorems. In STOC, pages 154 163, 2014. [24] Y. Yoshida. Gowers norm, function limits, and parameter estimation. In SODA, pages 1391 1406, 2016. [25] Y. Yoshida, M. Yamamoto, and H. Ito. Improved constant-time approximation algorithms for maximum matchings and other optimization problems. SIAM Journal on Computing, 41(4):1074 1093, 2012.