# distributed_learning_with_regularized_least_squares__09b639a3.pdf

Journal of Machine Learning Research 18 (2017) 1-31 Submitted 11/15; Revised 6/17; Published 9/17

Distributed Learning with Regularized Least Squares

Shao-Bo Lin sblin1983@gmail.com Department of Mathematics City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong Xin Guo x.guo@polyu.edu.hk Department of Applied Mathematics The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong Ding-Xuan Zhou mazhou@cityu.edu.hk Department of Mathematics City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong

Editor: Ingo Steinwart

We study distributed learning with the least squares regularization scheme in a reproducing kernel Hilbert space (RKHS). By a divide-and-conquer approach, the algorithm partitions a data set into disjoint data subsets, applies the least squares regularization scheme to each data subset to produce an output function, and then takes an average of the individual output functions as a ﬁnal global estimator or predictor. We show with error bounds and learning rates in expectation in both the L2-metric and RKHS-metric that the global output function of this distributed learning is a good approximation to the algorithm processing the whole data in one single machine. Our derived learning rates in expectation are optimal and stated in a general setting without any eigenfunction assumption. The analysis is achieved by a novel second order decomposition of operator diﬀerences in our integral operator approach. Even for the classical least squares regularization scheme in the RKHS associated with a general kernel, we give the best learning rate in expectation in the literature. Keywords: Distributed learning, divide-and-conquer, error analysis, integral operator, second order decomposition.

1. Introduction and Distributed Learning Algorithms

In the era of big data, the rapid expansion of computing capacities in automatic data generation and acquisition brings data of unprecedented size and complexity, and raises a series of scientiﬁc challenges such as storage bottleneck and algorithmic scalability (Zhou et al., 2014). To overcome the diﬃculty, some approaches for generating scalable approximate algorithms have been introduced in the literature such as low-rank approximations of kernel matrices for kernel principal component analysis (Sch olkopf et al., 1998; Bach, 2013), incomplete Cholesky decomposition (Fine, 2002), early-stopping of iterative optimization algorithms for gradient descent methods (Yao et al., 2007; Raskutti et al., 2014; Lin et al., 2016), and greedy-type algorithms. Another method proposed recently is distributed learning based on a divide-and-conquer approach and a particular learning algorithm implemented in individual machines (Zhang et al., 2015; Shamir and Srebro, 2014). This method

c 2017 Lin, Guo and Zhou.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v18/15-586.html.

Lin, Guo and Zhou

produces distributed learning algorithms consisting of three steps: partitioning the data into disjoint subsets, applying a particular learning algorithm implemented in an individual machine to each data subset to produce an individual output (function), and synthesizing a global output by utilizing some average of the individual outputs. This method can signiﬁcantly reduce computing time and lower single-machine memory requirements. For practical applications in medicine, ﬁnance, business and some other areas, the data are often stored naturally across multiple servers in a distributive way and are not combined for reasons of protecting privacy and avoiding high costs. In such situations, the ﬁrst step of data partitioning is not needed. It has been observed in many practical applications that when the number of data subsets is not too large, the divide-and-conquer approach does not cause noticeable loss of performance, compared with the learning scheme which processes the whole data on a single big machine. Theoretical attempts have been recently made in (Zhang et al., 2013, 2015) to derive learning rates for distributed learning with least squares regularization schemes under certain assumptions. This paper aims at error analysis of the distributed learning with regularized least squares and its approximation to the algorithm processing the whole data in one single machine. Recall (Cristianini and Shawe-Taylor, 2000; Evgeniou et al., 2000) that in a reproducing kernel Hilbert space (RKHS) (HK, K) induced by a Mercer kernel K on an input metric space X, with a sample D = {(xi, yi)}N i=1 X Y where Y = R is the output space, the least squares regularization scheme can be stated as

f D,λ = arg min f HK

(x,y) D (f(x) y)2 + λ f 2 K

Here λ > 0 is a regularization parameter and |D| =: N is the cardinality of D. This learning algorithm is also called kernel ridge regression in statistics and has been well studied in learning theory. See e.g. (De Vito et al., 2005; Caponnetto and De Vito, 2007; Steinwart et al., 2009; Bauer et al., 2007; Smale and Zhou, 2007; Steinwart and Christmann, 2008). The regularization scheme (1) can be explicitly solved by using a standard matrix inversion technique, which requires costs of O(N3) in time and O(N2) in memory. However, this matrix inversion technique may not conquer challenges on storages or computations arising from big data. The distributed learning algorithm studied in this paper starts with partitioning the data set D into m disjoint subsets {Dj}m j=1. Then it assigns each data subset Dj to one machine or processor to produce a local estimator f Dj,λ by the least squares regularization scheme (1). Finally, these local estimators are communicated to a central processor, and a global estimator f D,λ is synthesized by taking a weighted average

|D| f Dj,λ (2)

of the local estimators {f Dj,λ}m j=1. This algorithm has been studied with a matrix analysis approach in (Zhang et al., 2015) where some error analysis has been conducted under some eigenfunction assumptions for the integral operator associated with the kernel, presenting error bounds in expectation.

Distributed Learning

In this paper we shall use a novel integral operator approach to prove that f D,λ is a good approximation of f D,λ. We present a representation of the diﬀerence f D,λ f D,λ in terms of empirical integral operators, and analyze the error f D,λ f D,λ in expectation without any eigenfunction assumptions. As a by-product, we present the best learning rates in expectation for the least squares regularization scheme (1) in a general setting, which surprisingly has not been done for a general kernel in the literature (see detailed comparisons below).

2. Main Results

Our analysis is carried out in the standard least squares regression framework with a Borel probability measure ρ on Z := X Y, where the input space X is a compact metric space. The sample D is independently drawn according to ρ. The Mercer kernel K : X X R deﬁnes an integral operator LK on HK as

X Kxf(x)dρX, f HK, (3)

where Kx is the function K( , x) in HK and ρX is the marginal distribution of ρ on X.

2.1 Error Bounds for the Distributed Learning Algorithm

Our error bounds in expectation for the distributed learning algorithm (2) require the uniform boundedness condition for the output y, that is, for some constant M > 0, there holds |y| M almost surely. Our bounds are stated in terms of the approximation error

fλ fρ ρ, (4)

where fλ is the data-free limit of (1) deﬁned by

fλ = arg min f HK

Z (f(x) y)2dρ + λ f 2 K

ρ denotes the norm of L2 ρX , the Hilbert space of square integrable functions with respect to ρX, and fρ is the regression function (conditional mean) of ρ deﬁned by

Y ydρ(y|x), x X,

with ρ( |x) being the conditional distribution of ρ induced at x X. Since K is continuous, symmetric and positive semideﬁnite, LK is a compact positive operator of trace class and LK+λI is invertible. Deﬁne a quantity measuring the complexity of HK with respect to ρX, the eﬀective dimension (Zhang, 2005), to be the trace of the operator (LK + λI) 1LK as

N(λ) = Tr (LK + λI) 1LK , λ > 0. (6)

In Section 6 we shall prove the following ﬁrst main result of this paper concerning error bounds in expectation of f D,λ f D,λ in HK and in L2 ρX . Denote κ = supx X p

Lin, Guo and Zhou

Theorem 1 Assume |y| M almost surely. If |Dj| = N

m for j = 1, . . . , m, then we have

E f D,λ f D,λ ρ Cκ

λm Cm n 1 + p

m Cm o + fρ fλ ρ

E f D,λ f D,λ K Cκ

M m Cm n 1 + p

m Cm o + fρ fλ ρ

where Cκ is a constant depending only on κ, and Cm is the quantity given in terms of m, N, λ, N(λ) by

Cm := m (Nλ)2 + N(λ)

To derive explicit learning rates from the general error bounds in Theorem 1, one can quantify the increment of N(λ) by the following capacity assumption, a characteristic of the complexity of the hypothesis space (Caponnetto and De Vito, 2007; Blanchard and Kr amer, 2010), N(λ) cλ β, λ > 0 (7)

for some constants β > 0 and c > 0. Let {(λℓ, ϕℓ)}ℓbe a set of normalized eigenpairs of LK on HK with {ϕℓ}ℓbeing an orthonormal basis of HK and {λℓ}ℓarranged in a nonincreasing order. A suﬃcient condition for the capacity assumption (7) with 0 < β < 1 is λℓ= O(ℓ 1/β), which can be found in, e.g. Caponnetto and De Vito (2007).

Remark 2 The suﬃcient condition λℓ= O(ℓ 1/β) with the index β = d 2τ < 1 is satisﬁed by the Sobolev space W τ(B(Rd)) with the smoothness index τ > d/2 on a ball B(Rd) of the Euclidean space Rd when the marginal distribution ρX is the uniform distribution on B(Rd), see (Steinwart et al., 2009; Edmunds and Triebel, 1996). Condition (7) with β = 1 always holds true with the choice of the constant c = κ2. In fact, the eigenvalues of the operator (LK + λI) 1LK are { λℓ λℓ+λ}ℓ. So its trace is given by

N(λ) = P ℓ λℓ λℓ+λ P ℓ λℓ

In the existing literature on learning rates for the classical least squares algorithm (1), the regularization parameter λ is often taken to satisfy the restriction N(λ)

Nλ = O(1) as in (Caponnetto and De Vito, 2007) up to a logarithmic factor or in (Steinwart et al., 2009) under some assumptions on (K, ρX) (see (14) below). Here, to derive learning rates for E f D,λ f D,λ ρ with m 1 corresponding to the distributed learning algorithm (2), we consider λ to satisfy the following restriction with some constant C0 > 0,

0 < λ C0 and m N(λ)

Corollary 3 Assume |y| M almost surely. If |Dj| = N

m for j = 1, . . . , m, and λ satisﬁes (8), then we have

E f D,λ f D,λ ρ e Cκ

λ + m fρ fλ ρ

Distributed Learning

E f D,λ f D,λ K e Cκ

M + m fρ fλ ρ

where e Cκ is a constant depending only on κ, C0, and the largest eigenvalue of LK.

In the special case that fρ HK, the approximation error can be bounded as fλ fρ ρ fρ K

λ. A more general regularity condition can be imposed for the regression function as fρ = Lr K(gρ) for some gρ L2 ρX, r > 0, (9)

where the integral operator LK is regarded as a compact positive operator on L2 ρX and its rth power Lr K is well deﬁned for any r > 0. The condition (9) means fρ lies in the range of Lr K, and the special case fρ HK corresponds to the choice r = 1/2. It implies fλ fρ ρ λr gρ ρ by Lemma 21 below. Thus, under condition (9), we obtain from Corollary 3, by choosing λ to minimize the order of the error bound subject to the restriction (8), the following nice convergence rates for the error f D,λ f D,λ of the distributed learning algorithm (2).

Corollary 4 Assume regularity condition (9) for some 0 < r 1, capacity assumption (7) with β = 1 2α for some α > 0, and |y| M almost surely. If |Dj| = N

m for j = 1, . . . , m with

α max{2r,1}+ 1

2 +2α(r 1) 2α max{2r,1}+1+2α(r 1) , (10)

N 2α 2α max{2r,1}+1 , then we have

E f D,λ f D,λ ρ e Cκ,c,r 1 m

α max{2r,1} 2α max{2r,1}+1

E f D,λ f D,λ K e Cκ,c,r 1 m

α max{2r 1,0}

2α max{2r,1}+1 ,

where e Cκ,c,r is a constant independent of N or m.

In particular, when fρ HK and m N 1 2+2α , taking λ = m

N 2α 2α+1 yields the rates

E f D,λ f D,λ ρ e Cκ,c,r N α 2α+1 m 1 4α+2 and E f D,λ f D,λ K e Cκ,c,r 1 m.

Remark 5 In Corollary 4, we present learning rates in both HK and L2 ρX norms. The rates with respect to the L2 ρX norm provide estimates for the generalization ability of the algorithm for regression. The rates with respect to the HK norm give error estimates with respect to the uniform metric due to the relation f κ f K, and might be used to solve some mismatch problems in learning theory where the generalization ability of learning algorithms is measured with respect to a probability measure µ diﬀerent from ρX.

Remark 6 The learning rates in Corollary 4 are stated for the norms of the diﬀerence f D,λ f D,λ which reﬂects the variance of the distributed learning algorithm (2). These rates decrease as m increases (subject to the restriction (10)) and thereby the regularization parameter λ increases, which is diﬀerent from the learning rates presented for E f D,λ fρ ρ in (Zhang et al., 2015). To derive learning rates for E f D,λ fρ ρ by our analysis, the regularization parameter λ should be independent of m, as shown in Corollary 11 below.

Lin, Guo and Zhou

2.2 Optimal Learning Rates for Least Squares Regularization Scheme

The second main result of this paper is optimal learning rates in expectation for the least squares regularization scheme (1). We can even relax the uniform boundedness to a moment condition that for some constant p 1,

σ2 ρ Lp ρX, (11)

where σ2 ρ is the conditional variance deﬁned by σ2 ρ(x) = R

Y (y fρ(x))2 dρ(y|x). The following learning rates in expectation for the least squares regularization scheme (1) will be proved in Section 5. The existence of fλ is ensured by E[y2] < .

Theorem 7 Assume E[y2] < and moment condition (11) for some 1 p . Then we have

E h f D,λ fρ ρ i 2 + 56κ4 + 57κ2 (1 + κ) 1 + 1 (Nλ)2 + N(λ)

fλ fρ ρ + q σ2ρ p

If the parameters satisfy N(λ)

Nλ = O(1), we have the following explicit bound.

Corollary 8 Assume E[y2] < and moment condition (11) for some 1 p . If λ satisﬁes (8) with m = 1, then we have

E h f D,λ fρ ρ i = O

fλ fρ ρ + N(λ)

In particular, if p = , that is, the conditional variances are uniformly bounded, then

E h f D,λ fρ ρ i = O

In particular, when regularity condition (9) is satisﬁed, we have the following learning rates in expectation.

Corollary 9 Assume E[y2] < , moment condition (11) for some 1 p , and regularity condition (9) for some 0 < r 1. If the capacity assumption N(λ) = O(λ 1

holds with some α > 0, then by taking λ = N 2α 2α max{2r,1}+1 we have

E h f D,λ fρ ρ i = O N 2rα 2α max{2r,1}+1 + 1

2p 2α 1 2α max{2r,1}+1 .

In particular, when p = (the conditional variances are uniformly bounded), we have

E h f D,λ fρ ρ i = O N 2rα 2α max{2r,1}+1 .

Distributed Learning

Remark 10 It was shown in (Caponnetto and De Vito, 2007; Steinwart et al., 2009; Bauer et al., 2007) that under the condition of Corollary 9 with p = and r [1

2, 1], the best learning rate, called minimax lower rate of convergence, for learning algorithms with output functions in HK is O N 2rα 4αr+1 . So the convergence rate we obtain in Corollary 9 is optimal.

Combining bounds for f D,λ f D,λ ρ and f D,λ fρ ρ, we shall prove in Section 6 the following learning rates in expectation for the distributed learning algorithm (2) for regression.

Corollary 11 Assume |y| M almost surely and regularity condition (9) for some 1

2 < r 1. If the capacity assumption N(λ) = O(λ 1

2α ) holds with some α > 0, |Dj| = N

m for j = 1, . . . , m, and m satisﬁes the restriction

4αr+1 , (13)

then by taking λ = N 2α 4αr+1 , we have

E h f D,λ fρ ρ

i = O N 2αr 4αr+1 .

Remark 12 Corollary 11 shows that distributed learning with the least squares regularization scheme in a RKHS can achieve the optimal learning rates in expectation, provided that m satisﬁes the restriction (13). It should be pointed out that our error analysis is carried out under regularity condition (9) with 1/2 < r 1 while the work in (Zhang et al., 2015) focused on the case with r = 1/2. When r approaches 1/2, the number m of local processors under the restriction (13) reduces to 1, which corresponds to the non-distributed case. In our follow-up work, we will consider to relax the restriction (13) in a semi-supervised learning framework by using additional unlabelled data, as done in (Caponnetto and Yao, 2010). The main contribution of our analysis for distributed learning in this paper is to remove an eigenfunction assumption in (Zhang et al., 2015) by using a novel second order decomposition for a diﬀerence of operator inverses.

Remark 13 In Corollary 11 and Corollary 9, the choice of the regularization parameter λ is independent of the number m of local processors. This is consistent with the results in (Zhang et al., 2015). There have been several approaches for selecting the regularization parameter λ in regularization schemes in the literature including cross-validation (Gy orfy et al., 2002; Blanchard and Kr amer, 2016) and the balancing principle (De Vito et al., 2010). For practical applications of distributed learning algorithms, how to choose λ and m (except the situations when the data are stored naturally in a distributive way) is crucial. Though we only consider the theoretical topic of error analysis in this paper, it would be interesting to develop parameter selection methods for distributed learning.

3. Comparisons and Discussion

The least squares regularization scheme (1) is a classical algorithm for regression and has been extensively investigated in statistics and learning theory. There is a vast literature on

Lin, Guo and Zhou

this topic. Here for a general kernel beyond those for the Sobolev spaces, we compare our results with the best learning rates in the existing literature. Under the assumption that the orthogonal projection f H of fρ in L2 ρX onto the closure of HK satisﬁes regularity condition (9) for some 1

2 r 1, and that the eigenvalues {λi}i of LK satisfy λi i 2α with some α > 1/2, it was proved in (Caponnetto and De Vito, 2007) that

lim τ lim sup N sup ρ P(α) Prob h f D,λN f H 2 ρ > τλ2r N i = 0,

N 2α 4αr+1 , if 1

2 < r 1, log N

N 2α 2α+1 , if r = 1

and P(α) denotes a set of probability measures ρ satisfying some moment decay condition (which is satisﬁed when |y| M). This learning rate in probability is stated with a limit as τ and it has a logarithmic factor in the case r = 1

2. In particular, to guarantee f D,λN f H 2 ρ τηλ2r N with conﬁdence 1 η by this result, one needs to restrict N Nη to be large enough and has the constant τη depending on η to be large enough. Using existing tools for error analysis including those in (Caponnetto and De Vito, 2007), we develop a novel second order decomposition technique for a diﬀerence of operator inverses in this paper, and succeed in deriving the optimal learning rate in expectation in Corollary 9 by removing the logarithmic factor in the case r = 1

2. Under the assumption that |y| M almost surely, the eigenvalues {λi}i satisfying λi ai 2α with some α > 1/2 and a > 0, and for some constant C > 0, the pair (K, ρX) satisfying

1 2α K f 1 1

for every f HK, it was proved in (Steinwart et al., 2009) that for some constant cα,C depending only on α and C, with conﬁdence 1 η, for any 0 < λ 1,

πM (f D,λ) fρ 2 ρ 9A2(λ) + cα,C a1/(2α)M2 log(3/η)

Here πM is the projection onto the interval [ M, M] deﬁned (Chen et al., 2004; Wu et al., 2006) by

M, if f(x) > M, f(x), if |f(x)| M, M, if f(x) < M,

and A2(λ) is the approximation error deﬁned by

A2(λ) = inf f HK

n f fρ 2 ρ + λ f 2 K o .

When fρ HK, A2(λ) = O(λ) and the choice λN = N 2α 2α+1 gives a learning rate of

order f D,λN fρ ρ = O N α 2α+1 . Here one needs to impose the condition (14) for the

functions spaces L2 ρX and HK, and to take the projection onto [ M, M]. Our learning rates

Distributed Learning

in Corollary 9 do not require such a condition for the function spaces, nor do we take the projection. Let us mention the fact from (Steinwart et al., 2009; Mendelson and Neeman, 2010) that condition (14) is satisﬁed when HK is a Sobolev space on a domain of a Euclidean space and ρX is the uniform distribution, or when the eigenfunctions {ϕℓ/ λℓ: λℓ> 0} of LK normalized in L2 ρX are uniformly bounded. Recall that ϕℓ L2ρX = λℓsince λℓ=

λℓ ϕℓ 2 K = λℓϕℓ, ϕℓ K = LKϕℓ, ϕℓ K equals

X Kxϕℓ(x)dρX(x), ϕℓ K = Z

X ϕℓ(x)ϕℓ(x)dρX(x) = ϕℓ 2 L2ρX .

Learning rates for the least squares regularization scheme (1) in the HK-metric have been investigated in (Smale and Zhou, 2007). For the distributed learning algorithm (2) with subsets {Dj}m j=1 of equal size, under the assumption that for some constants 2 < k and A < , the eigenfunctions {ϕi}i satisfy

supλi>0 E ϕi(x) λi

2k A2k, when k < ,

supλi>0 ϕi(x) λi

A, when k = , (15)

and that fρ HK and λi ai 2α for some α > 1/2 and a > 0, it was proved in (Zhang et

al., 2015) that for λ = N 2α 2α+1 and m satisfying the restriction

N 2(k 4)α k

2α+1 A4k logk N

1 k 2 , when k <

N 2α 1 2α+1 A4 log N , when k =

with a constant cα depending only on α, there holds E h f D,λ fρ 2 ρ

i = O N 2α 2α+1 . This interesting result was achieved by a matrix analysis approach for which the eigenfunction assumption (15) played an essential role. It might be challenging to check the eigenfunction assumption (15) involving the integral operator LK = LK,ρX for the pair (K, ρX). To our best knowledge, besides the case of ﬁnite dimensional RKHSs, there exist in the literature only three classes of pairs (K, ρX) proved satisfying this eigenfunction assumption: the ﬁrst class (Steinwart et al., 2009) is the Sobolev reproducing kernels on Euclidean domains together with the unform measures ρX, the second (Williamson et al., 2001) is periodical kernels, and the third class can be constructed by a Mercer type expansion

K(x, y) = X

i λi ϕi(x) λi

ϕi(y) λi , (17)

where {ϕi(x) λi }i is an orthonormal system in L2 ρX. An example of a C Mercer kernel was presented in (Zhou, 2002, 2003) to show that only the smoothness of the Mercer kernel does not guarantee the uniform boundedness of the eigenfunctions {ϕi(x) λi }i. Furthermore, it was shown in (Gittens and Mahoney, 2016) that these normalized eigenfunctions associated with radial basis kernels tend to be localized when the radial basis parameters are made smaller, which raises a practical concern for the uniform boundedness assumption on the eigenfunctions.

Lin, Guo and Zhou

Remark 14 To see how the eigenfunctions {ϕi(x)}i change with the marginal distribution, we consider a diﬀerent measure µ induced by a nonnegative function P L2 ρX as dµ = P(x)dρX. An eigenpair (λ, ϕ) of the integral operator LK,µ associated with the pair (K, µ) with λ > 0 satisﬁes

X K( , x)ϕ(x)P(x)dρX = X

λi>0 λi ϕi λi

ϕi(x) λi ϕ(x)P(x)dρX = λϕ.

We introduce an index set I := {i : λi > 0}, a possibly inﬁnite matrix

ϕi(x) λi P(x)ϕj(x) p

i,j I , (18)

and express ϕ in terms of the orthonormal basis {ϕi}i I as ϕ = P i I ci λi ϕi where the

sequence c = (ci)i I is given by ci = ϕ, ϕi λi L2ρX = λi ϕ, ϕi K. Then the eigenpair (λ, ϕ) satisﬁes

LK,µϕ = λϕ X

i,j cj = λ X

ci λi ϕi KP c = λc.

Thus the eigenpairs of the integral operator LK,µ associated with (K, µ) can be characterized by those of the possibly inﬁnite matrix KP deﬁned by (18). Finding the eigenpairs of KP is an interesting question involving matrix analysis in linear algebra and multiplier operators in harmonic analysis. Note that when P 1 corresponding to µ = ρX, KP is diagonal with diagonal entries {λi}i and its eigenvectors yield eigenfunctions {ϕi}i. From the above observation, we can see that the marginal distribution ρX plays an essential role in the eigenfunction assumption (15) which might be diﬃcult to check for a general marginal distribution ρX. For example, it is even unknown whether any of the Gaussian kernels on X = [0, 1]d satisﬁes the eigenfunction assumption (15) when ρX is a general Borel measure.

Our learning rates stated in Corollary 4 or Corollary 11 do not require the eigenfunction assumption (15). Moreover, our restriction (10) for the number m of local processors in Corollary 4 is more general than (16) when α is close to 1/2: with r = 1/2 corresponding to the condition fρ HK, our restriction (10) in Corollary 4 is m N 1 4+6α with the power

index tending to 1

7 while the restriction in (16) with k = takes the form m cα N 2α 1 2α+1 A4 log N with the power index tending to 0 as α 1

2. Note that the learning rates stated in Corollary 4 are for the diﬀerence f D,λ f D,λ between the output function of the distributed learning algorithm (2) and that of the algorithm (1) using the whole data. In the special case of r = 1

2, we can see that E f D,λ f D,λ ρ e Cκ,c,r N α 2α+1 m 1 4α+2 , achieved by choosing

N 2α 2α+1 , is smaller as m becomes larger. Here the dependence of λ on m is crucial for achieving this convergence rate of the sample error: if we ﬁx λ and N, the error bounds for E f D,λ f D,λ stated in Theorem 1 and Corollary 3 become larger as m increases. On the other hand, as one expects, increasing the number m of local processors would increase

Distributed Learning

the approximation error for the regression problem, which can also be seen from the bound with λ = m

N 2α 2α+1 stated in Theorem 7. The result in Corollary 11 with r > 1

2 compensates

and gives the optimal learning rate E h f D,λ fρ ρ

i = O N 2rα 4αr+1 by restricting m as in (13).

Besides the divide-and-conquer technique, there are many other approaches in the literature to reduce the computational complexity in handling big data. These include the localized learning (Meister and Steinwart, 2016), Nystr om regularization (Bach, 2013; Rudi et al., 2015), and online learning (Dekel et al., 2012). The divide-and-conquer technique has the advantage of reducing the single-machine complexity of both space and time without a signiﬁcant lost of prediction power.

It is an important and interesting problem how the big data are decomposed for distributed learning. Here we only study the approach of random decomposition, and the data subsets are regarded as being drawn from the same distribution. There should be better decomposition strategies for some practical applications. For example, intuitively data points should be divided according to their spatial distribution so that the learning process would yield locally optimal predictors, which could then be synthesized to the output function. See (Thomann et al., 2016).

In this paper, we consider the regularized least squares with Mercer kernels. Our results might be extended to more general kernels. A natural class consists of bounded and positive semideﬁnite kernels as studied in (Steinwart and Scovel, 2012). By means of the Mercer type expansion (17), one needs some assumptions on the system {ϕi(x) λi }i, the domain X, and the measure ρX to relax the continuity of the kernel while keeping compactness of the integral operator. How to minimize the assumptions and to maximize the scope of applications of the framework such as the situation of an input space X without a metric (Shen et al., 2014; De Vito et al., 2013) is a valuable question to investigate.

Here we only consider distributed learning with the regularized least squares. It would be of great interest and value to develop the theory for distributed learning with other algorithms such as spectral algorithms (Bauer et al., 2007), empirical feature-based learning (Guo and Zhou, 2012; Guo et al., 2017; Shi et al., 2011), the minimum error entropy principle (Hu et al., 2015; Fan et al., 2016), and randomized Kaczmarz algorithms (Lin and Zhou, 2015).

Remark 15 After the submission of this paper, in our follow-up paper by Z. C. Guo, S. B. Lin, and D. X. Zhou entitled Learning theory of distributed spectral algorithms published in Inverse Problems, error analysis and optimal learning rates for distributed learning with spectral algorithms were derived. In late 2016, we learned that similar analysis was carried out for classical (non-distributed) spectral algorithms implemented in one machine by G. Blanchard and N. M ucke in a paper entitled Optimal rates for regularization of statistical inverse learning problems (ar Xiv:1604.04054, April 2016), and by L. H. Dicker, D. P. Foster, and D. Hsu in a paper entitled Kernel ridge vs. principal component regression: minimax bounds and adaptibility of regularization operators (ar Xiv:1605.08839, May 2016). We are indebted to one of the referees for pointing this out to us.

Lin, Guo and Zhou

4. Second Order Decomposition of Operator Diﬀerences and Norms

Our error estimates are achieved by a novel second order decomposition of operator differences in our integral operator approach. We approximate the integral operator LK by the empirical integral operator LK,D(x) on HK deﬁned with the input data set D(x) = {xi}N i=1 = {x : (x, y) D for some y Y} as

LK,D(x)(f) = 1 |D|

x D(x) f(x)Kx = 1 |D|

x D(x) f, Kx KKx, f HK, (19)

where the reproducing property f(x) = f, Kx K for f HK is used. Since K is a Mercer kernel, LK,Dj(x) is a ﬁnite-rank positive operator and LK,Dj(x) + λI is invertible. The operator diﬀerence in our study is that of the inverses of two operators deﬁned by

QD(x) = LK,D(x) + λI 1 (LK + λI) 1 . (20)

If we denote A = LK,D(x) + λI and B = LK + λI, then our second order decomposition for the operator diﬀerence QD(x) is a special case of the following second order decomposition for the diﬀerence A 1 B 1.

Lemma 16 Let A and B be invertible operators on a Banach space. Then we have

A 1 B 1 = B 1 {B A} B 1 + B 1 {B A} A 1 {B A} B 1. (21)

Proof We can decompose the operator A 1 B 1 as

A 1 B 1 = B 1 {B A} A 1. (22)

This is the ﬁrst order decomposition. Write the last term A 1 as B 1 + (A 1 B 1) and apply another ﬁrst order decomposition similar to (22) as A 1 B 1 = A 1 {B A} B 1.

It follows from (22) that

A 1 B 1 = B 1 {B A} B 1 + A 1 {B A} B 1 .

Then the desired identity (21) is veriﬁed. The lemma is proved.

To demonstrate how the second order decomposition leads to tight error bounds for the classical least squares regularization scheme (1) with the output function f D,λ, we recall the following formula (see e.g. (Caponnetto and De Vito, 2007; Smale and Zhou, 2007))

f D,λ fλ = LK,D(x) + λI 1 D, D := 1 |D|

z D ξλ(z) E[ξλ], (23)

where D is induced by the random variables ξλ with values in the Hilbert space HK deﬁned as ξλ(z) = (y fλ(x))Kx, z = (x, y) Z. (24)

Distributed Learning

Then we can express f D,λ fλ by means of the notation QD(x) = LK,D(x) + λI 1 (LK + λI) 1 as f D,λ fλ = QD(x) D + (LK + λI) 1 D.

Due to the identity between the L2 ρX norm and the HK metric

1 2 Kg K, g L2 ρX, (25)

we can estimate the error f D,λ fλ ρ = L

1 2 K (f D,λ fλ) K by bounding the HK norm of the following expression obtained from the second order decomposition (21) with B A = LK LK,D(x)

1 2 K QD(x) D = L

1 2 K (LK + λI) 1

2 n (LK + λI) 1

2 LK LK,D(x) o (LK + λI) 1 D

1 2 K (LK + λI) 1

2 n (LK + λI) 1

2 LK LK,D(x) o LK,D(x) + λI 1

n LK LK,D(x) (LK + λI) 1

2 o n (LK + λI) 1

Combining this expression with the operator norm bound L

1 2 K (LK + λI) 1

2 1 and the

notation ΞD = (LK + λI) 1

2 LK LK,D(x) (26)

for convenience, we ﬁnd L

1 2 K QD(x) D

K ΞD (LK + λI) 1 D K + ΞD LK,D(x) + λI 1

ΞD (LK + λI) 1

By decomposing (LK + λI) 1 D as (LK + λI) 1

2 (LK + λI) 1

2 D and using the bounds LK,D(x) + λI 1 1

λ, (LK + λI) 1

λ, we know that

1 2 K QD(x) D

(LK + λI) 1

f D,λ fλ ρ ΞD

λ + Ξ2 D λ + 1 (LK + λI) 1

2 D K . (28)

Thus the classical least squares regularization scheme (1) can be analyzed after estimating the operator norm ΞD = (LK + λI) 1

2 LK LK,D(x) and the function norm (LK + λI) 1

2 D K. In the following two lemmas, to be proved in the appendix, these norms are estimated in terms of eﬀective dimensions by methods in the existing literature (Caponnetto and De Vito, 2007; Bauer et al., 2007; Blanchard and Kr amer, 2010).

Lin, Guo and Zhou

Lemma 17 Let D be a sample drawn independently according to ρ. Then the following estimates for the operator norm (LK + λI) 1

2 LK LK,D(x) hold.

(a) E (LK + λI) 1

2 LK LK,D(x) 2 κ2N(λ)

(b) For any 0 < δ < 1, with conﬁdence at least 1 δ, there holds (LK + λI) 1

2 LK LK,D(x) B|D|,λ log(2/δ), (29)

where we denote the quantity

B|D|,λ = 2κ p

(c) For any d > 1, there holds

E (LK + λI) 1

2 LK LK,D(x) d 1

d (2dΓ(d) + 1) 1 d B|D|,λ,

where Γ is the Gamma function deﬁned for d > 0 by Γ(d) = R 0 ud 1 exp { u} du.

Lemma 18 Let D be a sample drawn independently according to ρ and g be a measurable bounded function on Z and ξg be the random variable with values in HK deﬁned by ξg(z) = g(z)Kx for z = (x, y) Z. Then the following statements hold.

(a) E (LK + λI) 1/2 (Kx) 2

(b) For almost every x X, there holds (LK + λI) 1/2 (Kx) K κ

(c) For any 0 < δ < 1, with conﬁdence at least 1 δ, there holds (LK + λI) 1/2 1 |D|

z D ξg(z) E [ξg]

! K 2 g log(2/δ) p

Remark 19 In the existing literature, the ﬁrst order decomposition (22) was used. To bound the norm of LK,D(x) + λI 1 D by this approach, one needs to either use the brute

force estimate LK,D(x) + λI 1 1

λ leading to suboptimal learning rates, or applying the approximation of LK by LK,D(x) which is valid only with conﬁdence and leads to conﬁdencebased estimates with λ depending on the conﬁdence level. In our second order decomposition, we decompose the inverse operator LK,D(x) + λI 1 further after the ﬁrst order decompo-

sition (22). This leads to ﬁner estimates with an additional factor (B A)B 1

2 in the second term of the bound (21) and gives the reﬁned error bound (28).

Distributed Learning

5. Deriving Error Bounds for Least Squares Regularization Scheme

In this section we prove our main result on error bounds for the least squares regularization scheme (1), which illustrates how to apply the second order decomposition (21) for operator diﬀerences in our integral operator approach.

Proposition 20 Assume E[y2] < and moment condition (11) for some 1 p . Then

E h f D,λ fλ ρ i 2 + 56κ4 + 57κ2 1 + 1 (|D|λ)2 + N(λ)

κ 1 p q σ2ρ p

2p + κ fλ fρ ρ p

Proof We apply the error bound (28) and the Schwarz inequality to get

E h f D,λ fλ ρ i

2#)1/2 E (LK + λI) 1/2 D 2

(31) The ﬁrst factor in the above bound involves ΞD = (LK + λI) 1

2 LK LK,D(x) and it can be estimated by applying Lemma 17 as (

2#)1/2 1 + E Ξ2 D λ

1/2 + E Ξ4 D λ2

(49B4 |D|,λ λ2

(|D|λ)2 + 57κ2N(λ)

|D|λ . (32)

To deal with the factor in the bound (31), we separate D = 1 |D| P z D ξλ(z) E[ξλ] as

where D := 1 |D|

z D ξ0(z), D := 1 |D|

z D (ξλ ξ0) (z) E[ξλ]

are induced by another random variable ξ0 with values in the Hilbert space HK deﬁned by

ξ0(z) = (y fρ(x))Kx, z = (x, y) Z. (33)

Then the second factor in (31) can be separated as E (LK + λI) 1/2 D 2

E (LK + λI) 1/2 D 2

1/2 + E (LK + λI) 1/2 D 2

Lin, Guo and Zhou

Let us bound the ﬁrst term of (34). Observe that

(LK + λI) 1/2 D = X

1 |D|(y fρ(x)) (LK + λI) 1/2 (Kx).

Each term in this expression is unbiased because R

Y (y fρ(x)) dρ(y|x) = 0. This unbiasedness and the independence tell us that

E (LK + λI) 1/2 D 2

|D|E (y fρ(x)) h (LK + λI) 1/2i (Kx) 2

|D|E σ2 ρ(x) h (LK + λI) 1/2i (Kx) 2

If σ2 ρ L , then σ2 ρ(x) σ2 ρ and by Lemma 18 we have

E (LK + λI) 1/2 D 2

1/2 q σ2ρ p

If σ2 ρ Lp ρX with 1 p < , we take q = p p 1 (q = for p = 1) satisfying 1

1 q = 1 and apply the H older inequality E[|ξη|] (E[|ξ|p])1/p (E[|η|q])1/q to ξ = σ2 ρ, η = (LK + λI) 1/2 (Kx)

E σ2 ρ(x) h (LK + λI) 1/2i (Kx) 2

E h (LK + λI) 1/2i (Kx) 2q

But h (LK + λI) 1/2i (Kx) 2q 2

" (LK + λI) 1/2 (Kx)

= N(λ) by Lemma 18. So we have

E (LK + λI) 1/2 D 2

λq 1 N(λ) 1/q)1/2

= q σ2ρ pκ 1 p N(λ)

Combining the above two cases, we know that for either p = or p < , the ﬁrst term of (34) can be bounded as

E (LK + λI) 1/2 D 2

1/2 q σ2ρ pκ 1 p N(λ)

Distributed Learning

The second term of (34) can be bounded easily as

E (LK + λI) 1/2 D 2

E (fρ(x) fλ(x))2 (LK + λI) 1/2 (Kx) 2

E (fρ(x) fλ(x))2 κ2

1/2 = κ fρ fλ ρ p

Putting the above estimates for the two terms of (34) into (31) and applying the bound (32) involving ΞD, we know that E h f D,λ fλ ρ i is bounded by

(|D|λ)2 + 57κ2N(λ)

q σ2ρ pκ 1 p N(λ)

2p + κ fρ fλ ρ p

Then our desired error bound follows. The proof of the proposition is complete.

Proof of Theorem 7 Combining Proposition 20 with the triangle inequality f D,λ fρ ρ f D,λ fλ ρ + fλ fρ ρ, we know that

E h f D,λ fρ ρ i fλ fρ ρ + 2 + 56κ4 + 57κ2 1 + 1 (|D|λ)2 + N(λ)

κ 1 p q σ2ρ p

|D|λ fλ fρ ρ

Then the desired error bound holds true, and the proof of Theorem 7 is complete.

Proof of Corollary 8 By the deﬁnition of eﬀective dimension,

λℓ λℓ+ λ λ1 λ1 + λ.

Combining this with the restriction (8) with m = 1, we ﬁnd N(λ) λ1 λ1+C0 and Nλ λ1 (λ1+C0)C0 . Putting these and the restriction (8) with m = 1 into the error bound (12), we know that

E h f D,λ fρ ρ i 2 + 56κ4 + 57κ2 (1 + κ) 1 + (λ1 + C0)2C2 0 λ2 1 + C0

(λ1 + C0)C0/λ1

fλ fρ ρ + q σ2ρ p

Then the desired bound follows. The proof of Corollary 8 is complete.

To prove Corollary 9, we need the following bounds (Smale and Zhou, 2007) for fλ fρ ρ and fλ fρ K.

Lin, Guo and Zhou

Lemma 21 Assume regularity condition (9) with 0 < r 1. There holds

fλ fρ ρ λr gρ ρ. (36)

Furthermore, if 1/2 r 1, then we have

fλ fρ K λr 1/2 gρ ρ. (37)

Proof of Corollary 9 By Lemma 21, regularity condition (9) with 0 < r 1 implies

fλ fρ ρ λr gρ ρ.

If N(λ) C0λ 1

for some constant C0 1, then the choice λ = N 2α 2α max{2r,1}+1 yields

2α+1 2α max{2r,1}+1 1 C0.

So (8) with m = 1 is satisﬁed. With this choice we also have

p) 0 N 2α max{2r,1} 2α max{2r,1}+1 1 2 (1 1

p)N 2α max{2r,1}+1 2α

2α max{2r,1}+1 1 2p

p) 0 N α max{2r,1} 2α max{2r,1}+1 + 1

2p 2α 1 2α max{2r,1}+1 .

Putting these estimates into Corollary 8, we know that

E h f D,λ fρ ρ i = O N 2αr 2α max{2r,1}+1 + N α max{2r,1} 2α max{2r,1}+1 + 1

2p 2α 1 2α max{2r,1}+1

= O N α min{2r,max{2r,1}}

2α max{2r,1}+1 + 1

2p 2α 1 2α max{2r,1}+1 .

But we ﬁnd min {2r, max{2r, 1}} = 2r

by discussing the two diﬀerent cases 0 < r < 1

2 r 1. Then our conclusion follows immediately. The proof of Corollary 9 is complete.

6. Proof of Error Bounds for the Distributed Learning Algorithm

To analyze the error f D,λ f D,λ for distributed learning, we recall the notation QD(x) = LK,D(x) + λI 1 (LK + λI) 1 for the diﬀerence of inverse operators and use the notation QDj(x) involving the data subset Dj. The empirical integral operator LK,Dj(x) is deﬁned with D replaced by the data subset Dj. For our error analysis for the distributed learning algorithm (2), we need the following error decomposition for f D,λ f D,λ.

Distributed Learning

Lemma 22 Assume E[y2] < . For λ > 0, we have

f D,λ f D,λ =

LK,Dj(x) + λI 1 LK,D(x) + λI 1 j

h QDj(x) i j +

h QDj(x) i j QD(x) D, (38)

j := 1 |Dj|

z Dj ξλ(z) E[ξλ], D := 1 |D|

z D ξλ(z) E[ξλ],

j := 1 |Dj|

z Dj ξ0(z), j := 1 |Dj|

z Dj (ξλ ξ0) (z) E[ξλ].

Proof Recall the expression (23) for f D,λ fλ. When the data subset Dj is used, we have

f Dj,λ fλ = LK,Dj(x) + λI 1 j.

So we know that

|D| f Dj,λ fλ =

LK,Dj(x) + λI 1 j.

We can decompose D as

z D ξλ(z) E[ξλ] =

z Dj ξλ(z) E[ξλ]

in the expression (23) for f D,λ fλ and ﬁnd

|D| LK,D(x) + λI 1 j.

Then the ﬁrst desired expression for f D,λ f D,λ follows. By adding and subtracting the operator (LK + λI) 1, writing j = j + j , and noting E[ξ0] = 0, we know that the ﬁrst expression implies (38). This proves Lemma 22.

Before proving our ﬁrst main result on the error f D,λ f D,λ in the HK metric and L2 ρ metric, we state the following more general result which allows diﬀerent sizes for data subsets {Dj}.

Lin, Guo and Zhou

Theorem 23 Assume that for some constant M > 0, |y| M almost surely. Then we have

E h f D,λ f D,λ ρ

2 S2 |Dj|,λ + S3 |Dj|,λ

N(λ) |Dj|λ + S|Dj|,λ

|D|λ + S|D|,λ

|D| + fρ fλ ρ p

E h f D,λ f D,λ K

2 S2 |Dj|,λ + S3 |Dj|,λ

N(λ) |Dj|λ + S|Dj|,λ

|D|λ + S|D|,λ

|D|λ + fρ fλ ρ p

where C κ is a constant depending only on κ, and S|Dj|,λ is the quantity given by

S|Dj|,λ = 1 |Dj|2λ2 + N(λ)

Proof Recall the expression (38) for f D,λ f D,λ in Lemma 22. It enables us to express

f D,λ f D,λ = J1 + J2 + J3, (39)

where the terms J1, J2, J3 are given by

h L1/2 K QDj(x) i j, J2 =

h L1/2 K QDj(x) i j , J3 = h L1/2 K QD(x) i D.

These three terms will be dealt with separately in the following. For the ﬁrst term J1 of (39), each summand with j {1, . . . , m} can be expressed as P z Dj 1 |D|(y fρ(x)) h L1/2 K QDj(x) i (Kx), and is unbiased because R

Y y fρ(x)dρ(y|x) = 0. The unbiasedness and the independence tell us that

E [ J1 K] n E h J1 2 K io1/2

2 E h L1/2 K QDj(x) i j 2

Let j {1, . . . , m}. The estimate (27) derived from the second order decomposition (21) yields h L1/2 K QDj(x) i j 2

λ + Ξ2 Dj λ

!2 (LK + λI) 1/2 j 2

Distributed Learning

Now we apply the formula

0 Prob [ξ > t] dt (42)

to estimate the expected value of (41). By Part (b) of Lemma 17, for 0 < δ < 1, there exists a subset Z|Dj| δ,1 of Z|Dj| of measure at least 1 δ such that

ΞDj B|Dj|,λ log(2/δ), Dj Z|Dj| δ,1 . (43)

Applying Part (c) of Lemma 18 to g(z) = y fρ(x) with g 2M and the data subset Dj, we know that there exists another subset Z|Dj| δ,2 of Z|Dj| of measure at least 1 δ such that (LK + λI) 1/2 j K 2M

κ B|Dj|,λ log(2/δ), Dj Z|Dj| δ,2 .

Combining this with (43) and (41), we know that for Dj Z|Dj| δ,1 Z|Dj| δ,2 ,

h L1/2 K QDj(x) i j 2

λ + B4 |Dj|,λ

2 B2 |Dj|,λ (2 log(2/δ))6 .

Since the measure of the set Z|Dj| δ,1 Z|Dj| δ,2 is at least 1 2δ, by denoting

C|Dj|,λ = 64

λ + B4 |Dj|,λ

2 B2 |Dj|,λ,

we see that

Prob h L1/2 K QDj(x) i j 2

K > C|Dj|,λ (log(2/δ))6 2δ.

For 0 < t < , the equation C|Dj|,λ (log(2/δ))6 = t has the solution

δt = 2 exp t/C|Dj|,λ 1/6 .

When δt < 1, we have

Prob h L1/2 K QDj(x) i j 2

K > t 2δt = 4 exp t/C|Dj|,λ 1/6 .

This inequality holds trivially when δt 1 since the probability is at most 1. Thus we can

apply the formula (42) to the nonnegative random variable ξ = h L1/2 K QDj(x) i j 2

K and obtain

E h L1/2 K QDj(x) i j 2

0 Prob [ξ > t] dt Z

0 4 exp t/C|Dj|,λ 1/6 dt

Lin, Guo and Zhou

which equals 24Γ(6)C|Dj|,λ. Therefore,

2 24Γ(6)C|Dj|,λ

|Dj|2λ2 + N(λ)

2 1 + 8κ2 κ2

|Dj|2λ2 + N(λ)

For the second term J2 of (39), we apply (27) again and obtain

h L1/2 K QDj(x) i j K

λ + Ξ2 Dj λ

! (LK + λI) 1/2 j K .

Applying the Schwarz inequality and Lemmas 17 and 18, we get

E h h L1/2 K QDj(x) i j K

λ + Ξ2 Dj λ

E (fρ(x) fλ(x))2 (LK + λI) 1/2 (Kx) 2

(49B4 |Dj|,λ λ2

κ fρ fλ ρ p

N(λ) |Dj|λ + 56κ2 κ2

(|Dj|λ)2 + N(λ)

! κ fρ fλ ρ p

It follows that

|D| κ fρ fλ ρ p

N(λ) |Dj|λ + 56κ2 κ2

(|Dj|λ)2 + N(λ)

The last term J3 of (39) has been handled by (27) and in the proof of Proposition 20 by ignoring the summand 1 in the bound (31), and we ﬁnd from the trivial bound σ2 ρ 4M2 with p = that

|D|λ + 56κ2 κ2

(|D|λ)2 + N(λ)

2 + κ fρ fλ ρ p

Combining the above estimates for the three terms of (39), we see that the desired error bound in the L2 ρX metric holds true. The estimate in the HK metric follows from the steps in deriving the error bound in the L2 ρX metric except that in the representation (39) the operator L1/2 K in the front disappears. This change gives an additional factor 1/

λ, the bound for the operator (LK + λI) 1/2, and proves the desired error bound in the HK metric.

Distributed Learning

Remark 24 A crucial step in the above error analysis for the distributed learning algorithm is to use the unbiasedness and independence to get (40) where the norm is squared in the expected value. Thus we can obtain the optimal learning rates in expectation for the distributed learning algorithm (2) but have diﬃculty in getting rates in probability. It would be interesting to derive error bounds in probability by combining our second order decomposition technique with some analysis in the literature (Caponnetto and De Vito, 2007; Blanchard and Kr amer, 2010; Steinwart and Christmann, 2008; Wu and Zhou, 2008).

Proof of Theorem 1 Since |Dj| = N

m for j = 1, . . . , m, the bound in Theorem 23 in the L2 ρX metric can be simpliﬁed as

E h f D,λ f D,λ ρ

λ m m (Nλ)2 + N(λ)

1 + m m (Nλ)2 + N(λ)

+C κ fρ fλ ρ

(Nλ)2 + m N(λ)

Nλ + 1 (Nλ)2 + N(λ)

C κM( m + 1)

λ m (Nλ)2 + N(λ)

1 + m m (Nλ)2 + N(λ)

+C κ fρ fλ ρ

(Nλ)2 + m N(λ)

λCm n m + m p

Cm o + 2C κ fρ fλ ρ

Then the desired error bound in the L2 ρX metric with Cκ = 2C κ follows. The proof for the error bound in the HK metric is similar. The proof of Theorem 1 is complete.

Proof of Corollary 3 As in the proof of Corollary 8, the restriction (8) implies N(λ) λ1 λ1+C0 and Nλ mλ1 (λ1+C0)C0 . It follows that

m (Nλ)2 (λ1 + C0)C0

1 Nλ (λ1 + C0)2C0

and with e C κ := (λ1+C0)2C0

Cm e C κ N(λ)

Putting these bounds into Theorem 1 and applying the restriction m N(λ)

Nλ C0, we know that

E f D,λ f D,λ ρ Cκ e C κ 3

N + m fρ fλ ρ

λ + m fρ fλ ρ

Lin, Guo and Zhou

E f D,λ f D,λ K Cκ e C κ 3

C0 + m fρ fλ ρ

Then the desired error bounds hold by taking the constant e Cκ = Cκ e C κ 3

2 1 + C0 2. This proves Corollary 3.

Proof of Corollary 4 If N(λ) C0λ 1

for some constant C0 1, then the choice λ = m

N 2α 2α max{2r,1}+1 satisﬁes (8). With this choice we also have

2α(max{2r,1} 1)

2α max{2r,1}+1 .

Since regularity condition (9) yields fλ fρ ρ gρ ρλr by Lemma 21, we have by Corollary 3,

E f D,λ f D,λ ρ e Cκ p

α(max{2r,1} 1)

2α max{2r,1}+1 gρ ρ m

2α 2α max{2r,1}+1 (r 1

α 2α max{2r,1}+1 .

The inequality m

N 2α 2α max{2r,1}+1 (r 1

N α 2α max{2r,1}+1 is equivalent to

m1+ 2α 2α max{2r,1}+1 (r 1) N

1 2 + 2α 2α max{2r,1}+1 (r 1)

and it can be expressed as (10). Since (10) is valid, we have

E f D,λ f D,λ ρ e Cκ p

gρ ρ + M 1 m

α max{2r,1} 2α max{2r,1}+1 .

This proves the ﬁrst desired convergence rate. The second rate follows easily. This proves Corollary 4.

Proof of Corollary 11 By Corollary 9, with the choice λ = N 2α 4αr+1 , we can immediately bound f D,λ fρ ρ as

E h f D,λ fρ ρ i = O N 2αr 4αr+1 .

The assumption N(λ) = O(λ 1

2α ) tells us that for some constant C0 1,

2α , λ > 0.

So the choice λ = N 2α 4αr+1 yields

Nλ C0 mλ 1+2α

2α N = C0m N 1+2α 4αr+1 1 = C0m N

4αr+1 . (44)

Distributed Learning

If m satisﬁes (13), then (8) is valid, and by Corollary 3,

E f D,λ f D,λ ρ e Cκ p

4αr+1 λr gλ ρ m

gλ ρ + M λr

4αr+1 λ 1 2 r

gλ ρ + M N 2αr 4αr+1 m N 2α(2r 1)+ 1

2 4αr+1 + 1 .

Since (13) is satisﬁed, we have

E f D,λ f D,λ ρ 2 e Cκ p

gλ ρ + M N 2αr 4αr+1 ,

and thereby E h f D,λ fρ ρ

i = O N 2αr 4αr+1 .

This proves Corollary 11.

Acknowledgments

We thank the anonymous referees for their constructive suggestions. The work described in this paper is supported partially by the Research Grants Council of Hong Kong [Project No. City U 11304114]. The corresponding author is Ding-Xuan Zhou.

This appendix provides detailed proofs of the norm estimates stated in Lemmas 17 and 18 involving the approximation of LK by LK,D(x). To this end, we need the following probability inequality for vector-valued random variables in (Pinelis, 1994).

Lemma 25 For a random variable ξ on (Z, ρ) with values in a separable Hilbert space (H, ) satisfying ξ f M < almost surely, and a random sample {zi}s i=1 independent drawn according to ρ, there holds with conﬁdence 1 eδ,

i=1 [ξ(zi) E(ξ)] 2f M log(2/eδ)

2E( ξ 2) log(2/eδ)

Proof of Lemma 17 We apply Lemma 25 to the random variable η1 deﬁned by

η1(x) = (LK + λI) 1/2 , Kx KKx, x X (46)

It takes values in HS(HK), the Hilbert space of Hilbert-Schmidt operators on HK, with inner product A, B HS = Tr(BT A). Here Tr denotes the trace of a (trace-class) linear operator. The norm is given by A 2 HS = P i Aei 2 K where {ei} is an orthonormal basis

Lin, Guo and Zhou

of HK. The space HS(HK) is a subspace of the space of bounded linear operators on HK, denoted as (L(HK), ), with the norm relations

A A HS, AB HS A HS B . (47)

Now we use eﬀective dimensions to estimate norms involving η1. The random variable η1 deﬁned by (46) has mean E(η1) = (LK + λI) 1/2 LK and sample mean (LK + λI) 1/2 LK,D(x). Recall the set of normalized (in HK) eigenfunctions {ϕi}i of LK. It is an orthonormal basis of HK. If we regard LK as an operator on L2 ρX, the normalized eigenfunctions in L2 ρX are { 1 λi ϕi}i and they form an orthonormal basis of the orthogonal complement of the eigenspace associated with eigenvalue 0. By the Mercer Theorem, we have the following uniform convergent Mercer expansion

K(x, y) = X

i λi 1 λi ϕi(x) 1 λi ϕi(y) = X

i ϕi(x)ϕi(y). (48)

Take the orthonormal basis {ϕi}i of HK. By the deﬁnition of the HS norm, we have

η1(x) 2 HS = X

(LK + λI) 1/2 , Kx KKxϕi 2

For a ﬁxed i, , Kx KKxϕi = ϕi(x)Kx,

and Kx HK can be expended by the orthonormal basis {ϕℓ}ℓas

ℓ ϕℓ, Kx Kϕℓ= X

ℓ ϕℓ(x)ϕℓ. (49)

η1(x) 2 HS = X

ℓ ϕℓ(x) (LK + λI) 1/2 ϕℓ

ℓ ϕℓ(x) 1 λℓ+ λϕℓ

i (ϕi(x))2 X

Combining this with (48), we see that

η1(x) 2 HS = K(x, x) X

λℓ+ λ , x X (50)

E η1(x) 2 HS κ2E

X (ϕℓ(x))2 dρX

X (ϕℓ(x))2 dρX = ϕℓ 2 L2ρX = p

L2ρX = λℓ. (51)

Distributed Learning

E η1 2 HS κ2 X

λℓ λℓ+ λ = κ2Tr (LK + λI) 1 LK = κ2N(λ) (52)

x D(x) η1(x) E[η1]

= E (LK + λI) 1/2 LK LK,D(x) 2

Then our desired inequality in Part (a) follows from the ﬁrst inequality of (47). From (49) and (50), we ﬁnd a bound for η1 as

η1(x) HS κ 1

ℓ (ϕℓ(x))2 κ

Applying Lemma 25 to the random variable η1 with f M = κ2

λ, we know by (47) that with conﬁdence at least 1 δ, E[η1] 1

x D(x) η1(x)

x D(x) η1(x)

2κ2 log(2/δ)

2κ2N(λ) log(2/δ)

Writing the above bound by taking a factor 2κ log(2/δ)

|D| , we get the desired bound (29).

Recall B|D|,λ deﬁned by (30). Apply the formula (42) for nonnegative random variables

to ξ = (LK + λI) 1/2 LK LK,D(x) d and use the bound

Prob [ξ > t] = Prob h ξ 1 d > t 1 d i 2 exp

t 1 d B|D|,λ

derived from (29) for t logd 2B|D|,λ. We ﬁnd

E (LK + λI) 1/2 LK LK,D(x) d logd 2B|D|,λ + Z

t 1 d B|D|,λ

The second term on the right-hand side of above equation equals 2d Bd |D|,λ R 0 ud 1 exp { u} du. Then the desired bound in Part (c) follows from R 0 ud 1 exp { u} du = Γ(d) and the lemma is proved.

Proof of Lemma 18 Consider the random variable η2 deﬁned by

η2(z) = (LK + λI) 1/2 (Kx) , z = (x, y) Z. (53)

Lin, Guo and Zhou

It takes values in HK. By (49), it satisﬁes

(LK + λI) 1/2 X

E (LK + λI) 1/2 (Kx) 2

This is the statement of Part (a). We also see from (49) that for x X,

ℓ (ϕℓ(x))2 !1/2 = 1

This veriﬁes the statement of Part (b). For Part (c), we consider another random variable η3 deﬁned by

η3(z) = (LK + λI) 1/2 (g(z)Kx) , z = (x, y) Z. (54)

It takes values in HK and satisﬁes

η3(z) K = |g(z)| (LK + λI) 1/2 (Kx) K = |g(z)|

η3(z) K κ g

E η3 2 K g 2 E

= g 2 N(λ).

Applying Lemma 25 veriﬁes the statement in Part (c). The proof of the lemma is complete.

F. Bach. Sharp analysis of low-rank kernel matrix approximations. Proceedings of the 26th Annual Conference on Learning Theory, 185-209, 2013.

F. Bauer, S. Pereverzev, and L. Rosasco. On regularization algorithms in learning theory. Journal of Complexity, 23:52-72, 2007.

G. Blanchard and N. Kr amer. Optimal learning rates for kernel conjugate gradient regression. Advances in Neural Information Processing Systems, 226-234, 2010.

G. Blanchard and N. Kr amer. Convergence rates for kernel conjugate gradient for random design regression. Analysis and Applications, 14:763-794, 2016.

Distributed Learning

A. Caponnetto and E. De Vito. Optimal rates for the regularized least squares algorithm. Foundations of Computational Mathematics, 7:331-368, 2007.

A. Caponnetto and Y. Yao. Cross-validation based adaptation for regularization operators in learning theory. Analysis and Applications, 8:161-183, 2010.

D. R. Chen, Q. Wu, Y. Ying, and D. X. Zhou. Support vector machine soft margin classiﬁers: error analysis. Journal of Machine Learning Research, 5:1143-1175, 2004.

N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, 2000.

O. Dekel, R. Gilad-Bachrach, O. Shamir, and X. Lin. Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research, 13:165-202, 2012.

E. De Vito, A. Caponnetto, and L. Rosasco. Model selection for regularized least-squares algorithm in learning theory. Foundations of Computational Mathematics, 5:59-85, 2005.

E. De Vito, S. Pereverzyev, and L. Rosasco. Adaptive kernel methods using the balancing principle. Foundations of Computational Mathematics, 10:455-479, 2010.

E. De Vito, V. Umanit a, and S. Villa. An extension of Mercer theorem to vector-valued measurable kernels. Applied and Computational Harmonic Analysis, 34:339-351, 2013.

D.E. Edmunds and H. Triebel. Function Spaces, Entropy Numbers, Diﬀerential Operators. Cambridge University Press, Cambridge, 1996.

T. Evgeniou, M. Pontil, and T. Poggio. Regularization networks and support vector machines. Advance in Computional Mathematics, 13:1-50, 2000.

J. Fan, T. Hu, Q. Wu, and D.X. Zhou. Consistency analysis of an empirical minimum error entropy algorithm. Applied and Computational Harmonic Analysis, 41:164-189, 2016.

S. Fine and K. Scheinberg. Eﬃcient SVM training using low-rank kernel representations. Journal of Machine Learning Research, 2:243-264, 2002.

A. Gittens and M. Mahoney. Revisiting the Nystr om method for improved large-scale machine learning. Journal of Machine Learning Research, 17:1-65, 2016.

X. Guo and D. X. Zhou. An empirical feature-based learning algorithm producing sparse approximations. Applied and Computational Harmonic Analysis, 32:389-400, 2012.

Z.C. Guo, D.H. Xiang, X. Guo, and D.X. Zhou. Thresholded spectral algorithms for sparse approximations. Analysis and Applications, 15:433-455, 2017.

L. Gy orfy, M. Kohler, A. Krzyzak, H. Walk. A Distribution-Free Theory of Nonparametric Regression. Springer-Verlag, Berlin, 2002.

T. Hu, J. Fan, Q. Wu, and D. X. Zhou. Regularization schemes for minimum error entropy principle. Analysis and Applications, 13:437-455, 2015.

Lin, Guo and Zhou

J. H. Lin, L. Rosasco, and D. X. Zhou. Iterative regularization for learning with convex loss functions. Journal of Machine Learning Research, 17(77): 1-38, 2016.

J. H. Lin and D. X. Zhou. Learning theory of randomized Kaczmarz algorithm. Journal of Machine Learning Research, 16:3341-3365, 2015.

S. Mendelson and J. Neeman. Regularization in kernel learning. The Annals of Statistics, 38(1):526-565, 2010.

M. Meister and I. Steinwart. Optimal learning rates for localized SVMs. Journal of Machine Learning Research, 17: 1-44, 2016.

I. Pinelis. Optimum bounds for the distributions of martingales in Banach spaces. The Annals of Probability, 22:1679-1706, 1994.

G. Raskutti, M. Wainwright, and B. Yu. Early stopping and non-parametric regression: an optimal data-dependent stopping rule. Journal of Machine Learning Research, 15:335366, 2014.

A. Rudi, R. Camoriano, and L. Rosasco. Less is more: Nystr om computational regularization. Advances in Neural Information Processing Systems, 1657-1665, 2015.

B. Sch olkopf, A. Smola, and K. R. M uller. Nonlinear component analysis as a kernel eigenvalue problem. IEEE Transactions on Information Theory, 10:1299-1319, 1998.

O. Shamir and N. Srebro. Distributed stochastic optimization and learning. In the 52nd Annual Allerton Conference on Communication, Control and Computing, 2014.

W.J. Shen, H.S. Wong, Q.W. Xiao, X. Guo, and S. Smale. Introduction to the peptide binding problem of computational immunology: new results. Foundations of Computational Mathematics, 14:951-984, 2014.

L. Shi, Y.L. Feng, and D.X. Zhou. Concentration estimates for learning with ℓ1-regularizer and data dependent hypothesis spaces. Applied and Computational Harmonic Analysis, 31:286-302, 2011.

S. Smale and D.X. Zhou. Learning theory estimates via integral operators and their approximations. Constructive Approximation, 26:153-172, 2007.

I. Steinwart and A. Christmann, Support Vector Machines. Springer, New York, 2008.

I. Steinwart, D. Hush, and C. Scovel. Optimal rates for regularized least squares regression. In Proceedings of the 22nd Annual Conference on Learning Theory (S. Dasgupta and A. Klivans, eds.), pp. 79-93, 2009.

I. Steinwart and C. Scovel. Mercer s theorem on general domains: On the interaction between measures, kernels, and RKHSs. Constructive Approximation, 35(3):363-417, 2012.

P. Thomann, I. Steinwart, I. Blaschzyk, and M. Meister. Spatial decompositions for large scale SVMs. Ar Xiv:1612.00374v1, 2016.

Distributed Learning

R. C. Williamson, A. J. Smola, and B. Sch olkopf. Generalization performance of regularization networks and support vector machines via entropy numbers of compact operators. IEEE Transactions on Information Theory, 47:2516-2532, 2001.

Q. Wu, Y. Ying, and D. X. Zhou. Learning rates of least-square regularized regression. Foundations of Computational Mathematics, 6:171-192, 2006.

Q. Wu and D. X. Zhou. Learning with sample dependent hypothesis spaces. Computers and Mathematics with Applications, 56:2896-2907, 2008.

Y. Yao, L. Rosasco, and A. Caponnetto. On early stopping in gradient descent learning. Constructive Approximation, 26:289-315, 2007.

T. Zhang. Learning bounds for kernel regression using eﬀective data dimensionality. Neural Computation, 17:2077-2098, 2005.

Y. C. Zhang, J. Duchi, and M. Wainwright. Communication-eﬃcient algorithms for statistical optimization. Journal of Machine Learning Research, 14:3321-3363, 2013.

Y. C. Zhang, J. Duchi, and M. Wainwright. Divide and conquer kernel ridge regression: a distributed algorithm with minimax optimal rates. Journal of Machine Learning Research, 16:3299-3340, 2015.

D. X. Zhou. The covering number in learning theory. Journal of Complexity, 18:739-767, 2002.

D. X. Zhou. Capacity of reproducing kernel spaces in learning theory. IEEE Transactions on Information Theory, 49:1743-1752, 2003.

Z. H. Zhou, N. V. Chawla, Y. Jin, and G. J. Williams. Big data opportunities and challenges: Discussions from data analytics perspectives. IEEE Computational Intelligence Magazine, 9:62-74, 2014.