# distributed_learning_with_regularized_least_squares__09b639a3.pdf Journal of Machine Learning Research 18 (2017) 1-31 Submitted 11/15; Revised 6/17; Published 9/17 Distributed Learning with Regularized Least Squares Shao-Bo Lin sblin1983@gmail.com Department of Mathematics City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong Xin Guo x.guo@polyu.edu.hk Department of Applied Mathematics The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong Ding-Xuan Zhou mazhou@cityu.edu.hk Department of Mathematics City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong Editor: Ingo Steinwart We study distributed learning with the least squares regularization scheme in a reproducing kernel Hilbert space (RKHS). By a divide-and-conquer approach, the algorithm partitions a data set into disjoint data subsets, applies the least squares regularization scheme to each data subset to produce an output function, and then takes an average of the individual output functions as a final global estimator or predictor. We show with error bounds and learning rates in expectation in both the L2-metric and RKHS-metric that the global output function of this distributed learning is a good approximation to the algorithm processing the whole data in one single machine. Our derived learning rates in expectation are optimal and stated in a general setting without any eigenfunction assumption. The analysis is achieved by a novel second order decomposition of operator differences in our integral operator approach. Even for the classical least squares regularization scheme in the RKHS associated with a general kernel, we give the best learning rate in expectation in the literature. Keywords: Distributed learning, divide-and-conquer, error analysis, integral operator, second order decomposition. 1. Introduction and Distributed Learning Algorithms In the era of big data, the rapid expansion of computing capacities in automatic data generation and acquisition brings data of unprecedented size and complexity, and raises a series of scientific challenges such as storage bottleneck and algorithmic scalability (Zhou et al., 2014). To overcome the difficulty, some approaches for generating scalable approximate algorithms have been introduced in the literature such as low-rank approximations of kernel matrices for kernel principal component analysis (Sch olkopf et al., 1998; Bach, 2013), incomplete Cholesky decomposition (Fine, 2002), early-stopping of iterative optimization algorithms for gradient descent methods (Yao et al., 2007; Raskutti et al., 2014; Lin et al., 2016), and greedy-type algorithms. Another method proposed recently is distributed learning based on a divide-and-conquer approach and a particular learning algorithm implemented in individual machines (Zhang et al., 2015; Shamir and Srebro, 2014). This method c 2017 Lin, Guo and Zhou. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v18/15-586.html. Lin, Guo and Zhou produces distributed learning algorithms consisting of three steps: partitioning the data into disjoint subsets, applying a particular learning algorithm implemented in an individual machine to each data subset to produce an individual output (function), and synthesizing a global output by utilizing some average of the individual outputs. This method can significantly reduce computing time and lower single-machine memory requirements. For practical applications in medicine, finance, business and some other areas, the data are often stored naturally across multiple servers in a distributive way and are not combined for reasons of protecting privacy and avoiding high costs. In such situations, the first step of data partitioning is not needed. It has been observed in many practical applications that when the number of data subsets is not too large, the divide-and-conquer approach does not cause noticeable loss of performance, compared with the learning scheme which processes the whole data on a single big machine. Theoretical attempts have been recently made in (Zhang et al., 2013, 2015) to derive learning rates for distributed learning with least squares regularization schemes under certain assumptions. This paper aims at error analysis of the distributed learning with regularized least squares and its approximation to the algorithm processing the whole data in one single machine. Recall (Cristianini and Shawe-Taylor, 2000; Evgeniou et al., 2000) that in a reproducing kernel Hilbert space (RKHS) (HK, K) induced by a Mercer kernel K on an input metric space X, with a sample D = {(xi, yi)}N i=1 X Y where Y = R is the output space, the least squares regularization scheme can be stated as f D,λ = arg min f HK (x,y) D (f(x) y)2 + λ f 2 K Here λ > 0 is a regularization parameter and |D| =: N is the cardinality of D. This learning algorithm is also called kernel ridge regression in statistics and has been well studied in learning theory. See e.g. (De Vito et al., 2005; Caponnetto and De Vito, 2007; Steinwart et al., 2009; Bauer et al., 2007; Smale and Zhou, 2007; Steinwart and Christmann, 2008). The regularization scheme (1) can be explicitly solved by using a standard matrix inversion technique, which requires costs of O(N3) in time and O(N2) in memory. However, this matrix inversion technique may not conquer challenges on storages or computations arising from big data. The distributed learning algorithm studied in this paper starts with partitioning the data set D into m disjoint subsets {Dj}m j=1. Then it assigns each data subset Dj to one machine or processor to produce a local estimator f Dj,λ by the least squares regularization scheme (1). Finally, these local estimators are communicated to a central processor, and a global estimator f D,λ is synthesized by taking a weighted average |D| f Dj,λ (2) of the local estimators {f Dj,λ}m j=1. This algorithm has been studied with a matrix analysis approach in (Zhang et al., 2015) where some error analysis has been conducted under some eigenfunction assumptions for the integral operator associated with the kernel, presenting error bounds in expectation. Distributed Learning In this paper we shall use a novel integral operator approach to prove that f D,λ is a good approximation of f D,λ. We present a representation of the difference f D,λ f D,λ in terms of empirical integral operators, and analyze the error f D,λ f D,λ in expectation without any eigenfunction assumptions. As a by-product, we present the best learning rates in expectation for the least squares regularization scheme (1) in a general setting, which surprisingly has not been done for a general kernel in the literature (see detailed comparisons below). 2. Main Results Our analysis is carried out in the standard least squares regression framework with a Borel probability measure ρ on Z := X Y, where the input space X is a compact metric space. The sample D is independently drawn according to ρ. The Mercer kernel K : X X R defines an integral operator LK on HK as X Kxf(x)dρX, f HK, (3) where Kx is the function K( , x) in HK and ρX is the marginal distribution of ρ on X. 2.1 Error Bounds for the Distributed Learning Algorithm Our error bounds in expectation for the distributed learning algorithm (2) require the uniform boundedness condition for the output y, that is, for some constant M > 0, there holds |y| M almost surely. Our bounds are stated in terms of the approximation error fλ fρ ρ, (4) where fλ is the data-free limit of (1) defined by fλ = arg min f HK Z (f(x) y)2dρ + λ f 2 K ρ denotes the norm of L2 ρX , the Hilbert space of square integrable functions with respect to ρX, and fρ is the regression function (conditional mean) of ρ defined by Y ydρ(y|x), x X, with ρ( |x) being the conditional distribution of ρ induced at x X. Since K is continuous, symmetric and positive semidefinite, LK is a compact positive operator of trace class and LK+λI is invertible. Define a quantity measuring the complexity of HK with respect to ρX, the effective dimension (Zhang, 2005), to be the trace of the operator (LK + λI) 1LK as N(λ) = Tr (LK + λI) 1LK , λ > 0. (6) In Section 6 we shall prove the following first main result of this paper concerning error bounds in expectation of f D,λ f D,λ in HK and in L2 ρX . Denote κ = supx X p Lin, Guo and Zhou Theorem 1 Assume |y| M almost surely. If |Dj| = N m for j = 1, . . . , m, then we have E f D,λ f D,λ ρ Cκ λm Cm n 1 + p m Cm o + fρ fλ ρ E f D,λ f D,λ K Cκ M m Cm n 1 + p m Cm o + fρ fλ ρ where Cκ is a constant depending only on κ, and Cm is the quantity given in terms of m, N, λ, N(λ) by Cm := m (Nλ)2 + N(λ) To derive explicit learning rates from the general error bounds in Theorem 1, one can quantify the increment of N(λ) by the following capacity assumption, a characteristic of the complexity of the hypothesis space (Caponnetto and De Vito, 2007; Blanchard and Kr amer, 2010), N(λ) cλ β, λ > 0 (7) for some constants β > 0 and c > 0. Let {(λℓ, ϕℓ)}ℓbe a set of normalized eigenpairs of LK on HK with {ϕℓ}ℓbeing an orthonormal basis of HK and {λℓ}ℓarranged in a nonincreasing order. A sufficient condition for the capacity assumption (7) with 0 < β < 1 is λℓ= O(ℓ 1/β), which can be found in, e.g. Caponnetto and De Vito (2007). Remark 2 The sufficient condition λℓ= O(ℓ 1/β) with the index β = d 2τ < 1 is satisfied by the Sobolev space W τ(B(Rd)) with the smoothness index τ > d/2 on a ball B(Rd) of the Euclidean space Rd when the marginal distribution ρX is the uniform distribution on B(Rd), see (Steinwart et al., 2009; Edmunds and Triebel, 1996). Condition (7) with β = 1 always holds true with the choice of the constant c = κ2. In fact, the eigenvalues of the operator (LK + λI) 1LK are { λℓ λℓ+λ}ℓ. So its trace is given by N(λ) = P ℓ λℓ λℓ+λ P ℓ λℓ In the existing literature on learning rates for the classical least squares algorithm (1), the regularization parameter λ is often taken to satisfy the restriction N(λ) Nλ = O(1) as in (Caponnetto and De Vito, 2007) up to a logarithmic factor or in (Steinwart et al., 2009) under some assumptions on (K, ρX) (see (14) below). Here, to derive learning rates for E f D,λ f D,λ ρ with m 1 corresponding to the distributed learning algorithm (2), we consider λ to satisfy the following restriction with some constant C0 > 0, 0 < λ C0 and m N(λ) Corollary 3 Assume |y| M almost surely. If |Dj| = N m for j = 1, . . . , m, and λ satisfies (8), then we have E f D,λ f D,λ ρ e Cκ λ + m fρ fλ ρ Distributed Learning E f D,λ f D,λ K e Cκ M + m fρ fλ ρ where e Cκ is a constant depending only on κ, C0, and the largest eigenvalue of LK. In the special case that fρ HK, the approximation error can be bounded as fλ fρ ρ fρ K λ. A more general regularity condition can be imposed for the regression function as fρ = Lr K(gρ) for some gρ L2 ρX, r > 0, (9) where the integral operator LK is regarded as a compact positive operator on L2 ρX and its rth power Lr K is well defined for any r > 0. The condition (9) means fρ lies in the range of Lr K, and the special case fρ HK corresponds to the choice r = 1/2. It implies fλ fρ ρ λr gρ ρ by Lemma 21 below. Thus, under condition (9), we obtain from Corollary 3, by choosing λ to minimize the order of the error bound subject to the restriction (8), the following nice convergence rates for the error f D,λ f D,λ of the distributed learning algorithm (2). Corollary 4 Assume regularity condition (9) for some 0 < r 1, capacity assumption (7) with β = 1 2α for some α > 0, and |y| M almost surely. If |Dj| = N m for j = 1, . . . , m with α max{2r,1}+ 1 2 +2α(r 1) 2α max{2r,1}+1+2α(r 1) , (10) N 2α 2α max{2r,1}+1 , then we have E f D,λ f D,λ ρ e Cκ,c,r 1 m α max{2r,1} 2α max{2r,1}+1 E f D,λ f D,λ K e Cκ,c,r 1 m α max{2r 1,0} 2α max{2r,1}+1 , where e Cκ,c,r is a constant independent of N or m. In particular, when fρ HK and m N 1 2+2α , taking λ = m N 2α 2α+1 yields the rates E f D,λ f D,λ ρ e Cκ,c,r N α 2α+1 m 1 4α+2 and E f D,λ f D,λ K e Cκ,c,r 1 m. Remark 5 In Corollary 4, we present learning rates in both HK and L2 ρX norms. The rates with respect to the L2 ρX norm provide estimates for the generalization ability of the algorithm for regression. The rates with respect to the HK norm give error estimates with respect to the uniform metric due to the relation f κ f K, and might be used to solve some mismatch problems in learning theory where the generalization ability of learning algorithms is measured with respect to a probability measure µ different from ρX. Remark 6 The learning rates in Corollary 4 are stated for the norms of the difference f D,λ f D,λ which reflects the variance of the distributed learning algorithm (2). These rates decrease as m increases (subject to the restriction (10)) and thereby the regularization parameter λ increases, which is different from the learning rates presented for E f D,λ fρ ρ in (Zhang et al., 2015). To derive learning rates for E f D,λ fρ ρ by our analysis, the regularization parameter λ should be independent of m, as shown in Corollary 11 below. Lin, Guo and Zhou 2.2 Optimal Learning Rates for Least Squares Regularization Scheme The second main result of this paper is optimal learning rates in expectation for the least squares regularization scheme (1). We can even relax the uniform boundedness to a moment condition that for some constant p 1, σ2 ρ Lp ρX, (11) where σ2 ρ is the conditional variance defined by σ2 ρ(x) = R Y (y fρ(x))2 dρ(y|x). The following learning rates in expectation for the least squares regularization scheme (1) will be proved in Section 5. The existence of fλ is ensured by E[y2] < . Theorem 7 Assume E[y2] < and moment condition (11) for some 1 p . Then we have E h f D,λ fρ ρ i 2 + 56κ4 + 57κ2 (1 + κ) 1 + 1 (Nλ)2 + N(λ) fλ fρ ρ + q σ2ρ p If the parameters satisfy N(λ) Nλ = O(1), we have the following explicit bound. Corollary 8 Assume E[y2] < and moment condition (11) for some 1 p . If λ satisfies (8) with m = 1, then we have E h f D,λ fρ ρ i = O fλ fρ ρ + N(λ) In particular, if p = , that is, the conditional variances are uniformly bounded, then E h f D,λ fρ ρ i = O In particular, when regularity condition (9) is satisfied, we have the following learning rates in expectation. Corollary 9 Assume E[y2] < , moment condition (11) for some 1 p , and regularity condition (9) for some 0 < r 1. If the capacity assumption N(λ) = O(λ 1 holds with some α > 0, then by taking λ = N 2α 2α max{2r,1}+1 we have E h f D,λ fρ ρ i = O N 2rα 2α max{2r,1}+1 + 1 2p 2α 1 2α max{2r,1}+1 . In particular, when p = (the conditional variances are uniformly bounded), we have E h f D,λ fρ ρ i = O N 2rα 2α max{2r,1}+1 . Distributed Learning Remark 10 It was shown in (Caponnetto and De Vito, 2007; Steinwart et al., 2009; Bauer et al., 2007) that under the condition of Corollary 9 with p = and r [1 2, 1], the best learning rate, called minimax lower rate of convergence, for learning algorithms with output functions in HK is O N 2rα 4αr+1 . So the convergence rate we obtain in Corollary 9 is optimal. Combining bounds for f D,λ f D,λ ρ and f D,λ fρ ρ, we shall prove in Section 6 the following learning rates in expectation for the distributed learning algorithm (2) for regression. Corollary 11 Assume |y| M almost surely and regularity condition (9) for some 1 2 < r 1. If the capacity assumption N(λ) = O(λ 1 2α ) holds with some α > 0, |Dj| = N m for j = 1, . . . , m, and m satisfies the restriction 4αr+1 , (13) then by taking λ = N 2α 4αr+1 , we have E h f D,λ fρ ρ i = O N 2αr 4αr+1 . Remark 12 Corollary 11 shows that distributed learning with the least squares regularization scheme in a RKHS can achieve the optimal learning rates in expectation, provided that m satisfies the restriction (13). It should be pointed out that our error analysis is carried out under regularity condition (9) with 1/2 < r 1 while the work in (Zhang et al., 2015) focused on the case with r = 1/2. When r approaches 1/2, the number m of local processors under the restriction (13) reduces to 1, which corresponds to the non-distributed case. In our follow-up work, we will consider to relax the restriction (13) in a semi-supervised learning framework by using additional unlabelled data, as done in (Caponnetto and Yao, 2010). The main contribution of our analysis for distributed learning in this paper is to remove an eigenfunction assumption in (Zhang et al., 2015) by using a novel second order decomposition for a difference of operator inverses. Remark 13 In Corollary 11 and Corollary 9, the choice of the regularization parameter λ is independent of the number m of local processors. This is consistent with the results in (Zhang et al., 2015). There have been several approaches for selecting the regularization parameter λ in regularization schemes in the literature including cross-validation (Gy orfy et al., 2002; Blanchard and Kr amer, 2016) and the balancing principle (De Vito et al., 2010). For practical applications of distributed learning algorithms, how to choose λ and m (except the situations when the data are stored naturally in a distributive way) is crucial. Though we only consider the theoretical topic of error analysis in this paper, it would be interesting to develop parameter selection methods for distributed learning. 3. Comparisons and Discussion The least squares regularization scheme (1) is a classical algorithm for regression and has been extensively investigated in statistics and learning theory. There is a vast literature on Lin, Guo and Zhou this topic. Here for a general kernel beyond those for the Sobolev spaces, we compare our results with the best learning rates in the existing literature. Under the assumption that the orthogonal projection f H of fρ in L2 ρX onto the closure of HK satisfies regularity condition (9) for some 1 2 r 1, and that the eigenvalues {λi}i of LK satisfy λi i 2α with some α > 1/2, it was proved in (Caponnetto and De Vito, 2007) that lim τ lim sup N sup ρ P(α) Prob h f D,λN f H 2 ρ > τλ2r N i = 0, N 2α 4αr+1 , if 1 2 < r 1, log N N 2α 2α+1 , if r = 1 and P(α) denotes a set of probability measures ρ satisfying some moment decay condition (which is satisfied when |y| M). This learning rate in probability is stated with a limit as τ and it has a logarithmic factor in the case r = 1 2. In particular, to guarantee f D,λN f H 2 ρ τηλ2r N with confidence 1 η by this result, one needs to restrict N Nη to be large enough and has the constant τη depending on η to be large enough. Using existing tools for error analysis including those in (Caponnetto and De Vito, 2007), we develop a novel second order decomposition technique for a difference of operator inverses in this paper, and succeed in deriving the optimal learning rate in expectation in Corollary 9 by removing the logarithmic factor in the case r = 1 2. Under the assumption that |y| M almost surely, the eigenvalues {λi}i satisfying λi ai 2α with some α > 1/2 and a > 0, and for some constant C > 0, the pair (K, ρX) satisfying 1 2α K f 1 1 for every f HK, it was proved in (Steinwart et al., 2009) that for some constant cα,C depending only on α and C, with confidence 1 η, for any 0 < λ 1, πM (f D,λ) fρ 2 ρ 9A2(λ) + cα,C a1/(2α)M2 log(3/η) Here πM is the projection onto the interval [ M, M] defined (Chen et al., 2004; Wu et al., 2006) by M, if f(x) > M, f(x), if |f(x)| M, M, if f(x) < M, and A2(λ) is the approximation error defined by A2(λ) = inf f HK n f fρ 2 ρ + λ f 2 K o . When fρ HK, A2(λ) = O(λ) and the choice λN = N 2α 2α+1 gives a learning rate of order f D,λN fρ ρ = O N α 2α+1 . Here one needs to impose the condition (14) for the functions spaces L2 ρX and HK, and to take the projection onto [ M, M]. Our learning rates Distributed Learning in Corollary 9 do not require such a condition for the function spaces, nor do we take the projection. Let us mention the fact from (Steinwart et al., 2009; Mendelson and Neeman, 2010) that condition (14) is satisfied when HK is a Sobolev space on a domain of a Euclidean space and ρX is the uniform distribution, or when the eigenfunctions {ϕℓ/ λℓ: λℓ> 0} of LK normalized in L2 ρX are uniformly bounded. Recall that ϕℓ L2ρX = λℓsince λℓ= λℓ ϕℓ 2 K = λℓϕℓ, ϕℓ K = LKϕℓ, ϕℓ K equals X Kxϕℓ(x)dρX(x), ϕℓ K = Z X ϕℓ(x)ϕℓ(x)dρX(x) = ϕℓ 2 L2ρX . Learning rates for the least squares regularization scheme (1) in the HK-metric have been investigated in (Smale and Zhou, 2007). For the distributed learning algorithm (2) with subsets {Dj}m j=1 of equal size, under the assumption that for some constants 2 < k and A < , the eigenfunctions {ϕi}i satisfy supλi>0 E ϕi(x) λi 2k A2k, when k < , supλi>0 ϕi(x) λi A, when k = , (15) and that fρ HK and λi ai 2α for some α > 1/2 and a > 0, it was proved in (Zhang et al., 2015) that for λ = N 2α 2α+1 and m satisfying the restriction N 2(k 4)α k 2α+1 A4k logk N 1 k 2 , when k < N 2α 1 2α+1 A4 log N , when k = with a constant cα depending only on α, there holds E h f D,λ fρ 2 ρ i = O N 2α 2α+1 . This interesting result was achieved by a matrix analysis approach for which the eigenfunction assumption (15) played an essential role. It might be challenging to check the eigenfunction assumption (15) involving the integral operator LK = LK,ρX for the pair (K, ρX). To our best knowledge, besides the case of finite dimensional RKHSs, there exist in the literature only three classes of pairs (K, ρX) proved satisfying this eigenfunction assumption: the first class (Steinwart et al., 2009) is the Sobolev reproducing kernels on Euclidean domains together with the unform measures ρX, the second (Williamson et al., 2001) is periodical kernels, and the third class can be constructed by a Mercer type expansion K(x, y) = X i λi ϕi(x) λi ϕi(y) λi , (17) where {ϕi(x) λi }i is an orthonormal system in L2 ρX. An example of a C Mercer kernel was presented in (Zhou, 2002, 2003) to show that only the smoothness of the Mercer kernel does not guarantee the uniform boundedness of the eigenfunctions {ϕi(x) λi }i. Furthermore, it was shown in (Gittens and Mahoney, 2016) that these normalized eigenfunctions associated with radial basis kernels tend to be localized when the radial basis parameters are made smaller, which raises a practical concern for the uniform boundedness assumption on the eigenfunctions. Lin, Guo and Zhou Remark 14 To see how the eigenfunctions {ϕi(x)}i change with the marginal distribution, we consider a different measure µ induced by a nonnegative function P L2 ρX as dµ = P(x)dρX. An eigenpair (λ, ϕ) of the integral operator LK,µ associated with the pair (K, µ) with λ > 0 satisfies X K( , x)ϕ(x)P(x)dρX = X λi>0 λi ϕi λi ϕi(x) λi ϕ(x)P(x)dρX = λϕ. We introduce an index set I := {i : λi > 0}, a possibly infinite matrix ϕi(x) λi P(x)ϕj(x) p i,j I , (18) and express ϕ in terms of the orthonormal basis {ϕi}i I as ϕ = P i I ci λi ϕi where the sequence c = (ci)i I is given by ci = ϕ, ϕi λi L2ρX = λi ϕ, ϕi K. Then the eigenpair (λ, ϕ) satisfies LK,µϕ = λϕ X i,j cj = λ X ci λi ϕi KP c = λc. Thus the eigenpairs of the integral operator LK,µ associated with (K, µ) can be characterized by those of the possibly infinite matrix KP defined by (18). Finding the eigenpairs of KP is an interesting question involving matrix analysis in linear algebra and multiplier operators in harmonic analysis. Note that when P 1 corresponding to µ = ρX, KP is diagonal with diagonal entries {λi}i and its eigenvectors yield eigenfunctions {ϕi}i. From the above observation, we can see that the marginal distribution ρX plays an essential role in the eigenfunction assumption (15) which might be difficult to check for a general marginal distribution ρX. For example, it is even unknown whether any of the Gaussian kernels on X = [0, 1]d satisfies the eigenfunction assumption (15) when ρX is a general Borel measure. Our learning rates stated in Corollary 4 or Corollary 11 do not require the eigenfunction assumption (15). Moreover, our restriction (10) for the number m of local processors in Corollary 4 is more general than (16) when α is close to 1/2: with r = 1/2 corresponding to the condition fρ HK, our restriction (10) in Corollary 4 is m N 1 4+6α with the power index tending to 1 7 while the restriction in (16) with k = takes the form m cα N 2α 1 2α+1 A4 log N with the power index tending to 0 as α 1 2. Note that the learning rates stated in Corollary 4 are for the difference f D,λ f D,λ between the output function of the distributed learning algorithm (2) and that of the algorithm (1) using the whole data. In the special case of r = 1 2, we can see that E f D,λ f D,λ ρ e Cκ,c,r N α 2α+1 m 1 4α+2 , achieved by choosing N 2α 2α+1 , is smaller as m becomes larger. Here the dependence of λ on m is crucial for achieving this convergence rate of the sample error: if we fix λ and N, the error bounds for E f D,λ f D,λ stated in Theorem 1 and Corollary 3 become larger as m increases. On the other hand, as one expects, increasing the number m of local processors would increase Distributed Learning the approximation error for the regression problem, which can also be seen from the bound with λ = m N 2α 2α+1 stated in Theorem 7. The result in Corollary 11 with r > 1 2 compensates and gives the optimal learning rate E h f D,λ fρ ρ i = O N 2rα 4αr+1 by restricting m as in (13). Besides the divide-and-conquer technique, there are many other approaches in the literature to reduce the computational complexity in handling big data. These include the localized learning (Meister and Steinwart, 2016), Nystr om regularization (Bach, 2013; Rudi et al., 2015), and online learning (Dekel et al., 2012). The divide-and-conquer technique has the advantage of reducing the single-machine complexity of both space and time without a significant lost of prediction power. It is an important and interesting problem how the big data are decomposed for distributed learning. Here we only study the approach of random decomposition, and the data subsets are regarded as being drawn from the same distribution. There should be better decomposition strategies for some practical applications. For example, intuitively data points should be divided according to their spatial distribution so that the learning process would yield locally optimal predictors, which could then be synthesized to the output function. See (Thomann et al., 2016). In this paper, we consider the regularized least squares with Mercer kernels. Our results might be extended to more general kernels. A natural class consists of bounded and positive semidefinite kernels as studied in (Steinwart and Scovel, 2012). By means of the Mercer type expansion (17), one needs some assumptions on the system {ϕi(x) λi }i, the domain X, and the measure ρX to relax the continuity of the kernel while keeping compactness of the integral operator. How to minimize the assumptions and to maximize the scope of applications of the framework such as the situation of an input space X without a metric (Shen et al., 2014; De Vito et al., 2013) is a valuable question to investigate. Here we only consider distributed learning with the regularized least squares. It would be of great interest and value to develop the theory for distributed learning with other algorithms such as spectral algorithms (Bauer et al., 2007), empirical feature-based learning (Guo and Zhou, 2012; Guo et al., 2017; Shi et al., 2011), the minimum error entropy principle (Hu et al., 2015; Fan et al., 2016), and randomized Kaczmarz algorithms (Lin and Zhou, 2015). Remark 15 After the submission of this paper, in our follow-up paper by Z. C. Guo, S. B. Lin, and D. X. Zhou entitled Learning theory of distributed spectral algorithms published in Inverse Problems, error analysis and optimal learning rates for distributed learning with spectral algorithms were derived. In late 2016, we learned that similar analysis was carried out for classical (non-distributed) spectral algorithms implemented in one machine by G. Blanchard and N. M ucke in a paper entitled Optimal rates for regularization of statistical inverse learning problems (ar Xiv:1604.04054, April 2016), and by L. H. Dicker, D. P. Foster, and D. Hsu in a paper entitled Kernel ridge vs. principal component regression: minimax bounds and adaptibility of regularization operators (ar Xiv:1605.08839, May 2016). We are indebted to one of the referees for pointing this out to us. Lin, Guo and Zhou 4. Second Order Decomposition of Operator Differences and Norms Our error estimates are achieved by a novel second order decomposition of operator differences in our integral operator approach. We approximate the integral operator LK by the empirical integral operator LK,D(x) on HK defined with the input data set D(x) = {xi}N i=1 = {x : (x, y) D for some y Y} as LK,D(x)(f) = 1 |D| x D(x) f(x)Kx = 1 |D| x D(x) f, Kx KKx, f HK, (19) where the reproducing property f(x) = f, Kx K for f HK is used. Since K is a Mercer kernel, LK,Dj(x) is a finite-rank positive operator and LK,Dj(x) + λI is invertible. The operator difference in our study is that of the inverses of two operators defined by QD(x) = LK,D(x) + λI 1 (LK + λI) 1 . (20) If we denote A = LK,D(x) + λI and B = LK + λI, then our second order decomposition for the operator difference QD(x) is a special case of the following second order decomposition for the difference A 1 B 1. Lemma 16 Let A and B be invertible operators on a Banach space. Then we have A 1 B 1 = B 1 {B A} B 1 + B 1 {B A} A 1 {B A} B 1. (21) Proof We can decompose the operator A 1 B 1 as A 1 B 1 = B 1 {B A} A 1. (22) This is the first order decomposition. Write the last term A 1 as B 1 + (A 1 B 1) and apply another first order decomposition similar to (22) as A 1 B 1 = A 1 {B A} B 1. It follows from (22) that A 1 B 1 = B 1 {B A} B 1 + A 1 {B A} B 1 . Then the desired identity (21) is verified. The lemma is proved. To demonstrate how the second order decomposition leads to tight error bounds for the classical least squares regularization scheme (1) with the output function f D,λ, we recall the following formula (see e.g. (Caponnetto and De Vito, 2007; Smale and Zhou, 2007)) f D,λ fλ = LK,D(x) + λI 1 D, D := 1 |D| z D ξλ(z) E[ξλ], (23) where D is induced by the random variables ξλ with values in the Hilbert space HK defined as ξλ(z) = (y fλ(x))Kx, z = (x, y) Z. (24) Distributed Learning Then we can express f D,λ fλ by means of the notation QD(x) = LK,D(x) + λI 1 (LK + λI) 1 as f D,λ fλ = QD(x) D + (LK + λI) 1 D. Due to the identity between the L2 ρX norm and the HK metric 1 2 Kg K, g L2 ρX, (25) we can estimate the error f D,λ fλ ρ = L 1 2 K (f D,λ fλ) K by bounding the HK norm of the following expression obtained from the second order decomposition (21) with B A = LK LK,D(x) 1 2 K QD(x) D = L 1 2 K (LK + λI) 1 2 n (LK + λI) 1 2 LK LK,D(x) o (LK + λI) 1 D 1 2 K (LK + λI) 1 2 n (LK + λI) 1 2 LK LK,D(x) o LK,D(x) + λI 1 n LK LK,D(x) (LK + λI) 1 2 o n (LK + λI) 1 Combining this expression with the operator norm bound L 1 2 K (LK + λI) 1 2 1 and the notation ΞD = (LK + λI) 1 2 LK LK,D(x) (26) for convenience, we find L 1 2 K QD(x) D K ΞD (LK + λI) 1 D K + ΞD LK,D(x) + λI 1 ΞD (LK + λI) 1 By decomposing (LK + λI) 1 D as (LK + λI) 1 2 (LK + λI) 1 2 D and using the bounds LK,D(x) + λI 1 1 λ, (LK + λI) 1 λ, we know that 1 2 K QD(x) D (LK + λI) 1 f D,λ fλ ρ ΞD λ + Ξ2 D λ + 1 (LK + λI) 1 2 D K . (28) Thus the classical least squares regularization scheme (1) can be analyzed after estimating the operator norm ΞD = (LK + λI) 1 2 LK LK,D(x) and the function norm (LK + λI) 1 2 D K. In the following two lemmas, to be proved in the appendix, these norms are estimated in terms of effective dimensions by methods in the existing literature (Caponnetto and De Vito, 2007; Bauer et al., 2007; Blanchard and Kr amer, 2010). Lin, Guo and Zhou Lemma 17 Let D be a sample drawn independently according to ρ. Then the following estimates for the operator norm (LK + λI) 1 2 LK LK,D(x) hold. (a) E (LK + λI) 1 2 LK LK,D(x) 2 κ2N(λ) (b) For any 0 < δ < 1, with confidence at least 1 δ, there holds (LK + λI) 1 2 LK LK,D(x) B|D|,λ log(2/δ), (29) where we denote the quantity B|D|,λ = 2κ p (c) For any d > 1, there holds E (LK + λI) 1 2 LK LK,D(x) d 1 d (2dΓ(d) + 1) 1 d B|D|,λ, where Γ is the Gamma function defined for d > 0 by Γ(d) = R 0 ud 1 exp { u} du. Lemma 18 Let D be a sample drawn independently according to ρ and g be a measurable bounded function on Z and ξg be the random variable with values in HK defined by ξg(z) = g(z)Kx for z = (x, y) Z. Then the following statements hold. (a) E (LK + λI) 1/2 (Kx) 2 (b) For almost every x X, there holds (LK + λI) 1/2 (Kx) K κ (c) For any 0 < δ < 1, with confidence at least 1 δ, there holds (LK + λI) 1/2 1 |D| z D ξg(z) E [ξg] ! K 2 g log(2/δ) p Remark 19 In the existing literature, the first order decomposition (22) was used. To bound the norm of LK,D(x) + λI 1 D by this approach, one needs to either use the brute force estimate LK,D(x) + λI 1 1 λ leading to suboptimal learning rates, or applying the approximation of LK by LK,D(x) which is valid only with confidence and leads to confidencebased estimates with λ depending on the confidence level. In our second order decomposition, we decompose the inverse operator LK,D(x) + λI 1 further after the first order decompo- sition (22). This leads to finer estimates with an additional factor (B A)B 1 2 in the second term of the bound (21) and gives the refined error bound (28). Distributed Learning 5. Deriving Error Bounds for Least Squares Regularization Scheme In this section we prove our main result on error bounds for the least squares regularization scheme (1), which illustrates how to apply the second order decomposition (21) for operator differences in our integral operator approach. Proposition 20 Assume E[y2] < and moment condition (11) for some 1 p . Then E h f D,λ fλ ρ i 2 + 56κ4 + 57κ2 1 + 1 (|D|λ)2 + N(λ) κ 1 p q σ2ρ p 2p + κ fλ fρ ρ p Proof We apply the error bound (28) and the Schwarz inequality to get E h f D,λ fλ ρ i 2#)1/2 E (LK + λI) 1/2 D 2 (31) The first factor in the above bound involves ΞD = (LK + λI) 1 2 LK LK,D(x) and it can be estimated by applying Lemma 17 as ( 2#)1/2 1 + E Ξ2 D λ 1/2 + E Ξ4 D λ2 (49B4 |D|,λ λ2 (|D|λ)2 + 57κ2N(λ) |D|λ . (32) To deal with the factor in the bound (31), we separate D = 1 |D| P z D ξλ(z) E[ξλ] as where D := 1 |D| z D ξ0(z), D := 1 |D| z D (ξλ ξ0) (z) E[ξλ] are induced by another random variable ξ0 with values in the Hilbert space HK defined by ξ0(z) = (y fρ(x))Kx, z = (x, y) Z. (33) Then the second factor in (31) can be separated as E (LK + λI) 1/2 D 2 E (LK + λI) 1/2 D 2 1/2 + E (LK + λI) 1/2 D 2 Lin, Guo and Zhou Let us bound the first term of (34). Observe that (LK + λI) 1/2 D = X 1 |D|(y fρ(x)) (LK + λI) 1/2 (Kx). Each term in this expression is unbiased because R Y (y fρ(x)) dρ(y|x) = 0. This unbiasedness and the independence tell us that E (LK + λI) 1/2 D 2 |D|E (y fρ(x)) h (LK + λI) 1/2i (Kx) 2 |D|E σ2 ρ(x) h (LK + λI) 1/2i (Kx) 2 If σ2 ρ L , then σ2 ρ(x) σ2 ρ and by Lemma 18 we have E (LK + λI) 1/2 D 2 1/2 q σ2ρ p If σ2 ρ Lp ρX with 1 p < , we take q = p p 1 (q = for p = 1) satisfying 1 1 q = 1 and apply the H older inequality E[|ξη|] (E[|ξ|p])1/p (E[|η|q])1/q to ξ = σ2 ρ, η = (LK + λI) 1/2 (Kx) E σ2 ρ(x) h (LK + λI) 1/2i (Kx) 2 E h (LK + λI) 1/2i (Kx) 2q But h (LK + λI) 1/2i (Kx) 2q 2 " (LK + λI) 1/2 (Kx) = N(λ) by Lemma 18. So we have E (LK + λI) 1/2 D 2 λq 1 N(λ) 1/q)1/2 = q σ2ρ pκ 1 p N(λ) Combining the above two cases, we know that for either p = or p < , the first term of (34) can be bounded as E (LK + λI) 1/2 D 2 1/2 q σ2ρ pκ 1 p N(λ) Distributed Learning The second term of (34) can be bounded easily as E (LK + λI) 1/2 D 2 E (fρ(x) fλ(x))2 (LK + λI) 1/2 (Kx) 2 E (fρ(x) fλ(x))2 κ2 1/2 = κ fρ fλ ρ p Putting the above estimates for the two terms of (34) into (31) and applying the bound (32) involving ΞD, we know that E h f D,λ fλ ρ i is bounded by (|D|λ)2 + 57κ2N(λ) q σ2ρ pκ 1 p N(λ) 2p + κ fρ fλ ρ p Then our desired error bound follows. The proof of the proposition is complete. Proof of Theorem 7 Combining Proposition 20 with the triangle inequality f D,λ fρ ρ f D,λ fλ ρ + fλ fρ ρ, we know that E h f D,λ fρ ρ i fλ fρ ρ + 2 + 56κ4 + 57κ2 1 + 1 (|D|λ)2 + N(λ) κ 1 p q σ2ρ p |D|λ fλ fρ ρ Then the desired error bound holds true, and the proof of Theorem 7 is complete. Proof of Corollary 8 By the definition of effective dimension, λℓ λℓ+ λ λ1 λ1 + λ. Combining this with the restriction (8) with m = 1, we find N(λ) λ1 λ1+C0 and Nλ λ1 (λ1+C0)C0 . Putting these and the restriction (8) with m = 1 into the error bound (12), we know that E h f D,λ fρ ρ i 2 + 56κ4 + 57κ2 (1 + κ) 1 + (λ1 + C0)2C2 0 λ2 1 + C0 (λ1 + C0)C0/λ1 fλ fρ ρ + q σ2ρ p Then the desired bound follows. The proof of Corollary 8 is complete. To prove Corollary 9, we need the following bounds (Smale and Zhou, 2007) for fλ fρ ρ and fλ fρ K. Lin, Guo and Zhou Lemma 21 Assume regularity condition (9) with 0 < r 1. There holds fλ fρ ρ λr gρ ρ. (36) Furthermore, if 1/2 r 1, then we have fλ fρ K λr 1/2 gρ ρ. (37) Proof of Corollary 9 By Lemma 21, regularity condition (9) with 0 < r 1 implies fλ fρ ρ λr gρ ρ. If N(λ) C0λ 1 for some constant C0 1, then the choice λ = N 2α 2α max{2r,1}+1 yields 2α+1 2α max{2r,1}+1 1 C0. So (8) with m = 1 is satisfied. With this choice we also have p) 0 N 2α max{2r,1} 2α max{2r,1}+1 1 2 (1 1 p)N 2α max{2r,1}+1 2α 2α max{2r,1}+1 1 2p p) 0 N α max{2r,1} 2α max{2r,1}+1 + 1 2p 2α 1 2α max{2r,1}+1 . Putting these estimates into Corollary 8, we know that E h f D,λ fρ ρ i = O N 2αr 2α max{2r,1}+1 + N α max{2r,1} 2α max{2r,1}+1 + 1 2p 2α 1 2α max{2r,1}+1 = O N α min{2r,max{2r,1}} 2α max{2r,1}+1 + 1 2p 2α 1 2α max{2r,1}+1 . But we find min {2r, max{2r, 1}} = 2r by discussing the two different cases 0 < r < 1 2 r 1. Then our conclusion follows immediately. The proof of Corollary 9 is complete. 6. Proof of Error Bounds for the Distributed Learning Algorithm To analyze the error f D,λ f D,λ for distributed learning, we recall the notation QD(x) = LK,D(x) + λI 1 (LK + λI) 1 for the difference of inverse operators and use the notation QDj(x) involving the data subset Dj. The empirical integral operator LK,Dj(x) is defined with D replaced by the data subset Dj. For our error analysis for the distributed learning algorithm (2), we need the following error decomposition for f D,λ f D,λ. Distributed Learning Lemma 22 Assume E[y2] < . For λ > 0, we have f D,λ f D,λ = LK,Dj(x) + λI 1 LK,D(x) + λI 1 j h QDj(x) i j + h QDj(x) i j QD(x) D, (38) j := 1 |Dj| z Dj ξλ(z) E[ξλ], D := 1 |D| z D ξλ(z) E[ξλ], j := 1 |Dj| z Dj ξ0(z), j := 1 |Dj| z Dj (ξλ ξ0) (z) E[ξλ]. Proof Recall the expression (23) for f D,λ fλ. When the data subset Dj is used, we have f Dj,λ fλ = LK,Dj(x) + λI 1 j. So we know that |D| f Dj,λ fλ = LK,Dj(x) + λI 1 j. We can decompose D as z D ξλ(z) E[ξλ] = z Dj ξλ(z) E[ξλ] in the expression (23) for f D,λ fλ and find |D| LK,D(x) + λI 1 j. Then the first desired expression for f D,λ f D,λ follows. By adding and subtracting the operator (LK + λI) 1, writing j = j + j , and noting E[ξ0] = 0, we know that the first expression implies (38). This proves Lemma 22. Before proving our first main result on the error f D,λ f D,λ in the HK metric and L2 ρ metric, we state the following more general result which allows different sizes for data subsets {Dj}. Lin, Guo and Zhou Theorem 23 Assume that for some constant M > 0, |y| M almost surely. Then we have E h f D,λ f D,λ ρ 2 S2 |Dj|,λ + S3 |Dj|,λ N(λ) |Dj|λ + S|Dj|,λ |D|λ + S|D|,λ |D| + fρ fλ ρ p E h f D,λ f D,λ K 2 S2 |Dj|,λ + S3 |Dj|,λ N(λ) |Dj|λ + S|Dj|,λ |D|λ + S|D|,λ |D|λ + fρ fλ ρ p where C κ is a constant depending only on κ, and S|Dj|,λ is the quantity given by S|Dj|,λ = 1 |Dj|2λ2 + N(λ) Proof Recall the expression (38) for f D,λ f D,λ in Lemma 22. It enables us to express f D,λ f D,λ = J1 + J2 + J3, (39) where the terms J1, J2, J3 are given by h L1/2 K QDj(x) i j, J2 = h L1/2 K QDj(x) i j , J3 = h L1/2 K QD(x) i D. These three terms will be dealt with separately in the following. For the first term J1 of (39), each summand with j {1, . . . , m} can be expressed as P z Dj 1 |D|(y fρ(x)) h L1/2 K QDj(x) i (Kx), and is unbiased because R Y y fρ(x)dρ(y|x) = 0. The unbiasedness and the independence tell us that E [ J1 K] n E h J1 2 K io1/2 2 E h L1/2 K QDj(x) i j 2 Let j {1, . . . , m}. The estimate (27) derived from the second order decomposition (21) yields h L1/2 K QDj(x) i j 2 λ + Ξ2 Dj λ !2 (LK + λI) 1/2 j 2 Distributed Learning Now we apply the formula 0 Prob [ξ > t] dt (42) to estimate the expected value of (41). By Part (b) of Lemma 17, for 0 < δ < 1, there exists a subset Z|Dj| δ,1 of Z|Dj| of measure at least 1 δ such that ΞDj B|Dj|,λ log(2/δ), Dj Z|Dj| δ,1 . (43) Applying Part (c) of Lemma 18 to g(z) = y fρ(x) with g 2M and the data subset Dj, we know that there exists another subset Z|Dj| δ,2 of Z|Dj| of measure at least 1 δ such that (LK + λI) 1/2 j K 2M κ B|Dj|,λ log(2/δ), Dj Z|Dj| δ,2 . Combining this with (43) and (41), we know that for Dj Z|Dj| δ,1 Z|Dj| δ,2 , h L1/2 K QDj(x) i j 2 λ + B4 |Dj|,λ 2 B2 |Dj|,λ (2 log(2/δ))6 . Since the measure of the set Z|Dj| δ,1 Z|Dj| δ,2 is at least 1 2δ, by denoting C|Dj|,λ = 64 λ + B4 |Dj|,λ 2 B2 |Dj|,λ, we see that Prob h L1/2 K QDj(x) i j 2 K > C|Dj|,λ (log(2/δ))6 2δ. For 0 < t < , the equation C|Dj|,λ (log(2/δ))6 = t has the solution δt = 2 exp t/C|Dj|,λ 1/6 . When δt < 1, we have Prob h L1/2 K QDj(x) i j 2 K > t 2δt = 4 exp t/C|Dj|,λ 1/6 . This inequality holds trivially when δt 1 since the probability is at most 1. Thus we can apply the formula (42) to the nonnegative random variable ξ = h L1/2 K QDj(x) i j 2 K and obtain E h L1/2 K QDj(x) i j 2 0 Prob [ξ > t] dt Z 0 4 exp t/C|Dj|,λ 1/6 dt Lin, Guo and Zhou which equals 24Γ(6)C|Dj|,λ. Therefore, 2 24Γ(6)C|Dj|,λ |Dj|2λ2 + N(λ) 2 1 + 8κ2 κ2 |Dj|2λ2 + N(λ) For the second term J2 of (39), we apply (27) again and obtain h L1/2 K QDj(x) i j K λ + Ξ2 Dj λ ! (LK + λI) 1/2 j K . Applying the Schwarz inequality and Lemmas 17 and 18, we get E h h L1/2 K QDj(x) i j K λ + Ξ2 Dj λ E (fρ(x) fλ(x))2 (LK + λI) 1/2 (Kx) 2 (49B4 |Dj|,λ λ2 κ fρ fλ ρ p N(λ) |Dj|λ + 56κ2 κ2 (|Dj|λ)2 + N(λ) ! κ fρ fλ ρ p It follows that |D| κ fρ fλ ρ p N(λ) |Dj|λ + 56κ2 κ2 (|Dj|λ)2 + N(λ) The last term J3 of (39) has been handled by (27) and in the proof of Proposition 20 by ignoring the summand 1 in the bound (31), and we find from the trivial bound σ2 ρ 4M2 with p = that |D|λ + 56κ2 κ2 (|D|λ)2 + N(λ) 2 + κ fρ fλ ρ p Combining the above estimates for the three terms of (39), we see that the desired error bound in the L2 ρX metric holds true. The estimate in the HK metric follows from the steps in deriving the error bound in the L2 ρX metric except that in the representation (39) the operator L1/2 K in the front disappears. This change gives an additional factor 1/ λ, the bound for the operator (LK + λI) 1/2, and proves the desired error bound in the HK metric. Distributed Learning Remark 24 A crucial step in the above error analysis for the distributed learning algorithm is to use the unbiasedness and independence to get (40) where the norm is squared in the expected value. Thus we can obtain the optimal learning rates in expectation for the distributed learning algorithm (2) but have difficulty in getting rates in probability. It would be interesting to derive error bounds in probability by combining our second order decomposition technique with some analysis in the literature (Caponnetto and De Vito, 2007; Blanchard and Kr amer, 2010; Steinwart and Christmann, 2008; Wu and Zhou, 2008). Proof of Theorem 1 Since |Dj| = N m for j = 1, . . . , m, the bound in Theorem 23 in the L2 ρX metric can be simplified as E h f D,λ f D,λ ρ λ m m (Nλ)2 + N(λ) 1 + m m (Nλ)2 + N(λ) +C κ fρ fλ ρ (Nλ)2 + m N(λ) Nλ + 1 (Nλ)2 + N(λ) C κM( m + 1) λ m (Nλ)2 + N(λ) 1 + m m (Nλ)2 + N(λ) +C κ fρ fλ ρ (Nλ)2 + m N(λ) λCm n m + m p Cm o + 2C κ fρ fλ ρ Then the desired error bound in the L2 ρX metric with Cκ = 2C κ follows. The proof for the error bound in the HK metric is similar. The proof of Theorem 1 is complete. Proof of Corollary 3 As in the proof of Corollary 8, the restriction (8) implies N(λ) λ1 λ1+C0 and Nλ mλ1 (λ1+C0)C0 . It follows that m (Nλ)2 (λ1 + C0)C0 1 Nλ (λ1 + C0)2C0 and with e C κ := (λ1+C0)2C0 Cm e C κ N(λ) Putting these bounds into Theorem 1 and applying the restriction m N(λ) Nλ C0, we know that E f D,λ f D,λ ρ Cκ e C κ 3 N + m fρ fλ ρ λ + m fρ fλ ρ Lin, Guo and Zhou E f D,λ f D,λ K Cκ e C κ 3 C0 + m fρ fλ ρ Then the desired error bounds hold by taking the constant e Cκ = Cκ e C κ 3 2 1 + C0 2. This proves Corollary 3. Proof of Corollary 4 If N(λ) C0λ 1 for some constant C0 1, then the choice λ = m N 2α 2α max{2r,1}+1 satisfies (8). With this choice we also have 2α(max{2r,1} 1) 2α max{2r,1}+1 . Since regularity condition (9) yields fλ fρ ρ gρ ρλr by Lemma 21, we have by Corollary 3, E f D,λ f D,λ ρ e Cκ p α(max{2r,1} 1) 2α max{2r,1}+1 gρ ρ m 2α 2α max{2r,1}+1 (r 1 α 2α max{2r,1}+1 . The inequality m N 2α 2α max{2r,1}+1 (r 1 N α 2α max{2r,1}+1 is equivalent to m1+ 2α 2α max{2r,1}+1 (r 1) N 1 2 + 2α 2α max{2r,1}+1 (r 1) and it can be expressed as (10). Since (10) is valid, we have E f D,λ f D,λ ρ e Cκ p gρ ρ + M 1 m α max{2r,1} 2α max{2r,1}+1 . This proves the first desired convergence rate. The second rate follows easily. This proves Corollary 4. Proof of Corollary 11 By Corollary 9, with the choice λ = N 2α 4αr+1 , we can immediately bound f D,λ fρ ρ as E h f D,λ fρ ρ i = O N 2αr 4αr+1 . The assumption N(λ) = O(λ 1 2α ) tells us that for some constant C0 1, 2α , λ > 0. So the choice λ = N 2α 4αr+1 yields Nλ C0 mλ 1+2α 2α N = C0m N 1+2α 4αr+1 1 = C0m N 4αr+1 . (44) Distributed Learning If m satisfies (13), then (8) is valid, and by Corollary 3, E f D,λ f D,λ ρ e Cκ p 4αr+1 λr gλ ρ m gλ ρ + M λr 4αr+1 λ 1 2 r gλ ρ + M N 2αr 4αr+1 m N 2α(2r 1)+ 1 2 4αr+1 + 1 . Since (13) is satisfied, we have E f D,λ f D,λ ρ 2 e Cκ p gλ ρ + M N 2αr 4αr+1 , and thereby E h f D,λ fρ ρ i = O N 2αr 4αr+1 . This proves Corollary 11. Acknowledgments We thank the anonymous referees for their constructive suggestions. The work described in this paper is supported partially by the Research Grants Council of Hong Kong [Project No. City U 11304114]. The corresponding author is Ding-Xuan Zhou. This appendix provides detailed proofs of the norm estimates stated in Lemmas 17 and 18 involving the approximation of LK by LK,D(x). To this end, we need the following probability inequality for vector-valued random variables in (Pinelis, 1994). Lemma 25 For a random variable ξ on (Z, ρ) with values in a separable Hilbert space (H, ) satisfying ξ f M < almost surely, and a random sample {zi}s i=1 independent drawn according to ρ, there holds with confidence 1 eδ, i=1 [ξ(zi) E(ξ)] 2f M log(2/eδ) 2E( ξ 2) log(2/eδ) Proof of Lemma 17 We apply Lemma 25 to the random variable η1 defined by η1(x) = (LK + λI) 1/2 , Kx KKx, x X (46) It takes values in HS(HK), the Hilbert space of Hilbert-Schmidt operators on HK, with inner product A, B HS = Tr(BT A). Here Tr denotes the trace of a (trace-class) linear operator. The norm is given by A 2 HS = P i Aei 2 K where {ei} is an orthonormal basis Lin, Guo and Zhou of HK. The space HS(HK) is a subspace of the space of bounded linear operators on HK, denoted as (L(HK), ), with the norm relations A A HS, AB HS A HS B . (47) Now we use effective dimensions to estimate norms involving η1. The random variable η1 defined by (46) has mean E(η1) = (LK + λI) 1/2 LK and sample mean (LK + λI) 1/2 LK,D(x). Recall the set of normalized (in HK) eigenfunctions {ϕi}i of LK. It is an orthonormal basis of HK. If we regard LK as an operator on L2 ρX, the normalized eigenfunctions in L2 ρX are { 1 λi ϕi}i and they form an orthonormal basis of the orthogonal complement of the eigenspace associated with eigenvalue 0. By the Mercer Theorem, we have the following uniform convergent Mercer expansion K(x, y) = X i λi 1 λi ϕi(x) 1 λi ϕi(y) = X i ϕi(x)ϕi(y). (48) Take the orthonormal basis {ϕi}i of HK. By the definition of the HS norm, we have η1(x) 2 HS = X (LK + λI) 1/2 , Kx KKxϕi 2 For a fixed i, , Kx KKxϕi = ϕi(x)Kx, and Kx HK can be expended by the orthonormal basis {ϕℓ}ℓas ℓ ϕℓ, Kx Kϕℓ= X ℓ ϕℓ(x)ϕℓ. (49) η1(x) 2 HS = X ℓ ϕℓ(x) (LK + λI) 1/2 ϕℓ ℓ ϕℓ(x) 1 λℓ+ λϕℓ i (ϕi(x))2 X Combining this with (48), we see that η1(x) 2 HS = K(x, x) X λℓ+ λ , x X (50) E η1(x) 2 HS κ2E X (ϕℓ(x))2 dρX X (ϕℓ(x))2 dρX = ϕℓ 2 L2ρX = p L2ρX = λℓ. (51) Distributed Learning E η1 2 HS κ2 X λℓ λℓ+ λ = κ2Tr (LK + λI) 1 LK = κ2N(λ) (52) x D(x) η1(x) E[η1] = E (LK + λI) 1/2 LK LK,D(x) 2 Then our desired inequality in Part (a) follows from the first inequality of (47). From (49) and (50), we find a bound for η1 as η1(x) HS κ 1 ℓ (ϕℓ(x))2 κ Applying Lemma 25 to the random variable η1 with f M = κ2 λ, we know by (47) that with confidence at least 1 δ, E[η1] 1 x D(x) η1(x) x D(x) η1(x) 2κ2 log(2/δ) 2κ2N(λ) log(2/δ) Writing the above bound by taking a factor 2κ log(2/δ) |D| , we get the desired bound (29). Recall B|D|,λ defined by (30). Apply the formula (42) for nonnegative random variables to ξ = (LK + λI) 1/2 LK LK,D(x) d and use the bound Prob [ξ > t] = Prob h ξ 1 d > t 1 d i 2 exp t 1 d B|D|,λ derived from (29) for t logd 2B|D|,λ. We find E (LK + λI) 1/2 LK LK,D(x) d logd 2B|D|,λ + Z t 1 d B|D|,λ The second term on the right-hand side of above equation equals 2d Bd |D|,λ R 0 ud 1 exp { u} du. Then the desired bound in Part (c) follows from R 0 ud 1 exp { u} du = Γ(d) and the lemma is proved. Proof of Lemma 18 Consider the random variable η2 defined by η2(z) = (LK + λI) 1/2 (Kx) , z = (x, y) Z. (53) Lin, Guo and Zhou It takes values in HK. By (49), it satisfies (LK + λI) 1/2 X E (LK + λI) 1/2 (Kx) 2 This is the statement of Part (a). We also see from (49) that for x X, ℓ (ϕℓ(x))2 !1/2 = 1 This verifies the statement of Part (b). For Part (c), we consider another random variable η3 defined by η3(z) = (LK + λI) 1/2 (g(z)Kx) , z = (x, y) Z. (54) It takes values in HK and satisfies η3(z) K = |g(z)| (LK + λI) 1/2 (Kx) K = |g(z)| η3(z) K κ g E η3 2 K g 2 E = g 2 N(λ). Applying Lemma 25 verifies the statement in Part (c). The proof of the lemma is complete. F. Bach. Sharp analysis of low-rank kernel matrix approximations. Proceedings of the 26th Annual Conference on Learning Theory, 185-209, 2013. F. Bauer, S. Pereverzev, and L. Rosasco. On regularization algorithms in learning theory. Journal of Complexity, 23:52-72, 2007. G. Blanchard and N. Kr amer. Optimal learning rates for kernel conjugate gradient regression. Advances in Neural Information Processing Systems, 226-234, 2010. G. Blanchard and N. Kr amer. Convergence rates for kernel conjugate gradient for random design regression. Analysis and Applications, 14:763-794, 2016. Distributed Learning A. Caponnetto and E. De Vito. Optimal rates for the regularized least squares algorithm. Foundations of Computational Mathematics, 7:331-368, 2007. A. Caponnetto and Y. Yao. Cross-validation based adaptation for regularization operators in learning theory. Analysis and Applications, 8:161-183, 2010. D. R. Chen, Q. Wu, Y. Ying, and D. X. Zhou. Support vector machine soft margin classifiers: error analysis. Journal of Machine Learning Research, 5:1143-1175, 2004. N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, 2000. O. Dekel, R. Gilad-Bachrach, O. Shamir, and X. Lin. Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research, 13:165-202, 2012. E. De Vito, A. Caponnetto, and L. Rosasco. Model selection for regularized least-squares algorithm in learning theory. Foundations of Computational Mathematics, 5:59-85, 2005. E. De Vito, S. Pereverzyev, and L. Rosasco. Adaptive kernel methods using the balancing principle. Foundations of Computational Mathematics, 10:455-479, 2010. E. De Vito, V. Umanit a, and S. Villa. An extension of Mercer theorem to vector-valued measurable kernels. Applied and Computational Harmonic Analysis, 34:339-351, 2013. D.E. Edmunds and H. Triebel. Function Spaces, Entropy Numbers, Differential Operators. Cambridge University Press, Cambridge, 1996. T. Evgeniou, M. Pontil, and T. Poggio. Regularization networks and support vector machines. Advance in Computional Mathematics, 13:1-50, 2000. J. Fan, T. Hu, Q. Wu, and D.X. Zhou. Consistency analysis of an empirical minimum error entropy algorithm. Applied and Computational Harmonic Analysis, 41:164-189, 2016. S. Fine and K. Scheinberg. Efficient SVM training using low-rank kernel representations. Journal of Machine Learning Research, 2:243-264, 2002. A. Gittens and M. Mahoney. Revisiting the Nystr om method for improved large-scale machine learning. Journal of Machine Learning Research, 17:1-65, 2016. X. Guo and D. X. Zhou. An empirical feature-based learning algorithm producing sparse approximations. Applied and Computational Harmonic Analysis, 32:389-400, 2012. Z.C. Guo, D.H. Xiang, X. Guo, and D.X. Zhou. Thresholded spectral algorithms for sparse approximations. Analysis and Applications, 15:433-455, 2017. L. Gy orfy, M. Kohler, A. Krzyzak, H. Walk. A Distribution-Free Theory of Nonparametric Regression. Springer-Verlag, Berlin, 2002. T. Hu, J. Fan, Q. Wu, and D. X. Zhou. Regularization schemes for minimum error entropy principle. Analysis and Applications, 13:437-455, 2015. Lin, Guo and Zhou J. H. Lin, L. Rosasco, and D. X. Zhou. Iterative regularization for learning with convex loss functions. Journal of Machine Learning Research, 17(77): 1-38, 2016. J. H. Lin and D. X. Zhou. Learning theory of randomized Kaczmarz algorithm. Journal of Machine Learning Research, 16:3341-3365, 2015. S. Mendelson and J. Neeman. Regularization in kernel learning. The Annals of Statistics, 38(1):526-565, 2010. M. Meister and I. Steinwart. Optimal learning rates for localized SVMs. Journal of Machine Learning Research, 17: 1-44, 2016. I. Pinelis. Optimum bounds for the distributions of martingales in Banach spaces. The Annals of Probability, 22:1679-1706, 1994. G. Raskutti, M. Wainwright, and B. Yu. Early stopping and non-parametric regression: an optimal data-dependent stopping rule. Journal of Machine Learning Research, 15:335366, 2014. A. Rudi, R. Camoriano, and L. Rosasco. Less is more: Nystr om computational regularization. Advances in Neural Information Processing Systems, 1657-1665, 2015. B. Sch olkopf, A. Smola, and K. R. M uller. Nonlinear component analysis as a kernel eigenvalue problem. IEEE Transactions on Information Theory, 10:1299-1319, 1998. O. Shamir and N. Srebro. Distributed stochastic optimization and learning. In the 52nd Annual Allerton Conference on Communication, Control and Computing, 2014. W.J. Shen, H.S. Wong, Q.W. Xiao, X. Guo, and S. Smale. Introduction to the peptide binding problem of computational immunology: new results. Foundations of Computational Mathematics, 14:951-984, 2014. L. Shi, Y.L. Feng, and D.X. Zhou. Concentration estimates for learning with ℓ1-regularizer and data dependent hypothesis spaces. Applied and Computational Harmonic Analysis, 31:286-302, 2011. S. Smale and D.X. Zhou. Learning theory estimates via integral operators and their approximations. Constructive Approximation, 26:153-172, 2007. I. Steinwart and A. Christmann, Support Vector Machines. Springer, New York, 2008. I. Steinwart, D. Hush, and C. Scovel. Optimal rates for regularized least squares regression. In Proceedings of the 22nd Annual Conference on Learning Theory (S. Dasgupta and A. Klivans, eds.), pp. 79-93, 2009. I. Steinwart and C. Scovel. Mercer s theorem on general domains: On the interaction between measures, kernels, and RKHSs. Constructive Approximation, 35(3):363-417, 2012. P. Thomann, I. Steinwart, I. Blaschzyk, and M. Meister. Spatial decompositions for large scale SVMs. Ar Xiv:1612.00374v1, 2016. Distributed Learning R. C. Williamson, A. J. Smola, and B. Sch olkopf. Generalization performance of regularization networks and support vector machines via entropy numbers of compact operators. IEEE Transactions on Information Theory, 47:2516-2532, 2001. Q. Wu, Y. Ying, and D. X. Zhou. Learning rates of least-square regularized regression. Foundations of Computational Mathematics, 6:171-192, 2006. Q. Wu and D. X. Zhou. Learning with sample dependent hypothesis spaces. Computers and Mathematics with Applications, 56:2896-2907, 2008. Y. Yao, L. Rosasco, and A. Caponnetto. On early stopping in gradient descent learning. Constructive Approximation, 26:289-315, 2007. T. Zhang. Learning bounds for kernel regression using effective data dimensionality. Neural Computation, 17:2077-2098, 2005. Y. C. Zhang, J. Duchi, and M. Wainwright. Communication-efficient algorithms for statistical optimization. Journal of Machine Learning Research, 14:3321-3363, 2013. Y. C. Zhang, J. Duchi, and M. Wainwright. Divide and conquer kernel ridge regression: a distributed algorithm with minimax optimal rates. Journal of Machine Learning Research, 16:3299-3340, 2015. D. X. Zhou. The covering number in learning theory. Journal of Complexity, 18:739-767, 2002. D. X. Zhou. Capacity of reproducing kernel spaces in learning theory. IEEE Transactions on Information Theory, 49:1743-1752, 2003. Z. H. Zhou, N. V. Chawla, Y. Jin, and G. J. Williams. Big data opportunities and challenges: Discussions from data analytics perspectives. IEEE Computational Intelligence Magazine, 9:62-74, 2014.