# distributed_kernel_ridge_regression_with_communications__fc025f1d.pdf

Journal of Machine Learning Research 21 (2020) 1-38 Submitted 7/19; Revised 2/20; Published 4/20

Distributed Kernel Ridge Regression with Communications

Shao-Bo Lin sblin1983@gmail.com Center of Intelligent Decision-Making and Machine Learning School of Management Xi an Jiaotong University Xi an, China

Di Wang wangdi@wzu.edu.cn Center of Intelligent Decision-Making and Machine Learning School of Management Xi an Jiaotong University Xi an, China

Ding-Xuan Zhou mazhou@cityu.edu.hk School of Data Science and Department of Mathematics City University of Hong Kong Kowloon, Hong Kong, China

Editor: John Shawe-Taylor

This paper focuses on generalization performance analysis for distributed algorithms in the framework of learning theory. Taking distributed kernel ridge regression (DKRR) for example, we succeed in deriving its optimal learning rates in expectation and providing theoretically optimal ranges of the number of local processors. Due to the gap between theory and experiments, we also deduce optimal learning rates for DKRR in probability to essentially reﬂect the generalization performance and limitations of DKRR. Furthermore, we propose a communication strategy to improve the learning performance of DKRR and demonstrate the power of communications in DKRR via both theoretical assessments and numerical experiments. Keywords: learning theory, distributed learning, kernel ridge regression, communication

1. Introduction

Commonly in this era, data of huge size are stored in numerous machines and cannot be shared for protecting data privacy. Typical examples include clinical data in medicine where medical data are collected in diﬀerent hospitals and ﬁnancial data in business where commercial data are generated in diﬀerent companies. These distributively stored data bring a new challenge for machine learning in the sense that every one would like to use the other s data but is unwilling to share his own data. Nonparametric distributed learning (NDL) (Zhang et al., 2015; Lin et al., 2017) presents a preferable approach to conquer this challenge by means of combining the prediction results from many local processors without sharing individual data each other.

. Corresponding author

c 2020 Shao-Bo Lin, Di Wang, and Ding-Xuan Zhou.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v21/19-592.html.

Lin, Wang, and Zhou

(a) Flow of training

(b) Flow of testing

Figure 1: Training and testing ﬂows of the divide-and-conquer learning.

There are three ingredients of NDL: local processing, communication, and synthesization. The local processing issue refers to applying a particular learning algorithm such as the kernel ridgel regression (Zhang et al., 2015), local average regression (Chang et al., 2017), multi-penalty regularization (Guo et al., 2019), coeﬃcient-based regularization (Pang and Sun, 2018; Shi, 2019), and spectral algorithms (Guo et al., 2017; M ucke and Blanchard, 2018; Lin et al., 2020) to tackle the data subset in a local machine and produce a local estimator. The communication issue focuses on exchanging exclusive information such as the data (Bellet et al., 2015), gradients (Zeng and Yin, 2018) and local estimator (Huang and Huo, 2019) between diﬀerent local machines. The synthesization issue devotes to producing a global estimator by combining local estimators and communicated information on the global machine, typical strategies of which are the majority voting (Mann et al., 2009), weighted average (Chang et al., 2017) and gradient-based algorithms (Bellet et al., 2015).

One of the most popular NDL approaches is the divide-and-conquer learning, in which the communication is not required and the weighted average is used in the synthesization issue. Figure 1 presents the training and testing ﬂows of the divide-and-conquer learning. As shown in Figure 1, to give a prediction of a query point xt, only a real number f Dj,λ(xt) is communicated to the global machine, which succeeds in protecting the data privacy of each local machine. Generalization performances of the divide-and-conquer learning have been proved to be similar to running the corresponding algorithm processing the whole data on a single but large enough machine (Zhang et al., 2015; Chang et al., 2017; Lin et al., 2017; Guo et al., 2017; M ucke and Blanchard, 2018; Pang and Sun, 2018; Shi, 2019; Lin et al., 2020). The theoretical problem is, however, that there is a strict restriction on the number of local machines to guarantee the optimal generalization performance, which is diﬃcult to be satisﬁed in real applications.

In this paper, taking the DKRR to be the speciﬁc algorithm, we aim at enlarging the number of local machines by considering communications among diﬀerent local machines. There are three purposes in our study. At ﬁrst, we improve the existing results for DKRR

Distributed Kernel Ridge Regression with Communications

in expectation in the sense of removing the eigen-function assumption in (Zhang et al., 2015) and relaxing the regularity assumption in (Lin et al., 2017). Our main tool to achieve this goal is a tight operator product estimate based on a recently developed concentration inequality for self-adjoint operators (Minsker, 2017). These estimates improve the results in (Lin et al., 2017; Guo et al., 2017), where the second order decomposition of operator diﬀerences and a classical concentration inequality in (Pinelis, 1994) are used.

Since generalization error estimates in probability quantify the generalization performance of DKRR in a single trial while estimates in expectation describe the average error, it is highly desired to deduce optimal learning rates for DKRR in probability. However, almost all existing results for DKRR are established in expectation (Zhang et al., 2015; Lin et al., 2017; Chang et al., 2017). The main reason is that the power of averaging in DKRR can be directly reﬂected by expectation, provided the samples are assumed to be drawn independently and identically according to some distribution. Our second purpose is to derive optimal learning rates for DKRR in probability, by means of a novel error decomposition technique motivated by Lin and Zhou (2018). Since the advantage of averaging cannot be directly used, the restriction on the number of local machines is a bit strict. Our estimates in probability support numerical observations that cannot be seen from the estimates in expectation.

Our last purpose is to develop a communication strategy to improve the performance of DKRR. Combining the recently developed integral approach (Lin et al., 2017; Guo et al., 2017) with a Newton-Raphson iteration, we design a communication strategy to enlarge the number of local machines to guarantee optimal learning rates for DKRR. Our basic idea is to communicate gradients of each local machine and use the Newton-Raphson iteration in the global machine to synthesize the global machine. Both theoretical analysis and numerical results are conducted to verify the power of communications. Theoretically, we prove that, in the sense of probability, DKRR with communications can reach the optimal learning rates, while the restriction to the number of local machines is the same as that in expectation. Numerically, we exhibit that communications enlarge the number of local machines of DKRR and thus essentially improve its learning performance.

The rest of the paper is organized as follows. In the next section, we present the communication strategy as well as its motivation. In Section 3, theoretical results including optimal learning rates for DKRR in expectation, optimal learning rates for DKRR in probability and optimal learning rates for DKRR with communications in probability are given. Section 4 makes some comparisons between our results and related work. In Section 5, we provide the main tool in our analysis, where a novel concentration inequality is used to bound the diﬀerence between integral operators and their empirical counterparts and some novel error decomposition strategies are adopted to quantify the generalization error. In Section 6, we prove our theoretical results presented in Section 3. In the last section, we conduct a series of numerical studies to verify the outperformance of DKRR with communications.

2. DKRR with Communications

In this section, we propose a novel communication strategy for DKRR to improve the learning performance.

Lin, Wang, and Zhou

80 160 240 320 400 480 The number of local machines

DKRR KRR Local approximation

Figure 2: The kernel is K(t, t ) = 1 + min{t, t } for t, t [0, 1]. Data {(ti, yi)}10000 i=1 are generated with {ti}10000 i=1 being drawn i.i.d. according to the uniform distribution on [0, 1] and y = f(t) + ε, where f(t) = min{t, 1 t} and ε N(0, 0.2). The green line (local approximation) denotes the performance of running KRR on a local machine with a noiseless data subset (of size 10000/m), {(ti, f(ti))}.

2.1. Limitations of DKRR without Communications

Let m be the number of local machines, Dj = {(xi,j, yi,j)}|Dj| i=1 be the data subset stored in the j-th local machine with 1 j m and D = Sm j=1 Dj be the disjoint union of {Dj}m j=1, where |Dj| denotes the cardinality of Dj. Write Dj(x) := {x : (x, y) Dj}. Let (HK, K) be the reproduced kernel Hilbert space (RKHS) induced by a Mercer kernel K on a compact metric (input) space X. DKRR is deﬁned (Zhang et al., 2015) with a regularization parameter λ > 0 by

|D| f Dj,λ, (1)

f D,λ = arg min f HK

(x,y) D (f(x) y)2 + λ f 2 K

The limitations of DKRR were studied in (Shang and Cheng, 2017) and (Liu et al., 2018) by presenting a sharp upper bound of m to guarantee the comparable performances for DKRR and kernel ridge regression (KRR). Their core idea is that the weighted average in (1) cannot improve the approximation ability of KRR in each local machine. The representer theorem shows

f Dj,λ HK,Dj :=

x Dj(x) ax Kx : ax R

Distributed Kernel Ridge Regression with Communications

where Kx := K(x, ). Since HK,Dj is a |Dj|-dimensional linear space, its approximation ability becomes worse when m increases, just as the trend of green line in Figure 2 shows. Therefore, it is impossible to derive comparable generalization errors of DKRR (blue line in Figure 2) and KRR with whole data (black line in Figure 2) when m is larger than m1. An ideal range of m to guarantee the optimal generalization performance of DKRR is [1, m1]. However, as shown in Figure 2, the practical range [1, m2] is much narrower than [1, m1]. This phenomenon says that the main bottleneck for DKRR is not due to the approximation ability but the fact that the weighted averaging is not good enough to compensate the loss of samples. Thus, eﬃcient communication strategies and synthesization methods are required to enlarge the range of m to guarantee the best generalization performance of distributed learning.

2.2. Motivations from Operator Representations

Before presenting our communication strategy, we give the motivation ﬁrst. Let SD : HK R|D| be the sampling operator (Smale and Zhou, 2007) deﬁned by

SDf := (f(x))(x,y) D.

Its scaled adjoint ST D : R|D| HK is given by

ST Dc := 1 |D|

i=1 ci Kxi, c := (c1, c2, . . . , c|D|)T R|D|.

Deﬁne LK,Df := ST DSDf = 1 |D|

(x,y) D f(x)Kx.

Then, it can be found in (Smale and Zhou, 2007) and (Lin et al., 2017) respectively that

f D,λ = (LK,D + λI) 1 ST Dy D (3)

|D| LK,Dj + λI 1 ST Djy Dj,

where y D := (y1, . . . , y|D|)T . For an arbitrary f HK, we have

f D,λ = f (LK,D + λI) 1 [(LK,D + λI) f ST Dy D], (4)

f 0 D,λ = f

|D| LK,Dj + λI 1 [ LK,Dj + λI f ST Djy Dj]. (5)

Since the (half) gradient of the empirical risk in (2) over HK on f is

GD,λ,f = 1 |D|

(x,y) D (f(x) y)Kx + λf = (LK,D + λI)f ST Dy D

Lin, Wang, and Zhou

and the Hessian is

HD,λ = 1 |D|

(x,y) D , Kx KKx + λI = LK,D + λI,

(4) and (5) can be regarded as the well known Newton-Raphson iteration. Comparing (4) with (5) and noting that the global gradients can be achieved via communications, i.e., GD,λ,f = Pm j=1 |Dj|

|D| GDj,λ,f, we aim at designing a communication strategy via the Newton Raphson iteration formed as

f ℓ D,λ = f ℓ 1 D,λ

|D| LK,Dj + λI 1 [(LK,D + λI) f ℓ 1 D,λ ST Dy D], ℓ N. (6)

2.3. DKRR with Communications

In order to derive an estimator with the operator representation (6), we propose a communication strategy for DKRR(ℓ) by iterating the following procedure for ℓ= 1, . . . , L. Step 1. Communicate the global estimator f ℓ 1 D,λ to local machines and get the local gradient function GDj,λ,ℓ:= GDj,λ,f ℓ 1 D,λ.

Step 2. Communicate back {GDj,λ,ℓ: j = 1, . . . , m} to the global machine and synthesize

the global gradient by GD,λ,ℓ:= Pm j=1 |Dj|

|D| GDj,λ,ℓ. Step 3. Communicate the gradient function GD,λ,ℓto each local machine and generate the gradient data Gj,ℓ= {(x, GD,λ,ℓ(x)) : x Dj(x)}. Then run KRR on the data Gj,ℓto obtain a function

g Dj,λ,ℓ:= arg min f HK

(x,y) Gj,ℓ (f(x) y)2 + λ f 2 K

, j = 1, . . . , m. (7)

Step 4. Communicate back g Dj,λ,ℓto the global machine and get

f ℓ D,λ = f ℓ 1 D,λ 1

|D| g Dj,λ,ℓ

Due to (3) and (7), we have

g Dj,λ,ℓ= (LK,Dj + λI) 1LK,Dj GD,λ,ℓ.

This together with the identity

1 λ[I (LK,Dj + λI) 1LK,Dj]GD,λ,ℓ= (LK,Dj + λI) 1GD,λ,ℓ

and (8) yields (6). The training and testing ﬂows of DKRR(ℓ) are exhibited in Figure 3. Comparing Figure 1 with Figure 3, communications are required in both training and testing stages of DKRR(ℓ). Noticing that communicating functions are infeasible in practice, Appendix B presents a simple realization for DKRR(ℓ) by communicating input data. We believe that there are other eﬃcient implementations of DKRR(ℓ) and leave it as future studies.

Distributed Kernel Ridge Regression with Communications

! ! ! ! ! ! " ! ! # ! ! " " " ! " ! # $ # % & ! "

' " ! " ! " % % " " "#

% " ! ! " ! %

% ! ! " ! &

! ! ' " ! &

! ! ' " ! %

% ! ! " ! %

(a) Flow of training

! ! ! ! " # $ # $

" $ $ " ! !"

" ! ! ! ! # $ # # $! # $$ # # " " " # % & % $ # !

"! ! # $ " # $

%! ! # $ " # $

&! ! # $ " # $

! ! # $ ! " # $

(b) Flow of testing

Figure 3: Training and testing ﬂows of distributed learning with communications.

3. Main Results

In this section, we analyze the generalization performances of the proposed algorithm as well as DKRR in a standard regression setting (Cucker and Zhou, 2007). Let a sample D = {(xi, yi)}N i=1 be independently drawn according to ρ, a Borel probability measure on Z := X Y with Y = R. The primary objective is the regression function deﬁned by

Y ydρ(y|x), x X,

where ρ(y|x) denotes the conditional distribution at x induced by ρ. Throughout the paper, we assume that K is a Mercer kernel and X is compact, which implies κ := p

supx X K(x, x) < .

3.1. Optimal Learning Rates for DKRR in Expectation

To derive optimal learning rates for DKRR, we need some assumptions on the decay of the outputs, regularity of the regression function and capacity of HK.

Assumption 1 We assume R

Y y2dρ < and

M |y fρ(x)|

M 1 dρ(y|x) γ2

2M2 , x X, (9)

where M and γ are positive constants.

Condition (9) is satisﬁed if the noise is uniformly bounded, Gaussian or sub-Gaussian (Caponnetto and De Vito, 2007). Let ρX be the marginal distribution of ρ and L2 ρX be the Hilbert space of ρX square integrable functions on X, with norm denoted by ρ. The Mercer kernel K : X X R deﬁnes an integral operator LK on HK (or L2 ρX) by

X Kxf(x)dρX, f HK (or f L2 ρX).

Lin, Wang, and Zhou

Our second assumption is the capacity assumption measured by the eﬀective dimension (Guo et al., 2017; Lin et al., 2017),

N(λ) = Tr((λI + LK) 1LK), λ > 0.

Assumption 2 There exists some s (0, 1] such that

N(λ) C0λ s, (10)

where C0 1 is a constant independent of λ.

Condition (10) with s = 1 is always satisﬁed by taking the constant C0 = Tr(LK) κ2. For 0 < s < 1, it was shown in (Guo et al., 2017, Page 7) that (10) is slightly more general than the eigenvalue decaying assumption in (Caponnetto and De Vito, 2007) and has been employed in (Blanchard and Kr amer, 2016; Guo et al., 2017; Lin et al., 2017; Chang et al., 2017; M ucke and Blanchard, 2018) to derive optimal learning rates for kernel-based algorithms.

Assumption 3 For r > 0, assume

fρ = Lr Khρ, for some hρ L2 ρX, (11)

where Lr K denotes the r-th power of LK : L2 ρX L2 ρX as a compact and positive operator.

The regularity condition (11) describes the regularity of fρ and has been adopted in a large literature to quantify learning rates for some algorithms (Smale and Zhou, 2007; Caponnetto and De Vito, 2007; Blanchard and Kr amer, 2016; Guo et al., 2017; Lin et al., 2017). Based on the above three assumptions, we can derive the following optimal learning rates for DKRR in expectation.

Theorem 1 Under Assumptions 1-3 with 1

2 r 1 and 0 < s 1, if λ = |D| 1 2r+s , |D1| = = |Dm| and m C1|D| 2r+s 1

2r+s (log |D|) 3, (12)

then E[ f 0 D,λ fρ 2 ρ] C2|D| 2r 2r+s ,

where C1, C2 are constants independent of |D| or m, whose values will be given explicitly in the proof.

Theorem 1 exhibits optimal learning rates for DKRR in expectation under the restriction (12). In previous studies (Lin et al., 2017; Guo et al., 2017; M ucke and Blanchard, 2018), optimal learning rates for DKRR are built upon the restriction

m |D| 2r 1 2r+s . (13)

A direct consequence is that if r = 1/2, then DKRR with m = |D|θ for an arbitrarily small θ > 0 may not achieve the optimal learning, according to the existing work. In particular, it was shown in (Lin et al., 2017) that a parameter value for λ larger than |D| 1/(2r+s) is

Distributed Kernel Ridge Regression with Communications

required under this circumstance, which leads to a sub-optimal learning rate. Comparing (13) with (12), we relax the restriction on m so that optimal learning rates for DKRR also hold for r = 1/2. The restriction (12) with r = 1/2 is similar to that in (Zhang et al., 2015). However, we remove the eigenfunction assumption in (Zhang et al., 2015) and derive optimal learning rates for DKRR under Assumption 3 with 1

2 r 1. It should be mentioned that removing the eigenfunction assumption in (Zhang et al., 2015) was already made in a series of previous papers (Lin et al., 2017; Guo et al., 2017; M ucke and Blanchard, 2018; Lin et al., 2020). However, an additional level of regularity, r > 1/2 is imposed due to (13), excluding r = 1/2 for Zhang et al. (2015). Our study in Theorem 1 successfully ﬁlls this gap.

3.2. Optimal Learning Rates for DKRR in Probability

Theorem 1 presented optimal learning rates for DKRR in expectation. However, the expectation describes the average information for multiple trails and fails to capture the learning performance of DKRR for a single trail. This explains the inconsistency between the theoretical result in Theorem 1 and numerical observation in Figure 2, where m2 is much smaller than m1. In the following theorem, we deduce learning rates for DKRR in probability.

Theorem 2 Let 0 < δ < 1. Under Assumptions 1-3 with 1

2 r 1 and 0 < s 1, if

λ = |D| 1 2r+s , |D1| = = |Dm|,

m |D| 2r+s 1

C3 log3 |D|, (14)

and 16|D| 2r+s 1

4r+2s exp n C4|D| 2r+s 1

8r+4s log |D| o δ, (15)

then with conﬁdence at least 1 δ, there holds

f 0 D,λ fρ ρ C5|D| r 2r+s log2 8

where C3, C4, C5 are constants independent of |D|, m or δ, whose values will be given explicitly in the proof.

It has been a diﬃcult task to derive optimal learning rates for DKRR in probability as shown in (16). Compared with the classical error decomposition in expectation (Chang et al., 2017), where the generalization error is decomposed into approximation error, sample error and distributed error, the error decomposition in probability is totally diﬀerent. In particular, it is not easy to separate a distributed error in probability to control the number of local machines. As a consequence, the upper bound of (14) is tighter than that of (12), showing a stricter restriction on m to guarantee the optimal learning rate in probability.

Neglecting the logarithmic fact, we have |D| 2r+s 1

2r+s . Noting that m2 m1 in Figure 2, the error estimate in probability coincides with the numerical results, showing the power of estimate in probability. Based on the conﬁdence-based error estimate in Theorem 2, we can derive almost sure convergence of DKRR.

Lin, Wang, and Zhou

Corollary 3 Under Assumption 1, Assumption 3 with 1

2 r 1 and Assumption 2 with

0 < s 1, if λ = |D| 1 2r+s , |D1| = = |Dm| and (14) holds, then for any ε > 0, there holds lim |D| |D| r 2r+s(1 ε) f 0 D,λ fρ ρ = 0.

3.3. Optimal Learning Rates for DKRR with Communications

In the previous subsection, we presented theoretical limitations of DKRR in terms of a small range of m to guarantee the optimal learning rate in probability. In the following theorem, we show that our proposed communication strategy can improve the performance of DKRR.

Theorem 4 Let 0 < δ < 1. Under Assumptions 1-3 with 1

2 r 1 and 0 < s 1, if

λ = |D| 1 2r+s , |D1| = = |Dm|,

(2r+s 1)(ℓ+1)

(2r+s)(ℓ+2)

C6 log3 |D| , (17)

(2r+s 1)(ℓ+1)

(2r+s)(ℓ+2)

C7 log3 |D| exp{ |D|

2r+s 1 2(2r+s)(ℓ+2) log |D|} δ < 1,

then with conﬁdence at least 1 δ, there holds

f ℓ D,λ fρ ρ C8|D| r 2r+s logℓ+2 4

where C6, C7, C8 are constants independent of m, δ or |D|, whose value will be given explicitly in the proof.

Comparing (17) with (14), the proposed communication strategy relaxes the restric-

tion on m from order |D| 2r+s 1

4r+2s to |D|

(2r+s 1)(ℓ+1)

(2r+s)(ℓ+2) . Furthermore, the upper bound of m is monotonically increasing with the number of communications, showing the power of communications in DKRR. As ℓ , up to a logarithmic factor, the restriction tends to the best one in (12). At the ﬁrst glance, the restriction (17) is always worse than (12), contradicting our assertions on the outperformance of communications. However, it should be highlighted that (12) only guarantees the error bound in expectation. This means that if m satisﬁes (12), we cannot conduct feasibility analysis of DKRR via a single (or ﬁnite many) trial. Figure 2 numerically shows the drawback of analysis in expectation. Theorem 4 conducts the error analysis in probability, a totally diﬀerent theoretical framework from Theorem 1. In this framework, Theorem 4 shows that communications improve the performance of DKRR since (17) is better than (14). It will be shown in Proposition 12 below that under (17), the error of DKRR(ℓ) converges exponentially fast with respect to the number of communications, meaning that only a small ℓin DKRR(ℓ) is required to get a satisfactory error bound in probability. Based on Theorem 4, we present almost sure convergence of DKRR with communications.

Distributed Kernel Ridge Regression with Communications

Corollary 5 Under Assumptions 1-3 with 1

2 r 1 and 0 < s 1, if λ = |D| 1 2r+s , |D1| = = |Dm| and (17) holds, then for any ε > 0, there holds

lim |D| |D| r 2r+s(1 ε) f ℓ D,λ fρ ρ = 0.

4. Related Work and Discussions

KRR is a classical learning algorithm for regression and has been extensively studied in statistics and learning theory. Optimal learning rates for KRR were established in (Caponnetto and De Vito, 2007; Steinwart et al., 2009; Lin et al., 2017, 2019). For DKRR, optimal learning rates were deduced in (Zhang et al., 2015) under Assumption 1, Assumption 3 with r = 1/2, some eigenvalue decaying assumption that is similar to Assumption 2, and an additional boundedness assumption for the eigenfunctions. Lin et al. (2017) removed the eigenfunction assumption by introducing a novel integral operator approach as well as a second-order decomposition for operator diﬀerence. However, optimal learning rates for DKRR (Lin et al., 2017) were derived under Assumption 3 with 1

2 < r 1, excluding the most popular case r = 1/2, i.e., fρ HK. Although several recent work (Guo et al., 2017; Chang et al., 2017; M ucke and Blanchard, 2018; Shi, 2019; Lin et al., 2020) focused on conquering this theoretical drawback, there is no essential improvement in presenting a good bound for the number of local machines according to the theory in (Shang and Cheng, 2017; Liu et al., 2018). In this paper, we succeed in deriving a tight bound for the number of local machines as (12) by applying the concentration inequality established in (Minsker, 2017) to describe the similarity of diﬀerent operators (see the next section for detailed descriptions). Diﬀerent from Lin et al. (2017), Theorem 1 in this paper removes the eigenfunction assumption of Zhang et al. (2015) without presenting additional regularity assumption. Previous optimal learning rates for DKRR (Zhang et al., 2015; Lin et al., 2017; Guo et al., 2017; Chang et al., 2017; M ucke and Blanchard, 2018; Shi, 2019; Lin et al., 2020) were built in expectation. Technically, the generalization error in expectation can be divided into the approximation error, sample error and distributed error (Chang et al., 2017) by using the unbiasedness property E[y|x] = fρ(x). The approximation error, independent of the sample, describes the approximation capability of the hypothesis space. The sample error connects the synthesized estimator (1) with the estimator (2) by showing an additional |Dj|

|D| in the sample error for local estimators. The distributed error measures the limitation of the distributed learning algorithm (1) and presents the upper bound of m to guarantee optimal learning rates for DKRR. Our error estimate in expectation also follows from this classical error decomposition (see Lemma 8 below). However, this widely used error decomposition is not applicable to DKRR in probability since there lacks an expectation operator to realize the unbiasedness E[y|x] = fρ(x). Thus, it requires novel approaches to deduce optimal learning rates for DKRR in probability. According to an explicit operator representation for the kernel-based gradient descent algorithm, optimal learning rates as well as a novel error decomposition based on its operator representation for distributed gradient descent algorithms were established in probability (Lin and Zhou, 2018). Using the similar error decomposition as Lin and Zhou (2018), a minimum error entropy algorithm with distributed gradient descents was proposed in (Hu et al., 2019) and a tight learning rate was derived. However, DKRR requires the computation of the inverse matrix (or operator), the error

Lin, Wang, and Zhou

decomposition of gradient descent algorithm in (Lin and Zhou, 2018; Hu et al., 2019) is not suitable for DKRR. In this paper, as shown in Proposition 10 below, we succeed in deriving a novel error decomposition for DKRR by introducing some measurements to quantify the diﬀerence between integral operators and their empirical counterparts. Then, applying the recently developed concentration inequalities for positive operators (Minsker, 2017), we derive optimal learning rates for DKRR in probability under much looser restriction on m than (Lin and Zhou, 2018; Hu et al., 2019). Numerous communication strategies (Li et al., 2014; Shamir et al., 2014; Bellet et al., 2015; Huang and Huo, 2019) were proposed to improve the learning performance of distributed learning algorithms in the framework of parametric regression (linear regression). To the best of our knowledge, our proposed communication strategy is the ﬁrst work focusing on improving the performance of learning algorithms in nonparametric regression. As shown in Figure 1, nonparametric regression transmits function values and protects the privacy of local machines, while parametric regression (Zhang et al., 2013) transmits coefﬁcients that may disclose the detailed information for local estimators. The most related work is Huang and Huo (2019), where a communication strategy based on Newton-Raphson iterations is proposed to equip ridge regression in linear regression. Our work diﬀers from Huang and Huo (2019) in the following three aspects. Firstly, our analysis is carried out in nonparametric regression rather than linear regression. Secondly, the communication strategy is based on the operator representation, which is exclusive for kernel approaches. Thirdly, our theory focuses on enlarging the range of number of local machines rather than improving the learning rate of distributed learning algorithms, since DKRR without communications is already optimal for not so large m based on previous studies (Lin et al., 2017). It would be interesting to extend our communication strategy to other distributed learning schemes such as distributed learning with convolutional neural networks in deep learning (Zhou, 2018a,b, 2020).

5. Operator Similarities and Error Decomposition

We analyze the learning performance of DKRR(ℓ) by using the integral operator approach (Smale and Zhou, 2007; Lin et al., 2017; Guo et al., 2017; Guo and Shi, 2019). Our novelty in analysis is tight bounds on quantifying the similarity between diﬀerent operator. These bounds together with the exclusive error decomposition yield optimal learning rates for DKRR and show the advantage of communications in distributed learning.

5.1. Similarities of Operators

The similarity between f ℓ D,λ and f D,λ depends heavily on that between the operator LK and LK,D. The classical method for analyzing similarity between LK and LK,D is to bound the norm of operator diﬀerence LK LK,D. By using a concentration inequality in Hilbert spaces from Pinelis (1994), it can be found (Caponnetto and De Vito, 2007; Blanchard and Kr amer, 2016) that for any δ (0, 1), with conﬁdence at least 1 δ, there holds

SD,λ := (LK + λI) 1/2(LK LK,D) 2κ(κ + 1)AD,λ log 2

Distributed Kernel Ridge Regression with Communications

AD,λ := 1 p

The bound in (18) is tight. However, in estimating the diﬀerence between f D,λ and

f ℓ D,λ, one also needs to estimate

RD,λ := (LK + λI) 1/2(LK LK,D)(LK + λI) 1/2 .

A classical approach (Lin et al., 2017; Guo et al., 2017) is to use RD,λ 1

λSD,λ and get that RD,λ 2κ(κ + 1)λ 1/2AD,λ log 2

holds with conﬁdence 1 δ. The leading term in (20) is

|D|λ for λ |D| 1. In the following

lemma, which will be proved in Appendix A, we reduce the leading term for bounding RD,λ

|D|λ by using a new concentration inequality for self-adjoint operators

(Minsker, 2017).

Lemma 6 Let 0 < δ 1. If 0 < λ 1 and N(λ) 1, then with conﬁdence 1 δ, there holds RD,λ C 1BD,λ log 4

where C 1 := max{(κ2 + 1)/3, 2

κ2 + 1} and

BD,λ := 1 + log N(λ)

1 + log N(λ)

λ|D| . (22)

Besides the diﬀerences between RD,λ and SD,λ, another quantity to measure the similarity between LK,D and LK is the operator product (LK + λI)1/2(LK,D + λI) 1/2 . A recently developed second order decomposition for positive operators (Lin et al., 2017; Guo et al., 2017) asserts that if A and B are invertible operators on a Banach space, then

A 1 B 1 = B 1(B A)B 1(B A)A 1 + B 1(B A)B 1.

This implies the following decomposition of the operator product

BA 1 = (B A)B 1(B A)A 1 + (B A)B 1 + I. (23)

Inserting A = LK,D + λI and B = LK + λI to (23) and noting (18), it is easy to derive the following upper bound for (LK + λI)(LK,D + λI) 1 (e.g., Guo et al., 2017): for any 0 < δ < 1, with conﬁdence at least 1 δ, there holds

(LK + λI)(LK,D + λI) 1 2

Lin, Wang, and Zhou

Hence, according to the Cordes inequality (Bhatia, 1997)

AτBτ AB τ, 0 < τ 1,

QD,λ := (LK + λI)1/2(LK,D + λI) 1/2

The leading term of the right-hand side of (24) is

|D|λ . In the following lemma, whose

proof is postponed to Appendix A, we improve (24) by using Lemma 6.

Lemma 7 Assume 0 < λ 1 and N(λ) 1. For δ 4 exp{ 1/(2C 1BD,λ)}, with conﬁdence 1 δ, there holds QD,λ

In (24), to guarantee the boundedness of QD,λ, it requires

δ C 2 for some

C 2 > 0 depending only on κ. However, in Lemma 7, recalling (22), it is suﬃcient that log(N(λ))

δ C 3 for some C 3 > 0 depending only on κ.

5.2. Error Decomposition for DKRR in Expectation

We use the error decomposition for DKRR in (Chang et al., 2017), where the data-free limit and noise-free version of f Dj,λ,

fλ = arg min f HK

X (f(x) fρ(x))2dρX + λ f 2 K

= (LK + λI) 1LKfρ, (25)

and f Dj,λ := E[f Dj,λ|Dj(x)] = (LK,Dj + λI) 1LK,Djfρ (26)

are used. The following lemma can be found in (Chang et al., 2017).

Lemma 8 Let f 0 D,λ be deﬁned by (1). We have

1 2E h f 0 D,λ fρ 2 ρ i fλ fρ 2 ρ +

|D|2 E f Dj,λ fλ 2 ρ +

|D| E f Dj,λ fλ 2

The three terms on the right-hand side of (27) are respectively the approximation error, sample error and distributed error. Based on Lemma 8, we can derive the following error decomposition for DKRR in expectation.

Distributed Kernel Ridge Regression with Communications

Proposition 9 Let f 0 D,λ be deﬁned by (1). If fρ HK, then

1 2E h f 0 D,λ fρ 2 ρ i fλ fρ 2 ρ +

|D|2 E h Q4 Dj,λ(PDj,λ + SDj,λ fλ K)2i

|D| E h Q4 Dj,λR2 Dj,λ (LK + λI)1/2(fλ fρ) 2 K i , (28)

where PD,λ := (LK + λI) 1/2(LKfρ ST Dy D) K .

Proof. We ﬁrst bound the sample error. It follows from (3) and (25) that

f D,λ fλ = (LK,D + λI) 1ST Dy D (LK + λI) 1LKfρ = (LK,D + λI) 1(ST Dy D LKfρ) + [(LK,D + λI) 1 (LK + λI) 1]LKfρ = (LK,D + λI) 1/2(LK,D + λI) 1/2(LK + λI)1/2(LK + λI) 1/2(ST Dy D LKfρ)

+ (LK,D + λI) 1/2(LK,D + λI) 1/2(LK + λI)1/2(LK + λI) 1/2(LK LK,D)fλ.

So (LK + λI)1/2(f Dj,λ fλ) K Q2 Dj,λ(PDj,λ + SDj,λ fλ K). (29)

Then, we bound the distributed error. Due to (3) and (26), we get

f D,λ fλ = (LK,D + λI) 1LK,Dfρ (LK + λI) 1LKfρ = (LK,D + λI) 1(LK,D LK)fρ + [(LK,D + λI) 1 (LK + λI) 1]LKfρ = (LK,D + λI) 1(LK,D LK)fρ + (LK,D + λI) 1(LK LK,D)fλ = (LK,D + λI) 1(LK,D LK)(fρ fλ)

= (LK,D + λI) 1(LK + λI)1/2(LK + λI) 1/2(LK,D LK)(LK + λI) 1/2

(LK + λI)1/2(fρ fλ).

Combining this with fρ HK yields

(LK + λI)1/2(f Dj,λ fλ) K Q2 Dj,λRDj,λ (LK + λI)1/2(fλ fρ) K. (30)

Plugging (29) and (30) into (27), we have

1 2E h f 0 D,λ fρ 2 ρ i fλ fρ 2 ρ +

|D|2 E h Q4 Dj,λ(PDj,λ + SDj,λ fλ K)2i

|D| E h Q4 Dj,λR2 Dj,λ (LK + λI)1/2(fλ fρ) 2 K i .

This completes the proof of Proposition 9.

Lin, Wang, and Zhou

5.3. Error Decomposition for DKRR in Probability

To deduce learning rates for DKRR in probability, we need the following error decomposition. It holds for f 0 D,λ f D,λ ρ, which is totally diﬀerent from Proposition 9 focusing on the expectation.

Proposition 10 Let f 0 D,λ be deﬁned by (1). Then

f 0 D,λ f D,λ ρ (LK + λI)1/2(f 0 D,λ f D,λ) K

|D| (RDj,λ + RD,λ)Q2 Dj,λ(PDj,λ + SDj,λ fλ K). (31)

Proof. From the deﬁnition of f 0 D,λ, we see

f 0 D,λ f D,λ =

|D| (LK,Dj + λI) 1ST Djy Dj (LK,D + λI) 1ST Dy D

|D| (LK,Dj + λI) 1 (LK,D + λI) 1 ST Djy Dj

|D| (LK,D + λI) 1(LK,D LK,Dj)f Dj,λ

|D| (LK,D + λI) 1(LK,D LK)(f Dj,λ fλ)

|D| (LK,D + λI) 1(LK LK,Dj)(f Dj,λ fλ)

= (LK,D + λI) 1(LK + λI)1/2(LK + λI) 1/2(LK,D LK)(LK + λI) 1/2

|D| (LK + λI)1/2(f Dj,λ fλ)

+ (LK,D + λI) 1(LK + λI)1/2

|D| (LK + λI) 1/2(LK,Dj LK)(LK + λI) 1/2(LK + λI)1/2(f Dj,λ fλ).

(LK + λI)1/2(f 0 D,λ f D,λ) K Q2 D,λRD,λ

|D| (LK + λI)1/2(f Dj,λ fλ) K

|D| RDj,λ (LK + λI)1/2(f Dj,λ fλ) K.

Distributed Kernel Ridge Regression with Communications

This together with (29) gives

f 0 D,λ f D,λ ρ Q2 D,λRD,λ

|D| Q2 Dj,λ(PDj,λ + SDj,λ fλ K)

|D| RDj,λQ2 Dj,λ(PDj,λ + SDj,λ fλ K).

This completes the proof of Proposition 10.

5.4. Error Decomposition for DKRR(ℓ)

In this subsection, we derive an error decomposition for DKRR(ℓ). At ﬁrst, we show in the following proposition the power of communications.

Proposition 11 Let ℓ 0. We have

(LK + λI)1/2(f ℓ D,λ f D,λ) K

|D| Q2 Dj,λ(RDj,λ + RD,λ)

(LK + λI)1/2(f 0 D,λ f D,λ) K.

Proof. Since

f D,λ = f ℓ 1 D,λ (LK,D + λI) 1[(LK,D + λI)f ℓ 1 D,λ ST Dy D],

f D,λ f ℓ D,λ = f ℓ 1 D,λ

|D| (LK,Dj + λI) 1[(LK,D + λI)f ℓ 1 D,λ ST Dy D]

f ℓ 1 D,λ + (LK,D + λI) 1[(LK,D + λI)f ℓ 1 D,λ ST Dy D]

|D| [(LK,Dj + λI) 1 (LK,D + λI) 1][(LK,D + λI)f ℓ 1 D,λ ST Dy D]

|D| (LK,Dj + λI) 1(LK,D LK,Dj)(LK,D + λI) 1[(LK,D + λI)f ℓ 1 D,λ ST Dy D]

|D| (LK,Dj + λI) 1(LK,D LK,Dj)(f ℓ 1 D,λ f D,λ)

|D| (LK,Dj + λI) 1(LK,D LK)(f ℓ 1 D,λ f D,λ)

|D| (LK,Dj + λI) 1(LK LK,Dj)(f ℓ 1 D,λ f D,λ)

=: UD,λ,1 + UD,λ,2.

Lin, Wang, and Zhou

(LK + λI)1/2UD,λ,1

|D| (LK + λI)1/2(LK,Dj + λI) 1(LK + λI)1/2(LK + λI) 1/2(LK,D LK)

(LK + λI) 1/2(LK + λI)1/2(f ℓ 1 D,λ f D,λ),

(LK + λI)1/2UD,λ,1 K

|D| Q2 Dj,λRD,λ (LK + λI)1/2(f ℓ 1 D,λ f D,λ) K.

Similarly, we ﬁnd

(LK + λI)1/2UD,λ,2 K

|D| Q2 Dj,λRDj,λ (LK + λI)1/2(f ℓ 1 D,λ f D,λ) K.

(LK + λI)1/2(f ℓ D,λ f D,λ) K

|D| Q2 Dj,λRDj,λ +

|D| Q2 Dj,λRD,λ

(LK + λI)1/2(f ℓ 1 D,λ f D,λ) K

|D| Q2 Dj,λRDj,λ +

|D| Q2 Dj,λRD,λ

(LK + λI)1/2(f 0 D,λ f D,λ) K.

This completes the proof of Proposition 11. Combining Proposition 11 with Proposition 10, we can derive the following error decomposition for DKRR(ℓ).

Proposition 12 Let ℓ 0. We have

(LK + λI)1/2(f ℓ D,λ f D,λ) K

|D| Q2 Dj,λ(RDj,λ + RD,λ)

|D| (RD,λ + RDj,λ)Q2 Dj,λ(PDj,λ + SDj,λ fλ K).

In this section, we present proofs of our main results.

Distributed Kernel Ridge Regression with Communications

6.1. Optimal Learning Rates for DKRR in Expectation

In this subsection, we prove optimal learning rates for DKRR in expectation. We need the following general theorem based on Assumption 1 and Assumption 3.

Theorem 13 Under Assumption 3 with 1

2 r 1 and Assumption 1, if 0 < λ 1 and N(λ) 1, then

E h f 0 D,λ fρ 2 ρ i 2λ2r + 24

exp{ 1/(2C 1BDj,λ)} + C2 1A2 Dj,λ

16 exp{ 1/(2C 1BDj,λ)} + 32 C2 2B2 Dj,λλ2r , (32)

where C1 := 4((κM + γ)(κ + 1) + κ2r(2κ + 1) hρ ρ) and C2 := 4κ(κ + 1) hρ ρ.

Proof. Due to Assumption 3 with r 1/2, we obtain

fλ K = (LK + λI) 1LKLr Khρ κ2r 1 L1/2 K hρ K = κ2r 1 hρ ρ. (33)

Moreover, (9) implies (Blanchard and Kr amer, 2016; Lin and Zhou, 2018) that with conﬁdence at least 1 δ, there holds

PDj,λ 2(κM + γ)(κ + 1)ADj,λ log 2

Thus, for δ 12 exp{ 1/(2C 1BDj,λ)}, it follows from Lemma 7, (18), (33) and (34) that with conﬁdence 1 δ, there holds

Q2 Dj,λ PDj,λ + SDj,λ fλ K C1ADj,λ log 6

δ , j = 1, . . . , m. (35)

Using the probability to expectation formula

0 P [ξ > t] dt (36)

to the positive random variable ξ1,j = Q4 Dj,λ(PDj,λ + SDj,λ fλ K)2 for any j = 1, . . . , m, we have

E[ξ1,j] = Z

0 P [ξ1,j > t] dt

= Z 12 exp{ 1/(2C 1 BDj,λ)}

0 P [ξ1,j > t] dt + Z

12 exp{ 1/(2C 1 BDj,λ)} P [ξ1,j > t] dt

12 exp{ 1/(2C 1BDj,λ)} + Z

12 exp{ 1/(2C 1 BDj,λ)} P[ξ1,j > t]dt.

When t 12 exp{ 1/(2C 1BDj,λ)}, it follows from (35) that

P[ξ1,j > t] 6 exp{ C 1 1 A 1 Dj,λt1/2}, j = 1, . . . , m,

Lin, Wang, and Zhou

12 exp{ 1/(2C 1 BDj,λ)} P[ξ1,j > t]dt 6 Z

0 exp{ C 1 1 A 1 Dj,λt1/2}dt

12 C2 1A2 Dj,λ

0 ue udu = 12 C2 1A2 Dj,λ.

E h Q4 Dj,λ(PDj,λ + SDj,λ fλ K)2i 12 exp{ 1/(2C 1BDj,λ)} + 12 C2 1A2 Dj,λ. (37)

Due to Assumption 3 with r 1/2, we have

(LK + λI)1/2(fλ fρ) K λ (LK + λI) 1/2Lr+1/2 K hρ ρ λr hρ ρ. (38)

Then, Lemma 6 and Lemma 7 with δ 8 exp{ 1/(2C 1BDj,λ)} yield that with conﬁdence 1 δ, there holds

Q2 Dj,λRDj,λ (LK + λI)1/2(fλ fρ) K C2BDj,λλr log 8

δ , j = 1, . . . , m. (39)

Then, applying (36) to ξ2,j = Q4 Dj,λR2 Dj,λ (LK + λI)1/2(fλ fρ) 2 K and using the same approach as above, we can derive for any j = 1, . . . , m,

E[Q4 Dj,λR2 Dj,λ (LK + λI)1/2(fλ fρ) 2 K] 8 exp{ 1/(2C 1BDj,λ)} + 16 C2 2B2 Dj,λλ2r. (40)

Plugging (37), (38) and (40) into (28), we get (32) directly. This completes the proof of Theorem 13. Based on Theorem 13, we can prove Theorem 1 as follows. Proof of Theorem 1. For C1 := min n 2r+s

2s , (2r+s)3

2rs max{(κ2+1)/3,2

o , due to (22),

|D1| = = |Dm|, λ = |D| 1 2r+s and (10), we have for any j = 1, . . . , m,

BDj,λ 2s 2r + sm|D| 2r+s 1

2r+s log |D| +

2s 2r + sm|D| 2r+s 1

2r+s log |D|. (41)

Noting C 1 = max{(κ2 + 1)/3, 2

κ2 + 1}, we see that (12) implies

BDj,λ 2r + s 4r C 1 log |D| (42)

and exp{ 1/(2C 1BDj,λ)} |D| 2r 2r+s , j = 1, . . . , m. (43)

According to (19), |D1| = = |Dm|, λ = |D| 1 2r+s and (10), we also get

ADj,λ m|D| 4r+2s 1

4r+2s + m|D| r 2r+s , j = 1, . . . , m. (44)

Inserting (42), (43) and (44) into (32) and noting λ = |D| 1 2r+s and (12), we have

E[ f 0 D,λ fρ 2 ρ] C1|D| 2r 2r+s ,

where C2 := 42 + 48 C2 1 + 32 C2 2 2r+s 4r C 1

2 . This completes the proof of Theorem 1.

Distributed Kernel Ridge Regression with Communications

6.2. Optimal Learning Rates for DKRR in Probability

In this subsection, we prove Theorem 2. To this end, we need the following theorem for DKRR in probability.

Theorem 14 Under Assumption 1 and Assumption 3 with 1

2 r 1, if 0 < λ 1, N(λ) 1, and 16m exp{ 1/(2C 1BDj,λ)} < 1, j = 1, . . . , m, (45)

then for 16m exp{ 1/(2C 1BDj,λ)} δ < 1, j = 1, . . . , m, (46)

with conﬁdence 1 δ there holds

f 0 D,λ f D,λ ρ C3 log2(4m)

|D| ADj,λBDj,λ log2 4

where C3 = 16C 1 C1.

Proof. Let j {1, . . . , m} be ﬁxed. It follows from Lemma 6 and BD,λ BDj,λ that with conﬁdence 1 δ

4, there holds

RD,λ + RDj,λ 2C 1BDj,λ log 16

Since δ 16 exp{ 1/(2C 1BDj,λ)}, Lemma 7 together with (35) shows that with conﬁdence 1 3δ/4, there holds

Q2 D,λQ2 Dj,λ PDj,λ + SDj,λ fλ K 2 C1ADj,λ log 16

Plugging (48) and (49) into (31), for ﬁxed j {1, . . . , m}, we see with conﬁdence 1 δ, there holds

Q2 D,λ(RDj,λ + RD,λ)Q2 Dj,λ(PDj,λ + SDj,λ fλ K) C3 log2 16

δ ADj,λBDj,λ. (50)

Thus, for δ 16 exp{ 1/(2C 1BDj,λ)}, the above estimate implies that with conﬁdence at least 1 mδ, there holds

max 1 j m Q2 D,λ(RDj,λ + RD,λ)Q2 Dj,λ(PDj,λ + SDj,λ fλ K) C3 log2 16

δ ADj,λBDj,λ.

Scaling mδ to δ, for δ 16m exp{ 1/(2C 1BDj,λ)}, with conﬁdence 1 δ, there holds

f 0 D,λ f D,λ ρ 4C 1 C1 log2 16m

|D| ADj,λBDj,λ.

δ + log(4m) (log(4m) + 1) log 4

δ 2 log(4m) log 4

Lin, Wang, and Zhou

which follows from (47). Then the proof of Theorem 14 is complete. Proof of Theorem 2. Due to the triangle inequality, we have

f 0 D,λ fρ ρ f D,λ f 0 D,λ ρ + f D,λ fρ ρ. (51)

To bound f D,λ fρ ρ, under (11) with 1

2 r 1, we obtain from (29) and (38) that

f D,λ fρ ρ λr + Q2 D,λ(PD,λ + SD,λ fλ K).

Then, (14) with C3 := max{4C 1(4 log 2 + 1), 16 log2 2} and (15) with C4 := 4C 1 imply δ exp{2C 1BD,λ}, and then (39) with Dj replaced by D shows that with conﬁdence 1 δ/2, there holds

f D,λ fρ ρ λr + C1AD,λ log 16

Noting (14) and (15) imply (45) and (46), and plugging (47) and (52) into (51), with conﬁdence 1 δ, there holds

f 0 D,λ fρ ρ λr + C1AD,λ log 16

δ + C3 log2(4m) log2 8

|D| ADj,λBDj,λ. (53)

Since |D1| = = |Dm|, λ = |D| 1 2r+s and 2s/(2r + s) 1, we have from (41) and (44) that

ADj,λBDj,λ m|D| 4r+2s 1

4r+2s + m|D| r 2r+s

m|D| 2r+s 1

2r+s log |D| + q

m|D| 2r+s 1

2r+s log |D| . (54)

This together with (14) shows

log2(4m)ADj,λBDj,λ 4|D| r 2r+s .

Inserting the above inequality into (53) and noting

AD,λ |D| 4r+2s 1

4r+2s + |D| r 2r+s 2|D| r 2r+s , (55)

we see with conﬁdence 1 δ, there holds

f 0 D,λ fρ ρ C5|D| r 2r+s log2 8

where we use log 16

C5 = 1 + 2 C1 + 4 C3.

This completes the proof of Theorem 2. To prove Corollary 3, we need the following Borel-Cantelli Lemma (Dudley, 2002, page 262). The Borel-Cantelli Lemma asserts for a sequence {ηn}n of events that if the sum of the probabilities is ﬁnite, i.e., P n=1 P[ηn] < , then the probability that inﬁnitely many of them occur is 0.

Distributed Kernel Ridge Regression with Communications

Lemma 15 Let {ηn} be a sequence of events in some probability space and {εn} be a sequence of positive numbers satisfying limn εn = 0. If

n=1 Prob[|ηn η| > εn] < ,

then ηn converges to η almost surely.

Proof of Corollary 3. Let N := |D|, it is easy to check that δ = δN = C4N 2 satisﬁes (15) for some C4 > 0 independent of N. Set ΨN = N r 2r+s . By Theorem 2, if λ = |D| 1 2r+s , |D1| = = |Dm| and (14) holds, then for any N and ε > 0,

Ψ 1+ε N f 0 D,λ fρ ρ > C5Ψε N

Denote µN = C5Ψε N log 8 δN

2 . Obviously,

N=2 P Ψ 1+ε N ft,D fρ ρ > µN

and µN 0 when N 0. Then our conclusion follows from Lemma 15. This completes the proof of Corollary 3.

6.3. Optimal Learning Rates for DKRR(ℓ)

In this subsection, we present the proof of Theorem 4. To this end, we prove the following theorem.

Theorem 16 Under Assumption 1 and Assumption 3 with 1

2 r 1, if 0 < λ 1, N(λ) 1, and (45) holds, then for δ satisfying (46), with conﬁdence 1 δ there holds

f ℓ D,λ f D,λ ρ 2 C3(4C 1)ℓ log(4m) log 4

|D| ADj,λBDj,λ.

Proof. Under (45) and (46), we obtain from Lemma 7, (48) and (50) that with conﬁdence 1 mδ, there holds

max 1 j m Q2 Dj,λ(RDj,λ + RD,λ) 4C 1BDj,λ log 16

max 1 j m Q2 D,λ(RD,λ + RDj,λ)Q2 Dj,λ(PDj,λ + SDj,λ fλ K) C3 log2 16

δ ADj,λBDj,λ.

Lin, Wang, and Zhou

Scaling mδ to δ, we obtain from (32) that

f ℓ D,λ f D,λ ρ (LK + λI)1/2(f ℓ D,λ f D,λ) K

4C 1 log 16m

C3 log2 16m

|D| ADj,λBDj,λ.

Noting further log 16

δ 2 log 4m, we have with conﬁdence 1 δ, there holds

f ℓ D,λ f D,λ ρ 2 C3(4C 1)ℓlogℓ+2(4m) logℓ+2 4

|D| ADj,λBDj,λ.

This completes the proof of Theorem 16. Proof of Theorem 4. Since |D1| = = |Dm| and λ = |D| 1 2r+s , it follows from (41) and 2s/(2r + s) 1 that

m|D| 2r+s 1

2r+s log |D| + p

m log |D||D| 2r+s 1

This together with (54) shows that with conﬁdence 1 δ, there holds

|D| BDj,λADj,λ

m|D| 2r+s 1

2r+s log |D| + p

m log |D||D| 2r+s 1

4r+2s ℓ+1 m|D| 4r+2s 1

4r+2s + m|D| r 2r+s .

Since (17) with C6 := (4C 1(2+4 log 2))2 and (15) with C7 = C6/4, we obtain (45) and (46). Then, it follows from Theorem 16 that with conﬁdence 1 δ, there holds

f ℓ D,λ f D,λ ρ 2 C3|D| r 2r+s logℓ+2 4

Hence, we obtain from (52) and (55) that with conﬁdence 1 δ, there holds

f ℓ D,λ fρ ρ f ℓ D,λ f D,λ ρ + f D,λ fρ ρ C8|D| r 2r+s logℓ+2 4

where C8 = 1 + C1 + 2 C3, which completes the proof of Theorem 4. Proof of Corollary 5. The proof of Corollary 5 is the same as that of Corollary 3 with (14) replaced by (17).

7. Experiments

In this section, we report numerical results to verify our theoretical statements. We employ three criteria for comparisons. The ﬁrst criterion is the global mean squared error (GMSE)

Distributed Kernel Ridge Regression with Communications

which is the mean square error (MSE) of KRR with training all samples in a batch mode. GMSE provides a baseline to assess the performances of DKRR and DKRR(ℓ). The second criterion is the average error (AE) which is the MSE of DKRR. The third criterion is the average error with communications (AEC), which is the MSE of DKRR (ℓ). Regularization parameters in all experiments are selected by grid search. We carry out three simulations to verify our theoretical statements. The ﬁrst simulation is devoted to illustrating the power of communications in DKRR. The second simulation is performed to demonstrate the relation between the generalization ability of DKRR(ℓ) and the number of training samples for ﬁxed numbers of local machines and communications. The third simulation focuses on comparisons on the training complexities of DKRR and DKRR(ℓ). Before carrying out simulations, we describe how the synthetic data is generated. The inputs {xi}N i=1 of training samples are independently drawn according to the uniform distribution on the (hyper-)cube [0, 1]d with d = 1 or d = 3. The corresponding outputs {yi}N i=1 are generated from the regression models yi = gj(xi) + εi for i = 1, 2, , N and j = 1, 2, where εi is the independent Gaussian noise N(0, 0.2),

g1(x) = x if 0 < x 0.5, 1 x if 0.5 < 0 1

for 1-dimensional data, and

g2(x) = (1 x 2)6(35 x 2 2 + 18 x 2 + 3) if 0 < x 2 1, 0 if x 2 > 1

for 3-dimensional data. The inputs {x i}N i=1 of testing samples are also drawn independently according to the uniform distribution on the (hyper-)cube [0, 1]d but the corresponding outputs {y i}N i=1 are generated by y i = gj(x i). It can be found in (Wu, 1995; Schaback and Wendland, 2006) that g1 W 1 1 and g2 W 4 3 , where W α d represents the α-order Sobolev space on [0, 1]d. If we deﬁne K1(x, x ) = 1 + min(x, x ) and K2(x, x ) = h( x x 2) with

h(r) = (1 r)4(4r + 1) if 0 < r 1, 0 if r > 1,

then we know (Wu, 1995; Schaback and Wendland, 2006) that K1 and K2 are reproducing kernels for W 1 1 and W 2 3 , respectively. Obviously, g1 HK1 and g2 HK2. In the training process of DKRR and DKRR(ℓ), we uniformly distribute N training samples to m local machines. Simulation 1: We generate 10000 samples for training and 1000 samples for testing. The number m of local machines varies from {20, 40, 60, , 480} for 1-dimensional data, and varies from {2, 4, 6, , 60} for 3-dimensional data. The testing results are shown in Figure 4 and Figure 5. Figure 4 shows the relation between MSE and the number of local machines by diﬀerent numbers of communications. From Figure 4, we can conclude the following four assertions. 1) When m is not too large, AEs are always comparable to GMSEs. There exists an upper bound of m, denoted by m B (e.g., m B 40 for d = 1 and m B 6 for d = 3), larger than which AE curves increase dramatically and far from the GMSE curves. This veriﬁes the theoretical statement in Theorem 2. 2) AECs also

Lin, Wang, and Zhou

80 160 240 320 400 480 The number of local machines

DKRR Com. #1 Com. #2 Com. #4 Com. #8 Com. #16 Baseline Local approximation

10 20 30 40 50 60 The number of local machines

DKRR Com. #1 Com. #2 Com. #4 Com. #8 Com. #16 Baseline Local approximation

Figure 4: The relation between MSE and the number of local machines for ﬁxed numbers of communications. The left ﬁgure and right ﬁgure are respectively the results on the 1-dimensional data and the 3-dimensional data. Com. # represents the number of communications.

2 4 6 8 10 12 14 16 18 20 The number of communications

DKRR(l): m=20 DKRR(l): m=80 DKRR(l): m=140 DKRR(l): m=200 DKRR(l): m=260 DKRR(l): m=320 DKRR: m=320 Baseline

2 4 6 8 10 12 The number of communications

DKRR(l): m=4 DKRR(l): m=10 DKRR(l): m=16 DKRR(l): m=22 DKRR(l): m=28 DKRR(l): m=34 DKRR: m=34 Baseline

Figure 5: The relation between MSE and the number of communications for ﬁxed numbers of local machines. The left ﬁgure and right ﬁgure are respectively the results on the 1-dimensional data and the 3-dimensional data.

have the upper bound, such as m B 200 for d = 1 and ℓ= 2, which is much larger than that of AEs. This result conﬁrms Theorem 4 by showing the power of communications in the sense that communications in DKRR can help to relax the restriction on m. 3) The upper bound m B of AECs increases with the number of communication increasing, which implies the necessity of communications and veriﬁes Theorem 4 by showing that the upper bound in (17) is monotonically increasing with respect to the number of communications. 4) Diﬀerent from DKRR which cannot suﬃciently embody the best approximation ability of local estimator, DKRR(ℓ) succeeds in presenting an upper bound of m which is close to m1 in Figure 2.

Distributed Kernel Ridge Regression with Communications

4000 8000 12000 16000 20000 The number of training samples

10-4 Local Machines # 20

Com. #1 Com. #2 Com. #3 Com. #4 Com. #8 Com. #16 DKRR Baseline

4000 8000 12000 16000 20000 The number of training samples

10-4 Local Machines # 100

Com. #1 Com. #2 Com. #3 Com. #4 Com. #8 Com. #16 DKRR Baseline

4000 8000 12000 16000 20000 The number of training samples

Local Machines # 180

Com. #1 Com. #2 Com. #3 Com. #4 Com. #8 Com. #16 DKRR Baseline

4000 8000 12000 16000 20000 The number of training samples

Local Machines # 260

Com. #1 Com. #2 Com. #3 Com. #4 Com. #8 Com. #16 DKRR Baseline

4000 8000 12000 16000 20000 The number of training samples

Local Machines # 340

Com. #1 Com. #2 Com. #3 Com. #4 Com. #8 Com. #16 DKRR Baseline

4000 8000 12000 16000 20000 The number of training samples

Local Machines # 400

Com. #1 Com. #2 Com. #3 Com. #4 Com. #8 Com. #16 DKRR Baseline

Figure 6: The relation between the number of training samples and MSE on the 1dimensional data.

The relation between MSE and the number of communications by diﬀerent numbers of local machines is illustrated in Figure 5. We can see that there exists an upper bound of m (e.g., about 280 for 1-dimensional data and 30 for 3-dimensional data), less than which AECs are guaranteed to converge to GMSEs with a fast rate, and larger than which AECs diverge dramatically. This veriﬁes Proposition 12 by comparing Pm j=1 |Dj|

|D| Q2 Dj,λ(RDj,λ + RD,λ) with 1. Simulation 2: This simulation investigates the generalization performance of DKRR and DKRR(ℓ) with varying the number of training samples N {3000, 3500, 4000, , 20000} and ﬁxing the number of testing samples N = 1000. The regression results are shown in Figure 6 and Figure 7, from which it can be drawn the following conclusions. 1) For ﬁxed numbers of local machines and communications, AE curves are getting closer and closer to GMSE curves, and AEC curves are converging to GMSE curves with the number of training samples increasing. This veriﬁes our theoretical assertions of Theorem 2 and Theorem 4. 2) For each ﬁxed number of local machines, there exists a lower bound for the number of training samples, denoted by NB (e.g., NB 6000 for d = 1 and m = 180, and NB 13500 for d = 3 and m = 40), larger than which, the generalization performance of DKRR(ℓ) is signiﬁcantly superior than that of DKRR, and AECs converge to GMSEs with the number of communications increasing. Obviously, the bound NB increases as the number of local machines increases. This also conﬁrms the conclusion in Theorem 4. Simulation 3: In this simulation, we compare the training complexities of DKRR and DKRR(ℓ). The computational complexity analysis for the training ﬂow in Appendix B

Lin, Wang, and Zhou

4000 8000 12000 16000 20000 The number of training samples

Local Machines # 4

Com. #1 Com. #2 Com. #3 Com. #4 Com. #8 Com. #16 DKRR Baseline

4000 8000 12000 16000 20000 The number of training samples

Local Machines # 10

Com. #1 Com. #2 Com. #3 Com. #4 Com. #8 Com. #16 DKRR Baseline

4000 8000 12000 16000 20000 The number of training samples

Local Machines # 16

Com. #1 Com. #2 Com. #3 Com. #4 Com. #8 Com. #16 DKRR Baseline

4000 8000 12000 16000 20000 The number of training samples

Local Machines # 28

Com. #1 Com. #2 Com. #3 Com. #4 Com. #8 Com. #16 DKRR Baseline

4000 8000 12000 16000 20000 The number of training samples

Local Machines # 40

Com. #1 Com. #2 Com. #3 Com. #4 Com. #8 Com. #16 DKRR Baseline

4000 8000 12000 16000 20000 The number of training samples

Local Machines # 52

Com. #1 Com. #2 Com. #3 Com. #4 Com. #8 Com. #16 DKRR Baseline

Figure 7: The relation between the number of training samples and MSE on the 3dimensional data.

is given in Table 1, in which the computational complexities of Step 1, Step 4 and Step 6 refer to those on each local machine, provided that parallelization is implemented. It should be noted that the time of data transmission in DKRR(ℓ) is not considered, and the computational complexity of Step 1 is also that of DKRR. Thus, the training complexities of DKRR and DKRR(ℓ) are O(N2τ/m+N3/m3) and O(N2τ/m+N3/m3+N2ℓ/m+m Nℓ), respectively. We deﬁne

ΩDKRR = N2τ/m + N3/m3, ΩDKRR(ℓ) = N2τ/m + N3/m3 + N2ℓ/m + m Nℓ.

It can be seen that ΩDKRR monotonously decreases with m increasing; ΩDKRR(ℓ) ﬁrst decreases and then increases with m increasing, and the number m with minimum ΩDKRR(ℓ) is given by

(τ + ℓ)2 + 12ℓ+ (τ + ℓ)

Once data and the kernel function are given, the complexity τ of calculating a kernel value can be considered as a constant. Provided ℓis large enough, we have m

N, from which a rough conclusion can be drawn as follows: The training complexity of DKRR(ℓ) mainly focuses on the local process (i.e., Step 1 of the training ﬂow in Appendix B) when m <

N, while it mainly focuses on the communications (i.e., from Step 4 to Step 7 of the training ﬂow in Appendix B) when m >

N. In the following, some numerical results are reported to verify the above analysis of training complexity.

Distributed Kernel Ridge Regression with Communications

Item Step 1 Step2 Step4 Step5 Step6 Step7

KDk,Dj MDj,λ αDj KDk,Dj αDj f 0 D,λ,Dk GDj,λ,k,ℓ GD,λ,k,ℓ βj,ℓ HDj,λ,k,ℓ f ℓ D,λ,Dk

m2 N3 m3 N 2 m2 N2 m2 N N2 m2 N N2 m2 N 2 m2 N

O(m N) O m Nℓ+ N2ℓ

Table 1: Computational complexity of the training ﬂow in Appendix B. INDV represents the complexity of calculating each individual item, and TOT represents the total complexities of Step 1, Step 2 and the loop from Step 4 to Step 7. τ represents the complexity of calculating a value of a kernel function K( , ).

We start by introducing some notations that are used below. AEN,m represents the MSE of DKRR with m local machines for N training samples. AECN,m,ℓrepresents the MSE of DKRR(ℓ) with m local machines and ℓcommunications for N training samples. GMSEN represents the MSE of KRR with training N samples in a batch mode. The

relative errors for DKRR and DKRR(ℓ) are deﬁned by REN,m = |AEN,m GMSEN|

RECN,m,ℓ= |AECN,m,ℓ GMSEN|

GMSEN , respectively. For DKRR with N training samples, we

denote the maximum number m of local machines that satisﬁes REN,m < ε as m N B , i.e., m N B = max{m|REN,m < ε}. For DKRR(ℓ) with N training samples and ℓcommunications, we denote the maximum number m of local machines that satisﬁes RECN,m,ℓ< ε as ˆm N,ℓ B , i.e., ˆm N,ℓ B = max{m|RECN,m,ℓ< ε}. In the experiments, ε is set as 0.05, and the computing capability of each local machine is assumed to be the same for simplicity.

We generate 20000 training samples and 1000 testing samples. The number m varies from {20, 40, 60, , 600} for the 1-dimensional data, and varies from {2, 4, 6, , 60} for the 3-dimensional data. The training time of DKRR with the number m changing from the minimum number (i.e., 20 for 1-dimensional data and 2 for 3-dimensional data) to m N B and the training time of DKRR(ℓ) with the number m changing from the minimum number to ˆm N,ℓ B by diﬀerent numbers ℓ {2, 4, 8, 16, 32} are shown in Figure 8. The following observations can be made from these results. 1) Because there are no calculations for communication steps, DKRR naturally consumes the least time when m < m N B . 2) For the two simulation datasets, we have m N B ˆm N,ℓ B , resulting that DKRR(ℓ) can further reduce the least training time of DKRR with the maximum number m N B when the number m approaches to ˆm N,ℓ B . However, the improvement is very limited for the 1-dimensional data, because the number m N B of DKRR is suﬃciently large ( m N B = 120), and thus the marginal utility for the reduction of training time is very small when DKRR(ℓ) further enlarges the maximum number of local machines. 3) With the number m increasing, the training time curves of DKRR(ℓ) (ℓ= 4, 8, 16, 32) ﬁrst decrease quickly and then increase slowly for the 1-dimensional data, and monotonously decrease for the 3-dimensional data. This phenomenon coincides with the aforementioned theoretical analysis of training complexity, because ˆm N,ℓ B 450 >

N for the 1-dimensional data, and ˆm N,ℓ B 50 <

N for the 3-dimensional data. In practical applications, we estimate the optimal number of local

Lin, Wang, and Zhou

100 200 300 400 500 The number of local machines

The training time (second)

DKRR Com. # 2 Com. # 4 Com. # 8 Com. # 16 Com. # 32 Estimated m*

5 10 15 20 25 30 35 40 45 The number of local machines

The training time (second)

DKRR Com. # 2 Com. # 4 Com. # 8 Com. # 16 Com. # 32

Figure 8: The relation between the number of local machines and the training time for ﬁxed numbers of communications. The left ﬁgure and right ﬁgure are respectively the results on the 1-dimensional data and the 3-dimensional data with N = 20000 training samples.

machines by the formula

( m , if ˆm N,ℓ B > m , ˆm N,ℓ B , otherwise (56)

to expect it to have the lowest time consumption. The estimated m and the corresponding training time for DKRR(ℓ) with ℓ {2, 4, 8, 16, 32} on the 1-dimensional data are marked by red pentagrams in the left ﬁgure of Figure 8. It can be seen that the training time with the estimated number m is very close to the minimum of training time, which validates the correctness and reliability of equation (56). We also record the results with varying the number of training samples N {3000, 3500, 4000, , 20000} and ﬁxing the number of testing samples N = 1000. Figure 9 shows the maximum numbers m N B of DKRR and the maximum numbers ˆm N,ℓ B of DKRR(ℓ) with varying the numbers of training samples and communications, and Figure 10 shows the training time of DKRR with m N B local machines and the training time of DKRR(ℓ) with ˆm N,ℓlocal machines. From these results, we have the following observations. 1) The maximum number of local machines increases as the number of training samples increases for both DKRR and DKRR(ℓ), but the growth of DKRR(ℓ) is faster than that of DKRR. 2) The maximum number ˆm N,ℓ B of DKRR(ℓ) is robust to the number ℓof communications when ℓ> 4. This veriﬁes the fast convergence as shown in Figure 5, and also implies that DKRR(ℓ) can achieve a satisfactory generalization performance in a small number of communications. 3) For the 1-dimensional data, DKRR(ℓ) does not have much advantage on the training time, and is even worse than DKRR when the number of training samples N 10000 and the number of communications ℓ 4. Furthermore, the training time increases signiﬁcantly as the number of communications increases. The main reason has two folds. On one hand, the marginal utility of enlarging the number of local machines diminishes when m N B of DKRR is large enough to eﬃciently deal with large-scale data. On

Distributed Kernel Ridge Regression with Communications

4000 8000 12000 16000 20000 The number of training samples

The maximum # m B of local machines

DKRR Com. # 2 Com. # 4 Com. # 8 Com. # 16 Com. # 32

4000 8000 12000 16000 20000 The number of training samples

The maximum # m B of local machines

DKRR Com. # 2 Com. # 4 Com. # 8 Com. # 16 Com. # 32

Figure 9: The relation between the number of training samples and the maximum number of local machines. The left ﬁgure and right ﬁgure are respectively the results on the 1-dimensional data and the 3-dimensional data.

4000 8000 12000 16000 20000 The number of training samples

The training time (second)

DKRR Com. # 2 Com. # 4 Com. # 8 Com. # 16 Com. # 32

4000 8000 12000 16000 20000 The number of training samples

The training time (second)

DKRR Com. # 2 Com. # 4 Com. # 8 Com. # 16 Com. # 32

Figure 10: The relation between the number of training samples and the training time with the optimal number of local machines. The left ﬁgure and right ﬁgure are respectively the results on the 1-dimensional data and the 3-dimensional data.

the other hand, the training complexity of DKRR(ℓ) mainly focuses on the communications when the number m is suﬃciently large. 4) For the 3-dimensional data, the number m N B of DKRR is relatively small (e.g., m N B = 12 for N = 20000 training samples). DKRR(ℓ) shows the signiﬁcantly superior performance on the training time when compared to DKRR, even for the large number ℓ= 32. This is the main focus of this paper, on an enlargement of the maximum number of local machines guaranteeing optimal learning rates when DKRR has a small number m N B of local machines.

All these simulations verify our theoretical statements and show the power of communications in distributed learning.

Lin, Wang, and Zhou

Acknowledgments

The work of Shao-Bo Lin is supported partially by the National Natural Science Foundation of China (Nos. 61876133, 11771012). The work of Di Wang is supported partially by the National Natural Science Foundation of China (Grant No. 61772374) and by Zhejiang Provincial Natural Science Foundation (Grants No. LY17F030004). The work of Ding Xuan Zhou is supported partially by the Research Grants Council of Hong Kong (Project No. City U 11306617), Hong Kong Institute for Data Science, and by the National Natural Science Foundation of China (Grant No. 11461161006).

Appendix A. Proofs of Lemmas in Section 5

We use a new concentration inequality for positive operators (Minsker, 2017), which was a reﬁned estimate for the well known Bernstein inequality for matrix (Tropp, 2015). This lemma has been adopted in (Dicker et al., 2017) and (Guo and Shi, 2019) to derive optimal learning rates for kernel-based spectral algorithms and coeﬃcient-regularization algorithms.

Lemma 17 Let R > 0 be a positive real constant and consider a ﬁnite sequence of selfadjoint Hilbert-Schmidt operators {ξi}n i=1 satisfying E[ξi] = 0 and ξi R almost surely. Suppose there are constants V, W > 0 such that E[(Pn i=1 ξi)2] V and Tr(E[(Pn i=1 ξi)2]) V W. For all t V 1/2 + R/3, there holds

2(V + Rt/3)

Proof of Lemma 6. Deﬁne the random variable

η(x) = (LK + λI) 1/2Kx Kx(LK + λI) 1/2, x X.

It is easy to see that η(x) is a self-adjoint operator for x X. Furthermore,

E[η] = (LK + λI) 1/2LK(LK + λI) 1/2,

(x,y) D η(x) = (LK + λI) 1/2LK,D(LK + λI) 1/2.

Set ξ(x) = η(x) E[η], x X.

Then, we have E[ξ] = 0. If {σℓ, ψℓ}ℓ 1 are normalized eigenparis of the integral operator LK on L2 ρX, then σℓψℓ K = 1 when σℓ> 0. Thus, { σℓψℓ: σℓ> 0} forms an orthonormal basis of HK. This implies that

η(x) 2 = sup f K=1 η(x)f 2 K X

ℓ (LK + λI) 1/2Kx Kx(LK + λI) 1/2 σℓψℓ 2 K

σℓψ2 ℓ(x) λ + σℓ (LK + λI) 1/2Kx 2 K =

σℓψ2 ℓ(x) λ + σℓ

Distributed Kernel Ridge Regression with Communications

ξ(x) = η(x) E[η] max x X η(x) + 1 max x X

σℓψ2 ℓ(x) λ + σℓ + 1 (κ2 + 1)λ 1,

where we use E[η] = (LK + λI) 1LK 1 and 0 < λ 1. For an arbitrary f HK, it follows from [η(x)]2f = η(x)[η(x)f] that

[η(x)]2 = Kx, (LK + λI) 1Kx Kη(x) = X

σℓψ2 ℓ(x) λ + σℓ η(x).

Then, we get from ξ(x) = η(x) E[η] that

(x,y) D ξ(x)

|D|( E[(η)2] + (E[η])2 )

|D|(κ2λ 1 E[η] + E[η] 2) |D|λ 1(κ2 + 1).

(x,y) D ξ(x)

|D|Tr E[(η)2] (E[η])2

|D|(κ2λ 1Tr(E[η]) + Tr(E[η])2) |D|λ 1N(λ)(κ2 + 1).

Plugging R = λ 1(κ2 + 1), V = |D|λ 1(κ2 + 1) and W = N(λ), we have that for all

λ|D| + κ2+1

3λ|D|, there holds

P[RD,λ t] = P

(x,y) D ξ(x)

4N(λ) exp λ|D|t2

2(κ2 + 1) + (κ2 + 1)t/3

Set 4N(λ) exp n λ|D|ε2

(κ2+1)(2+ε/3) o = δ. With conﬁdence 1 δ,

ε (κ2 + 1) log 4N(λ)

2(κ2 + 1) log 4N(λ)

t = (κ2 + 1) log 4N(λ)

2(κ2 + 1) log 4N(λ)

λ|D| + κ2 + 1

we then get that (21) holds with conﬁdence 1 δ. This completes the proof of Lemma 6. We then prove Lemma 7 by using Lemma 6.

Lin, Wang, and Zhou

Proof of Lemma 7. Since δ 4 exp{2C 1BD,λ}, it follows from Lemma 6 that RD,λ < 1

2 holds with conﬁdence 1 δ. Then,

(LK + λI)1/2(LK,D + λI) 1(LK + λI)1/2

= (LK + λI)1/2[(LK,D + λI) 1 (LK + λI) 1](LK + λI)1/2 + I = I

+ (LK + λI) 1/2(LK LK,D)(LK + λI) 1/2(LK + λI)1/2(LK,D + λI) 1(LK + λI)1/2.

(LK + λI)1/2(LK,D + λI) 1(LK + λI)1/2

2 (LK + λI)1/2(LK,D + λI) 1(LK + λI)1/2 .

This implies (LK + λI)1/2(LK,D + λI) 1(LK + λI)1/2 2.

Since t1/2 is operator monotone (Bhatia, 1997, Chap.4), we have

QD,λ [(LK + λI)1/2(LK,D + λI) 1(LK + λI)1/2]1/2

This completes the proof of Lemma 7.

Appendix B. Training and Testing Flows for DKRR(ℓ)

Generally speaking, it is diﬃcult to communicate functions in practice. Thus, the implementation of DKRR(ℓ) requires communications of additional information. In this part, we numerically realize DKRR(ℓ) by communicating the inputs of data for the sake of simplicity, though there are numerous state-of-the-art approaches to approximate the gradient matrix via much less of communication loss. Training Flow: Step 1 (local process). Run KRR (2) on the j-th local machine with data Dj and communicate Dj(x) = {x : (x, y) D} to the k-th local machine for k = 1, . . . , m. Store m matrices of size |Dj| |Dk|, KDk,Dj := {K(x, x )}x Dk(x),x Dj(x) with k = 1, . . . , m, the matrix MDj,λ = (KDj,Dj +λ|Dj|IDj) 1, and the vector αDj = MDj,λy Dj, where I is the unit matrix of size |Dj| |Dj|. Step 2 (synthesization) On the j-th local machine, communicate m vectors of size |Dk|, KDk,Dj αDj for k = 1, . . . , m to the global machine. Synthesize m global vectors f0 D,λ,Dk = Pm j=1 |Dj|

|D| KDk,Dj αDj for k = 1, . . . , m.

Step 3 (distributing) For ℓ= 1, 2, . . . , distribute fℓ 1 D,λ,Dj to the j-th local machine.

Step 4 (local gradients) On the j-th local machine, compute m gradient vectors GDj,λ,k,ℓ= KDk,Dj

|Dj| (f ℓ 1 D,λ,Dj y Dj) + λ fℓ 1 D,λ,Dk for k = 1, . . . , m and communicate these m vectors to the global machine. Step 5 (synthesizing gradients) Synthesize m global gradient vectors, GD,λ,k,ℓ:= Pm j=1 |Dj|

|D| GDj,λ,k,ℓfor k = 1, . . . , m. Distribute m vectors GD,λ,k,ℓwith k = 1, . . . , m to each local machine.

Distributed Kernel Ridge Regression with Communications

Step 6 (KRR on gradient data) On the j-th local machine, compute the vector βj,ℓ= MDj,λ GD,λ,j,ℓand m vectors HDj,λ,k,ℓ:= GD,λ,k,ℓ KDk,Dj βj,ℓfor k = 1, . . . , m. Communicate m vectors HDj,λ,k,ℓwith k = 1, . . . , m to the global machine. Step 7 (ﬁnal estimator) Generate m vectors

fℓ D,λ,Dk = fℓ 1 D,λ,Dk 1

|D| HDj,λ,k,ℓ, k = 1, . . . , m

and transmit fℓ D,λ,Dj to the j-th local machine with j = 1, . . . , m. Testing Flow: given vector D (x) := (x 1, . . . , x|D |) which consists of |D | query points. Step 1 (local estimates) Transmit D (x) to m local machines. On the j-th local machine, store vectors of size |D |, KD ,Dj αDj . Step 2 (global estimates) Compute test vector of size |D |,

f 0 D,λ(D ) :=

|D| KD ,Dj αDj .

Step 3 (local gradients) For ℓ= 1, 2, . . . , distribute f ℓ 1 D,λ(D ) to m local machines, and compute

GDj,λ,ℓ(D ) := KD ,Dj

|Dj| ( fℓ 1 D,λ,Dj y Dj) + λf ℓ 1 D,λ(D ),

where fℓ 1 D,λ,Dj is obtained from the training ﬂow. Step 4 (global gradients) Transmit GDj,λ,ℓ(D ) with j = 1, . . . , m to the global machine and get the global gradient vector

GD,λ,ℓ(D ) =

|D| GDj,λ,ℓ(D ).

Step 5 (ﬁnal estimator) Generate the vector of estimators

f ℓ D,λ(D ) = f ℓ 1 D,λ(D ) 1

h GD,λ,ℓ(D ) KD ,Dj βj,ℓ i .

R. Bhatia. Matrix Analysis, Volume 169 of Graduate Texts in Mathematics. Springer-Verlag Press, Berlin, 1997.

A. Bellet, Y. Liang, A. B. Garakani, M. F. Balcan, and F. Sha. A distributed frank-wolfe algorithm for communication-eﬃcient sparse learning. In Proceedings of the Fifteenth SIAM International Conference on Data Mining, Society for Industrial and Applied Mathematics, pages 478-486, Vancouver, British Columbia, 2015.

G. Blanchard and N. Kr amer. Convergence rates of kernel conjugate gradient for random design regression. Analysis and Applications, 14(6):763-794, 2016.

Lin, Wang, and Zhou

A. Caponnetto and E. De Vito. Optimal rates for the regularized least squares algorithm. Foundations of Computational Mathematics, 7:331-368, 2007.

X. Chang, S. B. Lin, and D. X. Zhou. Distributed semi-supervised learning with kernel ridge regression. Journal of Machine Learning Research, 18(46):1-22, 2017.

X. Chang, S. B. Lin, and Y. Wang. Divide and conquer local average regression. Electronic Journal of Statistics, 11(1):1326-1350, 2017.

F. Cucker and D. X. Zhou. Learning Theory: An Approximation Theory Viewpoint. Cambridge University Press, Cambridge, 2007.

Z. C. Guo, S. B. Lin, and D. X. Zhou. Learning theory of distributed spectral algorithms. Inverse Problems, 33(7):074009, 2017.

Z. C. Guo and L. Shi. Optimal rates for coeﬃcient-based regularized regression. Applied and Computational Harmonic Analysis, 47(3):662-701, 2019.

Z. C. Guo, S. B. Lin, and L. Shi. Distributed learning with multi-penalty regularization. Applied and Computational Harmonic Analysis, 46(3):478-499, 2019.

L. H. Dicker, D. P. Foster, and D. Hsu. Kernel ridge vs. principal component regression: Minimax bounds and the qualiﬁcation of regularization operators. Electronic Journal of Statistics, 11(1):1022-1047, 2017.

R. M. Dudley. Real Analysis and Probability, Cambridge Studies in Advanced Mathematics 74. Cambridge University Press, Cambridge, 2002.

L. Gy orﬁ, M. Kohler, A. Krzyzak, and H. Walk. A Distribution-Free Theory of Nonparametric Regression. Springer-Verlag Press, Berlin, 2002.

T. Hu, Q. Wu, and D. X. Zhou. Distributed kernel gradient descent algorithm for minimum error entropy principle. Applied and Computational Harmonic Analysis, 49(1):229-256, 2020.

C. Huang and X. M. Huo. A distributed one-step estimator. Mathematical Programming, 174(1-2):41-76, 2019.

M. Li, D. G. Andersen, A. J. Smola, and K. Yu. Communication eﬃcient distributed machine learning with the parameter server. In Advances in Neural Information Processing Systems, pages 19-27, Montral, Qubec, 2014.

J. Lin, A. Rudi, L. Rosasco and V. Cevher. Optimal rates for spectral algorithms with leastsquares regression over Hilbert spaces. Applied and Computational Harmonic Analysis, 48(3):868-890, 2020.

S. B. Lin, X. Guo, and D. X. Zhou. Distributed learning with regularized least squares. Journal of Machine Learning Research, 18(92):1-31, 2017.

S. B. Lin and D. X. Zhou. Distributed kernel-based gradient descent algorithms. Constructive Approximation, 47:249-276, 2018.

Distributed Kernel Ridge Regression with Communications

S. B. Lin, Y. Lei, and D. X. Zhou. Boosted kernel ridge regression: optimal learning rates and early stopping. Journal of Machine Learning Research, 20(46):1-36, 2019.

M. Liu, Z. Shang, and G. Cheng. How many machines can we use in parallel computing for kernel ridge regression? ar Xiv preprint ar Xiv:1805.09948, 2018.

G. Mann, R. Mc Donald, M. Mohri, N. Silberman, and D. Walker. Eﬃcient large-scale distributed training of conditional maximum entropy models. In Advances in Neural Information Processing Systems, pages 1231-1239, Whistler, British Columbia, 2009.

S. Minsker. On some extensions of Bernstein s inequality for self-adjoint operators. Statistics & Probability Letters, 127:111-119, 2017.

N. M ucke and G. Blanchard. Parallelizing spectrally regularized kernel algorithms. Journal of Machine Learning Research, 19(30):1-29, 2018.

M. Pang and H. Sun. Distributed regression learning with coeﬃcient regularization. Journal of Mathematical Analysis and Applications, 466(1):676-689, 2018.

I. Pinelis. Optimum bounds for the distributions of martingales in Banach spaces. Annals of Probablity, 22(4):1679-1706, 1994.

R. Schaback and H. Wendland. Kernel techniques: from machine learning to mesheless methods. Acta Numerica, 15:543-639, 2006.

O. Shamir, N. Srebro, and T. Zhang. Communication-eﬃcient distributed optimization using an approximate newton-type method. In Proceedings of the 31st International Conference on International Conference on Machine Learning, pages 1000-1008, Beijing, 2014.

Z. Shang and G. Cheng. Computational limits of a distributed algorithm for smoothing spline. Journal of Machine Learning Research, 18(108):1-37, 2017.

L. Shi. Distributed learning with indeﬁnite kernels. Analysis and Applications, 17(6):947975, 2019.

S. Smale and D. X. Zhou. Learning theory estimates via integral operators and their approximations. Constructive Approximation, 26:153-172, 2007.

I. Steinwart, D. Hush, and C. Scovel. Optimal rates for regularized least squares regression. In S. Dasgupta and A. Klivan, editors, Annual Conference on Learning Theory, pages 79-93, 2009.

J. A. Tropp. An introduction to matrix conecentation inequalities. Foundations and Trends in Machine Learning, 8(1-2):1-230, 2015.

Z. M. Wu. Compactly supported positive deﬁnite radial functions. Advances in Computational Mathematics, 4:283-292, 1995.

J. Zeng and W. Yin. On nonconvex decentralized gradient descent. IEEE Transactions on Signal Processing, 66(11):2834-2848, 2018.

Lin, Wang, and Zhou

Y. Zhang, J. C. Duchi, and M. J. Wainwright. Communication-eﬃcient algorithms for statistical optimization. Journal of Machine Learning Research, 14(68):3321-3363, 2013.

Y. Zhang, J. C. Duchi, and M. J. Wainwright. Divide and conquer kernel ridge regression: A distributed algorithm with minimax optimal rates. Journal of Machine Learning Research, 16(102):3299-3340, 2015.

D. X. Zhou. Deep distributed convolutional neural networks: Universality. Analysis and Applications, 16(6):895-919, 2018.

D. X. Zhou. Universality of deep convolutional neural networks. Applied and Computational Harmonic Analysis, 48(2):787-794, 2020.

D. X. Zhou. Distributed approximation with deep convolutional neural networks. Preprint, 2018.