# distributed_randomized_sketching_kernel_learning__53774083.pdf

Distributed Randomized Sketching Kernel Learning

Rong Yin,1,2 Yong Liu,3,4 Dan Meng1,2

1 Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China 2 School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China 3 Gaoling School of Artiﬁcial Intelligence, Renmin University of China, Beijing, China 4 Beijing Key Laboratory of Big Data Management and Analysis Methods, Beijing, China yinrong@iie.ac.cn, liuyonggsai@ruc.edu.cn, mengdan@iie.ac.cn

We investigate the statistical and computational requirements for distributed kernel ridge regression with randomized sketching (DKRR-RS) and successfully achieve the optimal learning rates with only a fraction of computations. More precisely, the proposed DKRR-RS combines sparse randomized sketching, divide-and-conquer and KRR to scale up kernel methods and successfully derives the same learning rate as the exact KRR with greatly reducing computational costs in expectation, at the basic setting, which outperforms previous state of the art solutions. Then, for the sake of the gap between theory and experiments, we derive the optimal learning rate in probability for DKRR-RS to reﬂect its generalization performance. Finally, to further improve the learning performance, we construct an efﬁcient communication strategy for DKRR-RS and demonstrate the power of communications via theoretical assessment. An extensive experiment validates the effectiveness of DKRR-RS and the communication strategy on real datasets.

Introduction Kernel methods have been widely used in data mining, machine learning, and other ﬁelds (Yin et al. 2020c; Saunders, Gammerman, and Vovk 1998; Liu 2021; Li and Liu 2021b; Yin et al. 2020b; Li and Liu 2021a; Wang et al. 2021; Li et al. 2018). However, they are unfeasible to deal with largescale scenarios due to the high computational requirements, typically at least quadratic in the number of data. To address these issues, a variety of approximate kernel ridge regression (KRR) are proposed. The main principle is to characterize statistical and computational trade-offs, that is, to sacriﬁce statistical accuracy to gain computational beneﬁts. The representative methods include Nystr om (Yin et al. 2021, 2020a; Rudi, Carratino, and Rosasco 2017; Li, Kwok, and Lu 2010) which constructs the approximate kernel matrix with a few anchor points, random features (Liu, Liu, and Wang 2021; Li, Liu, and Wang 2019; Rudi, Camoriano, and Rosasco 2016; Rahimi and Recht 2007), iterative optimization (Lin and Cevher 2020; Lin, Lei, and Zhou 2019; Shalev Shwartz et al. 2011), distributed learning (Liu, Liu, and Wang 2021; Lin, Wang, and Zhou 2020; Wang

Corresponding author: Yong Liu Copyright c 2022, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

2019; Guo, Lin, and Shi 2019; Chang, Lin, and Zhou 2017; Lin, Guo, and Zhou 2017; Zhang, Duchi, and Wainwright 2015, 2013) which divides the training data into some subsets for processing on local processors and carry out necessary communications, and randomized sketching (Lin and Cevher 2020; Lian, Liu, and Fan 2021; Liu, Shang, and Cheng 2019; Yang, Pilanci, and Wainwright 2017) which projects the kernel matrix into a small one based on the sketch matrix. The above studies show that randomized sketching and distributed learning have outstanding effects in kernel methods. Recently, combinations of those accelerated algorithms beneﬁt a lot and attract much attention, of which learning properties have been explored including the combination of Nystr om and iterative optimization (Rudi, Carratino, and Rosasco 2017), randomized sketching and iterative optimization (Lin and Cevher 2020), and divide-andconquer and multi-pass SGD (Lin and Cevher 2018). However, the computational requirements of the existing approximate KRR estimates are still high, and it still remains unclear for complexity reduction and statistical analysis to the combination of randomized sketching and distributed learning in kernel learning.

To further overcome the computational bottlenecks, we propose the method, called DKRR-RS, of combining divideand-conquer learning and a sparse randomized sketching method for KRR, which keeps the optimal learning rate. This paper makes the following main contributions. 1) The proposed DKRR-RS improves the existing state-of-the-art results of the distributed learning and randomized sketching. In particular, at the basic setting (2ζ + γ = 2) and the basic assumptions (see section for details), DKRR-RS requires only O(N 0.5+γ) time and O(N) space in expectation to guarantee the same learning rate as the exact KRR, where N is the number of data, ζ [1/2, 1], and γ [0, 1]. We take a substantial step in provably reducing the computational requirements by combining randomized sketching with distributed learning; 2) The theoretical performance in expectation reﬂects the average results of enough trials, but may fail to capture the learning performance for a single trial. To essentially reﬂect the generalization performance, we obtain the optimal learning rate of DKRR-RS in probability and numerically verify its effectiveness; 3) We propose a communication strategy to further improve the learning performance of DKRR-RS, called DKRR-RS-CM, and

The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22)

validate the theoretical bounds via numerical experiments. The rest of the paper is organized as follows. We ﬁrst introduce the related work and preliminaries. Then the section of The Proposed Algorithms introduces the proposed methods of DKRR-RS and DKRR-RS-CM. The theoretical results are shown in the following section. Finally, the experiments and conclusions are presented.

Related Work

The optimal learning rate of randomized sketching in KRR is ﬁrst proposed in (Yang, Pilanci, and Wainwright 2017), which utilizes the fast Fourier transform to accelerate the product of matrices in random orthogonal system sketches (ROS). However, due to the dense sketch matrices including the discrete Fourier transform matrix and the Hadamard matrix, O(N 2 + Nm2) time and O(N 2) space are needed for the optimal learning rate with the sketch dimension m = Ω(d N log4(N)), where d N > 1. Subsequently, the randomized sketching method in (Liu, Shang, and Cheng 2019), called Gauss, also achieves the optimal learning rate with time O(N 2m) and space O(N 2), based on dense Gaussian sketch matrix and local Rademacher framework, when m = Ω(d N). Recently, Lin and Cevher (Lin and Cevher 2020) construct a new randomized sketching method called Subgauss. The entries of the sketch matrix in Subgauss are i.i.d. Subgaussian (such as Gaussian or Bernoulli). Lin and Cevher ﬁrstly introduce the integral operator framework to derive the the optimal learning rate with time O(N 2m) and space O(N 2), when m = Ω(N

γ 2ζ+γ ), where γ 2ζ+γ > 0. Heng Lian et al. (Lian, Liu, and Fan 2021) combined divide-and-conquer and random sketching based on two well-known types of dense sketching matrix, the Gaussian sketch and the ROS sketch. The dense sketching matrices lead to high time and space complexity as mentioned above. The condition of theoretical analysis in (Lian, Liu, and Fan 2021) is different from this paper. By comparison with them, DKRR-RS keeps less time and space complexity with the optimal generalization error under the same condition. The representative distributed KRR includes DKRR (Guo, Lin, and Shi 2019; Chang, Lin, and Zhou 2017; Lin, Guo, and Zhou 2017; Zhang, Duchi, and Wainwright 2015, 2013) based on divide-and-conquer, DKRR-RF (Li, Liu, and Wang 2019) based on DKRR and random features (Rudi, Camoriano, and Rosasco 2016), and DKRR-NY-PCG (Yin et al. 2020a) based on DKRR and Nystr om-PCG (Rudi, Carratino, and Rosasco 2017), which derive the optimal learning rate in expectation. However, they have a restricted limitation in the number of local processors p, that is, to derive the optimal learning rate, p should be restricted to a constant at the most popular case (ζ = 1/2, γ = 1). Subsequently, Yin et al.(Yin et al. 2021), Liu et al. (Liu, Liu, and Wang 2021), and Lin et al. (Lin, Wang, and Zhou 2020) introduce communication strategies into DKRR and DKRR-RF to relax the restriction on p. However, in (Lin, Wang, and Zhou 2020), they require communicating the input data between each local processor, which brings difﬁculties to the protection of data privacy. In addition, the high communication complexity O(Nd) at each iteration makes it infeasible in practice

for large-scale datasets, where d is the dimension.

Preliminaries Let X be the input space and R be the output space. The training set D = p j=1Dj = {(xi, yi)}N i=1 is sampled identically and independently from X R with respect to ρ, where ρ be a ﬁxed but unknown distribution and p > 1. All subsets {Dj}p j=1 are disjoint and |D1| = . . . = |Dp| = n. Let ρX( ) be the induced marginal measure on X of ρ, ρ( |x) be the conditional probability measure on R with respect to x X and ρ. We denote by Kx the function K(x, ) and by (H, , H) the Hilbert space of functions with the associated inner product induced by K, deﬁned by Kx, Kx H = K(x, x ), x, x X. Let L2 ρX be the Lebesgue space of ρX square integrable functions, endowed with the inner product φ, ψ ρX = R φ(x)ψ(x)dρX(x), φ, ψ L2 ρX, and norm ψ ρ = p

ψ, ψ ρX for all ψ L2 ρX. For any f H, φ L2 ρX, deﬁne a linear map S : H L2 ρX, such that Sf = f, K( ) H L2 ρX, with adjoint S : L2 ρX H, such that S φ = R φ(x)KxdρX(x) H. L: L2 ρX L2 ρX, such that L = SS and T : H H, such that T = S S. For any υ Rn, deﬁne Sn : H Rn, such that Snf = 1 n( f, Kxi )n i=1 Rn, with adjoint S n : Rn H, such that S nυ = 1 nΣn i=1υi Kxi H. Tn : H H, such that Tn = S n Sn.

Kernel Ridge Regression (KRR) Given a hypothesis space H of measurable functions from X to R, the goal of the supervised learning problem can be formalized as minimizing the expected risk

inf f H E(f), E(f) = Z

X R (f(x) y)2dρ(x, y). (1)

Deﬁne the regression function (Steinwart and Christmann 2008) that minimizes the expected risk over all measurable functions as fρ(x) = R ydρ(y|x), almost everywhere. A good empirical solution ˆf should correspond to the small excess risk E( ˆf) inff H E(f). Suppose there is such an f H H: E (f H) = minf H E(f). This paper focuses on KRR. Given a Mercer kernel K : X X R, KRR can be state as

ˆf D,λ = arg min f H 1 N

i=1 (f(xi) yi)2+λ f 2 H, λ > 0. (2)

There is a unique closed form solution in Eq.(2) according to the Representer Theorem (Sch olkopf, Herbrich, and Smola 2001)

ˆf D,λ(x) =

i=1 ˆαi K(xi, x), with ˆα = (KN + λNI) 1y,

(3) where KN is the N N kernel matrix with KN(i, j) = K(xi, xj), y = y N = [y1, . . . , y N]T . The time O(N 3) for solving the linear system in Eq.(3) and the memory O(N 2) for storing the kernel matrix KN are computationally unfeasible in large-scale setting.

KRR with Divide-and-Conquer (DKRR) KRR with divide-and-conquer (DKRR) is deﬁned as

j=1 ˆf Dj,λ, (4)

where ˆf Dj,λ is the solution in Eq.(3). Its time complexity and space complexity are O(N 3/p3) and O(N 2/p2).

The Proposed Algorithms DKRR with Randomized Sketching (DKRR-RS) We propose a novel randomized sketching into DKRR in Eq.(4), called DKRR-RS. The randomized sketching is based on the sparse sketch matrix R Rm n, where the sketch dimension m < n. Let σ(i) { 1/m, +1/m} be 2-wise independent hash function and σ(i) = t for t { 1/m, +1/m} with probability 1/2. The entries Ri,j of R is designed as

σ(i), with prob. m

0, with prob. 1 m

Note that the novel sketch matrix R is sparse. One only needs to store the non-zero elements. Moreover, in a matrixmatrix product (say of the form R A for some matrix A), one can achieve a signiﬁcant n/m-fold speedup compared with a dense matrix. Projecting the kernel matrix Kn by the sketch matrix R. That is, one can restrict the solver of approximate KRR to the hypothesis space, Hm = {f|f = Pn i=1(RT ˆαj)i K(xi, ), ˆαj Rm}. Deﬁne f Dj,m,λ be the local estimator for j-th subset Dj in DKRR-RS. Therefore, the following problem

f Dj,m,λ = arg min f Hm 1 n

i=1 (f(xi) yi)2 + λ f 2 H, (6)

can be transformed into

f Dj,m,λ(x) =

i=1 (RT ˆαj)i K(xi, x), (7)

with ˆαj = (RK2 n RT + λn RKn RT ) 1(RKnyn). (8) Equivalently, f Dj,m,λ is characterized by the following equation (Pm Tn Pm + λI)f Dj,m,λ = Pm S nˆyn, (9) where Pm be the projection operator with range Hm and ˆyn = 1 nyn (Rudi, Camoriano, and Rosasco 2015). The global estimator can be obtained by the weighted average of approximate local estimators

f 0 D,m,λ = 1

j=1 f Dj,m,λ. (10)

The prediction stage is based on the approximate parameters ˆαj of each local processor and we obtain the ﬁnal prediction estimator by averaging the local estimators of each local processor.

Complexity Analysis of DKRR-RS Time Complexity: Due to the sparsity of the sketch matrix R, only the nonzero elements need to be multiplied instead of each element in the matrix-matrix product. Therefore, the cost of computing RKn is only O(Nm2/p). The matrix K2 n does not need to be represented explicitly. RK2 n R can be converted to the form of RKn(RKn)T , whose time complexity is O(Nm2/p) except for RKn. Taking into account the fragmented time, the time complexity of DKRR-RS can be summed up as O(Nm2/p). Space Complexity: The approximate kernel matrix Kn is the decisive part in the memory consumption, whose space complexity is O(N 2/p2). Communication Complexity: It is O(m) in DKRR-RS.

DKRR-RS with Communications (DKRR-RS-CM) For further improving the learning performance of approximate KRR, we propose a novel communication-based DKRR-RS, called DKRR-RS-CM. The efﬁcient communication strategy can enlarge the range of partition p with guaranteeing the optimal statistical performance of distributed KRR, which is adapted from (Lin, Wang, and Zhou 2020) and avoids local data communication. Let GD,m,λ be

GD,m,λ(f) = (Pm TNPm + λI)f Pm S N ˆy N. (11)

Note that GD,m,λ(f) is the half gradient of the empirical risk of arg minf Hm 1 N PN i=1(f(xi) yi)2 + λ f 2 H over Hm. According to Eq.(9), it can be found that

f D,m,λ = (Pm TNPm + λI) 1Pm S N ˆy N, (12)

f 0 D,m,λ = 1

j=1 (Pm Tn Pm + λI) 1Pm S nˆyn. (13)

Therefore, for any f H, we have

f D,m,λ =f (Pm TNPm + λI) 1

[(Pm TNPm + λI)f Pm S N ˆy N] , (14)

f 0 D,m,λ =f 1

j=1 (Pm Tn Pm + λI) 1

[(Pm Tn Pm + λI)f Pm S nˆyn] .

Comparing Eq.(14) and Eq.(15), and noting that the global gradient can be obtained via the communications of each local gradient, i.e., GD,m,λ(f) = 1 p Pp j=1 GDj,m,λ(f), we consider the communication strategy of Newton-Raphson iteration-based:

f l D,m,λ = f l 1 D,m,λ 1

j=1 βl 1 j , l > 0, (16)

where βl 1 j = (Pm Tn Pm + λI) 1GD,m,λ( f l 1 D,m,λ). The process of DKRR-RS-CM is summarized in Algorithm 1. When l = 0, DKRR-RS-CM executes the computation of DKRR-RS. Otherwise, the four-step communication

Algorithm 1: DKRR-RS with Communications (DKRR-RSCM) Input: p disjoint subsets {Dj}p j=1 with D = p j=1Dj, kernel parameter, regularization parameter λ, sketch dimension m. Output: f M D,m,λ 1: If l = 0 Local processor: compute f Dj,m,λ in Eq.(7), and communicate back to global processor. Global processor: compute f 0 D,m,λ in Eq.(10), and communicate to each local processor. 2: End If 3: For l = 1 to M do Local processor: compute local gradient GDj,m,λ( f l 1 D,m,λ) and communicate back to the global processor. Global processor: compute global gradient GD,m,λ( f l 1 D,m,λ) = 1 p Pp j=1 GDj,m,λ( f l 1 D,m,λ), and communicate to each local processor. Local processor: compute βl 1 j = (Pm Tn Pm + λI) 1GD,m,λ( f l 1 D,m,λ) and communicate back to the global processor. Global processor: compute f l D,m,λ in Eq.(16), and communicate to each local processor. 4: End For

strategy is implemented. The ﬁrst step is to compute the local gradients in each local processor and communicate them back to the global processor. The second step is to compute the global gradient by the local gradients in the global processor and communicate it to each local processor. The following is to compute βl 1 j by the global gradient in each local processor and communicate them back to the global processor. The fourth is to compute f l D,m,λ in the global processor and communicate it to each local processor. Here, one iteration ends. Repeat the four-step communication strategy until l = M and output f M D,m,λ. The testing ﬂow is shown in Appendix.

Complexity Analysis of DKRR-RS-CM Time Complexity: For each local processor, the matrices product RKn and (RKn)(RKn)T and the inverse of RKn(RKn)T + λn RKn RT need to be computed once. In each iteration, one need to compute the local gradient GDj,m,λ( f l 1 D,m,λ) and βl 1 j in each local processor. Therefore, the total time complexity is O(m2N/p+Mm N/p) for each local processor. Space Complexity: The key element in memory is the matrix Kn. Therefore, the space complexity in each local processor is O(N 2/p2). Communication Complexity: For l = 0, the global processor and each local processor need to communicate f Dj,m,λ and f 0 D,m,λ. In each iteration, the local processors receive GD,m,λ and f l D,m,λ, and communicate GDj,m,λ and βl 1 j back to the global processor. Therefore, the communication complexity is O(Mm).

Theoretical Analysis In this section, we characterize the generalization performances of DKRR-RS and DKRR-RS-CM showing they achieve the same optimal learning rate as KRR, with dramatically reduced computations.

Basic Assumptions Assumption 1 (Moment Assumption). There exist positive constants Q and b such that for all j 2 with j N, R

R |y|jdρ(y|x) 1

2j!bj 2Q2, almost everywhere on X. Typically, this assumption is related to a noise assumption in the regression model, which is satisﬁed if y is bounded almost surely (Rudi, Carratino, and Rosasco 2017). Assumption 2 (Regularity Assumption). f H satisﬁes R (f H(x) fρ(x))2Kx KxdρX(x) B2T , and the following H older source condition

f H = Lζg0, with g0 ρ R. (17)

Here, B, R are non-negative numbers, ζ [1/2, 1]. The equation Eq.(17) reﬂects the regularity of the function f H (Smale and Zhou 2007). The bigger ζ is, the higher the regularity of f H is and the faster the convergence rate is. denotes the tensor product. Assumption 3 (Capacity Assumption). For some γ [0, 1] and cγ > 0, T satisﬁes

tr(T (T + λI) 1) cγλ γ, for all λ > 0. (18)

This assumption controls the variance of the estimator and is equivalent to the classic entropy and covering number conditions (Steinwart and Christmann 2008). The effective dimension, the left hand-side of Eq.(18), is often used to measure the complexity of the hypothesis space H (Caponnetto and Vito 2007). γ reﬂects the size of H. This assumption is always true for γ = 1, since T is a trace class operator. It is satisﬁed, e.g., if the eigenvalues of T satisfy a polynomial decaying condition σi i 1/γ, or with γ = 0 if T is ﬁnite rank.

Optimal Learning Rate for DKRR-RS in Expectation Theorem 1. Under Assumptions 1-3 and the sketch matrix R constructed by Eq.(5), let δ (0, 1], γ [0, 1], ζ [1/2, 1], and f 0 D,m,λ be the estimator. When λ =

Ω(N 1 2ζ+γ ), m = Ω N

γ 2ζ+γ , and p = O N

2ζ+γ , with probability at least 1 δ, we have

E f 0 D,m,λ f H 2 ρ = O N 2ζ 2ζ+γ .

Remark 1. Note that E h E( f 0 D,m,λ) i inff H E(f) =

E f 0 D,m,λ f H 2

ρ (Smale and Zhou 2007). From a theo-

retical perspective, Theorem 1 shows that if the sketch dimension m = Ω N

γ 2ζ+γ and the number of partitions

2ζ+γ , DKRR-RS achieves the same optimal

learning rate O N 2ζ 2ζ+γ 1 as the exact KRR (Lin and Cevher 2020; Caponnetto and Vito 2007). At the basic setting (2ζ + γ = 2), the time and space complexity of DKRRRS are O(N 0.5+γ) and O(N) with the optimal learning rate. The most popular case (ζ = 1/2, γ = 1) and the worst case (ζ = 1, γ = 0) are included in the basic setting. In the case γ = 0, the time cost is only O(N 0.5), which is a substantial step in scaling up kernel methods.

Remark 2. Optimal learning rates for DKRR (Lin, Guo, and Zhou 2017; Guo, Lin, and Shi 2019), DKRR-NY-PCG (Yin et al. 2020a), and DKRR-RF (Li, Liu, and Wang 2019) in expectation have been established. However, they have a strict restriction on the number of local processors p. More precisely, at the most popular case, to reach the optimal learning rate, p in them should be restricted to a constant O(1), but for our result is O(

N). DKRR-RF (Liu, Liu, and Wang 2021) obtains the optimal learning rate with p = O(N

2ζ+γ ) and m = Ω(N

2ζ+γ ) in expectation. Compared with DKRR-RF (Liu, Liu, and Wang 2021) in expectation, DKRR-RS reduces the time complexity by a factor of N

2ζ+γ with the optimal learning rate, where 2(ζ 1)γ + 1 > 0. Compared to DKRR-NY-PCG (Yin et al. 2020a), DKRR-RS reduces the time complexity and space complexity by factors of N

1 γ 2ζ+γ and N

γ 2ζ+γ with the optimal learning rate, where 1 γ 0.

Optimal Learning Rate for DKRR-RS in Probability

The expectation in Theorem 1 describes the average properties of multiple trials but may fail to capture the learning performance for a single trial. To essentially reﬂect the generalization performance of DKRR-RS, we achieve the optimal learning rate in probability.

Theorem 2. Under Assumptions 1-3 and the sketch matrix R constructed by Eq.(5), let δ (0, 1], γ [0, 1], ζ [1/2, 1], and f 0 D,m,λ be the estimator. When λ =

Ω(N 1 2ζ+γ ), m = Ω N

γ 2ζ+γ , and p = O N

4ζ+2γ , with probability at least 1 δ, we have

f 0 D,m,λ f H 2

ρ = O N 2ζ 2ζ+γ .

Remark 3. DRKK-RS obtains the optimal learning rate not only in expectation but also in probability, and the upper bound O N

4ζ+2γ of p in Theorem 2 is stricter than

2ζ+γ in Theorem 1. This is because that the error decomposition in probability is not easy to separate a distributed error that controls p compared to the one in expectation. To derive the optimal learning rate, we provide a novel error decomposition in probability, please see the details in Appendix.

1We hide the logarithmic terms of learning rate and complexity in this paper.

Optimal Learning Rate for DKRR-RS-CM in Probability This part shows that the proposed communication strategy can improve the learning performance of DKRR-RS, that is, enlarge the range of partition p.

Theorem 3. Under Assumptions 1-3 and the sketch matrix R constructed by Eq.(5), let δ (0, 1], γ [0, 1], ζ [1/2, 1], and f M D,m,λ be the estimator. When

λ = Ω(N 1 2ζ+γ ), m = Ω N

γ 2ζ+γ , and p =

(2ζ+γ 1)(M+1)

(2ζ+γ)(M+2) , with probability at least 1 δ, we have

f M D,m,λ f H 2

ρ = O N 2ζ 2ζ+γ .

Proof. The proof of Theorem 1, 2, and 3 is given in Appendix.

Remark 4. It is clear that the upper bound of number of

partitions can be relaxed to O N

(2ζ+γ 1)(M+1)

(2ζ+γ)(M+2) in Theo-

rem 3 compared to O N

4ζ+2γ in Theorem 2 with the optimal learning rate, which demonstrates the function of communication strategy in improving the performance of DKRRRS. Note that as the number of communication M increases, the upper bound of p is increasing. When M , the partitions p can reach the same bound O N

2ζ+γ in expectation.

Compared with the Related Works Comparisons of Time and Space Complexity Table 1 shows the computational complexity of the state-of-the-art approximate KRR estimates with the same statistical accuracy as the exact KRR at the basic setting. We know that the proposed DKRR-RS only require N 0.5+γ time and N space with the optimal learning rate in expectation, which is more effective than other methods, where γ [0, 1]. DKRR-RSCM can also keep the least time and communication complexity with the optimal learning rate in probability than the communication-based methods of DKRR-CM (Lin, Wang, and Zhou 2020), DKRR-RF-CM (Liu, Liu, and Wang 2021), and DKRR-NY-CM (Yin et al. 2021). At the same time, the proposed DKRR-RS and DKRR-RS-CM keep the best upper bound of p under the same conditions.

Proof Techniques From a theoretical perspective, this paper is a non-trivial extension of these approximate methods. Compared with (Lin and Cevher 2020): They study stochastic gradient methods and randomized sketching, but we consider the distributed learning and randomized sketching. To obtain the optimal learning rate, we deduce a new decomposition f Dj,m,λ fm,λ ρ and lead into some techniques to obtain a tight distributed error bound in expectation, and deduce tight bounds of f 0 D,m,λ f D,m,λ ρ and f l D,m,λ f D,m,λ ρ in probability, which are not available in (Lin and Cevher 2020). See Appendix for details. Compared with (Liu, Liu, and Wang 2021; Yin et al. 2021; Lin, Wang, and Zhou 2020; Yin et al. 2020a; Li, Liu, and

Algorithms Time Space Comm p m Types

KRR (Caponnetto and Vito 2007) N 3 N 2 / / / Pro DKRR (Chang, Lin, and Zhou 2017) N 2 N N 0.5 N 0.5 / Exp DKRR (Lin, Wang, and Zhou 2020) N 2.25 N 1.5 N 0.75 N 0.25 / Pro DKRR-CM (Lin, Wang, and Zhou 2020) N

2M+4 N M+3 M+2 Md N N M+1 2M+4 / Pro Nystr om (Rudi, Camoriano, and Rosasco 2015) N 2 N 1.5 / / N 0.5 Pro Nystr om-PCG (Rudi, Carratino, and Rosasco 2017) N 1.5 N 1.5 / / N 0.5 Pro DKRR-NY-PCG (Yin et al. 2020a) N 1.5 N 1+ 2 N 0.5 N 0.5 2 N 0.5 Exp DKRR-NY-CM (Yin et al. 2021) N 3M+7 2M+4 N 2M+5 2M+4 MN 0.5 N M+1 2M+4 N 0.5 Pro Random Features (Rudi, Camoriano, and Rosasco 2016) N 2+2 1 N 1.5+ 1 / / N 0.5+ 1 Pro DKRR-RF (Li, Liu, and Wang 2019) N 1.5+2 1+ 2 N 1+ 1+ 2 N 0.5+ 1 N 0.5 2 N 0.5+ 1 Exp DKRR-RF (Liu, Liu, and Wang 2021) N 1.5+2 1 N 1+ 1 N 0.5+ 1 N 0.5 N 0.5+ 1 Exp DKRR-RF (Liu, Liu, and Wang 2021) N 1.75+2 1 N 1.25+ 1 N 0.5+ 1 N 0.25 N 0.5+ 1 Pro DKRR-RF-CM (Liu, Liu, and Wang 2021) N 3M+7 2M+4 +2 1 N 2M+5 2M+4 + 1 MN 0.5+ 1 N M+1 2M+4 N 0.5+ 1 Pro ROS (Yang, Pilanci, and Wainwright 2017) N 2 + Nd2 N N 2 / / d N log4(N) Pro Gauss (Liu, Shang, and Cheng 2019) N 2d N N 2 / / d N Pro Subgauss (Lin and Cevher 2020) N 2 N 2 / / N 2 Pro DKRR-RS (Theorem 1) N 0.5+2 2 N N 2 N 0.5 N 2 Exp DKRR-RS (Theorem 2) N 0.75+2 2 N 1.5 N 2 N 0.25 N 2 Pro DKRR-RS-CM (Theorem 3) N M+3 2M+4 +2 2 N M+3 M+2 MN 2 N M+1 2M+4 N 2 Pro

Table 1: Complexity of the state-of-the-art KRR estimates with the same learning rate as the exact KRR at the basic setting. Comm , Pro , and Exp denote communication complexity, In probability , and In expectation . m denotes the sketch dimension in randomized sketching, sampling scale in Nystr om and the number of random features in random features methods. N and M are the number of training data and communication. d > 0, γ [0, 1], 1 = (1 γ)γ

2 [0, 0.5], 0.5 + 1 2, d N > 1. Logarithmic terms are not showed.

Wang 2019): Although they also adopt distributed learning and/or communication strategies, the techniques of proof are different from ours. This is because the distributed errors in this paper are related to the proposed randomized sketching method, which does not exist in them. In addition, by introducing the proof techniques and new operator representations, we relax the restriction on p from O(1) to O(

N) at the most popular case, and reduce the lower bound of m compared to (Yin et al. 2020a; Li, Liu, and Wang 2019). Finally, in (Lin, Wang, and Zhou 2020), it requires communicating data among each local processor for each iteration, which causes the high communication complexity for largescale datasets and is difﬁcult to protect the privacy of data in local processors. However, we only require communicating the gradients and model parameters instead of data, which have a smaller communication complexity O(m) at each iteration and can do better on privacy protection. Meanwhile, we also have a smaller communication and time complexity than (Liu, Liu, and Wang 2021; Yin et al. 2021) due to the smaller lower bound of m so that we are more suitable for large-scale datasets. Overall, we provide novel distributed bounds with randomized sketching in expectation and probability, and communication-based distributed bound in probability. By introducing some novel techniques and decompositions, we derive the best upper bound of p, which are a non-trivial extension of (Liu, Liu, and Wang 2021; Yin et al. 2021; Lin and Cevher 2020; Lin, Wang, and Zhou 2020; Yin et al. 2020a;

Li, Liu, and Wang 2019). Please see the details in Appendix.

Experiments

In this section, we present an extensive experiment on the commonly datasets to verify our theoretical predictions. The empirical evaluations of DKRR-RS and DKRR-RSCM use Gaussian kernel, e 1 2h2 (x1 x2)2, on cadata (20640 samples), shuttle (43500 samples), w8a (49749 samples), and connect-4 (67557 samples) datasets2, where the optimal h 2[ 2:0.5:5] and λ 2[ 16:3: 4] are selected via 5fold cross-validation. The datasets are normalized with 70% samples used for training and the rest for testing. The experiments use RMSE and classiﬁcation error for regression and classiﬁcation problems and are repeated 5 times with a server of 32 cores (2.40GHz) and 32 GB of RAM. Figure 1 compare DKRR-RS with Subgauss, ROS3, DKRR, the classical Nystr om4 (Li, Kwok, and Lu 2010), and DKRR-NY which is a combination of Nystr om (Li, Kwok, and Lu 2010) and DKRR, with m = 600 >

N. From Figure 1, one can ﬁnd that DKRR-RS keeps approximate optimal error with the least time. And the larger the

2They are from https://www.csie.ntu.edu.tw/ cjlin/libsvmtools /datasets/ 3The code is from the author of (Yang, Pilanci, and Wainwright 2017) 4The code is from (Li, Kwok, and Lu 2010)

2 3 4 5 6 7 Partition #p

Nystrom Subgauss ROS DKRR DKRR-NY DKRR-RS

2 3 4 5 6 7 Partition #p

Training Time (log)(s)

Nystrom Subgauss ROS DKRR DKRR-NY DKRR-RS

2 3 4 5 6 7 Partition #p

Nystrom Subgauss ROS DKRR DKRR-NY DKRR-RS

(c) connect-4

2 3 4 5 6 7 Partition #p

Training Time (log)(s)

Nystrom Subgauss ROS DKRR DKRR-NY DKRR-RS

(d) connect-4

Figure 1: The testing error and training time about the number of partitions p of various algorithms on cadata and connect-4 datasets. For each algorithm, if necessary, m = 600.

Dataset Nystr om(m = 900) Subgauss(m = 900) ROS(m = 900) DKRR-RS(m = 900) time error time error time error time error cadata 0.55 1.08 0.133 2.19 0.15 0.021 3.95 1.20 0.119 0.21 0.19 0.023 shuttle 1.86 0.25 0.012 10.4 0.05 0.006 21.1 0.26 0.022 0.30 0.05 0.001 w8a 3.27 0.04 0.001 13.0 0.02 0.002 36.0 0.05 0.003 0.39 0.02 0.004

Dataset Nystr om(m = 1500) Subgauss(m = 1500) ROS(m = 1500) DKRR-RS(m = 1500) time error time error time error time error cadata 1.22 0.97 0.038 3.20 0.13 0.046 4.44 0.99 0.023 0.29 0.14 0.056 shuttle 4.52 0.25 0.013 15.9 0.04 0.002 23.5 0.25 0.015 0.50 0.05 0.002 w8a 6.00 0.03 0.0061 16.3 0.02 0.001 36.5 0.04 0.002 0.71 0.02 0.007

Table 2: Comparison of average training time (left) in seconds and average testing error (right) in solving KRR between Nystr om, Subgauss, ROS and DKRR-RS algorithms on cadata, shuttle, and w8a datasets, with m = 900 and 1500, the number of partitions p = 3. We bold the numbers of the best algorithm.

5 10 15 20 Partition #p

DKRR-RS DKRR-RS-CM #2 DKRR-RS-CM #4 DKRR-RS-CM #8

Figure 2: The testing error about the number of partitions p of DKRR-RS and DKRR-RS-CM on connect-4 datasets. 2, 4, and 8 represent the number of communications. m = 600.

partitions, the shorter the time of p-related algorithms. The gap between DKRR-RS and DKRR shows that the proposed sparse randomized sketching can greatly reduce the time consumption with a little error. They are consistent with our theoretical analysis and Theorem 1 and 2 that the proposed DKRR-RS can achieve satisfactory accuracy and less time with suitable p and m. Table 2 compare the training time and testing error of DKRR-RS and m-related algorithms (Subgauss, ROS and Nystr om) on cadata, shuttle and w8a datasets with p = 3, m = 900 and m = 1500. The experimental results show that

the larger m is, the longer the training time is and the smaller the test error is. Under the same m, DKRR-RS is evidently superior to other algorithms in training time and keeps the (nearly) best testing error, which is consistent with the theoretical analysis. When m = 1500, DKRR-RS is even one order of magnitude faster than Nystr om algorithm. Figure 2 compares DKRR-RS-CM (M = 2, 4, 8) with DKRR-RS, with m = 600 >

N, which shows that: 1) With the increase of p, errors of distributed algorithms (DKRR-RS-CM and DKRR-RS) are gradually increasing. When p is bigger than some upper bounds, the errors are far from the starting point. This demonstrates Theorem 1, 2, and 3. 2) The upper bound p of DKRR-RS-CM is bigger than that of DKRR-RS, which veriﬁes the power of communication strategy in enlarging the range of p. 3) As the number of communication increases, the upper bound of p is increasing. This is consistent with Theorem 3. More experiments and the details of datasets are given in Appendix.

Conclusions

In this paper, we propose DKRR-RS method by combining distributed learning and randomized sketching, and investigate its statistical and computational requirements in expectation and probability. Then, to further improve the learning performance, we construct an efﬁcient communication strategy for DKRR-RS and demonstrate the power of communications via theoretical and empirical assessments.

Acknowledgments

This work was supported in part by the Special Research Assistant project of CAS (No.E0YY221-2020000702), the National Natural Science Foundation of China (No.62106259, No.62076234), and Beijing Outstanding Young Scientist Program NO.BJJWZYJH012019100020098. Thank Professor Weiping Wang for his help in this paper.

References Caponnetto, A.; and Vito, E. D. 2007. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3): 331 368. Chang, X.; Lin, S.-B.; and Zhou, D.-X. 2017. Distributed semi-supervised learning with kernel ridge regression. The Journal of Machine Learning Research, 18(1): 1493 1514. Guo, Z.-C.; Lin, S.-B.; and Shi, L. 2019. Distributed learning with multi-penalty regularization. Applied and Computational Harmonic Analysis, 46(3): 478 499. Li, J.; Liu, Y.; and Wang, W. 2019. Distributed learning with random features. ar Xiv preprint ar Xiv:1906.03155. Li, J.; Liu, Y.; Yin, R.; Zhang, H.; Ding, L.; and Wang, W. 2018. Multi-class learning: From theory to algorithm. In Advances in Neural Information Processing Systems, 1593 1602. Li, M.; Kwok, J. T.; and Lu, B. L. 2010. Making large-scale Nystr om approximation possible. In International Conference on Machine Learning, 631 638. Li, S.; and Liu, Y. 2021a. Sharper generalization bounds for clustering. In International Conference on Machine Learning, 6392 6402. Li, S.; and Liu, Y. 2021b. Towards sharper generalization bounds for structured prediction. Advances in Neural Information Processing Systems. Lian, H.; Liu, J.; and Fan, Z. 2021. Distributed learning for sketched kernel regression. Neural Networks, 143: 368 376. Lin, J.; and Cevher, V. 2018. Optimal distributed learning with multi-pass stochastic gradient methods. In International Conference on Machine Learning, 3098 3107. Lin, J.; and Cevher, V. 2020. Convergences of regularized algorithms and stochastic gradient methods with random projections. The Journal of Machine Learning Research, 21(20): 1 44. Lin, S.-B.; Guo, X.; and Zhou, D.-X. 2017. Distributed learning with regularized least squares. The Journal of Machine Learning Research, 18(1): 3202 3232. Lin, S. B.; Lei, Y.; and Zhou, D.-X. 2019. Boosted kernel ridge regression: Optimal learning rates and early stopping. The Journal of Machine Learning Research, 20(46): 1 36. Lin, S.-B.; Wang, D.; and Zhou, D.-X. 2020. Distributed kernel ridge regression with communications. The Journal of Machine Learning Research, 21: 93:1 93:38. Liu, M.; Shang, Z.; and Cheng, G. 2019. Sharp theoretical analysis for nonparametric testing under random projection. In Conference on Learning Theory, 2175 2209.

Liu, Y. 2021. Reﬁned learning bounds for kernel and approximate k-means. In Advances in Neural Information Processing Systems. Liu, Y.; Liu, J.; and Wang, S. 2021. Effective distributed learning with random features: Improved bounds and algorithms. In International Conference on Learning Representations. Rahimi, A.; and Recht, B. 2007. Random features for largescale kernel machines. In Advances in Neural Information Processing Systems, 1177 1184. Rudi, A.; Camoriano, R.; and Rosasco, L. 2015. Less is more: Nystr om computational regularization. In Advances in Neural Information Processing Systems, 1657 1665. Rudi, A.; Camoriano, R.; and Rosasco, L. 2016. Generalization properties of learning with random features. In Advances in Neural Information Processing Systems, 3215 3225. Rudi, A.; Carratino, L.; and Rosasco, L. 2017. Falkon: An optimal large scale kernel method. In Advances in Neural Information Processing Systems, 3888 3898. Saunders, C.; Gammerman, A.; and Vovk, V. 1998. Ridge regression learning algorithm in dual variables. In International Conference on Machine Learning, 515 521. Sch olkopf, B.; Herbrich, R.; and Smola, A. J. 2001. A generalized representer theorem. In International Conference on Computational Learning Theory, 416 426. Springer. Shalev Shwartz, S.; Singer, Y.; Srebro, N.; and Cotter, A. 2011. Pegasos: Primal estimated sub-gradient solver for SVM. Mathematical Programming, 127(1): 3 30. Smale, S.; and Zhou, D. X. 2007. Learning theory estimates via integral operators and their approximations. Constructive Approximation, 26(2): 153 172. Steinwart, I.; and Christmann, A. 2008. Support vector machines. Springer Science & Business Media. Wang, J.; Liu, J.; Liu, Y.; et al. 2021. Improved learning rates of a functional Lasso-type SVM with sparse multikernel representation. In Advances in Neural Information Processing Systems. Wang, S. 2019. A sharper generalization bound for divideand-conquer ridge regression. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, 5305 5312. Yang, Y.; Pilanci, M.; and Wainwright, M. J. 2017. Randomized sketches for kernels: Fast and optimal nonparametric regression. The Annals of Statistics, 45(3): 991 1023. Yin, R.; Liu, Y.; Lu, L.; Wang, W.; and Meng, D. 2020a. Divide-and-conquer learning with Nystr om: Optimal rate and algorithm. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, 6696 6703. Yin, R.; Liu, Y.; Wang, W.; and Meng, D. 2020b. Extremely sparse Johnson-Lindenstrauss transform: From theory to algorithm. In 2020 IEEE International Conference on Data Mining, 1376 1381. IEEE. Yin, R.; Liu, Y.; Wang, W.; and Meng, D. 2020c. Sketch kernel ridge regression using circulant matrix: Algorithm and theory. IEEE Transactions on Neural Networks and Learning Systems, 31(9): 3512 3524.

Yin, R.; Liu, Y.; Wang, W.; and Meng, D. 2021. Distributed Nystr om kernel learning with communications. In International Conference on Machine Learning, 12019 12028. Zhang, Y.; Duchi, J.; and Wainwright, M. 2013. Divide and conquer kernel ridge regression. In Conference on Learning Theory, 592 617. Zhang, Y.; Duchi, J.; and Wainwright, M. 2015. Divide and conquer kernel ridge regression: A distributed algorithm with minimax optimal rates. The Journal of Machine Learning Research, 16(1): 3299 3340.