# federated_spectral_clustering_via_secure_similarity_reconstruction__83eae9d7.pdf

Federated Spectral Clustering via Secure Similarity Reconstruction

Dong Qiao1,2 Chris Ding1 Jicong Fan 1,2

1The Chinese University of Hong Kong, Shenzhen, China 2Shenzhen Research Institute of Big Data, Shenzhen, China dongqiao@link.cuhk.edu.cn {chrisding,fanjicong}@cuhk.edu.cn

Federated learning has a significant advantage in protecting data and information privacy. Many scholars proposed various secure learning methods within the framework of federated learning but the study on secure federated unsupervised learning especially clustering is limited. We in this work propose a secure kernelized factorization method for federated spectral clustering on distributed data. The method is non-trivial because the kernel or similarity matrix for spectral clustering is computed by data pairs, which violates the principle of privacy protection. Our method implicitly constructs an approximation for the kernel matrix on distributed data such that we can perform spectral clustering under the constraint of privacy protection. We provide a convergence guarantee of the optimization algorithm, reconstruction error bounds of the Gaussian kernel matrix, and the sufficient condition of correct clustering of our method. We also present guarantees of differential privacy. Numerical results on synthetic and real datasets demonstrate that the proposed method is efficient and accurate in comparison to baselines.

1 Introduction

In the era of big data, human beings can analyze massive data in various fields due to the improvement of storage and computational capabilities of computing devices [Li et al., 2020b]. Some popular fields such as artificial intelligence, machine learning, internet of things (Io T), and cloud computing have seen explosive development over the past few years. Nevertheless, a side effect of this trend is that individuals and organizations have more and more concerns about potential violation of privacy [Kairouz et al., 2021]. As a result, it has become a challenge to mine valuable information from user data but not to directly access it.

Federated learning [Kairouz et al., 2021; Mc Mahan et al., 2017] can train a global model without retrieving dispersed data [Yang et al., 2018]. This advantage has made it so popular that many scholars have put much effort into the study of federated learning. For example, Yang et al. [2019] presented the definitions of horizontal federated learning, vertical federated learning, and federated transfer learning. Some privacy-preserving machine learning models were also presented. For instance, He et al. [2020] developed a federated group knowledge transfer algorithm to train small CNNs on edge devices. Chen et al. [2018] proposed a protocol to conduct privacy-preserving ridge regression over high-dimensional data. Besides, Kim et al. [2018] proposed a block-chained federated learning architecture that enables on-device learning without any central coordination.

Regardless of the great progress of federated learning, it can be found that most of the existing studies are for supervised learning [Li et al., 2020a; Ghosh et al., 2020]. Note that collecting labeled data may deserve very high cost in real situations [Li et al., 2020b] while unlabeled data are abundant. Thus, it

Corresponding author

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

is necessary and important to study federated learning for unsupervised learning [Zhang et al., 2020; Tzinis et al., 2021; Zhuang et al., 2021; Dennis et al., 2021] such as clustering [Ng et al., 2001; Fan and Chow, 2017; Fan et al., 2018, 2021; Fan, 2021; Cai et al., 2022; Fan et al., 2022]. For example, Li et al. [2021] proposed a federated matrix factorization with a privacy guarantee for recommendation systems. Wang and Chang [2022] proposed two federated matrix factorization algorithms that can be used for federated clustering. Besides, there are some studies on federated spectral clustering. For instance, Wang et al. [2020] presented a federated multi-view spectral clustering method under the assumption that the data of each view are in one client. Hernández-Pereira et al. [2021] developed a cooperative spectral clustering model to deal with distributed data but the model is linear. However, the study on federated spectral clustering is still very limited and deserves more attention and effort.

In this paper, we propose a federated kernelized factorization method to reconstruct a similarity matrix for secure spectral clustering on distributed data. Our contributions are as follows.

We propose a federated spectral clustering algorithm and provide convergence guarantee for the optimization.

We further propose to add noise to the data or the learned factors to enhance the security of clustering and provide guarantees of differential privacy.

We provide upper bounds for the reconstruction error of the true similarity matrix and theoretical guarantees for correct clustering.

We test our method on both synthetic data and real datasets in comparison to baselines, which verify the effectiveness of our method.

Notations We use y, y, and Y to denote scalar, vector, and matrix, respectively. The element of Y at row i and column j is denoted by yij. We use 2 to denote the ℓ2 norm of a vector and use Tr( ), F , and sp to denote the trace, Frobenius norm, and spectral norm of a matrix respectively. The ℓ norm and ℓ2, norm of a matrix Y are defined as Y = maxij |yij| and

Y 2, = maxj q P

i y2 ij respectively. K, K, K, and k denote the kernel matrix, kernel function, number of clusters, and the number k in KNN, respectively. ϕ denotes the feature map induced by K.

2 Federated Spectral Clustering (Fed SC)

Suppose we have n data points of dimension m distributing in P clients. For convenience, we denote by X Rm n the matrix composed of all the n data points and denote by Xp Rm Np the matrix composed of the Np data points in client cp, where Np 1, p = 1, . . . , P, and PP p=1 Np = n. Without loss of generality, we let X = [X1, X2, . . . , XP ], which means {Xp}P p=1 are submatrices of X. Our goal is to perform spectral clustering on these data to partition them into K groups, under the constraint that the data in each client cannot leave the client itself and the privacy of the data should be protected as much as possible, though there could be a central server conducting clustering.

The aforementioned task is non-trivial because in spectral clustering, the first step is constructing an adjacency matrix A Rn n, which has to evaluate the similarity between every data pair (xi, xj) using a metric function M( , ) and hence violates the privacy constraint in the task. To solve the problem, we present a federated spectral clustering model in this section.

2.1 Similarity Reconstruction via Feature Space Factorization

In spectral clustering, for M( , ), there are many choices such as k nearest neighbor similarity and various kernel functions. Let K( , ) be a kernel function and we have

K(xi, xj) = ϕ(xi)T ϕ(xj), (1)

where ϕ : Rm Rm is a feature map induced by the kernel function2 and does not need to be carried out explicitly. When it comes to federated spectral clustering, the central server has no access

2The most widely-used kernel is the Gaussian kernel K(xi, xj) = exp 1 2r2 xi xj 2 2 , of which the feature map ϕ is an infinite-order polynomial feature map and r is a hyperparameter controlling the smoothness.

to the raw data distributed in clients and hence cannot compute K(xi, xj) using (1). However, if the central server can learn an effective approximation (denoted by \ ϕ(xi)) for each ϕ(xi) without accessing xi, K(xi, xj) can be estimated, i.e.,

K(xi, xj) \ ϕ(xi) T \ ϕ(xj). (2)

Thus, inspired by [Fan and Udell, 2019; Fan et al., 2021], we propose to approximate each ϕ(xi) by

\ ϕ(xi) = ϕ(Z)ci, (3)

where Z = [z1, z2, . . . , zd] Rm d, ϕ(Z) = [ϕ(z1), ϕ(z2) . . . , ϕ(zd)], and ci Rd. Both Z and ci are learned from individual columns of X and they can be regarded as intermediate variables avoiding the direct access of central server to xi (details of the learning will be introduced later). It follows from (2) and (3) that

K(xi, xj) c T i ϕ(Z) ϕ(Z)cj. (4)

Thus we can reconstruct the similarity between xi and xj via (4). For convenience, let Kxx = K(X, X) = ϕ(X) ϕ(X), Kzz = K(Z, Z) = ϕ(Z) ϕ(Z) Rd d, and C = [c1, c2, . . . , cn] Rd n. Then we have

Kxx (ϕ(Z)C)T (ϕ(Z)C) = CT Kzz C ˆ Kxx. (5)

Now we use ˆ Kxx as a reconstructed similarity matrix for spectral clustering.

In the form of federated learning, we expand (3) to

ϕ(Xp) ϕ(Z)Cp, p = 1, . . . , P. (6)

It indicates that Z is shared for all P clients and Cp is private for client cp. Note that (6) is a matrix factorization problem in the feature space induced by a kernel on the data in client cp, p = 1, . . . , P. Letting C = [C1, . . . , CP ], we solve the following distributed optimization problem3

minimize Z, C F(Z, C)

p=1 ωpfp(Z, Cp). (7)

In (7), fp is a local objective function for client cp and ω1, . . . , ωP are nonnegative weights for the clients. In this work, we let

fp(Z, Cp) =1

2 ϕ(Xp) ϕ(Z)Cp 2 F + λ

2Tr(K(Xp, Xp)) Tr(CT p K(Z, Xp)) + 1

2Tr(CT p K(Z, Z)Cp) + λ

2 Cp 2 F , (8)

where λ 0 is a penalty parameter. To guarantee the privacy of information, problem (7) shall be solved in the framework of federated learning.

2.2 Fed SC by Similarity Reconstruction and Model Averaging

In this section, we develop a Fed SC algorithm by similarity reconstruction and model averaging. As a classic and popular framework, Federated Averaging (or Fed Avg) is first introduced in [Mc Mahan et al., 2017] for federated learning. In our work, the proposed Fed SC is, therefore, built up based on the backbone of Fed Avg as in Figure 1. Fed SC consists of two stages. The first stage, shown by the left plot of Figure 1, is federated similarity reconstruction, which constructs a similarity matrix in the manner of federated learning. The second stage, shown by the left plot of Figure 1, is using the reconstructed similarity matrix to implement spectral clustering.

Stage I Federated Similarity Reconstruction

Step 1 : As the startup settings for our algorithm, the shared variable Z (i.e., the dictionary matrix Z) and each local coefficient matrix Cp for p = 1, 2, , P are initialized randomly in the central server and each client, respectively.

3Note that we do not show the data Xp in the objective explicitly since it is absorbed into fp.

Local Update Module

!"# $ !, %$ ! !"# # !, %# ! !"# # & !, %& !

Client 1 Client 2 Client P

Aggregation Module for Shared Variable r

Spectral Clustering Module

= Spectral Clustering !"# Z,$ !, K

Client 1 Client 2 Client P

%! &,'! & j

Information Flow of Federated Spectral Clustering:

Spectral Clustering o p q

Local Update Module !" = argmin #$ % &"'(, !

&!" = argmin )$ % &!, !"

o &+ -, + - &. -, . -

Figure 1: Diagram of the proposed Fed SC. Stage I (left plot): Federated Similarity Reconstruction (Steps 1-5). Stage II (right plot): Spectral Clustering (Steps 6-8).

Step 2 : For each round s, where 1 s S, the previous shared variable Z will firstly be broadcast to each participated client. After that, every client uses this received dictionary matrix Z to run its own iterative updates of local variables in the Local Update Module (LUM) as:

Cs p = arg min Cp fp(Zs 1, Cp) = arg min Cp

ϕ(Xp) ϕ(Zs 1)Cp 2 F + λ

2 Cp 2 F (9)

Zs p = arg min Zp fp(Zp, Cs p) = arg min Zp

ϕ(Xp) ϕ(Zp)Cs p 2 F + λ

Cs p 2 F (10)

Step 3 : Each client sends back its own dictionary matrix Zs p, p = 1, 2, . . . , P, to the central server.

Step 4 : The central server collects all (or a subset As 1 of) these uploaded matrices {Zs p}P p=1 and averages them into one new matrix Zs in Aggregation Module (AM), i.e.,

Zs = 1 |As 1|

p As 1 Zs p (11)

where |As 1| is the number of participated clients. In our study, we fix the number of participating clients for each round s. Therefore, we use the notation P instead of |As 1| in the sequel. This aggregated dictionary matrix Zs will then be used to push the next round of federated iteration until the tolerance condition is broken.

Step 5 : When Stage I comes to an end, the spectral clustering will start.

Stage II Spectral Clustering

Step 6 : Each client sends (ZS p , CS p ) back to the central server for the final aggregation of information.

Step 7 : The required similarity matrix is then constructed based on the obtained dictionary matrix ZS and coefficient matrix CS in Spectral Clustering Module (SCM). Based on this approximated similarity matrix, the standard spectral clustering is implemented as usual:

y = Spectral Clustering(CT K(Z, Z)C, K). (12)

Step 8 : The central server broadcasts its clustering results to every corresponding client.

2.3 Optimization Algorithm for Federated Similarity Reconstruction

As described in the above section, alternate updating of local variables is a key to solving the proposed Fed SC problem. In the following two parts, we discuss the optimization for Z and C, respectively.

For a client cp, consider the corresponding local optimization problem minimize Z,C fp(Z, C) (13)

where fp(Z, C) = 1

2 ϕ(Xp) ϕ(Z)C 2 F + λ

2 C 2 F = 1

2Tr(K(Xp, Xp)) Tr(CT K(Z, Xp)) + 1 2Tr(CT K(Z, Z)C)+ λ

2 C 2 F . Let the derivative of fp(Z, C) w.r.t. C be zero, we get the following one-step update for C: Cs p = (K(Zs 1, Zs 1) + λId) 1K(Zs 1, Xp), p = 1, 2, . . . , P. (14)

The derivative of fp(Z, C) w.r.t. Z is L Z = 1

σ2 (Xp WZ Z WZ) + 2

σ2 (ZQZ Z QZ), (15)

where the intermediate variables are detailed as WZ = CT K(Xp, Z) WZ = diag(1T n WZ) QZ = (0.5CCT ) K(Z, Z) QZ = diag(1T d QZ). Here 1n and 1d are the column vectors with all elements of 1. Because of the kernel function, Z cannot be updated like Cp. Here, we use the gradient method to update it. At local iteration t, by setting Zs,0 p = Zs 1, the update scheme of Z is

Zs,t p = Zs,t 1 p ηt fp

Z (Zs,t 1 p ). (16)

where ηt is the step size and can be set as the reverse of the Lipschitz constant of gradient if possible.

We summarize the optimization details in Algorithm 1 (shown in Appendix A).

2.4 Convergence Analysis of The Proposed Algorithm

First of all, it is obvious that all local objective functions fp( , ) for p = 1, . . . , P are lower bounded. To analyze the convergence of Algorithm 1, we make two assumptions. The first one is the Lipschitz continuity of the gradient of the local objective functions. Assumption 2.1. The gradients of all local objective functions fp( , ) for p = 1, . . . , P are Ls Zp Lipschitz continuous in Z, that is Zfp(Zs,t, Cs p) Zfp(Zs,t 1, Cs p) F Ls Zp Zs,t Zs,t 1 F . (17)

In addition, there exist some lower and upper bounds for Ls Zp, i.e., 0 < LZ Ls Zp LZ hold for all p = 1, . . . , P and s = 1, . . . , S.

The second assumption, similar to [Li et al., 2019; Lian et al., 2017], is as follows. Assumption 2.2. The difference between the local gradient and the global gradient is bounded as Zfp(Z, Cp) ZF(Z, C) F ζ, p = 1, . . . , P. (18)

To build the convergence condition, we define the following iterative terms of Zs,t and Cs for all t = 1, . . . , Q and s = 1, . . . , S:

TC(Zs,0, Cs) =

p=1 ωp Cs p Cs 1 p 2

TZ(Zs,t, Cs) = Zs,t Zs,t 1 2 F

where the instantaneous average Zs,t is defined as

p As Zs,t p . (20)

Based on the above assumptions, we provide the following convergence guarantee for Algorithm 1.

Theorem 2.3 (Convergence of Algorithm 1). Suppose Assumption 2.1 and Assumption 2.2 hold. Let T = S(1 + Q) be the total number of global and local rounds. Then the sequence {(Zs,t, Cs)} generated by Algorithm 1 with stepsize 1/Ls Zp and ωp = Np

n satisfies

s=1 TC(Zs,0, Cs) +

t=1 TZ(Zs,t, Cs)

T [F(Z1,0, C0) f] + 16ζ2ψD

where ψ = 1 + ( P +8)(Q 1)(2Q 1) P 4(Q 1)2(1+L 2 Z/L2 Z) and D = 2 γmin+λ + 4 LZ .

The proof can be found in Appendix E. We see that when T , the algorithm converges to a finite value, which is small if ζ is small and LZ is close to LZ.

3 Security-Enhanced Fed SC

In order to enhance the security of Fed SC, we present two noise-augmented variants of the proposed algorithm in this section.

3.1 Fed SC with Perturbed Data

We add random noise to the data in each client and then perform Algorithm 1 to reconstruct a similarity matrix, which further improves the privacy of data. Specifically, the data X Rm n is perturbed by a noise matrix E Rm n to form the noisy data matrix

X = X + E, (22)

where Eij N(0, σ2). We then perform Algorithm 1 with a Gaussian kernel of parameter r on X and obtain Z, C = [C1, C2, . . . , CP ], and

ˆ K x x = CT K(Z, Z)C. (23)

We have the following reconstruction (for the true similarity matrix Kxx = K(X, X)) error bound4. Theorem 3.1 (Error bound of similarity matrix reconstruction). Suppose X 2, = θ, C 2, =

τC, and ϕ(Z)C ϕ( X) 2, γ, where θ, τC, and γ are some nonnegative constants. Then with

the probability at least 1 n(n 1)e t, the reconstructed similarity matrix ˆ K x x satisfies ˆ K x x Kxx 1

2θ)2 2θ2i + (

dτC + 1)γ (24)

where ξ = q

Note that r is the hyperparameter of the Gaussian kernel. In our experiment, r was automatically estimated as the mean of all pairwise distances between data points, i.e., r = 1 n2 P

i,j xi xj 2. Assume |xik xjk| = O(ε) for all i, j [n], k [m], then xi xj = O( mε), which means r2 is linear with mε2. Thus, the reconstruction error ˆ K x x Kxx is upper bounded by O(σ2/ε2 +

dγτC), where ϵ2/σ2 can be regarded as a signal-noise ratio. Therefore, the bound is useful. In general, Theorem 3.1 indicates that when the added noise is small and the optimization makes γ small, the reconstruction error for the true similarity matrix is less than a small constant with high probability. This verified the effectiveness of our similarity reconstruction method.

It should be pointed out that ˆ K x x is not guaranteed to be a sparse matrix and hence the corresponding graph may not contain multiple connected components. We therefore use an extra KNN-based operation to get a sparse similarity matrix, which may also reduce the computational cost of eigenvalue decomposition when n is very large. Specifically, we let

ˆ K x x = get Sparse Matrixby KNN( ˆ K x x, k) (25)

4We defer the proof for all theoretical results to the supplementary material.

which only keeps the largest k connections from each point to other points. Finally, we perform spectral clustering using ˆ K x x. The central server broadcasts the clustering results to each participating client.

As mentioned before, one can choose to inject some noise into its raw data to avoid privacy leakage. However, a question is how much noise we can add to the data to the largest extent for the guarantee of correct clustering. We first present the following definitions.

Definition 3.2 (Local neighbor set). Suppose xi and xj are data points of X Rm n with class labels Li and Lj respectively, and let KNN(xi) be the set of the k-nearest neighbors of xi. We define

Nk,intra i := {xj X|Li = Lj and xj KNN(xi)}. (26)

Definition 3.3 (Global neighbor set). Suppose xi and xj are data points of X Rm n with class labels Li and Lj respectively, and let KNN(xi) be the point set of k-nearest neighbors of xi. We define Nk,global i := {xj X|xj KNN(xi)}. (27)

If we call the local neighbor of data point xi the intra-class neighbor of xi, another definition called inter-class neighbor of xi can be further introduced as follows.

Definition 3.4 (Inter-class neighbor set). Suppose xi and xj are data points of data matrix X Rm n with class labels, Li and Lj, respectively, and let KNN(xi) be the point set of k-nearest neighbors of xi. We define

Nk,inter i := {xj X|Li = Lj and xj KNN(xi)}. (28)

Based on the above definitions, the following definition is presented to determine whether a data point can be correctly clustered or not.

Definition 3.5 (Correct clustering). Suppose xi Rm and xj Rm are data points of data matrix X Rm n, xi is said to be correctly clustered with a tolerance of ϵ if

a. ˆ Kij maxk( ˆ Kinter ik ) ϵ for any of xj Nk,intra i ;

b. ˆ Kij mink( ˆ Kintra ik ) + ϵ for any of xj Nk,inter i .

Based on Definition 3.5, the following theorem gives the guarantee of our security-enhanced Fed SC.

Theorem 3.6 (Guarantee of noisy spectral clustering). Let B(σ) = 1 r2 (σξ +

2θ)2 2θ2 + (

dτC + 1)γ. Then with the probability of at least 1 n(n 1)e t, performing spectral clustering using ˆ K x x yields correct clustering results if

2 max i 1 4

max k (Kinter ik ) min k (Kintra ik ) (29)

where Kinter ik = (Kxx)inter ik and Kintra ik = (Kxx)intra ik .

Based on Theorem 3.1 and Theorem 3.6, we can get a bound on the variance of noise for Fed SC with perturbed data:

r2(B1 B2) + 2θ2

where B1 = ϵ

4 maxk(Kinter ik ) mink(Kintra ik ) and B2 = (

dτC + 1)γ. This bound indicates that the intensity of noise should not be too strong otherwise it may seriously affect the performance of federated spectral clustering. But, at least under this bound, one can choose to inject as much noise as possible into the raw data to ensure data security and privacy.

Using Theorem 3.22 in [Dwork et al., 2014] and the post-processing property of differential privacy, we have the following privacy guarantee for this enhanced Fed SC algorithm.

Proposition 3.7. Fed SC with perturbed data given by (22) is (ε, δ)-differentially private if σ 2cτX/ε, where c2 > 2 ln(1.25/δ).

Based on this proposition and (30), we obtain the following privacy-utility trade-off:

2 ln 1.25/δτX/ε < σ 1

r2(B1 B2) + 2θ2

2θ i . (31)

This ensures both clustering performance and (ε, δ)-differential privacy. In particular, if we substitute σ with the upper bound, we can get a strong level of privacy but the worst utility. By the way, B1 B2 is related to the property of the data. A larger B1 B2 means a better property for clustering, which further provides a larger upper bound for the noise level σ, yielding a stronger privacy guarantee.

3.2 Fed SC with Perturbed Factors

In Fed SC with perturbed factor, we added Gaussian noise to Z in every round of the optimization but added Gaussian noise to C in the last round of the optimization. To be more specific, C = C + EC, and Z = Z + EZ, where the entries of EC and EZ are drawn from N(0, σ2 C) and N(0, σ2 Z) respectively. Then we perform spectral clustering using the following reconstructed kernel matrix:

Kxx = CT K( Z, Z) C. (32)

The following theorem shows the reconstruction error bound for the ground truth kernel matrix Kxx. Theorem 3.8. Assume ϕ(Z)C ϕ(X) 2, γ, C 2, τC. Then with probability at least 1 (n + d)e t, it holds that Kxx Kxx γzc(γzc + 2) (33)

where γzc = γ +

d σCξd + τC

2 1 exp σ2 Zξ2 d 2r2 and ξ2 d = d + 2

We see that, given a fixed γ, the reconstruction error becomes smaller if σZ and σC are smaller. Based on Theorem 3.8 and Definitions 3.2-3.5, we can obtain a bound similar to that in Theorem 3.6 to guarantee correct clustering, which will not be detailed here. Theorem 3.9. In Fed SC, assume max(p,j){ xpj , x pj } τX, max(i,j) zi xj = Υ, Zs p sp τZ s, and CS 2, τC, we perturb {Zs p}P p=1, s = 1, 2, . . . , S with noise drawn from N(0, σ2 Z) with the parameter σZ p

(8S 2(g Z) log(e + (εZ/δZ))/ε2 Z) where (g Z) =

r2 n 1 + (τX + τZ) (τX+Υ)

r2 o and perturb {CS p }P p=1 with noise drawn from N(0, σ2 C) with

the parameter σC 2cλ 1

dτX(τX + Υ)/(r2εC) for c2 > 2 ln(1.25/δC). Then, Fed SC is (εC + εZ, δC + δZ)-differentially private.

Theorem 3.9 shows that our Fed SC with perturbed factors can protect the data privacy provided that the noises added to C and Z are sufficiently large. Similarly to (31), we can also get a privacy-utility trade-off using Theorem 3.8 and Theorem 3.9, which is detailed in Appendix B.

4 Related Work

It should be pointed out that the study on federated spectral clustering in literature is very limited. Besides our work, the only work that aims to address the problem is [Hernández-Pereira et al., 2021]. More introduction and discussion about the related work (federated matrix factorization/clustering [Yang et al., 2021; Ghosh et al., 2020; Dennis et al., 2021; Wang and Chang, 2022] and spectral clustering [Von Luxburg, 2007; Hernández-Pereira et al., 2021]) are in the supplementary material.

5 Experiments

5.1 Performance on similarity reconstruction

Taking the COIL20 dataset [Nene et al., 1996] as an example, we first obtain the similarity matrix from vanilla spectral clustering based on the same kernel function. Then, we use the proposed method to derive the estimated similarity matrix ˆ K x x which is actually an approximation of ground truth. To

make it clearer, we also give the corresponding sparse similarity matrices by KNN sparsification (25). Figure 2 shows the similarity matrices constructed by different methods. We see that the proposed method can be able to successfully reconstruct the similarity matrix in the federated scenarios. The reconstruction errors on synthetic data, iris, banknote authentication, and COIL20 are in the supplementary material.

5.2 Clustering performance of Fed SC

In this subsection, we check the clustering performance of the proposed security-enhanced Fed SC method on both synthetic and real-world datasets. The synthetic dataset is generated from concentric circles. The details are in the supplementary. This synthetic dataset is visualized in Figure 3(a). Here, we continue to adopt the aforementioned COIL20 as an example of real-world datasets.

Figure 2: Visualization of similarity matrices: (a) similarity matrix of vanilla spectral clustering; (b) approximated similarity matrix of the proposed method; (c)(d) the corresponding sparse similarity matrices generated by KNN sparsification.

Taking the synthetic dataset as an example, the first group of cases helps illustrate the effectiveness of the proposed Fed SC method. we first apply the vanilla spectral clustering method to the clean data. The predictive result is shown in Figure 3(b). It is clear that the vanilla spectral clustering method correctly clusters the data points lying in concentric circles. We then use the proposed Fed SC method to cluster the data. One can find in Figure 3(c) that almost all of the data points also have been grouped correctly. However, when we inject some volume of noise

(a) Ground truth

(b) Vanilla SC

(d) Noisy data

(e) Vanilla SC

Figure 3: Fed SC on concentric circles: (a) ground truth; (b) cluster assignment generated by vanilla SC; (c) cluster assignment generated by Fed SC; (d) Noisy ground truth; (e) cluster assignment generated by vanilla SC on noisy data; (f) cluster assignment generated by Fed SC on noisy data.

into the raw data, things may change a lot. Figure 3(d) is actually the ground truth Figure 3(a) adding some Gaussian noise. When focusing on this noisy data of concentric circles, we see from Figure 3(e) that the vanilla SC failed to cluster the data points while the proposed Fed SC method is still able to correctly cluster the data to some extent as in Figure 3(f). As we know, the similarity graph directly constructed from raw data could be very sensitive to each data point. When we add too much noise, the similarity graph may fail to model the local neighborhood relationships which may be the reason why data points in Figure 3(e) are not separable for vanilla SC. Instead, Fed SC is based on matrix factorization in the high-dimensional feature space and has a potential denoising effect. Therefore, it is possible for our method to achieve a better performance.The visualization of COIL20 can be found in the supplementary material.

5.3 Comparison with baselines

We compare our method with the clustering method DSC proposed by [Hernández-Pereira et al., 2021]. Because the existing literature on federated spectral clustering is rare, we here select both classic K-means and spectral clustering as the baselines. Two metrics including accuracy and NMI are adopted to evaluate the clustering results on four datasets including iris [Dua and Graff, 2017], COIL20 [Nene et al., 1996], banknote authentication [Dua and Graff, 2017], and USPS [Hull, 1994]. The details are in the supplementary material. Besides the clean data, we also consider adding noise to them to test the performance of methods under the condition of privacy protection. We directly inject Gaussian noise with zero mean and variance σ2 to the raw matrix X Rm n as X = X + E where E is a Gaussian noise matrix, each element Ei,j of which is i.i.d. with N(0, σ2).

Table 1 shows the clustering accuracy. Our Fed SC almost always achieves comparable clustering results to vanilla SC. It even outperformed vanilla SC in some cases and K-means in most cases since Fed SC has a potential denoising effect by approximating a similarity matrix. More importantly, Fed SC significantly outperformed DSC in almost all cases. The reason is that DSC performs spectral clustering on each local dataset, which may lead to very unstable and inaccurate results.

Table 1: Comparison of clustering accuracy (X and X denote the raw data and corrupted data respectively). The results of NMI are in Section D.4.

Kmeans SC DSC Fed SC

Iris 0.8933 0.0000 0.9000 0.0000 0.5480 0.0679 0.9000 0.0031 COIL20 0.6113 0.0534 0.8025 0.0009 0.1009 0.0100 0.7828 0.0231 Bank 0.6122 0.0000 0.5918 0.0000 0.5582 0.0045 0.7672 0.1457 USPS 0.6704 0.0047 0.6635 0.0000 0.1686 0.0014 0.6596 0.0021 ORL 0.6325 0.0270 0.7865 0.0106 0.1653 0.0073 0.7235 0.0170

X with 0.1σ

Iris 0.8940 0.0152 0.9120 0.0332 0.4533 0.0658 0.8993 0.0299 COIL20 0.6283 0.0484 0.8024 0.0020 0.0995 0.0122 0.7790 0.0213 Bank 0.6067 0.0022 0.6067 0.0944 0.5558 0.0023 0.7168 0.1068 USPS 0.6732 0.0035 0.6647 0.0016 0.1690 0.0007 0.6643 0.0022 ORL 0.6195 0.0329 0.7810 0.0065 0.1600 0.0089 0.7323 0.0250

X with 0.3σ

Iris 0.8420 0.0274 0.8327 0.0267 0.4533 0.0674 0.8427 0.0404 COIL20 0.6422 0.0366 0.7997 0.0029 0.0981 0.0084 0.7793 0.0240 Bank 0.6020 0.0038 0.5859 0.0105 0.5588 0.0074 0.6046 0.0064 USPS 0.6704 0.0063 0.6720 0.0044 0.1673 0.0019 0.6884 0.0509 ORL 0.6098 0.0167 0.7885 0.0047 0.1665 0.0057 0.7417 0.0280

X with 0.5σ

Iris 0.7740 0.0252 0.7313 0.0494 0.3833 0.0204 0.7540 0.0336 COIL20 0.6389 0.0296 0.7950 0.0080 0.1033 0.0167 0.7403 0.0294 Bank 0.6051 0.0076 0.5923 0.0094 0.5566 0.0030 0.6086 0.0073 USPS 0.6699 0.0031 0.7843 0.0030 0.1683 0.0017 0.7778 0.0062 ORL 0.5983 0.0295 0.7930 0.0172 0.1615 0.0096 0.7107 0.0345

X with 0.7σ

Iris 0.6500 0.0420 0.6087 0.0468 0.3927 0.0349 0.6120 0.0455 COIL20 0.6220 0.0627 0.7662 0.0172 0.0893 0.0055 0.6803 0.0198 Bank 0.6100 0.0112 0.6046 0.0144 0.5566 0.0035 0.6106 0.0107 USPS 0.6638 0.0044 0.7747 0.0040 0.1675 0.0004 0.7587 0.0117 ORL 0.5723 0.0360 0.7860 0.0093 0.1613 0.0066 0.6910 0.0232

5.4 More numerical result

The t SNE visualization, clustering results in terms of NMI, the performance of Fed SC with perturbed factors, etc, are in the supplementary material.

6 Conclusion

This paper has proposed a secure kernelized factorization method for federated spectral clustering on distributed data. We provide theoretical guarantees for optimization convergence, correct clustering, and differential privacy. The numerical experiments on synthetic and real image datasets verified the effectiveness of our method. To the best knowledge of the authors, this is the work that successfully addresses the problem of federated spectral clustering. One limitation of this work is that we haven t tested our Fed SC on very large datasets, though the moderate-size datasets are sufficient to justify the effectiveness of our Fed SC. Note that for large-scale datasets, the bottleneck of clustering is the eigenvalue decomposition of the Laplacian matrix, not our Fed SC algorithm.

Acknowledgments

This work was partially supported by the Youth program 62106211 of the National Natural Science Foundation of China, the General Program JCYJ20210324130208022 of Shenzhen Fundamental Research, the research funding T00120210002 of Shenzhen Research Institute of Big Data, the Guangdong Key Lab of Mathematical Foundations for Artificial Intelligence, and the funding UDF01001770 of The Chinese University of Hong Kong, Shenzhen.

Jinyu Cai, Jicong Fan, Wenzhong Guo, Shiping Wang, Yunhe Zhang, and Zhao Zhang. Efficient deep embedded subspace clustering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1 10, June 2022.

Yi-Ruei Chen, Amir Rezapour, and Wen-Guey Tzeng. Privacy-preserving ridge regression on distributed data. Information Sciences, 451:34 49, 2018.

Don Kurian Dennis, Tian Li, and Virginia Smith. Heterogeneity for the win: One-shot federated clustering. In International Conference on Machine Learning, pages 2611 2620. PMLR, 2021.

Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.

Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3 4):211 407, 2014.

Jicong Fan and Tommy W.S. Chow. Sparse subspace clustering for data with missing entries and high-rank matrix completion. Neural Networks, 93:36 44, 2017.

Jicong Fan and Madeleine Udell. Online high rank matrix completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.

Jicong Fan, Zhaoyang Tian, Mingbo Zhao, and Tommy W.S. Chow. Accelerated low-rank representation for subspace clustering and semi-supervised classification on large-scale data. Neural Networks, 100:39 48, 2018.

Jicong Fan, Chengrun Yang, and Madeleine Udell. Robust non-linear matrix factorization for dictionary learning, denoising, and clustering. IEEE Transactions on Signal Processing, 69:1755 1770, 2021.

Jicong Fan, Yiheng Tu, Zhao Zhang, Mingbo Zhao, and Haijun Zhang. A simple approach to automated spectral clustering. In Advances in Neural Information Processing Systems, volume 35, pages 9907 9921. Curran Associates, Inc., 2022.

Jicong Fan. Large-scale subspace clustering via k-factorization. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, KDD 21, page 342 352, New York, NY, USA, 2021. Association for Computing Machinery.

Avishek Ghosh, Jichan Chung, Dong Yin, and Kannan Ramchandran. An efficient framework for clustered federated learning. Advances in Neural Information Processing Systems, 33:19586 19597, 2020.

Chaoyang He, Murali Annavaram, and Salman Avestimehr. Group knowledge transfer: Federated learning of large cnns at the edge. Advances in Neural Information Processing Systems, 33:14068 14080, 2020.

Elena Hernández-Pereira, Oscar Fontenla-Romero, Bertha Guijarro-Berdiñas, and Beatriz Pérez Sánchez. Federated learning approach for spectral clustering. In ESANN, 2021.

Jonathan J. Hull. A database for handwritten text recognition research. IEEE Transactions on pattern analysis and machine intelligence, 16(5):550 554, 1994.

Peter Kairouz, Sewoong Oh, and Pramod Viswanath. The composition theorem for differential privacy. In International conference on machine learning, pages 1376 1385. PMLR, 2015.

Peter Kairouz, H Brendan Mc Mahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. Foundations and Trends in Machine Learning, 14(1 2):1 210, 2021.

Hyesung Kim, Jihong Park, Mehdi Bennis, and Seong-Lyun Kim. On-device federated learning via blockchain and its latency analysis. ar Xiv preprint ar Xiv:1808.03949, 2018.

Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, pages 1302 1338, 2000.

Xiang Li, Kaixuan Huang, Wenhao Yang, Shusen Wang, and Zhihua Zhang. On the convergence of fedavg on non-iid data. ar Xiv preprint ar Xiv:1907.02189, 2019.

Li Li, Yuxi Fan, Mike Tse, and Kuo-Yi Lin. A review of applications in federated learning. Computers & Industrial Engineering, 149:106854, 2020.

Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. Federated learning: Challenges, methods, and future directions. IEEE Signal Processing Magazine, 37(3):50 60, 2020.

Zitao Li, Bolin Ding, Ce Zhang, Ninghui Li, and Jingren Zhou. Federated matrix factorization with privacy guarantee. Proceedings of the VLDB Endowment, 15(4):900 913, 2021.

Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. Advances in Neural Information Processing Systems, 30, 2017.

Brendan Mc Mahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pages 1273 1282. PMLR, 2017.

SA Nene, SK Nayar, and H Murase. Columbia university image library (coil-20). Technical Report CUCS-005-96, 1996.

Andrew Ng, Michael Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems, volume 14. MIT Press, 2001.

Efthymios Tzinis, Jonah Casebeer, Zhepei Wang, and Paris Smaragdis. Separate but together: Unsupervised federated learning for speech enhancement from non-iid data. In 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pages 46 50. IEEE, 2021.

Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.

Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4):395 416, 2007.

Shuai Wang and Tsung-Hui Chang. Federated matrix factorization: Algorithm design and application to data clustering. IEEE Transactions on Signal Processing, 70:1625 1640, 2022.

Hongtao Wang, Ang Li, Bolin Shen, Yuyan Sun, and Hongmei Wang. Federated multi-view spectral clustering. IEEE Access, 8:202249 202259, 2020.

Timothy Yang, Galen Andrew, Hubert Eichner, Haicheng Sun, Wei Li, Nicholas Kong, Daniel Ramage, and Françoise Beaufays. Applied federated learning: Improving google keyboard query suggestions. ar Xiv preprint ar Xiv:1812.02903, 2018.

Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. Federated machine learning: Concept and applications. ACM Transactions on Intelligent Systems and Technology (TIST), 10(2):1 19, 2019.

Enyue Yang, Yunfeng Huang, Feng Liang, Weike Pan, and Zhong Ming. Fcmf: Federated collective matrix factorization for heterogeneous collaborative filtering. Knowledge-Based Systems, 220:106946, 2021.

Fengda Zhang, Kun Kuang, Zhaoyang You, Tao Shen, Jun Xiao, Yin Zhang, Chao Wu, Yueting Zhuang, and Xiaolin Li. Federated unsupervised representation learning. ar Xiv preprint ar Xiv:2010.08982, 2020.

Weiming Zhuang, Xin Gan, Yonggang Wen, Shuai Zhang, and Shuai Yi. Collaborative unsupervised visual representation learning from decentralized data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4912 4921, 2021.

A Framework of the Proposed Algorithm

In the beginning, Z and Cp for p = 1, 2, . . . , P are initialized randomly in the central server and clients, respectively. Then, Z is broadcast to every participating client and helps do the alternate updating of Zp and Cp based on the update schemes (16) and (14). Afterward, the obtained matrix Zp is sent back to the central server and aggregated for the next round of training. When the tolerance condition is broken, both ZS p and CS p are sent back to the central server for the subsequent clustering task.

Algorithm 1 Proposed Federated Similarity Reconstruction Input: Distributed data {Xp, : p P := {1, 2, . . . , P}}, clients weights {ωp : p P}.

1: Initialize Z0 at server side and {C0 p}P p=1 at client sides. 2: Randomly choose A0 P with |As| = P. 3: for round s = 1 to S do 4: Server side: compute Zs 1 = 1

P P p As 1 Zs 1 p .

5: Broadcast Zs 1 to clients cp, p As. 6: Client side: 7: for client p = 1 to P in parallel do 8: set Zs,0 p = Zs 1

9: update local variable Cs p: 10: Cs p = (K(Zs,0, Zs,0) + λId) 1K(Zs,0, Xp) 11: update local variable Zs p: 12: for t = 1 to Q do 13: Zs,t p = Zs,t 1 p ηs Zfp(Zs,t 1 p ) 14: end for 15: denote Zs p = Zs,Q p 16: if client p As 1 then 17: upload Zs p to the server. 18: end if 19: end for 20: Randomly choose As P with |As| = P. 21: end for Output: Z, Cp, p = 1, 2, . . . , P

B More theoretical results about Fed SC with perturbed factors

Based on Theorem 3.8 and Theorem 3.6, we can get a bound on the variance of noise for Fed SC with perturbed factors: γzc(σZ, σC) 1 + 2 p

1 + B1 (34)

where B1 = ϵ 2 maxi 1

4 maxk(Kinter ik ) mink(Kintra ik ) and γzc(σZ, σC) is exactly the γzc in Theorem 3.8 and is clearly a non-decreasing function with respect to σZ and σC, respectively. Therefore, it is a valid upper bound on both σZ and σC.

Proof. By Theorems 3.8 and 3.6, we have

γzc(γzc + 2) B1 (35)

That is, we need to solve a quadratic equation γ2 zc + 2γzc B1 = 0 with both γzc 0 and B1 0. Since the discriminant = 4 + 4B1 0, we have two roots 1 2 1 + B1 for this equation. Thus, it has 0 γzc 1 + 2 1 + B1 as desired.

This bound indicates that the intensity of noise should not be too strong otherwise it may seriously affect the performance of federated spectral clustering. But, at least under this bound, one can choose to inject as much noise as possible into the raw data to ensure data security and privacy.

Based on Theorem 3.9 and (34), we obtain the following privacy-utility trade-off:

γzc(σZ, σC) 1 + 2 1 + B1 σZ p

(8S 2(g Z) log(e + (εZ/δZ))/ε2 Z) σC 2cλ 1

dτX(τX + Υ)/(r2εC) (36)

This ensures both clustering performance and (ε, δ)-differential privacy. In particular, if we increase the intensity of either σZ or σC to reach the upper bound of γzc, we can get a strong level of privacy but the worst utility. By the way, B1 is related to the property of the data. A larger B1 means a better property for clustering, which further provides a larger upper bound for the noise level, yielding a stronger privacy guarantee.

C More discussion on the privacy-utility trade-off of Fed SC

Although theoretical results are presented in our study to ensure the security of Fed SC, we still provide here some insight into methods of using Secure Aggregation or other cryptographic techniques to handle pair-wise client functions (i.e., kernel function in our study). Even though the communication cost will be high, it might be an alternative to reduce the DP noise levels. Specifically, it can be performed as follows.

Step 1: ZS p is posted to the central server;

Step 2: Central server aggregates ZS p to get the global ZS;

Step 3: Central server computes KZZ = K(ZS, ZS) and broadcast it to clients; Step 4: For client p and ci Cp, if cj Cp, then client p directly calculate ˆKij = ci KZZcj; if cj Cp , then client p firstly encrypts and transfers its ci to client p , and then client p also encrypts its cj and use the cipher text to compute enc( ˆKij) = enc(c T i )enc(KZZ)enc(cj); Client p transfers the result enc( ˆKij) back to client p; Client p decrypts enc( ˆKij) to get ˆKij.

Step 5: Each client p sends its estimated results ˆKij back to the central server without sending its own Cp; Step 6: Central Server checks whether these posted results from clients are compatible with each other in case of injection attacks and then performs spectral clustering.

This alternative does not send clients C to the central server and gives an extra cross-validation process in the central server which may be useful to enhance security.

D More details and results of the experiments

It should be pointed out that in Table 2 of the main paper, the signal-noise ratio is as high as 12d B, which means the noise is tiny. The parameter d in our Fed SC was automatically determined and has a much larger value than that in the noiseless case. That is why the performance of Fed SC in the noisy case is even better than that in the noiseless case of some datasets. In this appendix, we increase the noise level (σe = βσ, where σ denotes the standard deviation of the clean data) and consider one more real dataset.

D.1 Dataset description

Synthetic data The synthetic data is generated from concentric circles. For θi [0, 2π], i = 1, 2, . . . , 1258, all the points of this synthetic dataset X R2 1258 are generated by

xi = (xi1, xi2) : xi1 = r cos(θi) + ei1 xi2 = r sin(θi) + ei2 (37)

where θ0 = 0, θ1258 = 2π, and the remaining θi are evenly spaced points between θ0 and θ1258. The additive noise ei1 and ei2 are drawn from N(0, σ2 e). We let σe = 0.1σx, where σx denotes the standard deviation of the data without the additive noise e. In our experiment, we set the hyperparameter r in (37) to 2 and 4.

Figure 4: Synthetic dataset of 2 concentric circles: (a) ground truth; (b) noisy data perturbed by Gaussian noise with mean zero and standard deviation 0.2std(X).

Real-world data Both iris and banknote authentication are from the UCI machine learning library. COIL20 is an image dataset from Columbia Imaging and Vision Laboratory. USPS is a dataset for handwritten text recognition research. The details of the mentioned datasets are shown in Table 2.

Table 2: Summary of four real-world datasets

# of clusters # of attributes # of instances Iris 3 4 150 COIL20 20 20 20 1440 Bank 2 5 1372 USPS 10 16 16 9298 ORL 40 92 112 400

The visualizations (by t-SNE [Van der Maaten and Hinton, 2008]) of three real-world datasets are shown in Figure 5.

-100 0 100 -100

-50 0 50 -50

Figure 5: t-SNE visualization of some real-world datasets

D.2 Evaluation metrics

We use two metrics to evaluate the clustering performance of our method: accuracy and normalized mutual information (NMI). Between them, accuracy is affected by the misclassification rate of cluster assignment. The smaller the misclassification rate, the greater the accuracy. In this study, given data matrix X Rm n with n sample points of m features, let Li and Lri for i = 1, 2, . . . , n be the true labels and predictive labels, respectively, the accuracy of predictive cluster assignment Lr is computed by

accuracy = 1

ncard({i|Li = Lri, i = 1, . . . , n}). (38)

NMI is a normalized version of mutual information that measures the agreement of two cluster assignments without considering their permutations.

NMI = 2I(L; Lr) H(L) + H(Lr) (39)

where I(L; Lr) is the mutual information between L and Lr, and H(L) and H(Lr) are the entropy of L and Lr, respectively.

D.3 Parameter settings

In our experiments, we set some hyperparameters including λC, d, r, and k for implementing the proposed Fed SC. Among them, λC as the penalty parameter of the regularization term is set to 1e 2. k is the hyperparameter of the KNN-based operation on the similarity matrix. We set k to max(ceil(log(n)), 1) based on [Von Luxburg, 2007]. The details of methods to set the remaining two parameters are as follows.

Setting of hyparameter r The hyperparameter r controls the smoothness of Gaussian kernel function. Due to the distinct characteristics of the datasets, we adopt the following adaptive method to determine the value of r: r = c Mean(Re(

D)) (40) where c = 1 and D is the distance matrix of data points, of which its element

Dij = xi xj 2 2, for i, j = 1, 2, . . . , n. (41)

In reality, the global D cannot be obtained because Fed SC does not allow data transmission. Then we compute Dp for each client cp, p = 1, . . . , P and then determine r as follows

p=1 Mean(Re( p

where c can be tuned.

Setting of hyparameter d The hyperparameter d controls the complexity of the approximation ϕ(X) ϕ(Z)C. We adaptively determine the value of d for each dataset:

i=1 si tol for k n

where si denotes the i-th largest eigenvalue of the kernel matrix Kxx, i = 1, . . . , n. In our experiment, we set the threshold tol to 0.99.

It is worth noting that we use the same r in the vanilla SC and our Fed SC for a fair comparison, though D cannot be obtained in our Fed SC. In addition, the setting of d relies on Kxx that is also not obtainable in our Fed SC. We use this setting for convenience and in reality we need to determine d by other methods such as letting d = γm, where γ is a hyperparameter.

D.4 Clustering results

NMI results The average results of 10 repeated trials are reported in Table 1 (ACC) and Table 3 (NMI). Note that for the USPS dataset, only 5 trials are performed to save time due to its large number of sample points. We see that our Fed SC outperformed Kmeans and DSC significantly in almost all cases and has at least comparable performance as SC in most cases.

Influence of d The clustering accuracies with different d are reported in Table 4.

Results of Fed SC with perturbed factors The results on COIL20 are reported in Table 5, where σZ = αzstd(Zp) and σC = αcstd(C). Through this experiment, it can be observed that perturbing factors can have a more significant impact on the accuracy of clustering results than perturbing raw data. That is, perturbing factors are more sensitive and can achieve a specified level of differential privacy with weaker noise. Furthermore, it can be seen from Table 5 that the clustering performance is more sensitive to σC than σZ.

Malicious attack on C Due to the kernel trick, the optimization problem is nonlinear and nonconvex. Hence, it is very difficult for potential attackers to recover the data X from the uploaded factors Z, C, especially when X or Z, C are perturbed by noise. Nevertheless, the attacker may perform K-means on the {Cp}P p=1 to obtain clustering results. However, we find that the clustering accuracy

Table 3: Comparison with existing clustering methods (NMI)

NMI Kmeans SC DSC Fed SC

Iris 0.6356 0.0000 0.6647 0.0000 0.2516 0.0707 0.6708 0.0139 COIL20 0.7658 0.0143 0.8809 0.0006 0.1259 0.0225 0.8613 0.0144 Bank 0.1480 0.0000 0.1228 0.0000 0.0122 0.0160 0.4901 0.3050 USPS 0.6125 0.0026 0.8083 0.0000 0.0064 0.0016 0.7972 0.0121 ORL 0.8343 0.0136 0.8980 0.0045 0.3970 0.0045 0.8525 0.0097

Xn with 0.1σ

Iris 0.6267 0.0305 0.6878 0.0764 0.1458 0.0794 0.6666 0.0600 COIL20 0.7721 0.0122 0.8805 0.0012 0.1214 0.0254 0.8529 0.0209 Bank 0.1383 0.0039 0.1314 0.2010 0.0039 0.0081 0.3853 0.2390 USPS 0.6147 0.0022 0.8097 0.0016 0.0076 0.0010 0.7920 0.0133 ORL 0.8280 0.0153 0.8948 0.0019 0.3944 0.0074 0.8554 0.0104

Xn with 0.3σ

Iris 0.5292 0.0528 0.5072 0.0442 0.1403 0.0792 0.5385 0.0594 COIL20 0.7745 0.0185 0.8767 0.0014 0.1174 0.0224 0.8477 0.0188 Bank 0.1327 0.0062 0.1031 0.0211 0.0135 0.0253 0.1489 0.0151 USPS 0.6112 0.0052 0.7927 0.0123 0.0070 0.0016 0.7837 0.0091 ORL 0.8136 0.0067 0.8974 0.0047 0.3987 0.0069 0.8578 0.0147

Xn with 0.5σ

Iris 0.4316 0.0310 0.3912 0.0426 0.0710 0.0421 0.4046 0.0388 COIL20 0.7676 0.0148 0.8721 0.0061 0.1238 0.0286 0.8293 0.0168 Bank 0.1408 0.0133 0.1189 0.0175 0.0059 0.0109 0.1521 0.0113 USPS 0.6088 0.0036 0.8049 0.0164 0.0059 0.0028 0.7951 0.0127 ORL 0.8058 0.0150 0.8970 0.0063 0.3977 0.0095 0.8381 0.0133

Xn with 0.7σ

Iris 0.3025 0.0407 0.2755 0.0381 0.0715 0.0459 0.2773 0.0408 COIL20 0.7589 0.0234 0.8564 0.0141 0.1032 0.0127 0.7772 0.0196 Bank 0.1506 0.0208 0.1420 0.0262 0.0074 0.0114 0.1577 0.0196 USPS 0.6007 0.0051 0.8004 0.0110 0.0042 0.0009 0.7636 0.0184 ORL 0.7877 0.0165 0.8944 0.0043 0.3946 0.0056 0.8236 0.0102

Table 4: Accurcy of the proposed algorithm on COIL20 (K = 20)

d 7K 8K 9K 194 (SVD) 10K 11K 12K Trial 1 0.7806 0.7882 0.7951 0.7944 0.7778 0.7979 0.7694 Trial 2 0.8035 0.7431 0.7819 0.8007 0.7812 0.8049 0.7792 Trial 3 0.7562 0.7896 0.7569 0.8097 0.7708 0.7569 0.7764 Trial 4 0.7444 0.7771 0.7771 0.7889 0.7854 0.7903 0.7958 Trial 5 0.7965 0.7361 0.8014 0.7833 0.7632 0.7264 0.7694 Mean 0.7762 0.7668 0.7825 0.7954 0.7757 0.7753 0.7781

d 10K 20K 30K 749 (SVD) 40K 50K 60K Trial 1 0.7708 0.7722 0.7764 0.8201 0.7000 0.7306 0.8139 Trial 2 0.7743 0.7889 0.7646 0.7882 0.7694 0.7451 0.7299 Trial 3 0.7375 0.7743 0.7167 0.7493 0.7792 0.7965 0.7299 Trial 4 0.7965 0.7688 0.8014 0.7736 0.7903 0.7382 0.7674 Trial 5 0.7493 0.7507 0.7576 0.7000 0.7792 0.8076 0.8194 Mean 0.7657 0.7710 0.7633 0.7662 0.7636 0.7636 0.7721

on C is lower than those of Kmeans, SC, and Fed SC reported in Table 1. For instance, the clustering accuracy on Iris is reported in Table.

Influence of P Table 7 compares the clustering performance between using a single client (P = 1) and using multiple ones (P = 8). It is clear that the operation of splitting data across multiple clients may lead to an accuracy loss of clustering, which implies that our method is valid.

More results on MNIST and CIFAR10 We have already included datasets of high-dimensional images in Table 1 like USPS and COIL20 with sizes 16 16 and 20 20, respectively. Nevertheless, to further improve the experiment, we added the results of MNIST (28 28) and CIFAR10(32 32) in Table 8.

Table 5: Clustering accuracy (average over 10 trials) of Fed SC with perturbed factors on COIL20.

αc 0 0.05 0.1 0.15 Mean Std Mean Std Mean Std Mean Std

0 0.7831 0.0268 0.6573 0.0291 0.6012 0.0391 0.4835 0.0526 0.05 0.7824 0.0126 0.6573 0.0243 0.6165 0.0286 0.4998 0.0567 0.1 0.7817 0.0299 0.6625 0.0258 0.6453 0.0186 0.5508 0.0415 0.15 0.7881 0.0223 0.6526 0.0258 0.6219 0.0467 0.5901 0.0326

Table 6: Clustering accuracy among different ways

Kmeans SC DSC Fed SC Attack on C Iris 0.8920 0.0028 0.9000 0.0000 0.5493 0.1263 0.9027 0.0064 0.6360 0.1557

E Proof for theorem on the convergence of Fed SC algorithm

We derive the objective descent with respect to Cs and Zs,t, respectively.

Objective descent with w.r.t. C: Based on the update scheme (14) of C, we have

Cfp(Zs,0, Cs p) = (K(Zs,0, Zs,0) + λId)Cs p K(Zs,0, Xp) = 0 (44)

where Zs,0 = Zs 1 = 1

p As 1 Zs 1,Q p .

According to the proposed Fed SC problem 7, we have

fp(Zs,0, Cs p) fp(Zs,0, Cs 1 p )

ϕ(Xp) ϕ(Zs,0)Cs p 2

ϕ(Xp) ϕ(Zs,0)Cs 1 p 2

= Tr((Cs p Cs 1 p )T K(Zs,0, Xp)) + λ

2 Tr(Cs p(Cs p)T Cs 1 p (Cs 1 p )T )

2Tr( Cs p(Cs p)T Cs 1 p (Cs 1 p )T K(Zs,0, Zs,0))

= Tr((Cs p Cs 1 p )T (K(Zs,0, Zs,0) + λId)Cs p)

2Tr( Cs p(Cs p)T Cs 1 p (Cs 1 p )T K(Zs,0, Zs,0) + λId )

2 Tr([Cs p Cs 1 p ]T [K(Zs,0, Zs,0) + λId][Cs p Cs 1 p ]) | {z } T.1

2(γs min + λ) Cs p Cs 1 p 2

where γs min = γmin(K(Zs,0, Zs,0)).

Summing it up from p = 1 to P, we have

F(Zs,0, Cs) F(Zs,0, Cs 1) 1

2(γs min + λ)

p=1 ωp Cs p Cs 1 p 2

Objective descent with w.r.t. Z: According to Assumption 2.1, it implies

F(Zs,t, Cs) F(Zs,t 1, Cs) ZF(Zs,t 1, Cs), Zs,t Zs,t 1 | {z } T.2

Zs,t Zs,t 1 2 F

Table 7: Clustering accuracy between using a single client and using multiple clients

Iris COIL20 Bank ORL P = 1 0.9007 0.0086 0.8073 0.0089 0.7536 0.1271 0.7937 0.0084 P = 8 0.8993 0.0165 0.8003 0.0049 0.6821 0.1405 0.7112 0.0258

Table 8: Clustering accuracy on MNIST and CIFAR10

Kmeans SC DSC Fed SC

X MNIST 0.5448 0.0257 0.6265 0.0439 0.1292 0.0144 0.6139 0.0464 CIFAR10 0.2171 0.0132 0.2182 0.0133 0.1235 0.0062 0.2134 0.0131

X with 0.3σ MNIST 0.5402 0.0225 0.5755 0.0366 0.1337 0.0191 0.5606 0.0457 CIFAR10 0.2209 0.0154 0.2198 0.0172 0.1194 0.0051 0.2187 0.0125

X with 0.7σ MNIST 0.5374 0.0555 0.5711 0.0329 0.1340 0.0135 0.5029 0.0264 CIFAR10 0.2202 0.0147 0.2205 0.0084 0.1209 0.0045 0.2134 0.0152

Now, we give a bound on T.2. Lemma E.1. For any s and t, it holds that

ZF(Zs,t 1, Cs), Zs,t Zs,t 1 = Ls Z Zs,t Zs,t 1 2 F + ZF(Zs,t 1, Cs), Zs,t Zs,t 1 (48)

Proof. Based on the update schemes of Z, we have

Zs,t p = Zs,t 1 p 1

Ls Z Zfp(Zs,t 1 p , Cs p)

Zs,t = Zs,t 1 1 PLs Z

p As Zfp(Zs,t 1 p , Cs p)

p As Zfp(Zs,t 1 p , Cs p) + Ls Z(Zs,t Zs,t 1)

Then, consider the following terms ZF(Zs,t 1, Cs), Zs,t Zs,t 1

= ZF(Zs,t 1, Cs) 1

p As Zfp(Zs,t 1 p , Cs p) Ls Z(Zs,t Zs,t 1), Zs,t Zs,t 1

= Ls Z Zs,t Zs,t 1 2 F

+ ZF(Zs,t 1, Cs) 1

p As Zfp(Zs,t 1 p , Cs p), Zs,t Zs,t 1

Based on Lemma 1, we continue to do the following derivation. F(Zs,t, Cs) F(Zs,t 1, Cs)

Zs,t Zs,t 1 2 F

+ ZF(Zs,t 1, Cs) 1

p As Zfp(Zs,t 1 p , Cs p), Zs,t Zs,t 1

Zs,t Zs,t 1 2 F

ZF(Zs,t 1, Cs) 1

p As Zfp(Zs,t 1 p , Cs p)

F | {z } T.3

where we used the fact that x, y x 2 2 + y 2 2 , c > 0.

Now, we give a bound on T.3.

Lemma E.2. For any s and t, it holds that

ZF(Zs,t 1, Cs) 1

p As Zfp(Zs,t 1 p , Cs p)

P + 2(1 + 8

p=1 ωp(Ls Zp)2 Zs,t 1 Zs,t 1 p 2

Proof. For any s and t, it holds that

ZF(Zs,t 1, Cs) 1

p As Zfp(Zs,t 1 p , Cs p)

ZF(Zs,t 1, Cs)

p=1 ωp Zfp(Zs,t 1 p , Cs p)

p=1 ωp Zfp(Zs,t 1 p , Cs p) 1

p As Zfp(Zs,t 1 p , Cs p)

ZF(Zs,t 1, Cs)

p=1 ωp Zfp(Zs,t 1 p , Cs p)

F | {z } T.4

p=1 ωp Zfp(Zs,t 1 p , Cs p) 1

p As Zfp(Zs,t 1 p , Cs p)

F | {z } T.5

where we used the fact that Pn i=1 ai 2 2 n Pn i=1 ai 2 2.

Firstly, we give a bound on T.4:

ZF(Zs,t 1, Cs)

p=1 ωp Zfp(Zs,t 1 p , Cs p)

p=1 ωp Zf(Zs,t 1, Cs p) Zfp(Zs,t 1 p , Cs p)

p=1 ωp ZF(Zs,t 1, Cs p) Zfp(Zs,t 1 p , Cs p) 2

p=1 ωp(Ls Zp)2 Zs,t 1 Zs,t 1 p 2

where we used the fact that Zfp( , Cs p) is Ls Zp-Lipschitz continuous.

Secondly, we give a bound on T.5:

p=1 ωp Zfp(Zs,t 1 p , Cs p) 1

p As Zfp(Zs,t 1 p , Cs p)

p=1 ωp Zfp(Zs,t 1 p , Cs p) Zfp (Zs,t 1 p , Cs p )

p=1 ωp Zfp(Zs,t 1 p , Cs p) Zfp (Zs,t 1 p , Cs p )

p=1 ωp Zfp(Zs,t 1 p , Cs p) Zfp (Zs,t 1 p , Cs p )

p=1 ωp Zfp(Zs,t 1 p , Cs p) Zfp (Zs,t 1 p , Cs p ) 2

F | {z } T.6

Thirdly, we give a bound on T.6:

Zfp(Zs,t 1 p , Cs p) Zfp (Zs,t 1 p , Cs p ) 2

F = Zfp(Zs,t 1 p , Cs p) Zfp(Zs,t 1, Cs p) + Zfp(Zs,t 1, Cs p)

ZF(Zs,t 1, Cs) + ZF(Zs,t 1, Cs)

h Zfp (Zs,t 1 p , Cs p ) Zfp (Zs,t 1, Cs p ) + Zfp (Zs,t 1, Cs p )

ZF(Zs,t 1, Cs) + ZF(Zs,t 1, Cs) 2 F = Zfp(Zs,t 1 p , Cs p) Zfp(Zs,t 1, Cs p)

+ Zfp(Zs,t 1, Cs p) ZF(Zs,t 1, Cs)

+ h Zfp (Zs,t 1 p , Cs p ) Zfp (Zs,t 1, Cs p ) i

+ Zfp (Zs,t 1, Cs p ) ZF(Zs,t 1, Cs) 2

F 4 Zfp(Zs,t 1 p , Cs p) Zfp(Zs,t 1, Cs p) 2

F + 4 Zfp(Zs,t 1, Cs p) ZF(Zs,t 1, Cs) 2

+ 4 Zfp (Zs,t 1 p , Cs p ) Zfp (Zs,t 1, Cs p ) 2

F + 4 Zfp (Zs,t 1, Cs p ) ZF(Zs,t 1, Cs) 2

4(Ls Zp)2 Zs,t 1 Zs,t 1 p 2

F + 4(Ls Zp )2 Zs,t 1 Zs,t 1 p 2

Here, it is the second time that we use the facts that Pn i=1 ai 2 2 n Pn i=1 ai 2 2 and that Zfp( , Cs p) is Ls Zp-Lipschitz continuous.

Based on the bound of T.6, we derive the bound on T.5.

p=1 ωp Zfp(Zs,t 1 p , Cs p) 1

p As Zfp(Zs,t 1 p , Cs p)

4(Ls Zp)2 Zs,t 1 Zs,t 1 p 2

F + 4(Ls Zp )2 Zs,t 1 Zs,t 1 p 2

| {z } Bound of T.5

p=1 ωp(Ls Zp)2 Zs,t 1 Zs,t 1 p 2

Fourthly, based on the bounds of T.4 and T.5, we continue to derive the final required bound. ZF(Zs,t 1, Cs) 1

p As Zfp(Zs,t 1 p , Cs p)

p=1 ωp(Ls Zp)2 Zs,t 1 Zs,t 1 p 2

F | {z } Bound of T.3

p=1 ωp(Ls Zp)2 Zs,t 1 Zs,t 1 p 2

| {z } Bound of T.4

P + 2(1 + 8

p=1 ωp(Ls Zp)2 Zs,t 1 Zs,t 1 p 2

Based on Lemma 2, we continue to do the following derivation.

F(Zs,t, Cs) F(Zs,t 1, Cs)

Zs,t Zs,t 1 2 F

P + 2(1 + 8

p=1 ωp(Ls Zp)2 Zs,t 1 Zs,t 1 p 2

| {z } Bound of T.3

Zs,t Zs,t 1 2 F + 16ζ2

Ls Z (1 + 8

p=1 ωp(Ls Zp)2 Zs,t 1 Zs,t 1 p 2

Then, summing it up from t = 1 to Q yields

F(Zs,Q, Cs) F(Zs,0, Cs)

Zs,t Zs,t 1 2 F + 16Qζ2

Ls Z (1 + 8

p=1 ωp(Ls Zp)2 Zs,t 1 Zs,t 1 p 2

F | {z } T.7

Now, we give a bound on T.7.

Lemma E.3. For any s, it holds that

Zs,t Zs,t p 2 F

4t (Ls Z)2 P

j=0 (Ls Zp)2 Zs,j Zs,j p 2

+ 4t (Ls Z)2 P

p =1 ωp (Ls Zp )2 Zs,j Zs,j p 2

Proof. Based on the update schemes of C and Z, we have

Zs,t p = Zs,t 1 p 1

Ls Z Zfp(Zs,t 1 p , Cs p)

Zs,t = Zs,t 1 1 PLs Z

p As Zfp(Zs,t 1 p , Cs p) (62)

Consequently, we have

Zs,t p = Zs,0 p 1

j=0 Zfp(Zs,j p , Cs p)

Zs,t = Zs,0 1 PLs Z

p As Zfp(Zs,j p , Cs p)

where Zs,0 p = Zs,0 = Zs 1.

Based on the above identities,

Zs,t Zs,t p 2 F

Zs,0 1 PLs Z

p As Zfp(Zs,j p , Cs p)

j=0 Zfp(Zs,j p , Cs p)

= 1 (Ls Z)2

p As Zfp(Zs,j p , Cs p)

j=0 Zfp(Zs,j p , Cs p)

p As Zfp(Zs,j p , Cs p) Zfp(Zs,j p , Cs p)

= t (Ls Z)2 P 2

h Zfp (Zs,j p , Cs p ) Zfp(Zs,j p , Cs p) i

t (Ls Z)2 P

Zfp (Zs,j p , Cs p ) Zfp(Zs,j p , Cs p) 2

= t (Ls Z)2 P

p =1 ωp Zfp (Zs,j p , Cs p ) Zfp(Zs,j p , Cs p) 2

F | {z } T.6

t (Ls Z)2 P

p =1 ωp 4(Ls Zp)2 Zs,j Zs,j p 2

F + 4(Ls Zp )2 Zs,j Zs,j p 2

| {z } Bound of T.6

= 4t (Ls Z)2 P

j=0 (Ls Zp)2 Zs,j Zs,j p 2

+ 4t (Ls Z)2 P

p =1 ωp (Ls Zp )2 Zs,j Zs,j p 2

Based on Lemma 3, we give a bound on T.7.

Lemma E.4. For any s, it holds that

p=1 ωp(Ls Zp)2 Zs,t 1 Zs,t 1 p 2

F 8Q(Q 1)(2Q 1)ζ2

P 4(Q 1)2(1 + L 2 Z/L2 Z) (65)

Proof. Based on Lemma 3, we have

p=1 ωp(Ls Zp)2 Zs,t 1 Zs,t 1 p 2

p=1 ωp(Ls Zp)2

j=0 (Ls Zp)2

Zs,j Zs,j p

2 F + 8(t 1)2ζ2

p =1 ωp (Ls Zp )2 Zs,j Zs,j p 2

p=1 ωp(Ls Zp)2 Ls Zp Ls Z

Zs,j Zs,j p 2

p =1 ωp (Ls Zp )2 Zs,j Zs,j p 2

p=1 ωp(Ls Zp)2 LZ

Zs,j Zs,j p 2

p =1 ωp (Ls Zp )2 Zs,j Zs,j p 2

P (1 + L 2 Z L2 Z )

p=1 ωp(Ls Zp)2 Zs,j Zs,j p 2

F + 8Q(Q 1)(2Q 1)ζ2

P (1 + L 2 Z L2 Z )

p=1 ωp(Ls Zp)2 Zs,t 1 Zs,t 1 p 2

F + 8Q(Q 1)(2Q 1)ζ2

Here, we used the inequality as

Zs,j Zs,j p 2

Zs,t 1 Zs,t 1 p 2

Rearranging the above inequality, we have

p=1 ωp(Ls Zp)2 Zs,t 1 Zs,t 1 p 2

F 8Q(Q 1)(2Q 1)ζ2

P 4(Q 1)2(1 + L 2 Z/L2 Z) (68)

Based on Lemma 4, we continue to do the following derivation:

F(Zs,Q, Cs) F(Zs,0, Cs)

Zs,t Zs,t 1 2 F + 16Qζ2

Ls Z (1 + 8

p=1 ωp(Ls Zp)2 Zs,t 1 Zs,t 1 p 2

F | {z } T.7

Zs,t Zs,t 1 2 F + 16Qζ2

Ls Z (1 + 8

P ) 8Q(Q 1)(2Q 1)ζ2

P 4(Q 1)2(1 + L 2 Z/L2 Z) | {z } Bound of T.7

Zs,t Zs,t 1 2 F + 16Qζ2ψ

where ψ = 1 + ( P +8)(Q 1)(2Q 1) P 4(Q 1)2(1+L 2 Z/L2 Z).

Then, combining (46) and (69) yields

1 2(γs min + λ)

p=1 ωp Cs p Cs 1 p 2

Zs,t Zs,t 1 2 F

[F(Zs,0, Cs 1) F(Zs,Q, Cs)] + 16Qζ2ψ

Derivation of the main result: Based on (70), we derive the convergence in terms of the iterative terms, TC(Zs,0, Cs) and TZ(Zs,t, Cs) for s = 1, 2, . . . , S.

TC(Zs,0, Cs) 2 γs min + λ[F(Zs,0, Cs 1) F(Zs,Q, Cs)] + 32Qζ2ψ (γs min + λ) PLs Z (71)

Then, summing it up from s = 1 to S yields

s=1 TC(Zs,0, Cs) 2 γmin + λ[F(Z1,0, C0) f] + 32SQζ2ψ (γmin + λ) PLZ (72)

Similarly, we have

t=1 TZ(Zs,t, Cs) 4 Ls Z [F(Zs,0, Cs 1) F(Zs,Q, Cs)] + 64Qζ2ψ

P(Ls Z)2 (73)

Then, summing it up from s = 1 to S yields

t=1 TZ(Zs,t, Cs) 4 LZ [F(Z1,0, C0) f] + 64SQζ2ψ

Lastly, combining (72) and (74) and dividing two sides of it by T = S(1 + Q) yields

s=1 TC(Zs,0, Cs) +

t=1 TZ(Zs,t, Cs)

T [F(Z1,0, C0) f] + 16ζ2ψD

where D = 2 γmin+λ + 4 LZ .

F Proof for theorem on error bound on noisy similarity matrix

Proof. It follows from the triangle inequality of matrix norm that ˆ K x x Kxx = ˆ K x x K x x + K x x Kxx

ˆ K x x K x x + K x x Kxx . (76)

Since X = X + E and ϕ( X) = ϕ(Z)C, we have

ˆ K x x = (ϕ(Z)C)T (ϕ(Z)C) = CT K(Z, Z)C K x x = ϕ( X)T ϕ( X) = K( X, X) Kxx = ϕ(X)T ϕ(X) = K(X, X) (77)

where E is a Gaussian noise matrix, of which Ei,j N(0, σ2). Hence, we obtain ˆ K x x Kxx CT K(Z, Z)C K( X, X) + K( X, X) K(X, X) . (78)

Denote by xi (or ei) the i-th column of X Rm n (or E Rm n), i = 1, . . . , n, we first derive the upper bound of the second term of the RHS of (78) as follows:

K x x Kxx = K( X, X) K(X, X) = max i,j |K( xi, xj) K(xi, xj)|

(xi + ei) (xj + ej) 2 2 2r2

xi xj 2 2 2r2

max i,j 1 2r2

(xi + ei) (xj + ej) 2 2 + xi xj 2 2

= max i,j 1 2r2

(xi xj) + (ei ej) 2 2 xi xj 2 2

= max i,j 1 2r2

ei ej 2 2 + 2 xi xj, ei ej

max i,j 1 2r2

ei ej 2 2 + 2| xi xj, ei ej |

max i,j 1 2r2

ei ej 2 2 + 2 xi xj 2 ei ej 2 ,

where for the first inequality we have used the fact that the exponential function is locally Lipschitz continuous, i.e., |ex ey| < |x y| for x, y < 0.

Now, let us figure out the upper bound of ei ej 2 2. Note that

ei ej 2 2 =

l=1 (eli elj)2 = 2σ2 m X

where eli represents the l-th element of the column vector ei, k = 1, . . . , m. It is clear that for l = 1, . . . , m, E[eli elj] = 0,

var[eli elj] = 2σ2. (81)

Hence, eli elj

2σ is a standard Gaussian random variable drawn from N(0, 1). Based on this, we can define a random variable as

which is distributed according to the Chi-squared distribution with m degrees of freedom.

From Laurent and Massart [2000], we know that for any positive t, the Chi-squared variable Q with m degrees of freedom satisfies

P(Q > m + 2

mt + 2t) 1 e t. (83)

Hence, a bound on ei ej 2 2 with probability 1 e t is

ei ej 2 2 = 2σ2Q 2σ2(m + 2

mt + 2t). (84)

Using union bound for (84), we have

max i,j ei ej 2 2 = 2σ2Q 2σ2(m + 2

mt + 2t). (85)

holds with probability at least 1 n(n 1)e t. Assume xi 2 θ, then xi xi 2 2θ. For convenience, let ξ = p

mt + 2t. It follows from (79) and (85) that, with probability at least 1 n(n 1)e t,

K x x Kxx 1 2r2

h 2σ2ξ2 + 2 xi xj 2

2θ)2 2θ2i .

Now, we figure out the upper bound of the first term of the RHS of (78). We have ˆ K x x K x x = CT K(Z, Z)C K( X, X)

= (ϕ(Z)C)T (ϕ(Z)C) ϕ( X)T ϕ( X)

= (ϕ(Z)C)T (ϕ(Z)C) (ϕ(Z)C)T ϕ( X) + (ϕ(Z)C)T ϕ( X) ϕ( X)T ϕ( X)

= (ϕ(Z)C)T (ϕ(Z)C ϕ( X)) + (ϕ(Z)C ϕ( X))T ϕ( X)

(ϕ(Z)C)T (ϕ(Z)C ϕ( X)) + (ϕ(Z)C ϕ( X))T ϕ( X)

(ϕ(Z)C) 2, ϕ(Z)C ϕ( X) 2, + ϕ(Z)C ϕ( X) 2, ϕ( X) 2,

= ( (ϕ(Z)C) 2, + ϕ( X) 2, ) ϕ(Z)C ϕ( X) 2, (87) where 2, is a norm such that X 2, = maxi X:,i 2 for a real matrix X. Here we can just

assume that ϕ(Z)C ϕ( X) 2, γ, γ is some constant; this relies on the optimization.

Moreover, assume C 2 τC, we have ϕ( X) 2, = 1

ϕ(Z)C 2, ϕ(Z) F max j C:,j

Hence, we can continue to do the derivation of the preceding inequality. ˆ K x x K x x ϕ(Z)C 2, + ϕ( X) 2,

ϕ(Z)C ϕ( X) 2,

dτC + 1)γ (89)

As a result, the overall bound is ˆ K x x Kxx ˆ K x x K x x + K x x Kxx

2θ)2 2θ2i + (

dτC + 1)γ (90)

where ξ = q

G Proof for Theorem 3.6

Proof. For a data matrix X Rm n, it is perturbed by a Gaussian noise matrix E Rm n with N(0, σ2) to form the noisy data matrix X = X + E. Let Kxx = K(X, X) be the ground truth and ˆ K x x = CT K(Z, Z)C be the approximated similarity matrix by Fed KMF algorithm 1. Then, consider any two data points in X, xi and xj, with the identical label, we know that

Kiu Kij (91)

where Kiu = maxk((Kxx)inter ik ) and Kij = (Kxx)ij.

After running Fed KMF (Algorithm 1) on the noisy matrix X, if the approximated similarity matrix satisfies ˆ K x x Kxx < B(σ), then consider two points in X, xi and xj, that are actually xi and xj perturbed by some noise, we have

ˆ Kiu ˆ Kij = ˆ Kiu Kiu + Kiu ˆ Kij + Kij Kij

= ( ˆ Kiu Kiu) + (Kij ˆ Kij) + (Kiu Kij)

| ˆ Kiu Kiu| + |Kij ˆ Kij| + (Kiu Kij) B(σ) + B(σ) + (Kiu Kij) = 2B(σ) + (Kiu Kij)

where ˆ Kiu = maxk(( ˆ K x x)inter ik ) and ˆ Kij = ( ˆ K x x)ij.

Based on Definition 3.5, xi and xj can be correctly clustered only if the following inequality holds with some tolerance ϵ > 0. ˆ Kiu ˆ Kij ϵ (93)

Thus, combining inequalities (92) and (93), the bound function B(σ) satisfies

2[ϵ (Kiu Kij)] (94)

Similarly, if we consider any two data points in X, xi and xj, with different labels, we know that

Kij Kiv (95)

where Kiv = mink((Kxx)intra ik ).

After running Fed KMF algorithm 1 on the noisy matrix X, we have

ˆ Kij ˆ Kiv = ˆ Kij Kij + Kij ˆ Kiv + Kiv Kiv

= ( ˆ Kij Kij) + (Kiv ˆ Kiv) + (Kij Kiv)

| ˆ Kij Kij| + |Kiv ˆ Kiv| + (Kij Kiv) B(σ) + B(σ) + (Kij Kiv) = 2B(σ) + (Kij Kiv)

where ˆ Kiu = maxk(( ˆ K x x)intra ik ).

Based on Definition 3.5, xi and xj can be correctly clustered only if the following inequality holds with some tolerance ϵ > 0. ˆ Kij ˆ Kiv ϵ (97)

Thus, combining inequalities (96) and (97), the bound function B(σ) satisfies

2[ϵ (Kij Kiv)] (98)

Then, with two upper bounds on B(σ), (94) and (98), we have

2 min i {ϵ (Kiu Kij), ϵ (Kij Kiv)}. (99)

where ϵ is the parameter of tolerance.

Alternatively, a slightly looser version is like

B(σ) min i 1 4[2ϵ (Kiu Kiv)]

2 max i 1 4(Kiu Kiv). (100)

where Kiu = maxk((Kxx)inter ik ) and Kiv = mink((Kxx)intra ik ).

H Proof for Proposition 3.7

Proof. The ℓ2-sensitivity [Dwork et al., 2014] of a function f : N|X| Rk is: 2(f) = max x y f(x) f(y) 2,

where x y denotes that x and y are neighboring datasets. In our case, the function is f(x) = x. Then f(x) f(y) 2 = x y 2 2τX. It means 2(f) 2τX. Now using Theorem 3.22 in [Dwork et al., 2014] and Lemma H.1, we get the desired result.

Lemma H.1 (Post-Processing [Dwork et al., 2014]). Let M : N|X| R be a randomized algorithm that is (ε, δ)-differentially private. Let h : R R be an arbitrary randomized mapping. Then h M : N|X| R is (ε, δ) differentially private.

I Proof for Theorem 3.8

Proof. Based on the assumptions C 2, τC, ϕ(Z)C ϕ(X) 2, γ, C = C + EC for the entry (EC)ij N(0, σ2 C), and Z = Z + EZ for the entry (EZ)ij N(0, σ2 Z), we obtain Kxx Kxx = CT K( Z, Z) C K(X, X)

= (ϕ( Z) C)T (ϕ( Z) C) ϕT (X)ϕ(X)

= (ϕ( Z) C)T (ϕ( Z) C) (ϕ( Z) C)T ϕ(X) + (ϕ( Z) C)T ϕ(X) ϕT (X)ϕ(X)

= (ϕ( Z) C)T (ϕ( Z) C) ϕ(X)) + (ϕ( Z) C ϕ(X))T ϕ(X)

ϕ( Z) C ϕ(X) 2, + ϕ( Z) C ϕ(X) 2, ϕ(X) 2,

ϕ( Z) C 2, + ϕ(X) 2,

ϕ( Z) C ϕ(X) 2, | {z } T.1 (101)

where ϕ(X) 2, = 1. An upper bound on ϕ( Z) C 2, can be obtained only if we derive the

upper bound on ϕ( Z) C ϕ(X) 2, . That is, if ϕ( Z) C ϕ(X) 2, γzc, it implies that

ϕ( Z) C 2, ϕ( Z) C ϕ(X) 2, + ϕ(X) 2, = γzc + 1 (102)

which consequently gives Kxx Kxx γzc(γzc + 2) (103)

Hence, we derive the upper bound on ϕ( Z) C ϕ(X) 2, for the remaining proof.

ϕ( Z) C ϕ(X) 2, = ϕ( Z) C ϕ(Z)C + ϕ(Z)C ϕ(X) 2,

ϕ( Z) C ϕ(Z)C 2, | {z } T.2

+ ϕ(Z)C ϕ(X) 2, (104)

where the second term ϕ(Z)C ϕ(X) 2, γ is the assumption. Next, we derive the upper bound on T.2. ϕ( Z) C ϕ(Z)C 2, = ϕ( Z) C ϕ( Z)C + ϕ( Z)C ϕ(Z)C 2,

= ϕ( Z)( C C) + (ϕ( Z) ϕ(Z))C 2,

ϕ( Z)( C C) 2, + (ϕ( Z) ϕ(Z))C 2,

d EC 2, + ϕ( Z) ϕ(Z) F C 2,

d + τC ϕ( Z) ϕ(Z) F | {z } T.3

where we used 1 σ2 C EC 2 2, ξ2 d = d + 2

dt + 2t with probability at least 1 ne t [Laurent and

Massart, 2000] since it is the fact that 1 σ2 C EC 2 2, χ2 d where the entry (EC)ij N(0, σ2 C). For T.3, we have ϕ( Z) ϕ(Z) 2

F = Tr((ϕ( Z) ϕ(Z))T (ϕ( Z) ϕ(Z)))

= Tr(ϕT ( Z)ϕ( Z) ϕT ( Z)ϕ(Z) ϕT (Z)ϕ( Z) + ϕT (Z)ϕ(Z))

= 2d 2 Tr(ϕT ( Z)ϕ(Z)) | {z } T.4

For T.4, we can obtain

Tr(ϕT ( Z)ϕ(Z)) =

j=1 ϕT ( zj)ϕ(zj) =

j=1 exp zj + (EZ):,j zj 2 2 2r2

j=1 exp (EZ):,j 2 2 2r2

d exp σ2 Zξ2 d 2r2

where we use 1 σ2 Z (EZ):,j 2 2 ξ2 d = d + 2

dt + 2t with probability at least 1 de t [Laurent and

Massart, 2000] since it is the fact that 1 σ2 Z (EZ):,j 2 2 χ2 d where the entry (EZ)ij N(0, σ2 Z).

Hence, we can go back to give an upper bound on ϕ( Z) C ϕ(X) 2, :

ϕ( Z) C ϕ(X) 2, γ + σCξd

2d 1 exp σ2 Zξ2 d 2r2

Finally, we have Kxx Kxx γzc(γzc + 2) (109)

2 1 exp σ2 Zξ2 d 2r2

J Proof for Theorem 3.9

Lemma J.1. Assume Υ = maxzj,x zj x 2 and x x 2 2τX, let {CS p }P p=1 be perturbed

by noise drawn from N(0, σ2) with the parameter σ 2cλ 1

r2ε for c2 > 2 ln(1.25/δ). Then, the Gaussian Mechanism that adds noise to {CS p }P p=1 is (ε, δ)-differentially private. Lemma J.2. Assume max(p,j){ xpj 2, x pj 2} τX, max(i,j) zi xj = Υ, Zs p sp τZ s, and CS 2, τC, let {Zs p}P p=1 for s = 1, , S be perturbed by noise drawn from N(0, σ2) with variance (8S 2(g Z) log(e + (ε/δ))/ε2) where (g Z) =

r2 n 1 + (τX + τZ) (τX+Υ)

r2 o . Then, the Gaussian Mechanism that adds noise to {Zs p}P p=1 for s = 1, , S is (ε, δ)-differentially private.

By Lemma J.2, the mechanism that adds Gaussian noise to Zs p for s = 1, , S with variance (8S 2(g Z) log(e + (εZ/δZ))/ε2 Z) satisfies (εZ, δZ)-differential privacy under S-fold adaptive composition for any εZ > 0 and δZ (0, 1]. By Lemma J.1, the Gaussian Mechanism that injects noise to CS p with parameter σ 2cλ 1

dτX(τX + Υ)/(r2εC) is (εC, δC)-differentially private. Therefore, by Theorem 3.16 of [Dwork et al., 2014], the proposed algorithm that adds Gaussian noise to Zs p for s = 1, , S and CS p is (εC + εZ, δC + δZ)-differentially private. This finished the proof.

K Proof for Lemma J.1

Proof. In our Fed SC, for each column of C, we have

c = g C(x) = GK(Z, x) = G

exp z1 x 2 2 2r2

... exp zd x 2 2 2r2

where G = (K(Z, Z) + λId) 1. We have g C(x) g C(x ) 2 G sp K(Z, x) K(Z, x ) 2, (112) where sp denotes the spectral norm of matrix. Since exp(z) is locally Lipschitz continuous when z < 0, we have exp zj x 2 2 2r2

exp zj x 2 2 2r2

1 2r2 zj x 2 zj x 2 2

= 1 2r2 x x 2 2 + 2 zj x , x x

1 2r2 x x 2 2 + 2 zj x 2 x x 2

Let Υ = maxzj,x zj x 2 and x x 2 2τX. Then the ℓ2-sensitivity of g is

2(g C) sup x,x G σ

1 2r2 ( x x 2 2 + 2 zj x 2 x x 2)

1 2r2 (4τ 2 X + 4ΥτX)

dτX(τX + Υ)

dτX(τX + Υ)

Then according to Theorem 3.22 in [Dwork et al., 2014], for c2 > 2 ln(1.25/δ) the Gaussian Mechanism with parameter σ 2cλ 1

r2ε is (ε, δ)-differentially private. This finished the proof.

L Proof for Lemma J.2

Proof. Proof Sketch The (ϵ, δ)-differential privacy of the proposed algorithm can be achieved by injecting noise into Z for each local update and into C at the final round. To prove this, we first compute the sensitivity of Z and C for determining the differential privacy of them. Then, we use the adaptive composition [Kairouz et al., 2015] to get the superposition of them which will give the final theoretical result.

Now, the formal proof is as follows.

In our Fed SC, consider one-step update of Z at client of p

Zs,t p = Zs,t 1 p ηt f Z (Zs,t 1 p ) (115)

where the derivative is given by f Z (Zs,t 1 p ) = 1

r2 (Xp WZ Z WZ) + 2

r2 (ZQZ Z QZ) (116)

and Xp = [xp1, , xpj 1, xpj, xpj+1, , xp Np].

For the simplicity of the proof, we omit the pair of iteration parameters (s, t) and instead denote two adjacent local updates by k and k 1 for nonnegative k 1. Thus, we have the equivalent version of a one-step update of Zp at client p

Zk p = Zk 1 p ηk

r2 (Xp WZ Zk 1 p WZ) + 2

r2 (Zk 1 p QZ Zk 1 p QZ) (117)

where Xp = [xp1, , xpj 1, xpj, xpj+1, , xp Np].

To compute the sensitivity of Zp, we give the counterpart of the above update as

(Zk p) = Zk 1 p ηk

r2 (X p W Z Zk 1 p W Z) + 2

r2 (Zk 1 p QZ Zk 1 p QZ) (118)

where X p = [xp1, , xpj 1, x pj, xpj+1, , xp Np].

Next, let s start to derive the upper bound on Zk p (Zk p) F term by term. Zk p (Zk p) F = ηk

r2 Xp WZ Zk 1 p WZ X p W Z Zk 1 p W Z F

r2 Xp WZ X p W Z Zk 1 p WZ W Z F

Xp WZ X p W Z F | {z } T.1

+ Zk 1 p WZ W Z F | {z } T.2

For T.1, we have

Xp WZ X p W Z F = Xp WZ X p WZ + X p WZ X p W Z F Xp X p WZ F | {z } T.3

+ X p (WZ W Z) F | {z } T.4

For T.3, we have

Xp X p WZ F = (xpj x pj)( c T j K(xpj, Zk 1 p )) F

c T j K(xpj, Zk 1 p ) 2

= xpj x pj 2

(cj cj)T (K(Zk 1 p , xpj) K(Zk 1 p , xpj))

cj cj 2 K(Zk 1 p , xpj) K(Zk 1 p , xpj) 2

= xpj x pj 2

cpj 4 K(Zk 1 p , xpj) 4

cpj 2 K(Zk 1 p , xpj) 2

xpj x pj 2 C 2, K(Zk 1 p , xpj) 2

Here, we use a b 2 = q

(a a)T (b b) for a, b Rd for the second equality; Cauchy-

Schwarz inequality for the second inequality; p

a a 2 = a 4 for a Rd for the third inequality.

Let Z,x = K(Z, x) K(Z, x ), then we have for T.4,

X p (WZ W Z) F = x pj n c T j K(xpj, Zk 1 p ) K(x pj, Zk 1 p ) o F

c T j K(xpj, Zk 1 p ) K(x pj, Zk 1 p ) 2

(cj cj)T Zk 1 p ,xpj Zk 1 p ,xpj

x pj 2 C 2, Zk 1 p ,xpj

Therefore, assume max{ xpj 2, x pj 2} τX, which means x x 2 2τX, we can get an upper bound on T.1.

Xp WZ X p W Z F Xp X p WZ F | {z } T.5

+ X p (WZ W Z) F | {z } T.6

xpj x pj 2 C 2, K(Zk 1 p , xpj) 2 + x pj 2 C 2, Zk 1 p ,xpj

= C 2, xpj x pj 2

K(Zk 1 p , xpj) 2 + x pj 2

Zk 1 p ,xpj

1 + τX(τX + Υ)

Here, we also used the fact that K(Zk 1 p , xpj) 2

d and the derived bound 2

r2 on Zk 1 p ,xpj

2 given by K, and C 2, τC given in Proof F.

Assume Zk p sp τZ k, we have for T.2 Zk 1 p WZ W Z F = Zk 1 p diag(1T n WZ) diag(1T n W Z) F = Zk 1 p diag(1T n(WZ W Z)) F Zk 1 p sp diag(1T n(WZ W Z)) F = Zk 1 p sp 1T n(WZ W Z) 2

c T j K(xpj, Zk 1 p ) K(x pj, Zk 1 p ) 2

Zk 1 p sp C 2, Zk 1 p ,xpj

dτZτCτX(τX + Υ)

Thus, we get the upper bounds on T.1 and T.2, respectively, and finally give an upper bound on Zk p (Zk p) F .

Zk p (Zk p) F ηk

Xp WZ X p W Z F | {z } T.1

+ Zk 1 p WZ W Z F | {z } T.2

1 + τX(τX + Υ)

dτZτCτX(τX + Υ)

1 + (τX + τZ) (τX + Υ)

Therefore, if we define Zp = g Z(Xp), the ℓ2-sensitivity of g Z is

2(g Z) = sup Xp,X p g Z(Xp) g Z(X p) 2

= sup Xp,X p

Zk p (Zk p) F

1 + (τX + τZ) (τX + Υ)

By Theorem 4.3 of [Kairouz et al., 2015], the mechanism that adds Gaussian noise to Zs p for s = 1, , S with variance (8S 2(g Z) log(e+(εZ/δZ))/ε2 Z) satisfies (εZ, δZ)-differential privacy under S-fold adaptive composition for any εZ > 0 and δZ (0, 1]. This finished the proof.