# homomorphic_matrix_completion__7b58efaf.pdf Homomorphic Matrix Completion Xiao-Yang Liu1 , Zechu Li2 , Xiaodong Wang1 1Department of Electrical Engineering, Columbia University, New York, 2Department of Computer Science, Columbia University, New York, {xl2427, zl2993, xw2008}@columbia.edu In recommendation systems, global positioning, system identification, and mobile social networks, it is a fundamental routine that a server completes a low-rank matrix from an observed subset of its entries. However, sending data to a cloud server raises up the data privacy concern due to eavesdropping attacks and the singlepoint failure problem, e.g., the Netflix prize contest was canceled after a privacy lawsuit. In this paper, we propose a homomorphic matrix completion algorithm for privacy-preserving purpose. First, we formulate a homomorphic matrix completion problem where a server performs matrix completion on cyphertexts, and propose an encryption scheme that is fast and easy to implement. Secondly, we prove that the proposed scheme satisfies the homomorphism property that decrypting the recovered matrix on cyphertexts will obtain the target matrix (on plaintexts). Thirdly, we prove that the proposed scheme satisfies an ( , δ)-differential privacy property. While with similar level of privacy guarantee, we reduce the best-known error bound O( 10p 1n2) to EXACT recovery at a price of more samples. Finally, on synthetic data and real-world data, we show that both homomorphic nuclear-norm minimization and alternating minimization algorithms achieve accurate recoveries on cyphertexts, verifying the homomorphism property. 1 Introduction The recurring low-rank matrix completion problem [4, 6, 22, 30, 13, 29, 46] concerns completing a low-rank matrix from a randomly observed subset of entries. It has wide applications in recommendation systems (collaborative filtering) [1, 47, 26], computer vision [2, 16, 27], global positioning [48, 32], sensory data analysis in Internet of Things [25, 34, 33], system identification, network data analysis [51, 11], mobile social networks [23, 38], etc. Existing works [6, 9] have demonstrated a remarkable fact: if an n n matrix with rank r n satisfies a certain incoherence property, then with high probability, it is possible to exactly recover the matrix from O(nr poly log n) n2 entries using polynomial-time algorithms. Intuitively, one needs roughly (2nr r2) parameters [6] (this is the dimension of the tangent space to the manifold of rank-r matrices) to fix an n n matrix of rank r, and the sampling randomness introduces a log n factor due to a coupon collector s effect. The information theoretical lower bound is (nr log n) [6], while the tightest known upper bound is O(nr log2 n) [9] with another log n factor from the Golfing scheme used by the recovery algorithms. The low-rank matrix completion problem usually deals with large-scale matrices that involve extensive computations, while in mobile computing, smart devices usually outsource such a huge computation task to a cloud server. However, sending data to a server or publishing anonymized data raises up privacy concerns [23, 44, 42], e.g., the recommendation contest Netflix prize was canceled after privacy lawsuit [35]. There are two major obstructive factors: anonymization in data publishing is still vulnerable, and storing sensitive data on a cloud server may encounter the single-point of failure Equal contribution. 36th Conference on Neural Information Processing Systems (Neur IPS 2022). Figure 1: Matrix completion on plaintext VS. homomorphic matrix completion on cyphertext. (SPOF) problem, say hackers. Existing works [20, 19, 10] address the privacy concern in various ways, e.g., a popular approach is to add noise to the data [20], therefore making a tradeoff between the recovery accuracy and the level of privacy. In cloud computing and distributed systems, the homomorphism property [14, 45] allows computations to be carried out on cyphertexts, generating an encrypted result which, when decrypted, matches the result of operations performed on the corresponding plaintexts. In this manner, homomorphic encryption securely chains together different services without sacrificing recovery accuracy, but may at a price of extensive computation. There are several partially homomorphic crypto-systems, and also a number of fully homomorphic crypto-systems [14, 45]. In addition, the homomorphic property can also be used to create many other secure systems, for example secure voting systems, collision-resistant hash functions, private information retrieval schemes [43], etc. In this paper, we integrate the large-scale distributed matrix completion task with a homomorphic encryption-decryption scheme, which guarantees the EXACT recovery and differential privacy at a price of more samples. First, we define the homomorphic matrix completion problem that ensures data privacy by preserving a homomorphism property between plaintexts and cyphertexts. Specifically, we propose a homomorphic encryption-decryption scheme, in which each node performs local encryption and decryption, and uploads an encrypted incomplete vector to a server that carries out the matrix completion computation. Then, we theoretically prove that the proposed scheme satisfies the homomorphism and differential privacy properties reducing the best-known error bound O( 10p 1n2) [20] to EXACT recovery. Finally, based on synthetic and real-world data, we show that the homomorphic nuclear-norm minimization and alternating minimization algorithms achieve accurate recoveries on both cyphertexts and plaintexts, verifying the homomorphism property. 2 Homomorphic Matrix Completion Problem First, we formally define the homomorphic matrix completion problem. Then, we introduce a notation of privacy by adapting the join ( , δ) differential privacy, which is a subspace-aware variant. 2.1 Notations and Preliminaries Notations: Let ei denote the i-th standard basis, Ik denote the k k identity matrix, and I denote the identity linear operator. For matrix X, the (i, j)-th element is Xij or X(i, j), the j-th column is Xj, and the transpose is X>. The concatenation of two matrices A 2 Rn1 n2 and B 2 Rn1 n3 with the same number of rows is denoted by [A, B] 2 Rn1 (n2+n3). By with high probability (w.h.p.) we mean that with probability at least 1 c1n c2 for some positive constants c1, c2. Let N(0, σ2) denote a Gaussian distribution with mean 0 and standard deviation σ. We use an overline to represent the encrypted version of a variable: variables before encryption are called plaintexts, e.g., X, while the encrypted variables are called cyphertexts, e.g., X. Let M 2 Rn1 n2 denote the observed entries of a data matrix M 2 Rn1 n2, where {(1, 1), (1, 2), ..., (n1, n2)} indicates the observed entries. We define a linear operator P : Rn1 n2 ! Rn1 n2 to represent the partial observation model as follows [P (M)]ij = Mij, if (i, j) 2 0, otherwise. (1) Assuming the true data matrix M is low-rank, i.e., rank(M) = r min(n1, n2). The (compact) singular value decomposition (SVD) is M = U V >, where U 2 Rn1 r represents r left singular vectors (a basis of the column subspace), V 2 Rn2 r represents r right singular vectors (a basis of the row subspace), and = diag([σ1, σ1, , σr]) 2 Rr r with singular values σ1 σ2 σr 0. The 2-norm of a vector is ||x||2, while the Frobenius norm and nuclear norm of M are ||M||F = i,j |Mij|2 and ||M|| = Pr i=1 σi, respectively. The operator norm (spectral norm) of a matrix and a linear operator L are defined as follows ||M|| , sup x2Rn2, ||x||2 1 ||Mx||2 = σ1, and ||L|| , sup ||X||F 1 ||L(X)||F . (2) Definition 1. (Column subspace and null space [28]) Let A 2 Rn1 n2. The set S(A) = {b 2 Rn1| b = Ax, x 2 Rn2} is the column space or range of A, and the set Ker(A) = {x 2 Rn2| Ax = 0} is the kernel or (right) null space of A. The null space (kernel space) of operator P is Ker(P ) = {Z 2 Rn1 n2 | P (Z) = 0}, which is denoted as ? 2. Let Uni(m) denote a set with m entries, which is sampled uniformly from all sets of m entries, and Ber(p) denote a set with E| | = m entries, where each entry is sampled independently according to a Bernoulli model with p = m/(n1n2). Let PU and PV denote the orthogonal projections onto the column and row space of M, respectively, i = UU >, PV = i = V V >. (3) Define an orthogonal decomposition Rn1 n2 = T T ?, where T is the linear space spanned by matrices with the same column space or row space as M, and T ? is its orthogonal complement that consists of matrices with row-space orthogonal to the row-space V and column-space orthogonal to the column-space U. T can be expressed as follows T = {UA> + BV > | A 2 Rn1 r, B 2 Rn2 r}. (4) The orthogonal projection PT onto T and the orthogonal projection onto T ? are as follows PT (X) = PUX + XPV PUXPV , PT ?(X) = (I PT )(X) = (In1 PU)X(In2 PV ). (5) 2.2 Problem Formulation for Homomorphic Matrix Completion We are interested in completing large-scale matrices and want to outsource this compute-intensive task from mobile devices to a cloud server. Here we aim to preserve the matrix entries from leakage, which is the key concern for recommendation systems as in Netflix s privacy lawsuit [35]. Distributed matrix completion problem on plaintexts. Assume that there are n2 nodes with limited computing power and a cloud server with superior computing power. The j-th node has an attribute vector Mj 2 Rn1, j = 1, ..., n2, however, it is incomplete and the observed entries are indexed by a set j {(1, j), (2, j), ..., (n1, j)}. We assume that the true values of these n2 vectors form a low-rank matrix M 2 Rn1 n2 with rank r min(n1, n2), the 2-norms of the attribute vectors is bounded by L, i.e., max1 j n2 ||Mj||2 L, and the observation set = S j=1,...,n2 j. Nodes upload their incomplete vectors to a cloud server to carry out a matrix completion task Find a matrix X 2 Rn1 n2, s.t. P (X) = P (M), rank(X) r, (6) where Uni(m) and r may be unknown. Without loss of generality, we assume that n1 n2 from now on. Note that our formulation also includes the case [37] where a matrix is distributed into blocks and then is completed in parallel. 2We adopt the notation ? since Ker(P ) corresponds to a set of matrices vanishing at ?. Homomorphic matrix completion problem on cyphertexts. In cloud computing, the homomorphism property allows computations to be carried out on cyphertexts, generating an encrypted result which, when decrypted, matches the result of operations performed on the plaintext. Following this paradigm, we define a homomorphic matrix completion problem that ensures data privacy. As shown in Fig. 1, this novel framework consists of three main steps: 1) each node locally encrypts as P j(M j) = P j(g(Mj)) with its private keys, j = 1, ..., n2, and uploads P j(M j) to a cloud server that later forms an incomplete matrix P (M) 2 Rn1 n2; 2) the cloud server solves the following matrix completion problem given P (M) and sends back the recovered vector c M j to the j-th node, j = 1, ..., n2, Find a matrix X 2 Rn1 n2, s.t. P (X) = P (M), rank(X) r, (7) where r = rank(M) may be slightly bigger than r due to by the encryption scheme g( ). 3) each node locally decrypts its own vector using private keys, i.e., c Mj = g 1(c M j), j = 1, ..., n2. 2.3 Notions of Privacy We introduce a new variant of differential privacy for low-rank matrices. 2.3.1 Differential Privacy Let D = {d1, ..., dn} be a dataset of n entries and T be a fixed domain, where each entry dj 2 T encodes potentially sensitive information about node j. Let A : T n ! On be an algorithm that operates on dataset D and produces n outputs, one for each node j and from a set of possible output O. Let D j denote the dataset D without the entry of the j-th node, and similarly A j(D) denote the set of outputs without the output for the j-th node. Let (dj; D j) denote the dataset obtained by adding a data entry dj to the dataset D j. The ( , δ)-differential privacy and joint ( , δ)-differential privacy [21] are given in the following. Definition 2. (( , δ)-differential privacy [12]). An algorithm A satisfies ( , δ)-differential privacy if for any node j, any two possible values of data entry dj, d0 j 2 T for node j, any tuple of data entries for all other nodes D j 2 T n 1, and any output set O On, we have PA[A(dj; D j) 2 O] e PA[A(d0 j; D j) 2 O] + δ. (8) Definition 3. (Joint ( , δ)-differential privacy [21]). An algorithm A satisfies ( , δ)-joint differential privacy if for any node j, any two possible values of data entry dj, d0 j 2 T for node j, any tuple of data entries for all other nodes D j 2 T n 1, and any output set O On 1, we have PA[A j(dj; D j) 2 O] e PA[A j(d0 j; D j) 2 O] + δ. (9) Intuitively, an algorithm A satisfies ( , δ)-differential privacy if for any node j and dataset D, A(D) and D j do not reveal much" information about dj. For low-rank matrices, [20] used a relaxed notion joint ( , δ)-differential privacy: an algorithm A satisfies joint ( , δ)-differential privacy if for any node j and dataset D, A j(D) (the output for the other n 1 nodes) and D j (data entries of the other n 1 nodes) do not reveal much" information about dj. Relaxing ( , δ)-differential privacy to joint ( , δ)-differential privacy is reasonable for the matrix completion problem since the j-th column for the j-th node can reveal a lot of information about dj. 2.3.2 Differential Privacy for Low-rank Matrix Completion We would like to point out that joint ( , δ)-differential privacy in Def. 3 can be further refined. For a low-rank matrix M, its column subspace S(M) is global information, which is shared across all n2 nodes and can be easily inferred from A j(D) and D j. Note that the differential privacy notion aims to protect individual information, rather than global information. We adapt it to low-rank matrices by excluding the shared column subspace. Low-rank matrices have linearly dependent columns, and this dependency is reflected in the fact that they share a common column subspace. Formally, a rank-r matrix M = U V > can be expressed Algorithm 1 Homomorphic matrix completion at the cloud server Input: parameters n1, n2, r, k. Output: matrix K 2 Rn1 k as public keys, the recovered matrix c X 2 Rn1 n2 (cyphertexts). 1: Generate a random matrix K 2 Rn1 k and broadcast K to all n2 nodes; 2: until received all n2 encrypted vectors P j(M j) (line 4 in Alg. 2) do 3: Carry out a matrix completion task in (7) and obtain c X 2 Rn1 n2; 4: Send the recovered vector c Xj 2 Rn1 back to the j-th node, j = 1, ..., n2. 5: end Algorithm 2 Homomorphic matrix completion at node j, for j = 1, ..., n2 Input: an incomplete vector P j(Mj), observation set j, and parameters n1, r, k. Output: an recovered vector c Xj (plaintexts). 1: until received K 2 Rn1 k from the server (line 1 in Alg. 1) do 2: Generate k random numbers Rj i.i.d N(0, σ2Ik); 3: Perform local encryption as P j(M j) = P j(Mj) + P j(KRj); 4: Upload P j(M j) to the cloud server; 5: end 6: until received the recovered vector c Xj from the cloud server (line 4 in Alg. 1) do 7: Using Rj and K, decrypt c Xj to obtain c Xj, i.e., c Xj = c Xj KRj. 8: end as M = UC where U 2 Rn1 r and C = V > 2 Rr n2; alternatively, a column can be expressed as Mj = UCj, for j = 1, ..., n2, where Cj is the coefficient vector (individual information) of the j-th node in the column subspace with basis U (global information). The following subspace-aware joint ( , δ)-differential privacy considers the coefficient vectors Cj for j = 1, ..., n2, i.e., D in Def. 3 corresponds to the coefficient matrix C 2 Rr n2. Definition 4. (Subspace-aware joint ( , δ)-differential privacy). Assume n2 nodes data vector form a rank-r matrix M 2 Rn1 n2 with M = USV > = UC where U 2 Rn1 r and C = SV > 2 Rr n2. A matrix completion algorithm A satisfies subspace-aware ( , δ)-joint differential privacy if for any node j, any two possible coefficient vectors Cj, C0 j 2 Rr for node j, any tuple of coefficient vectors for all other nodes C j 2 Rr (n2 1), and any output set O Rr n2 that consists of estimated coefficient vectors in a column subspace with basis U, we have PA[A j(Cj; C j|U) 2 O] e PA[A j(C0 j; C j|U) 2 O] + δ. (10) 3 Novel Homomorphic Framework for Matrix Completion We propose a homomorphic encryption-decryption scheme: a node performs local encryption and decryption, and uploads an encrypted vector to a server to perform the matrix completion computation. 3.1 Our Idea: Hiding a Low-rank Data Matrix in a Larger Subspace To preserve the privacy of a low-rank data matrix M 2 Rn1 n2 with rank r, our idea is to hide M (lies in an r-dimensional subspace) into a larger subspace of dimension r, such that r r and r, r n1. A sound approach would be enlarging the original subspace of the data matrix (i.e., the plaintext) as follows: a cloud server generates a random matrix K 2 Rn1 k as public keys, k n1, and broadcasts K to all n2 nodes; then, node j generates k random numbers as private keys Rj 2 Rk, and encrypts its vector Mj 2 Rn1 as follows (a version with missing entries will be given in (12)) M j = Mj + KRj, j = 1, ..., n2; Equivalently, M = M + RR. (11) In the encryption scheme (11), M is added up with KR, resulting in a matrix M with rank r r+k. Since r n1, M is also low-rank, it is possible to recover M from a subset of entries. Figure 2: Our encryption method. Plaintext and cyphertext have the same set of missing entries. 3.2 Proposed Homomorphic Encryption-Decryption Scheme We propose a homomorphic encryption-decryption scheme that consists of the following steps, while the pseudocodes are summarzied in Alg. 1 and Alg. 2. First, in line 1 of Alg. 1, the cloud server generates a random matrix K 2 Rn1 k as public keys, then broadcasts K to all n2 nodes. Second, in lines 1-5 of Alg. 2, after receiving K 2 Rn1 k from the server (line 1 in Alg. 1), the j-th node locally carries out an encryption with k private keys (i.e., Rj 2 Rk). As shown in Fig. 2, the j-th node locally encrypts its incomplete vector P j(Mj) as follows P j(M j) = P j(Mj) + P j(KRj), j = 1, ..., n2, (12) i.i.d N(0, σ2Ik), P j(KRj) means keeping the entries in j and setting the entries in the complement set of j to be zeros, thus P j(M j) has the same set of missing entries as P j(Mj). Note that Rj is stored locally, which are private keys that will NOT be shared with any other node. Then, each node uploads its encrypted vector P j(M j) to the cloud server. Third, in lines 2-5 of Alg. 1, after receiving all n2 encrypted vectors P j(M j), j = 1, ..., n2, the server forms an incomplete matrix M with = S j=1,...,n2 j. Then, the server carries out a matrix completion task in (7), and sends the recovered vector c Xj back to the j-th node, j = 1, ..., n2. Finally, in lines 11-13 of Alg. 2, using the locally stored private keys Rj, and the public keys K, the j-th node decrypts its own vector, i.e., c Xj = g 1(c Xj) = c Xj KRj, j = 1, ..., n2. 4 Homomorphism Property Holds at Price of More Samples We prove that the homomorphism property holds for the proposed scheme, which guarantees exact recovery on cyphertexts at a cost of more samples. The detailed proofs are given in Appx. C. Overview: Starting from a necessary and sufficient condition in Lemma 1, we relax to a sufficient condition in Lemma 3 for the homomorphism property to hold. Then, we provide a homomorphic version of Rudelson Selection Estimation Theorem in Theorem 2 that guarantees Lemma 3 with high probability. Therefore, we obtain a sample complexity for EXACT recovery in Theorem 3, where our interesting finding is that the homomorphism property holds at price of more samples. 4.1 Sufficient Condition for Low-rank Matrix Completion We start from a necessary and sufficient condition for low-rank matrix completion. Note that a similar necessary and sufficient condition for sparse vector recovery is discussed in compressive sensing [3, 8, 46]. Here, we apply a similar argument to obtain Lemma 1. We define a set of matrices with rank at most r and a rank-descent cone as follows ( M = {X 2 Rn1 n2 : rank(X) r}, DM(M) = {t(X M) 2 Rn1 n2 : rank(X) r, t 0}, (13) where M is the closure of the manifold of rank-r matrices. Accordingly, for M, we have M = {X 2 Rn1 n2 : rank(X) r}, DM(M) = {t(X M) 2 Rn1 n2 : rank(X) r, t 0}. Lemma 1. (Necessary and sufficient condition for low-rank matrix completion) M is the unique optimal solution to (6) if and only if ? \ DM(M) = {0}, where ? denotes Ker(P ). Geometric interpretation: M is the unique optimal solution to problem (6) if and only if starting from M, the rank of M + D increases for all directions D 2 ?, where D is nonzero. Therefore, the homomorphism property of low-rank matrix completion in problem (7) holds if ? \ DM(M) = {0} = ? \ DM(M). (15) Lemma 2. ([15, 7] (Theorem 6.1)) Let M = U V > be the compact SVD of matrix M. The tangent cone TM(M) of the set M at M is a linear subspace given by TM(M) = {UA> + BV > | A 2 Rn1 r, B 2 Rn2 r} , T, (16) and its complementary space is denoted by T ?. Since the rank-descent cone is a subset of the tangent cone defined in (16) ([17], Theorem 4.8), DM(M) T, and DM(M) T, we relax (15) to the following sufficient condition. Lemma 3. A sufficient condition for the homomorphic property of matrix completion under the proposed scheme in Alg. 1 and Alg. 2 is ? \ T = {0}. Interpretation: if ? \ T = {0} holds, then we know that M = M + KR is the unique optimal solution to problem (7) and M is the unique optimal solution to problem (6). Since M = M + KR is a one-to-one mapping, a decryption scheme M KR will return the desired true matrix M. 4.2 Homomorphic Version of Rudelson Selection Estimation Theorem The Rudelson selection estimation theorem [39] investigates the number of random points needed to bring a convex body into a nearly isotropic position. Such an approximate isometry property is fundamentally useful to characterize the number of entries needed to complete a low-rank matrix. Definition 5. (Coherence) Let U 2 Rn r be the r left singular vectors of M (corresponds to the column subspace S(M)) and PU be the orthogonal projection onto U. Then the coherence of U (or S(M), respectively) is defined as µ(S(M)) = µ(U) , n 1 i n ||PUei||2 1 i n ||U >ei||2 since (U >U) 1 = I and U is orthonormal. The concept coherence" measures the relationship between a low-dimensional space and the observation operator P , namely the cosine (with a scaling factor n r ) of the principal angle between the low-dimensional space and a standard basis. M is said to satisfy the standard incoherence condition with parameter µ0 if µ(U) µ0, and µ(V ) µ0. (18) A small µ0 ensures that the information of the row/column spaces of M is not too concentrated on a small number of rows/columns. It characterizes the contribution of an entry in recovering M: a small µ0 means that each entry provides approximated the same amount of information. Theorem 1. (Rudelson selection estimation theorem [3]) Assume that Ber(p) with p = ( m n1n2 ), and M obeys the standard incoherence condition (18) with parameter µ0. There is a constant CR such that for β > 1, with probability 1 3n β 2 , we have ||p 1PT P PT PT || CR µ0n2r(β log n2) m , < 1. (19) We derive the following homomorphic variant of the Rudelson selection estimation theorem [39] and will use it to guarantee Lemma 3. Our new contribution here is to derive the conditions when the approximate isometry property will hold simultaneously for both cyphertexts and plaintexts. Theorem 2. (Homomorphic version of Rudelson selection estimation theorem) Assume that Ber(p) with p = ( m n1n2 ), M and M satisfy the standard incoherence condition (18) with parameter µ0 and µ0, respectively. Under the proposed scheme in Alg. 1 and Alg. 2, there are constants CR, C0 R such that for β > 1, with probability 1 3n β 2 , we have (cyphertext) ||p 1PT P PT PT || C0 n2µ0r(β log n2) m , 0 < 1, which implies (plaintext) ||p 1PT P PT PT || CR n2µ0r(β log n2) Note that ||p 1PT P PT PT || < 1 implies that the sufficient condition ? \ T = {0} holds. 4.3 Sample Complexity for EXACT Recovery Then, we prove Theorem 3 that the homormophism property holds for the proposed scheme in Alg. 1 and Alg. 2, provided that there are sufficient number of observations (| | is large enough). Theorem 3. For Alg. 1 and Alg. 2, with probability 1 3n β 2 , the homomorphism property holds if p C0µ0r(β log n2) n1 , where C0 is a positive constant. Next, we characterize the coherence change of µ0 and provide the sample complexity for the EXACT recovery in Alg. 1 and Alg. 2. Lemma 4. The new coherence under the proposed scheme in Alg. 1 and Alg. 2 satisfies rµ0 + C max(k r ), with probability 1 cn 3 2 log n2. (21) Combining Theorem 3 and Lemma 4, we characterize the required number of entries. Corollary 1. For Alg. 1 and Alg. 2, with probability 1 6n β 2 log n2, the homomorphism property holds if p C0(rµ0+C max(k,log n2))(β log n2) n1 , where c, C0, C are positive constants. 5 Differential Privacy Property Holds In this section, we show that the differential privacy holds for the proposed scheme in Alg. 1 and Alg. 2. It is well-known that one can achieve ( , δ)-differential privacy by adding Gaussian noise. Definition 6. (Privacy loss as a random variable [12]) Considering a mechanism A on a pair of databases D, D0. For an outcome o 2 O, the privacy loss on o is defined as the logarithmic ratio between the probability to observe o on input D compared to that on input D0: A(D)||A(D0) = ln P(A(D) = o) P(A(D0) = o), (22) where P(A(D) = o) is a probability density over a continuous set O. Two potential issues of the proposed scheme in Alg. 1 and Alg. 2 is the projection recovery and the rank value r may be unknown. Namely, for a single-round encryption case, one can do a corresponding projection to obtain the real data. Therefore, we execute the proposed scheme twice and introduce two parameters σ1 and σ2: First-round encryption: the server randomly generats a matrix K1 2 Rn1 k and each node gener- ates k random numbers R1 i.i.d N(0, σ2 1Ik). Then, we have P (M) = P (M) + P (K1R1). Second-round encryption: the server obtains the column space of M as K2 2 Rn1 r with r = r + k and then each node generates r + k random numbers R2 i.i.d N(0, σ2 2Ir+k). Then we have P (M) = P (M) + P (K2R2). Figure 3: Comparing NN and AM algorithms with their homomorphic versions. The figure plots the success rates within 10 trials, where the white and black cells mean success and fail . The trial is success if RSE 10 5. We set k = 10 in Alg. 1 and Alg. 2 Theorem 4 states that the proposed scheme satisfies the subspace-aware joint ( , δ)-differential privacy in Section 2.3.2. The detailed proofs are given in Appx. D, where the key is to quantify σ under which the random variable privacy loss in (22) is bounded by , with probability at least 1 δ. Theorem 4. Let 2 (0, 1) and c2 > 2 ln(1.25/δ). Assume the true data matrix M 2 Rn1 n2 has rank r and each column has bounded 2-norm, i.e., = max1 j n2 ||Mj||2 L. Let R1 1Ik) with σ1 2c L 2 ln(2/δ)/ and R2 2Ik+r) with σ2 2c L 2 ln(2/δ)/ , then the encryption and decryption scheme in Alg. 1 and Alg. 2, satisfies the subspace-aware joint ( , δ)-differential privacy property. A substantial improvement is: for the same level of privacy (the same parameters , δ in the above joint ( , δ)-DP property), our algorithms are able to achieve EXACT recovery. Note that by proving the homomorphism property and characterising the sample complexity, we reduce the error bound O( 10p 1n2) from [20] to ZERO since we have EXACT recovery. 6 Performance Evaluation We evaluate the proposed scheme on synthetic data and real-world datasets using two matrix completion algorithms [41, 18], verifying the homomorphism property. 6.1 Experimental Settings Datasets. We experiment with synthetic data and real-world datasets. The synthetic data is generated randomly according to the low-rank 1, 000 1, 000 matrix model and serves as well-controlled inputs for verification. The real-world datasets include two benchmark datasets for recommendation systems, namely the Movie Lens10M (Top 400)3 and Netflix (Top 400) datasets. The Movie Lens dataset contains ratings of 400 most rated movies made by approximately 7, 000 users, and the Netflix dataset contains ratings of 400 most rated movies made by approximately 480 thousand users. Matrix completion algorithms. For the matrix completion on the server, we use nuclear-norm minimization (NN) and alternating minimization (AM) algorithms. In Section 6.2, we compare both algorithms with their homomorphic versions. In Section 6.3, on the real-world datasets, we also include the private Frank-Wolf (FW) algorithm [20] for comparison. Performance metric. We measure the recovery error via the relative square root error RSE = || c M M||F ||M||F . All experiments are executed for ten times and we report the average results. 6.2 Results on Synthetic Data We experiment with randomly generated low-rank matrices on NN and AM algorithms and their homomorphic versions HNN and HAM. We vary the rank r of the generated matrix and the percentage of observed entries from 1, 5, to 95. As shown in Fig. 6.2, we observe two trends: 1) for a certain rank r, the success rate increases as the percentage of observed entries increases; and 2) for a certain percentage of observed entries, the success rate decreases as the rank r increases. On the other hand, we find that the HNN and HAM need slightly more observed entries to reach the success threshold, 3https://movielens.org/ Figure 4: Results on Movie Lens10M and Netflix datasets. We vary the percentage of observed entries and measure the RSE recovery error. which verifies Theorem 3 that the scheme guarantees exact recovery at a cost of more samples. As an interpretation, the homomorphic version is to hide the plaintext matrix into a larger space, namely from rank r to rank r + k. In this case, given that we set k = 10 for the experiments, we find that the results of HNN and HAM can be obtained by shifting the results of their counterparts left one grid. 6.3 Results on Movie Lens10M and Netflix Datasets Fig. 4 shows the results on Movie Lens10M and Neflix datasets. For the newly introduced compared algorithm FW, we set the privacy parameter = 2 log(1/δ) and δ = 10 6. For the NN and AM algorithms, the setting is the same in Section 6.2. First of all, we observe that the homomorphic algorithms can achieve significantly lower recovery errors than the error of FW algorithm. This points out the difference between the proposed scheme and existing strategies, in which we do not sacrifice the recovery error to improve the privacy. On the other hand, we find that the homomorphic algorithms can reach the same level of recovery error as the vanilla algorithms on plaintexts, but need more samples. Such a performance is consistent with our theoretical proofs and our observations in Section 6.2. Moreover, we analyze the impact of increasing the percentage of observed entries on three types of algorithms, as shown in Fig. 4. For AM and FW algorithms, the recovery error decreases smoothly as the percentage increases (note that the y-axis decreasing in log). However, the NN algorithm demonstrates a significant error drop as we increase the percentage of observed entries. 7 Conclusion and Future Work This work studied the problem of privacy-preserving data completion in a distributed manner. To address the privacy concern, we define the homomorphic matrix completion problem and propose a homomorphic encryption-decryption scheme. Unlike existing works that preserve privacy by sacrificing recovery accuracy, our work guarantees the EXACT recovery while making a tradeoff between privacy and the number of samples. Then, we theoretically prove that the proposed scheme satisfies the homomorphism and differential privacy properties. Experimentally, we show that the proposed scheme is compatible with two matrix completion algorithms, namely the nuclear norm minimization and alternating minimization, and verify the homomorphism property. In the future, it would be interesting to extend this homomophic framework to the tensor completion problem [31, 32]. It would also be practically interesting to study federated learning application [24] and develop high-performance implementations for high-dimensional data analysis [50, 49, 36, 50]. Acknowledgement Xiao-Yang Liu would like to thank Prof. John Wright (Department of Electrical Engineering at Columbia University) for his insightful sharing. [1] J. Bennett and S. Lanning. The Netflix prize. In Proceedings of KDD Cup and Workshop, volume 2007, page 35. New York, NY, USA, 2007. [2] R. S. Cabral, F. Torre, J. P. Costeira, and A. Bernardino. Matrix completion for multi-label image classification. In Advances in Neural Information Processing Systems, pages 190 198, 2011. [3] E. J. Candès. Mathematics of sparsity (and a few other things). In Proceedings of the Interna- tional Congress of Mathematicians, Seoul, South Korea, volume 123, 2014. [4] E. J. Candès and B. Recht. Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9(6):717, 2009. [5] E. J. Candès, J. Romberg, and T. Tao. Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory, 52(2):489 509, 2006. [6] E.J. Candès and T. Tao. The power of convex relaxation: Near-optimal matrix completion. IEEE Transactions on Information Theory, 56(5):2053 2080, 2010. [7] T. P. Cason, P.-A. Absil, and P. Van Dooren. Iterative methods for low rank approximation of graph similarity matrices. Linear Algebra and its Applications, 438(4):1863 1882, 2013. [8] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky. The convex geometry of linear inverse problems. Foundations of Computational Mathematics, 12(6):805 849, 2012. [9] Y. Chen. Incoherence-optimal matrix completion. IEEE Transactions on Information Theory, 61(5):2909 2923, 2015. [10] S. Chien, P. Jain, W. Krichene, S. Rendle, S. Song, A. Thakurta, and L. Zhang. Private alternating least squares: Practical private matrix completion with tighter rates. In International Conference on Machine Learning, pages 1877 1887. PMLR, 2021. [11] L. Deng, H. Zheng, X.-Y. Liu, X. Feng, and Z. D. Chen. Network latency estimation with leverage sampling for personal devices: An adaptive tensor completion approach. IEEE/ACM Transactions on Networking, 28(6):2797 2808, 2020. [12] C. Dwork and A. Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3 4):211 407, 2014. [13] A. Elmahdy, J. Ahn, C. Suh, and S. Mohajer. Matrix completion with hierarchical graph side information. Advances in Neural Information Processing Systems, 33:9061 9074, 2020. [14] C. Gentry. Fully homomorphic encryption using ideal lattices. ACM STOC, 9:169 178, 2009. [15] J. Harris. Algebraic geometry: a first course, volume 133. Springer Science & Business Media, [16] Dat T. Huynh and Ehsan Elhamifar. Interactive multi-label cnn learning with partial labels. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9420 9429, 2020. [17] J. Jahn. Introduction to the theory of nonlinear optimization. Springer Science & Business Media, 2007. [18] P. Jain, P. Netrapalli, and S. Sanghavi. Low-rank matrix completion using alternating minimiza- tion. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing, pages 665 674, 2013. [19] P. Jain, J. Rush, A. Smith, S. Song, and A. Guha Thakurta. Differentially private model personalization. Advances in Neural Information Processing Systems, 34, 2021. [20] P. Jain, O. D. Thakkar, and A. Thakurta. Differentially private matrix completion revisited. In International Conference on Machine Learning, pages 2220 2229, 2018. [21] M. Kearns, M. Pai, A. Roth, and J. Ullman. Mechanism design in large games: Incentives and privacy. In Proceedings of the Conference on Innovations in Theoretical Computer Science, pages 403 410. ACM, 2014. [22] R.H. Keshavan, A. Montanari, and S. Oh. Matrix completion from a few entries. IEEE Transactions on Information Theory, 56(6):2980 2998, 2010. [23] L. Kong, L. He, X.-Y. Liu, Y. Gu, M.-Y. Wu, and X. Liu. Privacy-preserving compressive sensing for crowdsensing based trajectory recovery. In IEEE 35th International Conference on Distributed Computing Systems (ICDCS), pages 31 40, 2015. [24] L. Kong, X.-Y. Liu, H. Sheng, P. Zeng, and G. Chen. Federated tensor mining for secure industrial internet of things. IEEE Transactions on Industrial Informatics, 2019. [25] L. Kong, M. Xia, X.-Y. Liu, M.-Y. Wu, and X. Liu. Data loss and reconstruction in sensor networks. In Proceedings IEEE INFOCOM, pages 1654 1662. IEEE, 2013. [26] Y. Koren, S. Rendle, and R. Bell. Advances in collaborative filtering. Recommender Systems Handbook, pages 91 142, 2022. [27] Kaustav Kundu and Joseph Tighe. Exploiting weakly supervised visual patterns to learn from partial annotations. Advances in Neural Information Processing Systems, 33:561 572, 2020. [28] A. J. Laub. Matrix analysis for scientists and engineers. Siam, 2005. [29] Z. Li, B. Ding, C. Zhang, N. Li, and J. Zhou. Federated matrix factorization with privacy guarantee. Proceedings of the VLDB Endowment, 15(4):900 913, 2021. [30] G. Liu, Q. Liu, and X. Yuan. A new theory for matrix completion. Advances in Neural Information Processing Systems, 30, 2017. [31] X.-Y. Liu, S. Aeron, V. Aggarwal, and X. Wang. Low-tubal-rank tensor completion using alternating minimization. IEEE Transactions on Information Theory, 66(3):1714 1737, 2019. [32] X.-Y. Liu, S. Aeron, V. Aggarwal, X. Wang, and M.-Y. Wu. Adaptive sampling of rf fingerprints for fine-grained indoor localization. IEEE Transactions on Mobile Computing, 15(10):2411 2423, 2015. [33] X.-Y. Liu and X. Wang. LS-decomposition for robust recovery of sensory big data. IEEE Transactions on Big Data, 4(4):542 555, 2017. [34] X.-Y. Liu, Y. Zhu, L. Kong, C. Liu, Y. Gu, A. V Vasilakos, and M.-Y. Wu. CDC: Compressive data collection for wireless sensor networks. IEEE Transactions on Parallel and Distributed Systems, 26(8):2188 2197, 2014. [35] S. Lohr. Netflix cancels contest after concerns are raised about privacy. Web page: http://www.nytimes.com/2010/03/13/technology/13netflix.html?mcubz=0. The New Yorks Times, Mar. 12, 2010. [36] H. Lu, T. Zhang, and X.-Y. Liu. High-performance homomorphic matrix completion on gpus. In IEEE 21st International Conference on High Performance Computing and Communications, pages 1627 1634. IEEE, 2019. [37] Lester W Mackey, Ameet Talwalkar, and Michael I Jordan. Distributed matrix completion and robust factorization. Journal of Machine Learning Research, 16(1):913 960, 2015. [38] S. Rallapalli, L. Qiu, Y. Zhang, and Y.-C. Chen. Exploiting temporal stability and low-rank structure for localization in mobile networks. In Proceedings of the International Conference on Mobile Computing and Networking, pages 161 172. ACM, 2010. [39] M. Rudelson. Random vectors in the isotropic position. Journal of Functional Analysis, 164(1):60 72, 1999. [40] R. Schneider and A. Uschmajew. Convergence results for projected line-search methods on varieties of low-rank matrices via łojasiewicz inequality. SIAM Journal on Optimization, 25(1):622 646, 2015. [41] S. Shalev-Shwartz, A. Gonen, and O. Shamir. Large-scale convex minimization with a low-rank constraint. In International Conference on Machine Learning, 2011. [42] V. Singhal and T. Steinke. Privately learning subspaces. Advances in Neural Information Processing Systems, 34, 2021. [43] H. Sun and S.A. Jafar. The capacity of private information retrieval. IEEE Transactions on Information Theory, 2017. [44] Jalaj Upadhyay. The price of privacy for low-rank factorization. Advances in Neural Information Processing Systems, 31, 2018. [45] M. Van Dijk, C. Gentry, S. Halevi, and V. Vaikuntanathan. Fully homomorphic encryption over the integers. Springer, Annual International Conference on the Theory and Applications of Cryptographic Techniques, pages 24 43, 2010. [46] J. Wright and Y. Ma. High-dimensional data analysis with low-dimensional models: Principles, computation, and applications. Cambridge University Press, 2022. [47] Q. Wu, H. Zhang, X. Gao, J. Yan, and H. Zha. Towards open-world recommendation: An inductive model-based collaborative filtering approach. In International Conference on Machine Learning, pages 11329 11339. PMLR, 2021. [48] Q. Ye, J. Cheng, H. Du, X. Jia, and J. Zhang. A matrix-completion approach to mobile network localization. In Proceedings of the 15th ACM International Symposium on Mobile Ad Hoc Networking and Computing, pages 327 336. ACM, 2014. [49] T. Zhang, X.-Y. Liu, X. Wang, and A. Walid. cu Tensor-Tubal: Efficient primitives for tubal-rank tensor learning operations on gpus. IEEE Transactions on Parallel and Distributed Systems, 31(3):595 610, 2019. [50] T. Zhang, H. Lu, and X.-Y. Liu. High-performance homomorphic matrix completion on multiple gpus. IEEE Access, 8:25395 25406, 2020. [51] Y. Zhang, M. Roughan, W. Willinger, and L. Qiu. Spatio-temporal compressive sensing and Internet traffic matrices. In ACM SIGCOMM Computer Communication Review, volume 39, pages 267 278. ACM, 2009. Broader Impact Statement This paper is within the area of private machine learning, which calls for privacy-preserving data completion by proposing a homomorphic encryption-decryption scheme. Due to the wide application areas of the matrix completion problem, this work may have broad practical impact in recommendation systems, global positioning, system identification and mobile social networks, etc. 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] (c) Did you discuss any potential negative societal impacts of your work? [N/A] Not aware of foresee negative impacts. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experi- mental results (either in the supplemental material or as a URL)? [Yes] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] (c) Did you report error bars (e.g., with respect to the random seed after running experi- ments multiple times)? [Yes] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [N/A] (c) Did you include any new assets either in the supplemental material or as a URL? [N/A] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]