# differentially_private_model_personalization__419c91a8.pdf Differentially Private Model Personalization Prateek Jain Google Research prajain@google.com Keith Rush Google Research krush@google.com Adam Smith Boston University ads22@bu.edu Shuang Song Google Research shuangsong@google.com Abhradeep Thakurta Google Research athakurta@google.com We study personalization of supervised learning with user-level differential privacy. Consider a setting with many users, each of whom has a training data set drawn from their own distribution Pi. Assuming some shared structure among the problems Pi, can users collectively learn the shared structure and solve their tasks better than they could individually while preserving the privacy of their data? We formulate this question using joint, user-level differential privacy that is, we control what is leaked about each user s entire data set. We provide algorithms that exploit popular non-private approaches in this domain like the Almost-No-Inner-Loop (ANIL) method, and give strong user-level privacy guarantees for our general approach. When the problems Pi are linear regression problems with each user s regression vector lying in a common, unknown lowdimensional subspace, we show that our efficient algorithms satisfy nearly optimal estimation error guarantees. We also establish a general, information-theoretic upper bound via an exponential mechanism-based algorithm. 1 Introduction Modern machine learning techniques are amazingly successful but come with a range of risks to the privacy of the personal data on which they are trained. Complex models often encode exact personal information in surprising ways allowing, in extreme cases, the exact recovery of training data from black box use of the model [7, 8]. The emerging architecture of modern learning systems, in which models are trained collaboratively by networks of mobile devices using extremely rich, personal information exacerbates these risks. The paradigm of model personalization, a special case of multitask learning, has emerged as one way to address both privacy and scalability issues. The idea is to let users train models on their own data for example, to recognize friends and family members faces in photos, or to suggest text completions that match the user s style based on information that is common to the many other similar learning problems being solved by other users in the system. Even a fairly limited amount of shared information a useful feature representation or starting set of parameters for optimization, for example can dramatically reduce the amount of data each user requires. But that shared information can nevertheless be highly disclosive. In this paper, we formulate a model for reasoning rigorously about the loss to privacy incurred by sharing information for model personalization. In our model, there are n users, each holding a dataset of m labeled examples. We assume user j s data set Dj is drawn i.i.d. from a distribution Pj; the user s goal is to learn a prediction rule that generalizes well to unseen examples from Pj. Ideally, the user should succeed much better than they could have on their own. We give new algorithms for this setting, analyze their accuracy on specific data distributions, and test our results empirically. 35th Conference on Neural Information Processing Systems (Neur IPS 2021). We ask that our algorithms satisfy user-level, joint differential privacy (DP) [28] (called task-level privacy, in the context of multi-task learning [32]). In this setting, each user provides their data set Dj as input to the algorithm and receives output Aj = Aj(D1, ..., Dn). We require that for every choice of the other data sets D j = (D1, ..., Dj 1, Dj+1, ..., Dn) and for every two data sets Dj and D j, the collective view of the other users A j be distributed essentially identically regardless of whether user j inputs Dj or D j. The standard model of differential privacy doesn t directly fit our setting, since the model ultimately trained by user j will definitely reveal information about user j s data set. That said, the algorithms we design can ultimately be viewed as an appropriate composition of modules that satisfy the usual notion of DP (an approach known as the billboard model). For simplicity, we describe our algorithms in a centralized model in which the data are stored in a single location, and the algorithm A is run as a single operation. In most cases, we expect A to be run as a distributed protocol, using either general tools such as multiparty computation or lightweight, specialized ones such as differentially private aggregation to simulate the shared platform. Intuitively, strong privacy requirement at user level, while still demanding that users share some common information is significantly challenging. For one, as each user individually has a small amount of data, it has to share information about it s model/data to learn a meaningful representation. Furthermore, in practical personalization settings, there is feedback loop between the common or pooled knowledge of all users and the personalized models for each user. That is, starting with reasonable personalized models for each user, leads to a better pooled information, while good pooled information then helps each user learn better personal model. Now, requirement of strong privacy guarantees forces the pooled information quality to degrade up to some extent, which can then lead to poorer personalized model and form a negative feedback loop. 1.1 Contributions We consider two types of algorithms for DP model personalization: inefficient algorithms (based on the exponential mechanism [35]) that establish information-theoretic upper bounds on achievable error, and efficient ones based on popular iterative approaches to non-private personalization [40, 26, 51, 52]. These latter approaches are popular for their convergence speed and low communication overhead. As is often the case, those same features make them attractive starting points for DP algorithms. Problem Setting: Consider a set of n users, and suppose each user j [n] holds a data set of m records Dj = {(xij, yij)}i [m] where xij Rd, yij R. The goal is to learn a personalized model fj( ) = f( ; θj) : Rd R for each user j, where θj is a vector of parameters describing the model. We aim to learn a shared, low-dimensional representation for the features that allows users to train good predictors individually. For concreteness, we consider a linear embedding specified by a d k matrix U, where k d. We may think of U either as providing a k-dimensional representation of the feature xij (as U xij) or, alternatively, as a compact way to specify a d-dimensional regression vector θj = Uvj where vj is vector of length k. In both cases, user j s final predictor has the form fj(xij) = f ( xij, Uvj ) = f ( U xij, vj ) One may view this as a model as a two-layer neural network, where the first layer is shared across all users and the second layer is trained individually. A useful setting to have in mind is one where k m d so users do not have enough data to find a good solution on their own, but they do have enough data to find the best vector vj once an embedding U has been specified. Without loss of generality, we assume U Rd k to be an orthonormal basis and refer to it as embedding matrix. For brevity, we will define the matrix V = [v1| |vn] C Rk n with vjs as columns. Measure of Accuracy: Let LPop(U; V ) = E(i,j) u[m] [n],(xij,yij) Pj h ℓ U xij, vj ; yij i , where the loss function takes the form ℓ: R R R. We will focus on excess population risk defined in (1). The privately learned models are denoted by U priv, V priv . The error measures are defined with respect to any fixed choice of parameters (U , V ). Risk Pop( U priv, V priv ; (U , V )) = LPop(U priv, V priv) LPop(U , V ). (1) Alternating Minimization Framework: We develop an efficient framework based on alternating minimization [46, 29, 23]: starting from an initial embedding map U 0, the algorithm proceeds in rounds that alternate between users individually selecting the model v(t) j that minimizes the error of the predictor f ( , U (t)v(t) j ), and then running a DP algorithm, for which user j provides inputs Dj, v(t) j , to privately select a new embedding U (t+1) that minimizes the error of the predictor f ( , U (t+1)v(t) j ). In both steps, the optimization to be performed is convex when the loss being optimized is convex. This helps us handle the inherent non-convexity in the problem formulation. Instantiation and Analysis for Linear Regression with Gaussian Data: For the specific case of linear regression with the squared error loss, we show that our framework can be fully instantiated with an efficient algorithm which converges quickly to an optimal solution. For simplicity, we consider the case where the feature vectors and field noise are normally distributed and independent of each user s true model θ j , and furthermore that the θ j vectors admit a common low-dimensional representation U Rd k, so that θ j = U v j. We show that careful initialization of U 0 followed by alternating minimization converges to a near-optimal embedding as long as m = ω(k2) and n = ω k2.5d1.5 ε . Notice that non-privately, one would require n = ω(dk) users to get any reasonable test error. For standard private linear regression in dk dimensions, current state-of-the-art results (Theorem 3.2, [3]) have a sample complexity similar to what we achieve. Theorem 1.1 (Special case of Theorem 4.2). Suppose the output for point xij N(0, 1)d of user j is given by yij (U ) xij, v j + N(0, σ2 F) where U Rd k is an orthonormal matrix that describes the shared representation, and suppose v j N(0, 1)k. Let σF k and ε 1. Then, assuming the number of users n is at least (kd)1.5/ε, and the number of points per user m is at least k2, with high probability Algorithm 1 learns an embedding matrix U priv such that the average test error of a linear regressor learned over points embedded by U priv is at most e O d3k5 ε2n2 + σ2 F k Our instantiation of the framework in this case has two major components: The initial embedding U 0 is derived from users data by a single noisy averaging step which roughly approximates the d d projector onto the k-dimensional column space of U . The idea is that given two data points (xij, yij) and (x(i+1)j, y(i+1)j), the expected value of the rank-one matrix yijy(i+1)jxijx (i+1)j is (when rescaled) a projector onto the space spanned by the regression vector θj. Adding these rank-one matrices across many data points and users produces a matrix with high overlap with the desired projector U (U ) . This is similar to the approach taken by [13] to design a non-private algorithm for a related, less general setting. The DP minimization step, which fixes the vj s and seeks a near-minimal U, can be performed using any DP algorithm for convex minimization [9, 4]. In this particular case, one can view this step as solving a linear regression problem in which U represents a list of dk real parameters: once x and v are fixed, U x, v = x Uv is a linear function of U. For the analysis to be tractable, we restrict our attention to linear regression with independent, normally-distributed features. However, the framework we provide is more general, and can be applied to a wider class of models. Developing mathematical tools to analyse the behavior of noisy alternating minimization algorithms in more general settings remains an important open question. Additionally, we run simulations on synthetic data to demonstrate the effectiveness of our proposed algorithm. Our algorithm reaches a significantly better privacy-utility tradeoff compared to two baselines: i) each user uses their own data, and ii) all users jointly learn a single model under differential privacy. Information-theoretic Upper Bounds: In addition to developing efficient algorithms for particular settings, we give upper bounds on the achievable error of user-level DP model personalization via inefficient algorithms. Specifically, we consider the natural approach of using the exponential mechanism [35] to select a common structure that provides low prediction error on average across users. For the specific case of a shared linear embedding (a generalization of the linear regression setting above), when the feature vectors are drawn i.i.d. from N(0, 1)d, and when the v j s are drawn i.i.d. from N(0, 1)k, we provide an upper bound showing that n = ω k1.5d1.5 ε users suffice to learn a good model, assuming m is sufficiently large for users to train the remaining parameters locally (Theorem 5.2). In comparison to alternating minimization, the sample complexity is better by a factor of k. In summary, we initiate a systematic study of differentially private model personalization in the practically important few-shot (or per-user sparse data) learning regime. We propose using users data to learn a strong common representation/embedding using differential privacy, that can in turn be used to learn sample efficient models for each user. Using a simple but foundational problem setting, we demonstrate rigorously that this technique can indeed learn accurate common representation as well as personalized models, despite users housing only a small number of data points. 1.2 Related Work Personalization Frameworks: Model personalization is a special case of multitask or few-shot learning [10, 25] where the goal is to leverage shared structure amongst multiple tasks to better learn the individual tasks. There are many different frameworks for multi-task learning, each capturing a different kind of shared structure. In the context of model personalization, where tasks correspond to users, two broad approaches stand out. Neighboring models . This approach assumes that while each user learns their own model, all or a fraction of the models are close to each other thus can be learned together [18, 25]. Common representation . This approach, which we adopt in this paper, assumes a low-dimensional shared subspace where all points can be represented and now each user/task can learn a sample efficient model to solve the individual task [47, 38]. A common instantiation is a DNN architecture in which the weights in the last layer are user-specific but other weights are shared. Algorithmically, this second approach is more complex since it entails simultaneously finding an accurate representation of data and models building upon those representations. But several studies [38, 47] have shown it to be significantly more effective than other approaches like neighboring models. Recent works on this approach (e.g. [44, 47, 22, 40]) follow a similar training strategy to ours that is, they alternatively update the shared representation using gradient descent and then finetune individual classifiers [38, 30, 47]. In particular, the Almost-No-Inner-Loop (ANIL) method by [38] is most similar to the alternating optimization method that we adopt (see Algorithm 1). Theoretical understanding of these methods generally lag significantly behind their empirical success. However, several interesting recent results explain the effectiveness of these methods on simple tasks [13, 46]. Most of the papers in this domain focus on the linear regression problem with a shared lowdimensional representation that we study [46, 11, 48]. They show that one can provide much better estimates for the shared representation, and overall prediction error, by pooling information than would be possible for individual users acting alone. These existing analyses do not allow for noise in the iterations. In fact, for the general problem, the noise can lead to suboptimal solutions. Thus, a key contribution of our work is to show that in a widely studied setting, alternating minimization converges even when the minimization of U is noisy. Privacy: In our setting, the data set is made up of users individual data sets D1, ..., Dn, where each Dj potentially contains many records (labeled training examples). Users interact via a central algorithm, which we assume for simplicity to be implemented correctly and securely (either by a trusted party or using cryptographic techniques like multiparty computation). This algorithm provides output to each of the users. We aim to control what those outputs leak about the users input data. That is, presence/absence of user and its entire data should not affect the outputs significantly. This notion is known as user-level or task-level privacy and has been widely studied in the literature [34, 31], albeit mostly without personalization component. The only works we are aware of that look at personalization (or multitask learning more generally) with user-level guarantees are [19] and [24]. Geyer et al. [19] consider the neighboring models approach, which cannot work in the setting we study. Jain et al. [24] consider matrix completion, which can be viewed as a version of our setting in which training examples are limited to indicator vectors (items from a known discrete set). A few studies attempt to provide only record-level privacy a significantly weaker notion of privacy where presence/absence of only single record should be undetected by the output of the model. While the notion has been studied extensively for the standard non-personalized models [27, 9], for personalized models the literature is somewhat limited [21, 32]. The work of [32] discusses both taskand record-level privacy, but ultimately provides only algorithms that satisfy the weaker guarantee. As mentioned above, our goal is to provide strong user-level privacy guarantees so such methods do not apply in our case. 1.3 Notation We denote all matrices with bold upper case letters (e.g., A), and all vectors with bold lower case letters (a). Unless specified explicitly, all vectors are column vectors. We denote the clipping operation on a vector a as clip (a; ζ) = a min n 1, ζ a 2 2 Background on Privacy Billboard model: In this paper, we operate in the billboard model [20] of differential privacy [15, 14, 36]. Consider n users, and a computing server. The server runs a differentially private algorithm on sensitive information from the users, and broadcasts the output to all the users. Each user j [n] can then use the broadcasted output in a computation that solely relies on her data. The output of this computation is not made available to other users. A block schematic is shown in Figure 1. One important attribute of the billboard model is that it trivially satisfies joint differential privacy [28]. User-level privacy protection: In this work, we provide user-level privacy protection [16]. I.e., from the output of the algorithm available to an adversary, they will not be able to detect the presence/absence of all the data samples belonging to a single user. Correspondingly, in the definition of differential privacy below (Definition 2.1), a record consists of all the data samples belonging to a single user. Furthermore, we adhere to the replacement model of privacy, where the protection is with respect to the replacement of a user with another, instead of the presence/absence of a user. Definition 2.1 (Differential Privacy [15, 14, 36]). A randomized algorithm A is (ε, δ)-differentially private if for any pair of data sets D and D that differ in one record (i.e., |D D | = 1), and for all S in the output range of A, we have Pr[A(D) S] eε Pr[A(D ) S] + δ, where probability is over the randomness of A. Similarly, an algorithm A is (α, ρ)- Rényi differentially private (RDP) if Dα (A(D)||A(D )) ρ, where Dα is the Rényi divergence of order α. 3 Model Personlization via Private Alternating Minimization User 1 User 2 User 3 User n Compute embedding Differentially Private Global Computation Output of DP compute Figure 1: User-compute interaction in the billboard model. Shaded boxes represent privileged computation. U refer to the common embedding function, and vj refers to the model for user j [n]. In this section, we first provide a generic/meta algorithm for private model personalization (Algorithm 1 (Algorithm APriv-Alt Min)). The main idea is to alternate between two states for T iterations, i.e., for t [T], (i) Estimate the best embedding matrix U (t) based on the current personalized models h v(t) 1 , . . . , v(t) n i while preserving user-level (α, ρ)- RDP, and (ii) update the personalized modes based on the updated embedding matrix U (t). Finally, output U priv U (T +1), which will be used by each user j [n] to train her final personalized model vpriv j . While Algorithm APriv-Alt Min is a fairly natural method for model personalization, to the best of our knowledge, this is the first work that formally studies the privacy/utility trade-offs under user-level privacy. Prior works [40, 37] have used similar ideas in the non-private meta-learning setting. The estimation of the embedding matrix can be implemented by any differentially private convex optimization algorithm (e.g., DP-SGD [42, 4, 1]). As discussed in Section 4, for specific case of linear regression, we can perturb the sufficient statistics to obtain differential privacy guarantee, and then optimize over it. A similar idea was used in [41, 39]. We provide a formal description in Algorithm 1. In Section 4, we instantiate it in the context of personalized linear regression. There, we also provide formal excess population risk guarantees under Algorithm 1 APriv-Alt Min: Differentially Private Alternating Minimization Meta-algorithm Require: Data sets from each user j [n]: Dj = {(xij Rd, yij R) : i [m]} for m mod 4 = 0, rank of the projector: k, privacy parameters: (α, ρ), number of iterations: T, initial rank-k subspace matrix: U init, loss function: ℓ. 1: Initialize U (1) U init. 2: Randomly permute the users j [n] via permutation π unif [n]. Set j π(j), j [n]. 3: for t [T] do 4: St h 1 + (t 1)n 5: Each user j [St] independently solves v(t) j arg min v 2 Rk i [m/4] ℓ (U (t)) xij, v ; yij . 6: Estimate U (t+1) arg min U K i [m/4+1,m/2],j St ℓ U xij, v(t) j ; yij under (α, ρ)- RDP, where K is the set of all rank-k matrices with orthonormal columns in Rd k. 7: end for 8: U priv U (T +1). some data generating assumption. Since Line 6 guarantees (α, ρ)-RDP, and disjoint sets of users are used in each iteration, we can conclude that the whole algorithm guarantees (α, ρ)-RDP. 4 Instantiating Algorithm APriv-Alt Min with Linear Regression In this section, we instantiate Algorithm APriv-Alt Min (Algorithm 1) in the context of linear regression. While our privacy guarantees hold for any instantiation of the training data, the utility guarantees hold under the following data generating assumption. Data generation: We instantiate the problem description in Section 1.1 as follows. There is a fixed model v j Rk for each user j [n], and a fixed rank-k matrix with orthonormal columns U Rd k across all users. Let V := [v 1| |v n]. For each feature vector xij Rd, the response yij is given by: yij = (U ) xij, v j + zij, zij N(0, σ2 F). (2) In Theorem 4.2, we provide the privacy and utility guarantee for an instantiation of Algorithm 1 (Algorithm APriv-Alt Min) where the loss function is ℓ U xij, v ; yij = yij U xij, v 2 . We will adhere to Assumptions 4.1 for the utility analysis. Assumption 4.1 (Assumptions for Utility Analysis). Let λi > 0 be the i-th eigenvalue of 1 n V (V ) , and let µ := max j [n] v j 2 / kλk be the incoherence parameter. Let Noise-to-signal ratio be NSR = σF λk . We assume: (i) i [m], j [n], xij iid N(0, 1)d, and corresponding yij be generated using (2), (ii) m = eΩ (1 + NSR) k + k2 , (iii) n = eΩ λ1 λk µ2dk + λ1 λk k2 NSR2 + µ2k 2 + λ1 λk (ε,δ) NSR2 + µ2k d3/2 . Here, eΩ( ) hides polylog (n, m, k). Theorem 4.2 (Main Result. Bound on Excess Risk). Let V priv = [vpriv 1 , . . . , vpriv n ] with vpriv j arg min v Rk 2 m yij (U priv) xij, v 2 . Let Assumption 4.1 hold. Then, Algorithm APriv-Alt Min with parameters in Lemma 4.4 and (ε,δ) := 16 log(1/δ) ε outputs U priv such that i) it is (ε, δ)-differentially private, and ii) it has the following excess population risk w.p. at least 1 1/n9 (over the randomness of data generation and the Algorithm 2 Instantiating Line 6 of Algorithm 1 ( Algorithm APriv-Alt Min) Require: Set of users at time step t [T]: St. Current models: n v(t) j : j St o , data samples: {(xij, yij) : j St, i [m]}, privacy parameter: (ε,δ), clipping threshold for model: η, clipping threshold for response: ζ. 1: W ij = clip # xijv j ; η and eyij = clip (yij; ζ) for all i [m/4 + 1, m/2], j St. 2: W priv P j St,i [m/4+1,m/2] W ij W ij + Nsym 0, m2η4 2 (ε,δ)/4 dk dk, and j St,i [m/4+1,m/2] eyij W ij + N 0, m2ζ2η2 2 (ε,δ)/4 dk 3: # Z(t+1) arg min u Rdk 4 m |St| u W privu 2u bpriv 4: return U (t+1) Q part of the QR-decomposition of Z(t+1) algorithm): Risk Pop( U priv, V priv ; (U , V )) = O (ε,δ)(σ2 F + µ2k2dλk)(µ4k3d2) n2 + σ2 Fµ4k2d polylog (d, n) + k See supplementary material for the proof. Remark 1. Let us understand the bound above for a simple setting where the personal model for each user v j N(0, 1)k. Assuming large enough n, this implies that λk 1 and µ e O(1). Now even when V is known a priori, to obtain a reasonable estimate of U , we need to solve the following linear regression problem while ensuring DP: U priv = min U P ij(yij xij(v j) , U )2. Note that xij(v j) is isotropic. Now, without differential privacy, the information theoretical optimal estimation error is Θ σ2 F dk nm , where dk is the size of the linear regression problem and mn is the number of samples. Now, if we were to solve the above regression problem with DP, the best known algorithm [41] will have an additional error of e O κ dk nε 2 , where κ = σF + maxij xij(v j) F U F = e O(σF + dk2). Note that the first two terms in Theorem 4.2 indeed match O κ dk nε 2 + σ2 F dk nm up to an additional factor of k and up to polylog (d, n) factors. Finally, the last error term in the above theorem is due to excess risk in estimating v for a given user with m samples, and is information theoretically optimal. Remark 2. Under the assumption in Remark 1 and for σF = 0, the sample complexity for Theorem 4.2 is n = eω(k2.5d1.5/ε + d) and m = eω(k2). Note that, for ε , the complexity is O(k) worse than the information theoretic optimal. Furthermore, the sample complexity suffers from an additional d for constant ε compared to non-private case. Even for standard linear regression, a similar additional d factor is present in the sample complexity bound [41]; we leave further investigation into the optimal sample complexity for future work. In Section 4.1, we show an instantiation of Algorithm APriv-Alt Min (Algorithm 1) s.t. if the embedding matrix (U init) is initialized well, then U priv U priv converges in F to U (U ) . In Section 4.2, we provide an algorithm to obtain a good initialization of the embedding matrix (U init). Combining these two results imply Theorem 4.2. 4.1 Local Subspace Convergence In Algorithm 2, we instantiate Line 6 of Algorithm APriv-Alt Min. For any matrix A Rd1 d2, let # A Rd1d2 be the vectorized representation with columns of A placed consecutively. Let Nsym(0, σ2)d d denote a Wigner matrix with entries drawn i.i.d. from N(0, σ2). The privacy guarantee of Algorithm 2 is presented in Lemma 4.3 and the local subspace guarantee in Lemma 4.4. Lemma 4.3 (Privacy guarantee). If we set (ε,δ) = p 8 log(1/δ)/ε, then instantiation of Algorithm APriv-Alt Min with Algorithm 2 is (ε, δ)-differentially private in the billboard model. Lemma 4.4 (Local Subspace Convergence). Recall Assumptions 4.1. In Algorithm 2, let model clipping threshold η = e O(µ λkdk), and response clipping threshold ζ = e O σF + µ kλk . Let the number of iterations of Algorithm 1 (Algorithm APriv-Alt Min) be T = Ω log (λ1/λk) NSR+ (ε,δ) assume U init be s.t. (I U (U ) )U init F λk 32λ1 . We have the following for Algorithm 1 (Algorithm APriv-Alt Min), instantiated with Algorithm 2, w.p. at least 1 1/n10 (over the randomness of data generation and the algorithm): I U (U ) U priv F = e O (ε,δ)(NSR + µ Here, the noise-to-signal-ratio NSR = σF λk and privacy parameter (ε,δ) = ε . In e O( ), we hide polylog (d, n). See supplementary material for the proofs. The analysis of Lemma 4.4 roughly follows the analysis of alternating minimization [46], while accounting for the noise introduced due to privacy. At each iteration, we show that the embedding subspace gets closer in the Frobenius norm, and each of the personalized models gets closer in the ℓ2-norm. 4.2 Initialization Algorithm In Algorithm 3, we describe a private estimator for the estimation of U . This estimator eventually gets used in initializing the linear regression instantiation of Algorithm 1. We provide the privacy and subspace closeness guarantees in Lemma 4.5 and 4.6, with proofs in supplementary material. Algorithm 3 APriv-init: Private Initialization Algorithm for Algorithm APriv-Alt Min Require: Data sets from each user j [n]: Dj = {(xij Rd, yij R) : i [m]}, clipping bound for response: ζ, noise standard dev. for privacy: (ε,δ), and rank of the orthonormal basis: k. 1: W ij sym x(2i)jx (2i+1)j x(2i)j 2 x(2i+1)j 2 clip y(2i)j; ζ clip y(2i+1)j; ζ for all i [m/2] and j [n]. Here, sym(W ) makes a matrix Rd d symmetric by replicating the upper triangle. 2: M Noisy 2 nm i [m/2],j [n] W ij + Nsym 0, 2 (ε,δ)ζ4m2 d d ! 3: U priv Top-k eigenvectors of M Noisy as columns. Lemma 4.5 (Privacy guarantee). If we set (ε,δ) = p 8 log(1/δ)/ε, Algorithm 3 (Algorithm APriv-init) is (ε, δ)-differentially private. Lemma 4.6 (Subspace closeness). Recall Assumptions 4.1. Let the clipping bound for response be ζ = e O(σF + µ kλk). We have the following for Algorithm 3 (Algorithm APriv-init) w.p. at least 1 1/n10: I U (U ) U priv 2 = e O (ε,δ) NSR2 + µ2k d3/2 n + (NSR2 + µ2k) Here, privacy parameter (ε,δ) = ε . In e O( ), we hide polylog (d, n). The proof goes via direct analysis of the distance between the estimated subspace from the training examples, and the true subspace. While the convergence guarantee in Lemma 4.6 is unconditional, it is weaker than Lemma 4.4, especially in its dependence on k and NSR. Lemma 4.6 implies that under Assumption 4.1, I U (U ) U priv F = O λk λ1 , which is sufficient to satisfy the initialization condition in Lemma 4.4. Hence, if we initialize U using Algorithm 3 (Algorithm APriv-init) with a disjoint set of samples for each user, it immediately follows that the the local convergence guarantee in Lemma 4.4 is indeed a global convergence guarantee. Algorithm 4 AExp: Joint Differentially Private ERM via Exponential Mechanism Require: Data sets from each user j [n]: Dj = {(xij Rd, yij R) : i [m]} where m mod 2 = 0; model ℓ2-norm constraint C; clipping bound on the projected features Lf; privacy parameter ε; rank of the projection matrix k; net width φ, loss function ℓ: R R R; Lipschitz constant ξ of w.r.t. its first parameter. 1: Define a score function for any rank-k matrix with orthonormal columns U Rd k as score (U) = P i [m/2] ℓ clip U xij; Lf , vj ; yij ! 2: Define a net N φ of F -radius φ over matrices with orthonormal columns in Rd k. 3: Sample U priv N φ with Pr[U priv = U] exp εn 8Lf Cξ score (U) . 4: Each user j [n] independently estimates vpriv j arg min v 2 C i=m/2+1 ℓ U priv xij, v ; yij . 5 Exponential Mechanism based Model Personalization In this section, we take a more general approach towards outputting a projector U priv that approximately minimizes the excess population risk without worrying about actually estimating the projector onto U . Here, as we only care about low-excess risk, as opposed to subspace closeness, we can guarantee better convergence under milder assumptions. Recall the loss function LPop(U, V ) from (1). We want to optimize min U K min V Rd n, vj 2 C LPop(U, V ) while ensuring ε-DP in the billboard model. (Here K Rd k is the set of matrices with orthonormal columns, and vj corresponds to the j-th column of V .) To that end, we will use the exponential mechanism [35], over an ℓF -net of radius φ over K. The algorithm is presented in Algorithm 4 (Algorithm AExp). The privacy analysis of Algorithm AExp follows from the standard analysis of exponential mechanism, and the utility analysis goes via first proving an excess empirical risk bound, and then appealing to uniform convergence to get to excess population risk bound. Theorem 5.1 (Privacy guarantee). Algorithm 4 is ε-differentially private in the billboard model. Theorem 5.2 (Utility guarantee). Suppose the loss function ℓis ξ-Lipschitz in its first parameter, and C is the bound on the constraint set. Set the net size φ = 1/(εn) and the clipping norm Lf = 40 d log(nm) in Algorithm 4. Assuming ε (0, 1) and that the feature vectors are drawn i.i.d. from N(0, 1)d, we have (w.p. 1 1/ min{d, n}10): Risk Pop( U priv, V priv ; (U , V )) = O polylog (d, n) . Here, U and V are any fixed values of the common embedding matrix (orthonormal, in Rd k) and matrix of individual regression vectors (in Rn k, with rows of norm at most C), respectively. See supplementary material for the proofs of Theorems 5.1 and 5.2. Comparison of the utility guarantee to Theorem 4.2: The utility guarantee for Algorithm AExp (Theorem 5.2) is much more general than that in Theorem 4.2. Unlike Theorem 4.2, it allows arbitrary Lipschitz loss function ℓ, and any distribution over the feature vectors. However, for linear regression with i.i.d. spherical normal feature vectors and setting the diameter of the constraint set C = k, one can make Theorems 4.2 and 5.2 comparable. Theorem 4.2 shows an excess population risk e O k5d3 ε2n2 + k m whereas Theorem 5.2 gives e O . Theorem 4.2 is tighter in the regime where n = Ω(k3.5d1.5/ε). This difference is comparable to the so-called fast rates [43]. However, the sample complexity of Theorem 5.2 is better in terms of m by a factor of k1.5. 6 Numerical Simulation In this section, we provide numerical simulations to validate our theoretical results. There are three baselines: i) each user solves their problem using their own data without consulting with other 2 4 6 8 10 User-level epsilon Population MSE Non-private Own data Single model Our Figure 2: MSE vs. per-user ε for linear regression. users (called own data), and ii) a single differentially private global model trained on all users data together (called single model), and iii) non-private variant of alternating minimization, trained up to convergence. Overall, we observe that Algorithm 1 (Algorithm APriv-Alt Min) outperforms the private baselines by a significant margin. We consider the linear regression problems on synthetic data and the alternating minimization algorithm 1. We set the number of users n = 50, 000, number of samples per user m = 10, data dimension d = 50 and rank k = 2. We sample x, U and v from Gaussian distributions, and the field noise σF of target y is set to be 0.01. We normalize U to unit norm. We run Algorithm 1 with full batch, i.e., T = 1 and for multiple epochs. The privacy risk will accumulate over epochs, and we use the RDP sequential composition to account for that. We fix the clipping norm to be 10 4 and pick the optimal the number of epochs in {1, 2, 5, 10}. In Figure 2, we fix δ = 10 6 and plot the population mean squared error (MSE) computed based on the groundtruth model for multiple ε. As a reference, the MSE for a purely random model is around 4. In addition to (ε, δ)-differential privacy, we also report the value of Zero-Concentrated Differential Privacy (z CDP) [5], as it better captures the privacy properties, especially for Gaussian mechanism based algorithms. The definition can be found at Appendix A. In particular, privacy parameter ε = {1, 2, 5, 10} corresponds to 0.009, 0.04, 0.23, 0.90-z CDP, respectively. We observe that at ε 5, the population MSE for Algorithm 3 is comparable to the non-private baseline. In contrast, the error for the single model baseline remains very high, even at high values of ε. 7 Conclusion In this paper we studied the problem of personalized supervised learning with user-level differential privacy. Through our framework and Algorithm 1, we demonstrated that we can indeed learn accurate shared linear representation of the data, despite a limited number of samples-per-user and while preserving each user s privacy. Our error bounds and sample complexity bounds are nearly optimal in key parameters and are in fact, comparable to the best known bounds available for a much simpler linear regression problem. This work leads to several interesting questions: (i) In our model, can we provide similar privacy/utility trade-offs for deep networks based embedding functions instead of a linear embedding function, (ii) Can we make a variant of the exponential mechanism algorithm computationally feasible?, and (iii) Empirically validate the privacy/utility trade-offs on real world data sets. As more and more ML models are personalized for user tastes, ensuring privacy of individuals data is paramount to a fair, responsible system. We provide a rigorous framework to design such solutions, which hopefully will motivate practitioners and researchers to make privacy as a first class citizen while designing their personalization based ML system. Acknowledgements and Funding Disclosure We would like to thank Brendan Mc Mahan, and Daniel Ramage for various helpful discussions during the course of this project. Adam Smith was supported in part by NSF award CCF-1763786 as well as a Sloan Foundation research award and a Google Faculty Research Award. [1] Martín Abadi, Andy Chu, Ian J. Goodfellow, H. Brendan Mc Mahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proc. of the 2016 ACM SIGSAC Conf. on Computer and Communications Security (CCS 16), pages 308 318, 2016. [2] Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. J. Mach. Learn. Res., 2003. [3] Raef Bassily, Vitaly Feldman, Kunal Talwar, and Abhradeep Guha Thakurta. Private stochastic convex optimization with optimal rates. In Neur IPS, pages 11279 11288, 2019. [4] Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In Proc. of the 2014 IEEE 55th Annual Symp. on Foundations of Computer Science (FOCS), 2014. [5] Mark Bun and Thomas Steinke. Concentrated differential privacy: Simplifications, extensions, and lower bounds. In Theory of Cryptography Conference, pages 635 658. Springer, 2016. [6] Emmanuel J. Candès and Yaniv Plan. Tight oracle bounds for low-rank matrix recovery from a minimal number of random measurements. Co RR, abs/1001.0339, 2010. [7] Nicholas Carlini, Chang Liu, Úlfar Erlingsson, Jernej Kos, and Dawn Song. The secret sharer: Evaluating and testing unintended memorization in neural networks. In 28th USENIX Security Symposium (USENIX Security 19), pages 267 284, 2019. [8] Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. Extracting training data from large language models. ar Xiv preprint ar Xiv:2012.07805, 2020. [9] Kamalika Chaudhuri, Claire Monteleoni, and Anand D Sarwate. Differentially private empirical risk minimization. Journal of Machine Learning Research, 12(Mar):1069 1109, 2011. [10] Yinbo Chen, Xiaolong Wang, Zhuang Liu, Huijuan Xu, and Trevor Darrell. A new meta-baseline for few-shot learning. ar Xiv preprint ar Xiv:2003.04390, 2020. [11] Liam Collins, Hamed Hassani, Aryan Mokhtari, and Sanjay Shakkottai. Exploiting shared representations for personalized federated learning. ar Xiv preprint ar Xiv:2102.07078, 2021. [12] Chandler Davis. The rotation of eigenvectors by a perturbation. Journal of Mathematical Analysis and Applications, 6(2):159 173, 1963. [13] Simon S Du, Wei Hu, Sham M Kakade, Jason D Lee, and Qi Lei. Few-shot learning via learning the representation, provably. ar Xiv preprint ar Xiv:2002.09434, 2020. [14] Cynthia Dwork, Krishnaram Kenthapadi, Frank Mc Sherry, Ilya Mironov, and Moni Naor. Our data, ourselves: Privacy via distributed noise generation. In Advances in Cryptology EUROCRYPT, pages 486 503, 2006. [15] Cynthia Dwork, Frank Mc Sherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Proc. of the Third Conf. on Theory of Cryptography (TCC), pages 265 284, 2006. [16] Cynthia Dwork, Moni Naor, Toniann Pitassi, and Guy N. Rothblum. Differential privacy under continual observation. In STOC, 2010. [17] Cynthia Dwork, Kunal Talwar, Abhradeep Thakurta, and Li Zhang. Analyze gauss: optimal bounds for privacy-preserving principal component analysis. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 11 20, 2014. [18] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. ar Xiv preprint ar Xiv:1703.03400, 2017. [19] Robin C. Geyer, Tassilo Klein, and Moin Nabi. Differentially private federated learning: A client level perspective. Co RR, abs/1712.07557, 2017. [20] Justin Hsu, Zhiyi Huang, Aaron Roth, Tim Roughgarden, and Zhiwei Steven Wu. Private matchings and allocations. In STOC, 2014. [21] Rui Hu, Yuanxiong Guo, Hongning Li, Qingqi Pei, and Yanmin Gong. Personalized federated learning with differential privacy. IEEE Internet Things J., 7(10):9530 9539, 2020. [22] Shaoli Huang and Dacheng Tao. All you need is a good representation: A multi-level and classifier-centric representation for few-shot learning. ar Xiv preprint ar Xiv:1911.12476, 2019. [23] Prateek Jain, Praneeth Netrapalli, and Sujay Sanghavi. Low-rank matrix completion using alternating minimization. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing, pages 665 674, 2013. [24] Prateek Jain, Om Dipakbhai Thakkar, and Abhradeep Thakurta. Differentially private matrix completion revisited. In International Conference on Machine Learning, pages 2215 2224. PMLR, 2018. [25] Muhammad Abdullah Jamal and Guo-Jun Qi. Task agnostic meta-learning for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11719 11727, 2019. [26] Yihan Jiang, Jakub Konevcný, Keith Rush, and Sreeram Kannan. Improving federated learning personalization via model agnostic meta learning. Co RR, abs/1909.12488, 2019. [27] Shiva Prasad Kasiviswanathan, Homin K. Lee, Kobbi Nissim, Sofya Raskhodnikova, and Adam D. Smith. What can we learn privately? In 49th Annual IEEE Symp. on Foundations of Computer Science (FOCS), pages 531 540, 2008. [28] Michael Kearns, Mallesh Pai, Aaron Roth, and Jonathan Ullman. Mechanism design in large games: Incentives and privacy. In Proceedings of the 5th conference on Innovations in theoretical computer science, pages 403 410, 2014. [29] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8):30 37, 2009. [30] Kwonjoon Lee, Subhransu Maji, Avinash Ravichandran, and Stefano Soatto. Meta-learning with differentiable convex optimization. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10649 10657. IEEE Computer Society, 2019. [31] Daniel Levy, Ziteng Sun, Kareem Amin, Satyen Kale, Alex Kulesza, Mehryar Mohri, and Ananda Theertha Suresh. Learning with user-level privacy. Co RR, abs/2102.11845, 2021. [32] Jeffrey Li, Mikhail Khodak, Sebastian Caldas, and Ameet Talwalkar. Differentially private meta-learning. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. Open Review.net, 2020. [33] Percy Liang. Cs229t/stat231: Statistical learning theory (winter 2016), 2016. [34] H. Brendan Mc Mahan, Daniel Ramage, Kunal Talwar, and Li Zhang. Learning differentially private recurrent language models. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. Open Review.net, 2018. [35] Frank Mc Sherry and Kunal Talwar. Mechanism design via differential privacy. In 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS 07), pages 94 103, 2007. [36] Ilya Mironov. Rényi differential privacy. In 2017 IEEE 30th Computer Security Foundations Symposium (CSF), pages 263 275. IEEE, 2017. [37] Seewong Oh, Prateek Jain, and Kiran Thekumparampil. Sample efficient linear meta-learning by alternating minimization. Personal communication, 2021. [38] Aniruddh Raghu, Maithra Raghu, Samy Bengio, and Oriol Vinyals. Rapid learning or feature reuse? towards understanding the effectiveness of MAML. ar Xiv preprint ar Xiv:1909.09157, 2019. [39] Or Sheffet. Old techniques in differentially private linear regression. In ALT, 2019. [40] Karan Singhal, Hakim Sidahmed, Zachary Garrett, Shanshan Wu, Keith Rush, and Sushant Prakash. Federated reconstruction: Partially local federated learning. Co RR, abs/2102.03448, 2021. [41] Adam Smith, Abhradeep Thakurta, and Jalaj Upadhyay. Is interaction necessary for distributed private learning? In 2017 IEEE Symposium on Security and Privacy (SP), pages 58 77. IEEE, 2017. [42] Shuang Song, Kamalika Chaudhuri, and Anand D Sarwate. Stochastic gradient descent with differentially private updates. In 2013 IEEE Global Conference on Signal and Information Processing, pages 245 248. IEEE, 2013. [43] Karthik Sridharan, Shai Shalev-shwartz, and Nathan Srebro. Fast rates for regularized objectives. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2009. [44] Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE international conference on computer vision, pages 843 852, 2017. [45] Terence Tao. Topics in random matrix theory, volume 132. American Mathematical Soc., 2012. [46] Kiran Koshy Thekumparampil, Prateek Jain, Praneeth Netrapalli, and Sewoong Oh. Sample efficient linear meta-learning by alternating minimization, 2021. [47] Yonglong Tian, Yue Wang, Dilip Krishnan, Joshua B Tenenbaum, and Phillip Isola. Rethinking few-shot image classification: a good embedding is all you need? ar Xiv preprint ar Xiv:2003.11539, 2020. [48] Nilesh Tripuraneni, Chi Jin, and Michael I Jordan. Provable meta-learning of linear representations. ar Xiv preprint ar Xiv:2002.11684, 2020. [49] Joel A Tropp. User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics, 2012. [50] Joel A Tropp. An introduction to matrix concentration inequalities. ar Xiv preprint ar Xiv:1501.01571, 2015. [51] Kangkang Wang, Rajiv Mathews, Chloé Kiddon, Hubert Eichner, Françoise Beaufays, and Daniel Ramage. Federated evaluation of on-device personalization. Co RR, abs/1910.10252, 2019. [52] Tao Yu, Eugene Bagdasaryan, and Vitaly Shmatikov. Salvaging federated learning by local adaptation. Co RR, abs/2002.04758, 2020.