# private_model_personalization_revisited__6ad9fed6.pdf

Private Model Personalization Revisited

Conor Snedeker * 1 Xinyu Zhou * 1 Raef Bassily 2

We study model personalization under user-level differential privacy (DP) in the shared representation framework. In this problem, there are n users whose data is statistically heterogeneous, and their optimal parameters share an unknown embedding U Rd k that maps the user parameters in Rd

to low-dimensional representations in Rk, where k d. Our goal is to privately recover the shared embedding and the local low-dimensional representations with small excess risk in the federated setting. We propose a private, efficient federated learning algorithm to learn the shared embedding based on the Fed Rep algorithm in (Collins et al., 2021). Unlike (Collins et al., 2021), our algorithm satisfies differential privacy, and our results hold for the case of noisy labels. In contrast to prior work on private model personalization (Jain et al., 2021), our utility guarantees hold under a larger class of users distributions (sub-Gaussian instead of Gaussian distributions). Additionally, in natural parameter regimes, we improve the privacy error term in (Jain et al., 2021) by a factor of e O(dk). Next, we consider the binary classification setting. We present an information-theoretic construction to privately learn the shared embedding and derive a margin-based accuracy guarantee that is independent of d. Our method utilizes the Johnson-Lindenstrauss transform to reduce the effective dimensions of the shared embedding and the users data. This result shows that dimension-independent risk bounds are possible in this setting under a margin loss.

*Equal contribution 1Department of Computer Science & Engineering, The Ohio State University 2Department of Computer Science & Engineering and the Translational Data Analytics Institute (TDAI), The Ohio State University. Correspondence to: Conor Snedeker <snedeker.31@buckeyemail.osu.edu>, Xinyu Zhou <zhou.3542@buckeyemail.osu.edu>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

1 Introduction

The rapid advances in machine learning have revolutionized domains such as healthcare, finance, and personalized services. However, this progress relies heavily on the availability of user data, which presents two critical challenges. First, user data is often statistically heterogeneous, that is, individual users have distinct data distributions. This heterogeneity hinders the training of a single global model that performs well for all users. Second, the reliance on sensitive user data raises significant privacy concerns, necessitating robust mechanisms that offer strong privacy protections.

Model personalization has emerged as a key strategy to address the challenge of statistical heterogeneity by adapting models to individual users, rather than relying on a single global model that may perform suboptimally across diverse data distributions. A widely used framework for personalization is shared representation learning, where users collaborate to learn a low-dimensional embedding that captures commonalities in their tasks. This shared embedding allows users to train their local models more efficiently, leveraging the embedding to complement their unique data.

However, despite its effectiveness in handling heterogeneity, model personalization introduces additional privacy concerns, as learning shared representations involves user collaboration, which can inadvertently expose sensitive information. Ensuring rigorous privacy protections in this setting requires formal guarantees, such as user-level differential privacy (DP) (Dwork et al., 2006), which prevents adversaries from inferring an individual s presence in the training dataset based on the trained model. Existing approaches to private model personalization suffer from limitations such as restrictive assumptions about data distributions, centralized processing requirements, and suboptimal privacy-utility trade-offs.

In this work, we study model personalization under userlevel differential privacy via shared representation learning. In this problem, there are n users, where user i [n] is associated with a data distribution Di and a dataset Si = (z1, . . . zm) Dm i . The optimal parameter of user i is denoted as w i Rd. We aim at learning a shared low-dimensional embedding U Rd k, where k d, and user-specific parameters v 1, . . . , v n Rk such that w i = U v i for each user i [n], under user-level DP. We

Private Model Personalization Revisited

propose a novel private federated algorithm for model personalization via shared representation that achieves superior privacy-utility trade-off under relaxed assumptions. Our contributions are summarized as follows:

1.1 Contributions

Efficient private federated algorithm for regression: We extend the Fed Rep algorithm of (Collins et al., 2021) to ensure user-level DP while addressing the case of noisy labels in statistically heterogeneous user data. The algorithm is iterative, where in each iteration t, it alternates between local updates by the users to their local vectors v1,t, . . . , vn,t Rk and private aggregated gradient update to the shared embedding Ut Rd k. Previous work (Jain et al., 2021) required performing exact minimization with respect to both the local vectors and the shared embedding in a centralized manner. Since the embedding update in (Jain et al., 2021) involves a minimization over data from all the users, transforming their algorithm to a federated algorithm is not directly feasible. Meanwhile, ours is naturally a federated algorithm since the gradient is computed by each user locally, which the server privately aggregates. In natural parameter regimes, we show that our algorithm attains excess population risk of e O d2k n2ϵ2σ4 min, + d nmσ4 min,

σmin, is k-th largest singular value of 1 n V . Compared to (Jain et al., 2021), under the same parameter regimes, this reduces the privacy error term by a factor of e O(dk).

Robustness to broader data distributions: Unlike prior work (Jain et al., 2021) that assumes Gaussian data, the utility guarantees of our private federated algorithm apply to a broader class of sub-Gaussian distributions. Moreover, our results apply to statistically heterogeneous users in the sense that individual users may have distinct sub-Gaussian distributions, while the work of (Jain et al., 2021) assumes that the features of all the users are standard Gaussian. Moreover, we extend the results of (Collins et al., 2021) not only to the private case but also to the non-realizable case of noisy labels.

Improved initialization under privacy constraints: We introduce a private initialization algorithm for our federated algorithm. Our initialization algorithm is based on the subspace recovery approach of (Duchi et al., 2022). Our algorithm leverages a top-k singular value decomposition (SVD) with added Gaussian noise. This ensures a feasible starting point for the shared embedding, even under relaxed data assumptions.

Dimension-independent risk guarantees for classification: We provide an information-theoretic construction for the binary classification setting under the margin loss and derive an risk bound that benefits from margin guarantees. We incorporate the Johnson-Lindenstrauss transform to re-

duce the effective dimensionality of the embedding space. This leads to a margin-based risk bound independent of the input dimension d. This result holds for arbitrary distributions where the feature vectors are bounded in the Euclidean norm.

1.2 Related Work

Our work builds on and extends several lines of research in federated learning, model personalization, and differential privacy:

Federated personalization via shared representations: Shared representation learning has proven effective in addressing user heterogeneity (Collins et al., 2021; Jain et al., 2021). Additionally, centralized subspace recovery methods (Tripuraneni et al., 2021; Duchi et al., 2022) may be used to provide a good initialization for the shared representation for better federated personalization guarantees. While prior work focuses on non-private or centralized settings, our approach incorporates user-level DP in a federated framework and achieves better utility guarantees.

Differentially private model personalization: (Jain et al., 2021) introduced a private algorithm for model personalization with centralized processing and restrictive Gaussian assumptions. Our method improves excess risk guarantees and extends applicability to sub-Gaussian data distributions. There are other prior works studying private model personalization like (Bietti et al., 2022; Hu et al., 2021), but these works are not under the representation learning framework and their results are not comparable to ours.

Optimization methods for representation learning: Relevant optimization techniques include alternating exact minimization frameworks (Thekumparampil et al., 2021) and gradient-based optimization, which has also been widely used for personalization (e.g., (Collins et al., 2021; Raghu et al., 2019; Lee et al., 2019; Tian et al., 2020)). Our alternating minimization framework adapts these gradient techniques to a privacy-preserving federated setting with noisy labels.

Dimensionality reduction with DP guarantees: The use of the Johnson-Lindenstrauss transform in private learning has been explored in recent work (e.g., (Lê Nguyen et al., 2020; Bassily et al., 2022; Arora et al., 2022)). We adapt this technique for federated personalization, deriving marginbased risk guarantees independent of the data dimension d.

2 Preliminaries

We consider a setting of n users, where each user i [n] is associated with distribution Di observed by a dataset Si = ((x1, y1), . . . (xm, ym)) Dm i where xi Rd and yi R. Given a low dimensional embedding matrix U Rd k

Private Model Personalization Revisited

with orthonormal columns and local vector v Rk, the loss incurred by the parameter (U, v) over data point (x, y) is defined as ℓ(U, v, (x, y)) = ℓ (y, x, Uv ). Suppose k d. It is useful conceptually to think of the matrix U as a transformation x 7 U x that maps x Rd onto the low-dimensional space Rk. In addition, we assume that k m. This assumption is necessary to attain useful utility guarantees as in prior work (Jain et al., 2018).

Given a collection of data sets S = (S1, . . . , Sn) Rnm(d+1), embedding matrix U Rd k and collection of local vectors V = [v1, . . . vn] Rn k, we write the average empirical risk of (U, V ) over S as

b L(U, V ; S) = 1 nm

(x,y) Si ℓ(U, vi, (x, y))

We also define the average population loss of (U, V ) over the data distributions D = (D1, . . . , Dn) as

L(U, V ; D) = 1

i=1 E(x,y) Di [ℓ(U, vi, (x, y))] .

Given any (U , V ) Rd k Rn k for U with orthonormal columns, our goal is to minimize the excess population risk w.r.t (U , V ) defined as

E(U, V ) = L(U, V ; D) L(U , V ; D).

Let M Rd k be any matrix. Recall our setting in which k n. Denote the spectral norm of M with M 2 and its Frobenius norm with M F . We denote by σmax(M) and σmin(M) the largest and k-th largest singular values of M, respectively. Further, suppose that Q Rd k is a matrix with orthonormal columns and P Rk k an upper triangular matrix where QP = M is the QR decomposition of M. We take QR( ) to be the function that maps M to (Q, P). Finally, given a subspace B, we denote its orthogonal subspace as B . Definition 1. Let X R be a random variable. We call X a centered R-sub-Gaussian if X satisfies E eλX eλ2R2/2

for all λ R. Furthermore, we denote the distribution of X as SG(R2).

We now introduce a quantity that is commonly used in matrix completion (Jain et al., 2013) and shared representation learning (Jain et al., 2021; Collins et al., 2021) as a measure of distance between subspaces of Rd. It is conceptually useful to think of this distance as measuring the dissimilarity of span and alignment between two subspaces. Definition 2. We define the principal angle between the column spaces of matrices M1, M2 Rd k via dist(M1, M2) = ˆ M 1, ˆ M2 2 where ˆ M 1, , ˆ M2 are ma-

trices with orthonormal columns that satisfy span( ˆ M 1, ) = span(M1) and span( ˆ M2) = span(M2).

Note that we can instantiate the matrix ˆ M1, with Id d ˆ M1 ˆ M 1 . This is used many times in our analysis. We show how to bound the excess risk with the principal angle in Lemma 35 in Appendix B. Definition 3 (User-level Differential Privacy (Dwork et al., 2006)). A randomized algorithm A is (ϵ, δ)-user-level differentially private (user-level DP) if for any pair of neighboring collections of datasets S, S Rnm(d+1) differing in a single user dataset and any event B in the output range of A, we have

P [A(S) B] eϵP [A(S ) B] + δ.

Billboard Model. In this paper, we adopt the billboard model of differential privacy (Hsu et al., 2014; Kearns et al., 2014). In this model, there are n users and a central computing server. The server first runs a differentially private algorithm using the sensitive data from the users. The output of the algorithm is then broadcast to all users. Each user will independently train their own model using the broadcast output and the user s own data. The results of these individual computations will be kept to each user and never shared with other users.

Personalization via Representation Learning. We denote the the optimal parameters of user i as w i , and we assume all user parameters {w 1, . . . w n} share a low dimensional embedding matrix U Rk d with k d such that w i = U v i with a local vector v i Rk for i [n]. Given a dataset collection S = (S1, . . . Sn), the goal is to privately find a shared low dimensional embedding matrix U Rd k

and local vectors V = [v1, . . . vn] with low excess risk compared with U and V = [v 1, . . . v n] . Throughout this work, we assume U has orthonormal columns, hence the shared matrix U effectively maps d-dimensional feature vectors x Rd to low-dimensional representations U x Rk that ease local demands on each user. Given parameter spaces U, V, the objective can be formulated as a minimization problem written as min U U,V V L(U, V ; D).

3 Private Fed Rep Algorithm

In this section, we propose an efficient federated private model personalization algorithm based on the Fed Rep algorithm in (Collins et al., 2021) for regression problems. Our algorithm is based on an alternating minimization framework where the algorithm alternates between the updates of user local vectors and the shared embedding by first finding an empirical minimizer for each user, followed by computing a gradient with respect to the embedding. After we compute the private shared representation, each user will independently solve for their user-specific local model, which will be kept to each user. We provide rigorous utility and privacy guarantees for our algorithm and give a detailed comparison with prior work.

Private Model Personalization Revisited

We make several modifications to the original Fed Rep algorithms and its analysis. First, we sample disjoint data batches in each iteration to update the local vectors and shared embedding separately while the original Fed Rep uses the same data batches for both. This use of disjoint data batches enables us to give a tighter bound on the Frobenius norm of the computed gradients, which in turn reduces the amount of noise required for privacy preservation. Moreover, we consider the setting of noisy labels, as opposed to the original Fed Rep analysis. Incorporating label noise introduces unique technical challenges into the analysis. Roughly speaking, without label noise, we can show the distance between the computed embedding with the ground truth decreases geometrically with high probability across iterations. However, such monotonic distance decrease is no longer guaranteed in the presence of label noise. Instead, we employ an induction-based argument to upper bound the distance.

Before we introduce the Private Fed Rep algorithm, we first give some definitions and assumptions that hold for the entire section.

3.1 Problem Setting and Assumptions

In this section, we focus on quadratic loss defined ℓ(U, v, (x, y)) = (y x, Uv )2 for all U Rd k, v Rk, (x, y) Rd+1. Define Γ = maxi [n] v i 2. Denote

σmin, = σmin 1 n V and σmax, = σmax 1 n V , respectively. In our setting, we assume that the rank of 1 n V Rn k is k, which implies σmin, , the k-th largest singular value of 1 n V , is greater than 0. The values Γ, σmax, , and σmin, are important in this setting and appear in our final bounds. We additionally define the condition number to be γ = σmax,

The guarantees of our algorithm will rely on the following assumptions. Assumption 4 (Client Diversity). There are known Λ, λ > 0 such that Λ σmax, σmin, λ.

The assumption that σmin, > 0 is made explicitly or implicitly in prior work (Collins et al., 2021; Jain et al., 2021). Observe that Assumption 4 implies γ Λ

λ . In this sense, Λ

λ is an estimate of the condition number γ.

To make our analysis tractable, we will introduce some standard data assumptions similar to those invoked in prior works (Collins et al., 2021; Du et al., 2020; Jain et al., 2021; Thekumparampil et al., 2021; Tripuraneni et al., 2021). Although we make these assumptions, our algorithm can be used for a wider range of data sources than what is shown here. Assumption 5. Let i [n] and x Di. For all i [n] the feature distribution Di has covariance matrix Id d and

is sub-Gaussian in the sense of E e x,u e

u 2 2 2 for all u Rd.

Assumption 6. Let SG(R) be any centered R-sub Gaussian distribution over R. Sample ζ SG(R). For all i [n] and any data points (x, y) Di we generate the label y for the i-th user via y = x U v i + ζ where ζ is independent for all x Di.

Unless stated otherwise, in this section, we assume that Assumptions 4, 5, and 6 hold. However, we note that our algorithms do not require knowledge of Assumptions 5 and 6.

3.2 Main Algorithm

Algorithm 1 uses alternating minimization for convergence to an approximate minimizer of the excess population risk in parameters U Rd k with orthonormal columns and V Rn k, simultaneously. At each iteration of the inner loop each user samples b data points without replacement from their data set Si. The users then split this sample into disjoint batches Bi,t, B i,t for use in the computing parameters in iteration t. Our inner loop uses parallel execution at iteration t to locally compute a minimizer vi,t arg minv Rk b L(Ut, v; Bi,t) and a gradient i,t = U b L(vi,t, Ut; B i,t) for each user i. The server then aggregates clipped gradients {clip( i,t, ψ)}t [T ] for its gradient step and then returns Ut+1 to the users where clip(M, τ) = M M F min n 1, τ M F

o for any M Rd k

Each iteration Algorithm 1 ensures that Ut+1 has orthonormal columns. Due to orthonormality, the noise injected during the aggregation of ( i,t)i [n] does not affect the norm of i,t+1 for each i, t. This lack of noise propagation, Dx i having a covariance matrix Id d, and the use of disjoint batches Bi,t, B i,t ensure the following bound on the size of i,t.

Lemma 7. With probability at least 1 O(T n 10), we have i,t F e O (R + Γ)Γ

for all i [n] and all t [T] simultaneously.

We use Lemma 7 to choose the clipping parameter ψ in Algorithm 1 in later results.

Theorem 8. Algorithm 1 is (ϵ, δ)-user-level DP in the bill-

board model by setting ˆσ = C ψ

nϵ for some absolute constant C > 0.

We define ϵ,δ := C

ϵ for any ϵ > 0 and δ [0, 1].

For the following result, recall the definition of γ = σmax,

σmin, and Assumption 4, where we assume there exists λ > 0 such that σmin, λ.

Private Model Personalization Revisited

Algorithm 1 Private Fed Rep for linear regression

Require: Si = {(xi,1, yi,1), . . . , (xi,m, yi,m)} data for users i [n], learning rate η, iterations T, privacy noise parameters ˆσ, clipping parameters ψ, ψinit, batch size b m/2T , initial embedding Uinit Rd k

Let S0 i {(xi,j, yi,j) : j [m/2]} i [n] Let S1 i Si \ S0 i i [n] 1: Initialize: U0 Uinit 2: for t = 0, . . . , T 1 do 3: Server sends Ut to clients [n] 4: for Clients i [n] in parallel do 5: Sample two disjoint batches Bi,t and B i,t, each of size b, without replacement from S0 i 6: Update vi as vi,t arg minv Rk b L(Ut, v; Bi,t) 7: Compute the gradient w.r.t U

i,t U b L(Ut, vi,t; B i,t)

8: Send i,t to server 9: end for 10: Server aggregates the client gradients as

i=1 clip( i,t, ψ) + ξt+1

Ut+1, Pt+1 QR( ˆUt+1)

where ξt+1 N d k(0, ˆσ2) 11: end for 12: Server sends U priv UT to all clients 13: for clients i [n] independently do 14: vpriv i arg minv Rk b L(U priv, v; S1 i ) i [n] 15: end for 16: Return: U priv, V priv [vpriv 1 , . . . , vpriv n ]

Lemma 9. Let η 1 2σ2max, . Suppose 1 dist2(U0, U ) c for some constant c > 0. Set the clipping parameter ψ = e O (R + Γ)Γ

dk . Assume T =

ηλ2 = e O min nmλ2

max{ ϵ,δ,1}(R+Γ)Γd

km+R2Γ2dσ2 min, ,

mΓ2σ2 max, max{R2,1} max{Γ2,1}γ4kΓ2+R2kσ2max,

and Assump-

tions 5 and 6 hold for all user data. Set ˆσ as in Theorem 8 and batch size b = m/2T . Then, U priv, the first output of Algorithm 1, satisfies

dist(U priv, U )

1 cησ2 min, 4

2 dist(U0, U )

k T nϵσ2 min, +

(R2 + Γ2)Γ2d T

with probability at least 1 O(T n 10).

Note that dist(U0, U ) in the right-hand side of the bound in Lemma 9 is bounded from above by 1 c.

Given Lemma 9, we may obtain our main excess population risk bound. Let U priv be the final shared embedding returned from Algorithm 1. The intuition behind this result is that a small principal angle between the two matrices U priv, U

implies U priv approximates U as a transformation Rd Rk. Our choice of vpriv i arg minv Rk b L(U priv, v; S1 i ) for S1 i independent of U priv means wi = U privvpriv i reliably transforms a fresh feature vector x Dx i into a scalar x wi that is close to the noisy label x U v i + ζ when n, m are sufficiently large.

Recall in Assumption 4 we assume there exist Λ, λ > 0 such that Λ σmax, σmin, λ.

Theorem 10. Suppose all conditions of Lemma 9 hold with η = 1 2Λ2 , T = Θ Λ2 log(n3)

λ2 , and that Assumption 4 holds

as well. Then, U priv and V priv, the outputs of Algorithm 1, satisfy

L(U priv, V priv; D) L(U , V ; D)

(R2 + Γ2)Γ4Λ2d2k

n2ϵ2σ4 min, λ2 + (R2 + Γ2)Γ4Λ2d

nmσ4 min, λ2

with probability at least 1 O(T n 10). Furthermore, Algorithm 1 is (ϵ, δ)-user-level DP in the billboard model.

Note our results in Theorem 10 can also be extended to the case where new clients share the same embedding U . This is formally presented in the following corollary.

Corollary 11. Suppose all conditions of Theorem 10 hold. Consider a new client with a dataset Sn+1 Dm n+1, where Assumptions 5 and 6 hold with v n+1 2 Γ. Let U priv be the output of Algorithm 1 and vpriv n+1 = arg minv Rk ˆL(U priv, v; Sn+1), if m k log m . We have

L(U priv, vpriv n+1; Dn+1) L(U , v n+1; Dn+1)

(R2 + Γ2)Γ4Λ2d2k

n2ϵ2σ4 min, λ2 + (R2 + Γ2)Γ4Λ2d

nmσ4 min, λ2

with probability at least 1 O(T n 10 + m 100).

Initialization. We require dist(U0, U ) in Lemma 9 to be bounded away from 1 otherwise the bound is trivial since the first term becomes 1 when c = 0. By definition of the principal angle (2), dist(U0, U ) = 1 when the column space of U0 and U are orthogonal. For now, we assume

Private Model Personalization Revisited

such a U0 whose column space overlaps modestly with that of U in the sense of dist(U0, U ) < 1 c can be found with reasonable effort. In the next subsection we give an efficient algorithm that guarantees such an initialization.

3.3 Private Initialization Algorithm

In this section, we introduce an initialization algorithm for our private Fed Rep algorithm (Algorithm 1). Our initialization algorithm is based on the estimator in (Duchi et al., 2022) and the privacy guarantee is achieved by adding Gaussian noise to the estimator before using top-k-SVD. Our private initialization algorithm achieves the same convergence rate as the one in (Jain et al., 2021). But unlike (Jain et al., 2021), which is limited to i.i.d. standard Gaussian feature vectors, the guarantee for our algorithm holds in a broader class of sub-Gaussian feature vectors. The detailed steps of the algorithm are provided in Algorithm 2, while its privacy and utility guarantees are established in Lemma 12.

Algorithm 2 Private Initialization for Private Fed Rep

Require: Si = {(xi,1, yi,1), . . . , (xi,m/2, yi,m/2)} data for users i [n], privacy parameters ϵ, δ, clipping bound ψinit, rank k

1: Let ˆσinit ψinit q

2 log( 1.25

δ ) nϵ 2: Let ξinit N d d( 0, ˆσ2 init) 3: for Clients i [n] in parallel do 4: Send Zi 2 m(m 1) P

j1 =j2 yi,j1yi,j2xi,j1x i,j2 to server 5: end for 6: Server aggregates Zi and add noise for privatization

i=1 clip(Zi, ψinit) + ξinit

7: Server computes

Uinit DU init rank-k-SVD( ˆZ)

8: Return: Uinit

Lemma 12. Suppose that Assumptions 5 and 6 hold. Let Uinit be the output of Algorithm 2. Then, by setting ψinit = e O((R2 + Γ2)d), we have

dist(Uinit, U ) e O

(R2 + Γ2)d3/2

nϵσ2 min, +

(R2 + Γ2)Γ2d

with probability at least 1 O(n 10). Furthermore, Algorithm 2 is (ϵ, δ) user-level DP.

By using Lemma 12 in conjunction with Theorem 10 and

an iteration count T = Θ Λ2 log(R2d/k)

, we are able to

achieve the same utility guarantee as Theorem 10 without any assumptions on dist(Uinit, U ).

Comparison with (Jain et al., 2021): Under conventional assumptions in prior work (Tripuraneni et al., 2021) to normalize V and SG(R) so that both Γ and R are eΘ(1), the Priv-Alt Min algorithm in (Jain et al., 2021) achieves an excess risk bound of e O d3k2

n2ϵ2σ4 min, + d nmσ4 min,

m for Gaussian data. Meanwhile, our Private Fed Rep algorithm achieves a rate of e O d2kΛ2

n2ϵ2σ4 min, λ2 + dΛ2

nmσ4 min, λ2 + k m.

When Λ λ (an upper bound on the condition number of 1 n V ) is e O(1), which is a regime considered in (Jain et al., 2021) and assumed in (Collins et al., 2021), we improve the privacy error term in (Jain et al., 2021) by a factor of e O(dk).

More generally, in the regime where n =

e O min dkm

d3km ϵ2σ4 min, , q

d2mΛ2 ϵ2σ4 min, λ2

, our bound is

tighter than the bound of (Jain et al., 2021) by a factor of e O dkλ2

dk). Meanwhile, in the regime

where n = e O min d2k2m

d3km ϵ2σ4 min, , dΛ2

kσ4 min, λ2

and n = eΩ dkm

ϵ2 , we achieve a bound tighter than that

of (Jain et al., 2021) by a factor of e O d2k2mλ2

Λ λ = e O dk

Additionally, our work improves over that of (Jain et al., 2021) in two important respects. First, it requires less restrictive data assumptions (sub-Gaussian instead of Gaussian feature vectors). Second, our algorithm can be naturally written as a federated algorithm while the algorithm in (Jain et al., 2021) requires centralized processing and transforming it to a federated algorithm is infeasible.

3.4 Synthetic Data Experiment

The results in Figure 1 are obtained via data features from N(0, Id) with problem parameters n = 20, 000, d = 50, k = 2, and m = 10. Our data labels are generated as in Assumption 6 given label noise sampled from N(0, R2) with R = 0.01. We use local GD and non-private Fed Rep as baselines for our comparison. See Appendix B.4 for details1.

4 Margin-based Dimension Reduced Construction for Classification

We now introduce our information-theoretic guarantees for private representation learning for classification problem with margin loss. Our approach is based on the observation that learning a shared linear representation can be framed as

1Note as well this Git Hub repository with a copy of our code.

Private Model Personalization Revisited

Figure 1: Graph of population MSE over choice of privacy parameter ϵ [1, 8] for synthetic data comparing Algorithm 1 to Priv-Alt Min in (Jain et al., 2021).

training a 2-layer linear neural network, enabling the use of the Johnson-Lindenstrauss (JL) transform (Johnson, 1984) for dimensionality reduction.

We demonstrate this utility guarantee as follows. First, we introduce the assumptions and definitions that hold for this section. This includes the definition for the Johnson Lindenstrauss transform along with a randomized instantiation. Subsequently, we describe the score function and covering argument required to combine dimension reduction with the exponential mechanism. We then introduce Algorithm 3 that exploits the margin loss to attain our main result of the section, Theorem 17. Finally, we discuss the relevance of the main result and its independence from the data dimension d.

Let k d. By applying a linear transformation of the data via an appropriately chosen matrix M Rk d, we show that learning personalized parameters can support further dimension reduction. This is done by first learning e U Rk k over the linearly transformed data in Rk then constructing the final shared embedding M e U Rd k. We obtain utility guarantees independent of the data dimension d via this process of first learning a low-dimensional embedding then mapping it to Rd k.

In this section, we study the classification setting. Let X be the d-dimensional Euclidean ball with constant radius r > 0 centered at the origin and Y = { 1, 1}. Define that data space Z = X Y. Let U Rd k be the space of d k matrices with orthonormal columns and V Rn k be the space of matrices with rows bounded in Euclidean norm by a universal constant Γ > 0. The data distributions for our n users D1, . . . , Dn are possibly distinct distributions over X Y. We further define the distribution sequence D = (D1, . . . , Dn).

Definition 13. Let (U, v) Rd k Rk and (x, y)

Rd { 1, 1} any data point. We define the margin loss as ℓρ(U, v, z) = 1 [y x, Uv ρ] and denote the 0-1 loss ℓ(U, v, z) = ℓ0(U, v, z) = 1 [y x, Uv 0].

Let U U, v Rk, and S = (z 1 . . . , z m) Zm. We define b Lρ(U, v; S ) = 1 m Pm j=1 ℓρ(U, v, z j) for any U, v, S . Suppose V = [v1, . . . , vn] V and let S = (S1, . . . , Sn) be a sequence of datasets Si Zm for all i [n]. We further define b Lρ(U, V ; S) = 1

n Pn i=1 b Lρ(U, vi; Si) for all U, V, S. Our goal is to privately optimize the population loss L(U, V ; D) = 1 n Pn i=1 EZi Di [L(U, v1, Zi)] over U, V with personalization of V . We accomplish this via an application of the exponential mechanism over a cover of a dimension-reduced version of U.

Definition 14. Let G Rd be any set of t vectors. Fix τ, β (0, 1). We call the random matrix M Rk d a (t, τ, β)-Johnson-Lindenstrauss (JL) transform if for any u, u G

| Mu, Mu u, u | τ u 2 u 2

with probability at least 1 β over M.

The JL transform is a popular and efficient method of dimension reduction whose existence is ensured by the Johnson-Lindenstrauss lemma (Johnson, 1984; Nelson, 2020; Woodruff et al., 2014). Via a JL transform M we are able to preprocess a feature vector x into a low-dimensional vector Mx Rk for use in our federated learning problem. This preprocessing showcased in Algorithm 3 effectively reduces the dimension of our learning problem.

Lemma 15. (Bassily et al., 2022) Let τ, β (0, 1). Take G Rd to be any set of t vectors. Setting k =

for a k d matrix M with entries drawn

uniformly and independently from n 1

k o implies that M

is a (t, τ, β)-JL transform.

Suppose M Rk d is a random matrix with entries drawn uniformly from n 1

k o . Define UM = {MU : U U}. Let BF be the k k-dimensional Frobenius ball of radius

2k. We define N γ to be a Frobenius norm γ-cover of BF . Assuming that γ 1, we have UM BF with high probability. That is, for any U UM there exists e U N γ

where U e U γ. Moreover, |N γ| O

Algorithm 3 first takes as input the user data sequence S and score function f. It then constructs a JL transform M in the sense of Lemma 15. Algorithm 3 reduces the dimensionality of the users data in S by applying M to each feature vector in each user s dataset. The exponential mechanism is run over a cover N γ using a score function f( , SM), which will be defined shortly, to obtain a low-dimensional shared

Private Model Personalization Revisited

embedding e U Rk k. This M e U Rd k is used by Algorithm 3 where the final embedding U priv = M e U and local vectors V priv = [vpriv 1 , . . . , vpriv n ] are computed by the users via U priv.

Remark: Our algorithm functions as an improper learner in the sense that the learned shared representation U priv is not necessarily an orthonormal matrix. However, this does not affect the performance of the final user-specific classifiers in terms of expected loss. In particular, Theorem 17 establishes that the expected loss incurred by our algorithm remains close to that of an optimal orthonormal shared representation, ensuring that the lack of orthonormality does not degrade utility.

Assume for simplicity that m is even. We partition Si = S0 i S1 i where S0 i = {z1,j, . . . , z m

2 ,j} and S1 i = {z m

2 +1,j, . . . , zm,j} for each i [n]. Further, we denote St = (St 1, . . . , St n) where t {0, 1}. Suppose that S = (S1, . . . , Sn) Znm(d+1) is a sequence of n datasets with m samples each. Let SM = ((S1)M, . . . (Sn)M) where (Si)M = {(Mxi,j, yi,j) : j [m]} for all i [n]. The score function for Algorithm 3 is f(e U, SM) = 1

n Pn i=1 min vi Γ b Lρ(e U, vi; (Si)M) for e U N γ. Since we provide a user-level privacy guarantee, we must bound the sensitivity of f over neighboring sequences S, S where S differs from S by at most a single user s entire dataset. It can be shown that the sensitivity is bounded 1

n, which follows easily from the fact that the margin loss is bounded by 1.

Algorithm 3 Private Representation Learning for Personalized Classification Require: dataset sequences S0 and S1 of equal size, score function f(U , ) = min V V b Lρ(U , V ; ) over matrices U Rk k, privacy parameter ϵ > 0, target dimension k = O r2Γ2 log(nm/δ)

1: Sample M Rk d with entries drawn i.i.d uniformly from n 1

2: Let SM = ((S1)M, . . . , (Sn)M) where (Si)M = (Mx, y) : (x, y) S0 i for i [n] 3: Let N γ be a Frobenius norm γ-cover of BF 4: Run the exponential mechanism over N γ, privacy parameter ϵ, sensitivity 1

n, and score function f(U , SM), to select e U N γ

5: Let U priv M e U 6: Each user i [n] independently computes vpriv i arg min v 2 Γ b L(U priv, v, S1 i )

7: Return: U priv, V priv = [vpriv 1 , . . . vpriv n ]

The lemma below is an extension of a fundamental result for the exponential mechanism (Mc Sherry & Talwar, 2007).

Key to obtaining this result is that γ is both the error parameter of the JL transform and the radius of the sets in our cover N γ.

Lemma 16. Fix ϵ, ρ > 0, β (0, 1). Algorithm 3 is (ϵ, 0)- user-level DP. Sample S Dm. Then, Algorithm 3 returns U priv from input S such that

min V V L(U priv, V ; S0) min (U,V ) U V b Lρ(U, V ; S0)

+ e O r2Γ2k

with probability at least 1 β over the randomness of S and the internal randomness of the algorithm.

Theorem 17. Fix ϵ, ρ > 0, β (0, 1). Algorithm 3 is (ϵ, 0)- user-level DP in the billboard model. Sample user data S Dm. Then, Algorithm 3 returns U priv, V priv from input S such that

L(U priv, V priv; D) min (U,V ) U V b Lρ(U, V ; S0)

with probability at least 1 β over the randomness of S and the internal randomness of the algorithm.

To the best of our knowledge, the above bound is the first margin-based population loss bound that is independent of the data dimension in private personalization with shared embedding.

5 Conclusion

In this work, we revisit the problem of model personalization under user-level differential privacy (DP) in the shared representation framework. We propose a novel private federated personalization algorithm that extends the Fed Rep method while ensuring rigorous privacy guarantees. Our approach efficiently learns a low-dimensional shared embedding and user-specific local models while providing strong privacy-utility trade-offs, even under noisy labels and broader sub-Gaussian data distributions. A key contribution of our work is demonstrating that private model personalization can be achieved in a federated setting with improved risk bounds. Specifically, we show that our approach reduces the privacy error term by a factor of e O(dk) in a natural parameter regime, leading to a higher accuracy in highdimensional settings. Moreover, our private initialization technique ensures a good starting point for learning shared representations, even under privacy constraints. Additionally, for binary classification, we provide an informationtheoretic construction for private model personalization that leverages dimensionality reduction techniques, and hence, derive margin-based dimension-independent risk bound.

Private Model Personalization Revisited

While our work provides significant advancements in private federated personalization, several open directions remain for future research: (i) Relaxing the identity covariance assumption: Current results assume sub-Gaussian feature distributions with identity covariance. While this assumption simplifies the analysis and aligns with prior work, it does not hold in many real-world applications. A key challenge is extending our methods to handle arbitrary covariance matrices. This could involve leveraging adaptive preconditioning techniques or covariance-aware mechanisms to improve learning under more general feature distributions. (ii) Extending beyond quadratic loss functions: Most of efficient private personalization algorithms, including ours, rely on quadratic loss (i.e., least-squares regression) due to its analytical convenience. Developing gradient-based methods for more general loss functions, such as logistic or hinge losses, remains an important open problem. Current techniques for private optimization in these settings are either computationally expensive or lack strong utility guarantees.

Acknowledgment

The authors research is supported by NSF Award 2112471 and NSF CAREER Award 2144532.

Impact Statement

This work advances the theoretical and algorithmic foundations of privacy-preserving personalized learning, a key challenge in the deployment of trustworthy machine learning systems. In domains such as healthcare, education, and mobile applications, users data are often sensitive and highly personalized making it critical to ensure privacy while delivering individualized models.

Our contributions demonstrate that personalization and strong privacy guarantees can coexist in federated learning systems, without compromising accuracy. The proposed algorithms address limitations in prior work by supporting a broader range of user data distributions and operating in realistic federated environments. Importantly, we show that the accuracy of learned models can remain robust even in high-dimensional settings or under label noise, which frequently arise in practice.

By enabling theoretically sound, privacy-preserving personalization, this research can help guide the design of secure and reliable machine learning solutions across a wide array of socially impactful applications.

Arora, R., Bassily, R., Guzmán, C., Menart, M., and Ullah, E. Differentially private generalized linear models revisited. Advances in neural information processing systems,

35:22505 22517, 2022.

Bassily, R., Mohri, M., and Suresh, A. T. Differentially private learning with margin guarantees. Advances in Neural Information Processing Systems, 35:32127 32141, 2022.

Bietti, A., Wei, C.-Y., Dudik, M., Langford, J., and Wu, S. Personalization improves privacy-accuracy tradeoffs in federated learning. In International Conference on Machine Learning, pp. 1945 1962. PMLR, 2022.

Bun, M. and Steinke, T. Concentrated differential privacy: Simplifications, extensions, and lower bounds. In Theory of cryptography conference, pp. 635 658. Springer, 2016.

Collins, L., Hassani, H., Mokhtari, A., and Shakkottai, S. Exploiting shared representations for personalized federated learning. In International conference on machine learning, pp. 2089 2099. PMLR, 2021.

Du, S. S., Hu, W., Kakade, S. M., Lee, J. D., and Lei, Q. Few-shot learning via learning the representation, provably. ar Xiv preprint ar Xiv:2002.09434, 2020.

Duchi, J. C., Feldman, V., Hu, L., and Talwar, K. Subspace recovery from heterogeneous data with non-isotropic noise. Advances in Neural Information Processing Systems, 35:5854 5866, 2022.

Dwork, C., Mc Sherry, F., Nissim, K., and Smith, A. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7, 2006. Proceedings 3, pp. 265 284. Springer, 2006.

Hsu, J., Huang, Z., Roth, A., Roughgarden, T., and Wu, Z. S. Private matchings and allocations. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pp. 21 30, 2014.

Hu, S., Wu, Z. S., and Smith, V. Private multi-task learning: Formulation and applications to federated learning. ar Xiv preprint ar Xiv:2108.12978, 2021.

Jain, P., Netrapalli, P., and Sanghavi, S. Low-rank matrix completion using alternating minimization. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing, pp. 665 674, 2013.

Jain, P., Thakkar, O. D., and Thakurta, A. Differentially private matrix completion revisited. In International Conference on Machine Learning, pp. 2215 2224. PMLR, 2018.

Jain, P., Rush, J., Smith, A., Song, S., and Guha Thakurta, A. Differentially private model personalization. Advances in neural information processing systems, 34:29723 29735, 2021.

Private Model Personalization Revisited

Johnson, W. B. Extensions of lipshitz mapping into hilbert space. In Conference modern analysis and probability, 1984, pp. 189 206, 1984.

Kearns, M., Pai, M., Roth, A., and Ullman, J. Mechanism design in large games: Incentives and privacy. In Proceedings of the 5th conference on Innovations in theoretical computer science, pp. 403 410, 2014.

Lê Nguyen, H., Ullman, J., and Zakynthinou, L. Efficient private algorithms for learning large-margin halfspaces. In Algorithmic Learning Theory, pp. 704 724. PMLR, 2020.

Lee, K., Maji, S., Ravichandran, A., and Soatto, S. Metalearning with differentiable convex optimization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10657 10665, 2019.

Mc Sherry, F. and Talwar, K. Mechanism design via differential privacy. In 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS 07), pp. 94 103. IEEE, 2007.

Nelson, J. Sketching algorithms, 2020.

Raghu, A., Raghu, M., Bengio, S., and Vinyals, O. Rapid learning or feature reuse? towards understanding the effectiveness of maml. ar Xiv preprint ar Xiv:1909.09157, 2019.

Stewart, G. Matrix perturbation theory. Computer Science and Scientific Computing/Academic Press, Inc, 1990.

Thekumparampil, K. K., Jain, P., Netrapalli, P., and Oh, S. Sample efficient linear meta-learning by alternating minimization. ar Xiv preprint ar Xiv:2105.08306, 2021.

Tian, Y., Wang, Y., Krishnan, D., Tenenbaum, J. B., and Isola, P. Rethinking few-shot image classification: a good embedding is all you need? In Computer Vision ECCV 2020: 16th European Conference, Glasgow, UK, August 23 28, 2020, Proceedings, Part XIV 16, pp. 266 282. Springer, 2020.

Tripuraneni, N., Jin, C., and Jordan, M. Provable metalearning of linear representations. In International Conference on Machine Learning, pp. 10434 10443. PMLR, 2021.

Vershynin, R. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.

Woodruff, D. P. et al. Sketching as a tool for numerical linear algebra. Foundations and Trends in Theoretical Computer Science, 10(1 2):1 157, 2014.

Private Model Personalization Revisited

A Technical Lemmas 12

B Missing Proofs for Section 3 14

B.1 Main algorithm results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

B.2 Private initialization results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

B.3 Auxiliary lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

B.4 Private Fed Rep experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

C Missing Proofs for Section 4 46

Private Model Personalization Revisited

A Technical Lemmas

Proposition 18. Let X = (X1, . . . , Xa) Ra be random vector. Assume that Xp is Rp-sub-Gaussian for each p [a] and that Xp, Xq are uncorrelated for p = q. Let R = (R1, . . . , Ra). Then, for each ν Ra, the random variable ν, X is ( ν 2 maxp [a] Ra)-sub-Gaussian.

Proof. Fix an arbitrary λ R. Observe

E h eλ ν,X i = E

p=1 e(λνp)Xp #

p=1 E h e(λνp)Xpi

p=1 eλ2ν2 p R2 p

eλ2 ν 2 2 maxp [a] R2 a

where the second equality holds by uncorrelatedness, the first inequality from the definition of sub-Gaussian, and the final inequality from definition of ν 2 2.

Proposition 19. Let X = (X1, . . . , Xa) Ra be random vector with E [X] = 0. Assume Xp, Xq are uncorrelated for p = q. Suppose that ν, π Ra with ν, π = 0. Then, ν, X and π, X are uncorrelated.

Proof. By the assumptions of the above proposition

E [ ν, X π, X ] = E

p,q [a] νpπq Xp Xq

p=1 νpπp X2 p

p =q νpπq Xp Xq

p=1 νpπp + X

p =q νpπq E [Xp] E [Xq]

= E [ ν, X ] E [ π, X ]

where the last three equalities hold by ν, π = 0, the uncorrelatedness of the components of X, and the linearity of expectation.

Lemma 20 (Mc Diarmid s inequality). Let f : Rn R be a measurable function. Assume there exists a constant ci > 0 where, for all x1, . . . , xn R

sup x i R |f(x1, . . . , xi 1, xi, xi+1, . . . , xn) f(x1, . . . , xi 1, x i, xi+1, . . . , xn)| ci

for each i [n]. Suppose X1, . . . , Xn R is a sequence of i.i.d. random variables. Then, for all s > 0

P [f(X1, . . . , Xn) E [f(X1, . . . , Xn)] s] e 2s2 Pn i=1 ci .

Private Model Personalization Revisited

Definition 21. We say that a random variable X R is R-sub-exponential if

E h eλ|X|i eλR

for all λ such that 0 λ 1/R.

For the definition above, we call the smallest such R the sub-exponential norm of an R-sub-exponential random variable. Lemma 22 (Bernstein inequality). Let X1, . . . , Xn Rd be a sequence of i.i.d. R-sub-exponential random variables. Then, for any s > 0

e cn min s2

for some constant c > 0. Lemma 23 (Lemma 4.4.1 (Vershynin, 2018)). Let M Ra b be any matrix. Take Sa 1, Sb 1 to be the unit spheres in Ra, Rb, respectively. Suppose that N a, N b are Euclidean ϵ-covers over Sa 1, Sb 1. Then

M 2 1 1 ϵ max v N a,v N b v Mv .

Definition 24. A centered random vector X Ra with covariance matrix Σ is isotropic if

Σ = E XX = Ia.

Definition 25. A centered random vector X Ra is R-sub-Gaussian if

E h e X,u i e

for all u Ra.

Note that when a centered vector X Ra is 1-sub-Gaussian and has covariance matrix Ia, the components of X are 1-sub-Gaussian. We use this fact in our results in conjunction with Assumption 5. Theorem 26 (Theorem 4.6.1 (Vershynin, 2018)). Let M be an a b matrix whose rows are independent mean-zero K-sub-Gaussian isotropic random vectors in Rb. Then, for any α 0, there exists c > 0 where a c K2

b + α σmin(M) σmax(M) a + c K2

with probability 1 2e α2. Corollary 27. Let M be an a b matrix whose rows are independently drawn from N(0, ˆσ2Ia a). Then, for any α 0, we have ˆσ a

b + α σmin(M) σmax(M) ˆσ a +

with probability at least 1 2e α2.

Proof. Observe that we have M = ˆσ M for a matrix M N a b(0, 1). As well, note that M satisfies Theorem 26 with K = 1 since its rows are standard Gaussians.

For any matrix A Ra b, we also have that σp(c A) = c σp(A) for c > 0 a constant and σp(A) the p-th singular value of A. Combining the above reasoning and Theorem 26, there exists a constant c > 0 such that a c

b + α σmin( M) σmax( M) a + c

implies ˆσ a c

b + α σmin(M) σmax(M) ˆσ a + c

with probability at least 1 2e α2. Furthermore, for standard Gaussians we have c = 1.

Note that one could simply apply Theorem 26 directly to the matrix M from Corollary 27 by using K = ˆσ, but this does not give the desired result. In particular, we require that ˆσ scale both terms of the upper bound of σmax(M) in our later analysis of the Gaussian mechanism.

Private Model Personalization Revisited

B Missing Proofs for Section 3

B.1 Main algorithm results

We first restate our algorithm and main assumptions. Assumption (Restatement of Assumption 4). There are known Λ, λ > 0 such that Λ σmax, σmin, λ. Assumption (Restatement of Assumption 5). Let i [n] and x Di. For all i [n] the feature distribution Di has

covariance matrix Id and is sub-Gaussian in the sense of E e x,u e

u 2 2 2 for all u Rd. Assumption (Restatement of Assumption 6). Let SG(R) be any centered R-sub-Gaussian distribution over R. Sample ζ SG(R). For all i [n] and any data points (x, y) Di we generate the label y for the i-th user via

y = x U v i + ζ

where ζ is independent for all x Di.

Unless stated otherwise, in this section, we assume that Assumptions 4, 5, and 6 hold. Furthermore, throughout our proofs we assume n = Ω(dk). This assumption is reasonable in the context of our work since n = Ω(dk) is required for non-trivial results in our final bounds.

Algorithm 1 Private Fed Rep for linear regression

Require: Si = {(xi,1, yi,1), . . . , (xi,m, yi,m)} data for users i [n], learning rate η, iterations T, privacy noise parameter ˆσ, clipping parameters ψ, ψinit, batch size b m/2T , initial embedding Uinit Rd k

Let S0 i {(xi,j, yi,j) : j [m/2]} i [n] Let S1 i Si \ S0 i i [n] Initialize: U0 Uinit for t = 0, . . . , T 1 do

Server sends Ut to clients [n] for Clients i [n] in parallel do

Sample two disjoint batches Bi,t and B i,t, each of size b, without replacement from S0 i Update vi as vi,t arg min v Rk b L(Ut, v; Bi,t)

Compute the gradient w.r.t U i,t U b L(Ut, vi,t; B i,t)

Send i,t to server end for Server aggregates the client gradients as

i=1 clip( i,t, ψ) + ξt+1

Ut+1, Pt+1 QR( ˆUt+1)

where ξt+1 N d k(0, ˆσ2) end for Server sends U priv UT to all clients for Clients i [n] independently do

vpriv i arg minv Rk b L(U priv, v; S1 i ) i [n] end for Return: U priv, V priv [vpriv 1 , . . . , vpriv n ]

Lemma (Restatement of Lemma 7). With probability at least 1 O(T n 10), we have

i,t F e O (R + Γ)Γ

Private Model Personalization Revisited

for all i [n] and all t [T] simultaneously.

Proof. The privacy guarantee directly follows from the the privacy guarantee of Gaussian mechanism and advanced composition. In the following, we prove a high probability bound for i,t F . Let B i = {(x i,j, y i,j) : j [b]} be a data batch of user i sampled at iteration t as in Algorithm 1. Then we have

x i,j, Utvi,t+1 U v i + ζi,j x i,jv i,t+1

Denote u = Utvi,t+1 U v i . Throughout the proof we condition on the event

E = |ζi,j| R p

26 log(nb), u, x i,j u 2 p

26 log(nb), and vi,t+1 2 5

4Γ for all (i, j) [n] [b]

which holds by the sub-Gaussianity of x i,j, the independence between x i,j and Utvi,t+1 U v i , and Proposition 42 with probability at least 1 O(nb n 14) 1 O(n 12). Given event E, we have

x i,j, Utvi,t+1 U v i + ζi,j 5

26 log(nb).

Meanwhile, the (p, q)-element of x i,jv i,t+1 has

|(x i,j)p(vi,t+1)q| 5

26 log(nb) (1)

for all i, j with probability at least 1 O(nb n 14) by the sub-Gaussianity of the components of x i,j and conditioning on E. Therefore, taking union bound on p, q, t without conditioning, we have, with probability at least 1 O(Tdk n 12)

i,t F 82(R + Γ)Γ

for all i, t simultaneously. Finally, we bound 1 O(Tdk n 12) 1 O(T n 10).

Theorem (Restatement of Theorem 8). Algorithm 1 is (ϵ, δ)-user-level DP in the billboard model by setting ˆσ =

nϵ for some absolute constant C > 0.

Proof. Since the computation of V priv is performed by each user independently, it is sufficient to show that the computation of U priv satisfies centralized user-level DP.

Given the definition of the clipping function, for each i [n], we have clip( i,t, ψ) F ψ which implies that the

sensitivity for each update of the Ut is bounded by ψ. By taking ˆσ = C ψ

nϵ , we obtain the privacy cost of each

iteration t is ρ = O ϵ2 T ln(1/δ) under zero-concentrated differential privacy (z-CDP, (Bun & Steinke, 2016)). By the

composition property of z-CDP, we obtain the total privacy cost ρ = PT t=1 ρt = O ϵ2 ln(1/δ) . Then, we can convert the

z-CDP guarantee to DP guarantee with the privacy parameter being ρ + p

ρ ln(1/δ) = O(ϵ) with sufficiently small δ. Therefore, the computation of U priv is (ϵ, δ)-DP and the algorithm is (ϵ, δ)-DP in the billboard model.

For the rest of our proof we condition on the event of clipping does not affect the original gradient norm in Algorithm 1. By Lemma 7, this event has probability at least 1 (T n 10) for all i, t simultaneously when selecting ψ = e O (R + Γ)Γ

Let v1,t, . . . , vn,t be the sequence of local user parameters generated at iteration t of Algorithm 1. Define the matrix ΣVt = 1

n Pn i=1 vi,tv i,t. Take B t = (B 1,t, . . . , B n,t) to be the sequence of batches from of Algorithm 1 used to update Ut at iteration t for each user i. Assume that we re-index the batches B i,t = {(xt i,j, yt i,j) : j [b]} for each i, t. Let Ut+1, Pt+1 be the matrices from the QR decomposition, and

i=1 i,t = 1

j=1 Uℓ(Ut, vi,t, (xt i,j, yt i,j)) Rd k

Private Model Personalization Revisited

the aggregated gradient, all computed in Algorithm 1. For ease of notation we also drop the iteration index t on the data points (xt i,j, yt i,j). Recall that η is the fixed stepsize for Algorithm 1.

Lemma 28. If Pt+1 is invertible, then

dist(Ut+1, U ) P 1 t+1 2 Ik ηΣVt 2 dist(Ut, U ) + η t EB t [ t] 2 + η ξt 2 .

Proof. Let ˆUt+1 be the t-th iterate of Algorithm 1 before QR decomposition. Observe

(U ) ˆUt+1 2 = (U ) (Ut η t + ηξt) 2

(U ) (Ut η t + ηEB t[ t] ηEB t[ t]) 2 + η ξt 2

(U ) (Ut ηEB t[ t]) 2 + η t EB t[ t] 2 + η ξt 2

since (U ) 2 = 1. Then, since the data has covariance matrix Id and label noise ζi,j that satisfies E[ζi,j] = 0

EB t[ t] = 1

j=1 EB t ((x i,j) Utvi,t (x i,j) U v i + ζi,j)x i,jv i,t

j=1 EB t x i,j(x i,j) (Utvi,t U v i )v i,t

j=1 EB t (Utvi,t U v i )v i,t

j=1 Utvi,tv i,t U v i v i,t

where the second equality holds because (x i,j) Utvi,t (x i,j) U v i + ζi,j is a scalar and the fourth equality by the fact that B t is independent of Ut, Vt. Hence

j=1 (U ) (Utvi,tv i,t U v i v i,t) = 1

j=1 (U ) Utvi,tv i,t = (U ) Ut 1 n

i=1 vi,tv i,t

since (U ) U = 0. Then, for ΣVt = 1

n Pn i=1 vi,tv i,t

(U ) (Ut ηEB t+1[ t]) 2 = (U ) Ut(Ik ηΣVt) 2 Ik ηΣVt 2 (U ) Ut 2.

Hence (U ) ˆUt+1 2 Ik ηΣVt 2 (U ) Ut 2 + η t EB t+1[ t] 2 + η ξt 2. (2)

Now, assuming that Pt+1 is invertible

(U ) Ut+1 2 = (U ) ˆUt+1P 1 t+1 2

P 1 t+1 2 (U ) ˆUt+1 2.

So (U ) Ut+1 2 P 1 t+1 2 (U ) ˆUt+1 2. (3)

Combining (2) and (3) finishes the proof.

Our proof for our main result follows from bounding the terms and factors on the right-hand side of the inequality in Lemma 28; namely

dist(Ut+1, U ) P 1 t+1 2 Ik ηΣVt 2dist(Ut, U ) + η t EB t [ t] 2 + η ξt 2 . (4)

Private Model Personalization Revisited

The above inequality has four important components. The term t EB t [ t] 2 is the deviation of the aggregated gradient from its mean and hence can be bounded via concentration inequalities (see Proposition 30) and the term ξt 2 is the magnitude of the Gaussian noise added for privacy, which is not difficult to bound as well (see Proposition 29).

Deriving bounds on P 1 t+1 2 and Ik ηΣVt 2dist(Ut, U ) in (4) is less obvious and require some extra work. We can understand Ik ηΣVt 2 as a measurement of how close ΣVt is to η 1Ik. So, when Ik ηΣVt 2 is small, the vectors v1,t, . . . , vn,t are evenly distributed in Rk with no bias in any direction. For example, if v1,t, . . . , vn,t are independent with vi,t N(0, η 1Ik) for each i, the matrix Ik ηΣVt = Ik η

n Pn i=1 vi,tv i,t will concentrate around the (d k)-dimensional zero matrix.

The quantity P 1 t+1 2 is more subtle. This quantity is a result of relating dist(Ut+1, U ) and dist(Ut, U ) via (U ) ˆUt+1 as in our proof of Lemma 28. We can interpret P 1 t+1 2 as quantifying how far ˆUt+1 is from having orthonormal columns. As well, we require that P 1 t+1 exists in general because the principal angle has dist(M, M ) = 1 whenever rank(M) = rank(M ).

Recall that ψ is the gradient clipping parameter in Algorithm 1. Proposition 29. For all t, we have

d T log(n) log(1/δ)

with probability at least 1 O(n 10).

Proof. Let ˆσ = ψ

2T log(1.25/δ)

nϵ as in the statement of Theorem 8. Let α > 0. Then, by Corollary 27

k + α σmin(ξt) σmax(ξt) ˆσ

with probability at least 1 2e α2. Choosing α = 10d log n gives us

k + 10d log n p

2T log(1.25/δ)

d T log(n) log(1/δ)

with probability at least 1 2e 10d log n.

Proposition 30. For any t, we have

t EB t [ t] 2 e O

η2(R2 + Γ2)Γ2d

with probability at least 1 O(n 10).

Proof. Let N d and N k be Euclidean 1

4-covers of the d and k-dimensional unit spheres, respectively. Then, by Lemma 23 we have t EB t [ t] 2 4

3 max a N d,b N k a t EB t [ t] b.

Now a t EB t [ t] b

x i,j, Utvi,t U v i + ζi,j a, x i,j vi,t, b

j=1 EB t x i,j, Utvi,t U v i + ζi,j a, x i,j vi,t, b .

Private Model Personalization Revisited

We condition on vi,t 2 5

4Γ for each i, which has probability at least 1 O(n 14) by Proposition 42. Then, we have that

2 nb x i,j, Utvi,t U v i + ζi,j a, x i,j vi,t, b EB t x i,j, Utvi,t U v i + ζi,j a, x i,j vi,t, b

are centered, independent sub-exponential random variables when also conditioning on Ut. The sub-exponential norm of these random variables is 5

4 2 3(R+Γ)Γ

nb . Then, given Ut, Vt are independent of B t and assuming nb 20(d + k) log n, we get from Lemma 22

a t EB t [ t] b e O

η2(R2 + Γ2)Γ2d

with probability at least 1 O(n 10) . This bound holds unconditionally since vi,t are bounded with probability at least 1 O(n 10) for each i. Thus

t EB t [ t] 2 e O

η2(R2 + Γ2)Γ2d

with probability at least 1 O(n 10) by the union bound.

In Lemma 9 and Theorem 10, we assume there exist c0, c1 > 1 such that

c1 max{ ϵ,δ, 1}(R + Γ)Γd

km log2(nm) + R2Γ2dσ2 min, log(nm) ,

mΓ2σ2 max, c0 max{R2, 1} max{Γ2, 1}γ4kΓ2 log2 n + R2kσ2max, log(nm)

This relationship between T and the problem parameters is equivalent to setting

along with assuming the following two conditions on m and n:

Assumption 31. Let E0 = 1 dist2(U0, U ) > 0. We assume there exists some constant c0 > 1 where

max{R2, 1} max{Γ2, 1}γ4k log2 n

E2 0σ2max, + R2k

Γ2 log(nm) T. (8)

Assumption 32. Let ϵ,δ = C

log(1.25/δ)

ϵ for some constant C > 0 and E0 = 1 dist2(U0, U ). For the user count n, we assume, for some constant c1 > 1

max { ϵ,δ, 1} (R + Γ) Γd

k log2(nm) E2 0λ2 + R2Γ2d log(nm)

Our Assumptions 31 and 32 are used repeatedly throughout Appendix B. These conditions are easily leveraged as individual assumptions on m and n, unlike the upper bound on T. Including the equality (7) allows us to contain all required conditions on T, m, and n to one convenient setting.

Lemma 33. Let E0 = 1 dist2(U0, U ) and ψ = e O (R + Γ)Γ

dk . Suppose Assumption 31 and 32 hold. Then, for any iteration t, we have that Pt+1 is invertible and

1 ησ2 min, E0 2 log n

with probability at least 1 O(n 10).

Private Model Personalization Revisited

The proof of Lemma 33 is deferred to Appendix B.3.

Lemma 34. Let η 1 2σ2max, , n e8, E0 = 1 dist2(U0, U ), and ψ = e O (R + Γ)Γ

dk . Suppose Assumptions 31

and 32 hold along with T = log n ησ2 min, . Then, for all t, we have

P 1 t+1 2 Ik ηΣVt 2

1 ησ2 min, E0

with probability at least 1 O(n 10).

Proof. We will show that

Ik ηΣVt 2 1 ησ2 min

1 ησ2 min, E0

with high probability for each t. Then selecting n e8 and using Lemma 33 implies the product P 1 t+1 2 Ik ηΣVt 2 satisfies the lemma statement.

Note that ΣVt = 1 n V t Vt = 1 n Pn i=1 vi,tv i,t and define σmin,Vt = σmin 1 n Vt . Given Vt = V (U ) Ut F by

Lemma 39 and V and (U ) U are full rank, we have

1 n V (U ) Ut

1 n V σmin (U ) Ut σmax

Now, via the argument of Lemma 39, we have, with probability at least 1 O(n 10)

1 n F ντk 1 τk σmax, +

26R2 log(nb)

(1 τk)2b (10)

for τk = cτ q

b . Recall that γ = σmax,

σmin, and ν = Γ σmax, . Then, there exists a constant ˆc > 0 such that, by Assumption 31 with c0 16ˆc2, we have

ντk 1 τk σmax, ˆc

E2 0σ2max, c0γ4 log n σmax, E0 4 p

γ4 log n σmin, E0

4 log n (11)

26R2 log(nb)

(1 τk)2b ˆc

E2 0σ2max, c0γ4k log n σmax E0 4 p

γ4k log n σmin, E0

4 log n (12)

since m c0 max{R2,1} max{Γ2,1}γ4k log2 n

Furthermore σmin (U ) Ut q

1 (U ) Ut 2 2 = q

1 dist2(Ut, U )

Therefore, we obtain

σmin,Vt σmin

1 n V σmin (U ) Ut σmax

1 dist2(Ut, U ) E0 2 log n

Private Model Personalization Revisited

with probability at least 1 O(n 10). Thus

1 ησ2 min,Vt 1 ησ2 min,

1 dist2(Ut, U ) E0 2 log n

Claim 1: We have Ik ηΣVt 2 1 ησ2 min,Vt.

with probability at least 1 O(n 10).

We prove Claim 1 by showing 1 2σ2max, 2 σ2 max,Vt+σ2 min,Vt , which implies

Ik ηΣVt 2 max |1 ησ2 max,Vt|, |1 ησ2 min,Vt| 1 ησ2 min,Vt.

First, we reuse

1 n F ντk 1 τk σmax, +

26R2 log(nb)

as in (10). By Lemma 39 and the triangle inequality along with the submultiplicativity of the spectral norm

σmax,Vt = σmax

1 n V (U ) Ut

σmax, + σmax

σmin,Vt σmax, + σmax

1 n F . (14)

By (11) and (12), we have

1 n F σmax,

2 and thus via (14) σ2 min,Vt 2σ2 max, .

Since σmin,Vt σmax,Vt by definition, we have σ2 min,Vt + σ2 max,Vt 4σ2 max, . This gives us 1 2σ2 max, 2 σ2 max,Vt+σ2 min,Vt with the required probability. This proves Claim 1.

Next, we will prove the following claim using induction.

Claim 2: Let α = 2 5T

log(nm) 1 25T 2log(nm). We have q

1 dist2(Ut, U ) p

(1 tα)E0 for each iteration t [T].

Base case: when t = 0, the inequality q

1 dist2(U0, U ) E0 is clearly true since E0 = 1 dist2(U0, U ).

Inductive hypothesis: we assume q

1 (U ) Ut 2 2 p

(1 tα)E0 for some t [T].

Inductive Step: note tα < 1

2 for any t [T]. Our assumption on q

1 (U ) Ut 2 2, (13), and assuming n e E0 imply

σ2 min,Vt σ2 min,

(1 tα)E0 E0 2 log n

Now, since η 1 2σ2 max, 2 σ2 max,Vt+σ2 min,Vt , Weyl s inequality and (15) give us

Ik ηΣVt 2 1 ησ2 min,Vt 1 ησ2 min,

(1 tα)E0 E0 2 log n

Private Model Personalization Revisited

By Lemma 33

1 ησ2 min, E0 2 log n

with probability at least 1 O(n 10). This event is a superset of the probabilistic bounds from earlier in this lemma.

(U ) Ut+1 2

1 ησ2 min, E0 2 log n

(1 tα)E0 E0 2 log n

η2(R2 + Γ2)Γ2d

with probability at least 1 O(n 10). Assuming n e E0/(

2 1/2)2, we have

(1 tα)E0 E0 2 log n

2 1 ησ2 min,

E0/2 E0 2 log n

2 1 ησ2 min, E0

As well, if n e8, we have

1 ησ2 min, E0 2 log n

(1 tα)E0 E0 2 log n

1 ησ2 min, E0 2 log n

E0/2 E0 2 log n

1 ησ2 min, E0 2 log n

1 ησ2 min, E0

1 ησ2 min, E0

1 ησ2 min, E0

1 ησ2 min, E0

We now use the lower bound assumptions of the user count n and data sample count m. Recall the exact statements of Assumption 31 and 32. That is, there exist c0, c1 > 1 such that

max{R2, 1} max{Γ2, 1}γ4k log2 n

E2 0σ2max, + R2k

Γ2 log(nm) T

max { ϵ,δ, 1} (R + Γ) Γd

k log2(nm) E2 0λ2 + R2Γ2d log(nm)

The problem parameters that lower bound m, n now come into use during this proof. Furthermore, recall that by Assumption 4 we know some λ > 0 such that λ σmin, . Then, by Assumptions 31 and 32 along with n e8 and T = log n

ηλ2 , there

Private Model Personalization Revisited

exists a constant ˆc > 0 such that

(U ) Ut+1 2

1 ησ2 min, E0

2 (U ) Ut 2

1 ησ2 min, E0 2 log n

2 η(R + Γ)Γd p

Tk log(1/δ) log(nb)

1 ησ2 min, E0 2 log n

η2(R2 + Γ2)Γ2d log n

1 ησ2 min, E0

2 (U ) Ut 2 + O

η(R + Γ)Γd p

Tk log(1/δ) log(nb)

η2(R2 + Γ2)Γ2d log(nm)

1 ησ2 min, E0

2 (U ) Ut 2 + O

T c1T log(nm)

E4 0η2(R + Γ)Γσ2max, λ2T

c0c1 max{R2, 1} max{Γ2, 1}γ4k 3 2 T 2 log(nm) log2 n

1 ησ2 min, E0

2 (U ) Ut 2 + O

ησmin, E2 0 c1T p

E4 0η(R + Γ)Γσ2max, log n

c0c1 max{R2, 1} max{Γ2, 1}γ4k 3 2 T 2 log(nm) log2 n

1 ησ2 min, E0

2 (U ) Ut 2 + ˆc E2 0 c1γT p

log(nm) + ˆc

E4 0 c0c1γ4k 3 2 T 2 log(nm) log n

since η 1 2σ2max, and λ σmin, .Recall that γ, k 1. Using the choice of c0, c1 10ˆc in (17), we have

(U ) Ut+1 2

1 ησ2 min, E0

2 (U ) Ut 2 + E2 0 10γT p

E4 0 100γ4k 3 2 T 2 log(nm) log n

(U ) Ut 2 + E2 0 5T p

Thus, given E0 1

1 (U ) Ut+1 2 2 1

(U ) Ut 2 + E2 0 5T p

= 1 (U ) Ut 2 2 2E2 0 (U ) Ut 2 2 5T p

log(nm) E4 0 25T 2log(nm)

1 (U ) Ut 2 2 E0

log(nm) + 1 25T 2 log(nm)

Then, by our assumptions that q

1 (U ) Ut 2 2 p

(1 tα)E0, we have

1 (U ) Ut+1 2 2 1 (U ) Ut+1 2 2 αE0 (1 tα)E0 αE0 = (1 (t + 1)α)E0.

Private Model Personalization Revisited

Thus, by inductive hypothesis

σmin (U ) Ut q

1 dist2(Ut, U ) p

for any t 0. This proves Claim 2.

Observe that tα Tα = T 10T

log(nm)+1 25T 2 log(nm) < 1

2 for all t. Using this, Claim 2 implies

σmin (U ) Ut q

1 dist2(Ut, U ) p

for each t 0. Via (18) and n e8 along with η 1 2σ2max,

P 1 t+1 2 Ik ηΣVt 2

1 ησ2 min, E0

with probability at least 1 O(n 10) for any t [T 1]. This proves the lemma.

Lemma (Restatement of Lemma 9). Let η 1 2σ2max, . Suppose 1 dist2(U0, U ) c for some constant c > 0. Set

the clipping parameter ψ = e O (R + Γ)Γ

dk . Assume T = log n

ηλ2 = e O min nmλ2

max{ ϵ,δ,1}(R+Γ)Γd

km+R2Γ2dσ2 min, ,

mΓ2σ2 max, max{R2,1} max{Γ2,1}γ4kΓ2+R2kσ2max,

and Assumptions 5 and 6 hold for all user data. Set ˆσ as in Theorem 8 and batch

size b = m/2T . Then, U priv, the first output of Algorithm 1, satisfies

dist(U priv, U )

1 cησ2 min, 4

2 dist(U0, U ) + e O

k T nϵσ2 min, +

(R2 + Γ2)Γ2d T

with probability at least 1 O(T n 10).

Proof. By Lemma 28

(U ) Ut+1 2 P 1 t+1 2 Ik ηΣVt 2 (U ) Ut 2 + η t EB t [ t] 2 + η ξt 2 . (19)

We must bound all three terms on the right-hand side of (19). Combining (19) and Proposition 30

(U ) Ut+1 2 P 1 t+1 2

Ik ηΣVt 2 (U ) Ut 2 + e O

η2(R2 + Γ2)Γ2d

with probability at least 1 O(n 10). Combining our choice of ψ with Proposition 29, via the union bound

(U ) Ut+1 2 P 1 t+1 2

Ik ηΣVt 2 (U ) Ut 2 + e O

η2(R2 + Γ2)Γ2d

for all t [T 1] simultaneously with probability at least 1 O(T n 10).

To bound the right-hand side of the above inequality, we need to bound P 1 t+1 2. By Lemma 33, with probability at least

1 O(T n 10), we have P 1 t+1 2 1 ησ2 min, E0 2 log n

2 = O(1) for all t simultaneously, where the last equality follows

from the fact that η 1 2σmax, , E0 1, and log n 1 (assuming, without loss of generality, that n 2). Further, by Lemma 33

(U ) Ut+1 2

1 ησ2 min, E0

2 (U ) Ut 2 + e O

η2(R2 + Γ2)Γ2d

Private Model Personalization Revisited

for all t simultaneously with probability at least 1 O(T n 10).

Now, what remains is to obtain the tightest possible bound on the recursion from Lemma 22 via the summation of a geometric sum. Since dist(U , Ut) = (U ) Ut 2 1, we have, for all t simultaneously

dist(U , Ut+1)

1 ησ2 min, E0

2 dist(U , Ut) + e O

η2(R2 + Γ2)Γ2d

1 ησ2 min, E0

2 dist(U , Ut)

1 ησ2 min, E0

2 η(R + Γ)Γd

η2(R2 + Γ2)Γ2d

1 ησ2 min, E0

2 dist(U , Ut)

1 ησ2 min, E0/4

η2(R2 + Γ2)Γ2d

1 cησ2 min, 4

2 dist(U , U0) + e O

k T nϵσ2 min, +

(R2 + Γ2)Γ2d

with probability at least 1 O(T n 10), given that 1 1

x 2x for any x > 1 and since there exists c > 0 such that

E0 c. Selecting batch size b = m/2T completes the proof of the lemma.

Lemma 35 (Adaptation of Theorem 4.2 (Jain et al., 2021)). Let X be a matrix with rows sampled from Di and y the vector of its labels generated according to Assumption 6. Suppose that (x, y) with x Di is a data point independent of X with label vector y. Take U to be any matrix with orthonormal columns and v arg minv v U X y 2 2. Then

E ℓ( v, U, (x, y)) ℓ(v , U , (x, y)) Γ2 ( U U I)U 2 2 + R2k

Proof. Assume that X Dm i is a data matrix with label vector y generated as in Assumption 6 using noise vector ζ SG(R2)m. Let x Di be a data point that is independent of X. By definition of our loss function ℓand data generation method

Ex Di,ζ [ℓ(v, U, (x, y))] = Ex Di h x Uv x U v + ζ 2i

= (Uv U v ) Ex Di xx (Uv U v ) + E[ζ2]

= Uv U v 2 2 + E[ζ2]

for any fixed U Rd k, v Rk. Since v arg minv v U X y 2 2 we have v = U XX U 1 U X y, where this

Private Model Personalization Revisited

inverse exists by our assumption that k m. Then

E h U v U v 2 2

i = E U U XX U 1 U X y U v 2

= E U U XX U 1 U XX U v U v + U U XX U 1 U X ζ 2

E h U U U v U v 2

= U U U v U v 2

Γ2 ( U U I)U 2

where the first inequality follows from the fact that U XX U 1 U XX U = I implies U = U XX U 1 U XX by orthonormality of the columns of U and the expected mean square estimation error of sub-Gaussian noise. Combining our two inequalities with E [ℓ(v , U , (x, y))] = E[ζ2] completes the proof.

Theorem (Restatement of Theorem 10). Suppose all conditions of Lemma 9 hold with η = 1 2Λ2 , T = Θ Λ2 log(n3)

that Assumption 4 holds as well. Then, U priv and V priv, the outputs of Algorithm 1, satisfy

L(U priv, V priv; D) L(U , V ; D)

(R2 + Γ2)Γ4Λ2d2k

n2ϵ2σ4 min, λ2 + (R2 + Γ2)Γ4Λ2d

nmσ4 min, λ2

with probability at least 1 O(T n 10). Furthermore, Algorithm 1 is (ϵ, δ)-user-level DP.

Proof. Via Lemma 9 and η 1 2σ2max,

dist(U priv, U )

1 cησ2 min, 4

2 dist(U0, U ) + e O

k T nϵσ2 min, +

(R2 + Γ2)Γ2d T

with probability at least 1 O(T n 10). Note that dist(UT , U ) = (UT U T I)U 2. Applying Lemma 35 and plugging our choice of T in this bound finishes the proof.

Theorem 36. (Theorem 4 (Tripuraneni et al., 2021)) Given a new user with a dataset Sn+1 of size m2 whose elements sampled from distribution Dn+1 where assumptions 6 and 5 hold with v n+1 Γ. Then if dist(U priv, U ) δ and m2 k log m2, let vpriv n+1 = arg minv Rk b L(U priv, v; Sn+1), then we have

U v n+1 U privvpriv n+1 2 O Γ2δ2 + R2k

with probability at least 1 O(m 100 2 ) Corollary (Restatement of Corollary 11). Suppose all conditions of Theorem 10 hold. Consider a new client with a dataset Sn+1 Dm n+1, where Assumptions 5 and 6 hold with v n+1 2 Γ. Let U priv be the output of Algorithm 1 and vpriv n+1 = arg minv Rk ˆL(U priv, v; Sn+1), if m k log m . We have

L(U priv, vpriv n+1; Dn+1) L(U , v n+1; Dn+1)

(R2 + Γ2)Γ4Λ2d2k

n2ϵ2σ4 min, λ2 + (R2 + Γ2)Γ4Λ2d

nmσ4 min, λ2 + R2k

with probability at least 1 O(T n 10 + m 100).

Private Model Personalization Revisited

Proof. From the proof in Lemma 35, we have

L(U priv, vpriv n+1; Dn+1) L(U , v n+1; Dn+1) U v n+1 U privvpriv n+1 2 + E[η2] E[η2]

= U v n+1 U privvpriv n+1 2

Then by Lemma 9 and our choice of T, we obtain

dist(U priv, U ) e O

k nϵσ2 min, λ +

(R2 + Γ2)Γ2dΛ2

nmσ4 min, λ2

Then we plug the bound of dist(U priv, U ) into Theorem 36, and obtain

L(U priv, vpriv n+1; Dn+1) L(U , v n+1; Dn+1) U v n+1 U privvpriv n+1 2

(R2 + Γ2)Γ4Λ2d2k

n2ϵ2σ4 min, λ2 + (R2 + Γ2)Γ4Λ2d

nmσ4 min, λ2 + R2k

B.2 Private initialization results

Theorem 37 (Theorem 2.1 (Duchi et al., 2022)). Let M, ˆ M Rd d be symmetric matrices. Suppose p [k] for k a positive integer with k < d. Denote by λp the p-th largest eigenvalue of M. Let A, ˆA be matrices with columns the top-k eigenvectors of M, ˆ M, respectively. Then, λk λk+1 > 0 implies

dist(A, ˆA) 2 M ˆ M 2 λk λk+1 .

Proof. The theorem holds trivially when 2 M ˆ M 2 λk λk+1 > 1. So, we focus on the case 2 M ˆ M 2 λk λk+1 1.

Let ˆλp be the p-th largest eigenvalue of ˆ M. By Weyl s inequality and 2 M ˆ M 2 λk λk+1 1, we have

ˆλk+1 λk+1 M ˆ M 2 λk λk+1

This and the assumption λk λk+1 > 0 imply

λk+1 ˆλk+1 λk λk+1

2 > 0. (23)

By the Davis-Kahan theorem (Stewart, 1990)

A ˆA 2 2 M ˆ M 2 λk ˆλk+1 . (24)

Combining (23) and (24) along with A ˆA 2 = dist(A, ˆA) finishes the proof.

We use a slight adaptation of Theorem L.1 from (Duchi et al., 2022). The statement of the result below is different from the original result by a single step. This step is where the authors use the Davis-Kahan theorem to obtain a bound on the principal angle. We instead state their Theorem L.1 before applying Davis-Kahan for use in our private initialization guarantee. Theorem 38 (Adaptation of Theorem L.1 (Duchi et al., 2022)). Let S = (S1, . . . , Sn) be a sequence of datasets where Si = {(xi,1, yi,1), . . . , (xi,m/2, yi,m/2)} are sampled according to Assumptions 5 and 6 for each i [n]. Define Zi = 2 m(m 1) P

j1,j2 [m/2]:j1 =j2 yi,j1yi,j2xi,j1x i,j2, Z = 1

n Pn i=1 Zi, and Z = 1

n Pn i=1 E [Zi]. Then, we have

(R2 + Γ2)Γ2d

with probability at least 1 O(n 10).

Private Model Personalization Revisited

Algorithm 2 Private Initialization for Private Fed Rep

Require: Si = {(xi,1, yi,1), . . . , (xi,m/2, yi,m/2)} data for users i [n], privacy parameters ϵ, δ, clipping bound ψinit, rank k

1: Let ˆσinit ψinit q

δ ) nϵ 2: Let ξinit N d d( 0, ˆσ2 init) 3: for Clients i [n] in parallel do 4: Send Zi 2 m(m 1) P

j1 =j2 yi,j1yi,j2xi,j1x i,j2 to server 5: end for 6: Server aggregates Zi and add noise for privatization

i=1 clip(Zi, ψinit) + ξinit

7: Server computes Uinit DU init rank-k-SVD( ˆZ)

8: Return: Uinit

Lemma (Restatement of Lemma 12). Suppose that Assumptions 5 and 6 hold. Let Uinit be the output of Algorithm 2. Then, by setting ψinit = e O((R2 + Γ2)d), we have

dist(Uinit, U ) e O

(R2 + Γ2)d3/2

nϵσ2 min, +

(R2 + Γ2)Γ2d

with probability at least 1 O(n 10). Furthermore, Algorithm 2 is (ϵ, δ) user-level DP.

Proof. The privacy guarantee follows directly from the guarantee of Gaussian mechanism. For our utility guarantee, we condition on the event

E = n |yi,j| O((Γ + R) log(mn)), xp i,j O p

log dmn for all (i, j, p) [n] [m] [d] simultaneously o

where xp i,j represents the p-th entry of xi,j. The condition E holds with probability at least 1 O(n 10).

Conditioning on event E, we obtain Zi F e O (R2 + Γ2)d

and the clipping will not change the gradient norm. Let Z = 1

n Pn i=1 Zi, Z = 1

n Pn i=1 E [Zi]. Via Corollary 27 and the fact that the clipping does not affect the bound, we have

ˆZ Z 2 = ξinit 2 O

dˆσinit = O (R2 + Γ2)d3/2 log n log2(dmn)

with probability at least 1 2e 10 log n.

By Theorem 38, we have

(R2 + Γ2)Γ2d

with probability over 1 O(n 10). Therefore, by (25) and (26), we have

ˆZ Z 2 ˆZ Z 2 + ˆZ Z 2

(R2 + Γ2)d3/2 log n log2(dmn)

nϵ + log3(nd)

(R2 + Γ2)Γ2d

Private Model Personalization Revisited

with probability at least 1 O(n 10) via the union bound. Finally, using Theorem 37 and the fact that Z = 1 n Pn i=1(U v i )(U v i ) with (27), we obtain

dist(Uinit, U ) 2 ˆZ Z 2

(R2 + Γ2)d3/2 log n log2(dmn)

nϵσ2 min, + log3(nd)

(R2 + Γ2)Γ2d

with probability at least 1 O(n 10).

B.3 Auxiliary lemmas

The results of this section include multiple adaptations of those from (Collins et al., 2021) such as Lemma 43 and Lemma 33. Our proofs, when they are adaptations, are substantially more complex due to the addition of label noise and differential privacy to design Private Fed Rep (Algorithm 1).

This section has the following structure. We first characterize the solution of the local minimization step of Algorithm 1 in Lemma 39. Next, we introduce in Proposition 40 terms quantifying the label noise terms that periodically appear throughout our proofs. Using this proposition, we give a bound on the error from estimating v 1, . . . , v n incurred during the local minimization step of Algorithm 1 in Lemma 41. Using similar methods, in Lemma 43 we bound the spectral distance of the linear operator that defines the gradient step of Algorithm 1 from the identity operator. Finally, using all of these intermediate results allows us to prove Lemma 33, a key lemma in our main proof of Section B.1.

Take It, I t to be the index sets of our batches in Algorithm 1. Let Bi,t = {(xi,j, yi,j) : j It} and B i,t = {(x i,j, y i,j) : j I t}. We omit iterations t on our quantities for ease of notation. Further, we reindex the elements of Bi,t, B i,t to Bi,t = {(xi,j, yi,j) : j [b]} and B i,t = {(x i,j, y i,j) : j [b]}. Let Ai,j = eix i,j, A i,j = eix i,j for each (i, j) [n] It. Define A : Rn d Rnb where A(M) = ( Ai,j, M F )(i,j) [n] It for all matrices M Rn d. We analogously define the operator A with respect to the matrices A i,j.

Denote (A ) : Rnb Rn d the adjoint operator of A defined as (A ) (M) = Pn i=1 Pb j=1 A i,j, M A i,j. In this sense

(A) A : Rn d Rn d is a single operator. Furthermore, recall that ξt N(0, ˆσ2)d k and choose ˆσ = e O (R+Γ)Γ

dk nϵ . Define the following recursion from Algorithm 1.

Algorithm 1 recursion

Vt+1 arg min V Rn k 1 nb

A(V (U ) V (Ut) ) + ζ 2

nb (A ) A (Vt+1(Ut) V (U ) ) Vt+1

nb U A (Vt+1(Ut) ), ζ + ηξt

Ut+1, Pt+1 QR( ˆUt+1).

We will now state a theorem that gives an analytic form for Vt+1. Suppose p, q [d]. Let uq,t, u q be the q-th column of

Private Model Personalization Revisited

Ut, U , respectively. Define

Ai,jup,tu q,t(Ai,j) Rn n

Ai,jup,t(u q) (Ai,j) Rn n

Dp,q = up,t, u q In n Rn n

j=1 ζi,j Ai,jup,t Rn.

Take G, C, D to be nk nk block matrices with blocks Gp,q, Cp,q, Cp,q and W an nk-dimensional vector created by concatenating Wp for each p [k]. Denote ev = Vec(V ), a column vector of dimension nk.

For the following lemma we define

F = [(G 1((GD C)ev + W))1 . . . (G 1((GD C)ev + W))k] Rn k (30)

where (G 1((GD C)ev + W))p is the p-th n-dimensional block of the nk-dimensional vector G 1((GD C)ev + W).

Lemma 39. The matrix Vt+1 satisfies Vt+1 = V (U ) Ut F

at each iteration t + 1 for the error matrix F Rn k.

Proof. Note that Vt+1 minimizes the function F(V, Ut) = 2 nb A(V (U ) V (Ut) ) + ζ 2 2 and so vp F(Vt+1, Ut) = 0 for vp the p-th column of V for each p [k]. Recall our definition of Ai,j = eix i,j for ei the i-th n-dimensional standard basis vector. Given that A(V (U ) V (Ut) ) + ζ 2

2 = A(V (U ) V (Ut) ) 2

2 + 2 A(V (U ) V (Ut) ), ζ + ζ 2 2

we have for uq,t the q-th column of Ut

0 = vp F(Vt+1, Ut)

u q,t(Ai,j) vq,t+1 (u q) (Ai,j) v q Ai,jup,t + 4

nb vp A(V (U ) Vt+1(Ut) ), ζ .

Let (M) ,p be the p-th column of a matrix M. Then

vp A(V (U ) Vt+1(Ut) ), ζ = vp A(Vt+1(Ut) ), ζ

= vp ( Ai,j, Vt+1(Ut) F )(i,j) [n] It, ζ

(i,j) [n] [b] ζi,j Ai,j, Vt+1(Ut) F

(i,j) [n] [b] ζi,j Ai,j, Vt+1(Ut) +

(i,j) [n] [b] ζi,j Ai,j Ut, Vt+1

(i,j) [n] [b] ζi,j Ai,j Ut

Private Model Personalization Revisited

the p-th column of the matrix P

(i,j) [n] [b] ζi,j Ai,j Ut Rn k. Hence

Ai,jup,tu q,t(Ai,j) vq,t+1 = 1

Ai,jup,t(u q) (Ai,j) v q + 2

j=1 ζi,j Ai,jup,t.

Then, for evt+1 = (v 1,t+1, . . . , v k,t+1) Rnk and ev = ((v 1) , . . . , (v k) ) Rnk we have

evt+1 = G 1C(ev + W) = Dev G 1((GD C)ev + W)

conditioned on the event that G 1 exists. We denote F = [(G 1((GD C)ev + W))1 . . . (G 1((GD C)ev + W))k] Rn k where (G 1((GD C)ev + W))p is the p-th n-th dimensional block of the nk-dimensional vector G 1((GD C)ev + W). Recalling the definition of D, we have that Vt+1 = V (U ) Ut F.

In order to evaluate the final bounds with label noise included we must bound the following terms

1 b V A(Vt+1(Ut) ), ζ

and 1 nb U A(Vt+1(Ut) ), ζ

in spectral norm.

Proposition 40. With probability 1 O(n 11), we have

(1) 1 b V A(Vt+1(Ut) ), ζ 2

26R2n log(nb)

Furthermore, with probability at least 1 O(n 10), we have

(2) 1 nb U A(Vt+1(Ut) ), ζ 2 4

2 15R2Γ2d log n

Proof. Claim 1: With probability 1 e 11 log(nb), we have 1 b V A(Vt+1(Ut) ), ζ 2

26R2n log(nb)

Note that for any given p [n] and q [k] 1

b V A(Vt+1(Ut) ), ζ

b V ( eix i,j, Vt+1(Ut) F )(i,j) [n] [m], ζ

(i,j) [n] [m] ζi,j V ( eix i,j, Vt+1(Ut) F )(i,j)

(i,j) [n] [m] ζi,j V ( eix i,j Ut, Vt+1 F )(i,j)

(i,j) [n] [m] ζi,jeix i,j Ut

j=1 ζp,j xp,j, uq,t

Private Model Personalization Revisited

for uq,t the q-th column of Ut. Observe that ζp,j is independent of both xp,j and uq,t for all j [m]. Condition on the event

E = {|ζi,j| R p

26 log(nb) for all (i, j) [n] [m]}

which has probability at least 1 e 13 log(nb). Via this conditioning the random variable ζp,j xp,j, uq,t is R p

26 log(nb)- sub-Gaussian given that xp,j, uq,t is 1-sub-Gaussian. Note as well ζp,j xp,j, uq,t is mean zero and independent for each

b Pb j=1 ζp,j xp,j, uq,t is centered, q

26R2 log(nb)

b -sub-Gaussian, and independent for every p.

Let σ2 be the variance of 1

b Pb j=1 ζp,j xp,j, uq,t , which has σ q

26R2 log(nb)

b . Then, 1 m σ Pb j=1 ζp,j xp,j, uq,t

q [k] has covariance matrix Ik. By the one-sided version of Theorem 26

m σ V A(Vt+1(Ut) ), ζ O n +

with probability at least 1 e α2. Multiplying through by σ and setting α = n, we have 1 b V A(Vt+1(Ut) ), ζ 2

26R2n log(nb)

with probability at least 1 e n conditioned on the event E. So, in general 1 b V A(Vt+1(Ut) ), ζ 2

26R2n log(nb)

with probability at least 1 e n e 12 log(nb) 1 e 11 log(nb). This proves our first claim.

Claim 2: With probability at least 1 O(n 10), we have 1 nb U A(Vt+1(Ut) ), ζ 2 4

2 15R2Γ2d log n

Note that 1 nb U A(Vt+1(Ut) ), ζ = U

j=1 ζi,j Ai,j, Vt+1(Ut) +

j=1 ζi,j(Vt+1) Ai,j, (Ut) +

j=1 ζi,j(Ai,j) Vt+1, Ut

j=1 ζi,j(Ai,j) Vt+1.

Observe that since (Ai,j) is a matrix with one non-zero column x i,j, we have (Ai,j) Vt+1 = x i,j(vi,t+1) where vi,t+1 is the i-th row of Vt+1. Let N d and N k be Euclidean 1

4-covers of the d and k-dimensional unit spheres, respectively. Then, by Lemma 23

j=1 ζi,j(Ai,j) Vt+1

j=1 ζi,jx i,j(vi,t+1)

3 max a N d,b N k a

j=1 ζi,jx i,j(vi,t+1)

3 max a N d,b N k

j=1 ζi,j a, x i,j vi,t+1, b

Private Model Personalization Revisited

Let vi,t+1 2 5

4Γ for all i with probability at least 1 O(n 14) via Proposition 42. As well, we condition on the event

E = vi,t+1 2 5

4Γ for all i simultaneously

which has probability P [E] 1 O(n 13).

Since | vi,t+1, b | 25

16Γ by Cauchy-Schwarz, the random variable ζi,j a, x i,j vi,t+1, b is sub-exponential. Furthermore, the variable 1 nbζi,j a, x i,j vi,t+1, b has sub-exponential norm bounded by 125RΓ

nb and is mean zero. Following from Lemma 22 conditioned on E and for nb 15d log n, we have

j=1 ζi,j a, x i,j vi,t+1, b 4

2 15R2Γ2d log n

with probability at least 1 e 15d log n. Via the union bound over N d, N k

j=1 ζi,j(Ai,j) Vt+1

3 max a N d,b N k 1 nb

j=1 ζi,j a, x i,j vi,t+1, b

2 15R2Γ2d log n

with probability at least 1 9d+ke 15d log n 1 e 10d log n. Let E be the event

j=1 ζi,j(Ai,j) Vt+1

2 15R2Γ2d log n

Applying the fact that P[Ec] P[Ec|E] + P[Ec] e 10d log n + O(n 13) = O(n 10) finishes the second claim.

Recall the following definitions given prior to Lemma 39 where Ai,j = eix i,j. Denote

Ai,jup,tu q,t(Ai,j) Rn n

Ai,jup,t(u q) (Ai,j) Rn n

Dp,q = up,t, u q In n Rn n

j=1 ζi,j Ai,jup,t Rn.

Take G, C, D to be nk nk block matrices with blocks Gp,q, Cp,q, Cp,q and W an nk-dimensional vector created by concatenating Wp for each p [k]. Let Gi, Ci, Di be the k k matrices formed by taking the i-th diagonal element from each Gp,q, Cp,q, Dp,q, respectively.

Lemma 41. Let τk = cτ q

b for some c > 0. Then

(1) G 1 2 1 1 τk

with probability at least 1 O(n 13).

Furthermore,

(2) F F ντk 1 τk V 2dist(Ut, U ) +

26R2nk log(nb)

with probability at least 1 O(n 11).

Private Model Personalization Revisited

Proof. Claim 1: We have

G 1 2 1 1 τk

with probability at least 1 e 13k log n.

Let a be a normalized vector in nk dimensions. Define ai Rk to be the sub-vector of a constructed by choosing each ((p 1)n + i)-th component for p = 1, . . . , k. Observe that

σmin(G) = min a: a 2=1 a Ga

= min a: a 2=1

i=1 ai Giai

min i [n] σmin(Gi).

b Pb j=1 xi,jx i,j. By our definition below (31), we have Gi = U t Πi Ut. Note that 1

b U t xi,j is 1

b-sub-Gaussian and independent for each i, j. Assume b k. Then, using a one-sided form of Theorem 26, there exists a constant cτ > 0 where

σmin(U t Πi Ut) 1 cτ

with probability at least 1 e α2. Setting α = 14k log n gives us

σmin(Gi) 1 τk

with probability 1 e 14k log n for τk as in the lemma statement. Via the union bound over i [n]

σmin(G) 1 τk

with probability at least 1 e 13k log n.

Claim 2: We have

F F ντk 1 τk V 2dist(Ut, U ) +

26R2nk log(nb)

with probability at least 1 2e 13k log n e 11 log(nb).

The proof follows from bounding Hi = Gi Ci Di for each i [n] in spectral norm with Lemma 22 and exploiting the definition of our parameter ν = maxi [n] v i 2 σmax, .

Let Xi be the design matrix for xi,1, . . . , xi,b. Note that, by the definitions below (31), we have

Gi Di Ci = (Ut) Πi Ut(Ut) t U U Πi U = 1

b (Ut) X i Xi(Ut(Ut) Id)U .

(GD C)Vec(V ) 2 2 =

i=1 Hi(v i ) 2 2

i=1 Hi 2 2 v i 2 2

i=1 Hi 2 2.

(GD C)Vec(V ) 2 2 ν2

i=1 Hi 2 2. (32)

Private Model Personalization Revisited

We now bound each Hi using concentration inequalities. Define A = 1

b Xi Ut and B = 1

b Xi(Ut(Ut) Id)U . We

denote the rows of A and B with ai,j = 1

b(Ut) xi,j and bi,j = 1

b U (Ut(Ut) Id)xi,j, respectively. Note that, for N k

a Euclidean 1

4-cover of the unit sphere in k dimensions, we have

B A 2 2 max u,u N k u B Au

= 2 max u,u N k u

j=1 bi,ja i,j

= 2 max u,u N k

j=1 u, bi,j ai,j, u

via Lemma 23. Now, fix some (u, u ) N k N k. Then, ai,j = 1

b(Ut) xi,j and bi,j = 1

b U (Ut(Ut) Id)xi,j. So, u, ai,j is 5 4

b-sub-Gaussian and bi,j, u is 5 4

bdist(Ut, U )-sub-Gaussian. This means their product is 25

16bdist(Ut, U )- sub-exponential.

Now, from Lemma 22, there exists c > 0 such that

P Hi 2 2 2s P

max u,u N k

j=1 u, bi,j ai,j, u s

92ke c b min s2

2.5dist2(Ut,U ) , s 1.6dist(Ut,U ) .

Let τ > 0 satisfy s 1.6dist(Ut,U ) = max(τ, τ 2). Then

τ 2 = min s2

2.5dist2(Ut, U ), s 1.6dist(Ut, U )

Picking τ 2 = 14k log n

c b and assuming that b 11k log n ensures

35dist2(Ut, U )k log n

e 14k log n (33)

for any fixed i [n]. So

P (GD C)Vec(V ) 2 2 35ν2 V 2 2dist2(Ut, U )k log n

i=1 Hi 2 2 35ν2 V 2 2dist2(Ut, U )k log n

i=1 Hi 2 2 35ν2dist2(Ut, U )k log n

n P H1 2 2 35dist2(Ut, U )k log n

e 13k log n

where the first inequality follows from (32) and the last inequality follows from (33). The rest of the proof follows from the fact that F = [(G 1((GD C)ev + W))1 . . . (G 1((GD C)ev + W))k]. That is, we bound G 1[W1, . . . , Wk] via an application of Claim 1 along with Proposition 40 part (1). This gives a bound on the norm of F = [G 1((GD C)ev )1 . . . G 1((GD C)ev )k] + G 1[W1, . . . , Wk] by the union bound.

Private Model Personalization Revisited

Let fi be the i-th row of F in row vector form. Also, recall that Gi, Ci, Di are the k k matrices formed by taking the i-th diagonal element from each Gp,q, Cp,q, Dp,q as in (31).

Proposition 42. Suppose Assumption 31 holds. For all t [T 1], we have that vi,t+1 from Algorithm 1 satisfies

with probability at least 1 O(n 14) for all i [n].

Proof. By Lemma 39, we have (as row vectors) vi,t+1 = v i (U ) Ut fi. This implies

vi,t+1 2 v i 2 (U ) Ut 2 + fi 2

and so vi,t+1 2 Γ + fi 2. Fix i [n] and assume that ντk < 1. This is not difficult to achieve when using

Assumption 31. Recall that we defined τk = cτ q

b . Now, denoting the i-th row of a matrix with (M)i,

Gi 1(Gi Ci Di)(v i ) +

j=1 ζi,j Gi 1Ai,j Ut

Gi 1 2 (Gi Ci Di)(v i ) 2 + Gi 1 2

j=1 ζi,jx i,j Ut

ντk 1 τk v i 2 dist(Ut, U ) +

40R2k log n

ντk 1 τk Γ +

40R2k log n

with probability at least 1 2e 14k log(n) e 14 log(nb) via the argument of Lemma 41 part (1), and the same arguments as Lemma 41 part (2) and Proposition 40 part (1) applied to the vectors Gi 1(Gi Ci Di)(v i ) and 1 b Pb j=1 ζi,jx i,j Ut

Now, assuming that m 4000c2 τ max{R2,1} max{Γ2,1}γ4k log2 n

E2 0σ2max, T and m 4000 R2k

Γ2 T log(nm), we have

ντk 1 τk Γ +

40R2k log n

Taking the union bound over all i finishes the result.

Lemma 43. Let τ k = 5

nb . Suppose Assumption 31 holds. Then, we have, for any iteration t, that

b (A ) A (Vt+1(Ut) V (U ) ) (Vt+1(Ut) V (U ) ) Vt+1

τ kdist(Ut, U ) + 3RΓ2p

15 24dk log(n) log(nb)

with probability at least 1 O(n 10).

Proof. Take W = (W1, . . . , Wk) Rnk where Wp = 2 b Pn i=1 Pb j=1 ζi,j Ai,jup,t. Recall that F = [(G 1((GD

C)ev + W))1 . . . (G 1((GD C)ev + W))k] Rn k by (30). Define f W = [G 1W1 . . . G 1Wk]. Let Qt be the matrix defined via rows qi where q i = Ut(Ut) U (v i ) Ut(fi) U (v i ) .

Private Model Personalization Revisited

Finally, define by e Qt the matrix with rows eqi,t = qi,t f Wi, (Ut) where f Wi, is the i-th row of f W. Note that

b (A ) A (Vt+1(Ut) V (U ) ) (Vt+1(Ut) V (U ) ) Vt+1

b (A ) A ( e Qt) e Qt

b (A ) A (f Wi(Ut) ) f Wi(Ut) Vt+1

The lemma follows from bounding the right-hand terms individually.

Let τ k = 5

Claim 1: We have 1 n

b (A ) A ( e Qt) e Qt

2 τ kdist(Ut, U )

with probability at least 1 e 10d log n 2e 13k log n.

Note that eqi 2 Ut(Ut) U (v i ) U (v i ) 2 + Ut(fi) Ut(f Wi) 2

Γdist(Ut, U ) + fi f Wi 2.

Now fi f Wi 2 = Gi 1 2 Gi Di Ci 2 (v i ) 2

dist(Ut, U ) νΓτk

with probability at least 1 2e 13k log n via an argument identical to Lemma 41 part (2) without the label noise term. Choose c0 1000c2 in Assumption 31 to ensure ντk 1

4. Then fi f Wi 2 Γdist(Ut, U ) and hence eqi 2 2Γdist(Ut, U ) with probability at least 1 2e 13k log n.

By Proposition 42, we have vi,t+1 2 5

4Γ with probability at least 1 O(n 14) for each i . Observe

(A ) A ( e Qt) e Qt Vi,t+1 = 1

x i,j, eqi,t x i,j(vi,t+1) eqi,t(vi,t+1) .

We first condition on the event

E = eqi 2 2Γdist(Ut, U ) and vi,t+1 2 5

4Γ for all i [n]

which has probability at least 1 O(n 10) by the union bound. Define the Euclidean 1

4-covers of the d and k-dimensional unit spheres N d and N k, respectively. Via Lemma 23

x i,j, eqi,t x i,j(vi,t+1) eqi,t(vi,t+1) 2

3 max a N d,b N k 1 nb

x i,j, eqi,t a, x i,j vi,t+1, b a, eqi,t vi,t+1, b .

The variable x i,j, eqi,t is 2Γdist(Ut, U )-sub-Gaussian and a, x i,j is 5

4-sub-Gaussian via Proposition 18. This means x i,j, eqi,t a, x i,j is sub-exponential with norm 5

2Γdist(Ut, U ). Then, the variable 1 nb x i,j, eqi,t a, x i,j vi,t+1, b is sub-exponential with norm 5Γ nb dist(Ut, U ) v i 2 5Γ2

nb dist(Ut, U )

Private Model Personalization Revisited

given that we have conditioned on E. Note that the variables x i,j, eqi,t x i,j(vi,t+1) eqi,t(vi,t+1) are centered. Furthermore, due to our conditioning we can concentrate these variables with respect to the randomness x i,j since they are independent of eqi,t, vi,t+1. So, by Lemma 22 there exists a constant c > 0 where

x i,j, eqi,t x i,j(vi,t+1) eqi,t(vi,t+1) 2

9d+k exp c nb min s2

52Γ4dist(Ut, U )2 , s 5Γ2dist(Ut, U )

From (35) we denote τ 2 = min s2 52Γ4dist(Ut,U )2 , s 5Γ2dist(Ut,U ) and further set τ 2 = 13(d+k) log n

c nb . If nb 13(d+k) log n

x i,j, eqi,t x i,j(vi,t+1) eqi,t(vi,t+1) 2

13dist(Ut, U )

9d+ke 13(d+k) log n e 10d log n.

Recall that P[E] 1 O(n 10) . Therefore, we have

x i,j, eqi,t x i,j(vi,t+1) eqi,t(vi,t+1) 2

13dist(Ut, U )

e 10d log n + O(n 10) = O(n 10).

This proves Claim 1.

Claim 2: We have

b (A ) A (f Wi(Ut) ) f Wi(Ut) Vt+1

15 24dk log(n) log(nb)

with probability at least 1 O(n 10).

Observe 1 nb

(A ) A (f Wi(Ut) ) f Wi(Ut) Vt+1

x i,j, f Wi(Ut) x i,j(vi,t+1) f Wi(Ut) (vi,t+1) .

Note that the random variables x i,j, f Wi(Ut) x i,j(vi,t+1) f Wi(Ut) (vi,t+1) have mean zero since x i,j and f Wi(Ut)

are independent along with E h f Wi(Ut) i = 0. Using a covering argument identical to Claim 1, we have

x i,j, f Wi(Ut) x i,j(vi,t+1) f Wi(Ut) (vi,t+1) 2

3 max a N d,b N k 1 nb

x i,j, f Wi(Ut) a, x i,j vi,t+1, b a, f Wi(Ut) vi,t+1, b .

Recall that f Wi = 1 nb Pn i=1 Pb j=1 ζ i,jx i,j(vi,t+1)

i, Rk. Conditioning on vi,t+1 2 5

4Γ, we have f Wi 2

24k log(nb)

nb with probability at least 1 ke 12 log(nb) 1 e 11 log(nb) by the union bound on the components of f Wi. Multiplication by U t on the right does not change this bound given orthonormality.

Private Model Personalization Revisited

We thus condition on the new event

f Wi U t 2 5RΓ p

24k log(nb)

nb and vi,t+1 2 5

4Γ for all i [n]

which has probability at least 1 O(n 11) via our above work and Proposition 42. The variable x i,j, f Wi(Ut) is 5RΓ

24k log(nb)

-sub-Gaussian via conditioning on E. This means

x i,j, f Wi(Ut) a, x i,j vi,t+1, b a, f Wi(Ut) vi,t+1, b

24k log(nb)

-sub-exponential. Using an argument identical to how we showed (36), the inequality (37) implies

b (A ) A (f Wi(Ut) ) f Wi(Ut) Vt+1

15 24dk log(n) log(nb)

(nb) 3 4 (38)

with probability at least 1 O(n 10). Combining (36) and (38) via the union bound finishes the second claim.

Combining (34) with Claims 1 and 2 via the union bound finishes the proof.

Recall the exact statements of Assumption 31 and 32. That is, there exist c0, c1 > 1 such that

max{R2, 1} max{Γ2, 1}γ4k log2 n

E2 0σ2max, + R2k

Γ2 log(nm) T

max { ϵ,δ, 1} (R + Γ) Γd

k log2(nm) E2 0λ2 + R2Γ2d log(nm)

Lower bounds on the constants c0, c1 are used many times in the proof for the following lemma.

Lemma (Restatement of Lemma 33). Let E0 = 1 dist2(U0, U ) and ψ = e O (R + Γ)Γ

dk . Suppose Assumption 31 and 32 hold. Then, for any iteration t, we have that Pt+1 is invertible and

1 ησ2 min, E0 2 log n

with probability at least 1 O(n 10).

Proof. Recall that F = G 1(GD C)V + 1 b G 1 V A(Vt+1(Ut) ), ζ and let τk = cτ q

nb for some constant c > 0. Letting c0 2000c2 τ in Assumption 31, we have 1 1 τk 4 3 and noting G 1 2 1 1 τk by Lemma 41 part (1), we have

47R2n log(nb)

with probability at least 1 O(n 11) via the argument Lemma 41 part (2) for the spectral norm.

Recall that Qt = Vt+1(Ut) V (U ) and Vt = V (U ) Ut F by Lemma 39. Now, denote H t = 1

b(A ) A (Qt)Vt+1 + nξt, Ht = 1

b(A ) A (Qt), and Wt = η nb U A(Vt+1(Ut) ), ζ . By Recursion 28, we have

P t+1Pt+1 = ˆU t+1 ˆUt+1

= (Ut) Ut η

n((Ut) H t + (H t) Ut) + ((Ut) Wt + (Wt) Ut) η

n((Wt) H t + H t Wt)

n2 (H t) H t + (Wt) Wt

Private Model Personalization Revisited

By Weyl s inequality, the above implies

σ2 min(Pt+1) 1 η

nλmax((Ut) Ht Vt+1 + (Vt+1) (Ht) Ut) + λmin((Ut) Wt + (Wt) Ut)

nλmax((Wt) Ht Vt+1 + (Vt+1) (Ht) Wt) λmax((Ut) ξt + ξ t Ut)

λmax((Wt) ξt + ξ t Wt) + η2

n2 λmin((H t) H t) + λmin((Wt) Wt)

nλmax((Ut) Ht Vt+1 + (Vt+1) (Ht) Ut) + λmin((Ut) Wt + (Wt) Ut)

nλmax((Wt) Ht Vt+1 + (Vt+1) (Ht) Wt) λmax((Ut) ξt + ξ t Ut)

λmax((Wt) ξt + ξ t Wt).

To complete this proof we must bound each of these terms individually. That is, we upper bound

η nλmax((Ut) Ht Vt+1 + (Vt+1) (Ht) Ut) (40a)

λmax((Ut) ξt + ξ t Ut) (40b)

λmax((Wt) ξt + ξ t Wt) (40c) η nλmax((Wt) Ht Vt+1 + (Vt+1) (Ht) Wt) (40d)

and lower bound

λmin((Ut) Wt + (Wt) Ut). (41)

These bounds follow from our previous propositions and lemmas.

Term (40a) has

η nλmax((Ut) Ht Vt+1 + (Vt+1) (Ht) Ut)

= max a 2=1 η na (Ut) Ht Vt+1 + (Vt+1) (Ht) Uta

= max a 2=1 2η

n a (Vt+1) Ht Uta

max a 2=1 2η

n a (Vt+1) 1

b (A ) A (Qt) Qt

Uta + max a 2=1 2η

n a (Vt+1) Qt Uta.

max a 2=1 2η

n a (Vt+1) 1

b (A ) A (Qt) Qt

Uta 2ητ k + 3RΓ2p

15 24dk log(n) log(nb)

by Lemma 43. We are able to drop the second term on the right-hand side above later due to its very fast rate of decay in n, b. Now

max a 2=1 2η

n a (Vt+1) Qt Uta = max a 2=1 2η

n Qt, Vt+1aa (Ut)

= max a 2=1 2η

n Qt, V (U ) Utaa (Ut)

n Qt, Faa (Ut)

We bound the first term above via

Private Model Personalization Revisited

n Qt, V (U ) Utaa (Ut)

n tr Ut(Vt+1) U (V ) V (U ) Utaa (Ut)

n tr Ut(Ut) U (V ) Ut(F) U (V ) V (U ) Utaa (Ut)

n tr Ut(Ut) I U (V ) V (U ) Utaa (Ut) 2η

n tr Ut(F) V (U ) Utaa (Ut)

n tr (F) V (U ) Utaa

tr (F) V (U ) Utaa

a (F) V (U ) Uta

n a 2 (F) V (U ) Ut 2 a 2

1 τk σ2 max, +

47η2R2σ2max, log(nb)

with probability at least 1 O(n 11) by (39), where equalities 3 and 4 hold by the cycling property of the trace and ((Ut) Ut) = 0. Then

n Qt, Faa (Ut)

n tr Ut(Ut) U (V ) Ut(F) U (V ) Faa (Ut)

n tr Ut(Ut) I U (V ) Faa (Ut) + 2η

n tr Faa (Ut) Ut(F)

3 η ν2τ 2 k (1 τk)2 σ2 max, + 154η2R2 log(nb)

with probability at least 1 O(n 11) where the third equality follows from the cycling property of the trace and ((Ut) Ut) = 0. Combining (42) and (43) gives us

max a 2=1 2η

n a (Vt+1) Qt Uta 8η ντk (1 τk)2 σ2 max, + 154η2R2 log(nb)

given that τ 2 k τk by selecting c0 35c2 τ in Assumption 31. Define τk = ντk + τ k σ2 max, . Letting c0, c1 4000, for (40a), we have

η nλmax((Ut) Ht Vt+1 + (Vt+1) (Ht) Ut) 2ητ k + 8η ντk (1 τk)2 σ2 max, + 154η2R2 log(nb)

47η2R2σ2max, log(nb)

15 24dk log(n) log(nb)

8η τk (1 τk)2 σ2 max, + 4

47η2R2 max{σ2max, , 1} log(nb)

Private Model Personalization Revisited

Note that (Ut) ξt in (40b) is a k k Gaussian matrix with independent columns. Now, by an equivalent statement as Corollary 27 for a k k matrix

λmax((Ut) ξt + ξ t Ut) = max a: a 2=1 a (Ut) ξt + ξ t Uta

= 2 max a: a 2=1 a (Ut) ξta

2 (Ut) ξt 2

for w > 0 with probability at least 1 2e α2 for some constant c > 0. We select α = 10k log n, which means

λmax((Ut) ξt + ξ t Ut) 4ηˆσ p

with probability at least 1 O(n 10).

Now, for (40c) λmax((Wt) ξt + ξ t Wt) = max a 2=1 a ((Wt) ξt + ξ t Wt)a

2 max a 2=1 a ξ t Wta

2 ξt 2 Wt 2

2 15R2η2Γ2d log n

with probability at least 1 O(n 10) via Corollary 27 and Proposition 40 part (1). Now we bound the term (40d). Note that

η nλmax((Wt) Ht Vt+1 + (Vt+1) (Ht) Wt) 2η

1 b (A ) A (Qt)Vt+1

b (A ) A (Qt) Qt

2 + Qt Vt+1 2

3ητ kdist(Ut, U ) + 2η

n Qt Vt+1 2

by Lemma 43 and taking c0 4000c2 τ and c1 4000 from Assumptions 31 and 32 so 2ητ kdist(Ut, U ) is dominant over the noise term. Then

η nλmax((Wt) Ht Vt+1 + (Vt+1) (Ht) Wt) 2ητ kdist(Ut, U ) + 2η

n Qt Vt+1 2

Note that 2τ kdist(Ut, U ) 1 by taking c0 4000c2 τ and c1 4000 from Assumptions 31 and 32 since τ k =

nb . Then, by (39), the orthonormality of Ut, U , and the definition of Qt

Qt Vt+1 2 Ut(Ut) U (V ) Ut(F) U (V ) 2 V (U ) Ut F 2 (2 V 2 + F 2)( V 2 + F 2)

(2 + 2ντk) V 2 +

47R2n log(nb)

(1 + 2ντk) V 2 +

47R2n log(nb)

(2 + 2ντk)2 V 2 2 +

47(2 + 2ντk)2R2 V 2 2n log(nb) b + 47R2n log(nb)

n Qt Vt+1 2 2η(2 + ντk)2σ2 max, + 39

47η2R2σ2max, log(nb)

b + 154ηR2 log(nb)

Private Model Personalization Revisited

by selecting the constant c0 in Assumption 31 to have c0 4000c2 τ such that ντk 1 10. Recall that

2 15η2R2Γ2d log n

with probability at least 1 O(n 10) by Proposition 40. So, combining (45) and (46) η nλmax((Wt) Ht Vt+1 + (Vt+1) (Ht) Wt)

2 15η2R2Γ2d log n

2 15η2(1 + ντk)2R2Γ2dσ2max, log(nb)

2 15η2R2Γ2d log n

47η2R2σ2max, log(nb)

b + 154ηR2 log(nb)

with probability at least 1 O(n 11). Recall ντk 1 10 since c0 4000c2 τ. Moreover, for c1 4000

2 15η2R2Γ2d log(nb)

2 15η2(2 + 2ντk)2R2Γ2dσ2max, log(nb)

2 15η2R2Γ2d log n

47η2R2σ2max, log(nb)

b + 154ηR2 log(nb)

2 15η2R2Γ2d max{σ2max, , 1} log(nb)

nb by Assumptions 31 and 32.

Observe that for (41)

λmin((Ut) Wt + (Wt) Ut) 2λmin((Ut) Wt) 2 λmin((Ut) Wt)

and 2 λmin((Ut) Wt) 2σmin((Ut) Wt)) 2σmax((Ut) Wt)).

Note that concentration inequalities hold for the sum of uncorrelated sub-Gaussian random variables and U t having orthonormal rows. Then, by a modified version of Lemma 40 part (2) for a k k-dimensional version of Wt

σmax((Ut) Wt) = max a 2=1 a (Ut) Wta

2 15η2R2Γ2k log(nb)

with probability at least 1 O(n 10). Notice that our upper bound is independent of the data dimension d. Thus

λmin((Ut) Wt + (Wt) Ut) 8

2 15η2R2Γ2k log(nb)

Via the above work we have, simultaneously by the union bound, there exists a constant ˆc > 0 such that

η nλmax((Ut) Ht Vt+1 + (Vt+1) (Ht) Ut) ˆcη τkσ2 max, + ˆc

η2R2 max{σ2max, , 1} log(nb)

b λmax((Ut) ξt + ξ t Ut) ˆcηˆσ p

λmax((Wt) ξt + ξ t Wt) ˆcηˆσ p

R2η2Γ2d log n

η nλmax((Wt) Ht Vt+1 + (Vt+1) (Ht) Wt) ˆc

η2R2Γ2d max{σ2max, , 1} log(nb)

λmin((Ut) Wt + (Wt) Ut) ˆc

η2R2Γ2k log(nb)

Private Model Personalization Revisited

with probability at least 1 O(n 10). The value of ˆc may change between lines. Its precise value does not matter and this constant is used to simplify our proof. By our choice of ψ, we have ˆσ = O (R+Γ)Γ

T dk log(nb) nϵ . Putting together the

bounds in (47) and selecting c0 4000c2 τ and c1 4000 in Assumptions 31 and 32

σ2 min(Pt+1) 1 ˆcη τkσ2 max, ˆcηˆσ p

η2R2 max{σ2max, , 1} log(nb)

R2η2Γ2d log n

η2R2Γ2k log(nb)

η2R2Γ2d max{σ2max, , 1} log(nb)

nb 1 ˆcη τkσ2 max, 2ˆcηˆσ p

η2R2 max{Γ2, 1} log(nb)

η2R2Γ2d max{σ2max, , 1} log(nb)

with probability at least 1 O(n 10). Via our choice of ˆσ, there is a constant ˆc > 0 such that

σ2 min(Pt+1) 1 ˆcη τkσ2 max, 2ˆcη(R + Γ)Γd q

Tk log(1.25/δ) log3(nb)

η2R2 max{Γ2, 1} log(nb)

η2R2Γ2d max{σ2max, , 1} log(nb)

Recall again that τk = cτ q

b , τ k = 5

nb , and ν = Γ σmax, 1. Then

η τkσ2 max, = ηντkσ2 max, + ητ k

= cτηνσ2 max,

13ην2σ2 max,

Recall the exact statements of Assumption 31 and 32. That is, there exist c0, c1 > 1 such that

max{R2, 1} max{Γ2, 1}γ4k log2 n

E2 0σ2max, + R2k

Γ2 log(nm) T

max { ϵ,δ, 1} (R + Γ) Γd

k log2(nm) E2 0λ2 + R2Γ2d log(nm)

The problem parameters that lower bound m, n now come into use during the following steps. Furthermore, recall that by Assumption 4 we know some λ > 0 such that λ σmin, . Then, there exists a constant ˆc > 0 such that

ˆcη τkσ2 max, ˆcησ2 min, E0 c0 log n + ˆcηλ2E2 0 q

c0c1γ2k 3 2 log2(nm) log n .

ˆcη τkσ2 max, ˆcησ2 min, E0 c0 log n + ˆcησ2 min, E2 0 q

c0c1γ2k 3 2 log2(nm) log n . (48)

Private Model Personalization Revisited

since λ σmin, . Further, this same constant ˆc satisfies

2ˆcη(R + Γ)Γd q

Tk log(1.25/δ) log3(nb)

η2R2 max{Γ2, 1} log(nb)

η2R2Γ2d max{σ2max, , 1} log(nb)

2ˆcηλ2E2 0 c1 p

log(nm) + ˆcησ2 min, E0

1 c0 max{Γ2, 1}k log n

+ ˆcηλσmin, E2 0

max{σ2max, , 1}

c0c1 max{Γ2, 1}k 3 2 log2(n)log(nm)

and thus 2ˆcη(R + Γ)Γd q

Tk log(1.25/δ) log3(nb)

η2R2 max{Γ2, 1} log(nb)

η2R2Γ2d max{σ2max, , 1} log(nb)

2ˆcησ2 min, E2 0 c1 p

log(nm) + ˆcησ2 min, E0

1 c0 max{Γ2, 1}k log n

+ ˆcησ2 min, E2 0

max{σ2max, , 1}

c0c1 max{Γ2, 1}k 3 2 log2(n)log(nm)

since λ σmin, . Note that E2 0 E0 since E0 (0, 1). Then, combining c0, c1 max{10

2ˆc2, 4000c2 τ, 4000} with (48) and (49) gives us

σ2 min(Pt+1) 1 ησ2 min, E0 2 log n (50)

with probability at least 1 O(n 10).

Private Model Personalization Revisited

B.4 Private Fed Rep experiments

In this subsection we describe the synthetic data experiments we designed to compare our Private Fed Rep (Algorithm 1) to the Private Alternating Minimization Meta-Algorithm (Priv-Alt Min) of (Jain et al., 2021). Our comparison is described in Figure 2 as a graph of population mean square error (MSE) over choice of privacy parameter ϵ > 0. Note that we also experimented with non-private Alt Min, which performs similarly to the non-private Fed Rep in Figure 2.

Figure 2: Graph of population MSE over choice of privacy parameter ϵ [1, 8]. Local optimization is done via non-private gradient descent on each user s data separately and their population MSEs averaging over n users.

The features x Rd of our synthetic data are sampled from a standard normal Gaussian distribution. We select optimal parameters U , v 1, . . . , v n by sampling U from a d k-dimensional Gaussian distribution then generating an orthonormal matrix U via (U , P) = QR(U) and sampling v i N(0, Ik) for all i [n]. Labels are generated as in Assumption 6 and we choose Gaussian noise with standard deviation R = 0.01. Further, both Private Fed Rep and Priv-Alt Min are initialized using an implementation of Algorithm 2.

Our problem is instantiated with d = 50, k = 2, m = 10, and n = 20, 000. For Fed Rep we prune our hyperparameters, deciding on T = 5 and learning rate η = 2.5 with clipping parameter ψ = 10. Similarly, Priv-Alt Min with iterations optimized for T = 5 and clipping parameter 10 4. The Gaussian mechanism variance for both algorithms is calculated

using the privacy parameter ϵ,δ =

16 log(1.25/δ)

ϵ with δ = 10 6.

Private Model Personalization Revisited

C Missing Proofs for Section 4

We first restate our algorithm along with some initial definitions and results.

Define BF = {U Rd k : U F

2k}. Assume for simplicity that m is even. We also partitioned Si = S0 i S1 i where S0 i = {z1,j, . . . , z m

2 ,j} and S1 i = {z m

2 +1,j, . . . , zm,j} for each i [n]. Further, we denote St = (St 1, . . . , St n) where t {0, 1}. Suppose that S = (S1, . . . , Sn) Znm(d+1) is a sequence of n datasets with m samples each. Let SM = ((S1)M, . . . (Sn)M) where (Si)M = {(Mxi,j, yi,j) : j [m]} for all i [n]

Algorithm 3 Private Representation Learning for Personalized Classification

Require: dataset sequences S0 and S1 of equal size, score function f(U , ) = min V V b Lρ(U , V ; ) over matrices

U Rk k, privacy parameter ϵ > 0, target dimension k = O r2Γ2 log(nm/δ)

1: Sample M Rk d with entries drawn i.i.d uniformly from n 1

2: Let SM = ((S1)M, . . . , (Sn)M) where (Si)M = (Mx, y) : (x, y) S0 i for i [n] 3: Let N γ be a Frobenius norm γ-cover of BF 4: Run the exponential mechanism over N γ, privacy parameter ϵ, sensitivity 1

n, and score function f(U , SM), to select e U N γ

5: Let U priv M e U 6: Each user i [n] independently computes vpriv i arg min v 2 Γ b L(U priv, v, S1 i )

7: Return: U priv, V priv = [vpriv 1 , . . . vpriv n ]

Definition (Restatement of Definition 13). Let (U, v) Rd k Rk and (x, y) Rd { 1, 1} any data point. We define the margin loss as ℓρ(U, v, z) = 1 [y x, Uv ρ]

and denote the 0-1 loss ℓ(U, v, z) = ℓ0(U, v, z) = 1 [y x, Uv 0] .

Definition (Restatement of Definition 14). Let G Rd be any set of t vectors. Fix τ, β (0, 1). We call the random matrix M Rk d a (t, τ, β)-Johnson-Lindenstrauss (JL) transform if for any u, u G

| Mu, Mu u, u | τ u 2 u 2

with probability at least 1 β over M. Proposition 44 ((Woodruff et al., 2014)). Let G Rq be any set of t vectors. Fix τ, β (0, 1). For M Rk q a (t, τ, β)-JL transform for any u G

(1 τ) u 2 2 Mu 2 2 (1 + τ) u 2 2

with probability at least 1 β over M, which holds simultaneously with Definition 14.

Lemma (Restatement of Lemma 15). Let τ, β (0, 1). Take G Rd to be any set of t vectors. Setting k = O log( t

for a k d matrix M with entries drawn uniformly and independently from n 1

k o implies that M is a (t, τ, β)-JL transform. Lemma 45 (Lemma D.1 (Bassily et al., 2022)). Fix β (0, 1). Suppose U Rd k and that xi,j Rd, vi Rk for all (i, j) [n] [m]. Let M Rk d be a ((n + 1)m, γ, β/2)-JL transform in the sense of Proposition 44. Then, there exists a constant c 1 such that, with probability at least 1 β over the randomness of M, we have

xi,j 2 2 (1)

Private Model Personalization Revisited

v i U M Mxi,j v i U xi,j c Uvi 2 xi,j 2

for all (i, j) [n] [m] simultaneously.

Recall that U is the space of orthonormal d k matrices and V the set of n k matrices with columns whose Euclidean

norms are bounded by Γ > 0. Throughout the following results we assume that γ c q

k for some constant c 1 and a target dimension k for a JL transform. Furthermore, whenever we define a JL transform M, we assume that it preserves the norm of data points xi,j for all (i, j) [n] [m] and some fixed matrix in U U. Let N γ be a Frobenius γ-cover of the

2k-radius Frobenius ball BF . By cover we mean that N γ contains the center points of Frobenius balls of γ-radius whose union contain BF . Note as well that N γ BF . Proposition 46. Let M Rk d be a ((n + 1)m, γ, β/2)-JL transform in the sense of Proposition 44. Assume c2 log (nm/β) k and that xi,j is a sequence of r-bounded feature vectors for each (i, j) [n] [m]. Suppose N γ is a Frobenius norm γ-cover of the k k-dimensional

2k radius ball. Moreover, let U U and e U N γ where e U MU F γ. Then, there exists a constant c 1 where, with probability at least 1 β over the randomness of M, we have v i e U Mxi,j v i U xi,j (

for any V = [v1, . . . , vn] V and all (i, j) [n] [m] simultaneously.

Proof. Note that since U has orthonormal columns, we have Uvi 2 = vi 2 Γ for all j. So

v i U M Mxi,j v i U xi,j crΓ

with probability at least 1 1

2β by part (3) of Lemma 45. Recall that γ c q

k for some constant c 1. Assume

c2r2Γ2 log (nm/β) k , which implies γ 1. Let BF be the k k-dimensional Frobenius ball of radius

2k. We define N γ to be a Frobenius norm γ-cover of BF . Note MU F

2k with probability at least 1 1

2β by part (2) of Lemma 45. That is, with probability at least 1 1

2β, there exists e U N γ such that e U MU γ. Let e U N γ be within γ Frobenius-distance of MU conditioned on the event MU F

Now, there exists a constant c 1 such that v i e U Mxi,j v i U M Mxi,j vi 2 e U Mxi,j U M Mxi,j 2 vi 2 e U MU F Mxi,j 2

where the third inequality holds by part (1) of Lemma 45 along with our choice of e U and the fourth inequality by our choice of γ. Hence v i e U Mxi,j v i U M Mxi,j

with probability at least 1 1

2β. Combining (47) and (48) with the union bound and triangle inequality completes the proof.

Lemma (Restatement of Lemma 16). Fix ϵ, ρ > 0, β (0, 1). Algorithm 3 is (ϵ, 0)-user-level DP. Sample S Dm. Then, Algorithm 3 returns U priv from input S such that

min V V b L(U priv, V ; S0) min (U,V ) U V b Lρ(U, V ; S0) + e O r2Γ2k

with probability at least 1 β over the randomness of S and the internal randomness of the algorithm.

Private Model Personalization Revisited

Proof. Let M Rk d be a ((n + 1)m, γ, β/4)-JL transform in the sense of Proposition 44. Further, let BF Rk k be

the Frobenius ball of radius

2k. Recall that γ c q

k for a constant c 1. Assume c2 log (nm/β) k , which implies γ 1.

Define VU arg min V V b Lρ(U , V ; S0) for each U Rd k and fix U arg min U U b Lρ(U , VU ; S0). Denote the columns of VU as v1, . . . , vn. Let N γ be a Frobenius norm γ-cover over BF . We have MU F

2k with probability at least 1 1

4β by part (2) of Lemma 45. Choose γ = aρ (

2+1)crΓ for some constant a (0, 1) along with ˆU N γ within

Frobenius distance γ of MU conditioned on the event MU F

2k. Then, by Proposition 46, for all (i, j) [n] [m] simultaneously, we have v i ˆU Mxi,j v i U xi,j aρ

with probability at least 1 1

2β. Assuming 1 aρ 0, the above implies, with probability at least 1 1

b L(M ˆU, VM ˆU; S0) b L(M ˆU, VU; S0) b Lρ(U, VU; S0) (49)

where the left-hand inequality holds by VM ˆU being a minimizer for b L(M ˆU, S0).

Let U arg min U BF b L(M U , VM U ; S0). Then, by definition of U and the fact that N γ BF , we have

b L(M U, VM U; S0) b L(M ˆU, VM ˆU; S0). (50)

Let (S0 i )M = {(Mxi,j, yi,j) : j [m/2]} for all i [n] and SM = ((S0 i )M)i [n]. This definition and the characterization of our losses in Definition 13 imply b L( U, VM U; S0 M) = b L(M U, VM U; S0). Via the exponential mechanism over N γ given score function b L(U , VM U ; S0 M) for U N γ and the usual empirical loss guarantees for the exponential mechanism, we obtain U priv Rd k where U priv = M e U for some e U N γ such that

b L(U priv, VU priv; S0) = b L(e U, VM e U; S0 M) b L( U, VM U; S0 M) + O log|N γ|

with probability at least 1 1

2β. This gives us

b L(U priv, VU priv; S0) b L(M U, VM U; S0) + O

r2Γ2k log rΓ

k ρ log ((nm/β))

since |N γ| = O

k γ k k and k = O r2Γ2 log((nm/β))

ρ2 by our choice of γ. Combining (49), (50), and (51) via the

union bound along with recalling our definition VU arg min V V b Lρ(U , V ; S0) finishes the proof.

Theorem (Restatement of Theorem 17). Fix ϵ, ρ > 0, β (0, 1). Algorithm 3 is (ϵ, 0)-user-level DP in the billboard model. Sample user data S Dm. Then, Algorithm 3 returns U priv, V priv from input S such that

L(U priv, V priv; D) min (U,V ) U V b Lρ(U, V ; S0) + e O

with probability at least 1 β over the randomness of S and the internal randomness of the algorithm.

Proof. Let V priv V be the matrix with columns vpriv i arg min v Γ b L(U priv, v; S1 i ) and ˆV V with columns ˆvi arg min v Γ b L(U priv, v; S0 i ) for each i [n]. We first note that

L(U priv, V priv; D) min (U,V ) U V b Lρ(U, V ; S0)

= L(U priv, V priv; D) L(U priv, ˆV ; D) + L(U priv, ˆV ; D) b L(U priv, ˆV , S0)

+ b L(U priv, ˆV , S0) min (U,V ) U V b Lρ(U, V ; S0) (52)

Private Model Personalization Revisited

Let M Rk d be a ((n + 1)m, γ, β/6)-JL transform in the sense of Proposition 44. Let γ = c q

k for a constant c > 0 and γ = O ρ

rΓ . Assume c2 log (nm/β) k , which implies γ 1. Denote by BF the k k Frobenius ball of radius

2k. Let N γ be a γ-cover of BF . Using Lemma 16, we have

b L(U priv, ˆV ; S0) min (U,V ) U V b Lρ(U, V ; S0) e O r2Γ2k

with probability at least 1 1

3β over the randomness of M.

Claim 1: We have

L(U priv, V priv; D) L(U priv, ˆV ; D) e O

with probability at least 1 1

By the definition of V priv, we have b L(U priv, V priv; S1) b L(U priv, ˆV ; S1). (54)

Our proof strategy is to obtain generalization error bounds for the parameters (U priv, V priv) and (U priv, ˆV ) with respect to the loss b L( ; S1), which we leverage to prove the claim. We first analyze the generalization properties with respect to ℓ, D of our 2-layer linear networks induced by the matrix product Uv for any U BF and v Rk with v 2 Γ. We denote the family of 2-layer linear networks induced by the matrix product Uv as L and define BΓ Rk to be the centered

2kΓ-radius Euclidean ball. Let H0 be the space of binary classifiers induced by taking the sign of the functionals , Uv with Uv L. Similarly, define H to be the space of binary linear classifiers induced by functions , w for all w BΓ.

Since L BΓ each functional , Uv must also be contained in the space of functionals , w with w BΓ. This naturally implies H0 H. Hence, the VC dimension of H0 is no larger than the VC dimension of H, i.e. at most k + 1. Recall N γ BF and that, by the description of Algorithm 3, there is a particular e U N γ such that U priv = M e U. Then, we have e U BF and thus the binary classifier induced by e Uv is in H0 for any v with v 2 Γ.

To garner a generalization guarantee, we use the fact that the VC dimension of H0 is O(k ). Recall that (Si)M = {(Mxi,j, yi,j) : (xi,j, yi,j) Si} for each i [n]. Denote SM = ((Si)M)i [n] and define the sequence of distributions DM = ((D1)M, . . . , (Dn)M) where for, each i [n], we have that x (Di)M has the same distribution as Mx with x Di. Thus, we obtain, from the VC dimension generalization error bounds on H that, for each i [n], the following

L(e U, vpriv i ; (Di)M) b L(e U, vpriv i ; (S1 i )M) e O

and L(e U, ˆvi; (Di)M) b L(e U, ˆvi; (S1 i )M) e O

with probability at least 1 1 6nβ. Then, by the union bound and taking the arithmetic mean over the n users

L(e U, V priv; DM) b L(e U, V priv; S1 M) e O

and L(e U, ˆV ; DM) b L(e U, ˆV ; S1 M) e O

with probability at least 1 1

3β. Via the inner product that characterizes our losses in Definition 13, we have L(e U, ˆV ; DM) = L(U priv, ˆV ; D) and L(e U, V priv; DM) = L(U priv, V priv; D). Using the above inequalities and (54), we obtain

L(U priv, V priv; D) L(U priv, ˆV ; D) e O

Private Model Personalization Revisited

with probability at least 1 1

3β, which proves Claim 1.

Another application of the VC bound and the union bound gives us, with probability at least 1 1

L(U priv, ˆV ; D) b L(U priv, ˆV ; S0) e O

Combining (53), Claim 1, and (55) via the union bound, and recalling that k = O r2Γ2 log(nm/β)

ρ2 by our choice of γ, finishes the proof.