# multitask_differential_privacy_under_distribution_skew__cbab1843.pdf Multi-Task Differential Privacy Under Distribution Skew Walid Krichene 1 Prateek Jain 1 Shuang Song 1 Mukund Sundararajan 2 Abhradeep Thakurta 1 Li Zhang 1 We study the problem of multi-task learning under user-level differential privacy, in which n users contribute data to m tasks, each involving a subset of users. One important aspect of the problem, that can significantly impact quality, is the distribution skew among tasks. Tasks that have much fewer data samples than others are more susceptible to the noise added for privacy. It is natural to ask whether algorithms can adapt to this skew to improve the overall utility. We give a systematic analysis of the problem, by studying how to optimally allocate a user s privacy budget among tasks. We propose a generic algorithm, based on an adaptive reweighting of the empirical loss, and show that in the presence of distribution skew, this gives a quantifiable improvement of excess empirical risk. Experimental studies on recommendation problems that exhibit a long tail of small tasks, demonstrate that our methods significantly improve utility, achieving the state of the art on two standard benchmarks. 1. Introduction Machine learning models trained on sensitive user data present the risk of leaking private user information (Dwork et al., 2007; Korolova, 2010; Calandrino et al., 2011; Shokri et al., 2017). Differential Privacy (DP) (Dwork et al., 2006) mitigates this risk, and has become a gold standard widely adopted in industry and government (Abowd, 2018; Wilson et al., 2020; Rogers et al., 2021; Amin et al., 2022). In this work, we adopt the notion of user-level DP (Dwork & Roth, 2014; Kearns et al., 2014), which seeks to protect all data samples of a given user, a harder goal than protecting a single sample. User-level DP has been studied under differ- 1Google Research 2Google. Correspondence to: Walid Krichene . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). ent settings, including empirical risk minimization (Amin et al., 2019; Levy et al., 2021), mean estimation (Cummings et al., 2022), and matrix completion (Jain et al., 2018; Chien et al., 2021). The setting we study is that of multi-task learning, in which users contribute data to multiple tasks. This is for example the case in recommendation systems: given a set of n users and m items (pieces of content, such as movies or songs), the goal is to learn a representation for each item based on the users preferences. Another example is multi-class classification, where the goal is to learn m classifiers (one per class). Notice that the training data of a given user may be relevant to only a small subset of tasks. This is certainly the case in recommendation, where users typically interact with a small subset of the catalog of available content. This can also be the case in classification if the number of classes is very large. The interaction pattern between users and tasks can be described using a bipartite graph Ω [m] [n], such that user j contributes data to task i if and only if (i, j) Ω. We will also denote by Ωi = {j : (i, j) Ω} the set of users contributing to task i. The goal is then to minimize the empirical loss summed over tasks, j Ωi ℓ(θi; xij, yij), (1) where θi represents the model parameters for task i, and xij, yij are respectively the features and labels contributed by user j to task i. In general, we will assume that Ω, x, and y are all private. In practice, the tasks can be imbalanced, i.e. the distribution of task sizes |Ωi| can be heavily skewed. This is the case in recommendation systems, in which there is a long tail of items with orders of magnitude fewer training examples than average (Yin et al., 2012; Liu & Zheng, 2020). Classification tasks can also exhibit a similar skew among classes (Kubat & Matwin, 1997). Under differential privacy constraints, this skew makes tail tasks (those with less data) more susceptible to quality losses. This disparate impact of differential privacy is well documented, for example in classification with class imbalance (Bagdasaryan et al., 2019) or in DP language models (Mc Mahan et al., 2018). A natural question is whether DP algorithms can be made adaptive to such distribution skew. Some heuristics were Multi-Task Differential Privacy Under Distribution Skew found to be useful in practice. For example, (Chien et al., 2021) propose a sampling heuristic, which limits the number of samples per user, while biasing the sampling towards tail tasks. This was found to improve performance in practice (compared to uniform sampling), though no analysis was provided on the effect of this biased sampling. Another heuristic was proposed by (Xu et al., 2021), which applies to DPSGD, and operates by increasing the clipping norm of gradients on tail tasks. Similarly, this was observed to improve performance (compared to uniform clipping). In both cases, the intuition is that by allocating a larger proportion of a user s privacy budget to the tail (either via sampling or clipping), one can improve utility on the tail, and potentially on the overall objective (1). However, a formal analysis of biased sampling and clipping remains elusive, as it introduces a bias that is hard to analyze. We take a different approach, by introducing task-user weights wij to have a finer control over the budget allocation. The weights play a similar role to sampling or clipping: by assigning larger weights to a task, we improve utility of that task. The question is then how to optimally choose weights w to minimize excess risk on the overall objective (1), under privacy constraints (which translate into constraints involving w and Ω). Fortunately, this problem is amenable to formal analysis, and we derive optimal choices of weights, and corresponding error bounds, under different assumptions on the loss and the task-user graph Ω. 1.1. Contributions 1. We give a formal analysis of the private multi-task learning problem under distribution skew. We propose a generic method based on computing task-user weights, and applying those weights to the loss function (1), as a mechanism to control each user s privacy budget allocation among tasks. The method is generic in that it applies to several algorithms such as DPSGD (Bassily et al., 2014; Abadi et al., 2016) or Sufficient Statistics Perturbation (Foulds et al., 2016). 2. We derive utility bounds that explicitly show how the budget allocation trades-off privacy and utility (Theorems 3.8 and 3.12). By adapting the allocation to the task sizes, we obtain utility bounds (Theorem 4.3) with a guaranteed improvement compared to uniform allocation. The improvement increases with the degree of skew in the task distribution. 3. We conduct experiments on synthetic data and two standard recommendation benchmarks which exhibit a heavy distribution skew. Our methods significantly improve accuracy, even compared to strong baselines such as the tail sampling heuristic of (Chien et al., 2021). We also provide a detailed analysis of quality impact across tasks. We find that adaptive budget allocation visibly improves quality on tail tasks, with a limited impact on head tasks. 1.2. Related Work A common technique in user-level differential privacy is to adaptively control the sensitivity with respect to the data of a user. This is typically done via clipping (Abadi et al., 2016), sampling (Kasiviswanathan et al., 2013; Amin et al., 2019), using privacy filters (Feldman & Zrnic, 2021), or using weights in the objective function (Proserpio et al., 2014; Epasto et al., 2020) to scale the contribution of high sensitivity users. Our approach is related, in that we adaptively control sensitivity. But while these methods focus on adapting to users (users with high sensitivity are assigned lower weights), we focus on adapting to tasks (within each user, easier tasks are assigned lower weights). We mention in particular (Epasto et al., 2020), who use weights to bound user sensitivity, and analyze the optimal choice of weights under certain assumptions. Their approach is perhaps the most similar to ours, in that the choice of weights is phrased as an optimization problem. But our setting is inherently different: they consider the traditional ERM setting with a single task, while we consider the multitask setting (with heterogeneous tasks). As a result, their solution assigns identical weights to all examples of a given user, while our solution is designed to allocate a user s budget among tasks, and generally assigns different weights to different examples of a user. Turning to the multi-task setting, prior work on DP multitask learning (Li et al., 2020; Hu et al., 2021) has focused on task-level privacy, where every user is identified with one task. We consider a more general setting, in which tasks do not necessarily coincide with users. In particular, under our model, it s important that a user can contribute to more than one task. Another related problem is that of answering multiple linear or convex queries under DP, e.g. (Dwork et al., 2010; Hardt & Rothblum, 2010; Ullman, 2015). Though related, the two problems are different because the utility measure is inherently different: in the multi-query setting, the goal is to answer all queries accurately (i.e. a minimax objective), while in the multi-task setting, the utility is a sum across tasks, as in (1). Besides, algorithms in the multi-query convex setting (closest to ours) often have to maintain a distribution over a discretization of the data space, making them prohibitively expensive in continuous feature spaces, due to an exponential dependence on the dimension. In practice, budget allocation among tasks is often done through uniform sampling, for example in DP SQL (Wilson et al., 2020), Linked In s Audience Engagement API (Rogers et al., 2021), and Plume (Amin et al., 2022). The techniques that do adapt to the task distribution (via sampling as in (Chien et al., 2021) or clipping as in (Xu et al., 2021)) tend to be heuristic in nature, and lack a rigorous utility analysis. Multi-Task Differential Privacy Under Distribution Skew To the best of our knowledge, we present the first systematic analysis of multi-task, user-level DP under distribution skew. Our method operates directly on the loss function, which makes it applicable to standard algorithms such as DPSGD, and has a guaranteed improvement under distribution skew. 2. Preliminaries 2.1. Notation Throughout the paper, we will denote by Ω [m] [n] the task-user membership graph, where m, n are the number of tasks and users, respectively. We assume that m n. For all i, let Ωi := {j : (i, j) Ω} be the set of users who contribute to task i, and for all j, let Ωj := {i : (i, j) Ω} be the set of tasks to which j contributes data. We will denote by ni = |Ωi|, and by nj = |Ωj|. Let x be the L2 norm of a vector x, and X F be the Frobenius norm of a matrix X. For a symmetric matrix A, let A denote the pseudo-inverse of A s projection on the PSD cone. For a closed convex set C, let C denote its Euclidean diameter, and ΠC( ) denote the Euclidean projection on C. Let N d denote the multivariate normal distribution (of zero mean and unit variance), and N d d denote the distribution of symmetric d d matrices whose upper triangle entries are i.i.d. normal. Finally, whenever we state that a bound holds with high probability (abbreviated as w.h.p.), we mean with probability 1 1/nc, where c is any positive constant, say 2. 2.2. User-Level Differential Privacy Let D be a domain of data sets. Two data sets D, D D are called neighboring data sets if one is obtained from the other by removing all of the data samples from a single user. Let A : D S be a randomized algorithm with output space S. Definition 2.1 (User-Level Differential Privacy (Dwork & Roth, 2014)). Algorithm A is (ϵ, δ)-differentially private (DP) if for all neighboring data sets D, D D, and all measurable S S, Pr (A(D) S) eϵ Pr (A(D ) S) + δ. Remark 2.2. The data of a user j consists of Ωj (the set of tasks that user contributes to), together with the set of features and labels {(xij, yij) : i Ωj}. Notice that the user-task graph Ωitself is private. For example, in the recommendation setting, Ωrepresents which pieces of content each user interacted with, while the labels yij represent how much a user liked that piece of content; we want to protect both. For simplicity of presentation, our results will be stated in terms of R enyi differential privacy (RDP). Translation from RDP to DP is standard, and can be done for example using (Mironov, 2017, Proposition 3). Definition 2.3 (User-Level R enyi Differential Privacy (Mironov, 2017)). An algorithm A is (α, ρ)-R enyi differentially private (RDP) if for all neighboring data sets D, D D, Dα (A(D))||A(D )) ρ, where Dα is the R enyi divergence of order α. 3. Budget Allocation via Weights First, observe that the objective function (1) can be decomposed as i=1 Li(θi); Li(θi) = X j Ωi ℓ(θi; xij, yij), (2) where θi Rd are the model parameters for task i, and θ is a shorthand for (θ1, . . . , θm). For all i, we will denote the non-private solution of task i by θ i = argmin θi Li(θi). The goal is to learn a private solution ˆθ under user-level DP. The quality of ˆθ will be measured in terms of the excess empirical risk, i=1 E[Li(ˆθi)] Li(θ i ), where the expectation is over randomness in the algorithm. Remark 3.1. Although the objective is decomposable as a sum of Li(θi), the problems are coupled through the privacy constraint: since user j contributes data to all tasks in Ωj, the combined sensitivity of these tasks with respect to user j s data should be bounded. One is then faced with the question of how to allocate the sensitivity among tasks. This will be achieved by using weights in the objective function. Our general approach will be to apply a DP algorithm to a weighted version of the objective L(θ). Let w := (wij)(i,j) Ωbe a collection of arbitrary positive weights, and define the weighted multi-task objective j Ωi wijℓ(θi; xij, yij). (3) We first give some intuition on the effect of the weights (a formal analysis will be given in the next section). Suppose ℓis bounded, and that we compute ˆθ by applying noisy gradient descent (with fixed noise standard deviation) to Lw(θ). Then w will have the following effect on privacy and utility: increasing the weights of task i will generally improve its utility (since it increases the norm of task i s Multi-Task Differential Privacy Under Distribution Skew Algorithm 1 Task-weighted SSP (Sufficient Statistic Perturbation) 1: Inputs: Task-user graph Ω, user features and labels (xij, yij)(i,j) Ω, clipping parameters Γx, Γ , weights w, domain C. 2: For all i, j, xij clip(xij, Γx), yij clip(yij, ΓxΓ ), where clip(x, Γ) = x min(1, Γ/ x ) 3: for 1 i m do 4: Sample ξi N d, Ξi N d d j Ωi wij(xijx ij + λI) + Γ2 xΞi 6: ˆbi P j Ωi wijyijxij + Γ2 xΓ ξi 7: ˆθi ˆA iˆbi 8: end for 9: Return ˆθ gradients, hence increases its signal-to-noise ratio). Increasing the weights of user j will decrease her privacy (since it increases the norm of user j s gradients, hence increases the L2 sensitivity w.r.t. her data). Utility depends on task weights (wij)j Ωi, while sensitivity/privacy depend on user weights (wij)i Ωj. By characterizing how sensitivity and utility precisely scale in terms of the weights w, we will be able to choose w that achieves the optimal trade-off. Remark 3.2. Although we will apply a DP algorithm to the weighted objective Lw, utility will always be measured in terms of the original, unweighted objective L. The weights should be thought of as a tool to allocate users privacy budget, rather than a change in the objective function. 3.1. Weighted Multi-Task Ridge Regression We start by analyzing the case of multi-task ridge regression, optimized using the Sufficient Statistics Perturbation (SSP) method (Foulds et al., 2016; Wang, 2018). This will pave the way to the more general convex setting in the next section. Suppose that each task is a ridge regression problem. The multi-task objective (2) becomes (θi xij yij)2 + λ θi 2 , (4) where λ is a regularization constant. The exact solution of (4) is given by θ i = A 1 i bi where j Ωi (xijx ij + λI), bi = X j Ωi yijxij. (5) We propose a task-weighted version of SSP, described in Algorithm 1. The private solution ˆθi is obtained by forming weighted and noisy estimates ˆAi,ˆbi of the sufficient statistics (5), then returning ˆθi = ˆA iˆbi. Notice that if we use larger weights for task i, the relative noise scale in ˆAi,ˆbi becomes smaller (see Lines 5-6), which improves the quality of the estimate, while also increasing the sensitivity. Once we analyze how the weights affect privacy and utility, we will be able to reason about an optimal choice of weights. First, we state the privacy guarantee of Algorithm 1. All proofs are deferred to Appendix A. Theorem 3.3 (Privacy guarantee of Algorithm 1). Let (wij)(i,j) Ωbe a collection of task-user weights, and let β = maxj [n] P i Ωj w2 ij. Then Algorithm 1 run with weights w is (α, αβ)-RDP for all α > 1. The result can be easily translated to traditional DP, for example, for all ϵ, δ > 0 with ϵ log(1/δ), if β ϵ2 8 log(1/δ), then the algorithm is (ϵ, δ)-DP. Remark 3.4. The theorem can be interpreted as follows: each user j has a total budget β, that can be allocated among the tasks in Ωj by choosing weights such that P i Ωj w2 ij β. For this reason, we will refer to β as the user RDP budget. Remark 3.5. Sampling a fixed number of tasks per user (as done in (Chien et al., 2021)) is a special case of this formulation, where wij is a constant if task i is sampled for user j, and 0 otherwise. Using general weights allows a finer trade-off between tasks, as we shall see next. There is also a practical advantage to using weights: sampling methods require choosing a maximum number of tasks to sample per user (a hyper-parameter), this number is used in the privacy accounting to determine user sensitivity. One needs to tune this hyper-parameter carefully if it s too large, many users are below the threshold and we end up adding too much noise, and if it s too small, we might drop too many samples. When using weights, there is no such hyper-parameter (we keep all samples of all users), which simplifies parameter tuning. For the utility analysis, we will make the following standard assumption. Assumption 3.6. We assume that there exist Γx, Γ > 0 such that for all i, j, xij Γx, θ i Γ , and yij ΓxΓ . Remark 3.7. We state the result in the ridge regression case with bounded data to simplify the presentation. Similar results can be obtained under different assumptions, for instance when xij are i.i.d. Gaussian, in which case one can bound the minimum eigenvalue of the covariance matrices Ai without the need for regularization, see (Wang, 2018) for a detailed discussion. Theorem 3.8 (Utility guarantee of Algorithm 1). Suppose Assumption 3.6 holds. Let ω Rm be a vector of positive weights such that for all j, P i Ωj ω2 i β. Let ˆθ be the output of Algorithm 1 with weights1 wij = ωi (i, j) Ω. 1The choice of wij = ωi (all examples of a task share the same weights) simplifies the proof, and is sufficient to improve utility under skew, see Section 4. Multi-Task Differential Privacy Under Distribution Skew Algorithm 2 Task-weighted noisy gradient descent 1: Inputs: Task-user graph Ω, user features and labels x, y, domain C, clipping parameter Γ, weights w, initial parameters ˆθ(0), numer of steps T, learning rates η(t) i . 2: for 1 i m do 3: for 1 t T do 4: Sample ξ(t) i N d 5: g(t) i clip(P j Ωi wij θiℓ(θ(t 1) i ; xij, yij), Γ) 6: ˆθ(t) i ΠC[θ(t 1) i η(t) i (g(t) i + Γ p T/2 ξ(t) i )] 7: end for 8: end for 9: Return ˆθ(T ) = (ˆθ(T ) 1 , . . . , ˆθ(T ) m ) Then Algorithm 1 is (α, αβ)-RDP, for all α > 1, and w.h.p., L(ˆθ) L(θ ) = O Theorem 3.8 highlights that under the user-level privacy constraint (maxj P i Ωj ω2 i β), there is a certain tradeoff between tasks: by increasing ωi we improve quality for task i, but this may require decreasing ω for other tasks to preserve a constant total privacy budget. This allows us to reason about how to choose weights that achieve the optimal trade-off under a given RDP budget β. Remarkably, this problem is convex: 1 ω2 i ni s.t. X i Ωj ω2 i β j {1, . . . , n}. If the task-user graph Ωwere public, one could solve the problem exactly. But since task-user membership often represents private information, we want to protect Ω. Our approach will be to compute an approximate solution. Before tackling this problem, we first derive similar weightdependent privacy and utility bounds for noisy gradient descent. 3.2. Weighted Multi-Task Convex Minimization We now consider the more general problem of minimizing empirical risk for convex losses in the multi-task setting. Our method applies noisy GD (Bassily et al., 2014) to the weighted objective (3). This is summarized in Algorithm 2. We will consider two standard settings, in which noisy GD is known to achieve nearly optimal empirical risk bounds: Assumption 3.9 (Lipschitz convex on a bounded domain). Assume that C is a bounded convex domain, and that there exists Γ > 0 such that ℓ(θ; x, y) is a convex and Γ-Lipschitz function of θ C, uniformly in x, y. Assumption 3.10 (Lipschitz, strongly convex). Assume that there exist λ, Γ > 0 such that ℓ(θ; x, y) is a λ-strongly convex and Γ-Lipschitz function of θ, uniformly in x, y. Next, we derive privacy and utility bounds. Theorem 3.11 (Privacy guarantee of Algorithm 2). Let (wij)(i,j) Ωbe a collection of task-user weights, and let β = maxj [n] P i Ωj w2 ij. Then Algorithm 2 is (α, αβ)- RDP for all α > 1. Theorem 3.12 (Utility guarantee of Algorithm 2). Let ω Rm be a vector of positive weights such that for all j, P i Ωj ω2 i β. Let ˆθ be the output of Algorithm 2 with weights wij = ωi. Then Algorithm 2 is (α, αβ)-RDP, for all α > 1. Furthermore, (a) Under Assumption 3.9, if T = 2 Pm i=1 n2 i Pm i=1 1/ω2 i and η(t) i = ω2 i n2 i +T d/2 1 E[L(θ)] L(θ ) = O (7) where O hides polylog factors in ni. (b) Under Assumption 3.10, if T = 2 d |Ω| Pm i=1 1/ω2 i ni , and η(t) i = 1 ω2 i niλt, then E[L(θ)] L(θ ) = O We consider some special cases, as a consistency check. Single task Suppose there is a single task (m = 1), then the RDP budget constraint is simply satisfied with ω2 1 = β. Plugging this into the bounds from the theorem, and dividing by the number of examples n (as our loss L is a summation, instead of average, over all examples), we get that the average empirical risk is bounded by O C Γ and O Γ2d λn2β in the convex and strongly convex case, respectively. This recovers the usual convex ERM bounds (Bassily et al., 2014) with the correspondence β = O ϵ2/ log(1/δ) . Complete bipartite Ω Suppose there are m tasks and each user participates in all tasks. Then the budget constraint becomes mω2 i = β for all i. Plugging this into the bounds and dividing by mn (total number of examples), we get that the average empirical risk is O C Γ λn2β in the convex and strongly convex case, respectively. Notice that this is equivalent to solving a single task in dimension md, and the resulting ERM bound is the same. Multi-Task Differential Privacy Under Distribution Skew 4. Optimizing Weights Under Distribution Skew Equipped with the utility bounds of the previous section, we can ask what choice of weights minimizes the error bound under a given privacy budget. Observe that all utility bounds (6), (7), (8) are increasing in the quantity Pm i=1 1 ω2 i nγ i , where γ = 0 in the convex case, and γ = 1 in the strongly convex case (for both SSP and noisy GD). This amounts to solving a constrained optimization problem of the form 1 ω2 i nγ i s.t. X i Ωj ω2 i β, j {1, . . . , n}. In general, the task-user graph Ωis private (see Remark 2.2). We cannot solve (9) exactly, but under distributional assumptions on Ω, we can solve it approximately. In modeling the distribution of Ω, we take inspiration from the closely related problem of matrix completion, where a standard assumption is that elements of Ωare sampled uniformly at random, see e.g. (Jain et al., 2013) in other words, Ωis an Erd os-R enyi random graph. However, making such assumption implies that the task sizes ni = |Ωi| concentrate around their mean, and this fails to capture the task-skew problem we set out to solve. Instead, we will relax the assumption by removing uniformity of tasks while keeping uniformity of users. Specifically, Assumption 4.1. Assume that Ωis obtained by sampling, for each task i, ni users independently and uniformly from {1, . . . , n}. Furthermore, assume that the task sizes ni are publicly known, and that ni 1 for all i. In practice, both users and tasks can be heterogeneous. In that sense, Assumption 4.1 is a stylized model, that is meant to primarily capture the task long tails. We leave extensions to the user heterogeneous case to future work. The assumption that ni are publicly known is relatively mild: in some applications, this information is available, for example in video recommendation, the number of views of each video is often public. A similar assumption was made by (Epasto et al., 2020), where the number of examples contributed by each user is assumed to be public. When ni are not available, the solution can be based on private estimates of ni; we derive in Appendix B the extension to this case. In experiments, we will always privately estimate ni and account for the privacy cost of doing so. A consequence of Assumption 4.1 is that the quantity Bj := P i Ωj ω2 i (that appears in the inequality constraint in equation (9)) becomes a random variable, that is concentrated around its mean, which is equal to Pm i=1 ni n ω2 i . More precisely, we will show that, w.h.p., all Bj are bounded above by c(n) Pm i=1 ni n ω2 i for an appropriate c(n). Then we can replace the n constraints in (9) by a single constraint (for the privacy guarantee, the constraints will be enforced with probability 1, as we will discuss below; but for the purpose of optimizing the utility bound, it suffices that the constraints hold w.h.p.). The problem becomes: 1 ω2 i nγ i s.t. c(n) n ω2 i β, (10) which we can solve in closed form as follows. Let β = β c(n). The Lagrangian is Pm i=1 1 ω2 i nγ i + λ(Pm i=1 niω2 i n β) (λ is a Lagrange multiplier), and the KKT conditions yield: i, 1 ω3 i nγ i + λniωi = 0, Pm i=1 niω2 i = n β, which simplifies to ω i = n (γ+1)/4 i q Pm i =1 n(1 γ)/2 i /n β . (11) With ωi in this form and c(n) = c log n, it can be shown (see Lemma A.4) that w.h.p., maxj P i Ωj ω2 i β. But since we need the privacy guarantee to always hold, we simply add the following clipping step: let w be the task-user weights defined as w ij = ω i min i, j Ω. (12) In other words, for each user j, if the total sensitivity P i Ωj ω i 2 exceeds β, we scale down the weights of this user to meet the budget. This step ensures that the privacy guarantee is exact. But for utility analysis, we know that w.h.p., w ij = ωi (since the constraint is satisfied w.h.p.), so utility bounds will hold w.h.p. Remark 4.2 (Weight clipping preserves privacy). Observe that from eq. (12), the clipped weights of user j only depend on ω (which is released differentially privately) and on Ωj; it does not depend on data of other users j . This means that weight clipping can be done separately for each user, much like gradient clipping in DPSGD. Furthermore, the clipped weights are never publicly released, the weight clipping is internal to each algorithm. The next theorem provides the utility guarantee under weights w . Theorem 4.3 (Privacy-utility trade-off of Algorithms 1 and 2 under adaptive weights). Suppose Assumption 4.1 holds. (a) Under Assumption 3.6, let ˆθ be the output of Algorithm 1 run with weights w (eq. (11)-(12) with γ = 1). Then w.h.p., L(ˆθ) L(θ ) = O Γ4 xΓ2 dm2 Multi-Task Differential Privacy Under Distribution Skew (b) Under Assumption 3.9, if Algorithm 2 is run with weights w (eq. (11)-(12) with γ = 0) and parameters listed in Theorem 3.12(a), then E[L(θ)] L(θ ) = O (14) where E denotes expectation conditioned on a high probability event. (c) Under Assumption 3.10, if Algorithm 2 is run with w (eq. (11)-(12) with γ = 1) and parameters listed in Theorem 3.12(b), then E[L(θ)] L(θ ) = O Γ2dm2 In all three cases, the algorithm is (α, αβ)-RDP for all α > 1. To understand the effect of adapting to task skew, we compare these bounds to the case when we use uniform weights. With uniform weights, the privacy constraint yields ωuniform i = (n β/ Pm i =1 ni )1/2, i. Taking the ratio between the utility bound for ωuniform and the utility bound for ω , we obtain the following: In the Lipschitz bounded case, the relative improvement is R0(n1, . . . , nm) = (m Pm i=1 ni)1/2 Pm i=1 n1/2 i . In the strongly convex case (both for SSP and noisy GD), the relative improve- ment is R1(n1, . . . , nm) = Pm i=1 1 ni Pm i=1 ni m2 . In particular, R0, R1 are lower bounded by 1 (by Cauchy-Schwarz), and equal to 1 when the task distribution is uniform (all ni are equal). In both cases, the improvement can be arbitrary large. To see this, consider the extreme case when n1 = m1+ν and ni is a constant for all other m 1 tasks (ν > 0). A simple calculation shows that in this case, the leading term as m is R0 mν/2 and R1 mν. Both are unbounded in m. Finally, we give a qualitative comment on the optimal weights (11). While it s clear that increasing the weight of a task improves its quality, it was not clear, a priori, that increasing weights on tail tasks benefits the overall objective (1) (since tail tasks represent fewer terms in the sum). The analysis shows that this is indeed the case: the optimal trade-off (see eq. (11)) is obtained when ωi n (1+γ)/4 i , i.e. larger weights are assigned to smaller tasks. This can be explained by a certain diminishing returns effect that depends on the task size: from the optimization problem (10), the marginal utility benefit of increasing ωi is smaller for larger tasks (due to the nγ i term in the objective). At the same time, the marginal privacy cost of increasing ωi is higher for larger tasks (due to the ni term in the constraint). 5. Empirical Evaluation To evaluate our methods, we run experiments2 on largescale recommendation benchmarks on the Movie Lens data sets (Harper & Konstan, 2016). As mentioned in the introduction, recommendation problems are known to exhibit a long tail of content with much fewer training data than average, and this is the main practical issue we seek to address. We also run experiments on synthetic data, reported in Appendix C. In our evaluation, we investigate the following questions: 1. Whether on realistic data sets, our adaptive weight methods can improve the overall utility by shifting weights towards the tail, as the analysis indicates. In particular, Theorem 4.3 suggests that (under distributional assumptions on Ω), the optimal task weights are of the form ωi n (1+γ)/4 i . We experiment with weights of the form ωi n µ i for different values of µ [0, 1]. 2. Do we observe similar improvements for different algorithms? Our analysis applies both to SSP (Algorithm 1) and noisy GD (Algorithm 2). We run the experiments with both algorithms. 3. Beyond improvements in the average metrics, what is the extent of quality impact across the head and tail of the task distribution? Experimental setup Each of the Movie Lens data sets consists of a sparse matrix of ratings given by users to movies. The first benchmark, from Lee et al. (2013), is a rating prediction task on Movie Lens 10 Million (abbreviated as ML10M), where the quality is evaluated using the RMSE of the predicted ratings. The second benchmark, from (Liang et al., 2018), is a top-k item recommendation task on Movie Lens 20 Million (abbreviated as ML20M), where the model is used to recommend k movies to each user, and the quality is measured using Recall@k. Figure 7 in the appendix shows the movie distribution skew in each data set. For example, in ML20M, the top 10% movies account for 86% of the training data. In both cases, we train a DP matrix factorization model. We identify movies to tasks and apply our algorithms to learning the movie embedding representations, for details, see Appendix D. The current state of the art on these benchmarks is the DP alternating least squares (DPALS) method (Chien et al., 2021), which we use as a baseline. We will compare to DPALS both with uniform sampling (sample a fixed number of movies per user, uniformly at random) and with tail-biased sampling (sort all movies by increasing counts, then for each 2The code is available at the following repository: https://github.com/google-research/google-research/tree/master/ dp alternating minimization Multi-Task Differential Privacy Under Distribution Skew 1 5 10 15 20 Ada-DPALS DPALS (tail) DPALS (uniform) ALS (non-private) 1 5 10 15 20 Ada-DPSGD DPSGD (tail) DPSGD (uniform) SGD (non-private) Figure 1. Privacy-utility trade-off on ML10M, using uniform sampling, tail-biased sampling, and adaptive weights (our method), applied to DPALS (top) and DPSGD (bottom). The utility is measured using RMSE (lower is better). user, keep the k first movies). The latter was a heuristic specifically designed by (Chien et al., 2021) to address the movie distribution skew, and is the current SOTA. We also experiment with DPSGD on the same benchmarks, and compare uniform sampling, tail-biased sampling, and adaptive weights. When computing adaptive weights for our methods, we first compute private estimates ˆni of the counts (which we include in the privacy accounting), then define task weights following eq. (11)-(12), but allowing a wider range of exponents. More precisely, we replace eq. (11) with ω i = ˆn µ i q Pm i =1 ˆn 2µ+1 i /n β , (16) where µ is a hyper-parameter. In the analysis, the optimal choice of weights corresponds to µ = 1/4 in the convex case (eq. (11) with γ = 0), and µ = 1/2 in strongly convex case (eq.(11) with γ = 1). We experiment with different values of µ [0, 1] to evaluate the effect of shifting the weight distribution towards the tail. Effect of adaptive weights on privacy-utility trade-off We first evaluate the privacy/utility trade-off. The results are reported in Figure 1 (for ML10M) and Figure 2 (for 1 5 10 15 20 0.28 test Recall@20 Ada-DPALS DPALS (tail) DPALS (uniform) ALS (non-private) 1 5 10 15 20 0.28 test Recall@20 Ada-DPSGD DPSGD (tail) DPSGD (uniform) SGD (non-private) Figure 2. Privacy-utility trade-off on ML20M, using uniform sampling, tail-biased sampling, and adaptive weights (our method), applied to DPALS (top) and DPSGD (bottom). The utility is measured using Recall@20 (higher is better). ML20M), where we also include non-private baselines for reference. Our adaptive weights methods achieve the state of the art on both benchmarks. We see improvements when applying adaptive weights both to DPSGD and DPALS. The extent of improvement varies depending on the benchmark. In particular, the improvement is remarkable on the ML20M benchmark across all values of ϵ; using adaptive weights significantly narrows the gap between the previous SOTA and the non-private baseline. To illustrate the effect of the exponent µ, we report, in Figure 3, the performance on ML10M for different values of µ. The best performance is typically achieved when µ is between 1 2, a range consistent with the analysis. For larger values of µ (for example µ = 1), there is a clear degradation of quality, likely due to assigning too little weight to head tasks. The trend is similar on ML20M, see Figure 9 in the appendix. Impact on head and tail tasks To better understand the extent of quality impact on head/tail movies, we report the same metrics, sliced by movie counts. We sort the movies i by increasing counts ni, then group them into equally sized Multi-Task Differential Privacy Under Distribution Skew 0 1/8 1/4 1/3 1/2 1 0.81 = 5 = 10 = 20 (a) DPALS on ML10M 0 1/8 1/4 1/3 1/2 1 0.81 = 5 = 10 = 20 (b) DPSGD on ML10M Figure 3. Effect of the weight exponent µ (see eq. (16)) on the ML10M benchmark, using DPALS (top) and DPSGD (bottom). buckets, and compute average metrics3 on each bucket. The results are reported in Figure 4 for ϵ = 1, and similar plots are provided in Appendix D for other values of ϵ. The results show a clear trade-off between head and tail tasks. On buckets 0 and 1 (tail), it is generally the case that the larger µ is, the better the quality, while on bucket 4, the opposite trend can be observed. This is consistent with the intuition that as µ increases, more budget is assigned to the tail, which tends to shift the trade-off in favor of the tail. Even compared to the previous SOTA (tail-sampling), the improvements on all but the top bucket are significant. Although the tail sampling heuristic was designed to improve tail quality, the experiments indicate that adaptive weights are much more effective. Furthermore, the parameter µ allows more control over this trade-off. Finally, to give a concrete illustration of these improvements, we inspect the quality of recommendations on a few sample queries, reported in Appendix D. We find that there is a visible improvement on tail recommendations in models trained using our method. 3Since recall is naturally lower on tail tasks, we report Recall@k with larger k for tail buckets: we use k = 20 for the first bucket, 40 for the second, 60 for the third, and so on. This allows for more readable plots. 0 1 2 3 4 Count bucket Ada DPALS ( = 1) Ada DPALS ( = 1/2) Ada DPALS ( = 1/3) Ada DPALS ( = 1/4) DPALS (tail) DPALS (uniform) (a) RMSE on ML10M (ϵ = 1). 0 1 2 3 4 Count bucket Ada DPALS ( = 1) Ada DPALS ( = 1/2) Ada DPALS ( = 1/3) Ada DPALS ( = 1/4) Ada DPALS ( = 0) DPALS (tail) DPALS (uniform) (b) Recall on ML20M (ϵ = 1). Figure 4. RMSE and Recall metrics for ϵ = 1, sliced by movie frequency. Each bucket contains an equal number of movies. Buckets are ordered by increasing movie counts. 6. Conclusion To address the practical problem of long-tailed data distributions, we formally analyze the question of budget allocation among tasks under user-level DP. Our method is based on computing weights that adapt to the task distribution. We quantify utility improvements under optimal weights, in a range of settings. Importantly, the method achieves significant improvements on benchmarks, and allows finer control over quality trade-offs between head and tail tasks. To compute optimal weights, our analysis relied on certain distributional assumptions on the task-user graph Ω, and although this allowed us to model the task distribution skew, relaxing these assumptions is a promising direction. In particular, it may be possible to directly compute a privacypreserving solution of problem (9) in its general form, using techniques similar to the constraint-private LPs studied by (Hsu et al., 2014). Acknowledgements We are grateful to Harsh Mehta, Steffen Rendle, John Anderson, and the anonymous reviewers for insightful discussions. Multi-Task Differential Privacy Under Distribution Skew Abadi, M., Chu, A., Goodfellow, I., Mc Mahan, H. B., Mironov, I., Talwar, K., and Zhang, L. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, CCS 16, pp. 308 318, New York, NY, USA, 2016. Association for Computing Machinery. Abowd, J. M. The u.s. census bureau adopts differential privacy. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 18, pp. 2867, New York, NY, USA, 2018. Association for Computing Machinery. Amin, K., Kulesza, A., Munoz, A., and Vassilvtiskii, S. Bounding user contributions: A bias-variance trade-off in differential privacy. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 263 271. PMLR, 09 15 Jun 2019. Amin, K., Gillenwater, J., Joseph, M., Kulesza, A., and Vassilvitskii, S. Plume: Differential privacy at scale. Co RR, abs/2201.11603, 2022. Bagdasaryan, E., Poursaeed, O., and Shmatikov, V. Differential privacy has disparate impact on model accuracy. In Advances in Neural Information Processing Systems, volume 32, 2019. Bassily, R., Smith, A., and Thakurta, A. Private empirical risk minimization: Efficient algorithms and tight error bounds. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, pp. 464 473, 2014. Bertin-Mahieux, T., Ellis, D. P. W., Whitman, B., and Lamere, P. The million song dataset. In In Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR), 2011. Boucheron, S., Lugosi, G., and Massart, P. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013. Brodersen, K. H., Ong, C. S., Stephan, K. E., and Buhmann, J. M. The balanced accuracy and its posterior distribution. In 2010 20th International Conference on Pattern Recognition, pp. 3121 3124, 2010. Calandrino, J. A., Kilzer, A., Narayanan, A., Felten, E. W., and Shmatikov, V. you might also like: privacy risks of collaborative filtering. In 2011 IEEE symposium on security and privacy, pp. 231 246. IEEE, 2011. Chien, S., Jain, P., Krichene, W., Rendle, S., Song, S., Thakurta, A., and Zhang, L. Private alternating least squares: Practical private matrix completion with tighter rates. In Proceedings of the 38th International Conference on Machine Learning, 2021. Cui, Y., Jia, M., Lin, T.-Y., Song, Y., and Belongie, S. Classbalanced loss based on effective number of samples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019. Cummings, R., Feldman, V., Mc Millan, A., and Talwar, K. Mean estimation with user-level privacy under data heterogeneity. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022. Dwork, C. and Roth, A. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3 4):211 407, 2014. Dwork, C., Mc Sherry, F., Nissim, K., and Smith, A. Calibrating noise to sensitivity in private data analysis. In Proceedings of the Third Conference on Theory of Cryptography, TCC 06, pp. 265 284, Berlin, Heidelberg, 2006. Springer-Verlag. Dwork, C., Mc Sherry, F., and Talwar, K. The price of privacy and the limits of lp decoding. In Proceedings of the thirty-ninth annual ACM Symposium on Theory of Computing, pp. 85 94, 2007. Dwork, C., Rothblum, G. N., and Vadhan, S. Boosting and differential privacy. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pp. 51 60, 2010. Epasto, A., Mahdian, M., Mao, J., Mirrokni, V., and Ren, L. Smoothly bounding user contributions in differential privacy. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 13999 14010. Curran Associates, Inc., 2020. Feldman, V. and Zrnic, T. Individual privacy accounting via a r enyi filter. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021. Foulds, J., Geumlek, J., Welling, M., and Chaudhuri, K. On the theory and practice of privacy-preserving bayesian data analysis. In Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, UAI 16, pp. 192 201, Arlington, Virginia, USA, 2016. AUAI Press. Hardt, M. and Rothblum, G. N. A multiplicative weights mechanism for privacy-preserving data analysis. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pp. 61 70, 2010. Multi-Task Differential Privacy Under Distribution Skew Harper, F. M. and Konstan, J. A. The movielens datasets: History and context. Acm Transactions on Interactive Intelligent Systems (Tii S), 5(4):19, 2016. Hsu, J., Roth, A., Roughgarden, T., and Ullman, J. Privately solving linear programs. In Automata, Languages, and Programming, pp. 612 624, Berlin, Heidelberg, 2014. Springer Berlin Heidelberg. Hu, S., Wu, Z. S., and Smith, V. Private multi-task learning: Formulation and applications to federated learning. Co RR, abs/2108.12978, 2021. Huang, C., Li, Y., Loy, C. C., and Tang, X. Learning deep representation for imbalanced classification. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5375 5384, 2016. Jain, P., Netrapalli, P., and Sanghavi, S. Low-rank matrix completion using alternating minimization. In Proceedings of the Forty-Fifth Annual ACM Symposium on Theory of Computing, STOC 13, pp. 665 674, New York, NY, USA, 2013. Association for Computing Machinery. Jain, P., Thakkar, O. D., and Thakurta, A. Differentially private matrix completion revisited. In International Conference on Machine Learning, pp. 2215 2224. PMLR, 2018. Jain, P., Rush, J., Smith, A., Song, S., and Guha Thakurta, A. Differentially private model personalization. In Advances in Neural Information Processing Systems, volume 34, pp. 29723 29735. Curran Associates, Inc., 2021. Kasiviswanathan, S. P., Nissim, K., Raskhodnikova, S., and Smith, A. Analyzing graphs with node differential privacy. In Proceedings of the 10th Theory of Cryptography Conference on Theory of Cryptography, TCC 13, pp. 457 476, Berlin, Heidelberg, 2013. Springer-Verlag. Kearns, M., Pai, M., Roth, A., and Ullman, J. Mechanism design in large games: Incentives and privacy. In Proceedings of the 5th conference on Innovations in theoretical computer science, pp. 403 410, 2014. Korolova, A. Privacy violations using microtargeted ads: A case study. In 2010 IEEE International Conference on Data Mining Workshops, pp. 474 482. IEEE, 2010. Kubat, M. and Matwin, S. Addressing the curse of imbalanced training sets: One-sided selection. In Proceedings of the Fourteenth International Conference on Machine Learning (ICML 1997), Nashville, Tennessee, USA, July 8-12, 1997, pp. 179 186, 1997. Lee, J., Kim, S., Lebanon, G., and Singer, Y. Local lowrank matrix approximation. In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, ICML 13, pp. II 82 II 90, 2013. Levy, D., Sun, Z., Amin, K., Kale, S., Kulesza, A., Mohri, M., and Suresh, A. T. Learning with user-level privacy. In Advances in Neural Information Processing Systems, volume 34, pp. 12466 12479, 2021. Li, J., Khodak, M., Caldas, S., and Talwalkar, A. Differentially private meta-learning. In 8th International Conference on Learning Representations, ICLR 2020, 2020. Liang, D., Krishnan, R. G., Hoffman, M. D., and Jebara, T. Variational autoencoders for collaborative filtering. WWW 18, pp. 689 698, 2018. Liu, S. and Zheng, Y. Long-Tail Session-Based Recommendation, pp. 509 514. Association for Computing Machinery, New York, NY, USA, 2020. Mc Mahan, H. B., Ramage, D., Talwar, K., and Zhang, L. Learning differentially private recurrent language models. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, 2018. Mironov, I. R enyi differential privacy. In 2017 IEEE 30th Computer Security Foundations Symposium (CSF), pp. 263 275. IEEE, 2017. Proserpio, D., Goldberg, S., and Mc Sherry, F. Calibrating data to sensitivity in private data analysis: A platform for differentially-private analysis of weighted datasets. Proc. VLDB Endow., 7(8):637 648, apr 2014. Rogers, R., Subramaniam, S., Peng, S., Durfee, D., Lee, S., Kancha, S. K., Sahay, S., and Ahammad, P. Linkedin s audience engagements api: A privacy preserving data analytics system at scale. Journal of Privacy and Confidentiality, 11(3), Dec. 2021. Shokri, R., Stronati, M., Song, C., and Shmatikov, V. Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 3 18. IEEE, 2017. Ullman, J. Private multiplicative weights beyond linear queries. In Proceedings of the 34th ACM SIGMODSIGACT-SIGAI Symposium on Principles of Database Systems, PODS 15, pp. 303 312, New York, NY, USA, 2015. Association for Computing Machinery. Wang, Y. Revisiting differentially private linear regression: optimal and adaptive prediction & estimation in unbounded domain. In Globerson, A. and Silva, R. (eds.), Proceedings of the Thirty-Fourth Conference on Uncertainty in Artificial Intelligence, UAI 2018, Monterey, California, USA, August 6-10, 2018, pp. 93 103. AUAI Press, 2018. Multi-Task Differential Privacy Under Distribution Skew Wang, Y.-X., Ramanan, D., and Hebert, M. Learning to model the tail. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. Wilson, R. J., Zhang, C. Y., Lam, W., Desfontaines, D., Simmons-Marengo, D., and Gipson, B. Differentially private sql with bounded user contribution. In Privacy Enhancing Technologies Symposium (PETS), 2020. Xu, D., Du, W., and Wu, X. Removing disparate impact on model accuracy in differentially private stochastic gradient descent. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, KDD 21, pp. 1924 1932, New York, NY, USA, 2021. Association for Computing Machinery. Yin, H., Cui, B., Li, J., Yao, J., and Chen, C. Challenging the long tail recommendation. Proc. VLDB Endow., 5(9): 896 907, may 2012. Multi-Task Differential Privacy Under Distribution Skew The proofs of the main results are provided in Appendix A. Appendix B discusses the case when task counts are not public. We report additional experiments on synthetic data (Appendix C) and real data (Appendix D). A.1. Theorem 3.3 (Privacy Guarantee of Algorithm 1) Proof. The result is an application of the Gaussian mechanism. The procedure computes, for i [m], the estimates ˆAi and ˆbi (Lines 5-6), given as follows ˆAi = Ai + Γ2 xΞi, Ai = X j Ωi wij(xijx ij + λI) ˆbi = bi + ΓxΓyξi, bi = X j Ωi wijyijxij where Ξi N d d, ξi N d. Let A be the matrix obtained by stacking ( Ai)i [n]. If A is the same matrix obtained without user j s data, then A A 2 F = X i Ωj wijxijx ij 2 F X i Ωj w2 ijΓ4 x βΓ4 x where we use the fact that for all i, xij Γx and P j Ωi w2 ij β. Since we add Gaussian noise with variance Γ4 x, releasing ˆA is (α, α β 2 )-RDP (Mironov, 2017). Similarly, if b is obtained by stacking ( bi)i [m], and b is the same vector without user j s data, then b b 2 P i Ωj wijyijxij 2 P j Ωi w2 ijΓ2 yΓ2 x βΓ2 yΓ2 x, and releasing ˆb is (α, α β 2 )-RDP. By simple RDP composition, the process is (α, αβ)-RDP. A.2. Theorem 3.8 (Utility Guarantee of Algorithm 1) Proof. Since by assumption, the weights satisfy P i Ωj ω2 i β, the RDP guarantee is an immediate consequence of Theorem 3.3. To prove the utility bound, recall that the total loss is a sum over tasks i=1 Aiθi bi 2 F + niλ θi 2. (17) We will bound the excess risk of each term, using the following result. For a proof, see, e.g. (Wang, 2018, Appendix B.1). Lemma A.1. Suppose Assumption 3.6 holds. Consider the linear regression problem L(θi) = Aiθi bi 2 F , let θ i be its solution, and ˆθi be the SSP estimate obtained by replacing Ai and bi with their noisy estimates ˆAi = Ai + σΓ2 xΞ,ˆbi = bi + σΓ2 xΓ ξ where Ξ N d d, ξ N d. Then Li(ˆθi) Li(θ i ) = O d2Γ2 xΓ2 αni σ2 , where α = λmin(A i Ai)d niΓ2x = λd Fix a task i. Since by assumption all weights wij are equal to ωi, Lines 5-6 become ˆAi = ωi Ai + Γ2 xΞi and ˆbi = ωibi + Γ2 xΓ ξi, and ˆθi is the solution to ˆAiθi = ˆbi. This corresponds to the SSP algorithm, applied to the loss Li, with noise variances σ2 = 1 ω2 i . By Lemma A.1, we have Li(ˆθi) Li(θ i ) = O d2Γ2 xΓ2 α 1 niω2 i We conclude by summing (18) over i [m]. Multi-Task Differential Privacy Under Distribution Skew A.3. Theorem 3.11 (Privacy Guarantee of Algorithm 2) Proof. The result is an application of the Gaussian mechanism. At each step t of Algorithm 2, the procedure computes, for i [m], a noisy estimate of the gradient g(t) i = P j Ωi g(t) i (Line 5). Let g(t) Rmd be the vector obtained by stacking g(t) i for all i. And let g (t) be the same vector obtained without user j s data. Then g (t) g(t) 2 2 = X i Ωj wij ℓ(θi; xij, yij) 2 2 where we use the assumption that P i Ωj wij2 β. Since we add Gaussian noise with variance Γ2T/2, the procedure is (α, αβ/T)-RDP. Finally, by composition over T steps, the algorithm is (α, β)-RDP. A.4. Theorem 3.12 (Utility Guarantee of Algorithm 2) We will make use of the following standard lemmas (for example Lemma 2.5 and 2.6 in (Bassily et al., 2014)). Let f be a convex function defined on domain C, let θ = argminθ C f(θ). Consider the SGD algorithm with learning ηt. θ(t+1) = ΠC[θ(t) ηtg(t)]. Assume that there exists G such that for all t, E[g(t)] = ℓ(θ(t)) and E[ g(t) 2] G2, Lemma A.2 (Lipschitz case). Let η(t) = C t. Then for all T 1, E[f(θ(t))] f(θ ) = O C G log T Lemma A.3 (Strongly convex case). Assume that f is λ strongly convex and let η(t) = 1 λt. Then for all T 1, E[f(θ(t))] f(θ ) = O G2 log T proof of Theorem 3.12. First, since wij = ωi, the gradient g(t) i in Line 5 of the algorithm becomes g(t) i = ωi Li(θ(t 1) i ). Let ˆg(t) i = g(t) i + Γ p T/2ξ(t) i . Then E[ˆg(t) i ] = g(t) i and E[ ˆg(t) i 2] = E[ g(t) i 2] + E[ Γ p T/2ξ(t) i 2] Γ2[ω2 i n2 i + Td/2]. In the first line we use independence of ξ(t) i and g(t) i . In the second line we use that the variance of a multivariate normal is d, and the fact that Li has Lipschitz constant niΓ (since it is the sum of ni terms, each being Γ-Lipschitz). Define G2 i = Γ2[ω2 i n2 i + Td/2]. First, consider the Lipschitz bounded case. Applying Lemma A.2 to f = ωi Li, G = Gi, and ηi = C Gi t, we have for all T ωi E[Li(θi) Li(θ i )] = O C Gi log T Multi-Task Differential Privacy Under Distribution Skew Multiplying by 1 ωi and summing over tasks i {1, . . . , m}, we have E[L(θ)] L(θ ) = O C Γ log T m Pm i=1 n2 i T + where we used Pm i=1 ai p m Pm i=1 ai (by concavity). Finally, setting T to equate the terms under the square root, we Pm i=1 n2 i Pm i=1 1/ω2 i , and with this choice of T, E[L(θ)] L(θ ) = O as desired. We now turn to the strongly convex case. Applying Lemma A.3 to f = ωi Li, G = Gi (same as above), strong convexity constant ωiniλ, and ηi = 1 ωiniλt, we have for all T, ωi E[Li(θi) Li(θ i )] = O G2 i log T niωiλT Multiplying by 1 ωi and summing over tasks, we get E[L(θ)] L(θ ) = O Setting T to equate the last two sums, we get T = 2 Pm i=1 ni Pm i=1 1/ω2 i ni and with this choice of T, E[L(θ)] L(θ ) = O as desired. A.5. Theorem 4.3 (Privacy-Utility Trade-off Under Adaptive Weights) To prove the theorem, we will use the following concentration result: Lemma A.4 (Concentration bound on the privacy budget). Suppose that Assumption 4.1 holds and let B > 0 be given. Let ni = |Ωi| and ωi = n (1+γ)/4 i / q P i n(1 γ)/2 i /n B. Then there exists c > 0 (that does not depend on B) such that, with high probability, for all i, P i Ωj ω2 i Bc log n. Proof. Fix a user j. We seek to bound P i Ωj ω2 i . Let pij = P((i, j) Ω). Recall that by Assumption 4.1, pij = ni/n. For each i, define Xi as the random variable which takes value ω2 i = n Bn (1+γ)/2 i / P i n(1 γ)/2 i with probability ni/n and 0 otherwise. Then bounding P i Ωj ω2 i is equivalent to bounding Pm i=1 Xi. E[Xi] = Bn(1 γ)/2 i / X i n(1 γ)/2 i , Var[Xi] E[X2 i ] = (ni/n) n Bn (1+γ)/2 i / X i n(1 γ)/2 i 2 = n B2n γ i / X i n(1 γ)/2 i 2 B2/n , Multi-Task Differential Privacy Under Distribution Skew where the last inequality is by the fact that i, ni 1 (see Assumption 4.1). Hence we have that i=1 E[Xi] = B , i=1 Var[Xi] (m/n)B2 B2, where we use the assumption that n m. Furthermore, we have ω2 i B since j, ni 1. By the standard Bernstein s inequality ((Boucheron et al., 2013) 2.8), there exists c > 0 (that does not depend on B or ni) such that: i=1 Xi Bc log n The high probability bound follows by taking the union over all the i s. Proof of Theorem 4.3. Let ω , w be the weights given in eq. (11)-(12), respectively (ω are the optimal weights, and w are their clipped version). We apply Lemma A.4 with B = β = β/c log n and hence obtain that with high probability, for all i, P j Ωi ω i 2 β. In other words, w.h.p. clipping does not occur and w ij = ωi for all (i, j) Ω. Conditioned on this high probability event, we can apply the general utility bounds in Theorems 3.8 and 3.12, with ωi = ω i . This yields the desired bounds. A.6. Related Work on Class-Imbalanced Learning The question of distribution skew has been studied in the non-private setting, and several techniques have been proposed to address class imbalance, either using re-sampling or re-weighting (Brodersen et al., 2010; Huang et al., 2016; Wang et al., 2017; Cui et al., 2019). The main question in this line of work is that if one measures performance using a balanced metric (i.e. one that assigns equal importance to the different classes, rather than following the skewed data distribution), how should one modify the training process to improve quality? This question is orthogonal to our approach. Our study shows that regardless of the loss being optimized (i.e., even if one cares about the empirical, not the balanced loss), learning can be improved through a more careful allocation of the privacy budget. It should be possible to obtain a similar improvement in the balanced case, in which the objective function assigns equal weights to each task (for example by defining Li(θi) in eq. (2) to be a mean instead of a sum). B. Utility Analysis Under Approximate Counts In Section 4, the optimal choice of weights ω assumes knowledge of the counts ni. If the counts ni are not public, we can use DP estimates ˆni, and use them to solve the problem. Since differentially private counting has been studied extensively, we only state the utility bound in terms of the accuracy of the noisy counts the final privacy bound can be done through standard composition. For any constant s > 0, we call a counting procedure M s-accurate, if w.h.p., the counts ˆni produced by M satisfy that |ˆni ni| s for all i. We will give an analysis of Algorithm 1 with approximate counts. The general idea is to apply the algorithm with weights ˆω of the same form as the optimal weights ω , but where the exact counts are replaced with estimated counts. Suppose we are given s-accurate count estimates ˆni. Define (ˆni + s)1/2p m/n β , (19) This corresponds to eq. (11), but with ni replaced by ˆni+s (it will become clear below why we need to slightly over-estimate the counts). As in the exact case, we need to clip the weights, so that the privacy guarantee always holds. Define the clipped version ˆw as ˆwij = ˆωi min 1, β1/2 i Ωj ˆω2 i )1/2 (i, j) Ω. (20) Remark B.1. The input to the algorithm are the unclipped weights ˆω, which are computed differentially privately. The clipped weights are not released as part of the procedure. They are simply used as a scaling factor in the computation of ˆAi,ˆbi (Lines 5-6). Multi-Task Differential Privacy Under Distribution Skew Theorem B.2 (Privacy-utility trade-off of Algorithm 1 with estimated counts). Suppose Assumptions 3.6 and 4.1 hold. Let ˆni Rm be count estimates from an s-accurate procedure for some s > 0. Define ˆω, ˆw as in eq. (19)-(20). Let ˆθ be the output of Algorithm 1 run with weights ˆw. Then Algorithm 1 is (α, αβ)-RDP for all α > 1, and w.h.p., L(ˆθ) L(θ ) = O Γ4 xΓ2 dm nλβ Proof of Theorem B.2. We first show that w.h.p., the approximate weights ˆωi under-estimate the exact weights ω i . This will allow us to argue that w.h.p., clipping will not occur. First, define the function g : Rd Rd, Observe that whenever η η (coordinate-wise), g(η) g(η ) (coordinate-wise). Let η be the vector of true counts (η i = ni), and ˆη be the vector or approximate counts ˆηi = ni. We have that ω = g(η ) and ˆω = g(ˆη + s). But since by assumption, the counts estimates are s-accurate, then w.h.p., |ˆni n i | s, thus η ˆη + s and ˆω = g(ˆη + s) ω(η ) = ω , that is, the estimated weights ˆω under-estimate the optimal weights ω w.h.p. (this is precisely why, in the definition of ˆω, we used the adjusted counts ˆη + s instead of ˆη). Next, ˆω ω implies that X i Ωj ˆω2 i X i Ωj ω i 2. (22) We also know, from Lemma A.4 (applied with B = β), that w.h.p., P i Ωj ω i 2 c β log n = β for all j. Combining this with (22), we have that w.h.p., ˆω satisfies the RDP budget constraint, therefore wij = ˆω2 i (clipping does not occur since the budget constraint is satisfied). Applying Theorem 3.8 with weights ˆω yields the desired result. We compare the utility bounds under adaptive weights with exact counts (13), adaptive weights with approximate counts (21), and uniform weights. The bounds are, respectively, O Γ4 xΓ2 dm2 nλβ , O Γ4 xΓ2 dm nλβ Pm i=1 ˆni+s O Γ4 xΓ2 d nλβ Pm i=1 ni Pm i=1 1 ni The cost of using approximation counts (compared to exact counts) is a relative increase by the factor where we used that w.h.p., ˆni ni + s. The last term is the average relative error of count estimates. If the estimates are accurate on average, we don t expect to see a large utility loss due to using approximate counts. C. Synthetic Experiments We conduct experiments following the setup in Section 3. Specifically, we consider m = 100 linear regression tasks of dimension d = 5 and data from n = 10, 000 users. We assume that each task has an optimal solution θi, and each user j has a vector uj, such that xij = uj, and yij = ui, θj + N(0, σ2 F ), with σF representing some inherent data noise. We generate uj and θi from a Gaussian distribution N d and projected to the unit ball, and set σF = 10 3. To model a skewed distribution of tasks, we first sample, for each task i, a value qi [0, 1] following a power law distribution with parameter a = 1, 2 (the pdf of the distribution is f(x) axa 1), representing two level of skewness. We Multi-Task Differential Privacy Under Distribution Skew then normalize qis such that they sum up to 20. Then we construct Ωby sampling each (i, j) with probability qi. This way, in expectation, each user contributes to 20 tasks, and qin users contribute to task i. The distribution of qin is plotted in Figure 5a. We partition the data into training and test sets following an 80-20 random split. We run Ada DPALS and Ada DPSGD with weights set to ωi n µ i for varying µ, where ni is the number of users contributing to task i. The weights are normalized per user. Setting µ = 0 corresponds to the standard unweighted objective function. Since the purpose of the experiment is to validate the analysis and illustrate the algorithm in an stylized setting, we run the algorithm using the exact nis instead of estimating them privately. In Figure 5b-5e, we plot the RMSE for different values of µ and ϵ, for both algorithms and skewness. We set δ = 10 5. 0 2000 4000 6000 8000 10000 Item (sorted by decreasing count) a = 1.0 a = 2.0 (a) ni in the synthetic data. 0.0 0.2 0.4 0.6 0.8 1.0 test error (RMSE) 0.0 0.2 0.4 0.6 0.8 1.0 10 2 test error (RMSE) (b) DPMulti Regression. Skewness a = 1. 0.0 0.2 0.4 0.6 0.8 1.0 test error (RMSE) 0.0 0.2 0.4 0.6 0.8 1.0 test error (RMSE) (c) DPMulti Regression. Skewness a = 2. 0.0 0.2 0.4 0.6 0.8 1.0 0.14 test error (RMSE) 0.0 0.2 0.4 0.6 0.8 1.0 0.05 test error (RMSE) (d) DPSGD. Skewness a = 1. 0.0 0.2 0.4 0.6 0.8 1.0 test error (RMSE) 0.0 0.2 0.4 0.6 0.8 1.0 test error (RMSE) (e) DPSGD. Skewness a = 2. Figure 5. RMSE vs. µ on synthetic data. δ = 10 5. We observe that with uniform weights (µ = 0), the quality of the estimate can be quite poor, especially at lower values of ϵ. The quality improves significantly as we increase µ. Theoretical analysis (Equation (11)) suggests µ = 1/2 as the optimal choice, yet the empirically optimal µ can vary case by case. This may be due to the fact that the analysis makes no assumptions about the feature and label distribution, while in the experiment, the data is sampled from a linear model; we leave further analysis into designing better weighting strategies for easier data for future work. D. Additional Experiments on Movie Lens and Million Song Data In this section, we report additional experiments on Movie Lens and Million Song Data (Bertin-Mahieux et al., 2011) data sets. The code used to run the experiments is made available at the following URL: https://github.com/google-research/ google-research/tree/master/dp alternating minimization. We report additional experiments on the larger Million Song Data benchmark (abbreviated as MSD), see Figures 6-8. Due to the larger size of the data, we were only able to tune models of smaller size (embedding dimension of up to 32), but we expect the trends reported here to persist in larger dimension. Multi-Task Differential Privacy Under Distribution Skew Table 1. Statistics of the Movie Lens and Million Song data sets. ML10M ML20M MSD n (number of users) 69,878 136,677 571,355 m (number of items) 10,677 20,108 41,140 |Ω| (number of observations) 10M 9.99M 33.63M 1 5 10 15 20 0.100 test Recall@20 Ada-DPALS DPALS (tail) DPALS (uniform) Figure 6. Privacy-utility trade-off on the Million Song Data benchmark. D.1. Detailed Experimental Setting We follow the same experimental setting as (Jain et al., 2018; Chien et al., 2021; Jain et al., 2021). When reporting (ϵ, δ) DP guarantees, we use values of ϵ [1, 20], and take δ = 10 5 for ML10M and δ = 1/n for ML20M and MSD. All hyper-parameter values are specified in the source code. We solve a private matrix completion problem, where the training data is a partially observed rating matrix (yij)(i,j) Ω (where yij is the rating given by user j to item i), and the goal is to compute a low-rank approximation Y UV , by minimizing the following objective function: L(U, V ) = X (i,j) Ω ( ui, vj yij)2 + λ|ui|2 + λ|vj|2. The matrix U Rn d represents item embeddings, and the matrix V Rm d represents user embeddings. Algorithms We consider a family of alternating minimization algorithms studied by (Chien et al., 2021; Jain et al., 2021), in which we will use our algorithm as a sub-routine. This is summarized in Algorithm 3. Algorithm 3 Alternating Minimization for Private Matrix Completion 1: Inputs: Training data {yij}(i,j) Ω, number of steps T, initial item matrix ˆU (0), pre-processing RDP budget β0, training RDP budget β. 2: Pre-process the training data (with RDP budget β0). 3: for 1 t T do 4: ˆV (t) argmin V L( ˆU (t 1), V ) 5: Compute a differentially private solution ˆU (t) of min U L(U, ˆV (t)) (with RDP budget β) 6: end for 7: Return ˆU (T ) Algorithm 3 describes a family of algorithms, that includes DPALS (Chien et al., 2021), and the Alt Min algorithm of (Jain et al., 2021). It starts by pre-processing the training data (this includes centering the data, and computing private counts to be used for sampling or for computing adaptive weights). Then, it alternates between updating V and updating U. Updating V is done by computing an exact least squares solution, while updating U is done differentially privately. Remark D.1. The algorithm only outputs the item embedding matrix ˆU. This is indeed sufficient for the recommendation task: given an item matrix ˆU (learned differentially privately), to generate predictions for user j, one can compute the user s Multi-Task Differential Privacy Under Distribution Skew 0.0 0.2 0.4 0.6 0.8 1.0 Movie (sorted by decreasing count) ML10M ML20M Millionsong (a) Cumulative distribution function 0.0 0.2 0.4 0.6 0.8 1.0 Movie (sorted by decreasing count) ML10M ML20M Millionsong (b) Sorted task counts Figure 7. Task distribution skew in Movie Lens and Million Song Data. embedding v j = argminv Rd P i Ωj( ˆui, v yij)2 + λ|v|2, then complete the j-th column by computing ˆUv j . This minimization problem only depends on the published matrix ˆU and user j s data, therefore can be done in isolation for each user (for example on the user s own device), without incurring additional privacy cost. This is sometimes referred to as the billboard model of differential privacy, see for example (Jain et al., 2021). Notice that updating the item embeddings U (Line 5 in Algorithm 3) consists in solving the following problem differentially privately: j Ωi ( ui, vj yij)2 + λ|ui|2. This corresponds to our multi-task problem (1), where we identify each item to one task, with parameters θi = ui, features xij = vj, and labels yij. Then we can either apply Algorithm 1 (SSP with adaptive weights), or Algorithm 2 (DPSGD with adaptive weights) to compute the item update. The baselines we compare to are summarized in Table 2. Table 2. Algorithms used in experiments. Method Sub-routine used to update U (Line 5 in Algorithm 3) Budget allocation method DPALS (uniform) SSP Uniform sampling DPALS (tail) SSP Tail-biased sampling of (Chien et al., 2021) Ada DPALS Algorithm 1 Adaptive weights (eq. (11)-(12)) DPSGD (uniform) DPSGD Uniform sampling DPSGD (tail) DPSGD Tail-biased sampling of (Chien et al., 2021) Ada DPSGD Algorithm 2 Adaptive weights (eq. (11)-(12)) We compare to two algorithms: the DPALS method (Chien et al., 2021) which is the current SOTA on these benchmarks, and the DPSGD algorithm (which we found to perform well in the full-batch regime, at the cost of higher run times). In each case, we compare three methods to perform budget allocation: using uniform sampling, the tail-biased sampling heuristic of (Chien et al., 2021), and our adaptive weights method. Note that tail sampling is already adaptive to the task skew: it estimates task counts and uses them to select the tasks to sample for each user. Both for tail-sampling and adaptive weights, movie counts are estimated privately, and we account for the privacy cost of doing so. The proportion of RDP budget spent on estimating counts (out of the total RDP budget) is 20% for ϵ = 20, 14% for ϵ = 5 and 12% for ϵ = 1. This was tuned on the DPALS baseline (with tail sampling). Metrics The quality metrics that the benchmarks use are defined as follows: Let Ωtest be the set of test ratings. Then for a given factorization U, V , RMSE(U, V ) = q P (i,j) Ωtest( ui,vj yij)2 |Ωtest| . Recall is defined as follows. For a given user i, let Ωjtest be the set of movies rated by the user j. If we denote by ˆ Ωj the set of top k predictions for user j, then the recall is defined as Recall@k = 1 n Pn j=1 |Ωj test ˆΩj| min(k,|Ωj test|). Multi-Task Differential Privacy Under Distribution Skew 0 1 2 3 4 Count bucket Ada DPALS ( = 1) Ada DPALS ( = 1/2) Ada DPALS ( = 1/3) Ada DPALS ( = 1/4) DPALS (tail) DPALS (uniform) (a) ML10M, ϵ = 1. 0 1 2 3 4 Count bucket Ada DPALS ( = 1) Ada DPALS ( = 1/2) Ada DPALS ( = 1/3) Ada DPALS ( = 1/4) DPALS (tail) DPALS (uniform) (b) ML10M, ϵ = 5. 0 1 2 3 4 Count bucket Ada DPALS ( = 1) Ada DPALS ( = 1/2) Ada DPALS ( = 1/3) Ada DPALS ( = 1/4) DPALS (tail) DPALS (uniform) (c) ML10M, ϵ = 20. 0 1 2 3 4 Count bucket Ada DPALS ( = 1) Ada DPALS ( = 1/2) Ada DPALS ( = 1/3) Ada DPALS ( = 1/4) Ada DPALS ( = 0) DPALS (tail) DPALS (uniform) (d) ML20M, ϵ = 1. 0 1 2 3 4 Count bucket Ada DPALS ( = 1) Ada DPALS ( = 1/2) Ada DPALS ( = 1/3) Ada DPALS ( = 1/4) Ada DPALS ( = 0) DPALS (tail) DPALS (uniform) (e) ML20M, ϵ = 5. 0 1 2 3 4 Count bucket Ada DPALS ( = 1) Ada DPALS ( = 1/2) Ada DPALS ( = 1/3) Ada DPALS ( = 1/4) Ada DPALS ( = 0) DPALS (tail) DPALS (uniform) (f) ML20M, ϵ = 20. 0 1 2 3 4 Count bucket Ada DPALS ( = 1) Ada DPALS ( = 1/2) Ada DPALS ( = 1/3) Ada DPALS ( = 1/4) Ada DPALS ( = 0) DPALS (tail) DPALS (uniform) (g) MSD, ϵ = 1. 0 1 2 3 4 Count bucket Ada DPALS ( = 1) Ada DPALS ( = 1/2) Ada DPALS ( = 1/3) Ada DPALS ( = 1/4) Ada DPALS ( = 0) DPALS (tail) DPALS (uniform) (h) MSD, ϵ = 5. 0 1 2 3 4 Count bucket Ada DPALS ( = 1) Ada DPALS ( = 1/2) Ada DPALS ( = 1/3) Ada DPALS ( = 1/4) Ada DPALS ( = 0) DPALS (tail) DPALS (uniform) (i) MSD, ϵ = 20. Figure 8. Metrics sliced by movie frequency (i.e. frequency), on the ML10M, ML20M, and MSD benchmark, using the DPALS method. Each bucket contains an equal number of movies. Buckets are ordered by increasing frequency. 0 1/8 1/4 1/3 1/2 1 0.340 test Recall@20 = 5 = 10 = 20 0 1/8 1/4 1/3 1/2 1 0.300 test Recall@20 = 5 = 10 = 20 Figure 9. Comparison of the adaptive weights method with different values of µ on ML20M, when applied to DPALS (left) and DPSGD (right). D.2. Quality Impact on Head/Tail Movies In Figure 8, we report sliced RMSE (on ML10M) and Recall (on ML20M and MSD), for different values of ϵ. Multi-Task Differential Privacy Under Distribution Skew The following trend can be observed on all benchmarks, across all values of ϵ. DPALS with tail sampling improves upon uniform sampling, especially on the tail buckets. Our method (Ada DPALS) further improves upon tail sampling. The improvement is quite significant for lower values of ϵ, and for tail buckets. For example, on ML10M with ϵ = 1 (Figure 8a) we observe a large gap in RMSE, across all values of µ; the improvement is at least 21.6% on bucket 0, at least 23.7% on bucket 1, at least 22.8% on bucket 3, and at least 8.4% on bucket 4. The gap narrows as ϵ increases, which is consistent with the global privacy/utility trade-off plots in Figure 1. The exponent µ controls the trade-off between head and tail tasks: recall that the weights are defined as ωi 1/ˆnµ i where ˆni are the count estimates. A larger value of µ induces larger weights (and hence better quality) on the tail. This is visible on both benchmarks and across all values of ϵ: on lower buckets 0 and 1, better performance is obtained for larger values of µ, while the trend is reversed for the top bucket. When comparing performance on the overall objective, we find that the best performance is typically achieved when µ = 1/4, see Figures 3 and 9. When applied to DPALS performance remains high for a range of µ [1/4, 1/2]. When applied to DPSGD, performance seems more sensitive to µ, and the best performance is achieved for µ = 1/4. D.3. Qualitative Evaluation on ML20M To give a qualitative evaluation of the improvements achieved by our method, we inspect a few example queries. Though anecdotal, these examples give a perhaps more concrete illustration of some of the quality impact that our method can have, especially on tail recommendations. We will compare the following models: the ALS non-private baseline (same model in Figure 1-b), and two private models with ϵ = 1: the DPALS method with tail-biased sampling, and Ada-DPALS (with adaptive weights) with µ = 1/3 (we found that values of µ [1/4, 1/2] are qualitatively similar). We evaluate the models by displaying the nearest neighbor movies to a given query movie, where the similarity between movies is defined by the learned movie embedding matrix ˆV (the similarity between two movies i1 and i2 is ˆvi1, ˆvi2 . We select a few examples in Table 3; for additional examples, the models can be trained and queried interactively using the provided code. We select examples with varying levels of frequency (shown in the last column), to illustrate how privacy may affect quality differently depending on item frequency. The first query is The Shawshank Redemption, the most frequent movie in the data set. We see a large overlap of the top nearest neighbors according to all three models. In other words, privacy has little impact on this item. A similar observation can be made for other popular items. The second query is Pinocchio, a Disney animated movie from 1940. The nearest neighbors according to the non-private baseline (ALS) are other Disney movies from neighboring decades. The DPALS results are noisy: some of the top neighbors are Disney movies, but the fifth and sixth neighbors seem unrelated (Action and Drama movies). The Ada DPALS model returns more relevant results, all of the neighbors being Disney movies. The third example is Harry Potter and the Half-Blood Prince, an Adventure/Fantasy movie. The nearest neighbors from the ALS baseline are mostly other Harry Potter movies. The DPALS model misses the Harry Potter neighbors, and instead returns mostly action/adventure movies with varying degrees of relevance. The Ada DPALS model recovers several of Harry Potter neighbors. The next example is Nausica a of the Valley of the Wind, a Japanese animated movie from Studio Ghibli, released in 1984. The nearest neighbors according to the ALS model are similar movies from Studio Ghibli. The neighbors returned by DPALS are much more noisy: The first result (Spirited Away) is a Studio Ghibli movie and is the most relevant in the list. Other neighbors in the list are arguably unrelated to the query. Ada DPALS returns much more relevant results, all neighbors are Japanese animation movies, and five out of six are from Studio Ghibli. The last example is Interstellar, a Sci-Fi movie released in 2014. The ALS neighbors are other popular movies released around the same time (2013-2014) with a bias towards Action/Sci-Fi. Both private models (DPALS and Ada DPALS) return mixed results. Some results are relevant (Action/Sci-Fi movies) but others are arguably much less relevant, for example the third DPALS neighbor is a Hitchcock movie from 1945. Ada DPALS neighbors appear slightly better overall, in particular it returns two movies from the same director. As can be seen from these examples, Ada DPALS generally returns better quality results compared to DPALS. If we compare both models to the non-private baseline (ALS), the overlap between Ada DPALS and ALS is generally larger then the overlap between DPALS and ALS. To quantify this statement, we generate, for each movie, the top-20 nearest neighbors according Multi-Task Differential Privacy Under Distribution Skew to each model, then compute the percentage overlap with ALS. The results are reported in Figure 10. 0 1 2 3 4 Popularity bucket Overlap with the ALS baseline Ada-DPALS ( = 1/3) DPALS (tail) DPALS (uniform) Figure 10. Percentage overlap between the top-20 nearest neighbors according to private models (ϵ = 1) with the top-20 nearest neighbors according to the non-private baseline (ALS). Each bucket contains an equal number of movies (ordered by increasing frequency). Model Movie title Genres Frequency (in training set) Shawshank Redemption, The (1994) Crime Drama 47518 ALS Usual Suspects, The (1995) Crime Mystery Thriller 33834 Silence of the Lambs, The (1991) Crime Horror Thriller 42738 Pulp Fiction (1994) Comedy Crime Drama Thriller 44626 Schindler s List (1993) Drama War 35334 Apollo 13 (1995) Adventure Drama IMAX 26192 Forrest Gump (1994) Comedy Drama Romance War 40422 DPALS Silence of the Lambs, The (1991) Crime Horror Thriller 42738 Usual Suspects, The (1995) Crime Mystery Thriller 33834 Schindler s List (1993) Drama War 35334 Pulp Fiction (1994) Comedy Crime Drama Thriller 44626 Forrest Gump (1994) Comedy Drama Romance War 40422 Braveheart (1995) Action Drama War 32735 Ada DPALS Silence of the Lambs, The (1991) Crime Horror Thriller 42738 Pulp Fiction (1994) Comedy Crime Drama Thriller 44626 Usual Suspects, The (1995) Crime Mystery Thriller 33834 Forrest Gump (1994) Comedy Drama Romance War 40422 Schindler s List (1993) Drama War 35334 Braveheart (1995) Action Drama War 32735 Pinocchio (1940) Animation Children Fantasy Musical 5120 ALS Snow White and the Seven Dwarfs (1937) Animation Children Drama Fantasy Musical 7865 Dumbo (1941) Animation Children Drama Musical 3580 Cinderella (1950) Animation Children Fantasy Musical Romance 3957 Aristocats, The (1970) Animation Children 2669 Fantasia (1940) Animation Children Fantasy Musical 6135 Alice in Wonderland (1951) Adventure Animation Children Fantasy Musical 3487 DPALS Snow White and the Seven Dwarfs (1937) Animation Children Drama Fantasy Musical 7865 Jumanji (1995) Adventure Children Fantasy 6203 Beauty and the Beast (1991) Animation Children Fantasy Musical Romance IMAX 16391 Aladdin (1992) Adventure Animation Children Comedy Musical 19912 Assassins (1995) Action Crime Thriller 1146 Miracle on 34th Street (1947) Comedy Drama 2445 Ada DPALS Snow White and the Seven Dwarfs (1937) Animation Children Drama Fantasy Musical 7865 Fantasia (1940) Animation Children Fantasy Musical 6135 Pocahontas (1995) Animation Children Drama Musical Romance 2815 Sword in the Stone, The (1963) Animation Children Fantasy Musical 2217 Dumbo (1941) Animation Children Drama Musical 3580 Jungle Book, The (1994) Adventure Children Romance 2765 Table 3. Nearest neighbors according to ALS (non-private), DPALS, and Ada DPALS (µ = 1/3). Multi-Task Differential Privacy Under Distribution Skew Model Movie title Genres Frequency Harry Potter and the Half-Blood Prince (2009) Adventure Fantasy Mystery Romance IMAX 2176 ALS Harry Potter and the Deathly Hallows: Part 1 (2010) Action Adventure Fantasy IMAX 2099 Harry Potter and the Order of the Phoenix (2007) Adventure Drama Fantasy IMAX 2896 Harry Potter and the Deathly Hallows: Part 2 (2011) Action Adventure Drama Fantasy Mystery IMAX 2265 Sherlock Holmes (2009) Action Crime Mystery Thriller 2877 Harry Potter and the Goblet of Fire (2005) Adventure Fantasy Thriller IMAX 4773 Tangled (2010) Animation Children Comedy Fantasy Musical Romance IMAX 1259 DPALS Animatrix, The (2003) Action Animation Drama Sci-Fi 1216 Ratatouille (2007) Animation Children Drama 4728 Avengers, The (2012) Action Adventure Sci-Fi IMAX 2770 Tangled (2010) Animation Children Comedy Fantasy Musical Romance IMAX 1259 Slumdog Millionaire (2008) Crime Drama Romance 5415 Sherlock Holmes: A Game of Shadows (2011) Action Adventure Comedy Crime Mystery Thriller 1166 Ada DPALS Harry Potter and the Deathly Hallows: Part 2 (2011) Action Adventure Drama Fantasy Mystery IMAX 2265 Harry Potter and the Prisoner of Azkaban (2004) Adventure Fantasy IMAX 6433 Ratatouille (2007) Animation Children Drama 4728 Sherlock Holmes: A Game of Shadows (2011) Action Adventure Comedy Crime Mystery Thriller 1166 Harry Potter and the Order of the Phoenix (2007) Adventure Drama Fantasy IMAX 2896 Avatar (2009) Action Adventure Sci-Fi IMAX 4960 Nausica a of the Valley of the Wind (Kaze no tani no Naushika) (1984) Adventure Animation Drama Fantasy Sci-Fi 2151 ALS Laputa: Castle in the Sky (Tenkˆu no shiro Rapyuta) (1986) Action Adventure Animation Children Fantasy Sci-Fi 2227 My Neighbor Totoro (Tonari no Totoro) (1988) Animation Children Drama Fantasy 3593 Kiki s Delivery Service (Majo no takkyˆubin) (1989) Adventure Animation Children Drama Fantasy 1421 Porco Rosso (Crimson Pig) (Kurenai no buta) (1992) Adventure Animation Comedy Fantasy Romance 1022 Grave of the Fireflies (Hotaru no haka) (1988) Animation Drama War 2026 Howl s Moving Castle (Hauru no ugoku shiro) (2004) Adventure Animation Fantasy Romance 3503 DPALS Spirited Away (Sen to Chihiro no kamikakushi) (2001) Adventure Animation Fantasy 9161 Terminal, The (2004) Comedy Drama Romance 1963 Finding Neverland (2004) Drama 3371 Ring, The (2002) Horror Mystery Thriller 3535 Kill Bill: Vol. 1 (2003) Action Crime Thriller 12467 Man Who Wasn t There, The (2001) Crime Drama 2157 Ada DPALS Spirited Away (Sen to Chihiro no kamikakushi) (2001) Adventure Animation Fantasy 9161 My Neighbor Totoro (Tonari no Totoro) (1988) Animation Children Drama Fantasy 3593 Princess Mononoke (Mononoke-hime) (1997) Action Adventure Animation Drama Fantasy 6101 Ghost in the Shell (Kˆokaku kidˆotai) (1995) Animation Sci-Fi 4070 Howl s Moving Castle (Hauru no ugoku shiro) (2004) Adventure Animation Fantasy Romance 3503 Laputa: Castle in the Sky (Tenkˆu no shiro Rapyuta) (1986) Action Adventure Animation Children Fantasy Sci-Fi 2227 Interstellar (2014) Sci-Fi IMAX 1048 ALS Gone Girl (2014) Drama Thriller 847 Edge of Tomorrow (2014) Action Sci-Fi IMAX 881 Gravity (2013) Action Sci-Fi IMAX 1248 Guardians of the Galaxy (2014) Action Adventure Sci-Fi 1072 Wolf of Wall Street, The (2013) Comedy Crime Drama 1060 Grand Budapest Hotel, The (2014) Comedy Drama 1339 DPALS Day After Tomorrow, The (2004) Action Adventure Drama Sci-Fi Thriller 1233 Alive (1993) Drama 994 Spellbound (1945) Mystery Romance Thriller 1365 Source Code (2011) Action Drama Mystery Sci-Fi Thriller 1595 Fast and the Furious, The (2001) Action Crime Thriller 1406 Ice Storm, The (1997) Drama 3020 Ada DPALS Django Unchained (2012) Action Drama Western 2692 Inception (2010) Action Crime Drama Mystery Sci-Fi Thriller IMAX 9147 Dark Knight Rises, The (2012) Action Adventure Crime IMAX 2848 Intouchables (2011) Comedy Drama 1803 Source Code (2011) Action Drama Mystery Sci-Fi Thriller 1595 Avatar (2009) Action Adventure Sci-Fi IMAX 4960 Table 4. Nearest neighbors according to ALS (non-private), DPALS, and Ada DPALS (µ = 1/3).