# deep_learning_with_label_differential_privacy__1c2bb250.pdf

Deep Learning with Label Differential Privacy

Badih Ghazi Google Research badihghazi@google.com

Noah Golowich

EECS, MIT nzg@mit.edu

Ravi Kumar Google Research ravi.k53@gmail.com

Pasin Manurangsi

Google Research pasin@google.com

Chiyuan Zhang Google Research chiyuan@google.com

The Randomized Response (RR) algorithm [96] is a classical technique to improve robustness in survey aggregation, and has been widely adopted in applications with differential privacy guarantees. We propose a novel algorithm, Randomized Response with Prior (RRWith Prior), which can provide more accurate results while maintaining the same level of privacy guaranteed by RR. We then apply RRWith Prior to learn neural networks with label differential privacy (Label DP), and show that when only the label needs to be protected, the model performance can be signiﬁcantly improved over the previous state-of-the-art private baselines. Moreover, we study different ways to obtain priors, which when used with RRWith Prior can additionally improve the model performance, further reducing the accuracy gap between private and non-private models. We complement the empirical results with theoretical analysis showing that Label DP is provably easier than protecting both the inputs and labels.

1 Introduction

The widespread adoption of machine learning in recent years has increased the concerns about the privacy of individuals whose data is used during the model training. Differential privacy (DP) [34, 33] has emerged as a popular privacy notion that has been the basis of several practical deployments in industry [35, 82, 43, 9, 29] and the U.S. Census [3].

A classical algorithm that predates DP and was initially designed to eliminate evasive answer biases in survey aggregation is Randomized Response (RR) [96]: when the input is an element from a ﬁnite alphabet, the output is equal to the input with a certain probability, and is a uniform random other element from the alphabet with the remaining probability. This simple algorithm is shown to satisfy the strong notion of local DP [38, 58], whereby the response of each user is protected, in contrast with the so-called central DP setting where a curator has access to the raw user data, and only the output of the curator is required to be DP. We note that schemes building on RR have been studied in several previous works on DP estimation (e.g., [30, 57]), and have been deployed in practice [82].

Meanwhile, the large error incurred by RR (e.g., [17]) has stimulated signiﬁcant research aiming to improve its accuracy, mostly by relaxing to weaker privacy models (e.g., [36, 24, 4]). In this work, we use a different approach and seek to improve RR by leveraging available prior information. (A recent work of Liu et al. [63] also used priors to improve accuracy, but in the context of the DP multiplicative weights algorithm, which applies in the central DP setting.) The prior information can consist of domain-speciﬁc knowledge, (models trained on) publicly available data, or historical

Part of this work was done while at Google Research.

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

runs of a training algorithm. Our algorithm is presented in Section 3 (Algorithm 2). At a high level, given a prior distribution p on the alphabet, the algorithm uses p to prune the alphabet. If the prior is reliable and even if the alphabet is only minimally pruned, the probability that the output equals the input is larger than when RR is applied to the entire alphabet. On the other hand, if the prior is uniform over the entire alphabet, then our algorithm recovers the classical RR. To implement the above recipe, one needs to specify how to effect the pruning using p. It turns out that the magnitude of pruning can itself vary depending on p, but we can obtain a closed-form formula for determining this. Interestingly, by studying a suitable linear program, we show that the resulting RRWith Prior strategy is optimal in that among all "-DP algorithms, it maximizes the probability that the output equals the input when the latter is sampled from p (Theorem 3).

1.1 Applications to Learning with Label Differential Privacy

There have been a great number of papers over the last decade that developed DP machine learning algorithms (e.g., [19, 102, 88, 83 85, 78]). In the case of deep learning, the seminal work of Abadi et al. [2] introduced a DP training framework (DP-SGD) that was integrated into Tensor Flow [80] and Py Torch [92]. Despite numerous followup works, including, e.g., [75 77, 67, 98, 71, 20, 93], and extensive efforts, the accuracy of models trained with DP-SGD remains signiﬁcantly lower than that of models trained without DP constraints. Notably, for the widely considered CIFAR-10 dataset, the highest reported accuracy for DP models is 69.3% [93], which strikingly relies on handcrafted visual features despite that in non-private scenarios learned features long been shown to be superior. Even using pre-training with external (CIFAR-100) data, the best reported DP accuracy, 73%2, is still far below the non-private baselines (> 95%). The performance gap becomes a roadblocker for many real-world applications to adopt DP. In this paper, we focus on a more restricted, but important, special case where the DP guarantee is only required to hold with respect to the labels, as described next.

In the label differential privacy (Label DP) setting, the labels are considered sensitive, and their privacy needs to be protected, while the input points are not sensitive. This notion has been studied in the PAC setting [18, 13] and for the particular case of sparse linear regression [94], and it captures several practical scenarios. Examples include: (i) computational advertising where the impressions are known to the Ad Tech3, and thus considered non-sensitive, while the conversions reveal user interest and are thus private (see, e.g., Nalpas and Dutton [70] and [8]), (ii) recommendation systems where the choices are known, e.g., to the streaming service provider, but the user ratings are considered sensitive, and (iii) user surveys and analytics where demographic information (e.g., age, gender) is non-sensitive but income is sensitive in fact, this was the motivating reason for Warner [96] to propose RR many decades ago! We present a novel multi-stage algorithm (LP-MST) for training deep neural networks with Label DP that builds on top of RRWith Prior (see Section 3 and Algorithm 3), and we benchmark its empirical performance (Section 5) on multiple datasets, domains, and architectures, including the following.

On CIFAR-10, we show that it achieves 20% higher accuracy than DP-SGD4. On the more challenging CIFAR-100, we present the ﬁrst non-trivial DP learning results. On Movie Lens, which consists of user ratings of movies, we show improvements via LP-MST.

In some applications, domain speciﬁc algorithms can be used to obtain priors directly without going through multi-stage training. For image classiﬁcation problems, we demonstrate how priors computed from a (non-private) self-supervised learning [22, 23, 44, 53, 16] phase on the input images can be used to achieve higher accuracy with a Label DP guarantee with extremely small privacy budgets (" 0.1, see Section 5.2 for details).

We note that due to the requirement of DP-SGD to compute and clip per-instance gradient, it remains technically challenging to scale to larger models or mini-batch sizes, despite numerous attempts to minigate this problem [42, 5, 26, 90]. On the other hand, our formulation allows us to use state-of-

2For DP parameters of " = 8 and δ = 10 5, cited from Abadi et al. [2, Figure 6]. For a formal deﬁnition of DP, we refer the reader to Deﬁnition 2.1.

3Ad tech (abbreviating Advertising Technology) comprises the tools that help agencies and brands target, deliver, and analyze their digital advertising efforts; see, e.g., blog.hubspot.com/marketing/what-is-ad-tech.

4We remark that the notion of "-DP in [2, 76, 75] is not directly comparable to "-Label DP in our work in that they use the addition/removal notion whereas we use the substitution one. Please see the Supplementary Material for more discussion on this.

the-art deep learning architectures such as Res Net [51]. We also stress that our LP-MST algorithm goes beyond deep learning methods that are robust to label noise. (See [87] for a survey of the latter.)

Our empirical results suggest that protecting the privacy of labels can be signiﬁcantly easier than protecting the privacy of both inputs and labels. We ﬁnd further evidence to this by showing that for the special case of stochastic convex optimization (SCO), the sample complexity of algorithms privatizing the labels is much smaller than that of algorithms privatizing both labels and inputs; speciﬁcally, we achieve dimension-independent bounds for Label DP (Section 6). We also show that a good prior can ensure smaller population error for non-convex loss. (Details are in the Supplementary Material.)

2 Preliminaries

For any positive integer K, let [K] := {1, . . . , K}. Randomized response (RR) [96] is the following: let " 0 be a parameter and let y 2 [K] be the true value known to RR". When an observer queries the value of y, RR" responds with a random draw y from the following probability distribution:

Pr[ y = ˆy] =

e" e"+K 1 for ˆy = y, 1 e"+K 1 otherwise. (1)

In this paper, we focus on the application of learning with label differential privacy. We recall the deﬁnition of differential privacy (DP), which is applicable to any notion of neighboring datasets. For a textbook reference, we refer the reader to Dwork and Roth [32]. Deﬁnition 2.1 (Differential Privacy (DP) [33, 34]). Let ", δ 2 R 0. A randomized algorithm A taking as input a dataset is said to be (", δ)-differentially private ((", δ)-DP) if for any two neighboring datasets D and D0, and for any subset S of outputs of A, it is the case that Pr[A(D) 2 S] e" Pr[A(D0) 2 S] + δ. If δ = 0, then A is said to be "-differentially private ("-DP).

When applied to machine learning methods in general and deep learning in particular, DP is usually enforced on the weights of the trained model [see, e.g., 19, 59, 2]. In this work, we focus on the notion of label differential privacy. Deﬁnition 2.2 (Label Differential Privacy). Let ", δ 2 R 0. A randomized training algorithm A taking as input a dataset is said to be (", δ)-label differentially private ((", δ)-Label DP) if for any two training datasets D and D0 that differ in the label of a single example, and for any subset S of outputs of A, it is the case that Pr[A(D) 2 S] e" Pr[A(D0) 2 S] + δ. If δ = 0, then A is said to be "-label differentially private ("-Label DP).

All proofs skipped in the main body are given in the Supplementary Material.

3 Randomized Response with Prior

In many real world applications, a prior distribution about the labels could be publicly obtained from domain knowledge and help the learning process. In particular, we consider a setting where for each (private) label y in the training set, there is an associated prior p = (p1, . . . , p K). The goal is to output a randomized label y that maximizes the probability that the output is correct (or equivalently maximizes the signal-to-noise ratio), i.e., Pr[y = y]. The privacy constraint here is that the algorithm should be "-DP with respect to y. (It need not be private with respect to the prior p.)

We ﬁrst describe our algorithm RRWith Prior by assuming access to such priors.

3.1 Algorithm: RRWith Prior

We build our RRWith Prior algorithm with a subroutine called RRTop-k, as shown in Algorithm 1, which is a modiﬁcation of randomized response where we only consider the set of k labels i with largest pi. Then, if the input label y belongs to this set, we use standard randomized response on this set. Otherwise, we output a label from this set uniformly at random.

The main idea behind RRWith Prior is to dynamically estimate an optimal k based on the prior p, and run RRTop-k with k . Speciﬁcally, we choose k by maximizing Pr[RRTop-k(y) = y]. It is not

Algorithm 1 RRTop-k Input: A label y 2 [K] Parameters: k 2 [K], prior p = (p1, . . . , p K)

1. Let Yk be the set of k labels with maximum prior probability (with ties broken arbritrarily). 2. If y 2 Yk, then output y with probability e" e"+k 1 and output y0 2 Yk \ {y} with probability 1 e"+k 1. 3. If y 62 Yk, output an element from Yk uniformly at random.

hard to see that this expression is exactly equal to e" e"+k 1

if y p. RRWith Prior is presented in Algorithm 2.

Algorithm 2 RRWith Prior Input: A label y 2 [K] Parameters: prior p = (p1, . . . , p K)

1. For k 2 [K]:

(a) Compute wk := e" e"+k 1

, where Yk is the set of k labels with maximum prior probability (ties broken arbritrarily). 2. Let k = arg maxk2[K] wk. 3. Return an output of RRTop-k (y) with k = k .

3.1.1 Privacy Analysis

It is not hard to show that RRTop-k is "-DP. Lemma 1. RRTop-k is "-DP.

The privacy guarantee of RRWith Prior follows immediately from that of RRTop-k (Lemma 1) since our choice of k does not depend on the label y: Corollary 2. RRWith Prior is "-DP.

For learning with a Label DP guarantee, we ﬁrst use RRWith Prior to query a randomized label for each example of the training set, and then apply a general learning algorithm that is robust to random label noise to this dataset. Note that unlike DP-SGD [2] that makes new queries on the gradients in every training epoch, we query the randomized label once and reuse it in all the training epochs.

3.2 Optimality of RRWith Prior

In this section we will prove the optimality of RRWith Prior. For this, we will need additional notation. For any algorithm R that takes as input a label y and outputs a randomized label y, we let Objp(R) denote the probability that the output label is equal to the input label y when y is distributed as p; i.e., Objp(R) = Pry p[R(y) = y], where the distribution of y p is Pr[y = i] = pi for all i 2 [K].

The main result of this section is that, among all "-DP algorithms, RRWith Prior maximizes Objp(R), as stated more formally next. Theorem 3. Let p be any probability distribution on [K] and R be any "-DP algorithm that randomizes the input label given the prior p . We have that

Objp(RRWith Prior) Objp(R).

Before we proceed to the proof, we remark that our proof employs a linear program (LP) to characterize the optimal mechanisms; a generic form of such LPs has been used before in [48, 41]. However, these works focus on different problems (linear queries) and their results do not apply here.

Proof of Theorem 3. Consider any "-DP algorithm R, and let q y|y denote Pr[R(y) = y]. Observe that Objp(R) = P

y2[k] py qy|y.

Since q |y is a probability distribution, we must have that

q y|y = 1, 8y 2 [K], and q y|y 0, 8 y, y 2 [K].

Finally, the "-DP guarantee of R implies that

q y|y e" q y|y0 8 y, y, y0 2 [K].

Combining the above, Objp(R) is upper-bounded by the optimum of the following linear program (LP), which we refer to as LP1:

s.t. q y|y e" q y|y0 8 y, y, y0 2 [K], (2) X

q y|y = 1 8y 2 [K], (3)

q y|y 0 8 y, y 2 [K].

Notice that constraints (2) and (3) together imply that:

q y| y 1 8y 2 [K].

In other words, the optimum of LP1 is at most the optimum of the following LP that we call LP2:

s.t. qy|y + e "

q y| y 1 8y 2 [K], (4)

qy|y 0 8y 2 [K]. (5)

An optimal solution to LP2 must be a vertex (aka extreme point) of the polytope deﬁned by (4) and (5). Recall that an extreme point of a K-dimensional polytope must satisfy K independent constraints with equality. In our case, this means that one of the following occurs:

Inequality (5) is satisﬁed with equality for all y 2 [K] resulting in the all-zero solution

(whose objective is zero), or, For some non-empty subset Y [K], inequality (4) is satisﬁed with equality for all y 2 Y ,

and inequality (5) is satisﬁed with equality for all y 2 [K] \ Y . This results in

e" e"+|Y | 1 if y 2 Y, 0 if y /2 Y.

This yields an objective value of e" e"+|Y | 1 P

In conclusion, we have that

Objp(R) max ;6=Y [K]

e" + |Y | 1

e" + k 1 max Y [K],|Y |=k

py = max k2[K] wk,

where the last two equalities follow from our deﬁnitions of Yk and wk. Notice that Objp(RRWith Prior) = maxk2[K] wk. Thus, we get that Objp(RRWith Prior) Objp(R) as desired.

4 Application of RRWith Prior: Multi-Stage Training

Our RRWith Prior algorithm requires publicly available priors, which could usually be obtained from domain speciﬁc knowledge. In this section, we describe a training framework that bootstraps from a uniform prior, and progressively learns reﬁned priors via multi-stage training. This general framework can be applied to arbitrary domains even when no public prior distributions are available.

Speciﬁcally, we assume that we have a training algorithm A that outputs a probabilistic classiﬁer which, on a given unlabeled sample x, can assign a probability py to each class y 2 [K]. We partition our dataset into subsets S(1), . . . , S(T ), and we start with a trivial model M (0) that outputs equal probabilities for all classes. At each stage t 2 [T], we use the most recent model M (t 1) to assign the probabilities (p1, . . . , p K) for each sample xi from S(t). Applying RRWith Prior with this prior on the true label yi, we get a randomized label yi for xi. We then use all the samples with randomized labels obtained so far to train the model M (t).

The full description of our LP-MST (Label Privacy Multi-Stage Training) method is presented in Algorithm 3. We remark here that the partition S(1), . . . , S(T ) can be arbitrarily chosen, as long as it does not depend on the labels y1, . . . , yn. We also stress that the training algorithm A need not be private. We use LP-1ST to denote our algorithm with one stage, LP-2ST to denote our algorithm with two stages, and so on. We also note that LP-1ST is equivalent to using vanilla RR. The tth stage of a multi-stage algorithm is denoted stage-t.

Algorithm 3 Multi-Stage Training (LP-MST) Input: Dataset S = {(x1, y1), . . . , (xn, yn)} Parameters: Number T of stages, training algorithm A

1. Partition S into S(1), . . . , S(T )

2. Let M (0) be the trivial model that always assigns equal probability to each class. 3. For t = 1 to T:

(a) Let S(t) = ;. (b) For each (xi, yi) 2 S(t):

i. Let p = (p1, . . . , p K) be the probabilities predicted by M (t) on xi. ii. Let yi = RRWith Priorp(yi). iii. Add (xi, yi) to S(t). (c) Let M (t) be the model resulting from training on S(1) [ [ S(t) using A. 4. Output M (T ).

The privacy guarantee of LP-MST is given by the following:

Observation 4. For any " > 0, if RRWith Prior is "-DP, then LP-MST is "-Label DP.

Proof. We will in fact prove a stronger statement that the algorithm is "-DP even when we output all the T models M (1), . . . , M (T ) together with all the randomized labels y1, . . . , yn. For any possible output models m(1), . . . , m(T ) and output labels z1, . . . , zn, we have

Pr[M (1) = m(1), . . . , M (T ) = m(T ), y1 = z1, . . . , yn = zn]

4M (t) = m(t)

i2S(1)[ [S(t)

+++M (t 1) = m(t 1)i

Consider any two datasets D, D0 that differ on a single user s label; suppose this user is j and that the user belongs to partition 2 [T]. Then, the above expression for D and that for D0 are the same in all but one term: Pr

++M ( 1) = m( 1)

, which is the probability that RRWith Priorm( 1)(xi) outputs zi. Since RRWith Prior is "-DP, we can conclude that the ratio between the two probabilities is at most e" as desired.

We stress that this observation holds because each sensitive label yi is only used once in Line 3(b)ii of Algorithm 3, as the dataset S is partitioned at the beginning of the algorithm. As a result, since each

Table 1: Test accuracy (%) on CIFAR-10. The baseline performances taken from previously published results correspond to (", δ)-DP with δ = 10 5. The star? indicates the use of CIFAR-100 pre-trained representations.

Algorithm " = 1 " = 2 " = 3 " = 4 " = 6 " = 8 " = 1

DP-SGD w/ pre-train? [2] 67 70 73 80 DP-SGD [77] 61.6("=7.53) 76.6 Tempered Sigmoid [77] 66.2("=7.53) Yu et al. [98] 44.3("=6.78) Nasr et al. [71] 55 Chen and Lee [20] 53 Scatter Net+CNN [93] 69.3

LP-1ST 59.96 82.38 89.89 92.58 93.58 94.70 94.96 LP-2ST 63.67 86.05 92.19 93.37 94.26 94.52 -

LP-1ST w/ pre-train? 67.64 83.99 90.24 92.83 94.02 94.96 95.25 LP-2ST w/ pre-train? 70.16 87.22 92.12 93.53 94.41 94.59 -

Table 2: Experiments on CIFAR-100. The non-private baseline (" = 1) is 76.38% test accuracy.

Algorithm " = 3 " = 4 " = 5 " = 6 " = 8

LP-1ST 20.96 46.28 61.38 68.34 73.59 LP-2ST 28.74 50.15 63.51 70.58 74.14

stage is "-Label DP, the entire algorithm is also "-Label DP. This is known as (an adaptive version of a) parallel composition [68].

Finally, we point out that the running time of our RRWith Prior algorithm is quasi-linear in K (the time needed to sort the prior). This is essentially optimal within multistage training, since O(K) time will be required to write down the prior after each stage. Moreover, for reasonable values of K, the running time will be dominated by back-propagation for gradient estimation. Moreover, the focus of the current work is on small to modest label spaces (i.e., values of K).

5 Empirical Evaluation

We evaluate RRWith Prior on standard benchmark datasets that have been widely used in previous works on private machine learning. Speciﬁcally, in the ﬁrst part, we study our general multi-stage training algorithm that boostraps from a uniform prior. We evaluate it on image classiﬁcation and collaborative ﬁltering tasks. In the second part, we focus on image classiﬁcation only and use domainspeciﬁc techniques to obtain priors for RRWith Prior. We use modern neural network architectures (e.g., Res Nets [51]) and the mixup [101] regularization for learning with noisy labels. Please see the Supplementary Material for full details on the datasets and the experimental setup.

5.1 Evaluation with Multi-Stage Training

CIFAR-10 [60] is a 10-class image classiﬁcation benchmark dataset. We evaluate our algorithm and compare it to previously reported DP baselines in Table 1. Due to scalability issues, previous DP algorithms could only use simpliﬁed architectures with non-private accuracy signiﬁcantly below the state-of-the-art. Moreover, even when compared to those weaker non-private baselines, a large performance drop is observed in the private models. In contrast, we use Res Net18 with 95% nonprivate accuracy. Overall, our algorithms improve the previous state-of-the-art by a margin of 20% across all " s. Abadi et al. [2] treated CIFAR-100 as public data and use it to pre-train a representation to boost the performance of DP-SGD. We also observe performance improvements with CIFAR-100 pre-training (Table 1, bottom 2 rows). But even without pre-training, our results are signiﬁcantly better than DP-SGD even with pre-training.

Table 3: Experiments on Movie Lens-1M. The numbers show the test RMSE.

Algorithm " = 1 " = 2 " = 3 " = 4 " = 8 " = 1

LP-1ST 1.122 0.981 0.902 0.877 0.867 0.868 LP-2ST 1.034 0.928 0.891 0.874 0.865 Gaussian DP [15] 0.915 (" 10)

In Table 2 we also show results on CIFAR-100, which is a more challenging variant with 10 more classes. To the best of our knowledge, these are the ﬁrst non-trivial reported results on CIFAR-100 for DP learning. For " = 8, our algorithm is only 2% below the non-private baseline.

In addition, we also evaluate on Movie Lens-1M [49], which contains 1 million anonymous ratings of approximately 3, 900 movies, made by 6,040 Movie Lens users. Following [15], we randomly split the data into 80% train and 20% test, and show the test Root Mean Square Error (RMSE) in Table 3.

Results on MNIST [61], Fashion MNIST [97], and KMNIST [25], and comparison to more baselines can be found in the Supplementary Material. In all the datasets we evaluated, our algorithms not only signiﬁcantly outperform the previous methods, but also greatly shrink the performance gap between private and non-private models. The latter is critical for applications of deep learning systems in real-world tasks with privacy concerns.

Beyond Two Stages. In Figure 1(a), we report results on LP-MST with T > 2. For the cases we tried, we consistently observe 1 2% improvements on test accuracy when going from LP-2ST to LP-3ST. In our preliminary experiments, going beyond T > 4 stages leads to diminishing returns on some datasets.

5.2 Evaluation with Domain-Speciﬁc Priors

The multi-stage training framework evaluated in the previous section is a general domain-agnostic algorithm that bootstraps itself from uniform priors. In some cases, domain-speciﬁc priors can be obtained to further improve the learning performance. In this section, we focus on image classiﬁcation applications, where new advances in self-supervised learning (SSL) [22, 23, 44, 53, 16] show that high-quality image representations could be learned on large image datasets without using the class labels. In the setting of Label DP, the unlabeled images are considered public data, so we design an algorithm to use SSL to obtain priors, which is then fed to RRWith Prior for discriminative learning.

Speciﬁcally, we partition the training examples into groups by clustering using their representations extracted from SSL models. We then query a histogram of labels for each group via discrete Laplace mechanism (aka Geometric Mechanism) [41]. If the groups are largely homogeneous, consisting of mostly examples from the same class, then we can make the histogram queries with minimum privacy budget. The queried histograms are used as label priors for all the points in the group. Figure 1(b) shows the results on two different SSL representations: BYOL [44], trained on unlabeled CIFAR-10 images and DINO [16], trained on Image Net [27] images. Comparing to the baseline, the SSL-based priors signiﬁcantly improves the model performance with small privacy budgets. Note that since the SSL priors are not true priors, with large privacy budget (" = 8), it actually underperforms the uniform prior. But in most real world applications, small " s are generally more useful.

6 Theoretical Analysis

Previous works have shown that Label DP can be provably easier than DP in certain settings; specifically, in the PAC learning setting, Beimel et al. [13] proved that ﬁnite VC dimension implies learnability by Label DP algorithms, whereas it is known that this is not sufﬁcient for DP algorithms [7].

We extend the theoretical understanding of this phenomenon to the stochastic convex optimization (SCO) setting. Speciﬁcally, we show that, by applying RR on the labels and running SGD on top of the resulting noisy dataset with an appropriate debiasing of the noise, one can arrive at the following dimension-independent excess population loss.

Figure 1: (a) Test accuracy (%) on various datasets with LP-MST for T > 2. The curve CIFAR-10 w/ pre-train is using CIFAR-100 as public data to pre-train the model. (b) RRWith Prior with priors obtained from histogram query based on clustering in various SSL representations. We also plot recent results (Cluster RR) from Esfandiari et al. [37], which is a clustering based Label DP algorithm.

Theorem 5 (Informal). For any " 2 (0, 1), there is an "-Label DP algorithm for stochastic convex optimization with excess population loss O

where D denotes the diameter of the parameter space and L denotes the Lipschitz constant of the loss function.

The above excess population loss can be compared to that of Bassily et al. [12], who gave an (", δ)- DP algorithm with excess population loss OD,L

, where p denote the dimension of the parameter space; this bound is also known to be tight in the standard DP setting. The main advantage of our guarantee in Theorem 5 is that it is independent of the dimension p. Furthermore, we show that our bound is tight up to polylogarithmic factors and the dependency on the number of classes K.

The above result provides theoretical evidence that running RR on the labels and then training on this noisy dataset can be effective. We can further extend this to the setting where, instead of running RR, we run RRTop-k before running the aforementioned (debiased) SGD, although perhaps as expected our bound on the population loss now depends on the quality of the priors. Corollary 6 (Informal). Suppose that we are given a prior px for every x and let Y x

k denote the set of top-k labels with respect to px. Then, for any " 2 (0, 1), there is an "-Label DP algorithm for

stochastic convex optimization with excess population loss O

k "pn + Pr(x,y) D[y /2 Y x

where D, L are as deﬁned in Theorem 5 and D is the data distribution.

When our top-k set is perfect (i.e., y always belongs to Y x

k ), the bound reduces to that of Theorem 5, but with the smaller k instead of K. Moreover, the second term is, in some sense, a penalty we pay in the excess population loss for the inaccuracy of the top-k prior. We defer the formal treatment and the proofs to the Supplementary Material, in which we also present additional generalization results for non-convex settings.

Note that Corollary 6 is not in the exact setup we run in experiments, where we dynamically calculate an optimal k for each x given generic priors (via RRWith Prior), and for which the utility is much more complicated to analyze mathematically. Nonetheless, the above corollary corroborates the intuition that a good prior helps with training.

7 Conclusions and Future Directions

In this work, we introduced a novel algorithm RRWith Prior (which can be used to improve on the traditional RR mechanism), and applied it to Label DP problems. We showed that prior information can be incorporated to the randomized label querying framework while maintaining privacy constraints. We demonstrated two frameworks to apply RRWith Prior: (i) a general multi-stage training algorithm LP-MST that bootstraps from uniform priors and (ii) an algorithm that build priors from clustering with

SSL-based representations. The former is general purpose and can be applied to tasks even when no domain-speciﬁc priors are available, while the latter uses a domain-speciﬁc algorithm to extract priors and performs well even with very small privacy budget. As summarized by the ﬁgure on the right, in both cases, by focusing on Label DP, our RRWith Prior signiﬁcantly improved the model performance of previous state-of-the-art DP models that aimed to protect both the inputs and outputs. We note that, following up on our work, additional results on deep learning with Label DP were obtained [66, 100]. The narrowed performance gap between private and non-private models is vital for adding DP to real world deep learning models. We nevertheless stress that our algorithms only protect the labels but not the input points, which might not constitute a sufﬁcient privacy protection in all settings.

Our work opens up several interesting questions. Firstly, note that our multi-stage training procedure uses very different ingredients than those of Abadi et al. [2] (which employ DP-SGD, privacy ampliﬁcation by subsampling, and Renyi accounting); can these tools be used to further improve Label DP? Secondly, while our procedure can be implemented in the most stringent local DP setting5 [58], can it be improved in the weaker central (aka trusted curator) DP model, assuming the curator knows the prior? Thirdly, while our algorithm achieves pure DP (i.e., δ = 0), is higher accuracy possible for approximate DP (i.e., δ > 0)?

Acknowledgements

The authors would like to thank Sami Torbey for very helpful feedback on an early version of this work. At MIT, Noah Golowich was supported by a Fannie and John Hertz Foundation Fellowship and an NSF Graduate Fellowship.

[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis,

J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. Tensor Flow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.

[2] M. Abadi, A. Chu, I. Goodfellow, H. B. Mc Mahan, I. Mironov, K. Talwar, and L. Zhang. Deep

learning with differential privacy. In CCS, pages 308 318, 2016.

[3] J. M. Abowd. The US Census Bureau adopts differential privacy. In KDD, pages 2867 2867,

[4] J. Acharya, K. Bonawitz, P. Kairouz, D. Ramage, and Z. Sun. Context aware local differential

privacy. In ICML, pages 52 62, 2020.

[5] A. Agarwal and I. Ganichev. Auto-vectorizing tensorﬂow graphs: Jacobians, auto-batching

and beyond. ar Xiv:1903.04243, 2019.

[6] A. Agarwal, P. Bartlett, P. Ravikumar, and M. J. Wainwright. Information-theoretic lower

bounds on the oracle complexity of convex optimization. In NIPS, page 1 9, 2009.

[7] N. Alon, R. Livni, M. Malliaris, and S. Moran. Private PAC learning implies ﬁnite Littlestone

dimension. In STOC, pages 852 860, 2019.

5They can in fact be implemented in the slightly weaker sequentially interactive local DP model [31].

[8] E. Anderson. Masked learning, aggregation and reporting workﬂow (masked lark). https:

//github.com/WICG/privacy-preserving-ads/blob/main/Masked LARK.md, 2021.

[9] Apple Differential Privacy Team. Learning with privacy at scale. Apple Machine Learning

Journal, 2017.

[10] R. Bassily, A. Smith, and A. Thakurta. Private empirical risk minimization: Efﬁcient algorithms

and tight error bounds. In FOCS, pages 464 473, 2014.

[11] R. Bassily, V. Feldman, K. Talwar, and A. Thakurta. Private stochastic convex optimization

with optimal rates. ar Xiv:1908.09970, 2019.

[12] R. Bassily, V. Feldman, K. Talwar, and A. G. Thakurta. Private stochastic convex optimization

with optimal rates. In Neur IPS, pages 11279 11288, 2019.

[13] A. Beimel, K. Nissim, and U. Stemmer. Private learning and sanitization: Pure vs. approximate

differential privacy. To C, 12(1):1 61, 2016.

[14] O. Bousquet, S. Boucheron, and G. Lugosi. Introduction to statistical learning theory. In

Summer School on Machine Learning, pages 169 207. Springer, 2003.

[15] Z. Bu, J. Dong, Q. Long, and W. J. Su. Deep learning with Gaussian differential privacy.

Harvard Data Science Review, 2020(23), 2020.

[16] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin. Emerging

properties in self-supervised vision transformers. ar Xiv:2104.14294, 2021.

[17] T. H. Chan, E. Shi, and D. Song. Optimal lower bound for differentially private multi-party

aggregation. In ESA, pages 277 288, 2012.

[18] K. Chaudhuri and D. Hsu. Sample complexity bounds for differentially private learning. In

COLT, pages 155 186, 2011.

[19] K. Chaudhuri, C. Monteleoni, and A. D. Sarwate. Differentially private empirical risk mini-

mization. JMLR, 12(3), 2011.

[20] C. Chen and J. Lee. Stochastic adaptive line search for differentially private optimization. In

Big Data, pages 1011 1020, 2020.

[21] P. Chen, B. Liao, G. Chen, and S. Zhang. Understanding and utilizing deep neural networks

trained with noisy labels. In ICML, pages 1062 1070, 2019.

[22] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. A simple framework for contrastive learning

of visual representations. In ICML, pages 1597 1607, 2020.

[23] T. Chen, S. Kornblith, K. Swersky, M. Norouzi, and G. Hinton. Big self-supervised models

are strong semi-supervised learners. In Neur IPS, 2020.

[24] A. Cheu, A. Smith, J. Ullman, D. Zeber, and M. Zhilyaev. Distributed differential privacy via

shufﬂing. In EUROCRYPT, pages 375 403, 2019.

[25] T. Clanuwat, M. Bober-Irizar, A. Kitamoto, A. Lamb, K. Yamamoto, and D. Ha. Deep learning

for classical Japanese literature. ar Xiv:1812.01718, 2018.

[26] F. Dangel, F. Kunstner, and P. Hennig. Backpack: Packing more into backprop. ar Xiv:1912.10985, 2019.

[27] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale

hierarchical image database. In CVPR, pages 248 255, 2009.

[28] T. De Vries and G. W. Taylor. Improved regularization of convolutional neural networks with

cutout. ar Xiv:1708.04552, 2017.

[29] B. Ding, J. Kulkarni, and S. Yekhanin. Collecting telemetry data privately. In NIPS, pages

3571 3580, 2017.

[30] J. C. Duchi, M. I. Jordan, and M. J. Wainwright. Local privacy and minimax bounds: sharp

rates for probability estimation. In NIPS, pages 1529 1537, 2013.

[31] J. C. Duchi, M. I. Jordan, and M. J. Wainwright. Minimax optimal procedures for locally

private estimation. JASA, 113(521):182 201, 2018.

[32] C. Dwork and A. Roth. The algorithmic foundations of differential privacy. Foundations and

Trends in Theoretical Computer Science, 9(3-4):211 407, 2014.

[33] C. Dwork, K. Kenthapadi, F. Mc Sherry, I. Mironov, and M. Naor. Our data, ourselves: Privacy

via distributed noise generation. In EUROCRYPT, pages 486 503, 2006.

[34] C. Dwork, F. Mc Sherry, K. Nissim, and A. D. Smith. Calibrating noise to sensitivity in private

data analysis. In TCC, pages 265 284, 2006.

[35] Ú. Erlingsson, V. Pihur, and A. Korolova. Rappor: Randomized aggregatable privacypreserving ordinal response. In CCS, pages 1054 1067, 2014.

[36] Ú. Erlingsson, V. Feldman, I. Mironov, A. Raghunathan, K. Talwar, and A. Thakurta. Ampliﬁ-

cation by shufﬂing: From local to central differential privacy via anonymity. In SODA, pages 2468 2479, 2019.

[37] H. Esfandiari, V. Mirrokni, U. Syed, and S. Vassilvitskii. Label differential privacy via

clustering. ar Xiv:2110.02159, 2021.

[38] A. Evﬁmievski, J. Gehrke, and R. Srikant. Limiting privacy breaches in privacy preserving

data mining. In PODS, pages 211 222, 2003.

[39] V. Feldman and T. Zrnic. Individual privacy accounting via a Rényi ﬁlter. ar Xiv:2008.11193,

[40] V. Feldman, T. Koren, and K. Talwar. Private stochastic convex optimization: Optimal rates in

linear time. In STOC, page 439 449, 2020.

[41] A. Ghosh, T. Roughgarden, and M. Sundararajan. Universally utility-maximizing privacy

mechanisms. SICOMP, 41(6):1673 1693, 2012.

[42] I. Goodfellow. Efﬁcient per-example gradient computations. ar Xiv:1510.01799, 2015.

[43] A. Greenberg. Apple s differential privacy is about collecting your data but not your data.

Wired, June, 13, 2016.

[44] J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. H. Richemond, E. Buchatskaya, C. Doersch,

B. A. Pires, Z. D. Guo, M. G. Azar, et al. Bootstrap your own latent: A new approach to self-supervised learning. In Neur IPS, 2020.

[45] A. Gupta, K. Ligett, F. Mc Sherry, A. Roth, and K. Talwar. Differentially private combinatorial

optimization. In SODA, pages 1106 1125, 2010.

[46] B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. Tsang, and M. Sugiyama. Co-teaching:

Robust training of deep neural networks with extremely noisy labels. In Neur IPS, pages 8527 8537, 2018.

[47] B. Han, G. Niu, X. Yu, Q. Yao, M. Xu, I. W. Tsang, and M. Sugiyama. Sigua: Forgetting may

make learning with noisy labels more robust. In ICML, pages 4006 4016, 2020.

[48] M. Hardt and K. Talwar. On the geometry of differential privacy. In STOC, pages 705 714,

[49] F. M. Harper and J. A. Konstan. The Movie Lens datasets: History and context. ACM Trans.

Interact. Intell. Syst., 5(4), 2015.

[50] H. Harutyunyan, K. Reing, G. V. Steeg, and A. Galstyan. Improving generalization by

controlling label-noise information in neural network weights. In ICML, pages 4071 4081, 2020.

[51] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR,

pages 770 778, 2016.

[52] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In ECCV,

pages 630 645, 2016.

[53] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick. Momentum contrast for unsupervised visual

representation learning. In CVPR, pages 9729 9738, 2020.

[54] X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T.-S. Chua. Neural collaborative ﬁltering. In

WWW, pages 173 182, 2017.

[55] W. Hu, Z. Li, and D. Yu. Simple and effective regularization methods for training on noisily

labeled data with generalization guarantee. In ICLR, 2020.

[56] L. Jiang, D. Huang, M. Liu, and W. Yang. Beyond synthetic noise: Deep learning on controlled

noisy labels. In ICML, pages 4804 4815, 2020.

[57] P. Kairouz, K. Bonawitz, and D. Ramage. Discrete distribution estimation under local privacy.

In ICML, pages 2436 2444, 2016.

[58] S. P. Kasiviswanathan, H. K. Lee, K. Nissim, S. Raskhodnikova, and A. Smith. What can we

learn privately? SICOMP, 40(3):793 826, 2011.

[59] D. Kifer, A. Smith, and A. Thakurta. Private convex empirical risk minimization and high-

dimensional regression. In COLT, pages 25.1 25.40, 2012.

[60] A. Krizhevsky. Learning multiple layers of features from tiny images. Technical Report

TR-2009, University of Toronto, 2009.

[61] Y. Le Cun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document

recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

[62] M. Ledoux and M. Talagrand. Probability in Banach Spaces: Isoperimetry and Processes.

Springer Science & Business Media, 2013.

[63] T. Liu, G. Vietri, T. Steinke, J. Ullman, and Z. S. Wu. Leveraging public data for practical

private query release. In ICML, 2021.

[64] M. Lukasik, S. Bhojanapalli, A. K. Menon, and S. Kumar. Does label smoothing mitigate

label noise? In ICML, pages 6448 6458, 2020.

[65] X. Ma, H. Huang, Y. Wang, S. Romano, S. Erfani, and J. Bailey. Normalized loss functions

for deep learning with noisy labels. In ICML, pages 6543 6553, 2020.

[66] M. Malek, I. Mironov, K. Prasad, I. Shilov, and F. Tramèr. Antipodes of label differential

privacy: PATE and ALIBI. In Neur IPS, 2021.

[67] H. B. Mc Mahan, D. Ramage, K. Talwar, and L. Zhang. Learning differentially private recurrent

language models. In ICLR, 2018.

[68] F. Mc Sherry. Privacy integrated queries: an extensible platform for privacy-preserving data

analysis. CACM, 53(9):89 97, 2010.

[69] A. K. Menon, A. S. Rawat, S. J. Reddi, and S. Kumar. Can gradient clipping mitigate label

noise? In ICLR, 2019.

[70] M. Nalpas and S. Dutton. A more private way to measure ad conversions, the Event Con-

version Measurement API, October 2020. https://web.dev/conversion-measurement/ #how-this-api-preserves-user-privacy.

[71] M. Nasr, R. Shokri, et al. Improving deep learning with differential privacy using gradient

encoding and denoising. ar Xiv:2007.11524, 2020.

[72] N. Natarajan, I. S. Dhillon, P. Ravikumar, and A. Tewari. Learning with noisy labels. In NIPS,

volume 26, pages 1196 1204, 2013.

[73] A. S. Nemirovsky and D. B. Yudin. Problem Complexity and Method Efﬁciency in Optimization.

Wiley, Chichester, 1983.

[74] D. T. Nguyen, C. K. Mummadi, T. P. N. Ngo, T. H. P. Nguyen, L. Beggel, and T. Brox. Self:

Learning to ﬁlter noisy labels with self-ensembling. In ICLR, 2019.

[75] N. Papernot, M. Abadi, U. Erlingsson, I. Goodfellow, and K. Talwar. Semi-supervised

knowledge transfer for deep learning from private training data. In ICLR, 2017.

[76] N. Papernot, S. Song, I. Mironov, A. Raghunathan, K. Talwar, and Ú. Erlingsson. Scalable

private learning with PATE. In ICLR, 2018.

[77] N. Papernot, A. Thakurta, S. Song, S. Chien, and Ú. Erlingsson. Tempered sigmoid activations

for deep learning with differential privacy. In AAAI, 2021.

[78] H. Phan, M. T. Thai, H. Hu, R. Jin, T. Sun, and D. Dou. Scalable differential privacy with

certiﬁed robustness in adversarial learning. In ICML, pages 7683 7694, 2020.

[79] G. Pleiss, T. Zhang, E. R. Elenberg, and K. Q. Weinberger. Identifying mislabeled data using

the area under the margin ranking. In ICML, 2020.

[80] C. Radebaugh and U. Erlingsson. Introducing Tensor Flow Privacy: Learning with Differential

Privacy for Training Data, March 2019. blog.tensorflow.org.

[81] O. Shamir and T. Zhang. Stochastic gradient descent for non-smooth optimization: Conver-

gence results and optimal averaging schemes. In ICML, page I 71 I 79, 2013.

[82] S. Shankland. How Google tricks itself to protect Chrome user privacy. CNET, October, 2014.

[83] R. Shokri and V. Shmatikov. Privacy-preserving deep learning. In CCS, pages 1310 1321,

[84] M. T. Smith, M. A. Álvarez, M. Zwiessele, and N. Lawrence. Differentially private regression

with Gaussian processes. In AISTATS, 2018.

[85] M. T. Smith, M. A. Álvarez, and N. Lawrence. Differentially private regression and classiﬁca-

tion with sparse Gaussian processes. ar Xiv:1909.09147, 2019.

[86] H. Song, M. Kim, D. Park, and J.-G. Lee. Prestopping: How does early stopping help

generalization against label noise? ar Xiv:1911.08059, 2019.

[87] H. Song, M. Kim, D. Park, and J.-G. Lee. Learning from noisy labels with deep neural

networks: A survey. ar Xiv:2007.08199, 2020.

[88] S. Song, K. Chaudhuri, and A. D. Sarwate. Stochastic gradient descent with differentially

private updates. In Global SIP, pages 245 248, 2013.

[89] T. Steinke and J. Ullman. Between pure and approximate differential privacy. J. Priv. Conﬁ-

dentiality, 7(2), 2016.

[90] P. Subramani, N. Vadivelu, and G. Kamath. Enabling fast differentially private SGD via

just-in-time compilation and vectorization. In PPML, 2020.

[91] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and

A. Rabinovich. Going deeper with convolutions. In CVPR, pages 1 9, 2015.

[92] D. Testuggine and I. Mironov. Py Torch Differential Privacy Series Part 1: DP-SGD Algorithm

Explained, August 2020. medium.com.

[93] F. Tramèr and D. Boneh. Differentially private learning needs better features (or much more

data). In ICLR, 2021.

[94] D. Wang and J. Xu. On sparse linear regression in the local differential privacy model. In

ICML, pages 6628 6637, 2019.

[95] D. Wang, M. Ye, and J. Xu. Differentially private empirical risk minimization revisited: Faster

and more general. In NIPS, pages 2719 2728, 2017.

[96] S. L. Warner. Randomized response: A survey technique for eliminating evasive answer bias.

JASA, 60(309):63 69, 1965.

[97] H. Xiao, K. Rasul, and R. Vollgraf. Fashion-MNIST: a novel image dataset for benchmarking

machine learning algorithms. ar Xiv:1708.07747, 2017.

[98] L. Yu, L. Liu, C. Pu, M. E. Gursoy, and S. Truex. Differentially private model publishing for

deep learning. In S & P, pages 332 349, 2019.

[99] X. Yu, B. Han, J. Yao, G. Niu, I. W. Tsang, and M. Sugiyama. How does disagreement help

generalization against label corruption? In ICML, pages 7164 7173, 2019.

[100] S. Yuan, M. Shen, I. Mironov, and A. C. Nascimento. Practical, label private deep learning

training based on secure multiparty computation and differential privacy. Cryptology e Print Archive, 2021.

[101] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz. Mixup: Beyond empirical risk

minimization. In ICLR, 2018.

[102] J. Zhang, Z. Zhang, X. Xiao, Y. Yang, and M. Winslett. Functional mechanism: regression

analysis under differential privacy. VLDB, 5(11):1364 1375, 2012.

[103] J. Zhang, K. Zheng, W. Mou, and L. Wang. Efﬁcient private ERM for smooth objectives. In

IJCAI, pages 3922 3928, 2017.

[104] Z. Zhang and M. Sabuncu. Generalized cross entropy loss for training deep neural networks

with noisy labels. In Neur IPS, pages 8778 8788, 2018.

[105] S. Zheng, P. Wu, A. Goswami, M. Goswami, D. Metaxas, and C. Chen. Error-bounded

correction of noisy labels. In ICML, pages 11447 11457, 2020.