# regression_with_label_differential_privacy__296179a7.pdf

REGRESSION WITH LABEL DIFFERENTIAL PRIVACY

Badih Ghazi Pritish Kamath Ravi Kumar Ethan Leeman

Pasin Manurangsi Avinash Varadarajan Chiyuan Zhang

We study the task of training regression models with the guarantee of label differential privacy (DP). Based on a global prior distribution on label values, which could be obtained privately, we derive a label DP randomization mechanism that is optimal under a given regression loss function. We prove that the optimal mechanism takes the form of a randomized response on bins , and propose an efﬁcient algorithm for ﬁnding the optimal bin values. We carry out a thorough experimental evaluation on several datasets demonstrating the efﬁcacy of our algorithm.

1 INTRODUCTION

In recent years, differential privacy (DP, Dwork et al., 2006a;b) has emerged as a popular notion of user privacy in machine learning (ML). On a high level, it guarantees that the output model weights remain statistically indistinguishable when any single training example is arbitrarily modiﬁed. Numerous DP training algorithms have been proposed, with open-source libraries tightly integrated in popular ML frameworks such as Tensor Flow Privacy (Radebaugh & Erlingsson, 2019) and Py Torch Opacus (Yousefpour et al., 2021).

In the context of supervised ML, a training example consists of input features and a target label. While many existing research works focus on protecting both features and labels (e.g., Abadi et al. (2016)), there are also some important scenarios where the input features are already known to the adversary, and thus protecting the privacy of the features is not needed. A canonical example arises from computational advertising where the features are known to one website (a publisher), whereas the conversion events, i.e., the labels, are known to another website (the advertiser).1 Thus, from the ﬁrst website s perspective, only the labels can be treated as unknown and private. This motivates the study of label DP algorithms, where the statistical indistinguishability is required only when the label of a single example is modiﬁed.2 The study of this model goes back at least to the work of Chaudhuri & Hsu (2011). Recently, several works including (Ghazi et al., 2021a; Malek Esmaeili et al., 2021) studied label DP deep learning algorithms for classiﬁcation objectives.

Our Contributions. In this work, we study label DP for regression tasks. We provide a new algorithm that, given a global prior distribution (which, if unknown, could be estimated privately), derives a label DP mechanism that is optimal under a given objective loss function. We provide an explicit characterization of the optimal mechanism for a broad family of objective functions including the most commonly used regression losses such as the Poisson log loss, the mean square error, and the mean absolute error.

More speciﬁcally, we show that the optimal mechanism belongs to a class of randomized response on bins (Algorithm 1). We show this by writing the optimization problem as a linear program (LP), and characterizing its optimum. With this characterization in mind, it sufﬁces for us to compute the

Authors in alphabetical order. Email: {badih.ghazi, ravi.k53}@gmail.com, {pritishk, ethanleeman, pasin, avaradar, chiyuan}@google.com 1A similar use case is in mobile advertising, where websites are replaced by apps. 2We note that this label DP setting is particularly timely and relevant for ad attribution and conversion measurement given the deprecation of third-party cookies by several browsers and platforms (Wilander, 2020; Wood, 2019; Schuh, 2020).

M(y1, , yn)

ε-DP Trained ε-Label DP

model Labels Party

Step 1 Step 2

Features Party

Figure 1: Learning with feature-oblivious label DP.

optimal mechanism among the class of randomized response on bins. We then provide an efﬁcient algorithm for this task, based on dynamic programming (Algorithm 2).

In practice, a prior distribution on the labels is not always available. This leads to our two-step algorithm (Algorithm 3) where we ﬁrst use a portion of the privacy budget to build an approximate histogram of the labels, and then feed this approximate histogram as a prior into the optimization algorithm in the second step; this step would use the remaining privacy budget. We show that as the number of samples grows, this two-step algorithm yields an expected loss (between the privatized label and the raw label) which is arbitrarily close to the expected loss of the optimal local DP mechanism. (We give a quantitative bound on the convergence rate.)

Our two-step algorithm can be naturally deployed in the two-party learning setting where each example is vertically partitioned with one party holding the features and the other party holding the (sensitive) labels. The algorithm is in fact one-way, requiring a single message to be communicated from the labels party to the features party, and we require that this one-way communication satisﬁes (label) DP. We refer to this setting, which is depicted in Figure 1, as feature-oblivious label DP.

We evaluate our algorithm on three datasets: the 1940 US Census IPUMS dataset, the Criteo Sponsored Search Conversion dataset, and a proprietary app install ads dataset from a commercial mobile app store. We compare our algorithm to several baselines, and demonstrate that it achieves higher utility across all test privacy budgets, with signiﬁcantly lower test errors for the high privacy regime. For example, for privacy budget ε = 0.5, comparing to the best baseline methods, the test MSE for our algorithm is 1.5 smaller on the Criteo and US Census datasets, and the relative test error is 5 smaller on the app ads dataset.

Organization. In Section 2, we recall some basics of DP and learning theory, and deﬁne the feature-oblivious label DP setting in which our algorithm can be implemented. Our label DP algorithm for regression objectives is presented in Section 3. Our experimental evaluation and results are described in Section 4. A brief overview of related work appears in Section 5. We conclude with some interesting future directions in Section 6. Most proofs are deferred to the Appendix (along with additional experimental details and background material).

2 PRELIMINARIES

We consider the standard setting of supervised learning, where we have a set of examples of the form (x, y) X Y, drawn from some unknown distribution D and we wish to learn a predictor fθ (parameterized by θ) to minimize L(fθ) := E(x,y) D ℓ(fθ(x), y), for some loss function ℓ: R Y R 0; we will consider the case where Y R. Some common loss functions include the zeroone loss ℓ0-1( y, y) := 1[ y = y], the logistic loss ℓlog( y, y) := log(1+e yy) for binary classiﬁcation, and the squared loss ℓsq( y, y) := 1

2( y y)2, the absolute-value loss ℓabs( y, y) := | y y|, the Poisson log loss ℓPoi( y, y) := y y log( y) for regression. This paper focuses on the regression setting.

However, we wish to perform this learning with differential privacy (DP). We start by recalling the deﬁnition of DP, which can be applied to any notion of adjacent pairs of datasets. For an overview of DP, we refer the reader to the book of Dwork & Roth (2014).

Deﬁnition 1 (DP; Dwork et al. (2006b)). Let ε be a positive real number. A randomized algorithm A taking as input a dataset is said to be ε-differentially private (denoted ε-DP) if for any two adjacent datasets X and X , and any subset S of outputs of A, we have Pr[A(X) S] eε Pr[A(X ) S].

In supervised learning applications, the input to a DP algorithm is the training dataset (i.e., a set of labeled examples) and the output is the description of a predictor (e.g., the weights of the trained

model). Typically, two datasets are considered adjacent if they differ on a single training example; this notion protects both the features and the label of any single example. As discussed in Section 1, in certain situations, protecting the features is either unnecessary or impossible, which motivates the study of label DP that we deﬁne next. Deﬁnition 2 (Label DP; Chaudhuri & Hsu (2011)). An algorithm A taking as input a training dataset is said to be ε-label differentially private (denoted as ε-Label DP) if it is ε-DP when two input datasets are considered adjacent if they differ on the label of a single training example.

We next deﬁne an important special case of training with label DP, which commonly arises in practice. The setting could be viewed as a special case of vertical federated learning, where each training example is divided across different parties at the beginning of the training process; see, e.g., the survey of Kairouz et al. (2021) and the references therein. We consider the special case where we only allow a single round of communication from the ﬁrst party who holds all the labels, to the second party, who holds all the features, and who then trains the ML model and outputs it. We refer to this setting (depicted in Figure 1) as feature-oblivious label DP since the label DP property is guaranteed without having to look at the features (but with the ability to look at the labels from all users); we next give the formal deﬁnition. Deﬁnition 3 (Feature-Oblivious Label DP). Consider the two-party setting where the features party holds as input a sequence (xi)n i=1 of all feature vectors (across all the n users), and the labels party holds as input a sequence (yi)n i=1 of the corresponding labels. The labels party sends a single message M(y1, . . . , yn) to the features party; this message can be randomized using the internal randomness of the labels party. Based on its input and on this incoming message, the features party then trains an ML model that it outputs. We say that this output is feature-oblivious ε-Label DP if the message M(y1, . . . , yn) is ε-DP with respect to the adjacency relation where a single yi can differ across the two datasets.

We stress the practical applicability of the setting described in Deﬁnition 3. The labels party could be an advertiser who observes the purchase data of the users, and wants to enable a publisher (the features party) to train a model predicting the likelihood of a purchase (on the advertiser) being driven by the showing of a certain type of ad (on the publisher). The advertiser would also want to limit the leakage of sensitive information to the publisher. From a practical standpoint, the simplest option for the advertiser is to send a single privatized message to the publisher who can then train a model using the features at its disposal. This is exactly the feature-oblivious label DP setting.

3 LABEL DP LEARNING ALGORITHM

A common template for learning with label DP is to: (i) compute noisy labels using a local DP mechanism M, (ii) use a learning algorithm on the dataset with noisy labels. All of the baseline algorithms that we consider follow this template, through different ways of generating noisy labels, such as, (a) randomized response (for categorical labels), (b) (continuous/discrete) Laplace mechanism, (c) (continuous/discrete) staircase mechanism, etc. (for formal deﬁnitions of these mechanisms, see Appendix D). Intuitively, such a learning algorithm will be most effective when the noisy label mostly agrees with the true label.

Suppose the loss function ℓsatisﬁes the triangle inequality, namely that ℓ( y, y) ℓ(ˆy, y) + ℓ( y, ˆy) for all y, ˆy, y. Then we have that E (x,y) D ℓ(fθ(x), y) E y P ˆy M(y) ℓ(ˆy, y) + E (x,y) D ˆy M(y) ℓ(fθ(x), ˆy) , (1)

where, for ease of notation, we use P to denote the marginal on y for (x, y) D. The learning algorithm, in part (ii) of the template above, aims to minimize the second term in the RHS of (1). Thus, it is natural to choose a mechanism M, in part (i) of the template, to minimize the ﬁrst term in the RHS of (1).3 Ghazi et al. (2021a) studied this question for the case of 0-1 loss and showed that a randomized response on top k labels with the highest prior masses (for some k) is optimal. Their characterization was limited to this classiﬁcation loss. In this work, we develop a characterization for a large class of regression loss functions.

3Note that the ﬁrst term in the RHS of (1) is a constant, independent of the number of training samples. Therefore, this upper bound is not vanishing in the number of samples, and hence this inequality can be quite loose.

Algorithm 1 RR-on-BinsΦ ε .

Parameters: Φ : Y ˆY (for label set Y and output set ˆY), privacy parameter ε 0. Input: A label y Y. Output: ˆy ˆY.

return a sample ˆy ˆY , where the random variable ˆY is distributed as

Pr[ ˆY = ˆy] =

eε+| ˆ Y| 1 if ˆy = Φ(y)

1 eε+| ˆ Y| 1 otherwise,

for each ˆy ˆY

Algorithm 2 Compute optimal Φ for RR-on-BinsΦ ε .

Input: Distribution P over Y R, privacy param. ε 0, loss function ℓ: R2 R 0. Output: Output set ˆY R and Φ : Y ˆY.

y1, . . . , yk elements of Y in increasing order Initialize A[i][j] for all i, j {0, . . . , k} for r, i {1, . . . , k} do

L[r][i] minˆy R P y Y py e1[y [yr,yi]] ε ℓ(ˆy, y) A[0][0] 0 for i {1, . . . , k} do

for j {1, . . . , i} do

A[i][j] min0 r<i A[r][j 1] + L[r + 1][i] return Φ, ˆY correspond. to mind [k] 1 d 1+eε A[k][d]

3.1 RANDOMIZED RESPONSE ON BINS: AN OPTIMAL MECHANISM

We propose a new mechanism for generating noisy labels that minimizes the ﬁrst term in the RHS of (1). Namely, we deﬁne randomized response on bins (RR-on-BinsΦ ε ), which is a randomized algorithm parameterized by a scalar ε > 0 and a non-decreasing function Φ : Y ˆY that maps the label set Y R to an output set ˆY R. This algorithm is simple: perform ε-randomized response on Φ(y), randomizing over ˆY; see Algorithm 1 for a formal deﬁnition. Any Φ we consider will be non-decreasing unless otherwise stated, and so we often omit mentioning it explicitly.

An important parameter in RR-on-Bins is the mapping Φ and the output set ˆY. We choose Φ with the goal of minimizing E ℓ(ˆy, y). First we show, under some basic assumptions about ℓ, that for any given distribution P over labels Y, there exists a non-decreasing map Φ such that M = RR-on-BinsΦ ε minimizes Ey P,ˆy M(y) ℓ(ˆy, y). Since Φ is non-decreasing, it follows that Φ 1(ˆy) is an interval4 of Y, for all ˆy ˆY. Exploiting this property, we give an efﬁcient algorithm for computing the optimal Φ.

To state our results formally, we use L(M; P) to denote Ey P,ˆy M(y) ℓ(ˆy, y), the ﬁrst term in the RHS of (1). Our results hold under the following natural assumption; note that all the standard loss functions such as squared loss, absolute-value loss, and Poisson log loss satisfy Assumption 4.

Assumption 4. Loss function ℓ: R R R 0 is such that

For all y R, ℓ( , y) is continuous. For all y R, ℓ(ˆy, y) is decreasing in ˆy when ˆy y and increasing in ˆy when ˆy y. For all ˆy R, ℓ(ˆy, y) is decreasing in y when y ˆy and increasing in y when y ˆy.

We can now state our main result:

Theorem 5. For any loss function ℓ: R R R 0 satisfying Assumption 4, all ﬁnitely supported distributions P over Y R, there is an output set ˆY R and a non-decreasing map Φ : Y ˆY such that

L(RR-on-BinsΦ ε ; P) = inf M L(M; P),

where the inﬁmum is over all ε-DP mechanisms M.

Proof Sketch. This proof is done in two stages. First, we handle the case where ˆY is restricted to be a subset of O, where O is a ﬁnite subset of R. Under this restriction, the optimal mechanism M that minimizes L(M; P) can be computed as the solution to an LP. Since an optimal solution to an

4An interval of Y is a subset of the form [a, b] Y, for some a, b R, consisting of consecutive elements on Y in sorted order.

Algorithm 3 Labels Party s Randomizer Label Randomizerε1,ε2.

Parameters: Privacy parameters ε1, ε2 0. Input: Labels y1, . . . , yn Y. Output: ˆy1, . . . , ˆyn ˆY.

P MLap ε1 (y1, . . . , yn) Φ , ˆY Result of running Algorithm 2 with distribution P and privacy parameter ε2 for i [n] do

ˆyi RR-on-BinsΦ ε2 (yi) return (ˆy1, . . . , ˆyn)

LP can always be found at the vertices of the constraint polytope, we study properties of the vertices of the said polytope and show that the minimizer amongst its vertices necessarily takes the form of RR-on-BinsΦ ε . To handle the case of general ˆY, we ﬁrst note that due to Assumption 4, it sufﬁces to consider ˆY [ymin, ymax] (where ymin = miny Y y and ymax = maxy Y y). Next, we consider a sequence of increasingly ﬁner discretizations of [ymin, ymax], and show that RR-on-BinsΦ ε can come arbitrarily close to the optimal L(M; P) when restricting ˆY to be a subset of the discretized set. But observe that any Φ induces a partition of Y into intervals. Since there are only ﬁnitely many partitions of Y into intervals, it follows that in fact RR-on-BinsΦ ε can exactly achieve the optimal L(M; P). The full proof is deferred to Appendix A.

Given Theorem 5, it sufﬁces to focus on RR-on-BinsΦ ε mechanisms and optimize over the choice of Φ. We give an efﬁcient dynamic programming-based algorithm for the latter in Algorithm 2. The main idea is as follows. We can breakdown the representation of Φ into two parts (i) a partition PΦ = {S1, S2, . . .} of Y into intervals5, such that Φ is constant over each Si, and (ii) the values yi that Φ( ) takes over interval Si. Let y1, . . . , yk be elements of Y in increasing order. Our dynamic program has a state A[i][j], which is (proportional to) the optimal mechanism if we restrict the input to only {y1, . . . , yi} and consider partitioning into j intervals S1, . . . , Sj. To compute A[i][j], we try all possibilities for Sj. Recall that Sj must be an interval, i.e., Sj = {yr+1, . . . , yi} for some r < i. This breaks the problem into two parts: computing optimal RR-on-Bins for {y1, . . . , yr} for j 1 partitions and solving for the optimal output label ˆy for Sj. The answer to the ﬁrst part is simply A[r][j 1], whereas the second part can be written as a univariate optimization problem (denoted by L[r][i] in Algorithm 2). When ℓis convex (in the ﬁrst argument), this optimization problem is convex and can thus be solved efﬁciently. Furthermore, for squared loss, absolute-value loss, and Poisson log loss, this problem can be solved in amortized O(1) time, resulting in a total running time of O(k2) for the entire dynamic programming algorithm. The full description is presented in Algorithm 26; the complete analyses of its correctness and running time are deferred to Appendix B.

3.2 ESTIMATING THE PRIOR PRIVATELY

In the previous subsection, we assumed that the distribution P of labels is known beforehand. This might not be the case in all practical scenarios. Below we present an algorithm that ﬁrst privately approximates the distribution P and work with this approximation instead. We then analyze how using such an approximate prior affects the performance of the algorithm.

To formalize this, let us denote by MLap ε the ε-DP Laplace mechanism for approximating the prior. Given n samples drawn from P, MLap ε constructs a histogram over Y and adds Laplace noise with scale 2/ε to each entry, followed by clipping (to ensure that entries are non-negative) and normalization. A formal description of MLap ε can be found in Algorithm 4 in the Appendix.

5For convenience, we assume that S1, S2, . . . are sorted in increasing order. 6We remark that the corresponding Φ, ˆY on the last line can be efﬁciently computed by recording the minimizer for each A[i][j] and L[r][i], and going backward starting from i = k and j = arg mind [k] 1 d 1+eε A[k][d].

0 50 100 150 200 250 300 350 400 Original labels

Noised labels

MSE = 4591.6356

0.000 0.025

(a) RR-on-Bins

0 50 100 150 200 250 300 350 400 Original labels

Noised labels

MSE = 14461.2893

(b) Laplace Mechanism

Figure 2: Illustration of the training label randomization mechanism under privacy budget ε = 3. In this case, RR-on-Bins maps the labels to 3 bins chosen for optimal MSE via Algorithm 2. The 2D density plot contours are generated in log scale. The legends show the MSE between the original labels and the ε-DP randomized labels.

Our mechanism for the unknown prior case described in Algorithm 3 is now simple: We split the privacy budget into ε1, ε2, run the ε1-DP Laplace mechanism to get an approximate prior distribution P and run the optimal RR-on-Bins for privacy parameter ε2 to randomize the labels. It is not hard to see that the algorithm is (ε1 + ε2)-DP. (Full proof deferred to the Appendix.) Theorem 6. Label Randomizerε1,ε2 is (ε1 + ε2)-DP.

Let us now discuss the performance of the above algorithm in comparison to the case where the prior P is known, under the assumption that y1, . . . , yn are sampled i.i.d. from P. We show in the following theorem that the difference between the expected population loss in the two cases converges to zero as the number of samples n , provided that Y is ﬁnite: Theorem 7. Let ℓ: R R R 0 be any loss function satisfying Assumption 4. Furthermore, assume that ℓ(ˆy, y) B for some parameter B for all y, ˆy Y. For any distribution P on Y, ε > 0, and any sufﬁciently large n N, there is a choice of ε1, ε2 > 0 such that ε1 + ε2 = ε and

E y1,...,yn P P ,Φ , ˆ Y [L(RR-on-BinsΦ ε2 ; P)] inf M L(M; P) O B p

where P , Φ , ˆY are as in Label Randomizerε1,ε2 and the inﬁmum is over all ε-DP mechanisms M.

We remark that the above bound is achieved by setting ε1 = p

|Y|/n; indeed, we need to assume that n is sufﬁciently large so that ε1 < ε. The high-level idea of the proof is to ﬁrst use a known utility bound for Laplace mechanism Diakonikolas et al. (2015) to show that P, P are close. Then, we argue that the optimal RR-on-Bins is robust, in the sense that small changes in the prior (from P to P ) and privacy parameter (from ε to ε2) do not affect the resulting population loss too much.

4 EXPERIMENTAL EVALUATION

We evaluate the RR-on-Bins mechanism on three datasets, and compare with the Laplace mechanism (Dwork et al., 2006b), the staircase mechanism (Geng & Viswanath, 2014) and the exponential mechanism (Mc Sherry & Talwar, 2007). Note the Laplace mechanism and the staircase mechanism both have a discrete and a continuous variant. For real-valued labels (the Criteo Sponsored Search Conversion dataset), we use the continuous variant, and for integer-valued labels (the US Census dataset and the App Ads Conversion Count dataset), we use the discrete variant. All of these algorithms can be implemented in the feature-oblivious label DP setting of Figure 1. Detailed model and training conﬁgurations can be found in Appendix E.

Privacy Budget

MSE (Mechanism) MSE (Generalization)

Laplace Mechanism RR-on-Bins Laplace Mechanism RR-on-Bins

0.05 60 746.98 46.31 11 334.84 9.07 24 812.56 139.35 11 339.71 36.45 0.1 59 038.06 51.31 11 325.53 9.25 23 933.23 172.43 11 328.04 36.34 0.3 52 756.01 56.64 11 210.48 9.06 20 961.83 149.47 11 185.20 36.10 0.5 47 253.12 57.12 10 977.09 8.85 18 411.30 111.82 10 901.33 36.54 0.8 40 223.13 48.66 10 435.43 9.77 15 428.75 91.32 10 256.37 37.39 1.0 36 226.54 45.05 9 976.86 8.21 13 788.51 75.71 9 744.08 37.59 1.5 28 170.93 39.45 8 636.43 7.04 10 808.53 52.31 8 406.88 36.57 2.0 22 219.20 28.04 7 260.05 10.55 8 892.80 32.92 7 294.93 34.03 3.0 14 411.77 20.26 4 600.24 11.15 6 770.33 22.86 5 577.50 31.75 4.0 9 851.53 17.27 2 631.36 4.41 5 764.32 28.95 4 769.61 25.01 6.0 5 270.57 10.30 709.74 6.18 4 955.21 26.75 4 371.68 25.31 8.0 3 239.22 6.54 176.47 2.12 4 668.40 20.34 4 333.12 31.94 0.00 0.00 0.00 0.00 4 322.91 28.31 4 319.86 29.27

Table 1: MSE on the Criteo dataset. The ﬁrst column block (Mechanism) measures the error introduced by the DP randomization mechanisms on the training labels. The second column block (Generalization) measures the test error of the models trained on the corresponding private labels.

4.1 CRITEO SPONSORED SEARCH CONVERSION

The Criteo Sponsored Search Conversion Log Dataset (Tallis & Yadav, 2018) contains logs obtained from Criteo Predictive Search (CPS). Each data point describes an action performed by a user (click on a product related advertisement), with additional information consisting of a conversion (product was bought) within a 30-day window and that could be attributed to the action. We formulate a (feature-oblivious) label DP problem to predict the revenue (in C) obtained when a conversion takes place (the Sales Amount In Euro attribute). This dataset represents a sample of 90 days of Criteo live trafﬁc data, with a total of 15, 995, 634 examples. We remove examples where no conversion happened (Sales Amount In Euro is 1), resulting in a dataset of 1, 732, 721 examples. The conversion value goes up to 62, 458.773C. We clip the conversion value at 400C, which corresponds to the 95th percentile of the value distribution.

In Figure 2, we visualize an example of how RR-on-Bins randomizes those labels, and compare it with the Laplace mechanism, which is a canonical DP mechanism for scalars. In this case (and for ε = 3), RR-on-Bins chooses 3 bins at around 50, 100, and 250 and maps the sensitive labels to those bins with Algorithm 1. The joint distribution of the sensitive labels and randomized labels maintains an overall concentration along the diagonal. On the other hand, the joint distribution for the Laplace mechanism is generally spread out. Table 1 quantitatively compares the two mechanisms across different privacy budgets. The ﬁrst block (Mechanism) shows the MSE between the sensitive training labels and the private labels generated by the two mechanisms, respectively. We observe that RR-on-Bins leads to signiﬁcantly smaller MSE than the Laplace mechanism for the same label DP guarantee. Furthermore, as shown in the second block of Table 1 (Generalization), the reduced noise in the training labels leads to lower test errors.

Figure 3 compares with two additional baselines: the exponential mechanism, and the staircase mechanism. For both the Mechanism errors and Generalization errors, RR-on-Bins consistently outperforms other methods for both lowand high-ε regimes.

4.2 US CENSUS

The 1940 US Census dataset has been made publicly available for research since 2012, and has 131, 903, 909 rows. This data is commonly used in the DP literature (e.g., Wang et al. (2019); Cao et al. (2021); Ghazi et al. (2021b)). In this paper, we set up a label DP problem by learning to predict the duration for which the respondent worked during the previous year (the WKSWORK1 ﬁeld, measured in number of weeks). Figure 4a shows that RR-on-Bins outperforms the baseline mechanisms across a wide range of privacy budgets.

0.050.1 0.3 0.5 0.8 1 1.5 2 3 4 6 8 inf Privacy Budget

Mean Squared Error

Laplace Staircase Exponential RR-on-Bins

(a) Mechanism

0.050.1 0.3 0.5 0.8 1 1.5 2 3 4 6 8 inf Privacy Budget

Mean Squared Error

Laplace Staircase Exponential RR-on-Bins

(b) Generalization

Figure 3: MSE on the Criteo dataset: (a) error introduced by DP randomization mechanisms on the training labels and (b) test error of the models trained on the corresponding private labels.

0.050.1 0.3 0.5 0.8 1 1.5 2 3 4 6 8 inf Privacy Budget

Mean Squared Error

Laplace Staircase Exponential RR-on-Bins

(a) US Census

0.1 0.3 0.5 0.8 1.0 1.5 2.0 3.0 4.0 6.0 8.0 inf

Privacy Budget

Relative Error

Laplace Staircase RR-on-Bins

(b) App Ads

Figure 4: Test performance across different privacy budgets. (a) MSE on the US Census dataset. (b) Relative performance on the App Ads Conversion Count dataset. The relative error is calculated with respect to the non-private baseline, i.e. (E Ebaseline)/Ebaseline.

4.3 APP ADS CONVERSION COUNT PREDICTION

Finally, we evaluate our algorithm on an app install ads prediction dataset from a commercial mobile app store. The examples in this dataset are ad clicks and each label counts post-click events (aka conversions) occurring in the app after the user installs it. For example, if a user installs a ride share app after clicking the corresponding ad, the label could be the total number of rides that the user purchases in a given time window after the installation. We note that ad prediction tasks/datasets of a similar nature have been previously used for evaluation (Badanidiyuru et al., 2021).

Figure 4b shows the performance on this dataset. Consistent with the results observed on the other datasets, RR-on-Bins outperforms other mechanisms across all the privacy budget values.

In summary, by ﬁrst estimating a prior distribution (privately), RR-on-Bins signiﬁcantly reduces label noise compared to other label DP mechanisms, especially in the high-privacy (i.e., intermediate to small ε) regime. This leads to signiﬁcantly better test performance for the regression networks trained on these less noisy labels. For example, for ε = 0.5, comparing to the best baseline methods, the test MSE for RR-on-Bins is 1.5 smaller on the Criteo and US Census datasets, and the relative test error is 5 smaller on the App Ads dataset.

5 RELATED WORK

There has been a large body of work on learning with DP. Various theoretical and practical settings have been studied including empirical risk minimization (Chaudhuri et al., 2011), optimization (Song et al., 2013), regression analysis (Zhang et al., 2012), and deep learning (Abadi et al., 2016). The vast majority of the prior works studied the setting where both the features and the labels are deemed sensitive and hence ought to be protected.

Chaudhuri & Hsu (2011) and Beimel et al. (2016) studied the sample complexity of learning with label DP for classiﬁcation tasks. Wang & Xu (2019) studied label DP for sparse linear regression in the local DP setting. Ghazi et al. (2021a) provided a procedure for training deep neural classiﬁers with label DP. For the classiﬁcation loss, they formulate an LP and derive an explicit solution, which they refer to as RRTop-k. Our work could be viewed as an analog of theirs for regression tasks. Malek Esmaeili et al. (2021) used the PATE framework of Papernot et al. (2017; 2018) to propose a new label DP training method. We note that this, as well as related work on using unsupervised and semi-supervised learning to improve label DP algorithms (Esfandiari et al., 2022; Tang et al., 2022), do not apply to the feature-oblivious label DP setting.

A variant of two-party learning (with a features party and a labels party ) was recently studied by Li et al. (2021), who considered the interactive setting where the two parties can engage in a protocol with an arbitrary number of rounds (with the same application to computational advertising described in Section 1 as a motivation). By contrast, we considered in this paper the one-way communication setting (from Figure 1), which is arguably more practical and easier to deploy.

Kairouz et al. (2016) study optimal local DP algorithms under a given utility function and prior. For a number of utility functions, they show that optimal mechanisms belong to a class of staircase mechanisms. Staircase mechanisms are much more general than randomized response on bins. In that regard, our work shows that a particular subset of staircase mechanisms is optimal for regression tasks. In particular, while we give an efﬁcient dynamic programming algorithm for optimizing over randomized response on bins, we are not aware of any efﬁcient algorithm that can compute an optimal staircase mechanism. (In particular, a straightforward algorithm takes 2O(k) time.)

6 CONCLUSIONS AND FUTURE DIRECTIONS

In this work we propose a new label DP mechanism for regression. The resulting training algorithm can be implemented in the feature-oblivious label DP setting. We provide theoretical results shedding light on the operation and guarantees of the algorithm. We also evaluate it on three datasets, demonstrating that it achieves higher accuracy compared to several non-trivial, natural approaches.

Our work raises many questions for future exploration. First, in our evaluation, we set the objective function optimized using the LP to be the same as the loss function used during training; different ways of setting the LP objective in relation to the training loss are worth exploring. Second, our noising mechanism typically introduces a bias; mitigating it, say by adding unbiasedness constraints into the LP, is an interesting direction. Third, it might also be useful to investigate de-noising techniques that can be applied as a post-processing step on top of our label DP mechanism; e.g., the method proposed in Tang et al. (2022) for computer vision tasks, and the ALIBI method (Malek Esmaeili et al., 2021) for classiﬁcation tasks, which is based on Bayesian inference.

We believe that due to its simplicity and practical appeal, the setting of learning with featureoblivious label DP merits further study, from both a theoretical and an empirical standpoint.

Finally, we leave the question of obtaining better feature-aware label DP regression algorithms for a future investigation.

ACKNOWLEDGMENTS

We thank Peter Kairouz and Sewoong Oh for many useful discussions, and Eu-Jin Goh, Sam Ieong, Christina Ilvento, Andrew Tomkins, and the anonymous reviewers for their comments.

Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan Mc Mahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In CCS, pp. 308 318, 2016.

Ashwinkumar Badanidiyuru, Andrew Evdokimov, Vinodh Krishnan, Pan Li, Wynn Vonnegut, and Jayden Wang. Handling many conversions per click in modeling delayed feedback. In Ad KDD, 2021.

Amos Beimel, Kobbi Nissim, and Uri Stemmer. Private learning and sanitization: Pure vs. approximate differential privacy. To C, 12(1):1 61, 2016.

A Colin Cameron and Pravin K Trivedi. Regression Analysis of Count Data. Cambridge University Press, 2013.

Xiaoyu Cao, Jinyuan Jia, and Neil Zhenqiang Gong. Data poisoning attacks to local differential privacy protocols. In USENIX Security, pp. 947 964, 2021.

Kamalika Chaudhuri and Daniel Hsu. Sample complexity bounds for differentially private learning. In COLT, pp. 155 186, 2011.

Kamalika Chaudhuri, Claire Monteleoni, and Anand D Sarwate. Differentially private empirical risk minimization. JMLR, 12(3), 2011.

Ilias Diakonikolas, Moritz Hardt, and Ludwig Schmidt. Differentially private learning of structured discrete distributions. In NIPS, pp. 2566 2574, 2015.

Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3-4):211 407, 2014.

Cynthia Dwork, Krishnaram Kenthapadi, Frank Mc Sherry, Ilya Mironov, and Moni Naor. Our data, ourselves: Privacy via distributed noise generation. In EUROCRYPT, pp. 486 503, 2006a.

Cynthia Dwork, Frank Mc Sherry, Kobbi Nissim, and Adam D. Smith. Calibrating noise to sensitivity in private data analysis. In TCC, pp. 265 284, 2006b.

Hossein Esfandiari, Vahab Mirrokni, Umar Syed, and Sergei Vassilvitskii. Label differential privacy via clustering. In AISTATS, pp. 7055 7075, 2022.

Quan Geng and Pramod Viswanath. The optimal mechanism in differential privacy. In ISIT, pp. 2371 2375, 2014.

Badih Ghazi, Noah Golowich, Ravi Kumar, Pasin Manurangsi, and Chiyuan Zhang. Deep learning with label differential privacy. Neur IPS, pp. 27131 27145, 2021a.

Badih Ghazi, Ravi Kumar, Pasin Manurangsi, Rasmus Pagh, and Amer Sinha. Differentially private aggregation in the shufﬂe model: Almost central accuracy in almost a single message. In ICML, pp. 3692 3701, 2021b.

Arpita Ghosh, Tim Roughgarden, and Mukund Sundararajan. Universally utility-maximizing privacy mechanisms. SICOMP, 41(6):1673 1693, 2012.

Peter Kairouz, Sewoong Oh, and Pramod Viswanath. Extremal mechanisms for local differential privacy. JMLR, 17:17:1 17:51, 2016.

Peter Kairouz, H. Brendan Mc Mahan, Brendan Avent, Aur elien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Keith Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, Rafael G. L. D Oliveira, Salim El Rouayheb, David Evans, Josh Gardner, Zachary Garrett, Adri a Gasc on, Badih Ghazi, Phillip B. Gibbons, Marco Gruteser, Zaid Harchaoui, Chaoyang He, Lie He, Zhouyuan Huo, Ben Hutchinson, Justin Hsu, Martin Jaggi, Tara Javidi, Gauri Joshi, Mikhail Khodak, Jakub Koneˇcn y, Aleksandra Korolova, Farinaz Koushanfar, Sanmi Koyejo, Tancr ede Lepoint, Yang Liu, Prateek Mittal, Mehryar Mohri, Richard Nock, Ayfer Ozg ur, Rasmus Pagh, Mariana Raykova, Hang Qi, Daniel Ramage, Ramesh Raskar, Dawn Song, Weikang Song, Sebastian U. Stich, Ziteng Sun, Ananda Theertha Suresh, Florian Tram er, Praneeth Vepakomma, Jianyu

Wang, Li Xiong, Zheng Xu, Qiang Yang, Felix X. Yu, Han Yu, and Sen Zhao. Advances and open problems in federated learning. Foundations and Trends in Machine Learning, 14(1 2):1 210, 2021.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.

Oscar Li, Jiankai Sun, Xin Yang, Weihao Gao, Hongyi Zhang, Junyuan Xie, Virginia Smith, and Chong Wang. Label leakage and protection in two-party split learning. In ICLR, 2021.

Ilya Loshchilov and Frank Hutter. SGDR: Stochastic gradient descent with warm restarts. In ICLR, 2017.

Mani Malek Esmaeili, Ilya Mironov, Karthik Prasad, Igor Shilov, and Florian Tramer. Antipodes of label differential privacy: PATE and ALIBI. Neur IPS, 34:6934 6945, 2021.

Frank Mc Sherry and Kunal Talwar. Mechanism design via differential privacy. In FOCS, pp. 94 103, 2007.

Nicolas Papernot, Mart ın Abadi, Ulfar Erlingsson, Ian Goodfellow, and Kunal Talwar. Semisupervised knowledge transfer for deep learning from private training data. In ICLR, 2017.

Nicolas Papernot, Shuang Song, Ilya Mironov, Ananth Raghunathan, Kunal Talwar, and Ulfar Erlingsson. Scalable private learning with PATE. In ICLR, 2018.

Carey Radebaugh and Ulfar Erlingsson. Introducing Tensor Flow Privacy: Learning with Differential Privacy for Training Data, March 2019. blog.tensorflow.org.

Justin Schuh. Building a more private web: A path towards making third party cookies obsolete, January 2020. https://blog.chromium.org/2020/01/ building-more-private-web-path-towards.html.

Shuang Song, Kamalika Chaudhuri, and Anand D Sarwate. Stochastic gradient descent with differentially private updates. In Global SIP, pp. 245 248, 2013.

Marcelo Tallis and Pranjul Yadav. Reacting to variations in product demand: An application for conversion rate (CR) prediction in sponsored search. In IEEE Big Data, pp. 1856 1864, 2018.

Xinyu Tang, Milad Nasr, Saeed Mahloujifar, Virat Shejwalkar, Liwei Song, Amir Houmansadr, and Prateek Mittal. Machine learning with differentially private labels: Mechanisms and frameworks. Po PETS, 4:332 350, 2022.

Di Wang and Jinhui Xu. On sparse linear regression in the local differential privacy model. In ICML, pp. 6628 6637, 2019.

Tianhao Wang, Bolin Ding, Jingren Zhou, Cheng Hong, Zhicong Huang, Ninghui Li, and Somesh Jha. Answering multi-dimensional analytical queries under local differential privacy. In SIGMOD, pp. 159 176, 2019.

Stanley L Warner. Randomized response: A survey technique for eliminating evasive answer bias. JASA, 60(309):63 69, 1965.

John Wilander. Full Third-Party Cookie Blocking and More, March 2020. https://webkit.org/ blog/10218/full-third-party-cookie-blocking-and-more/.

Marissa Wood. Today s Firefox Blocks Third-Party Tracking Cookies and Cryptomining by Default, September 2019. https://blog.mozilla.org/en/products/firefox/ todays-firefox-blocks-third-party-tracking-cookies-and-cryptomining-by-default/.

Ashkan Yousefpour, Igor Shilov, Alexandre Sablayrolles, Davide Testuggine, Karthik Prasad, Mani Malek, John Nguyen, Sayan Ghosh, Akash Bharadwaj, Jessica Zhao, Graham Cormode, and Ilya Mironov. Opacus: User-friendly differential privacy library in Py Torch. ar Xiv:2109.12298, 2021.

Jun Zhang, Zhenjie Zhang, Xiaokui Xiao, Yin Yang, and Marianne Winslett. Functional mechanism: regression analysis under differential privacy. VLDB, 5(11):1364 1375, 2012.

A OPTIMALITY OF RR-on-BinsΦ ε

In this section, we prove Theorem 5. We prove this in two stages. First we consider the case where P has ﬁnite support and ˆY is constrained to be a subset of a pre-speciﬁed ﬁnite set O R. Next, we consider the case where ˆY is allowed to be an arbitrary subset of R, and thus, the output distribution of the mechanism on any input y can be an arbitrary probability measure over R.

Stage I: Finitely supported P , ˆ Y O, ℓstrictly increasing/decreasing.

Theorem 8. Let P be a distribution over R with ﬁnite support, O a ﬁnite subset of R, and ℓ: R R R 0 be such that

For all y R, ℓ(ˆy, y) is strictly decreasing in ˆy when ˆy y and strictly increasing in ˆy when ˆy y. For all ˆy R, ℓ(ˆy, y) is strictly decreasing in y when y ˆy and strictly increasing in y when y ˆy.

Then for all ε > 0, there exists Φ : Y O such that

L(RR-on-BinsΦ ε ; P) = inf M L(M; P).

where inf M is over all ε-DP mechanisms M with inputs in Y and outputs in O.

The proof of Theorem 8 is inspired by the proof of Theorem 3.1 of Ghosh et al. (2012). Namely, we describe the problem of minimizing L(M; P) as a linear program (LP). We arrange the variables of the LP into a matrix and associate any solution to the LP a signature matrix, which represents when constraints of the LP are met with equality. Then we make several observations of the signature matrix for any optimal solution, which eventually leads to the proof of the theorem.

Proof of Theorem 8. Without loss of generality, we may assume that Y = supp(P) and therefore is ﬁnite.

When ˆY is restricted to be a subset of O, the optimal mechanism M which minimizes L(M; P) = Ey P,ˆy M(y) ℓ(ˆy, y), is encoded in the solution of the following LP, with |Y| |O| variables My ˆy = Pr[M(y) = ˆy] indexed by (y, ˆy) Y O. Namely, the ﬁrst two constraints enforce that (My ˆy)y is a valid probability distribution and the third constraint enforces the ε-DP constraint.

ˆy O My ˆy ℓ(ˆy, y)

subject to y Y, ˆy O : My ˆy 0,

ˆy O My ˆy = 1,

ˆy O, y, y Y, y = y : My ˆy eε My ˆy.

Corresponding to any feasible solution M := (My ˆy)y,ˆy, we associate a |Y| |O| signature matrix SM. First, let pmin ˆy = miny My ˆy and let pmax ˆy = maxy My ˆy. Note that, from the constraints it follows that pmax ˆy eε pmin ˆy .

Deﬁnition 9 (Signature matrix). For any feasible M for the LP in (2), the signature entry SM(y, ˆy) for all y Y and ˆy O is deﬁned as

SM(y, ˆy) =

0 if My ˆy = 0 U if My ˆy = pmax ˆy = eε pmin ˆy L if My ˆy = pmin ˆy = e ε pmax ˆy S otherwise.

1.5 2.5 3.5 4.5

1 0 1/2 1/4 1/4

2 0 1/2 1/4 1/4

3 0 1/4 1/4 1/2

4 0 1/3 1/3 1/3

5 0 1/3 1/3 1/3

1.5 2.5 3.5 4.5

Figure 5: Example of a signature matrix for eε = 1/2 for Y = {1, 2, 3, 4, 5} and O = {1.5, 2.5, 3.5, 4.5}

We visualize SM as a matrix with rows corresponding to y s and columns corresponding to ˆy s, both ordered in increasing order (see Figure 5).

If M is an RR-on-BinsΦ ε mechanism for some Φ : Y O, then it is easy to see that the corresponding signature matrix satisﬁes some simple properties (given in Claim 10 below). Interestingly, we establish a converse, thereby characterizing the signature of matrices M that correspond to RR-on-BinsΦ ε mechanism for some Φ : Y O.

Claim 10. M corresponds to an RR-on-BinsΦ ε mechanism for some non-decreasing Φ : Y ˆY (for ˆY O) if and only if either

(1) One column consists entirely of S, while all other columns are entirely 0; let Ψ(y) = ˆy, where ˆy corresponds to the unique S column,

or all of the following hold:

(2a) Each column in SM is either entirely 0, or entirely consisting of U s and L s, with at least one U and one L. (2b) Each row contains a single U entry, with all other entries being either L or 0; for each y, we denote this unique column index of U by Ψ(y). (2c) For all y < y it holds that Ψ(y) Ψ(y ).

In each case, it holds that Φ = Ψ.

Proof of Claim 10. Suppose M corresponds to RR-on-BinsΦ ε for some non-decreasing Φ. If Φ is constant, then condition (1) holds, since all columns corresponding to ˆy / range(Φ) in SM are all 0, whereas, the only remaining column consists of all S. If Φ is not constant, then My ˆy = eε1[Φ(y)=ˆy]/(eε + | ˆY| 1), and hence its signature SM is such that SM(y, ˆy) is U if Φ(y) = ˆy, and L if ˆy range(Φ) {Φ(y)}, and 0 otherwise. It is easy to verify that all three conditions hold: (2a) a column corresponding to ˆy is entirely 0 if and only if ˆy / range(Φ) and entirely consisting of U s and L s otherwise with at least one U corresponding to y such that Φ(y) = ˆy and at least one L corresponding to y such that Φ(y) = ˆy (since Φ is non-constant), (2b) Each row corresponding to y has a unique U corresponding to ˆy = Ψ(y) = Φ(y), and (2c) For y < y , it holds that Ψ(y) Ψ(y ) since Φ is non-decreasing.

To establish the converse, suppose we have that SM satisﬁes condition (1). Then we immediately get that M corresponds to RR-on-Bins for the constant map Φ = Ψ. Next, suppose SM satisﬁes all three conditions (2a) (2c). Immediately, we have that M is given as

eε pmin ˆy if SM(y, ˆy) = U pmin ˆy if SM(y, ˆy) = L 0 if SM(y, ˆy) = 0.

Let ˆY correspond to the set of non-zero columns of SM. Since each row of M corresponds to a probability distribution, we have for each ˆy ˆY that P

ˆy ˆ Y pmin ˆy +(eε 1)pmin ˆy = 1 by considering

the row corresponding to any y Ψ 1(ˆy). Thus, we get that pmin ˆy is the same for all values in ˆY, and in particular, pmin ˆy = 1 eε+| ˆ Y| 1, and thus, we get that the corresponding M is uniquely given by

eε+| ˆ Y| 1 if SM(y, ˆy) = U

1 eε+| ˆ Y| 1 if SM(y, ˆy) = L

0 if SM(y, ˆy) = 0,

which clearly corresponds to RR-on-BinsΦ ε for Φ(y) = Ψ(y). We have that Ψ, and hence Φ, is non-decreasing from condition (2c). This completes the proof of Claim 10.

Thus, to show optimality of RR-on-BinsΦ ε , it sufﬁces to show that there exists a minimizer M of the LP (2), such that SM satisﬁes the conditions in Claim 10.

It is well known that an optimal solution to any LP can be found at the vertices of the constraint polytope and that for a LP with n variables, vertices must meet n linearly independent constraints. We use this to ﬁnd a submatrix of SM, which will determine M in its entirety.

If M is a feasible point, then each column of SM is either entirely 0, or entirely consisting of S, U, and L. This is necessary since if My ˆy = 0, then 0 My ˆy eεMy ˆy = 0. Let k denote the number of non-zero columns of SM.

Lemma 11. Suppose M is a vertex of the LP (2) with k non-zero columns of SM. If k 2, then M has a k k submatrix M (k) with all distinct rows, consisting of only U s and L s.

Proof. M is a vertex if and only if there are |Y| |O| many linearly independent constraints that are tight. We count the number of tight constraints just from SM and note some instances of dependence to give a lower bound on the number of linearly independent constraints.

1. |Y| constraints, given by P

ˆy My ˆy = 1 are always tight.

2. For a zero column of SM corresponding to ˆy, each My ˆy 0 is tight corresponding to each y Y. These correspond to |Y| (|O| k) constraints.

3. For a non-zero column of SM corresponding to ˆy, let Cˆy(U) denote the number of U entries in column ˆy and similarly, deﬁne Cˆy(L) and Cˆy(S) analogously. For each pair y1, y2 such that SM(y1, ˆy) = L and SM(y2, ˆy) = U, we have a tight constraint My1 ˆy eε My2 ˆy. However, these constraints are not linearly independent. In fact, there are only Cˆy(U)+Cˆy(L) 1 = |Y| Cˆy(S) 1 many linearly independent constraints among these.

4. Another instance where the constraints might be dependent is if two rows of SM, corresponding to say y1 and y2, are identical and do not contain any S s. In this instance, the two equations of P

ˆy Myi ˆy = 1 and the inequalities given by the 0s, U s, and L s are not independent. The DP inequality conditions imply that all the coordinates are equal between the two rows, which imply a dependence relation between those and the two conditions.

Counting these all up, we have a lower bound on the number of linearly independent constraints. This must be at least |Y| |O|. Let (# of duplicate rows not containing S) be the difference between the number of all rows not containing S and the number of distinct rows not containing S. Thus, we get

|Y|+|Y| (|O| k)+(|Y| 1) k X

ˆy Cˆy(S) (# of duplicate rows not containing S) |Y| |O|.

Rearranging,

ˆy Cˆy(S) (# of duplicate rows not containing S) k,

and because X

ˆy Cˆy(S) (# rows containing S),

we conclude

|Y| (# rows containing S) (# of duplicate rows not containing S) k.

Hence, there are at least k rows, which are all distinct and contain only 0 s, U s, and L s. Narrowing our scope to just the k non-zero columns, we get a k k sub-matrix M (k) that contains only U s and L s. This concludes the proof of Lemma 11.

So far, we did not use any information about the objective. Next we use the properties of the loss function ℓto show that if M is an optimal solution to the LP, any submatrix provided by Lemma 11 can only be of one form, the matrix with U s along the diagonal and L s everywhere else:

Lemma 12. Suppose M is an optimal vertex solution to the LP 2 with k 2 non-zero columns. Then any k k submatrix M (k) given by Lemma 11 has U s along the diagonal and L s otherwise.

Proof. Firstly, in each row of M (k), the U s are consecutive. Suppose for any y and ˆy1 < ˆy2 < ˆy3, it holds that SM(y, ˆy1) = SM(y, ˆy3) = U, whereas SM(y, ˆy2) is either L or S. This implies that ℓ(ˆy1, y), ℓ(ˆy3, y) < ℓ(ˆy2, y), since otherwise, it is possible to reduce My ˆy1 (resp. M(y ˆy3)) and increase My ˆy2 thereby further reducing the objective, without violating any constraints. But this is a contradiction to the assumption of ℓin Theorem 8, since ℓ( , y) cannot increase from ˆy1 to ˆy2 and then decrease from ˆy2 to ˆy3.

Secondly, SM cannot contain the following 2 2 matrix, where y1 < y2 and ˆy1 < ˆy2:

ˆy1 ˆy2 y1 L U y2 U L

Since M is optimal, we get that ℓ(ˆy1, y1) > ℓ(ˆy2, y1), and from the assumption on ℓit follows that ˆy1 < y1 (since otherwise, y1 ˆy1 < ˆy2 which would imply ℓ(ˆy1, y1) < ℓ(ˆy2, y1)). Similarly, we have that ℓ(ˆy2, y2) > ℓ(ˆy1, y2) and y2 < ˆy2. However, since ˆy1 < y1 < y2 < ˆy2, we have that ℓ(ˆy2, y1) > ℓ(ˆy2, y2) and ℓ(ˆy1, y2) > ℓ(ˆy1, y1), which gives rise to a contradiction:

ℓ(ˆy1, y1) > ℓ(ˆy2, y1) > ℓ(ˆy2, y2) > ℓ(ˆy1, y2) > ℓ(ˆy1, y1).

In particular, this claim of not containing the above 2 2 matrix applies to M (k).

Lastly, every row of M (k) has at least one U. Suppose for contradiction that the row y in SM does not contain a single U. Then it has all L s and 0 s. Any column that contains an L must contain a U in SM. So necessarily there is a row y in SM containing a U. Now, y containing only L s and 0 s implies that P

ˆy My ˆy < P

ˆy My ˆy, which is a contradiction of the constraints of the LP.

Since the rows of this k k sub-matrix M (k) are all distinct, and it does not contain any 2 2 submatrix as above, the only possible signature is where all the diagonal signature entries are U and the rest are L. Let si [k] be the index of the ﬁrst index U for the ith row and similarly ei [k] be the last index of U. Because the U s are continuous, si and ei determine the signature of the row. If si = sj for i = j, then ei = ej. Else if ei < ej, then P

ˆy My ˆy < P

ˆy My ˆy where y corresponds to row i and y corresponds to j which is a contradiction. Similarly for ei > ej. The same argument can be made if ei = ej, then si = sj. The rows of M (k) are distinct, so this implies that si = sj, ei = ej for i = j. The observation about the 2 2 matrix shows that if i < j, si sj and ei ej. So the si and ei are both k distinct ordered sequences of [k]. The only such sequence is si = ei = i, implying M (k) is a diagonal matrix of this form. This completes the proof of Lemma 12.

We now have the necessary observations to prove Theorem 8. Let M be an optimal vertex solution to the LP 2. That is L(M; P) = inf M L(M; P). If M has one non-zero column, then SM has one column entirely of S and the rest 0. By Claim 10, M corresponds to a RR-on-BinsΦ ε mechanism. If

M has k 2 non-zero columns, then by Lemma 12, SM has a k k submatrix M (k) with U along the diagonal and L otherwise. We show this completely determines SM in the form described by Claim 10(2a 2c).

Note that SM for any vertex M has at most one S per row. If SM(y, ˆy1) and SM(y, ˆy2) are both equal to S for y1 = y2, then it is possible to write M as a convex combination of two other feasible points given as M + ηM and M ηM for small enough η, where M (y ˆy2) = M (y ˆy1) = 1 and M (y ˆy ) = 0 for all other y , ˆy . Intuitively this corresponds to moving mass My ˆy1 to / from My ˆy2, in a way that does not violate any of the constraints. But M is a vertex so it cannot be the convex combination of two feasible points.

But moreover, SM for an optimal vertex solution has exactly one U per row and the rest L s and 0 s. Suppose the row corresponding to y has SM(y, ˆy) either L or S and the rest of the entries either L or 0. From the observations above of the submatrix M (k), there exists a row y such that SM(y , ˆy) = U. Then, we have P

ˆy My ˆy < P

ˆy My ˆy = 1, which contradicts feasibility. Similarly, if a row of SM contains a U and an S (or multiple U s), then we would have P

ˆy My ˆy > P

This allows us to deﬁne Ψ : Y O to be the unique ˆy such that SM(y, ˆy) = U. Note that Ψ is non-decreasing from the observation earlier that SM cannot contain the 2 2 signature above. This completely characterizes SM as the form described by Claim 10(2a 2c), completing the proof of Theorem 8.

We use a continuity argument to show that Theorem 8 holds in the case where the loss function ℓ(ˆy, y) is decreasing / increasing instead of strictly decreasing / increasing. Corollary 13. Let P be a distribution over R with ﬁnite support, O a ﬁnite subset of R, and ℓ: R R R 0 is such that

For all y R, ℓ(ˆy, y) is decreasing in ˆy when ˆy y and increasing in ˆy when ˆy y. For all ˆy R, ℓ(ˆy, y) is decreasing in y when y ˆy and increasing in y when y ˆy.

Then for all ε > 0, there exists Φ : Y O such that

L(RR-on-BinsΦ ε ; P) = inf M L(M; P).

where inf M is over all ε-DP mechanisms M with inputs in Y and outputs in O.

Proof. Let V correspond to the set of solutions M that are vertices of the LP. Let W V correspond to the set of vertex solutions that are RR-on-Bins. A rephrasing of Theorem 8 is that the minimum loss over mechanisms in V is equal to the minimum loss over mechanisms in W.

For η > 0, deﬁne ℓη(ˆy, y) = ℓ(ˆy, y)+η |y ˆy| . Note that ℓη satisﬁes the conditions of Theorem 8. For any ﬁxed mechanism M, the loss is continuous at η = 0:

ˆy O My ˆy ℓη(ˆy, y)

ˆy O My ˆy ℓ(ˆy, y)

The minimum of ﬁnitely many continuous functions is continuous. In particular

lim η 0 min M V

ˆy O My ˆy ℓη(ˆy, y)

ˆy O My ˆy ℓ(ˆy, y)

and similarly

lim η 0 min M W

ˆy O My ˆy ℓη(ˆy, y)

ˆy O My ˆy ℓ(ˆy, y)

but the LHS of both equations are equal by Theorem 8, so

ˆy O My ˆy ℓ(ˆy, y)

ˆy O My ˆy ℓ(ˆy, y)

completing our proof.

Stage II: Finitely supported P and ˆ Y R. We now set out to prove Theorem 5. Let ymin and ymax denote the minimum and maximum values in Y, respectively. We will ﬁrst show that

inf ˆ Y R inf Φ:Y ˆ Y L(RR-on-BinsΦ ε ; P) = inf M L(M; P), (3)

where the inﬁmum on the RHS is over all ε-DP mechanisms M.

Since RR-on-BinsΦ ε is an ε-DP mechanism, we have

inf M L(M; P) inf ˆ Y R inf Φ:Y ˆ Y L(RR-on-BinsΦ ε ; P). (4)

To show the converse, let γ > 0 be any parameter. There must exist an ε-DP mechanism M : Y R such that

L(M ; P) inf M L(M; P) + γ/2. (5)

Let O [ymin, ymax] be deﬁned as follows:

For each y Y, let ay = minˆy [ymin,ymax] ℓ(ˆy, y) and by = maxˆy [ymin,ymax] ℓ(ˆy, y). Let T := 4(by ay)/γ . Let oy,t be the ﬁnite set containing the maximal and minimal element of {ˆy | ℓ(ˆy, y) = ay + t

T (by ay)} if the set is non-empty. Otherwise, let it be the empty set. Let Oy := ST t=0 oy,t. Finally, let O = S

Naturally, O is ﬁnite. Finally, let M be the mechanism that ﬁrst runs M to get ˆy and then outputs the element in O closest to ˆy. By post-processing of DP, M remains ε-DP. Furthermore, it is not hard to see that by the construction of O and by Assumption 4, we have

L(M ; P) L(M ; P) + γ/2. (6)

Finally, since range(M ) = O is ﬁnite, the proof in Stage I implies that

inf ˆ Y O inf Φ:Y ˆ Y L(RR-on-BinsΦ ε ; P) L(M ; P). (7)

Combining (5), (6), and (7), we can conclude that

inf ˆ Y R inf Φ:Y ˆ Y L(RR-on-BinsΦ ε ; P) inf M L(M; P) + γ. (8)

Since (8) holds for any γ > 0, combining with (4), we can conclude that (3) holds.

Next, we will show that there exists ˆY and Φ : Y ˆY such that

L(RR-on-BinsΦ ε ; P) = inf ˆ Y R inf Φ:Y ˆ Y L(RR-on-BinsΦ ε ; P). (9)

Combining this with (3) completes the proof.

For any Φ : Y R, let PΦ denote the partition on Y induced by Φ 1. Note that the RHS of (9) can be written as

inf ˆ Y R inf Φ:Y ˆ Y L(RR-on-BinsΦ ε ; P) = min P inf Φ:Y ˆ Y ˆ Y R PΦ=P

L(RR-on-BinsΦ ε ; P), (10)

where the minimum is over all partitions of P.

For a ﬁxed partition P = PΦ of Y, L(RR-on-BinsΦ ε ; P) can simply be written as

y Y py eε1[y Si]

eε + | ˆY| 1 ℓ(ˆyi, y)

where ˆyi is the output corresponding to the part Si in the partition.

Notice that the function ˆyi P

y Y py eε1[y Si]

eε+| ˆ Y| 1ℓ(ˆyi, y) is continuous. Furthermore, it is ob-

vious that it increases once it becomes further from [ymin, ymax]. Thus, the minimum must be achieved at some point ˆy i [ymin, ymax]. As a result, by deﬁning Φ P such that Φ P(Si) = ˆy i , this also achieves the minimum for infΦ:Y ˆ Y ˆ Y R PΦ=P

L(RR-on-BinsΦ ε ; P). Therefore, plugging this back into

(10), we can conclude that the minimum of inf ˆ Y R infΦ:Y ˆ Y L(RR-on-BinsΦ ε ; P) must be achieved by some ˆY , Φ . This completes our proof of Theorem 5.

A.1 EXTENSION TO ARBITRARY (INFINITELY-SUPPORTED) P AND ˆY R.

We show that Theorem 5 can be extended to hold even for inﬁnitely-supported distributions P over a bounded interval, but under an additional assumption that the loss ℓis Lipschitz over the said interval. Assumption 14. For a speciﬁed bounded interval [ymin, ymax], loss function ℓ: [ymin, ymax] [ymin, ymax] R 0 is such that both ℓ( , y) and ℓ(ˆy, ) are L-Lipschitz. Theorem 15. For all ymin, ymax R, loss functions ℓ: R R R 0 satisfying Assumptions 4 and 14, and all distributions P over Y [ymin, ymax], there is a ﬁnite output set ˆY R and a non-decreasing map Φ : Y ˆY such that L(RR-on-BinsΦ ε ; P) = inf M L(M; P),

where the inﬁmum is over all ε-DP mechanisms M.

We break down the proof into a series of lemmas. First we show that the inﬁmum is reached by the inﬁmum of RR-on-Bins mechanisms: Lemma 16. Under the assumptions of Theorem 15, it holds that inf ˆ Y R, Φ:Y ˆ Y

L(RR-on-BinsΦ ε ; P) = inf M L(M; P).

Proof. To prove the lemma, it sufﬁces to show that for any g N, there exists a choice of ˆY R and Φ : Y ˆY such that L(RR-on-BinsΦ ε ; P) inf M L(M; P) + γ where γ = L ymax ymin

Consider a discretization of Y given as

Y := ymin + iγ

L : 0 i g .

Let ρ : Y Y be the map that rounds any element of Y to the closes element of Y. Note that |y ρ(y)| γ/(2L) for all y Y. Consider the distribution P over Y given by the rounding process , which samples y P and returns ρ(y). Note that there is a natural coupling between P and P such that |y ˆy| γ/(2L) holds with probability 1. From Theorem 5, we have that there exists a ﬁnite ˆY R and non-decreasing map Φ : Y ˆY such that

L(RR-on-Bins Φ ε ; P) = inf M L( M; P), (11)

where M is an ε-DP mechanism mapping Y to ˆY. We can extend Φ to Φ : Y ˆY given as Φ(y) := Φ(ρ(y)). It is easy to see that since Φ is non-decreasing, Φ is also non-decreasing. From Assumption 14, it follows that L(RR-on-BinsΦ ε ; P) = E y P ℓ(RR-on-BinsΦ ε (y), y)

= E y P ℓ(RR-on-Bins Φ ε (ρ(y)), y)

E y P ℓ(RR-on-Bins Φ ε (ρ(y)), ρ(y)) + γ/2 (from Assumption 14)

= L(RR-on-Bins Φ ε ; P) + γ/2. (12)

Similarly, for any ε-DP mechanism M mapping Y to ˆY, we can construct an ε-DP mechanism M mapping Y to ˆY where M( y) is sampled as M(y) for y P|ρ(y)= y. Note that sampling y P and returning y P|ρ(y)= y is equivalent to sampling y P. Thus, we have

L( M; P) = E y P ℓ( M( y), y)

= E y P ℓ(M(y), ρ(y))

E y P ℓ(M(y), y) + γ/2 (from Assumption 14)

= L(M; P) + γ/2. (13)

Thus, combining (11), (12), and (13), we get

L(RR-on-BinsΦ ε ; P) L(RR-on-Bins Φ ε ; P) + γ/2

= inf M L( M; P) + γ/2

inf M L(M; P) + γ .

Next we show that the inﬁmum of RR-on-Bins is reached by considering the RR-on-Bins with ﬁnitely many bins. Towards this end, let RR-on-Binsn ε denote the set of all RR-on-BinsΦ ε mechanisms where Φ : Y ˆY with | ˆY| = n.

Lemma 17. Suppose ℓ(ˆy, y) is integrable over y for any ˆy. Then for all ε > 0, it holds that

lim n inf M RR-on-Binsn ε L(M; P) inf M RR-on-Bins1 ε L(M; P).

Remark: Interestingly, Lemma 17 does not require any assumption about the distribution P or ℓ aside from integrability. For example, P can be an unbounded distribution.

α := inf M RR-on-Bins1 ε L(M; P) = inf ˆy R

Z ℓ(ˆy, y)d P(y) .

This is precisely inf M RR-on-Bins1 ε L(M; P), where the ˆy is selecting the single bin to output to. Note that α does not depend on ε.

For any n > 1, consider RR-on-BinsΦ ε , where Φ : Y ˆY for ˆY = {ˆy1, . . . , ˆyn}. Then we can bound the loss from below:

L(RR-on-BinsΦ ε ; P) = Z n X

1 eε + n 1ℓ(ˆyi, y)

+ eε 1 eε + n 1ℓ(Φ(y), y)

1 eε + n 1ℓ(ˆyi, y)

Z 1 eε + n 1ℓ(ˆyi, y)d P(y)

1 eε + n 1 α

= n eε + n 1 α.

In particular, inf M RR-on-Binsn ε L(M; P) n eε+n 1α which implies that

lim n inf M RR-on-Binsn ε L(M; P) α .

Corollary 18. Suppose ℓ(ˆy, y) is integrable over y for any ˆy. Then for all ε > 0, there exists n 1 such that

inf M RR-on-Binsn ε L(M; P) = inf n inf M RR-on-Binsn ε L(M; P).

Proof. For any n, let αn := inf M RR-on-Binsn ε L(M; P). From Lemma 17, we have that limn αn α1. If infn αn = α1, then the corollary is true for n = 1. Else, if infn αn = α < α1, then there exists n0 > 0 such that for all n > n0, it holds that αn > (α1 + α )/2. Thus, infn αn = minn n0 αn which implies that the inﬁmum is realized for some ﬁnite n.

Finally, we show that the inﬁmum over RR-on-Binsn ε is also achievable, completing the proof of Theorem 15.

Lemma 19. Suppose P is bounded within the interval Y = [ymin, ymax]. Suppose ℓ: R R R 0 satisﬁes Assumptions 4 and 14, then for any n > 0, there is an output set ˆY R with | ˆY| = n and a non-decreasing map Φ : Y ˆY such that

L(RR-on-BinsΦ ε ; P) = inf M RR-on-Binsn ε L(M; P).

Proof. Because of the increasing / decreasing nature of ℓ, we can restrict the output of Φ to be in the range [ymin, ymax]. Given an output range {ˆy1, . . . , ˆyn}, the RR-on-BinsΦ ε mechanism that minimizes L(RR-on-BinsΦ ε ; P) satisﬁes

Φ(y) = arg min ˆyi ℓ(ˆyi, y) .

So we can consider L(RR-on-BinsΦ ε ; P) as a function of (ˆy1, . . . , ˆyn) [ymin, ymax]n.

We claim that L(RR-on-BinsΦ ε ; P) is continuous with respect to these inputs {ˆy1, . . . , ˆyn}. Indeed by the Lipschitz continuity, a change in [ˆy1, . . . , ˆyn] to [ˆy 1, . . . , ˆy n] where |ˆyi ˆy i| < δ for some δ > 0 results in a change in L(RR-on-BinsΦ ε ; P) bounded by Lδ. Therefore L(RR-on-BinsΦ ε ; P) is a continuous function of (ˆy1, . . . , ˆyn) over a compact set [ymin, ymax]n and hence it attains its inﬁmum. This inﬁmum deﬁnes Φ that satisﬁes the lemma.

B ANALYSIS OF DYNAMIC PROGRAMMING ALGORITHM FOR FINDING OPTIMAL RR-on-Bins

Below we give correctness and running time analysis of Algorithm 2.

Correctness Proof. We will prove the following statement by strong induction on i:

A[i][j] = min Φ:{y1,...,yi} R |PΦ|=j

y Y py e1[y S] ε ℓ(Φ(S), y), (14)

where the minimum is across all Φ : {y1, . . . , yi} R such that, for each ˆy range(Φ), Φ 1(ˆy) is an interval and that |PΦ| = j where the j intervals are S1, . . . , Sj (in increasing order).

Before we prove the above statement, note that it implies that A[n][d] = (d 1 + eε) min ˆ Y,Φ:Y ˆ Y,|PΦ|=d L(RR-on-BinsΦ ε ; P). Thus, the last line of the algorithm ensures that we output the optimal RR-on-Bins as desired.

We will now prove (14) via strong induction on i.

Base Case. For i = 0 (and thus j = 0), the statement is obviously true.

Inductive Step. Now, suppose that the statement is true for all i = 0, . . . , t 1 for some t N. We will show that it is also true for i = t. To see this, we may rewrite the RHS term (for i = t) as

min Φ:{y1,...,yt} R |PΦ|=j

y Y py e1[y S] ε ℓ(Φ(S), y)

= min 0 r<t min Φ:{y1,...,yt} R |PΦ|=j {yr+1,...,yt} PΦ

y Y py e1[y S] ε ℓ(Φ(S), y)

= min 0 r<t

min Φ:{y1,...,yr} R |PΦ|=j 1

y Y py e1[y S] ε ℓ(Φ(S), y)

+ min ˆy=Φ({yr+1,...,yt})

y Y py e1[y [yr+1,yt]] ε ℓ(ˆy, y)

= min 0 r<t A[r][j 1] + L[r + 1][t].

where the third inequality follows from the inductive hypothesis and the deﬁnition of L[r + 1][t]. The last expression is exactly how A[t][j] is computed in our algorithm. Thus, (14) holds for i = t.

Running Time Analysis. It is clear that, apart from the computation of L[r + 1][i], the remainder of the algorithm runs in time O(k2). Therefore, the total running time is O(k2 T) where T denotes the running time for solving the problem minˆy R P

y Y py e1[y [yr,yi]] ε ℓ(ˆy, y). When the loss function ℓis convex, this problem is a univariate convex optimization problem and can be solved in polynomial time. We can even speed this up further. In fact, for the three main losses we consider (squared loss, Poisson log loss, and absolute-value loss) this problem can be solved in amortized constant time, as detailed below. Therefore, for these three losses, the total running time of the dynamic programming algorithm is only O(k2).

Squared Loss. In this case, the minimizer is simply

arg min ˆy R

y Y py e1[y [yr,yi]] ε ℓsq(ˆy, y) =

P y Y py e1[y [yr,yi]] εy P y Y py e1[y [yr,yi]] ε =: ˆy r,i.

Therefore, to compute ˆy r,i, it sufﬁces to keep the values P

y Y py e1[y [yr,yi]] εy and P

e1[y [yr,yi]] ε. We can start with r = i and as we increase i, these quantities can be updated in constant time. Note also that

y Y py e1[y [yr,yi]] ε ℓsq(ˆy, y)

y Y py e1[y [yr,yi]] ε y2 2

y Y py e1[y [yr,yi]] ε y

ˆy r,i + (ˆy r,i)2.

Therefore, to compute the minimum value, we additionally keep P

y Y py e1[y [yr,yi]] εy2, which again can be updated in constant time for each i.

Poisson Log Loss. The minimizer is exactly the same as in the squared loss, i.e.,

arg min ˆy R

y Y py e1[y [yr,yi]] ε ℓPoi(ˆy, y) =

y Y py e1[y [yr,yi]] εy P y Y py e1[y [yr,yi]] ε =: ˆy r,i.

Therefore, using the same method as above, we can compute ˆy in amortized constant time. The minimum is the simply

y Y py e1[y [yr,yi]] ε ℓPoi(ˆy, y)

y Y py e1[y [yr,yi]] ε

y Y py e1[y [yr,yi]] εy

log(ˆy r,i),

so this can also be computed in constant time with the quantities that we have recorded. Absolute-Value Loss. To describe the minimizer, we deﬁne a weighted version of median. Let {(w1, a1), . . . , (wt, at)} be a set of t tuples such that w1, . . . , wt R 0 and a1, . . . , at R with a1 at. The weighted median of {(w1, a1), . . . , (wt, at)}, denoted by wmed({(w1, a1), . . . , (wt, at)}) is equal to the minimum value a such that P j [t] aj a wj (P

j [t] wj)/2. It is not hard to see

arg min ˆy R

y Y py e1[y [yr,yi]] ε ℓabs(ˆy, y) = wmed n py e1[y [yr,yi]] ε, y o

Notice also that we must have ˆy r,i Y. For a ﬁxed r (and varying i), the algorithm is now as follows: ﬁrst compute ˆy r,r, and also compute

y Y y ˆy r,r

py e1[y [yr,yi]] ε,

y Y y>ˆy r,r

py e1[y [yr,yi]] ε.

For i = r + 1, . . . , k, initialize ˆy r,i = ˆy r,i 1 and update wlo or whi (corresponding to the weight change from pyi to eεpyi of yi). We then perform updates to reach the correct value of ˆy r,i That

is, if wlo < whi, then move to the next larger value in Y; if wlo pˆy r,i e1[ˆy r,i [yr,yi]] ε whi, then move to the next smaller value in Y. Otherwise, stop and keep the current ˆy r,i. To understand the running time of this subroutine, note that the initial running time for computing ˆy r,r and wlo, whi is O(k). Furthermore, each update for ˆy r,i takes O(1) time. Finally, observe that if we ever move ˆy r,i to a larger value, it must be that yi > ˆy r,i. After this i, ˆy r,i will never decrease again. As a result, in total across all i = r + 1, . . . , k, the total number of updates can be at most 2k. Thus, the total running time for a ﬁxed r is O(k). Summing up across r [k], we can conclude that the total running time of the entire dynamic programming algorithm is O(k2).

C RR-on-Bins WITH APPROXIMATE PRIOR

Recall from Section 3.2 that, when the prior is unknown, we split the budget into ε1, ε2, use the ε1DP Laplace mechanism to approximate the prior and then run the RR-on-Bins with privacy parameter ε2. The remainder of this section gives omitted details and proofs from Section 3.2. Throughout this section, we use k to denote |Y| (the size of the input label set).

C.1 LAPLACE MECHANISM AND ITS GUARANTEES

We start by recalling the Laplace mechanism for estimating distribution. Recall that the Laplace distribution with scale parameter b, denoted by Lap(b), is the distribution supported on R whose probability density function is 1

2b exp( |x|/b). The Laplace mechanism is presented in Algorithm 4.

It is well-known (e.g., Dwork et al. (2006b)) that this mechanism satisﬁes ε-DP. Its utility guarantee, which we will use in the analysis of Label Randomizer, is also well-known:

Algorithm 4 Laplace Mechanism for Estimating Probability Distribution MLap ε .

Parameters: Privacy parameter ε 0. Input: Labels y1, . . . , yn Y. Output: A probability distribution P over Y.

hy number of i such that yi = y h y max{hy + Lap(2/ε), 0}

return Distribution P over Y such that p y = h y P

Theorem 20 (e.g., Diakonikolas et al. (2015)). For any distribution P on Y, n N, and ε > 0, we have

E y1,...,yn P P MLap ε (y1,...,yn)

C.2 PROOF OF THEOREM 6

Theorem 6. Label Randomizerε1,ε2 is (ε1 + ε2)-DP.

Proof of Theorem 6. The Laplace mechanism is ε1-DP; by post-processing property of DP, Φ is also ε1-DP. For ﬁxed Φ , since RR-on-BinsΦ ε2 is ε2-DP and it is applied only once on each label, the parallel composition theorem ensures that (ˆy1, . . . , ˆyn) is ε2-DP. Finally, applying the basic composition theorem, we can conclude that the entire algorithm is (ε1 + ε2)-DP.

C.3 PROOF OF THEOREM 7

Theorem 7. Let ℓ: R R R 0 be any loss function satisfying Assumption 4. Furthermore, assume that ℓ(ˆy, y) B for some parameter B for all y, ˆy Y. For any distribution P on Y, ε > 0, and any sufﬁciently large n N, there is a choice of ε1, ε2 > 0 such that ε1 + ε2 = ε and

E y1,...,yn P P ,Φ , ˆ Y [L(RR-on-BinsΦ ε2 ; P)] inf M L(M; P) O B p

where P , Φ , ˆY are as in Label Randomizerε1,ε2 and the inﬁmum is over all ε-DP mechanisms M.

Proof of Theorem 7. From Theorem 20, we have

E D P n P MLap ε1 (D) [ P P 1] O

k n + k ε1n

Recall from Theorem 5 that there exists Φ : Y ˆY such that L(RR-on-BinsΦ ε ; P) = inf M L(M; P). Consider instead the RR-on-BinsΦ ε2 mechanism. We have

| inf M L(M; P) L(RR-on-BinsΦ ε2; P)|

= |L(RR-on-BinsΦ ε ; P) L(RR-on-BinsΦ ε2; P)|

eε + | ˆY| 1 eε2

eε2 + | ˆY| 1

ˆy ˆ Y\{Φ(y)}

eε + | ˆY| 1 1

eε2 + | ˆY| 1

= B 2(| ˆY| 1)(eε eε2)

(eε + | ˆY| 1)(eε2 + | ˆY| 1) 2B (1 eε2 ε) 2B(ε ε2) = 2Bε1.

Recall that RR-on-BinsΦ ε2 is an ε2-DP optimal mechanism for prior P . That is, we have

L(RR-on-BinsΦ ε2 ; P ) L(RR-on-BinsΦ ε2; P ).

Finally, we also have

|L(RR-on-BinsΦ ε2; P) L(RR-on-BinsΦ ε2; P )| B P P 1.

|L(RR-on-BinsΦ ε2 ; P) L(RR-on-BinsΦ ε2 ; P )| B P P 1.

Combining the above ﬁve inequalities, we arrive at

E y1,...,yn P P ,Φ , ˆ Y [L(RR-on-BinsΦ ε2 ; P)] inf M L(M; P) O

k n + k ε1n

Setting ε1 = p

k/n then yields the desired bound7.

D DP MECHANISMS DEFINITIONS

In this section, we recall the deﬁnition of various DP notions that we use throughout the paper.

Deﬁnition 21 (Global Sensitivity). Let f be a function taking as input a dataset and returning as output a vector in Rd. Then, the global sensitivity (f) of f is deﬁned as the maximum, over all pairs (X, X ) of adjacent datasets, of ||f(X) f(X )||1.

The (discrete) Laplace distribution with scale parameter b > 0 is denoted by DLap(b). Its probability mass function is given by p(y) exp( |y|/b) for any y Z.

Deﬁnition 22 (Discrete Laplace Mechanism). Let f be a function taking as input a dataset X and returning as output a vector in Zd. The discrete Laplace mechanism applied to f on input X returns f(X) + (Y1, . . . , Yd) where each Yi is sampled i.i.d. from DLap( (f)/ε). The output of the mechanism is ε-DP.

Next, recall that the (continuous) Laplace distribution Lap(b) with scale parameter b > 0 has probability density function given by h(y) exp( |y|/b) for any y R.

Deﬁnition 23 (Continuous Laplace Mechanism, Dwork et al. (2006b)). Let f be a function taking as input a dataset X and returning as output a vector in Rd. The continuous Laplace mechanism applied to f on input X returns f(X) + (Y1, . . . , Yd) where each Yi is sampled i.i.d. from Lap( (f)/ε). The output of the mechanism is ε-DP.

7Note that this requires n > k/ε2 for the setting of ε2 = ε ε1 to be valid.

We next deﬁne the discrete and continuous versions of the staircase mechanism (Geng & Viswanath, 2014).

Deﬁnition 24 (Discrete Staircase Distribution). Fix 2. The discrete staircase distribution is parameterized by an integer 1 r and has probability mass function given by:

a(r) for 0 i < r, e εa(r) for r i < e kεpr(i k ) for k i < (k + 1) and k N pr( i) for i < 0,

a(r) =:= 1 b 2r + 2b( r) (1 b).

Let f be a function taking as input a dataset X and returning as output a scalar in Z. The discrete staircase mechanism applied to f on input X returns f(X)+Y where Y is sampled from the discrete staircase distribution given in (15).

Deﬁnition 25 (Continuous Staircase Distribution). The continuous staircase distribution is parameterized by γ (0, 1) and has probability density function given by:

a(γ) for x [0, γ ) e εa(γ) for x [γ , ) e kεhγ(x k ) for x [k , (k + 1) ) and k N hγ( x) for x < 0,

a(γ) =:= 1 e ε

2 (γ + e ε(1 γ).

Let f be a function taking as input a dataset X and returning as output a scalar in R. The continuous staircase mechanism applied to f on input X returns f(X) + Y where Y is sampled from the continuous staircase distribution given in (16).

Deﬁnition 26 (Exponential Mechanism, Mc Sherry & Talwar (2007)). Let q( , ) be a scoring function such that q(X, r) is a real number equal to the score to be assigned to output r when the input dataset is X. The exponential mechanism returns a sample from the distribution that puts mass exp(εq(X, r)) on each possible output r. It is (2ε (q))-DP, where (q) is deﬁned as the maximum global sensitivity of q( , r) over all possible values of r.

Deﬁnition 27 (Randomized Response, Warner (1965)). Let ε 0, and q be a positive integer. The randomized response mechanism with parameters ε and K (denoted by RRε,q) takes as input y {1, . . . , K} and returns a random sample y drawn from the following probability distribution:

Pr[ y = ˆy] =

( eε eε+q 1 for ˆy = y 1 eε+q 1 otherwise. (17)

The output of RRε,q is ε-DP.

E ADDITIONAL EXPERIMENT DETAILS

We evaluate the proposed RR-on-Bins mechanism on three datasets, and compare with the Laplace mechanism (Dwork et al., 2006b), the staircase mechanism (Geng & Viswanath, 2014) and the exponential mechanism (Mc Sherry & Talwar, 2007). For real valued labels (the Criteo Sponsored Search Conversion dataset), we use the continuous Laplace mechanism (Deﬁnition 23) and the continuous staircase mechanism (Deﬁnition 25), and for integer valued labels (the US Census dataset and the App Ads Conversion Count dataset), we use the discrete Laplace mechanism (Deﬁnition 22) and the discrete staircase mechanism (Deﬁnition 24).

E.1 CRITEO SPONSORED SEARCH CONVERSION

The Criteo Sponsored Search Conversion dataset is publicly available from https://ailab. criteo.com/criteo-sponsored-search-conversion-log-dataset/. To predict the conversion value (Sales Amount In Euro), we use the following attributes as inputs:

Numerical attributes: Time_delay_for_conversion, nb_clicks_1week, product_price.

Categorical attributes: product_age_group, device_type, audience_id, product_gender, product_brand, product_category_1 product_category_7, product_country, product_id, product_title, partner_id, user_id.

All categorical features in this dataset have been hashed. We build a vocabulary for each feature by counting all the unique values. All the values with less than 5 occurrences are mapped to a single out-of-vocabulary item.

We randomly partition the dataset into 80% 20% train test splits. For each evaluation conﬁguration, we report the mean and std over 10 random runs. In each run, the dataset is also partitioned with a different random seed.

Our deep neural network consists of a feature extraction module and 3 fully connected layers. Speciﬁcally, each categorical attribute is mapped to a 8-dimensional feature vector using the prebuilt vocabulary for each attribute. The mapped feature vectors are concatenated together with the numerical attributes to form a feature vector. Then 3 fully connected layers with the output dimension 128, 64, and 1 are used to map the feature vector to the output prediction. The Re LU activation is applied after each fully connected layer, except for the output layer.

We train the model by minimizing the mean squared error (MSE) with L2 regularization 10 4, using the RMSProp optimizer. We use learning rate 0.001 with cosine decay (Loshchilov & Hutter, 2017), batch size 8192, and train for 50 epochs. Those hyperparameters are chosen based on a setup with minor label noise (generated via Laplace mechanism for ε = 6), and then ﬁxed throughout all runs.

For the RR-on-Bins mechanism, we use the recommended ε1 = p

|Y|/n in Theorem 7 to query a private prior, and with the remaining privacy budget, optimize the mean squared loss via dynamic programming. When running Algorithm 2, it would be quite expensive to use Y as the set of all unique (real valued) training labels. So we simply discretize the labels by rounding them down to integer values, and use Y = {0, 1, . . . , 400}. The integer labels are then mapped via Algorithm 1.

E.2 US CENSUS

The 1940 US Census can be downloaded from https://www.archives.gov/research/census/ 1940. We predict the time that the respondent worked during the previous year (the WKSWORK1 ﬁeld, measured in number of weeks), based on the following input ﬁelds: the gender (SEX), the age (AGE), the marital status (MARST), the number of children ever born to each woman (CHBORN), the school attendance (SCHOOL), the employment status (EMPSTAT), the primary occupation (OCC), and the type of industry in which the person performed an occupation (IND). We use only 50,582,693 examples with non-zero WKSWORK1 ﬁeld, and randomly partition the dataset into 80%/20% train/test splits. For each evaluation conﬁguration, we report the mean and std over 10 random runs. In each run, the dataset is also partitioned with a different random seed.

Our deep neural network consists of a feature extraction module and 3 fully connected layers. The feature extraction module can be described via the following pseudocode, where the vocabulary size is chosen according to the value range of each ﬁeld in the 1940 US Census documentation, and the embedding dimension for all categorical features are ﬁxed at 8.

Features = Concat([ Embedding{vocab_size=2}(SEX - 1), AGE / 30.0, Embedding{vocab_size=6}(MARST - 1), Embedding{vocab_size=100}(CHBORN), Embedding{vocab_size=2}(SCHOOL - 1), Embedding{vocab_size=4}(EMSTAT),

Embedding{vocab_size=1000}(OCC), Embedding{vocab_size=1000}(IND), ])

The features vectors are mapped to the output via 3 fully connected layers with the output dimension 128, 64, and 1. The Re LU activation is applied after each fully connected layer, except for the output layer.

We train the model by minimizing the MSE with L2 regularization 10 4, using the RMSProp optimizer. We use learning rate 0.001 with cosine decay (Loshchilov & Hutter, 2017), batch size 8192, and train for 200 epochs. For the RR-on-Bins mechanism, we use the recommended ε1 = p

|Y|/n in Theorem 7 to query a private prior, and with the remaining privacy budget, optimize the mean squared loss via dynamic programming.

E.3 APP ADS CONVERSION COUNT PREDICTION

The app install ads prediction dataset is collected from a commercial mobile app store. The examples in this dataset are ad clicks, and each label counts post-click events (aka conversions) happening in the app after the user installs the app.

The neural network models used here is similar to the models used in other experiments: embedding layers are used to map categorical input attributes to dense feature vectors, and then the concatenated feature vectors are passed through several fully connected layers to generate the prediction. We use the Adam optimizer (Kingma & Ba, 2015) and the Poisson regression loss (Cameron & Trivedi, 2013) for training. For the RR-on-Bins mechanism, we use the recommended ε1 = p

|Y|/n in Theorem 7 to query a private prior, and with the remaining privacy budget, optimize the mean squared error via dynamic programming. Following the convention in event count prediction in ads prediction problems, we train the loss with the Poisson log loss. We report the relative Poisson log loss on the test set with respect to the non-private baseline.

F LABEL CLIPPING

The Laplace mechanism and staircase mechanism we compared in this paper could both push the randomized labels out of the original label value ranges. This issue is especially severe for small ε s, as the range of randomized labels could be orders of magnitude wider than the original label value range. Since the original label value range is known (required for deciding the noise scale of each DP mechanism), we could post process the randomized labels by clipping to this range.

We compare the effects of clipping for both Laplace mechanism and staircase mechanism on the Criteo dataset in Figure 6. In both case, this simple mitigation helps reduce the error signiﬁcantly, especially for smaller epsilons. So in all the experiments of this paper, we always apply clipping to the randomized labels.

Figure 6 also shows the exponential mechanism (Mc Sherry & Talwar, 2007), which is equivalent to restricting the Laplace noises to the values that would not push the randomized labels out of the original value range. In our case, we implement it via rejection sampling on the standard Laplace mechanism. This algorithm avoid the boundary artifacts of clipping, but to guarantee the same level of privacy, the noise needs to scale with 2/ε, instead of 1/ε as in the vanilla Laplace mechanism. As a result, while it could be helpful in the low epsilon regime, in moderate to high ε regime, it is noticeably worse than all other methods, including the vanilla Laplace mechanism due to the 2 noise scaling with 1/ε.

G COMPARISON WITH RRWith Prior

In this paper, we extended the formulation of Ghazi et al. (2021a) from classiﬁcation to regression losses. Here we provide a brief comparison between our algorithm (RR-on-Bins) and the RRWith Prior algorithm from Ghazi et al. (2021a) on the US Census dataset. Instead of using multistage training, we ﬁt RRWith Prior in our feature-oblivious label DP framework, and use the same privately queried global prior as the input to both RR-on-Bins and RRWith Prior. For RRWith Prior,

0.05 0.1 0.3 0.5 0.8 1 1.5 2 3 4 6 8 inf epsilon

mean squared error

(a) Mechanism

105 106 107

0.05 0.1 0.3 0.5 0.8 1 1.5 2 3 4 6 8 inf epsilon

Laplace (w/o clip) Laplace Staircase (w/o clip) Staircase Exponential

(b) Generalization

Figure 6: MSE on the Criteo dataset, with or without postprocess clipping. (a) measures the error introduced by each DP randomization mechanism on the training labels. (b) measures the test error of the models trained on the corresponding private labels.

Privacy Budget

MSE (Mechanism) MSE (Generalization)

RRWith Prior RR-on-Bins RRWith Prior RR-on-Bins

0.5 274.11 179.88 273.97 183.28 1.0 274.11 159.49 273.97 172.64 2.0 274.11 107.89 273.97 152.03 4.0 138.00 36.43 145.89 136.59 8.0 11.58 2.54 134.26 134.35 0.00 0.00 134.27 134.27

Table 2: MSE on the US Census dataset, comparing RR-on-Bins with RRWith Prior. The ﬁrst block (Mechanism) measures the error introduced by the DP randomization mechanisms on the training labels. The second block (Generalization) measures the test error of the models trained on the corresponding private labels. Note RRWith Prior was designed for classiﬁcation loss (Ghazi et al., 2021a), here we just ignore the numerical similarity metric and treat each integer label as a separate category.

we simply treat each of the integer label values in US Census as an independent category during the randomization stage. Once the private labels are obtained, the training stage is identical for both algorithms. The results are shown in Table 2. Note the results are for reference purposes, as this is not a fair comparison. Because RRWith Prior was designed for classiﬁcation problems (Ghazi et al., 2021a), which ignore the similarity metrics between different labels, while RR-on-Bins takes that into account.