# hyperparameter_tuning_with_renyi_differential_privacy__08cfb008.pdf Published as a conference paper at ICLR 2022 HYPERPARAMETER TUNING WITH RENYI DIFFERENTIAL PRIVACY Nicolas Papernot Google Research, Brain Team papernot@google.com Thomas Steinke Google Research, Brain Team hyper@thomas-steinke.net For many differentially private algorithms, such as the prominent noisy stochastic gradient descent (DP-SGD), the analysis needed to bound the privacy leakage of a single training run is well understood. However, few studies have reasoned about the privacy leakage resulting from the multiple training runs needed to fine tune the value of the training algorithm s hyperparameters. In this work, we first illustrate how simply setting hyperparameters based on non-private training runs can leak private information. Motivated by this observation, we then provide privacy guarantees for hyperparameter search procedures within the framework of Renyi Differential Privacy. Our results improve and extend the work of Liu and Talwar (STOC 2019). Our analysis supports our previous observation that tuning hyperparameters does indeed leak private information, but we prove that, under certain assumptions, this leakage is modest, as long as each candidate training run needed to select hyperparameters is itself differentially private. 1 INTRODUCTION Machine learning (ML) systems memorize training data and regurgitate excerpts from it when probed (Carlini et al., 2020). If the training data includes sensitive personal information, then this presents an unacceptable privacy risk (Shokri et al., 2017). It may however still be useful to apply machine learning to such data, e.g., in the case of healthcare (Kourou et al., 2015; Wiens et al., 2019). This has led to a significant body of research on the development of privacy-preserving machine learning methods. Differential privacy (DP) (Dwork et al., 2006b;a) provides a robust and quantitative privacy guarantee. It has been widely accepted as the best framework for formally reasoning about the privacy guarantees of a machine learning algorithm. A popular method for ensuring DP is noisy (stochastic) gradient descent (a.k.a. DP-SGD) (Song et al., 2013; Bassily et al., 2014; Abadi et al., 2016). DP-SGD differs from standard (stochastic) gradient descent in three ways. First, gradients are computed on a per-example basis rather than directly averaged across a minibatch of training examples. Second, each of these individual gradients is clipped to ensure its 2-norm is bounded. Third, Gaussian noise is added to the gradients as they are averaged and applied to update model parameters. These modifications bound the sensitivity of each update so that the added noise ensures differential privacy. The composition (Dwork et al., 2010) and privacy amplification by subsampling (Balle et al., 2018) properties of differential privacy thus imply that the overall training procedure is differentially private. We can compute tight privacy loss bounds for DP-SGD using techniques like the Moments Accountant (Abadi et al., 2016) or the closely related framework of Rényi DP (Mironov, 2017; Mironov et al., 2019). Machine learning systems have hyperparameters, such as the learning rate, minibatch size, or choice of a regularizer to prevent overfitting. Details of the model architecture can also be treated as hyperparameters of the optimization problem. Furthermore, learning within the constraints of differential privacy may introduce additional hyperparameters, as illustrated in the DP-SGD optimizer by the 2-norm bound value for clipping, the scale of the Gaussian noise, and the choice of stopping time. Typically the training procedure is repeated many times with different hyperparameter settings in order to select the best setting, an operation known as hyperparameter tuning. This methodology implies that even if each run of the training procedure is privacy-preserving, we need to take into Alphabetical author order. Published as a conference paper at ICLR 2022 account the fact that the training procedure is repeated (possibly many times) when reasoning about the privacy of the overall learning procedure. Can the tuning of hyperparameters reveal private information? This question has received remarkably little attention and, in practice, it is often ignored entirely. We study this question and provide both positive and negative answers. 1.1 OUR CONTRIBUTIONS We show that, under certain circumstances, the setting of hyperparamters can leak private information. Hyperparameters are a narrow channel for private information to leak through, but they can still reveal information about individuals if we are careless. Specifically, if we tune the hyperparameters in an entirely non-private fashion, then individual outliers can noticeably skew the optimal hyperparameter settings. This is sufficient to reveal the presence or absence of these outliers à la membership inference (Shokri et al., 2017); it shows that we must exercise care when setting hyperparameters. We provide tools for ensuring that the selection of hyperparameters is differentially private. Specifically, if we repeat the training procedure multiple times (with different hyperparameters) and each repetition of the training procedure is differentially private on its own, then outputting the best repetition is differentially private. Of course, a basic version of such a result follows from the composition properties of differential privacy (that is the fact that one can sum the privacy loss bounds of multiple differentially private analyses performed on the same data to bound the overall privacy loss from analyzing this data). However, we provide quantitatively sharper bounds. Specifically, our privacy loss bounds are either independent of the number of repetitions or grow logarithmically in the number of repetitions, whereas composition would give linear bounds. Rather than repeating the training procedure a fixed number of times, our results require repeating the training procedure a random number of times. The privacy guarantees depend on the distribution of the number of runs; we consider several distributions and provide generic results. We discover a tradeoff between the privacy parameters and how heavy-tailed the distribution of the number of repetitions is. 1.2 BACKGROUND AND RELATED WORK Differential privacy (DP) is a framework to reason about the privacy guarantees of randomized algorithms which analyze data (Dwork et al., 2006b;a). An algorithm is said to be DP if its outputs on any pair of datasets that only differ on one individual s data are indistinguishable. A bound on this indistinguishability serves as the quantification for privacy. Formally, a randomized algorithm M : X n Y is (ε, δ)-DP if for any inputs x, x X n differing only on the addition, removal, or replacement of one individual s records and for any subset of outputs S Y, we have P [M(x) S] eεP [M(x ) S] + δ. (1) Here, the parameter ε is known as the privacy loss bound the smaller ε is, the stronger the privacy guarantee provided is, because it is hard for an adversary to distinguish the outputs of the algorithm on two adjacent inputs. The parameter δ is essentially the probability that the guarantee fails to hold. One of the key properties of DP is that it composes: running multiple independent DP algorithms is also DP and composition theorems allow us to bound the privacy parameters of such a sequence of mechanisms in terms of the individual mechanisms privacy parameters (Dwork & Roth, 2014). There is a vast literature on differential privacy in machine learning. A popular tool is the DP-SGD optimizer (Abadi et al., 2016). Because the noise added is Gaussian and DP-SGD applies the same (differentially private) training step sequentially, it is easier to reason about its privacy guarantees in the framework of Rényi differential privacy (Mironov, 2017). Rényi differential privacy (RDP) generalizes pure differential privacy (with δ = 0) as follows. An algorithm M is said to be (λ, ε)-RDP with λ 1 and ε 0, if for any adjacent inputs x, x Dλ (M(x) M(x )) := 1 λ 1 log E Y M(x) " P [M(x) = Y ] P [M(x ) = Y ] where Dλ (P Q) is the Rényi divergence of order λ between distributions P and Q. In the framework of RDP, one obtains sharp and simple composition: If each individual mechanism Mi is (λ, εi)-RDP, Published as a conference paper at ICLR 2022 then the composition of running all of the mechanisms on the data satisfies (λ, P i εi)-RDP. For instance, the privacy analysis of DP-SGD first analyzes the individual training steps then applies composition. Note that it is common to keep track of multiple orders λ in the analysis. Thus ε should be thought of as a function ε(λ), rather than a single number. In many cases, such as Gaussian noise addition, this is a linear function i.e., ε(λ) = ρ λ for some ρ R and such a linear bound yields the definition of zero-Concentrated DP with parameter ρ (ρ-z CDP) (Bun & Steinke, 2016). One could naïvely extend this composition-based approach to analyze the privacy of a training algorithm which involves hyperparameter tuning. Indeed, if each training run performed to evaluate one candidate set of hyperparameter values is DP, the overall procedure is also DP by composition over all the hyperparameter values tried. However, this would lead to very loose guarantees of privacy. Chaudhuri & Vinterbo (2013) were the first to obtain improved DP bounds for hyperparameter tuning, but their results require a stability property of the learning algorithm. The only prior work that has attempted to obtain tighter guarantees for DP hyperparameter tuning in a black-box fashion is the work of Liu & Talwar (2019). Their work is the starting point for ours. Liu & Talwar (2019) show that, if we start with a (ε, 0)-DP algorithm, repeatedly run it a random number of times following a geometric distribution, and finally return the best output produced by these runs, then this system satisfies (3ε, 0)-differential privacy.1 Liu & Talwar (2019) also consider algorithms satisfying (ε, δ)-DP for δ > 0. However, their analysis is restricted to the (ε, δ) formulation of DP and they do not give any results for Rényi DP. This makes it difficult to apply these results to modern DP machine learning systems, such as models trained with DP-SGD. Our results directly improve on the results of Liu & Talwar (2019). We show that replacing the geometric distribution on the number of repetitions in their result with the logarithmic distribution yields (2ε, 0)-differential privacy as the final result. We also consider other distributions on the number of repetitions, which give a spectrum of results. We simultaneously extend these results to the Rényi DP framework, which yields sharper privacy analyses. Independently Mohapatra et al. (2021) study adaptive hyperparameter tuning under DP with composition. In contrast, our results are for non-adaptive hyperparameter tuning, i.e., random search. A closely related line of work is on the problem of private selection. Well-known algorithms for private selection include the exponential mechanism (Mc Sherry & Talwar, 2007) and the sparse vector technique (Dwork et al., 2009; Zhu & Wang, 2020). However, this line of work assumes that there is a low-sensitivity function determining the quality of each of the options. This is usually not the case for hyperparameters. Our results simply treat the ML algorithm as a black box; we only assume that its output is private and make no assumptions about how that output was generated. Our results also permit returning the entire trained model along with the selected hyperparameters. 2 MOTIVATION A hyperparameter typically takes categorical values (e.g., the choice of activation function in a neural network layer), or is a single number (e.g., a real number for the learning rate or an integer for the number of epochs). Thus, it is intuitive that a hyperparameter provides little capacity as a channel to leak private information from the training data. Nevertheless, leakage can happen, in particular when training is done without preserving privacy. We illustrate how with the constructed example of hyperparameter tuning for a support vector machine learning (SVM) learned from a synthetic data distribution. We consider a SVM with a soft margin; we use stochastic gradient descent to minimize the corresponding objective involving the hinge loss and a weight penalty: lw,b(x, y) = w 2 2 + α max{0, 1 y(w x + b)} where y { 1, 1} indicates the label of training example x R2. Because our purpose here is to illustrate how leakage of private information can arise from hyperparameter tuning, we work with a synthetic data distribution for simplicity of exposition: we draw 40 inputs from isotropic 2D Gaussians of standard deviation 1.0 to form the training set D. The negative class is sampled from a Gaussian centered at µ 1 = (7.86, 3.36) and the positive at µ1 = (6.42, 9.17). Our learning procedure has a single hyperparameter α controlling how much importance is given to the hinge loss, i.e., how much the SVM is penalized for using slack variables to misclassify some of 1Liu & Talwar (2019) prove several other results with slightly different formulations of the problem, but this result is representative and is most relevant to our discussion here. Published as a conference paper at ICLR 2022 the training data. We first tune the value of α with the training set D and report training accuracy as a function of α in Figure 1. Next, we repeat this experiment on a dataset D to which we added 8 outliers xo = (7.9, 8.0) to the negative class. The resulting hyperparameter tuning curve is added to Figure 1. By comparing both curves, it is clear that the choice of hyperparameter α which maximizes accuracy differs in the two settings: the best performance is achieved around α = 8 with outliers whereas increasing α is detrimental to performance without outliers. Figure 1: Accuracy of the model as a function of the regularization weight α, with and without outliers. Note how the model performance exhibits a turning point with outliers whereas increasing the value of α is detrimental without outliers. This difference can be exploited to perform a variant of a membership inference attack (Shokri et al., 2017): Here, one could infer from the value of the hyperparameter α whether or not the outlier points xo were part of the training set or not. While the example is constructed, this shows how we must be careful when tuning hyperparameters: in corner cases such as the one presented here, it is possible for some information contained in the training data to leak in hyperparameter choices. In particular, this implies that the common practice of tuning hyperparameters without differential privacy and then using the hyperparameter values selected to repeat training one last time with differential privacy is not ideal. In Section 3, we will in particular show how training with differential privacy when performing the different runs necessary to tune the hyperparameter can bound such leakage effectively if one carefully chooses the number of runs hyperparameters are tuned for. 3 OUR POSITIVE RESULTS 3.1 PROBLEM FORMULATION We begin by appropriately formalizing the problem of differentially private hyperparameter tuning, following the framework of Liu & Talwar (2019). Suppose we have m randomized base algorithms M1, M2, , Mm : X n Y. These correspond to m possible settings of the hyperparameters. Ideally we would simply run each of these algorithms once and return the best outcome. For simplicity, we consider a finite set of hyperparameter possibilities; if the hyperparameters of interest are continuous, then we must pick a finite subset to search over (which is in practice sufficient). Here we make two simplifying assumptions: First, we assume that there is a total order on the range Y, which ensures that "best" is well-defined. In particular, we are implicitly assuming that the algorithm computes a quality score (e.g., accuracy on a test set) for the trained model it produces; this may require allocating some privacy budget to this evaluation.2 Second, we are assuming that the output includes both the trained model and the corresponding hyperparameter values (i.e., the output of Mj includes the index j).3 These assumptions can be made without loss of generality. 3.2 STRAWMAN APPROACH: REPEAT THE BASE ALGORITHM A FIXED NUMBER OF TIMES The obvious approach to this problem would be to run each algorithm once and to return the best of the m outcomes. From composition, we know that the privacy cost grows at most linearly with m. It turns out that this is in fact tight. There exists a (ε, 0)-DP algorithm such that if we repeatedly run it m times and return the best output, the resultant procedure is not (mε τ, 0)-DP for any τ > 0. This was observed by Liu & Talwar (2019, Appendix B) and we provide an analysis in Appendix D.1. This negative result also extends to Rényi DP. To avoid this problem, we will run the base algorithms a random number of times. The added uncertainty significantly enhances privacy. However, we must carefully choose this random distribution and analyze it. 2If the trained model s quality is evaluated on the training set, then we must increase the privacy loss budget to account for this composition. However, if the model is evaluated on a held out set, then the privacy budget need not be split; these data points are fresh from a privacy perspective. 3Note that in the more well-studied problem of selection, usually we would only output the hyperparameter values (i.e., just the index j) and not the corresponding trained model nor its quality score. Published as a conference paper at ICLR 2022 3.3 OUR ALGORITHM FOR HYPERPARAMETER TUNING To obtain good privacy bounds, we must run the base algorithms a random number of times. We remark that random search rather than grid search is often performed in practice (Bergstra & Bengio, 2012), so this is not a significant change in methodology. Specifically, we pick a total number of runs K from some distribution. Then, for each run k = 1, 2, , K, we pick an index jk [m] uniformly at random and run Mjk. Then, at the end, we return the best of the K outcomes. The privacy guarantee of this overall system then depends on the privacy guarantees of each of the mechanisms Mj as well as the distribution of the number of runs K. Specifically, we assume that there exists a uniform (Rényi) DP bound for all of the mechanisms Mj. Note that DP is convex where convexity here means that if M1, M2, , Mm are each individually DP, then running Mjk for a random jk [m] is also DP with the same parameters. To simplify notation, we also assume that there is a single mechanism Q : X n Y that picks a random index j [m] and then runs Mj. In essence, our goal is to boost the success probability of Q by repeating it many times. The distribution of the number of runs K must be chosen to both ensure good privacy guarantees and to ensure that the system is likely to pick a good setting of hyperparameters. Also, the overall runtime of the system depends on K and we want the runtime to be efficient and predictable. Our results consider several distributions on the number of repetitions K and the ensuing tradeoff between these three considerations. 3.4 MAIN RESULTS There are many possible distributions for the number of repetitions K. In this section, we first consider two the truncated negative binomial and Poisson distributions and state our main privacy results for these distributions. Later, in Section 3.5, we state a more general technical lemma which applies to any distribution on the number of repetitions K. Definition 1 (Truncated Negative Binomial Distribution). Let γ (0, 1) and η ( 1, ). Define a distribution Dη,γ on N = {1, 2, } as follows. If η = 0 and K is drawn from Dη,γ, then k N P [K = k] = (1 γ)k and E [K] = η (1 γ) γ (1 γη). If K is drawn from D0,γ, then P [K = k] = (1 γ)k k log(1/γ) (4) and E [K] = 1/γ 1 log(1/γ). This is called the negative binomial distribution, since Qk 1 ℓ=0 ℓ+η ℓ+1 = k+η 1 k if we extend the definition of binomial coefficients to non-integer η. The distribution is called truncated because P [K = 0] = 0, whereas the standard negative binomial distribution includes 0 in its support. The η = 0 case D0,γ is known as the logarithmic distribution . The η = 1 case D1,γ is simply the geometric distribution. Next, we state our main privacy result for this distribution. Theorem 2 (Main Privacy Result Truncated Negative Binomial). Let Q : X n Y be a randomized algorithm satisfying (λ, ε)-RDP and (ˆλ, ˆε)-RDP for some ε, ˆε 0, λ (1, ), and ˆλ [1, ).4 Assume Y is totally ordered. Let η ( 1, ) and γ (0, 1). Define an algorithm A : X n Y as follows. Draw K from the truncated negative binomial distribution Dη,γ (Definition 1). Run Q(x) repeatedly K times. Then A(x) returns the best value from the K runs. Then A satisfies (λ, ε )-RDP where ε = ε + (1 + η) 1 1 ˆε + (1 + η) log(1/γ) ˆλ + log E [K] 4If ˆλ = 1, then ˆε corresponds to the KL divergence; see Definition 9. However, ˆε is multiplied by 1 1/ˆλ = 0 in this case, so is irrelevant. Published as a conference paper at ICLR 2022 0 5 10 15 20 25 30 Renyi order ( ) Renyi divergence (D (A(x)|A(x ))) K = 1 (unrepeated) K = 10 (naive repetition) [K] = 10 [K] = 100 [K] = 1000 [K] = 10000 [K] = 100000 [K] = 1000000 Figure 2: Rényi DP guarantees from Corollary 4 for various expected numbers of repetitions of the logarithmic distribution (i.e., truncated negative binomial with η = 0), compared with base algorithm (0.1-z CDP) and naïve composition. 0 10 20 30 40 50 Renyi order ( ) Renyi divergence (D (A(x)|A(x ))) K = 1 (unrepeated) K = 10 (naive repetition) [K] = 10 Logarithmic = 0 [K] = 10 Geometric = 1 [K] = 10 = 5 [K] = 10 Poisson Figure 3: Rényi DP guarantees for repetition using different distributions or naïve composition with mean 10, with 0.1-z CDP base algorithm. 0 2 4 6 for ( , 1e 06)-DP Naive repetition Logarithmic = 0 Geometric = 1 = 5 Figure 4: Privacy versus expected number of repetitions using different distributions or naïve composition. Rényi DP guarantees are converted to approximate DP i.e., we plot ε such that we attain (ε, 10 6)-DP. The base algorithm is 0.1-z CDP. Theorem 2 shows a tradeoff between privacy and utility for the distribution Dη,γ of the number of repetitions. Privacy improves as η decreases and γ increases. However, this corresponds to fewer repetitions and thus a lower chance of success. We will study this aspect in Section 3.6. Theorem 2 assumes two RDP bounds, which makes it slightly hard to interpret. Thus we consider two illustrative special cases: We start with pure DP (a.k.a. pointwise DP) i.e., (ε, δ)-DP with δ = 0, which is equivalent to ( , ε)-RDP. This corresponds to Theorem 2 with λ and ˆλ . Corollary 3 (Theorem 2 for pure DP). Let Q : X n Y be a randomized algorithm satisfying (ε, 0)-DP. Let η ( 1, ) and γ (0, 1). Define A : X n Y as in Theorem 2. Then A satisfies ((2 + η)ε, 0)-DP. Our result is a generalization of the result of Liu & Talwar (2019) they show that, if K follows a geometric distribution and Q satisfies (ε, 0)-DP, then A satisfies (3ε, 0)-DP. Setting η = 1 in Corollary 3 recovers their result. If we set η < 1, then we obtain an improved privacy bound. Another example is if Q satisfies concentrated DP (Dwork & Rothblum, 2016; Bun & Steinke, 2016). This is the type of guarantee that is obtained by adding Gaussian noise to a bounded sensitivity function. In particular, this is the type of guarantee we would obtain from noisy gradient descent.5 Corollary 4 (Theorem 2 for Concentrated DP). Let Q : X n Y be a randomized algorithm satisfying ρ-z CDP i.e., (λ, ρ λ)-Rényi DP for all λ > 1. Let η ( 1, ) and γ (0, 1). Define A : X n Y and K Dη,γ as in Theorem 2. Assume ρ log(1/γ). Then A satisfies (λ, ε )-Rényi DP for all λ > 1 with ρ log (E [K]) + 2(1 + η) p ρ log(1/γ) ηρ if λ 1 + q 1 ρ log (E [K]) ρ (λ 1) + 1 λ 1 log (E [K]) + 2(1 + η) p ρ log(1/γ) ηρ if λ > 1 + q 1 ρ log (E [K]) . Figure 2 shows what the guarantee of Corollary 4 looks like. Here we start with 0.1-z CDP and perform repetition following the logarithmic distribution (η = 0) with varying scales (given by γ) and plot the Rényi DP guarantee attained by outputting the best of the repeated runs. The improvement over naive composition, which instead grows linearly, is clear. We also study other distributions on the number of repetitions, obtained by varying η, and Figure 3 gives a comparison. Figure 4 shows what these bounds look like if we convert to approximate (ε, δ)-DP with δ = 10 6. Remark 5. Corollary 4 uses the monotonicity property of Rényi divergences: If λ1 λ2, then Dλ1 (P Q) Dλ2 (P Q) (Van Erven & Harremos, 2014, Theorem 3). Thus (λ2, ε)-RDP implies (λ1, ε)-RDP for any λ1 λ2. In particular, the bound of Theorem 2 yields ε as λ 1, so we use monotonicity to bound ε for small λ. 5Note that the privacy of noisy stochastic gradient descent (DP-SGD) is not well characterized by concentrated DP (Bun et al., 2018), but, for our purposes, this is a close enough approximation. Published as a conference paper at ICLR 2022 Poisson Distribution. We next consider the Poisson distribution, which offers a different privacyutility tradeoff than the truncated negative binomial distribution. The Poisson distribution with mean µ 0 is given by P [K = k] = e µ µk k! for all k 0. Note that P [K = 0] = e µ > 0 here, whereas the truncated negative binomial distribution does not include 0 in its support. We could condition on K 1 here too, but we prefer to stick with the standard definition. We remark that (modulo the issue around P [K = 0]) the Poisson distribution is closely related to the truncated negative binomial distribution. If we take the limit as η while the mean remains fixed, then the negative binomial distribution becomes a Poisson distribution. Conversely, the negative binomial distribution can be represented as a convex combination of Poisson distributions or as a compound of Poisson and logarithmic; see Appendix A.2 for more details. Theorem 6 (Main Privacy Result Poisson Distribution). Let Q : X n Y be a randomized algorithm satisfying (λ, ε)-RDP and (ˆε, ˆδ)-DP for some λ (1, ) and ε, ˆε, ˆδ 0. Assume Y is totally ordered. Let µ > 0. Define an algorithm A : X n Y as follows. Draw K from a Poisson distribution with mean µ i.e., P [K = k] = e µ µk k! for all k 0. Run Q(x) repeatedly K times. Then A(x) returns the best value from the K runs. If K = 0, A(x) returns some arbitrary output independent from the input x. If eˆε 1 + 1 λ 1, then A satisfies (λ, ε )-RDP where ε = ε + µ ˆδ + log µ The assumptions of Theorem 6 are different from Theorem 2: We assume a Rényi DP guarantee and an approximate DP guarantee on Q, rather than two Rényi DP guarantees. We remark that a Rényi DP guarantee can be converted into an approximate DP guarantee (λ, ε)-RDP implies (ˆε, ˆδ)-DP for all ˆε ε and ˆδ = e(λ 1)(ˆε ε) 1 λ λ 1 (Mironov, 2017; Canonne et al., 2020). Thus this statement can be directly compared to our other result. We show such a comparison in Figure 3 and Figure 4. The proofs of Theorems 2 and 6 are included in Appendix B.2. 3.5 GENERIC RÉNYI DP BOUND FOR ANY DISTRIBUTION ON THE NUMBER OF REPETITIONS We now present our main technical lemma, which applies to any distribution on the number of repetitions K. Theorems 2 and 6 are derived from this result. It gives a Rényi DP bound for the repeated algorithm in terms of the Rényi DP of the base algorithm and the probability generating function of the number of repetitions applied to probabilities derived from the base algorithm. Lemma 7 (Generic Bound). Fix λ > 1. Let K be a random variable supported on N {0}. Let f : [0, 1] R be the probability generating function of K i.e., f(x) := P k=0 P [K = k] xk. Let Q and Q be distributions on Y. Assume Y is totally ordered. Define a distribution A on Y as follows. First sample K. Then sample from Q independently K times and output the best of these samples.6 This output is a sample from A. We define A analogously with Q in place of Q. Then Dλ (A A ) Dλ (Q Q ) + 1 λ 1 log f (q)λ f (q )1 λ , (6) where applying the same postprocessing to Q and Q gives probabilities q and q respectively i.e., there exists an arbitrary function g : Y [0, 1] such that q = E X Q [g(X)] and q = E X Q [g(X )]. The proof of this generic bound is found in Appendix B.1. To interpret the theorem, we should imagine adjacent inputs x, x X n, and then the distributions correspond to the algorithms run on these inputs: A = A(x), A = A(x ), Q = Q(x), and Q = Q(x ). The bounds on Rényi divergence thus correspond to Rényi DP bounds. The derivative of the probability generating function f (x) = E K x K 1 is somewhat mysterious. A first-order intuition is that, if q = q , then f (q)λ f (q )1 λ = f (q) f (1) = E [K] and thus the last term in the bound (6) is simply log E[K] λ 1 . A second-order intuition is that q q by DP and postprocessing and, if f is smooth, then f (q) f (q ) and the first-order intuition holds up to these approximations. Vaguely, f being smooth corresponds to the distribution of K being spread out (i.e., not a point mass) and not too 6If K = 0, the output can be arbitrary, as long as it is the same for both A and A . Published as a conference paper at ICLR 2022 3 4 5 6 7 Privacy for ( , 1e 06)-DP Expected quantile = 1 [1/(K + 1)] Logarithmic = 0 = 0.5 Geometric = 1 = 2 Figure 5: Expected quantile of the repeated algorithm A as a function of the final privacy guarantee (ε, 10 6)-DP for various distributions K, where each invocation of the base algorithm Q is 0.1-z CDP. 0 2 4 6 8 Privacy loss bound for ( , 1e 06)-DP Success probability = 1 f(1 1/100) Naive repetition Logarithmic = 0 Geometric = 1 = 0.5 Figure 6: Final success probability (β) of the repeated algorithm A as a function of the final privacy guarantee (ε, 10 6)-DP for various distributions, where each invocation of the base algorithm Q has a 1/100 probability of success and is 0.1-z CDP. 2 4 6 8 Privacy budget Model Accuracy for Best Hyperparameter logarithmic distribution = 0 geometric distribution = 1 negative binomial = 0.5 poisson distribution baseline (non-private search) Figure 7: Accuracy of the CNN model obtained at the end of the hyperparameter search, for the different distributions on the number of repetitions K we considered. We report the mean over 500 trials of the experiment. heavy-tailed (i.e. K is small most of the time). The exact quantification of this smoothness depends on the form of the DP guarantee q q . In our work, we primarily compare three distributions on the number of repetitions: a point mass (corresponding to naïve repetition), the truncated negative binomial distribution, and the Poisson distribution. A point mass would have a polynomial as the probability generating function i.e., if P [K = k] = 1, then f(x) = E x K = xk. The probability generating function of the truncated negative binomial distribution (Definition 1) is f(x) = E K Dη,γ ( (1 (1 γ)x) η 1 γ η 1 if η = 0 log(1 (1 γ)x) log(γ) if η = 0 . (7) The probability generating function of the Poisson distribution with mean µ is given by f(x) = eµ (x 1). We discuss probability generating functions further in Appendix A.2. 3.6 UTILITY AND RUNTIME OF OUR HYPERPARAMETER TUNING ALGORITHM Our analytical results thus far, Theorems 2 and 6 and Lemma 7, provide privacy guarantees for our hyperparameter tuning algorithm A when it is used with various distributions on the number of repetitions K. We now turn to the utility that this algorithm provides. The utility of a hyperparameter search is determined by how many times the base algorithm (denoted Q in the theorem statements) is run when we invoke the overall algorithm (A). The more often Q is run, the more likely we are to observe a good output and A is more likely to return the corresponding hyperparameter values. Note that the number of repetitions K also determines the algorithm s runtime, so these are closely linked. How does this distribution on the number of repetitions K map to utility? As a first-order approximation, the utility and runtime are proportional to E [K]. Hence several of our figures compare the different distributions on K based on a fixed expectation and Figure 4 plots E [K] on the vertical axis. However, this first-order approximation ignores the fact that some of the distributions we consider are more concentrated than others; even if the expectation is large, there might still be a significant probability that K is small. Indeed, for η 1, the mode of the truncated negative binomial distribution is K = 1. We found this to be an obstacle to using the (truncated) negative binomial distribution in practice in our experiments, and discuss this further in Appendix A.2.1. We can formulate utility guarantees more precisely by looking at the expected quantile of the output to measure our algorithm s utility. If we run the base algorithm Q once, then the quantile of the output is (by definition) uniform on [0, 1] and has mean 0.5. If we repeat the base algorithm a fixed number of times k, then the quantile of the best output follows a Beta(k, 1) distribution, as it is the maximum of k independent uniform random variables. The expectation in this case is k k+1 = 1 1 k+1. If we Published as a conference paper at ICLR 2022 repeat a random number of times K, the expected quantile of the best result is given by = E 1 1 K + 1 0 x f (x)dx = 1 Z 1 0 f(x)dx, (8) where f(x) = E x K is the probability generating function of K; see Appendix A.2.1 for further details. Figure 5 plots this quantity against privacy. We see that the Poisson distribution performs very well in an intermediate range, while the negative binomial distribution with η = 0.5 does well if we want a strong utility guarantee. This means that Poisson is best used when little privacy budget is available for the hyperparameter search. Instead, the negative binomial distribution with η = 0.5 allows us to improve the utility of the solution returned by the hyperparameter search, but this only holds when spending a larger privacy budget (in our example, the budget has to be at least ε = 4 otherwise Poisson is more advantageous). The negative binomial with η = 0.5 does very poorly. From a runtime perspective, the distribution of K should have light tails. All of the distributions we have considered have subexponential tails. However, larger η corresponds to better concentration in the negative binomial distribution with the Poisson distribution having the best concentration. Experimental Evaluation. To confirm these findings, we apply our algorithm to a real hyperparameter search task. Specifically, we fine-tune the learning rate of a convolutional neural network trained on MNIST. We implement DP-SGD in JAX for an all-convolutional architecture with a stack of 32, 32, 64, 64, 64 feature maps generated by 3x3 kernels. We vary the learning rate between 0.025 and 1 on a logarithmic scale but fix all other hyperparameters: 60 epochs, minibatch size of 256, ℓ2 clipping norm of 1, and noise multiplier of 1.1. In Figure 7, we plot the maximal accuracy achieved during the hyperparameter search for the different distributions considered previously as a function of the total privacy budget expended by the search. The experiment is repeated 500 times and the mean result reported. This experiment shows that the Poisson distribution achieves the best privacy-utility tradeoff for this relatively simple hyperparameter search. This agrees with the theoretical analysis we just presented above that shows that the Poisson distribution performs well in the intermediate range of utility, as this is a simple hyperparameter search. 4 CONCLUSION Our positive results build on the work of Liu & Talwar (2019) and show that repeatedly running the base algorithm and only returning the best output can incur much lower privacy cost than naïve composition would suggest. This however requires that we randomize the number of repetitions, rather than repeating a fixed number of times. We analyze a variety of distributions for the number of repetitions, each of which gives a different privacy/utility tradeoff. While our results focused on the privacy implications of tuning hyperparameters with, and without, differential privacy, our findings echo prior observations that tuning details of the model architecture without privacy to then repeat training with DP affords suboptimal utility-privacy tradeoffs (Papernot et al., 2020); in this work, the authors demonstrated that the optimal choice of activation function in a neural network can be different when learning with DP, and that tuning it with DP immediately can improve the model s utility at no changes to the privacy guarantee. We envision that future work will be able to build on our algorithm for private tuning of hyperparameters to facilitate privacy-aware searches for model architectures and training algorithm configurations to effectively learn with them. Limitations. We show that hyperparameter tuning is not free from privacy cost. Our theoretical and experimental results show that, in the setting of interest, the privacy parameter may double or even triple after accounting for hyperparameter tuning, which could be prohibitive. In this case, one compromise would be to state both privacy guarantees that of the base algorithm that does not account for hyperparameter tuning, and that of the overall system that does account for this. The reader may wonder whether our positive results can be improved. In Appendix D, we explore give some intuition for why they cannot (easily) be improved. We also note that our results are only immediately applicable to the hyperparameter tuning algorithm from Section 3.3. Other algorithms, in particular those that adaptively choose hyperparameter candidates will require further analysis. Finally, among the distributions on the number of repetitions K that we have analyzed, the distribution that provides the best privacy-utility tradeoffs will depend on the setting. While it is good to have choices, this does leave some work to be done by those using our results. Fortunately, the differences between the distributions seem to be relatively small, so this choice is unlikely to be critical. Published as a conference paper at ICLR 2022 REPRODUCIBILITY & ETHICS STATEMENTS Reproducibility. We give precise theorem statements for our main results and we have provided complete proofs in the Appendix, as well as all the necessary calculations and formulas for plotting our figures. We have also fully specified the setup required to reproduce our experimental results, including hyperparameters. Our algorithm is simple, fully specified and can be easily implemented. Ethics. Our work touches on privacy, which is an ethically sensitive topic. If differentially private algorithms such as ours are applied to real-world sensitive data, then potential harms to the people whose data is being used must be carefully considered. However, our work is not directly using real-world sensitive data. Our main results are theoretical and our experiments use either synthetic data or MNIST, which is a standard non-private dataset. ACKNOWLEGMENTS The authors would like to thank the reviewers for their detailed feedback and interactive discussion during the review period. We also thank our colleagues Abhradeep Guha Thakurta, Andreas Terzis, Peter Kairouz, and Shuang Song for insightful discussions about differentially private hyperparameter tuning that led to the present project, as well as their comments on early drafts of this document. Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan Mc Mahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pp. 308 318, 2016. Borja Balle, Gilles Barthe, and Marco Gaboardi. Privacy amplification by subsampling: Tight analyses via couplings and divergences. ar Xiv preprint ar Xiv:1807.01647, 2018. Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, pp. 464 473. IEEE, 2014. Raef Bassily, Kobbi Nissim, Adam Smith, Thomas Steinke, Uri Stemmer, and Jonathan Ullman. Algorithmic stability for adaptive data analysis. SIAM Journal on Computing, (0):STOC16 377, 2021. James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. Journal of machine learning research, 13(2), 2012. Mark Bun and Thomas Steinke. Concentrated differential privacy: Simplifications, extensions, and lower bounds. In Theory of Cryptography Conference, pp. 635 658. Springer, 2016. Mark Bun, Jonathan Ullman, and Salil Vadhan. Fingerprinting codes and the price of approximate differential privacy. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pp. 1 10, 2014. Mark Bun, Cynthia Dwork, Guy N Rothblum, and Thomas Steinke. Composable and versatile privacy via truncated cdp. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pp. 74 86, 2018. Clément L Canonne, Gautam Kamath, and Thomas Steinke. The discrete gaussian for differential privacy. ar Xiv preprint ar Xiv:2004.00010, 2020. Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, Alina Oprea, and Colin Raffel. Extracting training data from large language models. ar Xiv preprint ar Xiv:2012.07805, 2020. Published as a conference paper at ICLR 2022 Kamalika Chaudhuri and Staal A Vinterbo. A stability-based validation procedure for differentially private machine learning. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (eds.), Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013. URL https://proceedings.neurips.cc/paper/2013/file/ e6d8545daa42d5ced125a4bf747b3688-Paper.pdf. Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3-4):211 407, 2014. Cynthia Dwork and Guy N Rothblum. Concentrated differential privacy. ar Xiv preprint ar Xiv:1603.01887, 2016. Cynthia Dwork, Krishnaram Kenthapadi, Frank Mc Sherry, Ilya Mironov, and Moni Naor. Our data, ourselves: Privacy via distributed noise generation. In Annual International Conference on the Theory and Applications of Cryptographic Techniques, pp. 486 503. Springer, 2006a. Cynthia Dwork, Frank Mc Sherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pp. 265 284. Springer, 2006b. Cynthia Dwork, Moni Naor, Omer Reingold, Guy N Rothblum, and Salil Vadhan. On the complexity of differentially private data release: efficient algorithms and hardness results. In Proceedings of the forty-first annual ACM symposium on Theory of computing, pp. 381 390, 2009. Cynthia Dwork, Guy N Rothblum, and Salil Vadhan. Boosting and differential privacy. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pp. 51 60. IEEE, 2010. Konstantina Kourou, Themis P Exarchos, Konstantinos P Exarchos, Michalis V Karamouzis, and Dimitrios I Fotiadis. Machine learning applications in cancer prognosis and prediction. Computational and structural biotechnology journal, 13:8 17, 2015. Jingcheng Liu and Kunal Talwar. Private selection from private candidates. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pp. 298 309, 2019. URL https://arxiv.org/abs/1811.07971. Frank Mc Sherry and Kunal Talwar. Mechanism design via differential privacy. In 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS 07), pp. 94 103. IEEE, 2007. Ilya Mironov. Rényi differential privacy. In 2017 IEEE 30th Computer Security Foundations Symposium (CSF), pp. 263 275. IEEE, 2017. Ilya Mironov, Kunal Talwar, and Li Zhang. R\ enyi differential privacy of the sampled gaussian mechanism. ar Xiv preprint ar Xiv:1908.10530, 2019. Shubhankar Mohapatra, Sajin Sasy, Xi He, Gautam Kamath, and Om Thakkar. The role of adaptive optimizers for honest private hyperparameter selection. ar Xiv preprint ar Xiv:2111.04906, 2021. Nicolas Papernot, Abhradeep Thakurta, Shuang Song, Steve Chien, and Úlfar Erlingsson. Tempered sigmoid activations for deep learning with differential privacy. ar Xiv preprint ar Xiv:2007.14191, 2020. Ryan Rogers and Thomas Steinke. A better privacy analysis of the exponential mechanism. Differential Privacy.org, 07 2021. https://differentialprivacy.org/ exponential-mechanism-bounded-range/. Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 3 18. IEEE, 2017. Shuang Song, Kamalika Chaudhuri, and Anand D Sarwate. Stochastic gradient descent with differentially private updates. In 2013 IEEE Global Conference on Signal and Information Processing, pp. 245 248. IEEE, 2013. Thomas Steinke and Jonathan Ullman. Tight lower bounds for differentially private selection. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pp. 552 563. IEEE, 2017. Published as a conference paper at ICLR 2022 Tim Van Erven and Peter Harremos. Rényi divergence and kullback-leibler divergence. IEEE Transactions on Information Theory, 60(7):3797 3820, 2014. Jenna Wiens, Suchi Saria, Mark Sendak, Marzyeh Ghassemi, Vincent X Liu, Finale Doshi-Velez, Kenneth Jung, Katherine Heller, David Kale, Mohammed Saeed, Pilar N. Ossorio, Sonoo Thadaney Israni, and Anna Goldenberg. Do no harm: a roadmap for responsible machine learning for health care. Nature medicine, 25(9):1337 1340, 2019. Yuqing Zhu and Yu-Xiang Wang. Improving sparse vector technique with renyi differential privacy. In Advances in Neural Information Processing Systems 33 preproceedings (Neur IPS 2020), 2020. URL https://papers.nips.cc/paper/2020/ hash/e9bf14a419d77534105016f5ec122d62-Abstract.html. A FURTHER BACKGROUND A.1 DIFFERENTIAL PRIVACY & RÉNYI DP For completeness, we provide some basic background on differential privacy and, in particular, Rényi differential privacy. We start with the standard definition of differential privacy: Definition 8 (Differential Privacy). A randomized algorithm M : X n Y is (ε, δ)-differentially private if, for all neighbouring pairs of inputs x, x X n and all measurable S Y, P [M(x) S] eε P [M(x ) S] + δ. When δ = 0, this is referred to as pure (or pointwise) differential privacy and we may abbreviate (ε, 0)-DP to ε-DP. When δ > 0, this is referred to as approximate differential privacy. The definition of pure DP was introduced by Dwork et al. (2006b) and approximate DP was introduced by Dwork et al. (2006a). Note that the notion of neighbouring datasets is context-dependent, but this is often glossed over. Our results are general and can be applied regardless of the specifics of what is a neighbouring dataset. (However, we do require symmetry i.e., if (x, x ) are a pair of neighbouring inputs, then so are (x x).) Usually two datasets are said to be neighbouring if they differ only by the addition/removal or replacement of the data corresponding to a single individual. Some papers only consider addition or removal of a person s records, rather than replacement. But these are equivalent up to a factor of two. In order to define Rényi DP, we first define the Rényi divergences: Definition 9 (Rényi Divergences). Let P and Q be probability distributions on a common space Ω. Assume that P is absolutely continuous with respect to Q i.e., for all measurable S Ω, if Q(S) = 0, then P(S) = 0. Let P(x) and Q(x) denote the densities of P and Q respectively.7 The KL divergence from P to Q is defined as D1 (P Q) := E X P Ω P(x) log P(x) The max divergence from P to Q is defined as D (P Q) := sup log P(S) : P(S) > 0 . For λ (1, ), the Rényi divergence from P to Q of order λ is defined as Dλ (P Q) := 1 λ 1 log = 1 λ 1 log = 1 λ 1 log Z Ω P(x)λQ(x)1 λdx . 7In general, we can only define the ratio P(x)/Q(x) to be the Radon-Nikodym derivative of P with respect to Q. To talk about P(x) and Q(x) separately we must assume some base measure with respect to which these are defined. In most cases the base measure is either the counting measure in the case of discrete distributions or the Lesbesgue measure in the case of continuous distributions. Published as a conference paper at ICLR 2022 We state some basic properties of Rényi divergences; for further information see, e.g., Van Erven & Harremos (2014). Lemma 10. Let P, Q, P , and Q be probability distributions. Let λ [1, ]. The following hold. Non-negativity: Dλ (P Q) 0. Monotonicity & Continuity: Dλ (P Q) is a continuous and non-decreasing function of λ. Data processing inequality (a.k.a. Postprocessing): Let f(P) denote the distribution obtained by applying some (possibly randomized) function to a sample from P and let f(Q) denote the distribution obtained by applying the same function to a sample from Q. Then Dλ (f(P) f(Q)) Dλ (P Q). Finite case suffices: We have Dλ (P Q) = supf Dλ (f(P) f(Q)) even when f is restricted to functions with a finite range. Chain rule (a.k.a. Composition): Dλ (P P Q Q ) = Dλ (P Q) + Dλ (P Q ), where P P and Q Q denote the product distributions of the individual distributions. Convexity: The function (P, Q) 7 e(λ 1)Dλ(P Q) is convex for all λ (1, ). The function (P, Q) 7 Dλ (P Q) convex if and only if λ = 1. Now we can state the definition of Rényi DP (RDP), which is due to Mironov (2017). Definition 11 (Rényi Differential Privacy). A randomized algorithm M : X n Y is (λ, ε)-Rényi differentially private if, for all neighbouring pairs of inputs x, x X n, Dλ (M(x) M(x )) ε. A closely related definition is that of zero-concentrated differential privacy (Bun & Steinke, 2016) (which is based on an earlier definition (Dwork & Rothblum, 2016) of concentrated differential privacy that does not refer to Rényi divergences). Definition 12 (Concentrated Differential Privacy). A randomized algorithm M : X n Y is ρ-z CDP if, for all neighbouring pairs of inputs x, x X n and all λ (1, ), Dλ (M(x) M(x )) ρ λ. Usually, we consider a family of (λ, ε(λ))-RDP guarantees, where ε(λ) is a function, rather than a single function. Concentrated DP is one example of such a family, where the function is linear, and this captures the behaviour of many natural algorithms. In particular, adding Gaussian noise to a bounded sensitivity function: If f : X n Rd has sensitivity i.e., f(x) f(x ) 2 for all neighbouring x, x and M : X n Rd is the algorithm that returns a sample from N(f(x), σ2I), then M satisfies 2 2σ2 -z CDP. We can convert from pure DP to concentrated or Rényi DP as follows (Bun & Steinke, 2016). Lemma 13. If M satisfies (ε, 0)-differential privacy, then M satisfies 1 2ε2-z CDP i.e., (λ, 1 2ε2λ)- RDP for all λ (1, ). Conversely, we can convert from concentrated or Rényi DP to approximate DP as follows (Canonne et al., 2020). Lemma 14. If M satisfies (λ, ˆε)-RDP, then M satisfies (ε, δ)-DP where ε 0 is arbitrary and δ = exp((λ 1)(ˆε ε)) A.2 PROBABILITY GENERATING FUNCTIONS Let K be a random variable supported on N {0}. The probability generating function (PGF) of K is defined by f(x) = E x K = k=0 P [K = k] xk. The PGF f(x) is always defined for x [0, 1], but may or may not be defined for x > 1. The PGF characterizes K. In particular, we can recover the probability mass function from the derivatives of Published as a conference paper at ICLR 2022 the PGF (hence the name): P [K = k] = f (k)(0) where f (k)(x) denotes the kth derivative of f(x) and, in particular, P [K = 0] = f(0). We remark that is is often easiest to specify the PGF and derive the probability distribution from it, rather than vice versa; indeed, we arrived at the truncated negative binomial distribution by starting with the PGF that we want and then differentiating. We can also easily recover the moments of K from the PGF: We have f (k)(x) = P ℓ=k P [K = ℓ] xℓ k ℓ (ℓ 1) (ℓ 2) (ℓ k + 1). In particular, f(1) = E [1] = 1 and f (1) = E [K] and f (1) = E [K(K 1)]. Note that the PGF is a rescaling of the moment generating function (MGF) g(t) := E et K = f(et). The PGF can be related to the MGF in another way: Suppose Λ is a random variable on [0, ). Now suppose we draw K Poisson(Λ). Then the PGF of K is the MGF of Λ i.e., E x K = E K Poisson(Λ) x K = E Λ eΛ (x 1) = g(x 1), where g is the MGF of Λ. In particular, if Λ is drawn from a Gamma distribution, then this would yield K from a negative binomial distribution which has a PGF of the form f NB(x) = 1 (1 γ)x γ η .8 Note that our results work with a truncated negative binomial distribution, which is a negative binomial conditioned on K = 0. This corresponds to an affine rescaling of the PGF, namely f TNB(x) = f NB(x) f NB(0) f NB(1) f NB(0). We can also obtain a negative binomial distribution as a compound of a Poisson distribution and a logarithmic distribution. That is, if we draw T from a Poisson distribution and draw K1, K2, , KT independently from a logarithmic distribution, then K = PT t=1 Kt follows a negative binomial distribution. The PGF of the logarithmic distribution is given by f Kt(x) = E x Kt = log(1 (1 γ)x) log(γ) and the PGF of Poisson is given by f T (x) = E x T = eµ (x 1). Hence f K(x) = E x K = E T = E T f Kt(x)T = f T (f Kt(x)) = exp µ log(1 (1 γ)x) which is equivalent to f NB(x) = 1 (1 γ)x γ η with η = µ log(1/γ). Finally we remark that we can also use the PGF to show convergence in probability. In particular, lim η ,γ= η η+µ f NB(x) = lim η ,γ= η η+µ η = lim η ,γ= η η+µ η (x 1) η = eµ(x 1). That is, if we take the limit of the negative binomial distribution as η but the mean µ = η 1 γ γ remains fixed, then we obtain a Poisson distribution. If we take η 0, then f NB(x) 1, which is to say that the negative binomial distribution converges to a point mass at 0 as η 0. However, the truncated negative binomial distribution converges to a logarithmic distribution as η 0. A.2.1 PROBABILITY GENERATING FUNCTIONS AND UTILITY Recall that in Section 3.6, we analyzed the expected utility and runtime of different distributions on the number of repetitions K. Given our discussion of probability generating functions for these distributions, we can offer an alternative perspective on the expected utility and runtime. Suppose each invocation of Q has a probability 1/m of producing a good output. This would be the case if we are considering m hyperparameter settings and only one is good where here we consider the outcome to be binary (good or bad) for simplicity and what is a good or bad is determined only by the total order on the range Y and some threshold on the quality score (e.g., accuracy). Then A has a probability β := 1 P [A(x) Bad] = 1 E K h P [Q(x) Bad]Ki = 1 E (1 1/m)K = 1 f(1 1/m) 8The PGF of Binomial(n, p) is f(x) = (1 p + px)n. This expression is similar to the PGF of the negative binomial, except the negative binomial has a negative exponent. Published as a conference paper at ICLR 2022 of outputting a good output, where f(x) = E x K is the probability generating function of the distribution. If we make the first-order approximation f(1 1/m) f(1) f (1) 1/m = 1 E [K]/m, then we have β E [K]/m. In other words, for small values of 1/m, the probability of success is amplified by a multiplicative factor of E [K]. However, the above first-order approximation only holds for large m and, hence, small overall success probabilities β. In practice, we want β 1. The different distributions (Poisson and truncated negative binomial with different values of η) have very different behaviours even with the same expectation. In the regime where we want the overall success probability to be high (i.e., β 1), smaller η performs worse, because the distribution is more heavy-tailed. The best performing distribution is the Poisson distribution, which is almost as concentrated as naïve repetition. Figure 6 show the success probability β as a function of the final (ε, 10 6)-DP guarantee. This demonstrates that there is a tradeoff between distributions. More generally, we can relate the PGF of K to the expected utility of our repeated algorithm. Let X R be random variable corresponding to the utility of one run of the base algorithm Q. E.g. X could represent the accuracy, loss, AUC/AUROC, or simply the quantile of output. Now let Y R be the utility of our repeated algorithm A which runs the base algorithm Q repeatedly K times for a random K. That is, Y = max{X1, , XK} where X1, X2, are independent copies of X. Let cdf X(x) = P [X x] and cdf Y (x) = P [Y x] = E K h P X [X x]Ki = f(cdf X(x)), where f(x) = E x K is the PGF of the number of repetitions K. Assuming for the moment that X is a continuous random variable, we can derive the probability density function of Y from the cumulative distribution function: pdf Y (x) = d dxcdf X(x) = d dxf(cdf X(x)) = f (cdf X(x)) pdf X(x). This allows us to compute the expected utility: x pdf Y (x)dx = Z x f (cdf X(x)) pdf X(x)dx = E [X f (cdf X(X))]. In particular, we can compute the expected quantile (8) in which case X is uniform on [0, 1] and, hence, cdf X(x) = x and pdf X(x) = 1 for x [0, 1]. Integration by parts gives E [Y ] = Z 1 0 x f (x)dx = Z 1 dxxf(x) f(x)dx = 1f(1) 0f(0) Z 1 0 f(x)dx = 1 Z 1 Note that Z 1 0 f(x)dx = Z 1 0 E K x K dx = E K 0 x Kdx = E K Finally, we also want to ensure that the runtime of our hyperparameter tuning algorithm is wellbehaved. In particular, we wish to avoid heavy-tailed runtimes. We can obtain tail bounds on the number of repetitions K from the PGF or MGF too: For all t > 0, we have P [K k] = P h et (K k) 1 i E h et (K k)i = f(et) e t k. Thus, if the PGF f(x) = E x K is finite for some x = et > 1, then we obtain a subexponential tail bound on K. B PROOFS FROM SECTION 3 B.1 PROOF OF GENERIC BOUND Proof of Lemma 7. We assume that Y is a finite set and that P [K = 0] = 0; this is, essentially, without loss of generality.9 Denote Q( y) := P y Y I[y y] Q(y ) and similarly for Q(< y) 9Our proof can be extended to the general case. Alternatively, if Y is infinite, we can approximate the relevant quantities arbitrarily well with a finite partition; see Lemma 10 or Van Erven & Harremos (2014, Theorem 10). Published as a conference paper at ICLR 2022 and analogously for Q in place of Q. For each y Y, we have k=1 P [K = k] (Q( y)k Q(< y)k) = f(Q( y)) f(Q(< y)) Q( y . B.2 PROOFS OF DISTRIBUTION-SPECIFIC BOUNDS Truncated Negative Binomial Distribution Proof of Theorem 2. The probability generating function of the truncated negative binomial distribution is f(x) = E K Dη,γ ( (1 (1 γ)x) η 1 γ η 1 if η = 0 log(1 (1 γ)x) log(γ) if η = 0 . Published as a conference paper at ICLR 2022 f (x) = (1 (1 γ)x) η 1 γ η 1 if η = 0 1 γ log(1/γ) if η = 0 = (1 (1 γ)x) η 1 γη+1 E [K]. Now we delve into the privacy analysis: Let Q = Q(x) and Q = Q(x ) denote the output distributions of Q on two neighbouring inputs. Similarly, let A = A(x) and A = A(x ) be the corresponding pair of output distributions of the repeated algorithm. By Lemma 7, for appropriate values q, q [0, 1] and for all λ > 1 and all ˆλ > 1,10 we have Dλ (Q Q ) + 1 λ 1 log f (q)λ f (q )1 λ = Dλ (Q Q ) + 1 λ 1 log γη+1 E [K] (1 (1 γ)q) λ(η+1) (1 (1 γ)q ) (1 λ)(η+1) = Dλ (Q Q ) + 1 λ 1 log γη+1 E [K] (γ + (1 γ)(1 q))1 ˆλ (γ + (1 γ)(1 q )) ˆλ ν (γ + (1 γ)(1 q))u (ˆλν = (λ 1)(1 + η) and (1 ˆλ)ν + u = λ(η + 1)) Dλ (Q Q ) + 1 λ 1 log γη+1 E [K] γ + (1 γ) e(ˆλ 1)Dˆλ(Q Q) ν (γ + (1 γ)(1 q))u (1 q and 1 q are postprocessings of Q and Q respectively and e(ˆλ 1)Dˆλ( ) is convex and ν 0) Dλ (Q Q ) + 1 λ 1 log γη+1 E [K] γ + (1 γ) e(ˆλ 1)Dˆλ(Q Q) ν γu (γ γ + (1 γ)(1 q) and u 0) = Dλ (Q Q ) + ν λ 1 log γ + (1 γ) e(ˆλ 1)Dˆλ(Q Q) + 1 λ 1 log γη+1 E [K] γu = Dλ (Q Q ) + ν λ 1 (ˆλ 1)Dˆλ (Q Q) + log 1 γ 1 e (ˆλ 1)Dˆλ(Q Q) + 1 λ 1 log γu+η+1 E [K] = Dλ (Q Q ) + (1 + η) 1 1 Dˆλ (Q Q) + 1 + η ˆλ log 1 γ 1 e (ˆλ 1)Dˆλ(Q Q) + log (E [K]) λ 1 + 1 + η ˆλ log(1/γ) (ν = (λ 1)(1+η) ˆλ and u = (1 + η)( λ 1 = Dλ (Q Q ) + (1 + η) 1 1 Dˆλ (Q Q) + 1 + η γ 1 + e (ˆλ 1)Dˆλ(Q Q) + log (E [K]) Dλ (Q Q ) + (1 + η) 1 1 Dˆλ (Q Q) + 1 + η + log (E [K]) Proof of Corollary 4. We assume that Q : X n Y is a randomized algorithm satisfying ρ-z CDP i.e., (λ, ρ λ)-Rényi DP for all λ > 1. Substituting this guarantee into Theorem 2 (i.e., setting ε = ρ λ and ˆε = ρ ˆλ) gives that the repeated algorithm A satisfies (λ, ε )-RDP for ε ρ λ + (1 + η) 1 1 ρ ˆλ + (1 + η) log(1/γ) ˆλ + log E [K] This holds for all λ (1, ) and all ˆλ [1, ). 10Our proof here assumes ˆλ > 1, but the result holds for ˆλ = 1 too. This can be shown either by analyzing this case separately or simply by taking the limit ˆλ 1 and using the continuity properties of Rényi divergence. Published as a conference paper at ICLR 2022 We set ˆλ = p log(1/γ)/ρ to minimize this expression. Note that we assume ρ log(1/γ) and hence this is a valid setting of ˆλ 1. This reduces the expression to ε ρ λ (1 + η) ρ + 2(1 + η) p ρ log(1/γ) + log E [K] This bound is minimized when λ 1 = p log(E [K])/ρ. If λ 1 < p log(E [K])/ρ, then we can apply the monotonicity property of Renyi DP (Remark 5 and Lemma 10) and substitute in the bound with this optimal λ. That is, we obtain the bound ( ρ λ (1 + η) ρ + 2(1 + η) p ρ log(1/γ) + log E[K] λ 1 if λ > 1 + p log(E [K])/ρ 2 p ρ log E [K] + 2(1 + η) p ρ log(1/γ) ηρ if λ 1 + p log(E [K])/ρ . Proof of the Poisson Distribution Bound Proof of Theorem 6. The probability generating function for the Poisson distribution is f(x) = E x K = eµ(x 1). Thus f (x) = µ eµ(x 1). As in the previous proofs, let x and x be neighbouring inputs. Denote Q = Q(x), Q = Q(x ), A = A(x), and A = A(x). By Lemma 7, Dλ (Q Q ) + 1 λ 1 log f (q)λ f (q )1 λ = Dλ (Q Q ) + 1 λ 1 log µ eµλ(q 1)+µ(1 λ)(q 1) = Dλ (Q Q ) + µ(λq (λ 1)q 1) + log µ = Dλ (Q Q ) + µ((λ 1)(1 q ) λ(1 q)) + log µ Dλ (Q Q ) + µ((λ 1)(eˆε(1 q) + ˆδ) λ(1 q)) + log µ λ 1 (by our (ˆε, ˆδ)-DP assumption on Q and since 1 q and 1 q are postprocessings) = Dλ (Q Q ) + µ (1 q) eˆε λ λ 1 + µ ˆδ + log µ Dλ (Q Q ) + µ ˆδ + log µ where the final inequality follows from our assumption that eˆε 1 + 1 λ 1. C CONDITIONAL SAMPLING APPROACH In the main text, we analyzed the approach where we run the underlying algorithm Q a random number of times according to a carefully-chosen distribution and output the best result from these independent runs. An alternative approach also studied by Liu & Talwar (2019) is to start with a pre-defined threshold for a good enough output and to run Q repeatedly until it produces such a result and then output that. This approach has some advantages, namely being simpler and avoiding the heavy-tailed behaviour of the logarithmic distribution while attaining the same kind of privacy guarantee. However, the disadvantage of this approach is that we must specify the acceptance threshold a priori. If we set the threshold too high, then we may have to keep running Q for a long time.11 If we set the threshold too low, then we may end up with a suboptimal output. 11One solution to avoid an overly long runtime in the case that Q(S) is too small is to modify Q to output with some small probability p and then have A halt if this occurs. This would represent a failure of the algorithm to produce a good output, but would avoid a privacy failure. Published as a conference paper at ICLR 2022 We analyze this approach under Rényi DP, thereby extending the results of Liu & Talwar (2019). Our algorithm A now works as follows. We start with a base algorithm Q and a set of good outputs S. Now A(x) computes y = Q(x) and, if y S, then it returns y and halts. Otherwise, A repeats the procedure. This is equivalent to sampling from a conditional distribution Q(x)|Q(x) S. The number of times Q is run will follow a geometric distribution with mean 1/Q(S). Proposition 15. Let λ (1, ). Let Q and Q be probability distributions on Ωwith Dλ (Q Q ) < . Let S Ωhave nonzero measure under Q and also under Q. Let QS and Q S denote the conditional distributions of Q and Q respectively conditioned on being in the set S. That is, QS(E) = Q(E S)/Q(S) and Q S(E) = Q (E S)/Q (S) for all measurable E Ω. Then, for all p, q, r [1, ] satisfying 1/p + 1/q + 1/r = 1, we have Dλ (QS Q S) λ 1/p 1/r λ 1 Dr (λ 1/p) (Q Q )+λ + 1/q 2 λ 1 Dλ+1/q 1 (Q Q)+1/r + 1 λ 1 log 1 Q(S) Proof. For x Ω, denote the various distributional densities at x (relative to some base measure) by QS(x), Q S(x), Q(x), and Q (x). We have QS(x) = Q(x)I[x S]/Q(S) and Q S(x) = Q (x)I[x S]/Q (S). Now we have e(λ 1)Dλ(QS Q S) = Z Ω QS(x)λQ S(x)1 λdx = Q(S) λQ (S)λ 1 Z Ω I[x S]Q(x)λQ (x)1 λdx Q(S) λQ (S)λ 1 Z S Q(x)dx 1/p Z S Q (x)dx 1/q Z Q(x)λ 1/p Q (x)1 λ 1/q r dx 1/r (Hölder s inequality) = Q(S)1/p λQ (S)1/q+λ 1 Z S Q(x)rλ r/p Q (x)r rλ r/qdx 1/r = Q (S)λ0Q(S)1 λ0Q(S) 1/r 1 Z S Q(x)λ1Q (x)1 λ1dx 1/r (λ0 := λ + 1/q 1, λ1 := rλ r/p) e(λ0 1)Dλ0(Q Q) Q(S) 1/r 1 e(λ1 1)Dλ1(Q Q ) 1/r . (Postprocessing & non-negativity) The number of parameters in Proposition 15 is excessive. Thus we provide some corollaries that simplify the expression somewhat. Corollary 16. Let λ, Q, Q , S, QS, Q S be as in Proposition 15. The following inequalities all hold. D (QS Q S) D (Q Q ) + D (Q Q) . Dλ (QS Q S) Dλ (Q Q ) + λ 2 λ 1Dλ 1 (Q Q) + 2 λ 1 log 1 Q(S) Dλ (QS Q S) D (Q Q ) + λ 2 λ 1Dλ 1 (Q Q) + 1 λ 1 log 1 Q(S) Dλ (QS Q S) λ λ 1D (Q Q ) + Dλ (Q Q) + 1 λ 1 log 1 Q(S) r 1 Dλ (QS Q S) Dr(λ 1)+1 (Q Q ) + λ 2 λ 1Dλ 1 (Q Q) + 1/r + 1 λ 1 log 1 Q(S) The first inequality in Corollary 16 is essentially the result given by Liu & Talwar (2019): If Q satisfies ε-DP, then A satisfies 2ε-DP. Figure 8 plots the guarantee of the second inequality in Corollary 16 when Dλ (Q Q ) = 0.1λ and Dλ 1 (Q Q) = 0.1(λ 1). Published as a conference paper at ICLR 2022 D NEGATIVE RESULTS ON IMPROVEMENTS TO OUR ANALYSIS It is natural to wonder whether our results could be further improved. In this section, we give some examples that demonstrates that quantitatively there is little room for improvement. D.1 WHY A FIXED NUMBER OF REPETITIONS DOES NOT RESULT IN GOOD PRIVACY. We first consider more closely the strawman approach discussed in Section 3.2: the base algorithm Q is repeated a fixed number of times k and we return the best output. This corresponds to picking k from a point mass distribution. To understand why it performs so poorly from a privacy standpoint, we first apply our main result from Section 3.5 to the resulting point mass distribution. Point Mass: Suppose K is just a point mass i.e., P [K = k] = 1. So A runs the algorithm Q a deterministic number of times. Then the probability generating function (PGF) is f(x) = xk and its derivative is f (x) = k xk 1. Let Q denote the base algorithm. We abuse notation and let Q = Q(x) and Q = Q(x ), where x and x are neighbouring inputs. Similarly, let A = A(x) and A = A(x ) be the final output distributions obtained by running Q and Q repeatedly k times and returning the best result. We follow the same pattern of analysis that we applied to the other distributions in Theorems 2 and 6: Lemma 7 gives the bound Dλ (A A ) Dλ (Q Q ) + 1 λ 1 log k qλ q 1 λ k 1 Dλ (Q Q ) + 1 λ 1 log k e(λ 1)Dλ(Bern(q) Bern(q )) k 1 Dλ (Q Q ) + 1 λ 1 log k e(λ 1)Dλ(Q Q ) k 1 = k Dλ (Q Q ) + log k where the final inequality follows from the fact that Bern(q) and Bern(q ) are postprocessings of Q and Q respectively. This bound is terrible. In fact, it is slightly worse than a naïve composition analysis which would give Dλ (A A ) Dλ Q k Q k = k Dλ (Q Q ). It shows that a deterministic number of repetitions does not yield good privacy parameters, at least with this analysis. It is surprising that running the base algorithm Q a fixed number of times k and returning the best output performs so poorly from a privacy standpoint. We will now give a simple example that demonstrates that this is inherent and not just a limitation of our analysis. Liu & Talwar (2019, Appendix B) give a similar example. Proposition 17. For all ε > 0, there exists a ε-DP algorithm Q : X n Y such that the following holds. Define an algorithm A : X n Y that runs Q a fixed number of times k and returns the best output from these runs. Then A is not ˆε-DP for any ˆε < kε. Furthermore, for all λ > 1, A is not (λ, ˆε(λ))-Rényi DP for any ˆε(λ) < ε (λ), where ε (λ) = kε k log(1 + e ε) Proof. The base algorithm is simply randomized response. We will let Y = {1, 2} with the total order preferring 1, then 2. We will define a pair of distributions Q and Q on {1, 2} and then the base algorithm is simply set so that these are its output distributions on a pair of neighbouring inputs. Q = 1 1 + eε , eε 1 + eε , 1 1 + eε Published as a conference paper at ICLR 2022 Then D (Q Q ) = D (Q Q) = ε. Thus we can ensure that the base algorithm yielding this pair of distributions is ε-DP. Now we look at the corresponding pair of distributions from repeating the base algorithm k times. We have k , 1 1 + eε The first part of the result follows: D (A A ) log For all λ > 1, e(λ 1)Dλ(A A ) = eεkλ (1 + eε) k Dλ (A A ) εkλ k log(1 + eε) λ 1 = kε k log(1 + e ε) The second part of Proposition 17 shows that this problem is not specific to pure DP. For λ 1+1/ε, we have ε (λ) = Ω(kε), so we are paying linearly in k. However, Proposition 17 is somewhat limited to pure ε-DP or at least (λ, ε(λ))-RDP with not-toosmall values of λ. This is because the bad event is relatively low-probability. Specifically, the high privacy loss event has probability (1 + e ε) k. This is small, unless ε Ω(log k). We can change the example to make the bad event happen with constant probability. However, the base algorithm will also not be pure ε-DP any more. Specifically, we can replace the two distributions in Proposition 17 with the following: Q = (1 exp( 1/k), exp( 1/k)), Q = (1 exp( ε0 1/k), exp( ε0 1/k)). If we repeat this base algorithm a fixed number of times k, then the corresponding pair of distributions is given by A = (1 exp( 1), exp( 1)), A = (1 exp( kε0 1), exp( kε0 1)). Now we have D (A A ) = kε0 and the bad event happens with probability e 1 0.36. On the other hand, D (Q Q ) = ε0 like before, but D (Q Q) = log(1 exp( ε0 1/k)) log(1 exp( 1/k)) log(kε0 + 1). But we still have a good guarantee in terms of Rényi divergences. In particular, D1 (Q Q) (ε0 + 1/k) log(kε0 + 1), and we can set ε0 o(1/ log k) to ensure that we get reasonable (λ, ε(λ))-RDP guarantees for small λ. At a higher level, it should not be a surprise that this negative example is relatively brittle. Our positive results show that it only takes a very minor adjustment to the number of repetitions to obtain significantly tighter privacy guarantees for hyperparameter tuning than what one would obtain from naive composition. In particular, running a fixed number of times k versus running Poisson(k) times is not that different, but our positive results show that it already circumvents this problem in general. Published as a conference paper at ICLR 2022 We also remark that composition behaves differently in the low-order RDP or approx DP regime relative to the pure-DP or high-order RDP regime covered by Proposition 17. Thus the naïve composition baseline we compare to is also shifting from basic composition (ε-DP becomes kε-DP under k-fold repetition (Dwork et al., 2006b)) to advanced composition (ε-DP becomes (O(ε p k log(1/δ)), δ)-DP (Dwork et al., 2010)). Proving tightness of basic composition is easy, but proving tightness for advanced composition is non-trivial (in general, it relies on the machinery of fingerprinting codes (Bun et al., 2014)). This means it is not straightforward to extend Proposition 17 to this regime. D.2 TIGHT EXAMPLE FOR CONDITIONAL SAMPLING. We are also interested in the tightness of our generic results. We begin by studying the conditional sampling approach outlined in Appendix C. This approach is simpler and it is therefore easier to give a tight example. This also proves to be instructive for the random repetition approach in Section 3. Our tight example for the conditional sampling approach consists of a pair of distributions Q and Q . These should be thought of as the output distributions of the algorithm Q on two neighbouring inputs. The distributions are supported on only three points. Such a small output space seems contrived, but it should be thought of as representing a partition of a large output space into three sets. The first set is where the privacy loss is large and the second set is where the privacy loss is very negative, while the third set is everything in between. Fix s, t > 0 and a [0, 1/4]. Note (1 2a) 1 e3a. Let Q = a e s, a e s, 1 2a e s , Q = a e s t, a, 1 a a e s t , S = {1, 2} {1, 2, 3}, Q S = e s t 1 + e s t , 1 1 + e s t Intuitively, the set S corresponds to the outputs that (1) have large privacy loss or (2) very negative privacy loss and we exclude (3) the outputs with middling privacy loss. Once we condition on S we still have the outputs (1) with large privacy loss, but that privacy loss is further increased because of the renormalization. Specifically, the negative privacy loss means the renormalization constants are very different Q(S) = a 2e s Q (S) = a (1 + e s t) if s is large. In effect, the negative privacy loss becomes a positive privacy loss that is added to the already large privacy loss. Published as a conference paper at ICLR 2022 0 5 10 15 20 25 0 Renyi divergence upper bound D (Q|Q ) = D (Q |Q) Figure 8: Upper and lower bounds for Rényi DP of conditional sampling. For each λ, we pick the parameters s and t such that Dλ (Q Q ) = Dλ (Q Q) = 0.1 λ and we plot the upper bound from the second inequation in Corollary 16 along with the exact value of Dλ (QS Q S). We make the above intuition more precise by calculating the various quantities. For all λ > 1, we have Q(S) = 2a e s, e(λ 1)Dλ(Q Q ) = a e sλ (s+t)(1 λ) + a e sλ + (1 2a e s)λ(1 a a e s t)1 λ a e(λ 1)t s + ae sλ + e 2ae sλ(1 2a)1 λ a e(λ 1)t s + a + e3a(λ 1) a e(λ 1)t s (assuming t is large) 2Q(S) e(λ 1)t, = Dλ (Q Q ) t log(2/Q(S)) e(λ 1)Dλ(Q Q) = a e s(1 λ) (s+t)λ + a e s(1 λ) + (1 2a e s)1 λ(1 a a e s t)λ a e tλ s + a es(λ 1) + e3a e s (λ 1) e aλ a es(λ 1) (assuming s is large) = Dλ (Q Q) s + s log(2/Q(S)) Published as a conference paper at ICLR 2022 e(λ 1)Dλ(QS Q S) = 2 λ(1 + e s t)λ 1(e(λ 1)(s+t) + 1) 2 λ e(λ 1)(s+t), = Dλ (QS Q S) s + t λ log 2 We contrast this with our upper bound from the second part of Corollary 16: Dλ (QS Q S) Dλ (Q Q ) + λ 2 λ 1Dλ 1 (Q Q) + 2 λ 1 log 1 Q(S) t log(2/Q(S)) s + s log(2/Q(S)) + 2 log(1/Q(S)) = s + t 2 log 2 This example shows that our upper bound is tight up to small factors, namely the lower order terms we ignore with and λ 2 λ 1 log 2. Figure 8 illustrates how the upper and lower bounds compare. D.3 TIGHTNESS OF OUR GENERIC RESULT. Now we consider the setting from Section 3 where our base algorithm is run repeatedly a random number of times and the best result is given as output. Let Q : X n Y denote the base algorithm. Assume Y is totally ordered. Let K N be a random variable and let f(x) = E x K be its probability generating function. Define A : X n Y to be the algorithm that runs Q repeatedly K times and returns the best output. For a tight example, we again restrict our attention to distributions supported on three points: Q = Q(x) = (1 b c, b, c), Q = Q(x ) = (1 b c , b , c ), A = A(x) = (1 f(b + c), f(b + c) f(c), f(c)), A = A(x ) = (1 f(b + c ), f(b + c ) f(c ), f(c )). Here the total ordering prefers the first option (corresponding to the first coordinate probability), then the second, and then the third, which implies the expressions for A and A . Note that the probability values are not necessarily ordered the same way as the ordering on outcomes. Now we must set these four values to show tightness of our results. We make the first-order approximation A (1 f(b + c), f (c) b, f(c)), A (1 f(b + c ), f (c ) b , f(c )). We take this approximation and get e(λ 1)Dλ(A A ) (f (c) b)λ (f (c ) b )1 λ = e(λ 1)Dλ(b b ) (f (c))λ (f (c ))1 λ e(λ 1)Dλ(Q Q ) (f (c))λ (f (c ))1 λ, where the final approximation assumes that the second term is dominant in the equation e(λ 1)Dλ(Q Q ) = (1 b c)λ (1 b c )1 λ + bλ (b )1 λ + (c)λ (c )1 λ. Contrast this with our upper bound (Lemma 7), which says e(λ 1)Dλ(A A ) e(λ 1)Dλ(Q Q ) (f (q))λ (f (q ))1 λ, where q and q are arbitrary postprocessings of Q and Q . In particular, we can set the values so that q = c and q = c . This is not a formal proof, since we make imprecise approximations. But it illustrates that our main generic result (Lemma 7) is tight up to low order terms. Published as a conference paper at ICLR 2022 D.4 SELECTION & LOWER BOUNDS Private hyperparameter tuning is a generalization of the private selection problem. In the private selection problem we are given a utility function u : X n [m] R which has sensitivity 1 in its first argument i.e., for all neighbouring x, x X n and all j [m] = {1, 2, , m}, we have |u(x, j) u(x , j)| 1. The goal is to output and approximation to arg maxj [m] u(x, j) subject to differential privacy. The standard algorithm for private selection is the exponential mechanism (Mc Sherry & Talwar, 2007). The exponential mechanism is defined by j [m] P [M(x) = j] = exp ε ℓ [m] exp ε It provides (ε, 0)-DP and, at the same time, 1 8ε2-z CDP (Rogers & Steinke, 2021). On the utility side, we have the guarantees E [u(x, M(x))] max j [m] u(x, j) 2 P u(x, M(x)) max j [m] u(x, j) 2 for all inputs x and all β > 0 (Bassily et al., 2021, Lemma 7.1) (Dwork & Roth, 2014, Theorem 3.11). It is also well-known that the exponential mechanism is optimal up to constants. That is, (ε, 0)-DP selection entails an additive error of Ω(log(m)/ε) (Steinke & Ullman, 2017). Our results can be applied to the selection problem. The base algorithm Q(x) will simply pick an index j [m] uniformly at random and the privately estimate u(x, j) by adding Laplace or Gaussian noise ξ and output the pair (j, u(x, j) + ξ). The total order on the output space [m] R simply selects for the highest estimated utility (breaking ties arbitrarily). If we take ξ to be Laplace noise with scale 1/ε, then Q is (ε, 0)-DP. Applying Corollary 3 yields a ((2 + η)ε, 0)-DP algorithm A with the following utility guarantee. Let K be the number of repetitions and let (j1, u(x, j1) + ξ1), , (j K, u(x, j K) + ξK) be the outputs from the runs of the base algorithm. The probability that the repeated algorithm A will consider j := arg maxj [m] u(x, j) is P [j {j1, , j K}] = 1 E K k [K] P [j = jk] i = 1 f(1 1/m), where f(x) = E x K is the probability generating function of the number of repetitions K. For each noise sample, we have k P [|ξk| t] 1 e εt for all t > 0. Thus, for all t > 0, the probability that all noise samples are smaller than t is P [ k [K] |ξk| t] = f(1 e εt). By a union bound, we have P [u(x, M(x)) u(x, j ) 2t] f(1 e εt) f(1 1/m) for all t > 0. Setting η = 0, yields f(x) = log(1 (1 γ)x) log γ , so P [u(x, M(x)) u(x, j ) 2t] 1 log(1/γ) log 1+ 1 γ t = log(m10 1)/ε and γ = 1 exp(εt)+1 = m 10 so that 1 γ γ e εt = 1 and 1 γ γ 1 m = m9 1 P u(x, M(x)) u(x, j ) 20 ε log m 1 10 log m log m9+1 1/m 2 9 10 1 10 log2 m. That is, we can match the result of the exponential mechanism up to (large) constants. In particular, this means that the lower bounds for selection translate to our results i.e., our results are tight up to constants. E EXTENDING OUR RESULTS TO APPROXIMATE DP Our results are all in the framework of Rényi DP. A natural question is what can be said if the base algorithm instead only satisfies approximate DP i.e., (ε, δ)-DP with δ > 0. Liu & Talwar (2019) considered these questions and gave several results. We now briefly show how to our results can be extended in a black-box fashion to this setting. We begin by defining approximate Rényi divergences and approximate Rényi DP: Published as a conference paper at ICLR 2022 Definition 18 (Approximate Rényi Divergence). Let P and Q be probability distributions over Ω. Let λ [1, ] and δ [0, 1]. We define Dδ λ (P Q) = inf {Dλ (P Q ) : P = (1 δ)P + δP , Q = (1 δ)Q + δQ } , where P = (1 δ)P + δP denotes the fact that P can be expressed as a convex combination of two distributions P and P with weights 1 δ and δ respectively. Definition 19 (Approximate Rényi Differential Privacy). A randomized algorithm M : X n Y is δ-approximately (λ, ε)-Rényi differentially private if, for all neighbouring pairs of inputs x, x X n, Dδ λ (M(x) M(x )) ε. Definition 19 is an extension of the definition of approximate z CDP (Bun & Steinke, 2016). Some remarks about the basic properties of approximate RDP are in order: (ε, δ)-DP is equivalent to δ-approximate ( , ε)-RDP. (ε, δ)-DP implies δ-approximate (λ, 1 2ε2λ)-RDP for all λ (1, ). δ-approximate (λ, ε)-RDP implies (ˆε, ˆδ)-DP for ˆδ = δ + exp((λ 1)(ˆε ε)) δ-approximate (λ, ε)-Rényi differential privacy is closed under postprocessing. If M1 is δ1-approximately (λ, ε1)-Rényi differentially private and M2 is δ2-approximately (λ, ε2)-Rényi differentially private, then their composition is (δ1+δ2)-approximately (λ, ε1+ ε2)-RDP. Our results for Rényi DP can be extended to approximate Rényi DP by the following Lemma. Lemma 20. Assume Y is a totally ordered set. For a distribution Q on Y and a random variable K supported on N {0}, define AK Q as follows. First we sample K. Then we sample from Q independently K times and output the best of these samples. This output is a sample from A. Let Q, Q , Q1 δ0, Qδ0, Q 1 δ0, Q δ0 be distributions on Y satisfying Q = (1 δ0)Q1 δ0 +δ0Qδ0 and Q = (1 δ0)Q 1 δ0 + δ0Q δ0. Let K be a random variable on N {0} and let f(x) = E x K be the probability generating function of K. Define a random variable K on N {0} by P [K = k] = P [K = k] (1 δ0)k/f(1 δ0).12 Then, for all λ 1, we have Dδ λ AK Q AK Q Dλ AK Q1 δ0 where δ = 1 f(1 δ0). How do we use this lemma? We should think of AK Q as representing the algorithm we want to analyze. The base algorithm Q satisfies δ0-approximate (λ, ε)-RDP. The above lemma says it suffices to analyze the algorithm AK Q where Q satisfies (λ, ε)-RDP. We end up with a δ-approximate RDP result, where the final δ depends on δ0 and the PGF of K. As an example, we can combine Lemma 20 with Theorem 6 to obtain the following result for the approximate case. Corollary 21. Let Q : X n Y be a randomized algorithm satisfying (ε0, δ0)-DP. Assume Y is totally ordered. Let µ > 0. Define an algorithm A : X n Y as follows. Draw K from a Poisson distribution with mean µ. Run Q(x) repeatedly K times. Then A(x) returns the best value from the K runs. (If K = 0, A(x) returns some arbitrary output independent from the input x.) 12Note that the PGF of K is given by E h x K i = P k=0 xk P [K = k](1 δ0)k/f(1 δ0) = f(x(1 δ0))/f(1 δ0). Published as a conference paper at ICLR 2022 For all λ 1 + 1 eε 1, the algorithm A satisfies δ -approximate (λ, ε )-RDP where ε = ε0 + (eε0 1) log µ, δ = 1 e µ δ0 µ δ0. Proof of Lemma 20. For a distribution P on Y and an integer k 0, let max P k denote the distribution on Y obtained by taking k independent samples from P and returning the maximum value per the total ordering on Y. (If k = 0, this is some arbitrary fixed distribution.) Using this notation, we can express AK Q as a convex combination: k=0 P [K = k] max Qk. Suppose P = (1 δ)P +δP is a convex combination. We can view sampling from P as a two-step process: first we sample a Bernoulli random variable B {0, 1} with expectation δ; if B = 0, we return a sample from P and, if B = 1, we return a sample from P . Thus, if we draw k independent samples from P like this, then with probability (1 δ)k all of these Bernoullis are 0 and we generate k samples from P ; otherwise, we generate some mix of samples from P and P . Hence we can write max P k = (1 δ)k max(P )k + (1 (1 δ)k)P for some distribution P . It follows that we can express k=0 P [K = k] max Qk k=0 P [K = k] (1 δ0)k max Qk 1 δ0 + (1 (1 δ0)k)Pk = f(1 δ0)AK Q + (1 f(1 δ0))P for some distributions P0, P1, and (1 f(1 δ0))P = P k=0 P [K = k](1 (1 δ0)k)Pk. Similarly, we can express AK Q = f(1 δ0)AK Q 1 δ0 + (1 f(1 δ0))P for some distribution P . Using these convex combinations we have, by the definition of approximate Rényi divergence, Dδ λ AK Q AK Q Dλ AK Q1 δ0 as δ = 1 f(1 δ0). Proof of Corollary 21. Fix neighbouring inputs x, x and let Q = Q(x) and Q = Q(x ) be the corresponding pair of output distributions from the base algorithm. Then, in the notation of Lemma 20, A(x) = AK Q and A(x ) = AK Q for K Poisson(µ). Setting δ = 1 f(1 δ0) = 1 e µ δ0 µ δ0, we have Dδ λ (A(x) A(x )) = Dδ λ AK Q AK Q Dλ AK Q1 δ0 where K Poisson(µ (1 δ0)). Now we apply Theorem 6 to the algorithm corresponding to the pair of distributions AK Q1 δ0 and AK Q 1 δ0. The base algorithm, corresponding to the pair Q1 δ0 and Q 1 δ0 satisfies (ε0, 0)-DP. This yields Dλ AK Q1 δ0 ε0 + log((1 δ0) µ) if eε0 1 + 1/(λ 1). Setting λ = 1 + 1/(eε0 1) and applying monotonicity (Remark 5 and Lemma 10) and we obtain the result. Published as a conference paper at ICLR 2022 E.1 TRUNCATING THE NUMBER OF REPETITIONS Another natural question is what happens if we truncate the distribution of the number of repetitions K. For example, we may have an upper limit on the acceptable runtime. This does not require relaxing to approximate DP, as done by Liu & Talwar (2019). Let K be the non-truncated number of repetitions and let f(x) = E x K be the PGF. Let m N. Let K be the truncated number of repetitions. That is, P h K = k i = I[K m] P [K = k]/P [K m]. Let f(x) = E h x Ki be the corresponding PGF. We have f (x) = P k=1 k xk 1 P [K = k] and f (x) = Pm k=1 k xk 1 P[K=k] P[K m]. Hence, for x [0, 1], 0 f (x) f (x) P [K m] k=m+1 k xk 1 P [K = k] k=m+1 k P [K = k] = xm E [K I[K > m]]. Now f (x) P [K m] = Pm k=1 k xk 1 P [K = k] xm 1 E [K I[K m]]. Thus 1 f (x) f (x) P [K m] 1 + xm E [K I[K > m]] xm 1 E [K I[K m]] = 1 + x E [K I[K > m]] E [K] E [K I[K > m]] Now we can bound the quantity of interest in Lemma 7: For all q, q [0, 1], we have f (q)λ f (q )1 λ f (q) P [K m] P [K m] 1 + E[K I[K>m]] E[K] E[K I[K>m]] = f (q)λ f (q )1 λ 1 P [K m] 1 + E [K I[K > m]] E [K] E [K I[K > m]] This gives us a generic result for truncated distributions: Lemma 22 (Generic Bound (cf. Lemma 7) for Truncated distributions). Fix λ > 1 and m N. Let K be a random variable supported on N {0}. Let f : [0, 1] R be the probability generating function of K i.e., f(x) := P k=0 P [K = k] xk. Let Q and Q be distributions on Y. Assume Y is totally ordered. Define a distribution A on Y as follows. First we sample K which is K conditioned on K m i.e. P h K = k i = P [K = k|K m]. Then we sample from Q independently K times and output the best of these samples.13 This output is a sample from A. We define A analogously with Q in place of Q. Dλ (A A ) Dλ (Q Q )+ 1 λ 1 log f (q)λ f (q )1 λ + log 1 1 P[K>m] λ 1 +log 1 + E [K I[K > m]] E [K] E [K I[K > m]] (9) where q and q are probabilities attained by applying the same arbitrary postprocessing to Q and Q respectively i.e., there exists a function g : Y [0, 1] such that q = E X Q [g(X)] and q = E X Q [g(X )]. This will give almost identical bounds to using the non-truncated distribution as long as P [K > m] 1 and E [K I[K > m]] E [K], which should hold for large enough m. 13If K = 0, the output can be arbitrary, as long as it is the same for both A and A .