# provable_privacy_with_nonprivate_preprocessing__13a31316.pdf Provable Privacy with Non-Private Pre-Processing Yaxi Hu 1 Amartya Sanyal 1 Bernhard Sch olkopf 1 When analyzing Differentially Private (DP) machine learning pipelines, the potential privacy cost of data-dependent pre-processing is frequently overlooked in privacy accounting. In this work, we propose a general framework to evaluate the additional privacy cost incurred by non-private data-dependent pre-processing algorithms. Our framework establishes upper bounds on the overall privacy guarantees by utilising two new technical notions: a variant of DP termed Smooth DP and the bounded sensitivity of the pre-processing algorithms. In addition to the generic framework, we provide explicit overall privacy guarantees for multiple data-dependent pre-processing algorithms, such as data imputation, quantization, deduplication, standard scaling and PCA, when used in combination with several DP algorithms. Notably, this framework is also simple to implement, allowing direct integration into existing DP pipelines. 1. Introduction With the growing emphasis on user data privacy, Differential Privacy (DP), has become the preferred solution for safeguarding training data in Machine Learning (ML) and data analysis (Dwork et al., 2006). DP algorithms are designed to ensure that individual inputs minimally affect the algorithm s output, thus preserving privacy. This approach is now widely adopted by various organizations for conducting analyses while maintaining user privacy (Apple DP, 2017; U.S. Census Bureau, 2020). Pre-processing data is a standard practice in data analysis and machine learning. Techniques such as data imputation 1Max Planck Institute for Intelligent Systems, T ubingen, Germany. Correspondence to: Yaxi Hu , Amartya Sanyal . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). An implementation for our framework is available at https://github.com/yaxihu/ privacy-non-private-preprocessing. for handling missing values (Anil Jadhav & Ramanathan, 2019), deduplication for reducing memorization and eliminating bias (Kandpal et al., 2022; Lee et al., 2022), standard scaling for reducing the impact of outliers, and dimensionality reduction for denoising or visualization (Abadi et al., 2016; Zhou et al., 2021; Pinto et al., 2024) are commonly used. These pre-processings are also prevalent prior to applying DP algorithms, a pipeline that we term pre-processed DP pipeline. Among other uses, this has been shown to improve privacy-accuracy trade-off (Tramer & Boneh, 2021; Ganesh et al., 2023). A fundamental assumption, required by DP, is that individual data points are independent. However, if training data is used in pre-processing, this assumption is compromised. For example, when deduplicating a dataset, whether a point remains after the pre-processing is dependent on the presence of other points in its vicinity. Similarly, for mean imputation, the imputed value depends on the values of other data points. These dependencies, also evident in PCA and pre-training, can undermine the privacy guarantee of the pre-processed DP pipeline. There are multiple strategies to address this. A straightforward method to derive privacy guarantees for this pipeline is to use group privacy where the size of the group can be as large as the size of the dataset, thereby resulting in weak privacy guarantees. This idea, albeit not in the context of pre-processing, was previous explored under the name Dependent Differential Privacy (DDP) (Zhao et al., 2017; Liu et al., 2016). Another approach is to use public data for the pre-processing. Broadly referred to as semi-private learning algorithms, examples of these methods include pre-training on public data and learning projection functions using the public data e.g. (Pinto et al., 2024; Li et al., 2022b;a; Yu et al., 2021). Despite their success, these methods crucially rely on the availability of high-quality public data. In the absence of public data, an alternative approach is to privatise the pre-processing algorithm. However, designing new private pre-processing algorithms complicate the process, increasing the risk of privacy breaches due to errors in implementation or analysis. Moreover, private pre-processing can be statistically or computationally more demanding than private learning itself. For example, the sample complexity of DP-PCA(Chaudhuri et al., 2012) is dependent on the dimension of the training data (Liu et al., 2022), implying that the costs associated with DP-PCA Provable Privacy with Non-Private Pre-Processing could surpass the benefits of private learning in a lowerdimensional space. To circumvent these challenges, an alternative approach is non-private pre-processing with a more rigorous analysis of the entire pipeline. This method is straightforward, circumvents the need for modifying existing processes, and avoids the costs associated with private pre-processing. Naturally, this raises the important question, What is the price of non-private pre-processing in differentially private data analysis ? Our work shows that the overall privacy cost of preprocessed DP pipeline can be bounded with minimal degradation in privacy guarantee. To do this, we rely on two new technical notions: sensitivity of pre-processing functions (Definition 4) and Smooth-DP (Definition 6). In short, 1. We introduce a generic framework to quantify the privacy loss in the pre-processed DP pipeline. Applying this framework, we evaluate the impact of commonly used pre-processing techniques such as deduplication, quantization, data imputation, standard scaling, and PCA on the overall privacy guarantees. 2. We base our analysis on a novel variant of differential privacy, termed smooth DP, which may be of independent interest. We demonstrate that smooth DP retains essential DP characteristics, including postprocessing and composition capabilities, as well as a slightly weaker form of amplification by sub-sampling. 3. We propose an algorithm to balance desired privacy levels against utility in the pre-processed DP pipeline. This approach is based on the Propose-Test-Release Mechanism, allowing the user to choose desired privacy-utility trade-off. Related Work Closely related to our work, Debenedetti et al. (2023) also studies the necessity of conducting privacy analysis across the entire ML pipeline, rather than focusing solely on the training process. They identify that pre-processing steps, such as deduplication, can unintentionally introduce correlations into the pre-processed datasets, leading to privacy breaches. They show that the empirical privacy parameter of DP-SGD (Abadi et al., 2016) can be deteriorated by over five times with membership inference attack designed to exploit the correlation introduced by deduplication. While they show privacy attacks using deduplication with DP-SGD that can maximise the privacy loss, our work quantifies the privacy loss in deduplication as well as other pre-processing algorithms with several privacy preserving mecahnisms, thereby presenting a more holistic picture of this problem. Another line of research, focusing on privacy in correlated datasets (Liu et al., 2016; Zhao et al., 2017; Humphries et al., 2023), shows that correlations in the datasets can increase privacy risk of ML models. In response, Pufferfish Differential Privacy (Kifer & Machanavajjhala, 2014; Song et al., 2017) and Dependent Differential Privacy (Liu et al., 2016; Zhao et al., 2017) were proposed as privacy notions tailored for datasets with inherent dependencies. However, these definitions usually require complete knowledge of the datasets dependency structure or the data generating process and sometimes lead to vacuous privacy guarantees. Moreover, the application of these privacy notions usually complicates the privacy analysis, as many privacy axioms, such as composition, do not hold under these more general notions. 2. Preliminaries in Differential Privacy Before providing the main results of our work, we first introduce some common definitions and mechanisms in Differential Privacy. 2.1. R enyi Differential Privacy Differential privacy restricts the change in the output distribution of a randomized algorithm by altering a single element in the dataset. Formally, for ε, δ > 0, a randomized algorithm A satisfies (ε, δ)-DP if for any two dataset S, S that differ by exactly one element and any possible output of the algorithm O, P [A(S) O] eεP [A(S ) O] + δ. (1) In this work, we mainly focus on a stronger notion of DP, known as R enyi Differential Privacy (RDP)(Mironov, 2017), based on the R enyi divergence between two distributions. Definition 1 (R enyi Divergence). Let P, R be two probability distributions with supp(P) supp(R). Let α > 1. The R enyi divergence with order α between P and R is defined as Dα(P||R) = 1 α 1 log Ex R Let d H( , ) denote the Hamming distance between two datasets, we formally define RDP as follows. Definition 2 ((α, ε(α))-RDP). Let ε(α) be a function that maps each α to a positive real number. A randomized algorithm A is (α, ε(α))-RDP if for all α > 1, for any datasets S, S differing at a single point, it holds that Dα(A(S)||A(S )) ε(α). This definition of RDP can be easily converted to standard DP via Lemma M. While Definition 2 is unconditional on the possible set of datasets, a relaxed version of conditional RDP can be defined over a given dataset collection L such that the neighboring datasets S, S L. When the dataset collection is known in advance, the conditional definition Provable Privacy with Non-Private Pre-Processing of RDP allows for tighter privacy analysis. We use this conditional version in some of our analysis1. 2.2. Private mechanisms There are several ways to make non-private algorithms private. All of them implicitly or explicitly add carefully calibrated noise to the non-private algorithm. Below, we briefly define the three most common ways in which DP is injected in data analysis tasks and machine learning algorithms. Output perturbation The easiest way to inject DP guarantees in an estimation problem is to perturb the output of the non-private estimator with appropriately calibrated noise. Two most common ways to do so are Gaussian Mechanism and Laplace Mechanism. For any deterministic estimator f, both mechanisms add noise proportional to the global sensitivity f, defined as the maximum difference in f over all pairs of neighboring datasets. For a given privacy parameter ε > 0, both the Gaussian mechanism, denoted MG, and the Laplace mechanism, denoted ML, produce an output of the form f(S) + ξ. Here, ξ follows a Gaussian distribution N(0, 2 f/ε2) for the Gaussian mechanism, and a Laplace distribution Lap( f/ε) for the Laplace mechanism. Random sampling While Output perturbation is naturally suited to privatising the output of non-private estimators, it is less intuitive when selecting discrete objects from a set. In this case, a private mechanism can sample from a probability distribution defined on the set of objects. The Exponential Mechanism, denoted as ME, falls under this category and is one of the most fundamental private mechanisms. Given a score function Q with global sensitivity Q, it randomly outputs an estimator w with probability proportional to exp Q(w,S)ε Gradient perturbation Finally, most common ML applications use gradient-based algorithms to minimize a loss function on a given dataset. A common way to inject privacy in these algorithms is to introduce Gaussian noise into the gradient computations in each gradient descent step. This is referred to as Differential Private Gradient Descent (DP-GD) denoted as AGD (Bassily et al., 2014; Song et al., 2021). Other variants that are commonly used are Differential Private Stochastic Gradient Descent (DP-SGD) with subsampling (Bassily et al., 2014; Abadi et al., 2016), denoted as ASGD samp, and DP-SGD with iteration (Feldman et al., 2018), denoted as ASGD iter. We include the detailed description of each of these methods in Appendix A.2.2. 1To avoid confusion, we note that our privacy guarantees are not conditional, in the sense that it does not suffer a catastrophic failure under any dataset. 3. Main Results We first introduce a norm-based privacy notion, called Smooth RDP, that allows us to conduct a more fine-grained analysis on the impact of pre-processing algorithms. Using this definition, we establish our main results on the privacy guarantees of a pre-processed DP pipeline. 3.1. Smooth RDP Our analysis on the privacy guarantees of pre-processed DP pipelines relies on a privacy notion that ensures indistinguishability between two datasets with a bounded L12 distance. Here, the L12 distance between two datasets S and S of size n is defined as d12(S, S ) = Pn i=1 Si S i 2. We introduce this privacy notion as Smooth R enyi Differential Privacy (SRDP), defined as follows: Definition 3 ((α, ε(α, τ))-smooth RDP). Let ε(α, τ) be a function that maps each α, τ pair to a real value. A randomized algorithm A is (α, ε(α, τ))-SRDP if for each α > 1 and τ > 0, sup S,S :d12(S,S ) τ Dα(A(S)||A(S )) ε(α, τ). SRDP shares similarities with distance-based privacy notions (Lecuyer et al., 2019; Epasto et al., 2023), but they differ in a key aspect: the distance-based privacy considers neighboring datasets with bounded distance, while SRDP allows for comparison over two datasets differing in every entry. Similar to conditional RDP over a set L, we define conditional SRDP over a set L by imposing the additional assumption that S, S L in Definition 3. While conditional SRDP over a set L can be considered as a special case of Pufferfish R enyi Privacy (Kifer & Machanavajjhala, 2014), we show that it satisfies desirable properties, such as sequential composition and privacy amplification by subsampling, which are not satisfied by the more general Pufferfish Renyi Privacy (Pierquin & Bellet, 2023). Properties of SRDP Similar to RDP, SRDP satisfies (sequential) composition and closure under post-processing. Lemma 1. Let L be any dataset collection. Then, the following holds. Composition Let k be a positive integer. Let α > 0, εi : R R R, i [k]. For any i [k], if the randomized algorithms Ai is (α, εi(α, τ))-SRDP over L, then the composition of the k algorithms (A1, . . . , Ak) is (α, Pk i=1 εi(α, τ))-SRDP over L. Post-processing Let α > 0, ε : R R R, and f be an arbitrary algorithm. For any τ > 0, if A is (α, ε(α, τ))- SRDP over L, then f A is (α, ε(α, τ))-SRDP over L. Provable Privacy with Non-Private Pre-Processing Table 1. RDP and SRDP parameters of DP mechanisms. We let γ and κτ denote the inverse pointwise divergence and maximum divergence between two datasets. Notation Meaning Mechanism Assumptions RDP SRDP f Output function MG f is L-Lipschitz αε2 2 2 f ML f is L-Lipschitz ε Lτ f ε Q Score function ME Q is L-Lipschitz ε Lτ Q ε ℓ Loss function AGD ℓis L-Lipschitz and µ-smooth, σ = L T εn 2αε2 αµ2τ2ε2 T Number of iterations ASGD samp ℓis L-Lipschitz and µ-smooth, σ = Ω(L T/εn), inverse point-wise divergence γ, 1 α min n ε2n2 log n2ε L 2 αµ2τ2ε2γ2 σ Variance of gradient noise ASGD iter ℓis convex, L-Lipschitz and µ-smooth, σ = 8 2 log nηL ε n , ε = O(1/nα2), maximum divergence κτ, L p 2 ατ2µ2n log(n κτ +2) 2(n κτ +1)L2 log n η Learning rate SRDP also satisfies a form of Privacy amplification by subsampling. We state the weaker version without additional assumptions in Appendix A.1 and use a stronger version for ASGD iter in Theorem 1 with some light assumptions. SRDP parameters for common private mechanisms In Theorem 1, we present the RDP and SRDP parameters for the private mechanisms discussed above. The SRDP parameter usually relies on the Lipschitzness and Smoothness (see Appendix A.1 for definitions) of the output or objective function. Theorem 1 (Informal). The DP mechanisms discussed in Section 2.2 satisfy RDP and SRDP under assumptions on the output or the objective functions. We summarize the parameters and their corresponding assumptions in Table 1. Table 1 demonstrates that the RDP parameter ε of most private mechanisms increases to O(τε) for SRDP. In contrast, a naive analysis using group privacy inflates the privacy parameter to O(nε), as any two datasets with d12 distance τ can differ by at most n entries. Hence, SRDP always leads to tighter privacy parameter than group privacy for τ = o(n), which we are able to exploit later. For DP-SGD by subsampling and iteration, the SRDP guarantee is dependent on two properties: the inverse pointwise divergence and maximum divergence of the datasets, detailed in Appendix A.2.2. Briefly, the inverse pointwise divergence γ measures the ratio between the d12 distance and the maximum pointwise distance between two datasets. The maximum divergence κτ measures the number of points that the two datasets differ. DP-SGD with subsampling benefits from a small γ, while DP-SGD with iteration benefits from a small κτ. As we will show in the later sections, at least one of these conditions are usually met in practice. 3.2. Privacy of Pre-Processed DP Pipelines Before stating the main result, we introduce the term datadependent pre-processing algorithm. Let X Rd be the instance space with Euclidean norm bounded by 1, i.e. x X, x 2 1. A deterministic data-dependent pre-processing algorithm π : X n F takes a dataset S as input and returns a pre-processing function πS : X X in a function space F. Intuitively, privacy of a private algorithm is retained under non-private data-dependent preprocessing, if a single element in the dataset has bounded impact on the output of the pre-processing algorithm. We define the sensitivity of a pre-processing algorithm to quantify the impact of a simgle element below. Definition 4. Let S1 = S {z1} and S2 = S {z2} be two arbitrary neighboring datasets, we define the L and L2 sensitivity of a pre-processing function π as2 (π) = sup S1,S2:d H(S1,S2)=1 d H(πS1(S), πS2(S)), 2(π) = sup S1,S2:d H(S1,S2)=1 max x S πS1(x) πS2(x) 2 . When the neighboring datasets S, S are from a dataset collection L, we define conditional sensitivity of the preprocessing algorithm π as (L, π) and (L, π) in a similar manner. The conditional sensitivity is nondecreasing as the size of the set of L increases, i.e. if L L , then (L, π) (L , π) (π) and 2(L, π) 2(L , π) 2(π) for all π. In Theorem 2, we present the privacy guarantees of preprocessed DP pipeline in terms of the RDP and SRDP parameters of the private algorithm and the sensitivity of the 2When one of πS1(x) and πS2(x) is , we define πS1(x) πS2(x) 2 = 1. Provable Privacy with Non-Private Pre-Processing pre-processing algorithms. This is a meta theorem, which we then refine to get specific guarantees for different combinations of pre-processing and private mechanisms in Theorem 7. With a slight abuse of notation, we denote the output of a private algorithm A on πS(S) as A π(S). Theorem 2. For a set of datasets L and for any α 1, τ > 0, consider an algorithm A that is (α, ε(α))-RDP and (α, ε(α, τ))-SRDP over the set L. For a pre-processing algorithm π with L sensitivity and L2 sensitivity 2, A π is (α, bε)-RDP over L for all c1, c2 1, where bε max αc1 1 c1 (α 1) ε (αc1, 2 ) + ε c1α 1 αc2 1 c2(α 1)ε (αc2) + ε c2α 1 Proof sketch. Consider two neighboring datasets S1 and S2, where S1 = S {z1}, S2 = S {z2}. Let π1 and π2 be the output functions of the pre-processing algorithm π on S1 and S2 respectively. Our objective is to upper bound the R enyi divergence between the output distribution of A on the pre-processed datasets π1(S1) and π2(S2). We proceed by constructing a new dataset S that consists of π2(S) and the point π1(z1), as indicated in Figure 4. The construction ensures that S and π2(S2) are neighboring datasets, and that the L12 distance between S and π2(S2) is upper bounded by 2 . We then apply the RDP property of A to upper bound the divergence between A( S) and A(π2(S2)). Then, we employ the SRDP property of A to upper bound the divergence between A( S) and A(π1(S1)). Finally, we establish the desired upper bound in Equation (3) by combining the previous two divergences using the weak triangle inequality of R enyi divergence (Lemma J). While the privacy guarantee provided by Theorem 2 is conditional over a dataset collection L, it can be extend to an unconditional privacy guarantee over all possible datasets using the Propose-Test-Release (PTR) framework (Dwork & Lei, 2009). We show an example of the application of PTR and its guarantees in Section 5.2. Comparison with the group privacy or DDP analysis A naive analysis on the privacy guarantee of A π using group privacy or DDP (Zhao et al., 2017; Liu et al., 2016) provides an upper bound that grows polynomially3 with , which can be as large as poly(n). In contrast, Theorem 2 implies a tighter bound on privacy for common DP mechanisms, especially when 2 = o(n).4 Next, we present various examples of pre-processing algorithms with small 3Linearly if we consider Approximate DP. 4Theorem 2 is also applicable for other distance metrics, as long as Smooth RDP and the sensitivity of pre-processing are defined under comparable distance metric. Figure 1. Illustration of the privacy analysis: For two neighboring datasets S1, S2, a pre-processing algorithm π yields the preprocessed datasets π1(S1) and π2(S2) respectively. A synthetic dataset S is constructed by combining the pre-processed datasets, ensuring that S and π2(S2) are neighboring datasets and that S and π1(S1) have bounded L12 distance. sensitivities 2 = O(1). In Figure 2, we illustrate the improvement of the privacy analysis in Theorem 2 over the conventional analysis via group privacy for numerous pre-processing algorithms. 4. Privacy Guarantees of Common Pre-Processing Algorithms In this section, we use Theorem 2 to provide overall privacy guarantees for several common pre-processing algorithms. First, in Section 4.1, we define the pre-processing algorithms π and bound their L2 and L sensitivities. Then, in Section 4.2 we provide the actual privacy guarantees for all combinations of these pre-processing algorithms and privacy mechanisms defined in Section 2.2. We assume the instance space X to be the Euclidean ball with radius 1. 4.1. Sensitivity of Common Pre-Processing Algorithms Approximate deduplication Many machine learning models, especially Large Language Models (LLMs), are trained on internet-sourced data, often containing many duplicates or near-duplicates. These duplicates can cause issues like memorization, bias reinforcement, and longer training times. To mitigate this, approximate deduplication algorithms are used in preprocessing, as discussed in Rae et al. (2021); Touvron et al. (2023). We examine a variant of these algorithms termed η-approximate deduplication. The concept of η-approximate deduplication involves defining a good cluster. For a dataset S and a point x S, consider B(x, η; S) = { x S : x x 2 η}, a ball of radius η around x. This forms a good cluster if B(x, η; S) = B(x, 3η; S). Essentially, this means that any point within a good cluster is at least 2η distant from all other points outside the cluster in the dataset. We also define B(S) = {Bi = B(xi, η; S)}m i=1 as the set of all good Provable Privacy with Non-Private Pre-Processing clusters in a dataset S. The η-approximate deduplication process, πd η, identifies and retains only the center of each good cluster, removing all other points. Points are removed in reverse order of cluster size, prioritizing those with more duplicates. Quantization Quantization, another pre-processing algorithm for data compression and error correction, is especially useful when the dataset contains measurement errors. We describe a quantization method similar to η-approximate deduplication, denoted as πq η: it identifies all good clusters in the dataset, and replaces all points within each good cluster with the cluster s centroid in the reverse order of the size of the good clusters. The difference between de duplication and quantization is that while quantization replaces near duplicates with a representative value, deduplication removes them entirely. We discuss the L2 and L sensitivity of deduplication and quantization in Proposition 3. Proposition 3. For a dataset collection L, the L2 and L sensitivities5 of η-approximate deduplication πd η and quantization πq η are 2(L, πd η) = 1 and 2(L, πq η) = η, and (L, πd η) = (L, πq η) = max S L max B B(S) 2 |B| . While the L2 sensitivity of deduplication is a constant 1, is usually upper bounded by a small number. This is because in realistic datasets, the number of near duplicates is generally small for sufficiently small η. This leads to upper bounding the product 2 by a small number. For example, in text datasets, the fraction of near duplicates is typically smaller than 0.1, as demonstrated in Table 2 and 3 in Lee et al. (2022). Model-based imputation Survey data, such as US census data, often contains missing values, resulting from the participants unable to provide certain information, invalid responses, and changing questionnaire over time. Hence, it is crucial to process the missing values in these datasets with data imputation methods, to make optimal use of the available data for analysis while minimizing the introduction of bias into the results. We consider several imputation techniques which use the values of the dataset to impute the missing value. This can involve training a regressor to predict the missing feature based on the other feature, or simply imputing with datasetwide statistics like mean, median or trimmed mean. For the sake of clarity, we only discuss mean imputation in the main text but provide the guarantees for other imputations in Proposition 16 and Table 3 in Appendix C.2. 5When the definition of neighboring dataset is refined to addition and deletion of a single data point, the L sensitivity of both deduplication and quantization on a set L can be reduced to max S L,B B(S) |B|. Corollary 4. For a dataset collection L with maximum p missing values in any dataset, the L sensitivity of mean imputation πmean over L is p and the L2-sensitivity of πmean is upper bounded by 2 n p. Principal Component Analysis Principal Component Analysis (PCA) is a prevalent pre-processing algorithm. It computes a transformation matrix A k Rk d using the top k eigenvalues of the dataset S s covariance matrix. PCA serves two main purposes: dimension reduction and rank reduction. For dimension reduction, denoted as πPCA dim, PCA projects data into a lower-dimensional space (typically for high-dimensional data visualization), using the preprocessing function πS,dim(x) = A k x. For rank reduction, represented as πPCA rank, PCA leverages the low-rankness of the dataset with the function πS,rank(x) = Ak A k x. The primary difference between these two PCA applications is in the output dimensionality. Dimension reduction yields data of dimension k, while rank reduction maintains the original dimension d, but with a low-rank covariance matrix of rank k. We detail the L and L2 sensitivity for both PCA variants in Proposition 5. Proposition 5. For a dataset collection L, the L sensitivity of πPCA dim and πPCA rank is the size of the datasets in L, i.e. n. The L2-sensitivity of πPCA dim and πPCA rank is bounded by 2 2 and 2 respectively, where 2 = 4(3n + 2) n(n 1) min{δk min(L), δ1 min(L)}, where δk min(L) = min S L λk(S) λk+1(S) is the minimum gap between the kth and (k + 1)th eigenvalue over any covariance matrix of S L. Standard Scaling Scaling is one of the most common pre-processing methods. In Proposition 6, we provide the sensitivity results of standard scaling which scales each feature to have mean 0 and standard deviation 1, and min max scaling, which scales each feature to the interval between 0 and 1. Proposition 6. For a dataset collection L, the L sensitivity of standard scaling and min max scaling is the size of the datasets in L, i.e. n. The L2 sensitivity of standard scaling is 2 = 2 σ3 minn + 2 nσmin , where σmin is the minimum standard deviation over datasets in L. 4.2. Privacy Analysis for Pre-Processing Algorithms After establishing the necessary elements of our analysis, including the sensitivities of pre-processing algorithms (Section 3) and the SRDP parameters of private mechanisms (Table 1), we are ready to present the exact overall privacy guarantees for specific pre-processed DP pipelines. In Table 2, Provable Privacy with Non-Private Pre-Processing Table 2. Overall privacy guarantees bε of pre-processed DP pipelines with α 11. Let p represent the L sensitivity of deduplication, quantization and mean imputation. We also assume the size of the dataset n 101 for PCA, and the Lipschitz and smoothness parameters, along with global sensitivity, are set to 1. We omit the privacy guarantees of deduplication due to space constraint. However, we note that the privacy guarantees for deduplication are the same as those for quantization with η = 1. See Appendix D for details. Quantization Mean imputation PCA Standard Scaling MG 1.05αε2 1 + η2p2 1.05αε2 1 + 4p2 (n p)2 1.05αε2 1 + 12.22 (δk min)2 1.05αε2 1 + 4 σ3 min AGD 1.05αε2 4 + η2p2 4.2αε2 1 + p2 (n p)2 1.05αε2 4 + 12.22 (δk min)2 4.2αε2 1 + 1 σ3 min ML/ME ε (1 + ηp) ε 1 + 2p n p ε 1 + 12.2 ε 1 + 4 σ3 min ASGD samp 1.05αε2 2α + 12.22 (δk min)2 2.1αε2 α + 8 σ6 min ASGD iter 1.1αε2 n p log n p 1 + 4p2 n log n (n p)3 log n p (a) Quantization (b) Mean Imputation (c) PCA for rank reduction Figure 2. Visualization of the overall privacy of pre-processed Gaussian mechanism analysed with group privacy and our bound from Theorem 2. Here, η is the distance threshold of the quantization algorithm, n is the size of the possible datasets, and δmin is the minimum gap between the kth and the k + 1th eigenvalue of all possible datasets. we present the privacy guarantees for various combinations of pre-processing methods and private mechanisms. Theorem 7 (Informal). Pre-processed DP pipelines comprised of all combinations of private mechanisms in Section 2.2 and pre-processing algorithms in Section 4.1 are (α, bε)-RDP where bε is specified in Table 2. Table 2 shows that the pre-processing methods discussed in Section 4.1 typically lead to a minimal, constant-order increase in the privacy cost for common DP mechanisms, depending on the datasets in L. Specifically, deduplication results in a constant increase in the privacy cost when the datasets in L do not contain large clusters, i.e. = o(n), whereas quantization can handle larger clusters of size = o(n/η). The privacy cost of mean imputation remains constant for datasets with few missing values (p = o(n)). Standard scaling leads to constant amplification in privacy cost when all features in dataset collection L have nonvanishing standard deviation. For PCA, we only present the results for rank reduction and the results for dimension reduction follows similarly. The additional privacy cost remains small for low rank datasets with δmin bounded away from 0. Moreover, Figure 2 demonstrates a comparison between our SRDP-based analysis (Theorem 2) and the naive group privacy/DDP analysis for pre-processed Gaussian mechanism. This comparison illustrates the advantage of our SRDPbased analysis over group privacy/DDP and how the privacy guarantees vary with properties of the dataset collections L. For quantization, the SRDP-based privacy analysis provides significantly stronger privacy guarantees for smaller η. For mean imputation, the difference between SRDP-based analysis and group pirvacy analysis is large for smaller number of missing values. However, in the extreme scenario where over 90% of the data points are missing, group privacy achieves a similar guarantee as our analysis with SRDP. In the case of PCA for rank reduction, the privacy parameter obtained with SRDP-based privacy analysis remains constant, while it increases for group privacy analysis with the size of the dataset. 5. Unconditional Privacy Guarantees in Practice The previous section provides a comprehensive privacy analysis for pre-processed DP pipelines where the resulting Provable Privacy with Non-Private Pre-Processing privacy guarantee depends on properties of the dataset collection. In this section, we first illustrate how conditional privacy analysis can become ineffective for pathological datasets (Section 5.1). Then, we introduce a PTR-inspired framework to address this and establish unconditional privacy guarantees in Section 5.2. Using PCA for rank reduction as an example, we provide convergence guarantee for excess empirical loss for generalized linear models and validate our results with synthetic experiments. 5.1. Limitations of conditional privacy guarantee The previous analysis in Section 3 and 4 rely on the chosen dataset collection L. Typically, for dataset collections L that are well-structured, non-private pre-processing leads to only a slight decline in the privacy parameters as discussed in Section 4.2. However, this degradation can become significant for datasets that exhibit pathological characteristics. In the following, we present several examples where our guarantees in Theorem 2 become vacuous due to the pathological nature of the dataset. Imputation If the number of missing points is comparable to the size of the dataset i.e. p n, the privacy guarantee in Theorem 2 deteriorates to the level of those obtained from group privacy or DDP. This deterioration is also reflected in Figure 2b. PCA When performing PCA with a reduction to rank k, the privacy guarantee can worse and scale with n for very small δk min, in particular when δk min = O (1/n). This situation can occur naturally when the data is high rank or when k is chosen incorrectly . Deduplication Consider a dataset S = {z1, ..., zn}, where each pair of distinct data points are at least η distance apart, yet each point are no more than η distance from a specific reference point x, i.e. zi zj 2 η, zi x 2 η, for i = j [n].Under these conditions, applying deduplication on S leaves the dataset unchanged as all points are uniformly η distance from each other. However, if x is added to S, then deduplication would eliminate all points except x. As a result, the L sensitivity of deduplication becomes n, resulting in the same privacy analysis as group privacy. Interestingly, a similar example was leveraged by Debenedetti et al. (2023) in their side-channel attack. It is important to recognise that datasets exhibiting pathologucal characteristics, like those above, are usually not of practical interest in data analysis. For instance, applying mean imputation to a dataset where the number of missing points is nearly equal to the size of the dataset or implementing approximate deduplication that results in the removal of nearly all data are not considered sensible practice. A possible solution to this problem is to potentially refuse to provide an output when the input dataset is deemed patho- logical, provided that the decision to refuse is made in a manner that preserves privacy. This approach is adopted in the following section, where we employ the Propose-Test Release framework proposed by Dwork & Lei (2009) to establish unconditional privacy guarantees over all possible datasets, at the expense of accuracy on datasets that are pathological . 5.2. Unconditional privacy guarantees via PTR We present Algorithm 1 that applies the PTR procedure to combine the non-private PCA for rank reduction with DP-GD. It exhibits unconditional privacy guarantee over all possible datasets (Theorem 8). While this technique is specifically described for combining PCA with DP-GD, it s worth noting that the same approach can be applied to other pairings of DP mechanisms and pre-processing algorithms, though we do not explicitly detail each combination since their implementations follow the same basic principles. Algorithm 1 PTR for πPCA rank on DP-GD Input: Dataset S, estimated lower bound β of δk(S), privacy parameters ε, δ, Lipschitzness L 1: Set Γ = min S :δk(S )<β d H(S, S ) + Lap 1 δk(S ) = λk (S ) λk+1 (S ) 2: if Γ (log 2 δ )/ε then 3: Return 4: else 5: Compute the pre-processing function πS PCA rank us- ing the dataset S and set σ = 2L T εn . 6: Return ADP GD(πS PCA rank(S)) with parameter σ. 7: end if Theorem 8. For any L-Lipschitz and µ-smooth loss function ℓ, ε > 0 and δ exp 1.05ε2 1 + 12.22µ2 L2β2 , Algorithm 1 with privacy parameters ε, δ, and estimated lower bound β is (bε + ε, δ)-DP on a dataset of size n 101, where bε = 3ε r 1.05 1 + 12.22µ2 The RDP parameter on PCA with DP-GD in Table 2 with the same parameter σ can be converted to (bε, δ)- DP if it holds that for all S L, δk (S) β . In comparison, the guarantee in Theorem 8 is marginally worse by an additive ε, but it remains applicable even when δk (S) < β for some S L. However, when δk (S) < β for some S L, the bound in Table 2 degrades to the same as obtained via group privacy/DDP. For generalized linear models, we present an high probability upper bound on the excess empirical loss of Algorithm 1 conditional on the properties of the private dataset. Given a dataset S = {(xi, yi)}n i=1 and a loss function ℓ, let ˆℓS(θ) = 1 n Pn i=1 ℓ( θ, xi , yi) denote the empirical loss of a generalized linear model on the dataset S and let Provable Privacy with Non-Private Pre-Processing θ = arg minθ Rd ˆℓS(θ). For simplicity, we assume S is centered and ℓis L-Lipschitz. Proposition 9. For δ, ε defined in Algorithm 1, n 101, and any β δk (S), with probability at least 1 ξ, Algorithm 1 outputs ˆθ such that the excess empirical risk E h ˆℓS(ˆθ) i ˆℓS(θ ) = O 1 + k log 1/δ where ξ = 1 2δ exp (δk(S) β)nε 12.2 , Λ = Pd i=k+1 λi(S) where the high probability is over the randomness in Step 1 and the expectation is over the randomness of Step 6. The results follow a similar analysis as (Song et al., 2021), which shows that the convergence bound of DP-GD scales with the rank of the dataset. However, when the dataset remains full rank but the first k eigenvalues dominates the rest, e.g. Λ = O( k), the convergence bound following (Song et al., 2021) is O( d). In contrast, Proposition 9 leads to a dimension-independent convergence bound of order O( k), with a slight degradation in the privacy guarantee. Here, β introduces the trade-off between privacy and utility. Large β leads to tighter privacy guarantee (small effective ε in Theorem 8) at the risk of worse utility (large ξ in Proposition 9). Experimental results We conducted experiments on a synthetic approximately low rank dataset to corroborate our results in Proposition 9, and summarised the results in Figure 3. We generate a 2-class low rank dataset consisting of 1000 data points with dimension 6000 and approximately rank 50. The synthetic dataset has positive yet small eigenvalues for the kth eigenvectors for k 50, ensuring 2LΛ in Equation (4) is small but positive. For each overall privacy parameter ε, we evaluated the excess empirical risk of DP-SGD with: a) non-private PCA with an adjusted privacy parameter from Table 2, b) DPPCA, with half of the privacy budget to allocated to DPPCA, and c) no pre-processing. Figure 3 shows that pre-processed DP pipeline outperforms logistic regression without pre-processing. This is because the dataset s approximate low rankness, indicated by 2LΛ > 0, prevents logistic regression from leveraging the dimension-independent optimization guarantees using the original high dimensional datasets (Song et al., 2021). However, non-private PCA, while incurring a minor constantorder privacy cost, effectively utilizes the data s low rankness and offers significant benefits, especially at smaller ε values where optimization error is dominated by the first term in Equation (4). Figure 3 also demonstrates that DPPCA exhibits the worst performance among the three methods, beccause the inherent error of order Ω(d) of DP-PCA dominates given a high dimensional dataset with d n(Liu et al., 2022). Figure 3. Comparison of excess empirical loss of private logistic regression: for each level of overall privacy ε, pre-processed DP pipeline consistently outperforms other methods. In Appendix F, we provide the details on experimental setups and a discussion on the practical modification of Algorithm 1 with clipping. 6. Conclusion In this paper, we investigate the often-neglected impact of pre-processing algorithms in private ML pipelines. We propose a framework to assess the additional privacy cost from non-private pre-processing steps using two new technical notions: Smooth RDP and sensitivities of pre-processing algorithms. Finally, we propose a PTR-based procedure to relax some of the necessary assumptions in our framework and make it practically usable. Several interesting directions of future work remain unexplored, including handling more complex pre-processing algorithms, such as pre-trained deep neural feature extractors and private algorithms like private data synthesis. Acknowledgements We thank Francesco Pinto for the initial discussion of the project. We also thank Alexandru T ifea, Jiduan Wu and Omri Ben-Dov for their useful feedback. Additionally, we extend our gratitude to the reviewers of ICML for their careful reviews and insightful suggestions. Impact Statement Our work addresses a common practice, non-private preprocessing, that leads to additional privacy cost in ML applications. We identify the importance of extending the privacy guarantees from the standalone ML model to the whole ML pipeline and provide simple tools to achieve it. Our work contributes to the safer use of machine learning in society. Provable Privacy with Non-Private Pre-Processing Abadi, M., Chu, A., Goodfellow, I., Mc Mahan, H. B., Mironov, I., Talwar, K., and Zhang, L. Deep learning with differential privacy. In ACM SIGSAC Conference on Computer and Communications Security, 2016. Alabi, D., Mc Millan, A., Sarathy, J., Smith, A., and Vadhan, S. Differentially private simple linear regression. ar Xiv:2007.05157, 2022. Anil Jadhav, D. P. and Ramanathan, K. Comparison of performance of data imputation methods for numeric dataset. Applied Artificial Intelligence, 2019. Apple DP. Learning with privacy at scale, 2017. Bassily, R., Smith, A., and Thakurta, A. Private empirical risk minimization: Efficient algorithms and tight error bounds. In Annual Symposium on Foundations of Computer Science (FOCS), 2014. Chaudhuri, K., Monteleoni, C., and Sarwate, A. D. Differentially private empirical risk minimization. Journal of Machine Learning Research (JMLR), 2011. Chaudhuri, K., Sarwate, A., and Sinha, K. Near-optimal differentially private principal components. In Conference on Neural Information Processing Systems (Neur IPS), 2012. Debenedetti, E., Severi, G., Carlini, N., Choquette-Choo, C. A., Jagielski, M., Nasr, M., Wallace, E., and Tram er, F. Privacy side channels in machine learning systems. ar Xiv:2309.05610, 2023. Dwork, C. and Lei, J. Differential privacy and robust statistics. In Annual ACM Symposium on Theory of Computing (STOC), 2009. Dwork, C. and Roth, A. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci., 2014. Dwork, C., Mc Sherry, F., Nissim, K., and Smith, A. Calibrating noise to sensitivity in private data analysis. Theory of Cryptography, 2006. Epasto, A., Mirrokni, V., Narayanan, S., and Zhong, P. k-means clustering with distance-based privacy. In Conference on Neural Information Processing Systems (Neur IPS), 2023. Feldman, V., Mironov, I., Talwar, K., and Thakurta, A. Privacy amplification by iteration. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), 2018. Ganesh, A., Haghifam, M., Nasr, M., Oh, S., Steinke, T., Thakkar, O., Thakurta, A., and Wang, L. Why is public pretraining necessary for private model training? In International Conference on Machine Learning (ICML), 2023. Horn, R. A. and Johnson, C. R. Matrix Analysis. 1985. Humphries, T., Oya, S., Tulloch, L., Rafuse, M., Goldberg, I., Hengartner, U., and Kerschbaum, F. Investigating membership inference attacks under data dependencies. In IEEE Computer Security Foundations Symposium (CSF), 2023. Kandpal, N., Wallace, E., and Raffel, C. Deduplicating training data mitigates privacy risks in language models. In International Conference on Machine Learning (ICML), 2022. Kifer, D. and Machanavajjhala, A. Pufferfish: A framework for mathematical privacy definitions. ACM Transactions on Database Systems, 2014. Lecuyer, M., Atlidakis, V., Geambasu, R., Hsu, D., and Jana, S. Certified robustness to adversarial examples with differential privacy. In IEEE Symposium on Security and Privacy (SP), 2019. Lee, K., Ippolito, D., Nystrom, A., Zhang, C., Eck, D., Callison-Burch, C., and Carlini, N. Deduplicating training data makes language models better. In Annual Meeting of the Association for Computational Linguistics (ACL), 2022. Li, T., Zaheer, M., Reddi, S., and Smith, V. Private adaptive optimization with side information. In International Conference on Machine Learning (ICML), 2022a. Li, X., Tramer, F., Liang, P., and Hashimoto, T. Large language models can be strong differentially private learners. In International Conference on Learning Representations (ICLR), 2022b. Liu, C., Chakraborty, S., and Mittal, P. Dependence makes you vulnerable: Differential privacy under dependent tuples. In Annual Network and Distributed System Security Symposium (NDSS), 2016. Liu, F. Statistical properties of sanitized results from differentially private laplace mechanism with univariate bounding constraints. ar Xiv:1607.08554, 2016. Liu, X., Kong, W., Jain, P., and Oh, S. DP-PCA: Statistically optimal and differentially private pca. In Conference on Neural Information Processing Systems (Neur IPS), 2022. Mironov, I. R enyi differential privacy. In IEEE Computer Security Foundations Symposium (CSF), 2017. Nesterov, Y. E. Introductory Lectures on Convex Optimization - A Basic Course. 2004. ISBN 978-1-4613-4691-3. Provable Privacy with Non-Private Pre-Processing Pierquin, C. and Bellet, A. R enyi pufferfish privacy: General additive noise mechanisms and privacy amplification by iteration. ar Xiv:2312.13985, 2023. Pinto, F., Hu, Y., Yang, F., and Sanyal, A. PILLAR: How to make semi-private learning more effective. IEEE Conference on Secure and Trustworthy Machine Learning (Sa TML), 2024. Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, F., Aslanides, J., Henderson, S., Ring, R., and et al., Y. Scaling Language Models: Methods, Analysis & Insights from Training Gopher. ar Xiv:2112.11446, 2021. Schindler, D. A Variant of Weyl s Inequality for Systems of Forms and Applications. 2015. Song, S., Wang, Y., and Chaudhuri, K. Pufferfish privacy mechanisms for correlated data. In ACM International Conference on Management of Data, 2017. Song, S., Steinke, T., Thakkar, O., and Thakurta, A. Evading the curse of dimensionality in unconstrained private glms. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2021. Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., and et al., B. Llama 2: Open Foundation and Fine-Tuned Chat Models. ar Xiv:2307.09288, 2023. Tramer, F. and Boneh, D. Differentially private learning needs better features (or much more data). In International Conference on Learning Representations (ICLR), 2021. U.S. Census Bureau. Understanding differential privacy, 2020. van Erven, T. and Harremo es, P. R enyi divergence and Kullback-Leibler divergence. ar Xiv:1206.2459, 2012. Yu, D., Zhang, H., Chen, W., and Liu, T.-Y. Do not let privacy overbill utility: Gradient embedding perturbation for private learning. In International Conference on Learning Representations (ICLR), 2021. Yu, Y., Wang, T., and Samworth, R. J. A useful variant of the Davis Kahan theorem for statisticians. Biometrika, 2014. Zhao, J., Zhang, J., and Poor, H. V. Dependent differential privacy for correlated data. In IEEE Globecom Workshops (GC Wkshps), 2017. Zhou, Y., Wu, S., and Banerjee, A. Bypassing the ambient dimension: Private SGD with gradient subspace identification. In International Conference on Learning Representations (ICLR), 2021. Zwald, L. and Blanchard, G. On the convergence of eigenspaces in kernel principal component analysis. In Conference on Neural Information Processing Systems (Neur IPS), 2005. Provable Privacy with Non-Private Pre-Processing We present the detailed proof of our results and some additional findings in the appendix. To distinguish the lemmas from existing literature and new lemmas used in our proofs, we adopt alphabetical ordering for established results and numerical ordering for our contributions. A. Proofs regarding Smooth RDP (Section 3.1) In this section, we present the proofs regarding SRDP properties. In Appendix A.1, we introduce SRDP satisfies some desired properties, and present their proofs. Appendix A.2 provides the exact SRDP parameters of common DP mechanisms. A.1. Properties of SRDP and their proofs While conditional SRDP over a set L can be considered as a special case of Pufferfish R enyi Privacy (Kifer & Machanavajjhala, 2014), we show that it satisfies desirable properties, such as sequential composition and privacy amplification by subsampling, which are not satisfied by the more general Pufferfish Renyi Privacy (Pierquin & Bellet, 2023). Properties of SRDP Similar to RDP, SRDP satisfies (sequential) composition and closure under post-processing. Lemma 2. Let L be any dataset collection. Then, the following holds. Composition Let α > 0, εi : R R R, i [k]. For any i [k], if the randomized algorithms Ai is (α, εi(α, τ))-SRDP over L, then the composition (A1, . . . , Ak) is (α, Pk i=1 εi(α, τ))-SRDP over L. Post-processing Let α > 0, ε : R R R, and f be an arbitrary algorithm. For any τ > 0, if A is (α, ε(α, τ))-SRDP over L, then f A is (α, ε(α, τ))-SRDP over L. SRDP also satisfies a form of Privacy amplification by subsampling. We state the weaker version without additional assumptions in Appendix A.1 and use a stronger version for ASGD iter in Theorem 1 with some light assumptions. Lemma 2. Let L be any dataset collection. Then, the following holds. Composition Let α > 0, εi : R R R, i [k]. For any i [k], if the randomized algorithms Ai is (α, εi(α, τ))-SRDP over L, then the composition (A1, . . . , Ak) is (α, Pk i=1 εi(α, τ))-SRDP over L. Post-processing Let α > 0, ε : R R R, and f be an arbitrary algorithm. For any τ > 0, if A is (α, ε(α, τ))-SRDP over L, then f A is (α, ε(α, τ))-SRDP over L. Proof. Composition: For i [k] and any two datasets S, S L with d12(S, S ) τ, let νi and ν i be the distribution of Ai(S) and Ai(S ) respectively. Then, using to the independence between Ai(S) and Ai+1(S), denote the joint distribution of Ai(S) and Ai+1(S) as νi νi+1 and the joint distribution of Ai(S ) and Ai+1(S ) as ν i ν i+1. Then, we prove the composition property of by upper bounding Dα (A1(S), . . . , Ak(S)||A1(S ), . . . , Ak(S )) for S, S L with d12(S, S ) τ. To achieve this, we employ the following lemma on additivity of R enyi divergence (Lemma A). Lemma A (Additivity of Renyi divergence (van Erven & Harremo es, 2012)). For α > 1 and distributions ν1, ν2, ν 1, ν 2, Dα (ν1 ν2||ν 1 ν 2) = Dα (ν1||ν 1) + Dα (ν2||ν 2) . Applying Lemma A at step (a), Dα (A1(S), . . . , Ak(S)||A1(S ), . . . , Ak(S )) = Dα(ν1 . . . νk||ν 1 . . . ν k) (a) Dα (ν1||ν 1) + Dα (ν2 . . . νk||ν 2 . . . ν k) (b) = ε1(α, τ) + Dα (ν2 . . . νk||ν 2 . . . ν k) (c) = ε1(α, τ) + ε2(α, τ) + . . . i=1 εi(α, τ) Provable Privacy with Non-Private Pre-Processing where step (b) follows from the fact that A1 is SRDP with parameter ε1(α, τ), and step (c) is obtained by decomposing Dα(ν2 . . . νk||ν 2 . . . ν i+1) in the similar manner as step (a) and applying Lemma A iteratively. This completes the proof for composition of SRDP. Post-processing: For any S, S L with d12(S, S ) τ, let X = A(S), X = A(S ) be random variables with probability distribution νX, ν X, and Y = f(A(S)), Y = f(A(S )) be random variables with distributions νY and ν Y . For any value x, we denote the conditional distribution of Y = f(x) and Y = f(x) by νY |x and ν Y |x respectively. We note that νY |x = ν Y |x by definition. We will upper bound the R enyi divergence between f(A(S)) and f(A(S )), Dα (f(A(S))||f(A(S ))) = 1 α 1 log Ey ν Y = 1 α 1 log Ey ν Y νY |x(y)νX(x) ν Y |x(y)ν X(x) (a) 1 α 1 log Ey ν Y νY |x(y)νX(x) ν Y |x(y)ν X(x) (b) = 1 α 1 log Ex ν X α = Dα (A(S)||A(S )) where (a) follows from Jensen s inequality and the convexity of h(x) = xα for α 2 and x 0, and (b) follows from the fact that νY |x = ν Y |x for fixed x. This completes the proof. We present a general version of privacy amplification by subsampling for SRDP in Lemma 3. We consider a subsampling mechanism that uniformly selects B elements from a dataset with replacement. Lemma 3 implies that the SRDP parameter of a SRDP algorithm decreases by O(1/ n) when B = 1 for datasets of size n. Lemma 3 (Privacy amplification by subsampling of SRDP). For B 1, let π : X n X B be the subsampling mechanism that samples B elements from the dataset S of size n uniformly at random. Let A be a randomized algorithm that is (α, ε(α, τ))-SRDP. For α, τ such that ε(α, τ) 1 α 1 and for any integer 1 B , A π satisfies (α, ε (α, τ))-SRDP, where ε (α, τ) = 2 1 k B 1 Proof. Let S, S L with d12(S, S ) τ. First, order the points in S, S as S = {zi}n i=1 and S = {z i}n i=1 such that Pn i=1 zi z i 2 τ. Let m be a fixed integer. Let I = {i1, i2, . . .} be the set of indices such that for any j I, zj z j τ m holds. For 1 m n 1, if d12(S, S ) τ, there can be at most m 1 indices in I. Thus, for an index i sampled from [n] uniformly at random, we have Pi Unif([n]) h zi z i τ i = 1 Pi Unif([n]) h zi z i τ Let J = {i1, ..., i B} be a set where each index ij is sampled independently and uniformly from [n]. For any integer 1 m n 1, we compute the probability that d12(SJ, S J) τ PJ Unif([n])B i J zi z i 2 Bτ # (a) PJ Unif([n])B h i J, zi z i 2 τ i (b) 1 m 1 where step (a) follows as i J, zi z i 2 τ m implies P i J zi z i 2 Bτ m , step (b) follows by Equation (7) and the independence of each index in J. Provable Privacy with Non-Private Pre-Processing Replacing m = k B, we have for 1 k B n 1, i.e.for 1 PJ Unif([n])B i J zi z i 2 τ Next, we apply the weak convexity of R enyi divergence (Lemma B) to get the desired bound. Lemma B (Weak convexity of Renyi divergence, Lemma 25 in Feldman et al. (2018)). Let µ1, µ2 and υ1, υ2 be probability distributions over the same domain such that Dα (µi||υi) c α 1 for i {1, 2} and c (0, 1]. For β (0, 1), Dα (βµ1 + (1 β)µ2||βυ1 + (1 β)υ2) (1 + c) [βDα (µ1||υ1) + (1 β)Dα (µ2||υ2)] . Let E denote the event that the L12 distance between SJ and S J is smaller than τ i J zi z i 2 τ Unif([n])B. Then, Equation (9) implies that the probability of E is at least 1 k B 1 Let µ1, υ1 be the distribution of A(SJ) and A(S J) conditional on E occurring, and let µ2, υ2 be the distribution of A(SJ) and A(S J) conditional on E does not occur. By our assumption on τ, α, ε(α, τ) 1 α 1 and c = 1. Applying Lemma B with c = 1 in step (a), Dα (A(SJ)||A(S J)) = Dα (P [E] A(SJ|E) + P [ E] A(SJ| E)||P [E] A(S J|E) + P [ E] A(S J| E)) (a) 2P [E] ε α, τ + 2P [ E] ε (α, τ) (b) 2 1 k B 1 In step (b), we note that 2P [E] ε α, τ k + 2P [ E] ε (α, τ) increases with P [E] by the fact that ε α, τ k ε (α, τ). Then, we obtain the upper bound by substitute in the lower bound of P [E] from Equation (9). This completes the proof. However, under additional structural assumption on the pair of datasets that needs to remain indistinguishable, we can establish a more effective amplification of smooth RDP through subsampling. In Appendix A.2.2, we present the additional assumption with Definition 7. We also apply the stronger amplification in the proof of overall privacy guarantees for DP-SGD with subsampling. A.2. Proof of Theorem 1(Derivation of Table 1) In this section, we explore results regarding the SRDP parameters for various DP mechanisms. We start with the formal definitions of two fundamental assumptions, Lipschitzness and smoothness, in Definition 5 and 6. Then, we provide the proof of Theorem 1 (SRDP parameters in Table 1). For clarity, we present the proof for each private mechanism separately in Appendices A.2.1 and A.2.2. Definition 5 (Lipschitzness). A function f : K R is L-Lipschitz over the domain K X with respect to the distance function d : X X R+ if for any X, X K, |f(X) f(X )| Ld(X, X ). Definition 6 (Smoothness). A loss function ℓ: Θ K R+ is µ-smooth if for all θ Θ and x, x K, θℓ(θ, x) θℓ(θ, x) 2 µ x x 2. We note that Definition 6 is different from the usual definition of smoothness in the literature that requires θℓ(θ, x) θℓ(θ , x) 2 µ θ θ 2 for all x K and θ, θ Θ (Chaudhuri et al., 2011; Feldman et al., 2018). However, many loss functions, including square loss and logistic loss, are smooth on common model classes based on Definition 6. A.2.1. OUTPUT PERTURBATION AND RANDOM SAMPLING METHODS This section provides the detailed theorems and proofs for Smooth RDP parameters of output perturbation methods (Gaussian and Laplace mechanism) and random sampling methods (Exponential mechanism) in Theorem 10 to 12 respectively, as stated in Table 1. Provable Privacy with Non-Private Pre-Processing Theorem 10 (Theorem 1 for Gaussian mechanism). For an L-Lipschitz function f with global sensitivity f and the privacy parameter ε > 0, let MG(S) = f(S) + N 0, 2 f/ε2 denote the Gaussian mechanism. For a dataset collection L and for τ > 0, MG is α, αε2 2 -RDP and α, αL2τ 2ε2 2 -SRDP for α 1. Proof. We first show that MG is α, αε2 2 -RDP. For any neighboring datasets S, S with d H(S, S ) = 1, Dα (MG(S)||MG(S )) = Dα f(S) + N 0, 2 f/ε2 ||f(S ) + N 0, 2 f/ε2 = Dα N f(S), 2 f/ε2 ||N f(S ), 2 f/ε2 We apply Lemma C to upper bound the R enyi divergence between two Gaussian random variable with same variance. Lemma C (Corrolary 3 in Mironov (2017)). For any two Gaussian distributions with the same variance σ2 but different means µ0, µ1, denoted by N µ0, σ2 and N(µ1, σ2), the following holds, Dα N(µ0, σ2)||N(µ1, σ2) α (µ0 µ1)2 Dα (MG(S)||MG(S )) (a) sup (S,S ) S α (f(S) f(S ))2 where (a) follows from Lemma C and (b) follows due to the fact that |f(S) f(S )| f by the definition of global sensitivity. Then, we show that MG is α, αL2τ 2ε2 2 -SRDP over the dataset collection L. Specifically, we will show that for any τ > 0, for any two datasets S, S L with d12(S, S ) τ, Dα (A(S)||A(S )) αL2τ 2ε2 2 2 f . Let S, S L such that d12(S, S ) τ, we have Dα (MG(S)||MG(S )) = Dα N f(S), 2 f/ε2 ||N f(S ), 2 f/ε2 (a) α (f(S) f(S ))2 (b) α (L S S F )2 ε2 (c) αL2τ 2ε2 where step (a) follows by Lemma C, step (b) follows by the Lipschitzness of the function f with respect to the Frobenius norm, and step (c) follows by the fact that Frobenius norm is always smaller than L12 norm, i.e. S S F d12(S, S ). Theorem 11 (Theorem 1 for Laplace mechanism). For an L-Lipschitz function f with global sensitivity f and the privacy parameter ε, let ML(S) = f(S) + Lap ( f/ε) denote the Laplace mechanism. For a dataset collection L and for τ > 0, ML is (α, ε)-RDP and α, Lτ f ε -SRDP for α 1. Proof. Laplace mechanism ML is (ε, 0)-DP (Dwork et al., 2006). By the equivalence of (ε, 0)-DP and ( , ε)- RDP (Mironov, 2017), ML is ( , ε)-RDP. This implies that ML is (α, ε)-RDP for any 1 α by the monotonicity of R enyi divergence (Lemma D). Lemma D (Monotonicity of R enyi divergence (Mironov, 2017)). For 1 α < β, for any two distributions P, Q, the following holds Dα(P||Q) Dβ(P||Q). Next, we will show ML is (α, Lτε/ f)-SRDP over the set L. Specifically, we first show that ML is ( , Lτε/ f))-SRDP. Using the monotonicity of R enyi divergence (Lemma D), we can propagate this property to any α < . Provable Privacy with Non-Private Pre-Processing Let h be any output in the output space of f. For any two datasets S, S L with d12(S, S ) τ, P(ML(S) = h) P(ML(S ) = h) = ε f exp |h f(S)|ε ε f exp |h f(S )|ε f (| f(S ) + f(S)|) (b) exp Lτε where (a) follows by the L-Lipschitzness of the f with respect to Frobenius norm and (b) follows from the fact that S S F d12(S, S ) τ. This implies the output distributions have bounded infinite R enyi divergence, i.e. D (ML(S)||ML(S )) = max h H log P(ML(S) = h) P(ML(S ) = h) Lτε where the last inequality follows from Equation (11). By Monotocity of R enyi divergence (Lemma D), for any α 1, the α-R enyi divergence between the output distribution of Laplace mechanism on any two datasets S, S L with d12(S, S ) τ is also bounded by Lτε/ f. This concludes the proof. Theorem 12 (Theorem 1 for Exponential mechanism). For an L-Lipschitz score function Q with global sensitivity Q, privacy parameter ε, and a dataset collection L, the exponential mechanism, and for any τ > 0, ME is (α, ε)-RDP and (α, Lτε/ Q)-SRDP over L for any α 1. Proof. The proof for RDP of Exponential mechanism is similar to that of Laplace mechanism. As the exponential mechanism ME is (ε, 0)-DP (Dwork et al., 2006). By the equivalence of (ε, 0)-DP and ( , ε)-RDP (Mironov, 2017), ME is ( , ε)-RDP. By the monotonicity of R enyi divergence (Lemma D), ME is also (α, ε)-RDP for any α 1. To show that ME is α, Lτε -SRDP over the set L, we first show that ME is ( , Lτε/ Q)-SRDP over L and then apply the monotonicity of R enyi divergence (Lemma D). For any S, S L such that d12(S, S ) τ and any output h in the output space, P(ME(S) = h) P (ME(S ) = h) = exp Q(S,h)ε exp Q(S ,h)ε h H exp Q(S ,h)ε h H exp Q(S,h)ε exp Q(S,h)ε exp Q(S ,h)ε h H exp Q(S ,h) Q(S,h)ε 2 Q exp Q(S,h)ε h H exp Q(S,h)ε exp 2 maxh H |Q(S, h) Q(S , h)| ε (a) exp Lτε where step (a) follows by the L-Lipschitzness of the score function Q(S, h) for all h H and the fact that S S F d12(S, S ) τ. This implies bounded infinite order R enyi divergence between the two output distributions, i.e. D (ME(S)||ME(S )) = max h H log P(ME(S) = h) P(ME(S ) = h) Lτε Provable Privacy with Non-Private Pre-Processing where the last inequality follows from Equation (13). As shown in Equation (14), ME is ( , Lτε/ Q)-SRDP. This implies that ME is (α, Lτε/ Q)-SRDP for any α 1 by monotonicity of R enyi divergence (Lemma D). This completes the proof. A.2.2. GRADIENT-BASED METHODS In this section, we start with providing the formal definitions of gradient-based methods mentioned in Section 2.2 and evaluated in Table 1, including DP-GD, DP-SGD with subsampling, and DP-SGD with iteration. Then, we provide the detailed theorems and proofs for their Smooth RDP parameters, as stated in Table 1, in Theorem 13 to 15. DP-GD Given a L-Lipschitz loss function ℓ, DP-GD (Bassily et al., 2014; Song et al., 2021), denoted by ADP GD starts from some random initialization w0 in the parameter space H and conducts projected gradient descent for T iterations as wt+1 = ΠH(wt η gt) with the noisy gradient on the whole dataset gt defined as gt = 1 n Pn i=1 wℓ(wt 1, x) + N(0, σ2), and outputs the average parameter over the T iterations 1 T PT t=1 wt. DP-SGD with subsampling Another commonly used method is DP-SGD with subsampling (Abadi et al., 2016; Bassily et al., 2014), denoted as ASGD samp. In each gradient descent step, DP-SGD with subsampling first draws a uniform subsample of size B from the dataset without replacement. The gradient update is then performed by adding noise to the average gradient derived from this subsample. In contrast, DP-GD computes the average gradient of the entire dataset for its updates. DP-SGD with iteration Differentially Private Fixed Gradient Descent (DP-FGD), denoted by A(T ) FGD, is a variant of DP-SGD. It processes the data points in a fixed order the gradient at the tth step is calculates using the tth point in the dataset and outputs the parameter obtained after the T th iteration. DP-SGD with iteration (Feldman et al., 2018), denoted as ASGD iter, uses DP-FGD as a base procedure. It takes an extra parameter κτ and first uniformly samples an integer T from [n κτ + 1] = {1, . . . , n κτ + 1} and releases the output from A(T ) FGD, i.e., it releases the result from the T th iteration. While DP-SGD with iteration cannot take advantage of privacy amplification by subsampling for its privacy analysis, it relies on privacy amplification with iteration (Feldman et al., 2018) to achieve a comparable privacy guarantee to that of ASGD samp. Theorem 13 (Theorem 1 for DP-GD). For an L-Lipschitz and µ-smooth loss function ℓ, privacy parameter ε, dataset collection L, and σ = L T εn , AGD is α, αε2 2 -RDP and α, αµ2τ 2ε2 2L2 -SRDP over L for any α 1, τ > 0. Proof. We denote each noisy gradient descent step as θt+1 = θt ηg(θt, S) where g(θt, S) = 1 n Pn i=1 θℓ(θ, Si) + N(0, L T/εn) is the noisy gradient. To show that each gradient descent step is RDP, it suffices to show the gradient operator g is (α, αε2/2)-RDP by the fact that RDP is preserved by post-processing (Lemma E). Lemma E (Post-processing of RDP (Mironov, 2017)). Let α > 0, ε : R R R, and f be an arbitrary algorithm. For any ℓ> 0, if A is (α, ε(α))-RDP, then f A is (α, ε(α))-RDP. Now, we show the gradient operator g is α, αε2 2 - RDP. By the Lipschitzness of the loss function, for all θ, Si, we have θ(ℓ(θ, Si)) L. Thus, the global sensitivity of the average gradient 1 n Pn i=1 θℓ(θ, Si) is upper bounded by L n. For any neighboring datasets S, S such that d H(S, S ) = 1, for any θ, Dα (g(θ, S)||g(θ, S )) (a) α 1 n Pn i=1 θℓ(θ, Si) 1 n Pn i=1 θℓ(θ, S i) 2 where step (a) follows by the R enyi divergence of Gaussian mechanism (Lemma C), and step (b) follows by the fact that the sensitivity of each gradient estimation is L n. This implies that each gradient estimation g and thus, each gradient descent step, are (α, 2αε2/2T)-RDP. Applying the composition theorem of RDP (Lemma F), we can show that DP-GD with T gradient descent step is α, 2αε2 -RDP. Lemma F (Composition of RDP, Proposition 1 in Mironov (2017)). For α 1, let f be (α, ε1)-RDP and g be (α, ε2)-RDP. Then, the mechanism (f, g) is (α, ε1 + ε2)-RDP. Provable Privacy with Non-Private Pre-Processing Then, for any S, S L such that d12(S, S ) τ, we show that the R enyi divergence between the gradient estimate with S and S , Dα (g(θ, S)||g(θ, S )), is upper bounded. By the definition of gradient operator g and σ = L T εn , we have Dα (g(θ, S)||g(θ, S )) = Dα i=1 θℓ(θ, Si), L2T i=1 θℓ(θ, S i), L2T n Pn i=1 θℓ(θ, Si) 1 n Pn i=1 θℓ(θ, S i) 2 2 2 L2T (b) αε2µ2 Pn i=1 Si S i 2 2 (c) = αε2µ2τ 2 where step (a) follows from Lemma C, step (b) follows from the smoothness assumption of the loss function, i.e. θℓ(θ, Si) θℓ(θ, S i) 2 µ Si S i 2, and step (c) follows from the definition of L12 distance and the fact that d12(S, S ) τ. Applying composition of SRDP (Lemma 2) over the T gradient descent steps concludes the proof. For establishing the overall privacy guarantees for DP-SGD with subsampling and iteration, we introduce two properties of the dataset collection: the inverse point-wise distance (Definition 7) and the maximum distance (Definition 8). Then, we present the overall privacy guarantee for DP-SGD with subsampling and iteration in Theorem 13 and Theorem 15, with additional assumptions on the inverse point-wise distance and τ-constrained maximum distance of the dataset collection. Definition 7 (Inverse point-wise distance of a dataset collection L). Let k be the maximum integer such that for every pair of datasets S1 = Si 1 n i=1 L and S2 = Si 2 n i=1 L and for all i [n], Si 1 Si 2 2 d12(S1,S2) k . The inverse point-wise distance γ of a dataset collection L is defined as γ = n Definition 8 (τ-constrained maximum distance of a dataset collection L). For τ > 0, the τ-constrained maximum distance κτ of a dataset collection L is defined as the maximum Hamming distance between any two datasets S1, S2 L such that d12(S1, S2) τ. Theorem 14 (Theorem 1 for DP-SGD with subsampling). For any L-Lipschitz and µ-smooth loss function ℓ, privacy parameter ε, dataset collection L with inverse point-wise distance γ, and σ = Ω L T εn , ASGD samp is α, α2ε2 and α, αµ2τ 2ε2γ2 2L2 -SRDP over L for any τ > 0 and 1 α min n ε2n2 log n2ε L Proof. We let πu be a subsampling mechanism that uniformly samples one point from the dataset. We denote each gradient descent step as θt+1 = θt ηg(θt, S) where g(θt, S) = θℓ(θ, πu(S)) + N(0, σ2) represents the gradient estimate. We will show that each gradient estimate g is (α, α2ε2/2T)-RDP and thus, each gradient descent step satisfies RDP with the same parameters by post-processing theorem of RDP (Lemma E). Then, by the composition theorem for RDP (Lemma F) over the T gradient descent steps, we can conclude that ASGD samp is (α, α2ε2/2)-RDP. First, for each gradient estimate g(θ, πu(S)), we apply Lemma G to upper bound the R enyi divergence for each gradient step in Equation (18). Lemma G (Lemma 3 in Abadi et al. (2016)). Suppose that f : X Rp is a function with f( ) 2 L. Assume σ 1, and let i Unif([n]) represent a uniform random variable over the integers in the set [n] = 1, 2, . . . , n. Then, for any positive integer 1 α σ2 ln n σ and any pair of neighboring datasets S, S , the mechanism M(S) = f(Si) + N(0, σ2I) satisfies Dα (M(S)||M(S )) L2α(α + 1) n2(1 1/n)σ2 | {z } (I) | {z } (II) For neighboring datasets S, S and for any θ, each gradient estimate satisfies, Dα (g(θ, πu(S))||g(θ, πu(S ))) (a) cα2L2 Provable Privacy with Non-Private Pre-Processing Note that part (II) is smaller than part (I) in Equation (17) for α T ε . Hence, step (a) follows by application of Lemma G for some positive constant c 1. Step (b) follows by choosing σ = L Next, we will show that ASGD samp is α, ατ 2ε2µ2γ2 2L2 -SRDP. It suffices to show that each gradient estimate is α, ατ 2ε2µ2γ2 2T L2 -SRDP by the post-processing and composition theorem of SRDP (Lemma 2). Let k be the maximum integer such that for every pair S, S L with d12(S, S ) τ, the point-wise distance is less than τ/k. We note that for any S, S L with d12(S, S ) τ, then for any i [n], d12(πu(S), πu(S )) = Si S i 2 τ Next, we will upper bound the R enyi divergence between two gradient estimates. Let i Unif ([n]) be the sampled index. Then, for any S, S L such that d12(S, S ) τ, for any θ, Dα (g(θ, πu(S))||g(θ, πu(S ))) (a) Dα (g(θ, Si)||g(θ, S i)) (b) θℓ(θ, Si) θℓ(θ, S i) 2 2 2σ2 (d) ατ 2ε2µ2 2 (e) = ατ 2ε2µ2γ2 In step (a), i Unif([n]) is as defined above. Step (b) follows by the R enyi divergence of Gaussian distributions (Lemma C), and step (c) follows by the smoothness assumption of ℓ, i.e. θℓ(θ, S) θℓ(θ, S ) 2 µ S S 2 and Equation (19). Step (d) follows by the substitution of σ = L c T εn and c 1, and step (e) follows by the definition of inverse point-wise distance γ = n k as specified in Definition 7. This completes the proof. Remark 1. Without Definition 7, we can employ the weaker subsampling results for SRDP (Lemma 3). However, that would results in an extra n3/2 factor in the SRDP parameter, as Equation (20) will be replaced with Equation (21). Dα (g(θ, πu(S))||g(θ, πu(S ))) min k 1 1 2n 3 + (2n 4) 2n 3 1 ! αµ2τ 2ε2n2 (b) 2µ2τ 2ε2n3/2α where (a) is obtained by choosing k = 2n 3 and (b) follows by 1 2n 3 + (2n 4)( 2n 3 1) n(2n 3) Theorem 15 (Theorem 1 for DP-SGD with iteration). For an L-Lipschitz and µ-smooth convex loss function ℓ, let β = supz X Y 2 θℓ(θ; z) . For any privacy parameter ε, learning rate η 2/β, τ 0, dataset collection L with τ-constrained maximum distance κτ, and α > 1 such that max n L p 2(α 1)α, τL p 2(α 1)α o < q L satisfies that for any S, S L with d12(S, S ) τ, the differing points in S, S are consecutive, then ASGD iter with parameter κτ and σ = q ε is α, αε2 2 -RDP and α, ατ 2µ2n log(n κτ +2) 2(n κτ +1)L2 log n -SRDP. Proof. We first note that ASGD iter is (α, αε2 2 )-RDP following Theorem 26 in Feldman et al. (2018) (see Lemma H below for completeness). Lemma H (Privacy guarantee of SGD by iteration, Theorem 26 in Feldman et al. (2018)). Let ℓbe an convex L-Lipschitz and β-smooth loss function over Rd. Then, for any learning rate η 2 β , α > 1, σ L p 2(α 1)α, ASGD iter satisfies α, 4αL2 log n Provable Privacy with Non-Private Pre-Processing For SRDP, we consider datasets S, S L such that d12(S, S ) τ and with all differing points appearing consecutively. Without loss of generality, we assume the first differing point has index t. By the assumption on the τ-constrained maximum distance of L, there are in total κτ consecutive differing points in S, S , t n κτ + 1. In the following, we consider two cases: i) t > T, and ii) t T. As the gradient descent step with S and S are exactly the same before t, in the first case, with t > T, we have Dα (ASGD iter(S)||ASGD iter(S )) = Dα AT F GD(S)||AT F GD Dα At F GD(S)||At F GD = 0. In the second case, we first employ Lemma 4 to upper bound the R enyi divergence of the output of DPFGD at some fixed step T. Lemma 4. For τ 0, let L be a dataset collection with dataset size n and τ-constrained maximum distance κτ. Let L, τ and σ be parameters that satisfy the same assumptions as in Theorem 15. For any two datasets S, S L with d12(S, S ) τ, the algorithm AT FGD satisfies Dα AT FGD(S)||AT FGD(S ) 2ατ 2L σ2(n t + 1), where t n κτ + 1 being the index of the first pair of differing points. For some fixed T, by Lemma 4 the R enyi divergence between the outputs of ASGD iter on datasets S and S is upper bounded by Dα AT F GD(S)||AT F GD(S ) 2ατ 2L σ2(T t + 1). (22) Then, as T is a uniform random variable in ASGD iter(S), we can upper bound the R enyi divergence at some random time T by the weak convexity of R enyi divergence (Lemma B). We note that t T in case (ii), then for all T, as α satisfies σ τL p Dα AT F GD(S)||AT F GD(S ) 2ατ 2L2 σ2 (T t + 1) 2ατ 2L2 Therefore, we can apply Lemma B with c = 1. For any t T, Dα (ASGD iter(S)||ASGD iter(S )) (a) 2 n κτ + 1 T [n κτ +1] Dα AT FGD(S)||AT FGD(S ) (b) 2 n κτ + 1 σ2 (T t + 1) (n κτ + 1)σ2 log (n κτ t + 2) σ2 log(n κτ + 2) where (a) follows applying Lemma B and (b) follows from Equation (22) for T t. Substituting σ concludes the proof. Proof of Lemma 4. Consider two datasets S, S L that satisfy the assumptions in Lemma 4. Define the point-wise distance between these datasets as di = Si S i 2. Let t be the first index where di > 0. By the assumption that differing points in S, S are consecutively ordered, it follows that t n κτ + 1. By the definition of τ-constrained maximum distance (Definition 8) and the assumptions on the dataset collection L, we have, κτ , i { t, t + 1, . . . , t + κτ} (24) Provable Privacy with Non-Private Pre-Processing and di = 0 otherwise. We will define Contractive Noisy Iteration (CNI) (Definition 9) and construct a CNI that outputs AT FGD after T steps. Definition 9 (Contractive Noisy Iteration (CNI)). Given an initial state θ0 Θ, a sequence of contractive functions ψt : Θ Θ, and a noise parameter σ > 0, the Contractive Noisy Iteration (θ0, {θt} , N(0, σ2)) is defined by the following update rule: θt+1 = ψt+1(θt) + Zt, where Zt N(0, σ2). For S, S L, we construct two series of contractive function {ψi} and {ψ i} as the gradient descent on the ith data point of S and S respectively. Formally, ψi(θ) = θ η θ(θ, Si) ψ i(θ) = θ η θ(θ, S i) (25) The functions ψi and ψ i are contractive functions for η 2/β (Nesterov, 2004). It follows by the definition of DPFGD that θT = AT FGD(S) and θ T = AT FGD(S ) are the T th outputs of the CNIs (θ0, {ψt} , N(0, (ησ)2)) and (θ0, {ψ t} , N(0, (ησ)2)) respectively. For these two CNIs, we can apply Lemma I to upper bound the R enyi divergence between their T th outputs. Lemma I (Theorem 22 in Feldman et al. (2018) with fixed noise distribution). Let XT , X T denote the output of two Contractive Noisy Iteration (X0, {ψt}, N(0, (ησ)2)) and (X0, {ψ t}, N(0, (ησ)2)) after T steps. Let st = supx ψt(x) ψ t(x) 2. Let a1, . . . , a T be a sequence of reals such that zt = P i t ai 0 for all t < T and z T = 0. Then, Dα (XT ||X T ) i=1 Dα N(0, η2σ2)||N(ai, η2σ2) Following the definition in Lemma I, for the two contractive noisy maps (θ0, {ψt} , N(0, (ησ)2)) and (θ0, {ψ t} , N(0, (ησ)2)), we define si as si = sup θ ψi(θ) ψ i(θ) 2 (a) sup θ θ ηg(θ, Si) θ + ηg(θ, S i) 2 (b) ηµdi (c) = κτ i { t, t + 1, . . . , t + κτ} 0 Otherwise where step (a) follows by Equation (25), step (b) follows by the smoothness assumption on the loss function ℓ, i.e. θℓ(θ, Si) θℓ(θ, S i) 2 µ Si S i 2 for any θ, Si, S i. Step (c) follows by Equation (24). By choosing ai = 0 for all i < t and ai = ητL T t+1 for i t, we have that for all t T κτ +1, zt = P i t ai 0 and z T = 0. This allows for the application of Lemma I, Dα AT FGD(S)||AT FGD(S) (a) 2α η2σ2 (b) 2α η2σ2 where step (a) follows from the application Lemma I and R enyi divergence of two Gaussian distributions (Lemma C). Step (b) follows by the definition of ai. Provable Privacy with Non-Private Pre-Processing Figure 4. Illustration of the privacy analysis B. Proof of Meta-Theorem (Theorem 2) Theorem 2. For a set of datasets L and for any α 1, τ > 0, consider an algorithm A that is (α, ε(α))-RDP and (α, ε(α, τ))-SRDP over the set L. For a pre-processing algorithm π with L sensitivity and L2 sensitivity 2, A π is (α, bε)-RDP over L for all c1, c2 1, where bε max αc1 1 c1 (α 1) ε (αc1, 2 ) + ε c1α 1 αc2 1 c2(α 1)ε (αc2) + ε c2α 1 Proof. Consider two neighboring datasets S1 and S2, where S1 = S {z1} and S2 = S {z2}. Let π1 and π2 be the pre-processing functions output by the pre-processing algorithm π on S1 and S2 respectively. Our objective is to establish an upper bound on the R enyi divergence between the output distribution of A on the pre-processed dataset π1(S1) and π2(S2), i.e. Dα (A(π1(S1))||A(π2(S2))) and Dα (A(π2(S2))||A(π1(S1))). In the following, we first derive an upper bound on the R enyi divergence between A(π1(S1)) and A(π2(S2)). To do so, we construct a new dataset S using the components of π1(S1) and π2(S2) as indicated in Figure 4, i.e. S = π1(S1) π2(S2). Then, using the same approach, we will upper bound the R enyi divergence between A(π2(S2)) and A(π1(S1)). By contruction, S and π2(S2) are neighboring datasets, and that the L12 distance between S and π2(S2) is upper bounded by 2 . Using the RDP property of algorithm A we upper bound the divergence between A( S) and A(π2(S2)), Dα A(π1(S1))||A( S) ε(α, 2). (28) Similarly, using the SRDP property of the algorithm A over L, we upper bound the divergence between A( S) and A(π1(S1)), Dα A( S)||A(π2(S2)) ε(α). (29) Now, we combine Equations (28) and (29) using the weak triangle inequality of R enyi divergence (Lemma J), to upper bound the R enyi divergence between A(π1(S1)) and A(π2(S2). Lemma J (Triangle inequality of R enyi divergence (Mironov, 2017)). Let µ1, µ2, µ3 be distributions with the same support. Then, for α > 1, p, q > 1 such that 1 q = 1, it holds that Dα(µ1||µ2) α 1/p α 1 Dpα(µ1||µ3) + Dq(α 1/p)(µ3||µ2). Provable Privacy with Non-Private Pre-Processing Using Lemma J, for any c1 > 1, we have Dα(A(π1(S1))||A(π2(S2))) α 1 c1 α 1 Dc1α(A(π1(S1))|A( S)) + D c1α 1 c1 1 (A( S)||A(π2(S2))) (a) αc1 1 c1(α 1) ε(αc1, 2) + ε c1α 1 where step (a) follows from Equations (28) and (29). Similarly, by constructing a dataset S consisting of π1(S) and π2(z2), we upper bound Dα A(π2(S2))||A( S) and Dα A( S)||A(π1(S1)) with the SRDP and RDP property in a similar manner as Equations (28) and (29). Applying Lemma J, we can show that for any c2 > 1, Dα(A(π2(S2))||A(π1(S1))) αc2 1 c2(α 1)ε(αc2) + ε c2α 1 Combining Equation (30) and Equation (31) concludes the proof. Provable Privacy with Non-Private Pre-Processing C. Proofs for the sensitivity of different pre-processing algorithms (Section 4.1) In this section, we bound the sensitivity of pre-processing algorithms discussed in Section 4.1. C.1. Sensitivity analysis of deduplication and quantization Proposition 3. For a dataset collection L, the L2 and L sensitivities6 of η-approximate deduplication πd η and quantization πq η are 2(L, πd η) = 1 and 2(L, πq η) = η, and (L, πd η) = (L, πq η) = max S L max B B(S) 2 |B| . Proof. Consider two neighboring datasets S1, S2 L. Without loss of generality, let S = S1 S2, z1 = S1 \ S and z2 = S2 \ S, similar to the notations of original datasets in Figure 4. Sensitivity analysis of deduplication As the data space X is bounded by 1, it is obvious that the upper bound on L2 sensitivity of πd η is 1. To bound the L sensitivity, we first recall the definition of a good cluster. For a dataset S and a point x S, define B(x, η; S) = { x S : x x 2 η}, a ball of radius η around x. A point x is the centroid of a good cluster if B(x, η; S) = B(x, 3η; S). The set of all good clusters in a dataset S is denoted by B(S) = {Bi}m i=1 where Bi := B(xi, η; S) for all xi S satisfying B(xi, η; S) = B(xi, 3η; S). We will first prove that the difference between the datasets πd η,S1(S) and πd η,S(S) is the ball B(z1, η; S1). Similarly, we show that the maximum difference between πd η,S2(S) and πd η,S(S) is B(z2, η; S2). Taking supremum over all neighboring datasets in L, these two results imply that the L sensitivity of deduplication is upper bounded by twice the size of the largest good cluster in any dataset S L. To calculate that the maximum difference between the datasets πd η,S1(S) and πd η,S(S), we assume without loss of generality that there exists x1, . . . , xk S satisfying z1 xi η for all i [k]. We consider the following cases: Case I: x1, . . . xk are not in any good cluster centered at some c S. We will discuss the two sub-cases: one where the point z1 forms the centroid of a good cluster, and another where it does not. If the point z1 is the centroid of a good cluster, then x1, . . . xk are the only points in the good cluster B(z1, η; S) and will be removed by πd η,S1. In contrast, in Case I, x1, . . . , xk will not be removed by πd η,S. Hence, the difference between πd η,S1(S) and πd η,S(S) is {x1, . . . , xk} B(z1, η; S). If the point z1 is not the centroid of a good cluster, then z1 is not in any good cluster. We will prove this claim by contradiction. Assume z1 is in a good cluster centered at some c S. Then, for any xi, i [k], xi c 2 xi z1 2 + z1 c 2 2η. (32) If xi c 2 η, then xi is also in the good cluster around c S, contradicting the assumption that none of {x1, . . . xk} is in any good cluster. On the other hand, if η xi c 2 2η, then B(c, 3η; S) cannot be a good cluster. Therefore, we have shown that when x1, . . . xk are not in any good cluster and the point z1 is not the centroid of a good cluster, then none of {x, . . . , xk, z1} is in a good cluster. In this case, πd η,S(S) = πd η,S1(S). Case II: There exists some point x {x1, . . . , xk} in a good cluster centered at c S, i.e.x B(c, η; S) and B(c, η; S) is a good cluster. We first consider the effect of πd η,S and πd η,S1 on the single point x. Note that z1 c 2 z1 x 2 + x c 2 2η. (33) If z1 c 2 η, B(c, η; S1) = B(c, η; S {z1}) is also a good cluster. Then, πd η,S1 and πd η,S1 has the same effect on x. 6When the definition of neighboring dataset is refined to addition and deletion of a single data point, the L sensitivity of both deduplication and quantization on a set L can be reduced to max S L,B B(S) |B|. Provable Privacy with Non-Private Pre-Processing If η z1 c 2 2η, then z1 B(c, 3η; S1) but z1 / B(c, η; S1). This implies B(c, 3η; S1) = B(c, η; S1), and B(c, η; S1) is not a good cluster. Therefore, πd η,S removes the point x, while πd η,S1 does not. In this case, x is a different point between πd η,S(S) and πd η,S1(S). We note that there are at most k different points between πd η,S(S) and πd η,S1(S), when all points {x1, . . . , xk} are in some good cluster B(ci, η; S) for ci S and z1 is selected such that none of B(ci, η; S1) remains to be a good cluster. In this case, the difference between πd η,S(S) and πd η,S1(S) is {x1, . . . , xk} B(z1, η; S). Following a similar argument, we can show that the maximum number of different points between πd η,S2(S) and πd η,S(S) is k and this maximum set of different points is a subset of B(z2, η; S2). This concludes the proof for the sensitivity of deduplication. Senstivity analysis of quantization The analysis of L sensitivity of quantization is the same as that of deduplication. To get the L2 sensitivity of quantization, we consider two cases. If a point is in a good cluster, then quantization process change this point to the centroid of the cluster. The L2 distance incurred by the pre-processing is upper bounded by η by the definition of η-quantization. If a point is not in a good cluster, it remains unchanged after quantization. Combining the two cases, the L2 sensitivity of quantization is upper bounded by η. C.2. Sensitivity analysis of model-based imputation In this section, we first provide a general result on the sensitivity for any model-based imputation method. We then introduce specific imputation methods, including mean imputation (Corollary 4 in the main text), median imputation, trimmed mean imputation, and linear regression. We summarize their sensitivity results in Table 3. For a given model f, the imputation algorithm πf first generates an imputation function f S by fitting the model f to a dataset S. Then, it replaces each missing value in the dataset with the prediction based on the imputation function f S. The L2 and L sensitivity of model-based imputation is presented in Proposition 16. Several widely used models for imputation include mean, median, trimmed mean, and linear regression, described as follows. Mean imputation replaces the missing values in the jth feature with the empirical mean of the available data for that feature. On the other hand, median imputation replaces each missing value with the median of the non-missing points in the corresponding feature. The trimmed mean estimator with parameter m is an interpolation between the mean and median. It estimates each missing value by computing the mean after removing the m smallest and largest points from the remaining points of the feature that is not missing. The above methods address the missing values at a feature j using data from that feature alone. In contrast, linear regression also employs the information of the other features of the missing point. Specifically, it estimates the missing value by the prediction of a linear regression on the jth feature using some or all other features in the dataset. Many imputation methods, such as median and linear regression, do not have bounded global sensitivity even when the instance space is bounded (Alabi et al., 2022). However, the local sensitivity of these methods are usually bounded on well-behaved datasets. Given a collection of well-behaved datasets L, the L2 sensitivity over the collection is the upper bound on the local sensitivity of all datasets S L. We exploit this property to find the L2 sensitivity of the aforementioned imputation methods. We introduce additional notation and provide a summary of the L2 sensitivity for different methods in Table 3. Proposition 16 (Sensitivity of model-based imputation). For a dataset collection L, the L sensitivity of model-based imputation πf over L is the maximum number of missing values present in any dataset in L. Furthermore, the L2-sensitivity of πf over L is given by 2(L, πf) = max S ,S L d H(S,S )=1 j=1 (f S(xj) f S (xj))2, where xj denotes the jth feature of x. Specifically, for mean imputation, median imputation, trimmed mean imputation, and regression, we present their sensitivities and corresponding assumptions in Table 3. Provable Privacy with Non-Private Pre-Processing Notation Meaning Imputation Model f L2-Sensitivity 2(L, πf) p Maximum number of missing points in any S L Mean 2 n p xmin (m) Minimum mth ordered statistics of any feature in any S L Median xmax ((n+1)/2) xmin (n/2) xmax (m) Maximum mth ordered statistics of any feature in any S L m-Trimmed Mean xmax (m) xmin (n m)/n 2m p λmax Maximum eigenvalue of XT X, for any S = (X, Y ) L Linear regression λ2 max (λmax+1)λ2 min + 1 λmin λmin Minimum eigenvalue of XT X, for any S = (X, Y ) L Table 3. Notations and L2 sensitivity of different imputation models Proof. It is obvious that the L sensitivity of model-based imputation is upper bounded by the number of entries with missing values in any of the dataset S L. The L2 sensitivity is upper bounded by the sensitivity of the model f over the set L. This concludes the proof. L2 Sensitivity of mean over L For neighboring datasets S1, S2 L, write S1 = S {z1} and S2 = S {z2}. Without loss of generality, denote S = {si}n 1 i=1 . For each data point si, we denote its jth feature as sij. Also, we denote the number of available data points for the jth feature as nj. In the following, we derive an upper bound on the L2 sensitivity of mean imputation 2(L, πmean), 2(L, πmean)2 i=1 sij + z1j i=1 sij + z2j nj (z1j z2j) 2 j=1 (z1j z2j)2 = 4 (n p)2 where the last inequality follows by the fact that the instance space is bounded with diameter 1. Taking square root from both side of Equation (34) completes the proof. L2 Sensitivity of median and m-trimmed mean over L The L2-sensitivity of median and m-trimmed mean over L follows directly by the definition. L2 Sensitivity of linear regression πLR over L For linear regression, we present the L2 sensitivity by considering two datasets X, X L where X = X {z}. For any j [d], imputation with linear regression of the feature Xj looks at the a submatrix of X that does not include the jth feature. Denote the submatrix that is used for linear regression as Xr (Xr includes a subset of features, specified by the index r, of the original dataset X). Let Σ = 1 n X X, and let Σr = 1 n X r Xr be a principal matrix of Σ with submatrix Xr. We calculate the L2 sensitivity of linear regression on imputing the point xj below, 2(L, πLR) = x r (X r Xr) 1X r Xj x r (X r Xr + zz ) 1(X r Xj + zzj) xr (X r Xr) 1X r Xj (X r Xr + zz ) 1(X r Xj + zzj) (X r Xr) 1 (X r Xr + zz ) 1 X r Xj (X r Xr + zz ) 1zzj (X r Xr) 1 (X r Xr + zz ) 1 X r Xj + (X r Xr + zz ) 1zzj (a) (X r Xr) 1zz (X r Xr) 1 (1 + z (X r Xr) 1z) X r Xj (b) λ2 max (λmax + 1)λ2 min + 1 λmin Provable Privacy with Non-Private Pre-Processing where step (a) follows from Sherman-Morrison-Formula and that the eigenvalue of a principal submatrix is always larger than the smallest eigenvalue of the original matrix following Theorem 4.3.15 in (Horn & Johnson, 1985) and then taking supremum over all X L. Similarly, step (b) follows from the fact that the eigenvalue of a principal submatrix is always smaller than the largest eigenvalue of the original matrix following Theorem 4.3.15 in (Horn & Johnson, 1985) and then taking supremum over all X L. C.3. Sensitivity analysis of PCA Proposition 5. For a dataset collection L, the L sensitivity of πPCA dim and πPCA rank is the size of the datasets in L, i.e. n. The L2-sensitivity of πPCA dim and πPCA rank is bounded by 2 2 and 2 respectively, where 2 = 4(3n + 2) n(n 1) min{δk min(L), δ1 min(L)}, where δk min(L) = min S L λk(S) λk+1(S) is the minimum gap between the kth and (k + 1)th eigenvalue over any covariance matrix of S L. Proof. PCA for dimension reduction: For any two neighboring datasets S, S, without loss of generality, we denote S = {xi}n i=1, and S = {xi}n 1 i=1 {x n}. Let Σ, Σ denote their empirical covariance matrices and ˆµ, ˆµ denote the empirical mean. Let Ak and Ak be the matrix consisting of first k eigenvectors of Σ and Σ respectively. First, for any x X, we upper bound the A k x A k x F by a linear function of Σ Σ F using properties of the dataset collection L, A k x A k x F (a) Ak A k F (b) = Tr 2(I A k Ak) (36) where step (a) follows from Cauchy-Schwarz inequality and bounded instance space and step (b) follows from the definition of Frobenius norm A F = Tr(A A) and the orthonormality of Ak, Ak. Let σi denote the ith singular value of A k Ak, and let θi be the ith canonical angle of A k Ak, i.e. cos θi = σi. We can write Tr 2(I A k Ak) as follows, Tr 2(I A k Ak) (a) = 2 i=1 (cos θi)2 ! i=1 (sin θi)2 ! (d) 4 Σ Σ F min{δk min(L), δ1 min(L)}. where step (a) and (b) are due to the definition of singular value and canonical angle, step (c) follows from the fact that (sin θ)2 + (cos θ)2 = 1 for any θ, and step (d) follows by Davis-Kahan Theorem (Lemma K) stated below. Lemma K (Davis-Kahan Theorem (Yu et al., 2014)). Let Σ, ˆΣ Rd d be symmetric and positive definite, with eigenvalues λ1, . . . λd and ˆλ1 . . . ˆλd respectively. For k d, let V and ˆV be the dataset whose matrices consisting of the first k eigenvectors of Σ and ˆΣ respectively. Then, sin Θ(V, ˆV ) F 2 min Σ ˆΣ F min{λk λk+1, λ1 λ2}, where Θ(V, ˆV ) denotes the k k diagonal matrix of the principal angles between two subspaces V and ˆV . It remains to upper bound the Frobenius norm of Σ Σ F . By the definition of the empirical covariance matrix, we Provable Privacy with Non-Private Pre-Processing decompose Σ as i=1 (xi ˆµ )(xi ˆµ ) + 1 n 1(x n ˆµ )(x n ˆµ ) (a) = 1 n 1 i=1 (xi µ 1 n (x n xn))(xi µ 1 n (x n xn)) | {z } part I + 1 n 1(x n µ 1 n (x n xn))(x n µ 1 n (x n xn)) | {z } part II where (a) follows from µ = µ + 1 n (x n xn). Part (I) can be written as Part I = 1 n 1 i=1 (xi µ 1 n (x n xn))(xi µ 1 n (x n xn)) i=1 (xi µ)(xi µ) 1 n(n 1) (xi µ)(x n xn) + (x n xn)(xi µ) n2 (x n xn)(x n xn) =Σ 1 n 1(xn µ)(xn µ) + 1 n(n 1) (xn µ)(x n xn) + (x n xn)(xn µ) n2 (x n xn)(x n xn) Similarly, we can write part (II) as Part II = 1 n 1(x n µ 1 n (x n xn))(x n µ 1 n (x n xn)) =(x n µ)(x n µ) n 1 (x n µ)(x n xn) n(n 1) (x n xn)(x n µ) + (x n xn)(x n xn) Substituting Equation (39) and Equation (40) into Equation (38), and by the fact that xx T F 1 for x 2 1, we get Σ Σ F 2(3n + 2) n(n 1) . (41) Finally, substituting Equation (41) into Equation (37), and then Equation (37) into Equation (36) completes the proof. PCA for rank reduction: For any two neighboring datasets S, S L, we define the notations of Σ, Σ, Ak, Ak similarly as in the proof for πPCA dim. We state Lemma L, which is used to upper bound Ak A k x Ak A k x F . Lemma L (Simplified version of Theorem 3 in (Zwald & Blanchard, 2005)). Let A be a symmetric positive definite matrix with eigenvalues λ1 > λ2 > . . . > λd. Let B be a symmetric positive matrix. For an integer k > 0, let Ak be the matrix consisting the first k eigenvectors of A and Ak be the matrix consisting of the first k eigenvectors of A + B. Then, Ak and Ak satisfy that Ak A k Ak A k 2 B λk λk+1 . Provable Privacy with Non-Private Pre-Processing Applying Lemma L with A + B = Σ and A = Σ, we can show an upper bound on the term Ak A k x Ak A k x F for any x X. Ak A k x Ak A k x F Σ Σ F λk(S) λk+1(S) 4(3n + 2) n(n 1)δmin(S) (42) where the last inequality follows by Equation (41) and the definition of δmin(S) = min{δk min(S), δ1 min(S)}. Taking the supremum over all dataset S L concludes the proof. C.4. Sensitivity Analysis of Scaling Proposition 6. For a dataset collection L, the L sensitivity of standard scaling and min max scaling is the size of the datasets in L, i.e. n. The L2 sensitivity of standard scaling is 2 = 2 σ3 minn + 2 nσmin , where σmin is the minimum standard deviation over datasets in L. Proof. The L sensitivity for both scaling methods is trivially upper bounded by the size of the datasets in L. In the following, we prove the L2 sensitivity for standard scaling and min max scaling respectively. Proof of L2 sensitivity for standard scaling For any two neighboring datasets S, S , let µ, µ denote the mean of S, S respectively and σ, σ denote the standard deviation of S, S respectively. Then, the L2 sensitivity of standard scaling is 2 = max S,S ,x = max S,S ,x σ x σ µ σx + σµ = max S,S ,x (σ σ) (x µ) 2 + max S,S (a) 2 max S,S σ σ σ2 min + max S,S µ µ where step (a) follows from x µ 2 2 for any x and S and the definition of σmin. From Liu (2016), the global sensitivity of sample mean and variance are 2 n. We then show that for a dataset collection L, the global sensitivity conditional on L is upper bounded by 1/σminn. For any S, S with sample variances σ2, (σ )2, we have 2σmin 1 σminn, (44) where the first inequality follows by rearranging σ2 (σ )2 = |(σ σ ) (σ + σ )| 2σmin |σ σ |, and the second inequality by substituting the global sensitivity of sample covariance from Liu (2016). Substituting the sensitivity of sample mean 2/n and sample standard deviation 1 σminn into Equation (43), we conclude the proof. Provable Privacy with Non-Private Pre-Processing D. Proofs for the overall privacy guarantees (Section 4.2) In this section, we provide the proofs of the overall privacy guarantees for specific pre-processed DP pipelines, as stated in Table 2 in Section 4.2. For clarity, we restate the full version of Theorem 7 and specify the privacy guarantee for each category of privacy mechanisms (each row in Table 2) in Theorem 17. Then, we present the proof for each category separately. As the proof of DP-GD is the same as that of Gaussian mechanism, we omit the proof for DP-GD for simplicity. Theorem 17 (Full version of Theorem 7). Let p denote the L sensitivity of deduplication, quantization and mean imputation. Let n min {101, p} be the size of any dataset in the dataset collection L. Let ℓbe a 1-Lipschitz and 1-smooth loss function. For an output function f and a score function Q, assume their Lipschitz parameter and global sensitivity are both 1. Then, (i) Gaussian mechanism with output function f satisfy α, 1.05αε2(1 + p2) -RDP, α, 1.05αε2(1 + η2p2) -RDP, α, 1.05αε2 1 + 4p2 (n p)2 -RDP, α, 1.05αε2 1 + 12.22 -RDP and 1.05αε2 1 + 4 σ3 min -RDP when coupled with deduplication, quantization, mean imputation, PCA for rank reduction and standard scaling respectively. (ii) Exponential mechanism with score function Q and Laplace mechanism with output function f satisfy (α, ε(1+p))-RDP, (α, ε(1 + ηp))-RDP, α, ε 1 + 2p n p -RDP, α, ε 1 + 12.2 δmin -RDP and 4.2αε2 1 + 1 σ3 min -RDP when coupled with deduplication, quantization, mean imputation, PCA and standard scaling for rank reduction respectively. (iii) DP-SGD with subsampling with loss function ℓsatisfies α, 1.05αε2 2α + 12.22 -RDP and 2.1αε2 α + 8 σ6 min - RDP when coupled with PCA for rank reduction and standard scaling respectively. (iv) DP-SGD with iteration with loss function ℓcoupled with deduplication, quantization and mean imputation satisfy α, 1.1αε2 1 + p2n log(n p) (n p) log n -RDP, α, 1.1αε2 1 + η2p2n log(n p) (n p) log n -RDP, and α, 1.1αε2 1 + 4p2n log(n p) (n p)3 log n -RDP respectively. Proof of Theorem 17 (i). We apply Theorem 2 with c1 = c2 = 2, bε(α) max 2α 1 2(α 1) ε(2α, 2 2) + ε(2α 1), 2α 1 2(α 1)ε(2α) + ε(2α 1, 2 ) (a) max 2α 1 2(α 1) ε(2α, 2 ) + 2α 1 2(α 1)ε(2α), 2α 1 2(α 1)ε(2α) + 2α 1 2(α 1) ε(2α, 2 ) 2(α 1) ( ε(2α, 2 2) + ε(2α)) (b) 1.05 ( ε(2α, 2 ) + ε(2α)) . where step (a) follows from the monotonicity of Renyi Divergence (Lemma D) and step (b) follows from 2α 1 2(α 1) 1.05 for α > 11. By substituting the expression of RDP and SRDP parameter from Table 1 for Gaussian mechanism, the L2 and L sensitivity ( 2 and ) for deduplication, quantization, PCA, standard scaling and mean imputation from Proposition 3, 5 and 6 and Corollary 4, and the Lipschitz parameter and global sensitivity of the output function f into Equation (45), we complete the proof of the overall privacy guarantees for Gaussian mechanism. Proof of Theorem 17 (ii). We apply Theorem 2 with c1 = c2 = 1, bε(α) max α 1 α 1 ε(α, 2 2) + ε( ), α 1 α 1ε(α) + ε( , 2 ) = max{ ε(α, 2 ) + ε( ), ε(α) + ε( , 2 )} (46) Provable Privacy with Non-Private Pre-Processing We then derive an upper bound on bε(α) by monotonicity of RDP, bε(α) (a) bε( ) (b) ε( , 2 ) + ε( ) (47) where step (a) follows from Lemma D and step (b) follows by setting α = in Equation (46). By substituting the expression of RDP and SRDP parameter from Table 1 for Laplace mechanism, the L2 and L sensitivity ( 2 and ) for deduplication, quantization, imputation and PCA from Proposition 3, Corollary 4 and Proposition 5, and the Lipschitz parameter and global sensitivity of the output function f into Equation (45), we complete the proof of the overall privacy guarantees for Laplace mechanism. Similarly, for exponential mechanism, we substitute the expression of RDP and SRDP parameter from Table 1 for exponential mechanism, the L2 and L sensitivity ( 2 and ) for deduplication, quantization, PCA, standard scaling and mean imputation from Proposition 3, 5 and 6 and Corollary 4, and the Lipschitzness and the global sensitivity of the score function Q, into Equation (45). This completes the proof of the overall privacy guarantees for exponential mechanism. In the following, we provide the proof of combining DP-SGD with subsampling with PCA for rank reduction because datasets preprocessed by PCA for rank reduction satisfies Definition 7 automatically. Proof of Theorem 17 (iii). We first note that for PCA, for any two neighboring datasets, S1, S2, by the definition of sensitivity of PCA, πPCA rank(S1), πPCA rank(S2) and πPCA dim(S1), πPCA dim(S2) has inverse point-wise divergence 1 because = n. Following a similar argument as in the proof Gaussian mechanism (Theorem 17 (i)), we set the parameters in Theorem 2 c1 = c2 = 2 and obtain Equation (45). Then, we substitute in Equation (45) the following parameters: a)RDP and SRDP parameters for DP-SGD with subsampling from Table 1, b)γ = 1, c) Lipschitzness and smoothness parameter of the loss function ℓ, and d) L2 and L sensitivity ( 2 and ) for PCA and standard scaling from Proposition 5 and 6. This completes the proof. For DP-SGD with iteration, the pre-processing mechanisms achieve a tighter bound while satisfying n by Definition 8. Therefore, this method will not provide tighter bound for PCA and standard scaling. However, it improves the privacy analysis for imputation, deduplication and quantization, as proved below. Proof of Theorem 17 (iv). We first note that the maximum divergence κτ of datasets pre-processed by imputation, deduplication or quantization is upper bounded by = p. Following a similar argument as in the proof of Gaussian mechanism (Theorem 17 (i)), we set the parameters in Theorem 2 c1 = c2 = 2. We substitute in Equation (45) the following parameters: a)RDP and SRDP parameters for DP-SGD with iteration from Table 1, b)κτ = = p, c) Lipschitzness and smoothness parameter of the loss function ℓ, and d) L2 and L sensitivity ( 2 and ) for quantization, deduplication and imputation from from Proposition 3 and Corollary 4. This completes the proof. Provable Privacy with Non-Private Pre-Processing E. Privacy and accuracy guarantees of Algorithm 1 (Section 5.2) In this section, we present the proofs for the theoretical guarantees of Algorithm 1, as stated in Section 5.2. Specifically, we provide the proofs for the privacy guarantee (Theorem 8) and the accuracy guarantee (Proposition 9) of Algorithm 1. Theorem 8. For any L-Lipschitz and µ-smooth loss function ℓ, ε > 0 and δ exp 1.05ε2 1 + 12.22µ2 L2β2 , Algorithm 1 with privacy parameters ε, δ, and estimated lower bound β is (bε + ε, δ)-DP on a dataset of size n 101, where bε = 1.05 1 + 12.22µ2 Proof. For brevity, we denote Algorithm 1 as APTR. For any two neighboring datasets S1, S2 L, we consider two cases: i) the minimum eigen-gap of either S1 or S2 is smaller than or equal to β, and ii) the minimum eigen-gap of both S1 and S2 is greater than β. Case (i) Without loss of generality, we assume the eigen-gap of S1 is smaller than or equal to β. We will show that for any neighboring dataset S2 of S1, i.e. d H(S1, S2) = 1, APTR satisfies the following inequality, for any output set O H { } P [APTR(S1) O] eϵP [APTR(S2) O] + δ First consider the case O H. Then, assuming that the output is with high probability i.e. P [APTR(S1) = ] δ/2, we have that P[APTR(S1) O] P[APTR(S1) = ] δ 2 eϵP[APTR(S2) O] + δ Now we show P [APTR(S1) = ] δ/2. Given a input dataset S, we denote the Γ in step 1 of Algorithm 1 by Γ(S). P(APTR(S1) = ) (a) P Γ(S1) log 2 (b) = P Lap 1 where step (a) follows because the algorithm does not return only if Γ(S1) log 2 δ ε (Line 5 in Algorithm 1), and step (b) follows from min S :δmin(S ) β d H(S , S1) = 0 as S1 itself satisfies δmin(S1) β. Step (c) follows by the tail bound of Laplace distributions: for a positive real number t > 0, Pz Lap(0,b)(z t) e t/b. Now consider the only other possible output O = { }. We note that for any neighboring dataset S2 of S1, min S :δmin(S ) β d H(S , S1) = 0, min S :δmin(S ) β d H(S , S2) 1 (51) This implies P Γ(S2) log 2 δ ε 1 . (52) Define J(S) = min S :δmin(S )<β d H(S, S ). It is simple to note that J has global sensitivity 1 due to Equation (51). Thus, Γ is essentially Laplace mechanism on J and thereby (ε, 0)-DP. This implies that for any neighboring datasets S2 of S1, P(APTR(S1) = ) P(APTR(S2) = ) = P Γ(S1) log 2 P Γ(S2) log 2 δ ε (a) P Lap 1 δ ε 1 (b) eε (53) where the numerator of step (a) follows by Equation (50) and the denominator of step (a) follows by Equation (52). Step (b) follows by the tail bound of Laplace distributions. Combining Equation (50) and Equation (53), we show that Equation (48) holds in case (i). Case (ii) Consider S1, S2 whose kth-eigengaps are greater than β. We will show that the following holds, for any O H { } P [APTR(S1) O] ebεP [APTR(S2) O] + δ Provable Privacy with Non-Private Pre-Processing where bε = 3ε r 1.05 1 + 12.22µ2 δ . First consider the case O = { }. Following a similar argument as Equation (53), we have P [APTR(S1) = ] P [APTR(S2) = ] eε ebε. (55) Now consider the case when the output is not . Then, choosing any α 11 yields that the privacy parameter of DP-GD combined with non-private PCA is α, 1.05αε 1 + 12.22µ2 L2β2 -RDP from Table 2. Then, invoking Lemma M converts the RDP parameter to Approximate DP. Lemma M (RDP to Approximate DP (Mironov, 2017)). If A is an (α, ε)-RDP algorithm, then for 0 < δ < 1, it satisfies ε + log 1 δ α 1 , δ -differential privacy. In particular, we choose α = r δ 1.05ε2 1+ 12.22µ2 L2β2 + 1 11 and obtain that the output in step 6 of Algorithm 1 is (bε, δ)-DP, 1.05 1 + 12.22µ2 δ + 1.05ε 1 + 12.22µ2 1.05 1 + 12.22µ2 Here, step (a) follows by Lemma M with chosen α and step (b) utilizes the fact that δ satisfies log 1 δ 1.05ε2 1 + 12.22µ2 This algorithm also discloses information about the dataset regarding its minimum eigen-gap. Specifically, when the minimum eigen-gap of the private dataset is smaller than β, then with high probability the output is . However, the additional privacy cost incurred is smaller than that of releasing Γ directly. As Γ is a Laplace mechanism with global sensitivity 1, releasing Γ satisfies (ε, 0)-DP. By basic composition of approximate differential privacy (Lemma N), we can combine the privacy cost for releasing the results from DP-GD and releasing Γ. Thus, Algorithm 1 is (bε + ε, δ)-DP under case (ii). Lemma N (Basic composition of differential privacy (Dwork & Roth, 2014)). If algorithm A1 is (ε1, δ1)-DP and A2 is (ε2, δ2)-DP, then (A1, A2) is (ε1 + ε2, δ1 + δ2)-DP. Combining the two cases concludes the proof. Proposition 9. For δ, ε defined in Algorithm 1, n 101, and any β δk (S), with probability at least 1 ξ, Algorithm 1 outputs ˆθ such that the excess empirical risk E h ˆℓS(ˆθ) i ˆℓS(θ ) = O 1 + k log 1/δ where ξ = 1 2δ exp (δk(S) β)nε 12.2 , Λ = Pd i=k+1 λi(S) where the high probability is over the randomness in Step 1 and the expectation is over the randomness of Step 6. Proof. We first prove that the probability that Algorithm 1 outputs is small. Then, we show that in the even that it does not output but a real vector, then the excess empirical risk is small. Let Sd H = {S, S | d H(S, S ) = 1 δmin(S), δmin(S ) > 0}. Let ΣS, ΣS denote the covariance matrices of S and S respectively and δk (S) denote the kth eigen-gap of ΣS. Then, for any S, S Sd H, we first upper bound |δk (S) δk (S )| |δk(S ) δk(S)| = λk( Σ) λk+1( Σ) λk(Σ) + λk+1(Σ) λk( Σ) λk(Σ) + λk+1(Σ) λk+1( Σ) . (56) To bound this term, we upper bound λk( Σ) λk(Σ) for any k. Using Weyl s inequality (Schindler, 2015) λj(Σ) λj( Σ) Σ Σ op Σ Σ F 6.1 Provable Privacy with Non-Private Pre-Processing where the last inequality follows from a similar argument as Equation (41) in the proof of Proposition 5 for n 101. This yields |δk(S ) δk(S)| 12.2 Thus, for any S0, Sm with d H(S0, Sm) = m, we can construct a series of datasets S1, . . . , Sm 1 such that d H(Sj, Sj+1) = 1 for j {0, ..., m 1}. By applying Equation (58) iteratively over each i, we have δk(S0) δk(Sm) + 12.2m More generally, for any pairs of dataset S, S , we have d H(S, S ) |δk (S) δk (S )| n min S :δk(S ) β d H(S, S ) (δk(S) β) n 12.2 . (61) Now, we upper bound the probability that Algorithm 1 does not output . P Γ (S) log 1 = P min S :δk(S ) β d H(S, S ) + Lap 1 P (δk(S) β) n 12.2 + Lap 1 2 exp ε log 1 δ ε (δk(S) β) n 2δ exp β δk(S) where step (a) follows from the tail bound of Laplace distribution when δk(S) β + 12.2 log 1 We have shown that Algorithm 1 does not output with probability 1 1 2δ exp β δk(S) 12.2 nε . It remains to derive the convergence guarantee of Algorithm 1 when the output is not . To show this, we utilize the property that PCA transforms the dataset to have a low-rank covariance matrix, which allows us to apply the dimension-independent convergence guarantee of DP-GD following the analysis of Song et al. (2021) and establish the desired convergence guarantee. Let A k Rk d be the matrix consisting of the first k eigenvectors of the covariance matrix of S, Σ = 1 n Pn i=1 xix i . Let Sk = {(Ak A k xi, yi)}n i=1. We can decompose the error into three terms, E h ˆℓS(ˆθ) i ℓS(θ ) E h ˆℓS(ˆθ) i E h ˆℓSk(ˆθ) i | {z } (a) + E h ˆℓSk(ˆθ) i ℓSk(θ ) | {z } (b) + |ℓS(θ ) ℓSk(θ )| | {z } (c) By Theorem 3.1 in Song et al. (2021), part (b) is upper bounded by O L Lemma O (Theorem 3.1 in Song et al. (2021)). Let θ0 = 0p be the initial point of AGD. Let the dataset be centered at 0. Let k be the rank of the projector to the eigenspace of the covariance matrix Pn i=1 xix i . For a L-Lipschitz loss function ℓ, AGD with T = n2ε2 with appropriate learning rate η output ˆθ satisfying E h ˆℓS(ˆθ) i ˆℓS(θ ) O 1 + 2k log 1 Provable Privacy with Non-Private Pre-Processing Next, we show that part (a) and (c) are upper bounded by L Pn i=k+1 λi(S) = LΛ. To show this, we prove that for any θ Bd 2, ˆℓSk(θ) ˆℓS(θ) L Pn i=k+1 λi(S). ˆℓSk(θ) ˆℓS(θ) = i=1 ℓ(θ Ak A k xi, yi) 1 i=1 ℓ(θ xi, yi) i=1 L θ Ak A k xi θ xi 2 i=1 L Ak A k xi xi 2 i=k+1 λi(S) = LΛ where step (a) follows from Lipschitzness of the loss function ℓ, step (b) follows due to the projection step in DP-GD projects θ to the Euclidean ball with radius 1, and step (c) is obtained by substituting the reconstruction error of PCA, which is Pd i=k+1 λi(S). This concludes the proof. Provable Privacy with Non-Private Pre-Processing F. Experiment setups and discussion F.1. Experiment setups Data generation The synthetic data is generated with the make_classification function in the sklearn library. We generate a 2-class low rank dataset consisting of 1000 data points with dimension 6000 and rank 50. We set the parameter n_cluster_per_class in make_classification to 1. Models and allocation of privacy budget We compared the excess empirical loss of three models in Figure 3. We provide the details of the models below. Pre-processed DP pipeline: We employ non-private PCA to reduce the dimensionality of the original dataset to k and then apply private logistic regression. In particular, we use the make_private_with_epsilon method from the Opacus library with Py Torch SGD optimizer with learning rate 1e-2, max_grad_norm = 10 and epochs = 10. The privacy parameters epsilon and delta are obtained by adjusting the desired overall privacy level with our framework (Theorem 17). No pre-processing: We directly apply private logistic regression with the same parameters on the original highdimensional dataset. DP-PCA: We first implement the DP-PCA in Chaudhuri et al. (2012) and then apply private logistic regression. We allocate half of the privacy budget to DP-PCA and the remaining half to private logistic regression. F.2. Discussion on clipping In practice, when the sensitivity of the original function is unbounded, one can apply the technique of clipping to restrict the sensitivity (Abadi et al., 2016; Liu et al., 2022). We can also incorporate clipping into our pre-processed DP pipeline. Specifically, we first non-privately compute the PCA projection matrix Ak with the original dataset S. We clip the original data to Sclipped with the clipping threshold C [0.1R, 0.7R, 0.99R], where R represents the maximum norm of the original dataset. Then, we apply the projection matrix to the clipped dataset to obtain the pre-processed dataset Spreprocessed = Ak A k Sclipped. Finally, we apply private logistic regression on Spreprocessed, with the privacy parameter set by our framework. As shown in Table 4, the effect of clipping on accuracy depends on the clipping threshold. This is because clipping reduces the L2 sensitivity of PCA from 12R/nδk min to 12C/nδk min, allowing for selection of a larger privacy parameter during private learning and better accuracy. However, clipping also introduces inaccuracies in the dataset. Table 4. Excess empirical loss of clipped pre-processed DP-pipeline Clipping 0.1 Clipping 0.7 Clipping 0.99 Pre-processed DP pipeline 1.0 0.84 0.89 0.88 0.89 2.0 0.84 0.89 0.89 0.89 5.0 0.85 0.90 0.90 0.90 However, we note that R is dataset dependent and might lead to additional privacy leakage.