# easy_differentially_private_linear_regression__df67c3a1.pdf

Published as a conference paper at ICLR 2023

EASY DIFFERENTIALLY PRIVATE LINEAR REGRESSION

Kareem Amin Matthew Joseph M onica Ribero Sergei Vassilvitskii

Linear regression is a fundamental tool for statistical analysis. This has motivated the development of linear regression methods that also satisfy differential privacy and thus guarantee that the learned model reveals little about any one data point used to construct it. However, existing differentially private solutions assume that the end user can easily specify good data bounds and hyperparameters. Both present signiﬁcant practical obstacles. In this paper, we study an algorithm which uses the exponential mechanism to select a model with high Tukey depth from a collection of non-private regression models. Given n samples of d-dimensional data used to train m models, we construct an efﬁcient analogue using an approximate Tukey depth that runs in time O(d2n + dm log(m)). We ﬁnd that this algorithm obtains strong empirical performance in the data-rich setting with no data bounds or hyperparameter selection required.

1 INTRODUCTION

Existing methods for differentially private linear regression include objective perturbation (Kifer et al., 2012), ordinary least squares (OLS) using noisy sufﬁcient statistics (Dwork et al., 2014; Wang, 2018; Sheffet, 2019), and DP-SGD (Abadi et al., 2016). Carefully applied, these methods can obtain high utility in certain settings. However, each method also has its drawbacks. Objective perturbation and sufﬁcient statistics require the user to provide bounds on the feature and label norms, and DP-SGD requires extensive hyperparameter tuning (of clipping norm, learning rate, batch size, and so on).

In practice, users of differentially private algorithms struggle to provide instance-speciﬁc inputs like feature and label norms without looking at the private data (Sarathy et al., 2022). Unfortunately, looking at the private data also nulliﬁes the desired differential privacy guarantee. Similarly, while recent work has advanced the state of the art of private hyperparameter tuning (Liu & Talwar, 2019; Papernot & Steinke, 2022), non-private hyperparameter tuning remains the most common and highest utility approach in practice. Even ignoring its (typically elided) privacy cost, this tuning adds signiﬁcant time and implementation overhead. Both considerations present obstacles to differentially private linear regression in practice.

With these challenges in mind, the goal of this work is to provide an easy differentially private linear regression algorithm that works quickly and with no user input beyond the data itself. Here, ease refers to the experience of end users. The algorithm we propose requires care to construct and implement, but it only requires an end user to specify their dataset and desired level of privacy. We also emphasize that ease of use, while nice to have, is not itself the primary goal. Ease of use affects both privacy and utility, as an algorithm that is difﬁcult to use will sacriﬁce one or both when data bounds and hyperparameters are imperfectly set.

1.1 CONTRIBUTIONS

Our algorithm generalizes previous work by Alabi et al. (2022), which proposes a differentially private variant of the Theil-Sen estimator for one-dimensional linear regression (Theil, 1992). The core idea is to partition the data into m subsets, non-privately estimate a regression model on each, and then apply the exponential mechanism with some notion of depth to privately estimate a high-depth model from a restricted domain that the end user speciﬁes. In the simple one-dimensional case (Alabi et al.,

{kamin, mtjoseph, mribero, sergeiv}@google.com. Part of this work done while M onica was at UT Austin.

Published as a conference paper at ICLR 2023

2022) each model is a slope, the natural notion of high depth is the median, and the user provides an interval for candidate slopes.

We generalize this in two ways to obtain our algorithm, Tukey EM. The ﬁrst step is to replace the median with a multidimensional analogue based on Tukey depth. Second, we adapt a technique based on propose-test-release (PTR), originally introduced by Brown et al. (2021) for private estimation of unbounded Gaussians, to construct an algorithm which does not require bounds on the domain for the overall exponential mechanism. We ﬁnd that a version of Tukey EM using an approximate and efﬁciently computable notion of Tukey depth achieves empirical performance competitive with (and often exceeding) that of non-privately tuned baseline private linear regression algorithms, across several synthetic and real datasets. We highlight that the approximation only affects utility and efﬁciency; Tukey EM is still differentially private. Given an instance where Tukey EM constructs m models from n samples of d-dimensional data, the main guarantee for our algorithm is the following:

Theorem 1.1. Tukey EM is (ε, δ)-DP and takes time O(d2n + dm log(m)).

Two caveats apply. First, our use of PTR comes at the cost of an approximate (ε, δ)-DP guarantee as well as a failure probability: depending on the dataset, it is possible that the PTR step fails, and no regression model is output. Second, the algorithm technically has one hyperparameter, the number m of models trained. Our mitigation of both issues is empirical. Across several datasets, we observe that a simple heuristic about the relationship between the number of samples n and the number of features d, derived from synthetic experiments, typically sufﬁces to ensure that the PTR step passes and speciﬁes a high-utility choice of m. For the bulk of our experiments, the required relationship is on the order of n 1000 d. We emphasize that this heuristic is based only on the data dimensions n and d and does not require further knowledge of the data itself.

1.2 RELATED WORK

Linear regression is a speciﬁc instance of the more general problem of convex optimization. Ignoring dependence on the parameter and input space diameter for brevity, DP-SGD (Bassily et al., 2014) and objective perturbation (Kifer et al., 2012) obtain the optimal O(

d/ε) error for empirical risk minimization. Ada OPS and Ada SSP also match this bound (Wang, 2018). Similar results are known for population loss (Bassily et al., 2019), and still stronger results using additional statistical assumptions on the data (Cai et al., 2020; Varshney et al., 2022). Recent work provides theoretical guarantees with no boundedness assumptions on the features or labels (Milionis et al., 2022) but requires bounds on the data s covariance matrix to use an efﬁcient subroutine for private Gaussian estimation and does not include an empirical evaluation. The main difference between these works and ours is empirical utility without data bounds and hyperparameter tuning.

Another relevant work is that of Liu et al. (2022), which also composes a PTR step adapted from Brown et al. (2021) with a call to a restricted exponential mechanism. The main drawback of this work is that, as with the previous work (Brown et al., 2021), neither the PTR step nor the restricted exponential mechanism step is efﬁcient. This applies to other works that have applied Tukey depth to private estimation as well (Beimel et al., 2019; Kaplan et al., 2020; Liu et al., 2021; Ramsay & Chenouri, 2021). The main difference between these works and ours is that our approach produces an efﬁcient, implemented mechanism.

Finally, concurrent independent work by Cumings-Menon (2022) also studies the usage of Tukey depth, as well as the separate notion of regression depth, to privately select from a collection of non-private regression models. A few differences exist between their work and ours. First, they rely on additive noise scaled to smooth sensitivity to construct a private estimate of a high-depth point. Second, their methods are not computationally efﬁcient beyond small d, and are only evaluated for d 2. Third, their methods require the end user to specify bounds on the parameter space.

2 PRELIMINARIES

We start with the deﬁnition of differential privacy, using the add-remove variant.

Deﬁnition 2.1 (Dwork et al. (2006)). Databases D, D from data domain X are neighbors, denoted D D , if they differ in the presence or absence of a single record. A randomized mechanism

Published as a conference paper at ICLR 2023

M : X Y is (ε, δ)-differentially private (DP) if for all D D X and any S Y

PM [M(D) S] eεPM [M(D ) S] + δ.

When δ = 0, M is ε-DP. One general ε-DP algorithm is the exponential mechanism. Deﬁnition 2.2 (Mc Sherry & Talwar (2007)). Given database D and utility function u : X Y R mapping (database, output) pairs to scores with sensitivity

u = max D D ,y Y |u(D, y) u(D , y)|,

the exponential mechanism selects item y Y with probability proportional to exp ( ϵu(D,y)

2 u ). We say the utility function u is monotonic if, for D1 D2, for any y, u(D1, y) u(D2, y). Given monotonic u, the 2 inside the exponent denominator can be dropped. Lemma 2.3 (Mc Sherry & Talwar (2007)). The exponential mechanism is ϵ-DP.

Finally, we deﬁne Tukey depth. Deﬁnition 2.4 (Tukey (1975)). A halfspace hv is deﬁned by a vector v Rd, hv = {y Rd | v, y 0}. Let D Rd be a collection of n points. The Tukey depth TD(y) of a point y Rd with respect to D is the minimum number of points in D in any halfspace containing y,

TD(y) = min hv|y hv

Note that for a collection of n points, the maximum possible Tukey depth is n/2. We will prove a theoretical utility result for a version of our algorithm that uses exact Tukey depth. However, Tukey depth is NP-hard to compute for arbitrary d (Johnson & Preparata, 1978), so our experiments instead use a notion of approximate Tukey depth that can be computed efﬁciently. The approximate notion of Tukey depth only takes a minimum over the 2d halfspaces corresponding to the canonical basis. Deﬁnition 2.5. Let E = {e1, ..., ed} be the canonical basis for Rd and let D Rd. The approximate Tukey depth of a point y Rd with respect to D, denoted TD(y), is the minimum number of points in D in any of the 2d halfspaces determined by E containing y,

TD(y) = min ej|ej E,y hyi ej

x D 1x hyi ej .

Stated more plainly, approximate Tukey depth only evaluates depth with respect to the d axis-aligned directions. A simple illustration of the difference between exact and approximate Tukey depth appears in Figure 3 in the Appendix s Section 7.1. For both exact and approximate Tukey depth, when D is clear from context, we omit it for neatness.

3 MAIN ALGORITHM

Our algorithm, Tukey EM, consists of four steps:

1. Randomly partition the dataset into m subsets, non-privately compute the OLS estimator on each subset, and collect the m estimators into set {βi}m i=1. 2. Compute the volumes of regions of different approximate Tukey depths with respect to {βi}m i=1. 3. Run a propose-test-release (PTR) algorithm using these volumes. If it passes, set B to be the region of Rd with approximate Tukey depth at least m/4 in {βi}m i=1 and proceed to the next step. If not, release (failure). 4. If the previous step succeeds, apply the exponential mechanism, using approximate Tukey depth as the utility function, to privately select a point from B.

A basic utility result for the version of Tukey EM using exact Tukey depth appears below. The result is a direct application of work from Brown et al. (2021), and the (short) proof appears in the Appendix s Section 7.2..

Published as a conference paper at ICLR 2023

Theorem 3.1 (Brown et al. (2021)). Let 0 < α, γ < 1 and let S = {β1, ..., βm} be an i.i.d. sample from the multivariate normal distribution N(β , Σ) with covariance Σ Rd d and mean E [βi] = β Rd . Given ˆβ Rd with Tukey depth at least p with respect to S, there exists a constant

c > 0 such that when m c d+log(1/γ)

α2 with probability 1 γ, ˆβ β Σ Φ 1(1 p/m + α), where Φ denotes the CDF of of the standard univariate Gaussian.

In practice, we observe that empirical distributions of models for real data often feature Gaussian-like concentration, fast tail decay, and symmetry. Plots of histograms for the the models learned by Tukey EM on experiment datasets appear in the Appendix s Section 7.7. Nonetheless, we emphasize that Theorem 3.1 is a statement of sufﬁciency, not necessity. Tukey EM does not require any distributional assumption to be private, nor does non-Gaussianity preclude accurate estimation.

The remaining subsections elaborate on the details of our version using approximate Tukey depth, culminating in the full pseudocode in Algorithm 2 and overall result, Theorem 1.1.

3.1 COMPUTING VOLUMES

We start by describing how to compute volumes corresponding to different Tukey depths. As shown in the next subsection, these volumes will be necessary for the PTR subroutine. Deﬁnition 3.2. Given database D, deﬁne Vi,D = vol({y | y Rd and TD(y) i}), the volume of the region of points in Rd with approximate Tukey depth at least i in D. When D is clear from context, we write Vi for brevity.

Since our notion of approximate Tukey depth uses the canonical basis (Deﬁnition 2.5), it follows that V1, V2, . . . , Vm/21 form a sequence of nested (hyper)rectangles, as shown in Figure 3. With this observation, computing a given Vi is simple. For each axis, project the non-private models {βi}m i=1 onto the axis and compute the distance between the two points of exact Tukey depth i (from the left and right ) in the one-dimensional sorted array. This yields one side length for the hyperrectangle. Repeating this d times in total and taking the product then yields the total volume of the hyperrectangle, as formalized next. The simple proof appears in the Appendix s Section 7.2.

Lemma 3.3. Lines 5 to 10 of Algorithm 2 compute {Vi}m/2 i=1 in time O(dm log(m)).

3.2 APPLYING PROPOSE-TEST-RELEASE

The next step of Tukey EM employs PTR to restrict the output region eventually used by the exponential mechanism. We collect this process into a subroutine PTRCheck.

The overall strategy applies work done by Brown et al. (2021). Their algorithm privately checks if the given database has a large Hamming distance to any unsafe database and then, if this PTR check passes, runs an exponential mechanism restricted to a domain of high Tukey depth. Since a safe database is deﬁned as one where the restricted exponential mechanism has a similar output distribution on any neighboring database, the overall algorithm is DP. As part of their utility analysis, they prove a lemma translating a volume condition on regions of different Tukey depths to a lower bound on the Hamming distance to an unsafe database (Lemma 3.8 (Brown et al., 2021)). This enables them to argue that the PTR check typically passes if it receives enough Gaussian data, and the utility guarantee follows. However, their algorithm requires computing both exact Tukey depths of the samples and the current database s exact Hamming distance to unsafety. The given runtimes for both computations are exponential in the dimension d (see their Section C.2 (Brown et al., 2021)).

We rely on approximate Tukey depth (Deﬁnition 2.5) to resolve both issues. First, as the previous section demonstrated, computing the approximate Tukey depths of a collection of m d-dimensional points only takes time O(dm log(m)). Second, we adapt their lower bound to give a 1-sensitive lower bound on the Hamming distance between the current database and any unsafe database. This yields an efﬁcient replacement for the exact Hamming distance calculation used by Brown et al. (2021).

The overall structure of PTRCheck is therefore as follows: use the volume condition to compute a 1-sensitive lower bound on the given database s distance to unsafety; add noise to the lower bound

1We assume m is even for simplicity. The algorithm and its guarantees are essentially the same when m is odd, and our implementation handles both cases.

Published as a conference paper at ICLR 2023

Algorithm 1 PTRCheck

1: Input: Tukey depth region volumes V , privacy parameters ε and δ 2: Use Lemma 3.6 with t = |V |

2 and δ 8eε to compute lower bound k for distance to unsafe database

3: if k + Lap (1/ε) log(1/2δ)

ε then 4: Return True 5: else 6: Return False

and compare it to a threshold calibrated so that an unsafe dataset has probability δ of passing; and if the check passes, run the exponential mechanism to pick a point of high approximate Tukey depth from the domain of points with moderately high approximate Tukey depth. Before proceeding to the details of the algorithm, we ﬁrst deﬁne a few necessary terms. Deﬁnition 3.4 (Deﬁnition 2.1 Brown et al. (2021)). Two distributions P, Q over domain W are (ε, δ)-indistinguishable, denoted P ε,δ Q, if for any measurable subset W W,

Pw P [w W] eεPw Q [w W] + δ and Pw Q [w W] eεPw P [w W] + δ.

Note that (ε, δ)-DP is equivalent to (ε, δ)-indistinguishability between output distributions on arbitrary neighboring databases. Given database D, let A denote the exponential mechanism with utility function TD (see Deﬁnition 2.5). Given nonnegative integer t, let At denote the same mechanism that assigns score to any point with score < t, i.e., only samples from points of score t. We will say a database is safe if At is indistinguishable between neighbors. Deﬁnition 3.5 (Deﬁnition 3.1 Brown et al. (2021)). Database D is (ε, δ, t)-safe if for all neighboring D D, we have At(D) ε,δ At(D ). Let Safe(ε,δ,t) be the set of safe databases, and let Unsafe(ε,δ,t) be its complement.

We now state the main result of this section, Lemma 3.6. Brieﬂy, it modiﬁes Lemma 3.8 from Brown et al. (2021) to construct a 1-sensitive lower bound on distance to unsafety. Lemma 3.6. Deﬁne M(D) to be a mechanism that receives as input database D and computes the largest k {0, . . . , t 1} such that there exists g > 0 where, for volumes V deﬁned using a monotonic utility function, Vt k 1,D Vt+k+g+1,D e εg/2 δ

or outputs 1 if the inequality does not hold for any such k. Then for arbitrary D

1. M is 1-sensitive, and

2. for all z Unsafe(ε,4eεδ,t), d H(D, z) > M(D).

The proof of Lemma 3.6 appears in the Appendix s Section 7.2. Our implementation of the algorithm described by Lemma 3.6 randomly perturbs the models with a small amount of noise to avoid having regions with 0 volume. We note that this does not affect the overall privacy guarantee.

PTRCheck therefore runs the mechanism deﬁned by Lemma 3.6, add Laplace noise to the result, and proceeds to the restricted exponential mechanism if the noisy statistic crosses a threshold. Pseudocode appears in Algorithm 1, and we now state its guarantee as proved in the Appendix s Section 7.2. Lemma 3.7. Given the depth volumes V computed in Lines 9 to 10 of Algorithm 2, PTRCheck(V, ε, δ) is ε-DP and takes time O(m log(m)).

3.3 SAMPLING

If PTRCheck passes, Tukey EM then calls the exponential mechanism restricted to points of approximate Tukey depth at least t = m/4, a subroutine denoted Restricted Tukey EM (Line 12 in Algorithm 2). Note that the passage of PTR ensures that with probability at least 1 δ, running Restricted Tukey EM is (ε, δ)-DP. We use a common two step process for sampling from an exponential mechanism over a continuous space: 1) sample a depth using the exponential mechanism, then 2) return a uniform sample from the region corresponding to the sampled depth.

Published as a conference paper at ICLR 2023

3.3.1 SAMPLING A DEPTH

We ﬁrst deﬁne a slight modiﬁcation W of the volumes V introduced earlier.

Deﬁnition 3.8. Given database D, deﬁne Wi,D = vol({y | y Rd and TD(y) = i}), the volume of the region of points in Rd with approximate Tukey depth exactly i in D.

To execute the ﬁrst step of sampling, for i {m/4, m/4 + 1, . . . , m/2}, Wi,D = Vi,D Vi+1,D, so we can compute {Wi,D}m/2 i=m/4 from the V computed earlier in time O(m). The restricted exponential mechanism then selects approximate Tukey depth i {m/4, m/4 + 1, . . . , m/2} with probability

P [i] Wi,D exp(ε i).

Note that this expression drops the 2 in the standard exponential mechanism because approximate Tukey depth is monotonic; see Appendix Section 7.3 for details. For numerical stability, racing sampling Medina & Gillenwater (2020) can sample from this distribution using logarithmic quantities.

3.3.2 UNIFORMLY SAMPLING FROM A REGION

Having sampled a depth ˆi, it remains to return a uniform random point of approximate Tukey depth ˆi. By construction, Wˆi,D is the volume of the set of points y = (y1, ..., yd) such that the depth along every dimension j is at least ˆi, and the depth along at least one dimension j is exactly ˆi. The result is straightforward when d = 1: draw a uniform sample from the union of the two intervals of points of depth exactly ˆi (depth from the left and right ).

For d > 1, the basic idea of the sampling process is to partition the overall volume into disjoint subsets, compute each subset volume, sample a subset according to its proportion in the whole volume, and then sample uniformly from that subset. Our partition will split the overall region of depth exactly i according to the ﬁrst dimension with dimension-speciﬁc depth exactly i. Since any point in the overall region has at least one such dimension, this produces a valid partition, and we will see that computing the volumes of these partitions is straightforward using the S computed earlier. Finally, the last sampling step will be easy because the ﬁnal subset will simply be a pair of (hyper)rectangles. Since space is constrained and the details are relatively straightforward from the sketch above, full speciﬁcation and proofs for this process Sample Point With Depth(S, i) appear in Section 7.4. For immediate purposes, it sufﬁces to record the following guarantee:

Lemma 3.9. Sample Point With Depth(S, i) returns a uniform random sample from the region of points with approximate Tukey depth i in S in time O(d).

3.4 OVERALL ALGORITHM

We now have all of the necessary material for the main result, Theorem 1.1, restated below. The proof essentially collects the results so far into a concise summary.

Theorem 3.10. Tukey EM, given in Algorithm 2, is (ε, δ)-DP and takes time O d2n + dm log(m) .

Proof. Line 11 of the Tukey EM pseudocode in Algorithm 2 calls the check with privacy parameters ε/2 and δ/[8eε]. By the sensitivity guarantee of Lemma 3.6, the check itself is ε/2-DP. By the safety guarantee of Lemma 3.6 and our choice of threshold, if it passes, with probability at least 1 δ/2, the given database lies in Safe(ε/2,δ/2,t). A passing check therefore ensures that the sampling step in Line 12 is (ε/2, δ)-DP. By composition, the overall privacy guarantee is (ε, δ)-DP. Turning to runtime, the m OLS computations inside Line 3 each multiply d n

m d matrices, for O(d2n) time overall. From Lemma 3.3, Lines 5 to 10 take time O(dm log(m)). Lemma 3.7 gives the O(m log(m)) time for Line 11, and Lemma 3.9 gives the O(d) time for Line 12.

4 EXPERIMENTS

4.1 BASELINES

1. Non DP computes the standard non-private OLS estimator β = (XT X) 1XT y.

Published as a conference paper at ICLR 2023

Algorithm 2 Tukey EM

1: Input: Features matrix X Rn d, label vector y Rn, number of models m, privacy parameters ε and δ 2: Evenly and randomly partition X and y into subsets {(Xi, yi)}m i=1 3: for i = 1, . . . , m do 4: Compute OLS estimator βi (XT i Xi) 1XT i yi 5: for dimension j [d] do 6: {βi,j}m i=1 projection of {βi}m i=1 onto dimension j 7: (Sj,1, . . . , Sj,m) {βi,j}m i=1 sorted in nondecreasing order 8: Collect projected estimators into S Rd m, where each row is nondecreasing 9: for i [m/2] do 10: Compute volume of region of depth i, Vi Qd j=1(Sj,m (i 1) Sj,i) 11: if PTRCheck(V, ε/2, δ) then 12: ˆβ Restricted Tukey EM(V, S, m/4, ε/2) 13: Return ˆβ 14: else 15: Return

2. Ada SSP (Wang, 2018) computes a DP OLS estimator based on noisy versions of XT X and XT y. This requires the end user to supply bounds on both X 2 and y 2. Our implementation uses these values non-privately for each dataset. The implementation is therefore not private and represents an artiﬁcially strong version of Ada SSP. As speciﬁed by Wang (2018), Ada SSP (privately) selects a ridge parameter and runs ridge regression.

3. DPSGD (Abadi et al., 2016) uses DP-SGD, as implemented in Tensor Flow Privacy and Keras (Chollet et al., 2015), to optimize mean squared error using a single linear layer. The layer s weights are regression coefﬁcients. A discussion of hyperparameter selection appears in Section 4.4. As we will see, appropriate choices of these hyperparameters is both dataset-speciﬁc and crucial to DPSGD s performance. Since we allow DPSGD to tune these non-privately for each dataset, our implementation of DPSGD is also artiﬁcially strong.

All experiment code can be found on Github (Google, 2022).

4.2 DATASETS

We evaluate all four algorithms on the following datasets. The ﬁrst dataset is synthetic, and the rest are real. The datasets are intentionally selected to be relatively easy use cases for linear regression, as reﬂected by the consistent high R2 for Non DP.2 However, we emphasize that, beyond the constraints on d and n suggested by Section 4.3, they have not been selected to favor Tukey EM: all feature selection occurred before running any of the algorithms, and we include all datasets evaluated where Non DP achieved a positive R2. A complete description of the datasets appears both in the public code and the Appendix s Section 7.5. For each dataset, we additionally add an intercept feature.

1. Synthetic (d = 11, n = 22,000, Pedregosa et al. (2011)). This dataset uses sklearn.make regression and N(0, σ2) label noise with σ = 10.

2. California (d = 9, n = 20,433, Nugent (2017)) predicting house price.

3. Diamonds (d = 10, n = 53,940, Agarwal (2017)), predicting diamond price.

4. Trafﬁc (d = 3, n = 7,909, NYSDOT (2013)), predicting number of passenger vehicles.

5. NBA (d = 6, n = 21,613, Lauga (2022)), predicting home team score.

6. Beijing (d = 25, n = 159,375, ruiqurm (2018)), predicting house price.

7. Garbage (d = 8, n = 18,810, DSNY (2022)), predicting tons of garbage collected.

8. MLB (d = 11, n = 140,657, Samaniego (2018)), predicting home team score.

2R2 measures the variation in labels accounted for by the features. R2 = 1 is perfect, R2 = 0 is the trivial baseline achieved by simply predicting the average label, and R2 < 0 is worse than the trivial baseline.

Published as a conference paper at ICLR 2023

4.3 CHOOSING THE NUMBER OF MODELS

Before turning to the results of this comparison, recall from Section 3 that Tukey EM privately aggregates m non-private OLS models. If m is too low, PTRCheck will probably fail; if m is too high, and each model is trained on only a small number of points, even a non-private aggregation of inaccurate models will be an inaccurate model as well.

Experiments on synthetic data support this intuition. In the left plot in Figure 2, each solid line represents synthetic data with a different number of features, generated by the same process as the Synthetic dataset described in the previous section. We vary the number of models m on the x-axis and plot the distance computed by Lemma 3.6. As d grows, the number of models required to pass the PTRCheck threshold, demarcated by the dashed horizontal line, grows as well.

To select the value of R2 used for Tukey EM, we ran it on each dataset using m = 250, 500, . . . , 2000 and selected the smallest m where all PTR checks passed. We give additional details in Section 7.6 but note here that the resulting choices closely track those given by Figure 2. Furthermore, across many datasets, simply selecting m = 1000 typically produces nearly optimal R2, with several datasets exhibiting little dependence on the exact choice of m.

4.4 ACCURACY COMPARISON

Our main experiments compare the four methods at (ln(3), 10 5)-DP. A concise summary of the experiment results appears in Figure 1. For every method other than Non DP (which is deterministic), we report the median R2 values across the trials. For each dataset, the methods with interquartile ranges overlapping that of the method with the highest median R2 are bolded. Extended plots recording R2 for various m appear in Section 7.6. All datasets use 10 trials, except for California and Diamonds, which use 50.

Dataset Non DP Ada SSP Tukey EM DPSGD (tuned) DPSGD (90% tuned) Synthetic 0.997 0.991 0.997 0.997 0.997 California 0.637 -1.285 0.099 0.085 -1.03 Diamonds 0.907 0.216 0.307 0.828 0.371 Trafﬁc 0.966 0.944 0.965 0.938 0.765 NBA 0.621 0.018 0.618 0.531 0.344 Beijing 0.702 0.209 0.698 0.475 0.302 Garbage 0.542 0.119 0.534 0.215 0.152 MLB 0.722 0.519 0.721 0.718 0.712

Figure 1: For each dataset, the DP methods with interquartile ranges overlapping that of the DP method with the highest median R2 are bolded.

A few takeaways are immediate. First, on most datasets Tukey EM obtains R2 exceeding or matching that of both Ada SSP and DPSGD. Tukey EM achieves this even though Ada SSP receives non-private access to the true feature and label norms, and DPSGD receives non-private access to extensive hyperparameter tuning.

We brieﬂy elaborate on the latter. Our experiments tune DPSGD over a large grid consisting of 2,184 joint hyperparameter settings, over learning rate {10 6, 10 5, . . . , 1}, clip norm {10 6, 10 5, . . . , 106}, microbatches {25, 26, . . . , 210}, and epochs {1, 5, 10, 20}. Ignoring the extensive computational resources required to do so at this scale (100 trials of each of the 2,184 hyperparameter combinations, for each dataset), we highlight that even mildly suboptimal hyperparameters are sufﬁcient to signiﬁcantly decrease DPSGD s utility. Figure 1 quantiﬁes this by recording the R2 obtained by the hyperparameters that achieved the highest and 90th percentile median R2 during tuning. While the optimal hyperparameters consistently produce results competitive with or sometimes exceeding that of Tukey EM, even the mildly suboptimal hyperparameters nearly always produce results signiﬁcantly worse than those of Tukey EM. The exact hyperparameters used appear in Section 7.6.

We conclude our discussion of DPSGD by noting that it has so far omitted any attempt at differentially private hyperparameter tuning. We suggest that the results here indicate that any such method will

Published as a conference paper at ICLR 2023

250 500 750 1000 1250 1500 1750 2000 # models (m)

lower bound on distance to unsafe dataset

d = 5.0 d = 10.0 d = 15.0

d = 20.0 d = 25.0 d = 30.0

d = 35.0 d = 40.0 d = 45.0

d = 50.0 threshold

1000 1200 1400 1600 1800 2000 # models m

mean time (s) from 10 trials

synthetic time

Non-DP Ada SSP

DPSGD-Opt DPSGD-Subopt

Figure 2: Left: plot of Hamming distance to unsafety using Lemma 3.6 as the feature dimension d and number of models m varies, using (ln(3), 10 5)-DP and n = (d + 1)m throughout. Right: plot of average time in seconds as the number of models m used by Tukey EM varies

need to select hyperparameters with high accuracy while using little privacy budget, and emphasize that the presentation of DPSGD in our experiments is generous.

Overall, Tukey EM s overall performance on the eight datasets is strong. We propose that the empirical evidence is enough to justify Tukey EM as a ﬁrst-cut method for linear regression problems whose data dimensions satisfy its heuristic requirements (n 1000 d).

4.5 TIME COMPARISON

We conclude with a brief discussion of runtime. The rightmost plot in Figure 2 records the average runtime in seconds over 10 trials of each method. Tukey EM is slower than the covariance matrixbased methods Non DP and Ada SSP, but it still runs in under one second, and it is substantially faster than DPSGD. Tukey EM s runtime also, as expected, depends linearly on the number of models m. Since the plots are essentially identical across datasets, we only include results for the Synthetic dataset here. Finally, we note that, for most reasonable settings of m, Tukey EM has runtime asymptotically identical to that of Non DP (Theorem 1.1). The gap in practical performance is likely a consequence of the relatively unoptimized nature of our implementation.

5 FUTURE DIRECTIONS

An immediate natural extension of Tukey EM would generalize the approach to similar problems such as logistic regression. More broadly, while this work focused on linear regression for the sake of simplicity and wide applicability, the basic idea of Tukey EM can in principle be applied to select from arbitrary non-private models that admit expression as vectors in Rd. Part of our analysis observed that Tukey EM may beneﬁt from algorithms and data that lead to Gaussian-like distributions over models; describing the characteristics of algorithms and data that induce this property or a similar property that better characterizes the performance of Tukey EM is an open question.

6 ACKNOWLEDGMENTS

We thank Gavin Brown for helpful discussion of Brown et al. (2021), and we thank Jenny Gillenwater for useful feedback on an early draft. We also thank attendees of the Fields Institute Workshop on Differential Privacy and Statistical Data Analysis for helpful general discussions.

Published as a conference paper at ICLR 2023

Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan Mc Mahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Conference on Computer and Communications Security (CCS), 2016.

Shivam Agarwal. Diamonds dataset. https://www.kaggle.com/datasets/ shivam2503/diamonds, 2017. Accessed: 2022-7-6.

Daniel Alabi, Audra Mc Millan, Jayshree Sarathy, Adam Smith, and Salil Vadhan. Differentially private simple linear regression. In Privacy Enhancing Technologies Symposium (PETS), 2022.

Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efﬁcient algorithms and tight error bounds. In Foundations of Computer Science (FOCS), 2014.

Raef Bassily, Vitaly Feldman, Kunal Talwar, and Abhradeep Guha Thakurta. Private stochastic convex optimization with optimal rates. In Neural Information Processing Systems (Neur IPS), 2019.

Amos Beimel, Shay Moran, Kobbi Nissim, and Uri Stemmer. Private center points and learning of halfspaces. In Conference on Learning Theory (COLT), 2019.

Gavin Brown, Marco Gaboardi, Adam Smith, Jonathan Ullman, and Lydia Zakynthinou. Covarianceaware private mean estimation without private covariance estimation. In Neural Information Processing Systems (Neur IPS), 2021.

T Tony Cai, Yichen Wang, and Linjun Zhang. The cost of privacy: Optimal rates of convergence for parameter estimation with differential privacy. Annals of Statistics, 2020.

Franc ois Chollet et al. Keras. https://keras.io, 2015.

Ryan Cumings-Menon. Differentially private estimation via statistical depth. ar Xiv preprint ar Xiv:2207.12602, 2022.

DSNY. Dsny monthly tonnage data. https://data.cityofnewyork.us/ City-Government/DSNY-Monthly-Tonnage-Data/ebb7-mvp5, 2022. Accessed: 2022-7-28.

Cynthia Dwork, Frank Mc Sherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference (TCC), 2006.

Cynthia Dwork, Kunal Talwar, Abhradeep Thakurta, and Li Zhang. Analyze gauss: optimal bounds for privacy-preserving principal component analysis. In Symposium on the Theory of Computing (STOC), 2014.

Google. dp regression. https://github.com/google-research/ google-research/tree/master/dp_regression, 2022.

David S. Johnson and Franco P Preparata. The densest hemisphere problem. Theoretical Computer Science, 1978.

Haim Kaplan, Micha Sharir, and Uri Stemmer. How to ﬁnd a point in the convex hull privately. In Symposium on Computational Geometry (So CG), 2020.

Daniel Kifer, Adam Smith, and Abhradeep Thakurta. Private convex empirical risk minimization and high-dimensional regression. In Conference on Learning Theory (COLT), 2012.

Nathan Lauga. Nba games dataset. https://www.kaggle.com/datasets/ nathanlauga/nba-games?select=games.csv, 2022. Accessed: 2022-7-6.

Jingcheng Liu and Kunal Talwar. Private selection from private candidates. In Symposium on the Theory of Computing (STOC), 2019.

Xiyang Liu, Weihao Kong, Sham Kakade, and Sewoong Oh. Robust and differentially private mean estimation. ar Xiv preprint ar Xiv:2102.09159, 2021.

Published as a conference paper at ICLR 2023

Xiyang Liu, Weihao Kong, and Sewoong Oh. Differential privacy and robust statistics in high dimensions. In Conference on Learning Theory (COLT), 2022.

Frank Mc Sherry and Kunal Talwar. Mechanism design via differential privacy. In Foundations of Computer Science (FOCS), 2007.

Andr es Mu noz Medina and Jenny Gillenwater. Duff: A dataset-distance-based utility function family for the exponential mechanism. ar Xiv preprint ar Xiv:2010.04235, 2020.

Jason Milionis, Alkis Kalavasis, Dimitris Fotakis, and Stratis Ioannidis. Differentially private regression with unbounded covariates. In Artiﬁcial Intelligence and Statistics (AISTATS), 2022.

Cam Nugent. California housing prices dataset. https://www.kaggle.com/datasets/ camnugent/california-housing-prices, 2017. Accessed: 2022-7-6.

NYSDOT. New york weigh-in-motion station vehicle trafﬁc counts dataset. https://data.ny.gov/Transportation/ Weigh-In-Motion-Station-Vehicle-Traffic-Counts-201/gdpg-i86w, 2013. Accessed: 2022-7-6.

Nicolas Papernot and Thomas Steinke. Hyperparameter tuning with renyi differential privacy. In International Conference on Learning Representations (ICLR), 2022.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research (JMLR), 2011.

Kelly Ramsay and Shoja eddin Chenouri. Differentially private depth functions and their associated medians. ar Xiv preprint ar Xiv:2101.02800, 2021.

ruiqurm. Housing price in beijing. https://www.kaggle.com/datasets/ruiqurm/ lianjia, 2018. Accessed: 2022-7-28.

Antonio Samaniego. Mlb games data from retrosheet. https://www.kaggle.com/ datasets/samaxtech/mlb-games-data-from-retrosheet?select=game_ log.csv, 2018. Accessed: 2022-7-28.

Jayshree Sarathy, Sophia Song, Audrey Emma Haque, Tania Schlatter, and Salil Vadhan. Don t look at the data! how differential privacy reconﬁgures data subjects, data analysts, and the practices of data science. In Theory and Practice of Differential Privacy (TPDP), 2022.

Or Sheffet. Old techniques in differentially private linear regression. In Algorithmic Learning Theory (ALT), 2019.

Henri Theil. A rank-invariant method of linear and polynomial regression analysis, 1992.

John W Tukey. Mathematics and the picturing of data. In International Congress of Mathematicians (IMC), 1975.

Prateek Varshney, Abhradeep Thakurta, and Prateek Jain. (nearly) optimal private linear regression for sub-gaussian data via adaptive clipping. In Conference on Learning Theory (COLT), 2022.

Yu-Xiang Wang. Revisiting differentially private linear regression: optimal and adaptive prediction & estimation in unbounded domain. In Uncertainty in Artiﬁcial Intelligence (UAI), 2018.

Published as a conference paper at ICLR 2023

7.1 ILLUSTRATION OF EXACT AND APPROXIMATE TUKEY DEPTH

1 2 3 4 5 6 7

1 2 3 4 5 6 7 8

Figure 3: An illustrated comparison between exact (left) and approximate (right) Tukey depth. In both ﬁgures, the set of points is {(1, 1), (7, 3), (5, 7), (3, 3), (5, 5), (6, 3)}, the region of depth 0 is white, the region of depth 1 is light gray, and the region of depth 2 is dark gray. Note that for exact Tukey depth, the regions of different depths form a sequence of nested convex polygons; for approximate Tukey depth, they form a sequence of nested rectangles.

7.2 OMITTED PROOFS

We start with the proof of our utility result for exact Tukey depth, Theorem 3.1.

Proof of Theorem 3.1. This is a direct application of the results of Brown et al. (2021). They analyzed a notion of probabilistic (and normalized) Tukey depth over samples from a distribution: TN(µ,Σ)(y) := minv PX N(µ,Σ) [ X, v y, v ]. Their Proposition 3.3 shows that TN(µ,Σ)(y) can be characterized in terms of Φ, the CDF of the standard one-dimensional Gaussian distribution. Speciﬁcally, they show TN(µ,Σ)(y) = Φ( y µ Σ). From their Lemma 3.5, if m c d+log(1/γ)

then with probability 1 γ, |p/m TN(β ,Σ)(ˆβ)| α. Thus

α TN(β ,Σ)(ˆβ) p/m

p/m α Φ( ˆβ β Σ)

p/m α 1 Φ( ˆβ β Σ)

Φ( ˆβ β Σ) 1 p/m + α

ˆβ β Σ Φ 1(1 p/m + α)

where the third inequality used the symmetry of Φ.

Next, we prove our result about computing the volumes associated with different approximate Tukey depths, Lemma 3.3.

Proof of Lemma 3.3. By the deﬁnition of approximate Tukey depth, for arbitrary y = (y1, . . . , yd) of Tukey depth at least i, each of the 2d halfspaces hy1 e1, hy1 e1, . . . , hyd ed, hyd ed contains at least i points from D, where x y denotes multiplication of a scalar and vector. Fix some dimension j [d]. Since min(|hyj ej D|, |hyj ej D|) i, yj [Sj,i, Sj,m (i 1)]. Thus Vi,D = Qd j=1(Sj,m (i 1) Sj,i). The computation of S starting in Line 5 sorts d arrays of length m and so takes time O(dm log(m)). Line 9 iterates over m/2 depths and computes d quantities, each in constant time, so its total time is O(dm).

The next result is our 1-sensitive and efﬁcient adaptation of the lower bound from Brown et al. (2021). We ﬁrst restate that result. While their paper uses swap DP, the same result holds for add-remove DP.

Published as a conference paper at ICLR 2023

Lemma 7.1 (Lemma 3.8 (Brown et al., 2021)). For any k 0, if there exists a g > 0 such that Vt k 1,D Vt+k+g+1,D e εg/2 δ, then for every database z in Unsafe(ε,4eεδ,t), d H(D, z) > k, where d H denotes Hamming distance.

We now prove its adaptation, Lemma 3.6.

Proof of Lemma 3.6. We ﬁrst prove item 1. Let D and D be neighboring databases, D = D {x}, and let k D and k D denote the mechanism s outputs on the respective databases. It sufﬁces to show |k D k D | 1.

Consider some Vp for nonnegative integer p. If x has depth less than p in D, then Vp 1,D Vp,D Vp,D > Vp+1,D. Otherwise, Vp+1,D < Vp,D = Vp,D < Vp 1,D. In either case,

Vp+1,D < Vp,D Vp,D Vp 1,D. (1)

Now suppose there exist k D 0 and g D > 0 such that Vt k D 1,D

Vt+k D +g D +1,D e εg D /2 δ. Then by

Equation 1, Vt k D ,D Vt+k D +g D ,D e εg D /2 δ, so k D k D 1. Similarly, if there exist k D 0 and

g D > 0 such that Vt k D 1,D Vt+k D+g D+1,D e εg D/2 δ, then by Equation 1, Vt k D,D

Vt+k D+g D,D e εg D /2 δ, so

k D k D 1. Thus if k D 0 or k D 0, |k D k D | 1. The result then follows since k 1.

We now prove item 2. This holds for k 0 by Lemma 7.1; for k = 1, the lower bound on distance to unsafety is the trivial one, d H(D, z) 0.

The next proof is for Lemma 3.7, which veriﬁes the overall privacy and runtime of PTRCheck.

Proof of Lemma 3.7. The privacy guarantee follows from the 1-sensitivity of computing k (Lemma 3.6). For the runtime guarantee, we perform a binary search over the distance lower bound k starting from the maximum possible m/4 and, for each k, an exhaustive search over g. Note that if some k, g pair satisﬁes the inequality in Lemma 7.1, there exists some g for every k < k that satisﬁes it as well. Thus since both have range m/4, the total time is O(m log(m)).

7.3 USING MONOTONICITY

This section discusses our use of monotonicity in the restricted exponential mechanism. Deﬁnition 2.2 states that, if u is monotonic, the exponential mechanism can sample an output y with probability proportional to exp εu(D,y)

and satisfy ε-DP. Approximate Tukey depth is monotonic, so our application can also sample from this distribution. It remains to incorporate monotonicity into the PTR step.

It sufﬁces to show that Lemma 7.1 also holds for a restricted exponential mechanism using a monotonic score function. Turning to the proof of Lemma 7.1 given by Brown et al. (2021), it sufﬁces to prove their Lemma 3.7 using wx(S) = R

S exp(εq(x; y))dy. Note that their wx(S) differs by the 2 in its denominator inside the exponent term; this modiﬁcation is where we incorporate monotonicity. This difference shows up in two places in their argument. First, we can replace their bound

P [Mε,t(x) = y] P [Mε,t(x ) = y] eε/2 wx (Yt,x )

wx(Yt,x) eε wx(Yt,x )

with the two cases that arise in add-remove differential privacy. The ﬁrst considers x x and yields

P [Mε,t(x) = y] P [Mε,t(x ) = y] eε wx (Yt,x )

wx(Yt,x) eε wx(Yt,x )

since the mechanism on x never assigns a lower score to an output than on x . Using the same logic, the second considers x x , and we get

P [Mε,t(x) = y] P [Mε,t(x ) = y] wx (Yt,x )

wx(Yt,x) eε wx(Yt,x )

Published as a conference paper at ICLR 2023

The second application in their argument, which bounds P[Mε,t(x )=y]

P[Mε,t(x)=y] , uses the same logic. As a result, their Lemma 3.7 also holds for a monotonic restricted exponential mechanism, and we can drop the 2 in the sampling distribution as desired.

7.4 SAMPLING FROM A REGION DETAILS

We start by formally deﬁning our partition.

Deﬁnition 7.2. Given d-dimensional database D and dimension j [d], for y Rd, let TD,j(y) denote the exact (one-dimensional) Tukey depth of point y with respect to dimension j in database D. Let Bi denote the region of points with approximate Tukey depth i. Deﬁne the partition {Cj,i}d j=1 of Bi as the volume of points where depth i occurs in dimension j for the ﬁrst time, i.e.,

Cj,i := {y Rd | min j <j TD,j (y) > i and TD,j(y) = i and min j >j TD,j (y) i}.

The partition is well deﬁned because any point with approximate Tukey depth i is in exactly one of the Cj,i volumes. Each Cj,i is also the Cartesian product of three sets: any y Cj,i must have 1) depth strictly greater than i in dimensions 1, ..., j 1, 2) depth i in dimension j, and 3) depth at least i in dimensions j + 1, ..., d. Being the Cartesian product of three sets, the total volume of Cj,i can be computed as the product of the three corresponding volumes in lower dimensions. We will denote these by V<j,i, Wj,i, V>j,i, formalized below.

Deﬁnition 7.3. Given d-dimensional database D, dimension j [d], and depth i, deﬁne

1. Vj,i,D = vol({yj | y Rd, TD(y) i}), the total length in dimension j of the region with approximate Tukey depth at least i.

2. Wj,i,D = vol({yj | y Rd, TD,j(y) = i and TD(y) i}), the total length in dimension j of the region with depth exactly i in dimension j and approximate Tukey depth at least i.

3. V<j,i,D = vol({y1:j 1 | y Rd, TD(y) i}), the volume of the projection onto the ﬁrst j 1 dimensions of points with approximate Tukey depth at least i. Deﬁne V>j,i,D analogously.

When D is clear from context, we drop it from the subscript.

The next lemma shows how to compute these and other relevant volumes. We again note that we ﬁx m to be even for neatness. The odd case is similar.

Lemma 7.4. Given matrix S Rd m of projected and sorted models, as in Line 8 of Algorithm 2,

1. Vj,i = Sj,m (i 1) Sj,i,

2. Wj,i = Vj,i Vj,i+1,

3. V<j,i = Qj 1 j =1 Vj,i and V>j,i = Qd j =j+1 Vj,i

Proof. The proofs of the ﬁrst item uses essentially the same reasoning as the proof of Lemma 3.3. For the second item, any point contributing to Vj,i but not Vj,i+1 has depth exactly i in dimension j. For the third item, a point contributes to V<j,i if and only if it has depth at least i in all d dimensions; since the resulting region is a rectangle, its volume is the product of its side lengths.

With Lemma 7.4, we can now prove that Sample Point With Depth works as intended.

Proof of Lemma 3.9. Given S Rd m, deﬁne Bi and {Cj,i}d j=1 as in Deﬁnition 7.2. To show the outcome of Sample Point With Depth(S, i) is uniformly distributed over points with approximate Tukey depth i, it sufﬁces to show the algorithm samples a Cj,i with probability proportional to its volume. Recall that Cj,i is the set of points with depth greater than i in dimensions 1, 2, . . . , j 1, exactly i in dimension j, and at least i in the remaining dimensions. With this interpretation, Cj,i is

Published as a conference paper at ICLR 2023

Algorithm 3 Restricted Tukey EM

1: Input: Tukey depth region volumes V , sorted collection of estimators S, depth restriction t, privacy parameter ε 2: for i = t, t + 1, . . . , |V | 1 do 3: Compute volume of region of Tukey depth exactly i, Wi Vi Vi+1 4: Sample depth ˆi from distribution where P [i] Wi exp ε i

5: Return y Sample Point With Depth(S,ˆi)

Algorithm 4 Sample Point With Depth

1: Input: Sorted collection of estimators S, depth i 2: d, m number of rows and columns in S 3: Compute V 1,i using Lemma 7.4 4: for j = 1, . . . , d do 5: Compute V<j,i+1, Wj,i, and V>j,i using Lemma 7.4 6: Compute vol(Cj,i) = V<j,i+1 Wj,i V>j,i 7: Sample index j [d] with probability vol(Cj,i)

V 1,i 8: for j = 1, . . . , j 1 do 9: yj uniform random sample from [Sj ,i+1, Sj ,m i] 10: yj uniform random sample from [Sj ,i, Sj ,i+1) (Sj ,m i, Sj ,m (i 1)] 11: for j = j + 1, . . . , d do 12: yj uniform random sample from [Sj ,i, Sj ,m (i 1)] 13: Return y

the Cartesian product of three lower dimensional regions, and thus its volume is the product of the corresponding volumes, vol(Cj,i) = V<j,i+1 Wj,i V>j,i.

These quantities, along with the normalizing constant V 1,i, can be computed using Lemma 7.4.

Since i is ﬁxed, computing the full set of Vj,i and Wj,i takes time O(d), and by tracking partial sums and using logarithms, we can compute the full set of V<j,i and V>j,i in time O(d) as well. The last step is sampling the ﬁnal point y, which takes time O(d) using the previously computed S.

7.5 DATASET FEATURE SELECTION DETAILS

This section provides details for each of the real datasets evaluated in our experiments.

1. California Housing Nugent (2017): The label is median housevalue, and the categorical ocean proximity is dropped.

2. Diamonds Agarwal (2017): The label is price. Ordinal categorical features (carat, color, clarity) are replaced with integers 1, 2, . . ..

3. Trafﬁc NYSDOT (2013): The label is passenger vehicle count (Class 2), and the remaining features are motorcycles (Class 1) and pickups, panels, and vans (Class 3).

4. NBA Lauga (2022): The label is PTS home, and the features are FT PCT home, FG3 PCT home, FG PCT home, AST home, and REB home.

5. Beijing Housing ruiqurm (2018): The label is total Price. Features are days on market (DOM), followers, area of house in meters (square), number of kitchens (kitchen), building Type, renovation Condition, building material (building Structure), ladders per residence (ladder Ratio), elevator presence elevator, whether previous owner owned for at least ﬁve years (five Years Property), proximity to subway (subway), district, and nearby average housing price (community Average). Categorical building Type, renovation Condition, and building Structure are encoded as one-hot variables. We additionally removed a single outlier row (60422) whose norm is more than two

Published as a conference paper at ICLR 2023

orders of magnitude larger than that of other points; none of the DP algorithms achieved positive R2 with the row included.

6. New York Garbage DSNY (2022): The label is REFUSETONSCOLLECTED. The features are PAPERTONSCOLLECTED and MPGTONSCOLLECTED. The categorical BOROUGH is encoded as one-hot variables.

7. MLB Samaniego (2018): The label is home team runs (h score) and the features are v strikeouts, v walks, v pitchers used, v errors, h at bats, h hits, h doubles, h triples, h homeruns, and h stolen bases.

7.6 EXTENDED EXPERIMENT RESULTS

Figure 4 records, respectively, the number of epochs, the clipping norm, the learning rate, and the number of microbatches for both the best and 90th percentile hyperparameter settings on each dataset.

Dataset DPSGD (tuned) DPSGD (90% tuned) Synthetic (20, 1, 1, 128) (20, 10 3, 0.1, 64) California (20, 100, 1, 64) (10, 1000, 0.01, 64) Diamonds (20, 106, 1, 128) (10, 106, 0.1, 32) Trafﬁc (1, 106, 1, 1024) (10, 105, 0.1, 512) NBA (20, 100, 1, 512) (20, 10 6, 0.1, 32) Beijing (20, 100, 0.01, 512) (20, 0.1, 0.001, 512) Garbage (20, 1, 1, 32) (5, 1000, 1, 64) MLB (10, 100, 0.01, 512) (10, 10 5, 0.001, 128)

Figure 4: Hyperparameter settings used by DPSGD on each dataset.

7.7 DISTRIBUTION OF MODELS FOR ALL DATASETS

Figures 7 through 13 display histograms of models trained on each dataset. For each plot, we train 2,000 standard OLS models on disjoint partitions of the data and plot the resulting histograms for each coefﬁcient. The red curve plots a Gaussian distribution with the same mean and standard deviation as the underlying data.

7.7 Distribution of models for all datasets 17

1000 1200 1400 1600 1800 2000 # models m

R2 from 10 trials

Non-DP Ada SSP

DPSGD-Opt DPSGD-Subopt

800 1000 1200 1400 1600 1800 2000 # models m

R2 from 50 trials

Non-DP Ada SSP

DPSGD-Opt DPSGD-Subopt

800 1000 1200 1400 1600 1800 2000 # models m

R2 from 50 trials

Non-DP Ada SSP

DPSGD-Opt DPSGD-Subopt

600 800 1000 1200 1400 1600 1800 2000 # models m

R2 from 10 trials

Non-DP Ada SSP

DPSGD-Opt DPSGD-Subopt

800 1000 1200 1400 1600 1800 2000 # models m

R2 from 10 trials

Non-DP Ada SSP

DPSGD-Opt DPSGD-Subopt

1300 1400 1500 1600 1700 1800 1900 2000 # models m

R2 from 10 trials

Non-DP Ada SSP

DPSGD-Opt DPSGD-Subopt

800 1000 1200 1400 1600 1800 2000 # models m

R2 from 10 trials

Non-DP Ada SSP

DPSGD-Opt DPSGD-Subopt

800 1000 1200 1400 1600 1800 2000 # models m

R2 from 10 trials

Non-DP Ada SSP

DPSGD-Opt DPSGD-Subopt

Figure 5: Plots of R2 as the number of models m used by Tukey EM varies. The lines mark medians and the shaded regions span the ﬁrst and third quartiles. All datasets except Housing and Diamonds use 10 trials. Housing and Diamonds use 50 trials due to the variance of Tukey EM. Methods other than Tukey EM appear as ﬂat lines because they do not vary with m. Each plot varies the number of models m in increments of 250, starting with the m sufﬁcient to pass PTR in all trials.

7.7 Distribution of models for all datasets 18

Figure 6: Histograms of models on the California dataset.

25 0 25 50 75 0.000

25 0 25 50 75 0.000

0 25 50 75 100 0.000

40 20 0 20 40 60 0.000

25 0 25 50 75 100 0.000

25 0 25 50 75 0.000

0 20 40 60 80 100 0.000

40 20 0 20 40 60 0.00

50 25 0 25 50 0.000

40 20 0 20 40 60 0.000

Figure 7: Histograms of models on Synthetic.

1000 0 1000 0.000

10 0 10 20 30 0.00

7500 5000 2500 0 2500 0.0000

Figure 8: Histograms of models on Trafﬁc.

7.7 Distribution of models for all datasets 19

0 2000 4000 6000 8000 0.0000

1000 500 0 500 1000 0.00000

500 0 500 1000 0.0000

0 500 1000 0.00000

750 500 250 0 250 0.0000

500 250 0 250 0.0000

0 1000 2000 0.00000

1000 0 1000 2000 0.0000

0 2000 0.0000

Figure 9: Histograms of models on Diamonds.

0 10 20 0.00

5 0 5 10 15 20 0.00

0 5 10 15 0.000

0 2 4 6 0.0

Figure 10: Histograms of models on NBA.

7.7 Distribution of models for all datasets 20

1 0 1 2 3 0.0

1 0 1 2 3 0.0

2 4 6 8 0.0

400 200 0 200 400 0.000

250 0 250 500 0.000

200 0 200 0.000

100 0 0.000

100 50 0 50 100 0.000

20 0 20 0.00

0.002 0.004 0.006 0.008 0.010 0

150 100 50 0 50 0.000

150 100 50 0 0.00

200 100 0 0.000

200 100 0 0.000

300 200 100 0 0.000

400 200 0 200 400 0.000

200 100 0 100 0.000

100 0 100 0.000

200 100 0 100 200 0.00

200 100 0 0.000

200 100 0 0.000

200 100 0 100 0.000

100 0 100 0.00

200 100 0 100 0.000

400 200 0 0.000

Figure 11: Histograms of models on Beijing.

20 10 0 10 0.00

10 0 10 20 30 40 0.00

1500 1000 500 0 500 1000 1500 0.0000

1000 500 0 500 1000 1500 0.0000

1000 0 1000 2000 0.0000

1000 500 0 500 1000 1500 0.0000

1000 500 0 500 1000 1500 2000 0.000

0 1000 2000 0.0000

Figure 12: Histograms of models on Garbage.

7.7 Distribution of models for all datasets 21

0.2 0.0 0.2 0.4 0

0.2 0.0 0.2 0

0.25 0.00 0.25 0.50 0.75 1.00 0.0

0.0 0.5 1.0 0.0

0.25 0.20 0.15 0.10 0.05 0.00 0

0.2 0.4 0.6 0.8 0

0.5 0.0 0.5 1.0 1.5 0.0

0.0 0.5 1.0 1.5 0.00

0.5 0.0 0.5 1.0 0.00

1 0 1 2 3 4 0.0

Figure 13: Histograms of models on MLB.