# oracle_efficient_private_nonconvex_optimization__893d7509.pdf Oracle Efficient Private Non-Convex Optimization Seth Neel * 1 Aaron Roth * 2 Giuseppe Vietri * 3 Zhiwei Steven Wu * 3 Abstract One of the most effective algorithms for differentially private learning and optimization is objective perturbation. This technique augments a given optimization problem (e.g. deriving from an ERM problem) with a random linear term, and then exactly solves it. However prior analyses of this approach crucially rely on the convexity and smoothness of the objective function. We give two algorithms that extend this approach substantially. The first algorithm requires nothing except boundedness of the loss function, and operates over a discrete domain. We achieve this by introducing a novel normalization step into the objective perturbation algorithm, which provides enough stability to satisfy differential privacy even without convexity. The second algorithm operates over a continuous domain and its privacy analysis requires only that the loss function be bounded and Lipschitz in its continuous parameter. We complement our theoretical results with an empirical evaluation of the non-convex case, in which we use an integer program solver as our optimization oracle. We find that for the problem of learning linear classifiers, directly optimizing for 0/1 loss using our approach can out-perform the more standard approach of privately optimizing a convexsurrogate loss function on the Adult dataset. 1. Introduction Consider the general problem of optimizing a function L : Ln W 7 R defined with respect to a dataset D Ln *Equal contribution 1Wharton Statistics Department, University of Pennsylvania, Philadelphia, Pennsylvania, USA 2Department of Computer and Information Sciences, University of Pennsylvania, Philadelphia, Pennsylvania, USA 3Department of Computer Science and Engineering, University of Minnesota, Minneapolis, Minnesota, USA. Correspondence to: Seth Neel , Aaron Roth , Giuseppe Vietri , Zhiwei Steven Wu . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). and a parameter w W: arg minw L(D, w). This general class of problems includes classical empirical risk minimization, amongst others, and is a basic problem in learning and optimization. We say that such a function L is 1-sensitive in the dataset D if changing one datapoint in D can change the value of L(D, w) by at most 1, for any parameter value w. Suppose that we want to solve an optimization problem like this subject to the constraint of differential privacy. The exponential mechanism provides a powerful, general-purpose, and often error-optimal method to solve this problem (Mc Sherry & Talwar, 2007). It requires no assumptions on the function other than that it is 1-sensitive (this is a minimal assumption for privacy: more generally, its guarantees are parameterized by the sensitivity of the function). Unfortunately, the exponential mechanism is generally infeasible to run: its implementation (and the implementation of related mechanisms, like Report-Noisy-Max (Dwork & Roth, 2014)) requires the ability to enumerate the parameter range W, making it infeasible in most learning settings, despite its use in proving general information theoretic bounds in private PAC learning (Kasiviswanathan et al., 2011). When L(D, w) is continuous, convex, and satisfies second order conditions like strong convexity or smoothness, the situation is better: there are a number of algorithms available, including simple output perturbation (Chaudhuri et al., 2011) and objective perturbation (Chaudhuri et al., 2011; Kifer et al., 2012; Iyengar et al., 2019). This partly mirrors the situation in non-private data analysis, in which convex optimization problems can be solved quickly and efficiently, and most non-convex problems are NP-hard in the worst case. In the non-private case, however, the worst-case complexity of optimization problems does not tell the whole story. For many non-convex optimization problems, such as integer programming, there are fast heuristics that not only reliably succeed in optimizing functions deriving from real inputs, but can also certify their own success. In such settings, can we leverage these heuristics to obtain practical private optimization algorithms? In this paper, we give two novel analyses of objective perturbation algorithms that extend their applicability to 1-sensitive non-convex problems (and more generally, bounded sensitivity functions). We also get new results for convex problems, without the need for second order conditions like smoothness or strong convexity. Our first algorithm operates over a discrete parameter space W, Private Objective Perturbation and requires no further assumptions beyond 1-sensitivity for either its privacy or accuracy analysis i.e. it is comparable in generality to the exponential mechanism. The second algorithm operates over a continuous parameter space W, and requires only that L(D, w) be Lipschitz-continuous in its second argument. Its privacy analysis does not require convexity. Its accuracy analysis does but does not require any 2nd order conditions. We implement our first algorithm to directly optimize classification error over a discrete set of linear functions on the Adult dataset, and find that it substantially outperforms private logistic regression. 1.1. Related work Objective perturbation was first introduced by Chaudhuri et al. (2011), and analyzed for the special case of strongly convex functions. Its analysis was subsequently improved and generalized (Kifer et al., 2012; Iyengar et al., 2019) to apply to smooth convex functions, and to tolerate a small degree of error in the optimization procedure. We introduce a variant of objective perturbation that crucially uses a novel normalization step. Our paper is the first to give an analysis of a variant of objective perturbation without the assumption of convexity, and the first to give an accuracy analysis without making second order assumptions on the objective function even in the convex case. Chaudhuri et al. (2011) also introduced the related technique of output perturbation which perturbs the exact optimizer of a strongly convex function. The work most closely related to our first algorithm is Neel et al. (2018), who also give a similar oracle efficient algorithm for non-convex differentially private optimization: i.e. reductions from non-private optimization to private optimization. Their algorithm ( Report Separator Perturbed Noisy Max , or RSPM) relies on an implicit perturbation of the optimization objective by augmenting the dataset D with a random collection of examples drawn from a separator set. This is a non-standard piece of combinatorial structure, and so Neel et al. (2018) s approach only works in limited settings (we do not know of any settings in which it applies beyond the handful specifically identified in Neel et al. (2018)). The algorithms which we introduce in this paper are substantially more general: because they directly perturb the objective, they do not rely on the existence of a small separator set for the class of functions in question. One of the contributions of our paper is the first experimental analysis of RSPM, in section 5. Neel et al. (2018) also give a generic method to transform an algorithm (like ours) whose privacy analysis depends on the success of the optimization oracle, to an algorithm whose privacy analysis does not depend on this, whenever the optimization heuristic can certify its success (integer program solvers have this property). Their method applies to the algorithms we develop in this paper. Our second algorithm crucially uses an ℓ1 stability result for the Follow-the-perturbed- leader (FTPL) method in Suggala & Netrapalli (2019) in the context of online learning. Recently, Vietri et al. (2020) use the same FTPL method in the context of private synthetic data release, but they do not rely on the stability of FTPL to achieve privacy. Similar to our work, they follow the approach of designinig oracle-efficient algorithms that use heuristics like integer program solvers to solve private problems that are hard in the worst case, which was first proposed by Gaboardi et al. (2014). 2. Preliminaries We first define a dataset, a loss function with respect to a dataset, and the two types of optimization oracles we will call upon. We then define differential privacy, and state basic properties. A dataset D Ln is defined as a (multi)set of G-Lipschitz loss functions l. (Note that frequently, the dataset will explicitly contain data points , and the loss functions will be implicitly defined). For w in a parameter space W Rd, the loss on dataset D is defined to be We will define two types of perturbed loss functions, and the corresponding oracles which are assumed to be able to optimize each type. These will be used in our discrete objective perturbation algorithm in Section 3 and our sampling based objective perturbation algorithm in Section 4 respectively. Given a vector η Rd, we define the perturbed loss to be: L(D, w, η) = L(D, w) η, w where n = |D| is the size of the dataset D. This is simply the loss function augmented with a linear term. Let π be the normalization function formally defined in Section 3, which informally maps a d-dimensional vector with l2 norm at most D to a unit vector in Rd+1. Given a vector η Rd+1 We define the perturbed normalized loss to be: Lπ(D, w, η) = L(D, w) η, π w Definition 2.1 (Approximate Linear Optimization Oracle). Given as input a dataset D Ln and a d-dimensional vector η, an α-approximate linear optimization oracle Oα returns w = Oα(D, η) W such that L(D, w , η) inf w W L(D, w, η) + α When α = 0 we say O is a linear optimization oracle. Private Objective Perturbation Definition 2.2 (Approximate Normalized Linear Optimization Oracle). Given as input a dataset D Ln and a (d+1)- dimensional vector η, an α-approximate normalized linear optimization oracle Oα,π returns w = Oα,π(D, η) W such that Lπ(D, w , η) inf w W Lπ(D, w, η) + α When α = 0 we say Oπ is a normalized linear optimization oracle. We remark that while it seems less natural to assume an oracle for the normalized perturbed loss which involves the non-linearity π(w), in the supplement we show how we can linearize this term by introducing an auxiliary variable and introducing a convex constraint. This is ultimately how we implement this oracle in our experiments. Definition 2.3. A randomized algorithm M : Ln W is an (α, β)-minimizer for W if for every dataset D Ln, with probability 1 β, it outputs M(D) = w such that: 1 n L(D, w) inf w W 1 n L(D, w ) + α Certain optimization routines will have guarantees only for discrete parameter spaces: Definition 2.4 (Discrete parameter spaces). A τ-separated discrete parameter space Wτ Rd is a discrete set such that for any pair of distinct vectors w1, w2 Wτ we have w1 w2 2 τ. Finally we define differential privacy. We call two data sets D, D Ln neighbors (written as D D ) if D can be derived from D by replacing a single loss function li D with some other element of L. Definition 2.5 (Differential Privacy (Dwork et al., 2006b;a)). Fix ϵ, δ 0. A randomized algorithm A : L O is (ϵ, δ)- differentially private (DP) if for every pair of neighboring data sets D D L , and for every event Ω O: Pr[A(D) Ω] exp(ϵ) Pr[A(D ) Ω] + δ. The Laplace distribution centered at 0 with scale b is the distribution with probability density function Lap(z|b) = 1 2be |z| b . We also make use of the exponential distribution which has density function Exp(z|b) = 1 b if z 0 and Exp(z|b) = 0 otherwise. 3. Objective perturbation over a discrete decision space In this section we give an objective perturbation algorithm that is (ϵ, δ)-differentially private for any non-convex Lipschitz objective over a discrete decision space Wτ. We assume that each l L is G-Lipschitz over Wτ w.r.t. ℓ2 norm: that is for any w, w Wτ, |l(w) l(w )| G w w 2. Note that if l takes values in [0, 1], then we know l is also 1/τ-Lipschitz due to the τ-separation in Wτ. The Normalization Trick. The key technical innovation in this section of the paper is the modification of the standard objective perturbation algorithm by introducing a normalization step: rather than minimizing the perturbed loss, we minimize the perturbed normalized loss. Let D be a bound on the maximum ℓ2 norm of any vector in Wτ. We will make use of a normalization onto the unit sphere in one higher dimension. The normalization function π : Rd Rd+1 is defined as: π w = w1, . . . , wd, D q 1 w 2 2/D2 1 Note that π w 2 = 1 for all w Wτ, and also that for any w, w Wτ, π w π w 2 2 1 D2 w w 2 2, (1) since π w π w 2 2 = 1 D2 ( w w 2 2 + D2( p 1 w 2 2/D2 p 1 w 2 2/D2)2) 1 D2 w w 2 2. This shows that normalizing into the (d + 1)- dimensional sphere can t force points too much closer together than they start. The intuition behind the privacy proof is that the linear perturbation term provides stability; specifically we will argue that for any value of the noise η than induces a particular minimizer ˆw on a dataset D, there is a nearby value η that would induce ˆw on any adjacent dataset D . The argument proceeds by contradiction: suppose that there existed some v = ˆw that was the minimizer on D . Then since D and D only differ in one data point, the difference between the normalized losses of v and ˆw on D can be broken into three terms: the difference between their scores on D and the original perturbation term η, the difference between their scores on the two data points that differ between D, D , and the inner product between their normalized difference π( ˆw) π(v) with η η. The first term is positive by virtue of ˆw being the minimizer on the original dataset D. The second term can be lower bounded using Lipschitzness of L. The third term is lower bounded using the fact that η η is chosen to maximize the inner product η η, π( ˆw) π(v) by making the change in noise η η move in the direction of π( ˆw) We can only guarantee this has a greater inner product with ˆw than v if π( ˆw) 2 = π(v) 2 , which is the rationale behind the normalization trick. Then the whole expression can be shown to be lower bounded by 0, contradicting the fact that v is the unique minimizer of the normalized loss on D . We now prove that OPDisc is differentially private, illustrating the importance of the normalization trick. We then state an accuracy bound, which follows from a simple tail bound on the random linear perturbation term. Private Objective Perturbation Algorithm 1 Objective Perturbation over Discrete Space OPDisc input D = {li}n i=1, oracle Oπ over Wτ, privacy parameters ϵ, δ ln 1/δ τϵ Draw random vector η N 0, σ2 d+1 and use the projected oracle to solve: ˆw Oπ(D, η) arg min w Wτ Lπ(D, η, w) Theorem 1. Algorithm 1 is (ϵ, δ)-differentially private. Proof. For any realized noise vector η, we write ˆw = Oπ(D, η) as the output. Now consider the set of mappings G : Wτ Rd+1 Rd+1. If we can show: g G s.t. ˆw = Oπ(D , g( ˆw, η)) (Lemma 4) Pr η Pr g( ˆw, η) (Lemma 3) W.p.1, arg minw Wτ L(D, w, η) is unique, (Lemma 2) then the probability of outputting any particular w on input D is close to the corresponding probability, on input D as desired. Lemma 3 follows from simple properties of the Gaussian distribution, and Lemma 2 from discreteness of Wτ, which are established in the Appendix. We focus on proving Lemma 4, which is the central part of the proof. Lemma 2. Fix any τ-separated vector space Wτ. For every dataset D there is a subset B Rd+1 such that Pr η B = 0 and for any η Rd+1 \ B: a unique minimizer ˆw arg min w Wτ L D, w η, π w Denote the set of of noise vectors that induce output w on dataset D by E D, w = {η : Oπ(D, η) = w}. Define our mapping g G by: g( ˆw, η) def = g ˆ w(η) = η + 2 Note that the vector η η = g ˆ w(η) η is parallel to π( ˆw) . Lemma 3 shows that with high probability over the draw of η, Pr η Pr g ˆ w(η) . Lemma 3. Let η N(0, σ2)d+1, σ 7G2D2 log(1/δ) τϵ , and w Wτ. Then there exists a set C Rd+1 such that Pr η Cc 1 δ, and for all r Cc if p denotes the probability density function of η: p(r) p(gw(r)) eϵ Lemma 4. Fix any ˆw and any pair of neighboring datasets D, D . Let η E D, ˆw be such that ˆw is the unique minimizer ˆw infw L(D, w) η, π w . Then g ˆ w(η) E D , ˆw . Hence: I{η E D, ˆw } I{g ˆ w(η) E D , ˆw } Proof. Let c = 4 τ GD2. Suppose that v = ˆw is the output on neighboring dataset D when the noise vector is g ˆ w(η). We will derive a contradiction. Since v is the unique minimizer on D : L D , v g ˆ w(η), π v L D , ˆw g ˆ w(η), π ˆw < 0 Let i be the index where D and D are different, such that li D and l i D . Then L D , w = L D, w li(w) + l i(w). Now, write the loss function in terms of D and rearranging terms: L D, v η, π v L D, ˆw η, π ˆw + li( ˆw) li(v) l i( ˆw) l i(v) + cπ ˆw , π ˆw cπ ˆw , π v < 0 Since ˆw is a unique minimizer for D and η then term in the square bracket is positive. Hence: li( ˆw) li(v) l i( ˆw) l i(v) + cπ ˆw , π ˆw π v < 0 Since li, l i are G-Lipschitz functions li( ˆw) li(v) l i( ˆw) l i(v) 2G ˆw v 2. Now comes the importance of the normalization trick: because ||π(v)||2 = ||π( ˆw)||2 = 1, cπ ˆw , π ˆw π v = c 2 π ˆw π v 2 2, by expanding π ˆw π v 2 2. Note that without the normalization, this last term could be negative, breaking the contradiction argument. Substituting this becomes: 2G ˆw v 2 + c 2 π ˆw π v 2 2 < 0 For the next step we use inequality (1). We also apply the assumption that for two vectors ˆw = v the following inequality holds ˆw v 2 τ. c 2D2 ˆw v 2 2 < 2G ˆw v 2 (Inequality (1)) c 2D2 ˆw v 2 < 2G (Divide both sides by ˆw v 2) c ˆw v 2 < 4GD2 cτ < 4GD2 (By assumption ˆw v 2 τ) τ (Divide both sides by τ) Private Objective Perturbation This contradicts c = 4GD2 Putting the Lemmas together: Pr Oπ(D, η) S = Pr η [ ˆ w E D, ˆw Rd+1 p(η)I{η [ ˆ w E D, ˆw }dη (Rd+1\B)\C p(η)I{η [ ˆ w E D, ˆw }dη C p(η)I{η [ ˆ w E D, ˆw }dη (2) (Rd+1\C)\B p(η)I{η [ ˆ w E D, ˆw }dη + δ (3) Rd+1\(C B) p(η)I{η E D, ˆw }dη + δ Rd+1\(C B) p(η)I{g ˆ w(η) E D , ˆw }dη + δ Rd+1\(C B) eϵp(g ˆ w(η))I{g ˆ w(η) E D , ˆw }dη Rd+1\(g ˆ w(C) g ˆ w(B)) eϵp(η)I{η E D , ˆw } g ˆ w η Rd+1 p(η)I{η E D , ˆw }dη + δ ˆ w E D , ˆw = eϵPr Oπ(D , η) S + δ where equality (2) follows from Lemma 2. Then inequality (3) holds because C is chosen such that Pr η C < δ. The inequality (4) is from lemma 4 and inequality (5) is from the bounded ration lemma 3. Lastly, equality (6) follows because the mapping η g ˆ w(η) is one-to-one. Also note that g ˆ w η = 1 This completes the proof. We now state the accuracy guarantee, which follows from a standard Gaussian tail bound. Then in Subsection 3.1 we compare this guarantee to the accuracy guarantee for the competing RSPM method for learning discrete hyperplanes, in order to shed some light on the accuracy guarantee in practice. Theorem 5 (Utility). Algorithm 1 is an (α, β)-minimizer for W τ with 2(d + 1) ln (4/β) ln (1/δ) 3.1. Comparing OPDisc and RSPM While both OPDisc and the RSPM algorithm of Neel et al. (2018) require discrete parameter spaces, OPDisc is substantially more general in that it only requires the loss functions be Lipschitz, whereas RSPM assumes the loss functions are bounded in {0, 1} (and hence 1/τ Lipschitz over Wτ) and assumes the existence of a small separator set (defined in the supplement). Nevertheless, we might hope that in addition to greater generality, OPDisc has comparable or superior accuracy for natural classes of learning problems. We show this is indeed the case for the fundamental task of privately learning discrete hyperplanes, where it is better by a linear factor in the dimension. We define the RSPM algorithm, for which we must define the notion of a separator set, in the supplement. Theorem 6 (RSPM Utility (Neel et al., 2018)). Let W τ be a discrete parameter space with a separator set of size m. The Gaussian RSPM algorithm is an oracle-efficient (α, β)-minimizer for W τ for: m ln(2m/β) ln(1/δ) Let Iτ be a τ discretization of [ 1, 1]d, e.g. Iτ = [ 1, 1+ τ, . . . 0, τ, 2τ, . . . 1]d. Let Wτ be the subset of vectors in this discretization that lie within the unit Euclidean ball: Wτ = Iτ S(1)d. Wτ is τ-separated since any two distinct w, w differ in at least one coordinate by at least τ. Moreover Wτ admits a separator set of size m = 2(d 1) τ (see the Appendix of Neel et al. (2018). Since the loss functions li(w) = 1{w xi 1} {0, 1} and Wτ is τ-separated, the loss functions li are 1 τ -Lipschitz. By Theorem 6, RSPM has accuracy bound: d log(d/βτ) log(1/δ) By Theorem 5 OPDisc has accuracy bound: αOPDisc = O d log(1/β) log(1/δ) Thus, in this case, OPDisc has an accuracy bound that is different by a factor of roughly d τ. However, the bound of OPDisc is better only when τ is greater than 1/d2, pressing the question of how to set this parameter. The trade-off is Private Objective Perturbation that setting τ too large makes the algorithm OPDisc add too much noise to the objective, and our accuracy guarantee degrades very fast. On the other hand, if τ is too large, then we can miss the optimal solution to a large extent. However, for practical scenarios, setting the value of τ to be much larger than 1 d2 gives a discretized decision space such that the optimal answer is not too far from the optimal on the corresponding continuous decision space. For instance, in our experiments, we set τ equals to one. 4. Objective perturbation for Lipschitz functions We now present an objective perturbation algorithm (paired with an additional output perturbation step), which applies to arbitrary parameter spaces. The privacy guarantee holds for (possibly non-convex) Lipschitz loss functions, while the accuracy guarantee applies only if the loss functions are convex and bounded. Even in the convex case, this is a substantially more general statement than was previously known for objective perturbation: we don t require any second order conditions like strong convexity or smoothness (or even differentiability). Our guarantees also hold with access only to an α-approximate optimization oracle. We present the full algorithm in Algorithm 2. It 1) uses the approximate linear oracle (in Definition 2.1) to solve polynomially many perturbed optimization objectives, each with an independent random perturbation, and 2) perturbs the average of these solutions with Laplace noise. Before we proceed to our analysis, let us first introduce some relevant parameters. Let W have ℓ diameter D , and ℓ2 diameter D2. We assume that the loss functions li L are G-Lipschitz with respect to ℓ1 norm, and assume the loss functions are scaled to take values in [0, 1]. Our utility analysis requires convexity in the loss functions, and essentially follows from the high-probability bounds on the linear perturbation terms in the first stage and the output perturbation in the second stage. Algorithm 2 Objective Perturbation Sampling OPSamp input Approximate optimization oracle Oα, a dataset D = {li}n i=1, privacy parameters ϵ, δ. γ ϵ nd5/4 D2; m ln (2d/δ) 2γ2 for k = 1 to m do 2dϵ 250G2d2D2 (1+log(2/β))n Sample a random vector σk Exp(η)d. wk Oα D, σk end for λ 4D γ + 250ηGd2D2 + α 10G µ Lap(λ/ϵ)d output 1 m Pm k=1 wk + µ Theorem 7 (Utility). Assuming the loss functions are convex, Algorithm 2 is an (α , β)-minimizer for 1 n L(w, D) with D2 log(1/β) ϵn + α log(1/β) where α is the approximation error of the oracle Oα. The privacy analysis of this algorithm crucially depends on a stability lemma proven by Suggala & Netrapalli (2019) in the context of online learning, and does not require convexity.1 Lemma 8 (Stability lemma (Suggala & Netrapalli, 2019)). For any pair of neighboring data sets D, D . Let Oα(D, σ) and Oα(D , σ) be the output of an approximate oracle on datasets D and D respectively. Then, Eσ ||Oα(D, σ) Oα(D , σ)||1 250ηGd2D2 + α 10G From now on, let Σ = {σi : i [m]} be a sequence of of m i.i.d d-dimensional noise vectors and W(D, Σ) = 1 m P i Oα(D, σi) is the average output of m calls to an α-approximate oracle. Lemma 9. If m = ln (2d/δ) 2γ2 , for 0 γ 1, then, with probability 1 δ/2: W(D, Σ) Eσ[Oα(D, σ)] 1 2D γ where the randomness is taken over the different runs of Oα. The next lemma combines Lemma 8 and Lemma 9 to get high probability sensitivity bound for the average output of the approximate oracle. Lemma 10 (High Probability ℓ1-sensitivity). For any pair of neighboring datasets D, D , let W(D, Σ), W(D , Σ) be the sample average after m = ln (2d/δ) γ2 calls to an αapproximate oracle. Then, with probability 1 δ over the random draws of Σ, ||W(D, Σ) W(D , Σ)||1 4D γ+250ηGd2D2 + α Theorem 11. Algorithm 2 is (ϵ, δ)-differentially private. Proof sketch. Given a pair of neighboring data sets D, D , we will condition on the set of noise vectors Σ satisfy the ℓ1-sensitivity bound (7), which occurs with probability at least 1 δ. Then the privacy guarantee follows from the use of Laplace mechanism. 1Compared to the bound in Suggala & Netrapalli (2019), our bound has an additional factor of 2 since our neighboring relationship in Definition 2.5 is defined via replacement whereas in Suggala & Netrapalli (2019) the stability is defined in terms of adding another loss function. Private Objective Perturbation 5. Experiments For our experiments, we consider the problem of privately learning a linear threshold function to solve a binary classification task. Given a labeled data set {(xi, yi)}n i=1 where each xi Rd and yi { 1, 1}, the classification problem is to find a hyperplane that best separates the positive from the negative samples. A common approach is to optimize a convex surrogate loss function that approximates the classification loss. We use this approach (private logistic regression) as our baseline. In comparison, using our algorithm OPDisc, we instead try and directly optimize 0/1 classification error over a discrete parameter space, using an integer program solver. Although this can be computationally expensive, we find that it is feasible for relatively small datasets (we use a balanced subset of the Adult dataset with roughly n = 15, 000 and d = 23 features, after one-hot encodings of categorical features). In this setting, we find that OPDisc can substantially outperform private logistic regression. We remark that small data is the regime in which applying differential privacy is most challenging, and we view our approach as a promising way forward in this important setting. Data description and pre-processing We use the Adult dataset (Lichman, 2013), a common benchmark dataset derived from Census data. The classification task is to predict whether an individual earns over 50K per year. The dataset has n = 48842 records and 14 features that are a mix of both categorical and continuous attributes. The Adult dataset is unbalanced: only 7841 individuals have the 50k (positive) label. To arrive at a balanced dataset (so that constant functions achieve 50% error), we take all positive individuals, and an equal number of negative individuals selected at random, for a total dataset size of n = 15682. We encode categorical features with one-hot encodings, which increases the dimensionality of the dataset. We found it challenging to run our algorithm with more than 30 features, and so we take a subset of 7 features from the Adult dataset that are represented by d = 23 real-valued features after one-hot encoding. We chose the subset of features to optimize the accuracy of our logistic regression baseline. Baseline: private logistic regression (LR). We use as our baseline private logistic regression which optimizes over the space of continuous halfspaces with the goal of minimizing the logistic loss function, given by li(w) = log (1 + exp( y w, xi )). We implement a differentially private stochastic gradient descent (private SGD) algorithm from Bassily et al. (2014); Abadi et al. (2016), keeping track of privacy loss using the moment accountant method as implemented in the Tensor Flow Privacy Library. The algorithm involves three parameters: gradient clip norm, Figure 1. Accuracy versus ϵ. Figure 2. Distribution of run time. Figure 3. Accuracy and runtime evaluation of OPDisc, RSPM, and Private Logistic Regression (LR) on the Adult data set with size n = 15682 and d = 23 features. The value of δ = 1/n2 for all methods in all runs. mini-batch size, and learning rate. For each target privacy parameters (ϵ, δ), we run a grid search to identify the triplet of parameters that give the highest accuracy. To lower the variance of the accuracy, we also take average over all the iterates in the run of private SGD. Implementation details for OPDisc and RSPM For both OPDisc and RSPM, we encode each record (xi, yi) D as a 0/1 loss function: li(w) = I[yi = sgn( xi, w )]. For both algorithms, we have separation parameter τ = 1 and constrains the weight vectors to have ℓ2 norm bounded Private Objective Perturbation d. In OPDisc, each coordinate wj can take values in the discrete set { B, B + 1, . . . , B 1, B} with B = d , and we constrain the w 2 to be at most d. In RSPM, we optimize over the set { 1, 0, 1}d. OPDisc requires an approximate projected linear optimization oracle (Definition 2.2) and RSPM requires a linear optimization oracle (Definition 2.1). In the appendix, we show that the optimization problems can be cast as mixed-integer programs (MIPs), allowing us to implement the oracles via the Gurobi MIP solver. The Gurobi solver was able to solve each of the integer programs we passed it. Empirical evaluation. We evaluate our algorithms by their (0/1) classification accuracy. The Figure 1(a) plots the accuracy of OPDisc and our baseline (y-axis) as a function of the privacy parameter ϵ (x-axis), averaged over 15 runs. We fix δ = 1/n2 for all three algorithms across all runs. The error bars report the empirical standard deviation. We see that both OPDisc and RSPM improve dramatically over the logistic regression baseline. This shows that in small-data settings, it is possible to improve over the error/privacy tradeoff given by standard convex-surrogate approaches by appealing to non-convex optimization heuristics. OPDisc also obtains consistently better error than RSPM. The algorithm OPDisc also has a significantly lower variance in its error compared to the other two algorithms. The Figure 2(a) gives a histogram of the run-time of our three methods for our experiment. For both OPDisc and RSPM, the running time is dominated by an integer-program solver. We see that while our method frequently completes quite quickly (often even beating our logistic regression baseline!), it has high variance, and occasionally requires a long time to run. However, we were always able to solve the necessary optimization problem, eventually. 6. Acknowledgments Giuseppe Vietri has been supported by the GAANN fellowship from the U.S. Department of Education. Abadi, M., Chu, A., Goodfellow, I., Mc Mahan, H. B., Mironov, I., Talwar, K., and Zhang, L. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 308 318. ACM, 2016. Bassily, R., Smith, A. D., and Thakurta, A. Private empirical risk minimization: Efficient algorithms and tight error bounds. In 55th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2014, Philadelphia, PA, USA, October 18-21, 2014, pp. 464 473, 2014. doi: 10.1109/FOCS.2014.56. URL https://doi.org/ 10.1109/FOCS.2014.56. Chaudhuri, K., Monteleoni, C., and Sarwate, A. D. Differentially private empirical risk minimization. Journal of Machine Learning Research, 12(Mar):1069 1109, 2011. Dwork, C. and Roth, A. The algorithmic foundations of differential privacy. Foundations and Trends R in Theoretical Computer Science, 9(3 4):211 407, 2014. Dwork, C., Kenthapadi, K., Mc Sherry, F., Mironov, I., and Naor, M. Our data, ourselves: Privacy via distributed noise generation. In Annual International Conference on the Theory and Applications of Cryptographic Techniques, pp. 486 503. Springer, 2006a. Dwork, C., Mc Sherry, F., Nissim, K., and Smith, A. Calibrating noise to sensitivity in private data analysis. In Proceedings of the Third Conference on Theory of Cryptography, TCC 06, pp. 265 284, Berlin, Heidelberg, 2006b. Springer-Verlag. ISBN 3-540-32731-2, 978-3540-32731-8. doi: 10.1007/11681878_14. URL http: //dx.doi.org/10.1007/11681878_14. Gaboardi, M., Arias, E. J. G., Hsu, J., Roth, A., and Wu, Z. S. Dual query: Practical private query release for high dimensional data. In International Conference on Machine Learning, pp. 1170 1178, 2014. Iyengar, R., Near, J. P., Song, D., Thakkar, O., Thakurta, A., and Wang, L. Towards practical differentially private convex optimization. In Towards Practical Differentially Private Convex Optimization, pp. 0. IEEE, 2019. Kasiviswanathan, S. P., Lee, H. K., Nissim, K., Raskhodnikova, S., and Smith, A. What can we learn privately? SIAM Journal on Computing, 40(3):793 826, 2011. Kifer, D., Smith, A., and Thakurta, A. Private convex empirical risk minimization and high-dimensional regression. In Conference on Learning Theory, pp. 25 1, 2012. Lichman, M. UCI machine learning repository, 2013. URL http://archive.ics.uci.edu/ml. Mc Sherry, F. and Talwar, K. Mechanism design via differential privacy. In FOCS, volume 7, pp. 94 103, 2007. Neel, S., Roth, A., and Wu, Z. S. How to use heuristics for differential privacy. ar Xiv preprint ar Xiv:1811.07765, 2018. Suggala, A. S. and Netrapalli, P. Online non-convex learning: Following the perturbed leader is optimal. ar Xiv preprint ar Xiv:1903.08110, 2019. Private Objective Perturbation Vietri, G., Tian, G., Bun, M., Steinke, T., and Wu, Z. S. New oracle-efficient algorithms for private synthetic data release. In International Conference on Machine Learning, 2020.