# differentially_private_learning_with_margin_guarantees__c376ec2c.pdf

Differentially Private Learning

with Margin Guarantees

Raef Bassily The Ohio State University

& Google Research NY bassily.1@osu.edu

Mehryar Mohri Google Research & Courant Institute mohri@google.com

Ananda Theertha Suresh

Google Research, NY theertha@google.com

We present a series of new differentially private (DP) algorithms with dimensionindependent margin guarantees. For the family of linear hypotheses, we give a pure DP learning algorithm that beneﬁts from relative deviation margin guarantees, as well as an efﬁcient DP learning algorithm with margin guarantees. We also present a new efﬁcient DP learning algorithm with margin guarantees for kernel-based hypotheses with shift-invariant kernels, such as Gaussian kernels, and point out how our results can be extended to other kernels using oblivious sketching techniques. We further give a pure DP learning algorithm for a family of feed-forward neural networks for which we prove margin guarantees that are independent of the input dimension. Additionally, we describe a general label DP learning algorithm, which beneﬁts from relative deviation margin bounds and is applicable to a broad family of hypothesis sets, including that of neural networks. Finally, we show how our DP learning algorithms can be augmented in a general way to include model selection, to select the best conﬁdence margin parameter.

1 Introduction

Preserving privacy is a crucial objective for machine learning algorithms. A widely adopted criterion in statistical data privacy is the notion of differential privacy (DP) [DMNS06, Dwo06, DR14], which ensures that the information gained by an adversary is roughly invariant to the presence or absence of an individual in a dataset. Despite the remarkable theoretical and algorithmic progress in differential privacy over the last decade or more, however, its application to learning still faces several obstacles. A recent series of publications have shown that differentially private PAC learning of inﬁnite hypothesis sets is not possible, even for common hypothesis sets such as that of linear functions. In fact, this is the case for any hypothesis set containing threshold functions [BNSV15, ALMM19]. These results imply serious limitations for private agnostic learnability.

Another rich body of literature has studied differentially private empirical risk minimization (DPERM) and differentially private stochastic convex optimization (DP-SCO) (e.g., [CMS11, JT14, BST14, BFTT19, FKT20, SSTT21, BGN21, AFKT21, BGM21]). When the underlying optimization problem is constrained (constrained setting), tight upper and lower bounds have been derived for the excess empirical risk of DP-ERM [BST14] and for the excess population risk for DP-SCO [BFTT19, FKT20]. These results show that learning guarantees necessarily admit a dependency on the dimension d of the form

d m, where m is the sample size. This dependency is persistent, even in the special case of generalized linear losses (GLLs) [BST14], which limits the beneﬁt of such guarantees, since learning algorithms typically deal with high-dimensional spaces.

When the underlying optimization problem is unconstrained (unconstrained setting) and the loss is a generalized linear loss, the bounds given by [JT14], [SSTT21] and [BGM21] are dimensionindependent but they admit a dependency on w 2, where w is the unconstrained minimizer of the

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

expected loss (population risk), or w 2, where w is the unconstrained minimizer of the empirical loss. Since the problem is unconstrained, the norm of these vectors can be very large, even for classiﬁcation problems for which the minimizer of the zero-one loss admits a relatively small norm. Thus, in both the constrained and unconstrained settings, the learning guarantees derived from DP-ERM and DP-SCO are weak for hypothesis sets commonly used in machine learning.

The results just mentioned raise some fundamental questions about private learning: is differentially private learning with favorable (dimension-independent) guarantees possible for standard hypothesis sets? Must one resort to distribution-dependent bounds instead? In view of the negative PAC-learning results and other learning bounds mentioned earlier, we will seek instead optimistic margin-based learning bounds.

In the context of classiﬁcation, learning bounds for linear hypotheses based on the dimension or, more generally, based on the VC-dimension of the hypothesis set are known to be too pessimistic since they deal with the worst case. Instead, margin bounds have been shown to be the most informative and useful guarantees [KP02, SFBL97]. This motivates our study of differentially private learning algorithms with margin-based guarantees. Note that our conﬁdence-margin analysis and guarantees do not require the hard-margin separability assumptions adopted in [BDMN05, LNUZ20], which is a strong assumption that typically does not hold in practice. Another existing study that deals with somewhat related questions is that of [CHS14]. But, the paper deals with a speciﬁc class of maximization problems and adopts a non-standard deﬁnition of margin. Another related line of work is that of [RBHT09] and [CMS11] on DP Kernel classiﬁers. These works either provide suboptimal, dimension-dependent learning guarantees or make strong assumptions about the Fourier coefﬁcients of the kernel predictors. We discuss these prior works in more detail in Section 1.1.

Main contributions. We present a series of new differentially private (DP) algorithms for learning linear classiﬁers, kernel classiﬁers, and neural-network classiﬁers with dimension-independent, conﬁdence-margin guarantees. In Section 3, we study the family of linear hypotheses. We ﬁrst give a pure DP learning algorithm with relative deviation margin guarantees. Next, we present an efﬁcient DP learning algorithm with margin guarantees. Our algorithm is based on a faster construction for the JL-transform and a faster DP-ERM algorithm. While the general structure of our algorithms for linear classiﬁers is similar to that of [LNUZ20], our results require a new analysis that takes into account the scale-sensitive nature of the margin loss and the -hinge loss. In Section 4, We present a new efﬁcient DP learning algorithm with margin guarantees for kernel-based hypothesis sets, assuming that the positive deﬁnite kernel used is shift-invariant, as with the most commonly used Gaussian kernels. Our algorithm combines kernel approximation with the use of the JL-transform. Our result is based on a new style of analysis that uses regularized ERM as a reference. These ideas enable us to attain a bound that nearly matches the non-private margin bound, without resorting to the strong assumptions in prior work. Our conﬁdence-margin bounds for DP learning of linear and kernel classiﬁers nearly match the standard, non-private conﬁdence-margin bounds. In Section 5, we initiate the study of DP learning of neural networks with margin guarantees. We design a pure DP learning algorithm for a family of feed-forward neural networks for which we prove a conﬁdence-margin bound that is independent of the input dimension and exhibits better dependence on the network parameters than the bounds attained via uniform convergence. Our result entails a new analysis of embedding-based network compression technique. Our margin bound for neural networks is the ﬁrst of its type. The

bound is independent of the input dimension and scales only linearly with the number of activation units. In Appendix E, We further present a label privacy learning algorithm, which we show beneﬁts from relative deviation margin bounds. The algorithm and its guarantee are applicable to a broad family of hypothesis sets, including that of neural networks. Finally, we show in Appendix F how our DP learning algorithms can be augmented in a general way to include model selection, to select the best conﬁdence margin parameter.

1.1 Related work Prior work on unconstrained GLLs. [JT14] and [SSTT21] showed that it is possible to derive dimension-independent risk bounds for DP-ERM and DP-SCO in the context of linear prediction, when the parameter space is unconstrained and the loss function is convex and Lipschitz (GLL). However, their bounds scale with w , the norm of the optimal unconstrained minimizer of the expected surrogate loss such as the hinge loss. Also, using their techniques for unconstrained DPERM for GLLs together with uniform convergence would yield generalization error bounds that scale with the norm of the unconstrained empirical risk minimizer w. The ﬁrst issue with this line of work is that the norms of such unconstrained solutions can be very large, thereby resulting in uninformative

0 +γ γ +r r

2 distribution mass

Figure 1: Simple example in dimension one for which the minimizer of the expected hinge loss E[ hinge(w)] is w = 1

γ and thus w = 1

γ 1 for γ 1. Here, any other w > 0, in particular with a small norm, achieves the same zero-one loss as w .

bounds. In fact, one can construct simple, low-dimensional examples, where w = (m) while there is a predictor w with w = O(1) that attains the same expected zero-one error, see Figure 1 (a detailed analysis of that example is given in Appendix H). More importantly, the paradigm adopted in this line of work is to ﬁrst devise an algorithm and next derive bounds for its excess risk. In contrast, we start from strong generalization error bounds, which we use to guide the design of our algorithm.

Prior work on DP learning of hard-margin halfspaces. [BDMN05] and [LNUZ20] studied DP learning of linear classiﬁers in the separable setting, that is with a hardor geometric margin. [BDMN05] gave a construction based on a private version of the Perceptron algorithm, which results in a dimension-dependent bound on the expected error. This result was later improved by [LNUZ20] who gave new constructions with dimension-independent guarantees based their nice idea of using embeddings, namely, the Johnson-Lindenstrauss (JL) transform, to reduce the dimensionality of the problem from d to 1 γ, where γ is the geometric margin. Note that the hard-margin separability is a strong assumption that typically does not hold in practice. Moreover, the constructions proposed in [LNUZ20] require the knowledge of the margin for their guarantees to be valid. In contrast, our work considers the more general notion of conﬁdence margin, which does not require the existence of a geometric margin and applies to realistic scenarios with non-separable data. Moreover, the conﬁdence-margin parameter, , in our algorithms is tunable and can be optimized. Importantly, our algorithms still yield meaningful learning guarantees even if this parameter is not optimized. Our algorithms for linear classiﬁers also make use of an embedding as a pre-processing step. However despite the similar structure to that of [LNUZ20], our algorithm requires a new analysis and different settings of parameters. This new analysis is necessary to deal with the scale-sensitive nature of our bounds, due to the absence of a hard-margin.

Prior work on DP Kernel classiﬁers. [RBHT09] were the ﬁrst to provide differentially private constructions for SVMs in both the ﬁnite-dimensional feature space and kernel settings. However, their constructions are suboptimal and the resulting bounds suffer from a polynomial dependence on the dimension of the feature space. In addition, their error bound has a sub-optimal dependence on the sample size m and also has an explicit dependence on the 1-norm of the dual variables of the SVM. In general, this norm can be as large as m, in which case their error bound becomes vacuous. [CMS11] gave a similar construction for shift-invariant kernels. However, their error guarantees are based on the kernel approximation results of [RR08] and hence entail a relatively strong condition on the Fourier coefﬁcients of the kernel predictors and the kernel density. We note that the standard assumption of bounded Reproducing Kernel Hilbert Space (RKHS) norm does not imply such a condition. [JT13] gave algorithms for DP predictions with kernels, where the goal is to privately generate predictions (labels) on a small test set that admits no privacy constraints. In such scenarios, their algorithms do not output a classiﬁer. [JT13] also gave a construction for private learner that outputs a kernel classiﬁer, however, their construction is computationally inefﬁcient and the resulting error guarantees are dimension-dependent.

2 Preliminaries

We consider an input space X, a binary output space Y = { 1,+1} and a hypothesis set H of functions mapping from X to R. We denote by D a distribution over Z = X Y and denote by RD(h) the generalization error and by RS(h) the empirical error of a hypothesis h H:

RD(h) = E z=(x,y) D[1yh(x) 0] RS(h) = E z=(x,y) S[1yh(x) 0],

where we write z S to indicate that z is randomly drawn from the empirical distribution deﬁned by S. Given 0, we similarly deﬁne the -margin loss and empirical -margin loss of h H:

D(h) = E z=(x,y) D[1yh(x) ] R

S(h) = E z=(x,y) S[1yh(x) ].

The -margin loss is not convex. Hence, we also consider the -hinge loss to provide computationallyefﬁcient algorithms. For any > 0, deﬁne the -hinge loss as (u) max(1 u , 0), u R. Similar to the above deﬁnitions, given > 0, for a sample S, we deﬁne the expected and the empirical -hinge losses as follows:

D(w) = E z=(x,y) D[ (yi w,xi )] L

S(w) = E z=(x,y) S[ (yi w,xi )].

In the context of learning, differential privacy is deﬁned as follows.

Deﬁnition 2.1 (Differential privacy). Let ",δ 0. Let A (X Y)m H be a randomized algorithm. We say that A is (",δ)-DP if for any measurable subset O H and all S,S (X Y)m that differ

in one sample, the following inequality holds:

P(A(S) O) e" P(A(S ) O) + δ. (1)

If δ = 0, we refer to this guarantee as pure differential privacy.

3 Private Algorithms for Linear Classiﬁcation with Margin Guarantees

In this section, we present two private learning algorithms for linear classiﬁcation with margin guarantees: ﬁrst, a computationally inefﬁcient pure DP algorithm, which we show beneﬁts from relative deviation margin bounds, next, a computationally efﬁcient DP algorithm with a dimensionindependent bound expressed in terms of the empirical -hinge loss.

Let Bd(r) x Rd x 2 r denote the Euclidean ball in Rd of radius r and let X Bd(r) denote the feature space. We will use the shorthand Bd for Bd(1). We consider the class of linear predictors over X deﬁned by HLin = hw x w,x w Bd( ) . Note that one can represent the general class of afﬁne functions over Rd as linear functions over Rd+1 by simply mapping each x Rd to

x = (x,1) Rd+1. Thus, without loss of generality, we will consider HLin. Here, we view d as possibly much larger than the sample size m. We also note that even though the predictors in the input class HLin admit -bounded norm, we do not constrain the algorithm to output a predictor with bounded norm, which circumvents the necessary dependence on the dimension in constrained DP optimization [BST14].

3.1 Pure DP Algorithm for Linear Classiﬁcation

A standard method for designing differentially private algorithms for a continuous hypothesis class is to apply the exponential mechanism [MT07] to a cover of the hypothesis class. Since HLin is d-dimensional, the size of a useful cover is about 2 (d), thus, a direct application of the exponential mechanism yields an (d)-bound; we give a simple example illustrating that in Appendix G. Thus, instead, we seek to reduce the size of the cover without impacting its accuracy, using random projections. This results in a mapping Φ from Rd to a lower-dimensional space Rk.

For linear classiﬁcation, we wish to preserve intra-point distances and angles, that is x x Φx Φx for points x and x . It is known that this property can be fulﬁlled as a corollary of the Johnson-Lindenstrauss lemma [Nel11, Theorem 109]. For completeness, we provide a full proof in Appendix A. More interestingly, we show that we can reduce the dimension to O( 2r2 2), without the error decreasing signiﬁcantly. We then run the exponential mechanism in this lower-dimensional space and next compute a classiﬁer w in that space. We ﬁnally derive a classiﬁer in the original space by applying the transpose of the original projection matrix ΦT w. Note that the ﬁnal output ΦT w has expected norm O

d r and may not lie in Bd( ).

Algorithm 1 gives the pseudocode of the full algorithm. The algorithm and the analysis in this section include a dimensionality reduction technique for mapping the feature vectors from the input d-dimensional space to a k-dimensional space, where k = O( 2r2 2) for some (0, r]. Hence, we will be dealing with compressed parameter vectors in Rk. To distinguish these two spaces, we will denote the empirical error and the empirical -margin error in this k-dimensional space as

S (w ) and R(k),

S (w ), respectively, where w Bk and S Rk Y

Algorithm 1 APriv Mrg: Private Learner of Linear Classiﬁers with Margin Guarantees

Require: Dataset S = ((x1,y1),...,(xm,ym)) (X { 1})m; privacy parameter " > 0; margin

parameter (0, r]; conﬁdence parameter β > 0.

1: Let k = O

2: Let Φ be a random k d matrix with entries drawn i.i.d. uniformly from { 1

3: Let SΦ = {(xΦ,y) (x,y) S}, where for any x Rd, xΦ Φx Rk. 4: Let C be a 10 r-cover of Bk(2 ). 5: Run the Exponential mechanism over C with privacy parameter ", sensitivity 1 m, and score

function R(k)

SΦ (w) for w C, to select w C.

6: return w Priv = Φ w, where Φ denotes the transposition of Φ.

Theorem 3.1. Algorithm 1 is "-differentially private. For any β (0,1), with probability at least 1 β over the draw of a sample S of size m from D, the solution w Priv it returns satisﬁes:

RD(w Priv) min w Bd( )

2 min(1,")m

The proof is given in Appendix B.1. The theorem provides a margin guarantee for the private solution. Note that no assumption is made about separability or the existence of a favorable hardmargin. Instead, through the ﬁrst term, the bound is based on the distribution of the empirical margins y(w x). The theorem suggests that, when the empirical -margin loss remains small for a relatively large value of the conﬁdence margin parameter , then w Priv beneﬁts from a strong generalization guarantee. These comments hold similarly for other margin bounds presented in this paper.

This result, although given for a computationally inefﬁcient method, is stronger than several previously known ones: First, it is an (",0)-pure differential privacy guarantee; second, it is dimensionindependent and furthermore, unlike prior work, the norm of the optimal hypothesis does not appear in the bound. Furthermore, since it is a relative deviation margin bound, it smoothly interpolates between the realizable case of R

S(w) = 0 and the case of R

S(w) > 0. For a sample of size m, the bound is based on an interpolation between a 1 m-term that includes the square-root of the empirical margin loss as a factor and another term in 1 m. In particular, when the empirical margin loss is zero, the bound only admits the 1 m fast rate term. As a corollary, note that, up to constants, one can always obtain privacy for " > 1 essentially for free.

Dependence on : Our bound depends on the choice of . We note that this is fundamentally different from the dependence on w in prior works on GLLs [JT14, SSTT21] for two reasons: First, unlike w , is a measurable and, more importantly, tunable quantity, which we can select an optimal setting for1 (as we do in Appendix F). Second, the optimal choice for this parameter can be much smaller than w as demonstrated in Appendix H.

3.2 Efﬁcient Private Algorithm for Linear Classiﬁcation

The -margin loss is not convex and it is known that its minimization is generally intractable. Instead, we devise a computationally efﬁcient algorithm, whose guarantees are expressed in terms of the empirical -hinge loss. Algorithm 2 shows the pseudocode of our algorithm. We now discuss the key steps of the algorithm.

Fast JL-transform. Our algorithm entails a dimensionality reduction step (step 3) as in Algorithm 1. Here, we note that the new dimension k is chosen to be O(m"), which enables us to control the inﬂuence of the dimensionality reduction on the empirical hinge loss. We also note that this step is carried out via a fast construction for the JL-transform (Lemma A.4), which takes O(dlog(d)) time, assuming d > "m.

1Note that, without loss of generality, we can set = 1 and optimize only with respect to .

Algorithm 2 AE Priv Mrg: Efﬁcient Private Learner of Linear Classiﬁers with Margin Guarantees

Require: Dataset S = ((x1,y1),...,(xm,ym)) Bd(R) { 1}

m; privacy parameters ",δ; norm bound ; margin parameter (0, r]; conﬁdence parameter β > 0.

1: Let k = "m log(m β)

3 2 (1 δ) log(1 β).

2: Let Φ be a random k d matrix from the construction in Lemma A.4. 3: Let SΦ = Bk(2r)(xΦ),y (x,y) S , where for any x Rd, xΦ Φx Rk and Bk(2r) is

the Euclidean projection on Bk(2r). 4: Privately solve the convex ERM problem: argmin

SΦ(w) via Algorithm 4 (Appendix B.2)

and return w Bk(2 ). 5: return w Priv = Φ w, where Φ denotes the transposition of Φ.

Near linear-time DP convex ERM. After this step, we invoke an efﬁcient algorithm for DP-ERM (step 4 in Algorithm 2) to ﬁnd an approximate minimizer of the empirical -hinge loss L

SΦ(w), rather than using the exponential mechanism to ﬁnd an approximate minimizer for the empirical zero-one loss RSΦ(w). To improve the running time of step 4, we use the construction in [BGM21, Algorithm 2] to solve DP-ERM in near-linear time and with high-probability guarantee on the excess empirical risk (see Algorithm 4 in Appendix B.2). The algorithm of [BGM21] is devised for DP-SCO with respect to non-smooth generalized linear losses. It is based on a combination of a smoothing technique via proximal steps and the phased SGD algorithm [FKT20, Algorithm 2] for smooth DP-SCO. The algorithm of [BGM21] can be used for DP-ERM if it is fed with a sample from the empirical distribution of the dataset. However, the privacy guarantee requires a careful privacy analysis to deal with the fact that this sample may contain duplicate entries from the original dataset.

Moreover, since the original algorithms in [FKT20, BGM21] provide only expectation guarantees and we aim at high-probability learning bounds, we use a standard private conﬁdence-boosting technique to provide a high-probability guarantee on the excess risk of our variant. We summarize the guarantees of this variant in the following lemma. The details of the construction and the proof of the lemma below can be found in Appendix B.2. Lemma 3.1. Let m N, 0 < δ < 1

m, and 0 < " log(1 δ). Algorithm 4 (Appendix B.2) is (",δ)-DP. Let β (0,1). Let k N, and r, > 0. Let S Bk( r) { 1}

m be the input dataset and Bk( ) be the parameter space. With probability 1 β over the randomness in Algorithm 4, the output w satisﬁes

S( w) min w Bk( )

Moreover, Algorithm 2 requires O(mlog(m)log(1 β)) gradient computations.

We now state our main result in this section, which we prove in Appendix B.3. Theorem 3.2. Let 0 < δ < 1 m and 0 < " log(1 δ). Algorithm 2 is (",δ)-DP. Let β (0,1). Let S Dm for a distribution D over Bd(r) { 1}. Algorithm 2 outputs w Priv Rd such that with probability at least 1 β, we have

RD(w Priv) min w Bd( )

Moreover, Algorithm 2 runs in time O mdlog(max(d,m)) + "m2 log(m) log

3 2 (1 δ) .

4 Private Algorithms for Kernel-Based Classiﬁers with Margin Guarantees

In this section, we present private algorithms with margin guarantees for kernel-based predictors [SS02, STC04]. We ﬁrst consider a continuous, positive deﬁnite, shift-invariant kernel K X X R, where K(x,x) = r2 for all x X. The associated feature map is deﬁned as (x) K( ,x), x X, where X Bd(r).

Overview of the technique. Our approach is based on approximating the feature map by a ﬁnite-dimensional feature map X B2D (r) determined via Random Fourier Features (RFFs). The dimension 2D of the approximate feature map is chosen to be sufﬁciently large to ensure that for all pairs of feature vectors xi,xj in a training set S = ((x1,y1),...,(xm,ym)), we have (xi), (xj) K(xi,xj) 1 m with high probability over the randomness of (due to RFFs). This sufﬁces to derive an upper bound (margin bound) on the true error of a ﬁnite-dimensional linear predictor trained on the sample made of the labeled points (x),y , (x,y) S, that is essentially the same as the margin bound known for the kernel classiﬁer. Hence, in effect, we reduce the problem to that of learning a linear classiﬁer in a 2D-dimensional space, which we can solve privately using Algorithm 2. Note that the output predictor in this case is a ﬁnite-dimensional linear function rather than a function in the Reproducing Kernel Hilbert Space. A full description of our DP learner of kernel classiﬁers is given in Algorithm 3 below.

Bochner s Theorem and RFFs. Since the kernel K is shift-invariant, it can be expressed as K(x,x ) = r2 K(x x ),x,x X for some function K Z R, where Z = {z = x x x,x X}. Moreover, since K is positive-deﬁnite, K is the Fourier transform of a probability distribution P K:

K(x) = X ei !,x P K(!)d!.

This follows from Bochner s Theorem [Rud17]. Random Fourier Features (RFFs) provide a simple method introduced in [RR07] to approximate kernel feature maps in a data-independent fashion. The idea is based on Bochner s theorem. In particular, we ﬁrst sample !1,...,!D independently from the probability distribution P K. Then, we deﬁne an approximate feature map as follows:

(cos !1,x ,sin !1,x ,...cos !D,x ,sin !D,x ), x X. (2)

For D sufﬁciently large, it can be shown that (x), (x ) concentrates around K(x,x ) for all pairs x,x X [MRT18, Theorem 6.28]. In our analysis below, we only need that concentration to hold uniformly over pairs x,x from a ﬁxed training set rather than uniformly over all pairs x,x X. This leads to a simpler approximation guarantee, which we formally state below. Theorem 4.1. Let SX = (x1,...,xm) X m. Let K be a shift-invariant, positive deﬁnite kernel, where K(x,x) = r2, x X. Let P K be the probability distribution associated with K. Suppose !1,...,!D are drawn independently from P K. With probability 1, we have (x) 2 = r, x X. For any β (0,1), with probability at least 1 β, for all i,j [m] such that i j we have

(xi), (xj) K(xi,xj) 2r2

Proof. The ﬁrst assertion related to (x) 2 x X follows directly from the deﬁnition of (x) and a basic trigonometric identity. The proof of the second assertion about the inner products follows from the identity E !1,...,!D (xi), (xj) = K(xi,xj) that holds for all xi,xj, the application of

Hoeffding s bound combined with the union bound over all m2 pairs xi,xj SX . The unbiasedness of (xi), (xj) follows from the fact that the expectation is the Fourier transform of r2P K(!), which, by Bochner s Theorem, is r2 K(xi xj) = K(xi,xj).

We now state our main result, which we prove in Appendix C. Theorem 4.2. Let r > 0. Let K X X R be a shift-invariant, positive deﬁnite kernel, where K(x,x) = r2 for all x X. For any " > 0 and δ (0,1), Algorithm 3 is (",δ)-differentially private. Deﬁne H {h H w H }, where H is the norm corresponding to the reproducing kernel Hilbert space (RKHS) H associated with the kernel K. Let β (0,1). Given an input sample S of m examples drawn i.i.d. from a distribution D over X { 1}, Algorithm 3 outputs h

w Priv such that with probability at least 1 β, we have

w Priv) min

Algorithm 3 APriv Ker Mrg: Efﬁcient Private Learner of Kernel Classiﬁers with Margin Guarantees

Require: Dataset S = ((x1,y1),...,(xm,ym)) (X { 1})m; shift-invariant, positive deﬁnite

kernel K X X R with K(x,x) = r2 for all x X, privacy parameters ",δ; Reproducing kernel Hilbert space (RKHS) norm bound ; margin parameter (0,2 r]; conﬁdence parameter β > 0. 1: Let P K be the probability distribution associated with K. 2: Let D = m2 log(2m β). 3: Draw !1,...,!D independently from P K. 4: Let S = ( (xi),yi) i [m] , where is as deﬁned in (2).

5: w Priv AE Priv Mrg S , ",δ,2 , ,β 2 , where AE Priv Mrg is the private learner described in Algorithm 2. Note that the input dimension to AE Priv Mrg is 2D rather than d, and the norm bound parameter is 2 .

6: return Private predictor hw Priv X R deﬁned as x X, h

w Priv(x) w Priv, (x) .

where, for any h H , L

i=1 (yi h, (xi) H), where is the feature map associated with the kernel K and , H is the inner product associated with the RKHS H.

Polynomial kernels: Our results can be extended to polynomial kernels using a different approach to construct a ﬁnite-dimensional approximation of the kernel. A polynomial kernel of degree p, denoted as p, can be expressed as p(x,x ) = ( x,x + c)p, x,x X and c > 0 is some constant. Note that a feature map p associated with such a kernel can be expressed as a vector in R

d, where d = O(dp). In particular, p(x) is the vector of all monomials of a p-th degree polynomial. Ignoring computational efﬁciency considerations (or when p is a small constant), there is a simpler private construction than the one used for shift-invariant kernels. In that case, we can directly use the JL-transform to embed { p(x1),..., p(xm)} into a k-dimensional subspace exactly as in Section 3.2, which would result in a k-dimensional approximation of the kernel (by the properties of the JL-transform). Hence, we can directly use Algorithm 2 on the dataset S p (( p(x1),y1),..., p(xm),ym)). We therefore obtain the same bound on the expected error as above except that r would then be rp. That dependence on rp is inherent in this case even non-privately since p(x,x) can be as large as rp. However, as discussed below, more efﬁcient solutions can be designed for approximating these and many other kernels.

Further extensions. Our work can directly beneﬁt from the method of [LSS13], which is computationally faster than that of [RR07], O((m+d)log d), instead of O(md). Their technique also extends to any kernel that is a function of an inner product in the input space. We can further use, instead of the JL-transform, the oblivious sketching technique of [AKK+20] from numerical linear algebra, which builds on previous work by [PP13] and [ANW14], to design sketches for polynomial kernel with a target dimension that is only polynomially dependent on the degree of the kernel function, as well as a sketch for the Gaussian kernel on bounded datasets that does not suffer from an exponential dependence on the dimensionality of input data points. More recently, [SWYZ21] presented new oblivious sketches that further considerably improved upon the running-time complexity of these techniques. Their method also applies to other slow-growing kernels such as the neural tangent (NTK) and arc-cosine kernels.

5 Private Algorithms for Learning Neural Networks with Margin Guarantees

In this section, we describe a private learning algorithm that beneﬁts from favorable margin guarantees when run with a family of neural networks with a very large input dimension.

We consider a family HNN of L-layer feed-forward neural networks deﬁned over Bd(r), with d potentially very large compared to the sample size m. A function h in HNN can be viewed as a cascade of linear maps composed with a non-linear activation function (see Figure 2, left column). Here, W1,...,WL are the weight matrices deﬁning the network and is a non-linear activation. For simplicity, the width (number of neurons) in each hidden layer, denoted by N, is assumed to be the same for all the layers. Also, we assume that the output of the network is a

real scalar and hence we have WL RN. Furthermore, we assume no activation in the output layer. We also assume the same activation RN RN for all layers and choose it to be a sigmoid function: for any u = (u1,...,u N) RN, (u) = (σ (u1),...,σ (u N)), for some > 0, where

σ (a) = 1 e a

2 , a R. Note that σ is -Lipschitz and thus is -Lipschitz with respect to 2: for

any u,v RN, (u) (v) 2 u v 2. A typical choice for in practice is = 1, but we will keep the dependence on for generality. We deﬁne HNN as the subset of HNN with weight matrices that are -bounded in their Frobenius norm: for all j [L], Wj F for some > 0.

Figure 2: Illustration of the neural networks before and after JL-transforms.

We design a pure DP algorithm for learning L-layer feed-forward networks in HNN that beneﬁts from the following margin-based guarantee.

Theorem 5.1. Let " > 0,β (0,1), and > 0. Then, there is an "-DP algorithm which returns an L-layer network h Priv with N neurons per layer that with probability at least 1 β over the draw of a sample S Dm and the internal randomness of the algorithm admits the following guarantee:

RD(h Priv) min h HNN

S(h) + O r(2 )L

N m + r2(2 )2L N

where = log(Lm β)log(r( )L ).

Note that this guarantee is independent of d and, assuming L is a constant, the bound scales roughly as

N 2m + N 2"m, where is the conﬁdence-margin parameter and " is the privacy parameter. Note that, for " 1, the bound scales with # neurons, which is more favorable than standard bounds obtained via a uniform convergence argument, which depend on d, as well as the total number of edges (N 2), in addition to a similar dependence on L.

Our construction. Our DP learner is based on using L embeddings Φ0 Rk d,..., ΦL 1 Rk N given by data-independent JL-transform matrices to reduce the dimension of the inputs in each layer, including the input layer, to k = O r2(2 )2L

2 . We randomly generate a set Φ = (Φ0,...,ΦL 1)

of L independent JL matrices whose dimensions are described as above. We let HΦ

NN denote the family of L-layer neural networks, where each network hΦ HΦ

NN is associated with weight matrices Φ

L 1 WL for Wj Rk N,j [L 1], and WL Rk 1 (see Figure 2, right column). We deﬁne HΦ

NN where Wj F 2 for all j [L]. We start by creating a γ-cover C

of the product space of the matrices W1,..., WL associated with HΦ

NN2 , where γ = O r(4 )L .

C is a γ-cover of Bk N(2 ) ... Bk N(2 ) Bk(2 ) with respect to

F . We deﬁne HΦ

NN2 to be the corresponding family of networks whose associated matrices are in C. Given an input dataset S = ((x1,y1),...,(xm,ym)) Bd(r) { 1}

m, we then run the exponential mechanism over S with privacy parameter ", the score function being the empirical zero-one loss RS(h) h HΦ

NN2 , and the sensitivity 1 m, to return a neural network h Priv HΦ

6 Conclusion

We presented a series of new differentially private algorithms with dimension-independent margin guarantees, including algorithms for linear classiﬁcation, kernel-based classiﬁcation, or learning with a family of feed-forward neural networks, and label DP learning with general hypothesis sets. Our kernel-based algorithms can be extended to non-linear classiﬁcation with many other kernels, including a variety of kernels that can be approximated using polynomial kernels, using techniques based on oblivious sketching. Our study of DP algorithms with margin guarantees for a family of neural networks can be viewed as an initiatory step that could serve as the basis for a more extensive analysis of DP algorithms for broader families of neural networks.

Acknowledgments and Disclosure of Funding

This work was done while RB was a visiting scientist at Google, NY. RB s work at OSU on this research is supported by NSF Award AF-1908281 and Google Faculty Research Award.

[AC06] Nir Ailon and Bernard Chazelle. Approximate nearest neighbors and the fast Johnson-

Lindenstrauss transform. In Proceedings of the thirty-eighth annual ACM symposium on Theory of computing, pages 557 563, 2006.

[AFKT21] Hilal Asi, Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex

optimization: Optimal rates in l1 geometry. ar Xiv preprint ar Xiv:2103.01516, 2021.

[AKK+20] Thomas D. Ahle, Michael Kapralov, Jakob Bæk Tejs Knudsen, Rasmus Pagh, Ameya

Velingker, David P. Woodruff, and Amir Zandieh. Oblivious sketching of high-degree polynomial kernels. In Shuchi Chawla, editor, Proceedings of the 2020 ACM-SIAM Symposium on Discrete Algorithms, SODA 2020, Salt Lake City, UT, USA, January 5-8, 2020, pages 141 160. SIAM, 2020.

[AL09] Nir Ailon and Edo Liberty. Fast dimension reduction using rademacher series on dual

BCH codes. Discrete & Computational Geometry, 42(4):615 630, 2009.

[ALMM19] Noga Alon, Roi Livni, Maryanthe Malliaris, and Shay Moran. Private PAC learning

implies ﬁnite littlestone dimension. In Moses Charikar and Edith Cohen, editors, Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, Phoenix, AZ, USA, June 23-26, 2019, pages 852 860. ACM, 2019.

[ANW14] Haim Avron, Huy L. Nguyen, and David P. Woodruff. Subspace embeddings for the

polynomial kernel. In Zoubin Ghahramani, Max Welling, Corinna Cortes, Neil D. Lawrence, and Kilian Q. Weinberger, editors, Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 2258 2266, 2014.

[Bar98] Peter L Bartlett. The sample complexity of pattern classiﬁcation with neural networks:

the size of the weights is more important than the size of the network. IEEE transactions on Information Theory, 44(2):525 536, 1998.

[BDMN05] Avrim Blum, Cynthia Dwork, Frank Mc Sherry, and Kobbi Nissim. Practical privacy:

the sulq framework. In Chen Li, editor, Proceedings of the Twenty-fourth ACM SIGACTSIGMOD-SIGART Symposium on Principles of Database Systems, June 13-15, 2005, Baltimore, Maryland, USA, pages 128 138. ACM, 2005.

[BFSV+21a] Robert Istvan Busa-Fekete, Umar Syed, Sergei Vassilvitskii, et al. On the pitfalls of

label differential privacy. In Neur IPS 2021 Workshop Latin X in AI, 2021.

[BFSV+21b] Robert Istvan Busa-Fekete, Umar Syed, Sergei Vassilvitskii, et al. Population level

privacy leakage in binary classiﬁcation wtih label noise. In Neur IPS 2021 Workshop Privacy in Machine Learning, 2021.

[BFTT19] Raef Bassily, Vitaly Feldman, Kunal Talwar, and Abhradeep Thakurta. Private stochas-

tic convex optimization with optimal rates. ar Xiv preprint ar Xiv:1908.09970, 2019.

[BGM21] Raef Bassily, Cristóbal Guzmán, and Michael Menart. Differentially private stochas-

tic optimization: New results in convex and non-convex settings. ar Xiv preprint ar Xiv:2107.05585. Appeared at Neur IPS 2021., 2021.

[BGN21] Raef Bassily, Cristóbal Guzmán, and Anupama Nandi. Non-euclidean differentially

private stochastic convex optimization. ar Xiv preprint ar Xiv:2103.01278, 2021.

[BNSV15] Mark Bun, Kobbi Nissim, Uri Stemmer, and Salil P. Vadhan. Differentially private

release and learning of threshold functions. In Venkatesan Guruswami, editor, IEEE 56th Annual Symposium on Foundations of Computer Science, FOCS 2015, Berkeley, CA, USA, 17-20 October, 2015, pages 634 649. IEEE Computer Society, 2015.

[BST99] Peter Bartlett and John Shawe-Taylor. Generalization performance of support vector

machines and other pattern classiﬁers. Advances in kernel methods: support vector learning, pages 43 54, 1999.

[BST14] Raef Bassily, Adam D. Smith, and Abhradeep Thakurta. Private empirical risk mini-

mization: Efﬁcient algorithms and tight error bounds. In 55th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2014, Philadelphia, PA, USA, October 18-21, 2014, pages 464 473. IEEE Computer Society, 2014.

[CGM19] Corinna Cortes, Spencer Greenberg, and Mehryar Mohri. Relative deviation learning

bounds and generalization with unbounded loss functions. Annals of Mathematics and Artiﬁcial Intelligence, 85(1):45 70, 2019.

[CHS14] Kamalika Chaudhuri, Daniel J. Hsu, and Shuang Song. The large margin mechanism

for differentially private maximization. In Zoubin Ghahramani, Max Welling, Corinna Cortes, Neil D. Lawrence, and Kilian Q. Weinberger, editors, Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 1287 1295, 2014.

[CMS11] Kamalika Chaudhuri, Claire Monteleoni, and Anand D. Sarwate. Differentially private

empirical risk minimization. J. Mach. Learn. Res., 12:1069 1109, 2011.

[CMTS21] Corinna Cortes, Mehryar Mohri, and Ananda Theertha Suresh. Relative deviation

margin bounds. In International Conference on Machine Learning, pages 2122 2131. PMLR, 2021.

[DMNS06] Cynthia Dwork, Frank Mc Sherry, Kobbi Nissim, and Adam D. Smith. Calibrating noise

to sensitivity in private data analysis. In Shai Halevi and Tal Rabin, editors, Theory of Cryptography, Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7, 2006, Proceedings, volume 3876 of Lecture Notes in Computer Science, pages 265 284. Springer, 2006.

[DR14] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy.

Found. Trends Theor. Comput. Sci., 9(3-4):211 407, 2014.

[Dwo06] Cynthia Dwork. Differential privacy. In Michele Bugliesi, Bart Preneel, Vladimiro

Sassone, and Ingo Wegener, editors, Automata, Languages and Programming, 33rd International Colloquium, ICALP 2006, Venice, Italy, July 10-14, 2006, Proceedings, Part II, volume 4052 of Lecture Notes in Computer Science, pages 1 12. Springer, 2006.

[EMSV21] Hossein Esfandiari, Vahab Mirrokni, Umar Syed, and Sergei Vassilvitskii. Label

differential privacy via clustering. ar Xiv preprint ar Xiv:2110.02159, 2021.

[FKT20] Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex opti-

mization: optimal rates in linear time. In Konstantin Makarychev, Yury Makarychev, Madhur Tulsiani, Gautam Kamath, and Julia Chuzhoy, editors, Proccedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2020, Chicago, IL, USA, June 22-26, 2020, pages 439 449. ACM, 2020.

[GGK+21] Badih Ghazi, Noah Golowich, Ravi Kumar, Pasin Manurangsi, and Chiyuan Zhang.

On deep learning with label differential privacy. ar Xiv preprint ar Xiv:2102.06062, 2021.

[JL84] William B Johnson and Joram Lindenstrauss. Extensions of lipschitz mappings into a

hilbert space 26. Contemporary mathematics, 26:28, 1984.

[JT13] Prateek Jain and Abhradeep Thakurta. Differentially private learning with kernels. In

International conference on machine learning, pages 118 126. PMLR, 2013.

[JT14] Prateek Jain and Abhradeep Thakurta. (near) dimension independent risk bounds for

differentially private learning. In ICML, 2014.

[KP02] Vladmir Koltchinskii and Dmitry Panchenko. Empirical margin distributions and

bounding the generalization error of combined classiﬁers. Annals of Statistics, 30, 2002.

[KW11] Felix Krahmer and Rachel Ward. New and improved Johnson Lindenstrauss embed-

dings via the restricted isometry property. SIAM Journal on Mathematical Analysis, 43(3):1269 1281, 2011.

[LN17] Kasper Green Larsen and Jelani Nelson. Optimality of the johnson-lindenstrauss

lemma. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 633 638. IEEE, 2017.

[LNUZ20] Huy Le Nguyen, Jonathan Ullman, and Lydia Zakynthinou. Efﬁcient private algorithms

for learning large-margin halfspaces. In Algorithmic Learning Theory, pages 704 724. PMLR, 2020.

[LSS13] Quoc V. Le, Tamás Sarlós, and Alexander J. Smola. Fastfood - computing Hilbert space

expansions in loglinear time. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, volume 28 of JMLR Workshop and Conference Proceedings, pages 244 252. JMLR.org, 2013.

[MRT18] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine

learning. MIT press, 2018.

[MT07] Frank Mc Sherry and Kunal Talwar. Mechanism design via differential privacy. In

48th Annual IEEE Symposium on Foundations of Computer Science (FOCS 07), pages 94 103. IEEE, 2007.

[Nel10] Jelani Nelson. Johnson Lindenstrauss notes, 2010.

[Nel11] Jelani Jelani Osei Nelson. Sketching and streaming high-dimensional vectors. Ph D

thesis, Massachusetts Institute of Technology, 2011.

[Nel15] Jelani Nelson. Dimensionality reduction notes 2, 2015.

[PP13] Ninh Pham and Rasmus Pagh. Fast and scalable polynomial kernels via explicit

feature maps. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 239 247, 2013.

[RBHT09] Benjamin IP Rubinstein, Peter L Bartlett, Ling Huang, and Nina Taft. Learning in a

large function space: Privacy-preserving mechanisms for svm learning. ar Xiv preprint ar Xiv:0911.5708, 2009.

[RR07] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In

John C. Platt, Daphne Koller, Yoram Singer, and Sam T. Roweis, editors, Advances in Neural Information Processing Systems 20, Proceedings of the Twenty-First Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 3-6, 2007, pages 1177 1184. Curran Associates, Inc., 2007.

[RR08] Ali Rahimi and Benjamin Recht. Weighted sums of random kitchen sinks: Replac-

ing minimization with randomization in learning. Advances in neural information processing systems, 21, 2008.

[RS16] Sofya Raskhodnikova and Adam Smith. Lipschitz extensions for node-private graph

statistics and the generalized exponential mechanism. In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pages 495 504. IEEE, 2016.

[Rud17] Walter Rudin. Fourier analysis on groups. Courier Dover Publications, 2017.

[SFBL97] Robert E. Schapire, Yoav Freund, Peter Bartlett, and Wee Sun Lee. Boosting the

margin: A new explanation for the effectiveness of voting methods. In ICML, pages 322 330, 1997.

[SS02] Bernhard Schölkopf and Alex Smola. Learning with Kernels. MIT Press: Cambridge,

[SSTT21] Shuang Song, Thomas Steinke, Om Thakkar, and Abhradeep Thakurta. Evading the

curse of dimensionality in unconstrained private glms. In Arindam Banerjee and Kenji Fukumizu, editors, Proceedings of The 24th International Conference on Artiﬁcial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pages 2638 2646. PMLR, 13 15 Apr 2021.

[STC04] John Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis. Cam-

bridge Univ. Press, 2004.

[SWYZ21] Zhao Song, David P. Woodruff, Zheng Yu, and Lichen Zhang. Fast sketching of

polynomial kernels of polynomial degree. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 9812 9823. PMLR, 2021.

[YSMN21] Sen Yuan, Milan Shen, Ilya Mironov, and Anderson CA Nascimento. Practical, label

private deep learning training based on secure multiparty computation and differential privacy. Cryptology e Print Archive, 2021.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s

contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes]

(c) Did you discuss any potential negative societal impacts of your work? [N/A] There are

no foreseeable negative societal impact for this work. (d) Have you read the ethics review guidelines and ensured that your paper conforms to

them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] Full proofs are

provided in the appendix. 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experi-

mental results (either in the supplemental material or as a URL)? [N/A] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they

were chosen)? [N/A] (c) Did you report error bars (e.g., with respect to the random seed after running experi-

ments multiple times)? [N/A] (d) Did you include the total amount of compute and the type of resources used (e.g., type

of GPUs, internal cluster, or cloud provider)? [N/A] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

(d) Did you discuss whether and how consent was obtained from people whose data you re

using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable

information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if

applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review

Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount

spent on participant compensation? [N/A]