# rademacher_observations_private_data_and_boosting__c87f7fc9.pdf

Rademacher Observations, Private Data, and Boosting

Richard Nock RICHARD.NOCK@NICTA.COM.AU Giorgio Patrini GIORGIO.PATRINI@NICTA.COM.AU Arik Friedman ARIK.FRIEDMAN@NICTA.COM.AU NICTA & { The Australian National University, The University of New South Wales}, Sydney, Australia

The minimization of the logistic loss is a popular approach to batch supervised learning. Our paper starts from the surprising observation that, when ﬁtting linear classiﬁers, the minimization of the logistic loss is equivalent to the minimization of an exponential rado-loss computed (i) over transformed data that we call Rademacher observations (rados), and (ii) over the same classiﬁer as the one of the logistic loss. Thus, a classiﬁer learnt from rados can be directly used to classify observations. We provide a learning algorithm over rados with boosting-compliant convergence rates on the logistic loss (computed over examples). Experiments on domains with up to millions of examples, backed up by theoretical arguments, display that learning over a small set of random rados can challenge the state of the art that learns over the complete set of examples. We show that rados comply with various privacy requirements that make them good candidates for machine learning in a privacy framework. We give several algebraic, geometric and computational hardness results on reconstructing examples from rados. We also show how it is possible to craft, and efﬁciently learn from, rados in a differential privacy framework. Tests reveal that learning from differentially private rados brings non-trivial privacy vs accuracy tradeoffs.

1. Introduction

This paper deals with the following fundamental question:

What information is sufﬁcient for learning, and what guarantees can it bring that regular data cannot ?

Proceedings of the 32 nd International Conference on Machine Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copyright 2015 by the author(s).

By regular , we mean the usual inputs provided to a learner. In our context of batch supervised learning, this is a training set of examples, each of which is an observation with a class, and learning means inducing in reduced time an accurate function from observations to classes, a classiﬁer. It turns out that we do not need the detail of classes to learn a linear classiﬁer: an aggregate, whose size is the dimension of the observation space, is minimally sufﬁcient, the mean operator (Patrini et al., 2014).

But do we need examples ?

This perhaps surprising and non-trivial question is becoming crucial now that the nature of stored and processed signals intelligence data is heavily debated in the public sphere (Landau, 2015; Sproull et al., 2015). In the context of machine learning (ML), the objective of being accurate is more and more frequently subsumed by more complex goals, sometimes involving challenging tradeoffs in which accuracy does not ultimately appear in the topmost requirements. Privacy is one such crucial goal (Duchi et al., 2014; Enserink & Chin, 2015; Goroff, 2015). There are various models to capture the privacy requirement, such as secure multi-party computation and differential privacy (DP, (Dwork & Roth, 2014)). The former usually relies on cryptographic protocols, which can be heavy even for bare classiﬁcation and simple algorithms (Bost et al., 2014). The latter usually relies on the power of randomization to ensure that any local change cannot be spotted from the output delivered (Dwork et al., 2010; Dwork & Roth, 2014). In a ML setting, randomization can be performed at various stages, from the examples to the output of a classiﬁer. We focus on the upstream stage of the process, i.e. the input to the learner, which grants the beneﬁts that all subsequent stages also comply with differential privacy. Randomization has its power: it also has its limits in this case, as it may signiﬁcantly degrade the performance of learners.

The way we address this problem starts from a surprising observation, whose relevance to supervised ML goes beyond learning with private data: learning a linear classiﬁer over examples throughout the minimization of the expected logistic loss is equivalent to learning the same classiﬁer

Rademacher Observations

by minimizing an exponential loss over a complete set of transformed data that we call Rademacher observations, rados. Each rado is the sum of edge vectors over examples (edge = observation label). We also show that efﬁcient learning from all rados may also be achieved when carried out over subsets of all possible rados.

This is our ﬁrst contribution, and we expect it to be useful in several other areas of supervised learning. In the context of learning with private data, our other contributions can be summarized as showing how rados may yield new privacy guarantees not limited to differential privacy while authorizing boosting-compliant rates for learning. More precisely, our second contribution is to propose a radobased learning algorithm, which has boosting-compliant convergence rates over the logistic loss computed over the examples. Thus, we learn an accurate classiﬁer over rados, and the same classiﬁer is accurate over examples as well.

The fact that efﬁcient learning may be achieved through subset of rados is interesting because it opens the problem of designing this particular subset to address domainspeciﬁc requirements that add to the ML accuracy requirement. Among our other contributions, we provide one important design example, showing how to build differentially private mechanisms for rado delivery, such as when protecting speciﬁc sensitive features in data. Experiments conﬁrm in this case that learning from differentially private rados may still be competitive with learning from examples. We provide another design which pairs to our rado-based boosting algorithm, with the crucial property that when examples have been DP-protected by the popular Gaussian mechanism (Dwork & Roth, 2014), the joint pair (rado delivery design, boosting algorithm) may achieve convergence rates comparable to the noise-free setting with high probability, even over strong DP protection regimes. Our last contribution is to show that rados may protect the privacy of the original examples not only in the DP framework, but also from several algebraic, geometric and even computational-complexity theoretic standpoints.

The remainder of this paper is organized as follows. Section 2 presents Rademacher observations, shows the equivalence between learning from examples and learning from rados, and how learning from subsets of rados may be sufﬁcient for efﬁcient learning; 3 presents our radobased boosting algorithm, and 4 presents experiments with this algorithm; 5 presents our results in DP models, 6 presents related experiments; 7 provides results on the hardness of reconstructing examples from rados from algebraic, geometric and computational standpoints. To keep a readable paper, proofs and additional experiments are given in a companion Ar Xiv paper (Nock et al., 2015).

2. Rados and supervised learning

Let [n] = {1, 2, ..., n}. We are given a set of m examples S .= {(xi, yi), i [m]}, where xi X Rd is an observation and yi { 1, 1} is a label, or class. X is the domain. A linear classiﬁer θ Θ for some ﬁxed Θ Rd gives a label to x X equal to the sign of θ x R. Our results can be lifted to kernels (at least with ﬁnite dimension feature maps) following standard arguments (Quadrianto et al., 2009). We let Σm .= { 1, 1}m.

Deﬁnition 1 For any σ Σm, the Rademacher observation πσ with signature σ is πσ .= (1/2) P i(σi + yi)xi.

The simplest way to randomly sample rados is to pick σ as i.i.d. Rademacher variables, hence the name. Reference to S is implicit in the deﬁnition of πσ. A Rademacher observation sums edge vectors (the terms yixi), over the subset of examples for which yi = σi. When σ = y is the vector of classes, πσ = mµS is m times the mean operator, a minimal sufﬁcient statistics for the class (Quadrianto et al., 2009; Patrini et al., 2014). Thus, up to the normalization by m, any rado is a minimal sufﬁcient statistic for the class in a subset of the training sample. A popular approach to learn θ over S is to minimize the surrogate risk Flog (S, θ) built from the logistic loss (logloss):

Flog (S, θ) .= 1

i log 1 + exp yiθ xi .(1)

We deﬁne the exponential rado-risk F r exp(S, θ, U), computed on any U Σm with cardinal |U| = n, as:

F r exp(S, θ, U) .= 1

σ U exp θ πσ . (2)

It turns out that Flog = g(F r exp) for some continuous strictly increasing g and speciﬁc choice of U (in fact, U = Σm); hence, minimizing one criterion is equivalent to minimizing the other and vice versa. This is formalized below.

Lemma 2 The following holds true, for any θ and S:

Flog(S, θ) = log(2) + 1

m log F r exp(S, θ, Σm) . (3)

Lemma 2 shows that learning with examples via the minimization of Flog (S, θ), and learning with all rados via the minimization of F r exp(S, θ, Σm), are essentially equivalent tasks. Since the cardinal |Σm| = 2m is exponential, it is unrealistic, even on moderate-size samples, to pick that latter option. This raises however a very interesting question: if we replace Σm by subset U of size 2m,what does the relationship between examples and rados in eq. (3) become? We answer this question under the setting that:

Rademacher Observations

(i) instead of Σm, we consider a predeﬁned Σr Σm;

(ii) instead of considering U = Σr, we sample uniformly i.i.d. U Σr for n 1 rados.

While (ii) is directly targeted at reducing the number of rados, (i) is an upper-level strategic design to tackle additional constraints, such as differential privacy. We now need following deﬁnition of the logistic rado-risk:

F r log (S, θ, U) .= log(2) + 1

m log F r exp(S, θ, U) ,(4)

for any U Σm, so that Flog (S, θ) = F r log (S, θ, Σm). We also deﬁne the open ball B(0, r) .= {x Rd : x 2 < r}.

Theorem 3 Assume Θ B(0, rθ), for some rθ > 0. Let:

.= supθ Θ maxπσ Σr exp( θ πσ)

F rexp(S, θ, Σr) ,

.= F r exp(S, θ, Σr)

F rexp(S, θ, Σm) ,

where Σr follows (i) above. Then η > 0, there is probability 1 η over the sampling of U in (ii) above that:

Flog (S, θ) F r log(S, θ, U) + Q 1

rθ max Σr πσ 2 + d log 2en

and Q .= (1/m) log satisﬁes Q = 0 if Σr = Σm and

Q rθ θF r log (S, θ, Σm) 2 +

otherwise, letting

πr .= Eσ Σr(1/m) πσ 2. Furthermore, 0 β < 1/2, if m is sufﬁciently large, then letting π r .= maxΣr (1/m) πσ 2, ineq. (5) becomes:

Flog (S, θ) F r log(S, θ, U) + Q

Theorem 3 does not depend on the algorithm that learns θ. The right-hand side of ineq. (5) shows two penalties. Q arises from the choice of Σr and is therefore structural. Regardless of Σr, when the classiﬁer is reasonably accurate over all rados and expected examples edges in Σr average to a ball of reduced radius, the upperbound on Q in ineq. (7) can be very small. The other penalty, which depends on q, is statistical and comes from the sampling in Σr. Theorem 3 shows that when Σr = Σm, even when n m, the

Algorithm 1 Rado boosting (RADOBOOST)

Input set of rados Sr .= {π1, π2, ..., πn}; T N ; Step 1 : let θ0 0, w0 (1/n)1 ; Step 2 : for t = 1, 2, ..., T Step 2.1 : [d] ι(t) WFI(Sr, wt); Step 2.2 : let

j=1 wtjπjι(t) ; (9)

2π ι(t) log 1 + rt

1 rt ; (10)

Step 2.3 : for j = 1, 2, ..., n

w(t+1)j wtj

Return θT deﬁned by θT k .= P t:ι(t)=k αt , k [d];

minimization of F r log (S, θ, U) may still bring, with high probability, guarantees on the minimization of Flog (S, θ). Thus, a lightweight optimization procedure over a small number of rados may bring guarantees on the minimization of the expected logloss over examples for the same classiﬁer. The following Section exhibits one such algorithm.

3. Boosting using rados

Algorithm 1 provides a boosting algorithm, RADOBOOST, that learns from a set of Rademacher observations Sr .= {π1, π2, ..., πn}. Their (unknown) Rademacher assignments are denoted U .= {σ1, σ2, ..., σn} Σm. These rados have been computed from some sample S, unknown to RADOBOOST. In the statement of the algorithm, πjk denotes coordinate k of πj, and π k .= maxj |πjk|. More generally, the coordinates of some vector z Rd are denoted z1, z2, ..., zd. Step 2.1 gets a feature index ι(t) from a weak feature index oracle, WFI. In its general form, WFI returns a feature index maximizing |rt| in (9). The weight update was preferred to Ada Boost s because rados can have large feature values and the weight update prevents numerical precision errors that could otherwise occur using Ada Boost s exponential weight update. We now prove a key Lemma on RADOBOOST, namely the fast convergence of the exponential rado-risk F r exp(S, θ, U) under a weak learning assumption (WLA). We shall then obtain the convergence of the logistic rado-risk (4), and, via Theorem 3, the convergence with high probability of Flog (S, θ).

(WLA) γ > 0 such that t 1, the feature returned by WFI in Step 2.2 (9) satisﬁes |rt| γ.

Rademacher Observations

ADABOOST(n)

Domain m d 100σ

Abalone 4 177 8

23.20 1.44 0.24

25.14 1.83 [3: [1:3]]

Wine-white 4 898 11 1

30.44 3.25 0.20

32.48 3.55 [3: [1:3]]

Magic 19 020 10

20.91 0.99 0.05

22.75 1.51 [3: [5:3]]

EEG 14 980 14 14

44.36 1.99 0.07

44.23 1.73 [4: [4:3]]

Hardware 28 179 95

16.76 0.73 0.04

7.61 3.24 [2: [8:3]]

Twitter 583 250 77 44

53.09 11.23 [1: 3]

6.00 0.77 [1: [1:5]]

Su Sy 5 000 000 17

27.43 0.19 [2: 4]

27.26 0.55 [1: [1:6]]

Higgs 11 000 000 28

45.39 0.28 [9: 5]

47.86 0.06 [1: [1:7]]

Table 1. Comparison of RADOBOOST (n random rados), ADABOOST (Schapire & Singer, 1999) (full training fold) and ADABOOST(n) (n random examples in training fold); domains ranked in increasing d m value. Column n/m (resp. n/2m ) for ADABOOST(n) (resp RADOBOOST) is proportion of training data wrt fold size (resp. full set of rados). Notation [a:b] is shorthand for a 10b. Column 100σ is the number of features with outlier values distant from the mean by more than 100σ in absolute value. Column p (resp. p ) is p-value for a two-tailed paired t-test on ADABOOST (resp. ADABOOST(n)) vs RADOBOOST. ε means < 0.01.

Lemma 4 Suppose the (WLA) holds. Then after T rounds of boosting in RADOBOOST, the following upperbound holds on the exponential rado-loss of θT :

F r exp(S, θT , U) exp T γ2/2 . (12)

We now consider Theorem 3 with Σr = Σm, and therefore Q = 0. Blending Lemma 4 and Theorem 3 using (4) yields that, under the (WLA), we may observe with high probability (again, ﬁxing Σr = Σm, so Q = 0 in Theorem 3):

Flog (S, θT ) log(2) T γ2

2m + Q , (13)

where Q is the rightmost term in ineq. (5) or ineq. (8). So provided n 2m is sufﬁciently large, minimizing the exponential rado-risk over a subset of rados brings a classiﬁer whose average logloss on the whole set of examples may decrease at rate Ω(γ2/m) under a weak learning assumption made over rados only. This rate competes with those for direct approaches to boosting the logloss (Nock & Nielsen, 2008), and we now show that our weak learning assumption is also essentially equivalent to the one done in boosting over examples (Schapire & Singer, 1999). Let us rewrite rt(w) as the normalized edge in (9), making explicit the dependence in the current rado weights. Let

rex t ( w) .= 1

i=1 wixiι(t) (14)

be the normalized edges for the same feature ι(t) as the one picked in step 2.1 of RADOBOOST, but computed over examples using some weight vector w Pm; here, Pm is the m-dim probability simplex and x ι(t) .= maxi |xik|.

Lemma 5 wt Pn, γ > 0, there exists w Pm and γex > 0 such that |rt(wt)| γ iff |rex t ( w)| γex.

The proof of the Lemma gives clues to explain why the presence of outlier feature values may favor RADOBOOST.

4. Basic experiments with RADOBOOST

We have compared RADOBOOST to its main contender, ADABOOST (Schapire & Singer, 1999), using the same weak learner; in ADABOOST, it returns a feature maximizing |rt| as in eq. (14). In these basic experiments, we have deliberately not optimized the set of rados in which we sample U for RADOBOOST; hence, we have Σr = Σm. We have performed comparisons with 10 folds stratiﬁed cross-validation (CV) on 16 domains of the UCI repository (Bache & Lichman, 2013) of varying size. For space considerations, Table 1 presents the results on the 8 largest domains. SI presents the complete experiments. Each algorithm was ran for a total number of T = 1000 iterations; furthermore, the classiﬁer kept for testing is the one minimizing the empirical risk throughout the T iterations; in doing so, we also assessed the early convergence of algorithms. We ﬁxed n = min{1000, train fold size/2}. Table 1 displays that RADOBOOST compares favourably to AD-

ABOOST, and furthermore it tends to be all the better as m and d increase. On some domains like Hardware and Twitter, the difference is impressive and clearly in favor of RADOBOOST. Experimentally, we interpret it by the fact that random rados may have large norms on big domains, which may yield large boosting leveraging coefﬁcients. On domains like Twitter, this boosts convergence. Also, outlier features (see column 100σ in Table 1) can trick AD-

ABOOST in picking the wrong sign for αt for a large number of iterations. This drawback can be easily corrected (SI (Nock et al., 2015)) by enforcing minimal |rt| values. This improves ADABOOST on Hardware and Twitter. Improvements observed on RADOBOOST are even more favorable.

5. Rados and differential privacy

We discuss the delivery of rados to comply with DP constraints and their eventual impact on boosting. We thus adress both levels (i+ii) of rado delivery in 2. Our model is the standard DP model (Dwork & Roth, 2014). Intuitively,

Rademacher Observations

(a) L S S DP L L

Figure 1. Summary of the DP-related contributions of Section 5 (in color). (a) : usual DP mechanism that protects examples (S) prior to delivery to learner (L); (b) : mechanism that crafts differentially private rados (R) from unprotected examples ( 5.1); (c) : mechanism crafting rados from DP examples with objective to improve performances of rado-based learner L ( 5.2).

Algorithm 2 Feature-wise DP rados (DP-FEAT)

Input set of examples S, sensitive feature j [d], number of rados n, differential privacy parameter ǫ > 0; Step 1 : let β 1/(1 + exp(ǫ/2)) [0, 1/2); Step 2 : sample σ1, σ2, ..., σn i.i.d. (uniform) in Σβ,j m ; Return set of rados {πσ : σ sampled in Step 2};

an algorithm is DP if for any two neighboring datasets, it assigns similar probability to any possible output O. In other words, any particular record has only limited inﬂuence on the probability of any given output of the algorithm, and therefore the output discloses very little information about any particular record in the input. Formally, a randomized algorithm A is (ǫ, δ)-differentially-private (Dwork et al., 2006) for some ǫ, δ > 0 iff:

PA[O|S] exp(ǫ) PA[O|S ] + δ, S S , O,(15)

where the probability is over the coin tosses of A. This model is very strong, especially when δ = 0, and in the context of ML, maintaining high accuracy in strong DP regimes is generally a tricky tradeoff (Duchi et al., 2014). Because rados are an intermediate step between training sample S and a rado-based learner, there are two ways to design rados with respect to the DP framework: crafting DP rados from unprotected examples, or crafting rados from DP examples with the aim to improve the performance of the rado-based learner (Figure 5.2). These scenarii can be reduced to the design of Σr.

5.1. A feature-wise DP mechanism for rados

In this Subsection, we consider a relaxation of differentialprivacy, namely feature-wise differential privacy, where the differential privacy requirement applies to j -neighboring datasets: we say that two samples S, S are j -neighbors, noted S j S , if they are the same except for the value of the jth [d] observation feature of some example. We further assume that the feature is boolean. For example, we may have a medical database containing a column representing the HIV status of a doctor s patients (1 row =

m mj (+) mj (+)

S (coord. j )

Figure 2. How DP-FEAT works: neighbor samples S and S differ by one value for feature j (i.e. one edge coordinate, represented); the rado whose support relies only on the -1 in S (dashed lines) yields inﬁnite ratio PA[O|I]/PA[O|I ] in (15). This rado would never be sampled by DP-FEAT. On the other hand, a rado that sums an equal number s of +1 and -1 (dotted lines) may yield ratio very close to 1 (such a rado can be sampled by DP-FEAT).

a patient), and we do not wish that changing a single patient HIV status signiﬁcantly changes the density of that feature s values in rados. This setting would also be very useful in genetic applications to hide in rados gene disorders that affect one or few genes. Feature-wise DP is analogous to the concept of α-label privacy (Chaudhuri & Hsu, 2011), where differential privacy is guaranteed with respect to the label. Algorithm A in ineq. (15) is given in Algorithm 2. It relies on the following subset Σr .= Σβ,j m Σm (m+ .= |{i : yixij = +1}| (m/2)):

Σβ,j m .= {σ Σm:πσj [m+ β, m++ β]} ,(16)

with β .= (m/2) β(m + 1). The key feature of this mechanism is that it does not alter the examples in the sense that DP rados belong to the set of cardinal 2m that can be generated from S. Usual data-centered DP mechanisms would rather alter data, e.g. via noise injection (Goroff, 2015). Algorithm 2 exploits the fact that it is the tails of feature j that leak sensitive information about the feature in rados (see Figure 2).

Theorem 6 If ǫ = Ω(1/m) and ǫ = o(1), DPFEAT maintains (n ǫ, n δ)-differential privacy on feature j for some δ = o(1/m).

We have implemented Step 2 in Algorithm DP-FEAT in the simplest way, using Rademacher rejection sampling where each σj is picked i.i.d. as σj Σm until σj Σβ,j m . The following Theorem shows its algorithmic efﬁciency.

Theorem 7 For any η > 0, let n η .= η(1 exp(2β 1))/(4β), and let n R denote the total number of rados sampled in Σm until n rados are found in Σβ,j m . Then for any η > 0, there is probability 1 η that

( 1 if n n η l 1

m DBE(1 β 1/2) log n

m otherwise ,

where DBE is the bit-entropy divergence: DBE(p q) = p log(p/q)+(1 p) log((1 p)/(1 q)), for p, q (0, 1).

Rademacher Observations

Remark that replacing Σm by Σr = Σβ,j m would not necessarily impair the boosting convergence of RADOBOOST trained from rados samples from DPFEAT (Lemma 4). The only systematic change would be in ineq. (13) where we would have to integrate the structural penalty Q from Theorem 3 to further upperbound Flog (S, θT). In this case, the upperbound in (7) reveals that at least when the mean operator in Σβ,j m has small norm which may be the case even when some examples in S have large norm and the gradient penalty is small, then Q may be small as well.

Finally, the tail truncation design exploited in DPFEAT can be fairly simply generalized in two directions, to handle (a) real-valued features, and/or (b) several sensitive features instead of one.

5.2. Boosting from DP examples via rados

We now show how to craft rados from DP protected examples so as to approximately keep the convergence rates of RADOBOOST. More precisely, since edge vectors are sufﬁcient to learn (eq. 1), we assume that edge vectors are DP (neighbor samples, S S , would differ on one edge vector). A gold standard to protect data in the DP framework is to convolute data with noise. One popular mechanism is the Gaussian mechanism (Dwork & Roth, 2014; Hardt & Price, 2014), which convolutes data with independent Gaussian random variables N(0, ς2I), whose standard deviation ς depends on the DP requirement (ǫ, δ). Strong DP regimes are tricky to handle for learning algorithms. For example, the approximation factor ρ of the singular vectors under DP noise of the noisy power method roughly behaves as ρ = Ω(ς/ ) (Hardt & Price, 2014) (Corollary 1.1) where = O(d) is a difference between two singular values. When ς is small, this is a very good bound. When the DP requirement blows up, the bound remains relevant if d increases, which may be hard to achieve in practice it is easier in general to increase m than d, which requires to compute new features for past examples.

We consider ineq. (15) with neighbors I and I being two sets of m edge vectors differing by one edge vector, and O is a noisiﬁed set of m edge vectors generated through the Gaussian mechanism (Dwork & Roth, 2014) (Appendix A). We show the following non-trivial result: provided we design another particular Σr, the convergence rate of RADOBOOST, as measured over non-noisy rados, essentially survives noise injection in the edge vectors through the Gaussian mechanism, even under strong noise regimes, as long as m is large enough. The intuition is straightforward: we build rados summing a large number of edge vectors only (this is the design of Σr), so that the i.i.d. noise component gets sufﬁciently concentrated for the algorithm to be able to learn almost as fast as in the noise-free set-

ting. We emphasize the non-trivial fact that convergence rate is measured over the non-noisy rados, which of course RADOBOOST does not see. The result is of independent interest in the boosting framework, since it makes use of a particular weak learner (WFI), which we call prudential, which picks features with |rt| (9) upperbounded.

We start by renormalizing coefﬁcients αt (eq. (10)) in RADOBOOST by a parameter κ 1 given as input, so that we now have αt (1/(κπ ι(t))) log((1 + rt)/(1 rt)) in Step 2.2. It is not hard to check that the convergence rate of RADOBOOST now becomes, prior to applying the (WLA)

F r log(S, θT , U) log(2) 1

t r2 t . (17)

We say that WFI is λp-prudential for λp > 0 iff it selects at each iteration a feature such that |rt| λp. Edges vectors have been DP-protected as yi(xi + xr i ), with xr i N(0, ς2I) (for i [m]). Let mσ .= |{i : σi = yi}| denote the support of a rado, and (m > 0 ﬁxed):

Σr = Σm m .= {σ Σm : mσ = m } . (18)

Theorem 8 U Σr, τ > 0, if

m = Ω(ς ln(1/τ)), then λp > 0 such that RADOBOOST having access to a λp-prudential weak learner returns after T iteration a classiﬁer θT which meets with probability 1 τ:

F r log(S, θT , U) log(2) 1

t r2 t . (19)

The proof details parameters and dependencies hidden in the statement. The use of a prudential weak learner is rather intuitive in a noisy setting since αt blows up when |rt| is close to 1. Theorem 8 essentially yield that a sufﬁciently large support for rados is enough to keep with high probability the convergence rate of RADOBOOST within noisefree regime. Of course, the weak learner is prudential, which implies bounded |rt| < 1, and furthermore the leveraging coefﬁcients αt are normalized, which implies smaller margins. Still, Theorem 8 is a good theoretical argument to rely on rados when learning from DP edge vectors.

6. Experiments on differential privacy

Table 2 presents a subset of the experiments carried out with RADOBOOST and ADABOOST in the contexts of Subsections 5.1 and 5.2. Due to size constraints, the full Table (and more extensive experiments) can be found in SI (Nock et al., 2015). Unless otherwise stated, experimental settings (cross validation, number of rados for learning, etc.) are the same as in Section 4.

Rademacher Observations

Section 5.1

Section 5.2

RADOBOOST vs ADABOOST

RADOBOOST: 5.2 vs 5.1

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

ε (DP parameter)

Rado Boost (Rado: DP-compliant)

Radoboost (Rado: random)

0.1 1 10 100 1000

perr(Radoboost - Ada Boost)

perr Ada Boost

σ (Gaussian)

perr (Rado: s / m = 0.25) perr (Rado: s / m = 0.50) perr (Rado: s / m = 0.75)

perr (Rado: random)

perr Ada Boost

0.1 1 10 100

perr(Median-prud. - Strong)

σ (Gaussian)

Rado support : s / m = 0.25

Rado support : s / m = 0.5 Rado support : s / m = 0.75

Transfusion

Table 2. Left table: RADOBOOST on feature-wise DP rados (Subsection 5.1, showing standard deviations) vs RADOBOOST on plain random rados baseline and ADABOOST baseline (trained with complete fold). Center: test error of RADOBOOST minus ADABOOST s (also showing ADABOOST error on right axis, dotted line), for rados with ﬁxed support s (= m , in green, red, blue) and plain random rados (dotted grey). Right: test error of RADOBOOST using ﬁxed support s rados and a prudential learner, minus RADOBOOST using plain random rados and strong learner of Section 4. See text and SI (Nock et al., 2015)).

In a ﬁrst set of experiments, we have assessed the impact on learning of the feature-wise DP mechanism: on each tested domain, we have selected at random a binary feature, and then used Algorithm DP-FEAT to protect the feature for different values of DP parameter ǫ, in a range that covers usual DP experiments (Hsu et al., 2014) (Table 1). The main conclusion that can be drawn from the experiments is that learning from DP rados can compete with learning from random rados, and even learning from examples (ADABOOST), even for rather small ǫ.

We then have assessed the impact on learning of examples that have been protected using the Gaussian mechanism (Dwork & Roth, 2014), with or without rados, with or without a prudential weak learner for boosting, and with or without using a ﬁxed support for rado computation. SI (Nock et al., 2015) provides extensive results for all domains but the largest ones (Twitter, Su Sy, Higgs). In the central column (see also SI (Nock et al., 2015)), computing the differences between RADOBOOST s error and ADABOOST s reveals that, on domains where it is beaten by ADABOOST when there is no noise, RADOBOOST almost always rapidly becomes competitive with ADABOOST as noise increases. Hence, RADOBOOST is a good contender from the boosting family to learn from differentially private (or noisy) data. Second, using a prudential weak learner which picks the median feature (instead of the more efﬁcient weak learner that picks the best as in Section 4) can have RADOBOOST with ﬁxed support rados compete or beat RADOBOOST with plain random rados, at least for small noise levels (see Transfusion and Magic in the right column of Table 2 and SI (Nock et al., 2015)). Replacing the median-prudential weak learner by a strong learner can actually degrade RADOBOOST s results (see SI (Nock et al., 2015)). These two observations advocate in favor of the theory developed in Subsection 5.2. Finally, using rados with ﬁxed support instead of plain random ra-

dos (Section 4) can signiﬁcantly improve the performances of RADOBOOST (see SI (Nock et al., 2015)).

7. From rados to examples: hardness results

The problem we address here is how we can recover examples from rados, and when we cannot recover examples from rados. This last setting is particularly useful from the privacy standpoint, as this may save us costly obfuscation techniques that impede ML tasks (Bost et al., 2014).

7.1. Algebraic and geometric hardness

For any m N , we deﬁne matrix Gm {0, 1}m 2m as:

Gm .= 0 2m 1 1 2m 1 Gm 1 Gm 1

if m > 1, and G1 .= [0 1] otherwise (zd denotes a vector in Rd). Each column of Gm is the binary indicator vector for the edge vectors considered in a rado. Hereafter, we let E Rd m the matrix of columnwise edge vectors from S, Π Rd n the columnwise rado matrix and U {0, 1}2m n in which each column gives the index of a rado computed in Sr. By construction, we have:

Π = EGm U , (21)

and so we have the following elementary results for the (non) reconstruction of E (proof omitted).

Lemma 9 (a) when recoverable, edge-vectors satisfy: E = ΠU G m(Gm UU G m) 1; (b) when U, Π, m are known but n < m, there is not a single solution to eq. (21) in general.

Lemma 9 states that even when U, Π and m are known, elementary constraints on rados can make the recovery of

Rademacher Observations

edge vectors hard notice that such constraints are met in our experiments with RADOBOOST in Sections 4 and 6.

But this represents a lot of unnecessary knowledge to learn from rados: RADOBOOST just needs Π to learn. We now explore the guarantees that providing this sole information brings in terms of (not) reconstructing E. M Ra b, we let C(M) denote the set of column vectors, and for any C Rd, we let C ǫ .= z CB(z, ǫ). We deﬁne the Hausdorff distance, DH(E, E ), between E and E :

DH(E, E ) .= inf{ǫ : C(E) C(E ) ǫ C(E ) C(E) ǫ} .

The following Lemma shows that if the only information known is Π, then there exist samples that bring the same set of rados C(Π) as the unknown E but who are at distance proportional to the width of the domain at hand.

Lemma 10 For any Π Rd n, suppose eq. (21) holds, for some unknowns m > 0, E Rd m, U {0, 1}2m n. Suppose C(E) B(0, R) for some R > 0. Then there exists E Rd (m+1), U {0, 1}2m+1 n such that

C(E ) B(0, R) and Π = E Gm+1U , (22)

DH(E, E ) = Ω R log d

if m 2d, and DH(E, E ) = Ω(R/

d) otherwise.

Hence, without any more knowledge, leaks, approximations or assumptions on the domain at hand, the recovery of E pays in the worst case a price proportional to the radius of the smallest enclosing B(0, .) ball for the unknown set of examples. We emphasize that this inapproximability result does not rely on the computational power at hand.

7.2. Computational hardness

In this Subsection, we investigate two important problems in the recovery of examples. The ﬁrst problem addresses whether we can approximately recover sparse examples from a given set of rados, that is, roughly, solve (21) with a sparsity constraint on examples. The ﬁrst Lemma we give is related to the hardness of solving underdetermined linear systems for sparse solutions (Donoho & Tanner, 2005). The sparsity constraint can be embedded in the compressed sensing framework (Donoho, 2006) to yield ﬁner hardness and approximability results, which is beyond the scope of our paper. We deﬁne problem Sparse-Approximation as:

(Instance) : set of rados Sr = {π1, π2, ..., πn}, m N , r, ℓ R+, . p, Lp-norm for p R+;

(Question) : Does there exist set S .= {(xi, yi), i [m]} and set U .= {σ1, σ2, ..., σn} { 1, 1}m such that:

xi p ℓ, i [m] , (Sparse examples) πj πσj p r , j [n] . (Rado approx.)

Lemma 11 Sparse-Approximation is NP-Hard.

In the context of rados, the second problem we address has very large privacy applications. Suppose entity A has a huge database of people (e.g. clients), and obtains a set of rados emitted by another entity B . An important question that A may ask is whether the rados observed can be approximately constructed by its database, for example to ﬁgure out which of its clients are also its competitors . We deﬁne this as problem Probe-Sample-Subsumption :

(Instance) : set of examples S, set of rados Sr = {π1, π2, ..., πn}, m N , p, r R+.

(Question) : Does there exist S .= {(xi, yi), i [m]} S and set U .= {σ1, σ2, ..., σn} { 1, 1}m such that:

πj πσj p r , j [n] . (Rado approx.)

Lemma 12 Probe-Sample-Subsumption is NP-Hard.

This worst-case result calls for interesting domain-speciﬁc qualiﬁcations, such as in genetics where the privacy of individual genomes can be compromised by population-wise statistics (Homer et al., 2008; Nietfeld et al., 2011).

8. Conclusion

We have introduced novel quantities that are sufﬁcient for efﬁcient learning, Rademacher observations. The fact that a subset of these can replace traditional examples for efﬁcient learning opens interesting problems on how to craft these subsets to cope with additional constraints. We have illustrated these constraints in the ﬁeld of efﬁcient learning from privacy-preserving data, from various standpoints that include differential privacy as well as algebraic, geometric and computational considerations. In that last case, results rely on NP-Hardness, and thus go beyond the hardness of factoring integers on which rely some popular cryptographic techniques (Bost et al., 2014). Rados are also cryptography-compliant: homomorphic encryption schemes can be used to compute rados in the encrypted domain from encrypted edge vectors or examples rado computation can thus be easily distributed in secure multiparty computation applications. Finally, rados may allow signiﬁcant memory savings for learning, and could be of use in areas where speed matters, like on-line learning.

Rademacher Observations

Acknowledgments

Thanks are due to the reviewers for many useful comments about the paper. The authors are indebted to Tiberio Caetano for early discussions that brought the idea of Rademacher observations and their use in privacy related applications. Thanks are also due to Stephen Hardy and Hugh Durrant-Whyte for many stimulating discussions and feedback on the subject. NICTA is funded by the Australian Government through the Department of Communications and the Australian Research Council through the ICT Center of Excellence Program.

Bache, K. and Lichman, M. UCI machine learning repository, 2013.

Bost, R., Popa, R.-A., Tu, S., and Goldwasser, S. Machine learning classiﬁcation over encrypted data. Cryptology e Print Archive, Report 2014/331, 2014.

Chaudhuri, K. and Hsu, D. Sample complexity bounds for differentially private learning. In Proc. of the 24 th

COLT, pp. 155 186, 2011.

Donoho, D.-L. Compressed sensing. IEEE T. IT, 52(4): 1289 1306, 2006.

Donoho, D.-L. and Tanner, J. Sparse non-negative solution of underdetermined linear equations by linear programming. PNAS, 102:9446 9451, 2005.

Duchi, J.-C., Jordan, M.-I., and Wainwright, M. Privacyaware learning. JACM, 2014.

Dwork, C. and Roth, A. The algorithmic foudations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9:211 407, 2014.

Dwork, C., Mc Sherry, F., Nissim, K., and Smith, A. Calibrating noise to sensitivity in private data analysis. In Proc. of the 3rd TCC, pp. 265 284, 2006.

Dwork, C., Rothblum, G.-N., and Vadhan, S.-P. Boosting and differential privacy. In Proc. of the 51 st FOCS, pp. 51 60, 2010.

Enserink, M. and Chin, G. The end of privacy. Science, 347:490 491, 2015.

Goroff, D.-L. Balancing privacy versus accuracy in research protocols. Science, 347:479 480, 2015.

Hardt, M. and Price, E. The noisy power method: a meta algorithm with applications. In NIPS*27, pp. 2861 2869, 2014.

Homer, N., Szelinger, S., Redman, M., Duggan, D., Tembe, W., Muehling, J., Pearson, J.-V., Stephan, D.-A., Nelson, S.-F., and Craig, D.-W. Resolving individuals contributing trace amounts of dna to highly complex mixtures using high-density snp genotyping microarrays. PLo S Genetics, 4:e100167, 2008.

Hsu, J., Gaboardi, M., Haeberlen, A., Khanna, S., Narayan, A., Pierce, B.-C., and Roth, A. Differential privacy: An economic method for choosing epsilon. In Proc. of the 27th IEEE CSFS, pp. 398 410, 2014.

Landau, S. Control use of data to protect privacy. Science, 347:504 506, 2015.

Nietfeld, J.-J., Sugarman, J., and Litton, J.-E. The biopin, a concept to improve biobanking. Nature Reviews Cancer, 11:303 308, 2011.

Nock, R. and Nielsen, F. On the efﬁcient minimization of classiﬁcation-calibrated surrogates. In NIPS*21, pp. 1201 1208, 2008.

Nock, R., Patrini, G., and Friedman, A. Rademacher observations, private data, and boosting. Co RR, abs/1502.02322, 2015.

Patrini, G., Nock, R., Rivera, P., and Caetano, T.-S. (Almost) no label no cry. In NIPS*27, 2014.

Quadrianto, N., Smola, A.-J., Caetano, T.-S., and Le, Q.- V. Estimating labels from label proportions. JMLR, 10: 2349 2374, 2009.

Schapire, R. E. and Singer, Y. Improved boosting algorithms using conﬁdence-rated predictions. MLJ, 37:297 336, 1999.

Sproull, R.-F., Du Mouchel, W.-H., Kearns, M., Lampson, B.-W., Landau, S., Leiter, M.-E., Parker, E. Rindskopf, and Weinberger, P.-J. Bulk collection of signal intelligence: technical options. National Academies Press, 2015. Committee on responding to section 5(D) of Presidential Policy Directive 28: The Feasibility of Software to Provide Alternatives to Bulk Signals Intelligence Collection.