# transfer_learning_in_differential_privacys_hybridmodel__cb6b34a6.pdf

Transfer Learning In Differential Privacy s Hybrid-Model

Refael Kohen 1 Or Sheffet 1

The hybrid-model (Avent et al., 2017) in Differential Privacy is a an augmentation of the localmodel where in addition to N local-agents we are assisted by one special agent who is in fact a curator holding the sensitive details of n additional individuals. Here we study the problem of machine learning in the hybrid-model where the n individuals in the curator s dataset are drawn from a different distribution than the one of the general population (the local-agents). We give a general scheme Subsample-Test-Reweigh for this transfer learning problem, which reduces any curator-model DP-learner to a hybrid-model learner in this setting using iterative subsampling and reweighing of the n examples held by the curator based on a smooth variation of the Multiplicative-Weights algorithm (introduced by Bun et al. (2020)). Our scheme has a sample complexity which relies on the χ2-divergence between the two distributions. We give worst-case analysis bounds on the sample complexity required for our private reduction. Aiming to reduce said sample complexity, we give two specific instances our sample complexity can be drastically reduced (one instance is analyzed mathematically, while the other - empirically) and pose several directions for follow-up work.

1. Introduction

Differential privacy (DP) has become modern era s de-facto gold standard for privacy-preserving data analysis. In particular, the local model of DP in which each user interacts in a computation by sending randomized messages whose view yields at most ϵ-privacy loss has gained much popularity due to its (relative) design simplicity. In particular, much work has dealt with the problem of machine learning in the local-model of DP: assuming the N users details are

1Faculty of Engineering, Bar-Ilan University, Israel. Correspondence to: Refael Kohen <refael.kohen@gmail.com>, Or Sheffet <or.sheffet@biu.ac.il>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

drawn i.i.d. from some distribution T, how can we design local-DP protocols for finding an hypothesis h of small loss w.r.t. T. However, as opposed to the curator-model of DP in which all data is held by a trusted curator in charge of executing the computation machine learning in the local model of DP suffers from two drawbacks: (1) its has a much larger sample complexity and (2) it is of limited learning capabilities where only problems that are SQ-learnable are learnable in the local model, in contrast to the curator model that allows us to learn (almost) any PAC-learnable problem (Kasiviswanathan et al., 2008). In order to address these problems, there has been an extensive study in recent years regarding augmentations of DP s local-model. Most notably, the shuffle model has gained much focus (Bittau et al., 2017; Cheu et al., 2019; Balle et al., 2019; 2020; Ghazi et al., 2021a;b) as it reduces the privacy-loss of localmodel protocols drastically. Yet the focus of this work is on a different, far less studied, augmentation of local-DP.

We study the hybrid-model of differential privacy (Avent et al., 2017) in which a local-model computation protocol with N users is augmented by the aid of one special agent who is in fact a curator holding the sensitive details of additional n individuals. (We refer to the former N users as the local-agents and the latter n individuals as the curatoragents. ) The hybrid-model models a situation in which some users (the n curator agents) trust a proposed curator and allow her unrestricted access to their sensitive details; while the remaining N n users trust solely themselves and opt for the local-model. Indeed, hybrid-model improves the learning capabilities of the local-model: the theoretical work of Beimel et al. (2020) have proven that in the hybridmodel certain problems, which are inefficiently solvable in the standard local-DP model, are efficiently and privately computable. Alas, in the analysis of Beimel et al. (2020), the n curator-agents come from the exact same distribution T as the remaining N local-agents.1 In contrast, this work is motivated by a setting in which the n individuals voluntarily opt-in to the curator-model. Coping with such a selection bias was posed by Beimel et al. (2020) as an acute open problem since (quoting Beimel et al. (2020) verbatim:) from a practical point of view, this ... is aligned with

1Such a situation may arise when an extrinsic powerful agency, e.g. the census, randomly samples n individuals and mandates they provide the curator with their sensitive details.

Transfer Learning In Differential Privacy s Hybrid-Model

current industry practices, and the ... individuals willing to contribute via a curator can be employees, technology enthusiasts, or individuals recruited as alphaor beta-testers of products... (Merriman, Oct 7, 2014; Microsoft, Sep 15, 2017; Mozilla, June 4, 2019)

In this work we model this selection bias as a particular type of a transfer learning problem. We no longer assume that the n curator-agents were drawn from the same distribution T as the local-agents, but rather that they were drawn from some different distribution S (the source distribution). Our goal, however, remains the same: to learn a good hypothesis w.r.t. T (the target distribution) while incurring at most ϵ privacy-loss for any of the agents involved in the computation. Naturally, should S and T be so different that they reside on disjoint support then the n curator agents provide no assistance our transfer learning problem. Thus, in our model S and T are of bounded χ2-divergence (see details in Section 2). In other words, we study a particular variation of a transfer learning problem: in the process of finding h of small loss w.r.t. T we are allowed to conduct a DP computation over a set of examples drawn from S and also to conduct a local-DP computation over N additional examples drawn from T. And so we ask:

Are there learning tasks that are infeasible in the local-model, yet can be computed privately and efficiently when we have a DP curator-model access to samples drawn from a different distribution?

Our Contribution. Our transfer-learning problem has two na ıve baselines. (1) Relying solely on the N localagents and learning a small loss hypothesis w.r.t. T via some off-the-shelf local-model protocol; this is infeasible for certain problems (e.g. PARITY (Kasiviswanathan et al., 2008)) and costly for others (e.g. sparse problem in a d-dimensional setting, where known sample complexity bounds are d, (Duchi et al., 2013; Smith et al., 2017)). (2) Relying solely on the n curator-agents and learning an hypothesis via the some off-the-shelf DP learning algorithm guaranteed to return an hypothesis of small loss, such as a loss upper bounded by err S(h) α/D (T S) or err S(h) α2/χ2(T S)+1 (assuming these divergences are finite, see definition in Section 2).2 This venue is feasible only when indeed an hypothesis h exists whose error over S is as small as we require; but fails to give a meaningful guarantee in an agnostic setting, even if the best hypothesis in H has error Θ(α).

Our work proposes a third technique, whose sample complexity of the N local-agents is independent of d and whose sample complexity of the n curator-agents depends on pdim(H) (so if |H| = poly(d), e.g. the Example in Sec-

2This follows from the well known inequality, stating that for any event E we have Pr T[E] p

(χ2(T S) + 1) Pr S[E].

tion 4, then n = polylog(d)) and which may be feasible in the agnostic setting as well. Our proposed framework resembles Subsample & Aggregate (Nissim et al., 2007) in broad brushstrokes, except that instead of subsampling in parallel in order to convert a non-private mechanism to a private one, we subsample sequentially in order to convert a curator-model learning mechanism over S to hybrid-model learner. The crux of our technique is that upon each time we subsample and produce an hypothesis of small-loss w.r.t. S yet high loss w.r.t. T, we make a Multiplicative-Weights (MW)-based update step to our subsampling distribution. With each MW-update we transition closer to a target distribution which is based on the importance sampling (IS) weights of the points drawn from S. We thus title our technique Subsample-Test-Reweigh.

For clarity, we partition our analysis into two parts. The first, detailed in Section 3, presents the main ideas of our techniques while tabling the notion of privacy for a later section. Specifically, seeing as we know that learning in the local-model is equivalent to SQ-learning and that the vast majority of PAC-learnable problems are learnt in the curatormodel, we pose the following model. Fix an hypothesis ℓ which is PAC-learnable via hypothesis class H of pdim d through some learning algorithm M. We are given n labeled examples drawn from S and only a SQ-oracle access to T (see formal definitions in Section 2). So we iteratively (1) set a distribution µ over the n examples, (2) use M to learn an hypothesis h of small loss w.r.t. µ, (3) query the SQoracle to see if h s loss is sufficiently small (halt if yes), (4) if h has large loss we reweigh the distribution using the MW-algorithm and proceed to the next iteration. In Section 3 we prove that w.h.p. this algorithm outputs in T = O(α 2) iterations an hypothesis h of error err T(h) = O(α) provided n = Ω(d(χ2(T S) + 1)α 2). The crux of the proof lies in showing (w.h.p.) the existence of a particular distribution u over the n drawn samples from S s.t. h H the loss of h w.r.t. u and its loss w.r.t. T are close. Based on the seminal results of Cortes et al. (2010), it is not surprising that this u is the truncated IS weights w(x) = PT(x)

PS(x) for each x drawn from S.

Next, in Section 4 we give the privacy-preserving version of our MW-based technique from Section 3. Replacing the SQ-oracle calls with simple applications of the Randomized Response mechanism is trivial, but maintaining the privacy of the n curator-model agents is far trickier. First, it is evident that our off-the-shelf learner M must now be a privacypreserving PAC-learner. More importantly, our MW-update step must also maintain bounded privacy-loss. Luckily this latter point was already addressed by Bun et al. (2020) who use the notion of κ-dense distributions to give a MW-based algorithm where any intermediate distribution µ is such that no one single point has probability mass exceeding 1/κn (see details in Section 4). Using known several results re-

Transfer Learning In Differential Privacy s Hybrid-Model

garding subsampling and privacy (Karwa & Vadhan, 2018; Bun et al., 2018), we end up with the following reduction. Denote m1 as the sample complexity of some off-the-shelf curator-model learner that outputs an hypothesis of loss α under privacy loss parameter of ϵ = 1. In order to successfully apply the learner T times under our κ-dense MW-update subsampling scheme and have a total privacyloss of ϵ, we require a sample of n curator-agents drawn from S where n = Ω(m1

T ϵκ ) = Ω(m1 χ2(T S)+1

ϵα2 ) and N = Ω(ϵ 2α 4) local-agents. While, admittedly, these bounds are less than ideal, this is the first result to prove the feasibility of private transfer learning using poly-size sample even for problems whose sample complexity in the local-model is exponential.

In Section 5 we pose suggestions as to how to reduce this bound as open problems, leveraging on the fact that the worst-case bounds for either the number of iterations T or the density parameter κ may be drastically reduced for specific hypothesis classes / instances. We give two particular examples of such instances: one proven rigorously (for the case of PARITY under the uniform distribution) and one based on empirical evaluations of our (non private version of the) transfer-learning technique in a two high-dimensional Gaussian settings, in which our algorithm makes far fewer iterations than our O(α 2) worst-case upper-bound.

1.1. Related Work

The bounds and limitations of private learning were first established by Kasiviswanathan et al. (2008) who proved that (roughly speaking) the learning capabilities of the localmodel are equivalent to SQ-learning. This, together with the seminal result of Blum et al. (1994), gives an exponential lower bound on the number of local-agents required for learning PARITY in the local-model (fully formally proven in Beimel et al. (2020)). Other results regarding the power and limitations of the local-model were given in Beimel et al. (2008); Duchi et al. (2013); Bassily & Smith (2015); Smith et al. (2017); Duchi & Rogers (2019) culminating in Joseph et al. (2019). And yet, the classical SGD-algorithm is still applicable in the local-model (Smith et al., 2017).

The two main works about the hybrid model (Avent et al., 2017; Beimel et al., 2020) have been discussed already, as well as the elegant private boosting paradigm of Bun et al. (2020). Other works have also studied the applicability of the MW-algorithm for various tasks in DP (Hardt & Rothblum, 2010; Gupta et al., 2011). In the context of transfer learning, few works tied differential privacy with multi-task learning (Gupta et al., 2016; Xie et al., 2017; Li et al., 2020), hypthesis testing (Wang et al., 2018) and in a semi-supervised setting (Kumar, 2022); all focusing on empirical measuring of the utility and none giving any general framework with proven guarantees as this work.

Also, it could be interesting to implement a version of our algorithm which isn t private w.r.t. the samples from S in a PATE -like setting (Papernot et al., 2018) where public (or partially public) datasets are also available.

The classic problem of transfer learning has been studied extensively since the 20th century, and has too long and too rich of a history to be surveyed properly here. We thus mention a few recent works that achieved sample complexity bounds based on importance sampling (IS) weights and show concentration bounds related to divergences between the source and that target distributions. Some works (Agapiou et al., 2017; Chatterjee & Diaconis, 2017) showed sample complexity bounds based, in part, on exp(KL(T S)) χ2(T S) + 1 as well as other properties of the distribution3 which lack our desired sub Gaussian behavior. The seminal result of Cortes et al. (2010) achieved a bound that is, in spirit, more similar to the desired sub-Gaussian behavior. They also proved that with sample size bounds depending on χ2(T S) we can achieve accurate estimation of the loss of any hypothesis (simultaneously) in a finite pdim hypothesis class H. Recently, Metelli et al. (2021) suggested a method of correction of the weights that allow obtaining sub-Gaussian behavior, assuming prior knowledge of the second moment of the weights. Maia Polo & Vicente (2022) studied the case where access to distribution T is restricted to unlabeled examples, and propose methods for evaluating the IS weights. Yao & Doretto (2010) use a boosting algorithm (based on the MW algorithm) for transfer learning, yet in their work, the distributions S and T are identical but the features of the classes are different.

2. Preliminaries

PACand SQ-Learning. The seminal work of Valiant (1984) defined PAC-learnability, where the learner has access to poly-many examples drawn from a given distribution T over a domain X labeled by some ℓand its goal is to approximate ℓas well as the best hypothesis in some given hypothesis class H. We measure our prediction success using some loss-function L where for every x X, h H and any labelling function ℓit holds that L(h(x), ℓ(x)) [0, 1]; and so for any distribution T we denote err T(h) = Ex T[L(h(x), ℓ(x))]. Thus, a (α, β)-PAClearner outputs w.p. 1 β an hypothesis h H where err T(h) αH + α, where αH def = minh H{err T(h)}. The learner s ability to draw n i.i.d. examples is simulated via an example-oracle which upon a query returns such a labeled example. Thus, in the PAC-mode, a learner has full access to all of the details of any drawn example. In contrast, the Statistical Query (SQ) model (Kearns, 1998) restricts the

3Namely, probability of drawing a point of large weight.

Transfer Learning In Differential Privacy s Hybrid-Model

operation of the algorithm to view solely statistical properties of the distribution T and the labeling function ℓ. This is simulated via access to a SQτ-oracle which upon a statistical query ϕ : X R [0, 1] returns an estimation of Ex T[ϕ(x, ℓ(x))] up to a tolerance parameter τ (polynomially bounded away from 0).

Differential Privacy. Given a domain X, two multi-sets I, I X n are called neighbors if they differ on a single entry. An algorithm (alternatively, mechanism) is said to be (ϵ, δ)-differentially private (DP) (Dwork et al., 2006b;a) if for any two neighboring I, I and any set S of possible outputs we have: Pr[M(I) S] eϵ Pr[M(I ) S] + δ.

The Randomized-Response mechanism (Warner, 1965; Kasiviswanathan et al., 2008) is one of the classic (ϵ, δ)-DP mechanisms in the local-model. Given privacy parameters ϵ, δ, on an input b [0, 1] it outputs RRϵ,δ(b) N(b, 2 ln(2/δ)

ϵ2 ). When applied to N i.i.d. draws from a Bernoulli r.v. of mean µ, then we estimate µ by θ = 1 N P

i RRϵ,δ(bi). Standard application of the Hoeffding bound and Gaussian concentration bounds proves that applying the mechanism to N(α, β, ϵ, δ) = 4 ln(2/δ) ln(4/β)

ϵ2α2 , we get that Pr[|θ µ| α] 1 β.

Differentially Private Machine Learning is by now too large of a field to be surveyed properly. It was formally initiated by (Kasiviswanathan et al., 2008) who, as discussed above proved that the set of hypothesis-classes which is SQlearnable is conceptually equivalent to the set hypothesisclasses learnable in the local-model of DP. It is also worth mentioning the DP techniques for Empirical Risk Minimization and especially private SGD (Chaudhuri et al., 2011; Bassily et al., 2014) which we use as our private learner in Appendix A.

Divergence Between Distributions. Given two distributions S and T over the same domain X that have a Radon-Nikodym derivative, we denote said derivative as the importance sampling (IS) weight at a point x X as w(x) = PT(x)

PS(x) . Given a convex function f : (0, ) R where f(1) = 0, the f-divergence between two distribution T and S is Df(T||S) = Ex S[f(w(x))]. In the specific case when f(x) = 1 2|x 1| we obtain the total variation distance, denoted TV(T, S); when f(x) = x log(x) we obtain the Kullback-Leibler (KL) divergence, denoted KL(T S); and when f(x) = x2 1 we obtain the χ2divergence, denoted χ2(T S). Thus, a finite χ2-divergence between distributions implies a finite second moment for w(x). It is indeed quite simple to see that Ex T[w(x)] = Ex S[w(x)2] = χ2(T S) + 1. Moreover, a well-known result (Csisz ar & Shields, 2004) states the for any twicedifferentiable f it holds that Df(T S) f (1)

2 χ2(T S) (as follows from f s Taylor series). In the context of transfer learning, Cortes et al. (2010) proved that weighing all

examples in a sufficiently large sample drawn from S according to their IS weights gives an accurate estimation of the loss of any hypothesis in a finite pdim hypothesis class H over T; where their sample size bounds depend on χ2(T S). Another well-used notion of divergence is the α-Reyni divergence between T and S, definted as Dα(T S) = 1 α 1 ln(Ex S[w(x)α]) for any α > 1, which was also used to define similar notions of privacy (Bun & Steinke, 2016; Mironov, 2017; Bun et al., 2018). Note that D2(T S) = ln(χ2(T S)+1). We also denote D (T S) = supx X w(x).

Bernstein Inequality: In our work we use several standard concentration bounds (Markov- / Chernoff- / Hoeffdinginequalities), and also the slightly less familiar inequality of Bernstein (1954): Let {Xi}n i=1 be independent zero-mean random variables. Suppose that |Xi| M almost surely, for all i. Then for any positive t,

i Xi t exp t2

i E[X2 i ] + 2t M/3

3. A Non-Private Model

For the sake of clarity, we introduce our algorithm in stages where first we disregard the privacy aspect of the problem. In this section we deal with a specific transfer learning model, whose details are as follows. We are given a known domain X and some unknown labelling function ℓ: X R. We also have the following access oracles to two unknown distributions S and T over X we are given an example oracle Ex access to S, that upon a query returns an example x drawn from S and labeled by ℓ; and we are given a statistical query oracle SQτ to T, that upon a query ϕ : X R [0, 1] returns an answer in the range Ex T[ϕ(x, ℓ(x)] τ. Our goal is learn ℓthrough some hypothesis class H; i.e. to find an hypothesis h whose loss w.r.t. ℓover T is comparable to the loss of the best hypothesis in H, whose pdim(H) = d. Formally, we introduce our algorithm in the realizable setting, so given a parameter α > 0 and a loss function L : R2 [0, 1] our goal is to find h such that err T(h) = α. (We later discuss extension to the agnostic case.) In addition, we are given an algorithm M which is a (α, β)-PAC learner for our hypothesis class H. Namely, given α, β > 0, our learner M takes a sample of m(α, β) i.i.d. examples drawn from some distribution µ and labeled by some ℓand w.p. 1 β returns a function h H whose loss is upper bounded by errµ(h) α. So, had we been able to draw i.i.d. examples from T, we would have just fed them to the algorithm and produce a good hypothesis h. Alas, in our model, we only have statistical query oracle access to T, SQτ, with error of τ.

In this section, we propose a general reduction of a PAClearner over S to finding a good hypothesis w.r.t. T, which

Transfer Learning In Differential Privacy s Hybrid-Model

Algorithm 1 Non-Private Subsample-Test-Reweigh

Input: parameters 0 < α, β < 1/8 Draw n 800(χ2+1) log(1/α)

α2 (d log( 400d

α3 ) + log( 8

β )) labeled points i.i.d. from S, denoted x1, x2, ..., xn. Set weight w1 i 1 for each 1 i n.

Set T 32 log2( 8(χ2+1)

α ) α2 . for (t = 1, 2, 3, ..., T) do

Set µt as a distribution where µt i wt i. Draw m(α, β/T) examples i.i.d from µt. Apply M to the drawn points and obtain a function ht. Set at as the reply of the SQτ-oracle for the query ϕt(x, ℓ(x)) = L(ht(x), ℓ(x)). if (at > 2α + τ + αH) then

i set wt+1 i wt i exp( α/8 [1 L(ht(xi), ℓ(xi))]) else

return ht (and halt) end if end for

depends solely on χ2(T S) def = χ2. We argue that Algorithm 1 achieves our goal within T = O(log(χ2+1/α)/α2)- iterations.

Note that Algorithm 1 is presented for the realizable case. If we deal with an agnostic case, namely where αH > 0, then our algorithm iterates as long as errµt(ht) < αH + α. (A condition which ought to hold when we begin with h1, the a good hypothesis w.r.t. S and err S(h1) err T(h1).) It ought to be clear that by returning ht with the smallest error estimation given by the SQ-oracle in all T iterations, an hypothesis h H of small loss w.r.t. T is obtained.

Theorem 3.1. In the above-described setting, w.p. 1 2β Algorithm 1, halts and outputs an hypothesis h with err T(h) 2α + 2τ.

The proof of Theorem 3.1 relies on proving the following lemma, and builds on and extends a theorem of Cortes et al. (2010).

Lemma 3.2. Given 0 < α, β 1/8 and two distributions S and T whose χ2-divergence is χ2(T||S) = χ2. Then, if n = Ω((χ2 + 1) log(1/α)

α2 (d log( d

α) + log( 1

β ))) w.p. 1 β it holds that there exists a distribution u over the n drawn points such that (i) its divergence to the uniform distribution over the n drawn point, U[n], satisfies KL( u U[n]) log2(8(χ2+1)/α) and also (ii) h H, erru(h) err T(h) 5α/8.

Proof of Theorem 3.1. Based on Lemma 3.2, we now simply apply the characterization of the MW-algorithm from the seminal work of Arora et al. (2012). Setting the cost of example i w.r.t. hypothesis ht as mt i = 1 L(ht(xi), ℓ(xi)) we have that applying the MW-algorithm with costs m1, ..., m T

and with an update rate of η = α/8, we get that for any fixed distribution ν it holds that

t E i µt[mt i] X

t E i ν[mt i] + αT

8 + 8KL(ν U[n])

Note that w.p. 1 β our algorithm M returns an hypothesis of errµt α in all T iterations. In contrast, we know that each MW-update happens since err T(ht) 2α + τ τ = 2α; and so, w.p. 1 β, we have a distribution ν = u given by Lemma 3.2, which we plug-in to the above equation and have

t [1 (2α 5α

8 + 8 log2( 8(χ2+1)

Rearranging we get the bound 3αT

8 + 8 log2( 8(χ2+1)

α ) α . Hence, w.p. 1 2β Algorithm 1 halts within some iteration

t 32 log2( 8(χ2+1)

α ) α2 steps, which means err T(ht ) (2α+ τ) + τ = 2α + 2τ.

Proof of Lemma 3.2. Let w(x) : X R+ be the function defined by w(x) = PT(x)

PS(x) , where PT and PS denotes the PDFs of T and S resp. We call w(x) the weight of x. Lets u(x) = min(w(x), 4(χ2+1)

α ) be the truncated weight of x. Recall that E x S[w(x)] = 1 and that E x S[w(x)2] =

E x T[w(x)] = χ2 + 1. Thus, since for every x X it holds

that 0 u(x) w(x) then we have E x S[u(x)] 1 and that

E x S[u(x)2] E x T[u(x)] χ2 + 1. Denoting A = {x

X : w(x) > 4(χ2+1)

α }, we can apply the Markov inequality to infer that Pr x T[x A] = E x T[1A(x)] α/4 with 1A(x)

denoting the indicator of whether x A or not. And so:

E x S[w(x) u(x)] = E x S[(w(x) u(x))1A(x)]

X w(x)1A(x)PS(x)dx

X 1A(x)PT(x)dx = E x T[1A(x)] α/4

which implies that Ex S[u(x)] 1 α/4.

We continue to bounding U = Pn i=1 u(xi) for the n points x1, x2, ..., xn in our sample taken i.i.d. from S, provided n 150(χ2+1) ln(4/β)

α2 . To that end, we apply the Bernestein

Transfer Learning In Differential Privacy s Hybrid-Model

inequality. First, we have that

Pr[U > (1 + α/8)n] Pr[U n E[u(xi)] > αn/8]

2 P i Var[u(xi)] + 2

128n(χ2 + 1) + 64n

exp α2n 150(χ2 + 1)

and similarly we can prove Pr[n(1 α/4) U > αn/8] β/4. Hence, w.p. 1 β/2 it holds that (1 3α/8)n U (1 + α/8)n.

Now, setting u as the distribution where point i is sampled w.p. u(xi)

U we can use the above bound on U to infer that

KL( u U[n]) = X

U log2(u(xi)

U log2( 4(χ2+1)

5) log2( 8(χ2+1)

proving the first part of the claim. As for the second part of the claim, we simply use the result of Cortes et al. (2010) which shows universal convergence w.r.t. any unnormalized weights function. Namely, by setting α = α

50(χ2+1) log(1/α),4 we have that for any hypothesis class

H of pdim(H) = d, the following holds

E x S[u(x)L(h(x), ℓ(x))] P

n L(h(xi), ℓ(xi)) > α

E x S[u(x)L(h(x),ℓ(x))] P

n L(h(xi),ℓ(xi))

Ex S[w2(x)L2(h(x),ℓ(x))] > α q

E x S[u(x)L(h(x),ℓ(x))] P

n L(h(xi),ℓ(xi))

Ex S[u2(x)L2(h(x),ℓ(x))] > α q

( ) 4 exp d log( 2en

= 4 exp d log( 2en

d ) nα2 200(χ2+1) log(1/α) β

when n 800(χ2+1) log(1/α)

α2 (d log( 400d

α3 ) + log( 8

β )). Note that ( ) is taken verbatim from Cortes et al. (2010) Theorem 8. Thus, w.p. 1 β both the above bounds on U hold

4Which satisfies that α/4 > α p

(2 + ln(1/α ))(χ2 + 1) = α p

(2 + ln(1/α )) Ex S[w2(x)], when α 1/8.

and we have that for any h H it holds that

err T (h) = E x T[L(h(x), ℓ(x))] = E x S[w(x)L(h(x), ℓ(x))]

E x S[u(x)L(h(x), ℓ(x))] + E x S[w(x) u(x)]

i u(xi)L(h(xi), ℓ(xi)) i + α

2 + (1+α/8)

i u(xi)L(h(xi), ℓ(xi))

8 + E xi u[L(h(xi), ℓ(xi)]

4. The Private Boosting Paradigm

We now turn our attention to the full hybrid-model, and to our need to privatize Algorithm 1. In order to design a private version of Algorithm 1 one required multiple changes. First, and perhaps the easiest, is the fact that instead of a SQτ oracle access to T we estimate the error of ht using RR (in the local-model). Assuming our algorithm makes at most T iterations, standard argument shows that each such query requires Ω( 1 ϵ2α2 log(T/β)) users so that w.p. 1 β/T we get a α-esimation of err T (ht); so, the number of local-users required for our paradigm is Ω( T log(T/β)

ϵ2α2 ) = Ω( log((χ2+1)/α) log(log(χ2+1)/αβ)

Second, which is also a rather straight-forward change, is that we need to replace the learning mechanism M with a privacy-preserving learning mechanism whose sample complexity depends also on the privacy-loss parameter(s). This implies that the sample complexity of M is a function of the 4 parameter m = m(α, β, ϵ, δ).5 The question of setting the privacy parameters of M, thereby inferring the sample complexity of M, will be discussed momentarily.

Lastly, the more challenging aspect of the problem is maintaining the privacy of the samples among the n examples that are drawn from S. To that end, we rely on the MWvariant of (Bun et al., 2020), which in turn requires we introduce one more parameter, namely κ, into the problem.

Definition 4.1. Fix some 0 < κ < 1. Given n points and a set of weights w1, w2, ..., wn 0, we denote wavg = 1 n P

i wi and wmax = maxi{wi}. We say that the distribution induced by these weights, namely the distribution where µi wi, is κ-dense if κwmax wavg.

From a privacy stand-point, it is clear why a dense distribution is a desired trait: it makes it so that in a random sample of m draws from µ we expect each point i to be drawn no more than 1/κ times. (Bun et al., 2020) showed that for any set of weights w = (w1, w2, ..., wn) where i, wi (0, κ],

5For brevity, we use (ϵ, δ) as the privacy parameters, even if M is a (possibly truncated) z CDP-mechanism or R eyni-DP.

Transfer Learning In Differential Privacy s Hybrid-Model

min{c w1, 1}, min{c w2, 1}, . . . , min{c wn, 1} (2)

for the smallest c s.t. Πκ(w) 1 = κn, we obtain a set of weights whose induced distribution µ = Πκ( w) Πκ( w) 1 is κ-dense. Using this projection we get the following claim.

Claim 4.1. Let S and S be any two neighboring datasets of size n each. Let w and w be two weight vectors in (0, κ]n

which may differ on the only entry that differs between S and S , and let µ and µ be the two distributions derived from w and w resp. using the projection of (2). Let H be an hypothesis class and let M be a (ϵ, δ)-DP mechanism that takes as input a dataset of size m = m(α, β, ϵ, δ) and outputs some h H. Then, for any T H, if we denote X (resp. X ) as the result of m i.i.d. draws from S (resp. S ) using µ (resp. µ ), then, setting ϵ = 6ϵm

κn and δ = 4meϵ δ

κn we have that

Pr X µm[M( X) T] eϵ Pr X µ m[M( X ) T] + δ

Proof. This follows immediately from applying Lemma 6.1 in (Karwa & Vadhan, 2018) to this setting, using the bound TV(µ, µ ) 1/κn proven by Bun et al. (2020).

A reader familiar with sample complexity bounds in DPliterature, knows that usually the dependency of m in ϵ is inverse. Thus, the dependency of ϵ in ϵm suggests that ϵ ends up independent of the privacy-loss of parameter set to the private learning mechanism M. That is why chose to apply M with ϵ = 1, a parameter under which most private and non-private sample complexity bounds are asymptotically equivalent.6 The flip side of it is that we now set n to be proportional to ϵ 1. We are now ready to give our DP algorithm for transfer learning in the hybrid model.

Theorem 4.2. Using the same notation as in Algorithm 2, Algorithm 2 is a hybrid-model (ϵ, δ)-

DP algorithm provided n m1

288T ln(2/δ)

Ω(m1 (χ2+1)

log((χ2+1)/α) log(1/δ)

Proof. First, it is clear that each local-user is asked a single query and replies using a mechanism that is (ϵ, δ)- DP. As for the privacy of the curator-agents, by setting m1 = m(α, β/T, 1, κδ

8e T ), Claim 4.1 asserts that in each iteration we are (ϵ , δ )-DP for ϵ = 6 1 m1

8T ln(2/δ) < 1

and δ 4m1e1

2T δ 2T . Applying the Advanced Composition theorem of (Dwork et al., 2010) we get that in all T iterations together we are (ϵ, δ)-DP w.r.t. to each of the n curator-agents.

6Loosely speaking, a parameter under which we typically get privacy for free.

Algorithm 2 Private Subsample-Test-Reweigh

Input: parameters 0 < α, β < 1/8, 0 < ϵ, δ. A (ϵ, δ)-DP learning algorithm M of sample complexity m(α, β, ϵ, δ).

Set κ α 8(χ2+1), T 128 log2( 8(χ2+1)

α ) α2 , N0 4 ln(2/δ) ln(8T/β)

Draw a sample of n m(α, β

288T ln(2/δ)

ϵκ points i.i.d. from S, denoted x1, x2, ..., xn, all labeled by ℓ. Similarly draw N0 T local-users from T. Set weight w1 i κ for each 1 i n. for (t = 1, 2, 3, ..., T) do

Set µt as a distribution where µt = Πκ( w) Πκ( w 1 .

Apply M to a sample of m1 = m(α, β

2T ) examples drawn i.i.d. from µt, to obtain some hypo. ht. Pick arbitrarily a new batch B of N0 local-users, and set at 1 N0 P

x B RRϵ,δ L(ht(x), ℓ(x)) . if (at > 3α) then

i set wt+1 i wt i exp( α/8 [1 L(ht(xi), ℓ(xi))]) else

return ht (and halt) end if end for

Theorem 4.3. W.p. 1 3β, Algorithm 2 returns an hypothesis h such that err T (h) 4α + αH, provided the number of curator-agents is n = Ω((χ2 + 1) log(1/α)

α2 (d log( d

α) + log( 1

β ))), and the number of local-agents is N = N0T =

Ω( log((χ2+1)/α) log(log(χ2+1)/αβ)

Proof. Again, the proof relies on the characterization of the utility of the MW-mechanism w.r.t. k-dense set of weights. Again, let w1 be the initial weights vector where w1 i = κ for all i. Let w [0, 1]n be any fixed set of weights which is κ-dense, and denote ˆw = w w 1 as its induced distribution. Then, we have that for any sequence of costs m1, m2, ..., mt, Bun et al. (2020) proved that this MW-mechanism with learning rate of η = α

8 guarantees that

t E i µt[mt i] X

t E i ˆ w[mt i] + αT

1 κnϕ( w , w1)

where ϕ( w , w1) is the Bregman divergence induced by the entropy function, namely

ϕ( w , w1) = X

i w i log(w i w1 i ) w i + w1 i

As w we aim to use the weights which Lemma 3.2 guarantees whose nice properties hold w.p. 1 β. Thus, we set for each xi drawn from S the weight w i = u(xi)

4(χ2+1)/α 1. As Lemma 3.2 asserts, it holds that U = P u(xi)

Transfer Learning In Differential Privacy s Hybrid-Model

lies in the range [(1 3α/8)n, (1 + α/8)n], thus wavg = 1 n P αu(xi)

4(χ2+1) α(1 3α/8)

4(χ2+1) . Thus, by setting κ = α 8(χ2+1) we have that wavg κ 1 κwmax, implying w is indeed κ-dense. Also, the same bounds on U imply that

ϕ( w , w1) = X

i 2κu(xi) log(2κu(xi)

κ ) 2κu(xi) + κ

i (2u(xi) log(2u(xi)) 2u(xi) + 1)

κ n 2U + 2 X

i u(xi) log(8(χ2+1)/α)

2κ log(8(χ2+1)/α) U 4κn log(8(χ2+1)/α)

The remainder of the proof is as in Theorem 3.1. In each iteration it must hold that errµt(ht) α; whereas at 3α, which w.p. 1 β using classic utility bounds for the Randomized Response mechanism implies that err T(ht) 2α. Plugging this bound to Equation (3) stating the regret of the κ-dense MW-algorithm we get

8 + 4 log2( 8(χ2+1)

Rearranging yields the bound 3αT

8 + 32 log2( 8(χ2+1)

which implies that we halt in T 128 log2( 8(χ2+1)

α ) α2 iterations. Again, upon halting at 3α so it holds err T(ht) 4α.

Thus in a realizable setting, given a finite H, we can set M as the exponential mechanism over H returns w.p. 1 β an hypothesis of error Θ(α) when the sample size it at least Ω(ln(|H|)/αϵ). We thus obtain the following corollary.

Corollary 4.4. For any S, T with bounded χ2-divergence, there exists a (ϵ, δ)-DP hybrid-model learning for a finite H in the realizable case which returns w.p. 1 β an hypothesis h H with err T(h) = Θ(α), provided that we have n = Ω( (χ2+1) ln(|H|/β)

α3ϵ ) curator-agents drawn from S

and N = Ω( ln((χ2+1)/α)

ϵ2α4 ) local-agents drawn from T.

Example: Sparse Hypothses. It is worth noting that our sample complexity bounds are independent of the dimension d (assuming the domain X Rd). So consider a specific case that deals with a s-sparse problem over a high d-dimensional set where s d, say, where H is a linear seprartor that uses no more than s = O(1) features out of the d features of each example. This suggests that |H| = d O(s) = poly(d). We thus obtain s polylog(d) sample complexity bounds (setting all other parameters to be come reasonable constants.) whereas learning solely over T requires a sample complexity d (see Duchi et al. (2013)).

Private SGD. The specific case where the private learning mechanism M is SGD discussed in Appendix A. This

case is discussed separately for two main reasons. First, its algorithmic presentation is slightly different than in Algorithm 2, since in each iteration it makes successive draws from µt rather than a single draw of a subsample. Secondly, this is the one canonical case where L is continuous (rather than binary), and so the subject of scaling comes into play. Whereas thus far we assumed L is bounded by 1, it is straightforward to see that a L-Lipfshitz convex loss function over a convex set of diameter D has loss [0, LD], thus our goal is now to obtain a loss α (rather than α LD). Lastly, aiming to give a tight analysis, we use the notion of (ρ, ω)-t CDP (Bun et al., 2018) which is then converted to (ϵ, δ)-DP scheme. Nonetheless, the sample complexity bounds we get are similar to those of Algorithm 2.

5. On Reducing Sample-Complexity Bounds

While our work is the first to show the feasibility of transfer learning using poly-size sample, its sample complexity bound for the curator-agents is a large multiplicative factor O(ϵ 1α 2) over the sample complexity of a (nontransfer) curator-model learner. We believe that it is possible to significantly improve said bound, at least for particular instances. Here we provide two specific instances where indeed the number of required iterations until convergence of our algorithm is o(α 2), leading to a much smaller sample complexity.

Transfer Learning for PARITY under the Uniform Distribution. Consider the domain X = {0, 1}d and the class of PARITY functions, where for any S [d] we have c S(x) = L

i S xi. It is a well-known result that under the uniform distribution the PARITY class cannot be learnt in the local-model unless the number of local-agents is N = exp(d), yet a sample of size n = Θ(d/ϵα) suffices to learn PARITY under any distribution in the curatormodel (Kasiviswanathan et al., 2008). In Appendix B we show that in the hybrid-model one can learn PARITY in a single iteration, provided S and the uniform distribution T have polynomial χ2-divergence. The crux of the proof lies in proving that w.h.p. a sufficiently large sample drawn from S is linearly independent over Fd 2. This suggests that the private curator-model learner for PARITY (Kasiviswanathan et al., 2008) which leverages on (multiple) Gaussian elimination over Fd 2 returns w.h.p. the true labeling function in the PARITY hypothesis class.

Empirical Experiments: When Both S and T are Gaussians. Next, we show empirically that in other settings, both the number of required iterations until convergence and our sample complexity bounds are far greater than required. First, we consider a setting where S is a simple spherical Gaussian in d = 500-dimensions, S = N( 0, Id) whereas for T we picked an arbitrary set of k = 10 coordinates and

Transfer Learning In Differential Privacy s Hybrid-Model

set the standard deviation on these k as σ = 0.02 whereas the remaining d k coordinates have standard deviation of 1, i.e. T = N( 0, Id k σ2Ik). It is a matter of simple calculation to show that χ2(T S) + 1 = ( 1 σ2(2 σ2)) k/2 > 3 1015. Now, our target hyperplane separator is set such that its only non-zero coordinates are the k ones on which S and T have a different variance, and so that is classifies precisely α = 0.01 of the mass of T as 1. Now, testing this setting with the non-private version7 (Algorithm 1) over n 90, 000 we obtain an hypothesis of error 2α in all t = 50 repetitions of our experiment. Moreover, our iterative algorithm runs for only T 1200 1300 iterations, far below the α 2 upper bound. The results appear in Figures 1a and 1b in Appendix C. Second, we consider a setting where the true hypothesis actually corresponds to a k-dimensional ball over the k coordinates that are more concentrated in T than in S and ran both the private and non-private version of our algorithm. While the non-private version converged very fast, the private version required a very large sample of curator-agents, and so we were able to conduct only preliminary experiments with it. The experiment is detailed in Appendix D where the results appear in Figures 2 and 3.

Open Problems. Our work is the first to present a general framework for transfer learning in the hybrid-model, thus providing an initial answer to the open problem of selection bias posed by Beimel et al. (2020). While our framework does surpasses certain na ıve baselines in some specific cases, there is still much work to be done in order to reduce its sample complexity. Considering different divergences can be a promising direction, especially if we know that the 4th moment of the IS weights is bounded (as it may increase the value of κ). A different venue can be the study of repeated uses of Subsample-Test-Reweigh when person A applied the paradigm until she finds a good hypothesis, then hands over her last set of weights to Person B who uses it for learning a different hypothesis. Lastly, we believe that there s more to be studied in general in the intersection between DP and transfer learning, as the problem can be tackled from the alternative approach of discrepancy between hypothesis classes (Ben-David et al., 2010; Mansour et al., 2009).

Acknowledgements

This work was done when the first author was advised by the second author. O.S. is supported by the BIU Center for Research in Applied Cryptography and Cyber Security in conjunction with the Israel National Cyber Bureau in the Prime Minister s Office, and by ISF grant no. 2559/20. Both authors thank the anonymous reviewers for many helpful

7We thoroughly apologize, but experimenting with the private version becomes infeasible on a desktop computer due to its sample complexity constraints.

suggestions in improving this paper.

Agapiou, S., Papaspiliopoulos, O., Sanz-Alonso, D., and Stuart, A. M. Importance sampling: Intrinsic dimension and computational cost, 2017.

Arora, S., Hazan, E., and Kale, S. The multiplicative weights update method: a meta-algorithm and applications. Theory Comput., 8(1):121 164, 2012.

Avent, B., Korolova, A., Zeber, D., Hovden, T., and Livshits, B. BLENDER: Enabling local search with a hybrid differential privacy model. In USENIX Security, pp. 747 764. USENIX Association, 2017.

Balle, B., Bell, J., Gasc on, A., and Nissim, K. The privacy blanket of the shuffle model. In CRYPTO, volume 11693, pp. 638 667. Springer, 2019.

Balle, B., Bell, J., Gasc on, A., and Nissim, K. Private summation in the multi-message shuffle model. In CCS, pp. 657 676. ACM, 2020.

Bassily, R. and Smith, A. D. Local, private, efficient protocols for succinct histograms. In STOC, pp. 127 135. ACM, 2015.

Bassily, R., Smith, A., and Thakurta, A. Private empirical risk minimization: Efficient algorithms and tight error bounds. In FOCS, 2014.

Beimel, A., Nissim, K., and Omri, E. Distributed private data analysis: Simultaneously solving how and what. In CRYPTO, volume 5157, pp. 451 468. Springer, 2008.

Beimel, A., Korolova, A., Nissim, K., Sheffet, O., and Stemmer, U. The power of synergy in differential privacy: Combining a small curator with local randomizers. In ITC, volume 163 of LIPIcs, pp. 14:1 14:25. Schloss Dagstuhl - Leibniz-Zentrum f ur Informatik, 2020.

Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., and Vaughan, J. W. A theory of learning from different domains. Machine learning, 79(1):151 175, 2010.

Bernstein, S. Collected works ii. Moscow: Akad. Nauk SSSR, 1954.

Bittau, A., Erlingsson, U., Maniatis, P., Mironov, I., Raghunathan, A., Lie, D., Rudominer, M., Kode, U., Tinn es, J., and Seefeld, B. Prochlo: Strong privacy for analytics in the crowd. In SOSP, pp. 441 459. ACM, 2017.

Blum, A., Furst, M. L., Jackson, J. C., Kearns, M. J., Mansour, Y., and Rudich, S. Weakly learning DNF and characterizing statistical query learning using fourier analysis. In STOC, pp. 253 262. ACM, 1994.

Transfer Learning In Differential Privacy s Hybrid-Model

Bun, M. and Steinke, T. Concentrated differential privacy: Simplifications, extensions, and lower bounds. In TCC, volume 9985 of Lecture Notes in Computer Science, pp. 635 658, 2016.

Bun, M., Dwork, C., Rothblum, G. N., and Steinke, T. Composable and versatile privacy via truncated CDP. In STOC, pp. 74 86. ACM, 2018.

Bun, M., Carmosino, M. L., and Sorrell, J. Efficient, noisetolerant, and private learning via boosting. In COLT, volume 125, pp. 1031 1077. PMLR, 2020.

Chatterjee, S. and Diaconis, P. The sample size required in importance sampling, 2017.

Chaudhuri, K., Monteleoni, C., and Sarwate, A. D. Differentially private empirical risk minimization. Journal of Machine Learning Research, 12, 2011.

Cheu, A., Smith, A. D., Ullman, J. R., Zeber, D., and Zhilyaev, M. Distributed differential privacy via shuffling. In EUROCRYPT, volume 11476 of Lecture Notes in Computer Science, pp. 375 403. Springer, 2019.

Cortes, C., Mansour, Y., and Mohri, M. Learning bounds for importance weighting. In NIPS, pp. 442 450. Curran Associates, Inc., 2010.

Csisz ar, I. and Shields, P. Information theory and statistics: A tutorial. Foundations and Trends in Communications and Information Theory, 1(4):417 528, 2004.

Duchi, J. C. and Rogers, R. Lower bounds for locally private estimation via communication complexity. In Beygelzimer, A. and Hsu, D. (eds.), COLT, volume 99, pp. 1161 1191. PMLR, 2019.

Duchi, J. C., Jordan, M. I., and Wainwright, M. J. Local privacy and statistical minimax rates. In FOCS, pp. 429 438. IEEE Computer Society, 2013.

Dwork, C., Kenthapadi, K., Mc Sherry, F., Mironov, I., and Naor, M. Our data, ourselves: Privacy via distributed noise generation. In EUROCRYPT, 2006a.

Dwork, C., Mcsherry, F., Nissim, K., and Smith, A. Calibrating noise to sensitivity in private data analysis. In TCC, 2006b.

Dwork, C., Rothblum, G., and Vadhan, S. Boosting and differential privacy. In FOCS, 2010.

Ghazi, B., Golowich, N., Kumar, R., Pagh, R., and Velingker, A. On the power of multiple anonymous messages: Frequency estimation and selection in the shuffle model of differential privacy. In EUROCRYPT, volume 12698, pp. 463 488. Springer, 2021a.

Ghazi, B., Kumar, R., Manurangsi, P., Pagh, R., and Sinha, A. Differentially private aggregation in the shuffle model: Almost central accuracy in almost a single message. In ICML, volume 139, pp. 3692 3701. PMLR, 2021b.

Gupta, A., Hardt, M., Roth, A., and Ullman, J. Privately releasing conjunctions and the statistical query barrier. In STOC, 2011.

Gupta, S. K., Rana, S., and Venkatesh, S. Differentially private multi-task learning. In Intelligence and Security Informatics - 11th Pacific Asia Workshop, PAISI 2016, Auckland, New Zealand, April 19, 2016, Proceedings, volume 9650 of Lecture Notes in Computer Science, pp. 101 113. Springer, 2016.

Hardt, M. and Rothblum, G. A multiplicative weights mechanism for privacy-preserving data analysis. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pp. 61 70. IEEE, 2010.

Hazan, E. Introduction to online convex optimization. Foundations and Trends in Optimization, 2(3-4):157 325, 2016.

Joseph, M., Mao, J., Neel, S., and Roth, A. The role of interactivity in local differential privacy. In FOCS, pp. 94 105. IEEE Computer Society, 2019.

Karwa, V. and Vadhan, S. P. Finite sample differentially private confidence intervals. In Karlin, A. R. (ed.), ITCS, volume 94 of LIPIcs, pp. 44:1 44:9, 2018.

Kasiviswanathan, S. P., Lee, H. K., Nissim, K., Raskhodnikova, S., and Smith, A. What can we learn privately? In FOCS, 2008.

Kearns, M. Efficient noise-tolerant learning from statistical queries. J. ACM, 45(6):983 1006, 11 1998.

Kumar, M. Differentially private transferrable deep learning with membership-mappings, 2022.

Li, J., Khodak, M., Caldas, S., and Talwalkar, A. Differentially private meta-learning. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. Open Review.net, 2020.

Maia Polo, F. and Vicente, R. Effective sample size, dimensionality, and generalization in covariate shift adaptation. Neural Computing and Applications, pp. 1 13, 2022.

Mansour, Y., Mohri, M., and Rostamizadeh, A. Domain adaptation: Learning bounds and algorithms, 2009.

Merriman, C. Microsoft reminds privacy-concerned Windows 10 beta testers that they re volunteers. In The Inquirer, http://www.theinquirer.net/ 2374302, Oct 7, 2014.

Transfer Learning In Differential Privacy s Hybrid-Model

Metelli, A. M., Russo, A., and Restelli, M. Subgaussian and differentiable importance sampling for off-policy evaluation and learning. Advances in Neural Information Processing Systems, 34, 2021.

Microsoft. Windows Insider Program Agreement. https://insider.windows.com/en-us/ program-agreement, Sep 15, 2017.

Mironov, I. R enyi differential privacy. In CSF, pp. 263 275, 2017.

Mozilla. Firefox Privacy Notice. https: //www.mozilla.org/en-US/privacy/ firefox/#pre-release, June 4, 2019.

Nissim, K., Raskhodnikova, S., and Smith, A. Smooth sensitivity and sampling in private data analysis. In Proceedings of the thirty-ninth annual ACM Symposium on Theory of Computing, pp. 75 84. ACM, 2007. Full version in: http://www.cse.psu.edu/ asmith/ pubs/NRS07.

Papernot, N., Song, S., Mironov, I., Raghunathan, A., Talwar, K., and Erlingsson, U. Scalable private learning with PATE. In ICLR. Open Review.net, 2018.

Smith, A. D., Thakurta, A., and Upadhyay, J. Is interaction necessary for distributed private learning? In IEEE Symposium on Security and Privacy, SP, pp. 58 77, 2017.

Valiant, L. G. A theory of the learnable. Commun. ACM, 27 (11):1134 1142, 11 1984.

Wang, Y., Gu, Q., and Brown, D. E. Differentially private hypothesis transfer learning. In Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2018, Dublin, Ireland, September 10-14, 2018, Proceedings, Part II, volume 11052 of Lecture Notes in Computer Science, pp. 811 826. Springer, 2018.

Warner, S. L. Randomized Response: A Survey Technique for Eliminating Evasive Answer Bias. Journal of the American Statistical Association, 60(309), March 1965.

Xie, L., Baytas, I. M., Lin, K., and Zhou, J. Privacypreserving distributed multi-task learning with asynchronous updates. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, August 13 - 17, 2017, pp. 1195 1204. ACM, 2017.

Yao, Y. and Doretto, G. Boosting for transfer learning with multiple sources. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1855 1862, 2010. doi: 10.1109/CVPR.2010.5539857.

Transfer Learning In Differential Privacy s Hybrid-Model

A. Private Stochastic Gradient Descent

The Private SGD Algorithm. In this section, we discuss our private Subsample-Test-Reweigh paradigm where the private learning mechanism applied in each iteration is the standard private SGD (see Hazan (2016); Bassily et al. (2014)). For simplicity, we give the algorithm here.

Algorithm 3 Online SGD

Input: Parameters α, β, σ Lipfshitz constant L and a convex set H Rd of diameter D. A distribution µ over a sample S. Let h1 H arbitrarily. Set R max 9D2(L2+σ2d)

α2 , 8D2L2 ln(2/β)

α2 ; or set

R max 4(L2+σ2d)

λα ln( L2+σ2d

λα ), 8D2L2 ln(2/β)

α2 if L is λ-strongly convex. for (r = 1, 2, 3, ..., R) do

Draw a labeled example (xr, ℓ(xr)) µ Draw a random vector v N(0, σ2Id) Set ηr D σ r; or set ηr D λr if L is λ-strongly convex. Set hr+1 ΠH hr ηr( L(hr(xr), ℓ(xr)) + v)

end for Return h = 1

R PR r=1 hr.

Standard arguments from Hazan (2016) give the following utility theorem. Theorem A.1. Let H Rd be a convex set of diameter D and let L be a L-Lipfshitz function and denote αH = minh H E(x,ℓ(x)) µ[L(h(x), ℓ(x))]. Then, w.p. 1 β after R iterations, err( h) αH + α.

Proof. The proof follows from the usual analysis of SGD, under the observation that in each iterations

E(xr,ℓ(xr)) µ v N(0,σ2Id) [ L(hr(xr), ℓ(xr)) + v]

= E(x,y) µ[ L(hr(x), ℓ(x))]

E[ L(hr(xr), ℓ(xr)) + v 2] E[ L(hr(xr), ℓ(xr)) 2]

+ 2 E[ L(hr(xr), ℓ(xr)) v] + E[ v 2] L2 + σ2d

Plugging those into the bounds of the generalization properties of the SGD for convex functions we obtain:

err µ ( h) err µ (h ) + 3D p

So setting R max 9D2(L2+σ2d)

α2 , 8D2L2 ln(2/β)

α2 yields that errµ( h) αH + α

2 as required. Similarly, for a λ-strongly convex function we get

err µ ( h) err µ (h )+(L2 + dσ2)(1 + ln(T))

Algorithm 4 Private Subsample-Test-Reweigh with SGD

Input: parameters 0 < α, β < 1/8, 0 < ϵ < 1,0 < δ < 1/4. Private SGD mechanism over a convex set H Rd

of diameter D with convex L-Lipfshitz loss function L making at most R SGD-iterations.

Set κ α 8(χ2+1), T 128L2D2 log2( 8(χ2+1)

α ) α2 , N0

4 ln(2/δ) ln(8T/β)

ϵ2α2 , σ2 max 20L2, 16ϵ 1L2 ln(1/δ)+8L2

Set R max 9D2(L2+σ2d)

α2 , 8D2L2 ln(2/β)

α2 ; or set

R max 4(L2+σ2d)

λα ln( L2+σ2d

λα ), 8D2L2 ln(2/β)

α2 if L is λ-strongly convex.

Draw a sample of n q

8κ2ϵ ln( 52RT

8κ2ϵ ) points i.i.d. from S, denoted x1, x2, ..., xn, all labeled by ℓ. Draw N0 T local-users from T. Set weight w1 i κ for each 1 i n. for (t = 1, 2, 3, ..., T) do

Set µt as a distribution where µt = Πκ( w) Πκ( w 1 . Apply private SGD using µt and using σ2 for R iterations, and obtain an hypothesis ht. Pick arbitrarily a new batch B of N0 local-users, and set at 1 N0 P x B RRϵ,δ L(ht(x), ℓ(x)) . if (at > 3α + αH) then

i set wt+1 i wt i exp( α/8LD [LD L(ht(xi), ℓ(xi))]) else

return ht (and halt) end if end for

So setting R max 4(L2+σ2d)

λα ln( L2+σ2d

λα ), 8D2L2 ln(2/β)

yields that errµ( h) αH + α as required.

While we can apply a privacy analysis of Algorithm 3, we table it to the privacy analysis of our full algorithm.

The Subsample-Test-Reweigh Using SGD. We now transition to the full analysis of STR when we apply SGD as an intermediate procedure.

The utility analysis of Algorithm 4 is just as the one of Algorithm 2 modulo the fact that the non-negative loss is now bounded by LD rather than 1, which implies we must increase the number of MW-iterations by L2D2. It requires a sample complexity of n = Ω(L2D2(χ2 + 1) log(1/α)

α2 (d log( d

α) + log( 1

β ))) curator-agents, and N =

N0T = Ω( L2D2 log((χ2+1)/α) log(log(χ2+1)/αβ)

ϵ2α4 ) localagents to return, w.p. 1 3β an hypothesis of error αH + α. (The increase in the number of curator-agents is due to the fact that the loss is now in the range [0, LD] rather than [0, 1].) The privacy analysis of Algorithm 4 requires

Transfer Learning In Differential Privacy s Hybrid-Model

we transition of (ρ, ω)-t CDP given in Bun et al. (2018). Its definition as well as some of its basic properties (proven in Bun & Steinke (2016); Mironov (2017); Bun et al. (2018)) are provided below. Definition A.1. A mechanism M is said to be (ρ, ω)- t CDP if for two neighboring inputs I and I the αReyni divergence of the two distributions is bounded: Dα(M(I)||M(I )) αρ for any 1 < α ω. Fact A.2. Let f be a d-dimensional function of L2global sensitivity of max I,I neighbors f(I) f(I ) L. Then the mechanism that outputs for any instance I an output drawn from N(f(I), σ2Id) is ( 2L2

σ2 , )- t CDP.

Let M1, M2 be two (ρ1, ω1)-t CDP (resp. (ρ2, ω2)- t CDP) mechanisms. Then the mechanism that applies them to the same instance but using independent coin toss for each is (ρ1 + ρ2, min{ω1, ω2})-t CDP.

A mechanism which is (ρ, ω)-t CDP is also (ρω + ln(1/δ)

ω 1 , δ)-DP for any δ < 1/e.

Perhaps, however, the most important property of t CDP is that it is amplified by subsampling. Theorem A.3. [Thm. 12 of Bun et al. (2018) reworded] Fix ρ (0, 0.1]. Let s be a constant satisfying log(1/s) 3ρ(2 + ln(1/ρ)). Let M be a (ρ, )-t CDP mechanism. Let I and I be two neighboring instance and let µ be a distribution where the probability of sampling the one different entry between the two instances is s. Then the mechanism that samples entries from the input and then applies M on the subsample is (13s2ρ, log(1/s)

4ρ )-t CDP.

Now, throughout the execution of Algorithm 4 it holds that we subsample a point and apply the Gaussian mechanism for RT iterations. In all of these iterations we apply a distribution where the probability of subsampling any point into M is at most 1/κn. Moreover, our private-SGD mechanism when applied to a L-lipfshitz loss function is ( 2L2

σ2 , )- t CDP. Thus, if 2L2

σ2 0.1 and if

ln(κn) 6L2(2 + ln(σ2/2L2))

then all conditions of Theorem A.3 hold. This suggests that each time we execute M over a randomly drawn sample we are ( 26L2 σ2κ2n2 , σ2 ln(κn)

8L2 )-t CDP; thus, by composition, we are

σ2κ2n2 , σ2 ln(κn)

8L2 )-t CDP. Thus, w.r.t. the curator-agents, we are (ϵ, δ)-DP for any δ < 1/4 for

ϵ = 26RT ln(κn)

8κ2n2 + 8L2 ln(1/δ) σ2 ln(κn) 8L2

This suggests that in order to achieve (ϵ, δ)-DP w.r.t. the curator agents, we set σ2 = 16ϵ 1L2 ln(1/δ)+8L2

ln(κn) , and we

must set n so that n q

52RT [ln(κ)+ln(n)]

8κ2ϵ . Seeing as

κ < 1 it thus suffices for us to set n = q

8κ2ϵ ln( 52RT

8κ2ϵ ). It remains to check that under these values (4) holds; but indeed, note that σ2 2L2 10 and so

3(2 + ln(σ2/2L2))

σ2/2L2 3(2 + ln(10))

whereas under any reasonable set of parameters we get

8κ2ϵ ln( 52RT

κ, implying that ln(κn) ln(4) > 1.3. Theorem A.4. For any given ϵ (0, 1), δ (0, 1/4), 0 < α < 1/8, 0 < β < 1/3, two distribution S, T with bounded χ2-divergence and a convex loss-function L over a convex set H Rd of diameter D, we have that Algorithm 4 is (ϵ, δ)-DP algorithm in the hybrid model provided that

n = Ω((χ2 + 1)LD2

ln(2/β) α3 ϵ )

= Ω((χ2 + 1)D2L2p

d ln(1/δ) p

ln(2/β) α3ϵ )

if the loss-function is L-Lipfshitz and convex, or provided that

n = Ω( LD(χ2 + 1) q

λα + D2L2 ln(1/β)

= Ω(LD(χ2 + 1)

ln(1/β) ϵα ))

if the loss-function is L-Lipfshitz and λ-strongly convex. Furthermore, if the number of local-agents is N = Ω( L2D2 log((χ2+1)/α) log(log(χ2+1)/αβ)

ϵ2α4 ) then w.p. 1 3β it returns an hypothesis h H where err T(h) αH + 4α.

B. Transfer Learning for the PARITY-Problem Under the Uniform Distribution

Transfer Learning of PARITY for the uniform distribution. Consider the domain X = {0, 1}d and the class of PARITY functions, where for any S [d] we have c S(x) = L

i S xi. It is a well-known result that under the

uniform distribution the PARITY class cannot be learnt in the local-model unless the number of local-agents is N = exp(d), yet a sample of size n = Θ(d/ϵα) suffices to learn PARITY under any distribution in the curatormodel (Kasiviswanathan et al., 2008). Here we show that in the hybrid-model one can learn the PARITY class with a single iteration, provided S and the uniform distribution T have polynomial χ2-divergence.

To establish this, we prove the following sequence of claims and corollaries.

Transfer Learning In Differential Privacy s Hybrid-Model

Proposition B.1. Fix 0 < β < 1/2. Let S be a sample of n d log2(d/β) points drawn i.i.d. from the uniform distribution over {0, 1}d. Then the probability that S isn t linearly independent is β.

Proof. We prove the claim inductively as we iterate over all vectors in S. Due to simple counting argument, it is easy to see that the probability to draw an vector that is linearly dependent of given set of i vectors in a the d-dimension space Fd 2 is at most 2i d. So fix any 0 i d 1. At each step we have a set of i linearly independent vectors spanning a subspace of dimension i. We argue that the probability that among the next t = log2(d/β) vectors in S the probability that all t vectors are linearly dependent of these i basis vectors is at most (2i d)t 2 t = β/d. And so, after d iterations we have found a set of d linearly independent vectors in S w.p. 1 β.

Claim B.2. Fix 0 < β < 1/2. Let T be the uniform distribution over {0, 1}d and let S be a distribution over the same domain s.t. χ2(T S) = χ2 is finite. Let S be a sample of n 4(χ2 +1)d ln(d/β) points drawn i.i.d. from S. Then the probability that S isn t linearly independent is β.

Proof. The claim is proven in a similar inductive fashion to Proposition B.1. Fix any 0 i d 1. At each step we have a set of i linearly independent vectors spanning a subspace of dimension i. We argue that the probability that among the next t = log2(d/β) vectors in S the probability that all t vectors are linearly dependent of these i basis vectors is at most β/d, from which the claim follows immediately.

So now, given i linearly independent vectors, let E be the event that we draw a vector not in their span. Under the uniform distribution Pr T[E] 1 2i d 1/2. Standard bounds on the χ2-divergence give that

Pr S [E](χ2 + 1) Pr T [E] 1/2

implying that Pr S[E] 1 4(χ2+1) and so Pr S[ E] 1 1 4(χ2+1). It follows that the probability that among the next t = 4(χ2 + 1) ln(d/β) draws, not a single one lies outside the span of these i vectors is at most (1 1 4(χ2+1))t exp( t 4(χ2+1)) = β/d as required.

Corollary B.3. Fix β > 0. Set k = log4/3(2/β). Under the same notation as in Claim B.2 let S1, S2, ..., Sk be k independently drawn batches from S s.t. each Si contains at least n 32(χ2+1)d ln(2dk/β)

ϵ . Then w.p. 1 β it holds that when we drawn from each Si a subsample S i where each x Si is put in the subsample w.p. ϵ/4 independently of all other examples, then all S i are linearly independent.

Proof. Straight-forward application of the Chernoff bound gives that w.p. 1 β/2 each of the k S i-s contains at least ϵ/8|Si| many points. This suffices for us to apply Claim B.2 and have that w.p. 1 β/2 all S i are linearly independent.

Based on Corollary B.3 we can now apply the same PARITY learning algorithm from Kasiviswanathan et al. (2008) on the kn curator-agents just once and then test its correctness over the N. This algorithm outputs for each S i either a or a solution in the affine subspace that solves a system of equations over F2. But due to the linear independence of each S i, this solution must be the indicating vector of the relevant features of the true classifying function c S PARITY. It follows that for each Si outputs the true classifying function w.p. 1/4; and so, w.p. 1 β all Si return either or c S where at least one of the Si-s outputs the true classifier. Note that the true classifier s loss under any distribution S or T must be 0. Using additional O(ϵ 2α 2) we can test and see that indeed we have an hypothesis c PARTIY which is of small loss.

C. Experimental Evaluation of Non-Private Subsample-Test-Reweigh.

In this section, we show empirically that in another setting, both the number of required iterations until convergence and our sample complexity bounds are far greater than required. We consider a setting where S is a simple spherical Gaussian in d = 500-dimensions, S = N( 0, Id) whereas for T we picked an arbitrary set of k = 10 coordinates and set the standard deviation on these k as σ = 0.02 whereas the remaining d k coordinates have standard deviation of 1, i.e. T = N( 0, Id k (0.01)2Ik). It is a matter of simple calculation to show that χ2(T S) + 1 = ( 1 σ2(2 σ2))k/2 > 3 1015. Now, true hypothesis is a hyperplane separator set on the k coordinates on which S and T have a different variance, so that is classifies precisely α = 0.01 of the mass of T as 1.

We applied the non-private version of our algorithm (Algorithm 1). In the non-private version, in order to learn in each iteration a hyperplane separator over a subsample of examples from S we used SVM, where our optimization goal is 1

2 w 2 + C P

x max{0, 1 w, x } with a very large C = 1030 (aiming to find an exact hyperplane as possible) over a subsample whose size is set to be d+ln(0.05/T )

α . First, aiming to find the sample complexity of the curator-agents that already yields an hypothesis of error 2α, we ran our experiments with varying values of n = {104, 2 104, ..., 8 104, 9 104}, repeating each experiment t = 50 times. We observed that for n = 80, 000 we consistently return an hypothesis of small-loss. Results appear in Figure 1a. Then, for even larger values of

Transfer Learning In Differential Privacy s Hybrid-Model

(a) Loss on T: The loss on distribution T until convergence (in samples 90K-140K) or (in samples 10K-80K) until arriving to early stopping condition (the average loss on T in last 200 iterations not improved in more 0.01 compared to the best average loss in the previous 200 iterations).

(b) Iterations number: The iterations number until convergence decreases with size of the sample.

(c) Loss on T along the run: the loss on T decreases with the iterations.

Figure 1: Empirical Experiment Results

n = {9 104, 1 105, .., 1.4 105} we ran our experiment to see at which iteration do we halt. For n = 80, 000 we halt after 1300 iterations whereas for n = 140, 000 we halt after T 1200; yielding the rather surprising result that T isn t greatly affected by the increasing sample complexity. But regardless, this is very far below than the O(α 2)-upper bound. Results appear in Figure 1b. In fact, looking at the error of the resulting hypothesis along the run itself returns roughly the same values, as seen in Figure 1c.

D. Experimental Evaluation of Private Subsample-Test-Reweigh

In order to implement the private algorithm, we used another example with bounded examples and hypotheses.

Similarly, to the previous experiment, we consider a setting where S is a simple spherical Gaussian in d = 200dimensions, S = N( 0, Id) whereas for T we picked an arbitrary set of k = 6 coordinates and set the standard deviation on these k as σ = 0.4 whereas the remaining d k coordinates have standard deviation of 1, i.e. T = N( 0, Id k (0.4)2Ik). It is a matter of simple calculation to show that χ2(T S) + 1 = ( 1 σ2(2 σ2))k/2 > 39.1.

However, we now set the true labeling function as one that correlated to points with large importance sampling weight. It is fairly simple to see that by looking at the k-coordinates with smaller variance in T, any origin-centered ball has more probability mass in T then in S. And so, our hypothesis class is the set of origin-centered ellipses H = {w1, ..., wd, b [0, 1] : Pd i=1 wix2 i b} where so that xi S and thus y χ2 k. The true hypothesis ℓis the hyperplane separator with w = 0 Id k 1 Ik and b = σ2r0, where r0 is the a numerically set threshold under which a Pr X χ2 k[X < r0] = 0.3. Thus ℓlabels precisely 30% of the probability mass of T as 1, whereas it labels a significantly smaller fraction of S as 1.

As the curator-model learning algorithm we used the online SGD as presented in Hazan (2016), after mapping each example x Rd to the vector y Rd + where yi = x2 i for each coordinate y. This allows us to use the Hinge-loss function - L(y) = max{0, ℓ (y)( w, y b)}. The learning rate of the algorithm depends on a bounded diameter of the hypothesis and Lipschitzness of the loss that upper-bound the y . So we set a value B = 27.9 - the empirically the max value of y of 90% of the examples drawn from S, and projected each example with norm > B onto this simplex: {y Rd + : y B}. Therefore, including the additional intercept coordinate, we get: L( y) 2 B2 + 1, so we can use the Lipschitz parameter of L =

B2 + 1. We also verify that the diameter is lower than

d + 1 by verifying that σ 1/

Transfer Learning In Differential Privacy s Hybrid-Model

Non-privately, we run the online SGD with a maximum of 104 iterations or until finding an exact hyperplane with a loss of at most α = 0.01 with β = 10 6/T. Aiming to find the sample complexity of the curator-agents for which we get an hypothesis of error 2α over T, we ran our experiments with varying values of n = {10K, 25K, 50K, 75K, 100K, 125K}, repeating each experiment t = 50 times. Note that all these values of n are below the worst-case bound (in our settings is 15 1012), we consistently return a hypothesis of small-loss on T. Results appear in Figure 2a. Again, we see that the number of MWiterations, T, isn t greatly affected by the increasing sample complexity, as seen in Figure 2c. Also in all values of n we halt after 1000 iterations (Results appear in Figure 2b) which is far below than the O(α 2)-upper bound.

In the private version of the algorithm we applied Algorithm 4 with the private online SGD (Algorithm 3). The settings of the non-private version required a vast amount of memory because of the sample complexity bounds that are pretty big even for moderate values of α and ϵ, so we had to change some parameters and use fairly close distributions S and T. The changes from the non-private settings are as follows. We set the dimension to d = 8 and the number of closer coordinates to k = 2, so S = N( 0, (0.1)2Id) whereas T = N( 0, (0.1)2Id k (0.07)2Ik). (It is a matter of simple calculation to show that χ2(T S) + 1 = ( σ4 S σ2 T (2σ2 S σ2 T ))k/2 > 1.35, where σS, σT are the std in the

k coordinates of S and T respectively.) We also set r0 is a threshold for 0.4 of the examples, i.e. Pr X χ2 k[X < r0] = 0.4. We set α = 0.08, and κ = α 4(χ2+1), and used the privacy parameters of ϵ = 0.5 and δ = 0.0001. Following similar calculations to before we set the bound on our hypothesis set s diameter as B = 0.006 and the Lipschitz value of L = 0.1.

We succeeded to get from the SGD a hypothesis whose loss is lower than α = 0.08 with β = 10 6/T after 50M iterations. So we run SGD with R = 50M, R = 75M, and R = 100M iterations, repeating each experiment t = 50 time, to see their influence of them on the required number of MW-iterations. Our calculations lead to a sample complexity of n = 324, 700, 000 which we used in all runs. We can see (in Figure 3b) that the number of MW-iterations decreases as the iteration of SGD is increasing. In addition, the loss on T starts higher as the SGD iterations decrease (Figure 3c). However, in all these runs MW algorithm converged to 2α (Figure 3a).

Due to the sample size, we were not able to experiment thoroughly with the private version of our algorithm, alas, our experiments do show that we succeed in implementing the algorithm. Furthermore, we also experimented with a SGD version which minimized the Lasso-regularized version of our algorithm. This version converged in a single iteration

(a) Loss on T: The loss on distribution T until convergence.

(b) Iterations number: The iterations number until the convergence.

(c) Loss on T along the run: the loss on T decreases with the iterations.

Figure 2: Empirical Experiment Results

Transfer Learning In Differential Privacy s Hybrid-Model

(a) Loss on T: The loss on distribution T until convergence.

(b) Iterations number: The MW iterations number until convergence decreases as SGD iterations increases.

(c) Loss on T along the run: the loss on T decreases with the iterations.

Figure 3: Empirical Experiment Results

namely, it found the true separating coordinates on S. This is similar in spirit to the work of Avent et al. (2017) which also used the curator-agents to find the coordinates of the regression.