# pairwise_learning_with_differential_privacy_guarantees__347e6b31.pdf

The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence (AAAI-20)

Pairwise Learning with Differential Privacy Guarantees

Mengdi Huai,1* Di Wang,2* Chenglin Miao,2 Jinhui Xu,2 Aidong Zhang1

1Department of Computer Science, University of Virginia 2Department of Computer Science and Engineering, State University of New York at Buffalo 1{mh6ck, aidong}@virginia.edu, 2{dwang45, cmiao, jinhui}@buffalo.edu

Pairwise learning has received much attention recently as it is more capable of modeling the relative relationship between pairs of samples. Many machine learning tasks can be categorized as pairwise learning, such as AUC maximization and metric learning. Existing techniques for pairwise learning all fail to take into consideration a critical issue in their design, i.e., the protection of sensitive information in the training set. Models learned by such algorithms can implicitly memorize the details of sensitive information, which offers opportunity for malicious parties to infer it from the learned models. To address this challenging issue, in this paper, we propose several differentially private pairwise learning algorithms for both online and ofﬂine settings. Speciﬁcally, for the online setting, we ﬁrst introduce a differentially private algorithm (called On Pair Str C) for strongly convex loss functions. Then, we extend this algorithm to general convex loss functions and give another differentially private algorithm (called On Pair C). For the ofﬂine setting, we also present two differentially private algorithms (called Off Pair Str C and Off Pair C) for strongly and general convex loss functions, respectively. These proposed algorithms can not only learn the model effectively from the data but also provide strong privacy protection guarantee for sensitive information in the training set. Extensive experiments on real-world datasets are conducted to evaluate the proposed algorithms and the experimental results support our theoretical analysis.

Introduction As an important family of learning problems, pairwise learning has drawn much attention recently. Since pairwise learning involves a loss function depending on pairs of samples, it shows great advantage in modeling the relative relationship between pairs of samples over traditional pointwise learning (e.g., classiﬁcation), in which the loss function only takes individual samples as the input. In practice, many learning tasks can be categorized as pairwise learning problmes. For instance, metric learning (Huai et al. 2019; Jin, Wang, and Zhou 2009; Huai et al. 2018a; Sun et al. 2012; Huai et al. 2018b; Suo et al. 2018) aims to learn a distance metric from a given collection of pair of similar/dissimilar samples that preserves the distance relation

*The ﬁrst two authors contributed equally to this work. Copyright 2020, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

among the data, which can be formulated as a pairwise learning problem. Apart from metric learning, many other learning tasks, such as AUC maximization (Zhao et al. 2011; Natole, Ying, and Lyu 2018) and ranking (Tang and Wang 2018), can also be categorized as pairwise learning. Existing pairwise learning algorithms can be roughly divided into two categories: online and ofﬂine. The online pairwise learning algorithms process the input data records in a sequential manner and iteratively update the model upon the arrival of each sample (Zhao et al. 2011; Kar et al. 2013). In contrast, the ofﬂine pairwise learning algorithms require the entire training dataset ready before the learning process starts and take it as whole to update the model (Cao, Guo, and Ying 2016; Jin, Wang, and Zhou 2009). Although the importance of pairwise learning has been recognized in many real-world applications, existing pairwise learning algorithms fail to take into consideration an important issue in their designs, that is, the protection of sensitive information in the training set. The training datasets for pairwise learning are often collected from individual users and thus may contain private personal information. The models learned by such algorithms can implicitly memorize some details of the sensitive information, which undesirably offers opportunity for malicious parties to compromise the users privacy. Taking the patient similarity learning task as example, a hospital may want to train a universal patient similarity learning model from patients (crossing many hospitals) so as to obtain a better understanding of the diseases and diagnoses. Due to trust to the hospital, patients may be willing to provide necessary information for such a learning process. However, without a proper mechanism, the patients privacy may be breached when the trained model by the hospital is provided to other parties (such as medical research institutes or drug makers). This is because these parties can infer patients private information using various attack techniques, such as model inversion attack (Fredrikson, Jha, and Ristenpart ) and membership attack (Shokri et al. 2017). Thus, without a convincing privacy-preserving mechanism, the patients may not be willing to participate in such a learning task. Hence, a big challenge facing pairwise learning is how to learn a model privately such that sensitive information cannot be inferred from the learned model. To the best of our knowledge, no existing work has addressed the above challenge. This motivates us to design,

in this paper, privacy-preserving pairwise learning methods which can not only keep the sensitive information private but also guarantee good generalization performance. Among existing privacy-preserving strategies, differential privacy (DP) (Dwork et al. 2006), as a rigorous notion for data privacy, can provide very rigid privacy and utility guarantee. Although various DP methods exist for (online) pointwise learning, such as objective perturbation or DP-SGD (Chaudhuri and Monteleoni 2009; Bassily, Smith, and Thakurta 2014; Jain, Kothari, and Thakurta 2012; Wang, Chen, and Xu 2019), they cannot be applied to pairwise learning algorithms directly. This is mainly because the training sample pairs in pairwise learning algorithms are not i.i.d. and the loss function depends on more than one data records. In the light of the above challenges, in this paper, we propose efﬁcient differentially private algorithms for the aforementioned two types of pairwise learning problems. The contributions of this paper can be summarized as follows:

Firstly, we consider the pairwise learning problem in the online setting, and propose an (ϵ, δ)-DP algorithm called online pairwise private GIGA-Strongly convex method (On Pair Str C). This algorithm achieves a regret upper bound of O(

ϵ ) when the losses are strongly convex, where d is the feature dimension and n is the data size. We then extend this algorithm to general convex losses by proposing an algorithm called online pairwise private GIGA-convex method (On Pair C), which has a regret up-

per bound of O(

dn 3 4 ϵ ).

Secondly, we study the pairwise learning problem in the ofﬂine setting. We show that it is possible to achieve generalization errors of O(

d nϵ) and O(

d 4 nϵ) for strongly and general convex loss functions respectively by adopting the results in the online settings. We then improve these bounds by proposing an offline pairwise private GIGAStrongly convex algorithm (Off Pair Str C) and an offline pairwise private GIGA-convex algorithm (Off Pair C) for the two types of loss functions. Particularly, in the case of general convex loss functions, our improved algorithm can achieve a generalization error of O(

Related Work

As mentioned earlier, there is no existing result on pairwise learning under the differential privacy model. Thus, we only compare ours with those private pointwise learning methods. There is a long list of papers on differentially private pointwise learning in the last decade which attack the problem from different perspectives. For DP pointwise learning with convex loss functions, there are a lot of works on it, such as (Chaudhuri and Monteleoni 2009; Chaudhuri, Monteleoni, and Sarwate 2011; Bassily, Smith, and Thakurta 2014; Wang, Ye, and Xu 2017; Wang, Chen, and Xu 2019). However, all of the above results focus only on pointwise loss functions and cannot be extended to pairwise loss functions. Differentially private pointwise learning in the online setting has also been studied previously (Jain, Kothari, and

Thakurta 2012; Thakurta and Smith 2013). The works that are most related to ours are probably (Jain, Kothari, and Thakurta 2012) and (Thakurta and Smith 2013), where the authors gave a general framework for online convex optimization under differential privacy. However, there are some signiﬁcant differences from ours. Firstly, (Jain, Kothari, and Thakurta 2012) and (Thakurta and Smith 2013) consider only pointwise loss functions while we study pairwise loss functions. Thus, their methods cannot be directly extended to pairwise loss functions, making them incomparable with ours; secondly, due to the differences in the structure of two problems and the deﬁnitions of the regret, the analyses of the upper bounds and the DP guarantees are quite different (see Remark 1 for more details).

Preliminaries We say that two datasets D, D are neighbors if they differ by only one entry, which is denoted as D D . Deﬁnition 1 (Differential Privacy (Dwork et al. 2006)). A randomized algorithm A is (ϵ, δ)-differentially private (DP) if for all neighboring datasets D, D and for all events S in the output space of A, the following holds

Pr(A(D) S) eϵPr(A(D ) S) + δ.

When δ = 0, A is ϵ-differentially private. In this paper we focus on (ϵ, δ)-DP and use Gaussian mechanism (Dwork et al. 2006) to guarantee (ϵ, δ)-DP. Deﬁnition 2 (Gaussian Mechanism). Given any function q : X n Rd, the Gaussian mechanism is deﬁned as MG(D, q, ϵ) = q(D) + Y, where Y is drawn from Gaussian

Distribution N(0, σ2Id) with σ

2 ln(1.25/δ)Δ2(q)

ϵ . Here Δ2(q) is the ℓ2-sensitivity of the function q, i.e., Δ2(q) = sup D D ||q(D) q(D )||2. Gaussian mechanism preserves (ϵ, δ)-differential privacy. Additionally, we also use zero Concentrated Differential Privacy (z CDP) (Bun and Steinke 2016) and its composition property to guarantee (ϵ, δ)-DP. Compared to directly using the composition property of DP, it has many advantages (see (Lee and Kifer 2018; Wang and Xu 2019) for more details). Deﬁnition 3. A randomized mechanism A is ρ-z CDP if, for all neighboring dataset D, D and all α (1, ),

Dα(A(D)||A(D )) ρα,

where Dα( || ) is the α-R enyi Divergence 1. Lemma 1. (Bun and Steinke 2016) Suppose that two mechanisms satisfy ρ1-z CDP and ρ2-z CDP, respectively. Then, their composition is (ρ1 + ρ2)-z CDP. Lemma 2. (Bun and Steinke 2016) For a Gaussian mechanism q(D) + Y with Y N(0, σ2Id), it satisﬁes ( Δ2 2(q) 2σ2 )- z CDP. Lemma 3. (Bun and Steinke 2016) If a mechanism is ρ-

z CDP, then it is (ρ + 2

δ , δ)-DP for any δ > 0.

1For two distributions P and Q on Ω and α (1, ), the α-R enyi Divergence between P, Q is deﬁned as Dα(P||Q) = 1 α 1 log

Ω P(x)αQ(x)1 αdx.

Private Pairwise Learning Different from the pointwise loss function ℓ: C D R, a pairwise loss function is a function on pairs of data records, i.e., ℓ: C D D R, where D is the data universe. Given a dataset D = {z1, z2, , zn} Dn and a loss function ℓ( ; , ), its empirical risk can be deﬁned as:

L(w; D) = 1 n(n 1)

j =i ℓ(w; zi, zj). (1)

When the inputs are drawn i.i.d from an unknown underlying distribution P on D, the population risk is

LP(w) = Ezi,zj P[ℓ(w; zi, zj)]. (2)

Similar to the deﬁnition of private pointwise learning (Bassily, Smith, and Thakurta 2014), we can deﬁne private pairwise learning as follows. Deﬁnition 4 (Private pairwise learning). Let C Rd be a convex, closed and bounded constraint set, D be a data universe, and ℓ: C D D R be a pairwise loss function. Also, let D = {z1 = (x1, y1), z2 = (x2, y2), , zn = (xn, yn)} Dn be a dataset with data records {xi}n i=1 Rd and labels (responses) {yi}n i=1 [ 1, 1]n. Private pairwise learning is to ﬁnd a private estimator wpriv C so that the algorithm is (ϵ, δ) or ϵ differential private and the error is minimized, where the error for an estimator w can be measured by either the optimality gap Err D(w) = L(w; D) minw C L(w; D) or the generalization error Err P(w) = LP(w) minw C LP(w). In this paper, we will focus on a special class of pairwise loss functions 2 which contains the loss functions of metric learning, AUC maximization and bipartite ranking. Assumption 1. For the loss function, we assume that it has the form of ℓ(w; z, z ) = φ(Y (y, y )h(w; x, x )), and ℓis a G-Lipschitz and L-smooth convex function over w, where Y (y, y ) = y y or Y (y, y ) = yy . In the experimental part, we will let φ be the logistic function, i.e., φ(x) = log(1 + e x). Example 1: Metric Learning (Cao, Guo, and Ying 2016) The goal here is to learn a Mahalanobios metric M 2 W (x, x ) = (x x )T W(w x ) using loss function ℓ(W; z, z ) = φ(yy (1 M 2 W (x, x )), where y, y { 1, +1}. The constraint set C is C = {W : W Sd, W F 1}. Example 2: AUC Maximization (Zhao et al. 2011), Bipartite Ranking (Cl emenc on et al. 2008) The goal here is to maximize the area under the ROC curve for a linear classiﬁcation problem with the constraint of w 2 1. Here h(w; x, x ) = w T (x x ) and ℓ(w; z, z ) = φ((y y )h(w; x, x )), where y, y { 1, +1}.

Online Private Pairwise Learning Here we follow online pairwise learning in (Kar et al. 2013). An online learning algorithm A is given sequential access to

2We note that all the (ϵ, δ)-DP algorithms in this paper can be extended to general pairwise loss functions, although the upper bounds of the generalization errors may differ.

a stream of elements z1, z2, z3, , zn. At each time step t = 2, 3, , n, the algorithm selects a parameter wt 1 C upon which the data record zt is revealed, and the algorithm incurs the following penalty

ˆLt(wt 1, Dt) = 1 t 1

i=1 ℓ(wt 1; zt, zi), (3)

where Dt = {z1, , zt}. Thus, the online algorithm A maps a data sequence {z1, , zn} to a sequence of parameters {w1, , wn 1}. In the non-private case, the goal is to select {w1, , wn 1} so as to minimize the regret, i.e.,

t=2 ˆLt(wt 1, Dt) min w C

t=2 ˆLt(w, Dt).

Moreover, if all data are chosen i.i.d from the distribution P, we also want to minimize the generalized regret, i.e.,

t=2 LP(wt 1) (n 1) min w C LP(w). (5)

If ℓis convex, then from (5) we have parameter w = w1+ +wn 1

n 1 satisﬁes the following generalization error:

LP( w) min w C LP(w) RP,A(n)

However, under the differential privacy model, we need to guarantee that the output sequence {w1, , wn 1} is DP. Thus, private pairwise learning in the online setting can be deﬁned as follows: Deﬁnition 5 (Online private pairwise learning). Let Z = {z1, z2, , zn} be any sequence of data records in the data universe D. Let the sequence of outputs by algorithm A be A(Z) = {w1, , wn 1}. Then, A is (ϵ, δ)-differentially private if given any other data sequence Z which differs in at most one entry with Z, for all events S, we have Pr[A(Z) S] eϵPr[A(Z ) S] + δ. The goal of online private pairwise learning is to select private outputs {w1, , wn 1} that minimizes the (generalized) regret. From above discussions on (5) and (6), we know that if the generalized regret is low, the algorithm will have a good performance on generalization theoretically. From this view, the online setting is more general. Thus, in the paper, we will ﬁrst consider the online private pairwise learning and provide regrets for both strongly and general convex losses. After that, we will study the problem in the ofﬂine setting.

Online Private Pairwise Learning We ﬁrst consider the case that the loss function is strongly convex. After that, we will use the regularization perturbation strategy in (Thakurta and Smith 2013) to extend the resulting algorithm to general convex loss functions. Our algorithm is inspired by the stability of Generalized Inﬁnitesimal Gradient Ascent (GIGA) (Zinkevich 2003; Jain, Kothari, and Thakurta 2012), which is a well-known online convex algorithm (see Remark 1 for discussions on

Algorithm 1 Online Pairwise Private GIGA-Strongly Convex (On Pair Str C)

1: Input: Privacy parameters ϵ and δ, sequence of data record {z1, z2, , zn}, constrained convex set C Rd, and pairwise loss function ℓ( ; , ). 2: Parameters: ℓis G-Lipschitz, L-smooth and α-strongly convex over w. Step time T1 = max{ 16L2

3: Compute ρ which satisﬁes ρ + 2

4: for t = 1, , T1 do 5: Receive the data record zt (incurs penalty ˆLt(wt 1, Dt) when t 2). 6: Randomly choose a parameter wt C. 7: end for 8: for t = T1 + 1, , n do 9: Receive the data record zt (incurs penalty ˆLt(wt 1, Dt)). 10: Set step size ηt = t 1

t 2 2 αt 11: wt = ΠC[wt 1 ηt ˆLt(wt 1, Dt)], where ΠC is the projection onto the set C.

12: Set σ2 t = 32G2(n T1)

α2t2ρ . Let wt = wt + nt, where nt N(0, σ2 t Id). 13: Output wt = arg minw C w wt 2 2. 14: end for

the difference of our algorithm with the previous ones). The main steps are given in Algorithm 1. We call the above algorithm excluding the portion of random perturbation (i.e., steps 12 and 13) Pairwise GIGA. The following lemma gives an upper bound on the ℓ2-norm sensitivity of the output in the t-th iteration of Pairwise GIGA, which is to ensure (ϵ, δ)-DP of Algorithm 1.

Lemma 4. Let At(Dt) denote the output of Pairwise GIGA in the t-th iteration. Then, under the assumption of Algorithm 1, for any t 1 and Dt D t,

At(Dt) At(D t) 2 8G

In Algorithm 1, steps 4 to 7 seem weird due to the random sampling of wt 1. However, as we see from the proof of Lemma 4, this condition is necessary for the stability analysis. Moreover, the similar steps have also been adopted in some algorithms on DP online learning, such as (Thakurta and Smith 2013; Jain, Kothari, and Thakurta 2012).

Theorem 1. Under Assumption 1 and the assumption that the loss function ℓis α-strongly convex, for any 0 < ϵ, δ 1, Algorithm 1 is (ϵ, δ)-differentially private.

Note that to guarantee DP, we ﬁrst transfer (ϵ, δ)-DP to ρ-z CDP by Lemma 3, and then use composition theorem to make Algorithm 1 be ρ-z CDP (i.e., we make each iteration T1 + 1 t n be ρ n T1 -z CDP). It is easy to see that in this case the variance of the noise satisﬁes σ2 t = 32G2(n T1)

log(1/δ))2 . When ϵ log(1/δ) 1 (this case

will always holds since in practice we select ϵ = 0.1 5

and δ = 1 n), by Taylor expansion of 1 + x, we have (

log(1/δ) + ϵ

log(1/δ))2 ϵ2 4 log(1/δ). Thus in to-

tal, we have σ2 t 128G2(n T1) log(1/δ)

α2t2ϵ2 . The following theorem shows an upper bound on the regret of Algorithm 1, which can be transformed to generalized error (we will show it in the following section).

Theorem 2. Under the assumptions in Theorem 1 and the additional condition of ϵ log 1

δ 1, Algorithm 1 has the following upper bound on the regret of its outputs

RA(n, D) O( G2

α2 C 2 + G2 log n

with probability at least 1 ζ.

Remark 1. We note that (Jain, Kothari, and Thakurta 2012) used the differentially private version of GIGA and IGD (Kulis and Bartlett ) in their DP pointwise learning. But their Private GIGA or IGD algorithm is quite different from our method of On Pair Str C (Algorithm 1). Firstly, (Jain, Kothari, and Thakurta 2012) needs to assume that each loss function ˆLt is independent (see the proofs of Lemma 4 and Lemma 5 in (Jain, Kothari, and Thakurta 2012)), which means that it is only applicable to pointwise loss functions. However, in our problem, the penalty function (3) depends on previous data records, which means that it is much more complicated than the case in (Jain, Kothari, and Thakurta 2012). Thus, we need a much ﬁner and more different analysis on the stability of Pairwise GIGA. Also, the parameters of the step size ηt and time step T1 are quite different from those in (Jain, Kothari, and Thakurta 2012). Additionally, in order to show the power of our method, we also consider the case with additional ﬁnite buffer constraint, which has not been studied in (Jain, Kothari, and Thakurta 2012). Thus, our method is more general. Secondly, the upper bound (7) on the regret of our algorithm is less than that in (Jain, Kothari, and Thakurta 2012) with a factor of log n

δ . This is due to the fact that we use the composition property of z CDP instead of advanced composition theorem of DP (Dwork, Rothblum, and Vadhan 2010). Thirdly, since the deﬁnition of regret in our paper is different from that in pointwise learning (Jain, Kothari, and Thakurta 2012), the same upper bound (i.e., O(

dn ϵ )) on the regret for strongly convex losses are actually incomparable.

We now use the perturbation strategy in (Thakurta and Smith 2013) to obtain result for general convex losses.

Theorem 3. Let ℓbe a pairwise loss function satisfying Assumption 1. Then, for any 0 < ϵ, δ 1, Algorithm 2 is (ϵ, δ)- DP. Moreover, if ϵ log 1

δ 1 and take α = O( 1

4 n), then with probability at least 1 ζ, the following upper bound on regret for the outputs holds:

RA(n, D) O( L2G2 C 2 2

Algorithm 2 Online Pairwise Private GIGA-Convex (On Pair C)

1: Input: Privacy parameters ϵ and δ, sequence of data records {z1, z2, , zn}, constrained convex set C, pairwise loss ℓ( ; , ), and a parameter α to be deﬁned later. 2: Parameters: ℓis G-Lipschitz, L-smooth and convex over w. 3: Randomly select a point w0 C. Let ℓ(w; z, z ) = ℓ(w; z, z ) + α

2 w w0 2 2.

4: Run Algorithm 1 with loss ℓ, which is G = G+α C 2Lipschitz, L = L + α-smooth and α-strongly convex.

Comparing (8) with (7), we can see that for strongly convex pairwise loss functions, the average regret, i.e., RA(n)

upper bounded by O(

d nϵ), while for general convex ones,

d 4 nϵ). This is the same as in the case of pointwise loss functions (Thakurta and Smith 2013).

Ofﬂine Private Pairwise Learning

Generalization Error Induced by Generalized Regret

We ﬁrst observe that Algorithm 1 and 2 preserve (ϵ, δ)- DP in the ofﬂine settings. Also, as discussed in (5) and (6), if we get the generalized regret for the output {w1, w2, , wn 1}, we can easily obtain a generalization error by (6). By a theorem in (Kar et al. 2013), we can have the following generalization bounds for w = w1+ +wn 1

n 1 of Algorithm 1 and 2. Before this, we ﬁrst let the Rademacher averages of the pairwise loss functions class ℓ C := {(z, z ) ℓ(w; z, z ), w C} be denoted by the following (Kar et al. 2013):

Rn(ℓ C) = E[sup w C

i=1 ϵiℓ(w; z, zi)], (9)

where {ϵi}n i=1 are the Rademacher variables, and the expectation is over {ϵi}n i=1, z, {zi}n i=1.

Theorem 4. Under Assumption 1, the parameter w = w1+ +wn 1

n 1 satisﬁes the following generalization error for loss function ℓwith probability at least 1 2ζ if ϵ log 1

δ 1, where w1, w2, , wn 1 are the outputs of Algorithm 2 (Algorithm 1 for strongly convex loss functions),

Err P( w) O n t=2 Rt 1(ℓ C)

Moreover, if the loss is α-strongly convex, then we have:

Err P( w) O 1 n 1

t=2 Rt 1(ℓ C)+

Remark 2. Note that there are many problems whose Rademacher average is Rn(ℓ C) = O(

d n), e.g. Example 1 and 2 (Kar et al. 2013). Thus for Example 1, the generalization error is O( d ϵ 4 n) for logistic loss while it is O( d ϵ n) if adding an additional Frobenious regularization to the losses. Similar result holds for Example 2, where the generalization error is O(

d ϵ 4 n) while it is O(

d ϵ n) in the case with additional ℓ2-norm regularization.

Improved Upper Bounds by Ofﬂine Differentially Private Algorithms

Inspired by the sensitivity of Pairwise GIGA in Lemma 4 and Theorem 4, we propose an ofﬂine DP algorithm which has better upper bounds compared to (10) and (11). The basic idea is to use output perturbation. Speciﬁcally, we ﬁrst run Pairwise GIGA in the ofﬂine settings and then add some Gaussian noises to w = w1+ +wn

n to keep the algorithm (ϵ, δ)-DP, since the sensitivity of w is based on each wi, which can be obtained by Lemma 4. For general convex loss functions, we can still use the perturbation idea, which is the same as in Algorithm 2. See Algorithm 3 and 4 for details. The reason that we can improve the generalization error is due to the following fact. From Algorithms 1 and 2, we can see that the output sequences {w1, w2, , wn 1} satisfy the conditions of (ϵ, δ)-DP in each iteration. However, in the ofﬂine setting, we only need to ensure that the ﬁnal output is DP. Thus, instead of adding noise in each iteration, we can add noises only once to the ﬁnal output, meaning that we can add a smaller scale of noises compared to the online ones.

Theorem 5. For any 0 < ϵ, δ 1, Algorithm 3 is (ϵ, δ)-DP for any α-strongly convex loss functions satisfying Assumption 1. Moreover, if ϵ log 1

δ 1, then with probability at least 1 2ζ, the output ˆw satisﬁes:

Err P( ˆw) O(

d G2 C 2 log n

t=1 Rt(ℓ C)).

(12) Algorithm 4 is (ϵ, δ)-DP for any convex loss functions satisfying Assumption 1 if α = O( 1 n). Moreover, if ϵ log 1

δ 1, then with probability at least 1 2ζ, the output ˆw satisﬁes:

Err P( ˆw) O

d G2 C 2 2 log n

t=1 Rt(ℓ C) . (13)

From Theorem 5, we can see that for strongly and general convex loss functions, the bounds in (13) and (12) are respectively lower than those in (10) and (11). Speciﬁcally, for general convex loss functions, we can improve the upper bound from O(

d ϵ 4 n) to O(

d ϵ n) if Rn(ℓ C) = O(

Algorithm 3 Ofﬂine Pairwise Private GIGA-Strongly Convex (Off Pair Str C)

1: Input: Privacy parameters ϵ and δ, sequence of data {z1, z2, , zn}, constrained convex set C, pairwise loss ℓ( ; , ), and step number T1 = max{ 16L2

α2 , 7}. 2: Parameters: ℓis G-Lipschitz, L-smooth and α-strongly convex over w. 3: Randomly sample w1, , w T1 C. 4: for t = T1 + 1, , n do 5: Set step size ηt = t 1

t 2 2 αt. 6: Update

wt = arg min w C w (wt 1 ηt ˆLt(wt 1, Dt)) 2 2.

7: end for 8: Let w = w1+ +wn

n . 9: Denote w = w + σ, where σ N(0, 128G2 log2 n log(1.25/δ)

α2n2ϵ2 Id). 10: Return ˆw = arg minw C w w 2 2.

Algorithm 4 Pairwise Private GIGA-Convex (Off Pair C)

1: Input: Privacy parameters ϵ and δ, sequence of data {z1, , zn}, constrained convex set C, pairwise loss function ℓ( ; , ), and a parameter α to be deﬁned later. 2: Parameters: ℓis G-Lipschitz, L-smooth and convex over w. 3: Let ℓ(w; z, z ) = ℓ(w; z, z ) + α

2 w w0 2 2, w0 is any point in C. 4: Run Algorithm 3 with loss ℓ, which is G = G+α C 2Lipschitz, L = L + α-smooth and α-strongly convex.

Experiments In this section, we empirically evaluate the performance of the proposed differentially private algorithms on real-world datasets. We take two popular pairwise learning tasks, i.e., AUC maximization and metric learning, as examples. All of the experiments in this paper are conducted over 20 runs of different random permutations for each adopted dataset, and we report the averaged results.

Experimental Setup Datasets We use four real-world datasets that are widely adopted in pairwise learning tasks. These datasets are the Diabetes dataset, the Diabetic Retinopathy dataset, the Hepatitis dataset and the Cancer dataset (Dua and Graff 2017).

Performance Measures To evaluate the performance of the proposed algorithms, we use the following measures:

1. AUC: For AUC maximization task, we report the AUC measurement (Zhao et al. 2011) for each of the proposed algorithms over every adopted dataset. A larger AUC value means that the corresponding AUC maximization algorithm can generate more accurate results. 2. Classiﬁcation Accuracy: For metric learning task, we calculate the classiﬁcation accuracy that is deﬁned as the percentage of the correctly classiﬁed samples in the test set. The less the classiﬁcation accuracy, the worse the performance of the proposed algorithm. In this paper, the KNN classiﬁer is adopted to assign labels to the test samples. For the KNN classiﬁer, we set K to be 3. 3. Objective function value: For both metric learning task and AUC maximization task, we also report the objective function value of the proposed differentially private algorithms. A smaller objective function value means that the original pairwise learning model is less perturbed.

Baselines Since there is no existing work that addresses the privacy issue in pairwise learning, in experiments, we take the original pairwise learning algorithms that do not take any actions to protect the private information as the baselines. We denote the baseline methods as Non Private, which is the GIGA for pairwise loss functions (Kar et al. 2013).

Experiments for AUC Maximization We ﬁrst evaluate the performance of the proposed differentially private pairwise learning algorithms (i.e., On Pair Str C, On Pair C, Off Pair Str C and Off Pair C) for AUC maximization task (see Example 2 for the problem formulation). We add additional ℓ2 regularization λ

2 w 2 2 with λ = 10 3 to loss function for the strongly convex case. We study the effect of the training size n and the privacy parameter ϵ on the performance of the proposed On Pair Str C, On Pair C, Off Pair Str C and Off Pair C algorithms. Here we ﬁx δ = 1

n and consider three cases where the value of parameter ϵ is set to be 0.5, 1.5 and 2.5, respectively. For On Pair Str C and Off Pair Str C, we vary the training size from 40 to 90 and conduct the experiment on the Hepatitis and Cancer datasets. For On Pair C and Off Pair C, the experiment is conducted on the Diabetes and Diabetic Retinopathy datasets and we vary the training size from 50 to 350. In Figure 1 and Figure 2, we respectively report the objective values of On Pair Str C and On Pair C. The experimental results show that the larger the value of the training size n, the smaller the objective value. Additionally, when n is ﬁxed, the smaller the value of ϵ, the larger the objective value is. The performance of the proposed algorithms are comparable with that of the baseline, which can be observed from Figure 2. The results for Off Pair Str C and Off Pair C are shown in Figure 3 and Figure 4, respectively. Figure 3 shows the objective value of Off Pair Str C when the training size varies and Figure 4 reports the AUC measurement of Off Pair C. The results in the two ﬁgures also show that the larger the training size is or privacy parameter ϵ is, the higher the AUC measurement value is, which means that the proposed algorithm is less perturbed and more accurate. These experimental results verify that the proposed online differential private algorithms can

achieve good utility while guaranteeing strong privacy protection when they are applied to AUC maximization task.

=0.5 =1.5 =2.5 0.68

Objective value

(a) Hepatitis

=0.5 =1.5 =2.5 0.63

Objective value

Figure 1: The objective value of On Pair Str C for AUC maximization.

50 100 150 200 250 300 350 Training size

Objective value

(a) Diabetes

50 100 150 200 250 300 350 Training size

Objective value

(b) Diabetic Retinopathy

Figure 2: The objective value of On Pair C for AUC maximization.

=0.5 =1.5 =2.5 0.65

Objective value

(a) Hepatitis

=0.5 =1.5 =2.5 0.49

Objective value

Figure 3: The objective value of Off Pair Str C for AUC maximization.

Experiments for Metric Learning Next, we evaluate the performance of the proposed differentially private pairwise learning algorithms for the metric learning task (see Example 1 for the problem formulation). We add additional Frobenius norm λ

2 W 2 F to the loss function for the strongly convex case, where λ = 10 3. Similar to the experiments for AUC maximization, we evaluate the effect of the privacy parameter ϵ and the training size n. Due to the space limit, in this section, we only report the experimental results for general convex pairwise learning algorithms, i.e., On Pair C and Off Pair C. In these experiments, the value of δ is ﬁxed as 1 n, and we consider three cases where the parameter ϵ is set to be

=0.5 =1.5 =2.5

(a) Diabetes

=0.5 =1.5 =2.5

(b) Diabetic Retinopathy

Figure 4: The AUC measurement of Off Pair C.

=0.5 =1.5 =2.5 0.65

Objective value

(a) Diabetes

=0.5 =1.5 =2.5 0.65

Objective value

(b) Diabetic Retinopathy

Figure 5: The objective value of On Pair C for metric learning task under different training sizes.

50 100 150 200 250 300 350 Training size

(a) Diabetes

50 100 150 200 250 300 350 Training size

(b) Diabetic Retinopathy

Figure 6: The classiﬁcation accuracy of Off Pair C for metric learning task under different training sizes.

0.5, 1.5 and 2.5, respectively. We ﬁrst calculate the objective value of On Pair C when the training size varies from 50 to 350, and the results on the Diabetes and Diabetic Retinopathy datasets are shown in Figure 5. As for the ofﬂine algorithm Off Pair C, we report the classiﬁcation accuracy in Figure 6. As we can see, the derived experimental results are similar to that for AUC maximization. The proposed algorithms perform competitively with the baseline when we vary the values of n and ϵ.

Conclusions In this paper, we consider the pairwise learning problems in both online and ofﬂine settings. For the online setting, we ﬁrst propose an (ϵ, δ)-DP algorithm (called On Pair Str C) for the strongly convex loss functions, and then extend this algorithm to general convex loss functions by proposing another differentially private algorithm (called On Pair C). For the ofﬂine setting, we also propose two differentially private algorithms (called Off Pair Str C and Off Pair C) for strongly convex loss functions and general convex loss functions, re-

spectively, and then give their regret upper bounds. The experimental results on real-world datasets not only conﬁrm our theoretical analysis but also demonstrate the effectiveness of the proposed algorithms in real-world applications.

Acknowledgments This work is supported in part by the US National Science Foundation under grants IIS-1924928, IIS-1938167, IIS1910492, CCF-1716400, and OAC-1934600. Any opinions, ﬁndings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reﬂect the views of the National Science Foundation.

References Bassily, R.; Smith, A.; and Thakurta, A. 2014. Private empirical risk minimization: Efﬁcient algorithms and tight error bounds. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science (FOCS). Bun, M., and Steinke, T. 2016. Concentrated differential privacy: Simpliﬁcations, extensions, and lower bounds. In TCC, 635 658. Springer. Cao, Q.; Guo, Z.-C.; and Ying, Y. 2016. Generalization bounds for metric and similarity learning. Machine Learning. Chaudhuri, K., and Monteleoni, C. 2009. Privacy-preserving logistic regression. In Advances in Neural Information Processing Systems. Chaudhuri, K.; Monteleoni, C.; and Sarwate, A. D. 2011. Differentially private empirical risk minimization. Journal of Machine Learning Research. Cl emenc on, S.; Lugosi, G.; Vayatis, N.; et al. 2008. Ranking and empirical minimization of u-statistics. The Annals of Statistics. Dua, D., and Graff, C. 2017. UCI machine learning repository. Dwork, C.; Mc Sherry, F.; Nissim, K.; and Smith, A. 2006. Calibrating noise to sensitivity in private data analysis. In TCC. Springer. Dwork, C.; Rothblum, G. N.; and Vadhan, S. 2010. Boosting and differential privacy. In FOCS, 51 60. Fredrikson, M.; Jha, S.; and Ristenpart, T. Model inversion attacks that exploit conﬁdence information and basic countermeasures. In CCS. Huai, M.; Miao, C.; Li, Y.; Suo, Q.; Su, L.; and Zhang, A. 2018a. Metric learning from probabilistic labels. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, London, UK, August 19-23, 2018, 1541 1550. Huai, M.; Miao, C.; Suo, Q.; Li, Y.; Gao, J.; and Zhang, A. 2018b. Uncorrelated patient similarity learning. In Proceedings of the 2018 SIAM International Conference on Data Mining, 270 278. SIAM. Huai, M.; Xue, H.; Miao, C.; Yao, L.; Su, L.; Chen, C.; and Zhang, A. 2019. Deep metric learning: the generalization analysis and an adaptive algorithm. In Proceedings

of the 28th International Joint Conference on Artiﬁcial Intelligence, 2535 2541. AAAI Press. Jain, P.; Kothari, P.; and Thakurta, A. 2012. Differentially private online learning. In Conference on Learning Theory. Jin, R.; Wang, S.; and Zhou, Y. 2009. Regularized distance metric learning: Theory and algorithm. In NIPS, 862 870. Kar, P.; Sriperumbudur, B.; Jain, P.; and Karnick, H. 2013. On the generalization ability of online learning algorithms for pairwise loss functions. In International Conference on Machine Learning, 441 449. Kulis, B., and Bartlett, P. L. Implicit online learning. In Proceedings of the 27th International Conference on Machine Learning (ICML-10). Lee, J., and Kifer, D. 2018. Concentrated differentially private gradient descent with adaptive per-iteration privacy budget. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. Natole, M.; Ying, Y.; and Lyu, S. 2018. Stochastic proximal algorithms for auc maximization. In International Conference on Machine Learning. Shokri, R.; Stronati, M.; Song, C.; and Shmatikov, V. 2017. Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy. Sun, J.; Wang, F.; Hu, J.; and Edabollahi, S. 2012. Supervised patient similarity measure of heterogeneous patient records. ACM SIGKDD Explorations Newsletter 14(1). Suo, Q.; Zhong, W.; Ma, F.; Ye, Y.; Huai, M.; and Zhang, A. 2018. Multi-task sparse metric learning for monitoring patient similarity progression. In 2018 IEEE International Conference on Data Mining (ICDM), 477 486. IEEE. Tang, J., and Wang, K. 2018. Ranking distillation: Learning compact ranking models with high performance for recommender system. In Proc. of the 24th SIGKDD International Conference on Knowledge Discovery & Data Mining. Thakurta, A. G., and Smith, A. 2013. (nearly) optimal algorithms for private online learning in full-information and bandit settings. In Advances in Neural Information Processing Systems, 2733 2741. Wang, D., and Xu, J. 2019. Differentially private empirical risk minimization with smooth non-convex loss functions: A non-stationary view. Thirty-Third AAAI Conference on Artiﬁcial Intelligence, (AAAI-19), Honolulu, Hawaii, USA, January 27-February 1, 2019. Wang, D.; Chen, C.; and Xu, J. 2019. Differentially private empirical risk minimization with non-convex loss functions. In Proceedings of the 36th International Conference on Machine Learning. Long Beach, California, USA: PMLR. Wang, D.; Ye, M.; and Xu, J. 2017. Differentially private empirical risk minimization revisited: Faster and more general. In NIPS. Zhao, P.; Hoi, S. C.; Jin, R.; and Yang, T. 2011. Online auc maximization. In ICML, 233 240. Zinkevich, M. 2003. Online convex programming and generalized inﬁnitesimal gradient ascent. In Proc. of the 20th International Conference on Machine Learning (ICML-03).