# labelonly_membership_inference_attacks__3b8c23bd.pdf

Label-Only Membership Inference Attacks

Christopher A. Choquette-Choo 1 Florian Tram er 2 Nicholas Carlini 3 Nicolas Papernot 1

Membership inference is one of the simplest privacy threats faced by machine learning models that are trained on private sensitive data. In this attack, an adversary infers whether a particular point was used to train the model, or not, by observing the model s predictions. Whereas current attack methods all require access to the model s predicted conﬁdence score, we introduce a labelonly attack that instead evaluates the robustness of the model s predicted (hard) labels under perturbations of the input, to infer membership. Our label-only attack is not only as-effective as attacks requiring access to conﬁdence scores, it also demonstrates that a class of defenses against membership inference, which we call conﬁdence masking because they obfuscate the conﬁdence scores to thwart attacks, are insufﬁcient to prevent the leakage of private information. Our experiments show that training with differential privacy or strong ℓ2 regularization are the only current defenses that meaningfully decrease leakage of private information, even for points that are outliers of the training distribution.

1. Introduction

Machine learning algorithms are often trained on sensitive or private user information, e.g., medical records (Stanﬁll et al., 2010), conversations (Devlin et al., 2018), or ﬁnancial information (Ngai et al., 2011). Trained models can inadvertently leak information about their training data (Shokri et al., 2016; Carlini et al., 2019) violating users privacy.

In perhaps the simplest form of information leakage, membership inference (MI) (Shokri et al., 2016) attacks enable an adversary to determine whether or not a data point was used in the training data. Revealing just this information can

1University of Toronto and Vector Institute 2Stanford University 3Google. Correspondence to: Christopher A. Choquette-Choo <choquette.christopher@gmail.com>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

cause harm it leaks information about speciﬁc individuals instead of the entire population. Consider a model trained to learn the link between a cancer patient s morphological data and their reaction to some drug. An adversary with a victim s morphological data and query access to the trained model cannot directly infer if the victim has cancer. However, inferring that the victim s data was part of the model s training set reveals that the victim indeed has cancer.

Existing MI attacks exploit the higher conﬁdence that models exhibit on their training data (Pyrgelis et al., 2017; Truex et al., 2018; Hayes et al., 2019; Salem et al., 2018). An adversary queries the model on a candidate data point to obtain the model s conﬁdence and infers the candidate s membership in the training set based on a decision rule. The difference in prediction conﬁdence is largely attributed to overﬁtting (Shokri et al., 2016; Yeom et al., 2018).

A large body of work has been devoted to understanding and mitigating MI leakage in ML models. Existing defense strategies fall into two broad categories and either

(1) reduce overﬁtting (Truex et al., 2018; Shokri et al.,

2016; Salem et al., 2018); or,

(2) perturb a model s predictions so as to minimize the success of known membership attacks (Nasr et al., 2018a; Jia et al., 2019; Yang et al., 2020).

Defenses in (1) use regularization techniques or increase the amount of training data to reduce overﬁtting. In contrast, the adversary-aware defenses of (2) explicitly aim to minimize the MI advantage of a particular attack. They do so either by modifying the training procedure (e.g., an additional loss penalty) or the inference procedure after training. These defenses implicitly or explicitly rely on a strategy that we call conﬁdence-masking1, where the MI signal in the model s conﬁdence scores is masked to thwart existing attacks.

We introduce label-only MI attacks. Our attacks are more general: an adversary need only obtain (hard) labels without prediction conﬁdences of the trained model. This threat model is more realistic, as ML models deployed in user-facing products need not expose raw conﬁdence scores. Thus, our attacks can be mounted on any ML classiﬁer.

1Similar to gradient masking from the adversarial examples literature (Papernot et al., 2017).

Label-Only Membership Inference Attacks

In the label-only setting, a naive baseline predicts misclassiﬁed points as non-members. Our focus is surpassing this baseline. To this end, we will have to make multiple queries to the target model. We show how to extract ﬁne-grained MI signal by analyzing a model s robustness to perturbations of the target data, which reveals signatures of its decision boundary geometry. Our adversary queries the model for predicted labels on augmentations of data points (e.g., translations in vision domains) as well as adversarial examples.

We make the following contributions. In 5.1, we introduce the ﬁrst label-only attacks, which match conﬁdencevector attacks. By combining them, we outperform all others. In 5.2, 5.3 and 5.4, we show that conﬁdence masking is not a viable defense to privacy leakage, by breaking two canonical defenses that use it Mem Guard and Adversarial Regularization. In 6, we evaluate two additional techniques to reducing overﬁtting: data augmentation and transfer learning. We ﬁnd that data augmentation can worsen MI leakage while transfer learning can mitigate it. In 7, we introduce outlier MI : a stronger property that defenses should satisfy to protect MI of worst-case inputs; at present, differentially private training and (strong) L2 regularization appear to be the only effective defenses. Our code is available at https://github.com/cchoquette/ membership-inference.

2. Background and Related Works

Membership inference attacks (Shokri et al., 2016) are a form of privacy leakage that identify if a given data sample was in a machine learning model s training dataset. Given a sample x and access to a trained model h, the adversary uses a classiﬁer or decision rule fh to compute a membership prediction f(x; h) {0, 1}, with the goal that f(x; h) = 1 whenever x is a training point. The main challenge in mounting a MI attack is creating the attack classiﬁer f, under various assumptions about the adversary s knowledge of h and its training data distribution. Most prior work assumes that an adversary has only black-box access to the trained model h, via a query interface that on input x returns part or all of the conﬁdence vector h(x) [0, 1]C (for a classiﬁcation task with C classes).

The attack classiﬁer f is often trained on a local shadow (or, source) model ˆhi, which is trained on the same (or a similar) distribution as h s training data. Because the adversary trained ˆhi, they can assign membership labels to any input x, and use this dataset to train f. Salem et al. (2018) later showed that this strategy succeeds even when the adversary only has data from a different, but similar, task and that shadow models are unnecessary: a threshold predicting f(x; h) = 1 when the max prediction conﬁdence, maxi h(x), is above a tuned threshold, sufﬁces.

Yeom et al. (2018) investigate how querying related inputs x to x can improve MI. Song et al. (2019) explore how models explicitly trained to be robust to adversarial examples can become more vulnerable to MI (similar to our analysis of data augmentation in 6). Both works are crucially different because they use a different attack methodology and assume access to the conﬁdence scores. Sablayrolles et al. (2019) demonstrate that black-box attacks (like ours) can approximate white-box attacks by effectively estimating the model loss for a data point. Refer to Appendix A for a detailed background, including on defenses.

3. Attack Model Design

Our proposed MI attacks improve on existing attacks by (1) combining multiple strategically perturbed samples (queries) as a ﬁne-grained signal of the model s decision boundary, and (2) operating in a label-only regime. Thus, our attacks pose a threat to any query-able ML service.

3.1. A Naive Baseline: The Gap Attack

Label-only MI attacks face a challenge of granularity. For any query x, our attack model s information is limited to only the predicted class-label, argmaxi h(x)i. A simple baseline attack (Yeom et al., 2018) that predicts any misclassiﬁed data point as a non-member of the training set is a useful benchmark to assess the extra (non-trivial) information that MI attacks, label-only or otherwise, can extract. We call this baseline the gap attack because its accuracy is directly related to the gap between the model s accuracy on training data (acctrain) and held out data (acctest):

1/2 + (acctrain acctest)/2 , (1)

where acctrain, acctest [0, 1]. To exploit additional leakage on top of this baseline attack (achieve non-trival MI), any label-only adversary must necessarily make additional queries to the model. To the best of our knowledge, this trivial baseline is the only attack proposed in prior work that uses only the predicted label, y = argmaxi h(x)i.

3.2. Attack Intuition

Our strategy is to compute label-only proxies for the model s conﬁdence by evaluating its robustness to strategic input perturbations of x, either synthetic (i.e., data augmentation) or adversarial (examples) (Szegedy et al., 2013). Following a max-margin perspective, we predict that data points that exhibit high robustness are training data points. Works in the adversarial example literature share a similar perspective that non-training points are closer to the decision boundary and thus more susceptible to perturbations (Tanay & Grifﬁn, 2016; Tian et al., 2018; Hu et al., 2019).

Our intuition for leveraging robustness is two-fold. First,

Label-Only Membership Inference Attacks

models trained with data augmentation have the capacity to overﬁt to them (Zhang et al., 2016). Thus, we evaluate any effective train-test gap on the augmented dataset by evaluating x and its augmentations, giving us a more ﬁne-grained MI signal. For models not trained using augmentation, their robustness to perturbations can be a proxy for model conﬁdence. Given the special case of (binary) logistic regression models, with a learned weight vector w and bias b, the model will output a conﬁdence score for the positive class of the form: h(x) := σ(w x + b), where σ(t) = 1 1+e t (0, 1) is the logistic function.

Here, there is a monotone relationship between the conﬁdence at x and the Euclidean distance to the model s decision boundary. This distance is (w x + b)/||w||2 = σ 1(h(x))/||w||2. Thus, obtaining a point s distance to the boundary yields the same information as the conﬁdence score. Computing this distance is exactly the problem of ﬁnding the smallest adversarial perturbation, which can be done using label-only access to a classiﬁer (Brendel et al., 2017; Chen et al., 2019). Our thesis is that this relationship will persist for deep, non-linear models. This thesis is supported by prior work that suggests that deep neural networks can be closely approximated by linear functions in the vicinity of the data (Goodfellow et al., 2014).

3.3. Data Augmentation

Our data augmentation attacks create a MI classiﬁer f(x; h) for a model h. Given a target point (x0, ytrue), the adversary trains f to output f(x0, h) = 1, if x0 was a training member. To do this, they tune f to maximize MI accuracy on a source (or shadow ) model assuming knowledge of the target model s architecture and training data distribution. They then transfer f to perform MI by querying the blackbox model h. Using x0, we create additional data points {ˆx1, . . . , ˆx N} via different data augmentation strategies, described below. We query the target model h (ˆh in tuning) to obtain labels (y0, y1, . . . , y N) (h(x), h(ˆx1), . . . h(ˆx N)). Let bi I (ytrue = (yi)) be the indicator function for whether the i-th queried point was misclassiﬁed. Finally, we apply f(b0, . . . , b N) {0, 1} to classify x0.

We experiment with two common data augmentations in the computer vision domain: image rotations and translations. For rotations, we generate N = 3 images as rotations by a magnitude r for r [1, 15]. For translations, we generate N = 4d+1 translated images satisfying |i|+|j| = d for a pixel bound d, where we horizontal shift by i and vertical shift by j. In both we include the source image.

3.4. Decision Boundary Distance

These attacks predict membership using a point s distance to the model s decision boundary. Here, we extend the

intuition that this distance can be a proxy for conﬁdence of linear models (see 3.2) to deep neural networks.

Recall that conﬁdence-thresholding attacks predict highly conﬁdent samples as members (Salem et al., 2018). Given some estimate disth(x, y) of a point s ℓ2-distance to the model s boundary, we predict x a member if disth(x, y) > τ for some threshold τ. We deﬁne disth(x, y) = 0 for misclassiﬁed points, where argmaxi h(x)i = y, because no perturbation was needed. We tune τ on a shadow ˆhi, and ﬁnd that even crude estimates, e.g., Gaussian noise, can lead to nearly comparable attacks (see 5.5). We now discuss methods for estimating dist(x, y).

A White-Box Baseline for estimating dist(x, y) is an idealized white-box attack and is therefore not label-only. We use adversarial-examples generated by the Carlini and Wagner attack (Carlini & Wagner, 2017): given (x, y) the attack tries to ﬁnd the closest point x to x in the Euclidean norm, such that argmax h(x ) = y.

Label-only attacks use only black-box access. We rely on label-only adversarial example attacks (Brendel et al., 2017; Chen et al., 2019). These attacks start from a random point x that is misclassiﬁed, i.e., h(x ) = y. They then walk along the boundary while minimizing the distance to x. We use Hop Skip Jump (Chen et al., 2019), which closely approximates stronger white-box attacks.

Robustness to random noise is a simpler approach based on random perturbations. Again, our intuition stems from linear models: a point s distance to the boundary is directly related to the model s accuracy when it is perturbed by isotropic Gaussian noise (Ford et al., 2019). We compute a proxy for dh(x, y) by evaluating the accuracy of h on N points ˆxi = x + N(0, σ2 I), where σ is tuned on ˆh. For binary features we instead use Bernoulli noise: each xj x is ﬂipped with probability p, which is tuned on ˆh.

Many signals for robustness can be combined to improve the attack performance. We evaluate dh(x, y) for augmentations of x from 3.3. We only evaluate this attack where indicated due to its high query cost (see 5.5).

4. Evaluation Setup

Our evaluation is aimed at understanding how label-only MI attacks compare with prior attacks that rely on access to a richer query interface. To this end, we use an identical evaluation setup as prior works (Nasr et al., 2018b; Shokri et al., 2016; Long et al., 2017; Truex et al., 2018; Salem et al., 2018) (see Appendix B). We answer the following questions in our evaluation, 5, 6 and 7:

1. Can label-only MI attacks match prior attacks that use

Label-Only Membership Inference Attacks

the model s (full) conﬁdence vector?

2. Are defenses against conﬁdence-based MI attacks always effective against label-only attacks?

3. What is the query complexity of label-only attacks?

4. Which defenses prevent both label-only and full conﬁdence-vector attacks?

To evaluate an attack s success, we pick a balanced set of points from the task distribution, of which half come from the target model s training set. We measure attack success as overall MI accuracy but ﬁnd F1 scores to approximately match, with near 100% recall. See Supplement B.2 for further discussion on this evaluation.

Overall, we stress that our main goal is to show that in settings where MI attacks have been shown to succeed, a label-only query interface is sufﬁcient. In general, we should not expect our label-only attacks to exceed the performance of prior MI attacks since the former uses strictly less information from queries than the latter. There are three notable exceptions: our combined attack2 ( 5.1), conﬁdence masking defenses ( 5.2), and models trained with signiﬁcant data augmentation ( 6.1). In the latter two cases, we ﬁnd that existing attacks severely underestimate MI.

4.1. Attack Setup

We provide a detailed account of model architectures and training procedures in Supplement B.1 and of our threat model in Supplement C. We evaluate our attacks on 8 datasets used by the canonical work of Shokri et al. (2016). These include 3 computer vision tasks3, which are our main focus because of the importance of data augmentation to them, and 4 non-computer-vision tasks4 to showcase our attacks transferability. We train target neural networks on subsets of the original training data, exactly as performed by Shokri et al. (2016) and several later works (in both data amount and train-test gap). Controlling the training set size lets us control the amount of overﬁtting, which strongly inﬂuences the strength of MI attacks (Yeom et al., 2018). Prior work has almost exclusively studied (conﬁdence-based) MI attacks on these small datasets where models exhibit a high degree of overﬁtting. Recall that our goal is to show that label-only attacks match conﬁdence-based approaches: scaling MI attacks (whether conﬁdence-based or label-only) to

2Note that this attack s performance exceeds prior conﬁdencevector attacks, but that we do not test its conﬁdence-vector analog. Our results indicate that it should perform comparably.

3MNIST, CIFAR-10, and CIFAR-100: https://www. tensorflow.org/api_docs/python/tf/keras/datasets

4Adult Dataset: http://archive.ics.uci.edu/ml/ datasets/Adult Texas-100, Purchase-100, and Locations datasets: https://github.com/privacytrustlab/datasets

larger training datasets is an important area of future work.

5. Evaluation of Label-Only Attacks

5.1. Label-Only Attacks Match Conﬁdence-Vector Attacks

We ﬁrst focus on question 1). Recall from 3.1 that any label-only attack (with knowledge of y) is always trivially lower-bounded by the baseline gap attack of Yeom et al. (2018), predicting any misclassiﬁed point as a non-member.

Our main result is that our label-only attacks consistently outperform the gap attack and perform on-par with prior conﬁdence-vector attacks; by combining attacks, we can even surpass the canonical conﬁdence-vector attacks.

Observing Figure 1 and Table 1 (a) and (c), we see that the conﬁdence-vector attack outperforms the baseline gap attack, demonstrating that it exploits non-trivial MI. Remarkably, we ﬁnd that our label-only boundary distance attack performs at least on-par with the conﬁdence-vector attack. Moreover, our simpler but more query efﬁcient (see 5.5) data augmentation attacks also consistently outperform the baseline but fall short of the conﬁdence-vector attacks. Finally, combining these two label-only attacks, we can consistently outperform every other attack. These models were not trained with data augmentation; in 6.1, we ﬁnd that when they are, our data augmentation attacks outperform all others. Finally, we verify that as the training set size increases, all attacks monotonically decrease because the train-test gap is reduced. Note that on CIFAR-100, we experiment with the largest training subset possible: 30,000 data points, since we use the other half as the source model training set (and target model non-members).

Beyond Images We show that our label-only attacks can be applied outside of the image domain in Table 2. Our labelonly attack evaluates a model s accuracy under random perturbations, by adding Gaussian noise for continuousfeatured inputs, and ﬂipping binary values according to Bernoulli noise (see 3.4). Using 10,000 queries, our attacks closely match (at most 4 percentage-point degradation) conﬁdence-based attacks. Note that our attacks could also be instantiated in audio or natural language domains, using existing adversarial examples attacks (Carlini & Wagner, 2018) and data augmentations (Zhang et al., 2015).

5.2. Breaking Conﬁdence Masking Defenses

Answering question 2), we showcase an example where our label-only attacks outperform prior attacks by a signiﬁcant margin, despite the strictly more restricted query interface that they assume. We evaluate defenses against MI attacks and show that while these defenses do protect against existing conﬁdence-vector attacks, they have little

Label-Only Membership Inference Attacks

0 5000 10000 15000 20000 25000 30000 Number of Training Data Points

Attack Accuracy, %

Attack Model Confidence-Vector Data Augmentation Gap Attack

Boundary Distance Boundary & Translation Confidence-Vector Mem Guard

Figure 1. Accuracy of MI attacks on CIFAR-10. We evaluate 100 to 10,000 training points and compare the baseline gap attack, the conﬁdence-vector attack that relies on a ﬁne-grained query interface, and our label-only attacks based on data augmentation and distance to the decision boundary. We also show the conﬁdencevector attack performance against Mem Guard, noting that our label-only performances remain nearly unaltered. For the data augmentation attack, we report the best accuracy across multiple values of r (rotation angle) and d (number of translated pixels).

to no effect on our label-only attacks. Because any ML classiﬁer providing conﬁdences also provides the predicted labels, our attacks fall within their threat model, refuting these defenses security claims.

We identify a common pattern to these defenses that we call conﬁdence masking, wherein defenses aim to prevent MI by directly minimizing the privacy leakage in a model s conﬁdence scores. To this end, conﬁdence-masking defenses explicitly or implicitly mask (or, obfuscate) the information contained in the model s conﬁdences, (e.g., by adding noise) to thwart existing attacks. These defenses, however, have a minimal effect on the model s predicted labels. Mem Guard (Jia et al., 2019) and prediction puriﬁcation (Yang et al., 2020) explicitly maintain the invariant that the model s predicted labels are not affected by the defense, i.e.,

x, argmax h(x) = argmax hdefense(x) ,

where hdefense is the defended version of the model h.

An immediate issue with the design of conﬁdence-masking defenses is that, by construction, they will not prevent any label-only attack. Yet, these defenses were reported to drive the success rates of existing MI attacks to within chance. This result suggests that prior attacks fail to properly extract membership information contained in the model s predicted labels, and implicitly contained within its scores. Our labelonly attack performances clearly indicate that conﬁdence masking is not a viable defense strategy against MI.

We show that two peer-reviewed defenses, Mem Guard (Jia et al., 2019) and adversarial regularization (Nasr et al.,

Table 1. Accuracy of MI attacks on CIFAR-100 and MNIST. The target models are trained using 30,000 data points for CIFAR100 and 1,000 for MNIST. Tables (a) and (c) report results without any defense; (b) and (d) with Mem Guard (Jia et al., 2019), which prevents the conﬁdence-vector attacks via conﬁdence masking . Combined refers to the boundary and translation attack. Results that are affected by conﬁdence masking are marked in red.

(a) CIFAR-100 Undefended

Attack Accuracy

Gap attack 83.5 Conﬁdence-vector 88.1 Data augmentation 84.6 Boundary distance 88.0 Combined 89.2

(b) CIFAR-100 Mem Guard

Attack Accuracy

Gap attack 83.5 Conﬁdence-vector 50.0 Data augmentation 84.6 Boundary distance 88.0 Combined 89.2

(c) MNIST Undefended

Attack Accuracy

Gap attack 53.2 Conﬁdence-vector 55.7 Data augmentation 53.9 Boundary distance 57.8 Combined 58.7

(d) MNIST Mem Guard

Attack Accuracy

Gap attack 53.2 Conﬁdence-vector 50.0 Data augmentation 53.9 Boundary distance 57.8 Combined 58.7

2018a), fail to prevent label-only attacks, and thus, do not signiﬁcantly reduce MI compared to an undefended model. Other proposed defenses, e.g., reducing the precision or cardinality of the conﬁdence-vector (Shokri et al., 2016; Truex et al., 2018; Salem et al., 2018), and recent defenses like prediction puriﬁcation (Yang et al., 2020), also rely on conﬁdence masking: they are unlikely to resist label-only attacks. See Supplement D for more details on these defenses.

5.3. Breaking Mem Guard

We implement the strongest version of Mem Guard that can make arbitrary changes to the conﬁdence-vector while leaving the model s predicted label unchanged. Observing Figure 1 and Table 1 (b) and (d), we see that Mem Guard successfully defends against prior conﬁdence-vector attacks, but as expected, offers no protection against our label-only attacks. All our attacks signiﬁcantly outperform the (nonadaptive) conﬁdence-vector and the baseline gap attack.

The main reason that Jia et al. (2019) found Mem Guard to protect against conﬁdence-vector attacks is because these attacks were not properly adapted to this defense. Speciﬁcally, Mem Guard is evaluated against conﬁdence-vector attacks that are tuned on source models without Mem Guard enabled. This observation also holds for other defenses such as Yang et al. (2020). Thus, these attacks membership predictors are tuned to distinguish members from non-members based on high conﬁdence scores, which these defenses obfuscate. In a sense, a label-only attack like ours is the right adaptive attack against these defenses: since the model s conﬁdence scores are no longer reliable, the adversary s best strategy is to use hard labels, which these defenses explicitly do not

Label-Only Membership Inference Attacks

Table 2. Accuracy of membership inference attacks on Texas, Purchase, Location, and Adult. Where augmentations may not exist, noise robustness can still perform on or near par with conﬁdence-vector attacks. The target models are trained exactly as in (Shokri et al., 2016): 1, 600 points for Location and 10, 000 for the rest. Our noise robustness attack uses 10, 000 queries. Tables (a), (c), (e), and (g) report results without any defense. Tables (b), (d), (f), and (h) report results with Mem Guard (Jia et al., 2019), which prevents the conﬁdence-vector attacks via conﬁdence-masking . Results that are affected by conﬁdence masking are marked in red.

(a) Texas Undefended

Attack Accuracy

Gap attack 73.9 Conﬁdence-vector 84.0 Noise Robustness 80.3

(b) Texas Mem Guard

Attack Accuracy

Gap attack 73.9 Conﬁdence-vector 50.0 Noise Robustness 80.3

(c) Purchase Undefended

Attack Accuracy

Gap attack 67.1 Conﬁdence-vector 86.1 Noise Robustness 87.4

(d) Purchase Mem Guard

Attack Accuracy

Gap attack 67.1 Conﬁdence-vector 50.0 Noise Robustness 87.4

(e) Location Undefended

Attack Accuracy

Gap attack 72.1 Conﬁdence-vector 92.6 Noise robustness 89.2

(f) Location Mem Guard

Attack Accuracy

Gap attack 72.1 Conﬁdence-vector 50.0 Noise Robustness 89.2

(g) Adult Undefended

Attack Accuracy

Gap attack 58.7 Conﬁdence-vector 59.9 Noise Robustness 58.7

(h) Adult Mem Guard

Attack Accuracy

Gap attack 58.7 Conﬁdence-vector 50.0 Noise Robustness 58.7

modify. Moving forward, we recommend that the trivial gap baseline serve as an indicator of conﬁdence masking: a conﬁdence-vector attack should not perform signiﬁcantly worse than the gap attack for a defense to protect against MI. Thus, to protect against (all) MI attacks, a defense cannot solely post-process the conﬁdence-vector the model will still be vulnerable to label-only attacks.

5.4. Breaking Adversarial Regularization

The work of Nasr et al. (2018a) differs from Mem Guard and prediction puriﬁcation in that it does not simply obfuscate conﬁdence vectors at test time. Rather, it jointly trains a target model and a defensive conﬁdence-vector MI classiﬁer in a min-max fashion: the attack model to maximize MI and the target model to produce accurate outputs that yet fool

the attacker. See Supplement D for more details.

We train a target model defended using adversarial regularization, exactly as in (Nasr et al., 2018a). By varying its hyper-parameters, we achieve a defended state where the conﬁdence-vector attack is within 3 percentage points of chance, as shown in Supplement Figure 9. Again, our label-only attacks signiﬁcantly outperform this attack (compare Figures 6 (a) and (b)) because the train-test gap is only marginally reduced; this defense is not entirely ineffective it prevents label-only attacks from exploiting beyond 3 percentage points of the gap attack. However, when label-only attacks are sufﬁciently defended, it achieves signiﬁcantly worse test accuracy trade-offs than other defenses (see Figure 5). And yet, evaluating the defense solely on conﬁdencevector attacks overestimates the achieved privacy.

5.5. The Query Complexity of Label-Only Attacks

We now answer question 3) and investigate how the query budget affects the success rate of different label-only attacks.

Recall that our rotation attack evaluates N = 3 queries of images rotated by r and our translation attack N = 4d + 1 for shifts satisfying a total displacement of d. Figure 2 (a)- (b) shows that there is a range of perturbation magnitudes for which the attack exceeds the baseline (i.e., 1 r 8 for rotations, and 1 d 2 for translations). When the augmentations are too small or too large, the attack performs poorly because the augmentations have a similar effect on both train and test samples (i.e., small augmentations rarely change model predictions and large augmentations often cause misclassiﬁcations). An optimal parameter choice (r and d) outperforms the baseline by 3-4 percentage-points, which an adversary can tune using its local source model. As we will see in 6, these attacks outperform all others on models that used data augmentation at training time.

In Figure 2 (c), we compare different boundary distance attacks, discussed in 3.4. With 2,500 queries, the labelonly attack matches the white-box upper-bound using 2,000 queries and also matches the best conﬁdence-vector attack (see Figure 1). With 12,500 queries, our combined attack can outperform all others. Query limiting would likely not be a suitable defense, as Sybil attacks (Douceur, 2002) can circumvent it; even in low query regimes (<100) our attacks outperform the trivial gap by 4 percentage points. Finally, with <300 queries, our simple noise robustness attack outperforms our other label-only attacks. At large query budgets, our boundary distance attack produces more precise distance estimates and outperforms it. Note that the monetary costs are modest at $0.25 per sample5.

5https://www.clarifai.com/pricing

Label-Only Membership Inference Attacks

0 2 4 6 8 10 12 14 r in Attack

Attack Accuracy, %

Attack Model

Rotation Gap Attack Boundary Distance

(a) Rotation attack

0 1 2 3 4 5 d in Attack

Attack Accuracy, %

Attack Model

Translation Gap Attack Boundary Distance

(b) Translation attack

0 10 0 10 1 10 2 10 3 10 4

Number of Queries

Attack Accuracy, %

Attack Model Label-Only Boundary Distance Random Noise Label-Only Boundary & Translation Gap Attack White-Box Boundary Distance

(c) Boundary distance attack

Figure 2. Comparison of query settings for different label-only MI attacks on CIFAR-10. Target model are trained on a subset of 2,500 data points. In (a) and (b), we compare the performance of the data augmentation attack against two baselines (the gap attack and the label-only boundary distance attack, with 2, 500 queries), as we increase the r and d parameters. In (c), we compare attacks that threshold on a point s distance to the boundary in a black-box setting with a white-box baseline using Carlini and Wagner s attack (Carlini & Wagner, 2017). We describe these attacks in 3.4. Costs are $0.1 per 1000 queries5.

6. Defending with Better Generalization

Since conﬁdence-masking defenses cannot robustly defend against MI attacks, we now investigate to what extend standard regularization techniques that aim to limit a models ability to overﬁt to its training set can. We study how data augmentation, transfer learning, dropout (Srivastava et al., 2014), ℓ1/ℓ2 regularization, and differentially private (DP) training (Abadi et al., 2016) impact MI.

We explore three questions in this section:

A. How does training with data augmentation impact MI attacks, especially those that evaluate augmented data? B. How well do other standard machine learning regularization techniques help in reducing MI? C. How do these defenses compare to differential privacy, which can provide formal guarantees against any form of membership leakage?

6.1. Training with Data Augmentation Exacerbates MI

Data augmentation is commonly used in machine learning to reduce overﬁtting and encourage generalization, especially in low data regimes (Shorten & Khoshgoftaar, 2019; Mikołajczyk & Grochowski, 2018). Data augmentation is an interesting case study for our attacks. As it reduces a model s overﬁtting, one would expect it to reduce MI. But, a model trained with augmentation will have been trained to strongly recognize x and its augmentations, which is precisely the signal that our data augmentation attacks exploit.

We train target models with data augmentation similar to 3.3 and focus on translations as they are most common in computer vision. We use a simple pipeline where all translations of each image is evaluated in a training epoch. Though this differs slightly from the standard random sampling, we choose it to illustrate the maximum MI when the adversary s

queries exactly match the samples seen in training.

Observe from Figure 3 that augmentation reduces overﬁtting and improves generalization: test accuracy increases from 49.7% without translations to 58.7% at d = 5 and the train-test gap decreases. Due to improved generalization, the conﬁdence-vector and boundary distance attacks accuracies decrease. Yet, the success rate of the data augmentation attack increases. This increase conﬁrms our initial intuition that the model now leaks additional membership information via its invariance to training-time augmentation. Though the model trained with d = 5 pixel shifts has higher test accuracy, our data augmentation attack exceeds the conﬁdence-vector performance on the non-augmented model.6 Thus, model generalization is not the only variable affecting its membership leakage: models that overﬁt less on the original data may actually be more vulnerable to MI because they implicitly overﬁt more on a related dataset.

Attacking a high-accuracy Res Net We use, without modiﬁcation, the pipeline from Fix Match (Sohn et al., 2020), which trains a Res Net-28 to 96% accuracy on the CIFAR10 dataset, comparable to the state of the art. As with our other experiments, this model is trained using a subset of CIFAR-10, which sometimes leads to observably overﬁt models indicated by a higher gap attack accuracy. We train models using four regularizations, all random: vertical ﬂips, shifts by up to d = 4 pixels, image cutout (De Vries & Taylor, 2017), and (non-random) weight decay of magnitude 0.0005. All are either enabled or disabled.

Our results here, shown in Figure 4 corroborate those ob-

6Though we ﬁnd in Supplement Figure 8 that the attack is strongest when the adversary correctly guesses d, we note that these values are often ﬁxed for a domain and image resolution. Thus, adversarial knowledge of the augmentation pipeline is not a strong assumption.

Label-Only Membership Inference Attacks

0 1 2 3 4 5 d in Training

Attack Accuracy, %

Attack Model

Confidence-Vector Translation Gap Attack Boundary Distance

Figure 3. Accuracy of MI attacks on CIFAR-10 models trained with data augmentation on a subset of 2500 images. As in our attack, d controls the number of pixels by which images are translated during training, where no augmentation is d = 0. For models trained with signiﬁcant amounts of data augmentation, MI attacks become stronger despite it generalizing better.

tained with the simpler pipeline above: though test accuracy improves, our data augmentation attacks match or outperform the conﬁdence-vector attack.

6.2. Other Techniques to Prevent Overﬁtting

We explore questions B)-C) using other standard regularization techniques, with details in Supplement E. In transfer learning, we either only train a new last layer (last layer ﬁne-tuning), or ﬁne tune the entire model (full ﬁne-tuning).

We pre-train a model on CIFAR-100 to 51.6% test accuracy and then use transfer learning. We ﬁnd that boundary distance attack performed on par with the conﬁdence-vector in all cases. We observe that last layer ﬁne-tuning degrades all our attacks to the generalization gap, preventing nontrivial MI (see Figure 10 in Supplement F). This result corroborates intuition: linear layers have less capacity to overﬁt compared to neural networks. We observe that full ﬁne-tuning leaks more membership inference but achieves better test accuracies, as shown in Figure 5.

Finally, DP training (Abadi et al., 2016) formally enforces that the trained model does not strongly depend on any individual training point that it does not overﬁt. We use differentially private gradient descent (DP-SGD) (Abadi et al., 2016) (see Supplement E). To achieve comparable test accuracies as undefended models, the formal privacy guarantees become mostly meaningless (i.e., ϵ > 100).

In Figure 6, we ﬁnd that most forms of regularization fail to prevent even the baseline gap attack from reaching 60% accuracy or more. Only strong ℓ2 regularization (λ 1) and DP training consistently reduced MI. Figure 5 gives us a best understanding of the privacy-utility trade-off. We see that both prevent MI at a high cost in test-accuracy

0 2000 4000 6000 8000 10000 Number of Training Data Points

Attack Accuracy, %

Attack Type Confidence-Vector Data Augmentation Gap Attack

(a) Without Augmentations

0 2000 4000 6000 8000 10000 Number of Training Data Points

Attack Accuracy, %

Attack Type Confidence-Vector Data Augmentation Gap Attack

(b) With Augmentations

Figure 4. Accuracy of membership inference attacks on CIFAR-10 models trained as in Fix Match (Sohn et al., 2020). Our data augmentation attacks, which mimic the training augmentations, match or outperform the conﬁdence-vector attacks when augmentations were used in training. We evaluate 1000 randomly generated augmentations for this attack. As in previous experiments, we ﬁnd our label-only distance attack performs on par with the conﬁdence-vector attack. Interestingly, the gap attack accuracy also improves due to a relatively larger increase in training accuracy. With Augmentations and Without Augmentations refer to using all regularizations, as in Fix Match, or none, respectively.

they cause the model to underﬁt. However, we also clearly see the utility beneﬁts of transfer learning: these models achieve consistently better test-accuracy due to features learned from non-private data. Combining DP training with transfer learning mitigates privacy leakage at only minimal cost in test accuracy, achieving the best tradeoff. When transfer learning is not an option, dropout performs better.

7. Worst-Case (Outlier) MI

Here, we perform MI only for a small subset of outliers . Even if a model generalizes well on average, it might still

Label-Only Membership Inference Attacks

35 40 45 50 55 60 65 70 Target Model Test Accuracy, %

Attack Accuracy, %

Better Defense

Figure 5. Test accuracy and label-only attack accuracy for various defenses. Same setup as Figure 6. Models towards the bottom right are more private and more accurate.

0 10 20 30 40 50 60 Train-Test Gap

Attack Accuracy, %

Baseline Gap Attack Defense Type Adversarial Regularization DP Full Fine Tune DP Last Layer Data Augmentation Differential Privacy

Dropout Full Fine Tune L1 Regularization L2 Regularization Last Layer Mem Guard None

(a) Label-Only Attacks

0 10 20 30 40 50 60 Train-Test Gap

Attack Accuracy, %

(b) Full Conﬁdence-Vector Attacks

Figure 6. Comparison of MI attacks on various defenses. Target models are trained on 2500 data points from CIFAR-10. Point sizes indicate relative regularization amounts within a defense.

have overﬁt to unusual data in the tails of the distribution (Carlini et al., 2019). We use a similar but modiﬁed process as Long et al. (2018) to identify potential outliers.

First, the adversary uses a source model ˆh to map each targeted data point, x, to its feature space, or the activations of its penultimate layer, denoted as z(x). We deﬁne two

points x1, x2 as neighbors if their features are close, i.e., d(z(x1), z(x2)) δ, where d( , ) is the cosine distance and δ is a tunable parameter. An outlier is a point with less than γ neighbors in z(x) where γ is another tunable parameter. Given a dataset X of potential targets and an intended fraction of outliers β (e.g., 1% of X), we tune δ and γ so that a β-fraction of points x X are outliers. We use precision as the MI success metric.

We run our attacks on the outliers of the same models as in Figure 6. We ﬁnd in See Figure 11 in Supplement Section F, that we can always improve the attack by targeting outliers, but that strong L2 regularization and DP training prevent MI. As before, we ﬁnd that the label-only boundary distance attack matches the conﬁdence-vector attack performance.

8. Conclusion

We developed three new label-only membership inference attacks that can match, and even exceed, the success of prior conﬁdence-vector attacks, despite operating in a more restrictive adversarial model. Their label-only nature requires fundamentally different attack strategies, that in turn cannot be trivially prevented by obfuscating a model s conﬁdence scores. We have used these attacks to break two state-of-the-art defenses to membership inference attacks.

We have found that the problem with these conﬁdencemasking defenses runs deeper: they cannot prevent any label-only attack. As a result, any defenses against MI necessarily have to help reduce a model s train-test gap.

Finally, via a rigorous evaluation across many proposed defenses to MI, we have shown that differential privacy (with transfer learning) provides the strongest defense, both in an average-case and worst-case sense, but that it may come at a cost in the model s test accuracy.

To center our analysis on comparing the conﬁdence-vector and label-only settings, we use the same threat model as prior work (Shokri et al., 2016) and leave a ﬁne-grained analysis of label-only attacks under reduced adversarial knowledge (e.g., reduced data and model architecture knowledge (Yeom et al., 2018; Salem et al., 2018)) to future work.

Acknowledgments

We thank the reviewers for their insightful feedback. This work was supported by CIFAR (through a Canada CIFAR AI Chair), by NSERC (under the Discovery Program, NFRF Exploration program, and COHESA strategic research network), and by gifts from Intel and Microsoft. We also thank the Vector Institute s sponsors.

Label-Only Membership Inference Attacks

Abadi, M., Chu, A., Goodfellow, I., Mc Mahan, H. B., Mironov, I., Talwar, K., and Zhang, L. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, CCS 16, pp. 308 318, New York, NY, USA, 2016. Association for Computing Machinery. ISBN 9781450341394. doi: 10.1145/2976749.2978318. URL https://doi.org/10. 1145/2976749.2978318.

Bengio, Y., Goodfellow, I., and Courville, A. Deep learning, volume 1. MIT press, 2017.

Brendel, W., Rauber, J., and Bethge, M. Decision-based adversarial attacks: Reliable attacks against black-box machine learning models. ar Xiv preprint ar Xiv:1712.04248, 2017.

Carlini, N. and Wagner, D. Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp), pp. 39 57. IEEE, 2017.

Carlini, N. and Wagner, D. Audio adversarial examples: Targeted attacks on speech-to-text. In 2018 IEEE Security and Privacy Workshops (SPW), pp. 1 7. IEEE, 2018.

Carlini, N., Liu, C., Erlingsson, U., Kos, J., and Song, D. The secret sharer: Evaluating and testing unintended memorization in neural networks. In 28th {USENIX} Security Symposium ({USENIX} Security 19), pp. 267 284, 2019.

Chen, J., Jordan, M. I., and Wainwright, M. J. Hopskipjumpattack: A query-efﬁcient decision-based attack, 2019.

Cubuk, E. D., Zoph, B., Mane, D., Vasudevan, V., and Le, Q. V. Autoaugment: Learning augmentation policies from data, 2018.

Cui, X., Goel, V., and Kingsbury, B. Data augmentation for deep neural network acoustic modeling. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(9):1469 1477, 2015.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pretraining of deep bidirectional transformers for language understanding, 2018.

De Vries, T. and Taylor, G. W. Improved regularization of convolutional neural networks with cutout. ar Xiv preprint ar Xiv:1708.04552, 2017.

Douceur, J. R. The sybil attack. In International workshop on peer-to-peer systems, pp. 251 260. Springer, 2002.

Fadaee, M., Bisazza, A., and Monz, C. Data augmentation for low-resource neural machine translation. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2017. doi: 10.18653/v1/ p17-2090. URL http://dx.doi.org/10.18653/v1/ P17-2090.

Ford, N., Gilmer, J., Carlini, N., and Cubuk, D. Adversarial examples are a natural consequence of test error in noise. ar Xiv preprint ar Xiv:1901.10513, 2019.

Fredrikson, M., Jha, S., and Ristenpart, T. Model inversion attacks that exploit conﬁdence information and basic countermeasures. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pp. 1322 1333, 2015.

Gal, Y. Uncertainty in deep learning. University of Cambridge, 1: 3, 2016.

Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples, 2014.

Hayes, J., Melis, L., Danezis, G., and Cristofaro, E. D. Logan: Membership inference attacks against generative models. Proceedings on Privacy Enhancing Technologies, 2019(1):133 152, 2019. URL https://content.sciendo.com/view/ journals/popets/2019/1/article-p133.xml.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition, 2015.

Hu, S., Yu, T., Guo, C., Chao, W.-L., and Weinberger, K. Q. A new defense against adversarial images: Turning a weakness into a strength. In Advances in Neural Information Processing Systems, pp. 1635 1646, 2019.

Jayaraman, B., Wang, L., Evans, D., and Gu, Q. Revisiting membership inference under realistic assumptions. ar Xiv preprint ar Xiv:2005.10881, 2020.

Jia, J., Salem, A., Backes, M., Zhang, Y., and Gong, N. Z. Memguard: Defending against black-box membership inference attacks via adversarial examples. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, pp. 259 274, 2019.

Leino, K. and Fredrikson, M. Stolen memories: Leveraging model memorization for calibrated white-box membership inference. ar Xiv preprint ar Xiv:1906.11798, 2019.

Long, Y., Bindschaedler, V., and Gunter, C. A. Towards measuring membership privacy. ar Xiv preprint ar Xiv:1712.09136, 2017.

Long, Y., Bindschaedler, V., Wang, L., Bu, D., Wang, X., Tang, H., Gunter, C. A., and Chen, K. Understanding membership inferences on well-generalized learning models. ar Xiv preprint ar Xiv:1802.04889, 2018.

Maas, A. L., Hannun, A. Y., and Ng, A. Y. Rectiﬁer nonlinearities improve neural network acoustic models. In Proc. icml, volume 30, pp. 3, 2013.

Mikołajczyk, A. and Grochowski, M. Data augmentation for improving deep learning in image classiﬁcation problem. In 2018 International Interdisciplinary Ph D Workshop (IIPh DW), pp. 117 122, May 2018. doi: 10.1109/IIPHDW.2018.8388338.

Murphy, K. P. Machine Learning: A Probabilistic Perspective. The MIT Press, 2012. ISBN 0262018020, 9780262018029.

Nasr, M., Shokri, R., and Houmansadr, A. Machine learning with membership privacy using adversarial regularization. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, pp. 634 646, 2018a.

Nasr, M., Shokri, R., and Houmansadr, A. Machine learning with membership privacy using adversarial regularization. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, pp. 634 646, 2018b.

Ngai, E. W., Hu, Y., Wong, Y. H., Chen, Y., and Sun, X. The application of data mining techniques in ﬁnancial fraud detection: A classiﬁcation framework and an academic review of literature. Decision support systems, 50(3):559 569, 2011.

Label-Only Membership Inference Attacks

Papernot, N., Mc Daniel, P., Goodfellow, I., Jha, S., Celik, Z. B., and Swami, A. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, ASIA CCS 17, pp. 506 519, New York, NY, USA, 2017. ACM. ISBN 9781-4503-4944-4. doi: 10.1145/3052973.3053009. URL http: //doi.acm.org/10.1145/3052973.3053009.

Perez, L. and Wang, J. The effectiveness of data augmentation in image classiﬁcation using deep learning, 2017.

Pyrgelis, A., Troncoso, C., and De Cristofaro, E. Knock knock, who s there? membership inference on aggregate location data. ar Xiv preprint ar Xiv:1708.06145, 2017.

Sablayrolles, A., Douze, M., Schmid, C., Ollivier, Y., and J egou, H. White-box vs black-box: Bayes optimal strategies for membership inference. In International Conference on Machine Learning, pp. 5558 5567. PMLR, 2019.

Sajjad, M., Khan, S., Muhammad, K., Wu, W., Ullah, A., and Baik, S. W. Multi-grade brain tumor classiﬁcation using deep cnn with extensive data augmentation. Journal of computational science, 30:174 182, 2019.

Salem, A., Zhang, Y., Humbert, M., Berrang, P., Fritz, M., and Backes, M. Ml-leaks: Model and data independent membership inference attacks and defenses on machine learning models, 2018.

Shalev-Shwartz, S. and Ben-David, S. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.

Shokri, R., Stronati, M., Song, C., and Shmatikov, V. Membership inference attacks against machine learning models, 2016.

Shorten, C. and Khoshgoftaar, T. M. A survey on image data augmentation for deep learning. Journal of Big Data, 6(1):60, 2019.

Sohn, K., Berthelot, D., Li, C.-L., Zhang, Z., Carlini, N., Cubuk, E. D., Kurakin, A., Zhang, H., and Raffel, C. Fixmatch: Simplifying semi-supervised learning with consistency and conﬁdence, 2020.

Song, L., Shokri, R., and Mittal, P. Privacy risks of securing machine learning models against adversarial examples. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, pp. 241 257, 2019.

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overﬁtting. The journal of machine learning research, 15(1):1929 1958, 2014.

Stanﬁll, M. H., Williams, M., Fenton, S. H., Jenders, R. A., and Hersh, W. R. A systematic literature review of automated clinical coding and classiﬁcation systems. Journal of the American Medical Informatics Association, 17(6):646 651, 2010.

Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. Intriguing properties of neural networks. ar Xiv preprint ar Xiv:1312.6199, 2013.

Tan, C., Sun, F., Kong, T., Zhang, W., Yang, C., and Liu, C. A survey on deep transfer learning. In International conference on artiﬁcial neural networks, pp. 270 279. Springer, 2018.

Tanay, T. and Grifﬁn, L. A boundary tilting persepective on the phenomenon of adversarial examples. ar Xiv preprint ar Xiv:1608.07690, 2016.

Taylor, L. and Nitschke, G. Improving deep learning with generic data augmentation. In 2018 IEEE Symposium Series on Computational Intelligence (SSCI), pp. 1542 1547, Nov 2018. doi: 10.1109/SSCI.2018.8628742.

Tian, S., Yang, G., and Cai, Y. Detecting adversarial examples through image transformation, 2018.

Tram er, F., Zhang, F., Juels, A., Reiter, M. K., and Ristenpart, T. Stealing machine learning models via prediction apis. In 25th {USENIX} Security Symposium ({USENIX} Security 16), pp. 601 618, 2016.

Truex, S., Liu, L., Gursoy, M. E., Yu, L., and Wei, W. Towards demystifying membership inference attacks. ar Xiv preprint ar Xiv:1807.09173, 2018.

Wang, B. and Gong, N. Z. Stealing hyperparameters in machine learning. 2018 IEEE Symposium on Security and Privacy (SP), pp. 36 52, 2018.

Yang, Z., Shao, B., Xuan, B., Chang, E.-C., and Zhang, F. Defending model inversion and membership inference attacks via prediction puriﬁcation, 2020.

Yeom, S., Giacomelli, I., Fredrikson, M., and Jha, S. Privacy risk in machine learning: Analyzing the connection to overﬁtting. In 2018 IEEE 31st Computer Security Foundations Symposium (CSF), pp. 268 282. IEEE, 2018.

Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. ar Xiv preprint ar Xiv:1611.03530, 2016.

Zhang, X., Zhao, J., and Le Cun, Y. Character-level convolutional networks for text classiﬁcation. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS 15, pp. 649 657, Cambridge, MA, USA, 2015. MIT Press.