# dataset_inference_ownership_resolution_in_machine_learning__46f3d1d3.pdf

Published as a conference paper at ICLR 2021

DATASET INFERENCE: OWNERSHIP RESOLUTION IN MACHINE LEARNING

Pratyush Maini IIT Delhi

pratyush.maini@gmail.com

Mohammad Yaghini, Nicolas Papernot University of Toronto and Vector Institute {mohammad.yaghini,nicolas.papernot}@utoronto.ca

With increasingly more data and computation involved in their training, machine learning models constitute valuable intellectual property. This has spurred interest in model stealing, which is made more practical by advances in learning with partial, little, or no supervision. Existing defenses focus on inserting unique watermarks in a model s decision surface, but this is insufﬁcient: the watermarks are not sampled from the training distribution and thus are not always preserved during model stealing. In this paper, we make the key observation that knowledge contained in the stolen model s training set is what is common to all stolen copies. The adversary s goal, irrespective of the attack employed, is always to extract this knowledge or its by-products. This gives the original model s owner a strong advantage over the adversary: model owners have access to the original training data. We thus introduce dataset inference, the process of identifying whether a suspected model copy has private knowledge from the original model s dataset, as a defense against model stealing. We develop an approach for dataset inference that combines statistical testing with the ability to estimate the distance of multiple data points to the decision boundary. Our experiments on CIFAR10, SVHN, CIFAR100 and Image Net show that model owners can claim with conﬁdence greater than 99% that their model (or dataset as a matter of fact) was stolen, despite only exposing 50 of the stolen model s training points. Dataset inference defends against state-of-the-art attacks even when the adversary is adaptive. Unlike prior work, it does not require retraining or overﬁtting the defended model.1

1 INTRODUCTION

Machine learning models have increasingly many parameters (Brown et al., 2020; Kolesnikov et al., 2019), requiring larger datasets and signiﬁcant investment of resources. For example, Open AI s development of GPT-3 is estimated to have cost over USD 4 million (Li, 2020). Yet, models are often exposed to the public to provide services such as machine translation (Wu et al., 2016) or image recognition (Wu et al., 2019). This gives adversaries an incentive to steal models via the exposed interfaces using model extraction. This threat raises a question of ownership resolution: how can an owner prove that another suspect model stole their intellectual property? Speciﬁcally, we aim to determine whether a potentially stolen model was derived from an owner s model or dataset.

An adversary may derive and steal intellectual property from a victim in many ways. A prominent way is (1) model extraction (Tram er et al., 2016), where the adversary exploits access to a model s (1.a) prediction vectors (e.g., through an API) to reproduce a copy of the model at a lower cost than what is incurred in developing it. Perhaps less directly, (1.b) the adversary could also use the victim model as a labeling oracle to train their model on an initially unlabeled dataset obtained either from a public source or collected by the adversary. In a more extreme threat model, (2) the adversary could also get access to the dataset itself which was used to train the victim model and train their own model by either (2.a) distilling the victim model, or (2.b) training from scratch altogether. Finally, adversaries may gain (3) complete access to the victim model, but not the dataset. This may happen when a victim wishes to open-source their work for academic purposes but disallows its

Work done while an intern at the University of Toronto and Vector Institute 1Code and models for reproducing our work can be found at github.com/cleverhans-lab/dataset-inference

Published as a conference paper at ICLR 2021

commercialization, or simply via insider-access. The adversary may (3.a) ﬁne-tune over the victim model, or (3.b) use the victim for data-free distillation (Fang et al., 2019).

Preventing all forms of model stealing is impossible without decreasing model accuracy for legitimate users: model extraction adversaries can obfuscate malicious queries as legitimate ones from the expected distribution. Most prior efforts thus focus on watermarking models before deployment. Rather than preventing model stealing, they aim to detect theft by allowing the victim to claim ownership by verifying that a suspect model responds with the expected outputs on watermarked inputs. This strategy not only requires re-training and decreases model accuracy, it can also be vulnerable to adaptive attacks that lessen the impact of watermarks on the decision surface during extraction. Thus, recent work that has managed to prevail (Yang et al., 2019) despite distillation (Hinton et al., 2015) or extraction (Jia et al., 2020), has suffered a trade-off in model performance.

In our work, we make the key observation that all stolen models necessarily contain direct or indirect information from the victim model s training set. This holds regardless of how the adversary gained access to the stolen model. This leads us to propose a fundamentally different defense strategy: we identify stolen models because they possess knowledge contained in the private training set of the victim. Indeed, a successful model extraction attack will distill the victim s knowledge of its training data into the stolen copy. Hence, we propose to identify stolen copies by showing that they were trained (at least partially and indirectly) on the same dataset as the victim.

We call this process dataset inference (DI). In particular, we ﬁnd that stolen models are more conﬁdent about points in the victim model s training set than on a random point drawn from the task distribution. The more an adversary interacts with the victim model to steal it, the easier it will be to claim ownership by distinguishing the stolen model s behavior on the victim model s training set. We distinguish a model s behavior on its training data from other subsets of data by measuring the prediction certainty of any data point: the margin of a given data point to neighbouring classes.

At its core, DI builds on the premise of input memorization, albeit weak. One might think that DI succeeds only for models trained on small datasets when overﬁtting is likely. Surprisingly, in practice, we ﬁnd that even models trained on Image Net end up memorizing training data in some form.

Among related work discussed in 2, distinguishing a classiﬁer s behavior on examples from its train and test sets is closest to membership inference (Shokri et al., 2017). Membership inference (MI) is an attack predicting whether individual examples were used to train a model or not. Dataset inference ﬂips this situation and exploits information asymmetry: the potential victim of model theft is now the one testing for membership and naturally has access to the training data. Whereas MI typically requires a large train-test gap because such a setting allows a greater distinction between individual points in and out the training set (Yeom et al., 2018; Choo et al., 2020), dataset inference succeeds even when the defender has slightly better than random chance of guessing membership correctly; because the victim aggregates the result of DI over multiple points from the training set.

In summary, our contributions are:

We introduce dataset inference as a general framework for ownership resolution in machine learning. Our key observation is that knowledge of the training set leads to information asymmetry which advantages legitimate model owners when resolving ownership. We theoretically show on a linear model that the success of MI decreases with the size of the training set (as overﬁtting decreases), whereas DI is independent of the same. Despite the failure of MI on a binary classiﬁcation task, DI still succeeds with high probability. We propose two different methods to characterize training vs. test behavior: targeted adversarial attacks in the white-box setting, and a novel Blind Walk method for the black-box label-only setting. We then create a concise embedding of each data point that is fed to a conﬁdence regressor to distinguish between points inside and outside a model s training set. Hypothesis testing then returns the ﬁnal ownership claim. Unlike prior efforts, our method not only helps defend ML services against model extraction attacks, but also in extreme scenarios such as complete theft of the victim s model or training data. In 7, we also introduce and evaluate our approach against adaptive attacks. We evaluate our method on the CIFAR10, SVHN, CIFAR100 and Image Net datasets and obtain greater than 99% conﬁdence in detecting model or data theft via the threat models studied in this work, by exposing as low as 50 random samples from our private dataset.

Published as a conference paper at ICLR 2021

We remark that dataset inference applies beyond intellectual property issues. For example, Song & Shmatikov (2019) showed that models trained for gender classiﬁcation also learn features predictive of ethnicity. This raises ethical concerns, and dataset inference could assess whether a sensitive dataset was used by a model developer for different purposes than stated at data collection time.

2 RELATED WORK

Model Extraction. Model extraction (Tram er et al., 2016; Jagielski et al., 2020; Truong et al., 2021) is the process where an adversary tries to steal a copy of a machine learning model, that may have been remotely deployed (such as over a prediction API). Depending on the level of access provided by the prediction APIs, model extraction may be performed by only using the labels (Chandrasekaran et al., 2019; Correia-Silva et al., 2018) or the entire prediction logits of the deployed service (Orekondy et al., 2018). Model extraction has seen a cycle of attacks and defenses. Once an adversary has knowledge of the defense strategy adopted by the victim, they adaptively modify the attack to circumvent that defense (see watermarking). Model extraction can also be a reconnaissance step used to prepare for further attacks, e.g., ﬁnding adversarial examples (Papernot et al., 2017; Shumailov et al., 2020).

Watermarking. Since Uchida et al. (2017) embedded watermarks into neural networks and Adi et al. (2018) used them as signatures to claim possession, watermarks have been widely adopted as a way to resolve ownership claims. The underlying idea is to manipulate the model to learn information other than that from the true data distribution, and use this knowledge for veriﬁcation afterwards. This strategy not only requires new training procedures and decreases the model s accuracy (Jia et al., 2020), but is also vulnerable to adaptive attacks that lessen the impact of watermarks on the model s decision surface during extraction (Liu et al., 2018; Chen et al., 2019; Wang et al., 2019; Shaﬁeinejad et al., 2019).

Membership Inference. Shokri et al. (2017) train a number of shadow classiﬁers on conﬁdence scores produced by the target model with labels indicating whether samples came from the training or testing set. MI attacks are shown to work in white- (Leino & Fredrikson, 2020; Sablayrolles et al., 2019) as well as black-box scenarios against various target models including generative models (Hayes et al., 2019). Yeom et al. (2018) explore overﬁtting as the root cause of MI vulnerability. Choo et al. (2020) show that MI can succeed even in scenarios when the victim only provides labels.

Out of Distribution Detection. Liang et al. (2017) and Lee et al. (2018) measure model performance on modifying an input to ﬁnd if a sample is in or out-of-distribution. The premise is that in-distribution samples are easier to manipulate, whereas out-of-distribution samples require more work. In contrast, our work solves a much more challenging problem: the dataset distribution may be the same, but can we still identify which of the datasets was used for training?

3 THREAT MODEL AND DEFINITION OF DATASET INFERENCE

Consider a victim V who trains a model f V on their private data SV KV, where KV represents the private knowledge of V. While KV is an abstract concept that can not be concretely deﬁned, the private dataset SV represents a deﬁnite part of the victim s knowledge that can be formalized. An adversary A may gain access to a subset of KV and use it to train its own model f A . V suspects theft, and would like to prove that f A is indeed a copy of f V. Hence, V employs dataset inference on f A to determine if a subset of their private knowledge K KV was used to train f A . We formally deﬁne the victim and their dataset inference experiment below.

Deﬁnition 1 (Dataset Inferring Victim V(f, α, m)) Let V : F [0, 1] N 7 {1, } be a victim with private access to SV KV, where F represents the set of all classiﬁers trained on samples from a data distribution D. Given classiﬁer f, V can reveal at most m samples from SV to either conclusively prove that a subset of their private knowledge K KV has been used in the training of f with a Type-I error (FPR) < α, or return an inconclusive result .

Deﬁnition 2 (Dataset Inference Experiment Exp DI(V, m, α, SV, D)) Let F be as in Deﬁnition 1, and assume FV to be the set of all classiﬁers trained on the victim s private dataset SV D, and m a natural number. The dataset inference experiment follows:

Published as a conference paper at ICLR 2021

1. Choose b {0, 1} uniformly at random. 2. f A = f FV if b = 1; else f A = f F

3. Exp DI(V, m, α, SV, D) = 1 if V(f A , α, m) = 1 and b = 1 0 otherwise

4 THEORETICAL MOTIVATION

Dataset Inference (DI) aims to leverage the disparity in the response of an ML model to inputs that it saw during training time, versus those that it did not. We call this response prediction margin . In 4, we introduce our theoretical framework. In 4.1, we quantify the difference in the expected response of a model to any point in the training and test set. Finally, in 4.2 we describe how DI succeeds with high probability in this setting, while membership inference (MI) fails.

Setup. Consider a data distribution D, such that any input-label pair (x, y) can be described as:

y { 1, +1}; x1 = y u RK, x2 N(0, σ2I) RD (1)

where x = (x1, x2) RK+D and u RK is a ﬁxed vector. Observe that the last D dimensions of x represent Gaussian noise (with var. σ2) having no correlation to the correct label. However, the ﬁrst K dimensions are sufﬁcient to separate inputs from classes { 1, +1} (Nagarajan & Kolter, 2019). S D, s.t. |S| = m represents the private training set of a model with m distinct training examples.

Architecture. We consider the scenario of classifying the input distribution using a linear classiﬁer, f, with weights w = (w1, w2), such that for any input x: f(x) = w1 x1 + w2 x2. And the ﬁnal classiﬁcation decision is sgn(f(x)). While we only discuss the case of a linear network in this analysis, the success of DI only increases with the number of parameters in a machine learning model, as is the case for MI (Yeom et al., 2018). This, in effect makes the following analysis a stronger result to prove. Prior works have also argued how over-parametrized deep learning networks memorize training points (Zhang et al., 2016; Feldman, 2019). At its core, DI builds on the premise of (weak) input memorization. Results on DNNs are discussed in 7.

4.1 PREDICTION MARGIN

In our work, we use prediction margin to imply the conﬁdence of a machine learning model of its prediction. In other words, we try to capture the robustness of a model s prediction under uncertainty, which is equivalent to viewing the local landscape of a machine learning model. For the purpose of the theoretical analysis, it is convenient to deﬁne it as the margin of a data point from the decision boundary (y f(x)). As we scale our method to deep networks in the empirical evaluation, we will describe alternative methods of measuring the prediction margin in multi-class settings.

Theorem 1 (Train-Test Margin) Given a linear classiﬁer f trained to classify inputs (x, y) S (training set), the difference in the expected prediction margin for samples in S and D is given by E(x,y) S [y f(x)] E(x,y) D [y f(x)] = Dσ2, where σ2 is the Gaussian noise variance as in (1).

The proof (Appendix A.2) ﬁrst calculates the weights of the learned classiﬁer f by assuming that it is trained using gradient descent with a ﬁxed learning rate, and viewing all training points exactly once. We then analyze the expected margin for data points included in training or not.

4.2 DATASET INFERENCE V/S MEMBERSHIP INFERENCE

We now show how MI fails to distinguish between train and test samples in the same setting. This happens because an adversary has to make a decision about the presence of a given data point in the training set by querying a single point. However, DI succeeds with high probability in the same setting because it aggregates signal over multiple data points. We note that the statistical differences between the prediction margin of training and test data points in 4.1 are only known when we calculate an expectation over multiple samples.

Published as a conference paper at ICLR 2021

(a) If x is in training set

(b) If x is not in training set

Figure 1: The effect of including (x, A ) in the train set. If x is in the train set, the classiﬁer will learn to maximize the decision boundary s distance to Y \ { A }. If x is in the test set, it has no direct impact on the learned landscape.

Failure of Membership Inference. Consider a membership inferring adversary M that has no knowledge of the victim s training data S, but has domain knowledge such as the publicly available data distribution D. Deﬁne M(x, f) as the adversary s decision function to predict whether x belongs to S. Let R represent a distribution that uniformly at random samples from either S(b = 1) or D(b = 0). Then, M makes a membership decision about (x, b) R. Φ denotes the Gaussian CDF.

Theorem 2 (Failure of MI) Given a linear classiﬁer f trained on S, the probability that an adversary M correctly predicts the membership of inputs randomly belonging to the training or test set,

Px R [M(x, f) = b] = 1 Φ q

D 2m , and decreases with

|S| = m. Moreover, limm Px R [M(x, f) = b] = 0.5.

The theorem suggests that the success of MI when querying a single data point is extremely low. Ass m increases, the adversary can do no better than a coin ﬂip. This means that the success is directly proportional to overﬁtting (we present the proof in Appendix A.3)

Success of Dataset Inference. Take V to be a dataset inferring victim (Deﬁnition 1). Let ψV(f, S; D) be V s decision function for ownership resolution. In the next theorem, we show that the success of DI in practice is high and independent of the training set size. (Proof in Appendix A.4)

Theorem 3 (Success of DI) Choose b {0, 1} uniformly at random. Given an adversary s linear classiﬁer f trained on S D, s.t. |S | = |S| if b = 0, and on S otherwise. The probability

V correctly decides if an adversary stole its knowledge P [ψ(f, S; D) = b] = 1 Φ

Moreover, lim D P [ψ(f, S; D) = b] = 1.

Example. Assume a dataset of training size 50K and input dimensions K = 100, D = 900 (i.e., 100 strongly correlated features which is roughly similar to the MNIST dataset) We have P(x,y) S [ψ(f, S; D) = 1] = 1 10 26 1.0 while P(x,b) R [M(x, f) = b] = 0.526. Therefore, in a problem setting where membership inference succeeds only by slightly above random chance, dataset inference succeeds nearly every time.

5 DATASET INFERENCE

Dataset Inference is the process of determining whether a victim s private knowledge has been directly or indirectly incorporated in a model trained by an adversary. Our key intuition is that classiﬁers generally try to maximize the distance of training examples from the model s decision boundaries. This means that any model which has stolen the victim s private knowledge should also position data similar to victim s private training data far from its own decision boundaries. (See Figure 1) When a victim suspects knowledge was stolen from their model, they may measure how the adversary s model responds to their own training data to substantiate their ownership claim.

5.1 EMBEDDING GENERATION

For a model f and data point x, we aim to extract a feature embedding for x that characterizes its prediction margin (or distance from the decision boundaries) w.r.t. f. The victim V extracts these embeddings for points (x, y) D and labels them as inside (b = 1) or outside (b = 0) of their private dataset SV.2 We introduce two methods for generating embeddings based on the level of access the victim may have to the adversary s model.

2Recall that for our discussion on linear networks in 4, we used a simple metric to compute the prediction margin of a given data point as (y f(x)). However, the same does not apply to deep networks.

Published as a conference paper at ICLR 2021

Embedding Generator

Ownership Tester

Stolen / Inconclusive

(x1, x2, ...xm)

Private Samples

Public Samples

Figure 2: Training (dotted) the conﬁdence regressor with embeddings of public and private data, and victim s model f V; Dataset Inference (solid) using m private samples and adversary model f A

White-Box Setting: Min GD White-box embedding generation is used when V and A resolve the claim for ownership in the presence of a neutral arbitrator, such as a court. Indeed, Kumar et al. (2020) highlight that such attacks potentially fall under Computer Fraud and Abuse Act in the USA and are prosecutable for reverse engineering the model s source code . Both parties provide access to their models, and then the prediction margin is measured for the suspected adversary s model on the victim s train and test data points. For any data point (x, y) we evaluate its minimum distance to the neighbouring target classes t by performing gradient descent optimization of the following objective (Szegedy et al., 2013): minδ (x, x+ δ) s.t. f(x + δ) = t. The distance metric (x(i), x(j)) refers to the ℓp distance between points x(i) and x(j) for p {1, 2, }, and t is the target label. The distance to each target class is a feature in the embedding vector analyzed by the ownership tester from 5.2.

Black-Box Setting: Blind Walk. V may want to perform DI on a publicly deployed model f that only allows label query access. This makes them incapable of computing gradients required for Min GD. Moreover, querying f would be costly for V. Therefore, we introduce a new method called Blind Walk which estimates the prediction margin of any given data point through its robustness to random noise rather than a gradient search. We sample a random initial direction δ. Starting from an input (x, y), we take k N steps in the same direction until f(x + kδ) = t; t = y. Then, (x, x + kδ) is used as a proxy for the prediction margin of the model. Thus, the approach only requires label access to f. We repeat the search over multiple random initial directions to increase the information about the point s robustness, and use each of these distance values as features in the generated embedding. In practice, we ﬁnd Blind Walk to perform better than Min GD with the ownership tester from 5.2. We discuss further details justifying these observations in Appendix C.

5.2 OWNERSHIP TESTER

It is important for the victim to resolve ownership claims in as few queries as possible, since each query involves the victim revealing part of their private dataset SV. Since claiming ownership would likely lead to legal action, it is paramount that the victim minimizes their false positive rate. We thus test ownership in two phases: a regression model ﬁrst infers whether the potentially stolen model s predictions on individual examples contain the victim s private knowledge, this is then followed by a hypothesis test which aggregates these results to decide dataset inference. This is another key difference with membership inference efforts: rather than always predicting that a point is from the train or test data, we claim ownership of a model only when we have sufﬁcient conﬁdence. This is done through statistical hypothesis testing, which takes the false positive rate α as a hyper-parameter, and produces either conclusive positive results with an error of at most α, or an inconclusive result.

Conﬁdence Regressor. As deﬁned in 5.1, we extract distance embeddings w.r.t f V for data points in both V s private data SV and unseen publicly available data. Using the embeddings and the ground truth membership labels, we train a regression model g V. The goal of g V is to predict a (proxy) measure of conﬁdence that a sample contains f V s private information. For our hypothesis testing, we require that g V produce smaller values for samples from SV. Complete access to the dataset SV allows V to train g V accurately, as illustrated via dotted arrows in Figure 2.

Hypothesis Testing. This is the step where dataset inference claims are made (solid lines in Figure 2). Using the conﬁdence scores produced by g V and the membership labels, we create equalsized sample vectors c and c V from private training and public data, respectively. We test the null hypothesis H0 : µ < µV where µ = c and µV = c V are mean conﬁdence scores. The test would either reject H0 and conclusively rule that f A is stolen , or give an inconclusive result.

Published as a conference paper at ICLR 2021

6 EXPERIMENTAL SETUP AND IMPLEMENTATION OF DATASET INFERENCE

Unlike prior work on membership inference, which evaluates over victim models trained to overﬁt on small subsets of the original dataset, we train all of our victim models on large common benchmarks.

Datasets. We perform our experiments on the CIFAR10, CIFAR100, SVHN and Image Net datasets. These remain popular image classiﬁcation benchmarks, further description about which can be found in Appendix E.1. All details about experiments on SVHN and Image Net are in Appendix E.2.

Model Architecture. The victim model is a Wide Res Net (Zagoruyko & Komodakis, 2016) with depth 28 and widening factor of 10 (WRN-28-10) for both CIFAR10 and CIFAR-100, and is trained with a dropout rate of 0.3 (Srivastava et al., 2014). For the model stealing attacks described in 6.1, we use smaller architectures such as WRN-16-1 on CIFAR10 and WRN-16-10 on CIFAR100.

6.1 MODEL STEALING ATTACKS

We consider the strongest model stealing attacks in the literature, and introduce new attacks targeting dataset inference to perform an adaptive evaluation of our defense. The adversary A can gain different levels of access to V s private knowledge:

(1) AQ has query access to f V. We consider model extraction (Tram er et al., 2016) based adversaries which may (1.a) have access to the model s prediction vectors (via an API). AQ queries f V on a nontask speciﬁc dataset, and minimizes the KL divergence with its predictions. (1.b) Alternately, to further distance its predictions from the victim, the adversary may only use the most conﬁdent label from these queries (as pseudo-labels) to train. (2) AM has access to the victim s model f V. This may happen when V wishes to open-source their work for academic purposes but does not allow its commercialization, or via insider-access. (2.a) AM may ﬁne-tune over f V, or (2.b) use f V for datafree distillation (Fang et al., 2019) in a zero-shot learning framework that only utilizes synthetic and non-semantic queries.3 (3) AD has access to the complete private dataset, SV of the victim. They may train their own model either (3.a) by distilling f V (over query access), or (3.b) training from scratch using different learning schemes or architectures. (For further details see Appendix B).

Finally, we also perform DI against an independent and honest machine learning model I that was trained on its own private dataset. This model is used as a control, to ensure that we do not claim ownership of models that were not trained by stealing knowledge from our victim model.

Training the threat models. For model extraction and ﬁne-tuning attacks on CIFAR10 and CIFAR100, we use a subset of 500,000 unlabeled Tiny Images that are closest to CIFAR10, as created by Carmon et al. (2019). For SVHN, we use the extra training data released by the authors. We train the student model for 20 epochs for model extraction methods and 5 epochs for ﬁne-tuning. For zero-shot learning, we use data-free adversarial distillation method (Fang et al., 2019) and train the student model for 200 epochs. In case of distillation and modiﬁed architecture, we have access to the original training data of the victim. We train both models for 100 epochs on the full training set.

In all the training methods, we use a ﬁxed learning rate strategy with SGD optimizer and decay the learning rate by a factor of 0.2 at the end of the 0.3 , 0.6 , and 0.8 the total number of epochs.

6.2 IMPLEMENTATION DETAILS FOR DATASET INFERENCE

Embedding generation. For the white-box method (Min GD), we perform the attack against each target class while optimizing the ℓ1, ℓ2, ℓ norms. Hence, we obtain an embedding of size 30 (classes distance measures). In the case of CIFAR100, we only attack the 10 most conﬁdent target classes, as indicated by the prediction vector f(x). For the black-box method (Blind Walk), we sample 10 times from uniform, Gaussian, and laplace distributions to perturb the input. Once again, we obtain an embedding vector of size 30. More details are deferred to Appendix C.

Training the conﬁdence regressor. We train a two-layer linear network (with tanh activation) g V for the task of providing conﬁdence about a given data point s membership in private and public data. The regressor s loss function is L(x, y) = b g V(x) where the label b = 1 for a point in the (public) training set of the respective model, and 1 if it came from victim s private set.

3This is the ﬁrst work to consider data-free distillation as a stealing attack.

Published as a conference paper at ICLR 2021

Hypothesis Tests. We query models with equal number of samples from public and private datasets, create embeddings and calculate conﬁdence score vectors c and c V, respectively. We form a two sample T-test on the distribution of c and c V and calculate the p-value for the one-sided hypothesis H0 : µ < µV against Halt : µ > µV. From L(x, y), it follows that g V learns to minimize g V(x) when x SV, and maximizes it otherwise. Therefore, a vector that contains samples from SV produces lower conﬁdence scores, and decreases the test s p-value. If the p-value is below a predeﬁned signiﬁcance level α, H0 is rejected, and the model under test is marked as stolen .

In the following results, we repeat all experimental statistical tests for 100 times with randomly sampled data with replacement. To control for multiple testing, and account for the unknown dependence of the p-values thus generated, we aggregate these values using the harmonic mean (Wilson, 2018). To produce bootstrap 99-percentile conﬁdence intervals, we repeat the experiment 40 times.

Table 1 shows p-values and the effect size, µ = µ µV, which captures the average conﬁdence of our hypothesis test in claiming that a model was stolen. We test our approach against 6 different attackers and in two different settings (Blackand White-box). In addition, Table 1 also reports Source where the victim s complete model f V has been stolen, and Independent , the control model trained on a separate dataset. Understandably, we typically observe the largest and smallest effect sizes for these two baselines, which serve as bounds to interpret our evaluation of attacks.

Our evaluation shows that DI is robust to both the strongest model stealing techniques, but also an adaptive attack we propose based on zero-shot learning. DI can claim a model was stolen with at least 95% conﬁdence for most threat models with only 10 samples. Hence, the defense exploits an inherent property of model training. Among the six attacks we considered, we observe that our model consistently ﬂags ﬁne-tuned models as stolen. This departs from prior defenses against model extraction: e.g., watermarks often lack robustness to ﬁne-tuning. Here, DI is unaffected because ﬁnetuning does not remove knowledge from all private data used to train the stolen model. The labelquery and zero-shot attacks challenge DI the most. This is expected because zero-shot learning uses only synthetic data points for querying; and in case of logit-query, AQ is merely using V to label their dataset, which leaks much less private knowledge than distillation-based model extraction. In practice, their higher query complexity makes both these attacks the most (ﬁnancially) expensive to mount. We present concurring results on SVHN and Image Net in Appendix E.2.

DI requires few private points. In Figure 3, we show the number of private points the victim has to reveal (from its training set) to achieve a particular p-value when claiming model ownership is low: 40, and often as few as 20, samples to achieve a false positive rate (FPR) α of at most 1%.

Query efﬁciency. For the black-box scenario where the victim wants to assess the ownership of a model served through an API, DI is a query efﬁcient approach that comes at a low cost for the victim.

Model CIFAR10 CIFAR100

Stealing Attack Min GD Blind Walk Min GD Blind Walk

µ p-value µ p-value µ p-value µ p-value

V Source 0.838 10 4 1.823 10 42 1.219 10 16 1.967 10 44

AD Distillation 0.586 10 4 0.778 10 5 0.362 10 2 1.098 10 5

Diff. Architecture 0.645 10 4 1.400 10 10 1.016 10 6 1.471 10 14

AM Zero-Shot Learning 0.371 10 2 0.406 10 2 0.466 10 2 0.405 10 2

Fine-tuning 0.832 10 5 1.839 10 27 1.047 10 7 1.423 10 10

AQ Label-query 0.475 10 3 1.006 10 4 0.270 10 2 0.107 10 1

Logit-query 0.563 10 3 1.048 10 4 0.385 10 2 0.184 10 1

I Independent 0.103 1 -0.397 0.675 -0.242 0.545 -1.793 1

Table 1: Ownership Tester s effect size (higher is better) and p-value (lower is better) using m = 10 samples on multiple threat models (see 6.1). The highest and lowest effect sizes among the model stealing attacks (AD, AM, AQ) are marked in red and blue respectively.

Published as a conference paper at ICLR 2021

0.00 0.05 0.10 0.15 0.20

p-value CIFAR10 - White Box

CIFAR10 - Black Box

Distillation Label-Query Logit-Query Zero-Shot Learning Fine-Tuning Diff. Architecture

CIFAR100 - White Box

CIFAR100 - Black Box

Figure 3: p-value against number of revealed samples (m). Signiﬁcance levels (FPR) α = 0.01 and 0.05 (dotted lines) have been drawn. Under most attack scenarios, the victim V can dispute the adversary s ownership of f A (with FPR of at most 1%) by revealing fewer than 50 private samples.

For 100 data points, DI can be performed in less than 30,000 queries to the API. More efﬁcient embedding generation optimizations can signiﬁcantly improve this further. (See Appendix D).

White-box access is not essential to DI. While access to gradient information can help in particular scenarios (such as, logit-query in CIFAR100) which reduces m from 40 to 20 samples, for ﬁne-tuned adversary models, or those that are trained against a different architecture to evade detection, our proposed black-box solution (Blind Walk) performs surprisingly better than its White-box counterpart. We conjecture that the Blind Walk s advantage stems from a combination of factors: (a) gradient-based approaches are sensitive to numerical instabilities, (b) the approach is stochastic and aims to ﬁnd the expected prediction margin rather than the worst-case (it searches for any incorrect neighboring class in a randomly chosen direction rather than focusing on the distance to possible target classes). Hence, our proposed Blind Walk inference procedure is highly efﬁcient.

DI does not require overﬁtting or retraining. Unlike past defenses (watermarks) and attacks (MI) which we discussed previously, DI uniquely applies as a post-hoc solution to any publicly deployed model, irrespective of whether it overﬁt on its training set. This means that model owners in the real-world can perform DI immediately, to protect models that they have already deployed.

8 DISCUSSION AND CONCLUSION

While adversarial ML often consists of a cycle of attacks and defenses, we turn this game on its head. Dataset inference leverages knowledge a defender has of their training set to identify models that an adversary created by either directly accessing this training set without authorization or indirectly distilling knowledge from one of the models released by the defender. With dataset inference, model developers resolve model ownership conﬂicts without making changes to their existing models.

Interestingly, the ability to claim ownership through dataset inference gracefully degrades as the adversary spends increasingly more resources to train the stolen model. For instance, if an adversary extracts a copy and later ﬁne-tunes it with a different dataset to conceal the model, it will make the model more different and dataset inference will be less likely to succeed. But this is expected and desired: this means the adversary faced a higher cost to obfuscate this stolen copy. In itself it is not an easy task, because of accuracy degradation and catastrophic forgetting.

Finally, it remains a promising direction for future work to study the conﬂuence of DI with privacypreserving models trained using ϵ-differential privacy (DP). Leino & Fredrikson (2020) have shown that while DP can help against membership inference (MI) attacks, it comes at a steep cost in accuracy. We hypothesize that since DI ampliﬁes the membership signal using multiple private samples, it follows that the ϵ values required to make DI ineffective would be even lower than it is for MI. Therefore, ϵ values that can make the model private, likely do not interfere with dataset inference.

Acknowledgments. We thank the reviewers for their feedback. We would also like to thank members of Clever Hans Lab, especially Ilia Shumailov and Varun Chandrasekaran. This work was supported by a Canada CIFAR AI Chair, NSERC, Microsoft, and sponsors of the Vector Institute.

Published as a conference paper at ICLR 2021

Yossi Adi, Carsten Baum, Moustapha Cisse, Benny Pinkas, and Joseph Keshet. Turning your weakness into a strength: Watermarking deep neural networks by backdooring. In 27th {USENIX} Security Symposium ({USENIX} Security 18), pp. 1615 1631, 2018.

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam Mc Candlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners, 2020.

Yair Carmon, Aditi Raghunathan, Ludwig Schmidt, Percy Liang, and John Duchi. Unlabeled data improves adversarial robustness. In Advances in Neural Information Processing Systems (Neur IPS), 2019.

Varun Chandrasekaran, Kamalika Chaudhuri, Irene Giacomelli, Somesh Jha, and Songbai Yan. Exploring connections between active learning and model extraction, 2019.

Xinyun Chen, Wenxiao Wang, Chris Bender, Yiming Ding, Ruoxi Jia, Bo Li, and Dawn Song. Reﬁt: a uniﬁed watermark removal framework for deep learning systems with limited data. ar Xiv preprint ar Xiv:1911.07205, 2019.

Christopher A. Choquette Choo, Florian Tramer, Nicholas Carlini, and Nicolas Papernot. Label-only membership inference attacks, 2020.

Jacson Rodrigues Correia-Silva, Rodrigo F. Berriel, Claudine Badue, Alberto F. de Souza, and Thiago Oliveira-Santos. Copycat cnn: Stealing knowledge by persuading confession with random non-labeled data. 2018 International Joint Conference on Neural Networks (IJCNN), Jul 2018. doi: 10.1109/ijcnn.2018.8489592. URL http://dx.doi.org/10.1109/IJCNN.2018. 8489592.

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Image Net: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2018.

Gongfan Fang, Jie Song, Chengchao Shen, Xinchao Wang, Da Chen, and Mingli Song. Data-free adversarial distillation. ar Xiv preprint ar Xiv:1912.11006, 2019.

Vitaly Feldman. Does learning require memorization? a short tale about a long tail, 2019.

Jamie Hayes, Luca Melis, George Danezis, and Emiliano De Cristofaro. Logan: Membership inference attacks against generative models. Proceedings on Privacy Enhancing Technologies, 2019: 133 152, 01 2019. doi: 10.2478/popets-2019-0008.

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network, 2015.

Matthew Jagielski, Nicholas Carlini, David Berthelot, Alex Kurakin, and Nicolas Papernot. High accuracy and high ﬁdelity extraction of neural networks, 2020.

Hengrui Jia, Christopher A. Choquette-Choo, and Nicolas Papernot. Entangled watermarks as a defense against model extraction, 2020.

Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. Big transfer (bit): General visual representation learning, 2019.

Alex Krizhevsky. Learning multiple layers of features from tiny images. University of Toronto, 2012.

Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks, 2014.

Published as a conference paper at ICLR 2021

Ram Shankar Siva Kumar, Jonathon Penney, Bruce Schneier, and Kendra Albert. Legal risks of adversarial machine learning research. ar Xiv preprint ar Xiv:2006.16179, 2020.

Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin. A simple uniﬁed framework for detecting out-of-distribution samples and adversarial attacks. In Advances in Neural Information Processing Systems, pp. 7167 7177, 2018.

Klas Leino and Matt Fredrikson. Stolen memories: Leveraging model memorization for calibrated white-box membership inference. In USENIX Security Symposium, 2020.

Chuan Li. Open AI s GPT-3 Language Model: A Technical Overview, 2020. URL https:// lambdalabs.com/blog/demystifying-gpt-3/.

Shiyu Liang, Yixuan Li, and R Srikant. Principled detection of out-of-distribution examples in neural networks. ar Xiv preprint ar Xiv:1706.02690, 2017.

Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. Fine-pruning: Defending against backdooring attacks on deep neural networks. In International Symposium on Research in Attacks, Intrusions, and Defenses, pp. 273 294. Springer, 2018.

Paul Micaelli and Amos Storkey. Zero-shot knowledge transfer via adversarial belief matching, 2019.

Vaishnavh Nagarajan and J. Zico Kolter. Uniform convergence may be unable to explain generalization in deep learning. In Advances in Neural Information Processing Systems 32. Curran Associates, Inc., 2019.

Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011, 2011. URL http://ufldl.stanford.edu/ housenumbers/nips2011_housenumbers.pdf.

Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. Feature visualization. Distill, 2017. doi: 10.23915/distill.00007. https://distill.pub/2017/feature-visualization.

Chris Olah, Arvind Satyanarayan, Ian Johnson, Shan Carter, Ludwig Schubert, Katherine Ye, and Alexander Mordvintsev. The building blocks of interpretability. Distill, 2018. doi: 10.23915/ distill.00010. https://distill.pub/2018/building-blocks.

Tribhuvanesh Orekondy, Bernt Schiele, and Mario Fritz. Knockoff nets: Stealing functionality of black-box models, 2018.

Nicolas Papernot, Patrick Mc Daniel, Ian Goodfellow, Somesh Jha, Z. Berkay Celik, and Ananthram Swami. Practical black-box attacks against machine learning, 2017.

Alexandre Sablayrolles, Matthijs Douze, Cordelia Schmid, Yann Ollivier, and Herve Jegou. Whitebox vs Black-box: Bayes Optimal Strategies for Membership Inference. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of Machine Learning Research, volume 97, pp. 5558 5567. PMLR, 2019.

Masoumeh Shaﬁeinejad, Jiaqi Wang, Nils Lukas, and Florian Kerschbaum. On the robustness of the backdoor-based watermarking in deep neural networks. ar Xiv preprint ar Xiv:1906.07745, 2019.

Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 3 18. IEEE, 2017.

Ilia Shumailov, Yiren Zhao, Daniel Bates, Nicolas Papernot, Robert Mullins, and Ross Anderson. Sponge examples: Energy-latency attacks on neural networks. ar Xiv preprint ar Xiv:2006.03463, 2020.

Congzheng Song and Vitaly Shmatikov. Overlearning reveals sensitive attributes. Co RR, abs/1905.11742, 2019. URL http://arxiv.org/abs/1905.11742.

Published as a conference paper at ICLR 2021

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overﬁtting. J. Mach. Learn. Res., 15(1): 1929 1958, January 2014. ISSN 1532-4435.

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. ar Xiv preprint ar Xiv:1312.6199, 2013.

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision, 2015.

Florian Tram er, Fan Zhang, Ari Juels, Michael K. Reiter, and Thomas Ristenpart. Stealing machine learning models via prediction apis. In 25th USENIX Security Symposium (USENIX Security 16), pp. 601 618, Austin, TX, August 2016. USENIX Association. ISBN 978-1-931971-32-4.

Jean-Baptiste Truong, Pratyush Maini, Robert J. Walls, and Nicolas Papernot. Data-free model extraction. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021.

Yusuke Uchida, Yuki Nagai, Shigeyuki Sakazawa, and Shin ichi Satoh. Embedding watermarks into deep neural networks. Co RR, abs/1701.04082, 2017. URL http://arxiv.org/abs/ 1701.04082.

Bolun Wang, Yuanshun Yao, Shawn Shan, Huiying Li, Bimal Viswanath, Haitao Zheng, and Ben Y Zhao. Neural cleanse: Identifying and mitigating backdoor attacks in neural networks. In 2019 IEEE Symposium on Security and Privacy (SP), pp. 707 723. IEEE, 2019.

Daniel J. Wilson. The harmonic mean p-value for combining dependent tests. bio Rxiv, 2018. doi: 10.1101/171751. URL https://www.biorxiv.org/content/early/2018/02/07/ 171751.

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. Google s neural machine translation system: Bridging the gap between human and machine translation. Co RR, abs/1609.08144, 2016. URL http://arxiv.org/abs/1609.08144.

Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick. Detectron2. https://github.com/facebookresearch/detectron2, 2019.

Ziqi Yang, Hung Dang, and Ee-Chien Chang. Effectiveness of distillation attack and countermeasure on neural network watermarking, 2019.

S. Yeom, I. Giacomelli, M. Fredrikson, and S. Jha. Privacy Risk in Machine Learning: Analyzing the Connection to Overﬁtting. In 2018 IEEE 31st Computer Security Foundations Symposium (CSF), pp. 268 282, 2018.

Hongxu Yin, Pavlo Molchanov, Zhizhong Li, Jose M. Alvarez, Arun Mallya, Derek Hoiem, Niraj K. Jha, and Jan Kautz. Dreaming to distill: Data-free knowledge transfer via deepinversion, 2019.

Sergey Zagoruyko and Nikos Komodakis. Wide residual networks, 2016.

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization, 2016.

Published as a conference paper at ICLR 2021

A THEORETICAL MOTIVATION

In this section, we provide the formal proofs of Theorems 1, 2, 3 as stated in 4. First, we describe the preliminaries including the binary classiﬁcation task and the machine learning model used to train the same in Appendix A.1.

A.1 PRELIMINARIES

We repeat the preliminaries described in 4 to discuss the proofs in the following sections.

Setup. Consider a data distribution D, such that any input-label pair (x, y) can be described as:

y u.a.r { 1, +1}; x1 = yu RK, x2 N(0, σ2) RD (2)

where x = (x1, x2) RK+D and u RK is some ﬁxed vector. This suggests that the last D dimensions of the input is Gaussian noise which has no correlation with the correct label. However, the ﬁrst K input dimensions are sufﬁcient to perfectly separate data points from classes { 1, +1}. The setup is adapted from Nagarajan & Kolter (2019). We use S+ and D+ to represent the subset of the training set S and the distribution D with label y = 1.

Architecture. We consider the scenario of classifying the input distribution using a linear classiﬁer, f, with weights w = (w1, w2), such that for any input:

f(x) = w1 x1 + w2 x2 (3)

While we only discuss the case of a linear network in this analysis, the success of dataset inference (like membership inference) only increases with the number of parameters in a machine learning model (Yeom et al., 2018), which in effect makes the following analysis a stronger result to prove. Prior works have also argued how over-parametrized deep learning networks memorize training points (Zhang et al., 2016; Feldman, 2019).

A.2 TRAIN-TEST PREDICTION MARGIN (THEOREM 1)

Training Algorithm. We assume that the learning algorithm initializes the weights of the classiﬁer f to zero. Sample a training set S Dm = x(i), y(i) | i = 1 . . . m . The learning algorithm maximizes the loss L(x, y) = y f(x) and visits every training point once, with a gradient update step of learning rate α = 1.

w1 w1 + αy(i)x1 (i)

w2 w2 + αy(i)x2 (i) (4)

From the optimization steps described above, one may note that the learned weights for the classiﬁer f are given by w1 = mu and w2 = P

i y(i)x2(i) irrespective of the training batch size.

Inference. For any data point (x(j), y(j)), we calculate its prediction margin as the distance from the linear boundary, which is proportional to its label times the classiﬁer s output y f(x). For any point, x = (x1, x2) D, the prediction margin is therefore given by:

y f(x) = y (w1 x1 + w2 x2) = y (mu) (yu) + y

i y(i)x2 (i) !

i y(i)x2 (i) x2

Published as a conference paper at ICLR 2021

Now, we calculate the expected value of the margin for a point randomly sampled from the training set. Consider any point in the training set (x, y) S+ = (x(j), 1) for some index j. Then, we have:

Ex(j) S+f(x(j)) = y c + Ex2(i) N(0,σ2)

i y(i)x2 (i) x2 (j) !#

+ Ex2(j) N(0,σ2) h y(i)(x2 (j))2i

= c + 0 + Dσ2

Note that in (6), we utilize the fact that the square of a standard normal variable follows the χ2 (1) distribution; and that the expected value of product of independent random variables is same as the product of their expectations, followed by the linearity of expectation.

Similarly, now consider a new data point (x, 1) D+.

E(x,y) D+f(x) = yc + Ex2(i) N(0,σ2)

i y(i)x2 (i) x2

Once again, in (7) we utilize the fact that the expected value of product of independent random variables is same as the product of their expectations, followed by the linearity of expectation. At an aggregate over multiple data points, we can hence show that E(x,y) S+f(x) E(x,y) D+f(x) = Dσ2. This concludes the proof for Theorem 1.

A.3 FAILURE OF MEMBERSHIP INFERENCE (THEOREM 2)

In this section, we take a formal view of the conditions that lead to the failure and success of membership inference. Before we begin with our formal analysis, we would like to point out that the statistical difference between the distribution of training and test data points in Theorem 1 is only observed when we aggregate an expectation over multiple samples. Now, we show that the variance of this difference is so large, that it is very difﬁcult to make any claims from a single input data point.

Consider an adversary that does not have knowledge of the private data used to train a machine learning model. However, it contains domain knowledge of the task that the model is trying to solve. This may include the range and dimension of possible inputs to the model. In our case, the adversary has knowledge of the data distribution D, but not of the training set S.

For a single data point x = (x1, x2), s.t. (x, y) D, the adversary aims to predict whether it was used to train the machine learning model, f. The prediction margin for (x, y) D is given by:

y f(x) = y (w1 x1 + w2 x2) = c +

i y(i)x(i) 2 x2

From the analysis in Theorem 2, the adversary knows that E(x,y) S [y f(x)] = c + Dσ2 and E(x,y) D [y f(x)] = c. Let M(x|f) represents the membership decision of the adversary for a given data point x and classiﬁer f. The strongest adversary will use the following decision rule:

M(x|f) = 1, if (y f(x) c) t 0, o.w. (9)

where t 0, Dσ2 is some threshold that the adversary can tune in order to achieve maximum true positive rate and minimum false positives.

Similar to Yeom et al. (2018), we consider the scenario where the input data is randomly (with equal probability via coin ﬂip b) sampled from either S (if b = 1) or D (if b = 0). Let such a distribution be speciﬁed as (x, b) R. The adversary M must maximize the single objective P(x,b) R [M(x|f) = b]. In summary,

P(x,b) R [M(x|f) = b] = P(x,y) S [M(x|f) = 1] + P(x,y) D [M(x|f) = 0]

Published as a conference paper at ICLR 2021

We simplify our analysis by considering the data point (x, y) D+ (has true label, y = 1). However, the analysis generally applies to any (x, y) D.

Case 1: (x, y) D+. Assume meta-variable z2 = P i y(i)x2(i) . Therefore, z2 N(0, mσ2I), while x2 N(0, σ2I). Recall that x2, z2 RD. Assuming D to be large, we can conveniently apply the central limit theorem to approximate the distribution of the internal term. Let the individual dimensions of z2 be denoted by z2(i). Then, we have that:

P(x,y) D+ [M(x|f) = 0] = P(x,y) D+

i y(i)x2 (i) x2

= P(x,y) D+ [(z2 x2) < t]

= P(x,y) D+

j Dz2(j) x2(j)

Let α represents the distribution followed by z2(i) x2(i). From CLT, we have that the combined distribution behaves like a normal distribution, with µ = µα = and σ2 = σ2 α D .

σ2 α = m D2 σ4 (12)

We use the fact that Var[XY ] = Var[X]Var[Y ] + E[X]2Var[Y ] + E[Y ]2Var[X] and Var[c X] = c2 Var[X] for computing σ2 α. Therefore, let r N(0, m Dσ4):

P(x,y) D+ [M(x|f) = 0] = Pr N(0,m Dσ4) [r < t] (13)

It can be observed that P [r < t] increases with the threshold value t. For t = 0, P [(z2 x2) < 0] = 0.5. Whereas, for t = Dσ2, the probability decreases with the value of m (this can be intuitively understood as since the size of training set increases, overﬁtting decreases, making MI more difﬁcult). Even for as low as m = 100 points in the training set, P (z2 x2) < Dσ2 = 0.6. For any value of t 0, σ2 , the maximum probability for size of training data m = 100 is 0.6. Further, as the size of the training set increases, the probability tends to 0.5.

Case 2: (x, y) S+. Once again, as in the proof for Theorem 1, consider any point in the training set (x, y) S+ = (x(j), 1) for some index j. We will now calculate the probability of success of the adversary that follows the decision rule described above:

P(x,y) S+ [M(x|f) = 1] = P(x,y) S+

i y(i)x2 (i) x2

= P(x,y) S+

i y(i)x2 (i) x2 (j) !

+ x2 (j) x2 (j) > t

Now, following the discussion in the ﬁrst case, we know that the ﬁrst term can be approximated by a variable α N(0, (m 1)Dσ4). Similarly, using CLT over the sum of multiple random variables sampled from a χ2 1 distribution, we can approximate the second term in the above equation with a variable β N(Dσ2, Dσ4). Finally, using the property for sum of independent gaussians, we can approximate the entire prediction margin to be represented by a sample u N(Dσ2, m Dσ4). Then, we have that:

P(x,y) S+ [M(x|f) = 1] = Pu N(Dσ2,m Dσ4) [u > t] (15)

Hence, we show that the adversary can do no better than a coin ﬂip. This concludes the proof for Theorem 2. The interested reader may further analyze the assertion that the optimal value of t lies in 0, Dσ2 .

To resolve the optimal threshold t for membership inference, we restructure the arguments as follows. Recall from (10) that the adversary aims to ensure both true positive rates and true negative

Published as a conference paper at ICLR 2021

rates are high. We know:

P(x,y) D+ [M(x|f) = 0] = Pr N(0,m Dσ4) [r < t]

P(x,y) S+ [M(x|f) = 1] = Pu N(Dσ2,m Dσ4) [u > t]

P(x,b) R [M(x|f) = b] P [u r > 0] (16)

We know that both u, r are sampled from normal distributions. Therefore, deﬁne γ = (u r) N(Dσ2, 2m Dσ4). This simpliﬁes our discussion to a single normal distribution with mean µγ = Dσ2 and variance, σ2 γ = 2m Dσ4. We can now calculate the CDF at x = 0 to evaluate the maximum probability of success of membership inference (decision taken by the optimal adversary).

Let Z N(0, 1). It can hence be shown that:

P[γ > 0] = P (σγZ + µγ) = P Z > µγ

Clearly, as m , P[γ > 0] 0.5. This concludes the proof for Theorem 2.

A.4 SUCCESS OF DATASET INFERENCE (THEOREM 3)

In Theorem 2 we showed that an adversary querying a single data point can say no better than a coin ﬂip about the presence or absence of a given data point in a model s training set. In this section, we show that when we reverse this adversarial game, the victim can utilize the information asymmetry to predict with high conﬁdence if a potential adversary s model stole their knowledge in any form.

First, recall that the victim has access to its own private training set of size m. For the purposes of this proof, we call it Sm V . As the victim has complete information of the data distribution, it can randomly sample another dataset S0 D.

The victim considers that the potential adversary s model was stolen if the mean prediction margin for the points in SV is greater than S0 by some threshold parameter λ. Let ψV(f, S; D) be V s decision function to resolve ownership claims.

Recall that in Theorem 1 we had calculated the expected value of the difference in the prediction margin for the points in the training set versus those in the test set. In the proof of this theorem, we calculate the probability of the mean of the difference being greater than some value λ.

Now, let us calculate the probability of this margin for a data point randomly sampled from the training set. Let t V represent the mean of the prediction margin of all points in SV for a classiﬁer f. Similarly, let t0 represent the mean of the prediction margin of all points in S0 for the classiﬁer f. We will use u2 to denote the last D dimensions of points in S0. Then,

i y(i)x2 (i) x2 (j) !

+ (x2 (j))2 #

j (x2 (j))2 + X

i y(i)x2 (i) x2 (j) !#

i y(i)x2 (i) u2 (j) !#

P [ψV(f, S; D) = 1] = P [(t V t0) > λ]

Recognize the similarity of the above formulation with that discussed in the proof for Theorem 2 in Appendix A.3. Let t = t V t0. Then the random variable t represents the a sample from the distribution of means for γ deﬁned in Appendix A.3. We can now directly use the Central Limit

Published as a conference paper at ICLR 2021

Theorem for this proof. Therefore,

µt = µz = Dσ2

σ2 t = σ2 z m = 2Dσ4 (19)

Hence, t N(Dσ2, 2Dσ4). It is important to note that this distribution is independent of the number of training points. Hence, unlike membership inference, the success of DI is not curtailed by the lack of overﬁtting.

Similarly, for an honest adversary, the distribution of prediction margin for points in SV is the same as that for the points in S0. It directly follows that:

P [ψV(f, S; D) = 0] = P ˆt < λ = P [t > λ] (20)

where, ˆt N(0, 2Dσ4). Once again, like the proof of Theorem 2, by symmetry of two normal distributions with the same variance, and shifted means, we can ﬁnd that the optimal value of the parameter λ that maximized true positives, and minimizes false positives, λ = µt

Let Z N(0, 1). Then it can hence be shown that:

P[ˆt > λ] = P σt Z + µt > µt

Clearly, as D , P[ˆt > λ] 1.0. This concludes the proof for Theorem 3.

B MODEL STEALING TECHNIQUES

In this section, we provide more details about the various threat models that we consider in this work. We also provide speciﬁc use-cases and motivation for the respective threat models, and introduce a new adaptive adversary targeted speciﬁcally against DI.

V : Victim. The victim V wishes to release its machine learning model to the community, either as a service, or by open-sourcing it for non-commercial academic use. V wants to ensure that the deployed model is not being misused under the terms of license provided.

AD: Data Access. The adversary AD is able to gain complete access to the victim s private training data, and aims to deploy its own MLaa S by training the same. We note that labeled private training data is one of the most expensive commodities in the deployment cycle of modern machine learning systems.

1. Model Distillation: Traditionally, model distillation (Hinton et al., 2015) was used as a method to compress larger models by training smaller students using the logits of a teacher model. We use this as a threat model that the adversary may employ to distance its predictions from a model that was trained using hard labels from the dataset itself. The adversary requires both query access, and access to the victim s private training data for this attack. 2. Modiﬁed Architecture: Multiple works have attempted at identifying unique properties (or ﬁngerprints ) of a model by analyzing speciﬁc activations and representative features of internal model layers (Olah et al., 2017; 2018; Yin et al., 2019). We study the threat model where the adversary attempts training an alternate architecture on the victim s private dataset Dpriv to valid the robustness of our method to changes in model structure.

AM: Model Access. The use-case of such an adversary is two fold: (1) the victim open-sources their own machine learning model under a license that does not allow other individuals to monetize the same; and (2) the adversary gains insider access to the trained model of a victim. In both the cases, the adversary aims to monetize its own MLaa S and deploys their own model on the web, by modifying the original victim model to reduce the dependence on K.

Published as a conference paper at ICLR 2021

1. Fine-tuning: The adversary has full access to the victim s machine learning model, but not to its training data. While ﬁne-tuning is employed used to transfer the knowledge of large pre-trained models on a given task (Devlin et al., 2018), we use it as a stealing attack, where the adversary uses the predictions of the victim model on unlabeled public data in order to modify its decision boundaries. We consider the setting where the adversary can ﬁne-tune all layers. 2. Zero-Shot Learning: This is the strongest adversary that we introduce speciﬁcally targeted to evade dataset inference. To the best of our knowledge, we are the ﬁrst to consider such a threat model. The adversary uses no direct knowledge of the actual training data to avoid any features that it may learn as a result of the training on the victim s private data set. The adversary has complete access to the victim model, and uses data-free knowledge transfer (Micaelli & Storkey, 2019; Fang et al., 2019) to train a student model.

AQ: Query Access. Model extraction (Tram er et al., 2016) is the most popular form of model stealing attack against deployed machine leaning models on the web. We discuss the related work on model extraction attacks in more detail in 2. Depending on the access provided by the machine learning service, an adversary may aim to extract the model using the logits or the labels alone.

1. Model Extraction Using Labels: The victim model is used to provide pseudo-labels for a public dataset. The adversary trains their model on this pseudo-dataset. The key difference is that the input data points may be semantically irrelevant with respect to the task labels that the adversary s model is being trained on. 2. Model Extraction Using Logits: The performance of model extraction attacks can be improved when the victim provides conﬁdence values for different output classes, rather than the correct labels itself. The adversary s model is trained to minimize the KL divergence with the outputs of the victim on a public (or non-task speciﬁc) dataset.

I : Independent Model. Finally, we also study the results of dataset inference on an independent and honest machine learning model that is trained on its own private dataset. This is used as a control to verify that the dataset inference procedure does not always predict that the potential adversary stole the victim s knowledge.4

C EMBEDDING GENERATION

Embedding Generation Hyperparameters.

For the case of Min GD attack, we perform adversarial attacks deﬁned by the optimization equation

min δ (x, x + δ) s.t. f(x + δ) = t; x + δ [0, 1]n (22)

The distance metric (x(i), x(j)) refers to the ℓp distance between points x(i) and x(j) for p {1, 2, }, and t is the targeted label. To perform the optimization, we perform gradient descent with steps of size αp. We take a maximum of 500 steps of gradient optimization, but pre-terminate at the earliest misclassiﬁcation. The step sizes for the individual perturbation types are given by {α , α2, α1} = {0.001, 0.01, 0.1}.

For the case of Blind Walk, We sample a random initial direction δ. Starting from an input (x, y), take k N steps in the same direction until f(x + δ) = t; t = y. Then, (x, x + kδ) is used as a proxy for the prediction margin of the model. We repeat the search over multiple random initial directions to increase the information about a training data point s robustness, and use each of these distance values as features in the generated embedding.

As an implementation detail, we sample between uniform, laplace and gaussian noise to generate embedding features. To measure the ﬁnal perturbation distance from the initial starting point, we use different ℓp norms for each of the noise sampling methods. For uniform noise, we compute the ℓ

4Note that since we consider the difference in the distribution of outputs of the auxiliary classiﬁer on embeddings from the test and training set (rather than hard labels from the auxiliary classiﬁer), even in the absence of this control, we can un-deniably verify the conﬁdence of dataset inference. This is only included to contrast the difference and make the effects of the method clearer to the reader.

Published as a conference paper at ICLR 2021

distance; for gaussian noise, the ℓ2 distance; and for laplacian noise, the ℓ1 distance of the nearest misclassiﬁcation. While we take k steps of Blind Walk up till misclassiﬁcation, however, we do not exceed more than 50 steps and prematurely terminate without misclassiﬁcation in the event that the prediction label does not change.

Performance of White Box Approach. We ﬁnd in our evaluations that the white-box Min GD method generally underperforms the Blind Walk method. This happens despite its ability of being able to compute the nearest distance to any target class more accurately. While on the onset, this may seem to be a counter-intuitive result, since generally with more access, the performance of mapping the neighbours should only increase.

However, we note an important distinction. The end goal of the query generation process is not to calculate the minimum distance to target classes accurately, but rather to understand the prediction margin or the local landscape of a given data point. Readers may recall from adversarial examples literature (Szegedy et al., 2013) that adversarial examples can easily be constructed on the dataset that a given machine learning model was trained on. This observation hurts the idea of Figure 1b. Despite pushing the neighbouring class boundaries away, the existence of adversarial examples elucidates the existence of small pits within the landscape of the model.

We hypothesize that the gradient-based optimization objective aims to capture this minimum (adversarial) distance, and fails to capture a prediction margin that is more representative of the classiﬁer s prediction conﬁdence or general landscape. On the contrary, Blind Walk is able to perform a spectacular job at the same end goal. Since we are no longer adversarially trying to optimize the minimum distance to the neighbouring classes, multiple Blind Walk runs effectively map the average case prediction margin, which we argue is more useful than the worst case prediction margin as obtained by Min GD.

D EFFECT OF EMBEDDING SIZE

For all models, richer embeddings reduce the need for more revealed samples. (See Figure 4). We note that in the main body of this work, we had used a ﬁxed size of embedding vector, with 30 input features. However, recall that in the black-box setting, the victim incurs additional cost for querying the potential adversary. Therefore, in this section we aim to understand the marginal utility of extra embedding features added. In general, we ﬁnd that for most of the threat models studied, using only 10 features for the embedding space is sufﬁcient to achieve the required threshold p-value of 0.01. This suggests that we can slash the number of queries made to the potential adversary by one-thirds, without loss in conﬁdence of prediction.

Interestingly, we also note that even in scenarios where the victim reveals only 15 samples, additional embedding features have insigniﬁcant advantage as opposed to querying fresh samples. This suggests that the amount of entropy gained by revealing a new data point is signiﬁcantly more than that by extracting more features (beyond 10) for the same data point. We also note that the effect is not-consistent in the zero-shot learning threat model.

E EXPERIMENTS

E.1 DATASET DESCRIPTION

CIFAR10. CIFAR10 (Krizhevsky, 2012) contains 60,000 coloured images with 10,000 reserved for testing. There are 10 target classes with 5000 training images per class.

SVHN. SVHN (Netzer et al., 2011) is a dataset obtained from house numbers in Google Street View images. The underlying task is of digit classiﬁcation from 32 32 coloured images.

CIFAR100. CIFAR100 (Krizhevsky, 2012) also contains 60,000 coloured images with 10,000 reserved for testing. There are 100 target classes with 500 training images per class.

Image Net. The Image Net dataset (Deng et al., 2009) is a large-scale benchmark, consisting various challenges including that of image recognition for machine learning systems.

Published as a conference paper at ICLR 2021

Threat Model = Distillation Threat Model = Label-Query Threat Model = Logit-Query

0 5 10 15 20 25 30 Distance-Embedding Size

Threat Model = Zero-Shot Learning

0 5 10 15 20 25 30 Distance-Embedding Size

Threat Model = Fine-Tuning

0 5 10 15 20 25 30 Distance-Embedding Size

Threat Model = Diff. Architecture

15 30 45 60 75 90

Figure 4: p-value vs. distance-embedding size

Dataset Inference on SVHN (Blind Walk Attack)

Model Stealing Attack µ p-value

V Source 0.950 10 8

AD Distillation 0.537 10 3

Diff. Architecture 0.450 10 2

AM Zero-Shot Learning 0.512 10 3

Fine-tuning 0.581 10 4

AQ Label-query 0.513 10 03

Logit-query 0.515 10 02

Random-query 0.475 10 02

I Independent -0.322 10 01

Table 2: Ownership Tester s effect size in a small-data regime (using only m = 10 samples) on the SVHN dataset using Blind Walk attack. 2nd highest and lowest effect size is marked in red & blue.

E.2 ADDITIONAL DATASETS

To further validate our claims about the success of dataset inference, we provide evidence on two additional datasets. In this section, we present results on the SVHN and Image Net datasets.

SVHN. The results of DI via the Blind Walk attack on the various threat models discussed in Appendix B are presented in Table 2. To perform model extraction and ﬁne-tuning attacks, we utilized the set of extra images available with the SVHN dataset. We use the ﬁrst 50,000 images in this set to stage such attacks that require a surrogate dataset. For training on the original dataset with a different architecture, we once again utilize the Pre-activation version of Res Net-18 as for CIFAR10 and CIFAR100 in the main paper. Notably, we also introduce another threat model: Random-query which describes a scenario where the victim is queried with completely random inputs. While the zero-short learning framework also queries the victim with synthetic images, the queried images are synthesized to maximize the disparity between predictions of the student and teacher. On the contrary, in case of Random-query, we query the victim by sampling from a normal distribution, x N(0, 1). DI is resilient to completely random queries as well.

Published as a conference paper at ICLR 2021

Threat Model Image Net Architecture µ p-value

V Source Wide Res Net-50-2 1.868 10 34

AD Diff. Architecture Alex Net 0.790 10 3

Diff. Architecture Inception V3 1.085 10 5

Table 3: Ownership Tester s effect size in a small-data regime (using only m = 10 samples) on the Image Net dataset using Blind Walk attack.

Fraction Overlap µ p-value

0.0 -0.172 0.308 0.3 0.499 7.93 10 3

0.5 0.514 5.78 10 3

0.7 0.576 2.52 10 3

1.0 0.566 3.45 10 3

Table 4: Ownership Tester s effect size in a small-data regime (using only m = 10 samples).

Note that we do not include Random-query as a threat model in case of CIFAR10 and CIFAR100 datasets because random querying is insufﬁcient to achieve model extraction accuracies greater than the majority class baseline in more complicated tasks such as these. However, in case of SVHN, we were able to train an extracted model with 90.2% test set accuracy using random queries alone. Similar observations have been shared in other model extraction literature as well (Truong et al., 2021). We found that our conclusions hold for this additional dataset and we can claim ownership with as few as 10 examples.

Image Net. We remark that prior work in model extraction has not successfully demonstrated the efﬁcacy of model stealing on large-scale benchmarks like Image Net. These methods require many queries to steal a model, and are hence not practical yet. As a proof of validity of our approach, we demonstrate dataset inference (DI) in the threat model that assumes complete data theft: the adversary directly steals the dataset used by the victim to train a model rather than querying the victim model. We use 3 pre-trained models on Image Net using different architectures, and treat the one with a Wide Res Net-50-2 (Zagoruyko & Komodakis, 2016) backbone as the victim. We then observe if DI is able to correctly identify that the two other pre-trained models (Alex Net (Krizhevsky, 2014) and Inception V3 (Szegedy et al., 2015)) were also trained on the same dataset (i.e., Image Net). We conﬁrm that DI is able to claim ownership (i.e., that the suspect models were indeed trained using knowledge of the victim s training set) with a p-value of 10 3 on revealing only 10 samples.

For performing DI, the victim trains the conﬁdence regressor with the helps of embeddings generated by querying the Wide Res Net-50-2 architecture over the training and validation sets separately. We ﬁrst note that the conﬁdence-regressor generalizes well to other points in the victim s train and test set, despite the fact that the Image Net dataset is orders of magnitude larger than the previous benchmarks experimented on. With only 10 examples, we attain p-values less than 10 30. Finally, to test if this generalization holds for other architectures that underwent a disjoint training procedure, we experiment over Alex Net and Inception V3. We ﬁnd that given 10 examples from the Image Net training dataset, DI can conﬁdently say that the suspect models utilize knowledge of the victim s training set with p-values less than 10 4. Note that these models are trained on large datasets where works in MI attacks train victims on small subsets of training datasets to enable overﬁtting Yeom et al. (2018). We believe this demonstrates that DI scales to complex tasks.

F EXTENT OF OVERLAP

In this section we elaborate upon the effect of overlap between private datasets of two parties and how does dataset inference respond to such scenarios. More speciﬁcally, we study the amount of overlap required for DI to be able to claim theft of common knowledge in the following scenario: We consider a competitor (or adversary) who owns their own private training dataset SA. The adversary gains access to the victim s training dataset SV. The adversary now trains their ML model

Published as a conference paper at ICLR 2021

0.0 0.1 0.2 0.3 0.4 0.5 p-value

0.00 0.01 0.02 0.03 0.04 0.05

Source 0.0 0.3 0.5 0.7 1.0

Figure 5: Ownership Tester s p-value depicted as a function of number of training samples revealed (m). In the ﬁgure to the left, for an honest adversary whose dataset has no overlap with the victim s dataset, the p-value increases as the number of revealed samples increases, indicating decrease in conﬁdence of claim for knowledge theft. For all adversaries with fractional data overlap (to the right), DI is able to achieve a p-value of less than 0.01 in under 10 samples.

on S = SA SV λ, where SV λ SV and |SV λ| = λ|SV|. That is the new dataset S has a fraction λ of the training points private to V.

Since at the training time the adversary optimizes the prediction margin over all points in S, the prediction margin for points in SV also gets affected. At test time when the victim queries on these points, DI is expected to succeed.

We validate this claim on the SVHN dataset which provides a set of extra images apart from train and test sets. We train the adversary on the union of extra and varying fractions of the train set, where the train set is supposed to be private to the victim. At the time of dataset inference, the victim queries 50 samples from its private train and test set to the adversary s model. Dataset Inference succeeds with p-value = 10 3 for fractions of overlap = 0.3, 0.5, 0.7, 1.0 as tested. More importantly, as the overlap goes to 0, DI once again does not claim knowledge theft. We present our results for DI on revealing 10 samples in Table 4 and present a graphical illustration of these results with varying numbers of samples revealed in Figure 5. From Table 4, it may also be noted how the effect size increases with the amount of overlap of the private training set, indicating that the DI is becoming increasingly conﬁdent of knowledge theft.