# differentially_private_image_classification_from_features__4a2d5f62.pdf

Published in Transactions on Machine Learning Research (04/2023)

Diﬀerentially Private Image Classiﬁcation from Features

Harsh Mehta harshm@google.com Google Research

Walid Krichene walidk@google.com Google Research

Abhradeep Thakurta athakurta@google.com Google Research

Alexey Kurakin kurakin@google.com Google Research

Ashok Cutkosky ashok@cutkosky.com Boston University

Reviewed on Open Review: https: // openreview. net/ forum? id= Cj6p Lclmw T

In deep learning, leveraging transfer learning has recently been shown to be an eﬀective strategy for training large high performance models with Diﬀerential Privacy (DP). Moreover, somewhat surprisingly, recent works have found that privately training just the last layer of a pre-trained model provides the best utility with DP. While past studies largely rely on using ﬁrst-order diﬀerentially private training algorithms like DP-SGD for training large models, in the speciﬁc case of privately learning from features, we observe that computational burden is often low enough to allow for more sophisticated optimization schemes, including second-order methods. To that end, we systematically explore the eﬀect of design parameters such as loss function and optimization algorithm. We ﬁnd that, while commonly used logistic regression performs better than linear regression in the non-private setting, the situation is reversed in the private setting. We ﬁnd that least-squares linear regression is much more eﬀective than logistic regression from both privacy and computational standpoint, especially at stricter epsilon values (ε < 1). On the optimization side, we also explore using Newton s method, and ﬁnd that second-order information is quite helpful even with privacy, although the beneﬁt signiﬁcantly diminishes with stricter privacy guarantees. While both methods use second-order information, least squares is more eﬀective at lower epsilon values while Newton s method is more eﬀective at larger epsilon values. To combine the beneﬁts of both methods, we propose a novel optimization algorithm called DP-FC, which leverages feature covariance instead of the Hessian of the logistic regression loss and performs well across all ε values we tried. With this, we obtain new SOTA results on Image Net-1k, CIFAR-100 and CIFAR-10 across all values of ε typically considered. Most remarkably, on Image Net-1K, we obtain top-1 accuracy of 88% under DP guarantee of (8, 8 10 7) and 84.3% under (0.1, 8 10 7).

1 Introduction

Despite impressive performance, large machine learning models are susceptible to attacks. Previous work has demonstrated successful membership attacks where the goal is to extract exact instances of training

Code: https://github.com/google-research/google-research/tree/master/dp_transfer

Published in Transactions on Machine Learning Research (04/2023)

examples that the model was trained on (Shokri et al., 2017; Carlini et al., 2019; 2021; Choquette-Choo et al., 2020; Liu et al., 2021b; Balle et al., 2022) and models of larger size are known to be more likely to memorize training data. These attacks can be quite egregious if the model was trained on sensitive data like personal photos or emails. One approach to mitigate this risk is to train such models with privacy guarantees. In particular, Diﬀerential Privacy (DP) has become the gold standard in quantifying the risk and providing formal guarantees against membership attacks (Nasr et al., 2021).

Informally, Diﬀerentially Privacy implies that an adversary learns almost the same thing about an individual data point independent of their presence or absence in the training data set. More formally, DP is deﬁned as follows. Two datasets D, D are called neighboring datasets if one is obtained from the other by deleting one data point.

Deﬁnition 1.1 (Diﬀerential Privacy (Dwork et al., 2006b;a)). A randomized algorithm A is (ε, δ)-diﬀerentially private if, for any pair of neighboring datasets D and D , and for all events S in the output range of A, we have Pr[A(D) S] eε Pr[A(D ) S] + δ,

where the probability is over the randomness of A.

In the ﬁeld of deep learning, Diﬀerentially Private Stochastic Gradient Descent (DP-SGD) (Song et al., 2013; Bassily et al., 2014; Abadi et al., 2016) is the most commonly used method for training models with DP guarantees. While DP-SGD is a fairly general algorithm, a naive application can suﬀer from several computational challenges. To make matters worse, the gap between performance of a model with and without privacy typically widens as the model is made larger. This stands as a signiﬁcant obstacle to a wider adoption and deployment of large practical models with privacy guarantees.

For the problem of Image Classiﬁcation, several works have shown that transfer learning can be a very eﬀective strategy in order to improve privacy-utility trade-oﬀwhen formal privacy guarantees are required (Kurakin et al., 2022; Mehta et al., 2022; De et al., 2022). In this setting, a model is ﬁrst pre-trained with non-sensitive" data without privacy guarantees, then ﬁne-tuned on a sensitive" dataset over which a formal privacy guarantee is required. Similar to previous works, we simulate several publicly available image classiﬁcation benchmarks (like Image Net) as sensitive" datasets.

Interestingly, Mehta et al. (2022); De et al. (2022) observe that privately ﬁne-tuning just the last layer of a pre-trained model (using DP-SGD) leads to state of the art results in the Image Net-1k dataset. This is quite fortuitous, since privately ﬁne-tuning the full model typically introduces signiﬁcant computational challenges. We build on this observation, and perform a comprehensive exploration of various design parameters, including the choice of loss function and optimization algorithm, beyond simple DP-SGD. In this restricted setting of learning a single layer privately using features extracted from a pre-trained model, more sophisticated methods, such as second-order methods, are computationally viable. Our main contributions are as follows:

Somewhat surprisingly, we ﬁnd that linear regression solved using DP Least Squares performs much better than logistic regression solved using DP-SGD, especially at lower epsilons.

Postulating that the beneﬁts largely stem from the use of second-order information in the least squares solution, we further explore using Newton s method to solve logistic regression. While Newton s method outperforms linear regression in the non-private setting, we ﬁnd that it still performs worse with privacy constraints, largely because sanitizing the Hessian with logistic regression requires adding far more noise than in linear regression, where part of the Hessian can be shared across all classes.

To combine the beneﬁts of both, we introduce a method which we call Diﬀerentially Private SGD with Feature Covariance (abbreviated as DP-FC) where we simply replace the Hessian in Newton s method with sanitized Feature Covariance. Using Feature Covariance instead of Hessian allows us to make use of second-order information in the training procedure while sharing it across classes and iterations, which greatly reduces the amount of noise that needs be added to sanitize it. This allows us to continue using logistic regression, which performs better in non-private setting, while beneﬁting from improved privacy-utility trade-oﬀas seen with linear regression in the private setting.

Published in Transactions on Machine Learning Research (04/2023)

With DP-FC, we surpass previous state of the art results considerably on 3 image classiﬁcation benchmarks, namely Image Net-1k, CIFAR-10 and CIFAR-100, just by performing DP ﬁne-tuning on features extracted from a pre-trained model, see Table 1 for a summary. Consistent with previous works, we also ﬁnd that performance increases as the pre-training dataset and the model are made larger.

Dataset Epsilon Previous SOTA Accuracy Method Epochs (= Steps) Pretraining DS

0.01 95.55 (0.98) DP-LS 1 JFT 0.05 97.81 (0.22) DP-FC 10 JFT 0.1 98.23 (0.12) DP-FC 10 JFT 0.5 98.73 (0.04) DP-FC 10 JFT 1.0 96.7 98.80 (0.05) DP-FC 10 JFT 2.0 97.1 98.80 (0.03) DP-FC 10 JFT 4.0 97.2 98.83 (0.02) DP-FC 10 JFT 8.0 97.4 98.84 (0.02) DP-FC 10 JFT 98.90 (0.00) LS 1 JFT

0.01 76.96 (0.08) DP-LS 1 I21K 0.05 78.31 (0.33) DP-LS 1 JFT 0.1 80.57 (0.26) DP-LS 1 JFT 0.5 84.82 (0.38) DP-FC 10 JFT 1.0 83.0 87.85 (0.14) DP-FC 10 JFT 2.0 86.2 88.77 (0.16) DP-FC 10 JFT 4.0 87.7 89.51 (0.17) DP-FC 10 JFT 8.0 88.4 89.78 (0.08) DP-FC 10 JFT 90.60 (0.00) LS 1 JFT

Image Net-1K

0.01 81.99 (0.08) DP-LS 1 JFT 0.05 83.74 (0.10) DP-LS 1 JFT 0.1 84.28 (0.06) DP-FC 1 JFT 0.5 86.04 (0.06) DP-FC 10 JFT 1.0 84.4 86.78 (0.07) DP-FC 10 JFT 2.0 85.6 87.34 (0.04) DP-FC 10 JFT 4.0 86.0 87.70 (0.03) DP-FC 10 JFT 8.0 86.7 88.02 (0.03) DP-FC 10 JFT 88.90 (0.00) Newton 10 JFT

Table 1: Compilation of our best private Top-1 test accuracies. We report median and standard deviation across 5 training runs with diﬀerent seeds. All number are SOTA across all epsilons to the best of our knowledge. Previous state of the art for CIFAR-10 and CIFAR-100 were reported form Bu et al. (2022a) and for Image Net-1K from De et al. (2022). We denote ε to be for non-private setting where we turn oﬀ all sanitization steps including clipping. We set δ to 8 10 7 for Image Net-1k, and 1 10 5 for CIFAR-10 and CIFAR-100. Interestingly, in the non-private setting, most previous works (including Zhai et al. (2021)) use Linear Regression when ﬁnetuning from features but we found that Logistic Regression (solved using Newton s method) performs much better and leads to an impressive 88.9% accuracy when ﬁnetuning just the last layer. To put this in perspective, this is only 1.1% less than the current state of the art non-private accuracy of 91% on Image Net-1k (Yu et al., 2022b). We report extensive hyperparameter details in the appendix for reproducibility of our results.

2 Private Learning from Features

In this section, we describe the details of optimization strategies we considered, and state the privacy guarantees for each of them.

Given a data set D = {(x1, y1), , (xn, yn)}, we optimize the function L : Rm d R deﬁned as follows

Published in Transactions on Machine Learning Research (04/2023)

j=1 ℓ( θj, xi , yij) (1)

where n is the number of examples, m is the number of classes, θ Rm d is the weight matrix to be learned, xi is the feature vector for example i, yij is the label of example i and class j, and ℓis a convex loss function. We assume that yij [0, 1] for all i, j. Additionally, we also use the short hand ℓ(θ; (xi, yi)) = Pm j=1 ℓ( θj, xi , yij) where helpful in order to simplify the notation.

In the case of learning from features extracted from a pre-trained model, the feature vectors xi are last layer features. Further, unless otherwise speciﬁed, we will assume ℓto be the logistic loss, i.e. ℓlogistic(z, y) = y log σ(z) (1 y) log(1 σ(z)) where σ(z) = 1 1+e z . Finally, we will also assume that each step of optimization considers the whole batch, which greatly simpliﬁes both the privacy analysis of algorithms and the experiments. Rest of this section includes descriptions of several iterative solvers of this minimization problem, both in non-private and private settings. Some of these methods rely on the fact that we are only interested in ﬁne-tuning just the last layer, while others are more general. In the privacy analysis, we use z CDP (zero - Concentrated Diﬀerential Privacy) (Bun & Steinke, 2016), but we state our empirical results always with ﬁnal privacy guarantee in (ε, δ)-DP terms, as done in previous works.

Arguably the most popular approach to solving the above minimization problem in the non-private setting is Stochastic Gradient Descent (SGD). In the full-batch setting, at every iteration, SGD performs the update:

i=1 ℓ(θt; (xi, yi)) θt+1 = θt ηtgt (2)

where ηt denotes the learning rate used for for iteration t.

DP-SGD is a private variant of this algorithm and the baseline in all our experiments. Computationally, in order to bound the sensitivity of each training example, Abadi et al. (2016) suggest computing a gradient for each example separately and clipping each to a maximum norm of C (a user-speciﬁed hyper-parameter):

i=1 clip ( ℓ(θt; (xi, yi))) gt gt + N 0, (σC)2

where clip(v) = v min n 1, C v 2

After summing the clipped example gradients, a noise vector sampled from a Gaussian distribution with standard deviation σC is added, where σ is a parameter that determines the privacy guarantee via the Gaussian mechanism.

As shown in Algorithm 4, once the gradient has been sanitized, we are free to use it to accumulate statistics (e.g. ﬁrst or second moment estimates) which are typically useful with the optimization process. Algorithm 4 presents a generalized version of DP-SGD where the gradients get processed in the traditional DP-SGD fashion, and are then passed to a ﬁrst-order optimizer as an input. This lets us instantiate DP versions of well-known ﬁrst-order optimizers like SGD, Momentum and Adam. We employ DP-Adam in all our experiments. Finally, we omit the privacy analysis for the DP-SGD baseline since it is standard, but we do include in the appendix the details of the implementation we used for translation from privacy constraints to noise scale.

2.2 DP-Newton

Since the optimization problem under consideration is a relatively simple convex problem, second-order DP algorithms can be viable. We ﬁrst consider a privatized version of the popular Newton s method and denote

Published in Transactions on Machine Learning Research (04/2023)

it as DP-Newton. To control the sensitivity of each training example, one naive approach is to compute per-example Hessians and clip their norm, in a similar fashion to example gradient clipping in DP-SGD. But even in our last-layer ﬁne-tuning setting, this can be prohibitively expensive. For instance, training on features extracted from Vi T-G for Image Net-1k ﬁne-tuning, Hessian tensor (in a block diagonal form) is of size [1000, 1664, 1664] with approximately 2.8B entries. In order to avoid instantiating the Hessian for every training example, we instead choose to clip the feature vectors xi, then translate the clipping threshold into bounds on the Hessian and gradient norms. This is summarized in Algorithm 1.

Algorithm 1 Diﬀerentially Private Newton s Method

Require: Data set D = {(x1, y1), , (xn, yn)} with (xi, yi) D, loss function: ℓ: Rd R R, regularization coeﬃcient λ, learning rate η, clipping norm: C, number of iterations: T, noise multiplier: σ, bound on the second derivative of the loss: βH

1: Clip all features: xi clip(xi) for all i {1, . . . , n} where clip(v) = v min n 1, C v 2

2: Randomly initialize θ0. 3: for t = 1, . . . , T do

4: for j = 1, . . . , m do 5: gt,j Pn i=1 ℓ(θ t,j xi, yij).

6: Ht,j Pn i=1 2ℓ(θ t,j xi, yij) + λI

ngt,j + N 0d, ( σC m

n )2 where N 0d, ( σC m

n )2 indicates a d-dimensional vector each of

whose coordinates is an i.i.d. Gaussian with standard deviation σC m/n.

n Ht,j + N 0d d, σβHC2 m

n 2 where N 0d d, σβHC2 m

n 2 indicates a d d matrix

each of whose coordinates is an i.i.d. Gaussian with standard deviation σβHC2 m/n. 9: θt+1,j θt,j η H 1 t,j gt,j 10: end for 11: end for 12: return θT .

Theorem 2.1 (Privacy guarantee for Algorithm 1). Suppose ℓis twice diﬀerentiable, and that for all (z, y), |ℓ (z, y)| 1 and |ℓ (z, y)| βH. Then, Algorithm 1 satisﬁes T σ2 -z CDP.

Here ℓ and ℓ denote the ﬁrst and second derivatives of ℓwith respect to its ﬁrst argument. We choose the bound on the ﬁrst derivative to be 1 without loss of generality (via scaling of ℓ). Note that the assumptions of the theorem are satisﬁed for the sigmoid cross-entropy loss (i.e. logistic regression) with βH = 1

4. Indeed, |ℓ logistic(z, y)| = |y σ(z)|, and |ℓ logistic(z, y)| = σ(z) (1 σ(z)) where σ(z) = 1 1+e z . The ﬁrst expression is bounded by 1 (since y, σ(z) [0, 1]) and the second expression is bounded by 1

For the squared loss ℓ(z, y) = 1

2(z y)2, notice that (non-private) Newton s method reaches the solution in a single step, regardless of where it is initialized. Therefore it suﬃces to initialize Algorithm 1 at θ0 = 0, and run it for a single step (T = 1). When initialized at θ = 0, we have z = 0 and the assumptions of the theorem are satisﬁed (ﬁrst and second derivatives are bounded, with βH = 1).

Proof. The crux of the proof is to ensure that Lines 7 and 8 individually satisfy 1 2mσ2 -z CDP. The rest of the proof is just simple composition of z CDP Bun & Steinke (2016) across m classes and T iterations. To see this, ﬁrst let D and D be neighboring datasets and let (x, y) be the diﬀering datapoint between D and D .

Now, notice that in the computation of gt, we have the following: gt,j(D) gt,j(D ) = ℓ ( θt,j, x , yj) x . Thus, since x 2 C, and |ℓ ( θj, x , yj)| 1:

gt,j(D) gt,j(D ) 2 C (4)

From equation (4) it immediately follows that the computation of gt,j for each t {0, . . . , T 1} and each j {1, . . . , m} satisﬁes 1 2mσ2 -z CDP. Now, moving on to the sensitivity of Ht in Line 8, we have

Published in Transactions on Machine Learning Research (04/2023)

Ht,j(D) Ht,j(D ) = ℓ ( θj, x , yj) x x . Then, since x 2 C and |ℓ ( θj, x , yj)| βH, we have:

Ht,j(D) Ht,j(D ) F βHC2, (5)

and equation (5) immediately implies that the computation of Ht,j for each t {0, . . . , T 1} and j {1, . . . , m} satisﬁes 1 2mσ2 -z CDP. This completes the proof.

One drawback of DP-SGD and DP-Newton is that each training example i contributes to the gradients (resp. Hessians) of all classes j {1, . . . , m}, so the sensitivity (and hence the scale of required noise) increases with the number of classes. This is visible in Algorithm 1 where the amount of noise added for privacy protection scales with m (Lines 7 and 8). When the number of classes is large, this reduces signal-to-noise ratio and can hurt quality. In this section, we develop a method that aims to address this issue. We take inspiration from the matrix factorization literature, in which one can separate the loss function into the contribution of positive and negative classes. We assume that labels are binary (yij {0, 1}), and we consider the following quadratic loss function:

j=1 yij(x i θj yij)2 + α

2 (x i θj)2 + λ

in other words, we take ℓ(z, y) = 1

2(y(z y)2 + αz2). The ﬁrst term ﬁts the positive labels (notice that this term vanishes when y = 0), while the second term ﬁts negative labels, and α is hyper-parameter that trades-oﬀthe two terms. This formulation was studied by Hu et al. (2008) and enjoys remarkable empirical success Koren & Bell (2015). It turns out that this method is also well-suited for privacy, as we shall discuss below.

By expanding the quadratic terms, we can write the loss as

θ j Ajθj 2θ j bj + αθ j Gθj + λθ j θj

where for all j {1, . . . , m}

i:yij=1 xix i , bj = X

i:yij=1 xi, G =

i=1 xix i (6)

The exact solution is then given by θj = [Aj + αG + λI] 1 bj.

Notice that the solution θj for class j depends on class-speciﬁc statistics (the matrix Aj and the vector bj), as well as the global quantity G. Algorithm 2 computes a private estimate of the solution by adding Gaussian noise to each of G, Aj, bj (Lines 2, 4, and 5 respectively), then solving the linear system using the noised statistics (Line 6). This is a variant of the popular suﬃcient statistics perturbation algorithm for DP linear regression. One crucial observation is that the noisy version of G is only computed once (Line 2), and reused for all classes. This allows to control the sensitivity of the solution w.r.t. each example; indeed, suppose example i only has one positive class j0, then that example only contributes to G, Aj0, and bj0. In particular, the sensitivity of the solution (and hence the amount of noise we need to add) does not scale with the total number of classes, only with the number of positive classes per example. This is represented by the parameter k in Algorithm 2 (k is equal to 1 for single-class classiﬁcation tasks, and even in multi-class tasks, we typically have k m). We now give the formal privacy guarantee:

Theorem 2.2 (Privacy guarantee for Algorithm 2). Algorithm 2 satisﬁes 3 2σ2 -z CDP.

Proof. First, let D, D be neighboring data sets that diﬀer in the data point (x, y), and let G(D) be the global statistic in equation (6) computed on data set D. Then G(D) G(D ) F = x x 2, and

Published in Transactions on Machine Learning Research (04/2023)

since x 2 C (due to clipping), G(D) G(D ) F C2, thus the computation of G (Line 2) is 1 2σ2 - z CDP. Now moving to the computation of Aj: let A(D) be the concatenation of all the class-speciﬁc statistics, i.e. A(D) = [A1(D)| . . . |Am(D)]. Notice that Aj(D) Aj(D ) F = x x 2 if yj = 1 and 0 otherwise. Since by assumption, the number of positive classes per example is bounded by k, we have that A(D) A(D ) F

k x x F , which is bounded by

k C2 due to clipping. Therefore the computation of all Aj combined (Line 4) is 1 2σ2 -z CDP. Finally, by a similar argument, we have that b(D) b(D ) 2

k C, and the computation of all bj combined (Line 5) is 1 2σ2 -z CDP.

By simple composition of z CDP Bun & Steinke (2016), the algorithm is 3 2σ2 -z CDP.

Algorithm 2 Diﬀerentially Private Least Squares

Require: Data set D = {(x1, y1), , (xn, yn)} with (xi, yi) D, weight coeﬃcient α, regularization coeﬃcient λ, maximum number of positive classes per example: k, clipping norm: C, noise multiplier: σ. 1: Clip all features: xi clip(xi) for all i {1, . . . , n}.

2: G Pn i=1 xi x i + N 0d d, σC2 2 , where N 0d d, (σC2)2 indicates a d d-matrix, each of whose

coordinates is an i.i.d. Gaussian with standard deviation σC2. 3: for j = 1, . . . , m do

i:yij=1 xi x i + N 0d d, σ

i:yij=1 xi + N 0d, (σ

6: θj Aj + α G + λI 1 bj 7: end for 8: return θ.

From our early experiments, we found that Newton s method with logistic regression performs better than least squares linear regression in non-private setting. But in the private setting, DP-LS outperforms DPNewton, especially for lower values of epsilons. Notice that both are second-order methods, and both rely on estimating and inverting the Hessian (Line 9 in Algorithm 1 and Line 6 in Algorithm 2); the main advantage of DP-LS is that the private hessian computation does not have to composed over classes or iterations. In this section, in order to mitigate the trade-oﬀbetween the two methods, we introduce a method called Diﬀerentially Private SGD with Feature Covariance (DP-FC) which leverages covariance of features to make use of second-order information without paying the cost of composition over classes or iterations. Indeed, since feature covariance neither depends on the model parameters nor the prediction, it can be shared across both classes and iterations. This is described in Algorithm 3. The method can be interpreted as DP-SGD with preconditioning (where the approximate feature covariance G is used as preconditioner). We empirically observe that this leads to greatly reduced sensitivity compared to DP-Newton and signiﬁcant improvements in overall metrics across all values of epsilons we tried.

Theorem 2.3 (Privacy guarantee for Algorithm 3). Algorithm 3 satisﬁes (T +1)

2σ2 -z CDP.

Proof. The crux of the proof is to ensure that the computation of G (Line 3) and gt (Line 7) individually satisfy 1 2σ2 -z CDP. The rest of the proof is by simple composition of z CDP Bun & Steinke (2016) (once for the G computation, and T times for the gt computation).

Let D and D be neighboring datasets and let (x, y) be the diﬀering datapoint between D and D .

In the computation of gt, we have the following: |gt(D) gt(D )| = |clip( ℓ(θt; (x, y))| Cg. It immediately follows that the computation of gt for each t {0, . . . , T 1} satisﬁes 1 2σ2 -z CDP. Now, moving on to the sensitivity of G. We have G(D) G(D ) = x x C2 G, since x 2 CG. It immediately follows that G satisﬁes 1 2σ2 -z CDP, which completes the proof.

Published in Transactions on Machine Learning Research (04/2023)

Algorithm 3 Diﬀerentially Private SGD with Feature Covariance (DP-FC) Method

Require: Data set D = {(x1, y1), , (xn, yn)} with (xi, yi) D, loss function: ℓ: Rm d R R, learning rate η, clipping norms: CG and Cg, number of iterations: T, noise multiplier: σ

1: Clip features for covariance computation: xi clip(xi) for all i {1, . . . , n} where clip(x) =

x min n 1, CG

2: Compute the feature covariance: G = Pn i=1( xi x i ).

n G + N 0d d, ( σC2 G n )2Id + λI

4: Randomly initialize θ0. 5: for t = 1, . . . , T do

6: gt Pn i=i clip( ℓ(θt; (xi, yi)) where clip(g) = g min n 1, Cg g 2

ngt + N 0d, ( σCg

8: θt+1 θt η gt G 1

9: end for 10: return θT .

3 Empirical Results

Pretraining Method Epochs NP 0.01 0.05 0.1 0.5 1.0 2.0 4.0 8.0

Image Net-21K

DP-LS 1 75.5 71.2 71.5 71.5 73.0 73.6 74.1 74.6 75.0

DP-Newton 1 72.7 70.2 71.3 71.3 71.4 71.6 71.7 72.0 72.2 10 78.2 66.5 71.0 71.4 71.9 71.9 72.3 73.0 74.2

DP-FC 1 72.9 71.9 72.1 72.5 72.9 72.9 72.9 72.9 72.9 10 77.1 71.8 71.9 72.1 75.4 76.3 76.8 77.0 77.1

DP-Adam 1 73.8 - - - - 39.1 54.2 63.3 68.3 10 76.1 - - 22.1 52.9 62.3 66.6 69.5 71.0 100 78.3 - - - 47.2 59.1 65.8 68.8 70.4

DP-LS 1 87.5 82.4 83.8 84.1 85.8 86.2 86.4 86.6 86.7

DP-Newton 1 85.9 77.8 80.3 81.0 82.9 83.6 84.0 84.5 84.9 10 88.9 76.0 79.7 80.1 81.8 82.9 83.1 84.7 85.3

DP-FC 1 85.8 82.1 83.8 84.3 85.1 85.4 85.5 85.6 85.6 10 88.4 81.0 83.1 83.7 86.1 86.8 87.4 87.8 88.0

DP-Adam 1 84.8 - 68.7 74.0 82.1 83.7 84.4 84.8 84.8 10 87.3 - 70.5 75.6 83.4 84.9 85.6 86.3 86.7 100 88.6 - 49.7 64.4 78.3 81.5 83.9 85.4 86.3

Table 2: Comparison of Top-1 test accuracies when privately ﬁne-tuning on Imagenet-1K. We denote accuracy 20% with the symbol - . When pre-trained with JFT, we observe that DP-FC performs best for epsilon values ranging from [0.1, 8.0] whereas DP-LS is best for even lower epsilons. In the case of pre-training with Image Net-21k, we ﬁnd that DP-FC (10 epochs) outperforms all other methods across the board. For reference, we note that best non-private accuracy on Image Net-1k is 91% (Yu et al., 2022b). We used Vi T-B/16 model for pre-training with Image Net-21k and Vi T-G/14 for JFT.

Datasets. We use 3 datasets for private ﬁnetuning, namely 1) Image Net-1k (Deng et al., 2009) with 1k classes and 1.3M images 2) CIFAR-10 and 3) CIFAR-100. We also refer to these as the private dataset for which we want a privacy guarantee. For pre-training, we rely on JFT-3B, Image Net-21k and Image Net-1K

Published in Transactions on Machine Learning Research (04/2023)

Pretraining Method Epochs Non-Private 0.01 0.05 0.1 0.5 1.0 2.0 4.0 8.0

Image Net-1K

DP-LS 1 91.3 81.1 83.7 84.3 86.3 87.7 88.7 89.6 90.3

DP-Newton 1 91.0 77.4 79.6 80.5 84.2 85.7 87.1 88.4 89.2 10 91.6 79.7 81.4 82.7 86.1 87.5 89.0 88.6 90.2

DP-FC 1 91.0 79.0 81.5 83.1 86.6 88.0 88.9 89.7 90.3 10 91.1 81.2 84.7 86.5 89.5 90.4 91.0 91.1 91.1

DP-Adam 1 77.9 52.6 74.3 76.7 78.2 78.2 77.8 77.6 77.8 10 82.3 56.4 78.3 76.7 82.0 82.6 82.4 82.2 82.7 100 87.5 40.9 63.8 72.3 83.6 86.8 87.4 87.4 87.6

Image Net-21K

DP-LS 1 96.5 94.5 95.2 95.4 95.8 96.0 95.5 96.2 96.3

DP-Newton 1 96.5 94.3 94.9 95.2 95.6 95.7 96.0 96.1 96.2 10 96.5 94.8 95.2 95.4 95.6 95.9 96.0 96.2 96.2

DP-FC 1 96.6 94.6 95.2 95.5 95.9 96.0 96.2 96.3 96.3 10 96.6 94.8 95.6 95.8 96.1 96.3 96.5 96.5 96.5

DP-Adam 1 95.2 90.0 94.7 95.1 95.1 95.1 95.1 95.1 95.2 10 96.1 83.8 94.9 95.5 95.8 95.8 95.8 95.9 96.0 100 96.5 64.0 90.0 93.3 95.5 95.7 95.9 96.1 96.2

DP-LS 1 98.9 97.4 98.2 98.4 98.4 98.6 98.8 98.8 98.9

DP-Newton 1 98.9 94.1 96.5 97.2 98.1 98.3 98.5 98.7 98.8 10 98.9 95.9 97.5 97.9 98.2 98.5 98.6 98.4 98.8

DP-FC 1 98.9 95.2 97.6 97.9 98.5 98.6 98.8 98.8 98.9 10 98.9 97.3 98.2 98.4 98.8 98.8 98.9 98.9 98.9

DP-Adam 1 97.5 93.5 97.0 97.5 97.5 97.6 97.6 97.6 97.6 10 98.7 87.6 97.7 98.1 98.5 98.6 98.7 98.7 98.7 100 98.9 79.2 93.2 96.3 98.3 98.6 98.6 98.8 98.8

Table 3: Comparison of Top-1 test accuracies when private ﬁnetuning on CIFAR-10. We denote accuracy 20% with the symbol - . Similar to other datasets, DP-FC (10 epochs) outperform all other methods almost across the board with a single exception of epsilon of 0.01 when pre-training with JFT were DP-LS performs slightly better. For reference, we note that best non-private accuracy on CIFAR-10 is 99.5% (Dosovitskiy et al., 2021). We used Vi T-B/16 model for pre-training with Image Net-21k and Imagenet-1K, and Vi T-G/14 for JFT.

(as done in Zhai et al. (2021)). For JFT, we intentionally chose a slightly smaller version of the dataset i.e. JFT-3B instead of JFT-4B, enabling us to exactly follow Zhai et al. (2021) and thus lowering the risk of the project. Also note that, as done in recent works, none of our ﬁnetuning datasets in reality are sensitive datasets: we are only simulating a public/private dataset split only for demonstration purposes (Kurakin et al., 2022; Mehta et al., 2022; De et al., 2022). The JFT datasets are not publicly available but have been used extensively as a pre-training dataset in the non-private setting to obtain state-of-the-art results (Dosovitskiy et al., 2021; Brock et al., 2021; Tolstikhin et al., 2021; Zhai et al., 2021). Similar to Mehta et al. (2022); De et al. (2022), to make sure that our simulated public and private datasets capture a practical scenario, we carefully de-duplicate our pre-training datasets w.r.t. all splits of our ﬁnetuning datasets (Kolesnikov et al., 2020; Dosovitskiy et al., 2021). More details about this process can be found in the appendix.

Model variants. We evaluate the transfer learning capabilities of the Vision Transformer (Vi T) (Dosovitskiy et al., 2021) model family in our study. We follow the standard notation to indicate the model size and the

Published in Transactions on Machine Learning Research (04/2023)

Pretraining Method Epochs Non-Private 0.01 0.05 0.1 0.5 1.0 2.0 4.0 8.0

Image Net-1K

DP-LS 1 71.9 49.2 51.8 53.9 57.6 60.0 62.5 65.4 67.5

DP-Newton 1 68.8 44.6 48.6 49.3 50.2 51.2 54.4 57.2 59.5 10 72.1 36.4 48.2 49.8 50.7 51.8 53.5 56.1 59.5

DP-FC 1 68.7 49.2 50.2 52.1 58.3 60.8 62.8 64.6 66.1 10 71.4 48.9 49.7 53.7 61.4 64.9 68.2 69.8 70.4

DP-Adam 1 52.1 - 26.0 34.6 47.6 51.2 51.7 51.7 52.2 10 57.3 - 20.0 33.5 51.2 54.4 55.6 57.5 56.3 100 67.9 - - - 36.3 40.2 51.5 58.4 63.7

Image Net-21K

DP-LS 1 83.9 77.2 77.7 78.1 79.8 80.7 81.4 81.9 82.4

DP-Newton 1 83.0 76.5 77.2 77.2 77.6 78.3 79.0 79.5 80.5 10 83.0 73.6 77.4 77.8 78.3 78.9 79.6 80.4 81.4

DP-FC 1 83.0 77.1 77.5 78.2 80.0 80.9 81.6 81.9 82.4 10 84.3 77.1 77.1 78.5 81.6 82.7 83.3 83.8 83.9

DP-Adam 1 79.9 27.6 67.0 72.7 78.3 79.5 79.7 79.7 79.7 10 82.0 - 55.7 67.5 78.6 80.7 81.5 81.5 81.6 100 84.6 - 29.7 45.2 71.3 76.2 79.5 81.1 82.2

DP-LS 1 90.6 74.9 80.3 82.5 85.5 86.4 87.7 88.4 88.9

DP-Newton 1 89.9 73.1 72.6 73.4 78.5 80.9 82.8 84.6 85.9 10 89.9 69.6 75.7 76.7 77.4 77.6 80.0 82.9 85.4

DP-FC 1 89.9 73.5 78.6 81.0 85.2 86.7 87.7 88.3 88.6 10 90.1 72.1 75.9 79.0 86.2 88.1 89.0 90.0 90.1

DP-Adam 1 83.5 27.9 61.8 71.3 79.7 82.1 83.4 83.5 83.5 10 88.2 21.9 60.2 69.7 81.9 83.9 86.2 86.8 87.8 100 90.0 - 29.4 50.5 73.7 78.7 83.1 86.1 88.0

Table 4: Comparison of Top-1 test accuracies when private ﬁnetuning on CIFAR-100. We denote accuracy 20% with the symbol - . Similar to other datasets, DP-FC outperforms all other methods for moderate privacy budgets whereas DP-LS performs slightly better for very strict privacy guarantees depending on the pre-training dataset. For reference, we note that best non-private accuracy on CIFAR-100 is 96.08% (Foret et al., 2021). We used Vi T-B/16 model for pre-training with Image Net-21k and Imagenet-1K, and Vi T-G/14 for JFT.

input patch size, for example, Vi T-B/32 means the Base" variant with 32x32 input patch size. Note that for Vi T, compute requirements scales up as we reduce the patch size. We obtained features from Vi T-G/14 model pre-trained on JFT-3B (Zhai et al., 2021), and Vi T-B/16 pre-trained on Image Net-21k and Image Net-1k. (Steiner et al., 2021).

Next, we present our main set of private ﬁne-tuning results and core observations on all 3 datasets, namely Image Net-1k (Table 2), CIFAR-10 (Table 3) and CIFAR-100 (Table 4).

3.1 Better pre-training continues to improve private ﬁne-tuning performance

We evaluate private ﬁne-tuning performance on features extracted from pre-trained models of 2 sizes i.e. Vi TG/14 and Vi T-B/16, pre-trained on 3 diﬀerent datasets, namely JFT-3B, Image Net-21K and Image Net-1K. We do this to quantify the extent to which the representation quality is improved by increasing pre-training dataset in combination of model size. As shown in Figure 1, as the model size and pre-training dataset size is

Published in Transactions on Machine Learning Research (04/2023)

(a) CIFAR-10

(b) CIFAR-100

(c) Image Net-1K

Figure 1: Comparison of top-1 accuracies with private ﬁne-tuning using DP-FC method on all 3 datasets across a range of epsilons. We observe that better pre-training helps even more for lower values of epsilon (stricter privacy guarantee).

increased, we continue to see improvement in downstream private ﬁne-tuning performance for all 3 datasets we consider. In addition, we make following observations from our results.

First, better pre-training can help even more at stricter privacy budgets. As shown in Figure 1, for both CIFAR-10 and CIFAR-100, when comparing features extracted from Vi T-B/16 pre-trained with Image Net-1K and features from Vi T-G/14 pre-trained with JFT, the improvement in performance at ε = 1 is larger than ε = 8. We see a similar trend at even lower epsilons in Table 3 and Table 4.

Second, features extracted from oﬀ-the-shelf pre-trained models can suﬃce for DP. Except for the fact that we deduplicate all splits of our pre-training dataset with our ﬁne-tuning datasets, we use the exact same procedure used to pre-train large vision models. This suggests that, in practice, a recipe where features extracted from large oﬀ-the-shelf vision model used to privately ﬁne-tune a classiﬁer can be quite eﬀective for DP performance. Since there is no need for a special pre-trained model for use in DP, this considerably reduces the cost of training private image classiﬁers.

Lastly, private ﬁne-tuning of high quality features closes the gap between private and nonprivate performance considerably. In the non-private setting, Zhai et al. (2021) obtain an impressive 90.45% top-1 accuracy by ﬁne-tuning the whole Vi T-G/14 model on Image Net-1K dataset. We observe that even by ﬁne-tuning just the last layer, we can obtain as much as 88.9% top-1 accuracy on Image Net-1K from the same sized model. Thus the marginal beneﬁt of ﬁne-tuning the whole model is <2% even in the non-private case. In the private case, ﬁne-tuning of pre-extracted features with DP-FC at ε = 8 leads to state of the art 88% top-1 accuracy. On Vi T-G/14, this represents < 2.5% diﬀerence between best non-private and private performance. This is also just 3% below the best non-private accuracy of 91% on Image Net-1K (Yu et al., 2022b).

3.2 Better optimizers improve privacy-utility trade-oﬀ

We observe that the choice of optimizer can have a signiﬁcant impact on the privacy-utility trade-oﬀ.

First, optimizers that work well in the large batch regime are better for private training. Even in the non-private setting, it is known that batch size is intimately tied to the optimization procedure and can lead to suboptimal use of resources if increased beyond a certain point while keeping number of epochs constant (Goyal et al., 2017; You et al., 2017; 2019). The maximum batch size which can be used without jeopardizing the utility-compute trade-oﬀis also heavily dependent on the choice of optimizer (Zhang et al., 2019). Though in the private setting, several works have observed that increasing the batch size leads to improved privacy-utility trade-oﬀ(Dormann et al., 2021; Li et al., 2014; Hoory et al., 2021). It is also empirically observed that the utility in large batch regime can be further improved by leveraging optimizers which work well in large batch regime, such as DP-Adam or DP-LAMB (Anil et al., 2021; Mehta et al., 2022; Bu et al., 2022a). Second-order methods such as LS or Newton are also well-suited to the large-batch regime (they were developed and are typically used in the full-batch setting).

Published in Transactions on Machine Learning Research (04/2023)

Second, optimizers with faster convergence rates are advantageous in the private setting, because they can reduce the number of epochs required for convergence. Indeed, in typical privacy analysis of iterative methods, the analysis works by composition over iterations, which means that the privacy penalty scales with the number of visitations of the data. Thus, in addition to the computational beneﬁt, any improvement in convergence rate directly helps training with privacy because it requires less noise to be added under the same privacy constraints. We empirically observe that this reduction in noise in the case of DP-FC, in combination with sharing of the privatized Feature Covariance across classes and iterates, allow us to obtain better results with 10 epochs compared to even when using 100 epochs with DP-SGD.

4 Related Work

Diﬀerential privacy (Dwork et al., 2006b) is a popular method to guarantee privacy in a quantiﬁable way in many data-driven applications. To achieve diﬀerential privacy in machine learning, tasks practitioners commonly train models with privatized variation of gradient descent, called DP-SGD (Song et al., 2013; Bassily et al., 2014; Abadi et al., 2016).

Despite theoretical guarantees, diﬀerentially private training has two major drawbacks which limits its wide adoption. First of all, DP-SGD can be much slower compared to regular SGD, if implemented naively. To address this, several deep learning architectures oﬀer to vectorize the computation (Subramani et al., 2020) or it is sometimes possible to bound the sensitivity of each example without calculating the gradient for every example separately, leading to a dramatic cost-reduction both in terms of memory and compute (Goodfellow, 2015; Li et al., 2022b; Bu et al., 2022a;d). Although, in our work this cost is minimal since we only train the last layer. We do note that computational burden can still be an issue for optimization schemes like DP-Newton where we had to resort to feature clipping in order to produce bounds on sensitivity of the gradient and the hessian.

In addition to the computational cost, model trained with diﬀerentially privacy usually suﬀer from so-called utility loss , which means that accuracy (or any other quality metric) is worse (and sometimes signiﬁcantly worse) compared to accuracy of non-private model (Dormann et al., 2021; Klause et al., 2022). Over the years, several lines of improvements have been proposed including adaptive clipping (Pichapati et al., 2019; Thakkar et al., 2019; Bu et al., 2022b; Golatkar et al., 2022), param-eﬃcient ﬁnetuning (Yu et al., 2022a; Mehta et al., 2022; Bu et al., 2022c; Cattan et al., 2022; Li et al., 2022a) and even leveraging intermediate checkpoints (De et al., 2022; Shejwalkar et al., 2022). One of the recent trends to improve utility of private models signiﬁcantly involves various ideas related to transfer learning where previous works demonstrate improved performance in the setting where we have access to a large public or non-sensitive dataset of the same modality as the private data (Kurakin et al., 2022; De et al., 2022; Mehta et al., 2022; Tramèr & Boneh, 2021; Yu et al., 2022a; Li et al., 2022b; Kurakin et al., 2022; Hoory et al., 2021). Our work also leverages large pre-trained models in order to obtain high-quality features for private ﬁnetuning. In addition, similar to the works that focus on studying and reducing dimensionality of the model in the context of DP (Li et al., 2022a; Golatkar et al., 2022; Yu et al., 2021; Zhang et al., 2021; Zhou et al., 2020), we focus solely on learning just the last layer privately, which can be seen as an implicit way to reduce dimensionality.

In the context of diﬀerential privacy, several recent papers also advocate the use of large batch sizes in order to improve the privacy-utility tradeoﬀ(Mehta et al., 2022; Mc Mahan et al., 2018; Anil et al., 2021; Dormann et al., 2021; Hoory et al., 2021; Liu et al., 2021b; Kurakin et al., 2022). Even though this work explicitly does not explore the aﬀect changing the batch size, the fact that we are able to obtain state of the art results in the full batch setting may point to the eﬀectiveness of large batch sizes in the context of DP. Further, our work also zooms in on diﬀerentially private linear and logistic regression. Several existing works have studied diﬀerential private convex optimization (Chaudhuri et al., 2011; Kifer et al., 2012; Song et al., 2013; Bassily et al., 2014; Wu et al., 2016; Mc Mahan et al., 2017; Bassily et al., 2019; Iyengar et al., 2019; Feldman et al., 2020; Bassily et al., 2020; Song et al., 2020; Andrew et al., 2021). There is also a growing interest in the special case of linear regression (Smith et al., 2017; Sheﬀet, 2019; Liu et al., 2021a; Cai et al., 2021; Varshney et al., 2022; Wang, 2018) and even second order methods in the context of diﬀerential privacy (Avella-Medina et al., 2021; Chien et al., 2021). In this work, we illustrate empirically the extent to which second order

Published in Transactions on Machine Learning Research (04/2023)

methods can help in DP. It would be quite interesting to see how other popular second order methods like Shampoo (Gupta et al., 2018; Anil et al., 2020) would fare in the context of DP.

5 Conclusion

In this work, we focus on private ﬁnetuning of image classiﬁcation datasets using features extracted from a pre-trained model. Given that ﬁnetuning just the features is signiﬁcantly cheaper than ﬁnetuning the full model, we systematically explore optimization schemes which are perceived to be expensive in highdimensional settings. As illustrated on 3 ﬁnetuning datasets i.e. Image Net-1k, CIFAR-10 and CIFAR-100, we ﬁnd that DP-LS (Least Squares) outperforms DP-SGD with logistic regression, especially for lower values of epsilons. Given the intuition that 2nd order information may be the reason for superior performance of DP-LS, we also explore Newton s method with Logistic Loss. Noticing that the amount of noise required by Newton s method scales with the number of classes and iterations, we introduce an optimization scheme called DP-FC which replaces the hessian by the feature covariance matrix, that can be shared across classes and iterations. Using this insight, we demonstrate that it is indeed possible to get state of the art results by just ﬁnetuning the last layer of a pre-trained model with privacy constraints. We hope that our work signiﬁcantly reduces the barrier in training private models.

Martín Abadi, Andy Chu, Ian J. Goodfellow, H. Brendan Mc Mahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with diﬀerential privacy. In Proc. of the 2016 ACM SIGSAC Conf. on Computer and Communications Security (CCS 16), pp. 308 318, 2016.

Galen Andrew, Om Thakkar, Hugh Brendan Mc Mahan, and Swaroop Ramaswamy. Diﬀerentially private learning with adaptive clipping. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id= RUQ1zw ZR8_.

Rohan Anil, Vineet Gupta, Tomer Koren, Kevin Regan, and Yoram Singer. Scalable second order optimization for deep learning, 2020.

Rohan Anil, Badih Ghazi, Vineet Gupta, Ravi Kumar, and Pasin Manurangsi. Large-scale diﬀerentially private BERT. Co RR, abs/2108.01624, 2021. URL https://arxiv.org/abs/2108.01624.

Marco Avella-Medina, Casey Bradshaw, and Po-Ling Loh. Diﬀerentially private inference via noisy optimization, 2021.

Borja Balle, Giovanni Cherubin, and Jamie Hayes. Reconstructing training data with informed adversaries, 2022.

Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Eﬃcient algorithms and tight error bounds. In Proc. of the 2014 IEEE 55th Annual Symp. on Foundations of Computer Science (FOCS), pp. 464 473, 2014.

Raef Bassily, Vitaly Feldman, Kunal Talwar, and Abhradeep Guha Thakurta. Private stochastic convex optimization with optimal rates. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d Alché-Buc, Emily B. Fox, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, 8-14 December 2019, Vancouver, BC, Canada, pp. 11279 11288, 2019. URL http://papers.nips.cc/paper/ 9306-private-stochastic-convex-optimization-with-optimal-rates.

Raef Bassily, Vitaly Feldman, Cristóbal Guzmán, and Kunal Talwar. Stability of stochastic gradient descent on nonsmooth convex losses. In Neur IPS, 2020. URL https://proceedings.neurips.cc/paper/2020/ hash/2e2c4bf7ceaa4712a72dd5ee136dc9a8-Abstract.html.

Published in Transactions on Machine Learning Research (04/2023)

James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, and Skye Wanderman-Milne. JAX: composable transformations of Python+Num Py programs, 2018. URL http://github.com/google/jax.

Andrew Brock, Soham De, and Samuel L Smith. Characterizing signal propagation to close the performance gap in unnormalized resnets. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=IX3Nnir2om J.

Zhiqi Bu, Jialin Mao, and Shiyun Xu. Scalable and eﬃcient training of large convolutional neural networks with diﬀerential privacy, 2022a.

Zhiqi Bu, Yu-Xiang Wang, Sheng Zha, and George Karypis. Automatic clipping: Diﬀerentially private deep learning made easier and stronger. Ar Xiv, abs/2206.07136, 2022b.

Zhiqi Bu, Yu-Xiang Wang, Sheng Zha, and George Karypis. Diﬀerentially private bias-term only ﬁne-tuning of foundation models. Ar Xiv, abs/2210.00036, 2022c.

Zhiqi Bu, Yu-Xiang Wang, Sheng Zha, and George Karypis. Diﬀerentially private optimization on large model at small cost. Ar Xiv, abs/2210.00038, 2022d.

Mark Bun and Thomas Steinke. Concentrated diﬀerential privacy: Simpliﬁcations, extensions, and lower bounds. In Theory of Cryptography Conference, pp. 635 658. Springer, 2016.

T Tony Cai, Yichen Wang, and Linjun Zhang. The cost of privacy: Optimal rates of convergence for parameter estimation with diﬀerential privacy. The Annals of Statistics, 49(5):2825 2850, 2021.

Nicholas Carlini, Chang Liu, Úlfar Erlingsson, Jernej Kos, and Dawn Song. The secret sharer: Evaluating and testing unintended memorization in neural networks. In Proceedings of the 28th USENIX Conference on Security Symposium, 2019.

Nicholas Carlini, Florian Tramèr, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Úlfar Erlingsson, Alina Oprea, and Colin Raﬀel. Extracting training data from large language models. In USENIX Security, 2021.

Yannis Cattan, Christopher A. Choquette-Choo, Nicolas Papernot, and Abhradeep Thakurta. Fine-tuning with diﬀerential privacy necessitates an additional hyperparameter search. Ar Xiv, abs/2210.02156, 2022.

Kamalika Chaudhuri, Claire Monteleoni, and Anand D Sarwate. Diﬀerentially private empirical risk minimization. Journal of Machine Learning Research, 12(Mar):1069 1109, 2011.

Steve Chien, Prateek Jain, Walid Krichene, Steﬀen Rendle, Shuang Song, Abhradeep Thakurta, and Li Zhang. Private alternating least squares: Practical private matrix completion with tighter rates, 2021.

Christopher A. Choquette-Choo, Florian Tramer, Nicholas Carlini, and Nicolas Papernot. Label-only membership inference attacks, 2020.

Ekin D. Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V. Le. Randaugment: Practical automated data augmentation with a reduced search space. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Jun 2020. doi: 10.1109/cvprw50498.2020.00359. URL http://dx.doi.org/10.1109/CVPRW50498.2020.00359.

Soham De, Leonard Berrada, Jamie Hayes, Samuel L. Smith, and Borja Balle. Unlocking high-accuracy diﬀerentially private image classiﬁcation through scale, 2022. URL https://arxiv.org/abs/2204.13650.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248 255. Ieee, 2009.

Published in Transactions on Machine Learning Research (04/2023)

Friedrich Dormann, Osvald Frisk, Lars Norvang Andersen, and Christian Fischer Pedersen. Not all noise is accounted equally: How diﬀerentially private learning beneﬁts from large sampling rates. 2021 IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP), Oct 2021. doi: 10.1109/ mlsp52302.2021.9596307. URL http://dx.doi.org/10.1109/mlsp52302.2021.9596307.

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=Yicb Fd NTTy.

Cynthia Dwork, Krishnaram Kenthapadi, Frank Mc Sherry, Ilya Mironov, and Moni Naor. Our data, ourselves: Privacy via distributed noise generation. In Advances in Cryptology EUROCRYPT, pp. 486 503, 2006a.

Cynthia Dwork, Frank Mc Sherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Shai Halevi and Tal Rabin (eds.), Theory of Cryptography, 2006b.

Úlfar Erlingsson, Vitaly Feldman, Ilya Mironov, Ananth Raghunathan, Kunal Talwar, and Abhradeep Thakurta. Ampliﬁcation by shuﬄing: From local to central diﬀerential privacy via anonymity. In Timothy M. Chan (ed.), Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2019, San Diego, California, USA, January 6-9, 2019, pp. 2468 2479. SIAM, 2019. doi: 10.1137/1. 9781611975482.151. URL https://doi.org/10.1137/1.9781611975482.151.

Vitaly Feldman, Audra Mc Millan, and Kunal Talwar. Hiding among the clones: A simple and nearly optimal analysis of privacy ampliﬁcation by shuﬄing. ar Xiv preprint ar Xiv:2012.12803, 2020.

Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware minimization for eﬃciently improving generalization. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=6Tm1mposlr M.

Roy Frostig, Matthew Johnson, and Chris Leary. Compiling machine learning programs via high-level tracing. 2018. URL https://mlsys.org/Conferences/doc/2018/146.pdf.

Aditya Golatkar, Alessandro Achille, Yu-Xiang Wang, Aaron Roth, Michael Kearns, and Stefano Soatto. Mixed diﬀerential privacy in computer vision. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2022. doi: 10.1109/cvpr52688.2022.00819. URL http://dx.doi.org/ 10.1109/CVPR52688.2022.00819.

Daniel Golovin, Benjamin Solnik, Subhodeep Moitra, Greg Kochanski, John Elliot Karro, and D. Sculley (eds.). Google Vizier: A Service for Black-Box Optimization, 2017. URL http://www.kdd.org/kdd2017/ papers/view/google-vizier-a-service-for-black-box-optimization.

Ian J. Goodfellow. Eﬃcient per-example gradient computations. Ar Xiv, abs/1510.01799, 2015.

Priya Goyal, Piotr Dollár, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: training imagenet in 1 hour, 2017.

Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization, 2018.

Shlomo Hoory, Amir Feder, Avichai Tendler, Soﬁa Erell, Alon Cohen, Itay Laish, Hootan Nakhost, Uri Stemmer, Ayelet Benjamini, Avinatan Hassidim, and Yossi Matias. Learning and evaluating a diﬀerentially private pre-trained language model. In Findings of the Association for Computational Linguistics: EMNLP 2021, pp. 1178 1189, Punta Cana, Dominican Republic, 2021. URL https://aclanthology.org/2021. findings-emnlp.102/.

Yifan Hu, Yehuda Koren, and Chris Volinsky. Collaborative ﬁltering for implicit feedback datasets. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, ICDM 08, pp. 263 272, 2008.

Published in Transactions on Machine Learning Research (04/2023)

Roger Iyengar, Joseph P Near, Dawn Song, Om Thakkar, Abhradeep Thakurta, and Lun Wang. Towards practical diﬀerentially private convex optimization. In 2019 IEEE Symposium on Security and Privacy (SP), 2019.

Shiva Prasad Kasiviswanathan, Homin K. Lee, Kobbi Nissim, Sofya Raskhodnikova, and Adam D. Smith. What can we learn privately? In 49th Annual IEEE Symp. on Foundations of Computer Science (FOCS), pp. 531 540, 2008.

Daniel Kifer, Adam Smith, and Abhradeep Thakurta. Private convex empirical risk minimization and high-dimensional regression. In Conference on Learning Theory, pp. 25 1, 2012.

Helena Klause, Alexander Ziller, Daniel Rueckert, Kerstin Hammernik, and Georgios Kaissis. Diﬀerentially private training of residual networks with scale normalisation, 2022.

Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. Big transfer (bit): General visual representation learning. Lecture Notes in Computer Science, pp. 491 507, 2020. ISSN 1611-3349. doi: 10.1007/978-3-030-58558-7_29. URL http://dx.doi.org/10. 1007/978-3-030-58558-7_29.

Yehuda Koren and Robert Bell. Advances in collaborative ﬁltering. Recommender systems handbook, pp. 77 118, 2015.

Alexey Kurakin, Shuang Song, Steve Chien, Roxana Geambasu, Andreas Terzis, and Abhradeep Thakurta. Toward training at imagenet scale with diﬀerential privacy, 2022.

Mu Li, Tong Zhang, Yuqiang Chen, and Alexander J. Smola. Eﬃcient mini-batch training for stochastic optimization. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 14, pp. 661 670, New York, NY, USA, 2014. Association for Computing Machinery. ISBN 9781450329569. doi: 10.1145/2623330.2623612. URL https://doi.org/10.1145/2623330.2623612.

Xuechen Li, Daogao Liu, Tatsunori Hashimoto, Huseyin A. Inan, Janardhan Kulkarni, Yin Tat Lee, and Abhradeep Thakurta. When does diﬀerentially private learning not suﬀer in high dimensions? Ar Xiv, abs/2207.00160, 2022a.

Xuechen Li, Florian Tramer, Percy Liang, and Tatsunori Hashimoto. Large language models can be strong diﬀerentially private learners. In International Conference on Learning Representations, 2022b. URL https://openreview.net/forum?id=b Vu P3lt ATMz.

Xiyang Liu, Weihao Kong, and Sewoong Oh. Diﬀerential privacy and robust statistics in high dimensions, 2021a.

Yugeng Liu, Rui Wen, Xinlei He, Ahmed Salem, Zhikun Zhang, Michael Backes, Emiliano De Cristofaro, Mario Fritz, and Yang Zhang. Ml-doctor: Holistic risk assessment of inference attacks against machine learning models, 2021b.

H Brendan Mc Mahan, Daniel Ramage, Kunal Talwar, and Li Zhang. Learning diﬀerentially private recurrent language models. ar Xiv preprint ar Xiv:1710.06963, 2017.

H. Brendan Mc Mahan, Daniel Ramage, Kunal Talwar, and Li Zhang. Learning diﬀerentially private recurrent language models. In International Conference on Learning Representations, 2018. URL https: //openreview.net/forum?id=BJ0h F1Z0b.

Harsh Mehta, Abhradeep Thakurta, Alexey Kurakin, and Ashok Cutkosky. Large scale transfer learning for diﬀerentially private image classiﬁcation. 2022.

Ilya Mironov. Rényi diﬀerential privacy. In 2017 IEEE 30th Computer Security Foundations Symposium (CSF), pp. 263 275. IEEE, 2017.

Ilya Mironov, Kunal Talwar, and Li Zhang. Rényi diﬀerential privacy of the sampled gaussian mechanism, 2019.

Published in Transactions on Machine Learning Research (04/2023)

Milad Nasr, Shuang Songi, Abhradeep Thakurta, Nicolas Papernot, and Nicholas Carlin. Adversary instantiation: Lower bounds for diﬀerentially private machine learning. 2021 IEEE Symposium on Security and Privacy (SP), May 2021. doi: 10.1109/sp40001.2021.00069. URL http://dx.doi.org/10.1109/sp40001. 2021.00069.

Venkatadheeraj Pichapati, Ananda Theertha Suresh, Felix X Yu, Sashank J Reddi, and Sanjiv Kumar. Adaclip: Adaptive clipping for private sgd. ar Xiv preprint ar Xiv:1908.07643, 2019.

Or Sheﬀet. Old techniques in diﬀerentially private linear regression. In Aurélien Garivier and Satyen Kale (eds.), Proceedings of the 30th International Conference on Algorithmic Learning Theory, volume 98 of Proceedings of Machine Learning Research, pp. 789 827. PMLR, 22 24 Mar 2019. URL https:// proceedings.mlr.press/v98/sheffet19a.html.

Virat Shejwalkar, Arun Ganesh, Rajiv Mathews, Om Thakkar, and Abhradeep Thakurta. Recycling scraps: Improving private learning by leveraging intermediate checkpoints. Ar Xiv, abs/2210.01864, 2022.

Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 3 18, 2017.

Adam Smith, Abhradeep Thakurta, and Jalaj Upadhyay. Is interaction necessary for distributed private learning? In 2017 IEEE Symposium on Security and Privacy (SP), pp. 58 77. IEEE, 2017.

Shuang Song, Kamalika Chaudhuri, and Anand D Sarwate. Stochastic gradient descent with diﬀerentially private updates. In 2013 IEEE Global Conference on Signal and Information Processing, pp. 245 248. IEEE, 2013.

Shuang Song, Om Thakkar, and Abhradeep Thakurta. Characterizing private clipped gradient descent on convex generalized linear problems. ar Xiv preprint ar Xiv:2006.06783, 2020.

Xingyou Song, Sagi Perel, Chansoo Lee, Greg Kochanski, and Daniel Golovin. Open source vizier: Distributed infrastructure and api for reliable and ﬂexible blackbox optimization. In Automated Machine Learning Conference, Systems Track (Auto ML-Conf Systems), 2022.

Andreas Steiner, Alexander Kolesnikov, Xiaohua Zhai, Ross Wightman, Jakob Uszkoreit, and Lucas Beyer. How to train your vit? data, augmentation, and regularization in vision transformers, 2021.

Pranav Subramani, Nicholas Vadivelu, and Gautam Kamath. Enabling fast diﬀerentially private sgd via just-in-time compilation and vectorization. ar Xiv preprint ar Xiv:2010.09063, 2020.

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, D. Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1 9, 2015.

Om Thakkar, Galen Andrew, and H Brendan Mc Mahan. Diﬀerentially private learning with adaptive clipping. ar Xiv preprint ar Xiv:1905.03871, 2019.

Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Peter Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, and Alexey Dosovitskiy. MLP-mixer: An all-MLP architecture for vision. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview. net/forum?id=EI2KOXKdn P.

Florian Tramèr and Dan Boneh. Diﬀerentially private learning needs better features (or much more data). In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id= YTWGvp FOQD-.

Prateek Varshney, Abhradeep Thakurta, and Prateek Jain. (nearly) optimal private linear regression for sub-gaussian data via adaptive clipping. In Po-Ling Loh and Maxim Raginsky (eds.), Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of Machine Learning Research, pp. 1126 1166. PMLR, 02 05 Jul 2022. URL https://proceedings.mlr.press/v178/varshney22a.html.

Published in Transactions on Machine Learning Research (04/2023)

Yu-Xiang Wang. Revisiting diﬀerentially private linear regression: optimal and adaptive prediction & estimation in unbounded domain. Ar Xiv, abs/1803.02596, 2018.

Yu-Xiang Wang, Borja Balle, and Shiva Kasiviswanathan. Subsampled rényi diﬀerential privacy and analytical moments accountant. Journal of Privacy and Conﬁdentiality, 10(2), Jun 2020. ISSN 2575-8527. doi: 10.29012/jpc.723. URL http://dx.doi.org/10.29012/jpc.723.

Xi Wu, Fengan Li, Arun Kumar, Kamalika Chaudhuri, Somesh Jha, and Jeﬀrey F. Naughton. Bolt-on diﬀerential privacy for scalable stochastic gradient descent-based analytics, 2016.

Yang You, Igor Gitman, and Boris Ginsburg. Scaling SGD batch size to 32k for imagenet training, 2017.

Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. Large batch optimization for deep learning: Training bert in 76 minutes, 2019.

Da Yu, Huishuai Zhang, Wei Chen, and Tie-Yan Liu. Do not let privacy overbill utility: Gradient embedding perturbation for private learning. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. Open Review.net, 2021. URL https://openreview.net/ forum?id=7aog Oj_VYO0.

Da Yu, Saurabh Naik, Arturs Backurs, Sivakanth Gopi, Huseyin A. Inan, Gautam Kamath, Janardhan Kulkarni, Yin Tat Lee, Andre Manoel, Lukas Wutschitz, Sergey Yekhanin, and Huishuai Zhang. Diﬀerentially private ﬁne-tuning of language models. Ar Xiv, abs/2110.06500, 2022a.

Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models, 2022b.

Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transformers, 2021.

Guodong Zhang, Lala Li, Zachary Nado, James Martens, Sushant Sachdeva, George E. Dahl, Christopher J. Shallue, and Roger Grosse. Which algorithmic choices matter at which batch sizes? insights from a noisy quadratic model, 2019.

Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization, 2017.

Huanyu Zhang, Ilya Mironov, and Meisam Hejazinia. Wide network learning with diﬀerential privacy, 2021.

Yingxue Zhou, Zhiwei Steven Wu, and Arindam Banerjee. Bypassing the ambient dimension: Private sgd with gradient subspace identiﬁcation, 2020.

Yuqing Zhu and Yu-Xiang Wang. Poission subsampled rényi diﬀerential privacy. In International Conference on Machine Learning, pp. 7634 7642. PMLR, 2019.

A Limitations

This work leverages a large properietary dataset called JFT-3B to pre-train Vi T-G/14 model in order to illustrate the beneﬁts of scale on diﬀerential privacy with transfer learning. In order to make our work more generalizable and reproducible, we also include results with models pre-trained with Image Net-21k and Image Net-1k.

Another limitation of our work may be the fact that our pre-training dataset is largely in-distribution with the private ﬁne-tuning datasets. We would like to argue that, in practice, this is still valuable since it illustrates the eﬀectiveness of the approach, and helps estimate the utility of gathering a public dataset to pre-train on, given a sensitive dataset that one wants privacy guarantee over. Finally, out of distribution performance is an interesting research question even in the non-private setting and its exploration in the context of privacy can be a direction of very valuable future work.

Published in Transactions on Machine Learning Research (04/2023)

We additionally note that we do not include any cost related to hyperparmeter tuning in our privacy accounting and budget. This is in line with the setup of the baselines we consider and allows for a fair comparison with reported numbers. However, we recognize this as a limitation of our work and more research in this direction a considerable opportunity.

In terms of societal impact, the biggest cost of this work is training the largest Vi T-G model on a large dataset and its energy impact. However, we argue that our results ultimately point towards amortizing and increasingly leveraging already trained models for high-performance DP training, and thus potentially reducing the overall energy consumption.

B Algorithmic details

B.1 Privacy Analysis details for DP-SGD

Algorithm 4 Generalized First Order Diﬀerentially Private Algorithm

Require: Data set D = {(x1, y1), , (xn, yn)} with (xi, yi) D, loss function: ℓ: Rm d R R, a ﬁrst order optimizer Opt, clipping norm: C, number of iterations: T, noise multiplier: σ

1: Randomly initialize θ0. 2: for t = 1, . . . , T do

n Pn i=1 clip ( ℓ(θt; (xi, yi))), where clip(v) = v min n 1, C v 2

4: gt gt + N 0, (σC)2

5: θt+1 single step of ﬁrst order optimization with gradient Opt( gt) 6: end for

7: return 1

t=1 θt or θT .

The privacy parameters (ε, δ) are functions of C, σ, |Bt|, |D|, and the total number of iterations T. DP-SGD algorithm involves setting the right clipping norm C and the noise multiplier σ given a privacy budget, batch and dataset size. The (ε, δ) guarantee is computed by analysis of the Gaussian Mechanism with privacy ampliﬁcation by subsampling and composition across across iterations (Kasiviswanathan et al., 2008; Bassily et al., 2014; Abadi et al., 2016; Mironov, 2017; Mc Mahan et al., 2017; Mironov et al., 2019; Erlingsson et al., 2019; Zhu & Wang, 2019; Feldman et al., 2020; Wang et al., 2020). Our implementation relies on Tensorﬂow Privacy 1 codebase for conversion of (ε, δ) and clipping norm C to/from noise multiplier σ. We rely on the default Rényi accountant implementation already open-sourced as part of Tensorﬂow Privacy library.

To put the epsilon-delta values in context, privacy guarantee for let s say ε 4 on Image Net-1K satisﬁes a much stronger property of z CDP 1 (0.154 for ε = 4) which is by now an industry standard.

C Pre-training Details

We conduct all our experiments in Jax (Bradbury et al., 2018; Frostig et al., 2018) is framework that leverages just-in-time compilation using XLA2 and does auto-vectorization of the backward pass. We leverage this functionality throughout our experiments. Finally, we conduct our experiments on TPUv4 architecture.

C.1 Pre-training with JFT-3B

Dataset. Contrary to Mehta et al. (2022); De et al. (2022), we use a smaller version of JFT, namely JFT-3B (instead of JFT-4B) for our pre-training. We do this to lower the risk of the project and follow Zhai et al. (2021) exactly for pre-training Vi T-G/14 model. JFT-3B dataset consists of nearly 3 billion images, annotated with a class-hierarchy of around 30k labels via a semiautomatic pipeline. As done previously, we ignore the

1https://github.com/tensorflow/privacy 2https://www.tensorﬂow.org/xla

Published in Transactions on Machine Learning Research (04/2023)

hierarchical aspect of the labels and use only the assigned labels as targets for multi-label classiﬁcation via a sigmoid cross-entropy loss.

Deduplication. In order to both not inﬂate our results and break privacy guarantee oﬀered by ﬁne-tuning privately on Image Net, we extend the deduplication process proposed by Kolesnikov et al. (2020) and deduplicate both JFT-3B with respect to all splits of Image Net. We use a model based deduplication system which removes both exact and near-duplicates across common image transformation like crop, shift, resize etc.

Hyperparameters. At the pre-training stage, we follow Zhai et al. (2021) exactly and stick with the common practice of employing Adafactor optimizer with β1 = 0.9 and β2 = 0.999, with a batch size of 32768, dropout rate of 0.0, clip global norm of 1, and a high weight decay of 3.0 for the head" and 0.03 for the body". In addition, we remove the additional [class] token to save memory. Finally, all the models are pre-trained at resolution [224, 224], with inception crop followed by random horizontal ﬂip pre-process. We also use reciprocal square-root schedule with a linear learning rate warmup of 10k steps. Finally, Vi T-G/14 model was pre-trained using 2048 TPUv3 chips.

C.2 Pre-training with Image Net21k and Image Net1k

Datasets. Image Net-21k is a superset of Image Net-k with 21k classes and 14M images (Deng et al., 2009). Similar to before, in order to both not inﬂate our results and break privacy guarantee, we extend the deduplication process proposed by Kolesnikov et al. (2020) and deduplicate Image Net-21k with respect to all splits of Image Net-1k, CIFAR-10 and CIFAR-100. Similarly, we deduplicate Image Net-1k with respect to all splits of CIFAR-10 and CIFAR-100.

Hyperparameters. At the pre-training stage, we stick with the common practice of employing Adam optimizer (even for Res Net) with β1 = 0.9 and β2 = 0.999, with a batch size of 4096. Unlike pre-training with JFT dataset, we follow recommendations from Steiner et al. (2021) to use Aug Reg strategy where we lower the weight decay to 0.1 (which gets multiplied by the learning rate) and don t use dropout but instead use data augmentation strategy called medium1 which combines Mixup with α = 0.2 (Zhang et al., 2017) and Rand Augment with l = 15 and m = 2 (Cubuk et al., 2020). We also use linear learning rate warmup until 10k steps and linearly decay it until the end. Our model is pre-trained with 224x224-sized images.

Model Dataset Epochs Base η TPU v4 hours

Vi T-B/16 Image Net-21k 300 10 3 2.7k Vi T-B/16 Imagenet-1K 300 10 3 0.4k

Table 5: Pre-training hyperparams. We used batch size of 4096, learning rate warmup of 10k steps and then linear decay. Additionally, we set dropout rate to 0.0, clip global norm to 1 and weight decay to 0.0001. We use images of resolution 224x224. Note that we intentionally keep the model size the same to illustrate the eﬀect of larger pre-training dataset and its eﬀect on private ﬁne-tuning.

D Finetuning Details

D.1 Datasets

Image Net-1k We ﬁne-tune on Image Net train split and present the Top-1 accuracies we obtain from the oﬃcial test split. Following Mehta et al. (2022), we used images of input resolution 256x256 which is central cropped from a resolution of 384x384. Note that this is slightly lower resolution and without Inception Crop (Szegedy et al., 2015) which is typically done in non-private setting. Finally, for training with DP, we ﬁxed δ to be 8e-7.

CIFAR-10 and CIFAR-100 Similar to above, we ﬁne-tune on train split and present the Top-1 accuracies we obtain from the oﬃcial test split. We also changed the input resolution to 256x256 which is central cropped from an image of resolution 384x384. Again, this may look a little unusual at ﬁrst for CIFAR-10 and

Published in Transactions on Machine Learning Research (04/2023)

CIFAR-100 since the original resolution of the images is 32x32. But we ﬁrst upsample them to 384x384 and then central crop them. We found that using higher resolution images made a big diﬀerence in performance (even in non-private setting), especially when using features from a pre-trained model. Finally, for training with DP, we ﬁxed δ to be 1e-5.

D.2 Compute requirement

To illustrate that our proposed methods don t impose signiﬁcant compute overhead, we also report the exact compute required for each of our methods in Table 6 with Image Net-1k features (largest dataset we consider) obtained from Vi T-G/14 pretrained with JFT-3B dataset (highest dimensional model we consider).

As shown in Table 6, not only the absolute compute requirement for private ﬁnetuning using our methods is quite minimal, our best performing method (DP-FC) requires only slightly higher compute overhead over our baseline (DP-Adam) while outperforming in terms of quality signiﬁcantly.

Pretraining Dataset Method Epochs TPUv4 hours

DP-LS 1 0.9

DP-Newton 1 0.5 10 2.4

DP-FC 1 0.4 10 1.2

DP-Adam 1 0.2 10 0.9 100 8.7

Table 6: Comparison of TPU v4 hours required as a measure of exact compute required for private ﬁnetuning with Image Net-1k features obtained from Vi T-G/14 pretrained with JFT-3B dataset.

D.3 Hyperparameter Tuning

All of our results consider full-batch setting with a constant learning rate (when applicable). Additionally, following Mehta et al. (2022), we set initial weights to 0.0.

To gain conﬁdence in our proposed methods, we wanted to properly tune the important hyperparameters. Fortuitously, since we are only interested in learning just the last layer from features, each training run can be relatively inexpensive. To properly tune hyperparameters, we employed a Bayesian optimization package called Vizier (Golovin et al., 2017; Song et al., 2022). For tuning hyperparameters, we heldout 5% of the training set as our validation set and report top-1 accuracies on on the test set by using the tuned hyperparameters.

Note that, for a fair comparison, similar to previous work and our baseline (DP-SGD), we do not account for the privacy cost of hyperparameter tuning in our results. For proper tuning and trustful results, we made sure to run more trails than required for tuning each method, although for all of tuning runs, however we found that the optimizer converged quite early on, suggesting that the cost of hyperparamter tuning can be further reduced.

Published in Transactions on Machine Learning Research (04/2023)

D.4 DP-Adam hyperparameters

Model Pre-training DS Fine-tuning DS η λ DP Clipping Norm (C)

Vi T-G/14 JFT-3B Image Net-1k [10 8, 108] [10 8, 108] 1.0 Vi T-G/14 JFT-3B CIFAR-10 [10 8, 108] [10 8, 108] 0.005 Vi T-G/14 JFT-3B CIFAR-100 [10 8, 108] [10 8, 108] 0.005 Vi T-B/16 Image Net-21k Image Net-1k [10 8, 108] [10 8, 108] 0.005 Vi T-B/16 Image Net-21k CIFAR-10 [10 8, 108] [10 8, 108] 0.005 Vi T-B/16 Image Net-21k CIFAR-100 [10 8, 108] [10 8, 108] 0.005 Vi T-B/16 Image Net-1k CIFAR-10 [10 8, 108] [10 8, 108] 0.005 Vi T-B/16 Image Net-1k CIFAR-100 [10 8, 108] [10 8, 108] 0.005

Table 7: Fine-tuning hyperparams for DP-Adam. All models are trained in full-batch setting with a constant learning rate and no dropout. When training the models with DP, we replace the global clipping with per example clipping norm as speciﬁed in the table. Following Mehta et al. (2022), we set initial weights to 0.0, bias to -10.0 and train with sigmoid cross-entropy loss. Note that we employed a Bayesian optimization package called Vizier (Golovin et al., 2017; Song et al., 2022) and used a total of 200 trials for jointly tuning both the learning rate and weight decay as speciﬁed in the table.

D.5 DP-LS hyperparameters

Model Pre-training DS Fine-tuning DS α λ DP Clipping Norm (C)

Vi T-G/14 JFT-3B Image Net-1k [10 8, 108] [10 8, 108] 1.0 Vi T-G/14 JFT-3B CIFAR-10 [10 8, 108] [10 8, 108] 0.005 Vi T-G/14 JFT-3B CIFAR-100 [10 8, 108] [10 8, 108] 0.005 Vi T-B/16 Image Net-21k Image Net-1k [10 8, 108] [10 8, 108] 0.005 Vi T-B/16 Image Net-21k CIFAR-10 [10 8, 108] [10 8, 108] 0.005 Vi T-B/16 Image Net-21k CIFAR-100 [10 8, 108] [10 8, 108] 0.005 Vi T-B/16 Image Net-1k CIFAR-10 [10 8, 108] [10 8, 108] 0.005 Vi T-B/16 Image Net-1k CIFAR-100 [10 8, 108] [10 8, 108] 0.005

Table 8: Fine-tuning hyperparams for DP-LS. All models are trained in full-batch setting. When training the models with DP, we replace the global clipping with per example clipping norm as speciﬁed in the table. Speciﬁc to DP-LS, we clip the RHS with C and the gramians with C2. Interestingly, least squares is invariant to the starting weights which takes an important confounding factor away from the private training procedure. Similar to DP-Adam, we employed a Bayesian optimization package called Vizier (Golovin et al., 2017; Song et al., 2022) and used a total of 200 trials for jointly tuning both the α and weight decay as speciﬁed in the table.

Published in Transactions on Machine Learning Research (04/2023)

D.6 DP-Newton hyperparameters

Model Pre-training DS Fine-tuning DS η λ DP Clipping Norm (C)

Vi T-G/14 JFT-3B Image Net-1k [10 8, 108] [10 8, 108] 1.0 Vi T-G/14 JFT-3B CIFAR-10 [10 8, 108] [10 8, 108] 0.005 Vi T-G/14 JFT-3B CIFAR-100 [10 8, 108] [10 8, 108] 0.005 Vi T-B/16 Image Net-21k Image Net-1k [10 8, 108] [10 8, 108] 0.005 Vi T-B/16 Image Net-21k CIFAR-10 [10 8, 108] [10 8, 108] 0.005 Vi T-B/16 Image Net-21k CIFAR-100 [10 8, 108] [10 8, 108] 0.005 Vi T-B/16 Image Net-1k CIFAR-10 [10 8, 108] [10 8, 108] 0.005 Vi T-B/16 Image Net-1k CIFAR-100 [10 8, 108] [10 8, 108] 0.005

Table 9: Fine-tuning hyperparams for DP-Newton. All models are trained in full-batch setting with a constant learning rate and no dropout. When training the models with DP, we replace the global clipping with per example clipping norm as speciﬁed in the table. Speciﬁc to DP-Newton, we clip the features instead of the gradient and the Hessian. We do this save on computational cost since now we don t need to explicitly compute per-example gradient and the Hessian. For the privacy analysis, we use the clipping norm C to sanitize the gradient and C2

4.0 for the Hessian. Following Mehta et al. (2022), we set initial weights to 0.0 and train with sigmoid cross-entropy loss. We employed a Bayesian optimization package called Vizier (Golovin et al., 2017; Song et al., 2022) and used a total of 300 trials for jointly tuning both the learning rate and weight decay as speciﬁed in the table.

D.7 DP-FC hyperparameters

Model Pretraining DS Finetuning DS η λ DP Clipping Norm (C)

Vi T-G/14 JFT-3B Image Net-1k [10 8, 108] [10 8, 108] 1.0 Vi T-G/14 JFT-3B CIFAR-10 [10 8, 108] [10 8, 108] 0.005 Vi T-G/14 JFT-3B CIFAR-100 [10 8, 108] [10 8, 108] 0.005 Vi T-B/16 Image Net-21k Image Net-1k [10 8, 108] [10 8, 108] 0.005 Vi T-B/16 Image Net-21k CIFAR-10 [10 8, 108] [10 8, 108] 0.005 Vi T-B/16 Image Net-21k CIFAR-100 [10 8, 108] [10 8, 108] 0.005 Vi T-B/16 Image Net-1k CIFAR-10 [10 8, 108] [10 8, 108] 0.005 Vi T-B/16 Image Net-1k CIFAR-100 [10 8, 108] [10 8, 108] 0.005

Table 10: Fine-tuning hyperparams for DP-FC. All models are trained in full-batch setting with a constant learning rate and no dropout. When training the models with DP, we replace the global clipping with per example clipping norm as speciﬁed in the table. Speciﬁc to DP-FC, we clip the per-example gradients with clipping norm C and C2 for the gramian which is shared across classes and iterations. Following Mehta et al. (2022), we set initial weights to 0.0, set initial bias to -10.0 and train with sigmoid cross-entropy loss. We employed a Bayesian optimization package called Vizier (Golovin et al., 2017; Song et al., 2022) and used a total of 200 trials for jointly tuning both the learning rate and weight decay as speciﬁed in the table.