# adversarial_machine_unlearning__6491b7aa.pdf

Published as a conference paper at ICLR 2025

ADVERSARIAL MACHINE UNLEARNING

Zonglin Di1 , Sixie Yu2 , Yevgeniy Vorobeychik3, Yang Liu1

1University of California, Santa Cruz, 2Stellar Cyber, Inc., 3Washington University in St. Louis 1{zdi, yangliu}@ucsc.edu, 2bit.yusixie@gmail.com 3yvorobeychik@wustl.edu

This paper focuses on the challenge of machine unlearning, aiming to remove the influence of specific training data on machine learning models. Traditionally, the development of unlearning algorithms runs parallel with that of membership inference attacks (MIA), a type of privacy threat to determine whether a data instance was used for training. However, the two strands are intimately connected: one can view machine unlearning through the lens of MIA success with respect to removed data. Recognizing this connection, we propose a game-theoretic framework that integrates MIAs into the design of unlearning algorithms. Specifically, we model the unlearning problem as a Stackelberg game in which an unlearner strives to unlearn specific training data from a model, while an auditor employs MIAs to detect the traces of the ostensibly removed data. Adopting this adversarial perspective allows the utilization of new attack advancements, facilitating the design of unlearning algorithms. Our framework stands out in two ways. First, it takes an adversarial approach and proactively incorporates the attacks into the design of unlearning algorithms. Secondly, it uses implicit differentiation to obtain the gradients that limit the attacker s success, thus benefiting the process of unlearning. We present empirical results to demonstrate the effectiveness of the proposed approach for machine unlearning. The code is available at https://github.com/UCSC-REAL/SG-Unlearn.

1 INTRODUCTION

The enactment of the General Data Protection Regulation (GDPR) by the EU has elevated the importance of deleting user data from machine learning models to a critical level. This process is distinctly more intricate compared to removing data from conventional databases. Erasing the data s imprint from a machine learning model necessitates an approach to negate the data s influence on the model comprehensively while maintaining the utility and accuracy of the model.

Beyond this, establishing the true extent to which data influence has been erased from the model poses a significant challenge (Song & Mittal, 2021). Numerous methods and metrics have been advanced to validate the thoroughness of data removal, each with varying degrees of reliability and efficacy (Guo et al., 2020; Thudi et al., 2022b). We propose a novel adversarial perspective on unlearning that we argue is a more robust framework for effective machine unlearning. In this approach, the focus shifts to simulating possible attacks aimed at inferring whether the data that should have been forgotten nevertheless maintains some influence on the model. If, within this adversarial framework, an attacker fails to distinguish whether a data point was part of the training set or merely a typical instance of unseen data, we can conclude that the influence of the data point on the data has been successfully unlearned.

We leverage advancements from the burgeoning domain of Membership Inference Attacks (MIA) to simulate an adversary (Shokri et al., 2017), therein framing a Stackelberg Game (SG) between an unlearner, tasked with orchestrating the unlearning process, and an auditor deploying MIA to deduce the membership of data in the model s training set. The key idea is for the unlearner to adjust the model being unlearned by utilizing gradient feedbacks from the auditor s optimization

* Equal Contribution. Corresponding author: Yang Liu <yangliu@ucsc.edu>

Published as a conference paper at ICLR 2025

problem, moving the model in a direction that limits the effectiveness of the attack, thus achieving the goal of unlearning. Specifically, we formulate the MIA as a utility-maximizing problem, where the utility measures the remaining influence of a data point in the unlearned model. The unlearner s loss function is defined as a combination of the degradation of model performance and the auditor s utility. We harness the development from implicit differentiation and design a gradient-based algorithm to solve the game, allowing for seamless integration into existing end-to-end pipelines (Gould et al., 2016; Amos & Kolter, 2017; Agrawal et al., 2019b).

The contributions of the present paper are summarized below

We propose to evaluate the effectiveness of an unlearning algorithm from an adversarial perspective, inspiring us to develop a game theory framework that enables the use of advanced MIAs for enhancing the unlearning process.

Additionally, we design a gradient-based solution method to solve the game by leveraging implicit differentiation, making it amenable to end-to-end pipelines.

Finally, we support the efficacy of the game and the solution method with extensive results.

2 RELATED WORK

The first related thread is machine unlearning, which focuses on removing the influence of a subset of data (referred to as the forget set) from a machine learning model. The unlearning approaches are divided into two classes. The first one is exact unlearning, which involves retraining the model on data excluding the forget set. The second one is approximate unlearning. The ideas behind approximate unlearning are twofold. The first is to track the influence of each training data on the updates to a model s weights, allowing for a rollback during unlearning (Bourtoule et al., 2021; Graves et al., 2021b; Chen et al., 2022). The second is using a loss function to capture the objectives of unlearning (e.g., removing the influence of the forget set while maintaining model utility) and modifying the model weights to minimize the loss function (Guo et al., 2020; Golatkar et al., 2020b; Izzo et al., 2021b; Warnecke et al., 2023; Chundawat et al., 2023; Jia et al., 2023). The method proposed in this paper aligns with the second idea. Specifically, we design a loss function that simulates an auditor who uses MIAs to evaluate the effectiveness of unlearning. By differentiating through the auditor s optimization problem, we compute the gradients that reduce the auditor s utility, thus increasing the effectiveness of unlearning. Besides algorithmic developments, Jagielski et al. (2023) proposes a measure to quantify the forgetting during training; Thudi et al. (2022b) takes a formal analysis of the definition of approximation unlearning and propose methods to verify exact unlearning. Due to space constraint, it is not feasible to provide a comprehensive review of all related studies. We refer the readers to the survey article by Nguyen et al. (2022) for a more exhaustive discussion.

The second related line is membership inference attacks (MIA). Shokri et al. (2017) introduced MIAs, showing the privacy risks of machine learning models. Subsequently, different attack methods are proposed (Chen et al., 2021; Carlini et al., 2022; Ye et al., 2022; Bertran et al., 2023). On the other hand, Carlini et al. (2022) shows that existing criteria to evaluate MIAs are limited in capturing real-world scenarios and propose more practical evaluation metrics. In addition, comprehensive evaluation frameworks and tools are developed (Murakonda & Shokri, 2020; Song & Mittal, 2021). Finally, Nasr et al. (2018); Sharma et al. (2024) proposes a defense mechanism to counter MIAs from an adversarial perspective. Unlike these works, which relies on a purely NN-based minimax formulation limiting the integration of non-NN-based MIAs and existing unlearning methods, our Stackelberg game framework models the sequential nature of unlearning and auditing, enabling seamless incorporation of diverse, differentiable attack and defense strategies.

3 PRELIMINARIES

Machine Unlearning. Let D = {(xi, yi) | xi X, yi Y} be a labeled dataset, where X (resp. Y) denote the feature (resp. label) space. The training, validation, and test sets are Dtr, Dval, and Dte, respectively. A machine learning (ML) algorithm is denoted by A, mapping from the joint space of features and labels X Y to a hypothesis class. We refer to the model trained on the entire training set as the original model, i.e., θo = A(Dtr).

Published as a conference paper at ICLR 2025

Let Df = {(xf j , yf j )}q j=1 Dtr represent a forget set. The goal of machine unlearning is to remove the influence of Df from the original model, resulting in an unlearned model θu (i.e., θu = U(θo)) where U represents a machine unlearning algorithm. The unlearning algorithm may have access to other inputs (e.g., the validation set Dval) depending on the problem settings. Let Dr be the retain set, the subset of the training data excluding the forget set, i.e., Dr = Dtr \ Df. The gold standard of machine unlearning is θr = A(Dr), a model trained on the retain set, excluding the influence of Df. We denote θr as the gold standard when comparing machine unlearning algorithms. Retraining is expensive, especially for deep neural networks. This motivates the development of efficient machine unlearning algorithms that satisfy the following conditions: 1) the influence of Df is removed from the unlearned model, 2) the performance of the unlearned model is comparable to the performance of the original model, and 3) the computational costs (e.g., running time) are lower compared to those incurred during retraining.

Membership Inference Attacks. A membership inference attack (MIA) aims to determine whether a data instance was used to train an ML model (Shokri et al., 2017). An instance that was in the training set is called a member, while one that was not in the training set is called a non-member. Formally, given a target model θ, an attacker infers the membership of an instance (x, y) based on the model s outputs (i.e., Sθ(x)) and the label. The attacker does not have access to either the training data or the model parameters of the target model. Instead, he gathers proxy training and test sets and learns a model θ to mimic the behavior of the target model. Using the outputs of θ on its own training and test data, the attacker acquires a labeled (member v.s. non-member) dataset, and then uses the labeled dataset to train a binary classifier for determining the membership of an instance.

We adapt the idea of MIA to determine whether the influence of the forget set still exists in an unlearned model θu. Define an auditing set Dθu = {(sf j , 1), (ste j , 0)}q j=1, where sf j (resp. ste j ) represents the outputs of the forget (resp. test) instances from the unlearned model, that is, sf j = Sθu(xf j ) (resp. ste j = Sθu(xte j )).

Here, the test instances serve as an empirical distribution for the unseen data. The outputs can be scalars, such as the instance-wise cross-entropy losses. The outputs can also be the vectors of probabilities across the classes (Shokri et al., 2017; Carlini et al., 2022). The labels 1" and 0" indicate members and non-members, respectively. The MIA reduces to a binary classification task on Dθu, aiming to differentiate the forget instances from the test ones based on the outputs.

4 THE GAME MODEL

We model the machine unlearning problem as a Stackelberg game (SG) between an unlearner who deploys models as services, and an auditor who launches MIAs against the models. The key idea is to assess the effectiveness of an unlearning algorithm by measuring whether the auditor will succeed. In particular, the unlearning is considered effective when the auditor is unable to differentiate between the forget set from the test set based on their outputs from the unlearned model. The SG is played in a sequential manner: the unlearner first deploys an unlearned model, and then the auditor launches an MIA in response. Importantly, the advantage of first-mover endows the unlearner with the power to make a decision knowing that the auditor will play a best response (i.e., launching a strong attack). We now formally define the models for both players.

4.1 THE AUDITOR S MODEL

We begin by defining the auditor s model. Suppose the unlearner has deployed an unlearned model θu. Following standard setup (Shokri et al., 2017; Song & Mittal, 2021), we assume that the auditor has black-box access to the model, allowing him to query the model, e.g., submitting data to the model and collecting the outputs. The auditor s goal is to determine whether the influence of the forget set still exists in the model based on the outputs. To achieve this, the auditor constructs an auditing set Dθu, consisting of the model s outputs when passing the forget and test instances through the unlearning model θu (see Section 3 for details about the auditing set). The auditor assesses the distinctiveness of the two sets with a binary classifier trained on the auditing set through cross validation.

Published as a conference paper at ICLR 2025

Let Ua be the auditor s utility function, quantifying the distinctiveness of the forget and test instances. Intuitively, a large Ua indicates that the outputs of the forget instances are highly differentiable from the outputs of the test instances, strong evidence that the influence of the forget set still exists in the unlearned model. We formulate the auditor s model as the following optimization problem

Ua(θa, θu) = M( Dval θu ; θa) where θa Bθu = arg max θ a Ha M( Dtr θu; θ a). (1)

The auditing set Dθu is divided into the training Dtr θu and the validation Dval θu sets. The constraint encodes the process of learning a binary classifier. The set Bθu are the auditor s best-responses to the unlearner s decision θu, that is, a specific MIA that maximally differentiates the forget and test instances. The function M is an evaluation metric for the binary classifier on a dataset. The definition of M is flexible. One can use the accuracy to quantify the average performance of the classifier, where true positives are weighted equally with true negatives (Shokri et al., 2017; Song & Mittal, 2021). Alternatively, an average measure may not capture real privacy threats. Instead, ROC curve or true positive rates at specified false positive rates can be used for evaluation Carlini et al. (2022).

The auditor s model exhibits a high degree of generality, unifying several advanced MIAs in the literature; this includes neural network-based attacks proposed by Nasr et al. (2018), quantile regression-based attacks from Bertram et al. (2019), and prediction confidence-based attacks by Song & Mittal (2021), etc. Under the formulation of Equation 1, the mentioned attacks differ in 1) the hypothesis class Ha of the binary classifier and 2) the objective function M. Notice the dependence of the auditor s best-response on θu (i.e., Bθu) arising from the unlearner s first-mover advantage. The unlearner utilizes this dependence to select an unlearned model that limits the auditor s discriminative power, which we discuss next.

4.2 THE UNLEARNER S MODEL

Next, we define the unlearner s model. Let Cu represent the unlearner s cost function, which encompasses two main objectives for unlearning. The first objective is to maintain the utility of the model, ensuring that the unlearned model performs comparably (e.g., in terms of predictive power) to the original model. To achieve this objective, we minimize a loss function L(Dr; θu) computed on the retain set Dr, following the principles of empirical risk minimization. All regularization terms are included in the loss function to simplify notation. The second objective focuses on eliminating the influence of the forget set from the model being unlearned. We approach this objective adversarially by considering the auditor s utility M. In essence, a smaller value of the auditor s utility indicates that the forget set is harder to be distinguished from the test set, providing strong evidence that the unlearning process is effective.

Formally, the unlearner s optimization problem is to minimize the cost function below

Cu (θu, θa) = L(Dr; θu) + α M( Dval θu ; θa). (2)

The parameter α R+ balances the loss L and the auditor s utility M. Depending on the specific setting, the cost function Cu can be extended to incorporate additional objectives for unlearning. For instance, one can specify that the unlearned model should perform poorly on the forget set (Graves et al., 2021b); this can be achieved by minimizing an evaluation metric (e.g., likelihood) on the forget set. Also, several sparsity-promoting techniques have been shown helpful for unlearning (Jia et al., 2023); one way to achieve this is by adding an ℓ1 regularization to the cost function.

4.3 THE STACKELBERG GAME

Now, with the unlearner and the auditor models in place, we formally define the Stackelberg Game (SG). The SG is to solve the following bi-level optimization problem (Colson et al., 2007)

min θu Hu L(Dr; θu) + α M( Dval θu ; θa) | {z } Unlearner

s.t. θa Bθu | {z } Auditor

For example, Bertram et al. (2019) proposed a quantile-regression-based MIA. In this case, the best response is the optimal model parameters for the regression.

Published as a conference paper at ICLR 2025

The objective function has two components: the first ensures generalization by minimizing loss on the retain set, while the second quantifies privacy leakage by assessing the auditor s ability to differentiate between forget and test instances. A lower auditor utility indicates more effective unlearning, as it reduces the distinguishability between the two sets. The hierarchical structure encodes the sequential order of the play, with the upper level corresponding to the unlearner s optimization problem and the lower level capturing the auditor s best-responses. During the unlearning, the unlearner needs to proactively consider the auditor s responses. This requires selecting an unlearning model where the influence of the forget set is erased, or from the auditor s perspective, the forget instances are indistinguishable from the test ones.

The key assumption of the SG is that if the forget set cannot be distinguished from the test set in terms of the effectiveness of an MIA its influence is deemed eliminated from the unlearned model. We justify this assumption from three angles. First, one common way to measure forgetfulness is by assessing the accuracy of the unlearned model on the forget set (Graves et al., 2021b; Chundawat et al., 2023; Baumhauer et al., 2022). This approach is grounded on the observation that machine learning models exhibit distinct performance between training data and unseen data. However, it is important to note that accuracy on the forget set does not necessarily correlate with forgetfulness, as there are inherently difficult (or easy) instances that result in low (or high) accuracy regardless of whether they were part of the training set (Carlini et al., 2022). Secondly, MIAs have been used to study training data forgetting (Jagielski et al., 2023), demonstrating its utility in detecting residual traces of a dataset. Finally, from an adversarial perspective, if a sophisticated attack like an MIA cannot differentiate the forget set from the test set, it is reasonable to expect that the influence of the forget set has been removed.

We solve the SG using gradient-based methods, allowing for easy integration into end-to-end training pipelines. Specifically, we use Implicit Function Theorem to differentiate through the auditor s optimization problem Equation 1, obtaining the gradient of the auditor s utility with respect to (w.r.t) the unlearning model s weights, i.e., M/ θu. As a result, the SG becomes a differentiable layer, compatible with the standard forward-backward computation. The solution methods will be detailed in the next section.

5 SG-UN L E A R N: STACKELBERG GAME UNLEARN

In this section, we describe the algorithm for solving the SG. In general, it is NP-hard to find an optimal solution for the unlearner (Conitzer & Sandholm, 2006). Instead, we focus on gradient-based algorithms to find an approximate solution, i.e., a model parameter θu exhibiting good unlearning performance. The main technical challenge is computing the gradient of the auditor s utility w.r.t. the unlearning model s weights (i.e., M/ θu), which requires differentiation through the auditor s optimization problem. While the differentiation can be bypassed in some special cases, e.g., when the unlearner s hypothesis class is of linear regressions (Tong et al., 2018), this is rarely applicable in the current setting given our primary focus on unlearning deep neural networks.

Our solution leverages both the Implicit Function Theorem (IFT) (Dontchev et al., 2009) and tools from Differentiable Optimization (DO) to compute the gradients (Gould et al., 2016; Amos & Kolter, 2017; Agrawal et al., 2019b), thereby rendering the SG a differentiable layer seamlessly integrable into existing end-to-end pipelines.

We start by expanding the gradient of Cu w.r.t. θu using the chain rule

θu = L(Dr; θu)

θu + M( Dval θu ; θa) θa θa

Dtr θu Dtr θu θu . (4)

The first term on the right-hand side can be easily computed using an automatic differentiation tool like Py Torch (Paszke et al., 2017). In essence, the computation involves passing Dr through the unlearning model (i.e., θu) in the forward pass, computing the loss L, and getting the gradients in the backward pass. The second term on the right is an expansion of M/ θu using the chain rule; for clarity we omit the arguments of the functions. The gradient M/ θa is obtained by performing a standard forward-backward pass. Some evaluation metrics for binary classification, such as the 0-1 loss, AUC, recall, etc., are non-differentiable. Therefore, we adhere to standard practices by employing a differentiable proxy for M, such as utilizing the logistic loss as a substitute for the 0-1 loss.

Published as a conference paper at ICLR 2025

Leveraging Implicit Function Theorem Computing the gradient θa/ Dtr θu requires differentiation through the attacker s optimization problem. The main challenge is the absence of an explicit function that maps Dtr θu to θa. However, under certain regularity assumptions, one can derive an implicit mapping between Dtr θu and θa based on the optimality conditions of the auditor s optimization problem (Gould et al., 2016). A concrete example is when the optimization problem is convex, such as learning a support vector machine (SVM). In this case, the KKT conditions are necessary and sufficient conditions for the optimality, and it connects θa with Dtr θu through a system of linear equations, i.e., f( Dtr θu, θa) = 0, (5)

where f encapsulates the stationarity conditions, the primal and dual feasibility conditions, and the complementary slackness conditions (Boyd & Vandenberghe, 2004). For illustration purposes, a concrete example of the KKT conditions f for linear SVM is provided in Appendix A.10. We apply IFT to the system of linear equations, resulting in

θa Dtr θu =

f( Dtr θu, θa) θa

! 1 f( Dtr θu, θa)

For further insights into differentiating through an optimization problem using the implicit function theorem, we recommend referring to the lectures by Gould (2023).

Leveraging Differentiable Optimization In practice, we capitalize on tools from Differentiable Optimization (DO) to compute the gradients. Intuitively, we can consider DO as software that implements IFT, as shown in Equation 6, for a given optimization problem. What we need to do is describing the auditor s optimization problem using a specialized modeling language, e.g., cvxpy (Diamond & Boyd, 2016). We then use DO to transform this description into a differentiable layer. Subsequently, this differentiable layer is positioned atop the model undergoing unlearning, thereby establishing a computational pathway from θu to θa. The pseudocode for this process is provided in Algorithm 1. This algorithm has a time complexity of O(n3), where big-O notations inherently describes an upper bound and n denotes the size of the attacker s optimization problem (i.e., the number of variables and/or constraints). This cubic dependence stems from the matrix inversion in Equation 6. When Equation 1 admits specific structures, such as when the auditor uses a linear SVM, we leverage qpth to exploit the linear structures of the KKT conditions and enhance efficiency. The corresponding results, denoted as SG (Acc.), are presented in Table 1, with further discussion in Appendix A.11.

Algorithm 1 SG-Unlearn

1: Input: Dr, Df, Dte and the original model θo 2: Initialize: i = 0, θ0 u = θo, a scheduler ηi

3: while i < epoch do 4: Compute L(Dr; θi u) on the retain set in a forward pass

5: Update θi u θi u ηi L(Dr;θi u) θiu 6: Construct the auditing set Dθi u from Df and Dte 7: Describe the auditor s optimization problem Equation 1 with cvxpy 8: Convert the description to a differentiable layer Auditor Layer 9: Plug Auditor Layer into the computational graph 10: Get the auditor s best response θi a Auditor Layer( Dθi u )

11: Compute M Dval

θi u ; θi a

12: Update θi+1 u θi u ηi M Dval

θiu 13: i i + 1 14: end while 15: Return: θi u

This includes several state-of-the-art MIAs (Bertran et al., 2023; Song & Mittal, 2021).

Published as a conference paper at ICLR 2025

6 EXPERIMENTS

6.1 EXPERIMENT SETUP

We conduct experiments on both computer vision (CV) and natural language processing (NLP) datasets. For the CV tasks, we use the widely recognized image classification datasets CIFAR-10, CIFAR-100, and SVHN (Krizhevsky et al., 2009; Netzer et al., 2011). The backbone model we use is Res Net-18 (He et al., 2016). For NLP tasks, we assess performance on the 20 Newsgroups dataset, leveraging the BERT (Devlin, 2018) model.

We explore two learning scenarios: random forgetting and class-wise forgetting. In random forgetting, instances are sampled uniformly at random from all classes. In contrast, class-wise forgetting involves selecting all instances from a specific class. For CIFAR-10 and CIFAR-100, the forget set consists of 10% of the entire training set, while the ratio is reduced to 5% for SVHN. In all experiments, the attacker s optimization problem is formulated as a binary classification task, where a linear support vector machine (SVM) is used to distinguish between forget and test instances.

The baseline methods we use for comparison with SG include Retrain, Fine-Tune (FT) (Warnecke et al., 2021; Golatkar et al., 2020a), Gradient Ascent (GA) (Graves et al., 2021a; Thudi et al., 2022a), Influence Unlearning (IU) (Izzo et al., 2021a; Koh & Liang, 2017b), ℓ1-sparse (Jia et al., 2023), Random Label (RL) (Hayase et al., 2020), Boundary Expansion (BE), Boundary Shrink (BS) (Chen et al., 2023), SCRUB (Kurmanji et al., 2024) and DAU (Sharma et al., 2024). Further details on the baseline methods are provided in Section A.7 of the Appendix. For all methods, we use the SGD optimizer with a weight decay of 5e-4 and a momentum of 0.9. Other hyper-parameters are selected through the validation set. Specifically, we create a new auditing set. For each unlearning method, we select the hyper-parameters that maximize the difference between the validation accuracy and the MIA accuracy on this new auditing set. This approach ensures that the model both generalizes well to unseen data (high validation accuracy) and is less vulnerable to the attacks (low MIA accuracy). The hyperparameters are listed in Table 9 in the Appendix. In addition, we propose an accelerated version of SG (Acc.) and the discussion is given in Appendix A.11.

6.2 EVALUATION METRICS

We evaluate SG and the baseline methods using metrics commonly adopted in prior studies (Bourtoule et al., 2021; Jagielski et al., 2023; Jia et al., 2023; Chundawat et al., 2023). It is important to note that the test accuracy is evaluated on a subset of the test data that is separate from the one used for solving SG. Retain accuracy (Accr) and test accuracy (Accte) are used to quantify model utility (Jia et al., 2023). MIA accuracy, AUC and F1 score are the metrics to quantify the effectiveness of unlearning, all of which are estimated on the auditing set with 10-fold cross Carlini et al. (2022). An effective unlearning algorithm should result in MIA metrics that approach random guessing (0.5). Forget accuracy (Accf) and the absolute difference between the forget and test accuracy (|Accf Accte|):

An effective unlearning algorithm should result in Accf being close to Accte. This indicates that the unlearned model no longer retains specific information about the forget data, as its performance on the forget set should be similar to its performance on unseen test data (Dte), reflecting the removal of the influence of Df.

To gather additional statistical evidence regarding the effectiveness of unlearning, we collect the crossentropy losses of the forget and test instances from the unlearned model into the empirical distributions Lf and Lte, respectively. Next, we run a Kolmogorov-Smirnov statistics (KS Stat.) test to determine if the distributions can be differentiated from each other. The KS statistic quantifies the differences between Lf and Lte, where the p-value indicates whether the difference is significant (Massey Jr, 1951). In addition to the KS statistics, we provide the Wasserstein distance (W. Dist.) between the empirical distributions of Lf and Lte. This complements the KS statistics and evaluates the unlearning performance in terms of the similarity between the losses.

Retraining the unlearning model on Dtr \ Df from scratch.

Published as a conference paper at ICLR 2025

6.3 RESULTS

The experimental results for random forgetting and class-wise forgetting are presented in Section 6.3.1 and 6.3.2, respectively. We consider retrain as the gold standard for evaluating unlearning algorithms: the closer to the metrics of retrain the more effective the algorithm. We highlight the closest metrics to retrain in bold.

6.3.1 RANDOM FORGETTING

We present the results for CIFAR-10, CIFAR-100 and 20 News Group in Table 1. The results for SVHN datasets are provided in Appendix A.3. SG achieves the best performance for most of the metrics, demonstrating its effectiveness in unlearning. Specifically, the KS statistic of SG is consistently lower than those of the other baselines, exhibiting an order of magnitude difference in the statistics for CIFAR-10 compared to most baselines. Intuitively, ML models behave differently on training data compared to unseen data, and this difference is usually reflected in the corresponding losses (Carlini et al., 2022). The small KS statistic of SG implies that the forget and test instances exhibit greater similarity in terms of the model s behavior, although there is still a discernible difference between the losses. Another metric for measuring the similarity is the Wasserstein distance (W. Dist.). The baseline RL achieves the lowest distance, although the difference with SG is not statistically significant. A visualization of the cross-entropy losses for the forget and test instances from one of the experiments is provided in Figure 4. The experiment results of Tiny Image Net (Le & Yang, 2015) and Celeb A (Liu et al., 2018) are given in Table 7 and 8. We also consider the vision transformer model. The results of Vi T model (Dosovitskiy, 2020) on CIFAR-10 are given in Table 5.

Table 1: Experimental results (Meanstd) on CIFAR-10, CIFAR-100 and 20 News Group for random forgetting. The highlighted metrics are the closest to those of retraining, which are considered as the best performance compared with the other baselines. SG denotes the original method. SG (Acc.) uses qpth. SG + Li RA employs Li RA as the attacker. SG (Acc.) + RL sets L in Equation 4 as the objective of RL, with qpth as the attacker.

CIFAR-10 Accr Accte Accf |Accf Accte| MIA acc. MIA AUC MIA F1 KS Stat. W. Dist. RTE (min., )

Retrain 0.99960.0001 0.92910.0022 0.92300.0043 0.0061 0.50690.0073 0.50830.0099 0.50940.0187 0.02550.0080 0.03070.0115 14.92

FT 0.98860.0055 0.91140.0050 0.98510.0056 0.0737 0.54050.0031 0.54570.0070 0.62930.0127 0.09330.0050 0.31580.0142 0.45 GA 0.99960.0001 0.93040.0007 0.99950.0003 0.0691 0.55040.0066 0.56110.0071 0.66250.0106 0.14030.0037 0.27820.0026 0.18 IU 0.97230.0255 0.89660.0242 0.97220.0243 0.0756 0.53980.0055 0.55480.0076 0.61930.0204 0.10500.0192 0.40830.0554 0.02 ℓ1-sparse 0.99700.0007 0.92340.0014 0.99380.0016 0.0704 0.55010.0049 0.56940.0087 0.64050.6259 0.10180.0034 0.27040.0074 0.96 RL 0.99880.0001 0.92170.0008 0.98100.0025 0.0593 0.52170.0087 0.52970.0133 0.59350.0140 0.09860.0136 0.15200.0058 0.84 BE 0.99960.0001 0.93040.0007 0.99960.0003 0.0692 0.55410.0049 0.56390.0058 0.66290.0082 0.14120.0030 0.27830.0020 0.27 BS 0.99950.0001 0.93070.0008 0.99950.0004 0.0688 0.55880.0072 0.57790.0097 0.65900.0156 0.14660.0032 0.30720.0026 0.46 SCRUB 0.99710.0018 0.92510.0018 0.99590.0022 0.0708 0.55330.0059 0.56790.0073 0.63370.0149 0.10380.0071 0.24850.0154 1.30 DAU 0.95830.0033 0.90090.0026 0.93110.0042 0.0302 0.51840.0018 0.51460.0090 0.64390.0008 0.03930.0057 0.12160.0091 8.34

SG 0.99480.0029 0.89400.0048 0.93510.0070 0.0411 0.52020.0054 0.51340.0084 0.64800.0043 0.04820.0082 0.15550.0194 1.47 SG (Acc.) 0.99620.0003 0.88700.0011 0.90900.0001 0.0293 0.51100.0003 0.52000.0001 0.63580.0002 0.02740.0005 0.10870.0016 0.88 SG + Li RA 0.99480.0038 0.88650.0059 0.91580.0093 0.0293 0.51510.0039 0.51000.0070 0.63900.0026 0.03630.0059 0.11260.0014 8.33 SG (Acc.) + RL 0.99680.0048 0.92370.0051 0.94680.0108 0.0231 0.52080.0145 0.52300.0206 0.52360.0198 0.09580.0224 0.10820.0191 1.79

CIFAR-100 Accr Accte Accf |Accf Accte| MIA acc. MIA AUC MIA F1 KS Stat. W. Dist. RTE (min., )

Retrain 0.99960.0001 0.70350.0025 0.69250.0039 0.0110 0.51840.0057 0.52810.0053 0.51040.0081 0.02030.0045 0.05670.0200 13.08

FT 0.99910.0001 0.71170.0021 0.99840.0006 0.2867 0.66300.0075 0.73000.0102 0.68780.0107 0.45660.0083 1.25830.0166 0.39 GA 0.99960.0001 0.71580.0008 0.99960.0002 0.2838 0.69770.0060 0.76010.0065 0.72070.0088 0.49150.0030 1.22190.0038 0.20 IU 0.99710.0029 0.70260.0080 0.99590.0034 0.2933 0.66600.0089 0.73050.0134 0.69500.0124 0.45830.0168 1.26120.0366 0.21 ℓ1-sparse 0.99580.0013 0.70950.0025 0.98900.0028 0.2785 0.67380.0081 0.73920.0088 0.69520.0073 0.37170.0113 1.11570.0079 0.84 RL 0.99650.0054 0.66650.0031 0.84830.0447 0.1818 0.58080.0426 0.61520.0527 0.60800.0466 0.23230.0731 0.80670.1705 0.73 BE 0.99950.0001 0.71730.0014 0.99960.0002 0.2823 0.69770.0037 0.76610.0066 0.72480.0065 0.49400.0031 1.21750.0119 0.23 BS 0.99950.0001 0.71600.0013 0.99960.0002 0.2836 0.69870.0052 0.76510.0072 0.72390.0054 0.49630.0033 1.23820.0206 0.39 SCRUB 0.99930.0001 0.70970.0019 0.99910.0004 0.2894 0.70150.0057 0.77470.0060 0.72990.0056 0.47170.0052 1.22800.0128 1.14 DAU 0.93460.0006 0.66870.0045 0.80210.0073 0.1334 0.57600.0044 0.58160.0045 0.65260.0042 0.13800.0134 0.72680.0858 13.35

SG 0.89930.0105 0.63780.0066 0.72390.0093 0.0861 0.54120.0070 0.53200.0076 0.60610.0056 0.09880.0061 0.53160.0295 3.07 SG (Acc.) 0.96460.0019 0.60280.0003 0.70320.0027 0.1004 0.55190.0008 0.55710.0010 0.61920.0004 0.11200.0029 0.61760.0007 1.55 SG + Li RA 0.95740.0038 0.60080.0059 0.69420.0093 0.0934 0.54110.0039 0.53030.0070 0.60570.0026 0.02740.0059 0.10870.0014 10.68 SG (Acc.) + RL 0.99660.0014 0.66790.0033 0.80950.0113 0.1416 0.56090.0397 0.60320.0412 0.60130.0023 0.20410.0935 0.63910.1845 2.39

20 News Group Accr Accte Accf |Accf Accte| MIA acc. MIA AUC MIA F1 KS Stat. W. Dist. RTE (min., )

Retrain 1.0000 0.8528 0.9224 0.0696 0.5285 0.5512 0.5501 0.1405 0.5925 40.8

FT 0.9999 0.8518 0.8035 0.0482 0.5672 0.6059 0.6220 0.2495 1.1894 20.2 GA 0.0483 0.0483 0.0500 0.0017 0.4995 0.4973 0.2704 0.0334 0.0990 26.1 IU 1.0000 0.8575 0.9990 0.1415 0.5676 0.6054 0.6348 0.2986 0.9614 27.9 RL 0.9985 0.8298 0.6709 0.1589 0.7123 0.7651 0.7148 0.5334 1.1402 21.2

SG 1.0000 1.0000 1.0000 0.0000 0.5065 0.4922 0.5627 0.0791 0.0007 15.6

Another observation from the table is the inherent trade-off between model performance, measured by test accuracy, and the effectiveness of unlearning, measured by MIA accuracy. This trade-off has been documented in prior studies as a common challenge in unlearning tasks (Golatkar et al., 2020a; Bourtoule et al., 2021). Specifically, SG is more effective at unlearning the forget instances,

Published as a conference paper at ICLR 2025

as indicated by the highlighted MIA metrics. However, this effectiveness comes at a cost to the test accuracy on CIFAR-10 and CIFAR-100, a phenomenon observed in other unlearning techniques as well (Jia et al., 2023; Graves et al., 2021a). Despite this trade-off, the degradation in test accuracy remains minimal.

6.3.2 CLASS-WISE FORGETTING

We use CIFAR-10 as the benchmark for class-wise forgetting. For results on other datasets and discussion, please refer to Appendix A.9.

In random forgetting, instances are uniformly sampled across all classes, preserving the overall dataset distribution. In contrast, class-wise forgetting removes all instances from a specific class, resulting in a significant distribution shift that makes forget data more detectable. The experimental results, presented in Table 2, highlight the metrics closest to retraining. However, all methods perform poorly on MIA-related metrics, as the auditor can easily distinguish between forget and test instances due to the distinct distributional shifts caused by class-wise forgetting. Additionally, no single method consistently outperforms the others.

Table 2: Experimental results (Meanstd) on CIFAR-10 for class-wise forgetting. The highlighted metrics are the closest to those of retrain, which is considered as the best performance compared with the other baselines.

CIFAR-10 Accr Accte Accf |Accf Accte| MIA acc. MIA AUC MIA F1 KS Stat. W. Dist. RTE (min., )

Retrain 0.99960.0001 0.93330.0009 0.00000.0000 0.9333 0.99350.0006 0.99830.0004 0.99360.0007 0.98030.0002 9.56010.0911 13.96

FT 0.99580.0022 0.92260.0030 0.60430.0450 0.3183 0.99150.0011 0.99850.0002 0.99150.0011 0.79750.0133 0.93240.1847 1.16 GA 0.84780.0046 0.79420.0055 0.00070.0002 0.7935 0.99440.0011 0.99960.0002 0.99380.0015 0.92690.0087 15.29410.1656 0.84 IU 0.93390.0161 0.86440.0141 0.06190.0149 0.8025 0.99720.0009 0.99960.0001 0.99720.0007 0.81510.0195 8.25740.6484 0.31 ℓ1-sparse 0.99720.0005 0.92850.0014 0.09140.0310 0.8371 0.99100.0014 0.99890.0001 0.99100.0014 0.92080.0078 2.55520.1738 1.84 RL 0.99960.0000 0.93300.0008 0.00010.0001 0.9329 0.99160.0013 0.99900.0005 0.99160.0013 0.96950.0025 6.39890.0789 1.97 BE 0.97100.0012 0.89840.0023 0.24770.0022 0.6507 0.99640.0005 0.99900.0004 0.99640.0005 0.73060.0047 4.89840.0432 0.32 BS 0.96910.0031 0.89690.0031 0.25040.0105 0.6465 0.99650.0001 0.99880.0005 0.99650.0002 0.71960.0072 5.01550.0922 0.66 SCRUB 1.00000.0000 0.92690.0021 0.00000.0000 0.9269 1.00000.0000 1.00000.0000 1.00000.0000 0.99990.0001 70.99342.9441 3.47

SG 0.96670.0054 0.90560.0055 0.00000.0000 0.9056 0.98140.0026 0.99020.0025 0.98180.0025 0.96960.0032 5.27540.1882 0.84

6.3.3 THE EFFECT OF THE ATTACKER MODEL

Finally, we conduct a comparative study to understand the impact of adversarial modeling on the unlearning process, controlled by the parameter α as defined in Equation 2. We show the results for random forgetting and defer the results for class-wise forgetting to the appendix. In Figure 1, we compare two cases where α is set to either 1 or 0, denoted by SG-1 and SG-0 respectively. The comparison is done across four metrics: 1) the test accuracy; 2) the MIA accuracy; 3) the defender s utility, evaluated as the test accuracy minus the MIA accuracy, which provides a combined scalar value that measures both the performance of the unlearned model and the effectiveness of unlearning; 4) the Wasserstein distance between the empirical distributions of Lf and Lte. We show the averages over 10 experiments with different seeds, and 95% confidence intervals are displayed. The first observation is that the adversarial term (i.e., α M( Dval θu ; θa)) acts as a regularizer, improving the generalizability of the unlearned model. This observation is supported by comparing the test accuracy of SG-1 and SG-0 on CIFAR-10 (top middle). Similar findings have been reported in Nasr et al. (2018). Another observation is that adversarial modeling limits the attacker s ability to differentiate between forget instances and test instances; this is demonstrated by the MIA accuracy on CIFAR-100. The right-most column displays the Wasserstein distances between Lf and Lte. It is evident that the two losses are closer as a result of adversarial modeling, especially for CIFAR-100 dataset. Additionally, the distances progressively decrease throughout the epochs, confirming the effectiveness of the gradient-based method.

In addition to the existence of attacker, we also investigate the strength of the attacker by changing α. We select the α in large range of {0.05, 0.1, 0.25, 0.5, 1, 2, 5}. In Figure 2 in Appendix, we compare the performance regarding the test accuracy Accte. The cross of the red dash line is the performance of the retrain model. We can find that SG is robust to the attacker strength.

6.4 MIA SELECTION

To validate the generalization ability of SG, we select the MIA used in (Jia et al., 2023) as the attacker and we evaluate SG according to (Jia et al., 2023). The results given in Table 3 show that SG is not

Published as a conference paper at ICLR 2025

0 10 20 30 0.1

Defender Utility

0 10 20 30 0.88

Test Accuracy

0 10 20 30 0.88

MIA Accuracy

Wasserstein Distance

0 10 20 30 0.000

0 10 20 30 0.60

0 10 20 30 0.50

Figure 1: An ablation study to understand the impact of adversarial modeling on the process of unlearning; α = 1 and α = 0 corresponds to the cases with and without adversarial modeling, respectively. The results are the averages over 10 experiments with different seeds, and 95% confidence intervals are displayed. From the left to the right: 1) the defender s utility, evaluated as the test accuracy Accte minus the MIA accuracy; 2) test accuracy; 3) MIA accuracy; 4) Wasserstein distance between the cross-entropy losses of the forget and test instances. Top row: CIFAR-10; Bottom row: CIFAR-100. Epoch 0: Original model.

sensitive to the MIA. Furthermore, we also consider the state-of-art MIA which is Li RA (Carlini et al., 2022) as the attacker. The results are given in Table 1.

Table 3: SG is evaluated using the MIA attacker described in (Jia et al., 2023), comparing its performance against baseline models on the CIFAR-10 dataset under the random forgetting paradigm. For the metrics UA, MIA, RA, and TA, the value closest to the Retrain baseline is highlighted in bold.

CIFAR-10 UA (1 Accf) MIA RA (Accr) TA (Accr) Avg. Gap ( ) RTE (min, )

Retrain 0.08070.0047 0.17410.0069 1.00000.0001 0.91610.0024 - 24.66

FT 0.01100.0019 0.04060.0041 0.99830.0003 0.93700.0010 0.0555 1.58 GA 0.00560.0001 0.01190.0005 0.99480.0002 0.94550.0005 0.0680 0.31 IU 0.17510.0219 0.21390.0170 0.83280.0244 0.78130.0285 0.1091 1.18 ℓ1-sparse 0.01210.0038 0.04330.0052 0.97390.0031 0.95490.0018 0.0661 1.82 RL 0.02800.0037 0.18590.0348 0.99970.0001 0.94080.0012 0.0224 1.98 BE 0.00000.0000 0.00260.0002 1.00000.0000 0.95350.0018 0.0724 3.17 BS 0.00480.0007 0.01160.0004 0.99470.0001 0.94580.0003 0.0684 1.41 SCRUB 0.00700.0059 0.03880.0125 0.99590.0034 0.94220.0026 0.0598 4.05

SG 0.07480.0041 0.18350.0117 0.99900.0131 0.90720.0189 0.0063 3.48

7 DISCUSSION

In this paper, we design an adversarial framework for addressing the problem of unlearning a set data from a machine learning model. Our approach focuses on evaluating the effectiveness of unlearning from an adversarial perspective, leveraging membership inference attacks (MIAs) to detect any residual traces of the data within the model. The framework allows for a proactive design of the unlearning algorithm, synthesizing two lines of research machine unlearning and MIAs that have heretofore progressed in parallel. By using implicit differentiation techniques, we develop a gradient-based algorithm for solving the game, making the framework easily integrable into existing end-to-end learning pipelines. We present empirical results to support the efficacy of the framework and the algorithm. We believe our work can make a progress in trustworthy ML.

Published as a conference paper at ICLR 2025

ETHIC STATEMENT

This work does not involve potential malicious or unintended uses, fairness considerations, privacy considerations, security considerations, crowd sourcing, or research with human subjects.

REPRODUCIBILITY STATEMENT

We provide details to reproduce our results in Appendix A.7 and A.8. We also provide pseudo-code in Algorithm 1 and will release the code upon acceptance.

ACKNOWLEDGMENT

Z. Di and Y. Liu are partially supported by the NSF under grants IIS-2007951, IIS-2143895, and IIS-2040800. Y. Vorobeychik is supported by the NSF through grants IIS-2214141, IIS-1905558, and CNS-2310470, as well as by the ARO under grant W911NF-25-1-0059 and the ONR under grant N00014-24-1-2663.

A. Agrawal, B. Amos, S. Barratt, S. Boyd, S. Diamond, and Z. Kolter. Differentiable convex optimization layers. In Advances in Neural Information Processing Systems, 2019a.

Akshay Agrawal, Brandon Amos, Shane T. Barratt, Stephen P. Boyd, Steven Diamond, and J. Zico Kolter. Differentiable convex optimization layers. In Proceedings of the 32nd Conference on Neural Information Processing Systems (Neur IPS), pp. 9558 9570, 2019b.

Brandon Amos and J Zico Kolter. Optnet: Differentiable optimization as a layer in neural networks. In Proceedings of the 34th International Conference on Machine Learning (ICML), pp. 136 145. PMLR, 2017.

Thomas Baumhauer, Pascal Schöttle, and Matthias Zeppelzauer. Machine unlearning: Linear filtration for logit-based classifiers. Machine Learning, 111(9):3203 3226, 2022.

Theo Bertram, Elie Bursztein, Stephanie Caro, Hubert Chao, Rutledge Chin Feman, Peter Fleischer, Albin Gustafsson, Jess Hemerly, Chris Hibbert, Luca Invernizzi, et al. Five years of the right to be forgotten. In Proceedings of the ACM Conference on Computer and Communications Security (CCS), pp. 959 972, 2019.

Martin Bertran, Shuai Tang, Michael Kearns, Jamie Morgenstern, Aaron Roth, and Zhiwei Steven Wu. Scalable membership inference attacks via quantile regression. Co RR, abs/2307.03694, 2023. URL https://doi.org/10.48550/ar Xiv.2307.03694.

Lucas Bourtoule, Varun Chandrasekaran, Christopher A Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot. Machine unlearning. In Proceedings of IEEE Symposium on Security and Privacy (S&P), pp. 141 159. IEEE, 2021.

Stephen P Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.

Nicholas Carlini, Steve Chien, Milad Nasr, Shuang Song, Andreas Terzis, and Florian Tramer. Membership inference attacks from first principles. In Proceedings of the 39th IEEE Symposium on Security and Privacy (S&P), pp. 1897 1914. IEEE, 2022.

Min Chen, Zhikun Zhang, Tianhao Wang, Michael Backes, Mathias Humbert, and Yang Zhang. When machine unlearning jeopardizes privacy. In Yongdae Kim, Jong Kim, Giovanni Vigna, and Elaine Shi (eds.), Proceedings of the ACM Conference on Computer and Communications (CCS), pp. 896 911. ACM, 2021.

Min Chen, Zhikun Zhang, Tianhao Wang, Michael Backes, Mathias Humbert, and Yang Zhang. Graph unlearning. In Proceedings of the ACM Conference on Computer and Communications Security (CCS), pp. 499 513, 2022.

Published as a conference paper at ICLR 2025

Min Chen, Weizhuo Gao, Gaoyang Liu, Kai Peng, and Chen Wang. Boundary unlearning: Rapid forgetting of deep networks via shifting the decision boundary. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7766 7775, 2023.

Vikram S. Chundawat, Ayush K. Tarun, Murari Mandal, and Mohan S. Kankanhalli. Can bad teaching induce forgetting? unlearning in deep networks using an incompetent teacher. In Brian Williams, Yiling Chen, and Jennifer Neville (eds.), Proceedings of the Thirty-Seventh Conference on Artificial Intelligence (AAAI), pp. 7210 7217, 2023.

Benoît Colson, Patrice Marcotte, and Gilles Savard. An overview of bilevel optimization. Annals of operations research, 153:235 256, 2007.

Vincent Conitzer and Tuomas Sandholm. Computing the optimal strategy to commit to. In Joan Feigenbaum, John C.-I. Chuang, and David M. Pennock (eds.), Proceedings 7th ACM Conference on Electronic Commerce (EC), pp. 82 90. ACM, 2006.

Jacob Devlin. Bert: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805, 2018.

Steven Diamond and Stephen Boyd. Cvxpy: A python-embedded modeling language for convex optimization. The Journal of Machine Learning Research, 17(1):2909 2913, 2016.

Asen L Dontchev, R Tyrrell Rockafellar, and R Tyrrell Rockafellar. Implicit functions and solution mappings: A view from variational analysis, volume 616. Springer, 2009.

Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale. ar Xiv preprint ar Xiv:2010.11929, 2020.

Aditya Golatkar, Alessandro Achille, and Stefano Soatto. Eternal sunshine of the spotless net: Selective forgetting in deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9304 9312, 2020a.

Aditya Golatkar, Alessandro Achille, and Stefano Soatto. Eternal sunshine of the spotless net: Selective forgetting in deep networks. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9301 9309, 2020b.

Stephen Gould. Lecture notes on differentiable optimisation in deep learning. Co RR, 2023. URL https://users.cecs.anu.edu.au/~sgould/papers/ isaac22-lecture-notes.pdf.

Stephen Gould, Basura Fernando, Anoop Cherian, Peter Anderson, Rodrigo Santa Cruz, and Edison Guo. On differentiating parameterized argmin and argmax problems with application to bi-level optimization. Co RR, abs/1607.05447, 2016.

Laura Graves, Vineel Nagisetty, and Vijay Ganesh. Amnesiac machine learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 11516 11524, 2021a.

Laura Graves, Vineel Nagisetty, and Vijay Ganesh. Amnesiac machine learning. In Proceedings of the 33rd Conference on Artificial Intelligence (AAAI), pp. 11516 11524. AAAI Press, 2021b.

Chuan Guo, Tom Goldstein, Awni Y. Hannun, and Laurens van der Maaten. Certified data removal from machine learning models. In Proceedings of the 37th International Conference on Machine Learning (ICML), volume 119, pp. 3832 3842. PMLR, 2020.

Tomohiro Hayase, Suguru Yasutomi, and Takashi Katoh. Selective forgetting of deep networks at a finer level than samples. ar Xiv preprint ar Xiv:2012.11849, 2020.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 770 778, 2016.

Zachary Izzo, Mary Anne Smart, Kamalika Chaudhuri, and James Zou. Approximate data deletion from machine learning models. In International Conference on Artificial Intelligence and Statistics, pp. 2008 2016. PMLR, 2021a.

Published as a conference paper at ICLR 2025

Zachary Izzo, Mary Anne Smart, Kamalika Chaudhuri, and James Zou. Approximate data deletion from machine learning models. In Arindam Banerjee and Kenji Fukumizu (eds.), Proceedings of The 24th International Conference on Artificial Intelligence and Statistics (AISTATS), volume 130, pp. 2008 2016, 2021b.

Matthew Jagielski, Om Thakkar, Florian Tramèr, Daphne Ippolito, Katherine Lee, Nicholas Carlini, Eric Wallace, Shuang Song, Abhradeep Guha Thakurta, Nicolas Papernot, and Chiyuan Zhang. Measuring forgetting of memorized training examples. In Proceedings the 11th International Conference on Learning Representations (ICLR), 2023.

Jinghan Jia, Jiancheng Liu, Parikshit Ram, Yuguang Yao, Gaowen Liu, Yang Liu, Pranay Sharma, and Sijia Liu. Model sparsification can simplify machine unlearning. Co RR, abs/2304.04934, 2023. URL https://doi.org/10.48550/ar Xiv.2304.04934.

Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. In Doina Precup and Yee Whye Teh (eds.), Proceedings of the 34th International Conference on Machine Learning (ICML), volume 70, pp. 1885 1894, 2017a.

Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. In International conference on machine learning, pp. 1885 1894. PMLR, 2017b.

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.

Meghdad Kurmanji, Peter Triantafillou, Jamie Hayes, and Eleni Triantafillou. Towards unbounded machine unlearning. Advances in Neural Information Processing Systems, 36, 2024.

Yann Le and Xuan Yang. Tiny imagenet visual recognition challenge. CS 231N, 7(7):3, 2015.

Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Large-scale celebfaces attributes (celeba) dataset. Retrieved August, 15(2018):11, 2018.

Frank J Massey Jr. The kolmogorov-smirnov test for goodness of fit. Journal of the American statistical Association, 46(253):68 78, 1951.

Sasi Kumar Murakonda and Reza Shokri. Ml privacy meter: Aiding regulatory compliance by quantifying the privacy risks of machine learning. Co RR, abs/2007.09339, 2020.

Milad Nasr, Reza Shokri, and Amir Houmansadr. Machine learning with membership privacy using adversarial regularization. In Proceedings of the ACM Conference on Computer and Communications Security (CCS), pp. 634 646, 2018.

Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. 2011.

Thanh Tam Nguyen, Thanh Trung Huynh, Phi Le Nguyen, Alan Wee-Chung Liew, Hongzhi Yin, and Quoc Viet Hung Nguyen. A survey of machine unlearning. Co RR, abs/2209.02299, 2022.

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary De Vito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.

Rohan Sharma, Shijie Zhou, Kaiyi Ji, and Changyou Chen. Discriminative adversarial unlearning. ar Xiv preprint ar Xiv:2402.06864, 2024.

Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Membership inference attacks against machine learning models. In Proceedings of the IEEE Symposium on Security and Privacy (S&P), pp. 3 18. IEEE, 2017.

Liwei Song and Prateek Mittal. Systematic evaluation of privacy risks of machine learning models. In Proceedings of the 30th USENIX Security Symposium (USENIX), pp. 2615 2632, 2021.

Anvith Thudi, Gabriel Deza, Varun Chandrasekaran, and Nicolas Papernot. Unrolling sgd: Understanding factors influencing machine unlearning. In 2022 IEEE 7th European Symposium on Security and Privacy (Euro S&P), pp. 303 319. IEEE, 2022a.

Published as a conference paper at ICLR 2025

Anvith Thudi, Hengrui Jia, Ilia Shumailov, and Nicolas Papernot. On the necessity of auditable algorithmic definitions for machine unlearning. In Kevin R. B. Butler and Kurt Thomas (eds.), Proceedings of the Thirty-first USENIX Security Symposium (USENIX), pp. 4007 4022, 2022b.

Liang Tong, Sixie Yu, Scott Alfeld, and Yevgeniy Vorobeychik. Adversarial regression with multiple learners. In Proceedings of the 35th International Conference on Machine Learning (ICML), volume 80, pp. 4953 4961, 2018.

Alexander Warnecke, Lukas Pirch, Christian Wressnegger, and Konrad Rieck. Machine unlearning of features and labels. ar Xiv preprint ar Xiv:2108.11577, 2021.

Alexander Warnecke, Lukas Pirch, Christian Wressnegger, and Konrad Rieck. Machine unlearning of features and labels. In Proceedings of the 30th Annual Network and Distributed System Security Symposium (NDSS), 2023.

Jiayuan Ye, Aadyaa Maddi, Sasi Kumar Murakonda, Vincent Bindschaedler, and Reza Shokri. Enhanced membership inference attacks against machine learning models. In Proceedings of the Conference on Computer and Communications Security (CCS), pp. 3093 3106. ACM, 2022.

Published as a conference paper at ICLR 2025

The Appendix is organized as follows:

Section A.1: the description of all the notation we use in this paper.

Section A.2: the experiment results of Vi T model and on CIFAR-10 and CIFAR-100.

Section A.3: the experiment results for random forgetting on SVHN dataset.

Section A.4: the experiment results for random forgetting on Tiny Image Net dataset.

Section A.5: the experiment results of the practical scenario on Celeb A dataset.

Section A.6: the cross-entropy loss distribution of the testing and forgetting instances.

Section A.7: the baseline method we use in this paper and their settings.

Section A.8: the experiment details and hyper-parameters.

Section A.9: the experiment results of class-wise forgetting

Section A.10: the detailed explanation of our problem formulation and how we solve the problem.

Section A.11: the implementation details about the accelerated version of the proposed method and the performance comparison.

Section A.12: the experiment results of sequence unlearning

A.1 NOTATION TABLE

Table 4: A summary of the notations used in the paper Notation Meaning

D = {(xi, yi)} A dataset (xi, yi) One data point where xi is the feature while yi is the label X, Y The feature space and the label space Df, Dte, Dval, Dtr, Dr The forgetting, testing, validation, training, and retain set (xf j , yf j ) One data point from the forget set Df A A machine learning algorithm U A machine unlearning algorithm θo The original model, i.e., A(Dtr) θu The unlearned model, i.e., U(θo) θr The retrained model, i.e., A(Dr) Dθu The auditing dataset for membership inference attack sf j The output of a forget instance in the auditing dataset Dθu from θu ste j The output of a testing instance in the auditing dataset Dθu from θu Dtr θu, Dval θu The training and validation split of Dθu. Cu The unlearner s cost function Bθu The auditor s best response given an unlearning model θu Ua The utility function of the auditor Ha The hypothesis class of the auditor Hu The hypothesis class of the unlearner α The trade-off factor as defined in the unlearner s cost function Equation 2

A.2 CIFAR-10 AND CIFAR-100 DATASET

A.2.1 VIT RESULTS

In addition to the models based on convolution layer, we also test our method on the transformer-based models. The model we use is Vi T and the results are given in Table 5.

Published as a conference paper at ICLR 2025

Table 5: Experimental results (Meanstd) on CIFAR-10 for random forgetting using Vi T. The highlighted metrics are the closest to those of retraining, which is considered as the best performance compared with the other baselines.

CIFAR-10 Accr Accte Accf |Accf Accte| MIA acc. MIA AUC MIA F1 KS Stat. W. Dist. RTE (min., )

Retrain 0.83840.0020 0.74270.0017 0.74830.0033 0.0056 0.50000.0007 0.50000.0059 0.57940.0163 0.01900.0105 0.03040.0073 206.50s

FT 0.86300.0017 0.75990.0033 0.82210.0083 0.0622 0.52860.0001 0.53520.0005 0.61880.0019 0.06750.0078 0.24130.0142 18.85 GA 0.84730.0017 0.75940.0105 0.84610.0018 0.0867 0.54180.0020 0.55120.0050 0.63020.0002 0.08980.0074 0.30550.0313 3.01 ℓ1-sparse 0.84720.0015 0.75910.0103 0.84570.0007 0.0866 0.54230.0016 0.55120.0050 0.63050.0002 0.08900.0076 0.30480.0313 6.32 RL 0.84150.0027 0.75920.0085 0.81240.0034 0.0532 0.51570.0008 0.51140.0056 0.58620.0017 0.05980.0099 0.14730.0158 8.09

SG 0.85150.0045 0.74760.0203 0.83220.0091 0.0846 0.50190.0074 0.51000.0013 0.58140.0010 0.03740.0004 0.10410.0034 11.42

A.2.2 THE EFFECT OF ATTACK STRENGTH

0.90 0.91 0.92 0.93 0.46

MIA Accuracy

0.90 0.91 0.92 0.93

Forget Accuracy

0.90 0.91 0.92 0.93 0.025

KS Statistic

0.05 0.1 0.25 0.5 1 2 5

0.64 0.66 0.68 0.70 0.52

0.64 0.66 0.68 0.70

0.64 0.66 0.68 0.70

Test Accuracy

Figure 2: Experiments with different values of the trade-off parameter α. We consider 7 values {0.05, 0.1, 0.25, 0.5, 1, 2, 5}. Each dot represents a batch of 5 random experiments with the same α. The coordinates of a dot are the corresponding metrics averaged over the 5 runs. Top row: CIFAR-10; Bottom row: CIFAR-100.

A.3 SVHN DATASET

The results of SVHN dataset on random forgetting are given in Table 6. The comparison of SG on SVHN for random forgetting with and without attacker is illustrated in Figure 3.

Table 6: Experimental results (Meanstd) on SVHN for random forgetting. The highlighted metrics are the closest to those of retraining, which is considered as the best performance compared with the other baselines.

SVHN Accr Accte Accf |Accf Accte| MIA acc. MIA AUC MIA F1 KS Stat. W. Dist. RTE (min., )

Retrain 0.99590.0002 0.96100.0010 0.95340.0024 0.0076 0.52480.0058 0.54220.0075 0.51490.0157 0.03060.0117 0.06860.0145 20.46

FT 0.99910.0001 0.71170.0021 0.98760.0070 0.2867 0.53720.0123 0.55920.0121 0.55230.0170 0.06130.0342 0.17430.0091 1.55 GA 0.99540.0001 0.96410.0002 0.99490.0006 0.0308 0.51910.0072 0.54110.0051 0.55000.0178 0.08670.0065 0.14730.0026 0.97 IU 0.90760.0707 0.88170.0658 0.90500.0713 0.0233 0.53730.0116 0.55800.0097 0.54690.0187 0.04730.0207 0.14070.0882 0.41 ℓ1-sparse 0.93780.0615 0.91910.0540 0.92980.0620 0.0107 0.54570.0220 0.56650.0229 0.53470.0324 0.03960.0112 0.11580.0954 1.86 RL 0.99490.0002 0.96090.0006 0.97970.0018 0.0188 0.52110.0106 0.54110.0147 0.51440.0225 0.10790.0175 0.06420.0060 2.65 BE 0.99550.0001 0.96330.0002 0.99550.0006 0.0322 0.52090.0090 0.54410.0064 0.55530.0175 0.10160.0062 0.15280.0019 0.46 BS 0.99560.0002 0.96410.0001 0.99520.0008 0.0311 0.53220.0060 0.55090.0034 0.55940.0176 0.09940.0074 0.14040.0033 0.81 SCRUB 0.98320.0010 0.95590.0014 0.98090.0020 0.0250 0.52730.0031 0.54310.0103 0.52960.0196 0.04920.0139 0.10320.0141 1.85

SG 0.96860.0017 0.95760.0033 0.95600.0027 0.0016 0.50120.0052 0.50890.0272 0.32920.1798 0.05940.0233 0.01850.0041 3.16

A.4 TINYIMAGENET DATASET

The results of Tiny Image Net dataset on random forgetting are given in Table 7.

Published as a conference paper at ICLR 2025

Table 7: Experimental results (Meanstd) on Tiny Image Net for random forgetting. The highlighted metrics are the closest to those of retraining, which is considered as the best performance compared with the other baselines.

Tiny Image Net Accr Accte Accf |Accf Accte| MIA acc. MIA AUC MIA F1 KS Stat. W. Dist. RTE (min., )

Retrain 0.83770.0009 0.59670.0045 0.50570.0014 0.0910 0.54710.0028 0.56770.0029 0.48030.0037 0.11010.0021 0.46080.0124 237.12

FT 0.82420.0009 0.60950.0023 0.70330.0015 0.0938 0.54020.0019 0.53360.0013 0.60560.0024 0.09570.0030 0.50170.0071 65.07 GA 0.81320.0138 0.59660.0061 0.80560.0170 0.2090 0.59660.0056 0.60320.0057 0.66190.0059 0.19680.0103 0.91950.0371 12.49 IU 0.83590.0010 0.60610.0001 0.83400.0033 0.2269 0.60510.0029 0.61500.0026 0.67080.0032 0.21700.0007 0.96840.0082 6.73 ℓ1-sparse 0.78200.0015 0.61440.0012 0.63750.0045 0.0231 0.50390.0034 0.49060.0026 0.56740.0040 0.05000.0025 0.22170.0082 103.85 RL 0.77470.0006 0.60180.0020 0.59160.0030 0.0102 0.52800.0025 0.57020.0018 0.47530.0047 0.16610.0013 0.33040.0034 60.79 BE 0.80540.0115 0.55610.0120 0.80380.0162 0.2477 0.62170.0026 0.63410.0033 0.67790.0005 0.24030.0047 1.09620.0040 31.37 BS 0.82610.0008 0.57750.0001 0.82320.0020 0.2457 0.62340.0044 0.63330.0025 0.68180.0038 0.24150.0049 1.06230.0070 8.67

SG 0.84860.0007 0.59760.0005 0.55600.0053 0.0416 0.52120.0055 0.53410.0070 0.55220.0198 0.08930.0003 0.34160.0046 13.36

Figure 3: An ablation study to understand the impact of adversarial modeling on the process of unlearning; α = 1 and α = 0 corresponds to the cases with and without adversarial modeling, respectively. The results are the averages over 10 experiments with different seeds, and 95% confidence intervals are displayed. From the left to the right: 1) the defender s utility, evaluated as the test accuracy Accte minus the MIA accuracy; 2) test accuracy; 3) MIA accuracy; 4) Wasserstein distance between the cross-entropy losses of the forget and test instances.

Defender Utility

Test Accuracy

MIA Accuracy

0 10 20 30 0.0

Wasserstein Distance

A.5 CELEBA DATASET

We select Celeb A dataset as the practical scenario. The model is to classify whether the human is smiling or not while the unlearning goal is to forget the identities. The results of Celeb A dataset are given in Table 8.

Table 8: Experimental results (Meanstd) on Celeb A. The highlighted metrics are the closest to those of retraining, which is considered as the best performance compared with the other baselines.

Celeb A Accr Accte Accf |Accf Accte| MIA acc. MIA AUC MIA F1 KS Stat. W. Dist. RTE (min., )

Retrain 0.95840.0114 0.90870.0110 0.92840.0048 0.0076 0.51230.0085 0.51030.0087 0.63850.0082 0.02850.0155 0.06860.0613 33.70

FT 0.93610.0010 0.92570.0021 0.93200.0028 0.0063 0.50380.0003 0.50700.0037 0.61800.0037 0.01330.0008 0.01580.0052 3.43 GA 0.94440.0003 0.92760.0001 0.94800.0012 0.0204 0.52150.0004 0.52140.0019 0.63250.0006 0.02360.0018 0.04910.0038 2.43 ℓ1-sparse 0.71040.0883 0.70400.0917 0.71630.0921 0.0123 0.50380.0002 0.50740.0068 0.58500.0332 0.02510.0107 0.02770.0125 5.12 RL 0.93630.0007 0.92570.0005 0.93790.0063 0.0122 0.50360.0006 0.50810.0047 0.62380.0025 0.01880.0026 0.03430.0069 2.65 BE 0.93860.0027 0.92100.0021 0.94060.0015 0.0196 0.50810.0005 0.51710.0038 0.60550.0009 0.02890.0048 0.05460.0049 3.59 BS 0.94340.0005 0.92700.0002 0.94650.0043 0.0195 0.50960.0012 0.51390.0005 0.62870.0018 0.02700.0008 0.04830.0065 1.67

SG 0.93480.0011 0.92220.0001 0.92880.0010 0.0066 0.51590.0002 0.51030.0007 0.63580.0009 0.02740.0003 0.01080.0006 9.74

A.6 LOSS DISTRIBUTIONS

A visualization of the cross-entropy losses of the forget and test instances is shown in Figure 4.

A.7 BASELINE METHODS

Retrain: The first baseline is retraining, where the unlearned model is obtained by training on the retain set from scratch. We aim to develop unlearning algorithms so that the metrics they produce are as closely aligned with those of the retraining as possible. Fine-Tuning (FT): As the second baseline, FT continues to train the original model on the retain set for a few epochs. This a standard baseline used in various prior research (Graves et al., 2021b; Warnecke et al., 2023).

Published as a conference paper at ICLR 2025

0 10 20 30 40 50

0 10 20 30 40 50

0 10 20 30 40 50

0 10 20 30 40 50

0 10 20 30 40 50

0 10 20 30 40 50

0 10 20 30 40 50

0 10 20 30 40 50

0 10 20 30 40 50

Forget Loss

0 10 20 30 40 50

0 10 20 30 40 50

0 10 20 30 40 50

0 10 20 30 40 50

0 10 20 30 40 50

0 10 20 30 40 50

0 10 20 30 40 50

0 10 20 30 40 50

0 10 20 30 40 50

Forget Loss

0 10 20 30 40 50

0 10 20 30 40 50

0 10 20 30 40 50

0 10 20 30 40 50

0 10 20 30 40 50

0 10 20 30 40 50

0 10 20 30 40 50

0 10 20 30 40 50

Forget Loss

0 10 20 30 40 50

0 10 20 30 40 50

0 10 20 30 40 50

0 10 20 30 40 50

0 10 20 30 40 50

0 10 20 30 40 50

0 10 20 30 40 50

0 10 20 30 40 50

0 10 20 30 40 50

Forget Loss

0 10 20 30 40 50

0 10 20 30 40 50

0 10 20 30 40 50

0 10 20 30 40 50

0 10 20 30 40 50

0 10 20 30 40 50

0 10 20 30 40 50

0 10 20 30 40 50

0 10 20 30 40 50

Forget Loss

0 10 20 30 40 50

0 10 20 30 40 50

0 10 20 30 40 50

0 10 20 30 40 50

0 10 20 30 40 50

0 10 20 30 40 50

0 10 20 30 40 50

0 10 20 30 40 50

0 10 20 30 40 50

Forget Loss

Figure 4: The distributions of the cross-entropy losses for the forget and test instances from the unlearned models. The y-axis is in log scale for better visualization. From the first to the last figure, they are random forgetting on CIFAR-10, CIFAR-100, SVHN and class-wise forgetting on CIFAR-10, CIFAR-100, SVHN.

Published as a conference paper at ICLR 2025

Table 9: The hyper-parameter for the baseline method and SG used in this paper.

Parameters Retrain FT GA IU ℓ1-sparse RL BE BS SCRUB DAU SG

Learning rate 1e-2 5e-2 1e-3 1e-2 1e-2 1e-5 1e-5 5e-4 [1e-2, 2e-2] 1e-2 Num. of epoch 160 30 5 10 10 10 10 10 [3, 5] 30 γ 5e-4 α 10 T 4 Decay epochs [3, 5, 9] β 0.1 Attacker α 1.0

Gradient Ascent (GA): This baseline takes the original model as the starting point and runs a few epochs of gradient ascent on the forget set Df. The intuition is to disrupt the model s generalizability on Df (Graves et al., 2021b). Another name of GA is Neg Grad (Kurmanji et al., 2024). Influence Unlearning (IU): This baseline uses Influence Function to estimate the updates required for a model s weights as a result of removing the forget set from the training data (Izzo et al., 2021b; Koh & Liang, 2017a). ℓ1-sparse: This baseline integrates an ℓ1 norm-based sparse penalty into machine unlearning loss Jia et al. (2023). Random Label (RL): This baseline trains the original model on the retain set and the forgetting set Df whose labels are random to make the model unlearn Df while keep the model capability as much as possible. Boundary Expansion: This baseline proposes a neighbor searching method to identify the nearest but incorrect class labels to guide the way of boundary shifting. Boundary Shrink: This baseline artificially assigns forgetting samples to an extra shadow class of the original model Chen et al. (2023). SCRUB: This baseline achieve MU by using a teacher model and student model Kurmanji et al. (2024). DAU: We follow the official implementation of (Sharma et al., 2024) on the CIFAR-10 and CIFAR-100 dataset for random forgetting.

A.8 EXPERIMENT DETAILS

The hyperparameters used for SG and the baselines are in Table 9. The losses for the retraining baseline across the epochs are displayed in Figure 5. We run all the experiments using Py Torch 1.12 on NVIDIA A5000 GPUs and AMD EPYC 7513 32-Core Processor.

A.9 CLASS-WISE FORGETTING

The results of SVHN dataset on classwise forgetting are given in Table 13. Regarding non-iid samples, classwise forgetting represents an extreme scenario in which the model is tasked with forgetting an entire class from the dataset. The results for classwise forgetting are reported in Tables 2 and 13. As noted in prior unlearning literature (Chen et al., 2023; Jia et al., 2023), classwise forgetting exhibits consistent trends, where the forgetting accuracy and MIA metrics reach 1. This is because forgetting an entire class from the training set is straightforward and results in a clear distinction from the remaining data. Importantly, there is no clear state-of-the-art (SOTA) method that dominates across all metrics for classwise forgetting, and baseline methods generally perform similarly. Furthermore, the performance gap between SG and the best metrics is minimal. For instance, the differences in MIA metrics between SG and the best-performing approach are less than 1% in both Tables 2 and 13.

A.10 AN EXAMPLE OF THE CONDITION IN EQUATION 5

In this section, we provide a concrete example of the KKT conditions for linear support vector machines (SVM). As described in Section 1, the KKT conditions are key to relating the attacker s model parameters, denoted as θa, with the auditing set Dθu, which allows us to derive the gradient θa/ Dθu. The conditions f can be similarly derived for any model where the learning problem is convex. To simplify the notations, we use {(xi, yi)}q i=1 to represent Dθu. A standard formulation of

Published as a conference paper at ICLR 2025

0 20 40 60 80 100 Epoch

0 50 100 150 200 Epoch

0 20 40 60 80 100 Epoch

Figure 5: The training loss for the retrain baseline. For CIFAR10 and CIFAR100, the learning rate is multiplied by 0.1 when epoch is at 60, 120, 160; for SVHN, the same multiplication is done at epoch 60, 120. Top to bottom: CIFAR10, CIFAR100, SVHN.

Table 10: Experimental results (Meanstd) on SVHN for classwise forgetting. The highlighted metrics are the closest to those of retraining, which is considered as the best performance compared with the other baselines.

SVHN Accr Accte Accf |Accf Accte| MIA acc. MIA AUC MIA F1 KS Stat. W. Dist. RTE (min., )

Retrain 0.99630.0001 0.96390.0007 0.00000.0000 0.9639 0.99500.0005 0.99860.0004 0.99510.0005 0.99090.0003 8.69240.0750 20.46

FT 0.99780.0002 0.96220.0020 0.09950.0179 0.8627 0.99450.0005 0.99850.0006 0.99460.0007 0.95330.0038 2.32200.0216 3.17 GA 0.94440.0055 0.91440.0047 0.00000.0000 0.9144 0.99690.0004 0.99980.0000 0.99700.0002 0.98490.0012 16.48340.2720 0.98 IU 0.80440.1177 0.80610.0978 0.00000.0000 0.8061 0.99980.0003 1.00000.0000 0.99980.0003 0.99360.0056 15.06971.7117 0.41 ℓ1-sparse 0.97990.0004 0.95800.0017 0.00000.0000 0.9580 0.99210.0013 0.99660.0003 0.99210.0012 0.98180.0030 4.51390.2512 3.73 RL 0.99590.0001 0.96120.0013 0.00000.0000 0.9612 0.99120.0013 0.99710.0016 0.99130.0012 0.98130.0016 5.59780.0357 2.60 BE 0.98800.0008 0.95460.0012 0.28120.0061 0.6734 0.99760.0005 0.99950.0002 0.99760.0006 0.91060.0050 4.38160.0659 0.46 BS 0.98640.0010 0.95370.0010 0.31090.0052 0.6428 0.99750.0003 0.99950.0002 0.99760.0003 0.90720.0031 4.42900.1169 0.82 SCRUB 0.99160.0007 0.96160.0014 0.00000.0000 0.9616 0.99990.0001 1.00000.0001 0.99990.0001 0.99890.0008 24.45902.2852 3.91

SG 0.97160.0007 0.96010.0014 0.00000.0000 0.9601 0.99280.0001 0.99540.0001 0.99290.0001 0.99070.0008 5.01482.2852 5.92

the linear SVM is as follows

min θa,b 1 2 θa 2

s.t. yi (θ a xi + b) 1, i, (7)

where b is the bias term. The standard form is typically formulated as a minimization problem, so the attacker is to maximize V = 1

2 θa 2. Equation 7 is a convex program, and the optimal solution (i.e., θ a and b ) is characterized by the KKT conditions. The Lagrangian of the above is as follows

Published as a conference paper at ICLR 2025

Table 11: The attack time for each epoch between cvxpylayers and qpth on different datasets.

CIFAR-10 CIFAR-100 Tiny Image Net Celeb A

cvxplayers 31.92s 70.13s 24.39s 173.04s qpth 7.44s 7.97s 4.68s 15.09s

where αi 0 are the Lagrantian multipliers:

L(θa, b, αi) = 1

i=1 αi yi (θ a xi + b) 1 . (8)

Following sandard procedures (Boyd & Vandenberghe, 2004), the KKT conditions are as folllows

f( Dθu, θa) =

i=1 αiyixi = 0

i=1 αiyi = 0

yi (θ a xi + b) 1 αi 0, i

αi(yi(θ a xi + b) 1) = 0, i

which implicitly define a function between θa and the data Dθu = {(xi, yi)}q i=1. In practice, we describe the optimization problem Equation 7 using cvxpy (Diamond & Boyd, 2016). Then, we employ an off-the-shelf package called cvxpylayers (Agrawal et al., 2019b) to automatically derive the KKT conditions and compute the gradient θa/ Dθu.

A.11 THE ACCELERATED SG

To accelerate the SG performance, one particular setting discussed in our work involves the MIA auditor using a linear SVM as the classifier, as detailed in Appendix A.10. This specific problem can be formulated as a convex programming problem known as Quadratic Programming (QP). Due to the linear KKT condition in the decision variables, we leverage a specialized solver called qpth (Amos & Kolter, 2017) instead of using a more generic convex programming solver like cvxpylayer (Agrawal et al., 2019a). The qpth solver takes advantage of the special structure of QPs, specifically that the KKT conditions can be expressed as linear system equations, allowing for more efficient computation. This tailored approach significantly reduces the computational burden compared to solving a general convex programming problem shown in Table 1.

In addition, we compare the attack time and the accelerated time relative to retraining. The results are given in Tables 11 and 12.

Table 12: The speed comparison against retraining across various datasets between cvxpylayers and qpth. The percentage in parentheses represents the ratio relative to Retrain.

CIFAR-10 CIFAR-100 Tiny Image Net Celeb A

Retrain 14.92s 13.08s 33.70s 235.68s cvxplayers 1.47s (9.85%) 3.07s (23.47%) 15.20s (45.10%) 127.92s (54.04%) qpth 0.88 (5.90%) 1.55 (11.85%) 9.74 (28.90%) 13.36 (5.65%)

A.12 SEQUENCE UNLEARNING

In addition to one-time unlearning, we also explore sequential unlearning to enhance the robustness of SG. The unlearning scenario we adopt is random forgetting. Specifically, we select

Published as a conference paper at ICLR 2025

10% of the training data, Dtr, as the forgetting set Df and divide it into five disjoint subsets: {D(1) f , D(2) f , D(3) f , D(4) f , D(5) f }. Each subset is used sequentially for unlearning. For instance, the

first model, θ(1) u , is unlearned from the original model, θo, using D(1) f . Subsequently, the second

model, θ(2) f , is unlearned from θ(1) f using D(2) f , and so on. We report our experimental results on CIFAR-10.

Table 13: Experimental results (Meanstd) on CIFAR-10 for sequence forgetting. The highlighted metrics are the closest to those of retraining, which is considered as the best performance compared with the other baselines.

Part 1 Accr Accte Accf |Accf Accte| MIA acc. MIA AUC MIA F1 KS Stat. W. Dist.

Retrain 0.99990.0000 0.91670.0013 0.90400.0099 0.0127 0.48950.0049 0.50450.0027 0.50660.0485 0.05930.0058 0.03810.0058 FT 0.99840.0002 0.92870.0019 0.99230.0017 0.0636 0.54970.0062 0.56810.0203 0.68240.0044 0.11800.0059 0.28010.0207 GA 0.99970.0000 0.93270.0020 1.00000.0000 0.0673 0.56780.0012 0.59460.0097 0.69640.0004 0.17030.0033 0.30740.0062 IU 0.99960.0001 0.93230.0021 1.00000.0000 0.0677 0.56570.0025 0.59710.0092 0.69530.0011 0.16500.0059 0.30920.0065 ℓ1-sparse 0.88930.0013 0.87230.0059 0.86070.0054 0.0116 0.49820.0094 0.48220.0317 0.48730.1217 0.08200.0073 0.04720.0108 RL 0.99860.0001 0.92570.0002 0.92130.0069 0.0044 0.47900.0094 0.30160.0194 0.61700.0059 0.36000.0196 0.12870.0065 BS 0.99970.0000 0.93230.0015 1.00000.0000 0.0677 0.56750.0012 0.60240.0110 0.69600.0002 0.17030.0033 0.30760.0063 SALUN 0.99860.0001 0.92740.0024 0.92330.0053 0.0041 0.47950.0027 0.29660.0194 0.61790.0045 0.36400.0142 0.13780.0146 SG 0.98810.0084 0.88510.0100 0.91600.0134 0.0309 0.50270.0033 0.51150.0084 0.65110.0034 0.05330.0017 0.10780.0338

Part 2 Accr Accte Accf |Accf Accte| MIA acc. MIA AUC MIA F1 KS Stat. W. Dist.

Retrain 0.99990.0000 0.91430.0034 0.90700.0116 0.0073 0.50070.0091 0.49700.0331 0.38340.1959 0.06500.0119 0.07620.0416 FT 0.99840.0000 0.93140.0006 0.99300.0014 0.0616 0.54830.0029 0.56300.0157 0.68060.0018 0.10800.0144 0.25690.0217 GA 0.99840.0003 0.92990.0013 0.99870.0009 0.0688 0.56670.0053 0.58560.0068 0.69320.0031 0.14870.0111 0.29550.0240 IU 0.99840.0003 0.92870.0019 0.99830.0005 0.0696 0.56750.0046 0.59170.0066 0.69360.0028 0.15130.0074 0.29950.0232 ℓ1-sparse 0.81840.0017 0.81730.0041 0.79570.0187 0.0216 0.49920.0166 0.50540.0135 0.43250.1057 0.06070.0097 0.07970.0048 RL 0.99840.0001 0.92780.0007 0.93430.0063 0.0065 0.47350.0039 0.31380.0159 0.61310.0035 0.34570.0060 0.12310.0189 BS 0.99950.0001 0.93280.0012 1.00000.0000 0.0672 0.56670.0029 0.59950.0155 0.69500.0016 0.16270.0060 0.30560.0095 SALUN 0.99850.0001 0.92720.0013 0.92900.0022 0.0018 0.48670.0012 0.30770.0038 0.62380.0029 0.34730.0068 0.13720.0104 SG 0.99540.0016 0.88500.0043 0.91630.0108 0.0313 0.50420.0086 0.50100.0149 0.65900.0012 0.07900.0184 0.10310.0235

Part 3 Accr Accte Accf |Accf Accte| MIA acc. MIA AUC MIA F1 KS Stat. W. Dist.

Retrain 1.00000.0000 0.91540.0036 0.91030.0017 0.0051 0.49520.0065 0.49370.0308 0.41780.1459 0.06800.0014 0.05320.0416 FT 0.99840.0001 0.93030.0022 0.99000.0008 0.0597 0.55100.0015 0.56430.0117 0.68170.0016 0.10270.0104 0.27270.0134 GA 0.99840.0001 0.93080.0011 0.99700.0016 0.0662 0.56980.0037 0.59330.0146 0.69590.0023 0.16470.0196 0.28020.0202 IU 0.99850.0001 0.93140.0006 0.99700.0008 0.0656 0.57000.0046 0.59730.0152 0.69610.0028 0.16300.0187 0.27940.0218 ℓ1-sparse 0.82310.0060 0.81930.0045 0.79700.0065 0.0223 0.50650.0083 0.52190.0096 0.35600.0099 0.06530.0123 0.07110.0204 RL 0.99850.0000 0.92610.0024 0.92800.0051 0.0019 0.48750.0085 0.37520.0738 0.55440.1062 0.33530.0203 0.13990.0323 SALUN 0.99850.0002 0.92630.0019 0.93330.0076 0.0070 0.49000.0076 0.30930.0153 0.62860.0051 0.33170.0078 0.12860.0152 BS 0.99840.0002 0.92870.0019 0.99230.0017 0.0636 0.54970.0062 0.56810.0203 0.68240.0044 0.11800.0059 0.28010.0207 SG 0.98140.0145 0.88950.0146 0.91500.0177 0.0255 0.50010.0145 0.50130.0111 0.62190.0101 0.08570.0137 0.10190.0441

Part 4 Accr Accte Accf |Accf Accte| MIA acc. MIA AUC MIA F1 KS Stat. W. Dist.

Retrain 0.99990.0000 0.91400.0052 0.91330.0135 0.0007 0.51020.0066 0.51370.0422 0.49590.2180 0.07470.0167 0.09360.0123 FT 0.99840.0001 0.92860.0026 0.99100.0024 0.0624 0.55250.0058 0.57580.0111 0.68350.0036 0.11470.0084 0.28960.0273 GA 0.99840.0000 0.92980.0021 0.99770.0009 0.0679 0.57320.0050 0.59150.0072 0.69760.0030 0.14530.0202 0.29020.0158 IU 0.99840.0001 0.93030.0022 0.99830.0009 0.0680 0.57170.0066 0.59660.0059 0.69710.0043 0.14500.0205 0.29040.0150 ℓ1-sparse 0.82770.0009 0.82090.0053 0.81770.0076 0.0032 0.49600.0071 0.48510.0162 0.59100.0082 0.05770.0133 0.06060.0077 RL 0.99840.0002 0.92680.0011 0.93570.0069 0.0089 0.49250.0043 0.33600.0016 0.62940.0036 0.29900.0099 0.13140.0080 BS 0.99950.0001 0.93090.0016 0.99870.0005 0.0678 0.56980.0045 0.63300.0137 0.69700.0026 0.17930.0012 0.31190.0114 SALUN 0.99820.0003 0.92420.0007 0.93230.0103 0.0081 0.48670.0086 0.33320.0123 0.61140.0248 0.32000.0091 0.12630.0113 SG 0.98450.0048 0.88360.0037 0.91530.0113 0.0317 0.51180.0178 0.51390.0187 0.62660.0145 0.08070.0118 0.10180.1059

Part 5 Accr Accte Accf |Accf Accte| MIA acc. MIA AUC MIA F1 KS Stat. W. Dist.

Retrain 0.99990.0000 0.91230.0043 0.90470.0082 0.0076 0.50350.0092 0.50500.0304 0.50430.1700 0.05230.0189 0.05770.0199 FT 0.99860.0001 0.93010.0012 0.98830.0009 0.0582 0.55400.0047 0.56440.0256 0.68520.0024 0.11330.0118 0.28090.0203 GA 0.99850.0002 0.92980.0032 0.99830.0012 0.0685 0.56950.0071 0.58090.0215 0.69480.0040 0.14670.0077 0.30550.0170 IU 0.99860.0002 0.92860.0026 0.99870.0009 0.0701 0.57170.0065 0.58380.0206 0.69640.0033 0.14470.0090 0.31100.0168 ℓ1-sparse 0.82160.0008 0.81410.0026 0.80230.0081 0.0118 0.50400.0133 0.50380.0204 0.47640.0902 0.07600.0086 0.07690.0049 RL 0.99870.0002 0.92690.0010 0.91900.0000 0.0079 0.49130.0055 0.39570.0949 0.54040.1278 0.30830.0123 0.09560.0092 BS 0.99950.0001 0.93050.0018 0.99900.0008 0.0685 0.57120.0082 0.62150.0139 0.69710.0056 0.17970.0062 0.31250.0090 SALUN 0.99850.0001 0.92730.0007 0.92470.0037 0.0026 0.48450.0041 0.33210.0113 0.62300.0017 0.29470.0204 0.11430.0123 SG 0.99260.0020 0.88560.0084 0.91390.0105 0.0283 0.50760.0114 0.50370.0055 0.67090.0061 0.09530.0160 0.10370.0686