# verification_of_machine_unlearning_is_fragile__6c677aa5.pdf

Verification of Machine Unlearning is Fragile

Binchi Zhang 1 Zihan Chen 1 Cong Shen 1 Jundong Li 1

As privacy concerns escalate in the realm of machine learning, data owners now have the option to utilize machine unlearning to remove their data from machine learning models, following recent legislation. To enhance transparency in machine unlearning and avoid potential dishonesty by model providers, various verification strategies have been proposed. These strategies enable data owners to ascertain whether their target data has been effectively unlearned from the model. However, our understanding of the safety issues of machine unlearning verification remains nascent. In this paper, we explore the novel research question of whether model providers can circumvent verification strategies while retaining the information of data supposedly unlearned. Our investigation leads to a pessimistic answer: the verification of machine unlearning is fragile. Specifically, we categorize the current verification strategies regarding potential dishonesty among model providers into two types. Subsequently, we introduce two novel adversarial unlearning processes capable of circumventing both types. We validate the efficacy of our methods through theoretical analysis and empirical experiments using real-world datasets. This study highlights the vulnerabilities and limitations in machine unlearning verification, paving the way for further research into the safety of machine unlearning.

1. Introduction

In the deep learning era, machine learning (ML) has grown increasingly data-dependent. A significant volume of personal data has been utilized to train real-world ML systems. While the plentiful use of personal data has facilitated the advancement of machine learning, it simultaneously

1University of Virginia, Charlottesville, VA, USA. Correspondence to: Jundong Li <jundong@virginia.edu>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

poses a threat to user privacy and has resulted in severe data breaches (Nguyen et al., 2022). Consequently, recent regulations and laws (GDPR, 2016; CCPA, 2018) have mandated a novel form of request: the elimination of personal data and its impact from a trained ML model, known as machine unlearning (Cao & Yang, 2015) (MUL). In recent years, a proliferation of MUL techniques has emerged, employing various methods to remove certain training data (Nguyen et al., 2022; Xu et al., 2023).

Despite the success achieved, current MUL techniques are still a black box for data owners, i.e., data owners cannot monitor the unlearning process and ascertain whether their data has been truly unlearned from the model (Eisenhofer et al., 2022; Xu et al., 2023). For instance, many technology companies, e.g., Google and Microsoft, provide machine learning as a service (MLaa S). Under MLaa S, individual data owners upload personal data to the server. The server then trains an ML model using the collected dataset and provides its predictive functionality as a service to the data owners (Sommer et al., 2022). In particular, data owners might want their data to be unlearned as long as this service is no longer required. However, after sending an unlearning request, data owners will not receive any proof that their data was indeed unlearned. Without the ability to verify, data owners have to trust the model provider blindly for the efficacy and integrity of the unlearning process. On the other hand, the model provider might deceive data owners and pretend to have their data unlearned to avoid potential deterioration of the model utility and extra computational cost caused by unlearning (Eisenhofer et al., 2022). Other than MLaa S, similar problems also exist in other domains involving personal data, e.g., social media and financial systems. To address the above issue, recent studies have started exploring the verification strategy for MUL techniques from different aspects, e.g., injecting backdoor data into the training set or reproducing the unlearning operations. Regarding the model provider s potential dishonesty, a natural question arises: are current MUL verification strategies ensured to be safe? Specifically, we aim to study the following research question:

Can model providers successfully circumvent current verification strategies with dishonest unlearning?

To answer this question, we systematically investigate the

Verification of Machine Unlearning is Fragile

vulnerability of verification methods and obtain a pessimistic answer: current MUL verification is fragile, i.e., data owners may fail to verify the integrity of the unlearning process using current MUL verification strategies. To support our study, we first categorize existing verification strategies into two types: backdoor verification and reproducing verification. We then propose an adversarial unlearning process that can successfully circumvent both of them, i.e., always satisfies these two types of verification but still preserves the information of unlearned data. During unlearning, our approach selects the mini-batches excluding the unlearned data to effectively evade the detection by existing verification strategies, even with the most stringent reproducing verification strategy. Meanwhile, the selected batches are designed to yield model updates akin to those that would be produced by the unlearned data. By deliberately choosing the retained data that mimics the influence of the unlearned data in training, the unlearned model yielded by our adversarial method retains the information from the unlearned data. In addition, we propose a weaker but more efficient adversarial unlearning process that can deceive a subset of reproducing verification by forging the unlearning processes directly from the original training steps. We provide theoretical guarantees for the efficacy of our approaches. Furthermore, we conduct comprehensive empirical experiments to validate the efficacy of our proposed adversarial unlearning processes on real-world datasets. We highlight our main contributions as follows:

We propose two adversarial unlearning methods that can circumvent both types of current MUL verification strategies (backdoor and reproducing) while preserving the information of unlearned data.

We prove the capacity of the proposed adversarial methods to satisfy the stringent reproducing verification. We also prove that they can preserve the unlearned model utility as the original training and improve the efficiency.

We conduct empirical experiments to verify the efficacy of our proposed adversarial methods in real-world datasets, exposing the vulnerability of MUL verification.

2. Related Works

2.1. Machine Unlearning

The goal of MUL is to make a trained ML model forget some specific training data (Cao & Yang, 2015). A simple but powerful way of unlearning is to retrain the model from scratch, which is also called exact unlearning (Cao & Yang, 2015; Bourtoule et al., 2021; Kim & Woo, 2022; Chen et al., 2022). In exact unlearning, the unlearned model is ensured to behave as has never seen the unlearned data, which satisfies the goal of MUL. Following this path, Cao

& Yang first proposed the retraining-based methodology for MUL. Follow-up studies (Bourtoule et al., 2021; Kim & Woo, 2022; Chen et al., 2022) took steps to improve the efficiency of the retraining framework on the image and graph data. Despite the efforts to improve efficiency, the retraining-based framework still has difficulty accommodating frequent unlearning requests in the real world (Chien et al., 2023). Due to the large computational cost of exact unlearning, approximate unlearning was proposed that efficiently updates the original model to estimate the retrained model (Guo et al., 2020; Ullah et al., 2021; Izzo et al., 2021; Zhang et al., 2022; Pan et al., 2023; Wu et al., 2023a;b; Che et al., 2023; Warnecke et al., 2023). A common way to update the original model is to use the influence function (Koh & Liang, 2017), which can be seen as conducting a single Newton step (Boyd & Vandenberghe, 2004; Sekhari et al., 2021; Neel et al., 2021) to the model. In particular, approximate unlearning can be certified if the distance between the unlearned model and the retrained model is bounded in the probability space (Nguyen et al., 2022).

2.2. Verification for Machine Unlearning

Backdoor Verification. Backdoor verification of MUL requires data owners to actively inject backdoor poisoned data (e.g., changing the original label to a different one for misleading) as backdoor triggers (Sommer et al., 2022; Gao et al., 2022; Guo et al., 2023). In this way, the model trained on the backdoor data can misclassify the backdoor data as the modified classes. To verify the integrity of unlearning, data owners can deliberately request the model provider to unlearn the backdoor data. If the model provider honestly unlearns the backdoor data, the predictions are supposed to be the original label with high confidence (backdoor data is not triggered); otherwise, the predictions can still be misled to the poisoned label (backdoor data is triggered). In addition, Sommer et al. formulated the backdoor verification as a hypothesis test, enabling a probabilistic guarantee of a successful verification.

Reproducing Verification. Inspired by verifiable computation techniques, reproducing verification of MUL requires the model provider to provide a proof of unlearning (Po UL) that records the operations for unlearning, and data owners can reproduce every unlearning step to verify the integrity of unlearning. Considering that the model provider might deceive the data owner while conducting exact unlearning, Thudi et al. first introduced the proof of learning (Po L) technique to verify the model retraining operation. Although they showed that the Po L can be forged by model providers, their forging strategy is not realistic and brings limitations to understanding the safety of MUL verification. Following this paradigm, Weng et al. presented a trusted hardwareempowered Po UL technique with SGX enclave (Costan & Devadas, 2016). With the trusted hardware, their framework

Verification of Machine Unlearning is Fragile

Reproducing

MUL Verification Adversarial Method

Retraining (powerful)

Forging (efficient)

Partly Against

Figure 1. The connection of our threat model and different verification strategies. Our retraining method can deceive the backdoor and reproducing verification, and our forging method can only deceive a subset of reproducing verification but with better efficiency.

provides a better safety guarantee for verifying MUL. Recently, Eisenhofer et al. proposed the first cryptographic definition of verifiable unlearning and instantiated the Po UL with SNARKs (Costan & Devadas, 2016) and hash chains. Their verification framework has a safety guarantee from a cryptographic perspective.

Comparing different types of verification, backdoor verification requires data owners to inject poisoned data beforehand, while reproducing verification requires model providers to generate proof for the unlearning operations. In addition, Xu et al. summarized the current MUL evaluation methods (mainly for approximate unlearning), e.g., accuracy (Golatkar et al., 2020; 2021; Mehta et al., 2022), relearning time (Kim & Woo, 2022; Chundawat et al., 2023; Tarun et al., 2023), and membership inference attack (Chen et al., 2021) as verification strategies. However, even knowing the evaluation results, data owners still need to compare the results with the model retrained from scratch (exact unlearning method), which is unknown to data owners in practice. More importantly, evaluation methods fail to take into account the dishonest behaviors of model providers. Therefore, we do not focus on the safety problem of evaluation in this paper.

3. Threat Model

In our problem, we consider the data owner as a victim and the model provider as an adversary regarding safety in MUL verification. First, the model provider trains an ML model based on a training dataset provided by the data owners. After the data owners send an unlearning request, the model provider deceives the data owners that their data has been unlearned, while the model provider updates the model with an adversarial unlearning process rather than normal unlearning methods to preserve the information of the unlearned data. Next, we introduce the settings of our adversarial unlearning process from two perspectives: the adversary s goal and knowledge.

Adversary s Goal. The goal of using an adversarial unlearning process is to preserve the information of unlearned data during unlearning while satisfying the reproducing and backdoor verification. Note that the model provider might

cheat the data owner for two benefits, i.e., better model utility and lower computational cost. Hence, other than preserving the unlearned data, the adversary s goal should also include these benefits.

Adversary s Knowledge. The model owner has complete access to the training process, e.g., training data, unlearned data, and the model.

As a tentative step toward exploring the vulnerability of MUL verification, we mainly focus on the verification of exact unlearning, i.e., verifying whether the model provider honestly retrains the model from scratch. We conclude the connection between our threat model and different types of verification strategies for MUL in Figure 1.

4. Methodology

In this section, we introduce our design of an adversarial unlearning process that deceives both the reproducing verification and the backdoor verification. To satisfy the reproducing verification, the model provider should provide a valid Proof of Retraining (Po RT)1. Hence, we first introduce the notations and background knowledge of Po RT before proposing our adversarial unlearning process.

4.1. Preliminary

Notation. We denote D as a training dataset with n data samples and fw as an ML model with w collecting the learnable parameters. Let A be a learning process that takes the training set D as input and outputs the optimal w that minimizes the empirical risk on D. In particular, a learning process A is used to minimize the empirical loss function

A(D) = argminw L(w, D), (1)

where L(w, D) = 1 |D| P (x,y) D l(fw(x), y) is the empirical risk over D, (x, y) denotes the pair of the input data and the output label, and l( ) denotes a task-specific loss function, e.g., cross-entropy loss. We use Du to denote the set of data to be unlearned. Consequently, the model retrained from scratch can be obtained by A(D\Du).

Proof of Retraining. In reproducing verification, the model provider is required to provide a Po RT, and the data owner or a third-party verifier can reproduce all the retraining operations in the Po RT to verify the integrity of unlearning. A prevalent instantiation of Po RT is to record the trajectory of retraining during the unlearning process (Thudi et al., 2022; Weng et al., 2022; Eisenhofer et al., 2022), denoted by Pr = {w(t) r , d(t) r , g(t) r }t I where w(t) r denotes the intermediate model parameter during retraining and d(t) r

1For better clarity, we use the term Po RT to replace the aforementioned Po UL as they are the same under exact unlearning.

Verification of Machine Unlearning is Fragile

denotes the data used for deriving w(t) r . g(t) r denotes the updating function and I is the set of indices to the intermediate learning steps during retraining (I = {1, 2, . . . , T} where T is the number of model updating steps). In addition, a Proof of Training (Po T) can be defined similarly as the Po RT, i.e., Pt = {w(t), d(t), g(t)}t I, while the difference is that the unlearned data should be involved in the training but not in the retraining. In this paper, we assume that the equivalence of adopted models and the models appear on the Po T (w(T )) and the Po RT (w(T ) r ) can be verified, otherwise, reproducing verification will never be possible and the problem becomes trivial.

Reproducing Verification. We next provide a formal definition of reproducing verification (Thudi et al., 2022; Weng et al., 2022; Eisenhofer et al., 2022). As the retraining process can be divided into iterative steps, the Po RT can be seen as an ordered set of triplets {w(t) r , d(t) r , g(t) r }. Each triplet describes an iterative operation that updates the retrained model based on an updating function as w(t) r = g(t) r (w(t 1) r , d(t) r ). As an example, we can instantiate the updating function as the commonly used mini-batch stochastic gradient descent (Goodfellow et al., 2016):

g(t) r (w(t 1) r , d(t) r ) = w(t 1) r γ(t) L(w(t 1) r , d(t) r ), (2)

where γ(t) denotes the learning rate. We can define a valid Po RT that can satisfy the reproducing verification as follows.

Definition 4.1. A valid Proof of Retraining is defined as Pr = {w(t) r , d(t) r , g(t) r }t I that satisfies the following two properties:

1. Reproducible: t I, w(t) r g(t) r (w(t 1) r , d(t) r ) ε;

2. Removable: t I, d(t) r Dr = .

In the reproducible property, Thudi et al. set ε as a threshold for error tolerance, allowing the verifier to consider some numerical imprecision when reproducing the update rule, and w(t) r g(t) r (w(t 1) r , d(t) r ) is called the verification error at step t. In (Weng et al., 2022; Eisenhofer et al., 2022), the threshold ε can be reduced to an exact 0 with the help of SNARK (Setty, 2020) and Intel SGX (Costan & Devadas, 2016). Hence, we consider both the cases of 0 and ε thresholds.

4.2. First Adversarial Method (Retraining)

We first introduce our adversarial unlearning process against both reproducing and backdoor verification. Recall that the adversary s goal is to satisfy the unlearning verification while preserving the information of unlearned data. To satisfy the reproducing verification with a 0 verification error

gr(1)(wr(0), dr(1))

(wr(1), dr(1), gr(1))

gr(T)(wr(T-1), dr(T))

(wr(T), dr(T), gr(T))

Proof of Retraining:{wr(t), dr(t), gr(t)}

d(1) d(T) Unlearn Data

Figure 2. An illustration of the retraining-based adversarial unlearning framework. The Po RT is generated based on the retraining process where the mini-batch d(t) r D\Du sampling is guided by the similarity with d(t) D in gradient.

for each updating function, the model provider has to follow exactly the retraining process to generate the Po RT, strictly ensuring the reproducible and removable properties. However, inspired by recent studies of the possibility that different training data might lead to a similar gradient descent step (Shumailov et al., 2021; Thudi et al., 2022), we can find data in the retained set D\Du that can yield a similar gradient as the unlearned data. In particular, this idea is based on the commonly used mini-batch gradient descent. Under full-batch gradient descent, the gradients in the original training and the retraining are computed over the whole D and D\Du. Their difference is deterministic and cannot be manipulated. In contrast, under mini-batch gradient descent, the gradients in the original training and the retraining are computed over random samples in D and D\Du. With the randomness, it is possible to deliberately select the samples that induce a similar model update as the mini-batches in D from D\Du in the retraining. In this way, the retrained model might still learn from the unlearned data as the original training, even without explicitly using them.

As discussed above, we convert the problem of the adversarial unlearning process into selecting d(t) r D\Du that yields similar model updates as d(t) D. It is evident that if the original mini-batch contains no unlearned data d(t) Du = , we can directly set d(t) r = d(t). If the original mini-batch contains unlearned data, then our goal is to find d(t) r D\Du that makes L(w(t 1) r , d(t)) and L(w(t 1) r , d(t) r ) as close as possible. Regarding this goal, previous studies simply leveraged random sampling multiple times and chose the one yielding the smallest distance (Shumailov et al., 2021; Thudi et al., 2022).

d(t) r = argmindi L(w(t 1) r , d(t)) L(w(t 1) r , di) , (3)

where i = 1, . . . , s, d1, . . . , ds D\Du, and we denote the right-hand side of Equation (3) as Sr(w(t 1) r ; d(t)). However, this method is computationally expensive and does not provide theoretical guarantees. In this paper, we replace the unlearned data in the original batch: (xu, yu) d(t) Du

Verification of Machine Unlearning is Fragile

Algorithm 1 Retraining-based Adversarial Unlearning Algorithm

Input: Training data D, unlearned data Du. Output: Proof of Retraining Pr. Initialize w(0) r and Pr . for t = 1 to T do

Uniform mini-batch sampling d(t) D. Choose d(t) r Sr(w(t 1) r ; d(t)) or d(t) r Sn(d(t)). w(t) r g(t) r (w(t 1) r , d(t) r ). Pr Pr (w(t) r , d(t) r , g(t) r ). end for

with its (class-wise) closest neighbor in the retained data:

N(xu, yu) = argmin(x,y) D\Du,y=yu x xu . (4)

Consequently, our proposed mini-batch selection method can be expressed as

Sn(d(t)) = (d(t)\Du) {N(x, y)|(x, y) d(t) Du}. (5)

It is worth noting that we observe that using Sr can be better than Sn sometimes in practice, i.e., L(w(t 1) r , Sr(w(t 1) r ; d(t))) might be closer to L(w(t 1) r , d(t)) than L(w(t 1) r , Sn(w(t 1) r ; d(t))), as long as setting a large enough sample size s. However, our proposed closest neighbor selection is still necessary as it provides a worst-case upper bound for theoretical analysis and a more efficient way of adversarial unlearning when considering real-world threats. Consequently, we can generate a Po RT based on our adversarial unlearning process with carefully selected mini-batches. We provide a clear illustration of our adversarial unlearning framework in Figure 2. In addition, the detailed algorithm is shown in Algorithm 1. Specifically, the following properties can be guaranteed for Algorithm 1. Proposition 4.2. Algorithm 1 returns a valid Proof of Retraining under the threshold ε = 0. Proposition 4.3. Let CD := maxc maxx c minz c x z , where c denotes the class in the label domain. For the continuity of loss functions, we assume (i). wl(w, x) G, (ii). 2l(w,x)

w x Lx, and (iii). 2 wl(w, x) L. Let γ(t) = γ 1

L and m be the size of mini-batches, for Algorithm 1, we have

ET [ L(w(T ) r , D) 2] L(w(0) r , D) L(w(T ) r , D) γT

2 (G2 + pu B2) + (1 γL)pu GB,

where pu = 1 (1 |Du|

|D| )m and B = Lx CD.

Assumptions (i) and (iii) are the same as in (Ajalloeian & Stich, 2020), indicating that l( , x) is L-smooth and GLipschitz continuous. In addition, Assumption (ii) is the

same as in (Thudi et al., 2022), indicating that wl(w, ) is Lx-Lipschitz. We provide the proof of Proposition 4.2 and Proposition 4.3 in Appendix A. Proposition 4.2 ensures that Algorithm 1 can satisfy the reproducing verification; Proposition 4.3 guarantees that the retraining process can reach a neighborhood of the optimum over D, i.e., the adversarial unlearning process can still make the retrained model learn from Du D. Next, regarding the backdoor verification, if the retrained model reaches the exact optimum over D, the backdoor poisoned data in the unlearned set can still be triggered, and the adversarial unlearning process will be recognized. However, according to Proposition 4.3, the radius of the neighborhood Lγ

2 (G2 + pu B2) + (1 γL)pu GB (the distance between the retrained model and the original optimum) will increase after injecting backdoor data, which reduces the probability of backdoor data being triggered. The rationale is that after flipping the label of the backdoor data xb from yb to y b, the value of minx cy b x xb can increase because xb is actually in the class yb with a different distribution. Instead of learning the mapping of backdoor data xb y b, our adversarial unlearning process uses the data truly from the class y b to replace each of the backdoor data. Hence, our unlearned model cannot be triggered by the backdoor poisoned data.

4.3. Second Adversarial Method (Forging)

Although our first adversarial unlearning method can deceive the verification of MUL while preserving the information of unlearned data, the computational overhead of our method is not better than naive retraining, limiting the benefits of the model provider earned from the adversarial unlearning process. Hence, we propose another adversarial unlearning process that is more computationally efficient. As a trade-off, the power of our second adversarial method becomes weaker, i.e., it can only deceive the reproducing verification under an ε > 0 threshold. Inspired by (Thudi et al., 2022), if the model provider records a Po T during the original training, the computation of generating a Po RT can be reduced by reusing the Po T. In (Thudi et al., 2022), the model provider can update the Po T using a forging map and obtain a valid Po RT under the ε-threshold reproducing verification. In particular, the forging map is to replace the overlapping batches d(t) Du = with a batch d(t) r D\Du that induces a similar model update as d(t) in each triplet {w(t), d(t), g(t)}, i.e., L(w(t 1), d(t)) L(w(t 1), d(t) r ). However, the forging map is not realistic because it preserves the model parameters unchanged w(t) r = w(t), which can be easily recognized by the verifier (under multiple verification requests, the verifier will receive Po RTs with the same model parameters in each time). In this paper, we propose a novel forging map F : Pt Pr that directly updates both the model parameters and the batch data of each triplet in

Verification of Machine Unlearning is Fragile

g(1)(w(0), d(1))

(w(1), d(1), g(1))

g(T)(w(T-1), d(T))

(w(T), d(T), g(T))

gr(1)(wr(0), dr(1))

(wr(1), dr(1), gr(1))

gr(T)(wr(T-1), dr(T))

(wr(T), dr(T), gr(T))

Proof of Training:

Proof of Retraining:

{w(t), d(t), g(t)}

{wr(t), dr(t), gr(t)}

Figure 3. An illustration of the forging-based adversarial unlearning framework. Different from the retraining-based adversarial method, the Po RT here is generated directly from the Po T recorded in the original training. w(t) r (with d(t) r ) is obtained by conducting the forging map over the Po T instead of using the model updating function g(t) r .

the Po T to generate a valid Po RT instead of retraining. We formulate our proposed forging map as a triplet-wise updating function, i.e., for any t I,

(w(t) r , d(t) r , g(t) r ) = F(t)(w(t), d(t), g(t)). (6)

In this way, each triplet in Pr can be generated separately based on the corresponding triplet in Pt. Normally, the model updating function g remains unchanged during the retraining process, and we mainly focus on updating w(t)

and d(t) with F(t). We divide the updating of w(t) and d(t)

into two cases: excluding unlearned data Ie = {t | d(t) Du = } and including unlearned data In = {t | d(t) Du = } (Ie and In are used to denote the index set of the triplets in different cases). We assume that the number of unlearned data is far less than the retained data. Next, we discuss the two cases separately.

(1). For t Ie, although the data d(t) requires no modification (we can simply let d(t) r = d(t)), we still need to slightly alter the model parameters w(t) because otherwise, the verifier might notice a large proportion of unchanged models comparing w(t) r with w(t). On the other hand, the majority of triplets should be updated carefully because the model utility should not decrease very much after multiple verification requests. Considering both requirements for efficiency and model utility, we use single-step stochastic gradient descent (SGD) to update the model parameters in F(t) and have

w(t) r = w(t) γ(t) r l(fw(t)(x(t)), y(t)), (7)

where (x(t), y(t)) D\Du is a random sample chosen from the retained data and γ(t) r is a small value to control the verification error w(t) r g(t) r (w(t 1) r , d(t) r ) ε.

Algorithm 2 Forging-based Adversarial Unlearning Algorithm

Input: Training data D, unlearned data Du, closestneighbor mapping N : Du D\Du, Proof of Training Pt = {w(t), d(t), g(t)}. Output: Proof of Retraining Pr. Pr . for t = 1 to T in parallel do

if d(t) Du = then

d(t) r d(t). w(t) r w(t) γ(t) r l(fw(t)(x(t)), y(t)). else

d(t) r Sn(d(t)). w(t) r g(t)(w(t 1), d(t) r ). end if g(t) r g(t). Pr Pr (w(t) r , d(t) r , g(t) r ). end for

(2). For t In, the forging map should remove the unlearned data from d(t). To reduce the verification error, we still exploit the same strategy in our first adversarial method to select the mini-batch d(t) r D\Du with Sn(d(t)) (we do not use Sr due to the efficiency issue). Consequently, to update the model parameters in the forging map F(t), we let w(t) r = g(t)(w(t 1), Sn(d(t))). (8)

We conclude our second adversarial unlearning process in Algorithm 2 and provide a distinct illustration in Figure 3. The advantage of Algorithm 2 is that the forging map can be implemented in parallel thanks to our formulation in Equation (6), which largely reduces the execution time in real-world scenarios with frequent unlearning requests. Specifically, we analyze the condition of Algorithm 2 satisfying the reproducing verification and the time complexity of Algorithm 2 as follows.

Proposition 4.4. Under the same assumptions as in Proposition 4.3, when g(t) r is the vanilla SGD, 0 γ(t) 1 2L( p

9 + 4εL/(Lx CD) 3), and 0 γ(t) r ε γ(t)Lx CD

G(2+γ(t)L) , Algorithm 2 returns a valid Proof of Retraining under ε > 0 threshold.

Proposition 4.5. Assume the time complexity of naive retraining is T(n). The time complexity of Algorithm 1 is T(n) and the time complexity of Algorithm 2 is ( pu

m P ) T(n), where pu = 1 (1 |Du|/|D|)m, P denotes the number of processes in parallel, and m denotes the batch size.

The proofs of Proposition 4.4 and Proposition 4.5 can be found in Appendix A. Proposition 4.4 provides a theoretical guarantee when Algorithm 2 successfully deceives the ε-threshold reproducing verification. Proposition 4.5 com-

Verification of Machine Unlearning is Fragile

Table 1. Dataset statistics. Dataset # Train # Test # Class Image Size MNIST 60,000 10,000 10 28 28 CIFAR-10 50,000 10,000 10 32 32 3 SVHN 73,257 26,032 10 32 32 3

pares the time complexity of Algorithm 1 and Algorithm 2 with naive retraining and demonstrates the superiority of Algorithm 2 in computational efficiency. Despite the efficiency of Algorithm 2, it can fail to satisfy the backdoor verification because the final unlearned model w(T ) r is still dependent on the information of injected backdoor data.

5. Experiments

In this section, we empirically evaluate the vulnerability of MUL verification with numerical experiments.

Datasets. Our experiments are based on three widely adopted real-world datasets for image classification, MNIST (Le Cun et al., 1998), SVHN (Netzer et al., 2011), and CIFAR-10 (Krizhevsky et al., 2009): the MNIST dataset consists of a collection of handwritten digit images; the CIFAR-10 dataset contains color images in 10 classes, with each class representing a specific object category, e.g., cats and automobiles; the SVHN dataset consists of house numbers images captured from Google Street View. The statistics of these three datasets are shown in Table 1. All datasets are publicly accessible (MNIST with GNU General Public License, CIFAR-10 with MIT License, and SVHN with CC BY-NC License).

Evaluation Metrics. Recall that the goal of adversarial unlearning processes is to preserve the model utility and improve the efficiency of unlearning while circumventing the verification methods. To demonstrate the efficacy of our adversarial unlearning methods, we use the verification error threshold ε to evaluate the forging-based method and use the probability of type II errors (the target data is not unlearned is regarded as the null hypothesis) to evaluate our retraining-based method. In addition, we should also ensure that adversarial unlearning methods benefit model providers in preserving model utility and improving efficiency. Hence, we use utility metrics (e.g., F1-score) and the execution time to verify the benefit of adversarial unlearning methods.

Implementation. We implemented all experiments in the Py Torch (Paszke et al., 2019) library and exploited SGD as the optimizer for training. All experiments were conducted on an Nvidia RTX A6000 GPU. We use the verification error threshold ε to evaluate the forging-based method and use the probability of type II errors (the verifier thinks target data is unlearned but actually not) to evaluate the retraining-based method. We use utility met-

0% 20% 40% 60% 80% 100% Proportion Below Threshold

Verification Error Threshold

mnist cifar-10 svhn

(a) Verification error

3 10 4 4 10 4

Ie error In error

Ie error In error

0 5 10 15 20 25 30 Epoch Number

2 10 4 3 10 4 4 10 4 6 10 4

Ie error In error

(b) Details of error statistics

Figure 4. Verification error of forging-based adversarial unlearning method for MLP over MNIST, CNN over CIFAR-10, and Res Net over SVHN.

Table 2. Probability of the type II error (β value) of backdoor verification on different (adversarial) unlearning strategies over three datasets.

Method MNIST CIFAR-10 SVHN Original 2.61 10 42 5.14 10 20 1.11 10 28

Retrain 0.998 0.998 0.999 Adv-R 0.997 0.997 0.999 Adv-F 5.46 10 34 2.78 10 16 2.08 10 26

rics (e.g., F1-score) and the execution time to show the benefit of adversarial unlearning. We report the average value and standard deviation of the numerical results under five random seeds. More details of the hyperparameter setting are presented in Appendix B. Our code is available at https://github.com/zhangbinchi/ unlearning-verification-is-fragile.

5.1. Adversarial Unlearning vs Verification

We first empirically demonstrate the efficacy of our proposed adversarial unlearning methods, i.e., they can satisfy the verification and let the verifier believe that the data is unlearned normally. We next discuss the reproducing verification and the backdoor verification separately.

Reproducing Verification. For reproducing verification, our retraining-based method can always satisfy the verification with an exact 0 verification error because the Po RT is recorded directly based on the retraining process. Hence, we mainly focus on the efficacy of the forging-based method in this experiment. As the choice of verification error threshold ε is highly personalized, we directly fix the hyperparameters of our forging-based method and record the verification error in each model updating step. Specifically, we randomly choose 2% data as the unlearned set. The hyperparameter settings are shown in Appendix B, and the experimental results are shown in Figure 4. We can observe that the verification errors in most steps for all datasets are below 1e 3, i.e., our forging-based method can satisfy the reproducing verification with a 1e 3 threshold. Compared with the scale

Verification of Machine Unlearning is Fragile

MNIST CIFAR-10 SVHN Datasets

Execution Time / Epoch

Original Retrain

Adv-F Adv-R-Sn

Figure 5. Comparison of execution time among original training, naive retraining, and adversarial unlearning methods over three real-world datasets.

of adopted neural networks, 1e 3 can be seen as a small value. We can also see a reduction in the overall verification error as the model scale decreases (from Res Net to MLP). In addition, we also illustrate the mean value and the standard deviation of the verification error of Ie (excluding Du) and In (including Du) steps in different epochs and obtain the following observations: 1) the verification errors in Ie and In steps are consistent, perhaps due to the high dependence of both errors on the gradient scale; 2) within the overall threshold, the verification error tends to first increase and then decrease until staying nearby a local optimum; again, we attribute this to the dependence of errors on gradients.

Backdoor Verification. For backdoor verification, we exploit Athena (Sommer et al., 2022) as the verification strategy. In particular, we randomly choose the backdoor training and test data, inject a specific pattern of pixels, and change the label of training data to a fixed target label. To test the integrity of unlearning, we make unlearned data include the backdoor training data. If the backdoor success rate is close to the random prediction, the model can predict the original labels for the backdoor test data, and the unlearned model was not trained on the poisoned data. Thus, the verifier believes that the model provider follows the unlearning request. Specifically, we regard the target data is unlearned as the null hypothesis and the target data is not unlearned as the alternative hypothesis. We use the probability of type II errors (the verifier thinks target data is unlearned but actually not) to evaluate our adversarial unlearning methods. The results are shown in Table 2. Consistent with our former discussion, the retraining-based adversarial method and naive retraining are almost always regarded as truly unlearning the target data; the forgingbased adversarial method and original training are hardly regarded as truly unlearning the target data.

5.2. Adversary s Goal

In previous experiments, we have demonstrated that our proposed adversarial unlearning methods can satisfy reproducing and backdoor verification. Next, we aim to show that 1) the adversarial unlearning deceives the verification:

they still memorize the information of unlearned data, and 2) model providers benefit from adversarial unlearning: they preserve the model utility and improve unlearning efficiency.

Utility. In this experiment, we compare the model utility between naive retraining and our adversarial unlearning. To simulate the data heterogeneity in real-world scenarios, we add a class-imbalanced unlearning setting. For the retraining-based adversarial method, we adopt both random sampling Sr and nearest neighbor Sn approaches to select mini-batches. For the forging-based adversarial method, we use the last recorded model on the Po RT w(T ) r for utility evaluation. To demonstrate the long-term effect of adversarial unlearning, we divide 10% of the training data as the unlearned set. Details of hyperparameter settings and complete experimental results can be found in Appendix B.1, and we provide the truncated experimental results in Table 3. From the results, we can obtain the following observations: 1) in the normal setting, the retraining-based adversarial method has a similar utility to naive retraining; 2) in the class-imbalanced setting, the retraining-based adversarial method has a much better utility than naive retraining, which means model providers can benefit more when data heterogeneity exists in the unlearning process. 3) in both settings, the forging-based adversarial method has the best utility of the unlearned model (close to the original training).

Efficiency. In this experiment, we compare the computational efficiency between naive retraining and our adversarial unlearning. For the retraining-based adversarial method, we use the nearest neighbor selection to choose the minibatches and obtain the nearest neighbor mapping by ranking the pre-computed distance between the unlearned data and the retained data. For the forging-based adversarial method, we adopt five processes in parallel to compute the Po RT. We record the average execution time for each epoch and show the results in Figure 5. From the results, we can obtain that 1) the forging-based method has the shortest execution time; 2) the time cost of forging in practice can be longer than theoretical results due to the resource bottleneck of the adopted single GPU; 3) the retraining-based method is slightly slower than naive retraining due to the nearest neighbor calculation, but our proposed nearest neighbor selection is much faster than random sampling.

5.3. Mini-batch Selection: Nearest Neighbor vs Random Sampling

In Section 4.2, we introduced two mini-batch selection strategies for the retraining-based method: random sampling Sr and nearest neighbor Sn. Additionally, in Section 5.2, we demonstrated that our nearest neighbor selection strategy is faster than random sampling. This section delves deeper into comparing these two strategies.

Verification of Machine Unlearning is Fragile

Table 3. Comparison of the model utility among original training, naive retraining, and adversarial unlearning methods over three popular DNNs across three real-world datasets. We record the macro F1-score of the predictions on the unlearned set Du, retained set D\Du, and test set Dt. The prefix imdenotes the results in the class-imbalanced setting.

Method MLP & MNIST CNN & CIFAR-10 Res Net & SVHN Du D\Du Dt Du D\Du Dt Du D\Du Dt Original 99.47 0.09 99.76 0.08 97.00 0.17 100.00 0.00 100.00 0.00 85.33 0.31 100.00 0.00 100.00 0.00 94.91 0.09 Retrain 96.43 0.19 99.52 0.10 96.75 0.13 83.60 0.31 100.00 0.00 83.12 0.23 94.33 0.24 100.00 0.00 94.57 0.06 Adv-R (Sr) 98.17 0.16 99.33 0.18 96.78 0.13 83.81 0.44 100.00 0.00 83.08 0.34 94.38 0.11 100.00 0.00 94.54 0.09 Adv-R (Sn) 96.34 0.11 98.65 0.19 96.60 0.14 82.40 0.39 100.00 0.00 81.85 0.44 94.64 0.20 100.00 0.00 94.75 0.04 Adv-F 99.30 0.13 99.33 0.10 96.94 0.14 100.00 0.00 100.00 0.00 85.20 0.24 100.00 0.00 100.00 0.00 94.91 0.07 im-Original 60.29 13.07 97.00 2.60 96.88 0.07 100.00 0.00 100.00 0.00 85.44 0.22 100.00 0.00 100.00 0.00 94.66 0.30 im-Retrain 38.76 13.41 95.86 4.34 89.92 5.70 24.25 6.98 90.88 5.03 65.22 5.94 33.08 11.97 95.19 3.78 83.89 5.83 im-Adv-R (Sr) 39.48 12.20 99.71 0.19 91.04 4.80 25.94 6.89 96.00 4.90 76.51 4.32 34.76 12.06 98.00 4.01 87.63 6.11 im-Adv-R (Sn) 42.90 11.87 97.96 0.59 92.80 4.54 23.97 8.75 91.46 1.81 67.34 4.02 33.92 11.85 99.37 0.20 84.87 5.42 im-Adv-F 64.21 9.89 97.28 3.34 96.81 0.04 100.00 0.00 100.00 0.00 85.11 0.21 100.00 0.00 100.00 0.00 94.77 0.09

Adv-R-Sr-10

Adv-R-Sr-25

Adv-R-Sr-50

Selection Time / Batch

(a) Selection time of two mini-batch selection strategies.

Adv-R-Sr-10

Adv-R-Sr-25

Adv-R-Sr-50

Gradient Distance/ Epoch

(b) Gradient distance between mini-batch selection strategies and the original batch.

Adv-R-Sn Adv-R-Sr

Gradient Distance

Adv-R-Sn Adv-R-Sr

1 10 20 30 Epoch Number

Adv-R-Sn Adv-R-Sr

(c) Gradient distance in different datasets.

Figure 6. Comparison of two mini-batch selection strategies: random sampling Sr and nearest neighbor Sn.

Figure 6(b) and Figure 6(a) illustrate the selection time per epoch and the gradient distance between selected batches and their corresponding batches containing unlearn data for both strategies. We conduct these experiments on the MNIST dataset, varying the number of batch samples M for random sampling. In this strategy, M candidate batches are selected, and the one with the smallest gradient distance to the original batch is chosen. We tested M values of 5, 10, 25, and 50.

From Figure 6(a), we observe that: i) in the random sampling strategy, selection time increases with M due to the increased gradient computations; ii) our nearest neighbor selection consistently outperforms random sampling in speed, even with smaller M values, by avoiding extra gradient computations. Figure 6(b) shows that: i) larger M values in random sampling lead to better batch selection; ii) nearest neighbor selection performs better than random sampling with small M values and is comparable when M is large. Further, we extended our experiments to two additional datasets with M = 50 for the random sampling strategy. As depicted in Figure 6(c), our strategy exhibits similar performance to random sampling. In summary, the nearest neighbor strategy offers significantly faster selection times and comparable model training performance compared to

the random sampling strategy.

6. Conclusion

In this paper, we expose the vulnerability in the verification of MUL. In particular, we summarize current verification strategies into two types. Regarding both types, we propose two adversarial unlearning processes that circumvent the verification while preserving the information of unlearned data. Our theoretical and empirical results highlight the following conclusions: the adversarial unlearning method can circumvent the verification methods while improving the model utility without extra computation costs. Put another way of thinking, by adopting an adversarial unlearning method, a dishonest model provider gains a higher model utility (still memorizes the unlearned data) after unlearning without compromising efficiency. Furthermore, the retraining-based adversarial method can perfectly circumvent all existing verification methods relying on the natural similarity within the training data and the randomness of the training algorithm. This threat poses a novel challenge for the safe verification of machine unlearning: there is no strict guarantee for the verification of MUL. Better verification methods are needed to obtain precise and reliable verification results.

Verification of Machine Unlearning is Fragile

Acknowledgements

This work is supported in part by the National Science Foundation under grants (IIS-2006844, IIS-2144209, IIS2223769, CNS-2154962, BCS-2228534, ECCS-2033671, ECCS-2143559, CPS-2313110, and SII-2132700), the Commonwealth Cyber Initiative Awards under grants (VV-1Q23007, HV-2Q23-003, and VV-1Q24-011), the JP Morgan Chase Faculty Research Award, and the Cisco Faculty Research Award.

Impact Statement

Our proposed adversarial unlearning methods in this paper can threaten the safety of the verification of MUL. Generally, we can consider that a model provider trains an ML model based on some personal data collected from data owners. According to the legislation (CCPA, 2018; GDPR, 2016), the data owners have the right to have their data removed from the ML model and take actions to verify the efficacy of unlearning. However, using our proposed adversarial unlearning algorithms, the model provider can make the data owners (verifiers) believe that their data has been unlearned while preserving the information of these personal data.

Despite that, our research has a more significant positive influence compared with the potential risks. The study of MUL verification is still at a nascent stage. Given the deficient understanding of the safety of MUL verification, our study highlights the vulnerability of current verification strategies of MUL and inspires further research on the safe verification of MUL. Moreover, we tentatively provide discussions on the weak points of our proposed adversarial unlearning strategies and insights for detecting these misbehaviors in Appendix C.

Ajalloeian, A. and Stich, S. U. On the convergence of sgd with biased gradients. ar Xiv preprint ar Xiv:2008.00051, 2020.

Bourtoule, L., Chandrasekaran, V., Choquette-Choo, C. A., Jia, H., Travers, A., Zhang, B., Lie, D., and Papernot, N. Machine unlearning. In 2021 IEEE Symposium on Security and Privacy (SP), pp. 141 159, 2021.

Boyd, S. P. and Vandenberghe, L. Convex optimization. Cambridge university press, 2004.

Cao, Y. and Yang, J. Towards making systems forget with machine unlearning. In 2015 IEEE Symposium on Security and Privacy (SP), pp. 463 480, 2015.

CCPA. California consumer privacy act. 2018. URL https://oag.ca.gov/privacy/ccpa.

Che, T., Zhou, Y., Zhang, Z., Lyu, L., Liu, J., Yan, D., Dou, D., and Huan, J. Fast federated machine unlearning with nonlinear functional theory. In International conference on machine learning, pp. 4241 4268, 2023.

Chen, M., Zhang, Z., Wang, T., Backes, M., Humbert, M., and Zhang, Y. When machine unlearning jeopardizes privacy. In Proceedings of the 2021 ACM SIGSAC conference on computer and communications security, pp. 896 911, 2021.

Chen, M., Zhang, Z., Wang, T., Backes, M., Humbert, M., and Zhang, Y. Graph unlearning. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, pp. 499 513, 2022.

Chien, E., Pan, C., and Milenkovic, O. Efficient model updates for approximate unlearning of graph-structured data. In International Conference on Learning Representations, 2023.

Chundawat, V. S., Tarun, A. K., Mandal, M., and Kankanhalli, M. Zero-shot machine unlearning. IEEE Transactions on Information Forensics and Security, 2023.

Costan, V. and Devadas, S. Intel sgx explained. Cryptology e Print Archive, 2016.

Eisenhofer, T., Riepel, D., Chandrasekaran, V., Ghosh, E., Ohrimenko, O., and Papernot, N. Verifiable and provably secure machine unlearning. ar Xiv preprint ar Xiv:2210.09126, 2022.

Gao, X., Ma, X., Wang, J., Sun, Y., Li, B., Ji, S., Cheng, P., and Chen, J. Verifi: Towards verifiable federated unlearning. ar Xiv preprint ar Xiv:2205.12709, 2022.

GDPR. General data protection regulation. 2016. URL

https://gdpr-info.eu/.

Golatkar, A., Achille, A., and Soatto, S. Eternal sunshine of the spotless net: Selective forgetting in deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9304 9312, 2020.

Golatkar, A., Achille, A., Ravichandran, A., Polito, M., and Soatto, S. Mixed-privacy forgetting in deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 792 801, 2021.

Goodfellow, I., Bengio, Y., and Courville, A. Deep learning. MIT press, 2016.

Guo, C., Goldstein, T., Hannun, A., and Van Der Maaten, L. Certified data removal from machine learning models. In International Conference on Machine Learning, pp. 3832 3842, 2020.

Verification of Machine Unlearning is Fragile

Guo, Y., Zhao, Y., Hou, S., Wang, C., and Jia, X. Verifying in the dark: Verifiable machine unlearning by using invisible backdoor triggers. IEEE Transactions on Information Forensics and Security, 2023.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016.

Izzo, Z., Smart, M. A., Chaudhuri, K., and Zou, J. Approximate data deletion from machine learning models. In International Conference on Artificial Intelligence and Statistics, pp. 2008 2016, 2021.

Kim, J. and Woo, S. S. Efficient two-stage model retraining for machine unlearning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4361 4369, 2022.

Koh, P. W. and Liang, P. Understanding black-box predictions via influence functions. In International conference on machine learning, pp. 1885 1894, 2017.

Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. 2009.

Le, Y. and Yang, X. Tiny imagenet visual recognition challenge. CS 231N, 7(7):3, 2015.

Le Cun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

maintainers, T. and contributors. Torchvision: Pytorch s computer vision library. https://github.com/ pytorch/vision, 2016.

Mehta, R., Pal, S., Singh, V., and Ravi, S. N. Deep unlearning via randomized conditionally independent hessians. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10422 10431, 2022.

Neel, S., Roth, A., and Sharifi-Malvajerdi, S. Descent-todelete: Gradient-based methods for machine unlearning. In Algorithmic Learning Theory, pp. 931 962, 2021.

Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and Ng, A. Y. Reading digits in natural images with unsupervised feature learning. 2011.

Nguyen, T. T., Huynh, T. T., Nguyen, P. L., Liew, A. W.-C., Yin, H., and Nguyen, Q. V. H. A survey of machine unlearning. ar Xiv preprint ar Xiv:2209.02299, 2022.

Pan, C., Chien, E., and Milenkovic, O. Unlearning graph classifiers with limited data resources. In Proceedings of the ACM Web Conference 2023, pp. 716 726, 2023.

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.

Sekhari, A., Acharya, J., Kamath, G., and Suresh, A. T. Remember what you want to forget: Algorithms for machine unlearning. Advances in Neural Information Processing Systems, 34:18075 18086, 2021.

Setty, S. Spartan: Efficient and general-purpose zksnarks without trusted setup. In Annual International Cryptology Conference, pp. 704 737, 2020.

Shafran, A., Peleg, S., and Hoshen, Y. Membership inference attacks are easier on difficult problems. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14820 14829, 2021.

Shokri, R., Stronati, M., Song, C., and Shmatikov, V. Membership inference attacks against machine learning models. In 2017 IEEE symposium on security and privacy (SP), pp. 3 18, 2017.

Shumailov, I., Shumaylov, Z., Kazhdan, D., Zhao, Y., Papernot, N., Erdogdu, M. A., and Anderson, R. J. Manipulating sgd with data ordering attacks. Advances in Neural Information Processing Systems, 34:18021 18032, 2021.

Sommer, D. M., Song, L., Wagh, S., and Mittal, P. Athena: Probabilistic verification of machine unlearning. Proc. Privacy Enhancing Technol, 3:268 290, 2022.

Tarun, A. K., Chundawat, V. S., Mandal, M., and Kankanhalli, M. Deep regression unlearning. In International Conference on Machine Learning, pp. 33921 33939, 2023.

Thudi, A., Jia, H., Shumailov, I., and Papernot, N. On the necessity of auditable algorithmic definitions for machine unlearning. In 31st USENIX Security Symposium (USENIX Security 22), pp. 4007 4022, 2022.

Ullah, E., Mai, T., Rao, A., Rossi, R. A., and Arora, R. Machine unlearning via algorithmic stability. In Conference on Learning Theory, pp. 4126 4142, 2021.

Warnecke, A., Pirch, L., Wressnegger, C., and Rieck, K. Machine unlearning of features and labels. In Network and Distributed System Security Symposium, NDSS, 2023.

Weng, J., Yao, S., Du, Y., Huang, J., Weng, J., and Wang, C. Proof of unlearning: Definitions and instantiation. ar Xiv preprint ar Xiv:2210.11334, 2022.

Wu, J., Yang, Y., Qian, Y., Sui, Y., Wang, X., and He, X. Gif: A general graph unlearning strategy via influence

Verification of Machine Unlearning is Fragile

function. In Proceedings of the ACM Web Conference 2023, pp. 651 661, 2023a.

Wu, K., Shen, J., Ning, Y., Wang, T., and Wang, W. H. Certified edge unlearning for graph neural networks. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 2606 2617, 2023b.

Xu, H., Zhu, T., Zhang, L., Zhou, W., and Yu, P. S. Machine unlearning: A survey. ACM Computing Surveys, pp. 1 36, 2023.

Yuan, L., Tay, F. E., Li, G., Wang, T., and Feng, J. Revisiting knowledge distillation via label smoothing regularization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3903 3911, 2020.

Yurochkin, M., Agarwal, M., Ghosh, S., Greenewald, K., Hoang, N., and Khazaeni, Y. Bayesian nonparametric federated learning of neural networks. In International conference on machine learning, pp. 7252 7261. PMLR, 2019.

Zhang, Z., Zhou, Y., Zhao, X., Che, T., and Lyu, L. Prompt certified machine unlearning with randomized gradient smoothing and quantization. Advances in Neural Information Processing Systems, 35:13433 13455, 2022.

Verification of Machine Unlearning is Fragile

A.1. Proof of Proposition 4.2

Proof. To prove that Algorithm 1 returns a valid Po RT, we verify the two properties of a valid Po RT in Definition 4.1. First, we verify the removable property. For any d Sr(w(t 1) r ; d(t)) Sn(d(t)), we have d D\Du. Hence, we have d(t) r Du = . We then verify the reproducible property. It is obvious that w(t) r g(t) r (w(t 1) r , d(t) r ) = 0 and the reproducible property holds for ε = 0. In conclusion, Algorithm 1 returns a valid Po RT.

A.2. Proof of Proposition 4.3

Proof. Recall that Algorithm 1 can be seen as a mini-batch SGD with the following model updating function:

w(t+1) r = w(t) r γ L(w(t) r , d(t+1) r )

= w(t) r γ L(w(t) r , Sn(d(t+1))), (9)

where d(t+1) D. It is worth noting that d(t+1) is randomly chosen from D, while d(t+1) r relies on the selection of d(t+1)

and the unlearned set Du. If we take the expectation value over both sides, it will be difficult to directly compute the expectation E[ L(w(t) r , d(t+1) r )] on the right-hand side. Consequently, we let b(t) = L(w(t) r , d(t+1) r ) L(w(t) r , d(t+1)) and have

w(t+1) r = w(t) r γ( L(w(t) r , d(t+1)) + b(t)). (10)

For simplicity, we denote L(w) as the loss over the whole training set L(w, D) and Lt(w) as the loss over the original mini-batch at the t-th iteration L(w, d(t+1)). Next, we focus on the loss function value

L(w(t+1) r ) = L(w(t) r γ( Lt(w(t) r ) + b(t)))

= L(w(t) r ) L(w(t) r ) γ( Lt(w(t) r ) + b(t)) + γ2

2 ( Lt(w(t) r ) + b(t)) 2L(ξ(t))( Lt(w(t) r ) + b(t))

L(w(t) r ) γ L(w(t) r ) ( Lt(w(t) r ) + b(t)) + γ2L

2 L(w(t) r ) + b(t) 2

(11) The second equality sign is based on the first-order Taylor s theorem with γ2

2 ( Lt(w(t) r ) + b(t)) 2L(ξ(t))( Lt(w(t) r ) + b(t)) as the Lagrange remainder. Given w(t) r , we take the expectation over the randomly selected mini-batch d(t) for both sides and obtain

E[L(w(t+1) r )] E[L(w(t) r ) γ L(w(t) r ) ( Lt(w(t) r ) + b(t)) + γ2L

2 L(w(t) r ) + b(t) 2]

E[L(w(t) r )] γE[ L(w(t) r ) Lt(w(t) r )] γE[ L(w(t) r ) b(t)] + γ2L

2 E[ L(w(t) r ) + b(t) 2])

E[L(w(t) r )] γ L(w(t) r ) 2 γ(1 γL)E[ L(w(t) r ) b(t)] + γ2L

2 (G2 + E[ b(t) 2]).

For each iteration t, we have one corresponding inequality as Equation (12). We then sum up the Equation (12) for t = 0 to T 1 and have

t=0 E[L(w(t+1) r )]

t=0 E[L(w(t) r )] γ L(w(t) r ) 2 γ(1 γL)E[ L(w(t) r ) b(t)] + γ2L

2 (G2 + E[ b(t) 2]). (13)

After telescoping Equation (13), we have

t=0 L(w(t) r ) 2 + (1 γL)E[ L(w(t) r ) b(t)] L(w(0) r ) L(w(T ) r ) + γ2L

t=0 (G2 + E[ b(t) 2]). (14)

Verification of Machine Unlearning is Fragile

Without loss of generality, we specify the norm as ℓ-2 norm and have

b(t) = L(w(t) r , d(t+1) r ) L(w(t) r , d(t+1))

i=1 wl(w(t) r , d(t+1) ri ) wl(w(t) r , d(t+1) i )

i=1 wl(w(t) r , d(t+1) ri ) wl(w(t) r , d(t+1) i ) .

In particular, we use d(t+1) ri and d(t+1) i to denote the i-th data point in the mini-batch d(t+1) r and d(t+1). We let d(t+1) ri = (x(t+1) ri , y(t+1) ri ) and d(t+1) i = (x(t+1) i , y(t+1) i ), denoting the pair of feature vector and label. It is worth noting that y(t+1) ri = y(t+1) i according to the definition of Sn. We denote w[j] as the j-th element of w. Given y(t+1) ri = y(t+1) i , we omit the variable y in l( ) and have

l(w(t) r , x(t+1) ri ) w[j] l(w(t) r , x(t+1) i ) w[j]

s=0 (x(t+1) ri x(t+1) i ) 2l(w(t) r , x(t+1) i + s(x(t+1) ri x(t+1) i )) w[j] x ds

(x(t+1) ri x(t+1) i ) 2l(w(t) r , x(t+1) i + s(x(t+1) ri x(t+1) i )) w[j] x

s=0 x(t+1) ri x(t+1) i

2l(w(t) r , x(t+1) i + s(x(t+1) ri x(t+1) i )) w[j] x

s=0 x(t+1) ri x(t+1) i x l(w, x)

= x(t+1) ri x(t+1) i x l(w, x)

We then incorporate Equation (16) into Equation (15) and obtain that

i=1 wl(w(t) r , d(t+1) ri ) wl(w(t) r , d(t+1) i )

i=1 Lx x(t+1) ri x(t+1) i

In addition, the unlearned data cannot be involved in every single iteration when |Du| < T. When d(t+1) Du = , we have d(t+1) = d(t+1) r and b(t) = 0. Considering that d(t+1) is chosen uniformly from D, we can compute the probability of d(t+1) Du = as pu = 1 (1 |Du|/|D|)m. Consequently, when T , we can consider that only pu T iterations contain the bias term b(t) and we have b(t) = 0 for the remaining (1 pu)T iterations. We then incorporate Equation (17) back into Equation (14) and have

t=0 L(w(t) r ) 2 γ(1 γL)pu TGB γ

t=0 L(w(t) r ) 2 + (1 γL)E[ L(w(t) r ) b(t)]

L(w(0) r ) L(w(T ) r ) + γ2L

t=0 (G2 + E[ b(t) 2])

L(w(0) r ) L(w(T ) r ) + γ2LT

2 (G2 + pu B2).

Verification of Machine Unlearning is Fragile

Finally, we can finish the proof by

ET [ L(w(T ) r ) 2] = 1

t=0 L(w(t) r ) 2

L(w(0) r ) L(w(T ) r ) γT + γL

2 (G2 + pu B2) + (1 γL)pu GB.

A.3. Proof of Proposition 4.4

Proof. To prove that Algorithm 2 returns a valid Po RT, we verify the two properties of a valid Po RT in Definition 4.1. First, we verify the removable property. For the t-th iteration, if t Ie, we have that d(t) r Du = d(t) Du = holds; if t Ie, we have that d(t) r Du = Sn(d(t)) Du

= (d(t)\Du) Du {N(x, y)|(x, y) d(t) Du} Du

= {argmin(x ,y ) D\Du,y =y x x |(x, y) d(t) Du} Du

The last equality sign holds because each element argmin(x ,y ) D\Du,y =y x x is involved in D\Du. Hence, we have verified the removable property of Algorithm 2. Next, we verify the reproducible property. For any t I, we have t Ie or t In. We then discuss these two conditions separately. For simplicity, we denote that p(t) = w(t) r w(t).

(1). t Ie: we have d(t) r = d(t).

w(t) r g(t) r (w(t 1) r , d(t) r )

=w(t) + p(t) w(t 1) r + γ(t) L(w(t 1) r , d(t) r )

=w(t) + p(t) w(t 1) p(t 1) + γ(t) L(w(t 1) r , d(t) r )

=p(t) p(t 1) + γ(t) L(w(t 1) r , d(t)) γ(t) L(w(t 1), d(t)).

We then focus on the norm value and have

w(t) r g(t) r (w(t 1) r , d(t) r )

= p(t) p(t 1) + γ(t) L(w(t 1) r , d(t)) γ(t) L(w(t 1), d(t))

p(t) p(t 1) + γ(t)L w(t 1) r w(t 1)

= p(t) p(t 1) + γ(t)L p(t 1) .

(2). t In: we have d(t) r = Sn(d(t)).

w(t) r g(t) r (w(t 1) r , d(t) r )

=w(t) + p(t) w(t 1) r + γ(t) L(w(t 1) r , d(t) r )

=w(t 1) γ(t) L(w(t 1), d(t)) + p(t) w(t 1) r + γ(t) L(w(t 1) r , d(t) r )

=p(t) p(t 1) + γ(t) L(w(t 1), d(t) r ) γ(t) L(w(t 1), d(t)) + γ(t) L(w(t 1) r , d(t) r ) γ(t) L(w(t 1), d(t) r ).

We then focus on the norm value and have

w(t) r g(t) r (w(t 1) r , d(t) r )

p(t) p(t 1) + γ(t) L(w(t 1), d(t) r ) γ(t) L(w(t 1), d(t)) + γ(t) L(w(t 1) r , d(t) r ) γ(t) L(w(t 1), d(t) r )

p(t) p(t 1) + γ(t) b(t 1) + L p(t 1) ,

Verification of Machine Unlearning is Fragile

where b(t 1) = L(w(t 1), d(t) r ) L(w(t 1), d(t)). Next, we find the upper bound of p(t). When t Ie, we have p(t) = γ(t) r l(fw(t)(x(t)), y(t)) ; when t In, we have

p(t) = w(t 1) γ(t) L(w(t 1), d(t)) w(t 1) + γ(t) L(w(t 1), d(t) r ) = γ(t) b(t 1) . (25)

In addition, we know from Equation (17) that b(t 1) Lx CD. Combine the case t Ie and the case t In, we have p(t) max{γ(t) r l(fw(t)(x(t)), y(t)) , γ(t)Lx CD} = P. Finally, we verify the reproducible property by proving p(t) p(t 1) + γ(t)L p(t 1) ε and p(t) p(t 1) + γ(t) b(t 1) + L p(t 1) ε. Noting that p(t) p(t 1) + γ(t) b(t 1) + L p(t 1) p(t) p(t 1) + γ(t)L p(t 1) , we can only verify the second inequality.

(i). If γ(t) r l(fw(t)(x(t)), y(t)) γ(t)Lx CD, we have

p(t) p(t 1) + γ(t) b(t 1) + L p(t 1) 3γ(t)Lx CD + γ(t) 2 LLx CD ε. (26)

By solving this quadratic inequality, we obtain that when γ(t) 1 2L q

9 + 4εL Lx CD 3 , Equation (26) holds.

(ii). If γ(t) r l(fw(t)(x(t)), y(t)) γ(t)Lx CD, we have

p(t) p(t 1) + γ(t) b(t 1) + L p(t 1) γ(t)Lx CD + γ(t) r l(fw(t)(x(t)), y(t)) (2 + γ(t)L) ε. (27)

Consequently, we obtain that when γ(t) r ε γ(t)Lx CD l(fw(t)(x(t)),y(t)) (2+γ(t)L), Equation (27) holds. In order to avoid no solution

for γ(t) r , we also need

0 γ(t)Lx CD l(fw(t)(x(t)), y(t)) (2 + γ(t)L) γ(t) r ε γ(t)Lx CD l(fw(t)(x(t)), y(t)) (2 + γ(t)L). (28)

By solving this inequality for γ(t), we obtain 0 γ(t) 1 2L q

9 + 4εL Lx CD 3 again. Combining conclusions from

above cases and the assumption that wl(w, x) G, we have obtained that when 0 γ(t) 1 2L q

9 + 4εL Lx CD 3

and 0 γ(t) r ε γ(t)Lx CD

G(2+γ(t)L) , the reproducible property holds, i.e., w(t) r g(t) r (w(t 1) r , d(t) r ) ε. In conclusion, by incorporating the reproducible property and the removable property, our proof is finished.

A.4. Proof of Proposition 4.5

Proof. We assume the time complexity of naive retraining is T(n). For Algorithm 1, the only difference with naive retraining is the selection of mini-batches. At best, we can compute the nearest neighbor of each data sample beforehand and obtain a T(n) complexity. For Algorithm 2, every iteration in the Po RT can be computed separately in parallel. In particular, we have T(n) = n m T(1), where n is the number of iterations, m is the batch size, and T(1) denotes the time for computing the gradient of one data sample. If t Ie, computing w(t) r requires a complexity of T(1); if t In, computing w(t) r requires a complexity of m T(1). According to the analysis in Appendix A.2, we have that when n , the number of iterations in In will approach pun and the number of iterations in Ie will approach (1 pu)n. Finally, with the parallel processes, we have the overall complexity is

P + T(1) (1 pu)n

B. Experimental Setups and Additional Experimental Results

We implemented all experiments in the Py Torch (Paszke et al., 2019) library and exploited SGD as the optimizer for training. For consistent hyperparameter settings across all datasets, we fix the learning rate γ(t) as 10 2, the weight decay parameter

Verification of Machine Unlearning is Fragile

Table 4. Model Utility in Normal Settings. We assess the model utility among original training, naive retraining, and adversarial unlearning methods over three popular DNNs across three real-world datasets. We record the macro F1-score of the predictions on the unlearned set Du, retained set D\Du, and test set Dt.

Model Method MNIST CIFAR-10 SVHN Du D\Du Dt Du D\Du Dt Du D\Du Dt

Original 99.47 0.09 99.76 0.08 97.00 0.17 80.96 0.15 80.75 0.96 50.54 0.57 89.56 0.78 89.40 0.45 80.03 0.45 Retrain 96.43 0.19 99.52 0.10 96.75 0.13 49.32 0.59 82.96 1.29 49.29 0.77 82.42 0.22 89.83 0.39 79.50 0.37 Adv-R (Sr) 96.36 0.13 99.48 0.17 96.58 0.12 48.49 0.41 82.33 0.40 48.27 0.48 81.85 0.85 89.31 0.86 78.63 0.71 Adv-R (Sn) 96.34 0.11 98.65 0.19 96.60 0.14 50.43 0.17 80.01 1.30 49.86 0.23 82.98 0.57 92.60 0.56 79.77 0.57 Adv-F 99.30 0.13 99.33 0.10 96.94 0.14 81.13 0.76 81.38 0.79 50.81 0.63 89.99 0.71 89.77 0.55 80.11 0.85

Original 99.69 0.31 99.74 0.30 99.13 0.25 100.00 0.00 100.00 0.00 85.33 0.31 99.64 0.73 99.61 0.77 94.66 1.00 Retrain 99.31 0.12 100.00 0.00 99.39 0.05 83.60 0.31 100.00 0.00 83.12 0.23 94.24 0.22 100.00 0.00 94.96 0.10 Adv-R (Sr) 99.24 0.13 100.00 0.00 99.27 0.02 83.81 0.44 100.00 0.00 83.08 0.34 94.29 0.17 100.00 0.00 94.90 0.07 Adv-R (Sn) 99.35 0.09 100.00 0.00 99.40 0.02 83.12 0.31 100.00 0.00 82.71 0.33 94.36 0.32 100.00 0.00 94.92 0.15 Adv-F 100.00 0.00 100.00 0.00 99.46 0.08 100.00 0.00 100.00 0.00 85.20 0.24 100.00 0.00 99.99 0.01 95.13 0.10

Original 100.00 0.00 100.00 0.00 99.53 0.05 99.99 0.03 99.99 0.03 84.11 0.74 100.00 0.00 100.00 0.00 94.91 0.09 Retrain 99.41 0.09 100.00 0.00 99.46 0.07 82.76 0.38 100.00 0.00 82.10 0.39 94.33 0.24 100.00 0.00 94.57 0.06 Adv-R (Sr) 99.43 0.09 100.00 0.00 99.46 0.04 82.29 0.98 99.99 0.01 81.71 0.78 94.38 0.11 100.00 0.00 94.54 0.09 Adv-R (Sn) 99.41 0.06 100.00 0.00 99.42 0.04 82.40 0.39 100.00 0.00 81.85 0.44 94.64 0.20 100.00 0.00 94.75 0.04 Adv-F 100.00 0.00 100.00 0.00 99.49 0.03 100.00 0.00 100.00 0.00 84.54 0.37 100.00 0.00 100.00 0.00 94.91 0.07

Table 5. Model Utility in Class-Imbalanced Settings. We assess the model utility among original training, naive retraining, and adversarial unlearning methods over three popular DNNs across three real-world datasets. We record the macro F1-score of the predictions on the unlearned set Du, retained set D\Du, and test set Dt.

Model Method MNIST CIFAR-10 SVHN Du D\Du Dt Du D\Du Dt Du D\Du Dt

Original 60.29 13.07 97.00 2.60 96.88 0.07 34.50 6.41 75.68 2.29 50.95 0.19 39.13 7.26 84.02 3.90 80.50 0.83 Retrain 38.76 13.41 95.86 4.34 89.92 5.70 13.29 4.49 77.93 2.83 44.81 1.70 22.65 8.11 76.87 3.93 67.33 4.90 Adv-R (Sr) 39.48 12.20 99.71 0.19 91.04 4.80 13.13 4.32 81.21 3.21 44.19 1.81 26.33 8.99 87.25 2.74 72.09 4.50 Adv-R (Sn) 42.90 11.87 97.96 0.59 92.80 4.54 16.17 4.76 67.25 2.15 45.74 1.80 28.06 9.45 82.16 1.65 70.67 3.55 Adv-F 64.21 9.89 97.28 3.34 96.81 0.04 35.46 6.74 75.08 1.68 51.06 0.86 40.10 7.87 82.78 4.54 79.70 1.08

Original 100.00 0.00 100.00 0.00 99.48 0.06 100.00 0.00 100.00 0.00 85.44 0.22 100.00 0.00 100.00 0.00 95.13 0.07 Retrain 47.64 15.76 100.00 0.00 94.98 5.35 24.25 6.98 90.88 5.03 65.22 5.94 32.79 12.61 92.68 4.79 84.23 6.65 Adv-R (Sr) 46.00 12.84 100.00 0.00 94.94 4.77 25.94 6.89 96.00 4.90 76.51 4.32 33.43 12.12 97.52 3.87 86.69 5.40 Adv-R (Sn) 49.90 15.03 99.96 0.07 95.08 4.82 23.97 8.75 91.46 1.81 67.34 4.02 33.89 12.26 98.49 0.40 85.76 6.46 Adv-F 92.49 15.01 99.97 0.05 99.45 0.12 100.00 0.00 100.00 0.00 85.11 0.21 100.00 0.00 100.00 0.00 95.14 0.05

Original 100.00 0.00 100.00 0.00 99.54 0.05 98.30 3.39 99.96 0.08 85.02 0.96 100.00 0.00 100.00 0.00 94.66 0.30 Retrain 50.74 14.51 99.00 2.00 95.89 5.40 21.80 6.43 89.13 2.84 66.86 3.29 33.08 11.97 95.19 3.78 83.89 5.83 Adv-R (Sr) 51.76 14.11 100.00 0.00 96.33 4.73 25.53 6.93 100.00 0.00 74.18 4.26 34.76 12.06 98.00 4.01 87.63 6.11 Adv-R (Sn) 50.61 11.18 100.00 0.00 96.54 4.27 23.83 6.84 96.53 1.05 68.86 4.00 33.92 11.85 99.37 0.20 84.87 5.42 Adv-F 100.00 0.00 100.00 0.00 99.54 0.03 97.13 5.74 99.98 0.03 84.04 0.46 100.00 0.00 100.00 0.00 94.77 0.09

as 5 10 4, the training epochs number as 30, and set the batch size to 128. In determining the selection Sr, specifically for data batches containing unlearned data, we randomly sample 50 data batches from the remaining set D\Du and select the batch that yields the smallest distance, as defined in Equation (3). All experiments were conducted on an Nvidia RTX A6000 GPU. We reported the average value and the standard deviation of the numerical results under five different random seeds. For the experiment of verification errors in Figure 4, we set the learning rate γ(t) as 5 10 3, weight decay parameter as 0. We fix the size of the unlearned set as 1, 000 and set the learning rate γ(t) r as 10 3.

B.1. Model Utility Study

B.1.1. EXPERIMENTAL SETTINGS

All datasets utilized in our experiments adhere to the standard train/test split provided by the Torchvision library (maintainers & contributors, 2016). Within each experiment, 20% of the training data is set aside as the validation set. To assess model performance, we randomly sample 10% of the remaining training data to form the unlearn set. We report the average macro F1-score, along with its standard deviation, based on model predictions for the unlearn set Du, the remaining set D\Du, and the test set Dt. These metrics are computed across five random seeds to ensure robustness.

Additionally, to simulate scenarios where the data distribution of the unlearn set deviates from the overall training set, we introduce a class-imbalanced unlearning setting under a fixed train/val/test split. For each class c, we follow the approach of Yurochkin et al. by drawing a 5-dimensional vector qc from a Dirichlet distribution with its parameter of 0.5. We then

Verification of Machine Unlearning is Fragile

Table 6. Comparison of the model utility among original training, naive retraining, and adversarial unlearning methods based on Res Net-50 over the Tiny-Image Net dataset. We record the macro F1-score of the predictions on the unlearned set Du, retained set D\Du, and test set Dt.

Method Normal Imbalanced Du D\Du Dt Du D\Du Dt Original 90.34 0.47 90.21 0.43 36.59 0.74 91.76 0.64 91.42 0.59 33.81 0.60 Retrain 31.03 0.62 93.78 0.57 31.08 0.33 8.59 1.94 95.79 18.38 26.19 1.91 Adv-R (Sr) 31.26 0.46 92.83 0.63 31.78 0.16 8.98 2.95 95.65 18.65 26.76 1.17 Adv-R (Sn) 31.82 0.26 93.09 0.42 31.76 0.43 8.70 0.96 92.83 5.36 26.80 1.20 Adv-F 90.16 0.66 90.30 0.51 36.57 0.80 91.37 0.56 91.10 0.56 33.69 0.81

assign data to the i-th piece proportionally to qc[i]. In each experiment, one piece of data is selected as the unlearn set Du, while the remaining pieces constitute the remaining set D\Du. The average model performance is recorded across five experiments.

B.1.2. ADDITIONAL EXPERIMENTAL RESULTS

We provide the complete experimental results of the model utility in different types of unlearning strategies under normal and class-imbalanced settings in Table 4 and Table 5, respectively. From the full version of the results, we can obtain that:

1. The naive retraining can render a larger utility drop when the model under-fits the data and the data heterogeneity exists in the unlearning process;

2. In the normal setting, the retraining-based adversarial method has similar utility to the naive retraining, while in the class-imbalanced setting, the retraining-based adversarial method achieves a much better utility than naive retraining, which means model providers can benefit more when data heterogeneity exists in the unlearning process;

3. The forging-based adversarial method has much better performance than other unlearning methods in all cases, even better than the original training in some cases, which means the model provider can benefit the most from forging (but also with a larger probability of being detected by backdoor verification).

To further show the scalability of our proposed methods, we conduct additional experiments using the Res Net-50 model (He et al., 2016) over the Tiny-Image Net dataset (Le & Yang, 2015). The Tiny Image Net dataset (Le & Yang, 2015) is a subset of the Image Net dataset. It consists of 200 object classes, and for each object class, it provides 500 training images, 50 validation images, and 50 test images. All images have been downsampled to 64 64 3 pixels. As the test set is released without labels, we use the validation set as the test set in our experiment. Within each experiment, 20% of the training data is set aside as the validation set, and the division of the unlearn set and other parameter settings are consistent with the main experiments in the paper. We train the Res Net-50 model from scratch for 50 epochs following the experimental setting in (Yuan et al., 2020) and record the accuracy under 5 random seeds. For Sr, we randomly sample 10 mini-batches from the retained set D\Du and select the batch that yields the smallest distance, as defined in Equation (3). The experimental results of the utility of the unlearned model are shown in Table 6. Basically, the experimental results are consistent with Table 3 in our paper (though overfitting exists), i.e., Adv-F achieves comparable performance as the original model and Adv-R has a better performance compared with naive retraining. We also find that the difference between the normal setting and the imbalanced setting is not as distinct as in Table 3. We explain this as

1. Tiny-Image Net has 200 classes while the datasets in Table 3 only have 10 classes. The impact of imbalanced sampling can be weakened under far more classes.

2. Tiny-Image Net has only 400 training samples per class while the datasets in Table 3 have over 5,000 samples. The effect of nearest neighbor selection can be diminished as the constant CD (introduced in Proposition 4.3) might increase and subsequently, the gap between the adversarially unlearned model and the original model can be larger according to Proposition 4.3.

Verification of Machine Unlearning is Fragile

Table 7. Results of backdoor verification on different (adversarial) unlearning strategies over three datasets.

Method MLP & MNIST CNN & CIFAR-10 Res Net & SVHN in atk acc p ex atk acc q β in atk acc p ex atk acc q β in atk acc p ex atk acc q β Original 0.998 0.007 0.101 0.010 2.61 10 42 0.933 0.105 0.088 0.139 5.14 10 20 0.982 0.003 0.095 0.001 1.11 10 28

Retrain 0.102 0.016 0.103 0.012 0.998 0.118 0.022 0.124 0.010 0.998 0.110 0.001 0.096 0.001 0.999 Adv-R 0.103 0.017 0.102 0.015 0.997 0.129 0.021 0.102 0.009 0.997 0.109 0.001 0.096 0.001 0.999 Adv-F 0.995 0.003 0.103 0.013 5.46 10 34 0.914 0.119 0.100 0.050 2.78 10 16 0.981 0.006 0.096 0.001 2.08 10 26

B.2. Backdoor Verification

For backdoor verification, we exploit Athena (Sommer et al., 2022) as the verification strategy. Specifically, we consider two hypotheses: H0 - the data has been unlearned, and H1 - the data has not been unlearned. In assessing the effectiveness of a backdoor verification strategy applied to an algorithm A, we focus on the deletion confidence for a given acceptable tolerance of Type I error α ( α = Pr[Reject H0|H0 is true]), i.e.,

ρA,α(n) = (1 β) = 1 Pr[Accept H0|H1 is true]. (30)

We follow Sommer et al. to compute the Type II error β as a function of α and the number of testing samples n, i.e.,

n k pk(1 p)n k I[

n l ql(1 q)n l 1 α], (31)

where q and p represent the backdoor attack accuracy for deleted and undeleted data, respectively. The function I(x) = 1 if x is True and 0 otherwise. We apply similar method as in Sommer et al. to estimate q and p:

Estimating p. To estimate p, we first introduce a specific backdoor pattern to 10% of the unlearn set. This involves randomly selecting four pixels in each sample, setting their values to 1, and assigning a target label cb. The model is then trained on this partially backdoored dataset Db train. For evaluation, we extract 2% of the test samples Db test, apply the same backdoor pattern, and calculate the backdoor success rate for this trigger with the target label cb as,

p = 1 |Db test||{(x, y) Db test|A(Db train)(x) = cb}| (32)

Estimating q. To estimate q, we randomly select an additional 2% of the test samples, denoted as Dex test. On these samples, a different backdoor pattern is applied. We then calculate the backdoor success rate, focusing on the trigger with an alternate target label cex. This process enables us to determine q as,

q = 1 |Dex test||{(x, y) Dex test|A(Db train)(x) = cex}| (33)

We set α to 10 3, n to 30, and use the estimate p and q to compute β in Equation (31). For the MNIST and CIFAR-10 datasets, the target labels cb and cex are selected randomly, owing to the even distribution of their test data across classes. In contrast, for the SVHN dataset, we specifically choose cb = 3 and cex = 4 for a better explanation due to its uneven data distribution. Notably, in the SVHN dataset, data labeled as 3 or 4 accounts for nearly one-tenth of the test data, which is significant given that the dataset comprises ten classes. We provide the complete experimental results of the backdoor verification in Table 7.

We first clarify that if p or q approaches the level of random prediction, it suggests the corresponding backdoored data has not been utilized during the training process. Conversely, a high value of p indicates the effectiveness of the backdoor strategy. In other words, a large p value signifies that the backdoor attack was successful in influencing the model s behavior. The results in Table 7 reveal that the backdoor verification mechanism is highly effective for models trained on Db train. This effectiveness is primarily attributed to the substantial difference between p and q. Additionally, the verification mechanism is capable of detecting models modified using the forging-based method with a high degree of probability. This is due to the fact that such modifications only slightly alter the original model. In contrast, the retraining-based model can deception backdoor verification, as it does not directly utilize the partially backdoored unlearning data, resulting in a model that is not affected by poisoned data.

Verification of Machine Unlearning is Fragile

Table 8. Results of the membership inference attack on different (adversarial) unlearning strategies over three datasets.

Method MNIST CIFAR-10 SVHN Original 50.60 0.61 72.59 1.03 59.67 0.31 Retrain 50.16 0.92 48.97 0.65 51.88 1.39 Adv-F 50.09 0.34 72.41 0.55 59.04 1.10 Adv-R 49.66 0.85 48.93 1.22 50.09 0.65

0% 20% 40% 60% 80% 100% Proportion Below Threshold

Verification Error Threshold

5E-3 1E-3 5E-4 1E-4 5E-5

(a) Verification error distribution

5E-3 1E-3 5E-4 1E-4 5E-5 Learning Rate

Verification Error

(b) Verification error statistics

Figure 7. Comparison of verification error under different learning rate configurations.

B.3. Membership Inference Attack

Membership Inference Attack (MIA) (Shokri et al., 2017) is seen as an effective evaluation method of machine unlearning by measuring the privacy leakage of the data supposedly unlearned. Different from the reproducing verification and the backdoor verification, we tend to categorize MIA into the evaluation method rather than the verification method. To clarify this, we first aim to distinguish two different settings for measuring the unlearning efficacy. We can refer to them as evaluation and verification (which might be mixed up in previous works). Evaluation is supposed to be conducted by honest model providers to choose the best unlearning methods from a candidate set of unlearning algorithms. In contrast, verification is supposed to be conducted by data owners or third parties to check the unlearning efficacy. The most distinct difference between evaluation and verification is the capacity of the evaluator (or verifier), where the evaluator (model provider) has full access to the original model, unlearned model, and the dataset, while the verifier (data owner) only has limited access to the models and the data. As a typical example of evaluation methods, comparing the model utility with the retrained model (Golatkar et al., 2020; Nguyen et al., 2022) can only be conducted by the model provider with full access. For MIA, the attacker requires extensive knowledge of the model architecture for white-box attack variants, access to auxiliary or shadow data, and computational power to an extent similar to the model provider (Sommer et al., 2022). However, considering that a powerful third-party verifier can be seen as a potential membership-inference attacker, we conduct additional experiments to demonstrate whether our proposed adversarial methods can circumvent MIA. We adapt MIA against machine learning models (Shokri et al., 2017) to machine unlearning. We use a two-layer MLP as the attack model (discriminator). The attack model is trained using the same way, aiming at distinguishing the training samples (unlearned samples) from the test samples. We label the predictions of the unlearned data as positive data, and we randomly sample predictions of test data to ensure that the number of positive cases and the number of negative cases of the attack model are balanced. Ideally, the attack model is supposed to have low accuracy since the unlearned data is supposed to be removed from the trained model and be indistinguishable from the test samples. However, if the unlearning is ineffective, the unlearned data is still memorized by the unlearned model as the retained training samples, leading to a high attack accuracy. We record the AUC score of the attack under five different random seeds. The experimental results are shown in Table 8. From the results, we can observe that our proposed Adv-R and naive retraining have an attack accuracy around 50% (similar to random guessing), indicating that our proposed Adv-R is able to circumvent MIA. For simple target tasks (e.g. MNIST),

Verification of Machine Unlearning is Fragile

0 5 10 15 20 25 30 Epochs

Stochastic Gradient Norm

(a) Gradient norm over MNIST.

0 5 10 15 20 25 30 Epochs

Stochastic Gradient Norm

(b) Gradient norm over CIFAR-10.

0 5 10 15 20 25 30 Epochs

Stochastic Gradient Norm

(c) Gradient norm over SVHN.

Figure 8. Comparison of gradient norm over three datasets.

the predictions of the test samples are similar to the predictions of the training samples, and they are difficult to distinguish since the model learns well and has good generalizability (the predictions are correct and with high confidence). Hence, the attack accuracy is low. For difficult target tasks (e.g. CIFAR-10), the model might have a confident and accurate prediction for training samples but not for test samples. Subsequently, the training and test samples are easier to distinguish for attack models. This can also be seen as a limitation of membership inference attacks (ineffective for well-learned models) (Shafran et al., 2021).

B.4. Effect of Learning Rate on Verification Error

From Proposition 4.4, we obtain that we can find a small enough learning rate γ(t) and γ(t) r to forge a valid Po RT based on Algorithm 2 for arbitrarily small reproducing error threshold ε. To verify the effect of learning rate on the verification error, we conduct experiments of different learning rate configurations over the MNIST dataset, using the MLP model. We choose five different values for learning rate γ(t) as 5 10 3, 10 3, 5 10 4, 10 4, and 5 10 5. To simplify hyperparameter tuning, we directly set the forging learning rate the same as the original learning rate, i.e., γ(t) r = γ(t). Experimental results are shown in Figure 7. In particular, we show the percentage of verification errors (in different iterations) within the corresponding threshold in Figure 7(a), and show the statistics of verification error and the maximum value (representing the threshold ε and shown as red triangles) in Figure 7(b). From the results, we can observe that the verification error threshold (maximum) and the mean value of verification error decreases as the learning rate becomes smaller. In addition, in the case of γ(t) = γ(t) r , we can obtain a corollary of Proposition 4.4 as ε LCγ2 + 3Cγ where C = max{Lx CD, G}. As the value of learning rate γ is very small (γ usually has a significantly lower order of magnitude compared to C), we can approximately omit the second-order term and obtain that ε γ, which conforms to the experimental results shown in Figure 7(b).

B.5. Error Statistics of Reproducing Verification

We provide an in-depth analysis of the variation trend of the error statistics in Figure 4(b). Basically, as mentioned in the experiments, we attribute this result to the change in the gradient norm. We first take a look at the verification error from a theoretical perspective. Based on the proof of Proposition 4.4, when a batch contains the unlearned data, the verification error is related to b (the difference between the gradient on the original data and the gradient on the forging data); when a batch does not contain the unlearned data, the verification error is related to the stochastic gradient (the gradient on a random data point). As the number of unlearned data is small (the batches excluding the unlearned data distinctly outnumber the batches including the unlearned data), the mean verification error over an epoch is mainly determined by the norm of the stochastic gradient. For MNIST, the model learns well and the optimization converges fast (the accuracy nearly reaches 90% after the first epoch). Hence, the verification error remains stable and is relatively small. For CIFAR-10, the task is challenging for CNN, and the norm of the stochastic gradient grows larger and then decreases as the optimization converges (the convergence is not complete at the end, so we are not able to observe the plateau as MNIST). For SVHN, the difficulty of the task is between MNIST and CIFAR-10 for Res Net. Hence, the verification error goes through a short increase and then decrease, and finally goes into a stable plateau as MNIST. To support our insights, we plot the norm of the stochastic gradient (we directly use the gradient over the training batch) under different settings. The results are shown in Figure 8. The variation trend of the norm of stochastic gradients matches the verification error shown in Figure 4(b). The experimental

Verification of Machine Unlearning is Fragile

results demonstrate that the reproducing verification error is closely related to the gradient norm.

C. Deal with the Vulnerability of MUL Verification

In the aforementioned studies, we have exposed the vulnerability of the verification strategies of MUL, proved by theoretical and experimental results. Next, we would like to provide some simple insights on how to deal with vulnerability.

Retraining-based Adversarial Unlearning. Detecting retraining-based adversarial unlearning is extremely difficult because retraining-based adversarial unlearning does not explicitly utilize the unlearned data to update the model parameters. Consequently, the benefit for model providers from retraining-based adversarial unlearning is relatively small. In our observation, the performance of the unlearned model conducted by retraining-based adversarial unlearning is better than naive retraining but worse than original training, which conforms to the performance of approximate unlearning methods (Guo et al., 2020; Golatkar et al., 2020; 2021; Mehta et al., 2022). Unfortunately, existing studies of verification for approximate unlearning remain nascent. However, it would be promising to investigate the safe verification of approximate unlearning and exploit the verification methods to detect the retraining-based adversarial unlearning method.

Forging-based Adversarial Unlearning. Forging-based adversarial unlearning is relatively easy to detect because of its bounded difference from the original model. According to our experimental results, forging-based adversarial unlearning can be detected by backdoor verification with high confidence (in Table 7). Correspondingly, the model provider can largely benefit from forging-based adversarial unlearning (high utility as the original model and low unlearning time cost). Although theoretically, the model provider can find a proper learning rate γ(t) and γ(t) r to circumvent the reproducing verification with arbitrarily small threshold ε, the choice of learning rate is limited in practice. When model providers choose a smaller learning rate to circumvent more strict reproducing verification, the original training process can take longer time, even fail to converge. Hence, forging-based adversarial unlearning methods cannot circumvent arbitrarily strict reproducing verification, and the verifier can carefully select the error threshold to reject questionable unlearning operations.