# knowledge_removal_in_samplingbased_bayesian_inference__c0574493.pdf Published as a conference paper at ICLR 2022 KNOWLEDGE REMOVAL IN SAMPLING-BASED BAYESIAN INFERENCE Shaopeng Fu1 , Fengxiang He2,1 & Dacheng Tao1 1The University of Sydney, Australia, 2JD Explore Academy, China shfu7008@uni.sydney.edu.au, fengxiang.f.he@gmail.com, dacheng.tao@sydney.edu.au The right to be forgotten has been legislated in many countries, but its enforcement in the AI industry would cause unbearable costs. When single data deletion requests come, companies may need to delete the whole models learned with massive resources. Existing works propose methods to remove knowledge learned from data for explicitly parameterized models, which however are not appliable to the sampling-based Bayesian inference, i.e., Markov chain Monte Carlo (MCMC), as MCMC can only infer implicit distributions. In this paper, we propose the first machine unlearning algorithm for MCMC. We first convert the MCMC unlearning problem into an explicit optimization problem. Based on this problem conversion, an MCMC influence function is designed to provably characterize the learned knowledge from data, which then delivers the MCMC unlearning algorithm. Theoretical analysis shows that MCMC unlearning would not compromise the generalizability of the MCMC models. Experiments on Gaussian mixture models and Bayesian neural networks confirm the effectiveness of the proposed algorithm. The code is available at https: //github.com/fshp971/mcmc-unlearning. 1 INTRODUCTION The right to be forgotten refers to the right of individuals to request data controllers such as tech giants to delete the data collected from them. It has been recognized in many countries through legislation, including the European Union s General Data Protection Regulation (2016) and the California Consumer Privacy Act (2018). In the field of machine learning, legal experts suggest that companies may need to delete and re-train the machine learning models to meet the legal requirements of the right to be forgotten (Li et al., 2018; Garg et al., 2020). However, training a machine learning model would usually cost massive amounts of resources, including data, energy, and time. Thus, the enforcement of the right to be forgotten in the AI industry may result in unbearable costs. Toward this end, recent works (Cao & Yang, 2015; Guo et al., 2020) propose machine unlearning algorithms to efficiently characterize and remove the knowledge learned from specific data. To enable knowledge removal analysis, existing unlearning approaches usually require the machine learning models to be explicitly parameterized, which however do not cover an important family of machine learning algorithm, the Markov chain Monte Carlo (MCMC). MCMC is a sampling-based Bayesian inference method that aims to draw samples along a Markov chain to approximate a target posterior distribution (Hastings, 1970; Geman & Geman, 1984; Welling & Teh, 2011). Distributions inferred by MCMC are usually implicitly parameterized, which makes it difficult to directly apply existing unlearning algorithms to the MCMC unlearning problem. In this paper, we propose an MCMC unlearning algorithm to realize the right to be forgotten in the sampling-based Bayesian inference for the first time. The key of tackling the MCMC unlearning problem is to design methods to directly analyze the implicit parameter of the learned distribution. To this end, we propose to employ the learned implicit distribution to approximate the target posterior under the measure of KL-divergence, which then converts the MCMC unlearning problem to an *The authors contributed equally. Published as a conference paper at ICLR 2022 optimization problem with an explicit distribution parameter for knowledge removal analysis. Based on this conversion, an MCMC influence function is designed (Huber, 2004; Koh & Liang, 2017) to characterize the knowledge learned from single data samples. The new influence function then delivers the MCMC unlearning algorithm, which removes data knowledge by directly subtracting the corresponding influence from the learned distribution. Theoretical analyses are conducted for the proposed MCMC unlearning algorithm. For the knowledge removal analysis, we prove that the proposed algorithm can provably realize a ε-knowledge removal, which is a new notation defined for evaluating the machine unlearning performance in Bayesian inference. For the generalization analysis, a PAC-Bayesian generalization upper bound (Mc Allester, 1998; 1999; 2003; He et al., 2019) is established for the distribution processed by the MCMC unlearning algorithm. The upper bound shows that the difference introduced by MCMC unlearning in the generalization upper bound is no larger than the order of O( p |S |/N). This result demonstrates that the proposed algorithm would not compromise the generalization ability of the learned distribution. In summary, our work has four main contributions: (1) We enable the knowledge removal analysis in MCMC by converting the MCMC unlearning problem to an optimization problem. (2) Based on this problem conversion, we design an MCMC influence function to characterize the knowledge learned from data, which then delivers the MCMC unlearning algorithm. (3) We conduct theoretical analyses and prove that the MCMC unlearning can realize an ε-knowledge removal and has little impact on the generalization ability of the learned distribution. (4) We conduct comprehensive experiments to verify the effectiveness of the proposed algorithm, which include Gaussian mixture models for clustering on synthetic data, and Bayesian neural networks for classification on real-world dataset. The results suggest that the MCMC unlearning can effectively remove the knowledge learned from the requested data while retaining the information learned from other data intact. 2 RELATED WORKS Markov chain Monte Carlo. MCMC is a Bayesian inference method that would draw samples along a Markov chain, in which the drawn samples are proved to converge to that from a targeted distribution (Hastings, 1970). Although many improvements have been made such as Hamiltonian Monte Carlo (HMC) (Duane et al., 1987; Neal et al., 2011), Gibbs sampling (Geman & Geman, 1984; George & Mc Culloch, 1993) and slice sampling (Neal, 2003; Walker, 2007), MCMC still suffers from huge computational cost when inferring on large-scale dataset, e.g. in computer vision and image processing tasks (Zhang et al., 2020b; Wang et al., 2020), since each sampling step in MCMC requires computation over the whole dataset. To tackle this issue, Welling & Teh (2011) introduce a scalable MCMC sampler named stochastic gradient Langevin dynamics (SGLD) that by adding a proper noise to a standard stochastic gradient optimization algorithm (Robbins & Monro, 1951), which can help the update iterations converge to samples from the true posterior distribution. Since that, a variety of stochastic gradient MCMC samplers (SG-MCMC) have been proposed, including stochastic gradient HMC (SGHMC) (Chen et al., 2014), Patterson & Teh (2013), Ahn et al. (2012), Ding et al. (2014), and Zhang et al. (2020a). Ma et al. (2015) further design a complete framework for constructing SG-MCMC. Machine unlearning. Machine unlearning aims to remove the knowledge learned from specific data in efficient manners (Cao & Yang, 2015). Existing approaches can be roughly divided into exact methods and approximate methods. Exact methods aim to recover the model trained without the removed data exactly and are usually realized via reducing the retraining cost. For example, Ginart et al. (2019) design an efficient unlearning method for k-means clustering via model quantization. Bourtoule et al. (2021) propose a SISA framework that would only retrain the affected part of the model during data deletion. Other advances include Ullah et al. (2021) and Brophy & Lowd (2021). On the other hand, approximate methods aim to modify the trained model to approximate that trained only on the remaining data. A series of works (Guo et al., 2020; Golatkar et al., 2020; 2021) are designed via characterizing the influence of data on the trained model with influence functions (Cook & Weisberg, 1982; Huber, 2004; Koh & Liang, 2017). Other approaches include Baumhauer et al. (2020) and Izzo et al. (2021). Some works have studied machine unlearning in Bayesian inference (Nguyen et al., 2020; Gong et al., 2021). However, these works only focus on variational Published as a conference paper at ICLR 2022 inference, an optimization-based Bayesian inference method, and could not be applied to samplingbased Bayesian inference, MCMC. In contrast, our approach is the first machine unlearning method for MCMC, with a theoretical knowledge removal guarantee. 3 PRELIMINARIES Suppose p(z|θ) is a parameterized distribution over the sample space Z, where θ Θ Rd is the parameter with a prior distribution p(θ). Suppose S = {z1, z2, , z N} is a dataset consists of N samples, where each sample zi Z is drawn from the parameterized distribution p(z|θ), i.e., zi p(z|θ). Bayesian inference aims to infer the posterior distribution p(θ|S) of the parameter θ, p(θ|S) = p(θ) QN i=1 p(zi|θ) R p(θ) QN i=1 p(zi|θ)dθ . For simplicity, for the given dataset S, we let p S denotes the true posterior distribution p(θ|S), and ˆp S denotes the posterior distribution that inferred by Bayesian inference. Usually, the denominator of the posterior p S has no closed-form solution, which barriers directly calculating the posterior distribution. Thereby, researchers have designed methods to approximate the target posterior. A canonical approach is the sampling-based Bayesian inference method, Markov chain Monte Carlo (MCMC) (Hastings, 1970). MCMC infers a posterior distribution via first constructs a Markov chain and then draws samples according to the state of the chain. When the target posterior is the stationary distribution of the Markov chain, it is proved that the drawn samples will converge to that from the target posterior. However, MCMC methods usually suffer from scalable issues on large-scale datasets. To this end, stochastic gradient MCMC (SG-MCMC) (Ma et al., 2015) leverage mini-batch estimation (Robbins & Monro, 1951) to enable scalable sampling. In SG-MCMC, the posterior distribution p(θ|S) can be rewritten as p(θ|S) exp( U(θ)), where U(θ) is the potential function defined as U(θ) = P zi S log p(zi|θ) log p(θ). In each sampling step, drawing samples with SG-MCMC requires one to estimate the potential function U(θ) with mini-batch data S S as follows, zi S log p(zi|θ) log p(θ). Typical SG-MCMC methods include SGLD and SGHMC. Stochastic gradient Langevin dynamics (SGLD) (Welling & Teh, 2011) inserts Gaussian noise to stochastic gradients. In each sampling step, SGLD updates the posterior sample θt as follows, θt+1 = θt ηt θ U(θt) + 2 ηt W, where W is drawn from the standard Gaussian distribution N(0, I) and ηt > 0 is the step size. To improve the sampling efficiency of SGLD, stochastic gradient Hamiltonian Monte Carlo (SGHMC) (Chen et al., 2014) inserts a momentum vt into the posterior sample update as below, θt+1 = θt + vt, vt+1 = (1 α)vt ηt θ U(θt) + p 2αηt W, where W is also drawn from the Gaussian distribution N(0, I), ηt > 0 is the step size, and α (0, 1) is the momentum factor. The initial momentum term v0 is drawn from the Gaussian distribution N(0, η0I). To ensure the drawn samples converge to the targeted posterior, the step size ηt for both SGLD and SGHMC needs to satisfy (Welling & Teh, 2011): (1) P t=1 ηt = ; and (2) P t=1 η2 t < . A typical step size schedule is ηt = a(b + t) r, where a, b > 0 and r (0.5, 1]. 4 MCMC UNLEARNING ALGORITHM This section presents the main results of this paper. We first define a new notion, ε-knowledge removal, to assess the machine unlearning performance in Bayesian inference. Then, an MCMC unlearning algorithm is designed with knowledge removal and generalizability guarantees. Published as a conference paper at ICLR 2022 4.1 ε-KNOWLEDGE REMOVAL FOR BAYESIAN INFERENCE Suppose a client requests to remove her/his data S S from the whole training dataset S. An unlearning algorithm A for Bayesian inference aims to remove the knowledge learned from the requested data S from the inferred distribution ˆp S as follows, ˆp S S = A(ˆp S, S ), where ˆp S S is named as the processed distribution. In order to meet the regulation requirements, the following notion is defined to quantify the machine unlearning performance in Bayesian inference. Definition 1 (ε-knowledge removal). For the distribution ˆp S that inferred on dataset S via Bayesian inference and any subset S S, we call algorithm A performs ε-knowledge removal, if KL(ˆp S S ˆp S S ) ε, where ˆp S S = A(ˆp S, S ) is the processed distribution. Remark 1. Intuitively, a smaller ε indicates the algorithm A has better unlearning performance. In practice, one usually can only obtain an inferred distribution ˆp S( |ω), subjects to a random seed ω. In this case, the overall inferred distribution is ˆp S = R ˆp S( |ω)p(ω)dω, which suggests that the KL-divergence in Definition 1 can then be upper-bounded as below, KL(ˆp S S ˆp S S ) = KL(Eω ˆp S S ( |ω) Eω ˆp S S ( |ω)) = Z θ Eω ˆp S S (θ|ω) log Eω ˆp S S (θ|ω) Eω ˆp S S (θ|ω)dθ ˆp S S (θ|ω) log ˆp S S (θ|ω) ˆp S S (θ|ω) dθ = EωKL(ˆp S S ˆp S S |ω). Motivated by the above inequality, the following knowledge removal estimator is further defined to efficiently estimate the knowledge removal performance in practice. Definition 2 (knowledge removal estimator). Suppose ω1, , ωM is M i.i.d. random seed in Bayesian inference. Then, the knowledge removal estimator ˆεM is defined as below, m=1 KL(ˆp S S ˆp S S |ωm), where ˆp S S ( |ωm) = A(ˆp S( |ωm), S ) is the processed distribution for the m-th random seed ωm. Remark 2. Our experiments have employed this knowledge removal estimator ˆεM to estimate the ε value in the ε-knowledge removal guarantee. See Section 5.2 for details. Remark 3. This estimator is similar to the Local Forgetting Bound defined in Golatkar et al. (2020). The difference is that the bound in Golatkar et al. (2020) aims to estimate the randomness introduced by some readout functions , while ˆεM aims to quantify the ε-knowledge removal guarantee by estimating the difference between the original and processed distributions. 4.2 METHODOLOGY This section presents the MCMC unlearning algorithm. To make the analysis easier, we assume that one can obtain the exact targeted posterior distribution via MCMC. In other words, for a posterior ˆp S that is inferred by MCMC, we assume that ˆp S = p S. Developing unlearning algorithm for MCMC is challenging, since there is no explicitly parameterized distribution for knowledge removal analysis. To tackle this challenge, we re-formulate the MCMC unlearning problem as minimizing the following KL-divergence, min δ KL(p S( δ) p S S ), (1) where p S is the distribution already learned via MCMC, δ is the distribution shifting scale, S is the dataset that is requested to be deleted, and p S S is the targeted posterior that one aims to obtain. The Published as a conference paper at ICLR 2022 idea behinds Eq. (1) is to approximate the targeted posterior p S S via directly shifting the currently learned distribution p S. In this case, one no longer needs to know the exact implicit distribution parameter to perform machine unlearning, but only the explicit shifting scale δ is enough. So far, the MCMC unlearning problem has been converted to an optimization problem defined as Eq. (1). We then design MCMC unlearning algorithm based on this problem conversion. Let δ S S denotes any local minimizer of Eq. (1), and the goal of MCMC unlearning is to calculate the optimal shifting scale δ S S . To achieve that goal, we first expand the KL-divergence in Eq. (1) as follows, KL(p S( δ) p S S ) = KL(p S p S S ( + δ)) = Eθ p S [log p(θ|S)] + log p(S S ) + Eθ p S [ log p(S, θ + δ) + log p(S |θ + δ)] . (2) Through Eq. (2), one can find that when removing dataset S via optimizing Eq. (1), an additional term Eθ p S log p(S |θ + δ) is introduced to force the distribution p S( δ) approaching the target posterior p S S . This motivate us to follow existing works (Koh & Liang, 2017; Guo et al., 2020) to conduct influence analysis on Eq. (1) based on the introduced term Eθ p S log p(S |θ + δ). Specifically, we define a new function F S ,τ(δ, S) = Eθ p S[ log p(S, θ+δ) τ log p(S |θ+δ)], where τ [ 1, 0]. Suppose there exists a function ˆδ(τ) with domain [ 1, 0] such that ˆδ(0) = 0, and ˆδ(τ) is a local minimizer of F S ,τ(δ, S). Then, one can let the target shifting scale δ S S be ˆδ( 1), and approximate it by the first-order approximation: ˆδ( 1) ˆδ(0) ˆδ(0)/ τ, which means δ S S ˆδ(0)/ τ. Following a standard derivation, one can calculate the first derivative ˆδ(0)/ τ as the following defined MCMC influence function I(S ). Thus, the optimal δ S S is then approximated as δ S S I(S ). For the detailed influence analysis, see Appendix A.1. Definition 3 (MCMC influence function). The MCMC influence function of dataset S S is defined to be I(S ) = Eθ p S 2 θ log p(θ, S) 1 X z S (Eθ p S θ log p(z|θ))T . The MCMC influence function is defined to estimate the influence of data S on the learned distribution p S, as δ S S I(S ). The computational cost of the MCMC influence function would be high. An efficient calculation method is given in Appendix B. Finally, based on Definition 3, the MCMC unlearning algorithm is designed as follows. Algorithm 1 (MCMC unlearning). Suppose one have drawn a series of samples {θ1, , θT } from the posterior p S via MCMC. Then, MCMC unlearning algorithm removes the learned knowledge of dataset S from each drawn sample θi as follows, θ i θi I(S ), where I(S ) is the MCMC influence function for dataset S S. Remark 4. Algorithm 1 is equivalent to the following unlearning process, p S S ( ) = A(p S, S ) = p S( + I(S )) p S( δ S S ). Remark 5. Due to the inherent geometric difference between the distribution functions p S and p S S , it is impossible for Algorithm 1 to directly shift p S to exactly recover p S S . However, the experiments (see Section 5) show that the designed algorithm is good enough for removing learned knowledge from MCMC distributions. 4.3 KNOWLEDGE REMOVAL ANALYSIS This section studies the ε-knowledge removal for the proposed MCMC unlearning algorithm. We first introduces several assumptions (Assumptions 1-4) for the knowledge removal analysis. Assumption 1. Function Eθ p S log p(θ + δ) and Eθ p S log p(z|θ + δ) are L1-Lipschitz continuous on the support set supp(δ). The Lipschitzness of Eθ p S log p(θ + δ) can be realized by choosing appropriate prior distribution for θ, for example, let p(θ) be a Laplace distribution. Besides, the Lipschitzness of Eθ p S log p(z|θ + δ) can usually be achieved by setting p(z|θ + δ) to be a neural network with bounded domain in practice. Published as a conference paper at ICLR 2022 Assumption 2. There exists a neighborhood V2 supp(δ) of the origin point such that for any dataset S and any sample point z Z, the followings hold: (Smoothness). δ[ Eθ p S log p(θ + δ)] and δ[ Eθ p S log p(z|θ + δ)] are L2-Lipschitz continuous on the space V2. (Lipschitz Hessian). 2 δ[ Eθ p S log p(θ + δ)] and 2 δ[ Eθ p S log p(z|θ + δ)] are L3Lipschitz continuous on the space V2. Assumption 2 assumes the local smoothness and Hessian Lipschitzness for the involved functions, which are common in literature for analyzing Newton methods (Nesterov & Polyak, 2006; Carmon et al., 2018; Zhou et al., 2018). Assumption 3 (strong convexity). There exists a neighborhood V3 supp(δ) of the origin point such that for any dataset S and any sample point z Z, Eθ p S log p(θ + δ) and Eθ p S log p(z|θ + δ) are µ-strongly convex on V3. The influence analysis and the defined MCMC influence function require the Hessian matrices of Eθ p S log p(θ + δ) and Eθ p S log p(z|θ + δ) to be invertible. Assumption 3 ensures the invertibilities of these Hessian matrices. Assumption 4. Suppose Assumptions 1-3 hold. Then there exists a sub-neighborhood V V2 V3 of the origin point such that for any dataset S and any real τ [ 1, 0], the function Eθ p S log p(θ + δ|S) τ Eθ p S log p(S |θ + δ) has at least one local minimizer falls in V . Assumption 4 assumes the target optimal solution δ S S of Eq. (1) exists in the subspace that we are interested in, which would make the influence analysis being meaningful. Under Assumptions 1-4, we prove that the MCMC influence function defined in Definition 3 can rigorously estimate the influence of data on the shifting scale δ. Theorem 1. Under Assumptions 1-4, we have that δ S S = I(S ) + O |S |2 The proof of Theorem 1 is presented in Appendix A.1. Based on Theorem 1, a knowledge removal guarantee is proved for the MCMC unlearning algorithm. Theorem 2. Suppose Assumptions 1-4 hold. Then, the MCMC unlearning algorithm A(p S, S ) = p S( + I(S )) performs ε S S -knowledge removal, where ε S S = KL(p S( δ S S ) p S S ) + O |S |4 and δ S S is a local minimizer of the KL-divergence KL(p S( δ) p S S ). Proof sketch. We expand KL(p S( + I(S )) p S S ) into Taylor series up to 2-th order around the point δ = δ S S . By applying the first-order optimality condition, the first-order term in the Taylor series becomes 0. By leveraging the approximation error given in Theorem 1, the second-order term is proved to be no larger than the order of O(|S |4/N 3). The proof is presented in Appendix A.2. Remark 6. As explained in the previous section, Algorithm 1 could not completely remove the learned knowledge from the inferred distribution due to the inherent geometric difference between the distribution p S and p S S . Nevertheless, Theorem 2 demonstrates that the Algorithm 1 can help the KL-divergence in Eq. (1) approach its local minimum. 4.4 GENERALIZATION ANALYSIS This section studies the impact of the proposed MCMC unlearning algorithm on the generalization ability of the inferred distribution. The analysis is conducted based on the PAC-Bayes theory (Mc Allester, 1999; 2003; He et al., 2019), a framework for analyzing the generalization ability of stochastic machine learning models (Mohri et al., 2012; He & Tao, 2020). Published as a conference paper at ICLR 2022 6 4 2 0 2 4 4 Original Data 6 4 2 0 2 4 4 Origin Target Processed 6 4 2 0 2 4 4 Origin Target Processed (a) Visualized results of the GMMs experiments. There are four groups of points in the synthetic dataset, where different groups are in different colors. The samples that were generated from the original, processed, and target model via MCMC sampling are colored in black, blue, and red, respectively. SGLD SGHMC 0.0 Unlearning Time (s) Retrain Ours (b) Time costs for processing single deletion requests via retraining and MCMC unlearning. Figure 1: Experiment results for GMMs. Specifically, suppose Q is the parameter distribution of a stochastic model. Suppose S is the training dataset. Then, the expected risk R(Q) and empirical risk ˆRS(Q) of Q are defined to be R(Q) = E θ Q E z ℓ(θ, z), ˆRS(Q) = E θ Q 1 N i=1 ℓ(θ, zi), where θ is the model parameter drawn from the distribution Q and ℓis a loss function ranging in [0, 1]. The difference of the expected risk and the empirical risk is the generalization error. Its magnitude characterizes the generalization ability of the stochastic model. Then, a generalization bound for the distribution processed by Algorithm 1 is proved as below. Theorem 3. Suppose Assumptions 1-4 hold. Let p S S denotes the distribution processed by Algorithm 1. Then, for any δ (0, 1), with probability at least 1 δ, the following inequality holds, R(p S S ) ˆRS(p S S ) + Eθ p S[log p S(θ) + θ 1] + I(S ) 1 + log 2 + log 1 δ + log N + 2 2N 1 , where I(S ) 1 O(|S |/N). The proof is given in Appendix A.3. Corollary 1. Algorithm 1 increases the generalization upper bound in Theorem 3 no larger than the order of O( p Proof. According to Theorem 3, the difference of the generalization upper bound introduced by Algorithm 1 is O( q 2N 1 ) O( q 2N 1 ) = O( |S | N ). This completes the proof. Remark 7. Corollary 1 indicates that the proposed MCMC unlearning algorithm would not compromise the generalization ability of the inferred distribution. 5 EXPERIMENTS In this section, we empirically verify the effectiveness and efficiency of the proposed MCMC unlearning algorithm on the Gaussian mixture models and Bayesian neural networks. 5.1 EXPERIMENTS FOR GAUSSIAN MIXTURE MODELS We first visualize the effect of the MCMC unlearning algorithm on a synthetic clustering dataset that consists of 2, 000 examples, where each examples is two-dimensional and is possibly from 4 clusters. The Gaussian mixture models (GMMs) are employed to infer the cluster centers. For the details of the GMMs used in our experiments, see Appendix C.1. Experiment design. We employ SGLD and SGHMC to infer the GMMs on the clustering dataset. Then, 400 points are removed from each of the grey parts and yellow parts, around 40% of the whole Published as a conference paper at ICLR 2022 Table 1: Experiment results on CIFAR-10. Different metrics are used to evaluate the machine unlearning performance, including the classification errors on the remaining set Sr, removed set Sf, and test set Stest, the knowledge removal estimators ˆεM, and the membership inference attack (MIA) accuracy on Sf. Every experiment is repeated 5 times. The results show that the proposed MCMC unlearning algorithm beats baseline methods in almost every experiments settings. Remove Method Err. on Sr (%) Err. on Sf (%) Err. on Stest (%) ˆεM ( 103) MIA Acc. (%) CIFAR-10 + SGLD 3,000 Examples Retrain 20.45 0.95 46.69 1.05 31.80 0.71 0.00 0.00 51.85 1.39 Origin 21.71 0.94 18.91 1.13 31.22 0.53 4079.51 2.07 80.11 3.87 IS 21.81 0.80 28.71 1.06 31.77 0.82 4228.11 1.60 72.29 2.07 Ours 21.56 0.96 31.14 1.75 31.76 0.66 4070.59 1.99 69.21 5.22 5,000 Examples Retrain 18.61 0.90 100.00 0.00 36.25 0.63 0.00 0.00 0.00 0.00 Origin 21.82 0.91 19.07 1.26 31.23 0.53 4173.23 1.90 80.62 4.58 IS 21.17 0.87 54.05 1.86 33.25 0.77 4342.04 1.64 46.49 4.90 Ours 20.73 0.86 73.96 4.25 34.81 0.31 4151.34 1.94 26.64 7.68 CIFAR-10 + SGHMC 3,000 Examples Retrain 20.24 0.37 45.83 1.61 31.80 0.48 0.00 0.00 55.34 3.00 Origin 21.50 0.40 18.83 1.46 31.47 0.54 0.00 0.00 82.85 1.49 IS 21.61 0.40 28.64 1.23 31.90 0.53 4296.82 2.02 70.79 3.52 Ours 21.31 0.37 30.33 2.44 31.75 0.55 4137.04 2.42 69.83 2.78 5,000 Examples Retrain 18.56 0.38 100.00 0.00 36.21 0.34 0.00 0.00 0.00 0.00 Origin 21.61 0.39 18.91 1.13 31.47 0.52 4233.95 1.73 82.70 1.35 IS 21.03 0.44 54.20 1.48 33.55 0.36 4408.05 1.21 44.20 1.63 Ours 20.54 0.36 69.56 4.45 34.67 0.78 4212.63 1.72 29.12 3.94 dataset at all, by our forgetting algorithms. We also trained models on only the remaining set with the same MCMC inference settings to show the targets of the forgetting tasks. For the details of the experiments, see Appendix C. Results analysis. The visualization results are presented in Fig. 1a, which show that after unlearning, the processed models are close to the target models. Besides, the time costs for processing single data deletion requests are presented in Fig. 1b, which shows that the proposed MCMC unlearning algorithm is significantly faster than retraining the whole model. These results demonstrates that the proposed algorithm can effectively and efficiently remove the learned knowledge of specified data. 5.2 EXPERIMENTS FOR BAYESIAN NEURAL NETWORKS SGLD SGHMC 0 Unlearning Time (s) Retrain IS Ours Figure 2: Time costs for processing single data deletion requests on BNNs. A smaller time cost implies a higher efficiency of the unlearning method. We then apply the MCMC unlearning algorithm to the Bayesian neural networks on the CIFAR-10 (Krizhevsky et al., 2009) dataset for classification. Analogous experiments are also conducted on the Fashion-MNIST (Xiao et al., 2017) dataset, please see Appendix D for more details. Dataset & BNN. We divide the training set S of CIFAR-10 into two parts, the removed part Sf and the remained part Sr. A certain number of examples are randomly chosen from a single class in the dataset to form the removed training set Sf. The test set is denoted by Stest. A BNN consists of two convolutional layers and two fully-connected layers is adopted in the experiments. For more details about the dataset and model architecture, see Appendix D. Baseline method. We compare the proposed MCMC unlearning algorithm with the importance sampling method (IS). Specifically, when a data deletion request comes, the importance sampling method will process the request via performing MCMC sampling on the remaining data. Evaluation metrics. Four kinds of metrics are used to evaluate the performance of machine unlearning: (1) Classification errors. We calculate classification errors the remaining set Sr, removed set Sf, and test set Stest for different models. Ideally, after removing data, the three errors of the processed model would approach that of the retrained model. (2) ε-Knowledge removal guarantee. We use the knowledge removal estimator ˆεM defined in Definition 2 to estimate the ε value in the ε-knowledge removal guarantee. A smaller estimation result would imply a better knowl- Published as a conference paper at ICLR 2022 Removed 0 samples Removed 2048 samples Removed 4096 samples Removed 5000 samples (a) The prediction confidences changes for a plane image. Plane is also the removed class. Removed 0 samples Removed 2048 samples Removed 4096 samples Removed 5000 samples (b) The prediction confidences changes for a plane image. Plane is also the removed class. Removed 0 samples Removed 2048 samples Removed 4096 samples Removed 5000 samples (c) The prediction confidences changes for a cat image. Cat is not the removed class. Removed 0 samples Removed 2048 samples Removed 4096 samples Removed 5000 samples (d) The prediction confidences changes for a ship image. Ship is not the removed class. Figure 3: The changes of the prediction confidences for the test set examples of CIFAR-10 along with data removing. The BNN is trained with SGLD and the knowledge removal is conducted with the proposed MCMC unlearning algorithm. The predictions for the true labels are colored in red. edge removal performance. Calculating ˆεM requires to fix a series of random seeds ω1, , ωM. To this end, we use a series of pre-trained models as the random seeds. See Appendix D.3 for the detailed calculation procedures. (3) Membership inference attack (MIA) accuracy. We employ a white-box MIA (Yeom et al., 2018) to infer whether the examples from the removed subset Sf come from the training set of the model. Intuitively, the attack accuracy would decline after unlearning. Thereby, a small attack accuracy implies a strong knowledge removal performance. See Appendix D.4 for more details about the MIA and its calculation procedures. (4) Time costs for unlearning. For different unlearning methods, we estimate the average time usages for processing single data deletion requests. A smaller time cost corresponds to high unlearning efficiency. Experiment designs. We first train the BNN on the complete training set S with SGLD and SGHMC, respectively. Then, the subset Sf is removed iteratively, where a fixed number of data points are removed in each iteration. We also trained models on only the remaining set Sr to show the targets of the machine unlearning task. For the details of the experiments, see Appendix D.5. Results analysis. The experiment results are presented in Table 1. In all experiments, we observe that the proposed MCMC unlearning algorithm can significantly increase the model error on sample set Sf while making the error on sample sets Sr and Stest close to that of the retrained one. Besides, our algorithm can also reduce the ε value in the knowledge removal guarantee and the MIA accuracy on Sf, which further indicates it can indeed remove specified knowledge from the MCMC models. In contrast, the importance sampling method could neither effectively make the classification errors approach the targets, nor achieve stronger ε-knowledge removal. We also present the time costs for different unlearning methods in Fig. 2. One can found that the MCMC unlearning algorithm is significantly faster than other methods. All these experiment results demonstrate the effectiveness and efficiency of the proposed algorithm. Finally, we give a case study about the prediction changes on the test set examples to illustrate the effect of the proposed MCMC unlearning algorithm. The results are presented in Fig. 3, where one can observe that as the data removing continues, the prediction confidence of the model on the removed class significantly decreases, while those on other classes remain the same or increase. 6 CONCLUSION The right to be forgotten imposes a considerable compliance burden on AI companies. A company may need to delete the whole model learned from massive resources due to a request to delete a single sample point. Existing works only appliable to explicit parameterized optimization problem, which however not work for sampling-based Bayesian inference method, i.e., MCMC. In this work, we convert the MCMC unlearning problem to an optimization problem. Then, an MCMC influence function is designed to characterize the influence of data on the distribution inferred by MCMC. It then delivers the first MCMC unlearning algorithm. Theoretical analyses show that the proposed algorithm can indeed reduce the learned knowledge, and would not compromise the generalization ability of the inferred distribution. Experiments show that the proposed algorithm can remove the influence of specified samples without compromising the knowledge learned on the remained data. Published as a conference paper at ICLR 2022 ACKNOWLEDGMENTS FH is supported in part by the Major Science and Technology Innovation 2030 New Generation Artificial Intelligence key project (No. 2021ZD0111700). SF and DT are supported by Australian Research Council Projects FL-170100117, IH-180100002, IC-190100031, and LE-200100049. The authors sincerely appreciate Yue Xu and the anonymous ICLR reviewers for their helpful comments. Mart ın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Man e, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Vi egas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensor Flow: Large-scale machine learning on heterogeneous systems, 2015. Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan Mc Mahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pp. 308 318, 2016. Naman Agarwal, Brian Bullins, and Elad Hazan. Second-order stochastic optimization for machine learning in linear time. The Journal of Machine Learning Research, 18(1):4148 4187, 2017. Sungjin Ahn, Anoop Korattikara, and Max Welling. Bayesian posterior sampling via stochastic gradient Fisher scoring. In International Conference on Machine Learning, pp. 1771 1778, 2012. Thomas Baumhauer, Pascal Sch ottle, and Matthias Zeppelzauer. Machine unlearning: Linear filtration for logit-based classifiers. ar Xiv preprint ar Xiv:2002.02730, 2020. Lucas Bourtoule, Varun Chandrasekaran, Christopher A Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot. Machine unlearning. In 2021 IEEE Symposium on Security and Privacy (SP), pp. 141 159. IEEE, 2021. Jonathan Brophy and Daniel Lowd. Machine unlearning for random forests. In International Conference on Machine Learning, pp. 1092 1104. PMLR, 2021. Yinzhi Cao and Junfeng Yang. Towards making systems forget with machine unlearning. In 2015 IEEE Symposium on Security and Privacy, pp. 463 480. IEEE, 2015. Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Accelerated methods for nonconvex optimization. SIAM Journal on Optimization, 28(2):1751 1772, 2018. Tianqi Chen, Emily Fox, and Carlos Guestrin. Stochastic gradient Hamiltonian Monte Carlo. In International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pp. 1683 1691. PMLR, 2014. R Dennis Cook and Sanford Weisberg. Residuals and Influence in Regression. New York: Chapman and Hall, 1982. Nan Ding, Youhan Fang, Ryan Babbush, Changyou Chen, Robert D Skeel, and Hartmut Neven. Bayesian sampling using stochastic gradient Thermostats. In Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014. Simon Duane, Anthony D Kennedy, Brian J Pendleton, and Duncan Roweth. Hybrid Monte Carlo. Physics letters B, 195(2):216 222, 1987. Sanjam Garg, ShafiGoldwasser, and Prashant Nalini Vasudevan. Formalizing data deletion in the context of the right to be forgotten. Advances in Cryptology EUROCRYPT 2020, 12106:373, 2020. Published as a conference paper at ICLR 2022 Stuart Geman and Donald Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, (6): 721 741, 1984. Edward I George and Robert E Mc Culloch. Variable selection via Gibbs sampling. Journal of the American Statistical Association, 88(423):881 889, 1993. Antonio Ginart, Melody Guan, Gregory Valiant, and James Y Zou. Making AI forget you: Data deletion in machine learning. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. Aditya Golatkar, Alessandro Achille, and Stefano Soatto. Eternal sunshine of the spotless net: Selective forgetting in deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9304 9312, 2020. Aditya Golatkar, Alessandro Achille, Avinash Ravichandran, Marzia Polito, and Stefano Soatto. Mixed-privacy forgetting in deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 792 801, 2021. Jinu Gong, Osvaldo Simeone, and Joonhyuk Kang. Bayesian variational federated learning and unlearning in decentralized networks. ar Xiv preprint ar Xiv:2104.03834, 2021. Chuan Guo, Tom Goldstein, Awni Hannun, and Laurens Van Der Maaten. Certified data removal from machine learning models. In International Conference on Machine Learning, pp. 3832 3842. PMLR, 2020. W Keith Hastings. Monte Carlo Sampling Methods Using Markov Chains and Their Applications. 1970. Fengxiang He and Dacheng Tao. Recent advances in deep learning theory. ar Xiv preprint ar Xiv:2012.10931, 2020. Fengxiang He, Tongliang Liu, and Dacheng Tao. Control batch size and learning rate to generalize well: Theoretical and empirical evidence. In Advances in Neural Information Processing Systems, pp. 1143 1152, 2019. Peter J Huber. Robust Statistics, volume 523. John Wiley & Sons, 2004. Zachary Izzo, Mary Anne Smart, Kamalika Chaudhuri, and James Zou. Approximate data deletion from machine learning models. In International Conference on Artificial Intelligence and Statistics, pp. 2008 2016. PMLR, 2021. Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. In International Conference on Machine Learning, pp. 1885 1894, 2017. Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. Yann Le Cun, L eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. Tiffany Li, Eduard Fosch Villaronga, and Peter Kieseberg. Humans forget, machines remember: Artificial intelligence and the right to be forgotten. Computer Law & Security Review, 34(2):304, 2018. Yi-An Ma, Tianqi Chen, and Emily Fox. A complete recipe for stochastic gradient MCMC. In Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. David A Mc Allester. Some PAC-Bayesian theorems. In Proceedings of the Annual Conference on Computational Learning Theory, pp. 230 234, 1998. David A Mc Allester. PAC-Bayesian model averaging. In Proceedings of the Annual Conference on Computational Learning Theory, pp. 164 170, 1999. Published as a conference paper at ICLR 2022 David A Mc Allester. PAC-Bayesian stochastic model selection. Machine Learning, 51(1):5 21, 2003. Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2012. Radford M Neal. Slice sampling. The Annals of Statistics, 31(3):705 767, 2003. Radford M Neal et al. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2(11):2, 2011. Yurii Nesterov and Boris T Polyak. Cubic regularization of Newton method and its global performance. Mathematical Programming, 108(1):177 205, 2006. Quoc Phong Nguyen, Bryan Kian Hsiang Low, and Patrick Jaillet. Variational Bayesian unlearning. Advances in Neural Information Processing Systems, 33, 2020. Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary De Vito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in Py Torch. In NIPS-W, 2017. Sam Patterson and Yee Whye Teh. Stochastic gradient Riemannian Langevin Dynamics on the probability simplex. In Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013. Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathematical Statistics, pp. 400 407, 1951. Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 3 18. IEEE, 2017. Enayat Ullah, Tung Mai, Anup Rao, Ryan A. Rossi, and Raman Arora. Machine unlearning via algorithmic stability. In Proceedings of Thirty Fourth Conference on Learning Theory, volume 134 of Proceedings of Machine Learning Research, pp. 4126 4142. PMLR, 2021. Stephen G Walker. Sampling the dirichlet mixture model with slices. Communications in Statistics Simulation and Computation , 36(1):45 54, 2007. Chaoyue Wang, Chang Xu, and Dacheng Tao. Self-supervised pose adaptation for cross-domain image animation. IEEE Transactions on Artificial Intelligence, 1(1):34 46, 2020. doi: 10.1109/ TAI.2020.3031581. Qing Wang, Sanjeev R Kulkarni, and Sergio Verd u. Divergence estimation for multidimensional densities via k-nearest-neighbor distances. IEEE Transactions on Information Theory, 55(5): 2392 2405, 2009. Max Welling and Yee W Teh. Bayesian learning via stochastic gradient Langevin dynamics. In International Conference on Machine Learning, pp. 681 688, 2011. Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms, 2017. Samuel Yeom, Irene Giacomelli, Matt Fredrikson, and Somesh Jha. Privacy risk in machine learning: Analyzing the connection to overfitting. In 2018 IEEE 31st Computer Security Foundations Symposium (CSF), pp. 268 282. IEEE, 2018. Ruqi Zhang, Chunyuan Li, Jianyi Zhang, Changyou Chen, and Andrew Gordon Wilson. Cyclical stochastic gradient MCMC for Bayesian deep learning. In International Conference on Learning Representations, 2020a. Youjian Zhang, Chaoyue Wang, and Dacheng Tao. Video frame interpolation without temporal priors. In Advances in Neural Information Processing Systems, volume 33, pp. 13308 13318, 2020b. Dongruo Zhou, Pan Xu, and Quanquan Gu. Stochastic variance-reduced cubic regularized newton methods. In International Conference on Machine Learning, pp. 5990 5999. PMLR, 2018. Published as a conference paper at ICLR 2022 This section collects all the proofs in this paper. To avoid technicalities, we assume that all functions are differentiable throughout this paper. Furthermore, the order of differentiation and integration is assumed to be interchangeable. For simplicity, we denote that h(δ, z) := Eθ p S log p(z|θ + δ), f(δ) := Eθ p S log p(θ + δ), F(δ, S) := P z S h(δ, z) + f(δ) = P z S Eθ p S log p(z|θ + δ) Eθ p S log p(θ + δ). A.1 PROOF OF THEOREM 1 To prove Theorem 1, we first define the following function, F S ,τ(δ, S) = F(δ, S) + τ X z S h(δ, z), where τ [ 1, 0]. Under Assumption 3, it can be shown that F S ,τ(δ, S) is strongly convex on the space V3 defined in Assumption 3. Lemma 1. Suppose Assumption 3 holds. Then, for any τ [ 1, 0], F S ,τ(δ, S) is strongly convex on the space V3. Proof of Lemma 1. We rearrange the function F S ,τ as follows, F S ,τ(δ, S) = F(δ, S S ) + (1 + τ) X z S h(δ, z). Apparently, F(δ, S S ) is strongly convex on V3. Since τ [ 1, 0], we have that 1+τ 0. Thus, (1 + τ) P z S h(δ, z) is either strongly convex on V3 or equal to zero. Therefore, F S ,τ(δ, S) is strongly convex on V3. The proof is completed. We then prove that when Assumption 4 holds, there exists a continuous mapping that connects the origin point 0 Rd and the target shifting scale δ S S . Lemma 2. Suppose Assumption 4 holds. Then, there exists a continuous function ˆδ : [ 1, 0] V such that for any τ [ 1, 0], ˆδ(τ) is the global minimizer of the function F S ,τ(δ, S) on space V . Proof of Lemma 2. By applying Lemma 1 and Assumption 4, we have that the global minimizer of the function F S ,τ(γ, S) uniquely exists in V . Therefore, one can define the mapping ˆδ as below, ˆδ(τ) = arg min δ V F S ,τ(δ, S), where τ [ 1, 0]. We then prove the continuity of ˆδ. Notice that ˆδ(τ) is the solution of the following equation, δF(δ, S) + τ X z S δh(δ, z) = 0. (3) Since F S ,τ(δ, S) is strongly convex, thus its Hessian matrix 2 δF S ,τ(ˆδ(τ), S) is invertible. Combining with the implicit function theorem, we have that ˆδ is continuous on the interval [ 1, 0]. The proof is completed. Published as a conference paper at ICLR 2022 When τ takes 0 and 1, the function ˆδ(τ) becomes 0 Rd and δ S S , respectively. Therefore, δ S S can be expanded into Taylor series as follows, δ S S = ˆδ( 1) = ˆδ(0) where ξ [ 1, 0], 1 τ 2 is the Lagrange form of the remainder. We prove that the Lagrange remainder term becomes negligible as the training set size N goes to infinity. To do this, we first prove the following Lemma. Lemma 3. Suppose Assumptions 1-4 hold. The mapping ˆδ is as defined in Lemma 2. Then, we have that τ = 2 δ F(ˆδ(τ), S) X z S δh(ˆδ(τ), z)T , and for any τ [ 1, 0], ˆδ(τ) τ 2 O(|S |/N). Proof. We first calculate ˆδ(τ) τ based on Eq. (3) in the proof of Lemma 2. Calculate the derivatives of the both sides of Eq. (3) with respect to τ, we have that 2 δF(ˆδ(τ), S) ˆδ(τ) z S δh(ˆδ(τ), z)T + τ X z S 2 δh(ˆδ(τ), z) ˆδ(τ) According to Lemma 1, F S ,τ(δ, S) is strongly convex on V . Thus, the following Hessian matrix 2 δF S ,τ(ˆδ(τ), S) = 2 δF(ˆδ(τ), S) + τ X z S 2 δh(ˆδ(τ), z) is positive definite, hence invertible. Combining with Eq. (5), we have that 2 δF(ˆδ(τ), S) + τ X z S 2 δh(ˆδ(τ), z) z S δh(ˆδ(τ), z)T . (6) We then upper bound the norm of ˆδ(τ) τ . Based on Eq. (6), we have that ˆδ(τ) 1 N 2 δF(ˆδ(τ), S) + τ z S 2 δh(ˆδ(τ), z) z S δh(ˆδ(τ), z) We first consider the first term of the right-hand side of Eq. (7). By Assumption 3, both f(δ) and h(δ, z) are µ-strongly convex on the space V supp(δ). Thus, we have that 1 N 2 δF(ˆδ(τ), S) + τ z S 2 δh(ˆδ(τ), z) i=1 2 δh(ˆδ(τ), zi) + 1 N 2 δf(ˆδ(τ)) + τ z S 2 δh(ˆδ(τ), z) z S S µ + µ Let λmin denotes the smallest eigenvalue of the following matrix 1 N 2 δF(ˆδ(τ), S) + τ z S 2 δh(ˆδ(τ), z) Published as a conference paper at ICLR 2022 Then, the Eq. (8) implies that λmin N |S | N µ. Hence, we have the following, 1 N 2 δF(ˆδ(τ), S) + τ z S 2 δh(ˆδ(τ), z) = 1 λmin N (N |S |)µ = O(1). (9) We then upper bound the second term of the right-hand side of Eq. (7). By applying Assumption 1, we have that 1 N z S δh(ˆδ(τ), z) Finally, inserting eqs. (9) (10) into Eq. (7), we eventually have that ˆδ(τ) 2 O(1) O |S | The proof is completed. Then, the following Lemma is obtained based on Lemma 3, which demonstrates the strictness of the approximation in Eq. (4). Lemma 4. Suppose Assumptions 1-4 hold. The induced mapping ˆδ is as defined in Lemma 2. Then, for any τ [ 1, 0], we have that 2ˆδ(τ) τ 2 2 O(|S |2/N 2). Proof. We first calculate 2δ(τ) τ 2 based on Eq. (3). Similar to the proof of Lemma 3, we calculate the second-order derivatives of the both sides of Eq. (3) with respect to τ and have that i=1 B(τ, zi) + A(τ) + i=1 2 δh(ˆδ(τ), zi) + 2 δf(ˆδ(τ)) z S 2 δh(ˆδ(τ), z) ˆδ(τ) z S B(τ, z) + τ X z S 2 δh(ˆδ(τ), z) 2ˆδ(τ) which means i=1 2 δF(ˆδ(τ), S) + τ z S h(ˆδ(τ), z) i=1 B(τ, zi) + τ z S B(τ, z) + 1 z S 2 δh(ˆδ(τ), z) ˆδ(τ) in which the invertibility of 1 N PN i=1 2 δF(ˆδ(τ), S) + τ z S 2 δh(ˆδ(τ), z) is guaranteed by Eq. (8), A(τ), B(τ, z) RK 1, and for i = 1, , K, we have the following, A(τ)i = ˆδ(τ) B(τ, z)i = ˆδ(τ) h(ˆδ(τ), z) Published as a conference paper at ICLR 2022 We then upper bound the norm of 2ˆδ(τ) τ 2 . Based on Eq. (11), we have that 2ˆδ(τ) i=1 2 δF(ˆγ(τ), S) + τ z S 2 δh(ˆδ(τ), z) i=1 B(τ, zi) 2 + τ X z S B(τ, z) 2 + A(τ) 2 + 2 z S 2 δh(ˆδ(τ), z) ˆδ(τ) i=1 B(τ, zi) 2 + τ X z S B(τ, z) 2 + A(τ) 2 + 2 z S 2 δh(ˆδ(τ), z) ˆδ(τ) where Eq. (15) is obtained by inserting Eq. (9) (in the proof of Lemma 3) into Eq. (14). Thus, the remaining task is to upper bound the norms of A(τ), B(τ, z) and 2 δ P z S h(ˆδ(τ), z) ˆδ(τ) We first upper bound A(τ) 2. Applying Lemma 3, we have that ˆδ(τ) τ 2 O(|S |/N). Applying the Lipschitz Hessian condition in Assumption 2, we have that 2 δ( f(ˆδ(τ)) δi 2 is also bounded by a real constant. Therefore, we have that i=1 O(1) O |S |2 For B(τ, z), we similarly have that B(τ, z) 2 O |S |2 To upper bound the norm of P z S 2 δh(ˆδ(τ), z) ˆδ(τ) τ , we apply Lemma 3, Assumption 2, and have that z S 2 δh(ˆδ(τ), z) ˆδ(τ) 2 δh(ˆδ(τ), z) 2 Inserting eqs. (16), (17) and (18), into Eq. (15), we eventually have that 2ˆδ(τ) i=1 O |S |2 + τ O |S |3 + 2 O |S |2 The proof is completed. Finally, combining all the results, we prove Theorem 1. Proof of Theorem 1. The proof is completed by applying Lemmas 3 and 4 into Eq. (4). Published as a conference paper at ICLR 2022 A.2 PROOF OF THEOREM 2 This section presents the proof of Theorem 2. Proof of Theorem 2. By expanding the KL-divergence KL(p S( +I(S )) p S S ) into Taylor series around the local region of δ = δ S S , we have KL(p S( + I(S )) p S S ) = KL(p S( δ S S p S S ) + (I(S ) + δ S S )T 2 δKL(p S( ξ) p S S ) (I(S ) + δ S S ) = KL(p S( δ S S p S S ) (I(S ) + δ S S )T 2 δ z S h(ξ, z) (I(S ) + δ S S ), (19) where the first-order term equals 0 according to the first-order optimal condition. By applying Theorem 1 and Assumption 2, we further have (I(S ) + δ S S )T 2 δ z S h(ξ, z) (I(S ) + δ S S ) z S h(ξ, z) # 2 I(S ) + δ S S 2 O ((N |S | + 1)L2 diam(V )) O |S |2 By combining eqs. (19) and (20), the proof is completed. A.3 PROOF OF THEOREM 3 This section presents the proof of Theorem 3. We derive generalization bound for the proposed algorithm under the PAC-Bayesian framework (Mc Allester, 1998; 1999; 2003). The framework can provide guarantees for randomized predictors (e.g., the Bayesian predictors). Specifically, let Q a distribution on the parameter space Θ, P denotes the prior distribution over the parameter space Θ. Then, the expected risks R(Q) is bounded in terms of the empirical risk ˆR(Q, S) and KL-divergence KL(Q P) by the following result from PAC-Bayes. Lemma 5 (cf. Mc Allester (2003), Theorem 1). For any real δ (0, 1), with probability at least 1 δ, we have the following inequality for all distributions Q: R(Q) ˆR(Q, S) + KL(Q P) + log 1 δ + log N + 2 2N 1 . (21) Based on Lemma 5, we prove the generalization bounds in Theorem 3. proof of Theorem 3. Let the prior distribution P be a Laplace distribution lap(0, 1), p S S = p S( + I(S )) be the distribution processed by the MCMC unlearning algorithm. Then, one can calculate the KL-divergence KL(p S S P) as follows (where we assume that Θ = Rd), KL(p S S P) = Z p S S (θ) p(θ) p S S (θ)dθ Θ log p S(θ) p(θ I(S )) Θ [log p S(θ) + θ I(S ) 1 + log 2] p S(θ)dθ Eθ p S log p S(θ) + Eθ p S θ 1 + I(S ) 1 + log 2. (22) Published as a conference paper at ICLR 2022 Inserting Eq. (22) into Eq. (21) in Lemma 5, we then obtain the PAC-Bayesian generalization bound. Eventually, by applying Lemma 3, we have that I(S ) 1 O(|S |/N). The proof is completed. B EFFICIENT CALCULATION OF MCMC INFLUENCE FUNCTION A major computing burden in the MCMC unlearning algorithm is calculating the product of H 1v, where H is the Hessian matrix of some vector-valued function f(x) and v is a constant vector. When the function f(x) is of high dimension, the calculation would have a considerably high computational cost. We follow Agarwal et al. (2017) and Koh & Liang (2017) to apply a divide-and-conquer strategy to address the issue. This strategy relies on calculating the Hessian-vector product Hv. Hessian-vector product (HVP). We first discuss how to efficiently calculate Hv. The calculation of Hv can be decomposed into two steps: (1) calculate f(x) x and then (2) calculate x f(x) It is worth noting that f(x) x R1 d and v Rd 1, where d > 0 is the dimension of data. Thus, f(x) x v is a scalar value. Calculating its gradient x f(x) x v has a very low computational cost on platform Py Torch (Paszke et al., 2017) or Tensor Flow (Abadi et al., 2015). Calculating H 1v. When the norm H 1, the matrix H 1 can be expanded by the Taylor s series as H 1 = P i=0(I H)i. Define that H 1 j = Pj i=0(I H)i. Then, we have the following recursive equation, H 1 j v = v + (I H)H 1 j 1v. Agarwal et al. (2017) prove that when j , we have E[H 1 j ] H 1. Therefore, we employ H 1 j v to approximate H 1v. Moreover, to secure the condition H 1 stands, we scale H to c H by a scale c R+, such that c H 1. Then, we approximate (c H) 1. Eventually, we have that H 1 = c(c H) 1. We can plug it to the applicable equations above. C EXPERIMENTS DETAILS FOR GAUSSIAN MIXTURE MODELS This section provides the additional experiments details for the Gaussian mixture models. C.1 GAUSSIAN MIXTURE MODELS We employ Gaussian mixture models (GMMs) to infer the cluster centers on the synthetic dataset. A GMM assumes that each example is drawn from K Gaussian distributions centered at µ1, , µK, respectively. Then, the hierarchical structure of GMM is as follows: (1) draw a clustering center from the uniform distribution over {µc1, . . . , µc K}; and (2) sample zi from a Gaussian distribution centering at µci. That is, µk N(0, σ2I), ci categorical 1 Zi N(µci, I), where 1 k K, 1 i n, µk Rd, ci {1, , K}, Zi Rd, and the hyperparameter σ R is the prior standard deviation. In our experiments, the factor K is set as 4, and the prior standard deviation σ is set as 1. C.2 EXPERIMENTS DETAILS The experiments for GMM have two main phases: Published as a conference paper at ICLR 2022 Training phase. Every GMM is trained for 4, 000 iterations. The batch size is set as 64. For both SGLD and SGHMC, the learning rate schedule is set as 4 t 0.5005/N, where t is the training iteration step and N is the number of the training set size. Besides, the momentum factor α of SGHMC is set as 0.9. Unlearning phase. We remove a batch of 4 datums each time. When calculating the inversed Hessian-vector product H 1v in the influence functions (see Section B), the recursive calculation number j is set as 32, and the scaling factor c is set as 1/N , where N is the number of the current remained training examples. Notice that N will gradually decrease as the forgetting process continues. Moreover, we employ the Monte Carlo method to calculate the expectations in the MCMC influence function. Specifically, we repeatedly draw sample θ for 5 times, calculate the matrix or vector in the MCMC influence function, and average the results to approach the expectations. D EXPERIMENTS DETAILS FOR BAYESIAN NEURAL NETWORKS This section provides the additional experiments details for Bayesian neural networks. D.1 DATASETS We employ two image datasets, Fashion-MNIST (Xiao et al., 2017) and CIFAR-10 (Krizhevsky et al., 2009), in our experiments. Fashion-MNIST consists of grey-scale images from 10 classes, where each class contains 6, 000 training examples and 1, 000 test examples. Besides, CIFAR-10 consists of color images from 10 classes, where each class contains 5, 000 training examples and 1, 000 test examples. No data augmentation is used in the experiments. Furthermore, it is worth noting that the models used for Fashion-MNIST and CIFAR-10 are first pretrained on MNIST (Le Cun et al., 1998) and CIFAR-100 (Krizhevsky et al., 2009), respectively. Please see Appendices D.3 and D.5 for more details. D.2 BAYESIAN NEURAL NETWORKS This section introduces the model architectures used in our experiments and the implementation details for training BNNs. D.2.1 MODEL ARCHITECTURES For Fashion-MNIST, we use a Le Net (Le Cun et al., 1998) architecture that consists of two convolutional layers followed by three fully-connected layers. The convolutional layers use 5 5 convolutions, each followed by a 2 2 max-pooling layer and a Re LU activation function. The channel numbers of the two convolutional layers are 6 and 16, respectively. Besides, each fully-connected layer is also followed by a Re LU activation function, while the hidden layer channels are 120 and 84, respectively. for CIFAR-10, we follow Abadi et al. (2016) to use a network architecture that consists of two convolutional layers followed by two fully-connected layers. Each convolutional layer uses 5 5 convolution, with a channel number of 64, followed by a 2 2 max-pooling layer and a Re LU activation function. Besides, each fully-connected layer is also followed by a Re LU activation function, while the hidden layer channels are 384. D.2.2 MODEL TRAINING In all the experiments, we use an isotropic Gaussian distribution N(0, σ2 0I) as the prior of the BNNs, in which the hyperparameter σ0 is set as 0.1. Traditional SG-MCMC methods would inject random noise to BNNs during training. However, it is found that the noise injection in the early training iterations hurts the convergence of BNNs (Zhang et al., 2020a). To alleviate such a problem, we follow Zhang et al. (2020a) to first avoid the noise injection in the early training iterations and then resume SG-MCMC as usual in the rest of the training. Published as a conference paper at ICLR 2022 D.3 ESTIMATING THE ε-KNOWLEDGE REMOVAL GUARANTEE We use the knowledge removal estimator ˆεM defined in Definition 2 to estimate the ε value in the εknowledge removal guarantee. According to the definition of ˆεM, calculating it requires first fixing a series of random seeds, and then calculating the KL-divergence between two MCMC distributions. For the random seeds, we employ the pretraining technique (Golatkar et al., 2020) to fix them. Specifically, for the experiments on Fashion-MNIST, all the models used are first pre-trained on MNIST; for the experiments on CIFAR-10, all the models used are first pre-trained on CIFAR-100. Besides, for the KL-divergence calculation, we follow Wang et al. (2009) to estimate the KLdivergence in a k-NN manner, in which 100 samples are drawn from each of the processed and retrained distributions for the estimation. D.4 MEMBERSHIP INFERENCE ATTACK Membership inference attack (MIA) (Shokri et al., 2017; Yeom et al., 2018) is a privacy attack that aims to infer whether a given example is in the training set based on the output of the model. In our experiments, we adopt a white-box threshold-based MIA (Yeom et al., 2018) to assess the unlearning performance. The calculation consists of two steps: (1) learn an optimal MIA threshold on the remained training set Sr and the test set Stest, and (2) calculate MIA accuracy with the learned threshold on the removed training set Sf. Learn the optimal threshold. Suppose the to-be-inferred example (x, y) comes from Sr and Stest with equal probabilities. Then, the attack accuracy of MIA with a threshold ρ [0, 1] on the model fθ is calculated as follows, Acc(ρ, Sr, Stest) = 0.5 (x,y) Sr 1[fθ(x)y ρ] (x,y) Stest 1[fθ(x)y < ρ] where fθ(x)y is the output confidence for label y and 1[ ] is the indicator function. Then, the optimal threshold ρoptim is obtained via calculating the maximizer of the above attack accuracy, i.e., ρoptim = arg max ρ Acc(ρ, Sr, Stest). Calculate MIA accuracy on Sf. With the optimal threshold ρoptim that learned on Sr and Stest, the MIA accuracy on the removed set Sf is calculated as follows, Acc(ρoptim, Sf) = (x,y) Sf 1[fθ(x)y ρoptim] Intuitively, after removing the knowledge learned from the removed subset Sf, the MIA accuracy Acc(ρoptim, Sf) would decline. As a result, a low MIA accuracy on the removed subset Sf would imply a strong knowledge removal performance. D.5 EXPERIMENTS DETAILS This section provides the omitted experiment details, including the settings of the hyperparameters and the procedures of the experiments. D.5.1 EXPERIMENTS ON FASHION-MNIST Pretraining phase. In every experiment, we pretrain a non-Bayesian model on MNIST via SGD for 2, 000 iterations, with a batch size of 128, a learning rate of 0.1/N, where N is the training set size, and a momentum factor of 0.9. Training phase. Every BNN is trained for 10, 000 iterations. The batch size is set as 128. For both SGLD and SGHMC, we first train the model without noise injection in the first 1, 000 iterations. In this stage, the learning rate is fixed to 0.01/N. Then, we resume the traditional SGLD and SGHMC in the rest of the training. In this stage, the learning rate schedule is set as 0.01 t 0.5005/N, where t is the training iteration step. Besides, the momentum factor α of SGHMC is set as 0.9. Published as a conference paper at ICLR 2022 Unlearning phase. We remove a batch of 64 datums each time. For the MCMC unlearning algorithm, when calculating the inversed-Hessian-vector product H 1v in the MCMC influence function (see Appendix B), the recursive calculation number j is set as 64, the scaling factor c is set as 0.05/N , in which N is the number of the current remained training examples. Notice that N will gradually decrease as the forgetting process continues. Besides, we employ the Monte Carlo method to calculate the expectations in the MCMC influence function. Specifically, each time we draw a sample θ via the MCMC sampler and calculate the matrix and vector in the MCMC influence function based on θ to estimate the desired expectation. Besides, for the importance sampling method, we process each data deletion request by performing MCMC sampling for 1, 000 times on the remaining data. D.5.2 EXPERIMENTS ON CIFAR-10 Pretraining phase. In every experiment, we pretrain a non-Bayesian model on CIFAR-100 via SGD for 10, 000 iterations, with a batch size of 128, a learning rate of 0.1/N, where N is the training set size, and a momentum factor of 0.9. Training phase. Every BNN is trained for 20, 000 iterations. The batch size is set as 128. For both SGLD and SGHMC, we first train the model without noise injection in the first 5, 000 iterations. In this stage, the learning rate is fixed to 0.01/N. Then, we resume the traditional SGLD and SGHMC in the rest of the training. In this stage, the learning rate schedule is set as 0.01 t 0.5005/N, where t is the training iteration step. Besides, the momentum factor α of SGHMC is set as 0.9. Unlearning phase. We remove a batch of 64 datums each time. For the MCMC unlearning algorithm, When calculating the inversed-Hessian-vector product H 1v in the MCMC influence functions (see Appendix B), the recursive calculation number j is set as 64, the scaling factors c is set as 0.08/N , where N is the number of the current remained training examples. Notice that N will gradually decrease as the forgetting process continues. Besides, we employ the Monte Carlo method to calculate the expectations in the MCMC influence function. Specifically, each time we draw a sample θ via the MCMC sampler and calculate the matrix and vector in the MCMC influence function based on θ to estimate the desired expectation. Besides, for the importance sampling method, we process each data deletion request by performing MCMC sampling for 1, 000 times on the remaining data. D.6 EXPERIMENT RESULTS ON FASHION-MNIST This section presents all the experiment results of BNNs on the Fashion-MNIST dataset. The evaluations of unlearning performance are shown as Table 2, while the time costs for unlearning is shown in Fig. 4 The results further justify the effectiveness of the proposed algorithm. SGLD SGHMC 0 Unlearning Time (s) Fashion-MNIST Retrain IS Ours Figure 4: Time costs for processing single data deletion requests on BNNs. D.7 EVALUATION ON PREDICTION DIFFERENCES In this section, we adopt the metric named prediction differences to better quantify the difference between the retrained and processed models. For two BNNs p1( ) and p2( ) learned by MCMC, their prediction difference on a given dataset S is defined as Eθ1 p1Eθ2 p2 1 |S| P (xi,yi) S fθ1(xi) Published as a conference paper at ICLR 2022 Table 2: Experiment results on Fashion-MNIST. Different metrics are used to evaluate the machine unlearning performance, including the classification errors on the remaining set Sr, removed set Sf, and test set Stest, the knowledge removal estimators ˆεM, and the membership inference attack (MIA) accuracy on Sf. Every experiment is repeated 5 times. Remove Method Err. on Sr (%) Err. on Sf (%) Err. on Stest (%) ˆεM ( 103) MIA Acc. (%) Fashion MNIST + SGLD 4,000 Examples Retrain 12.81 0.49 37.62 0.76 15.63 0.59 0.00 0.00 16.12 5.79 Origin 13.49 0.81 17.39 0.41 14.73 0.80 351.00 4.98 10.86 8.02 IS 12.67 0.70 34.93 2.16 15.30 0.77 365.97 3.04 13.97 11.37 Ours 12.93 0.72 35.84 1.85 15.57 0.77 350.30 4.82 10.36 2.31 6,000 Examples Retrain 11.21 0.57 100.00 0.00 21.05 0.49 0.00 0.00 0.00 0.00 Origin 13.38 0.85 17.08 0.51 14.73 0.80 364.63 2.45 7.20 5.93 IS 11.10 0.66 87.30 5.82 19.90 0.87 378.11 1.45 4.52 2.31 Ours 11.39 0.68 98.13 1.61 20.59 0.99 362.55 2.47 1.25 0.67 Fashion MNIST + SGHMC 4,000 Examples Retrain 14.57 1.11 38.90 3.42 17.34 1.11 0.00 0.00 5.58 5.81 Origin 15.21 1.30 17.58 1.68 16.40 1.17 362.60 8.24 12.03 12.61 IS 14.27 1.22 36.54 1.73 16.97 1.16 375.62 5.10 13.76 13.12 Ours 14.56 1.24 37.48 2.02 17.29 1.08 361.72 7.87 11.28 13.25 6,000 Examples Retrain 12.99 0.94 100.00 0.00 22.61 0.69 0.00 0.00 0.00 0.00 Origin 15.16 1.31 17.27 1.80 16.40 1.17 366.98 6.24 8.32 12.07 IS 12.52 1.04 87.48 7.22 20.94 1.44 382.34 3.48 4.20 1.84 Ours 12.93 1.16 98.23 1.04 22.39 0.85 364.99 6.00 1.59 0.51 Table 3: The results of prediction differences on CIFAR-10 and Fashion-MNIST. The prediction differences between the retrained model and the processed model are calculated on the remaining set Sr, removed set Sf, and test set Stest, respectively. Remove Method Pred. diff. on Sr Pred. diff. on Sf Pred. diff. on Stest CIFAR-10 + SGLD 3,000 Examples Origin 0.18 0.00 0.50 0.02 0.21 0.01 IS 0.18 0.00 0.37 0.01 0.20 0.01 Ours 0.17 0.00 0.35 0.02 0.19 0.01 5,000 Examples Origin 0.23 0.00 1.42 0.02 0.37 0.01 IS 0.21 0.00 0.92 0.03 0.29 0.01 Ours 0.19 0.00 0.67 0.05 0.25 0.01 CIFAR-10 + SGHMC 3,000 Examples Origin 0.19 0.00 0.50 0.01 0.22 0.00 IS 0.19 0.00 0.38 0.01 0.21 0.00 Ours 0.18 0.00 0.35 0.01 0.20 0.01 5,000 Examples Origin 0.23 0.00 1.44 0.02 0.37 0.00 IS 0.22 0.00 0.94 0.02 0.31 0.00 Ours 0.20 0.00 0.73 0.05 0.27 0.00 Fashion MNIST + SGLD 4,000 Examples Origin 0.13 0.02 0.37 0.02 0.15 0.02 IS 0.11 0.02 0.22 0.01 0.12 0.02 Ours 0.11 0.02 0.21 0.02 0.12 0.02 6,000 Examples Origin 0.17 0.01 1.45 0.02 0.30 0.01 IS 0.13 0.01 0.56 0.05 0.18 0.01 Ours 0.13 0.01 0.33 0.07 0.15 0.01 Fashion MNIST + SGHMC 4,000 Examples Origin 0.15 0.03 0.43 0.03 0.18 0.03 IS 0.14 0.03 0.26 0.05 0.15 0.03 Ours 0.14 0.03 0.26 0.05 0.15 0.03 6,000 Examples Origin 0.18 0.02 1.44 0.02 0.30 0.02 IS 0.14 0.02 0.55 0.07 0.18 0.02 Ours 0.13 0.02 0.35 0.05 0.16 0.02 fθ2(xi) 1, where fθ1(xi) and fθ2(xi) are two prediction confidence vectors for the example xi. Intuitively, a smaller prediction difference indicates a smaller deviation between the retrained and processed models. We calculate the prediction difference between the retrained model and the processed model on three different datasets, Sr, Sf, and Stest. The experiment results on CIFAR-10 and Fashion-MNIST are shown in Table 3. The results show that the proposed MCMC unlearning algorithm can effectively reduce the prediction difference, which further justify the effectiveness of the proposed method.