# variational_bayesian_unlearning__2e7a950c.pdf Variational Bayesian Unlearning Quoc Phong Nguyen, Bryan Kian Hsiang Low, and Patrick Jaillet Dept. of Computer Science, National University of Singapore, Republic of Singapore Dept. of Electrical Engineering and Computer Science, MIT, USA {qphong,lowkh}@comp.nus.edu.sg, jaillet@mit.edu This paper studies the problem of approximately unlearning a Bayesian model from a small subset of the training data to be erased. We frame this problem as one of minimizing the Kullback-Leibler divergence between the approximate posterior belief of model parameters after directly unlearning from erased data vs. the exact posterior belief from retraining with remaining data. Using the variational inference (VI) framework, we show that it is equivalent to minimizing an evidence upper bound which trades off between fully unlearning from erased data vs. not entirely forgetting the posterior belief given the full data (i.e., including the remaining data); the latter prevents catastrophic unlearning that can render the model useless. In model training with VI, only an approximate (instead of exact) posterior belief given the full data can be obtained, which makes unlearning even more challenging. We propose two novel tricks to tackle this challenge. We empirically demonstrate our unlearning methods on Bayesian models such as sparse Gaussian process and logistic regression using synthetic and real-world datasets. 1 Introduction Our interactions with machine learning (ML) applications have surged in recent years such that large quantities of users data are now deeply ingrained into the ML models being trained for these applications. This greatly complicates the regulation of access to each user s data or implementation of personal data ownership, which are enforced by the General Data Protection Regulation in the European Union [24]. In particular, if a user would like to exercise her right to be forgotten [24] (e.g., when quitting an ML application), then it would be desirable to have the trained ML model unlearn from her data. Such a problem of machine unlearning [4] extends to the practical scenario where a small subset of data previously used for training is later identified as malicious (e.g., anomalies) [4, 9] and the trained ML model can perform well once again if it can unlearn from the malicious data. A naive alternative to machine unlearning is to simply retrain an ML model from scratch with the data remaining after erasing that to be unlearned from. In practice, this is prohibitively expensive in terms of time and space costs since the remaining data is often large such as in the above scenarios. How then can a trained ML model directly and efficiently unlearn from a small subset of data to be erased to become (a) exactly and if not, (b) approximately close to that from retraining with the large remaining data? Unfortunately, (a) exact unlearning is only possible for selected ML models (e.g., naive Bayes classifier, linear regression, k-means clustering, and item-item collaborative filtering [4, 12, 30]). This motivates the need to consider (b) approximate unlearning as it is applicable to a broader family of ML models like neural networks [9, 13] but, depending on its choice of loss function, may suffer from catastrophic unlearning1 that can render the model useless. For example, to mitigate this issue, the works of [9, 13] have to patch up their loss functions by additionally bounding the loss incurred 1A trained ML model is said to experience catastrophic unlearning from the erased data when its resulting performance is considerably worse than that from retraining with the remaining data. 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. by erased data with a rectified linear unit and injecting a regularization term to retain information of the remaining data, respectively. This begs the question whether there exists a loss function that can directly quantify the approximation gap and naturally prevent catastrophic unlearning. Our work here addresses the above question by focusing on the family of Bayesian models. Specifically, our proposed loss function measures the Kullback-Leibler (KL) divergence between the approximate posterior belief of model parameters by directly unlearning from erased data vs. the exact posterior belief from retraining with remaining data. Using the variational inference (VI) framework, we show that minimizing this KL divergence is equivalent to minimizing (instead of maximizing) a counterpart of the evidence lower bound called the evidence upper bound (EUBO) (Sec. 3.2). Interestingly, the EUBO lends itself to a natural interpretation of a trade-off between fully unlearning from erased data vs. not entirely forgetting the posterior belief given the full data (i.e., including the remaining data); the latter prevents catastrophic unlearning induced by the former. Often, in model training, only an approximate (instead of exact) posterior belief of model parameters given the full data can be learned, say, also using VI. This makes unlearning even more challenging. To tackle this challenge, we analyse two sources of inaccuracy in the approximate posterior belief learned using VI, which lay the groundwork for proposing our first trick of an adjusted likelihood of erased data (Sec. 3.3.1): Our key idea is to curb unlearning in the region of model parameters with low approximate posterior belief where both sources of inaccuracy primarily occur. Additionally, to avoid the risk of incorrectly tuning the adjusted likelihood, we propose another trick of reverse KL (Sec. 3.3.2) which is naturally more protected from such inaccuracy without needing the adjusted likelihood. Nonetheless, our adjusted likelihood is general enough to be applied to reverse KL. VI is a popular approximate Bayesian inference framework due to its scalability to massive datasets [15, 18] and its ability to model complex posterior beliefs using generative adversarial networks [33] and normalizing flows [21, 29]. Our work in this paper exploits VI to broaden the family of ML models that can be unlearned, which we empirically demonstrate using synthetic and real-world datasets on several Bayesian models such as sparse Gaussian process and logistic regression with the approximate posterior belief modeled by a normalizing flow (Sec. 4). 2 Variational Inference (VI) In this section, we revisit the VI framework [2] for learning an approximate posterior belief of the parameters θ of a Bayesian model. Given a prior belief p(θ) of the unknown model parameters θ and a set D of training data, an approximate posterior belief q(θ|D) p(θ|D) is being optimized by minimizing the KL divergence KL[q(θ|D) p(θ|D)] R q(θ|D) log(q(θ|D)/p(θ|D)) dθ or, equivalently, maximizing the evidence lower bound (ELBO) L [2]: L Z q(θ|D) log p(D|θ) dθ KL[q(θ|D) p(θ)] . (1) Such an equivalence follows directly from L = log p(D) KL[q(θ|D) p(θ|D)] where the logmarginal likelihood log p(D) is independent of q(θ|D). Since KL[q(θ|D) p(θ|D)] 0, the ELBO L is a lower bound of log p(D). The ELBO L in (1) can be interpreted as a trade-off between attaining a higher likelihood of D (first term) vs. not entirely forgetting the prior belief p(θ) (second term). When the ELBO L (1) cannot be evaluated in closed form, it can be maximized using stochastic gradient ascent (SGA) by approximating the expectation in L = Eq(θ|D)[log p(D|θ) + log(p(θ)/q(θ|D))] = Z q(θ|D) (log p(D|θ) + log(p(θ)/q(θ|D))) dθ with stochastic sampling in each iteration of SGA. The approximate posterior belief q(θ|D) can be represented by a simple distribution (e.g., in the exponential family) for computational ease or a complex distribution (e.g., using generative neural networks) for expressive power. Note that when the distribution of q(θ|D) is modeled by a generative neural network whose density cannot be evaluated, the ELBO can be maximized with adversarial training by alternating between estimating the log-density ratio log(p(θ)/q(θ|D)) and maximizing the ELBO [33]. On the other hand, when the distribution of q(θ|D) is modeled by a normalizing flow (e.g., inverse autoregressive flow (IAF) [21]) whose density can be computed, the ELBO can be maximized with SGA. 3 Bayesian Unlearning 3.1 Exact Bayesian Unlearning Let the (full) training data D be partitioned into a small subset De of data to be erased and a (large) set Dr of remaining data, i.e., D = Dr De and Dr De = . The problem of exact Bayesian unlearning involves recovering the exact posterior belief p(θ|Dr) of model parameters θ given remaining data Dr from that given full data D (i.e., p(θ|D) assumed to be available) by directly unlearning from erased data De. Note that p(θ|Dr) can also be obtained from retraining with remaining data Dr, which is computationally costly, as discussed in Sec. 1. By using Bayes rule and assuming conditional independence between Dr and De given θ, p(θ|Dr) = p(θ|D) p(De|Dr)/p(De|θ) p(θ|D)/p(De|θ) . (2) When the model parameters θ are discrete-valued, p(θ|Dr) can be obtained from (2) directly. The use of a conjugate prior also makes unlearning relatively simple. We will investigate the more interesting case of a non-conjugate prior in the rest of Sec. 3. 3.2 Approximate Bayesian Unlearning with Exact Posterior Belief p(θ|D) The problem of approximate Bayesian unlearning differs from that of exact Bayesian unlearning (Sec. 3.1) in that only the approximate posterior belief qu(θ|Dr) (instead of the exact one p(θ|Dr)) can be recovered by directly unlearning from erased data De. Since existing unlearning methods often use their model predictions to construct their loss functions [3, 4, 12, 14], we have initially considered doing likewise (albeit in the Bayesian context) by defining the loss function as the KL divergence between the approximate predictive distribution qu(y|Dr) R p(y|θ) qu(θ|Dr) dθ vs. the exact predictive distribution p(y|Dr) = R p(y|θ) p(θ|Dr) dθ where the observation y (i.e., drawn from a model with parameters θ) is conditionally independent of Dr given θ. However, it may not be possible to evaluate these predictive distributions in closed form, hence making the optimization of this loss function computationally difficult. Fortunately, such a loss function can be bounded from above by the KL divergence between posterior beliefs qu(θ|Dr) vs. p(θ|Dr), as proven in Appendix A: Proposition 1. KL[qu(y|Dr) p(y|Dr)] KL[qu(θ|Dr) p(θ|Dr)] .2 Proposition 1 reveals that reducing KL[qu(θ|Dr) p(θ|Dr)] decreases KL[qu(y|Dr) p(y|Dr)], thus motivating its use as the loss function instead. In particular, it follows immediately from our result below (i.e., proven in Appendix B) that minimizing KL[qu(θ|Dr) p(θ|Dr)] is equivalent to minimizing a counterpart of the ELBO called the evidence upper bound (EUBO) U: Proposition 2. Define the EUBO U as U Z qu(θ|Dr) log p(De|θ) dθ + KL[qu(θ|Dr) p(θ|D)] . (3) Then, U = log p(De|Dr) + KL[qu(θ|Dr) p(θ|Dr)] log p(De|Dr) such that p(De|Dr) is independent of qu(θ|Dr). From Proposition 2, minimizing EUBO (3) is equivalent to minimizing KL[qu(θ|Dr) p(θ|Dr)] which is precisely achieved using VI (i.e., by maximizing ELBO (1)) from retraining with remaining data Dr. This is illustrated in Fig. 1a where unlearning from De by minimizing EUBO maximizes ELBO w.r.t. Dr; in Fig. 1b, retraining with Dr by maximizing ELBO minimizes EUBO w.r.t. De. The EUBO U (3) can be interpreted as a trade-off between fully unlearning from erased data De (first term) vs. not entirely forgetting the exact posterior belief p(θ|D) given the full data D (i.e., including the remaining data Dr) (second term). The latter can be viewed as a regularization term to prevent catastrophic unlearning1 (i.e., potentially induced by the former) that naturally results from minimizing our loss function KL[qu(θ|Dr) p(θ|Dr)], which differs from the works of [9, 13] needing to patch up their loss functions (Sec. 1). Generative models can be used to model the approximate posterior belief qu(θ|Dr) in the EUBO U (3) in the same way as that in the ELBO L (1). 2Similarly, KL[p(y|Dr) qu(y|Dr)] KL[p(θ|Dr) qu(θ|Dr)] holds. 0 100 Iteration 0 100 Iteration 0 100 Iteration 0 100 Iteration (a) Unlearning from De by minimizing EUBO (b) Retraining with Dr by maximizing ELBO Figure 1: Plots of EUBO and ELBO when (a) unlearning from De and (b) retraining with Dr. λ max q(θ|D) Figure 2: Plot of q(θ|D) learned using VI. Gray shaded region corresponds to values of θ where q(θ|D) λ maxθ q(θ |D). Vertical blue strips on horizontal axis show 100 samples of θ q(θ|D). 3.3 Approximate Bayesian Unlearning with Approximate Posterior Belief q(θ|D) Often, in model training, only an approximate posterior belief3 q(θ|D) (instead of the exact p(θ|D) in Sec. 3.2) of model parameters θ given full data D can be learned, say, using VI by maximizing the ELBO (Sec. 2). Our proposed unlearning methods are parsimonious in requiring only q(θ|D) and erased data De to be available, which makes unlearning even more challenging since there is no further information about p(θ|D) nor the difference between p(θ|D) vs. q(θ|D). So, we estimate the unknown p(θ|Dr) (2) with p(θ|Dr) q(θ|D)/p(De|θ) (4) and minimize the KL divergence between the approximate posterior belief recovered by directly unlearning from erased data De vs. p(θ|Dr) (4) instead. We will discuss two novel tricks below to alleviate the undesirable consequence of using p(θ|Dr) instead of the unknown p(θ|Dr) (2). 3.3.1 EUBO with Adjusted Likelihood Let the loss function KL[ qu(θ|Dr) p(θ|Dr)] be minimized w.r.t. the approximate posterior belief qu(θ|Dr) that is recovered by directly unlearning from erased data De. Similar to Proposition 2, qu(θ|Dr) can be optimized by minimizing the following EUBO: e U Z qu(θ|Dr) log p(De|θ) dθ + KL[ qu(θ|Dr) q(θ|D)] (5) which follows from simply replacing the unknown p(θ|D) in U (3) with q(θ|D). We discuss the difference between p(θ|D) vs. q(θ|D) in the remark below: Remark 1. We analyze two possible sources of inaccuracy in q(θ|D) that is learned using VI by minimizing the loss function KL[q(θ|D) p(θ|D)] (Sec. 2). Firstly, q(θ|D) often underestimates the variance of p(θ|D): Though q(θ|D) tends to be close to 0 at values of θ where p(θ|D) is close to 0, the reverse is not enforced [1] (see, for example, Fig. 2). So, q(θ|D) can differ from p(θ|D) at values of θ where q(θ|D) is close to 0. Secondly, if q(θ|D) is learned through stochastic optimization of the ELBO (i.e., with stochastic samples of θ q(θ|D) in each iteration of SGA), then it is unlikely that the ELBO is maximized using samples of θ with small q(θ|D) (Fig. 2). Thus, both sources of inaccuracy primarily occur at values of θ with small q(θ|D). Though it can also be inaccurate at values of θ with large q(θ|D), such an inaccuracy can be reduced by representing q(θ|D) with a complex distribution (Sec. 2). Remark 1 motivates us to curb unlearning at values of θ with small q(θ|D) by proposing our first novel trick of an adjusted likelihood of the erased data: padj(De|θ; λ) p(De|θ) if q(θ|D) > λ maxθ q(θ |D) , 1 otherwise (i.e., shaded area in Fig. 2) ; (6) 3With a slight abuse of notation, we let q(θ|D) be the approximate posterior belief that maximizes the ELBO L (1) (Sec. 2) from Sec. 3.3 onwards. 0.75 1.00 1.25 1.50 1.75 2.00 2.25 (a) EUBO λ = 1 λ = 0.1 λ = 0.01 λ = 0.001 q(θ|D) q(θ|Dr) 0.75 1.00 1.25 1.50 1.75 2.00 2.25 (b) reverse KL Figure 3: Plot of approximate posterior beliefs with varying λ obtained by minimizing (a) EUBO (i.e., qu(θ|Dr; λ)) and (b) reverse KL (i.e., qv(θ|Dr; λ)); horizontal axis denotes θ = α. In (a), a huge probability mass of qu(θ|Dr, λ = 0) is at large values of α beyond the plotting area and the top of the plot of qu(θ|Dr, λ = 10 20) is cut off due to lack of space. for any θ where λ [0, 1] controls the magnitude of a threshold under which q(θ|D) is considered small. To understand the effect of λ, let padj(θ|Dr; λ) q(θ|D)/padj(De|θ; λ), i.e., by replacing p(De|θ) in (4) with padj(De|θ; λ). Then, using (6), padj(θ|Dr; λ) q(θ|D)/p(De|θ) if q(θ|D) > λ maxθ q(θ |D) , q(θ|D) otherwise (i.e., shaded area in Fig. 2) . (7) According to (7), unlearning is therefore focused at values of θ with sufficiently large q(θ|D) (i.e., q(θ|D) > λ maxθ q(θ |D)). Let the loss function KL[ qu(θ|Dr; λ) padj(θ|Dr; λ)] be minimized w.r.t. the approximate posterior belief qu(θ|Dr; λ) that is recovered by directly unlearning from erased data De. Similar to (5), qu(θ|Dr; λ) can be optimized by minimizing the following EUBO: e Uadj(λ) Z qu(θ|Dr; λ) log padj(De|θ; λ) dθ + KL[ qu(θ|Dr; λ) q(θ|D)] (8) which follows from replacing p(De|θ) in (5) with padj(De|θ; λ). Note that qu(θ|Dr; λ) can be represented by a simple distribution (e.g., Gaussian) or a complex one (e.g., generative neural network, IAF). We initialize qu(θ|Dr; λ) at q(θ|D) for achieving empirically faster convergence. When λ = 0, e Uadj(λ = 0) reduces to e U (5), i.e., EUBO is minimized without the adjusted likelihood. As a result, qu(θ|Dr; λ = 0) = qu(θ|Dr). As λ increases, unlearning is focused on a smaller and smaller region of θ with sufficiently large q(θ|D). When λ reaches 1, no unlearning is performed since padj(θ|Dr; λ = 1) = q(θ|D), which results in qu(θ|Dr; λ = 1) = q(θ|D) minimizing the loss function KL[ qu(θ|Dr; λ = 1) padj(θ|Dr; λ = 1)]. Example 1. To visualize the effect of varying λ on qu(θ|Dr; λ), we consider learning the shape α of a Gamma distribution with a known rate (i.e., θ = α): D are 20 samples of the Gamma distribution, De are the smallest 5 samples in D, and the (non-conjugate) prior belief and approximate posterior beliefs of α are all Gaussians. Fig. 3a shows the approximate posterior beliefs qu(θ|Dr; λ) with varying λ as well as q(θ|D) and q(θ|Dr) learned using VI. As explained above, qu(θ|Dr, λ = 1) = q(θ|D). When λ = 0.001, qu(θ|Dr, λ = 0.001) is close to q(θ|Dr). However, as λ decreases to 0, qu(θ|Dr, λ) moves away from q(θ|Dr). The optimized qu(θ|Dr; λ) suffers from the same issue of underestimating the variance as q(θ|D) learned using VI (see Remark 1), especially when λ tends to 0 (e.g., see qu(θ|Dr; λ = 10 20) in Fig. 3a). Though this issue can be mitigated by tuning λ in the adjusted likelihood (6), we may not want to risk facing the consequence of picking a value of λ that is too small. So, in Sec. 3.3.2, we will propose another novel trick that is not inconvenienced by this issue. 3.3.2 Reverse KL Let the loss function be the reverse KL divergence KL[ p(θ|Dr) qv(θ|Dr)] that is minimized w.r.t. the approximate posterior belief qv(θ|Dr) recovered by directly unlearning from erased data De. In contrast to the optimized qu(θ|Dr; λ) from minimizing EUBO (8), the optimized qv(θ|Dr) from minimizing the reverse KL divergence overestimates (instead of underestimates) the variance of p(θ|Dr) [1]: If p(θ|Dr) is close to 0, then qv(θ|Dr) is not necessarily close to 0. From (4), the reverse KL divergence can be expressed as follows: KL[ p(θ|Dr) qv(θ|Dr)] = C0 C1 Eq(θ|D) [(log qv(θ|Dr))/p(De|θ)] (9) where C0 and C1 are constants independent of qv(θ|Dr). So, the reverse KL divergence (9) can be minimized by maximizing Eq(θ|D)[(log qv(θ|Dr))/p(De|θ)] with stochastic gradient ascent (SGA). We also initialize qv(θ|Dr) at q(θ|D) for achieving empirically faster convergence. Since stochastic optimization is performed with samples of θ q(θ|D) in each iteration of SGA, it is unlikely that the reverse KL divergence (9) is minimized using samples of θ with small q(θ|D). This naturally curbs unlearning at values of θ with small q(θ|D), as motivated by Remark 1. On the other hand, it is still possible to employ the same trick of adjusted likelihood (Sec. 3.3.1) by minimizing the reverse KL divergence KL[ padj(θ|Dr; λ) qv(θ|Dr; λ)] as the loss function or, equivalently, maximizing Eq(θ|D)[(log qv(θ|Dr; λ))/padj(De|θ; λ)] where padj(De|θ; λ) and padj(θ|Dr; λ) are previously defined in (6) and (7), respectively. The role of λ here is the same as that in (8). To illustrate the difference between minimizing the reverse KL divergence (9) and EUBO (8), Fig. 3b shows the approximate posterior beliefs qv(θ|Dr; λ) with varying λ. It can be observed that qv(θ|Dr; λ = 1) = q(θ|D) (i.e., no unlearning). In contrast to minimizing EUBO (Fig. 3a), as λ decreases to 0, qv(θ|Dr; λ) does not deviate that much from q(θ|Dr), even when λ = 0 (i.e., the reverse KL divergence is minimized without the adjusted likelihood). This is because the optimized qv(θ|Dr; λ) is naturally more protected from both sources of inaccuracy (Remark 1), as explained above. Hence, we do not have to be as concerned about picking a small value of λ, which is also consistently observed in our experiments (Sec. 4). 4 Experiments and Discussion This section empirically demonstrates our unlearning methods on Bayesian models such as sparse Gaussian process and logistic regression using synthetic and real-world datasets. Further experimental results on Bayesian linear regression and with a bimodal posterior belief are reported in Appendices C and D, respectively. In our experiments, each dataset comprises pairs of input x and its corresponding output/observation yx. We use RMSProp as the SGA algorithm with a learning rate of 10 4. To assess the performance of our unlearning methods (i.e., by directly unlearning from erased data De), we consider the difference between their induced predictive distributions vs. that obtained using VI from retraining with remaining data Dr, as motivated from Sec. 3.2. To do this, we use a performance metric that measures the KL divergence between the approximate predictive distributions qu(yx|Dr) Z p(yx|θ) qu(θ|Dr; λ) dθ or qv(yx|Dr) Z p(yx|θ) qv(θ|Dr; λ) dθ vs. q(yx|Dr) R p(yx|θ) q(θ|Dr) dθ where qu(θ|Dr; λ) and qv(θ|Dr; λ) are optimized by, respectively, minimizing EUBO (8) and reverse KL (r KL) divergence (9) while requiring only q(θ|D) and erased data De (Sec. 3.3), and q(θ|Dr) is learned using VI (Sec. 2). The above predictive distributions are computed via sampling with 100 samples of θ. For tractability reasons, we evaluate the above performance metric over finite input domains, specifically, over that in De and Dr, which allows us to assess the performance of our unlearning methods on both the erased and remaining data, i.e., whether they can fully unlearn from De yet not forget nor catastrophically unlearn from Dr, respectively. For example, the performance of our EUBO-based unlearning method over De is shown as an average (with standard deviation) of the KL divergences between qu(yx|Dr) vs. q(yx|Dr) over all (x, yx) De. We also plot an average (with standard deviation) of the KL divergences between q(yx|D) vs. q(yx|Dr) over Dr and De as baselines (i.e., representing no unlearning), which is expected to be larger than that of our unlearning methods (i.e., if performing well) and labeled as full in the plots below. 4.1 Sparse Gaussian Process (GP) Classification with Synthetic Moon Dataset This experiment is about unlearning a binary classifier that is previously trained with the synthetic moon dataset (Fig. 4a). The probability of input x R2 being in the blue class (i.e., yx = 1 and denoted by blue dots in Fig. 4a) is defined as 1/(1 + exp(fx)) where fx is a latent function modeled by a sparse GP [27], which is elaborated in Appendix E. The parameters θ of the sparse GP consist of 20 inducing variables; the approximate posterior beliefs of θ are thus multivariate Gaussians (with full covariance matrices), as shown in Appendix E. By comparing Figs. 4b and 4c, it can be observed that after erasing De (i.e., mainly in yellow class), q(yx = 1|Dr) increases at x De. Figs. 4d and 4e show results of the performance of our EUBOand r KL-based unlearning methods -1.5 -0.2 1.1 2.4 2.4 1.1 -0.2 -1.5 -1.5 -0.2 1.1 2.4 2.4 1.1 -0.2 -1.5 100 r KL EUBO full (a) Dataset (b) q(yx = 1|D) (c) q(yx = 1|Dr) (d) Dr (e) De -1.5 -0.2 1.1 2.4 2.4 1.1 -0.2 -1.5 -1.5 -0.2 1.1 2.4 2.4 1.1 -0.2 -1.5 -1.5 -0.2 1.1 2.4 2.4 1.1 -0.2 -1.5 -1.5 -0.2 1.1 2.4 2.4 1.1 -0.2 -1.5 -1.5 -0.2 1.1 2.4 2.4 1.1 -0.2 -1.5 -1.5 -0.2 1.1 2.4 2.4 1.1 -0.2 -1.5 (f) λ = 10 5 (g) λ = 10 9 (h) λ = 0 (i) λ = 10 5 (j) λ = 10 9 (k) λ = 0 Figure 4: Plots of (a) synthetic moon dataset with erased data De (crosses) and remaining data Dr (dots), and of predictive distributions obtained using VI from (b) training with full data D and (c) retraining with Dr. Graphs of averaged KL divergence vs. λ achieved by EUBO, r KL, and q(θ|D) (i.e., baseline labeled as full) over (d) Dr and (e) De. Plots of predictive distributions (f-h) qu(yx = 1|Dr) and (i-k) qv(yx = 1|Dr) induced, respectively, by EUBO and r KL for varying λ. over Dr and De with varying λ, respectively.4 When λ = 10 9, EUBO performs reasonably well (compare Figs. 4g vs. 4c) as its averaged KL divergence is smaller than that of q(θ|D) (i.e., baseline labeled as full). When λ = 0, EUBO performs poorly (compare Figs. 4h vs. 4c) as its averaged KL divergence is much larger than that of q(θ|D), as shown in Figs. 4d and 4e. This agrees with our discussion of the issue with picking too small a value of λ for EUBO at the end of Sec. 3.3.1. In particular, catastrophic unlearning is observed as the input region containing De (i.e., mainly in yellow class) has a high probability in blue class after unlearning in Fig. 4h. On the other hand, when λ = 0, r KL performs well (compare Figs. 4k vs. 4c) as its KL divergence is much smaller than that of q(θ|D), as seen in Figs. 4d and 4e. This agrees with our discussion at the end of Sec. 3.3.2 that r KL can work well without needing the adjusted likelihood. One may question how the performance of our unlearning methods would vary when erasing a large quantity of data or with different distributions of erased data (e.g., erasing the data randomly vs. deliberately erasing all data in a given class). To address this question, we have discovered that a key factor influencing their unlearning performance in these scenarios is the difference between the posterior beliefs of model parameters θ given remaining data Dr vs. that given full data D, especially at values of θ with small q(θ|D) since unlearning in such a region is curbed by the adjusted likelihood and reverse KL. In practice, we expect such a difference not to be large due to the small quantity of erased data and redundancy in real-world datasets. We will present the details of this study in Appendix F due to lack of space by considering how much De reduces the entropy of θ given Dr. 4.2 Logistic Regression with Banknote Authentication Dataset The banknote authentication dataset [10] of size |D| = 1372 is partitioned into erased data of size |De| = 412 and remaining data of size |Dr| = 960. Each input x comprises 4 features extracted from an image of a banknote and its corresponding binary label yx indicates whether the banknote is genuine or forged. We use a logistic regression model with 5 parameters that is trained with this dataset. The prior beliefs of the model parameters are independent Gaussians N(0, 100). Unlike the previous experiment, the erased data De here is randomly selected and hence does not reduce the entropy of model parameters θ given Dr much, as explained in Appendix F; a discussion on erasing informative data (such as that in Sec. 4.1) is in Appendix F. As a result, Figs. 5a and 5b show a very small averaged KL divergence of about 10 3 between q(yx|D) vs. q(yx|Dr) (i.e., baselines) over Dr and De.4 Figs. 5a and 5b also show that our unlearning methods do not perform well when using multivariate Gaussians to model the approximate posterior beliefs of θ: While r KL still gives a useful qv(θ|Dr; λ) achieving an averaged KL divergence close to that of q(θ|D), EUBO gives a useless qu(θ|Dr; λ) incurring a large averaged KL divergence when λ is small. On the other hand, 4Note that the log plots can only properly display the upper confidence interval of 1 standard deviation (shaded area) and hence do not show the lower confidence interval. r KL EUBO full (a) Dr (b) De (c) Dr (d) De Figure 5: Graphs of averaged KL divergence vs. λ achieved by EUBO, r KL, and q(θ|D) (i.e., baseline labeled as full) over Dr and De for the banknote authentication dataset with the approximate posterior beliefs of model parameters represented by (a-b) multivariate Gaussians and (c-d) normalizing flows. when more complex models like normalizing flows with the MADE architecture [26] are used to represent the approximate posterior beliefs, EUBO and r KL can unlearn well (Figs. 5c and 5d). 4.3 Logistic Regression with Fashion MNIST Dataset The fashion MNIST dataset of size |D| = 60000 (28 28 images of fashion items in 10 classes) is partitioned into erased data of size |De| = 10000 and remaining data of size |Dr| = 50000. The classification model is a neural network with 3 fully-connected hidden layers of 128, 128, 64 hidden neurons and a softmax layer to output the 10-class probabilities. The model can be interpreted as one of logistic regression on 64 features generated from the hidden layer of 64 neurons. Since modeling all weights of the neural network as random variables can be costly, we model only 650 weights in the transformation of the 64 features to the inputs of the softmax layer. The other weights remain constant during unlearning and retraining. The prior beliefs of the network weights are N(0, 10). The approximate posterior beliefs are modeled with independent Gaussians. Though a large part of the network is fixed and we use simple models to represent the approximate posterior beliefs, we show that unlearning is still fairly effective. As discussed in Sec. 4.1, 4.2, and Appendix F, the random selection of erased data De and redundancy in D lead to a small averaged KL divergence of about 0.1 between q(yx|D) vs. q(yx|Dr) (i.e., baselines) over Dr and De (Figs. 6a and 6b) despite choosing a relatively large |De|. Figs. 6a and 6b show that when λ 10 9, EUBO and r KL achieve averaged KL divergences comparable to that of q(θ|D) (i.e., baseline labeled as full), hence making their unlearning insignificant.4 However, at λ = 0, the unlearning performance of r KL improves by achieving a smaller averaged KL divergence than that of q(θ|D), while EUBO s performance deteriorates. Their performance can be further improved by using more complex models to represent their approximate posterior beliefs like that in Sec. 4.2, albeit high-dimensional. Figs. 6c and 6d show the class probabilities for two images evaluated at the mean of the approximate posterior beliefs with λ = 0. We observe that r KL induces the highest class probability for the same class as that of q(θ|Dr). The class probabilities for other images are shown in Appendix G. The two images are taken from a separate set of 10000 test images (i.e., different from D) where r KL with λ = 0 yields the same predictions as q(θ|Dr) and q(θ|D) in, respectively, 99.34% and 99.22% of the test images, the latter of which are contained in the former. 4.4 Sparse Gaussian Process (GP) Regression with Airline Dataset This section illustrates the scalability of unlearning to the massive airline dataset of 2 million flights [15]. Training a sparse GP model with this massive dataset is made possible through stochastic VI [15]. Let Xu denote the set of 50 inducing inputs in the sparse GP model and f Xu be a vector of corresponding latent function values (i.e., inducing variables). The posterior belief p(f D, f Xu|D) is approximated as q(f D, f Xu|D) q(f Xu|D) p(f D|f Xu) where f D (fx)x D. Let the sets XD and XDe denote inputs in the full and erased data, respectively. Then, the ELBO can be decomposed to Z q(f Xu|D) p(fx|f Xu) log p(yx|fx) dfx df Xu KL[q(f Xu|D) p(f Xu)] (10) where R p(fx|f Xu) log p(yx|fx) dfx can be evaluated in closed form [11]. To unlearn such a trained model from De (|De| = 100K here), the EUBO (8) can be expressed in a similar way as the ELBO: e Uadj(λ)= X Z qu(f Xu|Dr; λ)p(fx|f Xu) log padj(yx|fx; λ) dfx df Xu+KL[ qu(f Xu|Dr; λ) q(f Xu|D)] 2 10 1 3 10 1 4 10 1 6 10 1 r KL EUBO full 0.0 0.5 1.0 pullover 57% t_shirt_top trouser pullover coat sandal shirt sneaker bag ankle_boots 0.0 0.5 1.0 pullover 61% 0.0 0.5 1.0 pullover 59% 0.0 0.5 1.0 sneaker 100% (a) Dr (c) Class probabilities for a shirt image 2 10 1 3 10 1 4 10 1 6 10 1 r KL EUBO full t_shirt_top 0 1 t_shirt_top 50.1% t_shirt_top trouser pullover coat sandal shirt sneaker bag ankle_boots 0 1 shirt 72.0% 0 1 shirt 52.1% 0 1 sneaker 100.0% (b) De (d) Class probabilities for a T-shirt image Figure 6: Graphs of averaged KL divergence vs. λ achieved by EUBO, r KL, and q(θ|D) (i.e., baseline labeled as full) over (a) Dr and (b) De. (c-d) Plots of class probabilities for two images in the fashion MNIST dataset obtained using q(θ|D), q(θ|Dr), optimized qv(θ|Dr; λ = 0) and qu(θ|Dr; λ = 0). Table 1: KL divergence achieved by EUBO (top row) and r KL (bottom row) with varying λ for airline dataset. λ 10 11 10 13 10 20 0 KL[ qu(f Xu|Dr; λ) q(f Xu|Dr)] 2194.49 1943.00 1384.96 2629.71 KL[ qv(f Xu|Dr; λ) q(f Xu|Dr)] 418.42 367.12 543.45 455.11 where padj(yx|fx; λ) = p(yx|fx) if q(fx, f Xu|D) > λ maxf Xu q(fx, f Xu|D), and padj(yx|fx; λ) = 1 otherwise. EUBO can be minimized using stochastic gradient descent with random subsets (i.e., mini-batches of size 10K) of De in each iteration. For r KL, we use the entire De in each iteration. Since qu(f Xu|Dr; λ), qv(f Xu|Dr; λ), and q(f Xu|Dr) in (10) [11] are all multivariate Gaussians, we can directly evaluate the performance of EUBO and r KL with varying λ through their respective KL[ qu(f Xu|Dr; λ) q(f Xu|Dr)] and KL[ qv(f Xu|Dr; λ) q(f Xu|Dr)] which, according to Table 1, are smaller than KL[q(f Xu|D) q(f Xu|Dr)] of value 4344.09 (i.e., baseline representing no unlearning), hence demonstrating reasonable unlearning performance. 5 Conclusion This paper describes novel unlearning methods for approximately unlearning a Bayesian model from a small subset of training data to be erased. Our unlearning methods are parsimonious in requiring only the approximate posterior belief of model parameters given the full data (i.e., obtained in model training with VI) and erased data to be available. This makes unlearning even more challenging due to two sources of inaccuracy in the approximate posterior belief. We introduce novel tricks of adjusted likelihood and reverse KL to curb unlearning in the region of model parameters with low approximate posterior belief where both sources of inaccuracy primarily occur. Empirical evaluations on synthetic and real-world datasets show that our proposed methods (especially reverse KL without adjusted likelihood) can effectively unlearn Bayesian models such as sparse GP and logistic regression from erased data. In practice, for the approximate posterior beliefs recovered by unlearning from erased data using our proposed methods, they can be immediately used in ML applications and continue to be improved at the same time by retraining with the remaining data at the expense of parsimony. In our future work, we will apply our our proposed methods to unlearning more sophisticated Bayesian models like the entire family of sparse GP models [5, 6, 7, 8, 16, 17, 18, 19, 20, 22, 23, 25, 31, 32, 34]) and deep GP models [33]. Broader Impact As discussed in our introduction (Sec. 1), a direct contribution of our work to the society in this information age is to the implementation of personal data ownership (i.e., enforced by the General Data Protection Regulation in the European Union [24]) by studying the problem of machine unlearning for Bayesian models. Such an implementation can boost the confidence of users about sharing their data with an application/organization when they know that the trace of their data can be reduced/erased, as requested. As a result, organizations/applications can gather more useful data from users to enhance their service back to the users and hence to the society. Our unlearning work can also contribute to the defense against data poisoning attacks (i.e., injecting malicious training data). Instead of retraining the tampered machine learning model from scratch to recover the quality of a service, unlearning the model from the detected malicious data may incur much less time, which improves the user experience and reduces the cost due to the service disruption. In contrast, the ability to unlearn machine learning models may also open the door to new adversarial activities. For example, in the context of data sharing, multiple parties share their data to train a common machine learning model. An unethical party can deliberately share a low-quality dataset instead of its high-quality one. After obtaining the model trained on datasets from all parties (including the low-quality dataset), the unethical party can unlearn the low-quality dataset and continue to train the model with its high-quality dataset. By doing this, the unethical party achieves a better model than other parties in the collaboration. Therefore, the possibility of machine unlearning should be considered in the design of different data sharing frameworks. Acknowledgments and Disclosure of Funding This research/project is supported by the National Research Foundation, Singapore under its Strategic Capability Research Centres Funding Initiative. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore. [1] C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006. [2] D. M. Blei, A. Kucukelbir, and J. D. Mc Auliffe. Variational inference: A review for statisticians. J. American Statistical Association, 112(518):859 877, 2017. [3] L. Bourtoule, V. Chandrasekaran, C. Choquette-Choo, H. Jia, A. Travers, B. Zhang, D. Lie, and N. Papernot. Machine unlearning. ar Xiv:1912.03817, 2019. [4] Y. Cao and J. Yang. Towards making systems forget with machine unlearning. In Proc. IEEE S&P, pages 463 480, 2015. [5] J. Chen, N. Cao, B. K. H. Low, R. Ouyang, C. K.-Y. Tan, and P. Jaillet. Parallel Gaussian process regression with low-rank covariance matrix approximations. In Proc. UAI, pages 152 161, 2013. [6] J. Chen, B. K. H. Low, P. Jaillet, and Y. Yao. Gaussian process decentralized data fusion and active sensing for spatiotemporal traffic modeling and prediction in mobility-on-demand systems. IEEE Trans. Autom. Sci. Eng., 12:901 921, 2015. [7] J. Chen, B. K. H. Low, and C. K.-Y. Tan. Gaussian process-based decentralized data fusion and active sensing for mobility-on-demand system. In Proc. RSS, 2013. [8] J. Chen, B. K. H. Low, C. K.-Y. Tan, A. Oran, P. Jaillet, J. M. Dolan, and G. S. Sukhatme. Decentralized data fusion and active sensing with mobile sensors for modeling and predicting spatiotemporal traffic phenomena. In Proc. UAI, pages 163 173, 2012. [9] M. Du, Z. Chen, C. Liu, R. Oak, and D. Song. Lifelong anomaly detection through unlearning. In Proc. CCS, pages 1283 1297, 2019. [10] D. Dua and C. Graff. UCI machine learning repository, 2017. [11] Y. Gal and M. van der Wilk. Variational inference in sparse Gaussian process regression and latent variable models a gentle tutorial. ar Xiv preprint ar Xiv, 1402, 2014. [12] A. Ginart, M. Guan, G. Valiant, and J. Y. Zou. Making AI forget you: Data deletion in machine learning. In Proc. Neur IPS, pages 3513 3526, 2019. [13] A. Golatkar, A. Achille, and S. Soatto. Eternal sunshine of the spotless net: Selective forgetting in deep neural networks. In Proc. CVPR, 2020. [14] C. Guo, T. Goldstein, A. Hannun, and L. van der Maaten. Certified data removal from machine learning models. ar Xiv:1911.03030, 2019. [15] J. Hensman, N. Fusi, and N. D. Lawrence. Gaussian processes for big data. In Proc. UAI, pages 282 290, 2013. [16] Q. M. Hoang, T. N. Hoang, and B. K. H. Low. A generalized stochastic variational Bayesian hyperparameter learning framework for sparse spectrum Gaussian process regression. In Proc. AAAI, pages 2007 2014, 2017. [17] Q. M. Hoang, T. N. Hoang, B. K. H. Low, and C. Kingsford. Collective model fusion for multiple black-box experts. In Proc. ICML, pages 2742 2750, 2019. [18] T. N. Hoang, Q. M. Hoang, and B. K. H. Low. A unifying framework of anytime sparse Gaussian process regression models with stochastic variational inference for big data. In Proc. ICML, pages 569 578, 2015. [19] T. N. Hoang, Q. M. Hoang, and B. K. H. Low. A distributed variational inference framework for unifying parallel sparse Gaussian process regression models. In Proc. ICML, pages 382 391, 2016. [20] T. N. Hoang, Q. M. Hoang, B. K. H. Low, and J. P. How. Collective online learning of Gaussian processes in massive multi-agent systems. In Proc. AAAI, pages 7850 7857, 2019. [21] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling. Improved variational inference with inverse autoregressive flow. In Proc. Neur IPS, pages 4743 4751, 2016. [22] B. K. H. Low, N. Xu, J. Chen, K. K. Lim, and E. B. Özgül. Generalized online sparse Gaussian processes with application to persistent mobile robot localization. In Proc. ECML/PKDD Nectar Track, pages 499 503, 2014. [23] B. K. H. Low, J. Yu, J. Chen, and P. Jaillet. Parallel Gaussian process regression for big data: Low-rank representation meets Markov approximation. In Proc. AAAI, pages 2821 2827, 2015. [24] A. Mantelero. The EU proposal for a general data protection regulation and the roots of the right to be forgotten . Computer Law & Security Review, 29(3):229 235, 2013. [25] R. Ouyang and B. K. H. Low. Gaussian process decentralized data fusion meets transfer learning in large-scale distributed cooperative perception. In Proc. AAAI, pages 3876 3883, 2018. [26] G. Papamakarios, T. Pavlakou, and I. Murray. Masked autoregressive flow for density estimation. In Proc. Neur IPS, pages 2338 2347, 2017. [27] J. Quiñonero-Candela and C. E. Rasmussen. A unifying view of sparse approximate Gaussian process regression. JMLR, 6:1939 1959, 2005. [28] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. [29] D. J. Rezende and S. Mohamed. Variational inference with normalizing flows. In Proc. ICML, pages 1530 1538, 2015. [30] S. Schelter. Amnesia Towards machine learning models that can forget user data very fast. In Proc. International Workshop on Applied AI for Database Systems and Applications, 2019. [31] T. Teng, J. Chen, Y. Zhang, and B. K. H. Low. Scalable variational Bayesian kernel selection for sparse Gaussian process regression. In Proc. AAAI, pages 5997 6004, 2020. [32] N. Xu, B. K. H. Low, J. Chen, K. K. Lim, and E. B. Özgül. GP-Localize: Persistent mobile robot localization using online sparse Gaussian process observation model. In Proc. AAAI, pages 2585 2592, 2014. [33] H. Yu, Y. Chen, Z. Dai, K. H. Low, and P. Jaillet. Implicit posterior variational inference for deep Gaussian processes. In Proc. Neur IPS, pages 14475 14486, 2019. [34] H. Yu, T. N. Hoang, B. K. H. Low, and P. Jaillet. Stochastic variational inference for Bayesian sparse Gaussian process regression. In Proc. IJCNN, 2019.