# from_adaptive_query_release_to_machine_unlearning__8ff6ce97.pdf From Adaptive Query Release to Machine Unlearning Enayat Ullah 1 Raman Arora 1 We formalize the problem of machine unlearning as design of efficient unlearning algorithms corresponding to learning algorithms which perform a selection of adaptive queries from structured query classes. We give efficient unlearning algorithms for linear and prefix-sum query classes. As applications, we show that unlearning in many problems, in particular, stochastic convex optimization (SCO), can be reduced to the above, yielding improved guarantees for the problem. In particular, for smooth Lipschitz losses and any ρ > 0, our results yield an unlearning algorithm with excess population risk of e O 1 n + d nρ with unlearning query (gradient) complexity e O(ρ Retraining Complexity), where d is the model dimensionality and n is the initial number of samples. For non-smooth Lipschitz losses, we give an unlearning algorithm with excess population risk e O 1 n + d nρ 1/2 with the same unlearning query (gradient) complexity. Furthermore, in the special case of Generalized Linear Models (GLMs), such as those in linear and logistic regression, we get dimension-independent rates of e O 1 n + 1 (nρ)2/3 and e O 1 n + 1 (nρ)1/3 for smooth Lipschitz and non-smooth Lipschitz losses respectively. Finally, we give generalizations of the above from one unlearning request to dynamic streams consisting of insertions and deletions. 1. Introduction The problem of machine unlearning is concerned with updating trained machine learning models upon request of deletions to the training dataset. This problem has recently gained attention owing to various data privacy laws such 1Department of Computer Science, The Johns Hopkins University, USA. Correspondence to: Enayat Ullah . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). as General Data Protection Regulation (GDPR), California Consumer Act (CCA) among others, which empower users to make such requests to the entity possessing user data. The entity is then required to update the state of the system such that it is indistinguishable to the state had the user data been absent to begin with. While as of now, there is no universally accepted definition of indistinguishibility as the unlearning criterion, in this work, we consider the most strict definition, called exact unlearning (see Definition 1). Motivating Example: The main objective of our work is to identify algorithmic design principles for unlearning such that it is more efficient than retraining, the naive baseline method. Towards this, we first discuss the example of unlearning for Gradient Descent (GD) method, which will highlight the key challenges as well as foreshadow the formal setup and techniques. GD and its variants are extremely popular optimization methods with numerous applications in machine learning and beyond. In a machine learning context, it is typically used to minimize the training loss, b L(w; S) = 1 n Pn i=1 ℓ(w; zi) where S = {zi}n i=1 is the training dataset and w, the model. Starting from an initial model w1, in each iteration, the model is updated as: wt+1 = wt η b L(wt; S) = wt η i=1 ℓ(wt; zi) where η is the learning rate. After training, a data-point, say zn without loss of generality, is requested to be unlearnt and so the updated training set is S = {zi}n 1 i=1 . We now need to apply an efficient unlearning algorithm such that its output is equal to that of running GD on S . Observe that the first iteration of GD is simple enough to be unlearnt efficiently by computing the new gradient b L(w1; S ) = 1 n 1 n b L(w1; S) ℓ(w1; zn) and up- dating as w 2 = w1 η b L(w1; S ). However, in the second iteration (and onwards), the gradient is computed on w 2 which can be different from w2 and the above adjustment can no longer be applied and one may need to retrain from here onwards. This captures a key challenge for unlearning in problems solved by simple iterative procedures such as GD adaptivity that is, the gradients (or more generally, the queries) computed in later iteration depend on the result of the previous iterations. We systematically formalize such procedures and design efficient unlearning algorithms for them. From Adaptive Query Release to Machine Unlearning 1.1. Our Results and Techniques Learning/Unlearning as Query Release: Iterative procedures are an integral constituent of the algorithmic toolkit for solving machine learning problems and beyond. As in the case of GD above, these often consist of a sequence of simple but adaptive computations. The simple computations are often efficiently undo-able (as in the first iteration of GD) but its adaptive nature change of result of one iteration changing the trajectory of the algorithm makes it difficult to undo computation, or unlearn, efficiently. As opposed to designing unlearning (and learning) algorithms for specific (machine learning) problems, we study the design of unlearning algorithms corresponding to (a class of) learning algorithms. We formalize this by considering learning algorithms which perform adaptive query release on datasets. Specifically, this consists of a selection of adaptive queries from structured classes like linear and prefix-sum queries (see Section 3 for details). The above example of GD is an instance of linear query, since the query, which is the average gradient 1 n Pn i=1 ℓ(wt; zi), is a sum of functions of data-points. With this view, we study how to design efficient unlearning algorithms for such methods. We use efficiency in the sense of number of queries made (query complexity), ignoring the use of other resources, e.g., space, computation for selection of queries, etc. To elaborate on why this is interesting, firstly note that this does not make the problem trivial, in the sense that even with unlimited access to other resources, it is still challenging do design an unlearning algorithm with query complexity smaller than that of retraining (the naive baseline). Secondly, let us revisit the motivation from solving optimization problems. The standard model to measure computation in optimization is the number of gradient queries a method makes for a target accuracy, often abstracted in an oracle-based setup (Nemirovskij and Yudin, 1983). Importantly, this setup imposes no constraints on other resources, yet it witnesses the optimality of well-known simple procedures like (variants of) GD. We follow this paradigm, and as applications of our results to Stochastic Convex Optimization (SCO), we make progress on the fundamental question of understanding the gradient complexity of unlearning in SCO. Interestingly, our proposed unlearning procedures are simple enough that the improvement over retraining in terms of query complexity also applies even with accounting for the (arithmetic) complexity of all other operations in the learning and unlearning methods. Linear queries: The simplest query class we consider is that of linear queries (details deferred to Appendix B). Herein, we show that the prior work of Ullah et al. (2021), which focused on unlearning in SCO and was limited to the stochastic gradient method, can be easily extended to general linear queries. This observation yields unlearning algo- rithms for algorithms for Federated Optimization/Learning and k-means clustering. Herein, we give a ρ-TV stable (see Definition 3) learning procedure with T adaptive queries and a corresponding unlearning procedure with a O( Tρ) relative unlearning complexity (the ratio of unlearning and retraining complexity; see Definition 5). Prefix-sum queries: Our main contribution is the case when we consider the class of prefix-sum queries. These are a sub-class of interval queries which have been extensively studied in differential privacy and are classically solved by the binary tree mechanism (Dwork et al., 2010). We note in passing that for differential privacy, the purpose of the tree is to enable a tight privacy accounting and no explicit tree may be maintained. In contrast, for unlearning, we show that maintaining the binary tree data structure aids for efficient unlearning. We give a binary-tree based ρ-TV stable learning procedure and a corresponding unlearning procedure with a e O(ρ) relative unlearning complexity. Unlearning in Stochastic Convex Optimization (SCO): Our primary motivation for considering prefix-sum queries is its application to unlearning in SCO (see Section 2 for preliminaries). 1) Smooth SCO: The problem of unlearning in smooth SCO was studied in Ullah et al. (2021) which proposed algorithms with excess population risk of e O 1 n + d nρ 2/3 where ρ is the relative unlearning complexity. We show that using a variant of variance-reduced Frank-Wolfe (Zhang et al., 2020), which uses prefix-sum queries, yields an improved excess population risk of O 1 n + d nρ . This corresponds to e O(ρn) expected gradient computations upon unlearning. 2) Non-smooth SCO: In the non-smooth setting, which was not covered in the prior works, we give an algorithm based on Dual Averaging (Nesterov, 2009), which again uses prefix-sum query access, and thus fits into the framework. This algorithm gives us an excess population risk of O 1 n + d1/4 nρ with e O(ρn) expected gradient complexity of unlearning. 3) Generalized Linear Models (GLM): GLMs are one of most basic machine learning problems which include the squared loss (in linear regression), logistic loss (in logistic regression), hinge loss (support vector machines), etc. We study unlearning in two classes of GLMs (see below), for which we combine recently proposed techniques based on dimensionality reduction (Arora et al., 2022) with the above prefix-sum query algorithms to get the following dimensionindependent rates. 3(a) Smooth GLM: For the smooth convex GLM setting, we combine Johnson-Lindenstrauss transform with variance reduced Frank-Wolfe to get O 1 n + 1 (nρ)2/3 excess pop- From Adaptive Query Release to Machine Unlearning Problem Base algorithm Rate Smooth, Lipschitz-SCO VR-FW 1 n + d nρ Lipschitz SCO DA 1 n + d1/4 nρ Smooth, Lipschitz GLM JL + VR-FW 1 n + 1 (nρ)2/3 Lipschitz GLM JL + DA 1 n + 1 (nρ)1/3 Table 1. Excess population risk guarantees for various problems as well as the base algorithm; ρ: relative unlearning complexity (see Definition 5), VR-FW: Variance-reduced Frank Wolfe, DA: Dual averaging, JL: Johnson-Lindenstrauss transform. ulation risk. Note that we get no overhead in statistical rate even with very small relative unlearning complexity, ρ n 1/4. This class of smooth GLMs contains the wellstudied problem of logistic regression. Hence, our result demonstrates that it is possible to unlearn logistic regression with sub-linear, specifically O(n3/4), unlearning complexity with no sacrifice in the statistical rate. 3(b) Lipschitz GLM: Similarly, for the Lipschitz convex GLM setting, we combine Johnson-Lindenstrauss transform with Dual Averaging yielding a rate of e O 1 n + 1 (nρ)1/3 . Please see Table 1 for a summary of above results. SCO in dynamic streams: Finally, we consider SCO in dynamic streams where we observe a sequence of insertions and deletions and are supposed to produce outputs after each time-point. In this case, we present two methods: one which satisfies the exact unlearning guarantee with worse update time, the other which satisfies weak unlearning which only requires the model (and not metadata) to be indistinguishable (see Definition 2) with improved update time. The exact unlearning method is inspired from the work of Ullah et al. (2021) which dealt with insertions similar to deletions. The weak unlearning method is motivated from the observation that the above may be too pessimistic. To elaborate, inserting a new data item does not warrant a (unlearning) guarantee that the algorithm s state be indistinguishable to the case if the point was not inserted. Hence, insertions should require smaller update time which is indeed the case for our proposed methods. 1.2. Related work Our work is a direct follow up of Ullah et al. (2021) which proposed the framework of Total Variation (TV) stability and maximal coupling for the exact machine unlearning problem. They applied this to unlearning in smooth stochastic convex optimization (SCO) and obtained a guarantee of 1 n + 3 on excess population risk, where n is the number of data samples, d, model dimensionality and ρ is the relative unlearning complexity (see Definition 5). We improve upon the results in that work in multiple ways as described in the preceding section.Besides this, the exact unlearning problem has been studied for k-means clustering (Ginart et al., 2019) and random forests (Brophy and Lowd, 2021). The work of Bourtoule et al. (2021) proposes a general methodology for exact unlearning for deep learning methods. Their focus is to devise practical methods and they do not provide theoretical guarantees on accuracy, even in simple settings. Finally, there are works which consider unlearning in SCO, however they use an approximate notion of unlearning inspired from differential privacy (Guo et al., 2019; Neel et al., 2021; Sekhari et al., 2021; Gupta et al., 2021), and therefore are incomparable to our work. 2. Problem Setup and preliminaries Let Z be the data space, W be the model space and M be the meta-data space, where meta-data is additional information a learning algorithm may save to aid unlearning. We consider a learning algorithm as a map A : Z W M and an unlearning algorithm as a map U : W M Z W M. We use A and U to denote the first output (which belongs to W) of A and U respectively. We recall the definition of exact unlearning which requires that the entire state after unlearning be indistinguishable from the state obtained if the learning algorithm were applied to the dataset without the deleted point. Definition 1 (Exact unlearning). A procedure (A, U) satisfies exact unlearning if for all datasets S, all z Z, and for all events E W M, we have, P (A (S\ {z}) E) = P (U (A(S), z) E) We next define weak unlearning wherein only the model output and not the entire state is required to be indistinguishable. Definition 2 (Weak unlearning). A procedure (A, U) satisfies weak unlearning if for all all datasets S, all z Z, and for all events E W M, we have, P (A (S\ {z}) E) = P (U (A(S), z) E) Unlearning request: We consider the setting where we start with a dataset of n samples and observe one unlearning request. We assume that the choice of unlearning request is oblivious to the learning process. In Section 6, we generalize our result to a streaming setting of requests. Total Variation stability, maximal coupling and efficient unlearning: The Total Variation (TV) distance between two probability distributions P and Q is TV(P, Q) = sup measurable E |P(E) Q(E)| . Next, we define Total Variation (TV) stability to motivate algorithmic techniques for efficient unlearning. Definition 3. An algorithm A is said to be ρ Total Variation (TV) stable if for all datasets S and S differing in From Adaptive Query Release to Machine Unlearning one point, i.e. |S S | = 1, the total variation distance, TV (A(S), A(S )) ρ Given two distributions P and Q, a coupling is a joint distribution π with marginals P and Q. Furthermore, a maximal coupling is a coupling π such that the disagreement probability P(x,y) π {x = y} = TV(P, Q). In the unlearning context, P = A(S), the output on initial dataset, and Q = A(S ), the output on the updated dataset. Hence, the unlearning problem simply becomes about transporting P to Q with small computational cost, akin to optimal transport (Villani, 2009). Furthermore, observe that when sampled from a maximal coupling between P and Q, by definition, we get the same sample for both P and Q, expect with probability ρ, and yet satisfying the exact unlearning criterion. The main idea is that for certain learning algorithms of interest, during unlearning, we can efficiently construct a (near) maximal coupling of P and Q, and so the same model output from P suffices for Q, most of the times. In particular, the fraction of times that we need change the model is (roughly) the TV-stability parameter ρ of the learning algorithm. The goal, therefore, is to design an (accurate) TV-stable learning algorithm and a corresponding efficient coupling-based unlearning algorithm. In this work, we use the technique of reflection coupling described below. Reflection Coupling (Lindvall and Rogers, 1986): Reflection Coupling is a classical technique in probability to maximally couple symmetric probability distributions. Consider two probability distributions P and Q with means u and u and let r be a sample from P. The process involves a rejection sampling step on the two distributions and sample r (see line 13 in in Algorithm 3). If it results in accept, we use the same r as the sample from Q, otherwise, we apply the following simple map: Reflect(u, u , r) = u u + r, which gives the sample from Q, see line 16 in Algorithm 3. Our algorithmic techniques borrow tools from differential privacy (Dwork et al., 2014) such as its relationship with Total Variation stability; we describe these in Appendix A. Stochastic Convex Optimization (SCO): SCO is the dominant framework for computationally-efficient machine learning. Consider a closed convex (constraint) set W Rd and let D denote its diameter. Let ℓ: W Z R be a loss function, which is convex in its first parameter z Z. Given n i.i.d. points from an unknown probability distribution D over Z, the goal is to devise an algorithm, the output of which has small population risk, defined as L(w; D) := E z Dℓ(w; z). The excess population risk is then L(w; D) L(w ; D) where w denotes a population risk minimizer over W. Algorithm 1 Template learning algorithm Input: Dataset S, steps T, query functions {qt( )}t T where qt Q, a query class, update functions {Ut( )}t T , selector function S( ) 1: Initialize model w1 W 2: for t = 1 to T 1 do 3: Query dataset ut = qt {wi}i t , S 4: Update wt+1 = Ut({wi}i t , ut) 5: end for Output: bw = S {wt}t T Generalized Linear Models (GLM): Generalized Linear Models (GLMs) are loss functions popularly encountered in supervised learning problems, like linear and logistic regression. Herein, ℓ(w; (x, y)) = ϕy ( w, x ), where ϕy : R R is some link function. We use X to denote the radius bound on data points, i.e. for x X Rd, x X . In this case, we consider the unconstrained setup i.e. W = Rd, as it allows to get dimension-independent rates for GLMs, similar to what happens under differential privacy (Jain and Thakurta, 2014; Arora et al., 2022). We introduce the Johnson-Lindenstrauss property below which is crucial to our construction. Definition 4 (Johnson-Lindenstrauss property). A random matrix Φ Rk d satisfies (β, γ)-JL property if for any u, v Rd, with probability at least 1 γ, P (| Φu, Φv u, v | β u v ) γ. There exists many efficient constructions of such random matrices (Nelson, 2011). 3. Unlearning for Adaptive Query Release We now set up the framework of adaptive query release, which is a lens to view (existing) iterative learning procedures; this view is useful in our design of corresponding unlearning algorithms. Iterative procedures run on datasets consist of a sequence of interactions with the dataset; each interaction computes a certain function, or query, on the dataset. The chosen query is typically adaptive, i.e., dependent on the prior query outputs. We consider iterative learning procedures which are composed of adaptive queries from a specified query class. Formally, consider a query class Q WW Z ; herein, each query in Q is a function of a sequence of {wi}i 1, qt({wi}i t , S) = qt 1({wi}ib Rd : (r b, e, e) E where >b Rd denote the Cartesian product of Rd s of upto > b but smaller than or equal to T (1) elements. Similarly, define E r b b as, E r b b = e b Rd : (r b, e) E Finally, define E