# from_adaptive_query_release_to_machine_unlearning__8ff6ce97.pdf

From Adaptive Query Release to Machine Unlearning

Enayat Ullah 1 Raman Arora 1

We formalize the problem of machine unlearning as design of efficient unlearning algorithms corresponding to learning algorithms which perform a selection of adaptive queries from structured query classes. We give efficient unlearning algorithms for linear and prefix-sum query classes. As applications, we show that unlearning in many problems, in particular, stochastic convex optimization (SCO), can be reduced to the above, yielding improved guarantees for the problem. In particular, for smooth Lipschitz losses and any ρ > 0, our results yield an unlearning algorithm with excess population risk of e O 1 n +

d nρ with unlearning query (gradient)

complexity e O(ρ Retraining Complexity), where d is the model dimensionality and n is the initial number of samples. For non-smooth Lipschitz losses, we give an unlearning algorithm with excess population risk e O 1 n +

d nρ 1/2 with the same unlearning query (gradient) complexity. Furthermore, in the special case of Generalized Linear Models (GLMs), such as those in linear and logistic regression, we get dimension-independent rates of e O 1 n + 1 (nρ)2/3 and e O 1 n + 1 (nρ)1/3

for smooth Lipschitz and non-smooth Lipschitz losses respectively. Finally, we give generalizations of the above from one unlearning request to dynamic streams consisting of insertions and deletions.

1. Introduction

The problem of machine unlearning is concerned with updating trained machine learning models upon request of deletions to the training dataset. This problem has recently gained attention owing to various data privacy laws such

1Department of Computer Science, The Johns Hopkins University, USA. Correspondence to: Enayat Ullah <enayat@jhu.edu>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

as General Data Protection Regulation (GDPR), California Consumer Act (CCA) among others, which empower users to make such requests to the entity possessing user data. The entity is then required to update the state of the system such that it is indistinguishable to the state had the user data been absent to begin with. While as of now, there is no universally accepted definition of indistinguishibility as the unlearning criterion, in this work, we consider the most strict definition, called exact unlearning (see Definition 1).

Motivating Example: The main objective of our work is to identify algorithmic design principles for unlearning such that it is more efficient than retraining, the naive baseline method. Towards this, we first discuss the example of unlearning for Gradient Descent (GD) method, which will highlight the key challenges as well as foreshadow the formal setup and techniques. GD and its variants are extremely popular optimization methods with numerous applications in machine learning and beyond. In a machine learning context, it is typically used to minimize the training loss, b L(w; S) = 1 n Pn i=1 ℓ(w; zi) where S = {zi}n i=1 is the training dataset and w, the model. Starting from an initial model w1, in each iteration, the model is updated as:

wt+1 = wt η b L(wt; S) = wt η

i=1 ℓ(wt; zi)

where η is the learning rate. After training, a data-point, say zn without loss of generality, is requested to be unlearnt and so the updated training set is S = {zi}n 1 i=1 . We now need to apply an efficient unlearning algorithm such that its output is equal to that of running GD on S . Observe that the first iteration of GD is simple enough to be unlearnt efficiently by computing the new gradient b L(w1; S ) = 1 n 1 n b L(w1; S) ℓ(w1; zn) and up-

dating as w 2 = w1 η b L(w1; S ). However, in the second iteration (and onwards), the gradient is computed on w 2 which can be different from w2 and the above adjustment can no longer be applied and one may need to retrain from here onwards. This captures a key challenge for unlearning in problems solved by simple iterative procedures such as GD adaptivity that is, the gradients (or more generally, the queries) computed in later iteration depend on the result of the previous iterations. We systematically formalize such procedures and design efficient unlearning algorithms for them.

From Adaptive Query Release to Machine Unlearning

1.1. Our Results and Techniques

Learning/Unlearning as Query Release: Iterative procedures are an integral constituent of the algorithmic toolkit for solving machine learning problems and beyond. As in the case of GD above, these often consist of a sequence of simple but adaptive computations. The simple computations are often efficiently undo-able (as in the first iteration of GD) but its adaptive nature change of result of one iteration changing the trajectory of the algorithm makes it difficult to undo computation, or unlearn, efficiently.

As opposed to designing unlearning (and learning) algorithms for specific (machine learning) problems, we study the design of unlearning algorithms corresponding to (a class of) learning algorithms. We formalize this by considering learning algorithms which perform adaptive query release on datasets. Specifically, this consists of a selection of adaptive queries from structured classes like linear and prefix-sum queries (see Section 3 for details). The above example of GD is an instance of linear query, since the query, which is the average gradient 1

n Pn i=1 ℓ(wt; zi), is a sum of functions of data-points. With this view, we study how to design efficient unlearning algorithms for such methods.

We use efficiency in the sense of number of queries made (query complexity), ignoring the use of other resources, e.g., space, computation for selection of queries, etc. To elaborate on why this is interesting, firstly note that this does not make the problem trivial, in the sense that even with unlimited access to other resources, it is still challenging do design an unlearning algorithm with query complexity smaller than that of retraining (the naive baseline). Secondly, let us revisit the motivation from solving optimization problems. The standard model to measure computation in optimization is the number of gradient queries a method makes for a target accuracy, often abstracted in an oracle-based setup (Nemirovskij and Yudin, 1983). Importantly, this setup imposes no constraints on other resources, yet it witnesses the optimality of well-known simple procedures like (variants of) GD. We follow this paradigm, and as applications of our results to Stochastic Convex Optimization (SCO), we make progress on the fundamental question of understanding the gradient complexity of unlearning in SCO. Interestingly, our proposed unlearning procedures are simple enough that the improvement over retraining in terms of query complexity also applies even with accounting for the (arithmetic) complexity of all other operations in the learning and unlearning methods.

Linear queries: The simplest query class we consider is that of linear queries (details deferred to Appendix B). Herein, we show that the prior work of Ullah et al. (2021), which focused on unlearning in SCO and was limited to the stochastic gradient method, can be easily extended to general linear queries. This observation yields unlearning algo-

rithms for algorithms for Federated Optimization/Learning and k-means clustering. Herein, we give a ρ-TV stable (see Definition 3) learning procedure with T adaptive queries and a corresponding unlearning procedure with a O(

Tρ) relative unlearning complexity (the ratio of unlearning and retraining complexity; see Definition 5).

Prefix-sum queries: Our main contribution is the case when we consider the class of prefix-sum queries. These are a sub-class of interval queries which have been extensively studied in differential privacy and are classically solved by the binary tree mechanism (Dwork et al., 2010). We note in passing that for differential privacy, the purpose of the tree is to enable a tight privacy accounting and no explicit tree may be maintained. In contrast, for unlearning, we show that maintaining the binary tree data structure aids for efficient unlearning. We give a binary-tree based ρ-TV stable learning procedure and a corresponding unlearning procedure with a e O(ρ) relative unlearning complexity.

Unlearning in Stochastic Convex Optimization (SCO): Our primary motivation for considering prefix-sum queries is its application to unlearning in SCO (see Section 2 for preliminaries).

1) Smooth SCO: The problem of unlearning in smooth SCO was studied in Ullah et al. (2021) which proposed algorithms

with excess population risk of e O 1 n +

d nρ 2/3 where

ρ is the relative unlearning complexity. We show that using a variant of variance-reduced Frank-Wolfe (Zhang et al., 2020), which uses prefix-sum queries, yields an improved excess population risk of O 1 n +

d nρ . This corresponds

to e O(ρn) expected gradient computations upon unlearning.

2) Non-smooth SCO: In the non-smooth setting, which was not covered in the prior works, we give an algorithm based on Dual Averaging (Nesterov, 2009), which again uses prefix-sum query access, and thus fits into the framework. This algorithm gives us an excess population risk of O 1 n + d1/4

nρ with e O(ρn) expected gradient complexity of unlearning.

3) Generalized Linear Models (GLM): GLMs are one of most basic machine learning problems which include the squared loss (in linear regression), logistic loss (in logistic regression), hinge loss (support vector machines), etc. We study unlearning in two classes of GLMs (see below), for which we combine recently proposed techniques based on dimensionality reduction (Arora et al., 2022) with the above prefix-sum query algorithms to get the following dimensionindependent rates.

3(a) Smooth GLM: For the smooth convex GLM setting, we combine Johnson-Lindenstrauss transform with variance reduced Frank-Wolfe to get O 1 n + 1 (nρ)2/3 excess pop-

From Adaptive Query Release to Machine Unlearning

Problem Base algorithm Rate

Smooth, Lipschitz-SCO VR-FW 1 n +

d nρ Lipschitz SCO DA 1 n + d1/4

nρ Smooth, Lipschitz GLM JL + VR-FW 1 n + 1 (nρ)2/3 Lipschitz GLM JL + DA 1 n + 1 (nρ)1/3

Table 1. Excess population risk guarantees for various problems as well as the base algorithm; ρ: relative unlearning complexity (see Definition 5), VR-FW: Variance-reduced Frank Wolfe, DA: Dual averaging, JL: Johnson-Lindenstrauss transform.

ulation risk. Note that we get no overhead in statistical rate even with very small relative unlearning complexity, ρ n 1/4. This class of smooth GLMs contains the wellstudied problem of logistic regression. Hence, our result demonstrates that it is possible to unlearn logistic regression with sub-linear, specifically O(n3/4), unlearning complexity with no sacrifice in the statistical rate.

3(b) Lipschitz GLM: Similarly, for the Lipschitz convex GLM setting, we combine Johnson-Lindenstrauss transform with Dual Averaging yielding a rate of e O 1 n + 1 (nρ)1/3 .

Please see Table 1 for a summary of above results.

SCO in dynamic streams: Finally, we consider SCO in dynamic streams where we observe a sequence of insertions and deletions and are supposed to produce outputs after each time-point. In this case, we present two methods: one which satisfies the exact unlearning guarantee with worse update time, the other which satisfies weak unlearning which only requires the model (and not metadata) to be indistinguishable (see Definition 2) with improved update time. The exact unlearning method is inspired from the work of Ullah et al. (2021) which dealt with insertions similar to deletions. The weak unlearning method is motivated from the observation that the above may be too pessimistic. To elaborate, inserting a new data item does not warrant a (unlearning) guarantee that the algorithm s state be indistinguishable to the case if the point was not inserted. Hence, insertions should require smaller update time which is indeed the case for our proposed methods.

1.2. Related work

Our work is a direct follow up of Ullah et al. (2021) which proposed the framework of Total Variation (TV) stability and maximal coupling for the exact machine unlearning problem. They applied this to unlearning in smooth stochastic convex optimization (SCO) and obtained a guarantee of 1 n +

3 on excess population risk, where n is the number of data samples, d, model dimensionality and ρ is the relative unlearning complexity (see Definition 5). We improve upon the results in that work in multiple ways as

described in the preceding section.Besides this, the exact unlearning problem has been studied for k-means clustering (Ginart et al., 2019) and random forests (Brophy and Lowd, 2021). The work of Bourtoule et al. (2021) proposes a general methodology for exact unlearning for deep learning methods. Their focus is to devise practical methods and they do not provide theoretical guarantees on accuracy, even in simple settings. Finally, there are works which consider unlearning in SCO, however they use an approximate notion of unlearning inspired from differential privacy (Guo et al., 2019; Neel et al., 2021; Sekhari et al., 2021; Gupta et al., 2021), and therefore are incomparable to our work.

2. Problem Setup and preliminaries

Let Z be the data space, W be the model space and M be the meta-data space, where meta-data is additional information a learning algorithm may save to aid unlearning. We consider a learning algorithm as a map A : Z W M and an unlearning algorithm as a map U : W M Z W M. We use A and U to denote the first output (which belongs to W) of A and U respectively.

We recall the definition of exact unlearning which requires that the entire state after unlearning be indistinguishable from the state obtained if the learning algorithm were applied to the dataset without the deleted point. Definition 1 (Exact unlearning). A procedure (A, U) satisfies exact unlearning if for all datasets S, all z Z, and for all events E W M, we have, P (A (S\ {z}) E) = P (U (A(S), z) E)

We next define weak unlearning wherein only the model output and not the entire state is required to be indistinguishable. Definition 2 (Weak unlearning). A procedure (A, U) satisfies weak unlearning if for all all datasets S, all z Z, and for all events E W M, we have, P (A (S\ {z}) E) = P (U (A(S), z) E)

Unlearning request: We consider the setting where we start with a dataset of n samples and observe one unlearning request. We assume that the choice of unlearning request is oblivious to the learning process. In Section 6, we generalize our result to a streaming setting of requests.

Total Variation stability, maximal coupling and efficient unlearning: The Total Variation (TV) distance between two probability distributions P and Q is

TV(P, Q) = sup measurable E |P(E) Q(E)| .

Next, we define Total Variation (TV) stability to motivate algorithmic techniques for efficient unlearning. Definition 3. An algorithm A is said to be ρ Total Variation (TV) stable if for all datasets S and S differing in

From Adaptive Query Release to Machine Unlearning

one point, i.e. |S S | = 1, the total variation distance, TV (A(S), A(S )) ρ

Given two distributions P and Q, a coupling is a joint distribution π with marginals P and Q. Furthermore, a maximal coupling is a coupling π such that the disagreement probability P(x,y) π {x = y} = TV(P, Q). In the unlearning context, P = A(S), the output on initial dataset, and Q = A(S ), the output on the updated dataset. Hence, the unlearning problem simply becomes about transporting P to Q with small computational cost, akin to optimal transport (Villani, 2009). Furthermore, observe that when sampled from a maximal coupling between P and Q, by definition, we get the same sample for both P and Q, expect with probability ρ, and yet satisfying the exact unlearning criterion. The main idea is that for certain learning algorithms of interest, during unlearning, we can efficiently construct a (near) maximal coupling of P and Q, and so the same model output from P suffices for Q, most of the times. In particular, the fraction of times that we need change the model is (roughly) the TV-stability parameter ρ of the learning algorithm. The goal, therefore, is to design an (accurate) TV-stable learning algorithm and a corresponding efficient coupling-based unlearning algorithm. In this work, we use the technique of reflection coupling described below.

Reflection Coupling (Lindvall and Rogers, 1986): Reflection Coupling is a classical technique in probability to maximally couple symmetric probability distributions. Consider two probability distributions P and Q with means u and u and let r be a sample from P. The process involves a rejection sampling step on the two distributions and sample r (see line 13 in in Algorithm 3). If it results in accept, we use the same r as the sample from Q, otherwise, we apply the following simple map:

Reflect(u, u , r) = u u + r,

which gives the sample from Q, see line 16 in Algorithm 3.

Our algorithmic techniques borrow tools from differential privacy (Dwork et al., 2014) such as its relationship with Total Variation stability; we describe these in Appendix A.

Stochastic Convex Optimization (SCO): SCO is the dominant framework for computationally-efficient machine learning. Consider a closed convex (constraint) set W Rd and let D denote its diameter. Let ℓ: W Z R be a loss function, which is convex in its first parameter z Z. Given n i.i.d. points from an unknown probability distribution D over Z, the goal is to devise an algorithm, the output of which has small population risk, defined as

L(w; D) := E z Dℓ(w; z).

The excess population risk is then L(w; D) L(w ; D) where w denotes a population risk minimizer over W.

Algorithm 1 Template learning algorithm Input: Dataset S, steps T, query functions {qt( )}t T where qt Q, a query class, update functions {Ut( )}t T , selector function S( ) 1: Initialize model w1 W 2: for t = 1 to T 1 do 3: Query dataset ut = qt {wi}i t , S

4: Update wt+1 = Ut({wi}i t , ut) 5: end for Output: bw = S {wt}t T

Generalized Linear Models (GLM): Generalized Linear Models (GLMs) are loss functions popularly encountered in supervised learning problems, like linear and logistic regression. Herein, ℓ(w; (x, y)) = ϕy ( w, x ), where ϕy : R R is some link function. We use X to denote the radius bound on data points, i.e. for x X Rd, x X . In this case, we consider the unconstrained setup i.e. W = Rd, as it allows to get dimension-independent rates for GLMs, similar to what happens under differential privacy (Jain and Thakurta, 2014; Arora et al., 2022).

We introduce the Johnson-Lindenstrauss property below which is crucial to our construction.

Definition 4 (Johnson-Lindenstrauss property). A random matrix Φ Rk d satisfies (β, γ)-JL property if for any u, v Rd, with probability at least 1 γ, P (| Φu, Φv u, v | β u v ) γ.

There exists many efficient constructions of such random matrices (Nelson, 2011).

3. Unlearning for Adaptive Query Release

We now set up the framework of adaptive query release, which is a lens to view (existing) iterative learning procedures; this view is useful in our design of corresponding unlearning algorithms. Iterative procedures run on datasets consist of a sequence of interactions with the dataset; each interaction computes a certain function, or query, on the dataset. The chosen query is typically adaptive, i.e., dependent on the prior query outputs. We consider iterative learning procedures which are composed of adaptive queries from a specified query class. Formally, consider a query class Q WW Z ; herein, each query in Q is a function of a sequence of {wi}i<t (typically, prior query outputs), and the dataset S, with output in W. With this view, we give a general template of a learning procedure as Algorithm 1, where {Ut}t and S are the update and selector functions internal to the algorithm.

Query model: We describe the query model which we use to measure computational complexity. Under the model, a query function q({w}i , S) takes |S| unit computations (or

From Adaptive Query Release to Machine Unlearning

queries, for brevity) for any q and {wi}i. In our applications to SCO, this will correspond to the gradient oracle complexity.

Our algorithmic approach to unlearning is rooted in the relationship between TV stability and maximal couplings. With this view, for a specified query class, we have the following requirements.

1. TV-stability: We want a ρ-TV stable modification of the learning Algorithm 1, in the sense that it responds to the queries (line 3) while satisfying TV stability. 2. Efficient unlearning algorithm: We measure efficiency as the average number of queries the unlearning algorithm makes relative to the learning algorithm (retraining), defined as follows.

Definition 5 (Relative Unlearning Complexity). The Relative Unlearning Complexity is defined as,

E(A,U) [Query complexity of unlearning algorithm U]

EA [Query complexity of learning algorithm A]

For a ρ-TV stable learning algorithm, we want that the relative unlearning complexity is (close to) ρ. This is motivated from the relationship between maximal coupling and TV distance. In the following, our proposed unlearning algorithm constructs a (near) maximal coupling of the learning algorithm s output under the original and updated dataset. This means that unlearning algorithm changes the original output (under the original dataset) with probability at most ρ in this case, the unlearning algorithm makes a number of queries akin to retraining. In the other case when it does change the output, it makes a small (ideally, constant) number of queries. The above imply that relative unlearning complexity is (close to) ρ.

We note that relative unlearning complexity, in itself, does not completely capture if the unlearning algorithm is good, since it may be the case that the corresponding learning algorithm is computationally more expensive than other existing methods. However, in our applications to SCO (Section 5), our learning algorithms are linear time, so the denominator, in the definition above, is as small as it can be (asymptotically), i.e. Θ(n). 3. Accuracy: We will primarily be concerned with correctness of the unlearning algorithm and its efficiency. In the applications (Section 5), we will give accuracy guarantees for specific problems, where we will see our proposed TV stable modified algorithms are still accurate.

4. Prefix-sum Queries

We now consider prefix-sum queries, which is the main contribution of this work. The reason for this choice is that two powerful (family of) algorithms for SCO, Dual Averaging

and Recursive Variance Reduction based methods, fit into this template (detailed in Section 5). We start by defining a prefix-sum query.

Definition 6. A set of queries {qt}t 1 where qt : Wt Zn W are called prefix-sum queries if q1(w1, S) = p1(w1, z1) and for all t > 1, qt({wi}i t , S) = qt 1({wi}i<t , S) + pt {wi}i t , zt) for some functions {pt}t 1 where pt : W Z W.

Simply put, prefix-sum queries, sequentially query new data points and adds them to the previous accumulated query. A simple example is computing partial sums of data points (z1, z1 + z2, . . .). Note that in the above definition, we can equivalently represent the prefix-sum queries using the sequence {pt}t. We also assume that the queries have bounded sensitivity, defined as follows.

Definition 7. A query q : W Zn W is B-sensitive if

sup {wi}i sup S,S :|S S |=1 q ({wi}i , S) q ({wi}i , S ) B.

We note that the bounded sensitivity condition is satisfied in a variety of applications; see Section 5.

4.1. Learning with Binary Tree Data-Structure

The learning algorithm, given as Algorithm 2, is based on answering the adaptive prefix-sum queries with the binary tree mechanism (Dwork et al., 2010). For n samples (assume n is a power of two, otherwise we can append dummy zero samples without any change in asymptotic complexity), the binary tree mechanism constructs a complete binary tree T with the leaf nodes corresponding to the data samples. The key idea in the binary tree mechanism is that instead of adding fresh independent noise to each prefixsum query, it is better to add correlated noise, where the correlation structure is described by a binary tree. For example, suppose we want to release the seventh prefix-sum query, P7 i=1 pi({wj}j i , zi), then consider the dyadic decomposition of 7 as 4, 2 and 1, and release the sum,

i=1 pi({wj}j i , zi) + ξ1 + 6 X

i=5 pi({wj}j i , zi)ξ2

+ p7({wj}j i , zi) + ξ3 ,

where ξi s denote the added noise, which may have also been used in prior prefix-sum query responses. See Figure 1 (left) for a simplified description of the process.

We index the nodes of the tree using using binary strings B = {0, 1}log(n) which describes the path from the root. Let the tree T = {vb}b B which denotes the contents stored by the learning algorithm. Herein, each node contains the tuple (u, r, w, z) where u Rd is the query response, r

From Adaptive Query Release to Machine Unlearning

Rd is the noisy response, w Rd a model and z Z a data point. In fact, only the leaf nodes store the model and data sample. The size of the tree is the space complexity of the learning procedure. Finally, define leaf : [n] {0, 1}log(n)

which gives the binary representation of the input leaf node.

This binary tree data structure supports the following operations:

1. Append(u, σ; T ): Add a new leaf to T , which consists of setting its query response and noisy query response to u, and u+N(0, σ2I) respectively. Further, update tree to add u to ub, corresponding to nodes vb in the path from this leaf to root, and add noise to their noisy response rb for nodes which are left child in the path. 2. Get Prefix Sum(t; T ), where t N: Get the t-th noisy response from T , which consists of traversing the tree from t-th leaf to root, and adding the noisy responses of nodes which are left child. 3. Get(b; T ) where b {0, 1}log(n): Get all items in the vertex of T indexed by b.

4. Set(b, v; T ) where b {0, 1}log(n): Set the contents of vertex b in the T as v.

Following Guha Thakurta and Smith (2013), we give pseudo-codes of the above operations in Appendix C, with minor modifications to aid the unlearning process.

Algorithm 2 Tree Learn(t0; T ) Input: Dataset S, steps T, B-sensitive prefix-sum queries {pt}t T , update functions {Ut}t T , noise std. σ 1: if t0 = 1 then Permute dataset and initialize T end if 2: ( , , wt0, ) = Get(leaf(t0); T ) 3: for t = t0 to |S| 1 do 4: ut = pt({wt}i t , zt) 5: Append(ut, σ; T ) 6: rt = Get Prefix Sum(t; T )

7: wt+1 = Ut {wt} t , rt

8: Set(leaf(t), (ut, rt, wt, zt) ; T ) 9: end for Output: bw = S ({wt}t)

4.2. Unlearning by Maximally Coupling Binary Trees

The unlearning Algorithm 3 is based on constructing a (near) maximal coupling of the binary trees under current and updated dataset. Let zj be the element to be deleted and let vs be the leaf node which contains zj (we use z in place of zj from here on, for simplicity). During unlearning, we simulate (roughly speaking) the dynamics of the learning algorithm if the deleted point was not present to begin with. In that case, in place of the deleted point, some other point would have been used. Now, since the dataset was randomly

Algorithm 3 Tree Unlearn Input: zj: data point to be deleted, T : internal tree datastructure saved during learning 1: s = leaf(j) and l = leaf(|S|) 2: ( , , w, z) = Get(s; T ) and ( , , , z ) = Get(l; T ) 3: g = pj({wq}q s , z) and g = pj({wq}q s , z ) 4: Let path = {l root} be the path from l to root. 5: for b path do ub = ub g end for 6: Remove node l from T . 7: Let b = s and ct = 1 8: if j = |S| then let b = end if 9: while b = do 10: (u, r, , ) = Get(b; T ) 11: u = u g + g

12: if Unif (0, 1) ϕN (u,σ2I)(r) ϕN (u ,σ2I)(r) then

13: if b = s then Set(b, (u , r, w, z ) ; T ) else then Set(b, (u , r, , ) ; T ) end if 14: else 15: r = Reflect(u, u , r) 16: if b = s then 17: Set(b; (u , r , , z ) ; T )

18: w = Uj {wq}q b , Get Prefix Sum(j; T )

19: Set(b, (u , r , w , z ) ; T ) 20: else 21: Set(b, (u , r , , ) ; T ) 22: end if 23: Tree Learn(j + ct; T ) // Continue Retraining 24: break 25: end if 26: if b is left sibling then ct = ct + 2|s| |b| 1 end if 27: Set (new) b as binary representation of parent of b 28: end while 29: Update dataset S = S\ {zj} Output: bw = S({wb}b)

permuted, every point is equally likely to have been used, and thus we can use the point z in the last leaf node, say vl, in the tree this choice of the last point is important for unlearning efficiency. Firstly, the computations associated with the last point z needs to be undone towards this, we update the contents of the nodes in the path from node vl to root (line 5), finally removing node vl from the tree (line 6). Then, we need to replace all the computations which used the deleted point z with the same computation under z . Since the learning algorithm was based on the binary tree mechanism, the point z was only explicitly used in the nodes lying on the path from leaf vs to the root (so, at most log (n) nodes). We say explicitly above because due to the adaptive nature of the process, in principle, all nodes after vs depend on it, in the sense that their contents would change if the response in vs were to change. However, importantly,

From Adaptive Query Release to Machine Unlearning

p1(w0) + ξ1

p3(w 2) + ξ3

p5(w 4) + ξ5

p7(w 6) + ξ7

p1(w0) + ξ1

p3(w 2) + ξ3

p7(w 6) + ξ7

Figure 1. A simplified schematic of the learning (left) and unlearning (right) procedures for prefix-sum queries. In the left, the leaves contain (noisy, if +ξi) prefix-sum queries applied on the randomly permuted data-point (zi s) below it. The intermediate nodes with + adds the not-noised values of its children, where as others add noise to it. On the right, the deleted point z4 is replaced with z8 which amounts to adjusting the queries with g + g (see Algorithm 3 for details) and performing Rejection Sampling (abbreviated RSi, where i s indicates the order of occurrence of sequence of rejection samplings) along the height of the tree.

the binary search structure of our learning algorithm and our coupling technique (details below) would enable us to (mostly) only care about explicit computations.

We first compute two new queries, under the data point z and z , with responses g = pj({wq}q s , z) and g = pj({wq}q s , z ) respectively (line 3). Starting with leaf node vs, we update the original unperturbed prefix-sum query response under z i.e. u to what it would have been under data-point z : u = u g +g (line 11). Further, since the training method adds noise N(0, σ2I) to u to produce original noisy response r, we now need to produce a sample from N(u , σ2I) to satisfy exact unlearning. Naively, we could simply get a fresh independent sample from N(u , σ2I), however, this would change the noisy response r, and hence require all subsequent computations to be redone (the adaptive nature). So, ideally, we want to reuse the same r and yet generate a sample from N(u , σ2I). This is precisely the problem of constructing a maximal coupling, discussed in the Section 2, wherein we also discussed the method of reflection coupling to do it.

This amounts to doing a rejection sampling which (roughly) ascertains if response r is still sufficient under the new distribution N(u , σ2I). Specifically we compute the ratio of the probability densities at r under the noise added to u

and u , i.e. ϕN (u,σ2I)(r) ϕN (u ,σ2I)(r) and compare it against a randomly sampled Unif(0,1); if it results in accept, we move to parent of the node vs, and repeat. If any step fails, we reflect which generates a different noisy response r , and continue retraining from the next leaf w.r.t. the post order traversal of the tree (the variable ct in Algorithm 3 keeps track of this next node). See Figure 1 for a simplified description of the process.

The main result of this section is as follows.

Theorem 1. The following are true for Algorithms 2 and 3,

1. The learning Algorithm 2 with σ2 = 64B2log2(n)

ρ satisfies ρ-TV stability.

2. The corresponding unlearning Algorithm 3 satisfies exact unlearning.

3. The relative unlearning complexity is e O (ρ)

As discussed in the preceding section, in the Theorem above, we have all the properties we needed with the unlearning process. We now move on to applications and give accuracy guarantees.

5. Applications

In the following, we describe some problems and learning algorithms. The corresponding unlearning algorithms and its correctness simply follow as application of the result of the preceding section, provided we show that it uses a bounded sensitivity prefix-sum query. The only other thing to show is the accuracy guarantee of the TV stable modification of the learning algorithm (Algorithm 2).

From here on, we use runtime to mean gradient complexity as is standard in convex optimization (Nemirovskij and Yudin, 1983). But, as pointed out before, our proposed unlearning algorithm yields similar improvements over retraining, even accounting for other operations in the method.

5.1. Smooth SCO with Variance Reduced Frank-Wolfe

We assume that the loss function w 7 ℓ(w; z) is H-smooth and G-Lipschitz for all z1. The algorithm we use is variance reduced Frank-Wolfe method where the variance reduced gradient estimate ut is the Hybrid-SARAH estimate (Tran-

1A real valued function x 7 f(x) is G-Lipschitz and H-smooth if |f(x1) f(x2)| G x1 x2 an f(x1) f(x2) H x1 x2 respectively.

From Adaptive Query Release to Machine Unlearning

Dinh et al., 2019) with γt = 1 t+1 given as,

ut = (1 γt) (ut 1 + ℓ(wt; zt) ℓ(wt 1; zt)) + γt ℓ(wt; zt)

i=1 ((i + 1) ℓ(wi; zi) i ℓ(wi 1; zi))

We show that the above is a prefix sum query with sensitivity B = 2 (HD + G), thus fits into our framework. The full pseudo-code is given as Algorithm 12 in Appendix E. We state the main result below where the accuracy guarantee follows from modifications to the analysis in Zhang et al. (2020). Theorem 2. Let ρ 1 and ℓ: W Z R be an Hsmooth, G-Lipschitz convex function over a closed convex set W of diameter D. Algorithm 12, as the learning algorithm, run with σ2 = 64(HD+G)2log2(n)

ρ2 , t0 = 1 and ηt = 1 t+1 on a dataset S of n i.i.d. samples from D outputs bw, with excess population risk bounded as,

E [L( bw; D) L(w ; D)] = e O

Furthermore, the corresponding unlearning Algorithm 3 (with query and update functions as specified in the learning algorithm), satisfies exact unlearning with e O (ρn) expected runtime.

5.2. Non-smooth SCO with Dual Averaging

In this section, we only assume that loss function w 7 ℓ(w; z) is G-Lipschitz and convex z Z. Herein, we use dual averaging method (Nesterov, 2009) where the model is updated as follows:

wt+1 = ΠW w0 η

i=1 ℓ(wi; zi) ,

where Π denotes the Euclidean projection on to the convex set W. The above again is a prefix-sum query with sensitivity G, thus fits into our framework. The full pseudocode is given as Algorithm 13 in Appendix E. The accuracy guarantee mainly follows from Kairouz et al. (2021). Theorem 3. Let ρ 1 and ℓ: W Z R be a GLipschitz convex function over a closed convex set W of diameter D. Algorithm 13, as the learning algorithm, run

with σ2 = 64G2log2(n)

ρ2 , t0 = 1 and η = Dd1/4

log(n) G nρ on a dataset S of n samples, drawn i.i.d. from D, outputs bw with excess population risk bounded as,

E [L( bw; D) L(w ; D)] = e O

Furthermore, the corresponding unlearning Algorithm 3 (with query and update functions as specified in the learning algorithm), satisfies exact unlearning with e O (ρn) expected runtime.

5.3. Convex GLM with JL Method

Algorithm 4 JL Method Input: Dataset S, loss function ℓ, base algorithm A, JL matrix Φ Rd k, noise variance σ2

1: ΦS = {Φxi}n i=1 2: ew = A(ℓ, ΦS, 2G X , 2H X 2 , σ) Output: bw = Φ ew

This JL method, proposed in Arora et al. (2022), is a general technique to get dimension-independent rates for unconstrained convex GLMs from algorithms giving dimensiondependent rate for constrained (general) convex losses. The method, described in Algorithm 4, simply embeds the dataset into a low dimensional space, via a JL matrix Φ, and then runs a base algorithm on the low dimensional dataset.

Smooth, Lipschitz GLMs: We assume that ϕy : R R is convex, H-smooth and G-Lipschitz for all y Y. We give the following result in this case using VR-Frank Wolfe as the base algorithm. Theorem 4. Let ρ 1 and ℓ: W X Y R be an Hsmooth, G-Lipschitz convex GLM loss function. Algorithm 4 instantiated with Algorithm 12, as the learning algorithm,

run with σ2 = e O (H X 2 w +G X ) 2

, t0 = 1, ηt =

1 t+1 and k = e O

H X 2 w (H X 2 w +G X )

2/3 (nρ)2/3 !

a dataset S of n samples, drawn i.i.d. from D, outputs bw with excess population risk bounded as,

E [L( bw; D) L(w ; D)] = e O

G X + H X 2 w w n

+ H1/3G2/3 w 4/3 X 4/3 + H X 2 w 2

Furthermore, the corresponding unlearning Algorithm 3 (with query and update functions as specified in the learning algorithm), satisfies exact unlearning with e O (ρn) expected runtime .

Lipschitz GLMs: We assume that ϕy : R R is convex and G-Lipschitz for all y Y. We give the following result in this case using Dual Averaging as the base algorithm. Theorem 5. Let ρ 1 and ℓ: W X Y R be a G-Lipschitz convex GLM loss function. Algorithm 4 with Algorithm 13 as the sub-routine, as the learning algorithm,

run with σ2 = O G2 X 2

ρ2 , t0 = 1, η = w d1/4

log(n) G X nρ and k = nρ on a dataset S of n samples sampled i.i.d. from D outputs bw, with excess population risk bounded as,

E [L( bw; D) L(w ; D)] = e O G X w 1 n + 1

From Adaptive Query Release to Machine Unlearning

Furthermore, the corresponding unlearning Algorithm 3 (with query and update functions as specified in the learning algorithm), satisfies exact unlearning with e O (ρn) expected runtime.

6. SCO in Dynamic Streams

In this section, we extend our previous results to dynamic streams wherein we observe a sequence of insertions and deletions, starting with potentially zero data points. We assume that the number of available points throughout is positive and the data points are i.i.d. from an an unknown distribution as well as the requests are chosen independent of the algorithm.

To give a simple and unified presentation, let the accuracy, say expected excess population risk, of the ρ-TV stable Algorithm 2 with a dataset S be denoted as, α(ρ, |S| ; P) where P denotes problem specific parameters such as Lipschitzness, diameter etc.

We present two techniques for dynamic streams; one of them satisfies exact unlearning but has a worse update time; this is similar to Ullah et al. (2021) and is deferred to Appendix F. The other, presented below, satisfies weak unlearning (see Definition 2) with better update time. A key component to both are anytime guarantees, which hold at every time-point in the stream, for any length of the stream.

Anytime binary tree mechanism: In the previous section, the depth of the initialized tree and the noise variance σ2, both were chosen as a function of the dataset size n. However, the tree can be easily built in an online manner as in prior work of Guha Thakurta and Smith (2013). For setting the noise variance: for target ρ-TV stability, we distribute the noise budget exponentially along the height of the tree; specifically, the leaf node contribute to ρ/2 TV stability, the nodes above them ρ/4 and so on. In this way, the final tree satisfies ρ-TV stability for any value of n.

Anytime accuracy: The other problem of changing data size is that the internal parameters of algorithm (step size, in our case) may be set as a function of n for desirable accuracy guarantees. Fortunately, the two algorithms that we consider, VR-Frank Wolfe and Dual Averaging, have known horizonoblivious parameter settings (Orabona, 2019). Their JL counterparts on the other hand, require setting the embedding dimension as a function of n, and thus not applicable unless we assume that the number of data points throughout the stream is Θ(n).

6.1. Weak Unlearning in Dynamic Streams

We first argue in what way insertions handled in Ullah et al. (2021) is deficient. The main reason is that they require insertions to also satisfy the unlearning criterion: the state

of the system upon insertion is instinguishable to the state had the inserted point being present to begin with. However, this is an overkill; adding new points simply serve to yield improved statistical accuracy. Furthermore, methods which allow adding new points, are abound, particularly in the stochastic optimization setting, sometimes known as incremental methods. Importantly, in most cases, the insertion time of these methods is constant (in n). Hence, a natural question is whether, for dynamic streams, can we design unlearning methods in which we pay for update time only in proportion to the number of deletions? Our result shows that we can, albeit under the weak unlearning (see Definition 2) guarantee.

Specifically, our procedure requires hiding the order in which data points are processed. Intuitively, an incremental method typically processes the newest data point the last. This ordering is problematic to our unlearning procedure, since if some point is to deleted, then we can no longer replace it with the last point, as we did before, since that would result in a different order. Our main result is as follows. Theorem 6. In the dynamic streaming setting with R requests, using anytime incremental learning and unlearning algorithms, Algorithm 2 and 3, without permuting the dataset, the following are true.

1. It satisfies weak unlearning at every time point in the stream. 2. The accuracy of the output bwi at time point i, with corresponding dataset Si, is

E[L( bwi; D)] min w L(w; D) = α(ρ, |Si| ; P)

3. The number of times retraining is triggered, for V unlearning requests is at most e O(ρV )

Importantly, in the above guarantee, we only pay for the number of unlearning requests V rather than the number of requests R.

7. Conclusion

In this paper, we proposed a general framework for designing unlearning algorithms for learning algorithms which can be viewed as performing adaptive query release on datasets. We applied this to yield improved guarantees for unlearning in various settings of stochastic convex optimization. All of our results (in the main text) are obtained by studying the class of prefix-sum queries, so a natural future direction is to extend it to more query classes, which could be useful for other problems.

Acknowledgements

This research was supported, in part, by NSF BIGDATA award IIS-1838139 and NSF CAREER award IIS-1943251.

From Adaptive Query Release to Machine Unlearning

Raman Arora, Raef Bassily, Crist obal A Guzm an, Michael Menart, and Enayat Ullah. Differentially private generalized linear models revisited. In Advances in Neural Information Processing Systems, 2022.

Lucas Bourtoule, Varun Chandrasekaran, Christopher A Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot. Machine unlearning. In 2021 IEEE Symposium on Security and Privacy (SP), pages 141 159. IEEE, 2021.

Jean Bretagnolle and Catherine Huber. Estimation des densit es: risque minimax. Zeitschrift f ur Wahrscheinlichkeitstheorie und verwandte Gebiete, 47(2):119 137, 1979.

Jonathan Brophy and Daniel Lowd. Machine unlearning for random forests. In International Conference on Machine Learning, pages 1092 1104. PMLR, 2021.

Cl ement L Canonne. A short note on an inequality between kl and tv. ar Xiv preprint ar Xiv:2202.07198, 2022.

Cynthia Dwork, Moni Naor, Toniann Pitassi, and Guy N Rothblum. Differential privacy under continual observation. In Proceedings of the forty-second ACM symposium on Theory of computing, pages 715 724, 2010.

Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci., 9(3-4):211 407, 2014.

Antonio Ginart, Melody Guan, Gregory Valiant, and James Y Zou. Making ai forget you: Data deletion in machine learning. Advances in Neural Information Processing Systems, 32, 2019.

Abhradeep Guha Thakurta and Adam Smith. (nearly) optimal algorithms for private online learning in fullinformation and bandit settings. Advances in Neural Information Processing Systems, 26, 2013.

Chuan Guo, Tom Goldstein, Awni Hannun, and Laurens Van Der Maaten. Certified data removal from machine learning models. ar Xiv preprint ar Xiv:1911.03030, 2019.

Varun Gupta, Christopher Jung, Seth Neel, Aaron Roth, Saeed Sharifi-Malvajerdi, and Chris Waites. Adaptive machine unlearning. Advances in Neural Information Processing Systems, 34, 2021.

Prateek Jain and Abhradeep Guha Thakurta. (near) dimension independent risk bounds for differentially private learning. In International Conference on Machine Learning, pages 476 484. PMLR, 2014.

Peter Kairouz, Brendan Mc Mahan, Shuang Song, Om Thakkar, Abhradeep Thakurta, and Zheng Xu. Practical and private (deep) learning without sampling or shuffling. In International Conference on Machine Learning, pages 5213 5225. PMLR, 2021.

Jakub Konecn y, H Brendan Mc Mahan, Felix X Yu, Peter Richt arik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency. ar Xiv preprint ar Xiv:1610.05492, 2016.

Torgny Lindvall and L Cris G Rogers. Coupling of multidimensional diffusions by reflection. The Annals of Probability, pages 860 872, 1986.

H Brendan Mc Mahan, Daniel Ramage, Kunal Talwar, and Li Zhang. Learning differentially private recurrent language models. ar Xiv preprint ar Xiv:1710.06963, 2017.

Ilya Mironov. R enyi differential privacy. In 2017 IEEE 30th computer security foundations symposium (CSF), 2017.

Seth Neel, Aaron Roth, and Saeed Sharifi-Malvajerdi. Descent-to-delete: Gradient-based methods for machine unlearning. In Algorithmic Learning Theory, 2021.

Jelani Nelson. Sketching and streaming high-dimensional vectors. Ph D thesis, MIT, 2011.

Arkadij Semenovic Nemirovskij and David Borisovich Yudin. Problem complexity and method efficiency in optimization. 1983.

Yurii Nesterov. Primal-dual subgradient methods for convex problems. Mathematical programming, 120(1):221 259, 2009.

Francesco Orabona. A modern introduction to online learning. ar Xiv preprint ar Xiv:1912.13213, 2019.

Ayush Sekhari, Jayadev Acharya, Gautam Kamath, and Ananda Theertha Suresh. Remember what you want to forget: Algorithms for machine unlearning. Advances in Neural Information Processing Systems, 34, 2021.

Quoc Tran-Dinh, Nhan H Pham, Dzung T Phan, and Lam M Nguyen. Hybrid stochastic gradient descent algorithms for stochastic nonconvex optimization. ar Xiv preprint ar Xiv:1905.05920, 2019.

Enayat Ullah, Tung Mai, Anup Rao, Ryan A Rossi, and Raman Arora. Machine unlearning via algorithmic stability. In Conference on Learning Theory, 2021.

Tim Van Erven and Peter Harremos. R enyi divergence and kullback-leibler divergence. IEEE Transactions on Information Theory, 60(7):3797 3820, 2014.

From Adaptive Query Release to Machine Unlearning

C edric Villani. Optimal transport: old and new, volume 338. Springer, 2009.

Yu-Xiang Wang, Borja Balle, and Shiva Prasad Kasiviswanathan. Subsampled r enyi differential privacy and analytical moments accountant. In International Conference on Artificial Intelligence and Statistics, 2019.

Mingrui Zhang, Zebang Shen, Aryan Mokhtari, Hamed Hassani, and Amin Karbasi. One sample stochastic frankwolfe. In International Conference on Artificial Intelligence and Statistics, pages 4012 4023. PMLR, 2020.

From Adaptive Query Release to Machine Unlearning

A. Additional Preliminaries

We recall some concepts from differential privacy which will be useful in our algorithmic techniques.

Definition 8. An algorithm A satisfies (α, ϵ(α))-R enyi Differential Privacy (RDP), if for any two datasets S and S which differ in one data point (|S S | = 1), the α-R enyi Divergence between A(S) and A(S ), with probability densities ϕA(S) and ϕA(S ), defined as follows:

Dα (A(S) A(S )) = 1 α 1 ln

Range(A) ϕA(S)(x)αϕA(S )(x)1 αdx

is bounded as, Dα(A(S) A(S )) ϵ(α).

RDP satisfies many desirable properties such as adaptive and parallel composition and amplification by sub-sampling (Mironov, 2017; Wang et al., 2019). Furthermore, we give the following lemma which relates TV stability to RDP.

Lemma 1 (RDP = TV-stability). If an algorithm satisfies (α, ϵ(α))-RDP, then it satisfies 1 exp lim α 1 ϵ(α) 1

Proof of Lemma 1. From Theorem 4 in Van Erven and Harremos (2014), we have that lim α 1Dα(P Q) = KL (P Q), where

KL( ) denotes the Kullback-Leibler (KL) divergence between the two distributions. Finally, we relate the TV distance with the KL divergence using Bretagnolle Huber bound (Bretagnolle and Huber, 1979; Canonne, 2022) which gives the claimed bound.

B. Unlearning for Linear Queries

A basic form of a query we consider is a linear query, defined as follows.

Definition 9. A query q : W Zn W is a linear query if q ({wi}i ; S) = P

j S pj ({wi}i ; zj) for some functions pj : W Z W.

We consider the class of B-sensitive linear queries. We give the TV stable modified learning procedure in Algorithm 5 which basically releases the linear queries perturbed with Gaussian noise of appropriate variance.

Algorithm 5 Learn Linear Queries(wt0, t0) Input: Dataset S, initial iteration t0, steps T, query functions {qt( )}t T , update functions {Ut( )}t T , selector function S( ), noise variance σ2

1: Initialize model w1 W 2: for t = t0 to T 1 do

3: Query the dataset ut = qt {wi}i t ; S .

4: Perturb: rt = ut + ξt where ξt N(0, σ2Id). 5: Update wt+1 = Ut({wi}i t , rt) 6: Save (ut, rt, wt+1) 7: end for Output: bw = S {wt}t T

Note that the underlying probability distribution that the above learning algorithm samples from is a Markov chain. The corresponding unlearning procedure, described in Algorithm 6, is based on constructing a coupling between the Markov chains for the current dataset and the dataset without the to-be-deleted point. In particular, we start from the first iteration, perform rejection sampling, if it results in acceptance, then we proceed to the second iteration and so on. If some iteration results in rejection, then we do the reflection step, and continue retraining from there on.

From Adaptive Query Release to Machine Unlearning

Algorithm 6 Unlearning algorithm for linear queries Input: Deleted point zj,

1: for t = 1 to T 1 do 2: (ut, rt, wt) = Load ()

3: Compute u t = ut pj t {wi}i t ; zj

4: if Unif (0, 1) ϕN (ut,σ2I)(rt) ϕN (u t,σ2I)(rt) then

5: Save (u t) 6: else 7: r t = reflect(rt, ut, u t)

8: wt+1 = Ut {wi}i t , r t

9: Learn Linear Queries(wt+1, t + 1) 10: break 11: end if 12: end for

The above is basically the same unlearning algorithm as that of Ullah et al. (2021) but presented in the general context of linear queries. Therefore, it generalizes the framework of Ullah et al. (2021) which was limited to the Stochastic Gradient Descent algorithm. We also remark that linear queries can often be augmented with a sub-sampling operator yielding amplified guarantees, as done in Ullah et al. (2021). However, we omit this extension for brevity. The main result of this section is as follows.

Theorem 7. The following are true for Algorithms 5 and 6,

1. The learning algorithm, Algorithm 5 with σ2 = 64B2

n2ρ2 satisfies ρ-TV stability.

2. The unlearning algorithm, Algorithm 6, corresponding to Algorithm 5, satisfies exact unlearning.

3. The relative unlearning complexity is O ρ

Proof. This proof simply follows from the observation that the analysis of Ullah et al. (2021) only uses the bounded sensitivity linear query structure of the stochastic gradient method for their TV stability bound as well as correctness and runtime of the unlearning procedure.

B.1. Applications

This generalization yields the following applications.

B.2. Federated Unlearning for Federated Averaging

In the federated learning setting, we have C clients (which typically correspond to user devices) with their own datasets and a parameter server (aggregator). A typical, informal, goal is training a single globally shared model using all the dataset with small communication between the clients and the server, and without moving any private data (explicitly) to the server. Federated Averaging (Konecn y et al., 2016), described in Algorithm 7, is a widely used method in federated learning. Note that in the every round of the method, the client outputs, {wc t}C c=1, are aggregated using an averaging operation:

In Algorithm 7, Client Update is a function which runs on the client s data using the current model wt and problem specific-parameter P (such as as number of steps, learning rate of some optimization routine). For brevity, we do not instantiate the Client Update function, but usually some variant of stochastic gradient descent is used.

From Adaptive Query Release to Machine Unlearning

Algorithm 7 Federated Averaging (Server side) Input: Number of clients C, number of rounds T, client-specific parameters P

1: Initialize model w1 W 2: for t = 1 to T 1 do 3: for c = 1 to C do 4: wc t+1 = Client Update (c, wt 1, P) 5: end for 6: wt+1 = 1

C PC c=1 wc t+1 7: end for Output: bw = S {wt}t T

Federated Unlearning: In the federated unlearning problem, after a model is trained, one of the clients requests to remove themselves from the process. The parameter server then needs to update the model (and state) in such a way that it is indistinguishable to the state if the client were absent. Hence, this is analogous to the standard unlearning problem with the client playing the role of a data point. This analogy also occurs with private federated learning wherein the widely-used granularity of differential privacy is user-level differential privacy (Mc Mahan et al., 2017). In this case, a client (potentially containing multiple data items) plays the role of a data item, the presence/absence of which is used in the differential privacy definition.

TV-stable learning and unlearning: The model aggregation step (line 6 in Algorithm 7) of the federated averaging method is a linear query over the clients. Moreover, if the clients output models that are bounded in norm, then it is a bounded sensitivity linear query (typically enforced by clipping the updates). Hence, this fits into the template of linear query release method and thus can be modified, as in Algorithm 5 to be TV stable. The corresponding unlearning method is the one given in Algorithm 6.

B.3. Lloyd s Algorithm for k-means Clustering

In this section, we briefly discuss how an algorithm for k-means clustering fits into the linear query release framework. We remark that the prior work of Ginart et al. (2019) gave an unlearning method for this problem based on randomized quantization, which can also be seen as a specific TV-stable algorithm followed by a coupling based unlearning method.

Lloyd s algorithm is a widely used method for k-means clustering. Herein, starting with an arbitrary choice of centers, we construct a partition of the dataset, which thereby gives a new set of centers. This process is repeated for a certain number of rounds. The method is described as Algorithm 8.

We notice again that the updates for every cluster, line 7 in Algorithm 8, is a linear query, hence it fits into the linear query release template and thus learning and unlearning algorithms based on linear queries readily follow.

Algorithm 8 Lloyd s algorithm Input: Number of clusters C, number of rounds T, dataset S = {zi}n i=1.

1: Initialize centers {wc}C c=1 2: for t = 1 to T 1 do 3: for c = 1 to C do 4: Compute Sc = n zc 1, zc 2, . . . zc |Sc| o , the set of data-points closest to wc.

5: end for 6: for c = 1 to C do 7: Update wc = 1 |Sc| P|Sc| i=1 zc i 8: end for 9: end for Output: {wc}C c=1

From Adaptive Query Release to Machine Unlearning

C. Missing Details from Section 4

In this section, we provide pseudo-code of the operations supported by the binary tree data structure.

Algorithm 9 Append(u, σ; T ) Input: Query response u, noise variance σ,Tree T

1: Let s be the (binary representation of) first empty leaf. 2: Let q be the index with the first 1 in s. 3: path = {s root} be the path from s to root consisting of at most q + 1 nodes from leaf. 4: Update Tree(u, path, σ; T )

Algorithm 10 Update Tree(u, path, σ; T ) Input: Query response u, Set of nodes path, noise variance σ,Tree T

1: for b path do 2: ub = ub + u 3: if b is a left child or b is a leaf then 4: ξ N(0, σ2I) 5: rb = ub + ξ 6: break 7: end if 8: end for

Algorithm 11 Get Prefix Sum(t; T ) Input: t N, Tree T ,

1: Initialize g Rp to 0 2: s leaf(t) 3: Let path be the path from s to root. 4: while b = do 5: if b is a leaf child or b is a leaf then 6: g = g + rb 7: end if 8: end while Output: g

D. Missing Proofs from Section 4

Proof of Theorem 1. The first part of the Theorem follows from Lemma 2 followed by post-processing to argue that the same TV stability parameter holds for the final iterate.

The second part, exact unlearning, follows from Lemma 5 wherein Q denotes the distribution of the algorithm s output run on the dataset without the to-be-deleted point.

For the third part, note that the unlearning algorithm 3 makes two queries if no retraining is triggered. If a retraining is triggered, the number of queries it makes is at most the query complexity of learning algorithm, T = n. Finally, the probability of retraining, from Lemma 6 is at most log (n) ρ. Combining, this gives the stated bound on relative unlearning complexity.

D.1. Lemmas for Unlearning

Additional notation: We first present some additional notation used in the statement and proof of the following lemmas. Let S and S be datasets before and after the unlearning request. Let P and Q denote the probability measures over the range of tree data-structure, which is T = Rd Rd Rd [n] n, induced by the output of learning algorithm on S and

From Adaptive Query Release to Machine Unlearning

S respectively. We order the nodes of the binary tree w.r.t. the post-order traversal of tree. Hence, given two nodes v and v or their binary representations s and s , we use v v or s s w.r.t the above ordering. Given a node b, let Pb ( |T b) denote the conditional distribution of the nodes given the prefix nodes of the tree.

Let p be a permutation over [n] and pb denote the index on the b-th node, when b is a leaf. Let µ denote the probability, and conditional probability, depending on context, of p and pb, under the random permutation model. Specifically, we use µ(p) and µ(pb|p b) to denote the probability of the sequence p and conditional probability of pb given the previous values.

Let T (1) denote the initial binary tree i.e. the one constructed after the algorithm is run on dataset S, and T (2) be the binary tree constructed after unlearning. Let P p and Qp denote the conditional distributions for P and Q respectively given permutation p.

We factor the probability density of P as:

ϕP T (1) = Y

b B ϕPb v(1) b |T (1) b = Y

b B µ(p(1) b |p(1) b)ϕ P p(1) b b

u(1) b , r(1) b , w(1) b |T (1) b

Fixing the permutation sequence p(1), denote and factor the conditional distribution as,

P (T (1)) = Y

b B ϕ P p(1) b b

u(1) b , r(1) b , w(1)|T (1) b

Finally, define response trees e T (1) and e T (2) which only contain the response variables (rb)b. Moreover, define distributions e P, e Pb, e P p, e P p b and e Q, e Qb, e Qp e Qp b as before.

We first show the the tree e T produced by the learning algorithm is TV-stable.

Lemma 2. Let 0 < ρ 1, B 0, n N. For B-sensitive prefix sum queries, setting σ2 = 64B2log2(n)

ρ2 , the response tree

data structure e T is ρ-TV stable.

Proof. The proof of privacy of tree aggregation is classical in differential privacy, see Guha Thakurta and Smith (2013) for example. The proof has three ingredients: Gaussian mechanism guarantee, parallel composition (to argue that accounting along the height of the tree suffices) and adaptive composition (for accounting along the height of the tree). Since the noise is Gaussian and these composition properties also holds under RDP (Mironov, 2017), therefore we can give an RDP guarantee of ϵ(α) log2(n) 64αB2

σ2 αρ2. Finally, using Lemma 1 and a numerical simplification since ρ 1 gives the claimed result.

Recall that j is the index of the data item (after permutation) which is deleted. Without loss of generality, assume that the original index of the deleted data-point is n. We first argue the following about the distribution of p(1) and p(2).

Lemma 3. For any set E [n]n and any set E [n 1]n 1, we have

Pp(1) p(1) E = µn(E)

Pp(2) p(2) E = µn 1(E )

Proof. Since p(1) and p(2) are discrete distributions, it suffices to argue the above for the atoms. Firstly, by construction, p(1) µn and hence the first part is done. For the second part for any sequence h = (hi)n 1 i=1 where hi [n 1]. Let [h, j] denote the concatenation of h and j (the deleted index). By symmetry, the probability

Pp(2) (h) = 1 n + 1Pp(1) ([h, j]) = µn 1(h)

This completes the proof.

We now show transport of the conditional distribution by the unlearning operation.

From Adaptive Query Release to Machine Unlearning

Lemma 4. For any measurable event E Rd|T (2)|,

P e T (2) E|p(1), p(2) = e Qp(2)(E).

Proof. The proof is based on induction on the nodes of e T (2) in the post-order traversal. Let v(1) b

b and v(2) b

nodes of the tree arranged in the post-order traversal order. Given j, index of the item deleted, let s = leaf(j). Define prefix(s) and suffix(s), as set of nodes before and after s respectively in the order.

Given an event E Rd| e T (2)| and r b, define E r b b as follows:

E r b b = e Rd : e >b Rd : (r b, e, e) E

where >b Rd denote the Cartesian product of Rd s of upto > b but smaller than or equal to T (1) elements. Similarly, define E r b b as,

E r b b = e b Rd : (r b, e) E

Finally, define E<b as

E<b = e <b Rd : e b Rd : (e, e) E

We now factorize the probability below as,

P e T (2) E|p(1), p(2) = Y

b prefix(s) P r(2) b E r(2) <b b |p(2) b , r(2) <b

P e T (2) s E r(2) <s s |e T (2) <s , p(1), p(2)

b prefix(s) P r(1) b E r(1) <b b |p(1) b , r(1) <b

P e T (2) s E r(2) <s s |e T (2) <s , p(1), p(2)

b prefix(s) Pb

E r(1) <b b |p(1) b , r(1) <b

P e T (2) s E r(2) <s s |e T (2) <s , p(1), p(2)

b prefix(s) Qb

E r(2) <b b |p(2) b , r(2) <b

P e T (2) s E r(2) <s s |e T (2) <s , p(1), p(2)

= Q<s E<s|p(2) <s, r(2) <s P e T (2) s E r(2) <s s |e T (2) <s , p(1), p(2)

where the second equality follows since r(1) b = r(2) b and p(1) b = p(2) b for all b < s by construction. The third equality follows

since r(1) b is distributed as Pb conditionally and fourth and final follows since conditioned on the permutation being the same, the prefix is also distributed as Q<s.

We now start the induction: let I(induction variable) be I = s i.e the last item is deleted. In this case, the unlearning algorithm simply removes the s-th node of the tree and all we are left with is the tree with prefix(s) nodes, which as argued above is distributed as Q<s = Q.

For the case I = s + 1: we simply focus on e T (2) s = e T (2) s = r(2) s . Note that r(1) s is distributed as N(u(1), σ2I) and we

want r(2) s distributed as N(u(2), σ2I). The operation in the algorithm is basically a one step reflection coupling which from Lemma 1 in Ullah et al. (2021) satisfies,

P r(2) s E r(2) <s s |p(1), p(2) = Qp(2) s s

E r(2) <s s

From Adaptive Query Release to Machine Unlearning

P e T (2) E|p(1), p(2) = Q<s E<s|p(2) <s, r(2) <s e Qp(2) s s

E r(2) <s s

= e Qp(2)(E)

This finishes the base cases.

We now proceed to the induction step: suppose the following claim holds for nodes upto I = k for any event E, the marginal distribution

P T (2) k E|p(1), p(2) = e Q k E|p(2)

For node k + 1, consider a few cases:

1. A: All rejection sampling steps prior to node k resulted in accepts:

(a) AP: Node k + 1 lies in the path from the s to root.

i. APA: The rejection sampling at this node succeeds. ii. APR: The rejection sampling at this node fails i.e. a reflection step is performed.

(b) AN: Node k + 1 doesn t lie in the path from s root.

2. R: Some rejection sampling step resulted in rejection.

For case R, we have that r(2) k+1 e Qk+1( |e T (2) k , p(2)). For the case AN, note that the random variable r(2) k+1 = r(1) k+1, hence,

P r(2) k+1 E r(2) k k+1|AN, T (2) k , p(1), p(2) = e Pk+1

E r(2) k k+1|p(2), e T (2) k

E r(2) k k+1|p(2), e T (2) k

where the last equality follows since the dependence of r(2) k+1 is only on data points which are leaves of the sub-tree rooted at node k + 1. These, by assumption do not contain the data point s, hence is identically distributed as Pk+1.

For the event AP, we have,

P r(2) k+1 E r(2) k k+1|AP, p(1)p(2), e T (2) = P r(2) k+1 E r(2) k k+1, APA|AP, p(1), p(2), e T (2) k

+ P r(2) k+1 E r(2) k k+1, APR|AP, p(1), p(2), e T (2) k

E r(2) k k+1|p(1), p(2), e T (2) k

where the last step follows from Lemma 1 in Ullah et al. (2021) .

Hence, combining AP and AN cases,

P r(2) k+1 E r(2) k k+1|AN, T (2) k , p(1), p(2) = e Qk+1

E r(2) k k+1|p(2), e T (2) k

We now combine all the cases: let ϕ(A) k, ϕ(R) k denote the conditional densities of e T (2) k under events A and R respectively. Let

Tk = e T (2) k . For any event E,

From Adaptive Query Release to Machine Unlearning

P e T (2) k+1 E|p(1), p(2) = P r(2) k+1 E r(2) k k+1|A, e T (2) k E k, p(1), p(2) P e T (2) k E r(2) k k+1, A|p(1), p(2)

+ P r(2) k+1 E r(2) k k+1|R, e T (2) k E k, p(1), p(2) P e T (2) k E k, R|p(1), p(2)

Rd Tk+1 1 r(2) k+1 E r(2) k k+1

1 e T (2) k E k

1 e T (2) k A ϕ(A) k e T (2) k

+ 1 e T (2) k R ϕ(R) k e T (2) k ϕ e Qp(2)

r(2) k+1|e T (2) k de T (2) k dr(2) k+1

Rd Tk+1 1 T (2) k+1 E ϕQp(2)

e T (2) k ϕ e Qp(2)

r(2) k+1|e T (2) k de T (2) k dr(2) k+1

where in the third equality, we use the induction hypothesis. This completes the proof of the lemma.

Lemma 5. For any measurable event E T, P[T (2) E] = Q(E).

Proof. This follows primarily from Lemma 4, and the fact that other elements in nodes of T , namely ub and wb are deterministic functions of the prefix vertices in the tree e T . Consider a decomposition of the event E = Eu Er Ew Ez. Now,

P[T (2) E] = Ep(1)P T (2) Eu Er Ew Ez|p(1), p(2) Ez P p(2) Ez

= Ep(1)P e T (2) Er|p(1), p(2) µn 1(Ez)

= Ep(1) e Qp(2) (Er) µn 1(E2)

= Ep(1)Qp(2) (Eu Ew Er) µn 1(Ez)

where the second and fourth equality follows since variables wb and ub are deterministic functions of the responses r b. The second and third equality also uses Lemma 3 and Lemma 4 respectively.

Lemma 6. The probability of retraining is at most log (n) ρ.

Proof. A retraining is triggered only when a rejection sampling step fails. Note that a rejection sampling step happens only when the node b belongs to the path from s to root, say path. Let Accept be the event when all rejection sampling steps

From Adaptive Query Release to Machine Unlearning

P (Accept) = ET (1),T (2),{ub} Y

ub ϕ e Qp(2)

r(1) b |T (1) <b

r(1) b |T (1) <b

= E e T (1),p(1),p(2) Y

ub ϕ e Qp(2)

r(1) b |e T (1) <b

r(1) b |e T (1) <b

= Ep(1),p(2) Y

Rd min ϕ e Qp(2)

r(1) b |e T (1) <b , ϕ e P p(1)

r(1) b |e T (1) <b dr(2) b

= Ep(1),p(2) Y

1 TV e Qp(2)

b , e P p(1)

b |e T (1) <b

b path (1 ρb)

1 log (n) max b ρb

1 log (n) ρ

where the fourth equality follows from the definition of TV distance and in the last equality, ρb denotes the (conditional) TV distance of node b. The third to last inequality follows from Lemma 7 and the second to last inequality follows from Holder s inequality. For the last inequality, we simply upper bound ρb ρ since the algorithm is ρ-TV stable (Lemma 2). This completes the proof.

Lemma 7. Let {ai}k i=1 be real numbers such that ai (0, 1) for all i and Pk i=1 ai 1. Then, Qk i=1 (1 ai) 1 Pk i=1 ai

Proof. We prove this via induction on k. The base case k = 1 is immediate. For the induction step k, we have

i=1 (1 ai) =

i=1 (1 ai) (1 ak)

This completes the proof.

From Adaptive Query Release to Machine Unlearning

E. Missing Proofs from Section 5

E.1. Variance-reduced Frank Wolfe

Algorithm 12 Variance-reduced Frank Wolfe(t0; T ) Input: Dataset S, loss function (w, z) 7 ℓ(w, z), steps T, σ,{ηt}t

1: if t0 = 1 then Permute dataset, initialize T , set wt0 = 0 end if 2: for t = 1 to T 1 do 3: ut = Pt i=1 ((i + 1) ℓ(wi; zi) i ℓ(wi 1; zi)) 4: Append(ut, σ; T ) 5: rt = Get Prefix Sum(t; T )

6: vt = arg minw W D w, rt t+1 E

7: wt+1 = (1 ηt)wt + ηtvt 8: Set(leaf(t), (ut, rt, wt, zt) ; T ) 9: end for Output: bw = w T

Proof of Theorem 2. For the accuracy guarantee, we follow the proof of Theorem 1 in Zhang et al. (2020). Let dt = rt t+1. From smoothness, we have

L(wt+1; D) L(wt; D) + L(wt; D), wt+1 wt + H

2 wt+1 wt 2

L(wt; D) + ηt L(wt; D) dt, vt wt + dt, vt wt + η2 t HD2

= L(wt; D) + ηt L(wt; D) dt, vt wt + ηt dt, w wt + η2 t HD2

L(wt; D) + ηt L(wt; D), w wt + ηt dt L(wt), w vt + η2 t HD2

(1 ηt) L(wt; D) ηt L(w ; D) + 2D

t + 1 dt L(wt; D) + η2 t HD2

where the second inequality follows from the update and the fact that iterates lie in the set of diameter D. The third inequality follows from the optimality of vt in the update in Algorithm 12. Finally, the last inequality follows from convexity, Cauchy-Schwarz inequality and by substituting the step-size. We now take expectation, and use the bound on gradient estimation error in Lemma 8 to get,

E[L(wt+1; D) L(w ; D)]

(1 ηt) E[L(wt; D) L(w ; D)] + e O

1 (t + 1)3/2 +

From Adaptive Query Release to Machine Unlearning

The above recursion gives us,

E[L(w T ; D) L(w ; D)] (L(w1; D) L(w ))

1 (i + 1)3/2 +

t=i+1 (1 ηt)

(i + 1)1/2 +

d (i + 1) ρ

where the second inequality follows from smoothness and substituting QT 1 t=i+1 (1 ηt) = i+1

T 1. Substituting number of iterations T = n completes the accuracy proof.

For the unlearning part, we start by showing that the algorithm falls into the template of bounded sensitivity prefix-sum query release. Recall that the update ut = Pt i=1 ((i + 1) ℓ(wi; zi) i ℓ(wi 1; zi)).

The sensitivity is then bounded as,

((i + 1) ℓ(wi; z) i ℓ(wi 1; z)) ((i + 1) ℓ(wi; z ) i ℓ(wi 1; z ))

i H wi wi i + 2G

i Hηi 1 vi 1 wi 1 + 2G

where the first inequality follows from smoothness and Lipschitzness of the loss. The second inequality follows from the update in Algorithm 12 and the last inequality follows from the fact that the iterates remain in the set of diameter D. Hence the correctness of the unlearning algorithm follows from Theorem 1. For runtime, the training time, in terms of gradient computations is Θ(n). Therefor, using the fact that the relative unlearning complexity, from Theorem 1, is e O(ρ), we have e O(ρn) bound on expected unlearning runtime.

Lemma 8. The gradient estimation error E rt

t+1 L(wt; D) 2 e O (HD + G)2 1 t+1 + d (t+1)2ρ2

Proof. Note that dt := rt t+1 comprises of the original gradient estimate from Zhang et al. (2020), say edt and the noise added by the binary tree mechanism, say ξt. Hence,

E dt L(wt; D) 2 = E edt L(wt; D) 2 + E ξt 2

(t + 1)2 ρ2

(HD + G)2 1 t + 1 + d

(t + 1)2 ρ2

where the first inequality follows from Lemma 2 in Zhang et al. (2020) with α = 1, and the fact that in the binary tree mechanism we add noise of variance σ at most log (n) times; the factor 1/(t + 1)2 comes because the gradient estimate is rt/(t + 1) and rt is the binary tree response. The final equality follows by plugging in the value of σ.

From Adaptive Query Release to Machine Unlearning

E.2. Dual Averaging

Algorithm 13 Dual averaging(t0; T ) Input: Dataset S, loss function (w, z) 7 ℓ(w, z), steps T, {ηt}t,

1: if t0 = 1 then Permute dataset, initialize T , set wt0 = 0 end if 2: for t = 1 to T 1 do 3: ut = Pt i=1 ℓ(wi; zi) 4: Append(ut, σ; T ) 5: rt = Get Prefix Sum(t; T ) 6: wt+1 = ΠW (w0 ηtpt) 7: Set(leaf(t), (ut, rt, wt, zt) ; T ) 8: end for Output: bw = w T

Proof of Theorem 3. The accuracy guarantee directly follows from Theorem 5.1 in Kairouz et al. (2021), replacing ϵ/log2(1/δ)2 therein by ρ. To elaborate, we set σ = e O G2

ρ2 as opposed to e O G2log4(1/δ)

ϵ2 , hence substituting it in the accuracy proof of Theorem 5.1 in Kairouz et al. (2021) gives the claimed guarantee.

For the unlearning part, we start by showing that the algorithm falls into the template of bounded sensitivity prefix query release.

Recall that the update ut = Pt i=1 ℓ(wt; zi). The sensitivity is simply bounded by Lipschitznes as,

ℓ(wt; z) ℓ(wt; z ) 2G

Hence the correctness of the unlearning algorithm follows from Theorem 1. For runtime, the training time, in terms of gradient computations is Θ(n). Therefor, using the fact that the relative unlearning complexity, from Theorem 1, is e O(ρ), we have e O(ρn) bound on expected unlearning runtime.

E.3. Convex GLMs with the JL method

Proof of Theorem 4. We start with the accuracy guarantee. Let α 1 be a parameter to be set later. From the JL property, with k = O log (n/β) /α2 , with probability at least 1 β, the norm of all data-points in S, Φxi (1 + α) xi 2 X . Hence, conditioned on the above event, the GLM loss function function is e G = 2G X -Lipschitz and e H = 4H X 2-smooth. Let ΦD denote the push-forward measure of D under the map (x, y) 7 (Φx, y). With probability at least 1 β, the excess risk is,

E[L( bw; D) L(w ; D)] = E[L(Φ ew; D) L(Φw ; ΦD)] + E[L(Φw ; ΦD) L(w ; D)]

= E[L( ew; ΦD) L(Φw ; ΦD)] + E[ϕy( Φw , Φx ) ϕy ( w , x )]

e G + e H w w

2 E | Φx, Φw x, w |2

e G + e H w w

e G + e H w w n + e H1/3 e G2/3 w 4/3 + e H w 2

where in the first inequality, we use the accuracy guarantee of VR-Frank Wolfe (Theorem 2) and smoothness of ϕy together with the fact that w is globally optimal. The second inequality follows from JL property and the last inequality follows by the setting of k.

From Adaptive Query Release to Machine Unlearning

For the in-expectation (over the JL matrix) bound, note that in the worst-case, L( bw; D) L(w ; D) G bw w . From boundedness of the range of (typical) JL maps, bw w = poly(n, d) w.p. 1. Hence, taking the failure probability β to be small enough suffices to be give an expectation bound which is same as above upto polylogarithmic factors.

We now proceed to the unlearning guarantee. We first remark that the correctness of the unlearning algorithm (see Lemma 4) holds as long as the learning algorithm uses prefix-sum queries, even with unbounded sensitivity. Hence, the correctness follows. We now proceed to bound the unlearning runtime. We first bound the TV stability parameter of the learning algorithm using Lemma 9. The setting of noise variance σ in Algorithm 4 together with the stability guarantee of Theorem 2 ensures that γ( e H, e G) τ

2. Hence the JL method satisfies ρ-TV stability. Now, Lemma 6 gives us that the probability of retraining is at most e O(ρ). Since the training time, in terms of gradient computations is Θ(n), we have e O(ρn) bound on expected unlearning runtime.

Proof of Theorem 5. We start with the accuracy guarantee; let α 1 be a parameter to be set later. From the JL property, with k = O log (n/β) /α2 , with probability at least 1 β, the norm of all data-points in S, Φxi (1 + α) xi 2 X . Hence, conditioned on the above event, the GLM loss function function is e G = 2G X -Lipschitz. Let ΦD denote the push-forward measure of D under the map (x, y) 7 (Φx, y). With probability at least 1 β, the excess risk is,

E[L( bw; D) L(w ; D)] = E[L(Φ ew; D) L(Φw ; ΦD)] + E[L(Φw ; ΦD) L(w ; D)]

= E[L( ew; ΦD) L(Φw ; ΦD)] + E[ϕy( Φw , Φx ) ϕy ( w , x )]

+ GE | Φx, Φw x, w |

where in the first inequality, we use the accuracy guarantee of Dual Averaging (Theorem 3) and Lipschitzness of ϕy together. The second inequality follows from JL property and the last inequality follows by the setting of k. As in Theorem 4, the same bound as above for in-expectation (over the JL matrix) holds follows by taking the failure probability β to be small enough.

The correctness and runtime of the unlearning algorithm follows as in the proof of Theorem 4.

Lemma 9. Suppose A is an algorithm which when run on e H-smooth and e G-Lipschitz functions is γ( e H, e G)-TV stable, then the JL method with with k = O (log (2n/τ)) and A as input, run on H-smooth and G-Lipschitz GLMs, satisfies

τ 2 + γ 2G X , 4H X 2 -TV stability.

Proof. Given a dataset S let GS be the uniform bound on Lipschitzness parameter of the class of loss functions {w 7 ℓ(w; z)}z S. We define HS similarly. Let α 1 be a parameter to be set later. From the JL property, with k = O (log (n/β)), with probability at least 1 β, the norm of all data-points in S, Φxi 2 X - we denote this event as EJL. Since the loss function is a GLM, we have that conditioned on EJL, the Lipschitzness and smoothness parameters GS and HS are bounded by 2G X and 2H X 2 respectively. We therefore get a stability parameter

eγ := γ 2G X , 4H X 2 .

We set β = ρ/2. We now incorporate the failure probability in the failure guarantee. Let PΦ and QΦ denote the probability distributions of the output on datasets S and S . By definition of TV distance,

From Adaptive Query Release to Machine Unlearning

TV(PΦ, QΦ) = sup E Pw P (w E) Pw Q (w E)

Pw P (w E|EJL) P(EJL) + Pw P (w E|E JL) P(E JL)

Pw Q (w E|EJL) P(EJL) Pw Q (w E|E JL) P(E JL)

sup E Pw P (w E|EJL) Pw Q (w E|EJL) P(EJL)

+ sup E Pw P (w E|E JL) Pw Q (w E|E JL) P(E JL)

sup E Pw P (w E|EJL) Pw Q (w E|EJL) + ρ/2

which completes the proof.

F. Missing details from Section 6

In this section, we present additional details and proofs of results in Section 6.

F.1. Weak Unlearning

Proof of Theorem 6. The first claim, weak unlearning guarantee of the unlearning algorithm, follows mainly from Lemma 4. Specifically, it shows that conditioned on the permutation of the dataset (in this case, since the dataset is not permuted, the permutation is simply identity), the distribution over the responses (rb)b in the tree after unlearning, is transported to the distribution of the output under S . Since the model output is a deterministic function of the responses, (weak unlearning) correctness follows for one request. For the streaming setting, we simply apply the above inductively over the requests.

The second claim follows since, at every time point, the executed algorithm is indistinguishable from the base algorithm executed over the current dataset. Moreover, by assumption, the base algorithm, is anytime, i.e. no parameter is set which depends on the size of the dataset. Hence, the accuracy guarantee follows. For the last claim about the number of retraining, firstly, as motivated, by the assumption that the algorithm is incremental, the insertions are handled in O(1) time. For the unlearning requests, note that from ρ-TV stability at every point, using Lemma 6, we have a e O(ρ) probability of retraining. We now apply Proposition 8 from Ullah et al. (2021) which converts this to a bound on the expected number of times a retraining is triggered. For V unlearning requests, this gives us a e O(ρV ) bound on the number of retraining triggers.

F.2. Exact Unlearning

Another way to extend the results for one unlearning request to dynamic streams is to modify the definition of unlearning (Definition 1) to also hold for insertions, as is done in Ullah et al. (2021). This allows us to apply the same tree based unlearning technique when handing insertions. Specifically, upon inserting a new point, we randomly choose a leaf and replace the leaf with the inserted point, and then insert the chosen leaf as the last leaf in the tree. We have the following guarantee for this method.

Theorem 8. In the dynamic streaming setting with R requests, using anytime learning and unlearning algorithms, Algorithm 2 and 3, the following are true.

1. Exact unlearning at every time point in the stream.

2. The accuracy of the output bwi at time point i, with corresponding dataset Si, is

E[L( bwi; D)] min w L(w; D) = α(ρ, |Si| ; P)

3. The total number of times, a retraining is triggered, for R requests is at most O(ρR)

From Adaptive Query Release to Machine Unlearning

Proof. The arguments are similar to that of the proof of Theorem 6. The first part follows by applying the correctness of the unlearning algorithm, Theorem 1, inductively over the stream. We remark that the handling the insertions in the same way as deletions hardly changes anything in the proofs. The second claim follows from the anytime nature of the algorithm and by assumption on the accuracy guarantee. Finally, using the probability of retraining in Lemma 6 and Proposition 8 in Ullah et al. (2021) gives us the stated number of retraining triggers.