# differentially_private_policy_evaluation__e163942b.pdf Differentially Private Policy Evaluation Borja Balle B.DEBALLEPIGEM@LANCASTER.AC.UK Lancaster University Maziar Gomrokchi MGOMRO@CS.MCGILL.CA Doina Precup DPRECUP@CS.MCGILL.CA Mc Gill University We present the first differentially private algorithms for reinforcement learning, which apply to the task of evaluating a fixed policy. We establish two approaches for achieving differential privacy, provide a theoretical analysis of the privacy and utility of the two algorithms, and show promising results on simple empirical examples. 1. Introduction Learning how to make decisions under uncertainty is becoming paramount in many practical applications, such as medical treatment design, energy management, adaptive user interfaces, recommender systems etc. Reinforcement learning (Sutton & Barto, 1998) provides a variety of algorithms capable of handling such tasks. However, in many practical applications, aside from obtaining good predictive performance, one might also require that the data used to learn the predictor be kept confidential. This is especially true in medical applications, where patient confidentiality is very important, and in other applications which are user-centric (such as recommender systems). Differential privacy (DP) (Dwork, 2006) is a very active research area, originating from cryptography, but which has now been embraced by the machine learning community. DP is a formal model of privacy used to design mechanisms that reduce the amount of information leaked by the result of queries to a database containing sensitive information about multiple users (Dwork, 2006). Many supervised learning algorithms have differentially private versions, including logistic regression (Chaudhuri & Monteleoni, 2009; Chaudhuri et al., 2011), support vector machines (Chaudhuri et al., 2011; Rubinstein et al., 2012; Jain & Thakurta, 2013), and the lasso (Thakurta & Smith, Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s). 2013). However, differential privacy for reinforcement learning tasks has not been tackled yet, except for the simpler case of bandit problems (Smith & Thakurta, 2013; Mishra & Thakurta, 2015; Tossou & Dimitrakakis, 2016). In this paper, we tackle differential privacy for reinforcement learning algorithms for the full Markov Decision Process (MDP) setting. We develop differentially private algorithms for the problem of policy evaluation, in which a given way of behaving has to be evaluated quantitatively. We start with the batch, first-visit Monte Carlo approach to policy evaluation, which is well understood and closest to regression algorithms, and provide two differentially private versions, which come with formal privacy proofs as well as guarantees on the quality of the solution obtained. Both algorithms work by injecting Gaussian noise into the parameters vector for the value functions, but they differ in the definition of the noise amount. Our privacy analysis techniques are related to previous output perturbation for empirical risk minimization (ERM), but there are some domain specific challenges that need to be addressed. In particular, the notion of neighbouring datasets we use is motivated by medical applications where individual patients generate full trajectories. In this case two datasets differing in a single patient yield regression problems differing in multiple correlated regression targets. Our utility analysis identifies parameters of the MDP that control how easy it is to maintain privacy in each case. The theoretical utility analysis, as well as some illustrative experiments, show that the accuracy of the private algorithms does not suffer (compared to usual Monte Carlo) when the data set is large. The rest of the paper is organized as follows. In Sec. 2 we provide background notation and results on differential privacy and Monte Carlo methods for policy evaluation. Sec. 3 presents our proposed algorithms. The privacy analysis and the utility analysis are outlined in Sec. 4 and Sec. 5 respectively. Detailed proofs for both of these sections are given in the Supplementary Material. In Sec. 6 we provide empirical illustrations of the scaling behaviour of Differentially Private Policy Evaluation the proposal algorithms, using synthetic MDPs, which try to mimic characteristics of real applications. Finally, we conclude in Sec. 7 with a discussion of related work and avenues for future work. 2. Background 2.1. Differential Privacy DP takes a user-centric approach, by providing privacy guarantees based on the difference of the outputs of a learning algorithm trained on two databases differing in a single user. The central goal is to bound the loss in privacy that a user can suffer when the result of an analysis on a database with her data is made public. This can incentivize users to participate in studies using sensitive data, e.g. mining of medical records. In the context of machine learning, differentially private algorithms are useful because they allow learning models in such a way that their parameters do not reveal information about the training data (Mc Sherry & Talwar, 2007). For example, one can think of using historical medical records to learn prognostic and diagnostic models which can then be shared between multiple health service providers without compromising the privacy of the patients whose data was used to train the model. To formalize the above discussion, let X be an input space and Y an output space. Suppose A is a randomized algorithm that takes as input a tuple X = (x1, . . . , xm) of elements from X for some m 1 and outputs a (random) element A(X) of Y. We interpret X X m as a dataset containing data from m individuals and define its neighbouring datasets as those that differ from X in their last1 element: X = (x1, . . . , xm 1, x m) with xm = x m. We denote this (symmetric) relation by X X . Algorithm A is (ε, δ)-differentially private for some ε, δ > 0 if for every m 1, every pair of datasets X, X X m, X X , and every measurable set Ω Y we have P[A(X) Ω] eεP[A(X ) Ω] + δ . (1) This definition means that the distribution over possible outputs of A on inputs X and X is very similar, so revealing this output leaks almost no information on whether xm or x m was in the dataset. A simple way to design a DP algorithm for a given function f : X m Y is the output perturbation mechanism, which releases A(X) = f(X) + η, where η is noise sampled from a properly calibrated distribution. For real outputs Y = Rd, the Laplace (resp. Gaussian) mechanism (see 1Formally, a neighbouring datasets is one which differs in one element, not necessarily the last. However, here we assume the order of the elements in X does not affect the distribution of A(X), and thus define without loss of generality neighbouring datasets as always differing in the last element. e.g. Dwork & Roth (2014)) samples each component of the noise η = (η1, . . . , ηd) i.i.d. from a Laplace (resp. Gaussian) distribution with standard deviation O(GS1(f)/ε) (resp. O(GS2(f) ln(1/δ)/ε)), where GSp(f) is the global sensitivity of f given by GSp(f) = sup X,X X m,X X f(X) f(X ) p . Calibrating noise to the global sensitivity is a worst-case approach that requires taking the supremum over all possible pairs of neighbouring datasets, and in general does not account for the fact that in some datasets privacy can be achieved with substantially smaller perturbations. In fact, for many applications (like the one we consider in this paper) the global sensitivity is too large to provide useful mechanisms. Ideally one would like to add perturbations proportional to the potential changes around the input dataset X, as measured, for example by the local sensitivity LSp(f, X) = sup X X f(X) f(X ) p. Nissim et al. (2007) showed that approaches based on LSp do not lead to differentially private algorithms, and then proposed an alternative framework for DP mechanisms with datadependent perturbations based on the idea of smoothed sensitivity. This is the approach we use in this paper; see Section 4 for further details. 2.2. Policy Evaluation Policy evaluation is the problem of obtaining (an approximation to) the value function of a Markov reward process defined by an MDP M and a policy π (Sutton & Barto, 1998; Szepesv ari, 2010). In many cases of interest M is unknown but we have access to trajectories containing state transitions and immediate rewards sampled from π. When the state space of M is relatively small, tabular methods that represent the value of each state can be used individually. However, in problems with large (or even continuous) state spaces, parametric representations for the value function are typically needed in order to defeat the curse of dimensionality and exploit the fact that similar states will have similar values. In this paper we focus on policy evaluation with linear function approximation in the batch case, where we have access to a set of trajectories sampled from the policy of interest. Let M be an MDP over a finite state space S with N = |S| states and π a policy on M. Given an initial state s0 S, the interaction of π with M is described by a sequence x = ((st, at, rt))t 0 of state action reward triplets. Suppose 0 < γ < 1 is the discount factor of M. The value function V π : S R of π assigns to each state the expected discounted cumulative reward obtained by a trajectory following policy π from that state: V π(s) = EM,π h P t 0 γtrt s0 = s i . (2) Differentially Private Policy Evaluation The value function can be considered a vector V π RS. We make the usual assumption that any reward r generated by M is bounded: 0 r Rmax, so 0 V π(s) Rmax/(1 γ) for all s S. Let Φ RS d be a feature representation that associates each state s S to a d-dimensional feature vector φ s = Φ(s, :) Rd. The goal is to find a parameter vector θ Rd such that ˆV π = Φθ is a good approximation to V π. To do so, we assume that we have access to a collection X = (x1, . . . , xm) of finite trajectories sampled from M by π, where each xi is a sequence of states, actions and rewards. We will use a Monte Carlo approach, in which the returns of the trajectories in X are used as regression targets to fit the parameters in ˆV π via a least squares approach (Sutton & Barto, 1998). In particular, we consider firstvisit Monte Carlo estimates obtained as follows. Suppose x = ((s1, a1, r1), . . . , (s T , a T , r T )) is a trajectory that visits s and ix,s is the time of the first visit to s; that is, six,s = s, and st = s for all t < ix,s. The return collected from this first visit is given by t=ix,s rtγt ix,s = t=0 rt+ix,sγt , and provides an unbiased estimate of V π(s). For convenience, when state s is not visited by trajectory x we assume Fx,s = 0. Given the returns from all first visits corresponding to a dataset X with m trajectories, we can find a parameter vector for the estimator ˆV π by solving the optimization problem argminθ JX(θ), where s Sxi ρs(Fxi,s φ s θ)2 , (3) and Sx is the set of states visited by trajectory x. The regression weights 0 ρs 1 are given as an input to the problem and capture the user s believe that some states are more relevant than others. It is obvious that JX(θ) is a convex function of θ. However, in general it is not strongly convex and therefore the optimum of argminθ JX(θ) is not necessarily unique. On the other hand, it is known that differential privacy is tightly related to certain notions of stability (Thakurta & Smith, 2013), and optimization problems with non-unique solutions generally pose a problem to stability. In order to avoid this problem, the private policy evaluation algorithms that we propose in Section 3 are based on optimizing slightly modified versions of JX(θ) which promote stability in their solutions. Note that the notions of stability related to DP are for worst-case situations: that is, they need to hold for every possible pair of neighbouring input dataset X X , regardless of any generative model assumed for the trajectories in those datasets. In particular, these stability considerations are not directly related to the variance of the estimates in ˆV π. We end this section by introducing further notation that will be used in the sequel. Given a dataset X with m trajectories let FX RS denote the vector containing the average first visit returns from all trajectories in X that visit a particular state. In particular, if Xs represents the multiset of trajectories from X that visit state s at some point, then we have FX(s) = FX,s = 1 |Xs| x Xs Fx,s . (4) If s is not visited by any trajectory in X we set FX,s = 0. We also define a diagonal matrix ΓX RS S with entries given by the product of the regression weight on each state and the fraction of trajectories in X visiting that state: ΓX(s, s) = ρs|Xs|/m. Now, a typical calculation solving for θ in θJX(θ) = 0 shows that any θX argminθ JX(θ) must also be an optimum of m (FX,s φ s θ)2 . (5) Alternatively, we can say θX is an optimum of JX(θ) if and only if it satisfies Φ ΓXΦθX = Φ ΓXFX . (6) Hence, JX(θ) has a unique global optimum if and only if the matrix Φ ΓXΦ is invertible. Since it is possible to find neighbouring datasets X X where at most one of Φ ΓXΦ and Φ ΓX Φ is invertible, using JX(θ) to define the policy evaluation problem poses a problem to the design differentially private algorithms. Next we discuss two ways to make this optimization more stable, leading to two different DP policy evaluation algorithms. 3. Private First-Visit Monte Carlo Algorithms In this section we give the details of two differentially private policy evaluation algorithms based on first-visit Monte Carlo estimates. A formal privacy analysis of these algorithms is given in Section 4. Bounds showing how the privacy requirement affects the utility of the value estimates are presented in Section 5. 3.1. Algorithm DP-LSW One way to make the optimization argminθ JX(θ) more stable to changes in the dataset X is to consider an alternative least-squares optimization leading to a closed form solution similar to (4) but where the invertibility of the coefficient matrix does not change with X. Thus, we modify the objective function (5) by introducing a new set of positive regression weights ws > 0 and letting Γ RS S be a Differentially Private Policy Evaluation diagonal matrix with Γ(s, s) = ws. In this way we obtain the objective function Jw X(θ) = X s S ws(FX,s φ s θ)2 = FX Φθ 2 2,Γ , (7) where v 2 2,Γ = Γ1/2v 2 2 = v Γv is a weighted L2 norm. Using the same argument as before we see that any θw X argminθ Jw X(θ) must satisfy Φ ΓΦθw X = Φ ΓFX . (8) Thus, the new optimization problem is well-posed whenever Φ ΓΦ is invertible, which henceforth will be our working assumption. Note that this is a mild assumption, since it is satisfied by choosing a feature matrix Φ with full column rank. Under this assumption we have: θw X = Φ ΓΦ 1 Φ ΓFX = Γ1/2Φ Γ1/2FX , (9) where M denotes the Moore Penrose pseudo-inverse. The difference between optimizing JX(θ) or Jw X(θ) is reflected in the differences between (6) and (8). In particular, if the trajectories in X are i.i.d. and ps denotes the probability that state s is visited by a trajectory in X, then taking ws = EX[ρs|Xs|/m] = ρsps yields a loss function Jw X(θ) that captures the effect of each state s in JX(θ) in the asymptotic regime m . However, we note that knowledge of these visit probabilities is not required for running our algorithm or for its analysis. Our first DP algorithm for policy evaluation applies a carefully calibrated output perturbation mechanism to the solution θw X of argminθ Jw X(θ). This algorithm is called DPLSW and its full pseudo-code is given in Algorithm 1. It receives as input the dataset X, the regression weights w, the feature representation Φ, and the MDP parameters Rmax and γ. Additionally, the algorithm is parametrized by the privacy parameters ε and δ. Its output is the result of adding a random vector η drawn from a multivariate Gaussian distribution N(0, σ2 XI) to the parameter vector θw X. In order to compute the variance of η the algorithm needs to solve the discrete optimization problem ψw X = max0 k KX e kβϕw X(k), where KX = maxs S |Xs|, β is a parameter computed in the algorithm, and ϕw X(k) is given by the following expression: ϕw X(k) = X ws max{|Xs| k, 1}2 . (10) Note that ψw X can be computed in time O(KXN). The variance of the noise in DP-LSW is proportional to the upper bound Rmax/(1 γ) on the return from any state. This bound might be excessively pessimistic in some applications, leading to unnecessary large perturbation of the Algorithm 1: DP-LSW Input: X, Φ, γ, Rmax, w, ε, δ Output: ˆθw X Compute θw X ; // cf. (9) ε and β ε 4(d+ln(2/δ)); Let ψw X max0 k KX e kβϕw X(k) ; // cf. (10) Let σX αRmax (Γ1/2Φ) ψw X; Sample a d-dimensional vector η N(0, σ2 XI); Return ˆθw X = θw X + η; solution θw X. Fortunately, it is possible to replace the term Rmax/(1 γ) with any smaller upper bound Fmax on the returns generated by the target MDP on any state. In practice this leads to more useful algorithms, but it is important to keep in mind that for the privacy guarantees to remain unaffected, one needs to assume that Fmax is a publicly known quantity (i.e. it is not based on an estimate made from private data). These same considerations apply to the algorithm in the next section. 3.2. Algorithm DP-LSL The second DP algorithm for policy evaluation we propose is also an output perturbation mechanism. It differs from DP-LSW in they way stability of the unperturbed solutions is promoted. In this case, we choose to optimize a regularized version of JX(θ). In particular, we consider the objective function Jλ X(θ) obtained by adding a ridge penalty to the least-squares loss from (3): Jλ X(θ) = JX(θ) + λ 2m θ 2 2 , (11) where λ > 0 is a regularization parameter. The introduction of the ridge penalty makes the objective function Jλ X(θ) strongly convex, and thus ensures the existence of a unique solution θλ X = argminθ Jλ X(θ), which can be obtained in closed-form as: θλ X = Φ ΓXΦ + λ 2m I 1 Φ ΓXFX . (12) Here ΓX is defined as in Section 2.2. We call DP-LSL the algorithm obtained by applying an output perturbation mechanism to the minimizer of Jλ X(θ); the full pseudo-code is given in Algorithm 2. It receives as input the privacy parameters ε and δ, a dataset of trajectories X, the regression weights ρ, the feature representation Φ, a regularization parameter λ > Φ 2 ρ , and the MDP parameters Rmax and γ. After computing the solution θλ X to argminθ Jλ X(θ), the algorithm outputs Differentially Private Policy Evaluation ˆθλ X = θλ X + η, where η is a d-dimensional noise vector drawn from N(0, σ2 XI). The variance of η is obtained by solving a discrete optimization problem (different from the one in DP-LSW). Let cλ = Φ ρ / 2λ and for k 0, define ϕλ X(k) as: s ρs min{|Xs| + k, m} + ρ 2 Then DP-LSL computes ψλ X = max0 k m e kβϕλ X(k), which can be done in time O(m N). Algorithm 2: DP-LSL Input: X, Φ, γ, Rmax, ρ, λ, ε, δ Output: ˆθλ X Compute θλ X ; // cf. (12) ε and β ε 4(d+ln(2/δ)); Let ψλ X max0 k m e kβϕλ X(k) ; // cf. (13) Let σX 2αRmax Φ (1 γ)(λ Φ 2 ρ ) Sample a d-dimensional vector η N(0, σ2 XI); Return ˆθλ X = θλ X + η; 4. Privacy Analysis This section provides a formal privacy analysis for DPLSW and DP-LSL and shows that both algorithms are (ε, δ)-differentially private. We use the smooth sensitivity framework of (Nissim et al., 2007; 2011), which provides tools for the design of DP mechanisms with data-dependent output perturbations. We rely on the following lemma, which provides sufficient conditions for calibrating Gaussian output perturbation mechanisms with variance proportional to smooth upper bounds of the local sensitivity. Lemma 1 (Nissim et al. (2011)). Let A be an algorithm that on input X computes a vector µX Rd deterministically and then outputs ZX N(µX, σ2 XI), where σ2 X is a variance that depends on X. Let α = α(ε, δ) = 5 p 2 ln(2/δ)/ε and β = β(ε, δ, d) = ε/(4d + 4 ln(2/δ)). Suppose ε and δ are such that the following are satisfied for every pair of neighbouring datasets X X : (a) σX α µX µX 2, and (b) | ln(σ2 X) ln(σ2 X )| β. Then A is (ε, δ)-differentially private. Condition (a) says we need variance at least proportional to the local sensitivity LS2(f, X). Condition (b) asks that the variance does not change too fast between neighbouring datasets by imposing the constraint σ2 X/σ2 X eβ. This is precisely the spirit of the smoothed sensitivity principle: calibrate the noise to a smooth upper bound of the local sensitivity. We acknowledge Lemma 1 is only available in pre-print form, and thus provide an elementary proof in the Supplementary Material for completeness. The remaining proofs from this section are also presented there. 4.1. Privacy Analysis of DP-LSW We start by providing an upper bound on the norm θw X θw X 2 for any two neighbouring datasets X X . Using (9) it is immediate that: θw X θw X 2 (Γ1/2Φ) FX FX 2,Γ . (14) Next we provide an upper bound to FX FX 2,Γ. Lemma 2. Let X X be two neighbouring datasets of m trajectories with X = (x1, . . . , xm 1, x) and X = (x1, . . . , xm 1, x ). Let X = (x1, . . . , xm 1). Let Sx (resp. Sx ) denote the set of states visited by x (resp. x ). Then we have FX FX 2,Γ Rmax ws (|X s | + 1)2 . Since the condition in Lemma 1 needs to hold for any dataset X neighbouring X, we take the supremum of the bound above over all neighbours, which yields the following corollary. Corollary 3. If X is a dataset of trajectories, then the following holds for every neighbouring dataset X X: FX FX 2,Γ Rmax ws max{|Xs|, 1}2 . Using this result we see that in order to satisfy item (a) of Lemma 1 we can choose a noise variance satisfying: σX αRmax (Γ1/2Φ) ws max{|Xs|, 1}2 , (15) where only the last multiplicative term depends on the dataset X, and the rest can be regarded as a constant that depends on parameters of the problem which are either public or chosen by the user, and will not change for a neighbouring dataset X . Thus, we are left with a lower bound expressible as σX Cpϕw X, where ϕw X = P s(ws/ max{|Xs|, 1}2) only depends on the dataset X through its signature X NS given by the number of times each state appears in the trajectories of X: X (s) = |Xs|. Accordingly, we write ϕw X = ϕw( X ), where ϕw : NS R is the function ws max{vs, 1}2 . (16) The signatures of two neighbouring datasets X X satisfy X X 1 because replacing a single trajectory can only change by one the number of first visits to any particular state. Thus, assuming we have a Differentially Private Policy Evaluation function ψ : NS R satisfying ψw(v) ϕw(v) and | ln(ψw(v)) ln(ψw(v ))| β for all v, v NS with v v 1, we can take σX = C p ψw( X ). This variance clearly satisfies the conditions of Lemma 1 since | ln(σ2 X) ln(σ2 X )| = | ln(ψw( X )) ln(ψw( X ))| β . The function ψw is known as a β-smooth upper bound of ϕw, and the following result provides a tool for constructing such functions. Lemma 4 (Nissim et al. (2007)). Let ϕ : NS R. For any k 0 let ϕk(v) = max v v k ϕ(v ). Given β > 0, the smallest β-smooth upper bound of ϕ is the function ψ(v) = sup k 0 e kβϕk(v) . (17) For some functions ϕ, the upper bound ψ can be hard to compute or even approximate (Nissim et al., 2007). Fortunately, in our case a simple inspection of (16) reveals that ϕw k (v) is easy to compute. In particular, the following lemma implies that ψw(v) can be obtained in time O(N v ). Lemma 5. The following holds for every v NS: ϕw k (v) = X ws max{vs k, 1}2 . Furthermore, for every k v 1 we have ϕw k (v) = P Combining the last two lemmas, we see that the quantity ψw X computed in DP-LSW is in fact a β-smooth upper bound to ϕw X. Because the variance σX used in DP-LSW can be obtained by plugging this upper bound into (15), the two conditions of Lemma 1 are satisfied. This completes the proof of the main result of this section: Theorem 6. Algorithm DP-LSW is (ε, δ)-differentially private. Before proceeding to the next privacy analysis, note that Corollary 3 is the reason why a mechanism with output perturbations proportional to the global sensitivity is not sufficient in this case. The bound there says that if in the worst case we can find datasets of an arbitrary size m where some states are visited few (or zero) times, then the global sensitivity will not vanish as m . Hence, the utility of such algorithm would not improve with the size of the dataset. The smoothed sensitivity approach works around this problem by adding large noise to these datasets, but adding much less noise to datasets where each state appears a sufficient number of times. Corollary 3 also provides the basis for efficiently computing smooth upper bounds to the local sensitivity. In principle, condition (b) in Lemma 1 refers to any dataset neighbouring X, of which there are uncountably many because we consider real rewards. Bounding the local sensitivity in terms of the signature reduces this to finitely many classes of neighbours, and the form of the bound in Corollary 3 makes it possible to apply Lemma 4 efficiently. 4.2. Privacy Analysis of DP-LSL The proof that DP-LSL is differentially private follows the same strategy as for DP-LSW. We start with a lemma that bounds the local sensitivity of θλ X for pairs of neighbouring datasets X X . We use the notation Is x for an indicator variable that is equal to one when state s is visited within trajectory x. Lemma 7. Let X X be two neighbouring datasets of m trajectories with X = (x1, . . . , xm 1, x) and X = (x1, . . . , xm 1, x ). Let Fx RS (resp. Fx RS) be the vector given by Fx(s) = Fx,s (resp. Fx (s) = Fx ,s). Define diagonal matrices Γρ, x,x RS S given by Γρ(s, s) = ρs and x,x (s, s) = Is x Is x . If the regularization parameter satisfies λ > Φ x,x ΓρΦ , then: θλ X θλ X 2 2 x,x Φθλ X Fx + Fx ΓρΦ 2 λ Φ x,x ΓρΦ . As before, we need to consider the supremum of the bound over all possible neighbours X of X. In particular, we would like to get a bound whose only dependence on the dataset X is through the signature X . This is the purpose of the following corollary: Corollary 8. Let X be a dataset of trajectories and suppose λ > Φ 2 ρ . Then the following holds for every neighbouring dataset X X: θλ X θλ X 2 2Rmax Φ (1 γ)(λ Φ 2 ρ ) s S ρs|Xs| + ρ 2 By the same reasoning of Section 4.1, as long as the regularization parameter is larger than Φ 2 ρ , a differentially private algorithm can be obtained by adding to θλ X a Gaussian perturbation with a variance satisfying σX 2αRmax Φ (1 γ)(λ Φ 2 ρ ) and the second condition of Lemma 1. This second requirement can be achieved by computing a β-smooth upper bound of the function ϕλ : NS R given by s S ρs max{vs, m} + ρ 2 Differentially Private Policy Evaluation When going from ϕλ X to ϕλ(v) we substituted |Xs| by max{vs, m} to reflect the fact that any state cannot be visited by more than m trajectories in a dataset X of size m. It turns out that in this case the function ϕλ k(v) = max v v k ϕλ(v ) arising in Lemma 4 is also easy to compute. Lemma 9. For every v NS, ϕλ k(v) is equal to: s S ρs max{vs + k, m} + ρ 2 Furthermore, for every k m mins vs we have ϕλ k(v) = Φ ρ m s S ρs + ρ 2 2 . Finally, in view of Lemma 4, Corollary 8, and Lemma 9, the variance of the noise perturbation in DP-LSL satisfies the conditions of Lemma 1, so we have proved the following. Theorem 10. Algorithm DP-LSL is (ε, δ)-differentially private. 5. Utility Analysis Because the promise of differential privacy has to hold for any possible pair of neighbouring datasets X X , the analysis in previous section does not assume any generative model for the input dataset X. However, in practical applications we expect X = (x1, . . . , xm) to contain multiple trajectories sampled from the same policy on the same MDP. The purpose of this section is to show that when the trajectories xi are i.i.d. the utility of our differentially private algorithms increases as m . In other words, when the input dataset grows, the amount of noise added by our algorithms decreases, thus leading to more accurate estimates of the value function. This matches the intuition that using data from more users to estimate a fixed number f parameters leads to a smaller individual contributions from each user, and makes the privacy constraint easier to satisfy. To measure the utility of our DP algorithms we shall bound the difference in empirical risk between the private and non-private parameters learned from a given dataset. That is, we want to show that the quantity EX,η[J X(ˆθ X) J X(θ X)] vanishes as |X| = m , for both = w and = λ. Due to space reasons, here we only state the main corollaries of our analysis and defer full statements and proofs to the Supplementary Material. In the case of DP-LSW, we give bounds on the excess empirical risk that decrease quadratically with m under the assumption that either all states are visited with non-zero probability or the user sets the regression weights so that such states do not contribute to θw X. Corollary 11. Let S0 = {s S|ps = 0}. If ws = 0 for all s S0, then EX,η[Jw X(ˆθw X) Jw X(θw X)] = O(1/m2). Our main statement for DP-LSL has a similar form, but in this case the rate of convergence depends on the choice of regularization parameter λ. In particular, we assume in the following statement that λ grows with m, and see what tensions arise int he selection of an adequate regularization schedule. Corollary 12. Suppose λ = ω(1) with respect to m. Then we have EX,η[Jλ X(ˆθλ X) Jλ X(θλ X)] = O(1/λm + 1/λ2 + m/λ3). Note that taking λ = Θ(m) we get a bound on the excess risk of order O(1/m2). However, if we want the regularization term in Jλ X(θ) to vanish as m we need λ = o(m). We shall see importance of this trade-off in our experiments. 6. Experiments In this section we illustrate the behaviour of the proposed algorithms on synthetic examples. The domain we use consists of a chain of N states, where in each state the agent has some probability p of staying and probability (1 p) of advancing to its right. There is a reward of 1 when the agent reaches the final, absorbing state, and 0 for all other states. While this is a toy example, it illustrates the typical case of policy evaluation in the medical domain, where patients tend to progress through stages of recovery at different speeds, and past states are not typically revisited (partly because in the medical domain, states contain historic information about past treatments). Trajectories are drawn by starting in an initial state distribution and generating stateaction-reward transitions according to the described probabilities until the absorbing state is reached. Trajectories are harvested in a batch, and the same batches are processed by all algorithms. We experiment with both a tabular representation of the value function, as well as with function approximation. In the latter case, we simply aggregate pairs of adjacent states, which are hence forced to take the same value. We compared the proposed private algorithms DP-LSW and DPLSL with their non-private equivalents LSW and LSL. The performance measure used is average root mean squared error over the state space. The error is obtained by comparing the state values estimated by the learning algorithms against the exact values obtained by exact, tabular dynamic programming. Standard errors computed over 20 independent runs are included. The main results are summarized in Fig. 1, for an environment with N = 40 states, p = 0.5, discount γ = 0.99, and for the DP algorithms, ε = 0.1 and δ = 0.1. In general, Differentially Private Policy Evaluation Figure 1. Empirical comparison of differentially private and non-private algorithms these constants should be chosen depending on the privacy constraints of the domain. Our theoretical results explain the expected effect of these choices on the privacy-utility trade-off so we do not provide extensive experiments with different values. The left plot in Fig. 1 compares the non-private LSL and LSW versions of Monte Carlo evaluation, in the tabular and function approximation case. As can be seen, both algorithms are very stable and converge to the same solution, but LSW converges faster. The second plot compares the performance of all algorithms in the tabular case, over a range of regularization parameters, for two different batch sizes. The third plot compares the expected RMSE of the algorithms when run with state aggregation, as a function of batch size. As can be seen, the DP algorithms converge to the same solutions as the non-private corresponding versions for large enough batch sizes. Interestingly, the two proposed approaches serve different needs. The LSL algorithms work better with small batches of data, whereas the LSW approach is preferable with large batches. From an empirical point of view, the trade-off between accuracy and privacy in the DP-LSL algorithm should be done by setting a regularization schedule proportional to m. While the theory suggests it is not the best schedule in terms of excess empirical risk, it achieves the best overall accuracy. Finally, the last plot shows excess risk as a function of the batch size. Interestingly, more aggressive function approximation helps both differentially private algorithms converge faster. This is intuitive, since using the same data to estimate fewer parameters means the effect of each individual trajectory is already obscured by the function approximation. Decreasing the number of parameters d of the function approximator increases β and lowers the smooth sensitivity bounds. In medical applications, one expects to have many attributes measured about patients, and to need aggressive function approximation in order to provide generalization. This result tells us that differentially private algorithms should be favoured in this case as well. Overall, the empirical results are very promising, showing that especially as batch size increases, the noise introduced by the DP mechanism decreases rapidly, and these algorithms provide the same performance but with the additional privacy guarantees. 7. Conclusion We present the first differentially private algorithms for policy evaluation in the full MDP setting. Our algorithms are built on top of established Monte Carlo methods, and come with utility guarantees showing that the cost of privacy diminishes as training batches get larger. The smoothed sensitivity framework is a key component of our analyses, which differ from previous works on DP mechanisms for ERM and bandits problems in two substantial ways. The first, we consider optimizations with non-Lipschitz loss functions, which prevents us from using most of the established techniques for analyzing privacy and utility in ERM algorithms and complicates some parts of our analysis. In particular, we cannot leverage the tight utility analysis of (Jain & Thakurta, 2014) to get dimension independent bounds. Second, and more importantly, the natural model of neighbouring datasets for policy evaluation involves replacing a whole trajectory. This implies that neighbouring datasets can differ in multiple regression targets, which is quite different from the usual supervised learning approach where neighbouring datasets can only change a single regression target. Our approach is also different from the on-line learning and bandits setting, where there is a single stream of experience and neighbouring datasets differ in one element of the stream. Note that this setting cannot be used naturally in the full MDP setup, because successive observations in a single stream are inherently correlated. In future work we plan to extend our techniques in two directions. First, we would like to design DP policy evaluation methods based on temporal-difference learning (Sutton, 1988). Secondly, we will tackle the control case, where policy evaluation is often used as a sub-routine, e.g. as in actor-critic methods. We also plan to evaluate the current algorithms on patient data from an ongoing clinical study (in which case, errors cannot be estimated precisely, because the right answer is not known). Differentially Private Policy Evaluation Chaudhuri, Kamalika and Monteleoni, Claire. Privacypreserving logistic regression. In Advances in Neural Information Processing Systems, pp. 289 296, 2009. Chaudhuri, Kamalika, Monteleoni, Claire, and Sarwate, Anand D. Differentially private empirical risk minimization. volume 12. JMLR. org, 2011. Dwork, Cynthia. Differential privacy. In Proceedings of the 33rd international conference on Automata, Languages and Programming-Volume Part II, pp. 1 12, 2006. Dwork, Cynthia and Roth, Aaron. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3-4):211 407, 2014. Jain, Prateek and Thakurta, Abhradeep. Differentially private learning with kernels. In Proceedings of the 30th International Conference on Machine Learning (ICML13), pp. 118 126, 2013. Jain, Prateek and Thakurta, Abhradeep Guha. (near) dimension independent risk bounds for differentially private learning. In Proceedings of The 31st International Conference on Machine Learning, pp. 476 484, 2014. Mc Sherry, Frank and Talwar, Kunal. Mechanism design via differential privacy. In Foundations of Computer Science, 2007. FOCS 07. 48th Annual IEEE Symposium on, pp. 94 103. IEEE, 2007. Mishra, Nikita and Thakurta, Abhradeep. Nearly optimal differentially private stochastic multi-arm bandits. In UAI, 2015. Nissim, Kobbi, Raskhodnikova, Sofya, and Smith, Adam. Smooth sensitivity and sampling in private data analysis. In Proceedings of the thirty-ninth annual ACM symposium on Theory of computing, pp. 75 84. ACM, 2007. Nissim, Kobbi, Raskhodnikova, Sofya, and Smith, Adam. Smooth sensitivity and sampling in private data analysis, 2011. URL http://www.cse.psu.edu/ ads22/pubs/NRS07/. Rubinstein, Benjamin IP, Bartlett, Peter L, Huang, Ling, and Taft, Nina. Learning in a large function space: Privacy-preserving mechanisms for svm learning. Journal of Privacy and Confidentiality, 4(1):4, 2012. Smith, Adam and Thakurta, Abhradeep. Nearly optimal algorithms for private online learning in full-information and bandit settings. In NIPS, 2013. Sutton, Richard S. Learning to predict by the methods of temporal differences. Machine Learning, 3(1):9 44, 1988. Sutton, Richard S and Barto, Andrew G. Reinforcement learning: An introduction. MIT press, 1998. Szepesv ari, Csaba. Algorithms for reinforcement learning. Morgan & Claypool Publishers, 2010. Thakurta, Abhradeep Guha and Smith, Adam. Differentially private feature selection via stability arguments, and the robustness of the lasso. In Conference on Learning Theory, pp. 819 850, 2013. Tossou, Aristide C. Y. and Dimitrakakis, Christos. Algorithms for differentially private multi-armed bandits. In International Conference on Artificial Intelligence (AAAI 2016), 2016.